1-844-696-6465 (US)        +91 77600 44484        help@dezyre.com

Visualise Daily Wikipedia Trends using Hive, Zepellin Notebooks and Airflow

In this project, we'll work with Apache Airflow and write scheduled workflow, which will download data from Wikipedia archives, upload to S3, process them in HIVE and finally analyze on Zeppelin Notebooks.

Users who bought this project also bought

What will you learn

  • Workflows and their uses
  • Apache Airflow
  • Working with Qubole and S3
  • Hive table creation and data processing
  • Charting via Zeppelin Notebooks

What will you get

  • Access to recording of the complete project
  • Access to all material related to project like data files, solution files etc.

Project Description

In this project we build a live workflow for a real project using Apache Airflow which is the new edge workflow management platform. We will go through the use cases of workflow, different tools available to manage workflow, important features of workflow like CLI and UI and how Airflow is differnt. We will install Airflow and run some simple workflows. 

In this project we will download the raw page counts data from wikipedia archieve  and we will process them via Hadoop. Then map that processed data to raw SQL data to identify the most lived up pages of a given day. Then we will visualize the proecessed data via Zeppelin Notebooks to identify the daily trends. We will use Qubole to power up Hadoop and Notebooks.

All steps like downloading, copying data to S3, creating tables and processing them via Hadoop would be task in Airflow and we will learn how to craft scheduled workflow in Airflow.



Member Technical Staff at Qubole

"I am a software developer with an experience of over 8yrs in writing applications. I've a 2yr old kid, who makes sure that I don't get any leisure time :). You'll find me in a movie theater or munching my head in new techs during the spare time."