Big Data Hadoop Project-Visualize Daily Wikipedia Trends

Big Data Hadoop Project-Visualize Daily Wikipedia Trends

In this big data project, we'll work with Apache Airflow and write scheduled workflow, which will download data from Wikipedia archives, upload to S3, process them in HIVE and finally analyze on Zeppelin Notebooks.
explanation image


Each project comes with 2-5 hours of micro-videos explaining the solution.

ipython image

Code & Dataset

Get access to 50+ solved projects with iPython notebooks and datasets.

project experience

Project Experience

Add project experience to your Linkedin/Github profiles.

Customer Love

Read All Reviews
profile image

James Peebles linkedin profile url

Data Analytics Leader, IQVIA

This is one of the best of investments you can make with regards to career progression and growth in technological knowledge. I was pointed in this direction by a mentor in the IT world who I highly... Read More

profile image

Ray Han linkedin profile url

Tech Leader | Stanford / Yale University

I think that they are fantastic. I attended Yale and Stanford and have worked at Honeywell,Oracle, and Arthur Andersen(Accenture) in the US. I have taken Big Data and Hadoop,NoSQL, Spark, Hadoop... Read More

What will you learn

Creating your own virtual environment in Python
Installing the Dependencies in the environment
Understanding Workflows and their uses
Installing Apache Airflow, Airflow Web server and Airflow Sheduler
Creating Tasks in Airflow and setting up the downtrend
Working with Qubole and S3
Creating page table in Hive using SQL dumps
Registering the Database and Extracting the desired Data
Understanding how to design schemas, performing inner joins , Double joins etc.
Visualizing and executing paths in AIrflow
Fetching Incoming Data and putting it on S3
Filtering Data Via Hive and Hadoop
Mapping the filtered data with the SQL data
Creating your own Airflow Scheduler on QU BOLE for auto task completeion
Final Charting via Zeppelin Notebooks

Project Description

In this big data project we build a live workflow for a real project using Apache Airflow which is the new edge workflow management platform. We will go through the use cases of workflow, different tools available to manage workflow, important features of workflow like CLI and UI and how Airflow is differnt. We will install Airflow and run some simple workflows. 

In this big data hadoop project, we will download the raw page counts data from wikipedia archieve  and we will process them via Hadoop. Then map that processed data to raw SQL data to identify the most lived up pages of a given day. Then we will visualize the proecessed data via Zeppelin Notebooks to identify the daily trends. We will use Qubole to power up Hadoop and Notebooks.

All steps like downloading, copying data to S3, creating tables and processing them via Hadoop would be task in Airflow and we will learn how to craft scheduled workflow in Airflow.

Similar Projects

In this Databricks Azure tutorial project, you will use Spark Sql to analyse the movielens dataset to provide movie recommendations. As part of this you will deploy Azure data factory, data pipelines and visualise the analysis.

Learn to design Hadoop Architecture and understand how to store data using data acquisition tools in Hadoop.

This is in continuation of the previous Hive project "Tough engineering choices with large datasets in Hive Part - 1", where we will work on processing big data sets using Hive.

Curriculum For This Mini Project

03h 59m