Hive Project - Visualising Website Clickstream Data with Apache Hadoop

Hive Project - Visualising Website Clickstream Data with Apache Hadoop

Analyze clickstream data of a website using Hadoop Hive to increase sales by optimizing every aspect of the customer experience on the website from the first mouse click to the last.


Each project comes with 2-5 hours of micro-videos explaining the solution.

Code & Dataset

Get access to 50+ solved projects with iPython notebooks and datasets.

Project Experience

Add project experience to your Linkedin/Github profiles.

Customer Love

Read All Reviews

Mohamed Yusef Ahmed

Software Developer at Taske

Recently I became interested in Hadoop as I think its a great platform for storing and analyzing large structured and unstructured data sets. The experts did a great job not only explaining the... Read More

Camille St. Omer

Artificial Intelligence Researcher, Quora 'Most Viewed Writer in 'Data Mining'

I came to the platform with no experience and now I am knowledgeable in Machine Learning with Python. No easy thing I must say, the sessions are challenging and go to the depths. I looked at graduate... Read More

What will you learn

Understanding the problem statement
Understanding Clickstream data and its need
Understanding the architecture and tools used for the solution
How to create an AWS EC2 instance?
What is Apache Flume and when to use it?
How to install Apache Flume on and EC2 instance?
How to set up Flume agent to ingest Clickstream logs?
How to create Spark SQL tables over AWS s3?
What is Apache Airflow?
How to install Apache Airflow on EC2 instance?
How to create Airflow DAG to automate workflow?
What is Tableau?
Why Tableau is better than other BI tools?
How to connect Tableau to Spark SQL server?
Bringing insights from Clickstream analysis using Tableau

Project Description

Clicksteam data records the flow or trail of a user when he/she visits a website. For example, if you have pages A-Z and want to see how many people land on Page G and then go to Page B - you can analyze this data and see the clickstream pattern of your visitors. This data is stored in semi structured web logs. Often you will hear the term web log analysis - this is the same as analyzing clickstream data. Segmenting, and analyzing this clickstream data will give you a more refined look at your customer's behavior patterns - from the time they land on your website till the time they either buy your product or leave without buying. 

You have built a wonderful website and your transaction page has all the information that is required for someone to know before buying the product. Still you see that a huge number of your website visitors leave before buying a single product. This is because of a broken link or path somewhere which prevents users to quickly and easily buy your product. Hadoop helps you to extract, store and analyze the clickstream data or web log data and merge it with the traditional customer data - in order to get better insights into the behavior of the visitor and optimize the path to product buying. Hive is the easiest of the Hadoop tools to learn. If you are from a data warehousing background and know SQL well - it will be a breeze to work on Hive. Hive is a data warehouse infrastructure built on top of Hadoop and is quite versatile in its usage, as it supports different storage types such as plain text, RCFile, Amazon S3, HBase, ORC, etc. Hive has its own SQL like language called HiveQL with schemas - which transparently converts queries to MapReduce or Apache Spark jobs. 

You will be working on solving these business problems for the end-user in this Hadoop Hive Project:

  • Optimizing the click through path of the users

  • Which is the most optimum path for a user to follow in order to buy the product?

  • After how many clicks does a user lose interest to buy a product?

  • Which products do users usually buy together?

  • Where can the website resources be allocated to provide the best user experience to a visitor to make him return again?

Similar Projects

In this Apache Spark SQL project, we will go through provisioning data for retrieval using Spark SQL.

The goal of this IoT project is to build an argument for generalized streaming architecture for reactive data ingestion based on a microservice architecture. 

In this project, we will evaluate and demonstrate how to handle unstructured data using Spark.

Curriculum For This Mini Project

Introduction to the Business Problem
Solution Architecture
Create EC2 Instance
Why is Flume used
Install Flume
Flume Configuration Ingesting Clickstream
Create Spark Sql Table
Installing Airlfow
Create Airflow Automation
Why Tableau for Visualisation - Overview
Analysis with Tableau - Overview
Connecting Tableau to Spark SQL Server
Analysis with Tableau - 1
Analysis with Tableau - 2