Process a Million Song Dataset to Predict Song Preferences

Process a Million Song Dataset to Predict Song Preferences

In this big data project, we will discover songs for those artists that are associated with the different cultures across the globe.


Each project comes with 2-5 hours of micro-videos explaining the solution.

Code & Dataset

Get access to 50+ solved projects with iPython notebooks and datasets.

Project Experience

Add project experience to your Linkedin/Github profiles.

Customer Love

Read All Reviews

Mohamed Yusef Ahmed

Software Developer at Taske

Recently I became interested in Hadoop as I think its a great platform for storing and analyzing large structured and unstructured data sets. The experts did a great job not only explaining the... Read More

Ray Han

Tech Leader | Stanford / Yale University

I think that they are fantastic. I attended Yale and Stanford and have worked at Honeywell,Oracle, and Arthur Andersen(Accenture) in the US. I have taken Big Data and Hadoop,NoSQL, Spark, Hadoop... Read More

What will you learn

Roadmap of the project
Horizontal Scalability of Hadoop and vertical scalability of RDMS
Pig Local and MapReduce working format , their explanation, and differences
Challenges in Pig MapReduce program
Overcoming Bandwidth challenges using Pig Tez
Analysis large datasets easily and efficiently
Understanding the Haversine formula and its application by Pig Latin UDF
Downloading the dataset and setting up Cloudera VMWare
PigLatin UDF "DataFu" (Created by LinkedIn) for data localization
Logging to the server using XShell-5 and Putty
Using data flow programming language "Pig Latin" for analysis
Using HDF5 for using as repository
Performing Basic EDA using Apache Ambari
Extracting Data from individual file and collectively pre-processing it
Registering UDF on Pig
Creating tables and using relational functions(Group, Join, CrossJoin, Filter, etc.)
Working with Hierarchical Data Format (HDF5)

Project Description

This big data hadoop project aims at being the best possible offline evaluation of a music recommendation system.  Any type of algorithm can be used: collaborative filtering, content-based methods, web crawling. By relying on the Million Song Dataset, the data for this big data project is completely open: almost everything is known and possibly available.

What is the task in a few words? You have: 

  1. the full listening history for 1M users, 
  2. half of the listening history for 110K users (10K validation set, 100K test set), 

and you must predict the missing half. How much easier can it get?

The most straightforward approach to this task is pure collaborative filtering, but remember that there is a wealth of information available to you through the Million Song Dataset.  For Million Song Dataset Download, click this link - Go ahead, explore!

Similar Projects

In this hive project, you will work on denormalizing the JSON data and create HIVE scripts with ORC file format.

Hive Project- Understand the various types of SCDs and implement these slowly changing dimesnsion in Hadoop Hive and Spark.

Analyze clickstream data of a website using Hadoop Hive to increase sales by optimizing every aspect of the customer experience on the website from the first mouse click to the last.

Curriculum For This Mini Project

02h 35m
02h 41m