Big Data Project on Processing Unstructured Data using Spark

Big Data Project on Processing Unstructured Data using Spark

In this project, we will evaluate and demonstrate how to handle unstructured data using Spark.
explanation image


Each project comes with 2-5 hours of micro-videos explaining the solution.

ipython image

Code & Dataset

Get access to 50+ solved projects with iPython notebooks and datasets.

project experience

Project Experience

Add project experience to your Linkedin/Github profiles.

Customer Love

Read All Reviews
profile image

Swati Patra linkedin profile url

Systems Advisor , IBM

I have 11 years of experience and work with IBM. My domain is Travel, Hospitality and Banking - both sectors process lots of data. The way the projects were set up and the mentors' explanation was... Read More

profile image

SUBHABRATA BISWAS linkedin profile url

Lead Consultant, ITC Infotech

The project orientation is very much unique and it helps to understand the real time scenarios most of the industries are dealing with. And there is no limit, one can go through as many projects... Read More

What will you learn

Giving unstructured data some structure
Programmatically creating data schema using Spark
Handling bad data
Revisiting Spark and Hive integration
Incremental updates in Spark
Automating your data pipeline

Project Description

Not all dataset comes structure. Or better put, there are more unstructured or semi-structured datasets that they are structured. And as a data engineer, we should at least give a good amount of structure or schema to data before it becomes useful for any downstream operation.

In this Hackerday session, we will evaluate and demonstrate how to handle rather unstructured data sets from the data disclosure history site. This dataset is a free text data that comes with a codebook describing the data. A lot does actually happen between the codebook and the data and we will see all in this sessions.

Ginnie Mae is a federally-owned corporation that helps to create and guarantee mortgage-backed securities in the US housing market. It is a lot more than that. See from more.

Similar Projects

Use the dataset on aviation for analytics to simulate a complex real-world big data pipeline based on messaging with AWS Quicksight, Druid, NiFi, Kafka, and Hive.

Learn to write a Hadoop Hive Program for real-time querying.

In this big data project, we will discover songs for those artists that are associated with the different cultures across the globe.

Curriculum For This Mini Project