1-844-696-6465 (US)        +91 77600 44484        help@dezyre.com
processing-unstructured-data-using-spark.jpg

Big Data Project on Processing Unstructured Data using Spark

In this project, we will evaluate and demonstrate how to handle unstructured data using Spark.
What are the prerequisites for this project?
  • It is expected that students have a fair knowledge of Big Data and Hadoop particularly HDFS, Spark and Hive.
  • A working Hadoop sandbox or cloud offering of Hadoop installation with Spark and hive.

What will you learn

  • Giving unstructured data some structure
  • Programmatically creating data schema using Spark
  • Handling bad data
  • Revisiting Spark and Hive integration
  • Incremental updates in Spark
  • Automating your data pipeline

Project Description

Not all dataset comes structure. Or better put, there are more unstructured or semi-structured datasets that they are structured. And as a data engineer, we should at least give a good amount of structure or schema to data before it becomes useful for any downstream operation.

In this Hackerday session, we will evaluate and demonstrate how to handle rather unstructured data sets from the ginniemae.gov data disclosure history site. This dataset is a free text data that comes with a codebook describing the data. A lot does actually happen between the codebook and the data and we will see all in this sessions.

Ginnie Mae is a federally-owned corporation that helps to create and guarantee mortgage-backed securities in the US housing market. It is a lot more than that. See https://www.investopedia.com/terms/g/ginniemae.asp from more.

Instructors

 
Michael

Big Data & Enterprise Software Engineer

I am passionate about software development, databases, data analysis and the android platform. My native language is java but no one has stopped me so far from learning and using angular and node.js. Data and data analysis is thrilling and so are my experiences with SQL on Oracle, Microsoft SQL Server, Postgres and MyS see more...

What is Hackerday?

Stay updated in technology trends by working on projects

Live online coding sessions led by industry experts

Build 2-4 projects a month each lasting 6 hours designed to teach you advanced concepts

Code in groups and connect with your community