Each project comes with 2-5 hours of micro-videos explaining the solution.
Get access to 50+ solved projects with iPython notebooks and datasets.
Add project experience to your Linkedin/Github profiles.
We have come to learn that Hadoop's distributed file system was engineered to favor fewer larger files over many small files. However, we mostly would not have control over how data come. Many data ingestion to data infrastructures come in small bits and whether we are implementing a data lake on HDFS or not, we will have to deal with this data inputs.
In this online hadoop project, we are going to be continuing the series on data engineering by discussing and implementing various ways to resolve the small file problem in hadoop.
We will start by defining what it means, how inevitable this situation could arise, how to identify bottlenecks in a hadoop cluster owing to the small file problem and varieties of ways to solve them.
In this Apache Spark SQL project, we will go through provisioning data for retrieval using Spark SQL.
In this project, we will look at two database platforms - MongoDB and Cassandra and look at the philosophical difference in how these databases work and perform analytical queries.
In this spark project, we will continue building the data warehouse from the previous project Yelp Data Processing Using Spark And Hive Part 1 and will do further data processing to develop diverse data products.