Each project comes with 2-5 hours of micro-videos explaining the solution.
Get access to 50+ solved projects with iPython notebooks and datasets.
Add project experience to your Linkedin/Github profiles.
This is one of the best of investments you can make with regards to career progression and growth in technological knowledge. I was pointed in this direction by a mentor in the IT world who I highly... Read More
I have worked for more than 15 years in Java and J2EE and have recently developed an interest in Big Data technologies and Machine learning due to a big need at my workspace. I was referred here by a... Read More
Why Big data?
Real time streaming data being captured at regular intervals of time from millions of IOT devices like sensors, clickstreams, logs from the device APIs and historical data from SQL databases. To store the huge volumes of data with high velocity and veracity, we need an efficient scalable storage system which is distributed across different nodes either in local or in cloud. Here comes the Hadoop concept which can be classified into two groups -Storage and processing. Storage will be done in HDFS and processing is done using Map reduce.
It refers to a system for moving data from one system to another. The data may or may not be transformed, and it may be processed in real time (or streaming) instead of batches. Right from extracting or capturing data using various tools, storing raw data, cleaning, validating data, transforming data into query worthy format, visualisation of KPIs including Orchestration of the above process is data pipeline.
What we are going to do?
We are going to extract data from APIs using Python, parse it, save it to EC2 instance locally after that upload the data onto HDFS. Then reading the data using Pyspark from HDFS and perform analysis. The techniques we are going to use is Kyro serialisation technique and Spark optimisation techniques. An External table is going to be created on Hive/Presto and at last for visualizing the data we are going to use AWS Quicksight.
In this big data project, we'll work with Apache Airflow and write scheduled workflow, which will download data from Wikipedia archives, upload to S3, process them in HIVE and finally analyze on Zeppelin Notebooks.
In this Databricks Azure tutorial project, you will use Spark Sql to analyse the movielens dataset to provide movie recommendations. As part of this you will deploy Azure data factory, data pipelines and visualise the analysis.
In this hadoop project, we are going to be continuing the series on data engineering by discussing and implementing various ways to solve the hadoop small file problem.