In this project, we will talk about Apache Zeppelin. We will write code, write notes, build charts and share all in one single data analytics environment using Hive, Spark and Pig.
Apache Zeppelin: What it is and how it works
Installing Zeppelin interpreters
Running Spark, Hive and Pig code on your notebook
Writing markdown notes or narrative text.
Collaboration or Sharing your book with others
and more...
Data analysis and Collaboration using Apache Zeppelin, Pig and Hive
In this project, we will talk about Apache Zeppelin. We will write code, write notes, build charts and share all in one single data analytics environment using Hive, Spark and Pig.
In this project, we will be performing an OLAP cube design using the AdventureWorks dataset. The deliverable for this session will be to design a cube, build and implement it using Kylin, query the cube and even connect familiar tools (like Excel) with our new cube.
Apache Kylin and how it works?
Installing Apache Kylin in our Quickstart VM
Design star schema on our AdventureWorks database
Implementing our star schema in Kylin
Writing aggregate queries against a Kylin cube
and more...
Online Analytical processing and visualization of retail data with Apache Kylin
In this project, we will be performing an OLAP cube design using the AdventureWorks dataset. The deliverable for this session will be to design a cube, build and implement it using Kylin, query the cube and even connect familiar tools (like Excel) with our new cube.
In this project, we will use the yelp review dataset to analyze businesses, reviews and ingest the final output of our data processing in Elasticsearch and use the visualization tool in the ELK stack to visualize various kinds of ad-hoc reports from the data.
Ingesting data from relational database using Sqoop
Ingesting data from relational database directly into Spark
Processing relational data in Spark
Ingesting processed data into Elasticsearch
Visualizing review analytics using Kibana
Analyze and Visualize Online Review Data using Spark,Elasticsearch,Sqoop and Kibana
In this project, we will use the yelp review dataset to analyze businesses, reviews and ingest the final output of our data processing in Elasticsearch and use the visualization tool in the ELK stack to visualize various kinds of ad-hoc reports from the data.
In this big data project, we will see how data ingestion and loading is done with Kafka connect APIs while transformation will be done with Kafka Streaming API.
Kafka and Data warehousing
Real-time data warehousing
Kafka Connect API
Kafka Streams API
End-to-end Kafka pipeline
Best Seller
Building Real-Time Data Pipelines with Kafka Connect
In this big data project, we will see how data ingestion and loading is done with Kafka connect APIs while transformation will be done with Kafka Streaming API.
It is continuation of the previous hackerday "Tough engineering choices with large datasets in Hive Part - 1", where we will work on processing big data sets using Hive.
Common misuse/abuse of hive
How to use and interpret Hive's explain command
File formats and their relative performance (Text, JSON, SequenceFile, Avro, ORC and Parquet)
Compression
Spark and hive for transformation
and more...
Tough engineering choices with large datasets in Hive Part - 2
It is continuation of the previous hackerday "Tough engineering choices with large datasets in Hive Part - 1", where we will work on processing big data sets using Hive.
In this hackerday, we will be performing a real time processing of log entries from applications, using Kafka for the streaming architecture in a microservice sense.
Re-state the case for real-time processing of log files
Run through our application and real-time log collection using Flume Log4J appenders
Flume Kafka Integration (Channels or Sink)
Kafka Stream
Kafka Connect
and more...
Real time log processing using streaming architecture 2
In this hackerday, we will be performing a real time processing of log entries from applications, using Kafka for the streaming architecture in a microservice sense.
In this hackerday, we are going to bring processing to the speed layer of the lambda architecture which opens up capabilities to monitor application real time performance, measure real time comfort with applications and real time alert in case of security
Making a case for real time processing of log files
Getting logs at real time using Flume Log4J appenders
Making a case for Kafka for log aggregation.
Storing log event as a time series datasets in HBase
Integrating Hive and HBase for data retrieval using query.
and more...
Real time log processing using streaming architecture
In this hackerday, we are going to bring processing to the speed layer of the lambda architecture which opens up capabilities to monitor application real time performance, measure real time comfort with applications and real time alert in case of security
In this project, our goal is to build an argument for generalized streaming architecture for reactive data ingestion based on a microservice architecture.
Streaming architectures (Lambda & Kappa)
MQTT and the IOT
Making decision about MQTT-Spark Streaming vs Kafka-Spark Streaming
Using Kafka as a data hub for streaming architecture
HBase and Spark Integration using the Spark HBase connector
and more...
General architecture for building IOT infrastructure
In this project, our goal is to build an argument for generalized streaming architecture for reactive data ingestion based on a microservice architecture.
In this project, we will use two NoSQL databases(HBase and MongoDB) to store Yelp business attributes and also learn how to retrieve these data for processing or query.
Why store data in a NoSQL database
Revisit NoSQL databases concepts
Storing sparse business attributes in HBase
Storing sparse business attributes in MongoDB
Integrating Hive and NoSQL databases for data retrieval using query
and more...
Data Engineering on Yelp Dataset - NoSQL Storage
In this project, we will use two NoSQL databases(HBase and MongoDB) to store Yelp business attributes and also learn how to retrieve these data for processing or query.
In this project, we are going to do network analysis using a graph database so that we can find patterns in how a social network affects business reviews and ratings.
Introduce key terminologies in graph database
Short introduction to cypher
Spark-Neo4j connector
Introduction to Spark GraphX
Data analysis using GraphX and Neo4j
Yelp data processing using Spark and Neo4j
In this project, we are going to do network analysis using a graph database so that we can find patterns in how a social network affects business reviews and ratings.
In this big data project, we will continue from a previous hive project "Data engineering on Yelp Datasets using Hadoop tools" and do the entire data processing using spark.
Doing data processing using Spark
Normalizing and denormalizing dataset into hive tables
Various ways of integrating Hive and Spark
Various complex data structures in Hive through spark
Exporting some of the processed datasets to RDBMS
Yelp Data Processing Using Spark And Hive Part 1
In this big data project, we will continue from a previous hive project "Data engineering on Yelp Datasets using Hadoop tools" and do the entire data processing using spark.
In this project, we are going to be continuing the series on data engineering by discussing and implementing various ways to solve the Hadoop Big Data problem.
What is Small file problem in Hadoop
How it arises (Batch and Streaming mode)
Solution (Streaming): Using flume
Solution (Streaming): Preprocessing and storing in a NoSQL database
Solution (Batch): Merging before storing in HDFS
and more...
Solving the Hadoop Small File Problem
In this project, we are going to be continuing the series on data engineering by discussing and implementing various ways to solve the Hadoop Big Data problem.
Spark Project - Discuss real-time monitoring of taxis in a city. The real-time data streaming will be simulated using Flume. The ingestion will be done using Spark Streaming.
Key-Value (NoSQL) Databases
Using Redis as a pub/sub message-oriented middleware
Using Redis as a caching server/persistence store
Streaming data with Flume/Spark integration
Real-time processing and display of streamed data on a "dashboard".
and more...
Real-time Auto Tracking with Spark-Redis
Spark Project - Discuss real-time monitoring of taxis in a city. The real-time data streaming will be simulated using Flume. The ingestion will be done using Spark Streaming.
In this hadoop project, you will be using a sample application log file from an application server to a demonstrated scaled-down server log processing pipeline.
The benefits of log-mining in certain industries
A full log-mining application use-case
Using Flume to ingest log data
Using Spark to process data
Integrating Kafka to complex event alert
and more...
Best Seller
Web Server Log Processing using Hadoop
In this hadoop project, you will be using a sample application log file from an application server to a demonstrated scaled-down server log processing pipeline.
In this project, we will be doing an analysis of the level and strength of interactions across areas of coverage of a telecom provider between different areas in the city of Milan.
Introduction to Graphs
Introduction to Spark GraphX
Building a graph structure from our dataset
Use of Spark GraphX graph operators
Graph Visualization
and more...
Analysis of Community Interactions using Spark GraphX
In this project, we will be doing an analysis of the level and strength of interactions across areas of coverage of a telecom provider between different areas in the city of Milan.
In this hive project , we will build a Hive data warehouse from a raw dataset stored in HDFS and present the data in a relational structure so that querying the data will be natural.
• How to run hive queries on Spark
• Hadoop data warehousing with Hive
• Using the interactive Scala Build Tool (sbt) with Spark
• Data serialization with kryo serialization example
• Performance optimization using caching.
and more...
Best Seller
Building a Data Warehouse using Spark on Hive
In this hive project , we will build a Hive data warehouse from a raw dataset stored in HDFS and present the data in a relational structure so that querying the data will be natural.
In this project, we'll work with Apache Airflow and write scheduled workflow, which will download data from
Wikipedia archives, upload to S3, process them in HIVE and finally analyze on Zeppelin Notebooks.
Workflows and their uses
Apache Airflow
Working with Qubole and S3
Hive table creation and data processing
Charting via Zeppelin Notebooks
Visualise Daily Wikipedia Trends using Hive, Zepellin Notebooks and Airflow
In this project, we'll work with Apache Airflow and write scheduled workflow, which will download data from
Wikipedia archives, upload to S3, process them in HIVE and finally analyze on Zeppelin Notebooks.
Analyze clickstream data of a website using Hadoop Hive to increase sales by optimizing every aspect of the customer experience on the website from the first mouse click to the last.
Analyze JSON data;Loading JSON format to Hive
Create a Schema to the fields in the table.
Creating queries to set up the EXTERNAL TABLE in Hive
Create new desired TABLE to copy the data.
Creating query to populate and filter the data.
and more...
Hive Project -Visualising Website Clickstream Data with Apache Hadoop
Analyze clickstream data of a website using Hadoop Hive to increase sales by optimizing every aspect of the customer experience on the website from the first mouse click to the last.
Use the Hadoop ecosystem to glean valuable insights from the Yelp dataset. You will be analyzing the different patterns that can be found in the Yelp data set, to come up with various approaches in solving a business problem.
Analyze JSON dataLoading JSON format to Hive
Create a Schema to the fields in the table.
Creating queries to set up the EXTERNAL TABLE in Hive
Create new desired TABLE to copy the data.
Creating query to populate and filter the data.
and more...
Data Mining Project on Yelp Dataset using Hadoop Hive
Use the Hadoop ecosystem to glean valuable insights from the Yelp dataset. You will be analyzing the different patterns that can be found in the Yelp data set, to come up with various approaches in solving a business problem.