Chief Scientific Officer, Machine Medicine Technologies
Senior Data Engineer, National Bank of Belgium
Senior Applied Scientist, Amazon
Big Data Engineer, Beyond Limits
In this big data project, you will use Hadoop, Flume, Spark and Hive to process the Web Server logs dataset to glean more insights on the log data.
Get started today
Request for free demo with us.
Schedule 60-minute live interactive 1-to-1 video sessions with experts.
Unlimited number of sessions with no extra charges. Yes, unlimited!
Give us 72 hours prior notice with a problem statement so we can match you to the right expert.
Schedule recurring sessions, once a week or bi-weekly, or monthly.
If you find a favorite expert, schedule all future sessions with them.
Source:
Source:
Source:
Source:
Source:
Source:
Source:
Source:
Source:
Source:
Source:
Source:
Source:
250+ end-to-end project solutions
Each project solves a real business problem from start to finish. These projects cover the domains of Data Science, Machine Learning, Data Engineering, Big Data and Cloud.
15 new projects added every month
New projects every month to help you stay updated in the latest tools and tactics.
500,000 lines of code
Each project comes with verified and tested solutions including code, queries, configuration files, and scripts. Download and reuse them.
600+ hours of videos
Each project solves a real business problem from start to finish. These projects cover the domains of Data Science, Machine Learning, Data Engineering, Big Data and Cloud.
Cloud Lab Workspace
New projects every month to help you stay updated in the latest tools and tactics.
Unlimited 1:1 sessions
Each project comes with verified and tested solutions including code, queries, configuration files, and scripts. Download and reuse them.
Technical Support
Chat with our technical experts to solve any issues you face while building your projects.
7 Days risk-free trial
We offer an unconditional 7-day money-back guarantee. Use the product for 7 days and if you don't like it we will make a 100% full refund. No terms or conditions.
Payment Options
0% interest monthly payment schemes available for all countries.
Agenda
Data processing is a crucial step in understanding any data. As a part of this video series, we understand various features and techniques available in Hadoop to store data in distributed manner on HDFS. We then use Hive to create projection of data stored in HDFS, Flume to ingest data from external systems to HDFS and Spark and Scala to process and transform the NASA log data to gain insights.
Aim:
In this project, you will use Hadoop, Flume, Spark and Hive to process the Web Server logs dataset to get more insights on the log data. As part of this, you will create Azure Virtual Machine and install Hadoop, Flume, Spark, Scala and Hive to perform Flume agent execution, Build Scala code, submit Spark jobs and Hive Queries using the dataset.
Data Format:
The logs are an ASCII file with one line per request, with the following columns:
host making the request. A hostname when possible. Otherwise the Internet address.
Timestamp in the format "DAY MON DD HH:MM:SS YYYY", where DAY is the day of the week, MON is the name of the month, DD is the day of the month, HH:MM:SS is the time of day using a 24-hour clock, and YYYY is the year. The timezone is -0400.
Request given in quotes.
HTTP reply code.
Bytes in the reply.
Tech Stack
➔ Language: Scala
➔ Services: Microsoft Azure, Hadoop, Hive, Flume, Spark
Scala
Scala is a multi-paradigm, general-purpose, high-level programming language. It's an object-oriented programming language that also supports functional programming. Scala applications can be converted to bytecodes and run on the Java Virtual Machine (JVM). Scala is a scalable programming language, and Javascript runtimes are also available.
Hive
Apache Hive is a fault-tolerant distributed data warehouse that allows for massive-scale analytics. Using SQL, Hive allows users to read, write, and manage petabytes of data. Hive is built on top of Apache Hadoop, an open-source platform for storing and processing large amounts of data. As a result, Hive is inextricably linked to Hadoop and is designed to process petabytes of data quickly. Hive is distinguished by its ability to query large datasets with a SQL-like interface utilizing Apache Tez or MapReduce.
Flume
Flume is a service for rapidly gathering, aggregating, and transporting massive amounts of log data that is distributed, reliable, and available. Its architecture is simple and adaptable, based on streaming data flows. It has configurable reliability techniques as well as several failovers and recovery mechanisms, making it resilient and fault tolerant. It employs a straightforward extensible data model that enables online analytic applications.
Approach
Sign-in to the Microsoft Azure account
Create a virtual machine
Select the tab to create a new VM
Add the basic configuration details to create a VM instance.
Connect to the Virtual machine
Download and install the putty application
Add the configuration details.
Install hadoop, hive, flume, scala, spark
Execute the Scripts in code.Zip step by step.
Recommended
Projects
Data Engineer’s Guide to 6 Essential Snowflake Data Types
From strings to timestamps, six key snowflake datatypes a data engineer must know for optimized analytics and storage | ProjectPro
A Beginner's Guide to AWS Rekognition for Image/Video Analysis
AWS Rekognition - from its robust features, working overflow, and intricate architecture to its seamless functionality and impactful projects | ProjectPro
Learning Artificial Intelligence with Python as a Beginner
Explore the world of AI with Python through our blog, from basics to hands-on projects, making learning an exciting journey.
Get a free demo