“In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log, they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for more systems of computers.” — Grace Hopper, a popular American Computer Scientist. (In reference to Big Data)
Developers of Google had taken this quote seriously, when they first published their research paper on GFS (Google File System) in 2003. Little did anyone know, that this research paper would change, how we perceive and process data. And so spawned from this research paper, the big data legend - Hadoop and its capabilities for processing enormous amount of data.
All movie buffs might be well aware on how a hero in the movie rises above all the odds and takes everything by storm. Same is the story, of the elephant in the big data room- “Hadoop”. Surprised? Yes, Doug Cutting named Hadoop framework after his son’s tiny toy elephant. Originally, the development started in Apache Nutch Project but later it was moved under Hadoop sub-project. Since then, it is evolving continuously and changing the big data world.
It has been 10 years since Hadoop first disrupted the Big Data world, but many are still unaware of how much this technology has changed the data analysis scene. We wanted to go back to the very basics of Hadoop and explain it as plainly as possible. It is critical that you understand, what Hadoop is, what it does and how does Hadoop work before you decide to steer your career in that direction.
Without much ado, let’s begin with Hadoop explained in detail.
What is Hadoop?
Like we said, we will go back to the very basics and answer all the questions you had about this big data technology - Hadoop. To keep things simple, just imagine that you have a file whose size is greater than the overall storage capacity of your system. It would not be possible to store that file in that single storage space. Hadoop is a framework that allows users to store multiple files of huge size (greater than a PC’s capacity).
Hadoop is a collection of libraries, or rather open source libraries, for processing large data sets (term “large” here can be correlated as 4 million search queries per min on Google) across thousands of computers in clusters. In earlier days, organizations had to buy expensive hardware to attain high availability. Hadoop has overcome this dependency as it does not rely on hardware but instead achieves high availability and detects point of failures through software itself.
As we all know, a blockbuster movie requires a strong lead role but it also requires promising supporting actors as well. So, let’s have a look at the four important libraries of Hadoop, which have made it a super hero-
- Hadoop Common – The role of this character is to provide common utilities that can be used across all modules.
- Hadoop MapReduce - The right hand of our actor, carrying out all the work assigned to it i.e. It does the job scheduling and processing across the cluster. Hadoop is like a data warehousing system so its needs a library like MapReduce to actually process the data.
- Hadoop Distributed File System (HDFS) – The left hand, which maintains all the records i.e. file system management across the cluster.
- Hadoop YARN – This is the newer and improved version of MapReduce, from version 2.0 and does the same work.
Hadoop has also given birth to countless other innovations in the big data space. Apache Spark has been the most talked about technology, that was born out of Hadoop. Hadoop and Spark is the most talked about affair in the big data world in 2016.
Read more about the connection between Hadoop vs Spark.
If you would like more information about Big Data careers, please click the orange "Request Info" button on top of this page.
Why use Hadoop?
Hadoop is used where there is a large amount of data generated and your business requires insights from that data. The power of Hadoop lies in its framework, as virtually most of the software can be plugged into it and can be used for data visualization. It can be extended from one system to thousands of systems in a cluster and these systems could be low end commodity systems. Hadoop does not depend upon hardware for high availability. The two primary reasons to support the question “Why use Hadoop” –
- The cost savings with Hadoop are dramatic when compared to the legacy systems.
- It has a robust community support that is evolving over time with novel advancements.
What is Hadoop used for?
Hadoop has become the go-to big data technology because of its power for processing large amounts of semi-structured and unstructured data. Hadoop is not popular for its processing speed in dealing with small data sets.
If you are thinking under what is Hadoop used for or the circumstances under which using Hadoop is helpful then here’s the answer-
- Hadoop is used in big data applications that gather data from disparate data sources in different formats. HDFS is flexible in storing diverse data types, irrespective of the fact that your data contains audio or video files (unstructured), or contain record level data just as in an ERP system (structured), log file or XML files (semi-structured). Hadoop is used in big data applications that have to merge and join data - clickstream data, social media data, transaction data or any other data format.
- Large scale enterprise projects that require clusters of servers where specialized data management and programming skills are limited, implementations are an costly affair- Hadoop can be used to build an enterprise data hub for the future.
- Do not make the mistake of using Hadoop when your data is just too small, say in MB’s or GB’s. To achieve high scalability and to save both money and time- Hadoop should be used only when the datasets are in petabytes or terabytes otherwise it is better to use Postgres or Microsoft Excel.
Hadoop cannot be an out-of-the-box solution for all big data problems and should be best used in applications that can make the most of its capability to store voluminous amount of data at an economical cost. If your data is too small or is sensitive then using Hadoop might not be an ideal choice.
How does Hadoop work?
As mentioned in the prequel, Hadoop is an ecosystem of libraries, and each library has its own dedicated tasks to perform. HDFS writes data once to the server and then reads and reuses it many times. When comparing it with continuous multiple read and write actions of other file systems, HDFS exhibits speed with which Hadoop works and hence is considered as a perfect solution to deal with voluminous variety of data.
Job Tracker is the master node which manages all the Task Tracker slave nodes and executes the jobs. Whenever some data is required, request is sent to NameNode which is the master node (smart node of the cluster) of HDFS and manages all the DataNode slave nodes. The request is passed on all the DataNode which serves the required data. There is concept of Heartbeat in Hadoop, which is sent by all the slave nodes to their master nodes, which is an indication that the slave node is alive.
MapReduce or YARN, are used for scheduling and processing. Hadoop MapReduce executes a sequence of jobs, where each job is a Java application that runs on the data. Instead of MapReduce, using querying tools like Pig Hadoop and Hive Hadoop gives the data hunters strong power and flexibility.
How to use Hadoop?
Every movie has a fascinating story but it’s the job of the director to make the best use of its cast and make the most out of it. The same applies to the elephant in the big data room, Hadoop can be used in various ways and it depends on the Data Scientist, Business analyst, Developer and other big data professionals on how they would like to harness the power of Hadoop. To truly harness the power of Hadoop and make the best use of it, professionals should learn everything about the Hadoop Ecosystem and master the skillset. For organizations that lack highly skilled Hadoop talent, they can make use of Hadoop distributions from top big data vendors like Cloudera, Hortonworks or MapR.
Want to know more about the various Hadoop Distributions you can exploit? Click Here
What does Hadoop do?
Well, being a versatile actor, Hadoop can fit into many roles depending on the script of the movie (business needs). That means, it can be used for product recommendations, identifying diseases, fraud detection, building indexes, sentiment analysis, infrastructure management, energy savings, online travel, etc. Hadoop distributes the same job across the cluster and gets it done within very limited time and that too on a clusters of commodity hardware. Saving both time and money which is the ultimate goal of any business.
Hadoop uses apply to diverse markets- whether a retailer wants to deliver effective search answers to a customer’s query or a financial firm wants to do accurate portfolio evaluation and risk analysis, Hadoop can well address all these problems. Today, the whole world is crazy for social networking and online shopping. So, let’s take a look at Hadoop uses from these two perspectives.
When scrolling through your Facebook news feed, you see lot of relevant advertisements, which pops up - based on the pages you have visited. Facebook also collects data from other mobile apps installed in your smartphone and gives you suggestion on your Facebook wall, based on your browsing history.
Hadoop is used extensively at Facebook that stores close to 250 billion photos and 350 million new photos being uploaded every day. Facebook uses Hadoop in multiple ways-
- Facebook uses Hadoop and Hive to generate reports for advertisers that help them track the success of their advertising campaigns.
- Facebook Messaging apps runs on top of Hadoop’s NoSQL database- HBase
- Facebook uses Hive Hadoop for faster querying on various graph tools.
Retail giants like Walmart, Amazon, and Nordstrom start collecting data about the browsing history of customers, location, IP addresses, items viewed, etc. Just take a scenario where you are looking at an iPhone on the website, it will show other items like cases for iPhones, screen protectors and etc. based on the patterns derived from others, who have viewed the same items and purchased it.
Social Media and Retail are not the only the industries where Hadoop is implemented, there are other industries extensively leveraging the power of Hadoop- Healthcare, Banking, Insurance, Finance, Gas Plants, Manufacturing industries, etc.
Here are some best picks from DeZyre Hadoop blog on various Hadoop Uses –
Who uses Hadoop?
There are several companies using Hadoop across myriad industries and here’s a quick snapshot of the same –
- Caesars Entertainment is using Hadoop to identify customer segments and create marketing campaigns targeting each of the customer segments.
- Chevron uses Hadoop to influence its service that helps its consumers save money on their energy bills every month.
- AOL uses Hadoop for statistics generation, ETL style processing and behavioral analysis.
- eBay uses Hadoop for search engine optimization and research.
- InMobi uses Hadoop on 700 nodes with 16800 cores for various analytics, data science and machine learning applications.
- Skybox Imaging uses Hadoop to store and process images to identify patterns in geographic change.
- Tinder uses Hadoop to “Swipe Right” on behavioral analytics to create personalized matches.
- Apixio uses Hadoop for semantic analysis so that doctors can have better answers to the questions related to patient’s health.
The list of companies using Hadoop is huge and here’s an interesting read on 121 companies using Hadoop in the big data world-
This blog post is just an overview of the growing Hadoop ecosystem that handles all modern big data problems. The need for Hadoop is no longer a question but the only question now is - how one can make the best out of it? Learning Hadoop can be the best career move in 2016. If you think Hadoop is the right career, for you, then you can talk to one of our career counselors on how to get started on the Hadoop learning path.
Want to know how much a Hadoop Developer earns at various companies? CLICK HERE