Apache Hadoop turns 10: The Rise and Glory of Hadoop

Apache Hadoop turns 10: The Rise and Glory of Hadoop

It is difficult to believe that the first Hadoop cluster was put into production at Yahoo, 10 years ago, on January 28th, 2006. Ten years ago nobody was aware that an open source technology, like Apache Hadoop will fire a revolution in the world of big data. Ever since 2006, the craze for Hadoop is exponentially rising, making it a cornerstone technology for businesses to power world-class products and deliver the best in-class user experience. Although we might be a bit late but it is still worth wishing the poster child for big data analytics - a belated Happy Birthday! As January 28th, 2016 marks the celebration of Hadoop’s tenth birthday, DeZyre takes the opportunity to celebrate the remarkable presence of the tiny toy elephant, in the big data community -by illuminating the milestones achieved and the momentum Hadoop gained in the last ten years.

Apache Hadoop Turns 10

Happy Birthday Hadoop

With more than 1.7 million lines of code, 12000+ commits to Hadoop and 800 contributors to the community since 2006- there are several milestones that have marked the growth of this revolutionary technology from 2006 to 2016. Hadoop has gained stardom in the IT industry because of two important factors- the tidal wave of big data and Apache open source license - making it accessible to anyone free of cost which is a huge advantage propelling the growth of Hadoop.

Work on Hands on Projects in Big Data and Hadoop

Hadoop has become a large community in 10 years, as, it manages everything inside the search engines right from identifying the ads to show up in the search engine, stories you see on your Yahoo homepage and the news feed Facebook pulls out based on the browsing history. Hadoop is an innovation in the big data world that has changed the face of how data is processed. Hadoop is a solution that can be used by anyone right from the IT giants like IBM and Google to small marketers for profitable business outcomes.

Hadoop Turns 10

Hadoop in 2006- The Hardest Step

Hadoop was born from the open source web crawler project Nutch, in 2006.Doug Cutting joined Yahoo in 2006 and started a new subproject from Nutch naming it after his young son’s tiny toy elephant - Hadoop. Doug Cutting and Cafarella had only 5 machines to work with, requiring several manual steps to operate the system. It also did not have any reliability; the data was lost in case the machine was lost. In 2006, Hadoop was not really capable of handling production search workloads as it worked on only 5 to 20 nodes at the time without much performance efficiency. In 2006, it was very difficult to work with Hadoop and only people with a fierce passion for coding could try their hands on it.

It has grown in every possible way from 2006 to 2016 whether it be - the number of users, number of hadoop developers, related projects, contributors to core hadoop, total commits to hadoop, lines of code written, etc.

Milestones in the History of Hadoop from 2007-2015

The number of components in the Hadoop Ecosystem has grown from 2 in 2006 to 25+ in 2016.With several Yahoo engineers and with the use of several thousand computers in 2007, Hadoop became a reliable and comparatively stable system for processing petabytes of data, using commodity hardware. The first version of Hadoop - ‘Hadoop 0.14.1’ was released on 4 September 2007.

Hadoop became a top level Apache project in 2008 and also won the Terabyte Sort Benchmark. Yahoo’s Hadoop cluster broke the previous terabyte sort benchmark record of 297 seconds for processing 1 TB of data by sorting 1 TB of data in 209 seconds - in July 2008. This was a major milestone in the history of Hadoop as it was the first time that an open source, Java based, project had won that benchmark.

In February 2008, Yahoo launched world’s largest Apache Hadoop production application -Yahoo Search Webmap, that ran on more than 10,000 Linux clusters.  The web map size data had approximately 1 trillion links between pages in the index. This was another major milestone in the history of Hadoop, as it proved that even in the early stage of development Hadoop is capable of handling web scale projects at economical cost. Today the results obtained from billions of search queries executed at Yahoo every month, completely depend on the data that is produced by Hadoop clusters.

With the intent to bring the concepts of rows, columns, partitions, tables and something SQL like in the world of Hadoop, whilst maintaining the flexibility and extensibility of Hadoop – 2008 made a mark, in the history of Hadoop, with new sub-projects named Pig - developed at Yahoo and Hive - developed by the Data Infrastructure Team at Facebook. Hive provided SQL access framework to work with Hadoop so that individuals who were not from a programming background could make the best use of the open source Hadoop. Few months later Apache Pig sub project was added to the Hadoop ecosystem, to ease programming for people who were not familiar with MapReduce concepts and to optimize execution.

With the vision to bring Hadoop and related sub-projects to traditional enterprises - Cloudera was founded in 2008 to commercialize Hadoop and help organizations develop enterprise grade big data solutions. Today, Cloudera is the leading innovator and the largest contributor in the Apache open source community. Doug Cutting joined Cloudera as a Chief Architect in 2009. To overcome the single point of failure (SPOF) problem, a new feature named HDFS High Availability (HA) Name Node was added to the Hadoop trunk in March 2012 - to provide support for deploying two Name Nodes in an active or passive configuration. To overcome the various criticisms revolving around Hadoop’s scaling limitations and job handling - YARN was developed in August 2012.

In Feb 2014, Apache Spark (dubbed - the Swiss Army Knife of Big Data) transformed from an Incubator project to a top level project at the Apache Software Foundation, recognized for its lightning fast speed of running programs - 100x faster than Hadoop and for its ease of use. To complement the already existing storage options - HBase and HDFS, Cloudera launched new Hadoop storage Kudu for faster analytics on fast data. Kudu merges the capabilities of HDFS and HBase, by providing efficient columnar scans, fast updates and inserts to support common real time use cases.

Hadoop has become the game changer since 2006 helping developers process big data faster, build better methods of page layout, advertising, spell-check, etc. With several sub-projects like Pig, Hive, HBase, Impala, Kudu and more- Hadoop, today, is and will continue to be the go-to technology for easy and affordable analysis of big data.

Ten years have passed but Yahoo still claims to have the largest Hadoop deployment in the world with over 35,000 Hadoop servers, hosted across 16 clusters, with a combined 600 petabytes of storage capacity, that execute close to 34 million compute jobs every month.

For the complete list of big data companies and their salaries- CLICK HERE

Hadoop Predictions for 2016

Gartner’s Merv Adrian said- "Despite investment in training programs, broad availability of Hadoop administration skills (value enablers) and analytics skills (value creators) is 2-3 years out. Don't underestimate this uphill climb. Next, applications taking advantage of the Hadoop ecosystem need to emerge. To this point, the work done on Hadoop has effectively been artisanal data management—everything is built by hand. That's fine if you've got the skills, but most enterprises don't build software—they buy it. Hadoop can be a great place for those applications to reside."

There is still a 2 to 3-year ramp up required for enterprises to speed up with Hadoop skills due to lack of expert talent. As the interest and adoption rate of traditional technologies like Oracle falls- the demand for Hadoop professionals is increasing at a highly accelerated pace. With companies like Uber and Facebook showing the glimpses of future by make the best use of Hadoop power- the future of Hadoop is bright - with many more milestones to be achieved in the next 10 years. With several companies baking Hadoop analytics into their products to benefit from it- the demand for Hadoop developers is anticipated to increase over time.

Advice for Developers interested in working with Hadoop

With the unfolding of the data century, Hadoop is a perfect fit for all tasks from machine learning to real-time data querying and analysis involving large amounts of data. All this is thanks to the big data community, the contributors, the committers and each one who has helped Hadoop grow at a rapid pace and made it the defacto standard for big data processing. As the big data trends continue in 2016, Hadoop will continue to make remarkable impact in the big data world, with many more advanced tools and enhancements to meet new challenges.

2016 is the best time to get started on learning Hadoop and building applications on it. Hadoop has come a long way in 10 years with several complementary projects and tools that form an integral part of the Hadoop ecosystem. With the combination of Hadoop and Spark leading to novel business use cases for the enterprise, at a cheaper cost of storing and processing big data - Hadoop and Spark skills will be in huge demand for the next 5 years.

Excited to see what the next 10 years has in store for Apache Hadoop? Learn Hadoop to become a contributor!




Work on hands on projects on Big Data and Hadoop with Industry Professionals

Relevant Projects

Movielens dataset analysis for movie recommendations using Spark in Azure
In this Databricks Azure tutorial project, you will use Spark Sql to analyse the movielens dataset to provide movie recommendations. As part of this you will deploy Azure data factory, data pipelines and visualise the analysis.

Real-Time Log Processing using Spark Streaming Architecture
In this Spark project, we are going to bring processing to the speed layer of the lambda architecture which opens up capabilities to monitor application real time performance, measure real time comfort with applications and real time alert in case of security

Spark Project-Analysis and Visualization on Yelp Dataset
The goal of this Spark project is to analyze business reviews from Yelp dataset and ingest the final output of data processing in Elastic Search.Also, use the visualisation tool in the ELK stack to visualize various kinds of ad-hoc reports from the data.

Explore features of Spark SQL in practice on Spark 2.0
The goal of this spark project for students is to explore the features of Spark SQL in practice on the latest version of Spark i.e. Spark 2.0.

Hadoop Project for Beginners-SQL Analytics with Hive
In this hadoop project, learn about the features in Hive that allow us to perform analytical queries over large datasets.

Tough engineering choices with large datasets in Hive Part - 2
This is in continuation of the previous Hive project "Tough engineering choices with large datasets in Hive Part - 1", where we will work on processing big data sets using Hive.

Spark Project -Real-time data collection and Spark Streaming Aggregation
In this big data project, we will embark on real-time data collection and aggregation from a simulated real-time system using Spark Streaming.

Hadoop Project-Analysis of Yelp Dataset using Hadoop Hive
The goal of this hadoop project is to apply some data engineering principles to Yelp Dataset in the areas of processing, storage, and retrieval.

Yelp Data Processing using Spark and Hive Part 2
In this spark project, we will continue building the data warehouse from the previous project Yelp Data Processing Using Spark And Hive Part 1 and will do further data processing to develop diverse data products.

Real-Time Log Processing in Kafka for Streaming Architecture
The goal of this apache kafka project is to process log entries from applications in real-time using Kafka for the streaming architecture in a microservice sense.