“What is Hadoop?” might seem a simple question but the answer to this question is not so simple because over the time Hadoop has grown into a complex ecosystem of various competitive and complementary projects. The path to learning hadoop is steep but using Hadoop framework successfully is not so easy. Hadoop is an interesting and powerful framework that makes even big data to look small through faster data processing by filling in many different roles in an enterprise based on the kind of data. This articles details into the tricks to getting real value from Hadoop and helps understand where it fits in the business operations.
“Too big for Excel is not “Big Data”” –said Chris Stucchio,Data Scientist
Image Credit: slideshare.net
Hadoop- Data Darling of Businesses
When businesses interested in leveraging big data analytics ask how to get started, they are often suggested to begin with Apache software’s open source distributed computing framework Hadoop.
Hadoop has become the data darling of all small, medium and large sized businesses but there are a few important aspects most companies forget to consider while adopting Hadoop as an enterprise solution for leveraging big data analysis–
- Is the data big enough for Hadoop? Does my business have several petabytes of data or more?
- Is Hadoop a good fit for the business needs to leverage big data analysis?
- Does the business actually have big data problems to use Hadoop?
- Is the business going to have steady influx of data?
- How much amount of data will the business operate on?
Hadoop is a great framework for processing large data sets but it definitely is not the silver bullet that fits all business cases for leveraging big data analysis. There are several good reasons for Hadoop being an attractive option to leverage big data analytics- distributed computational capabilities at an economical cost, its ability to process huge amounts of data generated by social media, Internet of Things, Mobile technology and other digital technologies. These advantages along with high profile implementations of Hadoop at top fortune 50 IT giants like Facebook, Amazon, eBay, Walmart, Yahoo are increasingly driving the adoption of Hadoop among various big dataanalytics companies.
What Hadoop means for not so big “Big Data”
If business analytics data is only few tens of GB then Hadoop is a heavy tool for it. Companies should not follow the trends blindly and start adopting Hadoop as the enterprise solution for big data analytics but rather follow their business requirements. Apache Hadoop is particularly designed to tackle big data problems and if the business do not have big data problems then definitely Hadoop is not the best fit for an organization.
Before the advent of Hadoop, companies chose to forego data capturing as there were no feasible ways to store and process large volumes of data. Now that Hadoop can help companies with big data analytics, organizations have started capturing data that was discarded earlier -to solve business problems. With increased number of businesses showing interest in Hadoop for good reasons- this open source framework has transformed the way in which data analysts look at storing and processing large and diverse big data sets. Several companies are turning Hadoop into an all-purpose processing platform without considering the fact that there are specific big data applications where Hadoop makes more sense than others that only have small datasets to be processed.
Organizations should not adopt Hadoop if the data they want to process is measurable in few tens of gigabytes. Hadoop in its MapReduce based personification has few limitations in the way big dat applications are programmed and the speed at which results are obtained. Businesses having data driven problems measured in few gigabytes should save themselves from the hassle of using Hadoop and rather opt to employ a BI tools like Postgres or use Microsoft Excel. To leverage the superior scalability of Hadoop and save considerable amount of money and time, organizations with a real big “Big Data” dataset (measured in Terabytes or more) must use Hadoop.
Become an IBM Certified Big Data Hadoop Developer!
Now the answer to the much awaited question- “How much data is considered “Big Data” for using Hadoop?
The answer to this question is little tricky as it varies from company to company and is not merely described by the volume of data.
"Big Data starts at 1.5 PB because that's what fits in memory on Blue Waters (one of the most powerful supercomputer)" – said Bill Gropp
For instance, some big data analytics companies might consider 10 terabyte as big data whereas for other 1 petabyte is big data. Big data need not essentially be big and can also include the complexities of processing information with huge volumes of diverse datasets. For instance, census data has huge volume but need not necessarily be considered big data as it is structured and can be easily stored in traditional legacy systems whereas data collected from social media is big data (despite the fact that the size of social media data is less than that of census data) because it is unstructured cannot be stored in relational databases.
23andMe, a personal genetic profiling service charges a fee of $99 to structure a human genome. The data generated from a sequenced DNA of an individual is approximately 800MB. That’s not a lot definitely, so if somebody tells you that this is big data, you would make fun out of it. However, within that 800MB of DNA sequence there are close to 4 billion pieces of information and thousands of patterns making 800MB of data a big processing challenge. One need not have petabytes or Exabyte’s of data to have a big data opportunity.
RedMonk analyst Stephen O'Grady said- "Larger dataset sizes present unique computational challenges. But the structure, workload, accessibility and even location of the data may prove equally challenging."
For the complete list of big data companies and their salaries- CLICK HERE
Most of the companies believe that they have a big “Big Data” dataset but that is not the case. Hadoop is designed for petabyte scale computing but most of the real time Hadoop jobs process less than 100 gigabytes of input (90% of Hadoop jobs at Facebook process less than 100 gigabytes of data and on an average Hadoop jobs at Yahoo or Microsoft process less than 14 gigabytes), thus, requiring a single scale-up server setup for running Hadoop jobs instead of a scale-out setup.
Hadoop holds lot of promise in the big data market for providing democratized access to large storage and computational power, however, to explore the complete potential of Hadoop, organizations must raise the possibility of exploiting it for big datasets. Most of the mid-sized companies nowadays are looking to gain a big data edge and want to boast of a similar architecture as that of Google or Yahoo in particular if there is an open source technology like Hadoop. However, companies need to realize that Hadoop’s most touted benefits can be attained only if enterprises have big data problems.
Certainly, the big IT giants such as Amazon, Google, Facebook and Yahoo have been big Hadoop users since years, however, it is important for other mid-sized businesses to understand where Hadoop can fit in their operations-
- Hadoop is the choice when companies have really big “Big Data” datasets.
- If companies are experiencing heavy expenditure on archiving valuable data then it is an intelligent take to set up Hadoop clusters and retain the data till the time they find out how to leverage it for big data analysis.
- Hadoop is the right choice for companies having an “enterprise data hub” vision.
- Hadoop makes it really easy for organizations that celebrate data diversity by mixing and matching different types of data like – social sentiment data, transaction data,geo-location data and clickstream data.
If companies really want to get the most out of the big data Hadoop investment then they need to ensure that they use Hadoop for big “Big Data” datasets or datasets that have big data challenges for computing.
Know more about the best big data certification programs