As Big Data grows bigger, a number of specialized tasks are emerging. Tasks such as data preparation and presentation, which were perceived as minor, add on skills, now assume more importance than ever. Data preparation, which was earlier seen as a “nice to have” skill, is a core skill requirement now. Data Scientists are expected to be masters of data preparation, processing, analysis, and presentation. In this article we explore why data preparation is so important, what are the issues faced by data scientists when they use present day data preparation tools. We will also focus on how Apache Spark aids fast data processing and data preparation. “Fast data processing with Spark” – is the reason why Apache Spark’s popularity among enterprises in gaining momentum.
For the complete list of big data companies and their salaries- CLICK HERE
All the hard work done by data scientists on data transformation, analysis and presentation can amount to nothing if the data is not proper in the first place. Data preparation is checking the data that would later be processed and analyzed.
Whether you are doing a market research or conducting an experiment, there is a good chance that data will come in from difference sources, in different formats. There is a need to monitor data that has already been processed and the rest which is outstanding. Data is becoming more valuable by the day, which is why it has become imperative to preserve the original data so that the analysis results can be traced back to it. Well prepared data is accurate, is easy to structure and does not have scope for ambiguity, making it ideal to be fed into Spark clusters.
Well prepared data not only helps during data entry, it is easily read by transformation programs and is usable in analysis. Missing values, invalid entries and outliers have to be detected and handled before raw data can be processed. Here are some examples of incorrect data that can significantly alter analysis and presentation:
The following steps of data preparation are commonly used to get raw data ready for software based processing and analysis:
Gone are the days when data was collected manually and missing values and outliers were manually treated. We are in the days of Big Data, where tens of thousands of rows of data are commonplace. Every financial transaction that we do, generates data, which is of value to the store we swipe our card at, the brands we purchase, our bank that issues a credit card, etc. Given the critical purposes (fraud detection, risk prediction, transaction monitoring, etc.) for which Big Data based systems are used, it is imperative to have specialized data preparation tools that can process data chunks from different sources and integrate them for further transformation, analysis and presentation.
Businesses can no longer afford to view data preparation tools as an add-on to Big Data analytic solutions. Before we get to the latest buzzword in data preparation tools – Apache Spark – we would like to re-iterate that data preparation tools are not competing with Big Data analytic software. If anything, they are enablers.
As businesses add customers and expand their operations, the data at their disposal is also increasing at a breakneck pace. Bigger the dataset, more is the requirement for real time analytics to build actionable insights. Apache Spark provides instant results and eliminates delays that can be lethal for business processes. Spark’s parallel in-memory data processing is much faster than any other approach requiring disc access. In the following section we will explore the advantages of Apache Spark in Big Data:
Hadoop MapReduce has enjoyed a good run in the Big Data ecosystem for the past few years. As Big Data gets bigger and businesses get more demanding, developers and data scientists are getting hungrier for speed. In a bid to separate sound from noise, developers require a simpler, faster approach to solving problems that keep getting complex by the day. Data suppliers have clearly thrown their weight behind Hadoop MapReduce’s younger and nimbler rival – Apache Spark.
Cloudera chief strategy officer Mike Olson sums up the “breathtaking” growth of Spark and a paradigm shift in customer preferences witnessed by his Hadoop distributor company – “Before very long, we expect that Spark will be the dominant general – purpose processing framework for Hadoop. If you want a good, general purpose engine these days, you’re choosing Apache Spark, not Apache MapReduce”. On closer reading, one can clearly make out what he meant by associating “general purpose processing” with Spark. The message is clear – Hadoop MapReduce isn’t dead yet and it is not going to be so any time soon. However, if you are building an analytics platform from ground up, Spark should be your first choice.
Spark has gained acceptability with all major Hadoop distributions. Also, Spark is an open source platform which means businesses no longer need to be apprehensive about adopting it, wondering what will happen if they choose to move away from the Hadoop ecosystem later. In simpler terms, businesses can write their entire analytics infrastructure on the Apache Spark platform and carry it with them to any Big Data environment, should they choose to do so at a later stage.
Despite the good things that Hadoop brought to the table, one of the major concerns the developer community had about adopting Hadoop was - that the learning curve for users was too steep and as a result, it was difficult to find its practitioners. It did get simpler with time but the criticism hasn’t completely gone away. Apache Spark doesn’t require its users to understand a variety of complex programming patterns (such as Java and MapReduce). Anyone with database knowledge and scripting skills will be able to use Spark.
Apache Spark hasn’t been around too long. It is still in the growth stage and is set for universal acceptability. It is always a good time to be an early adopter of a technology that is on its way to be embraced globally, by businesses of all sizes. After all, first mover advantage applies to everyone!
If you would like more information about Big Data careers, please click the orange "Request Info" button on top of this page.
Theoretical advantages aside, a number of business representatives have endorsed Apache Spark and the advantages their teams derived out of it. One of them is summarized:
Brian Kursar is director of Data Science at Toyota Motor Sales USA. At the Spark Summit in June, he shared his teams’ experience and the improvement Apache Spark brought to their application. His customer experience analysis application integrated over 700 million records (spread over social media, survey data and call centers) to identify customers turnover issues and zero in on the problematic areas. This would enable the team to eliminate the causes and get involved at a personal level when required.
Hadoop MapReduce had been their preferred choice for running the analysis and this usually took over 160 hours. In the digital age, information can often be a perishable commodity and a week’s time was considered too long to study the outcome and take corrective measures. The entire job was rewritten for Spark and the same analysis was completed in four hours!
This seemingly impossible improvement would be less of a surprise when we consider that Matei Zaharia created Spark for his PhD at University of California Berkeley to overcome the limitations he noted in Hadoop MapReduce while interning at early Hadoop users, Facebook included.
Hadoop MapReduce is not going out the door today or even in the next few years. But Apache Spark has firmly established itself as a leader for fast data processing.
Do you believe that Apache Spark will push Hadoop MapReduce faster than what we predicted?