Latest Update made on May 24, 2016.
Big Data has found a comfortable home inside the Hadoop ecosystem. Hadoop based data stores have gained wide acceptance around the world by developers, programmers, data scientists, and database experts. Convenient as the data storage is, Hadoop’s inherent reporting mechanism presents a number of challenges to be overcome, the prominent ones being learning a new language from scratch (for querying and reporting on Hadoop data sets) and performance related issues.
Since the emergence of the Hadoop ecosystem, programmers and analysts having years of experience with SQL - suddenly felt crippled with the new big data technology. They were required to learn a new querying language all over again to effectively utilize the benefits provided by Hadoop. SQL being the most popular database language, the need of the hour was to combine the immense storage capacity of Hadoop with SQL, resulting into a SQL–on–Hadoop tool which would enable programmers to dig out relevant data from an ever growing Hadoop repository.
If you would like more information about Big Data careers, please click the orange "Request Info" button on top of this page.
A number of SQL–on–Hadoop tools have been developed since then, which allow programmers to utilize their existing SQL expertise on Hadoop data stores. They aim to put a comfortable and familiar SQL based front end to query vast expanse of data stored under Hadoop architecture. While the jury is out on the best tools available, in this article we glance through some of the popular tools and list the pros and cons of using them.
SQL-on-Hadoop Tool: Apache Hive
Developed by Facebook for internal assignments, Hive has quickly gained traction and has become a top choice for running queries on Hadoop for experienced SQL practitioners. The big catch is that even though it provides an SQL like querying environment, it uses the MapReduce methodology in the background to query the database and return results. It also supports user defined functions and allows processing of compressed data. If performance parameters and algorithm does not concern you much, this is an excellent platform to get your tasks accomplished from day one.
It is one of the de-facto tools which are present by default on almost all Hadoop installations. Given that there is minimal initial investment to get started with, it makes for a good tool to be tried. One noticeable performance issue you are likely to face with Hive is that the queries are usually very slow because of MapReduce associated overheads.
Hive, however, does have its share of limitations. It supports only four file formats – text, SequenceFile, ORC and RCFile. Though these cover most of the popular file formats, this is something to be kept in mind before you decide on using Hive. The future of Hive looks better though, with Hortonworks continuously improving its Apache Tez project, a new back-end for Hive aimed at improving response time.
For Interview Questions and Answers for Apache Hive, click here.
For the complete list of big data companies and their salaries- CLICK HERE
SQL-on-Hadoop Tool: Cloudera Impala
Impala provides developers the luxury of running interactive SQL queries on Hadoop Distributed File System (HDFS) and HBase. Though Hive also provides an SQL like interface, it follows batch processing which can lead to lags if one is looking for performance oriented alternative. Impala overcomes that shortcoming, by running queries in real time which allows integration of SQL based business intelligence tools with the Hadoop data store.
Developed by Cloudera, Impala is available as an open source tool and supports popular file formats such as text, LZO, SequenceFile, Avro and RCFile. It supports cloud based architecture through Amazon’s Elastic MapReduce. Impala’s ANSI SQL compatibility means there is minimum business disruption as developers and analysts can be productive from day one, without needing to learn any new language. It provides integration for a number of other Business Intelligence tools to guarantee continuity in work.
Getting started with Impala does require considerable groundwork. To fully utilize the power of Impala, your data needs to be stored in Parquet file format, which can be an exercise in itself. Impala’s requirement of installing demons across the cluster as well as lack of support for YARN can also be painful bottlenecks.
SQL-on-Hadoop Tool: Presto
It is another initiative by Facebook, which is available as an open source tool. Written in Java, it shares many similarities with Impala, some of which are:
- Provides interactive experience.
- Requires considerable groundwork, i.e. installation across a number of nodes.
- For optimal performance, it requires data to be stored in a particular file format (RCFile).
In addition, Presto provides interoperability with Hive meta-store. Presto allows you to combine data from multiple sources, which is a major advantage for enterprise wide deployments. The big difference from Impala is that Presto is not supported by any of the major vendors. Hence, if your plan is to get an enterprise wide deployment (and subsequent support on an ongoing basis), you might want to consider other options, even though some of the well-known technology giants such as Airbnb and Dropbox are already using it.
SQL-on-Hadoop Tool: Shark
As one of the first specialized SQL-on-Hadoop projects, Shark started as an alternative to having to run Hive on MapReduce. It aimed to retain all the functionalities of Hive, yet deliver superior performance. Written in Scala by UC Berkeley, it is built on Apache’s data processing engine Spark, unlike Impala and Hive. It has earned a reputation of being a faster alternative to Map-Reduce and there is an ever growing community of users around the world.
Though Shark does offer performance improvements, it can offer it only till a point. It faces challenges when it comes to maintaining scalability without compromising on the performance parameters. As Shark runs on top of Hive, it has to live with a complex codebase inherited from Hive, which becomes difficult to maintain and tune beyond a certain point.
As businesses around the world keep getting hungry and impatient for data, integration of sophisticated business analytics with SQL presents harder challenges and there comes a point where a system which was primarily designed for MapReduce architecture struggles to meet the strict performance demands.
There are a number of other SQL-for-Hadoop tools and they all have their own sets of pros and cons. While it is not possible to review each one, we are signing off by compiling a check-list that an ideal tool should contain:
- SQL syntax friendly
- Efficient with joins operations
- Analytics capable, preferably with an in-built engine
- Support for native Hadoop file format
Pivotal, an enterprise level SQL-on-Hadoop product, is capable of handling most of the demands of modern day analytics and it ticks almost all the boxes. Its integrated analytics engine comes with machine learning capabilities, which improves performance with usage. Modern day organizations focused on data analysis, demand query languages to be able to handle statistical, mathematical and machine learning algorithms such as regression, factor analysis, hypothesis testing, etc. and Pivotal presents a very capable option.
With plenty of options at your disposal, we hope SQL masters would benefit a lot from these tools and hit the ground running once the right tool is chosen. If you are planning to start your Hadoop training any time soon, then it is imperative that you do some research on the technical background required to learn Hadoop. You will find the following posts quite helpful :