Top SQL-on-Hadoop Tools

A list of top sql-on-hadoop tools which have similar syntax to SQL programming and can be used by programmers who are not familiar with Java but know SQL.

Top SQL-on-Hadoop Tools
 |  BY ProjectPro

Big Data has found a comfortable home inside the Hadoop ecosystem. Hadoop based data stores have gained wide acceptance around the world by developers, programmers, data scientists, and database experts. Convenient as the data storage is, Hadoop’s inherent reporting mechanism presents a number of challenges to be overcome, the prominent ones being learning a new language from scratch (for querying and reporting on Hadoop data sets) and performance related issues.


Big Data Hadoop Project-Visualize Daily Wikipedia Trends

Downloadable solution code | Explanatory videos | Tech Support

Start Project

Since the emergence of the Hadoop ecosystem, programmers and analysts having years of experience with SQL - suddenly felt crippled with the new big data technology. They were required to learn a new querying language all over again to effectively utilize the benefits provided by Hadoop.  SQL being the most popular database language, the need of the hour was to combine the immense storage capacity of Hadoop with SQL, resulting into a SQL–on–Hadoop tool which would enable programmers to dig out relevant data from an ever growing Hadoop repository.

SQL-on-Hadoop Tools

SQL-on-Hadoop Tools

A number of SQL–on–Hadoop tools have been developed since then, which allow programmers to utilize their existing SQL expertise on Hadoop data stores. They aim to put a comfortable and familiar SQL-based front end to query the vast expanse of data stored under Hadoop architecture. While the jury is out on the best tools available, in this article we glance through some of the popular tools and list the pros and cons of using them.

SQL-on-Hadoop Tool: Apache Hive

Developed by Facebook for internal assignments, Hive has quickly gained traction and has become a top choice for running queries on Hadoop for experienced SQL practitioners. The big catch is that even though it provides an SQL-like querying environment, it uses the MapReduce methodology in the background to query the database and return results. It also supports user-defined functions and allows processing of compressed data. If performance parameters and algorithm does not concern you much, this is an excellent platform to get your tasks accomplished from day one.

Explore SQL Database Projects to Add them to Your Data Engineer Resume.

It is one of the de-facto tools which are present by default on almost all Hadoop installations. Given that there is a minimal initial investment to get started with, it makes for a good tool to be tried. One noticeable performance issue you are likely to face with Hive is that the queries are usually very slow because of MapReduce associated overheads.

Hive, however, does have its share of limitations. It supports only four file formats – text, SequenceFile, ORC and RCFile. Though these cover most of the popular file formats, this is something to be kept in mind before you decide on using Hive. The future of Hive looks better though, with Hortonworks continuously improving its Apache Tez project, a new back-end for Hive aimed at improving response time. 

For Interview Questions and Answers for Apache Hive, click here. 

Get FREE Access to Data Analytics Example Codes for Data Cleaning, Data Munging, and Data Visualization

SQL-on-Hadoop Tool: Cloudera Impala

Impala provides developers the luxury of running interactive SQL queries on Hadoop Distributed File System (HDFS) and HBase. Though Hive also provides an SQL like interface, it follows batch processing which can lead to lags if one is looking for performance oriented alternative. Impala overcomes that shortcoming, by running queries in real time which allows integration of SQL based business intelligence tools with the Hadoop data store.

Developed by Cloudera, Impala is available as an open source tool and supports popular file formats such as text, LZO, SequenceFile, Avro and RCFile. It supports cloud based architecture through Amazon’s Elastic MapReduce. Impala’s ANSI SQL compatibility means there is minimum business disruption as developers and analysts can be productive from day one, without needing to learn any new language. It provides integration for a number of other Business Intelligence tools to guarantee continuity in work.

Getting started with Impala does require considerable groundwork. To fully utilize the power of Impala, your data needs to be stored in Parquet file format, which can be an exercise in itself. Impala’s requirement of installing demons across the cluster as well as lack of support for YARN can also be painful bottlenecks.

SQL-on-Hadoop Tool: Presto

It is another initiative by Facebook, which is available as an open source tool. Written in Java, it shares many similarities with Impala, some of which are:

  • Provides interactive experience.
  • Requires considerable groundwork, i.e. installation across a number of nodes.
  • For optimal performance, it requires data to be stored in a particular file format (RCFile).

In addition, Presto provides interoperability with Hive meta-store. Presto allows you to combine data from multiple sources, which is a major advantage for enterprise wide deployments. The big difference from Impala is that Presto is not supported by any of the major vendors. Hence, if your plan is to get an enterprise wide deployment (and subsequent support on an ongoing basis), you might want to consider other options, even though some of the well-known technology giants such as Airbnb and Dropbox are already using it.

Learn to Manage and Schedule Hadoop Jobs with Oozie Workflow Scheduler

Access to a curated library of 250+ end-to-end industry projects with solution code, videos and tech support.

Request a demo

SQL-on-Hadoop Tool: Shark

As one of the first specialized SQL-on-Hadoop projects, Shark started as an alternative to having to run Hive on MapReduce. It aimed to retain all the functionalities of Hive, yet deliver superior performance. Written in Scala by UC Berkeley, it is built on Apache’s data processing engine Spark, unlike Impala and Hive. It has earned a reputation of being a faster alternative to Map-Reduce and there is an ever growing community of users around the world.

Though Shark does offer performance improvements, it can offer it only till a point. It faces challenges when it comes to maintaining scalability without compromising on the performance parameters. As Shark runs on top of Hive, it has to live with a complex codebase inherited from Hive, which becomes difficult to maintain and tune beyond a certain point.

Get More Practice, More Big Data and Analytics Projects, and More guidance.Fast-Track Your Career Transition with ProjectPro

As businesses around the world keep getting hungry and impatient for data, integration of sophisticated business analytics with SQL presents harder challenges and there comes a point where a system which was primarily designed for MapReduce architecture struggles to meet the strict performance demands.

There are a number of other SQL-for-Hadoop tools and they all have their own sets of pros and cons. While it is not possible to review each one, we are signing off by compiling a check-list that an ideal tool should contain:

  • SQL syntax friendly
  • Efficient with joins operations
  • Scalable
  • Analytics capable, preferably with an in-built engine
  • Support for native Hadoop file format

Here's what valued users are saying about ProjectPro

I come from Northwestern University, which is ranked 9th in the US. Although the high-quality academics at school taught me all the basics I needed, obtaining practical experience was a challenge. This is when I was introduced to ProjectPro, and the fact that I am on my second subscription year...

Abhinav Agarwal

Graduate Student at Northwestern University

I come from a background in Marketing and Analytics and when I developed an interest in Machine Learning algorithms, I did multiple in-class courses from reputed institutions though I got good theoretical knowledge, the practical approach, real word application, and deployment knowledge were...

Ameeruddin Mohammed

ETL (Abintio) developer at IBM

Not sure what you are looking for?

View All Projects

Pivotal, an enterprise level SQL-on-Hadoop product, is capable of handling most of the demands of modern day analytics and it ticks almost all the boxes. Its integrated analytics engine comes with machine learning capabilities, which improves performance with usage. Modern day organizations focused on data analysis, demand query languages to be able to handle statistical, mathematical and machine learning algorithms such as regression, factor analysis, hypothesis testing, etc. and Pivotal presents a very capable option.

With plenty of options at your disposal, we hope SQL masters would benefit a lot from these tools and hit the ground running once the right tool is chosen. If you are planning to start your Hadoop training any time soon, then it is imperative that you do some research on the technical background required to learn Hadoop. You will find the following posts quite helpful :

How much SQL is required to learn Hadoop?

How much Java is required to learn Hadoop?

Prerequisites to learn Hadoop.

Spark SQL vs Apache Drill

PREVIOUS

NEXT

Access Solved Big Data and Data Science Projects

About the Author

ProjectPro

ProjectPro is the only online platform designed to help professionals gain practical, hands-on experience in big data, data engineering, data science, and machine learning related technologies. Having over 270+ reusable project templates in data science and big data with step-by-step walkthroughs,

Meet The Author arrow link