Apache Kafka is breaking barriers and eliminating the slow batch processing method that is used by Hadoop. This is just one of the reasons why Apache Kafka was developed in LinkedIn. Kafka was mainly developed to make working with Hadoop easier. True that it is eliminating the limitations of Hadoop – but it will not eliminate Hadoop itself. Apache Kafka is an enabler – a fault tolerant, publish-subscribe message broker. To explain Apache Kafka in a simple manner would be to compare it to a central nervous system than collects data from various sources. This data is constantly changing, and is voluminous. This data can be anything from clickstream data, activity/ web logs, consumer data, etc. Apache Kafka captures all this data and makes it available to enterprise users in real time. This blog post will explore why Apache Kafka was developed, what does it do and what makes Kafka so popular with Big Data analysis.
When Big Data wasn’t as big as it is today, gathering vast expanse of data in volumes was the primary challenge in technology space. Now that Big Data has been around for years, we have a number of options to store it. Zettabyte capable servers are the norm of the day even for mid – sized organizations. As the problem of storing enormous data volumes got solved, another one reared up – what to do with so much data?
Data Analytics is one of the most sought after technical skills for modern day organizations. There are a number of programming languages and tools available but they all come with their own share of limitations. Expensive licenses, inability to read data from different sources, issues with real time processing and operating in a distributed environment - are some of the common problems businesses face while handling Big Data.
Before analytics can happen, there is an important task that has to be carried out. Data from multiple sources (banking transactions, website traffic, app traffic, transponders and sensors), in different formats, has to be fed into one common platform and made available for further processing. What compounds the challenge is the fact that there is no uniformity in data sources and each of them uses its own technology to store data. Other than the speed required to ingest real time data and convert it into a common form for further analytics, scalability is a major challenge.
For the complete list of big data companies and their salaries- CLICK HERE
If you would like more information about Big Data careers, please click the orange "Request Info" button on top of this page.
Apache Kafka attempts to solve this issue. Initially developed by LinkedIn for managing their internal data, it has steadily gained popularity. Written in Scala, Apache Kafka was open sourced in 2011. It was aimed to provide a scalable, high throughput and low latency platform for handling real time data feeds. Scalability is one feature that makes Apache Kafka stand out from others. High fault tolerance is one of the key features desired in a real time messaging system and Kafka ticks that box as well. These qualities make it stand out from traditional message brokers such as RabbitMQ, JMS and APMQ.
What makes Kafka popular?
Kafka has the ability to fit into any industry, across various use cases. Be it gathering data from a fleet of thousands of trucks owned by a supply chain company or data from different home appliances used to control a high tech office space – Kafka has the ability to broker massive message streams for analysis. It works in combination with the popular analysis tools such as Apache Storm, Apache HBase and Apache Spark.
Kafka – a perfect candidate for Distributed Messaging
Storing data nodes at different nodes spread across the IT infrastructure is an effective way to hedge risk and maintain high availability. Though all organizations want to adopt the distributed layout, system availability and slow transfer rates often turn out to be show stoppers. A modern day Big Data architect has to keep in mind the breakneck rate of increase in data volumes. Keeping a system future proof by having adequate storage without compromising on the speed is one aspect where Kafka emerges a clear winner.
Scalability ensures data can be streamed through thousands of nodes in no time. Even if a small number of nodes go down temporarily, customer load is balanced to other nodes and high availability is not compromised. Hence, there is no single point of failure. It works more as a peer – to – peer system rather than traditional master – slave model. It facilitates the broker – producer – consumer – zookeeper mechanism for speedy messaging across platforms at a high throughput and uptime.
Since being open sourced by LinkedIn to the Apache Community in 2011, huge number of developers have come on board and there is a large, ever growing community which helps in the enterprise adoption of Apache Kafka.
Enrol Now for Hadoop Training Online to become a Certified Hadoop Developer
What Kafka does?
Kafka is extensively being used across industries for general – purpose messaging system where high availability and real time data integration and analytics are of utmost importance. Some of the common uses of Kafka are:
- Batch Data Processing
- Website and Web Applications Activity Tracking
- Sensor Data Collection
- Aggregating Data Logs
- Gathering Real Time Traffic Information
- Medical Parameter Monitoring
Within 5 years of being open sourced, Kafka has found favor with the following high profile companies:
- Cisco Systems
- Goldman Sachs
IBM is taking Apache Kafka a step further, by developing Message Hub (presently in Beta stage). It provides a scalable and reliable messaging system for cloud based environment for asynchronous messaging distribution. IBM’s new Streaming Analytics service is in the process of utilizing this to analyze millions of events per second for instant analytics and developing automated decision making systems.
Popularity of Kafka is further highlighted by the fact that the number of monthly downloads has grown about 7 times in the 1 year period between August 2014 and July 2015, according to Confluent – a company floated by Kafka’s early stage developers.
Where is Kafka heading to?
As more and more large businesses turn to Kafka for distributed messaging, it is likely to be the focal point in the distributed messaging universe. Internet of Things (IoT) is the next big thing in technology space and given the vast volumes of data the connected devices will produce each passing second, Kafka developers clearly have plenty on their plate.
In the words of Neha Narkhede, co-founder and CTO, Confluent, “What Kafka allows you to do is move data across the company and make it available as a continuously free-flowing stream within seconds to people who need to make use of it. And it does it at scale.”