Hadoop is beginning to live up to its promise of being the backbone technology for Big Data storage and analytics. Companies across the globe have started to migrate their data into Hadoop to join the stalwarts who already adopted Hadoop a while ago. It is important to study and understand several Hadoop use cases for these simple reasons –
- Hadoop is still a complex technology with several limitations and complications.
- All Data is not Big Data and might not require a Hadoop solution.
- Companies that use Hadoop, often fall into the error of not using the right Hadoop architecture and tools, thereby reducing Hadoop’s efficiency.
If you would like more information about Big Data careers, please click the orange "Request Info" button on top of this page.
Studying Hadoop use cases will help to –
1.) Understand what kind of big data problems need Hadoop and
2) What sort of infrastructure should one have in order to set up and work on the Hadoop framework.
The two obvious benefits of using Hadoop is that, it provides storage for any kind of data from various sources and provides a platform for proficient analytics of the data with low latency. Hadoop is well known to be a distributed, scalable and fault-tolerant system. It can store petabytes with relatively low infrastructure investment. Hadoop runs on clusters of commodity servers. All such servers have local storage and CPU which can store few terabytes on its local disk.
Hadoop has two critical components, which we should explore before looking into industry use cases of Hadoop:
Hadoop Distributed File System (HDFS)
The storage system for Hadoop is known as HDFS. HDFS system breaks the incoming data into multiple packets and distributes it among different servers connected in the clusters. That way every server, stores a fragment of the entire data set and all such fragments are replicated on more than one server to achieve fault tolerance.
MapReduce is a distributed data processing framework. HDFS distributes a dataset to different servers but Hadoop MapReduce is the connecting framework responsible to distribute the work and aggregate the results obtained through data processing.
Apache Hadoop provides solution to the problem caused by large volume of complex data. With the result of growth in data, additional servers can be used to store and analyse the data at low cost. This is complemented by processing power of the servers in a cluster by MapReduce.
These two components define Hadoop, as it gained importance in data storage and analysis, over the legacy systems, due to its distributed processing framework.
Let’s take a look at some Hadoop use cases in various industries.
Hadoop in the Financial Sector
The Finance sector is one of the major users of Hadoop. One of the primary use cases for Hadoop, was in risk modelling, to solve the question for banks to evaluate customers and markets better than legacy systems.
These days we notice that many banks compile separate data warehouses into a single repository backed by Hadoop for quick and easy analysis. Hadoop clusters are used by banks to create more accurate risk analysis models for the customers in its portfolio. Such risk analysis helps banks to manage their financial security and offer customized products and services.
Hadoop has helped the financial sector, maintain a better risk record in the aftermath of 2008 economic downturn. Before that, every regional branch of the bank maintained a legacy data warehouse framework isolated from a global entity. Data such as checking and saving transactions, home mortgage details, credit card transactions and other financial details of every customer was restricted to local database systems, due to which, banks failed to paint a comprehensive risk portfolio of their customers.
After the economic recession, most of the financial institutions and national monetary associations started maintaining a single Hadoop Cluster containing more than petabytes of financial data aggregated from multiple enterprise and legacy database systems. Along with aggregating, banks and financial institutions started pulling in other data sources - such as customer call records, chat and web logs, email correspondence and others. When such unprecedented scale of data is analysed with the assistance of Hadoop, MapReduce and techniques like sentiment analysis, text processing, pattern matching, graph creating; banks were able to identify same customers across different sources along with accurate risk assessment.
This proved that not just banks but any company which has fragmented and incomplete picture of its customers, would improve its customer satisfaction and revenue by creating a global data warehouse. Currently banks as well as government financial institutions use HDFS and MapReduce to commence anti money laundering practices, asset valuation and portfolio analysis.
Predicting market behaviour is another classic problem in the financial sector. Different market sources such as, stock exchanges, banks, revenue and proceeds department, securities market - hold massive volume of data themselves but their performance is interdependent. Hadoop provides the technological backbone by creating a platform where data from such sources can be compiled and processed in real-time.
Morgan Stanley with assets over 350 billion is one of the world’s biggest financial services organizations. It relies on the Hadoop framework to make industry critical investment decisions. Hadoop provides scalability and better results through it administrator and can manage petabytes of data which is not possible with traditional database systems.
JPMorgan Chase is another financial giant which provides services in more than 100 countries. Such large commercial banks can leverage big data analytics more effectively by using frameworks like Hadoop on massive volumes of structured and unstructured data. JPMorgan Chase has mentioned it on various channels that they prefer to use HDFS to support the exponentially growing size of data as well as for low latency processing of complex unstructured data.
Larry Feinsmith, Managing Director of Information Technology at JPMorgan Chase said that, “Hadoop's ability to store vast volumes of unstructured data allows the company to collect and store web logs, transaction data and social media data. Hadoop allows us to store data that we never stored before. The data is aggregated into a common platform for use in a range of customer-focused data mining and data analytics tools”.
For the complete list of big data companies and their salaries- CLICK HERE
Financial institutions use Big Data Analytics and Hadoop to analyse world economy, fraud detection, derive value based products for customers, analyse credit market data, effective cash and credit management and for providing an enriching customer experience.
Hadoop in Healthcare Sector
No other industry has benefitted from the use of Hadoop as much as the Healthcare industry has. Healthcare industry leverages Big Data for curing diseases, reducing medical cost, predicting and managing epidemics and maintaining the quality of human life by keeping track of large scale health index and metrics. This section will elaborate about the usage of Big Data and Hadoop in the healthcare industry.
The big data generated in the healthcare sector is mainly due to patient record keeping and regulatory requirements. McKinsey projected that efficient usage of Big Data and Hadoop in healthcare industry can reduce the data warehousing expenses by $300-$500 billion globally. The data generated by electronic health devices is difficult to analyse using the traditional database management systems. Complexity and volume of the healthcare data is the primary driving force behind the transition from legacy systems to Hadoop in the healthcare industry. Using Hadoop on such scale of data helps in easy and quick data representation, database design, clinical decision analytics, data querying and fault tolerance.
Hadoop as a database system allows the storage of unstructured healthcare data in its native form. For billions of medical records, Hadoop provides unconstrained parallel data processing, fault tolerance and storage for mass amount of unstructured datasets. Since HDFS and MapReduce have the ability for processing terabytes of data, it makes Hadoop indispensable for healthcare sector’s big data problems.
As a case study, we will discuss a healthcare information technology company which was required to save seven years’ worth of historical claims and remittance data. Using traditional database system to store this data, created trouble in data retention while processing millions of claims each day. They created a solution by allowing Hadoop system to store the massive data. The complex processes of normalizing data, logging terabytes of information and data querying for analytics was done smoothly thereafter on a Hadoop system developed by Cloudera.
Big Data and Hadoop technology is also applied in the Healthcare Insurance Business. Using distributed database system within healthcare intelligence applications - assists medical insurance companies, hospitals and beneficiaries to increase their product value by devising smart business solutions. For example, if a medical insurance company needs to estimate general age of populace, below which individuals are not victim of a specific disease – they can create profitable policies suitable for different parts of the population. Such an estimation needs processing of massive amounts of data including medicines, geographic regions, patient care records, diseases, symptoms, etc. Hadoop and MapReduce will prove to be economical in processing such massive unstructured information.
The objective of using Hadoop in Healthcare is to store and analyse medical data which can be leveraged for evaluating public health trends with population of billions, as well as to deliberately create treatment options for individual patients as per their needs.
Hadoop for Telecom Industry
Let’s looks at couple of examples from the telecom industry where Big Data and Hadoop solved a critical business problem. China Mobil Guangdong company has an extensive network of customers. Not only do they have to maintain billions of mobile call record, but it is essential to business, that they have real-time access to call records and billing information of the customers. Traditional database management system couldn’t scale in this scenario, so they came up with a cost effective solution using Hadoop. With the help of Intel tech, they used HBase to store billions of call record details. As per an estimate, nearly 30 Terabytes of data is added to their database on a monthly basis.
Another example of Big Data management in the telecom industry comes from Nokia. They store and analyse massive volume of data from their manufactured mobile phones. To paint a fair picture of Nokia’s Big Data, they manage 100 TB of structured data along with 500+ TB of semi-structured data. Hadoop Distributed Framework System provided by Cloudera manages all variety of Nokia’s data and processes it in a scale of petabytes.
The Telecommunication industry includes the local and long distance calls, wireless communications, test messaging, high speed internet, data communication, television streaming and similar other means of satellite based communication. As we have seen in the last couple of examples, the telecommunication industry generates high velocity and volume of data. Because of such massive competition, companies target customers by catering specific products to stand out. Insights like location, interest, activities, and preferences of customers are being used to create marketing and promotional plans for Telecom Industry.
To sum it up, Big Data in the Telecom sector needs a scalable, robust, fault tolerant and precise data analytics software for storing and analysing massive volumes of communication data in real time. Some of the obvious benefits of such robust database system is the ability to generate accurate billing information with call, messages and data records. Major industries in this vertical extrapolate historical data and bandwidth consumption records, this is used to estimate bandwidth usage and communication trends by consumer, which is required for advanced network infrastructure planning and product/scheme design.
The following verticals of telecom industry get benefitted the most, with Big Data and Hadoop usage:
- Call Data Records Management
- Servicing of Telecom Data Equipment
- Advanced Telecom infrastructure planning
- Creating new products and services
- Network traffic analytics
Hadoop in Retail Sector
Any large scale retailer doing transactional data analysis needs to put together massive quantities of Point of Sale transaction data coming from various data sources, with an objective of predicting demand, increasing profit and creating targeted marketing and promotional campaigns.
Retail analytics is one of the major consumers of data warehousing industry and is also responsible for its innovation and growth. This is due to the ability to collect and store far more data about consumers, their behaviour and consumption - both online and in stores. The historical sales record along with the help of Hadoop and MapReduce are used for increasing profit margins and sales.
The new data that retailers generate today, requires sophisticated processing like language processing, pattern recognition, sentiment analysis, etc. Therefore, traditional database management systems are no longer a cost effective platform for storing complex data meant for such analysis.
The solution to this problem is straightforward. Load a historical transactional point of sales data, into a Hadoop cluster. Upon which you can build analytics application using Hive, MapReduce and Apache Spark. This gives you a fault tolerant system with low latency, which can be used for analysing massive quantities of data at a comparable price.
There are several examples of retail companies leveraging Big Data through Hadoop. In this section, we will discuss the examples of Etsy and Sears. Etsy is an online market place, whereas Sears has both online and, brick and mortar stores. Both these companies needed to analyse large volumes of log data, for marketing campaigns, sales management, inventory management, etc. Amazon Elastic MapReduce services were used to create a Hadoop cluster. They stored and analysed the data to estimate consumer behaviour, search recommendation, product placement, inventory management, targeted marketing, product promotion, etc.
Big Data and Hadoop is being used in the retail industry for the following use cases:
- Retail analytics for inventory forecasting.
- Retail analytics for dynamic pricing of products.
- Retail analytics for supply chain efficiency.
- Retail analytics for targeted and customized promotion and marketing.
- Retail analytics in fraud detection and prevention.
With the ever increasing interaction over social media and retail channels, customers are comparing products, services and prices for multiple online and store retailers. With this kind of behaviour, consumers can quickly shift from one retailer to another, which makes it absolutely essential for companies in the retail sector to tap this information and keep track of the leaks in the sales funnel. It is necessary for a retail company to tap retail big data analytics, using Hadoop to understand customer’s purchase behaviour.
Hadoop for Building Recommendation System
Building recommendation system is essential for various verticals like online retailers, online newsfeeds, online videos and movies, etc. In this section we will discuss a case study which uses Big Data and Hadoop to build a recommendation system.
A leading online matrimonial service needs to measure compatibility between individual profiles in order to make quality matches for a potential relationship. Let’s look into how Hadoop helped people find romance.
The data acquired by the company is taken from the users during their signup and during their online behaviour. People are required to fill surveys in order to describe their attributes and characteristics they look for in a partner. The company blends this information with demographics knowledge and their online activities to build a comprehensive image of the respective user as well as to create a list of recommended set of partners. The data included is a combination of structured information, matching stats, which is used for scoring and recommendation algorithms.
With an increase in the number of subscribers of the service, the data associated also grew; making it difficult to compare every possible pair of profiles. The traditional database system of the company could not scale up to match the growth of customer base. Hadoop was brought in to replace the data storage and the analytics system, because of its fast analytic support and low cost storage.
Incorporating such massive volumes of data along with fault tolerant processing has enabled the online dating service to improve its recommendation system through robust evaluation of compatibility scores. Behaviour of the user like profile visits and duration of those visits through collaborative filtering gives accurate information about user preferences.
Let us know in the comments below if we have missed out on any important big data use cases for Hadoop.