The desire to save every bit and byte of data for future use, to make data-driven decisions is the key to staying ahead in the competitive world of business operations. All this is possible due to the low cost storage systems like Hadoop and Amazon S3. For the same cost, organizations can now store 50 times as much data as in a Hadoop data lake than in a data warehouse. Data lake is gaining momentum across various organizations and everyone wants to know how to implement a data lake and why. The powerful data lake architecture leverages analytics capabilities for big data processing and helps businesses address operational challenges which were difficult to be solved using conventional data warehousing technologies.
There are several people writing that data lakes are replacing data warehouses but this is just another technology hype that is coming across the effective use of data. Will data lake replace a data warehouse or will the two complement each other is currently the hottest discussion in the big data community. This article explores the most debated discussion on “Data Lake vs. Data Warehouse”, to which ProjectPro industry experts add the point “or Both Coexist".
Data warehouses do a good job for what they are meant to do, but with disparate data sources and different data types like transaction logs, social media data, tweets, user reviews, and clickstream data –Data Lakes fulfil a critical need. Data warehouses can store only structured data in a standard format that fit in rows and columns. However, to manage semi-structured and unstructured data, organizations are adopting data lake architecture for greater flexibility. Data lakes can store any type and amount of data and use it for their applications when required which is not possible with a conventional data warehouse. Data lakes eliminate the cost and data transformation complexity of data ingestion as they follow a “Schema on Read” approach.
Somebody asks you “How will you handle analytics for 64 TB of data that a company creates every month, Data Lake or a data warehouse?” To find out the right answer to this question, you must first understand what a data lake is and how is it different from a data warehouse approach.
Pentaho CTO James Dixon coined the term Data Lake in 2010 and defined it as follows - “If you think of a data mart as a store of bottled water – cleansed and packaged and structured for easy consumption. Translate this into the data version of the term and the contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”
Forbes magazine defines Data Lake in comparison to a data warehouse as-“The difference between a data lake and a data warehouse is that the data in a data warehouse is pre-categorized at the point of entry, which can dictate how it's going to be analysed.”
Gartner defines data lakes as “marketed as enterprise wide data management platforms for analysing disparate sources of data in its native format”
Suppose that a retailer wants to analyse POS data using Market Basket Analysis approach and find answers to questions like –
All these questions need to be answered from multiple business dimensions like demographics of the store, location of the store, date and time of the year, type of the customer, demographics of the customer, etc. This kind of market basket analysis can be performed by storing the POS data in a traditional data warehouse and performing business intelligence analysis.
Now, what if the retailer wants to find out what transactions were hand-coded into the system and what sales transaction were scanned into the POS system? The retailers might require this information to find out –
To find answers to such questions, retailers cannot use the data stored in the data warehouse. The transaction logs from the actual transactions need to be analysed which contain details like the id of the sales representative, the time when the transaction was done and how the transaction was entered into the system through POS or was it hand-coded. This is where data lakes come into action where, detailed transaction logs need to be accessed in raw format for analysis. Data lakes have gained popularity because data can be loaded into the storage repository without defining the data model. Data lakes follow a Schema on Read approach unlike data warehouses which follow a Schema on Load approach.
Any discussion about Data Lake and big data is closely associated to the Apache Hadoop ecosystem leading to a description on how to build a data lake using the power of the tiny toy elephant Hadoop. Hadoop Data Lake is a data management platform that consists of one or more hadoop clusters for storing and processing huge amounts of non-relational data like sensor data, JSON objects, clickstream data, and social media data. Hadoop Data Lake has become popular because of its cost-effective and technically feasible methodologies to solve big data challenges. Data Lake Hadoop is considered as an evolution from the existing data architecture.
Many people are of the thought that Data lakes are just re-creation of the data warehouse but is this the truth behind the new shiny data lakes?
Data warehouses require a pre-defined ETL process to extract the data and bring it to a data warehouse. When loading data into a data warehouse the schema must be known and also the query for data warehouse should be known. In case if the query needs to be changed then the data has to be reinjested into the data warehouse.
Data lakes follow an ELT process i.e. Extract, Load and Transform. The schema is defined only when the data is pulled and accessed for analysis. Data is stored at a leaf level in an untransformed state and schema is applied only to fulfil data analysis requirements.
Database professionals analyse various data sources to understand the business processes and then profile the data into a structured data model for reporting. This requires considerable amount of time to analyse various data sources, understand business processes and profile data. Most of the time is spent in making decisions about what data needs to be included in the data warehouse and what not. Usually, if some data is not required to answer specific business questions or for reporting, then the data is not loaded into the data warehouse as this helps simplify the data model and saves expensive disk storage space.
To the contrary, in a data Lake ALL data is loaded into the storage repository irrespective of its use. This helps businesses to dig into the storage repository whenever they require specific data for any kind of analysis. Commodity, off the shelf servers make data lake hardware infrastructure easily scalable to petabytes very economical.
As explained earlier, data lakes place raw data into large storage repositories like HDFS used by Hadoop where it can be analysed without a defined structure(Schema on Read) unlike data warehouse that relies on a schema (Schema on Write), data lake is free for all kinds of analytics. In a data lake architecture, parsing and schema is applied to the data when a data scientist reads the data in raw format from the lake. Organizations cannot choose between either one -“Schema on Read” vs. “Schema on Write” as this depends on several factors.
Suppose that there is a data set that contains millions of PDF documents which have been scanned from paper business cards. There is a choice when writing this data to the data lake. The organisation can create a schema to the data before writing it by organizing the data from PDF into a table with columns like name, company name, email id, phone number and so forth. This is schema on write approach as the data on each business card is mapped and written to predefined columns in a data warehouse. Another way to do this is merely dump all the PDF documents in the data lake and later identify the schema that would be required to do analysis. Data lakes provide the ability to do exploratory analytics, using tools like Hadoop, Spark, Hive and Apache Drill as the data scientist can control the parser and schema. There is nothing like a better approach as they are two very different approaches and it all depends on what a data scientist or a developer is trying to do.
Storage industry has lots to offer in terms of low cost horizontally scalable platforms for storing large datasets. Hadoop has evolved as a batch processing framework built on top of low cost hardware and storage and most companies have started using Hadoop as a data lake because of its economical storage cost unlike data warehouses that are expensive.
Data warehouse technology is in existence since decades and is mature enough in terms of security whilst data lakes are emerging. Data is more secure in a data warehouse when compared to a data lake. With significant efforts on making big data technologies secure, data lakes are likely to become secure and mature over time.
Data warehouse is a logical representation of structured data used by business users at different levels for decision making. Decision makers in an organization cannot make critical decisions based on inaccurate data. However, as the volume of unstructured data increases, organizations have to implement a data lake to complement a data warehouse and support the discovery of new questions. The architectures, contents and structure of a data lake are determined based on the type of analytics which businesses cannot store and process just by using a conventional data warehouse architecture.
The emergence of Data Lake has led to lamentable dialogues in the big data community which compare data warehouses disapprovingly with data lakes. A new, hot concept elbows the older technology out of the way and this is augmenting the misleading thought process that data lakes will replace a data warehouse. Data Lakes can do several things that data warehouses cannot, the vice-versa is also true. Data Lake is just a groundwork for a data warehouse and not a replacement for it, so moving away from the hype, we conclude that Data Warehouse is here to stay and far from being dead. Data lake and data warehouse are two different technologies serving various business needs.