How to join Excel data from Multiple files using Pandas?
BIG DATA RECIPES DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET     ALL TAGS

How to join Excel data from Multiple files using Pandas?

How to join Excel data from Multiple files using Pandas?

This recipe helps you join Excel data from Multiple files using Pandas

0

Recipe Objective

In most of the big data scenarios , we need to merge multiple files or tables based on the various conditions to a unified data model for quicker data analysis purposes.in this recipe we are going to merge various excel files based on the certain conditions

System requirements :

  • Install pandas python module as follows: pip install pandas
  • The below codes can be run in Jupyter notebook , or any python console
  • In this scenario we are going to use 3 excel files to perform joins Products dataset , Orders dataset , Customers dataset

Step 1: Import the modules

In this example we are going to use the pandas library , this library is used for data manipulation pandas data structures and operations for manipulating numerical tables and time series

Import pandas as pd

Step 2: Read the Excel Files

In the below code we are going read the data from excel files, and create dataframes using pandas library.

orders = pd. read_excel('orders.xlsx') products =pd.read_excel("products.xlsx") customers = pd.read_excel("customers.xlsx")

Step 3: Join operations on the Data frames

using the merge function in the pandas library , all database join operations between the pandas from the excel data. using the "how" parameter in the merge function we will perform the join operations like left, right,..etc.

Left Join :

import pandas as pd orders = pd. read_excel('orders.xlsx') products =pd.read_excel("products.xlsx") customers = pd.read_excel("customers.xlsx") result = pd.merge(orders,customers[["Product_id","Order_id","customer_name",'customer_email']],on='Product_id', how='left') result.head()

Output of the above code:

Inner Join :

import pandas as pd orders = pd. read_excel('orders.xlsx') products =pd.read_excel("products.xlsx") customers = pd.read_excel("customers.xlsx") result= pd.merge(products,customers,on='Product_id',how='inner',indicator=True) result.head()

Output of the above code:

Right Join :

import pandas as pd orders = pd. read_excel('orders.xlsx') products =pd.read_excel("products.xlsx") customers = pd.read_excel("customers.xlsx") result = pd.merge(orders, customers[["Product_id","Order_id","customer_name",'customer_email']], on='Product_id', how='right', indicator=True) result.head()

Output of the above code:

Outer Join :

import pandas as pd orders = pd. read_excel('orders.xlsx') products =pd.read_excel("products.xlsx") customers = pd.read_excel("customers.xlsx") result= pd.merge(products,customers,on='Product_id',how='outer',indicator=True) result.head()

Output of the above code:

Step 4: write result to the csv file

After getting the result write to the hdfs or local file

import pandas as pd orders = pd. read_excel('orders.xlsx') products =pd.read_excel("products.xlsx") customers = pd.read_excel("customers.xlsx") result = pd.merge(orders, customers[["Product_id","Order_id","customer_name",'customer_email']], on='Product_id') result.head() # write the results to the hdfs/ local result.to_excel("Results.xlsx", index = False)

Output of the above code : an excel file which will be written to current location of execution of the code and it looks like below

Relevant Projects

Online Hadoop Projects -Solving small file problem in Hadoop
In this hadoop project, we are going to be continuing the series on data engineering by discussing and implementing various ways to solve the hadoop small file problem.

Hadoop Project-Analysis of Yelp Dataset using Hadoop Hive
The goal of this hadoop project is to apply some data engineering principles to Yelp Dataset in the areas of processing, storage, and retrieval.

Implementing Slow Changing Dimensions in a Data Warehouse using Hive and Spark
Hive Project- Understand the various types of SCDs and implement these slowly changing dimesnsion in Hadoop Hive and Spark.

Explore features of Spark SQL in practice on Spark 2.0
The goal of this spark project for students is to explore the features of Spark SQL in practice on the latest version of Spark i.e. Spark 2.0.

Finding Unique URL's using Hadoop Hive
Hive Project -Learn to write a Hive program to find the first unique URL, given 'n' number of URL's.

Tough engineering choices with large datasets in Hive Part - 1
Explore hive usage efficiently in this hadoop hive project using various file formats such as JSON, CSV, ORC, AVRO and compare their relative performances

AWS Project - Build an ETL Data Pipeline on AWS EMR Cluster
Build a fully working scalable, reliable and secure AWS EMR complex data pipeline from scratch that provides support for all data stages from data collection to data analysis and visualization.

Event Data Analysis using AWS ELK Stack
This Elasticsearch example deploys the AWS ELK stack to analyse streaming event data. Tools used include Nifi, PySpark, Elasticsearch, Logstash and Kibana for visualisation.

Real-Time Log Processing in Kafka for Streaming Architecture
The goal of this apache kafka project is to process log entries from applications in real-time using Kafka for the streaming architecture in a microservice sense.

Data Warehouse Design for E-commerce Environments
In this hive project, you will design a data warehouse for e-commerce environments.