In most of the big data scenarios , we need to merge multiple files or tables based on the various conditions to a unified data model for quicker data analysis purposes.in this recipe we are going to merge various excel files based on the certain conditions
pip install pandas
In this example we are going to use the pandas library , this library is used for data manipulation pandas data structures and operations for manipulating numerical tables and time series
Import pandas as pd
In the below code we are going read the data from excel files, and create dataframes using pandas library.
orders = pd. read_excel('orders.xlsx')
products =pd.read_excel("products.xlsx")
customers = pd.read_excel("customers.xlsx")
using the merge function in the pandas library , all database join operations between the pandas from the excel data. using the "how" parameter in the merge function we will perform the join operations like left, right,..etc.
Left Join :
import pandas as pd
orders = pd. read_excel('orders.xlsx')
products =pd.read_excel("products.xlsx")
customers = pd.read_excel("customers.xlsx")
result = pd.merge(orders,customers[["Product_id","Order_id","customer_name",'customer_email']],on='Product_id', how='left')
result.head()
Output of the above code:
Inner Join :
import pandas as pd
orders = pd. read_excel('orders.xlsx')
products =pd.read_excel("products.xlsx")
customers = pd.read_excel("customers.xlsx")
result= pd.merge(products,customers,on='Product_id',how='inner',indicator=True)
result.head()
Output of the above code:
Right Join :
import pandas as pd
orders = pd. read_excel('orders.xlsx')
products =pd.read_excel("products.xlsx")
customers = pd.read_excel("customers.xlsx")
result = pd.merge(orders, customers[["Product_id","Order_id","customer_name",'customer_email']],
on='Product_id',
how='right',
indicator=True)
result.head()
Output of the above code:
Outer Join :
import pandas as pd
orders = pd. read_excel('orders.xlsx')
products =pd.read_excel("products.xlsx")
customers = pd.read_excel("customers.xlsx")
result= pd.merge(products,customers,on='Product_id',how='outer',indicator=True)
result.head()
Output of the above code:
After getting the result write to the hdfs or local file
import pandas as pd
orders = pd. read_excel('orders.xlsx')
products =pd.read_excel("products.xlsx")
customers = pd.read_excel("customers.xlsx")
result = pd.merge(orders,
customers[["Product_id","Order_id","customer_name",'customer_email']],
on='Product_id')
result.head()
# write the results to the hdfs/ local
result.to_excel("Results.xlsx", index = False)
Output of the above code : an excel file which will be written to current location of execution of the code and it looks like below