How to join Excel data from Multiple files using Pandas?
BIG DATA RECIPES DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET     ALL TAGS

How to join Excel data from Multiple files using Pandas?

How to join Excel data from Multiple files using Pandas?

This recipe helps you join Excel data from Multiple files using Pandas

0

Recipe Objective

In most of the big data scenarios , we need to merge multiple files or tables based on the various conditions to a unified data model for quicker data analysis purposes.in this recipe we are going to merge various excel files based on the certain conditions

System requirements :

  • Install pandas python module as follows: pip install pandas
  • The below codes can be run in Jupyter notebook , or any python console
  • In this scenario we are going to use 3 excel files to perform joins Products dataset , Orders dataset , Customers dataset

Step 1: Import the modules

In this example we are going to use the pandas library , this library is used for data manipulation pandas data structures and operations for manipulating numerical tables and time series

Import pandas as pd

Step 2: Read the Excel Files

In the below code we are going read the data from excel files, and create dataframes using pandas library.

orders = pd. read_excel('orders.xlsx') products =pd.read_excel("products.xlsx") customers = pd.read_excel("customers.xlsx")

Step 3: Join operations on the Data frames

using the merge function in the pandas library , all database join operations between the pandas from the excel data. using the "how" parameter in the merge function we will perform the join operations like left, right,..etc.

Left Join :

import pandas as pd orders = pd. read_excel('orders.xlsx') products =pd.read_excel("products.xlsx") customers = pd.read_excel("customers.xlsx") result = pd.merge(orders,customers[["Product_id","Order_id","customer_name",'customer_email']],on='Product_id', how='left') result.head()

Output of the above code:

Inner Join :

import pandas as pd orders = pd. read_excel('orders.xlsx') products =pd.read_excel("products.xlsx") customers = pd.read_excel("customers.xlsx") result= pd.merge(products,customers,on='Product_id',how='inner',indicator=True) result.head()

Output of the above code:

Right Join :

import pandas as pd orders = pd. read_excel('orders.xlsx') products =pd.read_excel("products.xlsx") customers = pd.read_excel("customers.xlsx") result = pd.merge(orders, customers[["Product_id","Order_id","customer_name",'customer_email']], on='Product_id', how='right', indicator=True) result.head()

Output of the above code:

Outer Join :

import pandas as pd orders = pd. read_excel('orders.xlsx') products =pd.read_excel("products.xlsx") customers = pd.read_excel("customers.xlsx") result= pd.merge(products,customers,on='Product_id',how='outer',indicator=True) result.head()

Output of the above code:

Step 4: write result to the csv file

After getting the result write to the hdfs or local file

import pandas as pd orders = pd. read_excel('orders.xlsx') products =pd.read_excel("products.xlsx") customers = pd.read_excel("customers.xlsx") result = pd.merge(orders, customers[["Product_id","Order_id","customer_name",'customer_email']], on='Product_id') result.head() # write the results to the hdfs/ local result.to_excel("Results.xlsx", index = False)

Output of the above code : an excel file which will be written to current location of execution of the code and it looks like below

Relevant Projects

Spark Project-Analysis and Visualization on Yelp Dataset
The goal of this Spark project is to analyze business reviews from Yelp dataset and ingest the final output of data processing in Elastic Search.Also, use the visualisation tool in the ELK stack to visualize various kinds of ad-hoc reports from the data.

Analysing Big Data with Twitter Sentiments using Spark Streaming
In this big data spark project, we will do Twitter sentiment analysis using spark streaming on the incoming streaming data.

Implementing Slow Changing Dimensions in a Data Warehouse using Hive and Spark
Hive Project- Understand the various types of SCDs and implement these slowly changing dimesnsion in Hadoop Hive and Spark.

Web Server Log Processing using Hadoop
In this hadoop project, you will be using a sample application log file from an application server to a demonstrated scaled-down server log processing pipeline.

Create A Data Pipeline Based On Messaging Using PySpark And Hive - Covid-19 Analysis
In this PySpark project, you will simulate a complex real-world data pipeline based on messaging. This project is deployed using the following tech stack - NiFi, PySpark, Hive, HDFS, Kafka, Airflow, Tableau and AWS QuickSight.

Tough engineering choices with large datasets in Hive Part - 1
Explore hive usage efficiently in this hadoop hive project using various file formats such as JSON, CSV, ORC, AVRO and compare their relative performances

Spark Project -Real-time data collection and Spark Streaming Aggregation
In this big data project, we will embark on real-time data collection and aggregation from a simulated real-time system using Spark Streaming.

Finding Unique URL's using Hadoop Hive
Hive Project -Learn to write a Hive program to find the first unique URL, given 'n' number of URL's.

Explore features of Spark SQL in practice on Spark 2.0
The goal of this spark project for students is to explore the features of Spark SQL in practice on the latest version of Spark i.e. Spark 2.0.

Yelp Data Processing Using Spark And Hive Part 1
In this big data project, we will continue from a previous hive project "Data engineering on Yelp Datasets using Hadoop tools" and do the entire data processing using spark.