How to join Excel data from Multiple files using Pandas?
BIG DATA RECIPES DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET     ALL TAGS

How to join Excel data from Multiple files using Pandas?

How to join Excel data from Multiple files using Pandas?

This recipe helps you join Excel data from Multiple files using Pandas

Recipe Objective

In most of the big data scenarios , we need to merge multiple files or tables based on the various conditions to a unified data model for quicker data analysis purposes.in this recipe we are going to merge various excel files based on the certain conditions

System requirements :

  • Install pandas python module as follows: pip install pandas
  • The below codes can be run in Jupyter notebook , or any python console
  • In this scenario we are going to use 3 excel files to perform joins Products dataset , Orders dataset , Customers dataset

Step 1: Import the modules

In this example we are going to use the pandas library , this library is used for data manipulation pandas data structures and operations for manipulating numerical tables and time series

Import pandas as pd

Step 2: Read the Excel Files

In the below code we are going read the data from excel files, and create dataframes using pandas library.

orders = pd. read_excel('orders.xlsx') products =pd.read_excel("products.xlsx") customers = pd.read_excel("customers.xlsx")

Step 3: Join operations on the Data frames

using the merge function in the pandas library , all database join operations between the pandas from the excel data. using the "how" parameter in the merge function we will perform the join operations like left, right,..etc.

Left Join :

import pandas as pd orders = pd. read_excel('orders.xlsx') products =pd.read_excel("products.xlsx") customers = pd.read_excel("customers.xlsx") result = pd.merge(orders,customers[["Product_id","Order_id","customer_name",'customer_email']],on='Product_id', how='left') result.head()

Output of the above code:

Inner Join :

import pandas as pd orders = pd. read_excel('orders.xlsx') products =pd.read_excel("products.xlsx") customers = pd.read_excel("customers.xlsx") result= pd.merge(products,customers,on='Product_id',how='inner',indicator=True) result.head()

Output of the above code:

Right Join :

import pandas as pd orders = pd. read_excel('orders.xlsx') products =pd.read_excel("products.xlsx") customers = pd.read_excel("customers.xlsx") result = pd.merge(orders, customers[["Product_id","Order_id","customer_name",'customer_email']], on='Product_id', how='right', indicator=True) result.head()

Output of the above code:

Outer Join :

import pandas as pd orders = pd. read_excel('orders.xlsx') products =pd.read_excel("products.xlsx") customers = pd.read_excel("customers.xlsx") result= pd.merge(products,customers,on='Product_id',how='outer',indicator=True) result.head()

Output of the above code:

Step 4: write result to the csv file

After getting the result write to the hdfs or local file

import pandas as pd orders = pd. read_excel('orders.xlsx') products =pd.read_excel("products.xlsx") customers = pd.read_excel("customers.xlsx") result = pd.merge(orders, customers[["Product_id","Order_id","customer_name",'customer_email']], on='Product_id') result.head() # write the results to the hdfs/ local result.to_excel("Results.xlsx", index = False)

Output of the above code : an excel file which will be written to current location of execution of the code and it looks like below

Relevant Projects

GCP Data Ingestion with SQL using Google Cloud Dataflow
In this GCP Project, you will learn to build a data processing pipeline With Apache Beam, Dataflow & BigQuery on GCP using Yelp Dataset.

PySpark Tutorial - Learn to use Apache Spark with Python
PySpark Project-Get a handle on using Python with Spark through this hands-on data processing spark python tutorial.

Bitcoin Data Mining on AWS Free Tier
Bitcoin Mining on AWS - Learn how to use AWS Cloud for building a data pipeline and analysing bitcoin data.

Web Server Log Processing using Hadoop
In this hadoop project, you will be using a sample application log file from an application server to a demonstrated scaled-down server log processing pipeline.

Real-Time Streaming of Twitter Sentiments AWS EC2 NiFi
Learn to perform 1) Twitter Sentiment Analysis using Spark Streaming, NiFi and Kafka, and 2) Build an Interactive Data Visualization for the analysis using Python Plotly.

Finding Unique URL's using Hadoop Hive
Hive Project -Learn to write a Hive program to find the first unique URL, given 'n' number of URL's.

Data Warehouse Design for E-commerce Environments
In this hive project, you will design a data warehouse for e-commerce environments.

Airline Dataset Analysis using Hadoop, Hive, Pig and Impala
Hadoop Project- Perform basic big data analysis on airline dataset using big data tools -Pig, Hive and Impala.

Log Analytics Project with Spark Streaming and Kafka
In this spark project, you will use the real-world production logs from NASA Kennedy Space Center WWW server in Florida to perform scalable log analytics with Apache Spark, Python, and Kafka.

Spark Project -Real-time data collection and Spark Streaming Aggregation
In this big data project, we will embark on real-time data collection and aggregation from a simulated real-time system using Spark Streaming.