How to save a DataFrame to MySQL in PySpark

This recipe helps you save a DataFrame to MySQL in PySpark

Recipe Objective: How to Save a DataFrame to MySQL in PySpark?

Data merging and aggregation are essential parts of big data platforms' day-to-day activities in most big data scenarios. In this scenario, we will load the dataframe to the MySQL database table or save the dataframe to the table.

Learn Spark SQL for Relational Big Data Procesing

System requirements :

  • Install Ubuntu in the virtual machine click here
  • Install single-node Hadoop machine click here
  • Install pyspark or spark in Ubuntu click here
  • The below codes can be run in Jupyter notebook or any python console.

Step 1: Import the modules

In this scenario, we are going to import the pyspark and pyspark SQL modules and create a spark session as below:

import pyspark from pyspark.sql import SparkSession from pyspark.sql import Row spark = SparkSession.builder.config("spark.jars", "/home/hduser/mysql-connector-java-5.1.47/mysql-connector-java-5.1.47.jar") \ .master("local").appName("PySpark_MySQL_test2").getOrCreate()

Explore PySpark Machine Learning Tutorial to take your PySpark skills to the next level!

The output of the code:

bigdata_1.jpg

Step 2: Create Dataframe to store in MySQL

Here we will create a dataframe to save in MySQL table for that The Row class is in the pyspark.sql submodule. As shown above, we import the Row from class.

studentDf = spark.createDataFrame([ Row(id=1,name='vijay',marks=67), Row(id=2,name='Ajay',marks=88), Row(id=3,name='jay',marks=79), Row(id=4,name='vinay',marks=67), ])

The output of the code:

bigdata_2.jpg

Step 3: To View Data of Dataframe.

Here we are going to view the data top 5 rows in the dataframe as shown below.

studentDf.show(5)

bigdata_3.jpg

Step 4: To save the dataframe to the MySQL table.

Here we are going to save the dataframe to the MySQL table which we created earlier. To save, we need to use a write and save method as shown in the below code.

studentDf.select("id","name","marks").write.format("jdbc").option("url", "jdbc:mysql://127.0.0.1:3306/dezyre_db&useUnicode=true&characterEncoding=UTF-8&useSSL=false") \ .option("driver", "com.mysql.jdbc.Driver").option("dbtable", "students") \ .option("user", "root").option("password", "root").save()

To check the output of the saved data frame in the MySQL table, log in to the MySQL database. The output of the saved dataframe.

bigdata_4.jpg

Conclusion

Here we learned to Save a DataFrame to MySQL in PySpark.

What Users are saying..

profile image

Anand Kumpatla

Sr Data Scientist @ Doubleslash Software Solutions Pvt Ltd
linkedin profile url

ProjectPro is a unique platform and helps many people in the industry to solve real-life problems with a step-by-step walkthrough of projects. A platform with some fantastic resources to gain... Read More

Relevant Projects

Deploy an Application to Kubernetes in Google Cloud using GKE
In this Kubernetes Big Data Project, you will automate and deploy an application using Docker, Google Kubernetes Engine (GKE), and Google Cloud Functions.

Snowflake Real Time Data Warehouse Project for Beginners-1
In this Snowflake Data Warehousing Project, you will learn to implement the Snowflake architecture and build a data warehouse in the cloud to deliver business value.

Building Data Pipelines in Azure with Azure Synapse Analytics
In this Microsoft Azure Data Engineering Project, you will learn how to build a data pipeline using Azure Synapse Analytics, Azure Storage and Azure Synapse SQL pool to perform data analysis on the 2021 Olympics dataset.

Build a big data pipeline with AWS Quicksight, Druid, and Hive
Use the dataset on aviation for analytics to simulate a complex real-world big data pipeline based on messaging with AWS Quicksight, Druid, NiFi, Kafka, and Hive.

Build a Real-Time Dashboard with Spark, Grafana, and InfluxDB
Use Spark , Grafana, and InfluxDB to build a real-time e-commerce users analytics dashboard by consuming different events such as user clicks, orders, demographics

Build an Incremental ETL Pipeline with AWS CDK
Learn how to build an Incremental ETL Pipeline with AWS CDK using Cryptocurrency data

EMR Serverless Example to Build a Search Engine for COVID19
In this AWS Project, create a search engine using the BM25 TF-IDF Algorithm that uses EMR Serverless for ad-hoc processing of a large amount of unstructured textual data.

Hive Mini Project to Build a Data Warehouse for e-Commerce
In this hive project, you will design a data warehouse for e-commerce application to perform Hive analytics on Sales and Customer Demographics data using big data tools such as Sqoop, Spark, and HDFS.

How to deal with slowly changing dimensions using snowflake?
Implement Slowly Changing Dimensions using Snowflake Method - Build Type 1 and Type 2 SCD in Snowflake using the Stream and Task Functionalities

Getting Started with Pyspark on AWS EMR and Athena
In this AWS Big Data Project, you will learn to perform Spark Transformations using a real-time currency ticker API and load the processed data to Athena using Glue Crawler.