Create dataframe from a JSON File and write into the DBFS

This recipe helps you create DataFrame from a JSON File read Data from DBFS and write into the DBFS. The data generated is extensive and complex, which involves complex structures like lists, maps, and struct type. And the format of data also is generated, processed, transformed into various formats like CSV, parquet, Avro, and JSON.
Last Updated: 20 Dec 2022

Get access to Big Data projects View all Big Data projects

APACHE HADOOP PROJECTS DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective: How to create DataFrame from a JSON File, read Data from DBFS and write into the DBFS?

In today's world, the data generated is extensive and complex, which involves complex structures like lists, maps, and struct type. And the format of data also is generated, processed, transformed into various formats like CSV, parquet, Avro, and JSON. In these, maintaining Data in the form of JSON format is very popular. So, here we will learn about creating DataFrame from a JSON File, reading Data from DBFS, and writing into the DBFS (Databricks File System.)

Build Log Analytics Application with Spark Streaming and Kafka

Recipe Objective: How to create DataFrame from a JSON File, read Data from DBFS and write into the DBFS?

Implementation Info:

Databricks Community Edition click here
Spark-Scala
sample data file 1click here
sample data file 2click here
sample data file 3click here
storage - Databricks File System(DBFS)

Step 1: Uploading data to DBFS

Follow the below steps to upload data files from local to DBFS

Click create in Databricks menu
Click Table in the drop-down menu, it will open a create new table UI
In UI, specify the folder name in which you want to save your files.
click browse to upload and upload files from local.
path is like /FileStore/tables/your folder name/your file

Refer to the image below for example

Step 2: Read JSON File into DataFrame

Using spark.read.json("path") or spark.read.format("json").load("path") you can read a JSON file into a Spark DataFrame. These methods take a file path as an argument. In our use case, the file path will be "/FileStore/tables/zipcode.json." Here we have used a DataBricks inbuilt function display() to view the data in the dataframe. Unlike reading a CSV, by default, JSON data source infer schema from an input file which means there is no need to mention "inferschema" =true.

//read single json file into dataframe val df = spark.read.json("/FileStore/tables/zipcode.json") df.printSchema() display(df)//This method works only in databricks notebook.

Step 3: Reading multiline JSON file.

//read multiline json file into dataframe println("Read Data From Multiline Json") val df2 = spark.read.option("multiline",true).json("/FileStore/tables/zip_multiline.json") df2.printSchema() df2.show()

Step 4: Reading Multiple Json Files

Sometimes you may want to read records from JSON files that scattered multiple lines. To read such files, the use-value true to the multiline option. By default multiline option is set to false. In the .json() method, you can also read multiple JSON files from different paths. Just pass all file names with their respective paths by separating comma, as shown below.

We can read all JSON files from a directory into DataFrame just by passing the directory as a path to the json() method also.

//reading multiple files println("Reading from Multiple files") val df3 = spark.read.json("/FileStore/tables/zipcode.json", "/FileStore/tables/zipcode2.json") println("Record count "+df3.count()) df3.show(2)

Step 5: Reading files with a custom schema

Spark Schema defines the structure of the data. In other words, it is the structure of the DataFrame. Spark SQL provides StructType & StructField classes to specify the structure to the DataFrame programmatically. If you know the file schema ahead and do not want to use the default inferSchema option for column names and types, use user-defined custom column names and type using the schema option.

A StructType object can be constructed by StructType(fields: Seq[StructField])

A StructField object can be constructed by StructField(java.lang.String name, DataType dataType, boolean nullable, Metadata metadata)

While creating a DataFrame, we can specify the structure of it by using StructType and StructField. StructType is a collection of StructField's used to define the column name, data type, and a flag for nullable or not. Using StructField, we can also add nested struct schema, ArrayType for arrays, and MapType for key-value pairs.

import org.apache.spark.sql.types._ println("Read Json file By defining custom schema") val schema = new StructType() .add("City",StringType,true) .add("RecordNumber",IntegerType,true) .add("State",StringType,true) .add("ZipCodeType",StringType,true) .add("Zipcode",LongType,true) val df_with_schema = spark.read.option("multiline",true).json("/FileStore/tables/zip_multiline.json") df_with_schema.printSchema() df_with_schema.show(false)

Step 6: Writing DataFrame into DBFS(DataBricks File System)

Here, we are writing the Dataframe into DBFS into the spark_training folder created by me. Using DataBricks eradicates our custom VM for spark and HDFS. And I am using databricks filesystem commands to view the content of folder writing into DBFS.

Spark DataFrameWriter also has a method mode() to specify SaveMode; the argument to this method either takes the below string or a constant from SaveMode class.

overwrite – mode is used to overwrite the existing file; alternatively, you can use SaveMode.Overwrite.

append – To add the data to the existing file; alternatively, you can use SaveMode.Append.

ignore – Ignores write operation when the file already exists; alternatively, you can use SaveMode.Ignore.

errorifexists or error – This is a default option when the file already exists. It returns an error; alternatively, you can use SaveMode.ErrorIfExists.

//writing the file into DBFS df.write.mode(SaveMode.Overwrite).format("json").save("/FileStore/tables/spark_training/") //DBFS After writing the DataFrame display(dbutils.fs.ls("/FileStore/tables/spark_training/"))

Conclusion

We have learned to read a JSON file with a single line and multiline records into Spark DataFrame. We have also learned to read single and multiple files simultaneously and write JSON files back to DataFrame using different save options.

Download Materials

bigdata_1

bigdata_2

bigdata_3

bigdata_4

bigdata_5

bigdata_6

zip_multiline

zipcode

zipcode2

What Users are saying..

Ray han

Tech Leader | Stanford / Yale University

I think that they are fantastic. I attended Yale and Stanford and have worked at Honeywell,Oracle, and Arthur Andersen(Accenture) in the US. I have taken Big Data and Hadoop,NoSQL, Spark, Hadoop... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

Build a Scalable Event Based GCP Data Pipeline using DataFlow

In this GCP project, you will learn to build and deploy a fully-managed(serverless) event-driven data pipeline on GCP using services like Cloud Composer, Google Cloud Storage (GCS), Pub-Sub, Cloud Functions, BigQuery, BigTable

View Project Details

Analyse Yelp Dataset with Spark & Parquet Format on Azure Databricks

In this Databricks Azure project, you will use Spark & Parquet file formats to analyse the Yelp reviews dataset. As part of this you will deploy Azure data factory, data pipelines and visualise the analysis.

View Project Details

Deploy an Application to Kubernetes in Google Cloud using GKE

In this Kubernetes Big Data Project, you will automate and deploy an application using Docker, Google Kubernetes Engine (GKE), and Google Cloud Functions.

View Project Details

Build an ETL Pipeline on EMR using AWS CDK and Power BI

In this ETL Project, you will learn build an ETL Pipeline on Amazon EMR with AWS CDK and Apache Hive. You'll deploy the pipeline using S3, Cloud9, and EMR, and then use Power BI to create dynamic visualizations of your transformed data.

View Project Details

GCP Project to Explore Cloud Functions using Python Part 1

In this project we will explore the Cloud Services of GCP such as Cloud Storage, Cloud Engine and PubSub

View Project Details

Build Streaming Data Pipeline using Azure Stream Analytics

In this Azure Data Engineering Project, you will learn how to build a real-time streaming platform using Azure Stream Analytics, Azure Event Hub, and Azure SQL database.

View Project Details

Create A Data Pipeline based on Messaging Using PySpark Hive

In this PySpark project, you will simulate a complex real-world data pipeline based on messaging. This project is deployed using the following tech stack - NiFi, PySpark, Hive, HDFS, Kafka, Airflow, Tableau and AWS QuickSight.

View Project Details

Deploying auto-reply Twitter handle with Kafka, Spark and LSTM

Deploy an Auto-Reply Twitter Handle that replies to query-related tweets with a trackable ticket ID generated based on the query category predicted using LSTM deep learning model.

View Project Details

Build a Spark Streaming Pipeline with Synapse and CosmosDB

In this Spark Streaming project, you will learn to build a robust and scalable spark streaming pipeline using Azure Synapse Analytics and Azure Cosmos DB and also gain expertise in window functions, joins, and logic apps for comprehensive real-time data analysis and processing.

View Project Details

SQL Project for Data Analysis using Oracle Database-Part 5

In this SQL Project for Data Analysis, you will learn to analyse data using various SQL functions like ROW_NUMBER, RANK, DENSE_RANK, SUBSTR, INSTR, COALESCE and NVL.

View Project Details

Create dataframe from a JSON File and write into the DBFS

Recipe Objective: How to create DataFrame from a JSON File, read Data from DBFS and write into the DBFS?

Table of Contents

Implementation Info:

Step 1: Uploading data to DBFS

Step 2: Read JSON File into DataFrame

Step 3: Reading multiline JSON file.

Step 4: Reading Multiple Json Files

Step 5: Reading files with a custom schema

Step 6: Writing DataFrame into DBFS(DataBricks File System)

Conclusion

Ray han

Relevant Projects

You might also like

Relevant Projects