Create dataframe from a JSON File and write into the DBFS

This recipe helps you create DataFrame from a JSON File read Data from DBFS and write into the DBFS. The data generated is extensive and complex, which involves complex structures like lists, maps, and struct type. And the format of data also is generated, processed, transformed into various formats like CSV, parquet, Avro, and JSON.

Recipe Objective: How to create DataFrame from a JSON File, read Data from DBFS and write into the DBFS?

In today's world, the data generated is extensive and complex, which involves complex structures like lists, maps, and struct type. And the format of data also is generated, processed, transformed into various formats like CSV, parquet, Avro, and JSON. In these, maintaining Data in the form of JSON format is very popular. So, here we will learn about creating DataFrame from a JSON File, reading Data from DBFS, and writing into the DBFS (Databricks File System.)

Build Log Analytics Application with Spark Streaming and Kafka

Implementation Info:

  1. Databricks Community Edition click here
  2. Spark-Scala
  3. sample data file 1click here
  4. sample data file 2click here
  5. sample data file 3click here
  6. storage - Databricks File System(DBFS)

Step 1: Uploading data to DBFS

Follow the below steps to upload data files from local to DBFS

  1. Click create in Databricks menu
  2. Click Table in the drop-down menu, it will open a create new table UI
  3. In UI, specify the folder name in which you want to save your files.
  4. click browse to upload and upload files from local.
  5. path is like /FileStore/tables/your folder name/your file

Refer to the image below for example

bigdata_1.PNG

Step 2: Read JSON File into DataFrame

Using spark.read.json("path") or spark.read.format("json").load("path") you can read a JSON file into a Spark DataFrame. These methods take a file path as an argument. In our use case, the file path will be "/FileStore/tables/zipcode.json." Here we have used a DataBricks inbuilt function display() to view the data in the dataframe. Unlike reading a CSV, by default, JSON data source infer schema from an input file which means there is no need to mention "inferschema" =true.

//read single json file into dataframe val df = spark.read.json("/FileStore/tables/zipcode.json") df.printSchema() display(df)//This method works only in databricks notebook.

bigdata_2.PNG

Step 3: Reading multiline JSON file.

Sometimes you may want to read records from JSON files that scattered multiple lines. To read such files, the use-value true to the multiline option. By default multiline option is set to false. We need to specify explicitly option("multiline",true).

//read multiline json file into dataframe println("Read Data From Multiline Json") val df2 = spark.read.option("multiline",true).json("/FileStore/tables/zip_multiline.json") df2.printSchema() df2.show()

bigdata_3.PNG

Step 4: Reading Multiple Json Files

Sometimes you may want to read records from JSON files that scattered multiple lines. To read such files, the use-value true to the multiline option. By default multiline option is set to false. In the .json() method, you can also read multiple JSON files from different paths. Just pass all file names with their respective paths by separating comma, as shown below.

We can read all JSON files from a directory into DataFrame just by passing the directory as a path to the json() method also.

//reading multiple files println("Reading from Multiple files") val df3 = spark.read.json("/FileStore/tables/zipcode.json", "/FileStore/tables/zipcode2.json") println("Record count "+df3.count()) df3.show(2)

bigdata_4.PNG

Step 5: Reading files with a custom schema

Spark Schema defines the structure of the data. In other words, it is the structure of the DataFrame. Spark SQL provides StructType & StructField classes to specify the structure to the DataFrame programmatically. If you know the file schema ahead and do not want to use the default inferSchema option for column names and types, use user-defined custom column names and type using the schema option.

A StructType object can be constructed by StructType(fields: Seq[StructField])

A StructField object can be constructed by StructField(java.lang.String name, DataType dataType, boolean nullable, Metadata metadata)

While creating a DataFrame, we can specify the structure of it by using StructType and StructField. StructType is a collection of StructField's used to define the column name, data type, and a flag for nullable or not. Using StructField, we can also add nested struct schema, ArrayType for arrays, and MapType for key-value pairs.

import org.apache.spark.sql.types._ println("Read Json file By defining custom schema") val schema = new StructType() .add("City",StringType,true) .add("RecordNumber",IntegerType,true) .add("State",StringType,true) .add("ZipCodeType",StringType,true) .add("Zipcode",LongType,true) val df_with_schema = spark.read.option("multiline",true).json("/FileStore/tables/zip_multiline.json") df_with_schema.printSchema() df_with_schema.show(false)

bigdata_5.PNG

Step 6: Writing DataFrame into DBFS(DataBricks File System)

Here, we are writing the Dataframe into DBFS into the spark_training folder created by me. Using DataBricks eradicates our custom VM for spark and HDFS. And I am using databricks filesystem commands to view the content of folder writing into DBFS.

Spark DataFrameWriter also has a method mode() to specify SaveMode; the argument to this method either takes the below string or a constant from SaveMode class.

overwrite – mode is used to overwrite the existing file; alternatively, you can use SaveMode.Overwrite.

append – To add the data to the existing file; alternatively, you can use SaveMode.Append.

ignore – Ignores write operation when the file already exists; alternatively, you can use SaveMode.Ignore.

errorifexists or error – This is a default option when the file already exists. It returns an error; alternatively, you can use SaveMode.ErrorIfExists.

//writing the file into DBFS df.write.mode(SaveMode.Overwrite).format("json").save("/FileStore/tables/spark_training/") //DBFS After writing the DataFrame display(dbutils.fs.ls("/FileStore/tables/spark_training/"))

bigdata_6.PNG

Conclusion

We have learned to read a JSON file with a single line and multiline records into Spark DataFrame. We have also learned to read single and multiple files simultaneously and write JSON files back to DataFrame using different save options.

What Users are saying..

profile image

Ray han

Tech Leader | Stanford / Yale University
linkedin profile url

I think that they are fantastic. I attended Yale and Stanford and have worked at Honeywell,Oracle, and Arthur Andersen(Accenture) in the US. I have taken Big Data and Hadoop,NoSQL, Spark, Hadoop... Read More

Relevant Projects

Build a Scalable Event Based GCP Data Pipeline using DataFlow
In this GCP project, you will learn to build and deploy a fully-managed(serverless) event-driven data pipeline on GCP using services like Cloud Composer, Google Cloud Storage (GCS), Pub-Sub, Cloud Functions, BigQuery, BigTable

Analyse Yelp Dataset with Spark & Parquet Format on Azure Databricks
In this Databricks Azure project, you will use Spark & Parquet file formats to analyse the Yelp reviews dataset. As part of this you will deploy Azure data factory, data pipelines and visualise the analysis.

Deploy an Application to Kubernetes in Google Cloud using GKE
In this Kubernetes Big Data Project, you will automate and deploy an application using Docker, Google Kubernetes Engine (GKE), and Google Cloud Functions.

Build an ETL Pipeline on EMR using AWS CDK and Power BI
In this ETL Project, you will learn build an ETL Pipeline on Amazon EMR with AWS CDK and Apache Hive. You'll deploy the pipeline using S3, Cloud9, and EMR, and then use Power BI to create dynamic visualizations of your transformed data.

GCP Project to Explore Cloud Functions using Python Part 1
In this project we will explore the Cloud Services of GCP such as Cloud Storage, Cloud Engine and PubSub

Build Streaming Data Pipeline using Azure Stream Analytics
In this Azure Data Engineering Project, you will learn how to build a real-time streaming platform using Azure Stream Analytics, Azure Event Hub, and Azure SQL database.

Create A Data Pipeline based on Messaging Using PySpark Hive
In this PySpark project, you will simulate a complex real-world data pipeline based on messaging. This project is deployed using the following tech stack - NiFi, PySpark, Hive, HDFS, Kafka, Airflow, Tableau and AWS QuickSight.

Deploying auto-reply Twitter handle with Kafka, Spark and LSTM
Deploy an Auto-Reply Twitter Handle that replies to query-related tweets with a trackable ticket ID generated based on the query category predicted using LSTM deep learning model.

Build a Spark Streaming Pipeline with Synapse and CosmosDB
In this Spark Streaming project, you will learn to build a robust and scalable spark streaming pipeline using Azure Synapse Analytics and Azure Cosmos DB and also gain expertise in window functions, joins, and logic apps for comprehensive real-time data analysis and processing.

SQL Project for Data Analysis using Oracle Database-Part 5
In this SQL Project for Data Analysis, you will learn to analyse data using various SQL functions like ROW_NUMBER, RANK, DENSE_RANK, SUBSTR, INSTR, COALESCE and NVL.