Read and write a Dataframe into a Text file in Apache Spark

This recipe helps you read and write data as a Dataframe into a Text file format in Apache Spark. The Dataframe in Apache Spark is defined as the distributed collection of the data organized into the named columns.Dataframe is equivalent to the table conceptually in the relational database or the data frame in R or Python languages but offers richer optimizations.

Recipe Objective - Read and write data as a Dataframe into a Text file format in Apache Spark?

The Dataframe in Apache Spark is defined as the distributed collection of the data organized into the named columns. Dataframe is equivalent to the table conceptually in the relational database or the data frame in R or Python languages but offers richer optimizations. The DataFrames can be constructed from a wide array of sources: the structured data files, tables in Hive, the external databases, or the existing Resilient distributed datasets. The test file is defined as a kind of computer file structured as the sequence of lines of electronic text. The text file exists stored as data within a computer file system, and also the "Text file" refers to the type of container, whereas plain text refers to the type of content. The Apache Spark provides many ways to read .txt files that is "sparkContext.textFile()" and "sparkContext.wholeTextFiles()" methods to read into the Resilient Distributed Systems(RDD) and "spark.read.text()" & "spark.read.textFile()" methods to read into the DataFrame from local or the HDFS file.

System Requirements

  • Scala (2.12 version)
  • Apache Spark (3.1.1 version)

This recipe explains Spark Dataframe and various options available in Spark CSV while reading & writing data as a dataframe into a CSV file.

Implementing Spark CSV in Databricks

nullValues: The nullValues option specifies the string in a JSON format to consider it as null. For example, if a date column is considered with a value "2000-01-01", set null on the DataFrame.

dateFormat: The dateFormat option is used to set the format of input DateType and the TimestampType columns. dateFormat supports all the java.text.SimpleDateFormat formats.

// Importing Packages

import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}

import spark.implicits._

Databricks-1

The spark SQL and implicit package are imported to read and write data as the dataframe into a Text file format.

// Implementing Text File

object TextFile {

  def main(args:Array[String]):Unit= {

    val spark: SparkSession = SparkSession.builder()

      .master("local[1]")

      .appName("Spark Text File")

      .getOrCreate()

    // Reading Text file and returns DataFrame

    val dataframe:DataFrame = spark.read.text("/FileStore/tables/textfile.txt")

    dataframe.printSchema()

    dataframe.show(false)

    // Converting to columns by splitting

    // Using Map transformation

      val dataframe2 = dataframe.map(f=>{

      val element = f.getString(0).split(",")

      (element(0),element(1))

    })

    dataframe2.printSchema()

    dataframe2.show(false)

    // Writing of Text file

    dataframe2.write.text("/FileStore/tables/textfile.txt")

  }

}

Databricks-2

Databricks-3

Textfile object is created in which spark session is initiated. The dataframe value is created in which textfile.txt is read using spark.read.text("path") function. The dataframe2 value is created for converting records(i.e., Containing One column named "value") into columns by splitting by using map transformation and split method to transform. Finally, the text file is written using "dataframe.write.text("path)" function.

What Users are saying..

profile image

Abhinav Agarwal

Graduate Student at Northwestern University
linkedin profile url

I come from Northwestern University, which is ranked 9th in the US. Although the high-quality academics at school taught me all the basics I needed, obtaining practical experience was a challenge.... Read More

Relevant Projects

Real-time Auto Tracking with Spark-Redis
Spark Project - Discuss real-time monitoring of taxis in a city. The real-time data streaming will be simulated using Flume. The ingestion will be done using Spark Streaming.

PySpark ETL Project for Real-Time Data Processing
In this PySpark ETL Project, you will learn to build a data pipeline and perform ETL operations for Real-Time Data Processing

Build a Data Pipeline with Azure Synapse and Spark Pool
In this Azure Project, you will learn to build a Data Pipeline in Azure using Azure Synapse Analytics, Azure Storage, Azure Synapse Spark Pool to perform data transformations on an Airline dataset and visualize the results in Power BI.

Learn Real-Time Data Ingestion with Azure Purview
In this Microsoft Azure project, you will learn data ingestion and preparation for Azure Purview.

Build an ETL Pipeline for Financial Data Analytics on GCP-IaC
In this GCP Project, you will learn to build an ETL pipeline on Google Cloud Platform to maximize the efficiency of financial data analytics with GCP-IaC.

PySpark Project to Learn Advanced DataFrame Concepts
In this PySpark Big Data Project, you will gain hands-on experience working with advanced functionalities of PySpark Dataframes and Performance Optimization.

SQL Project for Data Analysis using Oracle Database-Part 2
In this SQL Project for Data Analysis, you will learn to efficiently analyse data using JOINS and various other operations accessible through SQL in Oracle Database.

Building Data Pipelines in Azure with Azure Synapse Analytics
In this Microsoft Azure Data Engineering Project, you will learn how to build a data pipeline using Azure Synapse Analytics, Azure Storage and Azure Synapse SQL pool to perform data analysis on the 2021 Olympics dataset.

Deploy an Application to Kubernetes in Google Cloud using GKE
In this Kubernetes Big Data Project, you will automate and deploy an application using Docker, Google Kubernetes Engine (GKE), and Google Cloud Functions.

Airline Dataset Analysis using PySpark GraphFrames in Python
In this PySpark project, you will perform airline dataset analysis using graphframes in Python to find structural motifs, the shortest route between cities, and rank airports with PageRank.