Explain Apache Avro Read and write Dataframe in Avro in Spark

This recipe explains what is Apache Avro and how to read and write data as a Dataframe into Avro file format in Spark. Apache Avro is defined as an open-source, row-based, data-serialization and data exchange framework for the Hadoop or big data projects. . Apache Avro is mainly used in Apache Spark, especially for Kafka-based data pipelines.

Recipe Objective - What is Apache Avro. Read and write data as a Dataframe into Avro file format?

Apache Avro is defined as an open-source, row-based, data-serialization and data exchange framework for the Hadoop or big data projects, initially developed by the databricks as the open-source library that supports reading and writing data in the Avro file format. Apache Avro is mainly used in Apache Spark, especially for Kafka-based data pipelines. When the Avro data is stored in the file, its schema is always stored with it to process files later by any program. Apache Avro has been built to serialize and exchange the big data between different Hadoop-based projects. It serializes the data in the compact binary format, and schema is in the JSON format, defining field names and the data types. Apache Avro supports complex data structures like Arrays, Maps, Arrays of the map, and array elements. Apache Avro supports multi-languages, i.e., different languages can also read data written by one language. In Apache Avro, the code generation is not required to read or write data files, and it allows the simple integration with many dynamic languages.

System Requirements

  • Scala (2.12 version)
  • Apache Spark (3.1.1 version)

This recipe explains what Apache Avro is and reading & writing data as a dataframe into Avro file format.

Using Avro file format in Databricks<

// Importing packages import java.io.File import org.apache.avro.Schema import org.apache.spark.sql.{SaveMode, SparkSession} import org.apache.spark.sql.functions._ import spark.sqlContext.implicits._

Databricks-1

The java.io file, Avro schema, spark SQL savemode, sparksession & functions, and sqlContext implicit package is imported in the environment to read & write data as a dataframe into the Avro file.

// Implementing avro file object ExampleAvro { def main(args: Array[String]): Unit = { val spark: SparkSession = SparkSession.builder().master("local[1]") .appName("Avro File System") .getOrCreate() val avdata = Seq(("Rahul ", "", "Chopra", 2019, 2, "M", 5000), ("Manoj ", "Sharma", "", 2012, 2, "M", 4563), ("Sachin ", "", "Aggarwal", 2012, 2, "M", 4563), ("Anita ", "Singh", "Malhotra", 2004, 6, "F", 4892), ("Harshdeep", "Singh", "White", 2008, 6, "", -2) ) val column = Seq("firstname", "middlename", "lastname", "dob_year", "dob_month", "gender", "salary") val dataframe = avdata.toDF(column: _*) //Writing Avro File dataframe.write.format("avro") .mode(SaveMode.Overwrite) .save("C:\\tmp\\spark_out\\avro\\humans.avro") //Reading Avro File spark.read.format("avro").load("/tmp/spark_out/avro/humans.avro").show() //Writing Avro Partition dataframe.write.partitionBy("dob_year","dob_month") .format("avro") .mode(SaveMode.Overwrite) .save("/tmp/spark_out/avro/humans_partition.avro") //Reading the Avro Partition spark.read .format("avro") .load("/tmp/spark_out/avro/humans_partition.avro") .where(col("dob_year") === 2010) .show() //Expliciting Avro schema val AvroSchema = new Schema.Parser() .parse(new File("src/main/resources/humans.avsc")) spark.read .format("avro") .option("avroSchema", AvroSchema.toString) .load("/tmp/spark_out/avro/humans.avro") .show() } }

Databricks-2

Databricks-3

 

Databricks-4

ExampleAvro object is created in which spark session is initiated. "avdata" is the spark dataframe created using seq() object. The "column" value is made using a seq() object which consists of Firstname, Middle name, Lastname, Dob year, Dob month, Gender and Salary. "dataframe" value is created to convert the "column" value data into dataframe using toDF() function. The dataframe is written to Avro file format that is humans.avro. Avro file that is humans.avro file is read using the load() function. The dataframe is partitioned by dob_month and dob_year, and the file is written to humans_partition.avro. Avro partitioned file that is humans_partitioned,avro file is read using the load() function. Finally, the schema is stored in humans.avsc file and provide this file using option() while reading the Avro file. This schema includes the structure of the Avro file with field names and its data types.

What Users are saying..

profile image

Gautam Vermani

Data Consultant at Confidential
linkedin profile url

Having worked in the field of Data Science, I wanted to explore how I can implement projects in other domains, So I thought of connecting with ProjectPro. A project that helped me absorb this topic... Read More

Relevant Projects

Building Real-Time AWS Log Analytics Solution
In this AWS Project, you will build an end-to-end log analytics solution to collect, ingest and process data. The processed data can be analysed to monitor the health of production systems on AWS.

Retail Analytics Project Example using Sqoop, HDFS, and Hive
This Project gives a detailed explanation of How Data Analytics can be used in the Retail Industry, using technologies like Sqoop, HDFS, and Hive.

SQL Project for Data Analysis using Oracle Database-Part 4
In this SQL Project for Data Analysis, you will learn to efficiently write queries using WITH clause and analyse data using SQL Aggregate Functions and various other operators like EXISTS, HAVING.

Snowflake Azure Project to build real-time Twitter feed dashboard
In this Snowflake Azure project, you will ingest generated Twitter feeds to Snowflake in near real-time to power an in-built dashboard utility for obtaining popularity feeds reports.

Yelp Data Processing using Spark and Hive Part 2
In this spark project, we will continue building the data warehouse from the previous project Yelp Data Processing Using Spark And Hive Part 1 and will do further data processing to develop diverse data products.

Streaming Data Pipeline using Spark, HBase and Phoenix
Build a Real-Time Streaming Data Pipeline for an application that monitors oil wells using Apache Spark, HBase and Apache Phoenix .

Build a Scalable Event Based GCP Data Pipeline using DataFlow
In this GCP project, you will learn to build and deploy a fully-managed(serverless) event-driven data pipeline on GCP using services like Cloud Composer, Google Cloud Storage (GCS), Pub-Sub, Cloud Functions, BigQuery, BigTable

Learn Data Processing with Spark SQL using Scala on AWS
In this AWS Spark SQL project, you will analyze the Movies and Ratings Dataset using RDD and Spark SQL to get hands-on experience on the fundamentals of Scala programming language.

Big Data Project for Solving Small File Problem in Hadoop Spark
This big data project focuses on solving the small file problem to optimize data processing efficiency by leveraging Apache Hadoop and Spark within AWS EMR by implementing and demonstrating effective techniques for handling large numbers of small files.

Build an AWS ETL Data Pipeline in Python on YouTube Data
AWS Project - Learn how to build ETL Data Pipeline in Python on YouTube Data using Athena, Glue and Lambda