Explain Apache Avro Read and write Dataframe in Avro in Spark

This recipe explains what is Apache Avro and how to read and write data as a Dataframe into Avro file format in Spark. Apache Avro is defined as an open-source, row-based, data-serialization and data exchange framework for the Hadoop or big data projects. . Apache Avro is mainly used in Apache Spark, especially for Kafka-based data pipelines.
Last Updated: 23 Feb 2023

Get access to Big Data projects View all Big Data projects

APACHE HADOOP PROJECTS DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective - What is Apache Avro. Read and write data as a Dataframe into Avro file format?

Apache Avro is defined as an open-source, row-based, data-serialization and data exchange framework for the Hadoop or big data projects, initially developed by the databricks as the open-source library that supports reading and writing data in the Avro file format. Apache Avro is mainly used in Apache Spark, especially for Kafka-based data pipelines. When the Avro data is stored in the file, its schema is always stored with it to process files later by any program. Apache Avro has been built to serialize and exchange the big data between different Hadoop-based projects. It serializes the data in the compact binary format, and schema is in the JSON format, defining field names and the data types. Apache Avro supports complex data structures like Arrays, Maps, Arrays of the map, and array elements. Apache Avro supports multi-languages, i.e., different languages can also read data written by one language. In Apache Avro, the code generation is not required to read or write data files, and it allows the simple integration with many dynamic languages.

System Requirements

Scala (2.12 version)
Apache Spark (3.1.1 version)

This recipe explains what Apache Avro is and reading & writing data as a dataframe into Avro file format.

Using Avro file format in Databricks<

// Importing packages import java.io.File import org.apache.avro.Schema import org.apache.spark.sql.{SaveMode, SparkSession} import org.apache.spark.sql.functions._ import spark.sqlContext.implicits._

Databricks-1

The java.io file, Avro schema, spark SQL savemode, sparksession & functions, and sqlContext implicit package is imported in the environment to read & write data as a dataframe into the Avro file.

// Implementing avro file object ExampleAvro { def main(args: Array[String]): Unit = { val spark: SparkSession = SparkSession.builder().master("local[1]") .appName("Avro File System") .getOrCreate() val avdata = Seq(("Rahul ", "", "Chopra", 2019, 2, "M", 5000), ("Manoj ", "Sharma", "", 2012, 2, "M", 4563), ("Sachin ", "", "Aggarwal", 2012, 2, "M", 4563), ("Anita ", "Singh", "Malhotra", 2004, 6, "F", 4892), ("Harshdeep", "Singh", "White", 2008, 6, "", -2) ) val column = Seq("firstname", "middlename", "lastname", "dob_year", "dob_month", "gender", "salary") val dataframe = avdata.toDF(column: _*) //Writing Avro File dataframe.write.format("avro") .mode(SaveMode.Overwrite) .save("C:\\tmp\\spark_out\\avro\\humans.avro") //Reading Avro File spark.read.format("avro").load("/tmp/spark_out/avro/humans.avro").show() //Writing Avro Partition dataframe.write.partitionBy("dob_year","dob_month") .format("avro") .mode(SaveMode.Overwrite) .save("/tmp/spark_out/avro/humans_partition.avro") //Reading the Avro Partition spark.read .format("avro") .load("/tmp/spark_out/avro/humans_partition.avro") .where(col("dob_year") === 2010) .show() //Expliciting Avro schema val AvroSchema = new Schema.Parser() .parse(new File("src/main/resources/humans.avsc")) spark.read .format("avro") .option("avroSchema", AvroSchema.toString) .load("/tmp/spark_out/avro/humans.avro") .show() } }

Databricks-2

Databricks-3

Databricks-4

ExampleAvro object is created in which spark session is initiated. "avdata" is the spark dataframe created using seq() object. The "column" value is made using a seq() object which consists of Firstname, Middle name, Lastname, Dob year, Dob month, Gender and Salary. "dataframe" value is created to convert the "column" value data into dataframe using toDF() function. The dataframe is written to Avro file format that is humans.avro. Avro file that is humans.avro file is read using the load() function. The dataframe is partitioned by dob_month and dob_year, and the file is written to humans_partition.avro. Avro partitioned file that is humans_partitioned,avro file is read using the load() function. Finally, the schema is stored in humans.avsc file and provide this file using option() while reading the Avro file. This schema includes the structure of the Avro file with field names and its data types.

Download Materials

Databricks_1

Databricks_2

Databricks_3

Databricks_4

What Users are saying..

Gautam Vermani

Data Consultant at Confidential

Having worked in the field of Data Science, I wanted to explore how I can implement projects in other domains, So I thought of connecting with ProjectPro. A project that helped me absorb this topic... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

Building Real-Time AWS Log Analytics Solution

In this AWS Project, you will build an end-to-end log analytics solution to collect, ingest and process data. The processed data can be analysed to monitor the health of production systems on AWS.

View Project Details

Retail Analytics Project Example using Sqoop, HDFS, and Hive

This Project gives a detailed explanation of How Data Analytics can be used in the Retail Industry, using technologies like Sqoop, HDFS, and Hive.

View Project Details

SQL Project for Data Analysis using Oracle Database-Part 4

In this SQL Project for Data Analysis, you will learn to efficiently write queries using WITH clause and analyse data using SQL Aggregate Functions and various other operators like EXISTS, HAVING.

View Project Details

Snowflake Azure Project to build real-time Twitter feed dashboard

In this Snowflake Azure project, you will ingest generated Twitter feeds to Snowflake in near real-time to power an in-built dashboard utility for obtaining popularity feeds reports.

View Project Details

Yelp Data Processing using Spark and Hive Part 2

In this spark project, we will continue building the data warehouse from the previous project Yelp Data Processing Using Spark And Hive Part 1 and will do further data processing to develop diverse data products.

View Project Details

Streaming Data Pipeline using Spark, HBase and Phoenix

Build a Real-Time Streaming Data Pipeline for an application that monitors oil wells using Apache Spark, HBase and Apache Phoenix .

View Project Details

Build a Scalable Event Based GCP Data Pipeline using DataFlow

In this GCP project, you will learn to build and deploy a fully-managed(serverless) event-driven data pipeline on GCP using services like Cloud Composer, Google Cloud Storage (GCS), Pub-Sub, Cloud Functions, BigQuery, BigTable

View Project Details

Learn Data Processing with Spark SQL using Scala on AWS

In this AWS Spark SQL project, you will analyze the Movies and Ratings Dataset using RDD and Spark SQL to get hands-on experience on the fundamentals of Scala programming language.

View Project Details

Big Data Project for Solving Small File Problem in Hadoop Spark

This big data project focuses on solving the small file problem to optimize data processing efficiency by leveraging Apache Hadoop and Spark within AWS EMR by implementing and demonstrating effective techniques for handling large numbers of small files.

View Project Details

Build an AWS ETL Data Pipeline in Python on YouTube Data

AWS Project - Learn how to build ETL Data Pipeline in Python on YouTube Data using Athena, Glue and Lambda

View Project Details

Explain Apache Avro Read and write Dataframe in Avro in Spark

Recipe Objective - What is Apache Avro. Read and write data as a Dataframe into Avro file format?

System Requirements

Using Avro file format in Databricks<

Gautam Vermani

Relevant Projects

You might also like

Relevant Projects