How to convert files from XML to CSV format in NiFi

This recipe helps you convert files from XML to CSV format in NiFi

Recipe Objective: How to convert files from XML format to CSV format in NiFi?

In most big data scenarios, Apache NiFi is used as open-source software for automating and managing the data flow between systems. It is a robust and reliable system to process and distribute data. It provides a web-based User Interface to create, monitor, and control data flows. Gathering data from databases is widely used to collect real-time streaming data in Big data environments to capture, process, and analyze the data. Conversion of CSV schema to XML is commonly used in big data-based large-scale environments.

Access Snowflake Real-Time Project to Implement SCD's

System requirements :

Step 1: Configure the GetFile

Creates FlowFiles from files in a directory. NiFi will ignore files it doesn't have at least read permissions for. Here we are getting the file from the local directory.

bigdata_1.jpg

We scheduled this processor to run every 60 sec in the Run Schedule and Execution as the Primary node in the SCHEDULING tab. Here we are ingesting the drivers.xml file drivers data from a local directory; for that, we configured Input Directory and provided the file name.

Step 2: Configure the UpdateAttribute

Updates the Attributes for a FlowFile using the Attribute Expression Language and/or deletes the attributes based on a regular expression.

Here we will use the UpdateAttribute to update the schema name for the Avro schema registry as below.

bigdata_2.jpg

As shown above, we added a new attribute schema.name as drivers value.

Step 3: Configure the ConvertRecord and Create Controller Services:

Using an XMLReader controller service that references a schema in an AvroSchemaRegistry controller service

The AvroSchemaRegistry contains a "drivers" schema that defines information about each record (field names, field ids, field types)

Using a CSVRecordSetWriter controller service that references the same AvroSchemaRegistry schema.

In ConvertRecord processor, the properties tab in the RecordReader value column drop down will get as below, then click on create new service.

bigdata_3.jpg

Then you will get the pop up as below select CSV reader in compatible controller service drop-down as shown below:

bigdata_4.jpg

Follow the same steps to create controller service for the CSV record set writer as below

bigdata_5.jpg

To Enable Controller Services Select the gear icon from the Operate Palette:

bigdata_6.jpg

This opens the NiFi Flow Configuration window. Select the Controller Services tab:

bigdata_7.jpg

Click on the "+" symbol to add the Avro schema registry; it will add the Avro schema registry as the above image. Then click on the gear symbol and config as below:

bigdata_8.jpg

In the property, we need to provide the schema name, and in the value Avro schema, click ok and Enable AvroSchemaRegistry by selecting the lightning bolt icon/button. This will then allow you to enable the XMLReader and CSVRecordSetWriter controller services.

Configure the XMLReader as below:

bigdata_9.jpg

And also, configure the CSVRecordsetWriter as below :

bigdata_10.jpg

Then after that, click on apply, and then you will be able to see the XMLReader, and CSVRecordWriter controller services then Select the lightning bolt icons for both of these services. All the controller services should be enabled at this point.

Click on the thunder symbol and enable them.

bigdata_11.jpg

Step 4: Configure the UpdateAttribute to update the filename

Updates the Attributes for a FlowFile using the Attribute Expression Language and/or deletes the attributes based on a regular expression. Here, we are going to give the name for the FlowFile.

bigdata_12.jpg

the output of the filename

bigdata_13.jpg

Step 5: Configure the UpdateAttribute to update file extension

Updates the Attributes for a FlowFile by using the Attribute Expression Language and/or deletes the attributes based on a regular expression

Configured the update attribute processor as below, UpdateAttribute adds the file name with the CSV extension as an attribute to the FlowFile

bigdata_14.jpg

The output of the filename:

bigdata_15.jpg

Step 6: Configure the PutFile

Writes the contents of a FlowFile to the local file system, it means that we are storing the converted CSV content in the local directory for that we configured as below:

bigdata_16.jpg

As shown in the above image, we provided a directory name to store and access the file.

The output of the file stored in the local and data looks as below:

bigdata_17.jpg

Conclusion

Here we learned to convert files from XML format to CSV format in NiFi.

What Users are saying..

profile image

Gautam Vermani

Data Consultant at Confidential
linkedin profile url

Having worked in the field of Data Science, I wanted to explore how I can implement projects in other domains, So I thought of connecting with ProjectPro. A project that helped me absorb this topic... Read More

Relevant Projects

Python and MongoDB Project for Beginners with Source Code-Part 2
In this Python and MongoDB Project for Beginners, you will learn how to use Apache Sedona and perform advanced analysis on the Transportation dataset.

PySpark Project-Build a Data Pipeline using Hive and Cassandra
In this PySpark ETL Project, you will learn to build a data pipeline and perform ETL operations by integrating PySpark with Hive and Cassandra

SQL Project for Data Analysis using Oracle Database-Part 3
In this SQL Project for Data Analysis, you will learn to efficiently write sub-queries and analyse data using various SQL functions and operators.

Deploying auto-reply Twitter handle with Kafka, Spark and LSTM
Deploy an Auto-Reply Twitter Handle that replies to query-related tweets with a trackable ticket ID generated based on the query category predicted using LSTM deep learning model.

Retail Analytics Project Example using Sqoop, HDFS, and Hive
This Project gives a detailed explanation of How Data Analytics can be used in the Retail Industry, using technologies like Sqoop, HDFS, and Hive.

Build an ETL Pipeline with Talend for Export of Data from Cloud
In this Talend ETL Project, you will build an ETL pipeline using Talend to export employee data from the Snowflake database and investor data from the Azure database, combine them using a Loop-in mechanism, filter the data for each sales representative, and export the result as a CSV file.

Build Serverless Pipeline using AWS CDK and Lambda in Python
In this AWS Data Engineering Project, you will learn to build a serverless pipeline using AWS CDK and other AWS serverless technologies like AWS Lambda and Glue.

Learn Efficient Multi-Source Data Processing with Talend ETL
In this Talend ETL Project , you will create a multi-source ETL Pipeline to load data from multiple sources such as MySQL Database, Azure Database, and API to Snowflake cloud using Talend Jobs.

Build an Incremental ETL Pipeline with AWS CDK
Learn how to build an Incremental ETL Pipeline with AWS CDK using Cryptocurrency data

Build an ETL Pipeline for Financial Data Analytics on GCP-IaC
In this GCP Project, you will learn to build an ETL pipeline on Google Cloud Platform to maximize the efficiency of financial data analytics with GCP-IaC.