How to extract values from XML data in NiFi

This recipe helps you extract values from XML data in NiFi

Recipe Objective: How to Extract values from XML data in NiFi?

In most big data scenarios, Apache NiFi is used as open-source software for automating and managing the data flow between systems. It is a robust and reliable system to process and distribute data. It provides a web-based User Interface to create, monitor, and control data flows. Gathering data from databases is widely used to collect real-time streaming data in Big data environments to capture, process, and analyze the data. The XML data is commonly used in big data-based large-scale environments. We will split the XML file into multiple XML documents and extract the attributes and their values from the XML Data in this scenario.

Build a Real-Time Dashboard with Spark, Grafana and Influxdb

System requirements :

Step 1: Configure the GetFile

Creates FlowFiles from files in a directory. NiFi will ignore files it doesn't have at least read permissions for. Here we are getting the file from the local directory.

bigdata_1.jpg

We scheduled this processor to run every 60 sec in the Run Schedule and Execution as the Primary node in the SCHEDULING tab. Here we are ingesting the drivers_data.xml file drivers data from a local directory; we configured Input Directory and provided the filename.

Step 2: Configure the SplitXML

Splits an XML File into multiple separate FlowFiles, each comprising a child or descendant of the original root element.

bigdata_2.jpg

In the above, we mentioned split Depth 1. A depth of 1 means split the root's children, whereas a depth of 2 means split the root's children's children and so forth.

The output of the split XML data:

bigdata_3.jpg

To evaluate the attributes from the XML data, we will be going to use the EvaluateXpath processor.

Step 3: Configure the EvaluateXPath

This processor Evaluates one or more XPaths against the content of a FlowFile. The results of those XPaths are assigned to FlowFile Attributes or are written to the FlowFile itself, depending on the configuration of the processor. XPaths are entered by adding user-defined properties; the property's name maps to the Attribute Name. The result will be placed (if the Destination is flowfile-attribute; otherwise, the property name is ignored). The value of the property must be a valid XPath expression.

bigdata_4.jpg

As shown in the above image, we are evaluating the attribute values from the XML data. In the data .” row" is parent and "location" is children.

The output of the Evaluated attribute values:

bigdata_5.jpg

Step 4: Configure the ReplaceText

Updates the content of a FlowFile by evaluating a Regular Expression (regex) against it and replacing the section of the content that matches the Regular Expression with some alternate value.

bigdata_6.jpg

After evaluating the required attributes and their values, we arrange them column by column using ReplaceText below.

bigdata_7.jpg

The output of the data is as below:

bigdata_8.jpg

Step 5: Configure the MergeContent

Merges a Group of FlowFiles based on a user-defined strategy and packages them into a single FlowFile. We are merging the single row of 1000 rows as a group; we need to configure it below.

bigdata_9.jpg

In the above, we need to specify the Delimiter Strategy as Text and In Demarcator value press shift button + Enter then click ok because we need to add every row in the new line.

The output of the data:

bigdata_10.jpg

Step 6: Configure the UpdateAttribute to update the filename

Updates the Attributes for a FlowFile using the Attribute Expression Language and/or deletes the attributes based on a regular expression. Here, we will give the name for the FlowFiles.

bigdata_11.jpg

The output of the file name:

bigdata_12.jpg

Step 7: Configure the UpdateAttribute to update file extension

Configured the update attribute processor as below; UpdateAttribute adds the file name with the XML extension as an attribute to the FlowFiles.

bigdata_13.jpg

The output of the file name:

bigdata_14.jpg

Step 8: Configure the PutFile

Writes the contents of a FlowFile to the local file system, it means that we are storing the converted CSV content in the local directory for that we configured as shown below:

bigdata_15.jpg

As shown in the above image, we provided a directory name to store and access the file.

The output of the file stored in the local and data looks as below:

bigdata_16.jpg

Conclusion

Here we learned to Extract values from XML data in NiFi.

What Users are saying..

profile image

Jingwei Li

Graduate Research assistance at Stony Brook University
linkedin profile url

ProjectPro is an awesome platform that helps me learn much hands-on industrial experience with a step-by-step walkthrough of projects. There are two primary paths to learn: Data Science and Big Data.... Read More

Relevant Projects

Build an ETL Pipeline on EMR using AWS CDK and Power BI
In this ETL Project, you will learn build an ETL Pipeline on Amazon EMR with AWS CDK and Apache Hive. You'll deploy the pipeline using S3, Cloud9, and EMR, and then use Power BI to create dynamic visualizations of your transformed data.

PySpark Tutorial - Learn to use Apache Spark with Python
PySpark Project-Get a handle on using Python with Spark through this hands-on data processing spark python tutorial.

Build a Data Pipeline with Azure Synapse and Spark Pool
In this Azure Project, you will learn to build a Data Pipeline in Azure using Azure Synapse Analytics, Azure Storage, Azure Synapse Spark Pool to perform data transformations on an Airline dataset and visualize the results in Power BI.

Build Classification and Clustering Models with PySpark and MLlib
In this PySpark Project, you will learn to implement pyspark classification and clustering model examples using Spark MLlib.

Learn Real-Time Data Ingestion with Azure Purview
In this Microsoft Azure project, you will learn data ingestion and preparation for Azure Purview.

Build a big data pipeline with AWS Quicksight, Druid, and Hive
Use the dataset on aviation for analytics to simulate a complex real-world big data pipeline based on messaging with AWS Quicksight, Druid, NiFi, Kafka, and Hive.

AWS Project for Batch Processing with PySpark on AWS EMR
In this AWS Project, you will learn how to perform batch processing on Wikipedia data with PySpark on AWS EMR.

Orchestrate Redshift ETL using AWS Glue and Step Functions
ETL Orchestration on AWS - Use AWS Glue and Step Functions to fetch source data and glean faster analytical insights on Amazon Redshift Cluster

Python and MongoDB Project for Beginners with Source Code-Part 1
In this Python and MongoDB Project, you learn to do data analysis using PyMongo on MongoDB Atlas Cluster.

Analyse Yelp Dataset with Spark & Parquet Format on Azure Databricks
In this Databricks Azure project, you will use Spark & Parquet file formats to analyse the Yelp reviews dataset. As part of this you will deploy Azure data factory, data pipelines and visualise the analysis.