Convert XLSX file to CSV and store it into HDFS in NiFi

This recipe explains how to convert XLSX file to CSV and store it into HDFS in NiFi. Apache NiFi is used as open-source software for automating and managing the data flow between systems.

Recipe Objective: How to use GetFile to get XLSX file from local convert it to CSV and store it into HDFS in NiFi?

In most big data scenarios, Apache NiFi is used as open-source software for automating and managing the data flow between systems. It is a robust and reliable system to process and distribute data. It provides a web-based User Interface to create, monitor, and control data flows. Gathering data using rest API calls is widely used to collect real-time streaming data in Big data environments to capture, process, and analyze the data. In this scenario, we will fetch data from the MySQL database table, and We will do Query operation on FlowFile and store the result into the local.

System requirements:

Note: in this scenario, we tried to know How we configure the ConvertExcelToCSVProcessor and use it. We have the XLSX file in the local, and the data output looks as shown below.

bigdata_1.jpg

Step 1: Configure the GetFile

Creates FlowFiles from files in a directory. NiFi will ignore files it doesn't have at least read permissions for. Here we are getting the file from the local directory.

bigdata_2.jpg

Here we are ingesting the Employee.xlsx file from a local directory. For that, we have configured the Input Directory and also provided the file name.

Step 2: Configure the ConvertExcelToCSVProcessor

Consumes a Microsoft Excel document and converts each worksheet to CSV. Each sheet from the incoming Excel document will generate a new Flowfile that will be output from this processor. Each output Flow File's contents will be formatted as a CSV file where each row from the excel sheet is output as a new line in the CSV file.

bigdata_3.jpg

As shown in the above image, we need to provide the value of the Sheets to Extract as Employees.

The output of the filename:

bigdata_4.jpg

The output of the data looks as shown below:

bigdata_5.jpg

Step 3: Configure the PutHDFS

Write FlowFile data to Hadoop Distributed File System (HDFS). Here we are writing parsed data from the HTTP endpoint and storing it into the HDFS to configure the processor as below.

Note: In the Hadoop configurations, we should provide the 'core-site.xml' and 'hdfs-site.xml' files because Hadoop will search the classpath for a 'core-site.xml' and 'hdfs-site.xml' file or will revert to a default configuration.

bigdata_6.jpg

Here in the above image, we provided Hadoop configurations resources, and in the directory, we have given a directory name to store files. We have given value append for the conflict resolutions strategy append because it will append to it when new data comes.

The output of the stored data in the HDFS and its file structure :

bigdata_7.jpg

Note: if you get any errors on permissions to store through the HDFS, go to Hadoop installed folder and edit the hdfs-site.xml; add the below code:

dfs.permissions false

Conclusion

Here we learned to use GetFile to get XLSX files from local, convert them to CSV, and store them into HDFS in NiFi.

What Users are saying..

profile image

Savvy Sahai

Data Science Intern, Capgemini
linkedin profile url

As a student looking to break into the field of data engineering and data science, one can get really confused as to which path to take. Very few ways to do it are Google, YouTube, etc. I was one of... Read More

Relevant Projects

Talend Real-Time Project for ETL Process Automation
In this Talend Project, you will learn how to build an ETL pipeline in Talend Open Studio to automate the process of File Loading and Processing.

Build an Incremental ETL Pipeline with AWS CDK
Learn how to build an Incremental ETL Pipeline with AWS CDK using Cryptocurrency data

Build a big data pipeline with AWS Quicksight, Druid, and Hive
Use the dataset on aviation for analytics to simulate a complex real-world big data pipeline based on messaging with AWS Quicksight, Druid, NiFi, Kafka, and Hive.

Deploy an Application to Kubernetes in Google Cloud using GKE
In this Kubernetes Big Data Project, you will automate and deploy an application using Docker, Google Kubernetes Engine (GKE), and Google Cloud Functions.

Build a real-time Streaming Data Pipeline using Flink and Kinesis
In this big data project on AWS, you will learn how to run an Apache Flink Python application for a real-time streaming platform using Amazon Kinesis.

PySpark Project-Build a Data Pipeline using Kafka and Redshift
In this PySpark ETL Project, you will learn to build a data pipeline and perform ETL operations by integrating PySpark with Apache Kafka and AWS Redshift

PySpark Project to Learn Advanced DataFrame Concepts
In this PySpark Big Data Project, you will gain hands-on experience working with advanced functionalities of PySpark Dataframes and Performance Optimization.

AWS Project - Build an ETL Data Pipeline on AWS EMR Cluster
Build a fully working scalable, reliable and secure AWS EMR complex data pipeline from scratch that provides support for all data stages from data collection to data analysis and visualization.

Learn How to Implement SCD in Talend to Capture Data Changes
In this Talend Project, you will build an ETL pipeline in Talend to capture data changes using SCD techniques.

PySpark Project-Build a Data Pipeline using Hive and Cassandra
In this PySpark ETL Project, you will learn to build a data pipeline and perform ETL operations by integrating PySpark with Hive and Cassandra