How to take a source file and output file in text format in HDFS

This recipe helps you take a source file and output the file in text format in HDFS. The Hadoop fs shell command text takes the source file and outputs the file in the text format. It detects the encoding of the file and decodes it to plain text.

Recipe Objective: How to take a source file and output the file in text format in HDFS?

Let us see how to take a source file and output the file in text format in the HDFS.

Access Snowflake Real-Time Project to Implement SCD's

Prerequisites:

Before proceeding with the recipe, make sure Single node Hadoop is installed on your local EC2 instance. If not already installed, follow the below link to do the same.

Steps to set up an environment:

  • In the AWS, create an EC2 instance and log in to Cloudera Manager with your public IP mentioned in the EC2 instance. Login to putty/terminal and check if HDFS is installed. If not installed, please find the links provided above for installations.
  • Type “&ltyour public IP&gt:7180” in the web browser and log in to Cloudera Manager, where you can check if Hadoop is installed.
  • If they are not visible in the Cloudera cluster, you may add them by clicking on the “Add Services” in the cluster to add the required services in your local instance.

Taking a source file and output the file in text format in the HDFS:

The Hadoop fs shell command text takes the source file and outputs the file in the text format. It detects the encoding of the file and decodes it to plain text. The syntax for the command is:

hadoop fs -text &ltsource file&gt

In this recipe, we read the zip file named “upgrad” present in the root user and are reading the output in the text format.

bigdata_1

Command output:

bigdata_2

Download Materials

What Users are saying..

profile image

Anand Kumpatla

Sr Data Scientist @ Doubleslash Software Solutions Pvt Ltd
linkedin profile url

ProjectPro is a unique platform and helps many people in the industry to solve real-life problems with a step-by-step walkthrough of projects. A platform with some fantastic resources to gain... Read More

Relevant Projects

How to deal with slowly changing dimensions using snowflake?
Implement Slowly Changing Dimensions using Snowflake Method - Build Type 1 and Type 2 SCD in Snowflake using the Stream and Task Functionalities

Build a Data Pipeline in AWS using NiFi, Spark, and ELK Stack
In this AWS Project, you will learn how to build a data pipeline Apache NiFi, Apache Spark, AWS S3, Amazon EMR cluster, Amazon OpenSearch, Logstash and Kibana.

Airline Dataset Analysis using PySpark GraphFrames in Python
In this PySpark project, you will perform airline dataset analysis using graphframes in Python to find structural motifs, the shortest route between cities, and rank airports with PageRank.

Build a Streaming Pipeline with DBT, Snowflake and Kinesis
This dbt project focuses on building a streaming pipeline integrating dbt Cloud, Snowflake and Amazon Kinesis for real-time processing and analysis of Stock Market Data.

Hands-On Real Time PySpark Project for Beginners
In this PySpark project, you will learn about fundamental Spark architectural concepts like Spark Sessions, Transformation, Actions, and Optimization Techniques using PySpark

Build an ETL Pipeline on EMR using AWS CDK and Power BI
In this ETL Project, you will learn build an ETL Pipeline on Amazon EMR with AWS CDK and Apache Hive. You'll deploy the pipeline using S3, Cloud9, and EMR, and then use Power BI to create dynamic visualizations of your transformed data.

Build a Real-Time Dashboard with Spark, Grafana, and InfluxDB
Use Spark , Grafana, and InfluxDB to build a real-time e-commerce users analytics dashboard by consuming different events such as user clicks, orders, demographics

A Hands-On Approach to Learn Apache Spark using Scala
Get Started with Apache Spark using Scala for Big Data Analysis

Databricks Data Lineage and Replication Management
Databricks Project on data lineage and replication management to help you optimize your data management practices | ProjectPro

Talend Real-Time Project for ETL Process Automation
In this Talend Project, you will learn how to build an ETL pipeline in Talend Open Studio to automate the process of File Loading and Processing.