Display the tail of a file and its aggregate length in the HDFS

This recipe explains what tail of a file and aggregate length and how to display it in the HDFS

Recipe Objective: How to display the tail of a file and its aggregate length in the HDFS?

In this recipe, we see how to display the tail of a file in the HDFS and find its aggregate length.

Prerequisites:

Before proceeding with the recipe, make sure Single node Hadoop is installed on your local EC2 instance. If not already installed, follow this link (click here ) to do the same.

Build a Real-Time Dashboard with Spark, Grafana and Influxdb

Steps to set up an environment:

  • In the AWS, create an EC2 instance and log in to Cloudera Manager with your public IP mentioned in the EC2 instance. Login to putty/terminal and check if HDFS is installed. If not installed, please find the links provided above for installations.
  • Type “&ltyour public IP&gt:7180” in the web browser and log in to Cloudera Manager, where you can check if Hadoop is installed.
  • If they are not visible in the Cloudera cluster, you may add them by clicking on the “Add Services” in the cluster to add the required services in your local instance.

Displaying the tail of a file in HDFS:

We come across scenarios where the content of the file is extensive, which usually is the case in Big Data, then simply displaying the entire file content would end up drying the resources. In such cases, we use the “tail” argument to display only the last 30 rows of the file

Step 1: Switch to root user from ec2-user using the “sudo -i” command.

bigdata_1

Step 2: Displaying the previous few entries of the file

Passing the “-tail” argument in the hadoop fs command, followed by the full path of the file we would like to display, returns only the last few entries. The syntax for the same is given below:

hadoop fs -tail &ltfile path&gt

Below is the sample output when I tried displaying the tail of a file “flights_data.txt.”

bigdata_2

Finding the aggregate length of a file:

The “-du” parameter helps us find the length of a file in the HDFS. It returns three columns: the size of the file, disk space consumed with all replicas, and full pathname.

hdfs dfs -dus &ltfile path&gt

This returns the aggregate length of the file followed by the total size of its replicas and its entire path. Please note, “-dus” and “-du -s” are the same. We can pass them in either way in the command.
However, if we wish to display the result in a more readable format, pass the “-h” option in the above command. The syntax for the same is:

hdfs dfs -du -h &ltfile path&gt

For example, let us see the aggregate length of the “flights_data.txt” file. From the output in the below picture, we can observe the difference in how the result is displayed while using “-dus” and “-du -h.”

bigdata_3

What Users are saying..

profile image

Gautam Vermani

Data Consultant at Confidential
linkedin profile url

Having worked in the field of Data Science, I wanted to explore how I can implement projects in other domains, So I thought of connecting with ProjectPro. A project that helped me absorb this topic... Read More

Relevant Projects

Real-Time Streaming of Twitter Sentiments AWS EC2 NiFi
Learn to perform 1) Twitter Sentiment Analysis using Spark Streaming, NiFi and Kafka, and 2) Build an Interactive Data Visualization for the analysis using Python Plotly.

AWS CDK and IoT Core for Migrating IoT-Based Data to AWS
Learn how to use AWS CDK and various AWS services to replicate an On-Premise Data Center infrastructure by ingesting real-time IoT-based.

Build a Streaming Pipeline with DBT, Snowflake and Kinesis
This dbt project focuses on building a streaming pipeline integrating dbt Cloud, Snowflake and Amazon Kinesis for real-time processing and analysis of Stock Market Data.

Snowflake Azure Project to build real-time Twitter feed dashboard
In this Snowflake Azure project, you will ingest generated Twitter feeds to Snowflake in near real-time to power an in-built dashboard utility for obtaining popularity feeds reports.

Airline Dataset Analysis using PySpark GraphFrames in Python
In this PySpark project, you will perform airline dataset analysis using graphframes in Python to find structural motifs, the shortest route between cities, and rank airports with PageRank.

Project-Driven Approach to PySpark Partitioning Best Practices
In this Big Data Project, you will learn to implement PySpark Partitioning Best Practices.

Build Streaming Data Pipeline using Azure Stream Analytics
In this Azure Data Engineering Project, you will learn how to build a real-time streaming platform using Azure Stream Analytics, Azure Event Hub, and Azure SQL database.

Build Serverless Pipeline using AWS CDK and Lambda in Python
In this AWS Data Engineering Project, you will learn to build a serverless pipeline using AWS CDK and other AWS serverless technologies like AWS Lambda and Glue.

Build an Analytical Platform for eCommerce using AWS Services
In this AWS Big Data Project, you will use an eCommerce dataset to simulate the logs of user purchases, product views, cart history, and the user’s journey to build batch and real-time pipelines.

SQL Project for Data Analysis using Oracle Database-Part 1
In this SQL Project for Data Analysis, you will learn to efficiently leverage various analytical features and functions accessible through SQL in Oracle Database