Display the tail of a file and its aggregate length in the HDFS

This recipe explains what tail of a file and aggregate length and how to display it in the HDFS
Last Updated: 23 Aug 2022

Get access to Big Data projects View all Big Data projects

APACHE HADOOP PROJECTS DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective: How to display the tail of a file and its aggregate length in the HDFS?

In this recipe, we see how to display the tail of a file in the HDFS and find its aggregate length.

Recipe Objective: How to display the tail of a file and its aggregate length in the HDFS?

Prerequisites:

Before proceeding with the recipe, make sure Single node Hadoop is installed on your local EC2 instance. If not already installed, follow this link (click here ) to do the same.

Build a Real-Time Dashboard with Spark, Grafana and Influxdb

Steps to set up an environment:

In the AWS, create an EC2 instance and log in to Cloudera Manager with your public IP mentioned in the EC2 instance. Login to putty/terminal and check if HDFS is installed. If not installed, please find the links provided above for installations.
Type “&ltyour public IP&gt:7180” in the web browser and log in to Cloudera Manager, where you can check if Hadoop is installed.
If they are not visible in the Cloudera cluster, you may add them by clicking on the “Add Services” in the cluster to add the required services in your local instance.

Displaying the tail of a file in HDFS:

We come across scenarios where the content of the file is extensive, which usually is the case in Big Data, then simply displaying the entire file content would end up drying the resources. In such cases, we use the “tail” argument to display only the last 30 rows of the file

Step 1: Switch to root user from ec2-user using the “sudo -i” command.

bigdata_1

Step 2: Displaying the previous few entries of the file

Passing the “-tail” argument in the hadoop fs command, followed by the full path of the file we would like to display, returns only the last few entries. The syntax for the same is given below:

hadoop fs -tail &ltfile path&gt

Below is the sample output when I tried displaying the tail of a file “flights_data.txt.”

bigdata_2

Finding the aggregate length of a file:

The “-du” parameter helps us find the length of a file in the HDFS. It returns three columns: the size of the file, disk space consumed with all replicas, and full pathname.

`hdfs dfs -dus &ltfile path&gt`

This returns the aggregate length of the file followed by the total size of its replicas and its entire path. Please note, “-dus” and “-du -s” are the same. We can pass them in either way in the command.
However, if we wish to display the result in a more readable format, pass the “-h” option in the above command. The syntax for the same is:

`hdfs dfs -du -h &ltfile path&gt`

For example, let us see the aggregate length of the “flights_data.txt” file. From the output in the below picture, we can observe the difference in how the result is displayed while using “-dus” and “-du -h.”

Download Materials

bigdata_1

bigdata_2

bigdata_3

What Users are saying..

Gautam Vermani

Data Consultant at Confidential

Having worked in the field of Data Science, I wanted to explore how I can implement projects in other domains, So I thought of connecting with ProjectPro. A project that helped me absorb this topic... Read More