How to return checksum information of files in HDFS

This recipe helps you return checksum information of files in HDFS

Recipe Objective: How to return checksum information of files in HDFS?

Data Integrity in HDFS. A checksum is a small-sized block of data derived from another block of digital data to detect errors that may have been introduced during its transmission or storage. By themselves, checksums are often used to verify data integrity but are not relied upon to verify data authenticity. HDFS transparently checksums all data written to it and by default verifies checksums when reading data. A separate checksum is created for every I/O byte. In this recipe, we see how to return checksum information of files in HDFS.

Build a Real-Time Dashboard with Spark, Grafana and Influxdb

Prerequisites:

Before proceeding with the recipe, make sure Single node Hadoop (click here ) is installed on your local EC2 instance.

Steps to set up an environment:

  • In the AWS, create an EC2 instance and log in to Cloudera Manager with your public IP mentioned in the EC2 instance. Login to putty/terminal and check if Hadoop is installed. If not installed, please find the links provided above for installations.
  • Type “&ltyour public IP&gt:7180” in the web browser and log in to Cloudera Manager, where you can check if Hadoop is installed.
  • If they are not visible in the Cloudera cluster, you may add them by clicking on the “Add Services” in the cluster to add the required services in your local instance.

Returning checksum information of a file in HDFS:

Let us take a file present in the root directory of HDFS as an example and find the checksum information of that file. Command to do this:

hdfs dfs -ls

It returns the list of files present in the HDFS root directory. The output of the same looks similar to:

bigdata_1

The checksum of a file can be returned using the command:

hdfs dfs -checksum &ltfile&gt

Let’s find the checksum info of the “test.txt” file present in our root directory. The output of the same is given below.

bigdata_2

Download Materials

What Users are saying..

profile image

Ameeruddin Mohammed

ETL (Abintio) developer at IBM
linkedin profile url

I come from a background in Marketing and Analytics and when I developed an interest in Machine Learning algorithms, I did multiple in-class courses from reputed institutions though I got good... Read More

Relevant Projects

Build a Scalable Event Based GCP Data Pipeline using DataFlow
In this GCP project, you will learn to build and deploy a fully-managed(serverless) event-driven data pipeline on GCP using services like Cloud Composer, Google Cloud Storage (GCS), Pub-Sub, Cloud Functions, BigQuery, BigTable

Graph Database Modelling using AWS Neptune and Gremlin
In this data analytics project, you will use AWS Neptune graph database and Gremlin query language to analyse various performance metrics of flights.

Deploying auto-reply Twitter handle with Kafka, Spark and LSTM
Deploy an Auto-Reply Twitter Handle that replies to query-related tweets with a trackable ticket ID generated based on the query category predicted using LSTM deep learning model.

Hands-On Real Time PySpark Project for Beginners
In this PySpark project, you will learn about fundamental Spark architectural concepts like Spark Sessions, Transformation, Actions, and Optimization Techniques using PySpark

Retail Analytics Project Example using Sqoop, HDFS, and Hive
This Project gives a detailed explanation of How Data Analytics can be used in the Retail Industry, using technologies like Sqoop, HDFS, and Hive.

PySpark Project-Build a Data Pipeline using Kafka and Redshift
In this PySpark ETL Project, you will learn to build a data pipeline and perform ETL operations by integrating PySpark with Apache Kafka and AWS Redshift

AWS CDK and IoT Core for Migrating IoT-Based Data to AWS
Learn how to use AWS CDK and various AWS services to replicate an On-Premise Data Center infrastructure by ingesting real-time IoT-based.

How to deal with slowly changing dimensions using snowflake?
Implement Slowly Changing Dimensions using Snowflake Method - Build Type 1 and Type 2 SCD in Snowflake using the Stream and Task Functionalities

Learn Efficient Multi-Source Data Processing with Talend ETL
In this Talend ETL Project , you will create a multi-source ETL Pipeline to load data from multiple sources such as MySQL Database, Azure Database, and API to Snowflake cloud using Talend Jobs.

Streaming Data Pipeline using Spark, HBase and Phoenix
Build a Real-Time Streaming Data Pipeline for an application that monitors oil wells using Apache Spark, HBase and Apache Phoenix .