HANDS-ON-LAB

Ingest and query data in HBase/Phoenix

Problem Statement

This hands-on process Ingest and query data in HBase/Phoenix code aims to create a Lambda function to cleanse YouTube statistics reference data and store it in an S3 bucket in CSV format. Additionally, the cleansed data should be exposed in the Glue catalog. 

The statistics reference data (the JSON files) is placed in the raw S3 bucket:

s3://<raw_bucket_name>/youtube/raw_statistics_reference_data/

Tasks

  1. Prepare the environment: Set up HBase/Phoenix and ensure that it is properly installed and configured.

  2. Copy the CSV file to the HBase/Phoenix container: Use the provided command to copy the CSV file "wikiticker-2015-09-12.csv" into the HBase/Phoenix container.

  3. Create the Phoenix table: Write the necessary SQL statements to create a table named "wikiticker" in Phoenix with the specified columns and data types.

  4. Load data into the Phoenix table: Use Phoenix's built-in capabilities to load the data from the CSV file into the "wikiticker" table.

  5. Read the first 10 lines from the Phoenix table: Write a SQL query in Phoenix to retrieve the first 10 lines from the "wikiticker" table.

  6. Read the first 10 lines from the HBase table: Use HBase commands or a tool like HBase shell to read the first 10 lines directly from the HBase table.


Learn how to ingest a CSV data file into HBase/Phoenix, create tables, and query data from both Phoenix and HBase. Set up the environment, load data, and retrieve information.

Learnings

  • HBase/Phoenix setup and configuration: Set up and configure HBase/Phoenix in your environment.

  • Data ingestion: Copy the CSV file into the HBase/Phoenix container and load the data into the Phoenix table.

  • Table creation in Phoenix: Write the SQL statement to create a table in Phoenix with the specified columns and data types.

  • Data retrieval from Phoenix: Write a SQL query to retrieve the desired data from the Phoenix table.

  • Data retrieval from HBase: Use HBase commands or tools to directly read data from the HBase table.

 

Data types -

 

Column

Data type

time (primary key)

string

channel

string

cityName

string

countryIsoCode

string

countryName

string

isAnonymous

string

isMinor

string

isNew

string

isRobot

string

isUnpatrolled

string

metroCode

string

namespace

string

regionIsoCode

string

regionName

string

user

string

delta

integer

added

integer

deleted

integer

FAQs

Q1. What is HBase?

HBase is a distributed, scalable, and NoSQL database built on top of the Hadoop Distributed File System (HDFS). It provides real-time read/write access to large datasets and is designed to handle big data workloads.

 

Q2. What is Phoenix?

Phoenix is a SQL query engine for Apache HBase. It provides an SQL-like interface for querying and managing data stored in HBase, enabling users to leverage familiar SQL syntax and tools for data exploration and analysis.

 

Q3. Why use HBase/Phoenix for data storage and querying?

HBase/Phoenix combination offers a scalable and flexible solution for storing and querying large-scale structured data. HBase provides distributed storage and high availability, while Phoenix adds a SQL layer for easy data querying and manipulation.