HANDS-ON-LAB

Ingest and query data in HBase/Phoenix

Problem Statement

This hands-on process Ingest and query data in HBase/Phoenix code aims to create a Lambda function to cleanse YouTube statistics reference data and store it in an S3 bucket in CSV format. Additionally, the cleansed data should be exposed in the Glue catalog.

The statistics reference data (the JSON files) is placed in the raw S3 bucket:

s3://<raw_bucket_name>/youtube/raw_statistics_reference_data/

Tasks

Prepare the environment: Set up HBase/Phoenix and ensure that it is properly installed and configured.
Copy the CSV file to the HBase/Phoenix container: Use the provided command to copy the CSV file "wikiticker-2015-09-12.csv" into the HBase/Phoenix container.
Create the Phoenix table: Write the necessary SQL statements to create a table named "wikiticker" in Phoenix with the specified columns and data types.
Load data into the Phoenix table: Use Phoenix's built-in capabilities to load the data from the CSV file into the "wikiticker" table.
Read the first 10 lines from the Phoenix table: Write a SQL query in Phoenix to retrieve the first 10 lines from the "wikiticker" table.
Read the first 10 lines from the HBase table: Use HBase commands or a tool like HBase shell to read the first 10 lines directly from the HBase table.

Learn how to ingest a CSV data file into HBase/Phoenix, create tables, and query data from both Phoenix and HBase. Set up the environment, load data, and retrieve information.

Learnings

HBase/Phoenix setup and configuration: Set up and configure HBase/Phoenix in your environment.
Data ingestion: Copy the CSV file into the HBase/Phoenix container and load the data into the Phoenix table.
Table creation in Phoenix: Write the SQL statement to create a table in Phoenix with the specified columns and data types.
Data retrieval from Phoenix: Write a SQL query to retrieve the desired data from the Phoenix table.
Data retrieval from HBase: Use HBase commands or tools to directly read data from the HBase table.

Data types -

Column	Data type
time (primary key)	string
channel	string
cityName	string
countryIsoCode	string
countryName	string
isAnonymous	string
isMinor	string
isNew	string
isRobot	string
isUnpatrolled	string
metroCode	string
namespace	string
regionIsoCode	string
regionName	string
user	string
delta	integer
added	integer
deleted	integer

FAQs

Q1. What is HBase?

HBase is a distributed, scalable, and NoSQL database built on top of the Hadoop Distributed File System (HDFS). It provides real-time read/write access to large datasets and is designed to handle big data workloads.

Q2. What is Phoenix?

Phoenix is a SQL query engine for Apache HBase. It provides an SQL-like interface for querying and managing data stored in HBase, enabling users to leverage familiar SQL syntax and tools for data exploration and analysis.

Q3. Why use HBase/Phoenix for data storage and querying?

HBase/Phoenix combination offers a scalable and flexible solution for storing and querying large-scale structured data. HBase provides distributed storage and high availability, while Phoenix adds a SQL layer for easy data querying and manipulation.