In big data scenarios , when data volume is huge, we may need to find a subset of data to speed up data analysis. Here comes a technique to select and analyze a subset of data in order to identify patterns and trends in the data known as sampling.
Here we are using the employee related comma separated values (csv) dataset for the create hive table in local.
Data of Output looks as follows:
Before create a table open the hive shell and we need to create a database as follows :Open the hive shell as below
To create database using below queries :
Create database dezyre_db;
use dezyre_db;
As follows below:
Here we are going create a hive table for loading the data from this table to created bucketed tables, Use below to create a hive table:
CREATE TABLE employee (
employee_id int,
company_id int,
seniority int,
salary int,
join_date string,
quit_date string,
dept string
)
ROW FORMAT DELIMITED
fields terminated by ','
TBLPROPERTIES ("skip.header.line.count"="1");
the above query runs as follows :
Loading the data into hive table and verifying the data
load data local inpath '/home/bigdata/Downloads/empdata.csv' into table employee;
Verifying the data by running the select query as follows
The use of the windowing feature is to create a window on the set of data , in order to operate aggregation like Standard aggregations: This can be either COUNT(), SUM(), MIN(), MAX(), or AVG() and the other analytical functions are like LEAD, LAG, FIRST_VALUE and LAST_VALUE.
hive>SELECT department, COUNT(id) OVER (PARTITION BY department) FROM employee;
Output of the above query :
The above query gives the employee count for each department as duplicate records, so to avoid duplicate records then will use distinct in the query as follows below :
hive> SELECT DISTINCT * FROM (SELECT dept, COUNT(employee_id) OVER (PARTITION BY dept) FROM employee) as dist_count;
Output of the above query :
hive> select employee_id, dept, MAX(salary) as max_sal from employee group by employee_id, dept order by max_sal desc;
the above query runs as follows
Output of the above query :
hive> SELECT employee_id,dept, MIN(salary) as min_sal from employee group by employee_id;
Output of the above query :
hive> SELECT AVG(salary) as avg_sal from employee where dept='marketing';
Output of the above query :
hive>SELECT employee_id, company_id, seniority, dept,salary, LEAD(employee_id) OVER (PARTITION BY dept ORDER BY salary) FROM employee;
The above query runs as follows
Output of the above query :
hive>SELECT employee_id, company_id, seniority, dept,salary, LAG(employee_id) OVER (PARTITION BY dept ORDER BY salary) FROM employee;
The above query runs as follows
Output of the above query :
hive>SELECT employee_id, company_id, seniority, dept,salary, FIRST_VALUE(employee_id) OVER (PARTITION BY dept ORDER BY salary) FROM employee;
The above query runs as follows
Output of the above query :
hive>SELECT employee_id, company_id, seniority, dept,salary, LAST_VALUE(employee_id) OVER (PARTITION BY dept ORDER BY salary) FROM employee;
The above query runs as follows
Output of the above query :