I find two datasets: employee and salary for learning and practicing. After putting two files into HDFS, we just need to create tables:
create external table employee ( employee_id INT, birthday DATE, first_name STRING, family_name STRING, gender CHAR(1), work_day DATE) row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde' with serdeproperties ( "separatorChar" = ",", "quoteChar" = "'" ) stored as textfile location '/employee/'; create external table salary ( employee_id INT, salary INT, start_date DATE, end_date DATE) row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde' with serdeproperties ( "separatorChar" = ",", "quoteChar" = "'" ) stored as textfile location '/salary/';
Now we could analyze the data.
Find the oldest 10 employees.
select * from employee order by birthday asc limit 10;
Find all the employees joined the corporation in January 1990.
select * from employee where work_day >= '1990-01-01' and work_day <= '1990-01-31';
Find the top 10 employees earned the highest average salary. Notice we use 'order by' here because 'sort by' only produce local order in reducer.
select e.first_name, e.family_name, avg(s.salary) as avg_salary from employee as e join salary as s on (e.employee_id == s.employee_id) group by e.first_name, e.family_name order by avg_salary limit 10;
Let's find out whether this corporation has sex discrimination:
SELECT e.gender, AVG(s.salary) AS avg_salary FROM employee AS e JOIN salary AS s ON (e.employee_id == s.employee_id) GROUP BY e.gender;
The result is:
F 63767.607741168045 M 63839.90097030445
Looks good 🙂