Having two tables: salary and employee,we can use Pig to find the most high-salary employees:
salary = LOAD '/user/robin/salaries/salaries.csv' USING PigStorage(',') AS (uid:int, salary:int, begin:chararray, end:chararray); employee = LOAD '/user/robin/employees/employees.csv' USING PigStorage(',') AS (uid:int, birth:chararray, givenname:chararray, familyname:chararray, gender:chararray, work:chararray); jo = JOIN employee BY uid, salary BY uid; res = ORDER ( FOREACH ( GROUP jo BY (employee::uid, employee::birth, employee::givenname, employee::familyname, employee::gender, employee::work) ) GENERATE group.employee::uid, group.employee::givenname, group.employee::familyname, AVG(jo.salary::salary) AS avg_salary ) BY avg_salary DESC; fs -rmr /user/sanbai/join_result; STORE res INTO '/user/robin/join_result' USING PigStorage(',');
The result is:
109334,'Tsutomu','Alameldin',141835.33333333334 205000,'Charmane','Griswold',141064.63636363635 43624,'Tokuyasu','Pesch',138492.94444444444 493158,'Lidong','Meriste',138312.875 37558,'Juichirou','Thambidurai',138215.85714285713 276633,'Shin','Birdsall',136711.73333333334 238117,'Mitsuyuki','Stanfel',136026.2 46439,'Ibibia','Junet',135747.73333333334 254466,'Honesty','Mukaidono',135541.0625 253939,'Sanjai','Luders',135042.25 ....