Build dataflow to get monthly top price of Land Trading in UK
The dataset is downloaded from UK government data web(The total data size is more than 3GB). And, I am using Apache Oozie to run Hive and Sqoop job periodically. The Hive script “land_price.hql”:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 |
-- Import data from external table to parquet table SET mapred.job.queue.name=root.default; SET mapreduce.input.fileinputformat.split.minsize=64000000; SET mapreduce.input.fileinputformat.split.maxsize=256000000; CREATE TABLE IF NOT EXISTS realestates ( transaction_id STRING, price INT, date_of_transfer DATE, postcode STRING, property_type CHAR(1), old_new CHAR(1), duration CHAR(1), paon STRING, saon STRING, street STRING, locality STRING, town_city STRING, district STRING, country STRING, ppd_category_type CHAR(1), record_status CHAR(1)) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( "separatorChar" = ",", "quoteChar" = "'" ) STORED AS textfile LOCATION '/user/realestates/'; CREATE TABLE IF NOT EXISTS realestates_p ( transaction_id STRING, price INT, date_of_transfer STRING, postcode STRING, property_type CHAR(1), old_new CHAR(1), duration CHAR(1), paon STRING, saon STRING, street STRING, locality STRING, town_city STRING, district STRING, country STRING, ppd_category_type CHAR(1), record_status CHAR(1)) CLUSTERED BY (transaction_id) INTO 8 BUCKETS STORED AS ORC; INSERT OVERWRITE TABLE realestates_p SELECT transaction_id, CAST(SUBSTR(TRIM(price), 2, LENGTH(price)-2) AS INT), date_of_transfer, postcode, SUBSTR(property_type, 2, LENGTH(property_type)-2), SUBSTR(old_new, 2, LENGTH(old_new)-2), SUBSTR(duration, 2, LENGTH(duration)-2), paon, saon, street, locality, town_city, district, country, SUBSTR(ppd_category_type, 2, LENGTH(ppd_category_type)-2), SUBSTR(record_status, 2, LENGTH(record_status)-2) FROM realestates; -- Generate new table for max price of every month CREATE TABLE month_top ROW FORMAT DELIMITED FIELDS TERMINATED BY "," STORED AS TEXTFILE AS SELECT MAX(price) AS max_price, month, town_city, district, country FROM ( SELECT SUBSTR(date_of_transfer, 2, 7) AS month, price, street, locality, town_city, district, country FROM realestates_p ) month_view GROUP BY month, town_city, district, country SORT BY max_price; |
We want Hive job to run on queue “root.default” in YARN (and other jobs in “root.mr”),… Read more »