Build dataflow to get monthly top price of Land Trading in UK

The dataset is downloaded from UK government data web(The total data size is more than 3GB). And, I am using Apache Oozie to run Hive and Sqoop job periodically.
The Hive script “land_price.hql”:

-- Import data from external table to parquet table
SET mapred.job.queue.name=root.default;
SET mapreduce.input.fileinputformat.split.minsize=64000000;
SET mapreduce.input.fileinputformat.split.maxsize=256000000;
CREATE TABLE IF NOT EXISTS realestates (
                          transaction_id STRING,
                          price          INT,
                          date_of_transfer DATE,
                          postcode         STRING,
                          property_type    CHAR(1),
                          old_new          CHAR(1),
                          duration         CHAR(1),
                          paon             STRING,
                          saon             STRING,
                          street           STRING,
                          locality         STRING,
                          town_city        STRING,
                          district         STRING,
                          country          STRING,
                          ppd_category_type CHAR(1),
                          record_status     CHAR(1))
  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
  WITH SERDEPROPERTIES (
                        "separatorChar" = ",",
                        "quoteChar"     = "'"
                       )
  STORED AS textfile
  LOCATION '/user/realestates/';
CREATE TABLE IF NOT EXISTS realestates_p (
                            transaction_id STRING,
                            price          INT,
                            date_of_transfer STRING,
                            postcode         STRING,
                            property_type CHAR(1),
                            old_new          CHAR(1),
                            duration         CHAR(1),
                            paon             STRING,
                            saon             STRING,
                            street           STRING,
                            locality         STRING,
                            town_city        STRING,
                            district         STRING,
                            country          STRING,
                            ppd_category_type CHAR(1),
                            record_status     CHAR(1))
  CLUSTERED BY (transaction_id) INTO 8 BUCKETS
  STORED AS ORC;
INSERT OVERWRITE TABLE realestates_p
    SELECT transaction_id, CAST(SUBSTR(TRIM(price), 2, LENGTH(price)-2) AS INT), date_of_transfer, postcode,
               SUBSTR(property_type, 2, LENGTH(property_type)-2),
               SUBSTR(old_new, 2, LENGTH(old_new)-2),
               SUBSTR(duration, 2, LENGTH(duration)-2),
               paon, saon, street, locality, town_city, district, country,
               SUBSTR(ppd_category_type, 2, LENGTH(ppd_category_type)-2),
               SUBSTR(record_status, 2, LENGTH(record_status)-2)
    FROM realestates;
-- Generate new table for max price of every month
CREATE TABLE month_top
ROW FORMAT DELIMITED FIELDS TERMINATED BY ","
STORED AS TEXTFILE AS
SELECT MAX(price) AS max_price, month, town_city, district, country FROM (
  SELECT SUBSTR(date_of_transfer, 2, 7) AS month, price, street, locality, town_city, district, country FROM realestates_p
) month_view
GROUP BY month, town_city, district, country
SORT BY max_price;

We want Hive job to run on queue “root.default” in YARN (and other jobs in “root.mr”), so we set the “mapred.job.queue.name” to “root.default”.

Remember to use SUBSTR() in Hive to erase quote charactor “\”” when importing data from raw CSV file.

The “coordinator.xml” for Apache Oozie:


  
    1
  
  
    
      ${appDir}

The “workflow.xml” for Apache Oozie:


  
      ${jobTracker}
      ${nameNode}
  
  
  
    
    
  
  
    
      ${jobTracker}
      ${nameNode}
      hive-site.xml
      
        
          mapred.job.queue.name
          root.default
        
      
      
    
    
    
  
  
    
      ${jobTracker}
      ${nameNode}
      
        
          mapred.job.queue.name
          root.default
        
      
      export -Dmapred.job.queue.name=root.default
        --connect jdbc:mysql://192.168.0.1/robin
        --username root --password root --table month_top
        --export-dir /user/hive/warehouse/month_top
    
    
    
  
  
    
      ${jobTracker}
      ${nameNode}
      
        
      
      
        
          mapreduce.job.queuename
          root.mr
        
      
      org.apache.hadoop.examples.terasort.TeraGen
      -Dmapreduce.job.queuename=root.mr
      -Dmapred.map.tasks=96
      ${numRows}
      ${inputDir}
    
    
    
  
  
    
      ${jobTracker}
      ${nameNode}
      
        
      
      
        
          mapreduce.job.queuename
          root.mr
        
        
          mapreduce.input.fileinputformat.split.minsize
          4294967296
        
      
      org.apache.hadoop.examples.terasort.TeraSort
      -Dmapreduce.job.queuename=root.mr
      ${inputDir}
      ${outputDir}

We run two jobs parallelly here: Hive and TeraSort (TeraSort is not useful in real productive environment, but it could be a good substitute for real private job in my company).

The sqoop once report error “javax.xml.parsers.ParserConfigurationException: Feature ‘http://apache.org/xml/features/xinclude’ is not recognized”.
The solution is change file “/usr/lib/hadoop/bin/hadoop” like:
HADOOP_OPTS="$HADOOP_OPTS -Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,NullAppender} \
    -Djavax.xml.parsers.DocumentBuilderFactory=com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl"

“job.properties” for Oozie:

jobTracker=192.168.0.1:8032
nameNode=hdfs://nameservice1
inputDir=/user/hive/tera
outputDir=/user/hive/result
appDir=/user/oozie/myapp
numRows=12345678
oozie.coord.application.path=${appDir}/coordinator.xml
oozie.use.system.libpath=true

Remember to set “oozie.use.system.libpath=true” therefore Oozie could run Hive and Sqoop job correctly.

The script to create MYSQL table:

create table robin.month_top (
    price int(4),
    month char(16),
    town_city char(64),
    district char(128),
    country char(64));

After launch the Oozie coordinator, it will finally put consequent data into MYSQL table:

Looks the land price of “WOKINGHAM” in October 2015 is extremely expensive.

Robin on Linux

Build dataflow to get monthly top price of Land Trading in UK

Leave a Reply Cancel reply