The better choice of “Action” for running terasort test case in Oozie is “Java Action” instead of “Mapreduce Action” because terasort need to run
TeraInputFormat.writePartitionFile(job, partitionFile);
first and then load ‘partitonFile’ by “TotalOrderPartitioner”. It’s not a simple Mapreduce job which need merely a few propertyies.
The directory of this”TerasortApp” which using “Java Action” of Oozie looks just like:
TerasortApp/ ├── job.properties ├── lib │ └── hadoop-mapreduce-examples.jar └── workflow.xml
The core of this App is “workflow.xml”:
[12/1991]
${jobTracker}
${nameNode}
org.apache.hadoop.examples.terasort.TeraGen
-Dmapred.map.tasks=96
${numRows}
${inputDir}
${jobTracker}
${nameNode}
mapreduce.input.fileinputformat.split.minsize
4294967296
org.apache.hadoop.examples.terasort.TeraSort
${inputDir}
${outputDir}
Failed to terasort!
Note 1. In Cloudera environment, The Web UI will fail in the last step of creating sharelib for Oozie Service. To fix this problem:
$sudo -u oozie /usr/lib/oozie/bin/oozie-setup.sh sharelib create -fs hdfs://localhost:8020 -locallib /usr/lib/oozie/oozie-sharelib-yarn/
$sudo -u oozie oozie admin -shareliblist -oozie http://localhost:11000/oozie
[Available ShareLib]
oozie
hive
distcp
hcatalog
sqoop
mapreduce-streaming
spark
hive2
pig
Note 2. We can’t use property of ‘mapred.map.tasks’ to change the number of mappers in Terasort because it is actually decided by class ‘TotalOrderPartitioner’. Therefore I use ‘mapreduce.input.fileinputformat.split.minsize’ property to limit the number of mappers.