Spark

Books I read in year 2017

2017 is not an easy year for me, therefore I read so many books to comfort myself. The books show above are just the top-10 books I rated.
I can’t remember how many times I have read “The old man and the sea”. This time, I read it for my daughter because she wants to hear some “fantastic and powerful” stories. After I told her the whole story (of course, I changed some parts of the story for a child), she is a little puzzle. Hmm, it’s hard to understand for a girl, even the story is truly “powerful” 🙂
At the beginning of 2017 (about February), my old colleague Jian Mei asked my help for building a Bird-Classification-Application. I accept his requirement and start to learn knowledge of Deep Learning from the beginning. Actually, this task makes me “alive” again: I begin to learn new frameworks (MXNET and Tensorflow), read papers (I haven’t read papers for many years) and thick books (such as “Deep Learning“). Finally, we have completed the application for classifying Chinese birds, and I begin my new career in Deep Learning area. Thanks to my old colleague again.
When I was a child, I read a small comic book about a family living on an isolated island. In January, I found an English book in a bookshop, which named “The Swiss Family Robinson”. Suddenly I realized this must be the original version of that old comic book. So I bought it. It only cost me 20 RMB (about 3 dollars, books are desperately cheap in China). In the following weeks, I read this book chapter by chapter and tell them to my daughter. This time, she can understand the book and become interesting about this old story.
In 2015, I traveled to Boston to give my presentation on Linux Vault Conference with my colleague Coly Li. In MIT bookshop, I bought many books, including a history book named “Ancient Rome” (I have read a lot of books about ancient Rome, but none is the English version). But year 2015 and 2016 are very busy, and I only have enough time to read the book over in 2017. This book contains many pictures, which is good for children too. Maybe in the future, I could read it for my children.

Using Spark-SQL to transfer CSV file to Parquet

After downloading data from “Food and Agriculture Organization of United Nations”, I get many CSV files. One of the file is named “Trade_Crops_Livestock_E_All_Data_(Normalized).csv” and it looks like:

Area Code,Area,Item Code,Item,Element Code,Element,Year Code,Year,Unit,Value,Flag
"2","Afghanistan","231","Almonds shelled","5910","Export Quantity","1961","1961","tonnes","0.000000",""
"2","Afghanistan","231","Almonds shelled","5910","Export Quantity","1962","1962","tonnes","0.000000",""
"2","Afghanistan","231","Almonds shelled","5910","Export Quantity","1963","1963","tonnes","0.000000",""
......

To load this CSV file into Spark and dump it to Parquet format, I wrote these codes:

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.rdd._
/* Area Code,Area,Item Code,Item,Element Code,Element,Year Code,Year,Unit,Value,Flag */
case class Trade(area_code:Int, area:String, item_code:Int, item:String, element_code:Int,
                 element:String, year:Int, unit:String, value:Double, flag:String)
object TradeCrops {
  def scrub(str:String):String = {
    return str.replace("\"", "")
  }
  def toInt(str:String):Int = {
    try {
      return scrub(str).toInt
    } catch {
      case e:Throwable => {
        return 0
      }
    }
  }
  def toDouble(str:String):Double = {
    try {
      return scrub(str).toDouble
    } catch {
      case e:Throwable => {
        return 0
      }
    }
  }
  def toTrade(line:String):Trade = {
    val fields = line.split("\",")
    Trade(
      toInt(fields(0)),
      scrub(fields(1)),
      toInt(fields(2)),
      scrub(fields(3)),
      toInt(fields(4)),
      scrub(fields(5)),
      toInt(fields(7)),
      scrub(fields(8)),
      toDouble(fields(9)),
      scrub(fields(10))
    )
  }
  def main(args: Array[String]) {
    val conf = new SparkConf()
      .setAppName("Trade Crops Application")
    val sc = new SparkContext(conf)
    val spark = SparkSession.builder()
        .appName("Spark SQL Trade Crops")
        .getOrCreate()
    val file = sc.textFile("hdfs:///FAO/Trade_Crops.csv")
    val tradeRDD = file.filter(_.split("\",").length == 11).map(toTrade(_))
    val tradeDF = spark.createDataFrame(tradeRDD)
    tradeDF.write.parquet("hdfs:///FAO/Trade_Crops.parquet")
  }
}

The build.sbt is

lazy val root = (project in file("."))
  .settings(
    name := "FAO",
    version := "1.0",
    scalaVersion := "2.11.7",
    unmanagedJars in Compile += file("/usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples-2.6.0-cdh5.7.6.jar"),
    libraryDependencies ++= Seq(
      "org.apache.spark" % "spark-core_2.11" % "2.1.1",
      "org.apache.spark" % "spark-sql_2.11" % "2.1.1",
      "org.apache.hadoop" % "hadoop-client" % "2.6.0"
    )
  )

Always remember to add dependency for “spark-sql” or else it will report “createDataFrame() if not a member of spark”.
And finally, the submit script is:

/disk1/spark-2.1.1-bin-hadoop2.6/bin/spark-submit --class TradeCrops \
  --master yarn \
  --driver-memory 2G \
  --executor-memory 2G \
  --executor-cores 1 \
  --num-executors 64 \
  ./target/scala-2.11/FAO_2.11-1.0.jar

Books I read in year 2016

Here comes the last day of 2016 year. And it is also the time for me to review my harvest about knowledge, or books.
Frankly speaking, the book “All hard thing about hard things” literally frighten me, and cause me to give up any idea about joining a startup company in China. Maybe this is the best consequence, for many startup companies failed in this end of year and I fortunately avoid this tempest.
Diving more deeper into the ocean of “Hadoop Ecosystem”, or “Big Data”, I find out Spark is really a convenient and powerful framework (compare to MapReduce) which could implement complicated algorithm or data-flow with a few lines of code. Surely, Scala is also a key element for Spark’s efficiency and concision.
Today, even normal person could imagine a sci-fi story about how modern people will fight with Alien invaders. But, what will happen if Aliens attacked the earth in the ancient time? What about Medieval age? Then comes the funny and bold sci-fi novel “The High Crusade”. A group of Medieval army defeat the invader of Alien， and did even more: occupied a frontline planet of a gigantic Alien Empire. It is really out of my imagination 🙂

Problem about running Hive-2.0.1 on Spark-1.6.2

When I launched Hive-2.0.1 on Spark-1.6.2, it report errors:

FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.spark.SparkTask. com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer$.handled
Type()Ljava/lang/Class;

After changed “spark.master” from “yarn-cluster” to “local” and add “–hiveconf hive.root.logger=DEBUG,console” to hive command, it printed out details like:

java.lang.NoSuchMethodError: com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer$.handledType()Ljava/lang/Class;
        at com.fasterxml.jackson.module.scala.deser.NumberDeserializers$.(ScalaNumberDeserializersModule.scala:49)
        at com.fasterxml.jackson.module.scala.deser.NumberDeserializers$.(ScalaNumberDeserializersModule.scala)
        at com.fasterxml.jackson.module.scala.deser.ScalaNumberDeserializersModule$class.$init$(ScalaNumberDeserializersModule.scala:61)
        at com.fasterxml.jackson.module.scala.DefaultScalaModule.(DefaultScalaModule.scala:19)
        at com.fasterxml.jackson.module.scala.DefaultScalaModule$.(DefaultScalaModule.scala:35)
        at com.fasterxml.jackson.module.scala.DefaultScalaModule$.(DefaultScalaModule.scala)
        at org.apache.spark.rdd.RDDOperationScope$.(RDDOperationScope.scala:81)

This article suggest replacing fasterxml.jackson package with newer version, but the problem remained the same even after I completed the replacement.
Then I found the [HIVE-13301] in JIRA:

This is because calcite has a shaded 2.1.1 version of jackson-databind in it. You can probably remove that from the jar and leave the jackson-databind alone in the hive distro.

This explains everything clearly: Hive was using jackson-databind-2.1.1 in calcite package instead of lib/jackson-databind-2.4.2.jar, therefore updating it has no effect.
Thus, we should remove shaded jackson-databind-2.1.1 in calcite-avatica-1.5.0.jar:

cd ${HIVE_HOME}/lib/
mkdir tmp
cd tmp
# Extract classes from jar
jar -xf ../calcite-avatica-1.5.0.jar
# Remove old jackson-classes in calcite-avatica
find . -name "*jackson*"|xargs rm -rf
# Build new calcite-avatica jar without jackson-classes
jar -cf calcite-avatica-1.5.0.jar *
cp calcite-avatica-1.5.0.jar ../

The Hive uses lib/jackson-databind-2.4.2.jar and runs correctly now.

Using Linear Regression to filter spam message of SMS on Spark

By using the sample from “SMS Spam Collection v. 1“, I write a simple program on Spark to classify normal and spam message.

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.regression.LinearRegressionWithSGD
object SimpleRegression {
  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("Simple Regression")
    val sc = new SparkContext(conf)
    val smsData = sc.textFile("hdfs://127.0.0.1/user/robin/SMSSpamCollection")
    val normal = smsData.filter(line => line.substring(0, 4) == "ham\t")
      .map(line => line.substring(4))
    val spam = smsData.filter(line => line.substring(0, 5) == "spam\t")
      .map(line => line.substring(5))
    // Create a HashingTF instance to map email text to vectors of 10,000 features.
    val tf = new HashingTF(numFeatures = 100000)
    // Each email is split into words, and each word is mapped to one feature.
    val spamFeatures = spam.map(email => tf.transform(email.split(" ")))
    val normalFeatures = normal.map(email => tf.transform(email.split(" ")))
    val positiveExamples = spamFeatures.map(features => LabeledPoint(100, features))
    val negativeExamples = normalFeatures.map(features => LabeledPoint(-100, features))
    val trainingData = positiveExamples.union(negativeExamples)
    trainingData.cache() // Cache since Logistic Regression is an iterative algorithm.
    // Run Linear Regression using the SGD algorithm.
    val model = new LinearRegressionWithSGD().run(trainingData)
    // Test on a positive example (spam) and a negative one (normal).
    val posTest = tf.transform(
      ("Someone has contacted our dating service and entered your phone because they fancy you").split(" "))
    val negTest = tf.transform(
      ("Hi Dady, I started studying Spark the other").split(" "))
    println("Prediction for positive test example: " + model.predict(posTest))
    println("Prediction for negative test example: " + model.predict(negTest))
  }
}

and the “build.sbt” file contains:

lazy val root = (project in file("."))
    .settings(
        name := "test",
        version := "1.0",
        scalaVersion := "2.10.6",
        unmanagedJars in Compile += file("/home/sanbai/hadoop-2.7.2/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar"),
        libraryDependencies ++= Seq(
            "org.apache.spark" % "spark-core_2.10" % "2.0.1",
            "org.apache.spark" % "spark-hive_2.10" % "2.0.1",
            "org.apache.spark" % "spark-mllib_2.10" % "2.0.1",
            "org.apache.spark" % "spark-streaming_2.10" % "2.0.1",
            "org.apache.hadoop" % "hadoop-client" % "2.7.2",
            "org.xerial.snappy" % "snappy-java" % "1.1.2"
        )
    )

After submit the job to YARN:

./bin/spark-submit --class SimpleRegression \
  --master yarn \
  --deploy-mode cluster \
  --driver-memory 2G \
  --executor-memory 14G \
  --executor-cores 1 \
  --num-executors 1 \
  --queue spark \
  /home/sanbai/myspark/target/scala-2.10/test_2.10-1.0.jar

We could retrieve the log of job by:

bin/yarn logs -applicationId application_1473140384986_0096

And the result is:

Prediction for positive test example: 24.238025869328453
Prediction for negative test example: -34.879236141966544

From now on, we can consider the message with negative value as normal and positive value as spam (Or use 10 instead of 0 as boundary).
This is just a example, for the dataset of sample is too small and it could only filter obvious spam message. To identify more spam messages, we need to add more features like ‘the topics of every message’, ‘total number of words’, ‘the frequency of special words’ etc.

Why my Spark job hangs?

After running my small application for Spark of Machine Learning , the job hangs and the Spark UI for it display nothing for more than 5 minutes.
That is weird and I see some logs in UI of yarn:

16/09/30 17:05:33 INFO ipc.Client: Retrying connect to server: user/110.75.167.140:8020. Already tried 0 time(s); maxRetries=45
16/09/30 17:05:53 INFO ipc.Client: Retrying connect to server: user/110.75.167.140:8020. Already tried 1 time(s); maxRetries=45
16/09/30 17:06:13 INFO ipc.Client: Retrying connect to server: user/110.75.167.140:8020. Already tried 2 time(s); maxRetries=45
16/09/30 17:06:33 INFO ipc.Client: Retrying connect to server: user/110.75.167.140:8020. Already tried 3 time(s); maxRetries=45
16/09/30 17:06:53 INFO ipc.Client: Retrying connect to server: user/110.75.167.140:8020. Already tried 4 time(s); maxRetries=45
16/09/30 17:07:13 INFO ipc.Client: Retrying connect to server: user/110.75.167.140:8020. Already tried 5 time(s); maxRetries=45
16/09/30 17:07:33 INFO ipc.Client: Retrying connect to server: user/110.75.167.140:8020. Already tried 6 time(s); maxRetries=45

I haven’t any IP looks like “110.75.x.x”. Why is the Spark job trying to connect it ?
After reviewing the code carefully, I find out the problem:

    val conf = new SparkConf().setAppName("Simple Regression")
    val sc = new SparkContext(conf)
    val smsData = sc.textFile("hdfs://user/sanbai/SMSSpamCollection")

It is me who forget to add IP to URI of HDFS. Thus, the correct code should be:

    val smsData = sc.textFile("hdfs://127.0.0.1/user/sanbai/SMSSpamCollection")

Now the application runs correctly.

“java.io.Exception: failed to uncompress the chunk” in Apache Spark

After I run spark-submit in my YARN cluster with Spark-1.6.2:

./bin/spark-submit --class TerasortApp \
  --master yarn \
  --deploy-mode cluster \
  --driver-memory 4G \
  --executor-memory 12G \
  --executor-cores 4 \
  --num-executors 16 \
  --conf spark.yarn.executor.memoryOverhead=4000 \
  --conf spark.memory.useLegacyMode=true \
  --conf spark.shuffle.memoryFraction=0.6 \
  --conf "spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:ArrayAllocationWarningSize=2048M" \
  --queue spark \
  /home/sanbai/myspark/target/scala-2.10/test_2.10-1.0.jar

The job fail, and the log report:

com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to uncompress the chunk: PARSING_ERROR(2)
Serialization trace:
bytes (org.apache.hadoop.io.Text)
  at com.esotericsoftware.kryo.io.Input.fill(Input.java:142)
  at com.esotericsoftware.kryo.io.Input.require(Input.java:169)
  at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:317)
  at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:297)
  at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:35)
  at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:18)
  at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699)
  at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611)
  at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
  at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
  at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:228)
  at org.apache.spark.serializer.DeserializationStream.readValue(Serializer.scala:171)
  at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:201)
  at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:198)
  at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
  at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)

Somebody in the internet say may be this is caused by the compatibility problem between Spark-1.6.2 and Snappy. Therefore I add

--conf spark.io.compression.codec=lz4

to my spark-submit shell script to change compress algorithm from Snappy to lz4. And this time everything goes ok.

Terasort for Spark (part2 / 2)

In previous article, we used Spark to sort large dataset generated by Teragen. But it cost too much time than Hadoop Mapreduce framework, so we are going to optimize it.
By looking at the Spark UI for profiling, we find out the “Shuffle” read/write too much data from/to the hard-disk, this will surely hurt the performance severely.

In “Terasort” of Hadoop, it use “class TotalOrderPartition” to map all the data to a large mount of partitions by ordering, so every “Reduce” job only need to sort data in one task (almost don’t need any shuffle from other partition). This will save a lot of network bandwidth and CPU usage.
Therefore we could modify our Scala code to sort every partition locally:

    logData.partitionBy(new TeraSortPartitioner(512))
      .mapPartitions(iter => {
        iter.toVector.sortBy(kv => kv._1.getBytes).iterator
      })
      .saveAsNewAPIHadoopFile[TeraOutputFormat]("hdfs://127.0.0.1/output")

and the spark-submit should also be changed:

./bin/spark-submit --class TerasortApp \
  --master yarn \
  --deploy-mode cluster \
  --driver-memory 2000M \
  --executor-memory 5200M \
  --executor-cores 1 \
  --num-executors 64 \
  --conf spark.yarn.executor.memoryOverhead=900 \
  --conf spark.shuffle.memoryFraction=0.6 \
  --conf spark.kryoserializer.buffer.max=2000m \
  --conf "spark.executor.extraJavaOptions=-XX:+UseG1GC" \
  --queue spark \
  /home/sanbai/myspark/target/scala-2.10/Terasort_2.10-1.0.jar

This time, the job only cost 10 minutes for sorting data!
Screenshot from “Job Browser” of Hue:

Terasort for Spark (part1 / 2)

We could use Spark to sort all the data which is generated by Teragen of Hadoop.
TerasortApp.scala

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.Partitioner
import org.apache.spark.rdd._
import org.apache.hadoop.examples.terasort.TeraInputFormat
import org.apache.hadoop.examples.terasort.TeraOutputFormat
import org.apache.hadoop.io.Text
import com.google.common.primitives.Longs
import com.google.common.primitives.UnsignedBytes
case class TeraSortPartitioner(numPartitions: Int) extends Partitioner {
  import TeraSortPartitioner._
  val rangePerPart = (max - min) / numPartitions
  override def getPartition(key: Any): Int = {
    val b = key.asInstanceOf[Text].getBytes()
    val prefix = Longs.fromBytes(0, b(0), b(1), b(2), b(3), b(4), b(5), b(6))
    (prefix / rangePerPart).toInt
  }
}
object TeraSortPartitioner {
  val min = Longs.fromBytes(0, 0, 0, 0, 0, 0, 0, 0)
  val max = Longs.fromBytes(0, -1, -1, -1, -1, -1, -1, -1)  // 0xff = -1
}
object TerasortApp {
  implicit val caseInsensitiveOrdering = UnsignedBytes.lexicographicalComparator
  def main(args: Array[String]) {
    val conf = new SparkConf()
      .registerKryoClasses(Array(classOf[Text]))
      .setAppName("Simple Application")
    val sc = new SparkContext(conf)
    var logData = sc.newAPIHadoopFile("hdfs://127.0.0.1/tera", classOf[TeraInputFormat], classOf[Text], classOf[Text])
    logData.partitionBy(new TeraSortPartitioner(logData.partitions.size))
      .sortBy(kv => kv._1.getBytes)
      .saveAsNewAPIHadoopFile[TeraOutputFormat]("hdfs://127.0.0.1/output")
  }
}

build.sbt

lazy val root = (project in file("."))
    .settings(
        name := "Terasort",
        version := "1.0",
        scalaVersion := "2.10.6",
        unmanagedJars in Compile += file("/home/sanbai/hadoop-2.7.2/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar"),
        libraryDependencies ++= Seq(
            "org.apache.spark" % "spark-core_2.10" % "1.6.2",
            "org.apache.hadoop" % "hadoop-client" % "2.7.2"
        )
    )

After building the jar file, we could submit it to spark (I run my spark on yarn-cluster mode):

./bin/spark-submit --class TerasortApp \
  --master yarn \
  --deploy-mode cluster \
  --driver-memory 2000M \
  --executor-memory 2000M \
  --executor-cores 1 \
  --num-executors 128 \
  --conf spark.yarn.executor.memoryOverhead=2048 \
  --conf spark.shuffle.memoryFraction=0.9 \
  --conf spark.storage.memoryFraction=0.9 \
  --conf "spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=85" \
  --queue spark \
  /home/sanbai/myspark/target/scala-2.10/Terasort_2.10-1.0.jar

It costs 17 minutes to complete the task, but tool “terasort” from Hadoop only costs 8 minutes to sort all data. The reason is I haven’t use TotalOrderPartitioner so spark has to sort all the data between different partitions (also between different servers) which costs a lot of network resource and delay the progress.

Remember to use scala-2.10 to build app for Spark-1.6.x, otherwise spark will report error like:
scala.runtime.VolatileObjectRef.zero()Lscala/runtime/VolatileObjectRef

Deploy Hive on Spark

The Mapreduce framework is too small for realtime analytic query, so we need to change engine of Hive from “mr” to “spark” (link):
1. set environment for spark:

export SPARK_HOME=/home/my/spark/

2. copy configuration xml file for Hive:

cp /home/my/hive/conf/hive-default.xml.template /home/my/hive/conf/hive-site.xml

and change these configuration items:


  hive.execution.engine
  spark


  spark.executor.memory
  4g


  spark.serializer
  org.apache.spark.serializer.KryoSerializer

Notice: remember to replace all “${system:java.io.tmpdir}/${system:user.name}” in hive-site.xml to “/tmp/my/” (link)