Category Archives: bigdata

Problem about running Hive-2.0.1 on Spark-1.6.2

When I launched Hive-2.0.1 on Spark-1.6.2, it report errors:

After changed “spark.master” from “yarn-cluster” to “local” and add “–hiveconf hive.root.logger=DEBUG,console” to hive command, it printed out details like:

This article suggest replacing fasterxml.jackson package with newer version, but the problem remained the same even after I completed the… Read more »

Using Linear Regression to filter spam message of SMS on Spark

By using the sample from “SMS Spam Collection v. 1“, I write a simple program on Spark to classify normal and spam message.

and the “build.sbt” file contains:

After submit the job to YARN:

We could retrieve the log of job by:

And the result is:… Read more »

Why my Spark job hangs?

      No Comments on Why my Spark job hangs?

After running my small application for Spark of Machine Learning , the job hangs and the Spark UI for it display nothing for more than 5 minutes. That is weird and I see some logs in UI of yarn:

I haven’t any IP looks like “110.75.x.x”. Why is the… Read more »

“java.io.Exception: failed to uncompress the chunk” in Apache Spark

After I run spark-submit in my YARN cluster with Spark-1.6.2:

The job fail, and the log report:

Somebody in the internet say may be this is caused by the compatibility problem between Spark-1.6.2 and Snappy. Therefore I add

to my spark-submit shell script to change compress algorithm… Read more »

Terasort for Spark (part1 / 2)

We could use Spark to sort all the data which is generated by Teragen of Hadoop. TerasortApp.scala

build.sbt

After building the jar file, we could submit it to spark (I run my spark on yarn-cluster mode):

It costs 17 minutes to complete the task, but tool “terasort”… Read more »

Deploy Hive on Spark

      No Comments on Deploy Hive on Spark

The Mapreduce framework is too small for realtime analytic query, so we need to change engine of Hive from “mr” to “spark” (link): 1. set environment for spark:

2. copy configuration xml file for Hive:

and change these configuration items:

Notice: remember to replace all “${system:java.io.tmpdir}/${system:user.name}” in… Read more »

Partitioning and Bucketing Hive table

In previous article, we use sample datasets to join two tables in Hive. To promote the performance of table join, we could also use Partition or Bucket. Let’s first create a parquet format table with partition and bucket:

Then import data into it:

But it reports error:

Read more »

Example datasets for learning Hive

I find two datasets: employee and salary for learning and practicing. After putting two files into HDFS, we just need to create tables:

Now we could analyze the data. Find the oldest 10 employees.

Find all the employees joined the corporation in January 1990.

Find the top… Read more »