Category Archives: bigdata

Partitioning and Bucketing Hive table

In previous article, we use sample datasets to join two tables in Hive. To promote the performance of table join, we could also use Partition or Bucket. Let’s first create a parquet format table with partition and bucket:

Then import data into it:

But it reports error:

Read more »

Example datasets for learning Hive

I find two datasets: employee and salary for learning and practicing. After putting two files into HDFS, we just need to create tables:

Now we could analyze the data. Find the oldest 10 employees.

Find all the employees joined the corporation in January 1990.

Find the top… Read more »

Use MapReduce to join two datasets

The two datasets are:

To join the two tables above by “student id”, we need to use MultipleInputs. The code is:

Compile and run it:

And the result in /my is:

Some tips about Hive

      No Comments on Some tips about Hive

Found some tips about Hive in my learning progress: 1. When I start “bin/hive” at first time, these errors report:

The solution is simple:

Actually, we’d better use mysql instead of derby for multi-users environment. 2. Control the number of mappers for SQL jobs. If a SQL job… Read more »