Apache Orc – Robin on Linux

Some tips about Hive

Found some tips about Hive in my learning progress:
1. When I start “bin/hive” at first time, these errors report:

Exception in thread "main" java.lang.RuntimeException: Hive metastore database is not initialized. Please use schematool (e.g. ./schematool -initSchema -dbType ...) to create the schema. If needed, don't forget to include the option to auto-create the underlying database in your JDBC connection string (e.g. ?createDatabaseIfNotExist=true for mysql)

The solution is simple:

mv metastore_db metastore_db.tmp
schematool -initSchema -dbType derby

Actually, we’d better use mysql instead of derby for multi-users environment.
2. Control the number of mappers for SQL jobs. If a SQL job use too much mappers, the context-switch of processes (include frequent launch/stop for JVM) will cost extra CPU resource. We could use

set mapreduce.input.fileinputformat.split.maxsize=...
set mapreduce.input.fileinputformat.split.minsize=...

to change the number of mappers for all the SQL jobs.
3. After I imported 1TB data into a “Orc format” table, the size of the table is just 250GB. But after I imported 1TB data into a “Parquet format” table, the size is 900GB. Looks Apache Orc has more effective compression algorithm for custom data.
4. Using partitions carefully.

create table users (name string, age smallint) partitioned by (ca string);

Now we have a table named “users” and is partitioned by field “ca”.

hive> insert into users values("robindong", 36);
FAILED: SemanticException 1:12 Need to specify partition columns because the destination table is partitioned. Error encountered near token 'users'
hive> insert into users values("robindong", 36, "China");
FAILED: SemanticException 1:12 Need to specify partition columns because the destination table is partitioned. Error encountered near token 'users'

We can’t using normal INSERT clause to insert record into partitioned table. Trying this:

insert into users partition (ca="China") values("robindong", 36);

Now, there is a record in HDFS directory “/user/hive/warehouse/users/ca=China/”
In the book “Programming Hive”, it said we could copy the data in a partition directory to AWS s3 and then set partition to it. But, what if I set the partition to a new empty HDFS directory? Let’s try:

alter table users partition(ca = 'China') set location '/empty/';
hive> select * from users where ca='China';
OK
Time taken: 0.298 seconds

Because the partition has been set to a empty directory, the select couldn’t find any records now. That is what “Schema on read” mean.
5. Debug.

bin/hive --hiveconf hive.root.logger=DEBUG,console

This will print many debug information for finding causes such as:

aused by: java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at org.apache.hive.common.util.ReflectionUtil.setJobConf(ReflectionUtil.java:112)
        ... 20 more
Caused by: java.lang.IllegalArgumentException: Compression codec com.hadoop.compression.lzo.LzoCodec not found.
        at org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:139)
        at org.apache.hadoop.io.compress.CompressionCodecFactory.(CompressionCodecFactory.java:179)
        at org.apache.hadoop.mapred.TextInputFormat.configure(TextInputFormat.java:45)
        ... 25 more
Caused by: java.lang.ClassNotFoundException: Class com.hadoop.compression.lzo.LzoCodec not found
        at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)
        at org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:132)
        ... 27 more