After installing all Hadoop environment, I used DistCp to copy large files in distributed cluster. But it report error:
Error: A JNI error has occurred, please check your installation and try again
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/mapreduce/Job
at java.lang.Class.getDeclaredMethods0(Native Method)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.mapreduce.Job
... 7 more
Seems it can’t even find the basic MapReduce class. Then I checked CLASSPATH for Hadoop:
Pretty strange, the HADOOP_CLASSPATH contains ‘mapreduce’ directories. It supposed to be able to find ‘Job’ class, unless the MapReduce jar package is in other directories.
Finally, I found the real MapReduce jar is actually in other position. Therefore I add these directories into HADOOP_CLASSPATH: edit ~/.bashrc and add following line
DistCp could work now.
Question1: Flume process report “Expected timestamp in the Flume event headers, but it was null”
Solution1: The flume process expect to receive events with timestamp, but the event doesn’t have. For sending normal text event to flume, we need to tell it to generate timestamp with every events by itself. Put below line into configuration:
Question2: HDFS Sink generates tremendous small files with high frequency even we have set “a1.sinks.k2.hdfs.rollInterval=600”
Solution2: We still need to set “rollCount” and “rollSize”, as Flume will roll file if any condition of “rollInterval”, “rollCOunt”, or “rollSize” been fulfilled.
Question3: Flume process exit and report “Exception in thread “SinkRunner-PollingRunner-DefaultSinkProcessor” java.lang.OutOfMemoryError: GC overhead limit exceeded”
Solution3: Simply add “JAVA_OPTS=”-Xms12g -Xmx12g” (My server has more than 16G physical memory) into “/usr/lib/flume-ng/bin/flume-ng”
—— My configuration file for Flume ——
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 1
a1.sinks.k1.hdfs.roundUnit = hour
a1.sinks.k2.hdfs.round = true
a1.sinks.k2.hdfs.roundValue = 1
a1.sinks.k2.hdfs.roundUnit = hour
The startup command for Cloudera Environment:
sudo -u hdfs flume-ng agent --conf ./ --conf-file example.conf \
-name a1 -Dflume.root.logger=INFO,console \
After running my small application for Spark of Machine Learning , the job hangs and the Spark UI for it display nothing for more than 5 minutes.
That is weird and I see some logs in UI of yarn:
16/09/30 17:05:33 INFO ipc.Client: Retrying connect to server: user/18.104.22.168:8020. Already tried 0 time(s); maxRetries=45
16/09/30 17:05:53 INFO ipc.Client: Retrying connect to server: user/22.214.171.124:8020. Already tried 1 time(s); maxRetries=45
16/09/30 17:06:13 INFO ipc.Client: Retrying connect to server: user/126.96.36.199:8020. Already tried 2 time(s); maxRetries=45
16/09/30 17:06:33 INFO ipc.Client: Retrying connect to server: user/188.8.131.52:8020. Already tried 3 time(s); maxRetries=45
16/09/30 17:06:53 INFO ipc.Client: Retrying connect to server: user/184.108.40.206:8020. Already tried 4 time(s); maxRetries=45
16/09/30 17:07:13 INFO ipc.Client: Retrying connect to server: user/220.127.116.11:8020. Already tried 5 time(s); maxRetries=45
16/09/30 17:07:33 INFO ipc.Client: Retrying connect to server: user/18.104.22.168:8020. Already tried 6 time(s); maxRetries=45
I haven’t any IP looks like “110.75.x.x”. Why is the Spark job trying to connect it ?
After reviewing the code carefully, I find out the problem:
val conf = new SparkConf().setAppName("Simple Regression")
val sc = new SparkContext(conf)
val smsData = sc.textFile("hdfs://user/sanbai/SMSSpamCollection")
It is me who forget to add IP to URI of HDFS. Thus, the correct code should be:
val smsData = sc.textFile("hdfs://127.0.0.1/user/sanbai/SMSSpamCollection")
Now the application runs correctly.
I have to change a program which is written by c language from writing local files to writing on HDFS. After learning the example of C API in libhdfs, I complete the modification of open()/write()/read() to hdfsOpenFile()/hdfsWriteFile()/hdfsReadFile() and so on. But when running the new program, many problems occured. The first is: after fork(), I can’t open files of HDFS anymore. And the problem looks very common in community and haven’t any solution yet.
So I have to try the hdfs-fuse tool. According to the steps of this article, I successfully build and run the hdfs-fuse:
./fuse_dfs_wrapper.sh -d dfs://x.x.x.x:8020 /data/ -obig_writes
But something weird happened:
fd = open("my.db", "w");
write(fd, "hello", 5);
After fsync(), the size of file “my.db” is still zero by “ls” command on mountpoint “/data”! It cause the program report error and can’t continue to process.
The reason is fuse-dfs haven’t implement fuse_fsync() interface. After adding the implementation of fuse_fsync() by hdfsHSync(), it works now. But the performance is too bad: about 10~20MB/s in network.
Consequently, I decided to use glusterfs instead of HDFS because it totally don’t need any modification for user program and support erasure-code since version 3.6 (this will dramatically reduce occupation of storage space).