Category Archives: bigdata

Using Spark-SQL to transfer CSV file to Parquet

After downloading data from “Food and Agriculture Organization of United Nations”, I get many CSV files. One of the file is named “Trade_Crops_Livestock_E_All_Data_(Normalized).csv” and it looks like:

To load this CSV file into Spark and dump it to Parquet format, I wrote these codes:

The build.sbt is

Read more »

Some tips about “Amazon Redshift Database Developer Guide”

Show diststyle of tables

Details about distribution styles: How to COPY multiple files into Redshift from S3 Could “Group” (or “Order”) by number, not column name

COPY with automatical compression To apply automatic compression to an empty table, regardless of its current compression encodings, run the… Read more »

Enable audit log for AWS Redshift

When I was trying to enable the Audit Log for AWS Redshift, I chose to use a exists bucket in S3. But it report error:

According to this document, I need to change permission of bucket “redshift-robin”. So I entered the AWS Console of S3, click bucket name of… Read more »

Read paper “iShuffle: Improving Hadoop Performance with Shuffle-on-Write”

Paper reference: iShuffle: Improving Hadoop Performance with Shuffle-on-Write Background: A job in Hadoop consists of three main stages: map, shuffle, reduce (Actually shuffle stage has been contained into reduce stage). What is the problem? Shuffle phase need to migrate large mount of data from nodes which running map job to… Read more »

Build dataflow to get monthly top price of Land Trading in UK

The dataset is downloaded from UK government data web(The total data size is more than 3GB). And, I am using Apache Oozie to run Hive and Sqoop job periodically. The Hive script “land_price.hql”:

We want Hive job to run on queue “root.default” in YARN (and other jobs in “”),… Read more »

Use Oozie to run terasort

      No Comments on Use Oozie to run terasort

The better choice of “Action” for running terasort test case in Oozie is “Java Action” instead of “Mapreduce Action” because terasort need to run

first and then load ‘partitonFile’ by “TotalOrderPartitioner”. It’s not a simple Mapreduce job which need merely a few propertyies. The directory of this”TerasortApp” which using… Read more »