Category Archives: bigdata

Some tips about “Amazon Redshift Database Developer Guide”

Show diststyle of tables

Details about distribution styles: http://docs.aws.amazon.com/redshift/latest/dg/viewing-distribution-styles.html How to COPY multiple files into Redshift from S3 http://docs.aws.amazon.com/redshift/latest/dg/t_loading-tables-from-s3.html Could “Group” (or “Order”) by number, not column name

COPY with automatical compression To apply automatic compression to an empty table, regardless of its current compression encodings, run the… Read more »

Enable audit log for AWS Redshift

When I was trying to enable the Audit Log for AWS Redshift, I chose to use a exists bucket in S3. But it reports error:

According to this document, I need to change permission of bucket “redshift-robin”. So I entered the AWS Console of S3, click bucket name of… Read more »

Read paper “iShuffle: Improving Hadoop Performance with Shuffle-on-Write”

Paper reference: iShuffle: Improving Hadoop Performance with Shuffle-on-Write Background: A job in Hadoop consists of three main stages: map, shuffle, reduce (Actually shuffle stage has been contained into reduce stage). What is the problem? Shuffle phase need to migrate large mount of data from nodes which running map job to… Read more »

Build dataflow to get monthly top price of Land Trading in UK

The dataset is downloaded from UK government data web(The total data size is more than 3GB). And, I am using Apache Oozie to run Hive and Sqoop job periodically. The Hive script “land_price.hql”:

We want Hive job to run on queue “root.default” in YARN (and other jobs in “root.mr”),… Read more »

Use Oozie to run terasort

      No Comments on Use Oozie to run terasort

The better choice of “Action” for running terasort test case in Oozie is “Java Action” instead of “Mapreduce Action” because terasort need to run

first and then load ‘partitonFile’ by “TotalOrderPartitioner”. It’s not a simple Mapreduce job which need merely a few propertyies. The directory of this”TerasortApp” which using… Read more »

Install CDH(Cloudera Distribution Hadoop) by Cloudera Manager

These days I was trying to install Cloudera-5.8.3 on my centos-7 machines, and here are some steps for operation and tips for trouble shooting: 0. If you are not in USA, the speed of network for accessing Cloudera Repository of RPMS(or Parcels) is desperately slow, thus we need to move… Read more »

Using Pig to join two tables and sort it

Having two tables: salary and employee´╝îwe can use Pig to find the most high-salary employees:

The result is: