AWS

Some problems about using AWS DMS

AWS DMS is a new type of service used to migrate data from different types of database and data-warehouse. I met some problems when trying to use it in production environment.
Problem 1. When using a MySQL server of AWS RDS as the source of a replication task. It reported errors after started the task:

Last failure message
Last Error Task error notification received from subtask 0, thread 0 [reptask/replicationtask.c:2673] [1020418] Error Code [10001] : Binary Logging must be enabled for MySQL server; Errors in MySQL server binary logging configuration. Follow all prerequisites for 'MySQL as a source in DMS' from https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.MySQL.html or'MySQL as a target in DMS' from https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Target.MySQL.html ; Failed while preparing stream component 'st_0_WBK5KGUWQAH6VKEP4I5LH2EFHE'.; Cannot initialize subtask; Stream component 'st_0_WBK5KGUWQAH6VKEP4I5LH2EFHE' terminated [reptask/replicationtask.c:2680] [1020418] Stop Reason FATAL_ERROR Error Level FATAL

The failure message looks terrible. But at least I can find this doc to follow. After changed the configurations as below:

binlog_format	ROW
binlog_checksum	NONE
binlog_row_image	FULL

the error still existed.
The real answer is in here since I used RDS instead of self-managed MySQL. After I add one line Terraform code to enable “automatic backups”:

resource "aws_db_instance" "test_gaf" {
  ......
  backup_retention_period     = 10
}

the replication task began to work without the error.
Problem 2. Running replication task for a while to export data from MySQL to AWS Redshift. A new error log appeared in Redshift load logs:

019-10-29T04:41:27 [TARGET_LOAD ]E: RetCode: SQL_ERROR SqlState: XX000 NativeError: 30 Message: [Amazon][Amazon Redshift] (30) Error occurred while trying to execute a query: [SQLState XX000] ERROR: User arn:aws:redshift:us-east-1:262284277472:dbuser:analytics-20190902/masteruser is not authorized to assume IAM Role arn:aws:iam::262284277472:role/dms-access-for-endpoint DETAIL: ----------------------------------------------- error: User arn:aws:redshift:us-east-1:262284277472:dbuser:analytics-20190902/masteruser is not authorized to assume IAM Role arn:aws:iam::262284277472:role/dms-access-for-endpoint code: 8001 context: IAM Role=arn:aws:iam::262284277472:role/dms-access-for-endpoint query: 1799 location: xen_aws_credentials_mgr.cpp:321 process: padbmaster [pid=21755] ----------------------------------------------- [1022502] (ar_odbc_stmt.c:4622)

Why masteruser is not authorized? The answer is here. Below is the Terraform code:

data "aws_iam_policy_document" "dms_assume_role" {
  statement {
    actions = ["sts:AssumeRole"]
    principals {
      identifiers = ["dms.amazonaws.com"]
      type        = "Service"
    }
  }
  statement {
    actions = ["sts:AssumeRole"]
    # By https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Security.APIRole.html,
    # we also need principal `redshift.amazonaws.com`
    principals {
      identifiers = ["redshift.amazonaws.com"]
      type        = "Service"
    }
  }
}

Then I had giiven “dms_assume_role” two Trusty Entities

Problem 3. There was still a error in Redshift load log (so many errors in AWS DMS…):

Error	Type	Raw Field Value
Invalid timestamp format or value [YYYY-MM-DD HH24:MI:SS]	timestamp	0000-00-00 00:00:00

Seems the answer is here. Therefore I added “acceptanydate=true;timeformat=auto” into the “extra connection settings” in Redshift endpoint. But the error just changed to:

Error	Type	Raw Field Value
Invalid data	timestamp	0000-00-00 00:00:00

After searching for almost two days, I found that the reason is in the schema of Redshift, which is automatically created by AWS DMS replication task.

CREATE TABLE my (
    ...
    mydate TIMESTAMP DEFAULT '0000-00-00 00:00:00' NOT NULL,
    ...
)

Since the schema doesn’t allow “mydate” column to be null but the “acceptanydate=true” is trying to transfer “0000-00-00 00:00:00 to null”, the final error is “Invalid data” for Redshift.
The solution for this problem is: create table of Redshift manually to let “mydate” column to be “nullable”, and change the working mode of replication task to “TRUNCATE_BEFORE_LOAD”.

Some tips about using AWS Glue

Configure about data format
To use AWS Glue, I write a ‘catalog table’ into my Terraform script:

resource "aws_glue_catalog_table" "my_table" {
...
    input_format  = "org.apache.hadoop.mapred.TextInputFormat"
    output_format = "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
    ser_de_info {
      name                  = "SerDeCsv"
      serialization_library = "org.apache.hadoop.hive.serde2.OpenCSVSerde"
      parameters = {
        "separatorChar" = ","
        "quoteChar"     = "'"
      }
    }
...
}

But after using PySpark script to access this table, it reports:

py4j.protocol.Py4JJavaError: An error occurred while calling o58.getCatalogSource.
: com.amazonaws.services.glue.util.NonFatalException: Formats not supported for SparkSQL data sources. Got csv
	at com.amazonaws.services.glue.SparkSQLDataSource.setFormat(DataSource.scala:641)
	at com.amazonaws.services.glue.GlueContext.getCatalogSource(GlueContext.scala:254)
	at com.amazonaws.services.glue.GlueContext.getCatalogSource(GlueContext.scala:139)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:280)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:748)

Seems we can’t use ‘OpenCSVSerde’. Actually, the correct answer is:

Input format: org.apache.hadoop.mapred.TextInputFormat
Output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Serde serialization lib: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Serde parameters: field.delim ,

The version of zeppelin
When using zeppelin to run PySpark script, it reports error:

org.apache.thrift.TApplicationException: Internal error processing createInterpreter
	at org.apache.thrift.TApplicationException.read(TApplicationException.java:111)
	at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:71)
	at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.recv_createInterpreter(RemoteInterpreterService.java:209)
	at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.createInterpreter(RemoteInterpreterService.java:192)
	at org.apache.zeppelin.interpreter.remote.RemoteInterpreter$2.call(RemoteInterpreter.java:169)
	at org.apache.zeppelin.interpreter.remote.RemoteInterpreter$2.call(RemoteInterpreter.java:165)
	at org.apache.zeppelin.interpreter.remote.RemoteInterpreterProcess.callRemoteFunction(RemoteInterpreterProcess.java:135)
	at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.internal_create(RemoteInterpreter.java:165)
	at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.open(RemoteInterpreter.java:132)
	at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getFormType(RemoteInterpreter.java:299)
	at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:407)
	at org.apache.zeppelin.scheduler.Job.run(Job.java:188)
	at org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:315)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

According to the document:

The latest release of Apache Zeppelin, 0.8.x, is not supported. Download the older release named zeppelin-0.7.3-bin-all.tgz from the download page and follow the installation instructions.

Some ideas about building streaming ETL on AWS

After discussed with technical support guys from AWS, I get more information about how to use all the service of AWS to build a streaming ETL architecture, step by step.
The main architecture could be described by the diagram below:

AWS S3 is the de-facto data lake. All the data, no matter from AWS RDS or AWS Dynamo or other custom ways, could be written into AWS S3 by using some specific format, such as Apache Parquet or Apache ORC (CSV format is not recommend because it’s not suitable for data scan and data compression). Then, data engineers could use AWS Glue to extract the data from AWS S3, transform them (using PySpark or something like it), and load them into AWS Redshift.
For some frequently-used data, they could also be put in AWS Redshift for optimised query. When it is needed to join tables from both AWS S3 and AWS Redshift, we could also use AWS Redshift Spectrum.
BTW, I also joined a workshop about DataBricks’ new Unified Data Analytics and Machine Learning Platform which is built on AWS. It contains
1. Delta Lake for data storage and schema enforcement.
2. Notebook to let user directly write code and run them to process and analyze data by Apache Spark. Just like Jupyter Notebook.
3. MLFlow use above data to train machine learning model.
I used Apache Spark for learning about 4 years ago. At that time, I even need to build java/scala package by myself, upload and run it. Debugging is tedious because I can only scan logs of CLI again and again to find mistakes in code. But now, Databricks give a much more convenient solution for the data scientists and developers.

Someone who is interesting in this platform could try free edition of it https://databricks.com/try-databricks

Investigating about Streaming ETL solutions

Normal ETL solutions need to deliver all data from transactional databases to data warehouse. For instance, DBAs or Data Scientists usually deploy a script to export whole table from database to data warehouse each hour. To accelerate this process, we decided to use Streaming ETL solution in AWS(or GCP, if possible).
Firstly, I tested the AWS Data Pipeline. Although it’s called ‘Pipeline’, it needs a Last Modified Column in customer’s MySQL table so it could decide which part of the table should be extracted in each turn. The new rows, which means their Last Modified Column values had been updated, will be extracted. However, our MySQL tables don’t have this column, and adding these column and corresponding logics in code will be too tedious for a old infrastructure. The AWS Data Pipeline is not a suitable solution for us.
Then, I found the tutorial and my colleague found another doc at the same time. Combining these two suggestions, I thought out a viable solution:

A in-house service using pymysqlreplication and boto3 to parse binlog from MySQL, and write these parsed-out events into AWS Kinesis (or Kafka)
Another in-house service read these events and exported them into AWS RedShift

Since the AWS Redshift is a columnar storage data warehouse, inserting/updating/deleting data one by one will severely hurts its performance. So We need to use S3 service to store the intermediate files, and ‘COPY’ command to batch the operations, as below:

Migrate blog to AWS’s ec2

My blog had been hosting on Linost since 2013. But recently support staff from Linost noticed me that my site has led CPU usage of the host machine to 100% so the hosting system automatically ‘limited’ my resource, which actually means my site has totally been shut down.
The first thing I want to do is trying to log in my host machine by using SSH. But unfortunately, Linost doesn’t support SSH login. Without SSH and all the Linux commands, how could I find out the problem of high load and resolve it?
Finally, I chose ec2 of AWS for my new hosing machine. In order to reduce the cost, ‘t2.nano’, the cheapest instance type, has been chosen. Although it only has 512MB memory, it’s adequate to run a basic blog on WordPress. Additionally, I bought reserved instance by paying upfront for a whole year. That really decrease the cost further (about 50% discount).
Using ec2 has another advantage: I don’t need to install Mysql/Apache/PHP/Wordpress by myself. With Jetware’s AMI (Amazon Machine Image), a basic WordPress blog could be launched with a few clicks of buttons. Jetware’s AMI uses LEMP (Linux/nginx web Engine/MySQL/PHP) as its basic software stack, and also include myPHPAdmin for management of MySQL. This AMI is totally free. The only small defect is the account of MySQL has been set to an empty password with username ‘root’. But we could fix it by simply:

# Login mysql command line
mysql -uroot
# Set password for root user on 'localhost'
SET PASSWORD FOR root@localhost = PASSWORD('yourpassword');

By typing ‘https://donghao.org/phpmyadmin/’ in the browser, I can manage MySQL so easily:

That’s awesome! Thanks to Jetware.

Some tips about “Amazon Redshift Database Developer Guide”

Show diststyle of tables

SELECT relname, reldiststyle
FROM   pg_class
WHERE  relname='salary' OR relname='employee';

Details about distribution styles: http://docs.aws.amazon.com/redshift/latest/dg/viewing-distribution-styles.html
How to COPY multiple files into Redshift from S3
http://docs.aws.amazon.com/redshift/latest/dg/t_loading-tables-from-s3.html
Could “Group” (or “Order”) by number, not column name

SELECT listing.sellerid, sum(sales.qtysold)
FROM   sales, listing
WHERE  sales.salesid = listing.listid
AND    listing.listtime > '2008-12-01'
AND    sales.saletime > '2008-12-01'
GROUP BY 1
ORDER BY 1;

COPY with automatical compression
To apply automatic compression to an empty table, regardless of its current compression encodings, run the COPY command with the COMPUPDATE option set to ON. To disable automatic compression, run the COPY command with the COMPUPDATE option set to OFF.
Change diststyle of table

CREATE TALBLE userseven DISTSTYLE EVEN AS
SELECT * FROM users;

Show storage space of columns

select col, max(blocknum)
from stv_blocklist b, stv_tbl_perm p
where (b.tbl=p.id) and name ='lineorder'
and col < 17
group by name, col
order by col;

Change current environment in SQL Editor

set query_group to test;
set session authorization 'adminwlm';
set wlm_query_slot_count to 3; /* override current level */

Primary key and foreign key
Amazon Redshift does not enforce primary key and foreign key constraints, but the query optimizer uses them when it generates query plans. If you set primary keys and foreign keys, your application must maintain the validity of the keys.  
Distribution info in EXPLAIN
DS_DIST_NONE
No redistribution is required, because corresponding slices are collocated on the compute nodes. You will typically have only one DS_DIST_NONE step, the join between the fact table and one dimension table.
DS_DIST_ALL_NONE
No redistribution is required, because the inner join table used DISTSTYLE ALL. The entire table is located on every node.
DS_DIST_INNER
The inner table is redistributed.
DS_DIST_OUTER
The outer table is redistributed.
DS_BCAST_INNER
A copy of the entire inner table is broadcast to all the compute nodes.
DS_DIST_ALL_INNER
The entire inner table is redistributed to a single slice because the outer table uses DISTSTYLE ALL.
DS_DIST_BOTH
Both tables are redistributed.
Create Like

create table likesales (like sales);
insert into likesales (select * from sales);
drop table sales;
alter table likesales rename to sales;

Interleaved skew

select tbl as tbl_id, stv_tbl_perm.name as table_name,
col, interleaved_skew, last_reindex
from svv_interleaved_columns, stv_tbl_perm
where svv_interleaved_columns.tbl = stv_tbl_perm.id
and interleaved_skew is not null;

The value for interleaved_skew is a ratio that indicates the amount of skew. A value of 1 means there is no skew. If the skew is greater than 1.4, a VACUUM REINDEX will usually improve performance unless the skew is inherent in the underlying set.
About interleaved sort key: http://docs.aws.amazon.com/redshift/latest/dg/t_Sorting_data.html#t_Sorting_data-interleaved
Concurrent write
Concurrent write operations are supported in Amazon Redshift in a protective way, using write locks on tables and the principle of serializable isolation.  
UNLOAD

unload  ('select * from venue order by venueid')
to 's3://mybucket/tickit/venue/reload_'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
manifest
delimiter '|';
truncate venue;
copy venue
from 's3://mybucket/tickit/venue/reload_manifest'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
manifest
delimiter '|';

Redshift UDF
In addition to the Python Standard Library, the following modules are part of the Amazon Redshift implementation:
* numpy 1.8.2  
* pandas 0.14.1  
* python-dateutil 2.2  
* pytz 2015.7  
* scipy 0.12.1  
* six 1.3.0  
* wsgiref 0.1.2

CREATE FUNCTION f_within_range (x1 float, y1 float, x2 float, y2 float) RETURNS bool
 IMMUTABLE as $$
    def distance(x1, y1, x2, y2):
        import math
        return math.sqrt((y2 - y1) ** 2 + (x2 - x1) ** 2)
    return distance(x1, y1, x2, y2) < 20
$$ LANGUAGE plpythonu;

Data Join
* Nested Loop : The least optimal join, a nested loop is used mainly for cross-joins (Cartesian products) and some  inequality joins.  
* Hash Join and Hash  Typically faster than a nested loop join, a hash join and hash are used for inner joins and left and  right outer joins. These operators are used when joining tables where the join columns are not both distribution keys and sort keys. The hash operator creates the hash table for the inner table in the join; the hash join operator reads the outer table, hashes the joining column, and finds matches in the inner hash table.  
* Merge Join  Typically the fastest join, a merge join is used for inner joins and outer joins. The merge join is not used for full joins. This operator is used when joining tables where the join columns are both distribution keys and sort keys, and when less than 20 percent of the joining tables are unsorted. It reads two sorted tables in order and finds the matching rows. To view the percent of unsorted rows, query the SVV_TABLE_INFO (p. 786) system table.  
wlm_query_slot_count
You can temporarily override the amount of memory assigned to a query by setting the wlm_query_slot_count parameter to specify the number of slots allocated to the query.  By default, WLM queues have a concurrency level of 5  
VARCHAR
A VARCHAR(12) column can contain 12 single-byte characters, 6 two-byte characters, 4 three- byte characters, or 3 four-byte characters.  
Tuple in Redshift SQL

select * from venue
where (venuecity, venuestate) in (('Miami', 'FL'), ('Tampa', 'FL'))
order by venueid;

SIMILAR TO

SELECT gender, COUNT(gender) FROM employee WHERE first_name SIMILAR TO '%ein%' GROUP BY gender;

analyze_threshold_percent
To reduce processing time and improve overall system performance, Amazon Redshift skips analyzing a table if the percentage of rows that have changed since the last ANALYZE command run is lower than the analyze threshold specified by the analyze_threshold_percent parameter. By default, analyze_threshold_percent is 10
COPY from DynamoDB
Setting READRATIO to 100 or higher will enable Amazon Redshift to consume the entirety of the DynamoDB table's provisioned throughput, which will seriously degrade the performance of concurrent read operations against the same table during the COPY session. Write traffic will be unaffected.
Different databases in Redshift
After you have created the TICKIT database, you can connect to the new database from your SQL client. Use the same connection parameters as you used for your current connection, but change the database name to tickit.
Interleaved Sort Key
A maximum of eight columns can be specified for an interleaved sort key.
Concatenate in SQL

insert into t1(col1) values('Incomplete'::char(3));

INSERT INTO from SELECT

INSERT INTO hello
( SELECT employee_id, first_name
  FROM employee ORDER BY 2)

Prepare and execute PLAN

DROP TABLE IF EXISTS prep1;
CREATE TABLE prep1 (c1 int, c2 char(20));
PREPARE prep_insert_plan (int, char)
AS insert into prep1 values ($1, $2);
EXECUTE prep_insert_plan (1, 'one');
EXECUTE prep_insert_plan (2, 'two');
EXECUTE prep_insert_plan (3, 'three');
DEALLOCATE prep_insert_plan;

Powerful 'WITH' for sub-query in SQL

WITH workage AS (
SELECT employee_id, datediff(day, birthday, work_day)/365 AS work_age FROM employee)
SELECT COUNT(employee_id), work_age FROM workage GROUP BY work_age ORDER BY 1 DESC;

UNLOAD with compression

unload ('select * from employee')
TO 's3://robin-data-023/employee_'
iam_role 'arn:aws:iam::589631040421:role/fullRedshift'
gzip;

VACCUM

VACUUM FULL salary TO 100 PERCENT;

'OVER' in SQL

SELECT *,
100*salary/FIRST_VALUE(salary)
OVER (PARTITION BY employee_id ORDER BY salary DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS most_salary_percent
FROM salary
WHERE employee_id = 10001 OR employee_id = 10002;

Show and set current settings

select current_setting('query_group');
select set_config('query_group', 'test', true);

Show blocks(1MB) allocated to each column in the 'salary' table

select col, count(*)
from stv_blocklist, stv_tbl_perm
where stv_blocklist.tbl = stv_tbl_perm.id
and stv_blocklist.slice = stv_tbl_perm.slice
and stv_tbl_perm.name = 'salary'
group by col
order by col;

Slice and Col
slice: Node slice
col: Every table you create has three hidden columns appended to it: INSERT_XID, DELETE_XID, ROW_ID
In STV_SLICES , we can see relations between slice and node. Single node have two slices: 0 and 1
Common used tables for meta information
pg_table_def
svv_table_info

Example datasets for Amazon RedShift

Last year, I imported two datasets to Hive. Currently, I will load two these two datasets into Amazon RedShift instead.
After created a RedShift Cluster in my VPC, I couldn’t connect to it even with Elastic IP. Then I check the parameters of my VPC between AWS’s default VPC, and eventually saw the vital differences. First, set “Network ACL” in “VPC” of AWS:

Then, add rule in “Route table”, which let node to access Anywhere(0.0.0.0/0) through “Internet Gateway” (also created in “VPC” service):

Now I could connect to my RedShift cluster.
Create s3 bucket by AWS Cli:

aws s3 mb s3://robin-data-023 --region us-west-2

Upload two csv files into bucekt:

aws s3 cp salaries.csv s3://robin-data-023/
aws s3 cp employees.csv s3://robin-data-023/

Create tables in Redshift by using SQL-Bench:

create table employee (
employee_id INTEGER primary key distkey,
birthday DATE sortkey,
first_name VARCHAR(64),
family_name VARCHAR(64),
gender CHAR(1),
work_day DATE
);
create table salary (
employee_id INTEGER primary key distkey,
salary INTEGER,
start_date DATE sortkey,
end_date DATE
);

Don’t put blank space or tab(‘\t’) before column name when creating table. or else Redshift will consider column name as
” employee_id”
” salary”
…

Load data from s3 to RedShift by COPY, the powerful tool for ETL in AWS.

copy employee
from 's3://robin-data-023/employees.csv'
iam_role 'arn:aws:iam::589631040421:role/fullRedshift'
csv quote as '\'';
copy salary
from 's3://robin-data-023/salaries.csv'
iam_role 'arn:aws:iam::589631040421:role/fullRedshift'
csv quote as '\'';

We could see the success report like this:

Warnings:
Load into table 'employee' completed, 300024 record(s) loaded successfully.
0 rows affected
COPY executed successfully
Execution time: 21.84s
Warnings:
Load into table 'salary' completed, 2819810 record(s) loaded successfully.
0 rows affected
COPY executed successfully
Execution time: 19.66s

There are “Warnings” but “successfully”, a little weird. But don’t worry, it’s ok for SQL-Bench.
Currently we could run this script which was wrote last year (But need to change ‘==’ to ‘=’ for compatible problem):

SELECT e.gender, AVG(s.salary) AS avg_salary
    FROM employee AS e
          JOIN salary AS s
            ON (e.employee_id = s.employee_id)
GROUP BY e.gender;

The result is

Enable audit log for AWS Redshift

When I was trying to enable the Audit Log for AWS Redshift, I chose to use a exists bucket in S3. But it reports error:

"Cannot read ACLs of bucket redshift-robin. Please ensure that your IAM permissions are set up correctly."
"Service: AmazonRedshift; Status Code: 400; Error Code: InsufficientS3BucketPolicyFault ...."

According to this document, I need to change permission of bucket “redshift-robin”. So I entered the AWS Console of S3, click bucket name of “redshift-robin” in left panel, and saw description of permissions:

Press “Add Bucket Policy”, and in the pop-out-window, press “AWS Policy Generator”. Here came the generator, which is easy to use for creating policy.
Add two policy for “redshift-robin”:

The “902366379725” is the account-id of us-west-2 region (Oregon)

Click “Generate Policy”, and copy the generated JSON to “Bucket Policy Editor”:

Press “Save”. Now, we could enable Audit Log of Redshift for bucket “redshift-robin”:

Book notes about “Amazon Redshift Database Developer Guide”

Although be already familiar with Cloud Computing for may years, I haven’t look inside many services provided by Amazon Web Service. Because my company (Alibaba) has it’s own cloud platform: Aliyun, so we are only allowed to use home-made cloud products, such as ECS(like EC2 in AWS), RDS(like RDS in AWS), ODPS(like EMR in AWS).
These days I have read some sections of “Amazon Redshift Database Developer Guide” on my Kindle at my commute time.
Amazon Redshift is built on PostgreSQL, which is not very popular in China but pretty famous in Japan and USA. The book said that primary key and foreign key are only used for informal and constrains are totally not supported. I guess Redshift do distributed the rows of every tables into different servers in the cluster therefore keeping constrains is almost impossible.
Columnar Storage is used in Redshift because it is a perfect solution for OLAP (OnLine Analytical Processing) in which situation users tends to retrieve or load tremendous of records. Column-oriented Storage is also suitable for compression and will conserve colossal disk space.
The interesting thing is the architecture of Amazon Redshift and Greenplum looks very similar: both distribute the rows, both use PostgreSQL as back-end engine. Greenplum has open-sourced recently, which make common users to build private OLAP platform much easier. This lead a new question for me: if users could build a private cloud on their bare-metal servers very easily (by the software of OpenStack, OpenShift, Mesos, Greenplum etc.), is it still necessary to build their services and store their data into public cloud? Or the only value of public cloud will be maintaining and managing large mount of bare-metal servers?