VPC – Robin on Linux

Example datasets for Amazon RedShift

Last year, I imported two datasets to Hive. Currently, I will load two these two datasets into Amazon RedShift instead.
After created a RedShift Cluster in my VPC, I couldn’t connect to it even with Elastic IP. Then I check the parameters of my VPC between AWS’s default VPC, and eventually saw the vital differences. First, set “Network ACL” in “VPC” of AWS:

Then, add rule in “Route table”, which let node to access Anywhere(0.0.0.0/0) through “Internet Gateway” (also created in “VPC” service):

Now I could connect to my RedShift cluster.
Create s3 bucket by AWS Cli:

aws s3 mb s3://robin-data-023 --region us-west-2

Upload two csv files into bucekt:

aws s3 cp salaries.csv s3://robin-data-023/
aws s3 cp employees.csv s3://robin-data-023/

Create tables in Redshift by using SQL-Bench:

create table employee (
employee_id INTEGER primary key distkey,
birthday DATE sortkey,
first_name VARCHAR(64),
family_name VARCHAR(64),
gender CHAR(1),
work_day DATE
);
create table salary (
employee_id INTEGER primary key distkey,
salary INTEGER,
start_date DATE sortkey,
end_date DATE
);

Don’t put blank space or tab(‘\t’) before column name when creating table. or else Redshift will consider column name as
” employee_id”
” salary”
…

Load data from s3 to RedShift by COPY, the powerful tool for ETL in AWS.

copy employee
from 's3://robin-data-023/employees.csv'
iam_role 'arn:aws:iam::589631040421:role/fullRedshift'
csv quote as '\'';
copy salary
from 's3://robin-data-023/salaries.csv'
iam_role 'arn:aws:iam::589631040421:role/fullRedshift'
csv quote as '\'';

We could see the success report like this:

Warnings:
Load into table 'employee' completed, 300024 record(s) loaded successfully.
0 rows affected
COPY executed successfully
Execution time: 21.84s
Warnings:
Load into table 'salary' completed, 2819810 record(s) loaded successfully.
0 rows affected
COPY executed successfully
Execution time: 19.66s

There are “Warnings” but “successfully”, a little weird. But don’t worry, it’s ok for SQL-Bench.
Currently we could run this script which was wrote last year (But need to change ‘==’ to ‘=’ for compatible problem):

SELECT e.gender, AVG(s.salary) AS avg_salary
    FROM employee AS e
          JOIN salary AS s
            ON (e.employee_id = s.employee_id)
GROUP BY e.gender;

The result is