Category Archives: bigdata

An example of using Spark Structured Streaming

This snippet will monitor two directories and join the data from them when there is a new CSV file in any directory.

The join operation is implemented by Spark SQL which is easy to use (for DBA), and also easy to maintain. Some articles said if the Spark process… Read more »

A problem of using Pyspark SQL

Here is the code:

It will report error after running ‘cat xxx.py|bin/pyspark’:

I used to think it was because ‘2’ is a string, so I changed ‘row’ to be ‘[2, 29, 29, 29]’. But the error also changed to:

Then I searched on google, and find this… Read more »

Processing date and time in AWS Redshift

Since AWS Redshift don’t have function like FROM_UNIX(), it’s much more weird to get formatted time from a UNIX timestamp (called ‘epoch’ in Reshift):

Ref: https://stackoverflow.com/questions/39815425/how-to-convert-epoch-to-datetime-redshift If we want to see the statistics result group by hours:

Some tips about using AWS Glue

Configure about data format To use AWS Glue, I write a ‘catalog table’ into my Terraform script:

But after using PySpark script to access this table, it reports:

Seems we can’t use ‘OpenCSVSerde’. Actually, the correct answer is:

The version of zeppelin When using zeppelin to run… Read more »

Using Spark-SQL to transfer CSV file to Parquet

After downloading data from “Food and Agriculture Organization of United Nations”, I get many CSV files. One of the file is named “Trade_Crops_Livestock_E_All_Data_(Normalized).csv” and it looks like:

To load this CSV file into Spark and dump it to Parquet format, I wrote these codes:

The build.sbt is

Read more »

Some tips about “Amazon Redshift Database Developer Guide”

Show diststyle of tables

Details about distribution styles: http://docs.aws.amazon.com/redshift/latest/dg/viewing-distribution-styles.html How to COPY multiple files into Redshift from S3 http://docs.aws.amazon.com/redshift/latest/dg/t_loading-tables-from-s3.html Could “Group” (or “Order”) by number, not column name

COPY with automatical compression To apply automatic compression to an empty table, regardless of its current compression encodings, run the… Read more »