Author Archives: Robin Dong

An example of using Spark Structured Streaming

This snippet will monitor two directories and join the data from them when there is a new CSV file in any directory.

The join operation is implemented by Spark SQL which is easy to use (for DBA), and also easy to maintain. Some articles said if the Spark process… Read more »

A problem of using Pyspark SQL

Here is the code:

It will report error after running ‘cat xxx.py|bin/pyspark’:

I used to think it was because ‘2’ is a string, so I changed ‘row’ to be ‘[2, 29, 29, 29]’. But the error also changed to:

Then I searched on google, and find this… Read more »

Processing date and time in AWS Redshift

Since AWS Redshift don’t have function like FROM_UNIX(), it’s much more weird to get formatted time from a UNIX timestamp (called ‘epoch’ in Reshift):

Ref: https://stackoverflow.com/questions/39815425/how-to-convert-epoch-to-datetime-redshift If we want to see the statistics result group by hours:

Some tips about using AWS Glue

Configure about data format To use AWS Glue, I write a ‘catalog table’ into my Terraform script:

But after using PySpark script to access this table, it reports:

Seems we can’t use ‘OpenCSVSerde’. Actually, the correct answer is:

The version of zeppelin When using zeppelin to run… Read more »

The uneasy way to implement SSDLite by myself

SSDLite is a variant of Single Shot Multi-box Detection. It uses MobileNetV2 instead of VGG as backbone. Thus it can make detection extremely fast. I was trying to implement SSDLite from the code base of ssd.pytorch. Although it’s not a easy work, I finally learn a lot from the entire… Read more »

A tip about Terraform

      No Comments on A tip about Terraform

Terraform is a interesting (in my opinion) tool to implement Infrastructure-as-Code. When I first used it to write production script at yesterday, I met a error report:

After a while of searching on Google, I got the cause: it can’t find my AWS credential in my computer. Actually I… Read more »