Normal ETL solutions need to deliver all data from transactional databases to data warehouse. For instance, DBAs or Data Scientists usually deploy a script to export whole table from database to data warehouse each hour. To accelerate this process, we decided to use Streaming ETL solution in AWS(or GCP, if possible).
Firstly, I tested the AWS Data Pipeline. Although it’s called ‘Pipeline’, it needs a Last Modified Column in customer’s MySQL table so it could decide which part of the table should be extracted in each turn. The new rows, which means their Last Modified Column values had been updated, will be extracted. However, our MySQL tables don’t have this column, and adding these column and corresponding logics in code will be too tedious for a old infrastructure. The AWS Data Pipeline is not a suitable solution for us.
Then, I found the tutorial and my colleague found another doc at the same time. Combining these two suggestions, I thought out a viable solution:
- A in-house service using pymysqlreplication and boto3 to parse binlog from MySQL, and write these parsed-out events into AWS Kinesis (or Kafka)
- Another in-house service read these events and exported them into AWS RedShift
Since the AWS Redshift is a columnar storage data warehouse, inserting/updating/deleting data one by one will severely hurts its performance. So We need to use S3 service to store the intermediate files, and ‘COPY’ command to batch the operations, as below: