bigdata

A problem about using DataFrame in Apache Spark

Here is the code for loading CSV file (table employee) to DataFrame of Apache Spark:

    val schema = StructType(
      Seq(
        StructField("id", LongType),
        StructField("birthday", DateType),
        StructField("firstname", StringType),
        StructField("lastname", StringType),
        StructField("gender", StringType),
        StructField("workingdate", DateType)
      )
    )
    val df = ss.read.format("csv")
      .option("header", "false")
      .option("quote", "'")
      .schema(schema)
      .load("employees.csv")
    df.show()

But after I run the jar in Spark, it report:

+----+--------+---------+--------+------+-----------+
|  id|birthday|firstname|lastname|gender|workingdate|
+----+--------+---------+--------+------+-----------+
|null|    null|     null|    null|  null|       null|
|null|    null|     null|    null|  null|       null|
|null|    null|     null|    null|  null|       null|
|null|    null|     null|    null|  null|       null|
|null|    null|     null|    null|  null|       null|
|null|    null|     null|    null|  null|       null|
|null|    null|     null|    null|  null|       null|
|null|    null|     null|    null|  null|       null|
|null|    null|     null|    null|  null|       null|
|null|    null|     null|    null|  null|       null|
|null|    null|     null|    null|  null|       null|
|null|    null|     null|    null|  null|       null|
|null|    null|     null|    null|  null|       null|
|null|    null|     null|    null|  null|       null|
|null|    null|     null|    null|  null|       null|
|null|    null|     null|    null|  null|       null|
|null|    null|     null|    null|  null|       null|
|null|    null|     null|    null|  null|       null|
|null|    null|     null|    null|  null|       null|
|null|    null|     null|    null|  null|       null|
+----+--------+---------+--------+------+-----------+

Seems data haven’t been correctly load.
After reviewed the document for CSV format carefully, I noticed that the quote in my CSV file is ‘ instead of “. So I added a option in my code to let Spark recognise single quote:

    val df = ss.read.format("csv")
      .option("header", "false")
      .option("quote", "'")
      .schema(schema)
      .load("employees.csv")

This time the CSV have been read out properly.

+-----+----------+---------+-----------+------+-----------+
|   id|  birthday|firstname|   lastname|gender|workingdate|
+-----+----------+---------+-----------+------+-----------+
|10001|1953-09-02|   Georgi|    Facello|     M| 1986-06-26|
|10002|1964-06-02|  Bezalel|     Simmel|     F| 1985-11-21|
|10003|1959-12-03|    Parto|    Bamford|     M| 1986-08-28|
|10004|1954-05-01|Chirstian|    Koblick|     M| 1986-12-01|
|10005|1955-01-21|  Kyoichi|   Maliniak|     M| 1989-09-12|
|10006|1953-04-20|   Anneke|    Preusig|     F| 1989-06-02|
|10007|1957-05-23|  Tzvetan|  Zielinski|     F| 1989-02-10|
|10008|1958-02-19|   Saniya|   Kalloufi|     M| 1994-09-15|
|10009|1952-04-19|   Sumant|       Peac|     F| 1985-02-18|
|10010|1963-06-01|Duangkaew|   Piveteau|     F| 1989-08-24|
|10011|1953-11-07|     Mary|      Sluis|     F| 1990-01-22|
|10012|1960-10-04| Patricio|  Bridgland|     M| 1992-12-18|
|10013|1963-06-07|Eberhardt|     Terkki|     M| 1985-10-20|
|10014|1956-02-12|    Berni|      Genin|     M| 1987-03-11|
|10015|1959-08-19| Guoxiang|  Nooteboom|     M| 1987-07-02|
|10016|1961-05-02| Kazuhito|Cappelletti|     M| 1995-01-27|
|10017|1958-07-06|Cristinel|  Bouloucos|     F| 1993-08-03|
|10018|1954-06-19| Kazuhide|       Peha|     F| 1987-04-03|
|10019|1953-01-23|  Lillian|    Haddadi|     M| 1999-04-30|
|10020|1952-12-24|   Mayuko|    Warwick|     M| 1991-01-26|
+-----+----------+---------+-----------+------+-----------+

An example of using Spark Structured Streaming

This snippet will monitor two directories and join the data from them when there is a new CSV file in any directory.

from pyspark.sql import SQLContext
from pyspark.context import SparkContext
from pyspark.sql.types import *
sc = SparkContext()
sqlContext = SQLContext.getOrCreate(sc)
pl_schema = StructType([StructField('id', LongType(), True),
    StructField('gid', LongType(), True),
    StructField('pid', LongType(), True),
    StructField('firstlogin', IntegerType(), True)
])
pl_df = sqlContext.readStream.schema(pl_schema).csv('/tmp/pl/')
pl_df.createOrReplaceTempView('pl_mapping')
user_schema = StructType([StructField('id', LongType(), True),
    StructField('fullname', StringType(), True)
])
user_df = sqlContext.readStream.schema(user_schema).csv('/tmp/user/')
user_df.createOrReplaceTempView('users')
result = sqlContext.sql("SELECT u.id, u.fullname FROM users AS u JOIN pl_mapping AS pl ON u.id = pl.gaf_id")
query = result.writeStream.outputMode('append').format('csv').option('path', '/tmp/result/').option('checkpointLocation', '/tmp/ckpt/').start()
print('Starting')
query.awaitTermination(3600)

The join operation is implemented by Spark SQL which is easy to use (for DBA), and also easy to maintain.
Some articles said if the Spark process restart after failed, the ‘checkpoint’ would help it to continue work from last uncompleted position. I tried it in my local computer, and noticed that it do make some duplicated rows after restart. This is a severe problem for production environment so I will check it in next testings.

A problem of using Pyspark SQL

Here is the code:

from pyspark.sql import SQLContext
from pyspark.context import SparkContext
from pyspark.sql.types import *
from typing import List
sc = SparkContext()
sqlContext = SQLContext.getOrCreate(sc)
schema = StructType([StructField('id', LongType(), True),
                      StructField('gid', LongType(), True),
                      StructField('pid', LongType(), True),
                      StructField('firstlogin', IntegerType(), True)
])
row = ['2', '29', '29', '29']
df = sqlContext.createDataFrame(row, schema)
df.show()

It will report error after running ‘cat xxx.py|bin/pyspark’:

TypeError: StructType can not accept object '2' in type

I used to think it was because ‘2’ is a string, so I changed ‘row’ to be ‘[2, 29, 29, 29]’. But the error also changed to:

TypeError: StructType can not accept object 2 in type

Then I searched on google, and find this article. Looks like I forgot to transfer ‘list’ of python to ‘RDD’ of Apache Spark.
But at last, I found the real reason: I just need to add ‘[]’ between my ‘list’!
The right code is here:

row = ['2', '29', '29', '29']
df = sqlContext.createDataFrame([row], schema)

Some problems about using AWS DMS

AWS DMS is a new type of service used to migrate data from different types of database and data-warehouse. I met some problems when trying to use it in production environment.
Problem 1. When using a MySQL server of AWS RDS as the source of a replication task. It reported errors after started the task:

Last failure message
Last Error Task error notification received from subtask 0, thread 0 [reptask/replicationtask.c:2673] [1020418] Error Code [10001] : Binary Logging must be enabled for MySQL server; Errors in MySQL server binary logging configuration. Follow all prerequisites for 'MySQL as a source in DMS' from https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.MySQL.html or'MySQL as a target in DMS' from https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Target.MySQL.html ; Failed while preparing stream component 'st_0_WBK5KGUWQAH6VKEP4I5LH2EFHE'.; Cannot initialize subtask; Stream component 'st_0_WBK5KGUWQAH6VKEP4I5LH2EFHE' terminated [reptask/replicationtask.c:2680] [1020418] Stop Reason FATAL_ERROR Error Level FATAL

The failure message looks terrible. But at least I can find this doc to follow. After changed the configurations as below:

binlog_format	ROW
binlog_checksum	NONE
binlog_row_image	FULL

the error still existed.
The real answer is in here since I used RDS instead of self-managed MySQL. After I add one line Terraform code to enable “automatic backups”:

resource "aws_db_instance" "test_gaf" {
  ......
  backup_retention_period     = 10
}

the replication task began to work without the error.
Problem 2. Running replication task for a while to export data from MySQL to AWS Redshift. A new error log appeared in Redshift load logs:

019-10-29T04:41:27 [TARGET_LOAD ]E: RetCode: SQL_ERROR SqlState: XX000 NativeError: 30 Message: [Amazon][Amazon Redshift] (30) Error occurred while trying to execute a query: [SQLState XX000] ERROR: User arn:aws:redshift:us-east-1:262284277472:dbuser:analytics-20190902/masteruser is not authorized to assume IAM Role arn:aws:iam::262284277472:role/dms-access-for-endpoint DETAIL: ----------------------------------------------- error: User arn:aws:redshift:us-east-1:262284277472:dbuser:analytics-20190902/masteruser is not authorized to assume IAM Role arn:aws:iam::262284277472:role/dms-access-for-endpoint code: 8001 context: IAM Role=arn:aws:iam::262284277472:role/dms-access-for-endpoint query: 1799 location: xen_aws_credentials_mgr.cpp:321 process: padbmaster [pid=21755] ----------------------------------------------- [1022502] (ar_odbc_stmt.c:4622)

Why masteruser is not authorized? The answer is here. Below is the Terraform code:

data "aws_iam_policy_document" "dms_assume_role" {
  statement {
    actions = ["sts:AssumeRole"]
    principals {
      identifiers = ["dms.amazonaws.com"]
      type        = "Service"
    }
  }
  statement {
    actions = ["sts:AssumeRole"]
    # By https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Security.APIRole.html,
    # we also need principal `redshift.amazonaws.com`
    principals {
      identifiers = ["redshift.amazonaws.com"]
      type        = "Service"
    }
  }
}

Then I had giiven “dms_assume_role” two Trusty Entities

Problem 3. There was still a error in Redshift load log (so many errors in AWS DMS…):

Error	Type	Raw Field Value
Invalid timestamp format or value [YYYY-MM-DD HH24:MI:SS]	timestamp	0000-00-00 00:00:00

Seems the answer is here. Therefore I added “acceptanydate=true;timeformat=auto” into the “extra connection settings” in Redshift endpoint. But the error just changed to:

Error	Type	Raw Field Value
Invalid data	timestamp	0000-00-00 00:00:00

After searching for almost two days, I found that the reason is in the schema of Redshift, which is automatically created by AWS DMS replication task.

CREATE TABLE my (
    ...
    mydate TIMESTAMP DEFAULT '0000-00-00 00:00:00' NOT NULL,
    ...
)

Since the schema doesn’t allow “mydate” column to be null but the “acceptanydate=true” is trying to transfer “0000-00-00 00:00:00 to null”, the final error is “Invalid data” for Redshift.
The solution for this problem is: create table of Redshift manually to let “mydate” column to be “nullable”, and change the working mode of replication task to “TRUNCATE_BEFORE_LOAD”.

Processing date and time in AWS Redshift

Since AWS Redshift don’t have function like FROM_UNIX(), it’s much more weird to get formatted time from a UNIX timestamp (called ‘epoch’ in Reshift):

SELECT timestamp 'epoch' + my_timestamp_column * interval '1 second' AS my_column_alias
FROM my_table

Ref: https://stackoverflow.com/questions/39815425/how-to-convert-epoch-to-datetime-redshift
If we want to see the statistics result group by hours:

SELECT
  COUNT(1),
  DATE_TRUNC('hour', timestamp 'epoch' + my.timestamp * interval '1 second') AS hour
FROM my_table AS my
WHERE my_table.my_timestamp_column > extract('epoch' from '2019-10-23'::TIMESTAMP)
GROUP BY hour
ORDER BY hour;

Some tips about using AWS Glue

Configure about data format
To use AWS Glue, I write a ‘catalog table’ into my Terraform script:

resource "aws_glue_catalog_table" "my_table" {
...
    input_format  = "org.apache.hadoop.mapred.TextInputFormat"
    output_format = "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
    ser_de_info {
      name                  = "SerDeCsv"
      serialization_library = "org.apache.hadoop.hive.serde2.OpenCSVSerde"
      parameters = {
        "separatorChar" = ","
        "quoteChar"     = "'"
      }
    }
...
}

But after using PySpark script to access this table, it reports:

py4j.protocol.Py4JJavaError: An error occurred while calling o58.getCatalogSource.
: com.amazonaws.services.glue.util.NonFatalException: Formats not supported for SparkSQL data sources. Got csv
	at com.amazonaws.services.glue.SparkSQLDataSource.setFormat(DataSource.scala:641)
	at com.amazonaws.services.glue.GlueContext.getCatalogSource(GlueContext.scala:254)
	at com.amazonaws.services.glue.GlueContext.getCatalogSource(GlueContext.scala:139)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:280)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:748)

Seems we can’t use ‘OpenCSVSerde’. Actually, the correct answer is:

Input format: org.apache.hadoop.mapred.TextInputFormat
Output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Serde serialization lib: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Serde parameters: field.delim ,

The version of zeppelin
When using zeppelin to run PySpark script, it reports error:

org.apache.thrift.TApplicationException: Internal error processing createInterpreter
	at org.apache.thrift.TApplicationException.read(TApplicationException.java:111)
	at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:71)
	at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.recv_createInterpreter(RemoteInterpreterService.java:209)
	at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.createInterpreter(RemoteInterpreterService.java:192)
	at org.apache.zeppelin.interpreter.remote.RemoteInterpreter$2.call(RemoteInterpreter.java:169)
	at org.apache.zeppelin.interpreter.remote.RemoteInterpreter$2.call(RemoteInterpreter.java:165)
	at org.apache.zeppelin.interpreter.remote.RemoteInterpreterProcess.callRemoteFunction(RemoteInterpreterProcess.java:135)
	at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.internal_create(RemoteInterpreter.java:165)
	at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.open(RemoteInterpreter.java:132)
	at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getFormType(RemoteInterpreter.java:299)
	at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:407)
	at org.apache.zeppelin.scheduler.Job.run(Job.java:188)
	at org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:315)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

According to the document:

The latest release of Apache Zeppelin, 0.8.x, is not supported. Download the older release named zeppelin-0.7.3-bin-all.tgz from the download page and follow the installation instructions.

Some ideas about building streaming ETL on AWS

After discussed with technical support guys from AWS, I get more information about how to use all the service of AWS to build a streaming ETL architecture, step by step.
The main architecture could be described by the diagram below:

AWS S3 is the de-facto data lake. All the data, no matter from AWS RDS or AWS Dynamo or other custom ways, could be written into AWS S3 by using some specific format, such as Apache Parquet or Apache ORC (CSV format is not recommend because it’s not suitable for data scan and data compression). Then, data engineers could use AWS Glue to extract the data from AWS S3, transform them (using PySpark or something like it), and load them into AWS Redshift.
For some frequently-used data, they could also be put in AWS Redshift for optimised query. When it is needed to join tables from both AWS S3 and AWS Redshift, we could also use AWS Redshift Spectrum.
BTW, I also joined a workshop about DataBricks’ new Unified Data Analytics and Machine Learning Platform which is built on AWS. It contains
1. Delta Lake for data storage and schema enforcement.
2. Notebook to let user directly write code and run them to process and analyze data by Apache Spark. Just like Jupyter Notebook.
3. MLFlow use above data to train machine learning model.
I used Apache Spark for learning about 4 years ago. At that time, I even need to build java/scala package by myself, upload and run it. Debugging is tedious because I can only scan logs of CLI again and again to find mistakes in code. But now, Databricks give a much more convenient solution for the data scientists and developers.

Someone who is interesting in this platform could try free edition of it https://databricks.com/try-databricks

Investigating about Streaming ETL solutions

Normal ETL solutions need to deliver all data from transactional databases to data warehouse. For instance, DBAs or Data Scientists usually deploy a script to export whole table from database to data warehouse each hour. To accelerate this process, we decided to use Streaming ETL solution in AWS(or GCP, if possible).
Firstly, I tested the AWS Data Pipeline. Although it’s called ‘Pipeline’, it needs a Last Modified Column in customer’s MySQL table so it could decide which part of the table should be extracted in each turn. The new rows, which means their Last Modified Column values had been updated, will be extracted. However, our MySQL tables don’t have this column, and adding these column and corresponding logics in code will be too tedious for a old infrastructure. The AWS Data Pipeline is not a suitable solution for us.
Then, I found the tutorial and my colleague found another doc at the same time. Combining these two suggestions, I thought out a viable solution:

A in-house service using pymysqlreplication and boto3 to parse binlog from MySQL, and write these parsed-out events into AWS Kinesis (or Kafka)
Another in-house service read these events and exported them into AWS RedShift

Since the AWS Redshift is a columnar storage data warehouse, inserting/updating/deleting data one by one will severely hurts its performance. So We need to use S3 service to store the intermediate files, and ‘COPY’ command to batch the operations, as below:

Using Spark-SQL to transfer CSV file to Parquet

After downloading data from “Food and Agriculture Organization of United Nations”, I get many CSV files. One of the file is named “Trade_Crops_Livestock_E_All_Data_(Normalized).csv” and it looks like:

Area Code,Area,Item Code,Item,Element Code,Element,Year Code,Year,Unit,Value,Flag
"2","Afghanistan","231","Almonds shelled","5910","Export Quantity","1961","1961","tonnes","0.000000",""
"2","Afghanistan","231","Almonds shelled","5910","Export Quantity","1962","1962","tonnes","0.000000",""
"2","Afghanistan","231","Almonds shelled","5910","Export Quantity","1963","1963","tonnes","0.000000",""
......

To load this CSV file into Spark and dump it to Parquet format, I wrote these codes:

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.rdd._
/* Area Code,Area,Item Code,Item,Element Code,Element,Year Code,Year,Unit,Value,Flag */
case class Trade(area_code:Int, area:String, item_code:Int, item:String, element_code:Int,
                 element:String, year:Int, unit:String, value:Double, flag:String)
object TradeCrops {
  def scrub(str:String):String = {
    return str.replace("\"", "")
  }
  def toInt(str:String):Int = {
    try {
      return scrub(str).toInt
    } catch {
      case e:Throwable => {
        return 0
      }
    }
  }
  def toDouble(str:String):Double = {
    try {
      return scrub(str).toDouble
    } catch {
      case e:Throwable => {
        return 0
      }
    }
  }
  def toTrade(line:String):Trade = {
    val fields = line.split("\",")
    Trade(
      toInt(fields(0)),
      scrub(fields(1)),
      toInt(fields(2)),
      scrub(fields(3)),
      toInt(fields(4)),
      scrub(fields(5)),
      toInt(fields(7)),
      scrub(fields(8)),
      toDouble(fields(9)),
      scrub(fields(10))
    )
  }
  def main(args: Array[String]) {
    val conf = new SparkConf()
      .setAppName("Trade Crops Application")
    val sc = new SparkContext(conf)
    val spark = SparkSession.builder()
        .appName("Spark SQL Trade Crops")
        .getOrCreate()
    val file = sc.textFile("hdfs:///FAO/Trade_Crops.csv")
    val tradeRDD = file.filter(_.split("\",").length == 11).map(toTrade(_))
    val tradeDF = spark.createDataFrame(tradeRDD)
    tradeDF.write.parquet("hdfs:///FAO/Trade_Crops.parquet")
  }
}

The build.sbt is

lazy val root = (project in file("."))
  .settings(
    name := "FAO",
    version := "1.0",
    scalaVersion := "2.11.7",
    unmanagedJars in Compile += file("/usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples-2.6.0-cdh5.7.6.jar"),
    libraryDependencies ++= Seq(
      "org.apache.spark" % "spark-core_2.11" % "2.1.1",
      "org.apache.spark" % "spark-sql_2.11" % "2.1.1",
      "org.apache.hadoop" % "hadoop-client" % "2.6.0"
    )
  )

Always remember to add dependency for “spark-sql” or else it will report “createDataFrame() if not a member of spark”.
And finally, the submit script is:

/disk1/spark-2.1.1-bin-hadoop2.6/bin/spark-submit --class TradeCrops \
  --master yarn \
  --driver-memory 2G \
  --executor-memory 2G \
  --executor-cores 1 \
  --num-executors 64 \
  ./target/scala-2.11/FAO_2.11-1.0.jar

Data Join in AWS Redshift

In “Amazon Redshift Database Developer Guide“, there is an explanation for data join:
“HASH JOIN and HASH are used when joining tables where the join columns are not both distribution keys and sort keys.
MERGE JOIN is used when joining tables where the join columns are both distribution keys and sort keys, and when less than 20 percent of the joining tables are unsorted.”
Let’s take ‘salary’ and ’employee’ for example.
Firstly, we EXPLAIN the join of ‘salary’ and ’employee’, and it shows “Hash Join”:

Then we create two new tables:

CREATE TABLE salary_new
distkey (employee_id)
sortkey (employee_id)
AS SELECT * FROM salary;
CREATE TABLE employee_new
distkey (employee_id)
sortkey (employee_id)
AS SELECT * FROM employee;

Currently, the join column is both distkey and sortkey. Hence EXPLAIN shows “Merge Join”: