Apache Spark

A problem about using DataFrame in Apache Spark

Here is the code for loading CSV file (table employee) to DataFrame of Apache Spark:

    val schema = StructType(
      Seq(
        StructField("id", LongType),
        StructField("birthday", DateType),
        StructField("firstname", StringType),
        StructField("lastname", StringType),
        StructField("gender", StringType),
        StructField("workingdate", DateType)
      )
    )
    val df = ss.read.format("csv")
      .option("header", "false")
      .option("quote", "'")
      .schema(schema)
      .load("employees.csv")
    df.show()

But after I run the jar in Spark, it report:

+----+--------+---------+--------+------+-----------+
|  id|birthday|firstname|lastname|gender|workingdate|
+----+--------+---------+--------+------+-----------+
|null|    null|     null|    null|  null|       null|
|null|    null|     null|    null|  null|       null|
|null|    null|     null|    null|  null|       null|
|null|    null|     null|    null|  null|       null|
|null|    null|     null|    null|  null|       null|
|null|    null|     null|    null|  null|       null|
|null|    null|     null|    null|  null|       null|
|null|    null|     null|    null|  null|       null|
|null|    null|     null|    null|  null|       null|
|null|    null|     null|    null|  null|       null|
|null|    null|     null|    null|  null|       null|
|null|    null|     null|    null|  null|       null|
|null|    null|     null|    null|  null|       null|
|null|    null|     null|    null|  null|       null|
|null|    null|     null|    null|  null|       null|
|null|    null|     null|    null|  null|       null|
|null|    null|     null|    null|  null|       null|
|null|    null|     null|    null|  null|       null|
|null|    null|     null|    null|  null|       null|
|null|    null|     null|    null|  null|       null|
+----+--------+---------+--------+------+-----------+

Seems data haven’t been correctly load.
After reviewed the document for CSV format carefully, I noticed that the quote in my CSV file is ‘ instead of “. So I added a option in my code to let Spark recognise single quote:

    val df = ss.read.format("csv")
      .option("header", "false")
      .option("quote", "'")
      .schema(schema)
      .load("employees.csv")

This time the CSV have been read out properly.

+-----+----------+---------+-----------+------+-----------+
|   id|  birthday|firstname|   lastname|gender|workingdate|
+-----+----------+---------+-----------+------+-----------+
|10001|1953-09-02|   Georgi|    Facello|     M| 1986-06-26|
|10002|1964-06-02|  Bezalel|     Simmel|     F| 1985-11-21|
|10003|1959-12-03|    Parto|    Bamford|     M| 1986-08-28|
|10004|1954-05-01|Chirstian|    Koblick|     M| 1986-12-01|
|10005|1955-01-21|  Kyoichi|   Maliniak|     M| 1989-09-12|
|10006|1953-04-20|   Anneke|    Preusig|     F| 1989-06-02|
|10007|1957-05-23|  Tzvetan|  Zielinski|     F| 1989-02-10|
|10008|1958-02-19|   Saniya|   Kalloufi|     M| 1994-09-15|
|10009|1952-04-19|   Sumant|       Peac|     F| 1985-02-18|
|10010|1963-06-01|Duangkaew|   Piveteau|     F| 1989-08-24|
|10011|1953-11-07|     Mary|      Sluis|     F| 1990-01-22|
|10012|1960-10-04| Patricio|  Bridgland|     M| 1992-12-18|
|10013|1963-06-07|Eberhardt|     Terkki|     M| 1985-10-20|
|10014|1956-02-12|    Berni|      Genin|     M| 1987-03-11|
|10015|1959-08-19| Guoxiang|  Nooteboom|     M| 1987-07-02|
|10016|1961-05-02| Kazuhito|Cappelletti|     M| 1995-01-27|
|10017|1958-07-06|Cristinel|  Bouloucos|     F| 1993-08-03|
|10018|1954-06-19| Kazuhide|       Peha|     F| 1987-04-03|
|10019|1953-01-23|  Lillian|    Haddadi|     M| 1999-04-30|
|10020|1952-12-24|   Mayuko|    Warwick|     M| 1991-01-26|
+-----+----------+---------+-----------+------+-----------+

An example of using Spark Structured Streaming

This snippet will monitor two directories and join the data from them when there is a new CSV file in any directory.

from pyspark.sql import SQLContext
from pyspark.context import SparkContext
from pyspark.sql.types import *
sc = SparkContext()
sqlContext = SQLContext.getOrCreate(sc)
pl_schema = StructType([StructField('id', LongType(), True),
    StructField('gid', LongType(), True),
    StructField('pid', LongType(), True),
    StructField('firstlogin', IntegerType(), True)
])
pl_df = sqlContext.readStream.schema(pl_schema).csv('/tmp/pl/')
pl_df.createOrReplaceTempView('pl_mapping')
user_schema = StructType([StructField('id', LongType(), True),
    StructField('fullname', StringType(), True)
])
user_df = sqlContext.readStream.schema(user_schema).csv('/tmp/user/')
user_df.createOrReplaceTempView('users')
result = sqlContext.sql("SELECT u.id, u.fullname FROM users AS u JOIN pl_mapping AS pl ON u.id = pl.gaf_id")
query = result.writeStream.outputMode('append').format('csv').option('path', '/tmp/result/').option('checkpointLocation', '/tmp/ckpt/').start()
print('Starting')
query.awaitTermination(3600)

The join operation is implemented by Spark SQL which is easy to use (for DBA), and also easy to maintain.
Some articles said if the Spark process restart after failed, the ‘checkpoint’ would help it to continue work from last uncompleted position. I tried it in my local computer, and noticed that it do make some duplicated rows after restart. This is a severe problem for production environment so I will check it in next testings.

A problem of using Pyspark SQL

Here is the code:

from pyspark.sql import SQLContext
from pyspark.context import SparkContext
from pyspark.sql.types import *
from typing import List
sc = SparkContext()
sqlContext = SQLContext.getOrCreate(sc)
schema = StructType([StructField('id', LongType(), True),
                      StructField('gid', LongType(), True),
                      StructField('pid', LongType(), True),
                      StructField('firstlogin', IntegerType(), True)
])
row = ['2', '29', '29', '29']
df = sqlContext.createDataFrame(row, schema)
df.show()

It will report error after running ‘cat xxx.py|bin/pyspark’:

TypeError: StructType can not accept object '2' in type

I used to think it was because ‘2’ is a string, so I changed ‘row’ to be ‘[2, 29, 29, 29]’. But the error also changed to:

TypeError: StructType can not accept object 2 in type

Then I searched on google, and find this article. Looks like I forgot to transfer ‘list’ of python to ‘RDD’ of Apache Spark.
But at last, I found the real reason: I just need to add ‘[]’ between my ‘list’!
The right code is here:

row = ['2', '29', '29', '29']
df = sqlContext.createDataFrame([row], schema)

Robin on Linux

Apache Spark