PySpark

Some hints on Dataproc

When running a job in the cluster of Dataproc, it reported:

java.util.concurrent.ExecutionException: java.lang.ClassNotFoundException: Failed to find data source: BIGQUERY.

The reason is I haven’t added the Jar file for BigQuery. After adding the new Jar file into properties to the template of creating a cluster:

properties:
          spark:spark.jars: gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.11-0.18.1.jar

the job starts to read data from BigQuery tables.

Remember not to use gs://spark-lib/bigquery/spark-bigquery-latest.jar because it will hang your job when you are reading BigQuery tables. Seems even google makes a significant mistake in their cloud platform :p

2. If a PySpark job needs to use some additional packages in the Dataproc cluster, what should we do?

Still need to add more items in the template to let it install pip packages:

    clusterName: robin
    config:
      gceClusterConfig:
        metadata:
          enable-cloud-sql-proxy-on-workers: 'false'
          use-cloud-sql-private-ip: 'false'
          PIP_PACKAGES: 'google-cloud-storage google-api-python-client google-auth'
      initializationActions:
      - executableFile: gs://goog-dataproc-initialization-actions-us-central1/python/pip-install.sh
        executionTimeout: 600s

3. To see how a Hive table be created

show create table <table>;

How to gracefully end a PySpark application

This article recommend using “return” to jump out of a PySpark application. But after I did by following what he said. It reports error:

  File "test.py", line 333
    return
    ^
SyntaxError: 'return' outside function

Seems it can’t work. After trying to run PySpark application on my own laptop, I finally got the correct answer:

import sys
if df.rdd.isEmpty():
  sys.exit(0)

A problem of using Pyspark SQL

Here is the code:

from pyspark.sql import SQLContext
from pyspark.context import SparkContext
from pyspark.sql.types import *
from typing import List
sc = SparkContext()
sqlContext = SQLContext.getOrCreate(sc)
schema = StructType([StructField('id', LongType(), True),
                      StructField('gid', LongType(), True),
                      StructField('pid', LongType(), True),
                      StructField('firstlogin', IntegerType(), True)
])
row = ['2', '29', '29', '29']
df = sqlContext.createDataFrame(row, schema)
df.show()

It will report error after running ‘cat xxx.py|bin/pyspark’:

TypeError: StructType can not accept object '2' in type

I used to think it was because ‘2’ is a string, so I changed ‘row’ to be ‘[2, 29, 29, 29]’. But the error also changed to:

TypeError: StructType can not accept object 2 in type

Then I searched on google, and find this article. Looks like I forgot to transfer ‘list’ of python to ‘RDD’ of Apache Spark.
But at last, I found the real reason: I just need to add ‘[]’ between my ‘list’!
The right code is here:

row = ['2', '29', '29', '29']
df = sqlContext.createDataFrame([row], schema)

Some tips about using AWS Glue

Configure about data format
To use AWS Glue, I write a ‘catalog table’ into my Terraform script:

resource "aws_glue_catalog_table" "my_table" {
...
    input_format  = "org.apache.hadoop.mapred.TextInputFormat"
    output_format = "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
    ser_de_info {
      name                  = "SerDeCsv"
      serialization_library = "org.apache.hadoop.hive.serde2.OpenCSVSerde"
      parameters = {
        "separatorChar" = ","
        "quoteChar"     = "'"
      }
    }
...
}

But after using PySpark script to access this table, it reports:

py4j.protocol.Py4JJavaError: An error occurred while calling o58.getCatalogSource.
: com.amazonaws.services.glue.util.NonFatalException: Formats not supported for SparkSQL data sources. Got csv
	at com.amazonaws.services.glue.SparkSQLDataSource.setFormat(DataSource.scala:641)
	at com.amazonaws.services.glue.GlueContext.getCatalogSource(GlueContext.scala:254)
	at com.amazonaws.services.glue.GlueContext.getCatalogSource(GlueContext.scala:139)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:280)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:748)

Seems we can’t use ‘OpenCSVSerde’. Actually, the correct answer is:

Input format: org.apache.hadoop.mapred.TextInputFormat
Output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Serde serialization lib: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Serde parameters: field.delim ,

The version of zeppelin
When using zeppelin to run PySpark script, it reports error:

org.apache.thrift.TApplicationException: Internal error processing createInterpreter
	at org.apache.thrift.TApplicationException.read(TApplicationException.java:111)
	at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:71)
	at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.recv_createInterpreter(RemoteInterpreterService.java:209)
	at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.createInterpreter(RemoteInterpreterService.java:192)
	at org.apache.zeppelin.interpreter.remote.RemoteInterpreter$2.call(RemoteInterpreter.java:169)
	at org.apache.zeppelin.interpreter.remote.RemoteInterpreter$2.call(RemoteInterpreter.java:165)
	at org.apache.zeppelin.interpreter.remote.RemoteInterpreterProcess.callRemoteFunction(RemoteInterpreterProcess.java:135)
	at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.internal_create(RemoteInterpreter.java:165)
	at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.open(RemoteInterpreter.java:132)
	at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getFormType(RemoteInterpreter.java:299)
	at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:407)
	at org.apache.zeppelin.scheduler.Job.run(Job.java:188)
	at org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:315)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

According to the document:

The latest release of Apache Zeppelin, 0.8.x, is not supported. Download the older release named zeppelin-0.7.3-bin-all.tgz from the download page and follow the installation instructions.

Robin on Linux

PySpark