A problem of using Pyspark SQL

Here is the code:

from pyspark.sql import SQLContext
from pyspark.context import SparkContext
from pyspark.sql.types import *
from typing import List
sc = SparkContext()
sqlContext = SQLContext.getOrCreate(sc)
schema = StructType([StructField('id', LongType(), True),
                      StructField('gid', LongType(), True),
                      StructField('pid', LongType(), True),
                      StructField('firstlogin', IntegerType(), True)
])
row = ['2', '29', '29', '29']
df = sqlContext.createDataFrame(row, schema)
df.show()

It will report error after running ‘cat xxx.py|bin/pyspark’:

TypeError: StructType can not accept object '2' in type

I used to think it was because ‘2’ is a string, so I changed ‘row’ to be ‘[2, 29, 29, 29]’. But the error also changed to:

TypeError: StructType can not accept object 2 in type

Then I searched on google, and find this article. Looks like I forgot to transfer ‘list’ of python to ‘RDD’ of Apache Spark.
But at last, I found the real reason: I just need to add ‘[]’ between my ‘list’!
The right code is here:

row = ['2', '29', '29', '29']
df = sqlContext.createDataFrame([row], schema)

Robin on Linux

A problem of using Pyspark SQL

Leave a Reply Cancel reply

Robin on Linux

Related Posts

Leave a Reply Cancel reply