bigdata

How to gracefully end a PySpark application

This article recommend using “return” to jump out of a PySpark application. But after I did by following what he said. It reports error:

  File "test.py", line 333
    return
    ^
SyntaxError: 'return' outside function

Seems it can’t work. After trying to run PySpark application on my own laptop, I finally got the correct answer:

import sys
if df.rdd.isEmpty():
  sys.exit(0)

An old bug about PyArrow

To save memory for my program using Pandas, I change types of some column from string to category as the reference.

df[["os_type", "cpu_type", "chip_brand"]] =
	df[["os_type", "cpu_type", "chip_brand"]].astype("category")

It could save at least half memory in my case. But when I use pyarrow to store the dataframe to parquet

df.to_parquet("my.parquet")

it reports errors:

Invalid: BinaryArray cannot contain more than 2147483646 bytes, have 2147483647

It’s a bug from old version pyarrow and had been fixed in Sep 2019. Then I tried to upgrade my pyarrow-0.12.1 to pyarrow-0.17.1 and it fixed this error.

But the story hasn’t ended up here.

For pyarrow-0.12.1, the below snippet will return a class of type <pyarrow.lib.Column>

import pyarrow.parquet as pq
table = pq.read_table(path)
table.column(0)

and this class will also contain a attribute “Column name”

But for pyarrow-0.17.1, the same code will return a class of type <pyarrow.lib.ChunkedArray> which doesn’t have a “Column name”.

This difference will make some code fail (actually, our program). Beware of this: after you upgrade pyarrow (or any other library in Python), run the test to make sure all the legacy code work properly.

A few notes for Pandas and BigQuery

Get the memory size of a DataFrame of Pandas

df.memory_usage(deep=True).sum()

2. Upload a large DataFrame of Pandas to BigQuery table

If your DataFrame is too big, the uploading operation will report “UDF out of memory”

google.api_core.exceptions.BadRequest: 400 Resources exceeded during query execution: UDF out of memory.; Failed to read Parquet file [...]. This might happen if the file contains a row that is too large, or if the total size of the pages loaded for the queried columns is too large.

The solution is as simple as splitting the DataFrame and upload them one by one:

client = bigquery.Client()
for df_chunk in np.array_split(df, 10):
    job_config = bigquery.LoadJobConfig()
    job_config.write_disposition = bigquery.WriteDisposition.WRITE_APPEND
    job = client.load_table_from_dataframe(df_chunk, table_id, job_config=job_config)
    job.result()

3. Restore table in BigQuery

How to recover a deleted table in BigQuery? Just use bq command

bq cp dataset.table@1577833205000 dataset.new_table

If your <timestamp> is not correct, the bq command will give you a notification about what <timestamp> is right for this table. Then you can use that correct <timestamp> again.

Import date column in Pandas to BigQuery

Imaging we have a small CSV file:

name,enroll_time
robin,2021-01-15 09:50:33
tony,2021-01-14 01:50:33
jaime,2021-01-13 00:50:33
tyrion,2021-2-15 13:22:17
bran,2022-3-16 14:00:01

Let’s try to load it into DataFrame of Pandas and upload it to a table of BigQuery:

import pandas as pd
from google.cloud import bigquery
df = pd.read_csv("test.csv", parse_dates=["enroll_time"], index_col=0)
schema = []
schema.append(bigquery.SchemaField("name", "STRING"))
schema.append(bigquery.SchemaField("enroll_time", "DATE"))
job_config = bigquery.LoadJobConfig(schema=schema)
bq_client = bigquery.Client()
table = "project.dataset.test_table"
job = bq_client.load_table_from_dataframe(
    df, table, job_config=job_config
)
job.result()

But it reports error:

  File "pyarrow/array.pxi", line 176, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 85, in pyarrow.lib._ndarray_to_array
  File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to date32[day] would lose data: 1610704233000000000

Seems the BigQuery library couldn’t recognize the 1610704233000000000 as nano-seconds. Then I tried to divide the 1610704233000000000 with 1e9 but it also failed.

Actually what we need to do is just use TIMESTAMP instead of DATE as the type of column enroll_time:

schema.append(bigquery.SchemaField("name", "STRING"))
schema.append(bigquery.SchemaField("enroll_time", "TIMESTAMP"))

and the BigQuery library could recognize the column even with nano-seconds unit.

To solve the problem about pivot() of Pandas

Below is an example from pandas official document for pivot():

import pandas as pd
df = pd.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two', 'two'],
                   'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'baz': [1, 2, 3, 4, 5, 6],
                   'zoo': ['x', 'y', 'z', 'q', 'w', 't']})
result = df.pivot(index='foo', columns='bar', values='baz')
print(result)

the result will be

bar  A  B  C
foo
one  1  2  3
two  4  5  6

But if there are duplicate rows in the Dataframe, it will report error:

ValueError: Index contains duplicate entries, cannot reshape

To fix this, we can just drop_duplicates() first and then pivot():

result = df.drop_duplicates().pivot(index='foo', columns='bar', values='baz')

As matter of fact, there are situations that drop_duplicates() couldn’t fix:

df = pd.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two', 'two'],
                   'bar': ['A', 'A', 'C', 'A', 'B', 'C'],
                   'baz': [1, 2, 3, 4, 5, 6],
                   'zoo': ['x', 'y', 'z', 'q', 'w', 't']})

Now we will need to use groupby() and unstack() to replace pivot():

result = (df.groupby(["foo", "bar"])
    .baz
    .first()
    .unstack()
)

And the result is

bar    A    B    C
foo
one  1.0  NaN  3.0
two  4.0  5.0  6.0

Get the schema of a parquet file

Previously I just use this snippet to get all the column names of a parquet file:

import pandas as pd
df = pd.read_parquet("hello.parquet")
print(list(df.columns))

But if the parquet file is very large (maybe not very large, for example, 1GB), it will cause OOM in my small VM (about 4GB RAM).

Actually, what I want is just column names, not the whole data. Since parquet file has strongly designed format, there must be someway we can only get the schema instead of all data.

And, here it is:

import pyarrow.parquet as pq
schema = pq.read_schema("hello.parquet", memory_map=True)
print(list(schema.names))

Efficient reading in pandas

My previous code was trying to read all data and get only one column that I need:

import pandas as pd
df = pd.read_csv("data.csv")["card_id"]

In the test environment, this program cost more than 10GB memory because of the large size of the data file.

To reduce the memory, I changed to use usecols :

import pandas as pd
df = pd.read_csv("data.csv", usecols=["card_id"])

Then, the program only cost less than 1GB memory.

The only problem is: only read_csv() and read_sql() support reading special columns. In read_parquet(), we still need to read all data at first.

Transfer Redshift SQL to BigQuery SQL

In my recent work, I need to run some SQL snippet from Redshift on Google’s BigQuery platform. Since different data warehouses have a different recipe for SQL, the transferring work couldn’t be avoided.

Here comes some tricks:

Redshift	BigQuery
field::VARCHAR	CAST(field AS String)
isnull(), nvl()	ifnull()
dateadd()	date_add()
datediff()	date_diff()
union	union all
field ILIKE pattern	UPPER(field) LIKE pattern
split_part(string, delimiter, part)	split(string, delimiter)[safe_offset(part)]

In Redshift we can select columns like this:

SELECT
  SQRT(score) AS new_score,
  new_score * 10
FROM ...

But in BigQuery we couldn’t use column name from “AS”. The SQL in BigQuery should be:

SELECT
  SQRT(score) AS new_score,
  SQRT(score) * 10
FROM ...

And, BigQuery has the “WITH” clause to replace the “temporary table”, which is very powerful:

WITH result AS (
   WITH example AS ( SELECT * FROM `dataset.table` )
   SELECT * FROM example
)
SELECT * FROM result

Some tips about BigQuery on GCP

Migrate SQL script from AWS Redshift to BigQuery

CONVERT_TIMEZONE('AEDT',getdate())::DATE

in Redshift should be changed to

current_date("Australia/Sydney")

in BigQuery.
Since BigQuery doesn’t force type conversion, some NULL value in Redshift could be a NULL value or a ‘NULL’ string in BigQuery. Make sure you use both

column is NULL

and

column = 'NULL'

for checking.
In BigQuery, we can also use UDF like this:

create temp function change_date(the_date DATE, offset_day INT64) AS (
    DATE_ADD(the_date, INTERVAL offset_day DAY)
);
create temp function next_thursday(the_date DATE) AS (
    change_date(DATE_TRUNC(the_date, WEEK(THURSDAY)), 7)
);

Performance improvement of BigQuery SQL
Remove ‘DISTINCT’ in SQL and de-dup data later in Pandas could boost whole performance for data processing. Even ‘CAST’ in BigQuery would hurt the performance. The best way to find the bottlenecks for your SQL is by looking at the ‘Execution details‘ in GUI.
Loading speed
For pandas-gbq, we can accelerate the speed of reading BigQuery table by adding argument ‘use_bqstorage_api=True’ in ‘read_gbq()’ function:

df = pandas_gbq.read_gbq(bqsqlfile, project_id='myproject', use_bqstorage_api=True)

Recent learned tips abou Numpy and Pandas

Precision
After running this snippet:

import numpy as np
a = np.array([0.112233445566778899], dtype=np.float32)
b = np.array([0.112233445566778899], dtype=np.float64)
print(a, b)

It print out:

[0.11223345] [0.11223345]

Why np.float32 and np.float64 have the same output? The answer is: displaying of numpy array need to set options.
Let’s set option before print:

import numpy as np
a = np.array([0.112233445566778899], dtype=np.float32)
b = np.array([0.112233445566778899], dtype=np.float64)
np.set_printoptions(precision=18)
print(a, b)

The result has became:

[0.112233445] [0.1122334455667789]

which looks much reasonable.
Furthermore, why it prints out ‘0.1122334455667789’ which has only ’16’ precision instead of ’18’? Because the float64 only support about 15~16 precisions, as this reference said.

Hidden metadata
There are two parquet files which look different after using ‘cksum’ to compare. But after we export them as CSV files:

import pandas as pd
df = pd.read_parquet("my.parquet")
df.to_csv("my.csv")
...

The two output CSV files are exactly the same.
Then what happened in those previous two parquet files? Dose parquet file have some hidden metadata in it?
As a matter of fact, parquet file will save the ‘index’ of a DataFrame of Pandas while CSV file will not. If we drop the index before writing out the parquet file:

df.reset_index(drop=True)
df.to_parquet("my.parquet")
...

These two parquet files would become identical.