BigQuery

How to unfold two Arrays in BigQuery

Imaing we have data like this:

WITH Sequences AS
  (SELECT 1 AS id, [0, 1, 1, 2, 3, 5] AS prod_type, [1.1, 1.2, 2.1, 2.3, 3.3, 3.4] AS prod_price,
   UNION ALL SELECT 2 AS id, [2, 4, 8, 16, 32] AS prod_type, [1.3, 4.2, 2.1, 7.3, 5.3, 9.4] AS prod_price,
   UNION ALL SELECT 3 AS id, [5, 10] AS prod_type, [1.8, 4.9, 2.0, 7.6, 5.1, 8.4] AS prod_price)
select * from sequences

How could I get the total price of each “prod_type” for every “id”?

First we need to unfold the “prod_type” and “prod_price” correspondingly:

WITH Sequences AS
  (SELECT 1 AS id, [0, 1, 1, 2, 3, 5] AS prod_type, [1.1, 1.2, 2.1, 2.3, 3.3, 3.4] AS prod_price,
   UNION ALL SELECT 2 AS id, [2, 4, 8, 16, 32] AS prod_type, [1.3, 4.2, 2.1, 7.3, 5.3, 9.4] AS prod_price,
   UNION ALL SELECT 3 AS id, [5, 10] AS prod_type, [1.8, 4.9, 2.0, 7.6, 5.1, 8.4] AS prod_price)
SELECT id, prod_type, prod_price
from
sequences,
unnest(prod_type) AS prod_type,
unnest(prod_price) AS prod_price;

and then use “group by” to calculate total price:

...
SELECT
  id,
  prod_type,
  SUM(prod_price)
FROM
  sequences,
  UNNEST(prod_type) AS prod_type,
  UNNEST(prod_price) AS prod_price
GROUP BY
  id,
  prod_type

A BigQuery error about the partition

We were using client.query() (from Python API of BigQuery) to insert selected data into a table with a specific partition. But the script reported errors like:

google.api_core.exceptions.BadRequest: 400 Some rows belong to different partitions rather than destination partition

This note said it might be the cause of the incorrect date format for the partition. I checked the code but only found the partition format is correct.

The real reason is the input: the “selected data”. The data that will be inserted is from this SQL:

SELECT col1, col2, "2023-01-06" as partition_date FROM my_table;

The partition date set by the Python script bigquery.QueryJobConfig(destination="new_table$20230103") for the destination table is “2023-01-03” but the source data’s partition date is “2023-01-06”. This is why there is the above error.

Using Python to run BigQuery job with project id

Here is the code for me to query a table of BigQuery:

from google.cloud import bigquery
from google.cloud.bigquery_storage import BigQueryReadClient

client = bigquery.Client()
storage_client = BigQueryReadClient()
df = client.query("select * from my_table1").to_dataframe(bqstorage_client=storage_client)

Then it reported the error:

“Access Denied: Project PRJ_B: User does not have bigquery.jobs.create permission in project PRJ_B.”

But actually, I want to launch a job in project PRJ_A. So I add a shell command “gcloud config set project PRJ_A” before running this python script. But the errors continued.

After searching the API doc of Python BigQuery, I found out that the “bigquery.Client()” function could add an argument:

client = bigquery.Client(project="PRJ_A")

Now the script works well.

A strange error in BigQuery

Two days ago we met a weird error when running a select through BigQuery Python API:

Error : google.api_core.exceptions.BadRequest: 400 Bad int64 value: BA1D

I checked the select SQL but it doesn’t contain any type like “int64”.

After “binary search” in the SQL code, I finally found out that the SQL is actually querying a “view” and the code of this view is like:

SELECT
  cast(col1, int64) AS COL1,
  cast(col2, int64) AS COL2,
FROM
  table1

The correct solution is to change “cast” to “safe_cast”.

Here is the lesson for me: some errors may occur not only in the direct SQL code but in some indirect views…

The correct way to insert data from another table in BigQuery

Incorrect code:

WITH source1 as (
	SELECT blah FROM blah
),
source2 as (
    SELECT moreblah FROM source1
)
INSERT INTO newtable FROM source2;

Correct solution:

INSERT INTO newtable 
    WITH source1 as (
    	SELECT blah FROM blah
    ),
    source2 as (
        SELECT moreblah FROM source1
    )
    SELECT * FROM source2;

pandas.datetime64 with Timezone

I barely pay attention to the pandas.datetime64 type. But yesterday a problem stroke me.

It was a parquet file with a column “start_date”:

>>> df["start_date"]
                  start_date
0  2022-03-22 00:00:00+11:00
1  2022-03-22 00:00:00+11:00
2  2022-03-22 00:00:00+11:00
3  2022-03-22 00:00:00+11:00
4  2022-03-22 00:00:00+11:00

Looks they are “2022-03-22” on Tuesday. But after I export this into BigQuery and select them, they became “2022-03-21 UTC”, which is Monday by default.

The problem is definitely about the Timezone this column has:

>>> df.dtypes
start_date     datetime64[ns, Australia/Sydney]

What we need to do to be aligned with BigQuery is just remove the timezone and make the time to just “2022-03-22”.

The solution is forcibly and simple:

df["start_date"] = df["start_date"].dt.tz_localize(None)

Get DDL of a table in BigQuery

How could I conveniently get the creating-SQL of a table in BigQuery? We could use INFORMATION_SCHEMA:

SELECT
  table_name,
  ddl
FROM
  `data-to-insights.taxi.INFORMATION_SCHEMA.TABLES`
WHERE
  table_name="tlc_yellow_trips_2018_sample"

The result of ddl is:

CREATE TABLE `data-to-insights.taxi.tlc_yellow_trips_2018_sample`
(
  vendor_id STRING,
  pickup_datetime DATETIME,
  dropoff_datetime DATETIME,
  passenger_count INT64,
  trip_distance NUMERIC,
  rate_code STRING,
  store_and_fwd_flag STRING,
  payment_type STRING,
  fare_amount NUMERIC,
  extra NUMERIC,
  mta_tax NUMERIC,
  tip_amount NUMERIC,
  tolls_amount NUMERIC,
  imp_surcharge NUMERIC,
  total_amount NUMERIC,
  pickup_location_id STRING,
  dropoff_location_id STRING
);

Recover truncated table in BigQuery

If you accidentally truncate a table in BigQuery, you can try this article to recover the data. Furthermore, I found out that the "bq cp project:dataset.table@-36000 project:dataset.table” method could not work in my situation. The only working solution is “SYSTEM_TIME AS OF“:

CREATE `mydataset.newtable` AS
SELECT *
FROM `mydataset.mytable`
  FOR SYSTEM_TIME AS OF TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 2 HOUR);

and then “bq cp project:mydataset.newtable project:mydataset.mytable“

Migrate Spark job to BigQuery

I have just finished a work about migrating Spark job to BigQuery, or more precisely: migrate Python code to SQL. It’s a tedious work but improve the performance significantly: from 4 hours runtime of PySpark to half an hour on BigQuery (Honors belongs to the BigQuery!).

There are a few notes for the migration, or just SQL skills:

To create or overwrite a temporary table:

CREATE OR REPLACE TEMP TABLE `my_temp_tbl` AS ...

2. Select all columns from a table except some special ones:

SELECT * EXCEPT(year, month, day) FROM ...

3. To do pivot() on BigQuery: https://hoffa.medium.com/easy-pivot-in-bigquery-one-step-5a1f13c6c710. The key is clause EXECUTE IMMEDIATE which works like eval() in Python: take string as input and run it as SQL snippet.

4. Using clause OFFSET with LIMIT is terribly slow when the table is very big. The best solution for me is that use “bq extract” to export data to GCS as parquet files, and then get each part of these files by a program.

5. The parquet files could use column names that contain a hyphen, like “last-year”, “real-name”. But the BigQuery only support columns with underline, like “last_year”, “real_name”. So the “bq load” will automatically transfer column name “last-year” in the parquet file to “last_year” in the table of BigQuery.

Change the schema of BigQuery tables

We can easily add new column for a table in BigQuery:

ALTER TABLE mydataset.mytable
      ADD COLUMN new_col STRING

But when you want to delete or rename an existed column, there is no SQL to implement it. The only way to delete or rename an existed column is to use the bq command:

bq show --format=prettyjson mydataset.mytable > schema.json
# Edit the schema.json to only leave a list of columns
bq mk --table mydataset.new_mytable schema.json
# Export data from `mytable` to `new_mytable`
bq rm --table mydataset.mytable
bq cp --table mydataset.new_mytable mydataset.mytable

And remember to backup your data before operating!