I have just finished a work about migrating Spark job to BigQuery, or more precisely: migrate Python code to SQL. It’s a tedious work but improve the performance significantly: from 4 hours runtime of PySpark to half an hour on BigQuery (Honors belongs to the BigQuery!).
There are a few notes for the migration, or just SQL skills:
- To create or overwrite a temporary table:
CREATE OR REPLACE TEMP TABLE `my_temp_tbl` AS ...
2. Select all columns from a table except some special ones:
SELECT * EXCEPT(year, month, day) FROM ...
3. To do pivot()
on BigQuery: https://hoffa.medium.com/easy-pivot-in-bigquery-one-step-5a1f13c6c710. The key is clause EXECUTE IMMEDIATE
which works like eval()
in Python: take string as input and run it as SQL snippet.
4. Using clause OFFSET
with LIMIT
is terribly slow when the table is very big. The best solution for me is that use “bq extract
” to export data to GCS as parquet files, and then get each part of these files by a program.
5. The parquet files could use column names that contain a hyphen, like “last-year”, “real-name”. But the BigQuery only support columns with underline, like “last_year”, “real_name”. So the “bq load” will automatically transfer column name “last-year” in the parquet file to “last_year” in the table of BigQuery.