We were using
client.query() (from Python API of BigQuery) to insert selected data into a table with a specific partition. But the script reported errors like:
google.api_core.exceptions.BadRequest: 400 Some rows belong to different partitions rather than destination partition
This note said it might be the cause of the incorrect date format for the partition. I checked the code but only found the partition format is correct.
The real reason is the input: the “selected data”. The data that will be inserted is from this SQL:
SELECT col1, col2, "2023-01-06" as partition_date FROM my_table;
The partition date set by the Python script
bigquery.QueryJobConfig(destination="new_table$20230103") for the destination table is “2023-01-03” but the source data’s partition date is “2023-01-06”. This is why there is the above error.
Here is the code for me to query a table of BigQuery:
from google.cloud import bigquery
from google.cloud.bigquery_storage import BigQueryReadClient
client = bigquery.Client()
storage_client = BigQueryReadClient()
df = client.query("select * from my_table1").to_dataframe(bqstorage_client=storage_client)
Then it reported the error:
“Access Denied: Project PRJ_B: User does not have bigquery.jobs.create permission in project PRJ_B.”
But actually, I want to launch a job in project PRJ_A. So I add a shell command “gcloud config set project PRJ_A” before running this python script. But the errors continued.
After searching the API doc of Python BigQuery, I found out that the “bigquery.Client()” function could add an argument:
client = bigquery.Client(project="PRJ_A")
Now the script works well.
Two days ago we met a weird error when running a
select through BigQuery Python API:
Error : google.api_core.exceptions.BadRequest: 400 Bad int64 value: BA1D
I checked the
select SQL but it doesn’t contain any type like “int64”.
After “binary search” in the SQL code, I finally found out that the SQL is actually querying a “view” and the code of this view is like:
cast(col1, int64) AS COL1,
cast(col2, int64) AS COL2,
The correct solution is to change “cast” to “safe_cast”.
Here is the lesson for me: some errors may occur not only in the direct SQL code but in some indirect views…
I barely pay attention to the pandas.datetime64 type. But yesterday a problem stroke me.
It was a parquet file with a column “start_date”:
0 2022-03-22 00:00:00+11:00
1 2022-03-22 00:00:00+11:00
2 2022-03-22 00:00:00+11:00
3 2022-03-22 00:00:00+11:00
4 2022-03-22 00:00:00+11:00
Looks they are “2022-03-22” on Tuesday. But after I export this into BigQuery and select them, they became “2022-03-21 UTC”, which is Monday by default.
The problem is definitely about the Timezone this column has:
start_date datetime64[ns, Australia/Sydney]
What we need to do to be aligned with BigQuery is just remove the timezone and make the time to just “2022-03-22”.
The solution is forcibly and simple:
df["start_date"] = df["start_date"].dt.tz_localize(None)
How could I conveniently get the creating-SQL of a table in BigQuery? We could use
The result of
CREATE TABLE `data-to-insights.taxi.tlc_yellow_trips_2018_sample`
If you accidentally truncate a table in BigQuery, you can try this article to recover the data. Furthermore, I found out that the
"bq cp project:dataset.table@-36000 project:dataset.table” method could not work in my situation. The only working solution is “SYSTEM_TIME AS OF“:
CREATE `mydataset.newtable` AS
FOR SYSTEM_TIME AS OF TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 2 HOUR);
and then “bq cp project:mydataset.newtable project:mydataset.mytable“
I have just finished a work about migrating Spark job to BigQuery, or more precisely: migrate Python code to SQL. It’s a tedious work but improve the performance significantly: from 4 hours runtime of PySpark to half an hour on BigQuery (Honors belongs to the BigQuery!).
There are a few notes for the migration, or just SQL skills:
- To create or overwrite a temporary table:
CREATE OR REPLACE TEMP TABLE `my_temp_tbl` AS ...
2. Select all columns from a table except some special ones:
SELECT * EXCEPT(year, month, day) FROM ...
3. To do
pivot() on BigQuery: https://hoffa.medium.com/easy-pivot-in-bigquery-one-step-5a1f13c6c710. The key is clause
EXECUTE IMMEDIATE which works like
eval() in Python: take string as input and run it as SQL snippet.
4. Using clause
LIMIT is terribly slow when the table is very big. The best solution for me is that use “
bq extract” to export data to GCS as parquet files, and then get each part of these files by a program.
5. The parquet files could use column names that contain a hyphen, like “last-year”, “real-name”. But the BigQuery only support columns with underline, like “last_year”, “real_name”. So the “bq load” will automatically transfer column name “last-year” in the parquet file to “last_year” in the table of BigQuery.
We can easily add new column for a table in BigQuery:
ALTER TABLE mydataset.mytable
ADD COLUMN new_col STRING
But when you want to delete or rename an existed column, there is no SQL to implement it. The only way to delete or rename an existed column is to use the
bq show --format=prettyjson mydataset.mytable > schema.json
# Edit the schema.json to only leave a list of columns
bq mk --table mydataset.new_mytable schema.json
# Export data from `mytable` to `new_mytable`
bq rm --table mydataset.mytable
bq cp --table mydataset.new_mytable mydataset.mytable
And remember to backup your data before operating!
- Get the memory size of a DataFrame of Pandas
2. Upload a large DataFrame of Pandas to BigQuery table
If your DataFrame is too big, the uploading operation will report “UDF out of memory”
google.api_core.exceptions.BadRequest: 400 Resources exceeded during query execution: UDF out of memory.; Failed to read Parquet file [...]. This might happen if the file contains a row that is too large, or if the total size of the pages loaded for the queried columns is too large.
The solution is as simple as splitting the DataFrame and upload them one by one:
client = bigquery.Client()
for df_chunk in np.array_split(df, 10):
job_config = bigquery.LoadJobConfig()
job_config.write_disposition = bigquery.WriteDisposition.WRITE_APPEND
job = client.load_table_from_dataframe(df_chunk, table_id, job_config=job_config)
3. Restore table in BigQuery
How to recover a deleted table in BigQuery? Just use
bq cp dataset.table@1577833205000 dataset.new_table
<timestamp> is not correct, the
bq command will give you a notification about what
<timestamp> is right for this table. Then you can use that correct