RobinDong

Be careful with random generate number

This is the program I have used for a month:

import numpy as np
np.random.seed(202105)
rand = np.random.rand()
# business logic code using 'rand'

Then I add another np.random.rand() in the head of the code, and this time the output data of this code became quite different.

The reason is simple: the “rand” always generates 0.9 in the previous execution, but since it becomes the second rand() call instead of the first one, it generates 0.06 (unfortunately). The 0.06 is far less than 0.9 so the output data is totally different.

I think the solution is also simple: don’t make your program too dependent on the random number, or just keep your generated number under a range (like 0.6~0.9 in my case).

Trace memory error of CUDA program

The program which used CUDA for computing in GPU reported error about memory:

terminate called after throwing an instance of 'std::runtime_error'
  what():  [CUDA] an illegal memory access was encountered LightGBM/src/treelearner/cuda_tree_learner.cpp 239

For common C++ program, we use gdb for debugging. For CUDA program, we should use cuda-gdb. Make sure to compile CUDA code with -g flag and then run:

/usr/local/cuda-11.0/bin/cuda-gdb python3
(cuda-gdb) run test.py

After a while, we could see the exact memory corrupt position of the code:

CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x1668b2f0 (histogram_16_64_256.cu:182)
Thread 1 "python3" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 0, grid 10, block (2163,0,0), thread (0,0,0), device 0, sm 0, warp 3, lane 0]
0x000000001668b380 in LightGBM::histogram16<<<(7360,1,1),(16,1,1)>>> () at LightGBM/src/treelearner/kernels/histogram_16_64_256.cu:185
185            feature = (feature >> ((ind & 1) << 2)) & 0xf;

Migrate Spark job to BigQuery

I have just finished a work about migrating Spark job to BigQuery, or more precisely: migrate Python code to SQL. It’s a tedious work but improve the performance significantly: from 4 hours runtime of PySpark to half an hour on BigQuery (Honors belongs to the BigQuery!).

There are a few notes for the migration, or just SQL skills:

To create or overwrite a temporary table:

CREATE OR REPLACE TEMP TABLE `my_temp_tbl` AS ...

2. Select all columns from a table except some special ones:

SELECT * EXCEPT(year, month, day) FROM ...

3. To do pivot() on BigQuery: https://hoffa.medium.com/easy-pivot-in-bigquery-one-step-5a1f13c6c710. The key is clause EXECUTE IMMEDIATE which works like eval() in Python: take string as input and run it as SQL snippet.

4. Using clause OFFSET with LIMIT is terribly slow when the table is very big. The best solution for me is that use “bq extract” to export data to GCS as parquet files, and then get each part of these files by a program.

5. The parquet files could use column names that contain a hyphen, like “last-year”, “real-name”. But the BigQuery only support columns with underline, like “last_year”, “real_name”. So the “bq load” will automatically transfer column name “last-year” in the parquet file to “last_year” in the table of BigQuery.

Take care of the comma (in Python)

Think about the result of this snippet:

def concat(a, b):
    return a + "_" + b
left = "hello",
right = "world"
print(concat(left, right))

Should be “hello_world”, right?

But the actual result is an error:

Traceback (most recent call last):
  File "test.py", line 7, in <module>
    print(concat(left, right))
  File "test.py", line 2, in concat
    return a + "_" + b
TypeError: can only concatenate tuple (not "str") to tuple

and the error seems hard to understand.

As a matter of fact, the reason for this error is we have just put a comma after “hello” so the value of variable “left” is a tuple (“hello”,) instead of a string “hello”

Even for this shortcode sample, it is hard to find the reason, not mention in the real program to find out the root. For the first time in recent 3 years, I began to miss the strong-type programming language like C…

Debug CUDA error for PyTorch

After I changed my dataset for my code, the training failed:

/tmp/pip-req-build-_tx3iysr/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:310: operator(): block: [0,0,0], thread: [59,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/tmp/pip-req-build-_tx3iysr/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:310: operator(): block: [0,0,0], thread: [60,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/tmp/pip-req-build-_tx3iysr/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:310: operator(): block: [0,0,0], thread: [61,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/tmp/pip-req-build-_tx3iysr/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:310: operator(): block: [0,0,0], thread: [62,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/tmp/pip-req-build-_tx3iysr/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:310: operator(): block: [0,0,0], thread: [63,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
Traceback (most recent call last):
  File "train.py", line 337, in <module>
    train(args, train_loader, eval_loader)
  File "train.py", line 189, in train
    sounds = aug(sounds)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 881, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/sanbai/birds_sound_classification/utils/augment.py", line 13, in forward
    image = (image - image.mean()) / image.std()
RuntimeError: CUDA error: device-side assert triggered

It’s terribly hard to find out the reason for this common error “RuntimeError: CUDA error: device-side assert triggered”. But someone on Github recommends a method: adding CUDA_LAUNCH_BLOCKING=1 before the program.

Now the real error behind RuntimeError shows up: it’s the wrong number of categories I set to the model.

Be careful when you use “isin()” method in Pandas

import pandas as pd
df_excl = pd.DataFrame({"id": ["12345"]})
df = pd.DataFrame({"id": ["12345", "67890"]})
result = df[~df.id.isin(df_excl[["id"]])]
print(result)

Guess what’s the result of above snippet? Just a dataframe with “67890”? No, the result is

      id
0  12345
1  67890

Why the “12345” has not been excluded? The reason is quite tricky: df_excl[["id"]] is a DataFrame but what we need in isin() is Series! So we shouldn’t use [[]] here, but []

The correct code should use df_excl["id"], as below:

...
result = df[~df.id.isin(df_excl["id"])]
print(result)

An error about multiprocessing of Python

Our python program reported errors when running a new dataset:

[77 rows x 4 columns]]'. Reason: 'error("'i' format requires -2147483648 <= number <= 2147483647",)'
multiprocessing.pool.MaybeEncodingError: Error sending result: '[                          id  ... email_send_date
    raise self._value
  File "/usr/local/lib/python3.6/multiprocessing/pool.py", line 644, in get
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/usr/local/lib/python3.6/multiprocessing/pool.py", line 266, in map
    dfs = pool.map(partial(pd.read_parquet, **kwargs), file_list)

Then I found this issue in the python community quickly https://bugs.python.org/issue17560 https://bugs.python.org/issue17560. Seems the reason is the multiprocessing mechanism in python only support 32bit to encode object length. And this problem existed even up to Python-3.8

The solution is just using multithreads instead of multiprocessing

Previous code:

with Pool(processes=n_jobs) as pool:
  pool.map(...)

Solution code:

with ThreadPoolExecutor(max_workers=n_jobs) as pool:
  pool.map(...)

Source code reading of LightGBM

Finally I get a few hours to look into the code of LightGBM.

I used to have some questions about LighGBM, and now fortunately I can answer some of them by myself. Even some answers may be wrong, that is still better than no answer at all 🙂

Q: Will LightGBM contruct a couple of trees as one model?

A: No. It will only contruct one tree as a model for a dataset

Q: How would LightGBM choose the feature that has the highest gain in entropy?

A: It will simply iterate all features (with a loop, in code) and try to find the best split for all of them. After that, it will pick the feature and the split with the highest gain

"src/treelearner/serial_tree_learner.cpp"
158 Tree* SerialTreeLearner::Train(const score_t* gradients, const score_t *hessians) {
...
185   int init_splits = ForceSplits(tree_ptr, &left_leaf, &right_leaf, &cur_depth);
186
187   for (int split = init_splits; split < config_->num_leaves - 1; ++split) {
188     // some initial works before finding best split
189     if (BeforeFindBestSplit(tree_ptr, left_leaf, right_leaf)) {
190       // find best threshold for every feature
191       FindBestSplits(tree_ptr);
192     }
...

"src/treelearner/serial_tree_learner.cpp"
322 void SerialTreeLearner::FindBestSplits(const Tree* tree) {
323   std::vector<int8_t> is_feature_used(num_features_, 0);
324   #pragma omp parallel for schedule(static, 256) if (num_features_ >= 512)
325   for (int feature_index = 0; feature_index < num_features_; ++feature_index) {
326     if (!col_sampler_.is_feature_used_bytree()[feature_index]) continue;
327     if (parent_leaf_histogram_array_ != nullptr
328         && !parent_leaf_histogram_array_[feature_index].is_splittable()) {
329       smaller_leaf_histogram_array_[feature_index].set_is_splittable(false);
330       continue;
331     }
332     is_feature_used[feature_index] = 1;
333   }
...

Q: In the model file of LightGBM, it shows “num_leaves=63” in every iteration. Shouldn’t it change the depth and leaves of model for every iteration?

A: Can’t answer yet. Still need to look into the code to see why…

Accelerate reading of NumPy array from files

In the training process, I need to read array data from .npy file and get a part of it:

import numpy as np
data = np.load("sample1.npy")
sound1 = data[start1: end1]
sound2 = data[start2: end2]

Since the .npy files are large, it became slowly to read a large file but only get some small parts of it. Is there a simple way to let us only read these small parts?

Yes, it’s just a simple line change mmap_mode:

import numpy as np
data = np.load("sample1.npy", mmap_mode="r")  # just change this line
sound1 = data[start1: end1]
sound2 = data[start2: end2]

Now, the load() function will not generate any IO for disk. But when we process the segment (like sound1 or sound2), it will load and only load the pages that contains the segment, which decrease the total IO tremendously.

After I change this line of code, the reading bandwidth dropped from 300MB/s to less than 100MB/s

Change the schema of BigQuery tables

We can easily add new column for a table in BigQuery:

ALTER TABLE mydataset.mytable
      ADD COLUMN new_col STRING

But when you want to delete or rename an existed column, there is no SQL to implement it. The only way to delete or rename an existed column is to use the bq command:

bq show --format=prettyjson mydataset.mytable > schema.json
# Edit the schema.json to only leave a list of columns
bq mk --table mydataset.new_mytable schema.json
# Export data from `mytable` to `new_mytable`
bq rm --table mydataset.mytable
bq cp --table mydataset.new_mytable mydataset.mytable

And remember to backup your data before operating!