machine learning

Kubeflow deployment: part 1

By following the document, I tried to deploy the management cluster of Kubeflow. But after running make apply-cluster it reported:

The management cluster name "kubeflow-mgmt" is valid.
# Delete the directory so any resources that have been removed
# from the manifests will be pruned
rm -rf build/cluster
mkdir -p build/cluster
kustomize build ./cluster -o build/cluster
# Create the cluster
anthoscli apply -f build/cluster
I0723 14:53:19.329785   24546 main.go:230] reconcile serviceusage.cnrm.cloud.google.com/Service container.googleapis.com
I0723 14:53:23.236897   24546 main.go:230] reconcile container.cnrm.cloud.google.com/ContainerCluster kubeflow-mgmt
Unexpected error: error reconciling objects: error reconciling ContainerCluster:gcp-wow-rwds-ai-mlchapter-dev/kubeflow-mgmt: error creating GKE cluster kubeflow-mgmt: googleapi: Error 400: Project "gcp-wow-rwds-ai-mlchapter-dev" has no network named "default".
make: *** [apply-cluster] Error 1

The reason for this error is that Kubeflow could only use the network with the name “default” in GCP as its VPC. This issue is still open and has been pointed to anthos.

Workaround: Create a new GKE cluster manually, and set MGMT_NAME to the existed cluster name

export MGMT_NAME=kubeflow-exp

Then the make apply-cluster would work properly.

First experiments about Vertex AI of Google Cloud

As the above menu show in the Vertex AI, it is trying to include all common processes of building and running a machine learning model.

For my experiment, I just create a Dataset by loading file from GCS. Unfortunately, the loading process support only CSV file as tabular data so I have to convert my big PARQUET file into CSV format first (really inconvenient).

Strange error

But after I created a training process by using builtin XGBoost container. It report a strange error:

There is an invalid column, but what’s the name of it? The GUI didn’t show. I finally find out that it’s a column with an empty name. Seems Vertex AI couldn’t even process a table with a column of an empty name.

2. AutoML

After manually removed the column with an empty name and select AutoML for my tabular data. The training went successfully. The final regression L1 loss is 0.237, just the same result with my own LightGBM model.

3. Custom Pakcage

By following this document, I create a custom Python package for my training of the XGBoost model. The self-brew package use environment-variable to get Dataset from GCS. The final L1 loss is slightly worse than LightGBM.

Frankly speaking, I haven’t seen any advantage of Vertex AI over our home-brew Argo/K8S training framework. In the Vertex AI training process, those special errors, like OOM(Out Of Memory), are hard to discover.

Debug CUDA error for PyTorch

After I changed my dataset for my code, the training failed:

/tmp/pip-req-build-_tx3iysr/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:310: operator(): block: [0,0,0], thread: [59,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/tmp/pip-req-build-_tx3iysr/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:310: operator(): block: [0,0,0], thread: [60,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/tmp/pip-req-build-_tx3iysr/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:310: operator(): block: [0,0,0], thread: [61,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/tmp/pip-req-build-_tx3iysr/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:310: operator(): block: [0,0,0], thread: [62,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/tmp/pip-req-build-_tx3iysr/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:310: operator(): block: [0,0,0], thread: [63,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
Traceback (most recent call last):
  File "train.py", line 337, in <module>
    train(args, train_loader, eval_loader)
  File "train.py", line 189, in train
    sounds = aug(sounds)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 881, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/sanbai/birds_sound_classification/utils/augment.py", line 13, in forward
    image = (image - image.mean()) / image.std()
RuntimeError: CUDA error: device-side assert triggered

It’s terribly hard to find out the reason for this common error “RuntimeError: CUDA error: device-side assert triggered”. But someone on Github recommends a method: adding CUDA_LAUNCH_BLOCKING=1 before the program.

Now the real error behind RuntimeError shows up: it’s the wrong number of categories I set to the model.

Source code reading of LightGBM

Finally I get a few hours to look into the code of LightGBM.

I used to have some questions about LighGBM, and now fortunately I can answer some of them by myself. Even some answers may be wrong, that is still better than no answer at all 🙂

Q: Will LightGBM contruct a couple of trees as one model?

A: No. It will only contruct one tree as a model for a dataset

Q: How would LightGBM choose the feature that has the highest gain in entropy?

A: It will simply iterate all features (with a loop, in code) and try to find the best split for all of them. After that, it will pick the feature and the split with the highest gain

"src/treelearner/serial_tree_learner.cpp"
158 Tree* SerialTreeLearner::Train(const score_t* gradients, const score_t *hessians) {
...
185   int init_splits = ForceSplits(tree_ptr, &left_leaf, &right_leaf, &cur_depth);
186
187   for (int split = init_splits; split < config_->num_leaves - 1; ++split) {
188     // some initial works before finding best split
189     if (BeforeFindBestSplit(tree_ptr, left_leaf, right_leaf)) {
190       // find best threshold for every feature
191       FindBestSplits(tree_ptr);
192     }
...

"src/treelearner/serial_tree_learner.cpp"
322 void SerialTreeLearner::FindBestSplits(const Tree* tree) {
323   std::vector<int8_t> is_feature_used(num_features_, 0);
324   #pragma omp parallel for schedule(static, 256) if (num_features_ >= 512)
325   for (int feature_index = 0; feature_index < num_features_; ++feature_index) {
326     if (!col_sampler_.is_feature_used_bytree()[feature_index]) continue;
327     if (parent_leaf_histogram_array_ != nullptr
328         && !parent_leaf_histogram_array_[feature_index].is_splittable()) {
329       smaller_leaf_histogram_array_[feature_index].set_is_splittable(false);
330       continue;
331     }
332     is_feature_used[feature_index] = 1;
333   }
...

Q: In the model file of LightGBM, it shows “num_leaves=63” in every iteration. Shouldn’t it change the depth and leaves of model for every iteration?

A: Can’t answer yet. Still need to look into the code to see why…

Some thoughts about cuDF and cuML

I just received an email from NVIDIA about their RAPIDS. Although the cuDF and cuML look fantastic for a data scientist. I am still doubtful about them.

In our daily work, we usually process small DataFrame by Pandas, so cuDF will be too expensive since it needs GPU. And even we need to join two large DataFrame, we tend to use BigQuery, for it’s distributed and relatively cheap. The only proper case for cuDF I think is some heavy operations on less than 8GB data. Who need so many heavy operations on a DataFrame? I don’t know.

For cuML, it’s more like a GPU version scikit-learn. Actually, for tabular data we use XGBoost/LightGBM, for non-structure-data we use PyTorch/Tensorflow. Who will even use scikit-learn? Not even mention the cuML.

A stupid mistake in the new deep learning experiment

After my old colleague, JianMei prepared about 1TB data of the birds’ sound records (every mp3 file will be transferred to an image by using spectrogram and split into chunks with each chunk 2.5 seconds period. After all, every file is a 1250×78 multi-dimension array), I started training with almost the same code using in bird image classification.

The train-accuracy rises very slowly so I add a line of code to normalize every input sample:

image = (image - image.mean()) / image.std()

After that, the train-accuracy could rise faster, but the eval-accuracy still quite low.

In order to find out the root of the problem, I started to train from only two classes: “Black-capped Donacobius” and “Blue-eared Barbet”

Eval accuracy:0.965278 | Train accuracy:1.000000

The result seems pretty good. So I increate the number of classes to 20

Eval accuracy:0.211102 | Train accuracy:0.840278
Eval accuracy:0.203834 | Train accuracy:0.885417
Eval accuracy:0.191514 | Train accuracy:0.904247
Eval accuracy:0.193245 | Train accuracy:0.916667
Eval accuracy:0.210894 | Train accuracy:0.916667
Eval accuracy:0.190269 | Train accuracy:0.932292

Are some types of bird hard to generalize in deep learning model? Then I began to consider how to find these “hard to train” bird type: maybe start from 2 classes and increase the number of classes step by step, and then draw a few curves about the train-accuracy and eval-accuracy…

Suddenly I realized that I just use normalization in the training sample but not evaluation sample!

What a stupid mistake. It wasted me a whole stuffy afternoon for nothing. I really should remember this lesson: do what you do in training samples to evaluation samples, except dropout.

A struggle to keep the accuracy

In this August, we have got 0.83 evaluation accuracy for DIB-10K dataset. But since last month, we have updated the dataset and the accuracy could only get to 0.82.

The first doubtful point is the Weight Standardization method we used for micro-batch (since the model is too big). So I turned to try gradient-accumulation and use this snippet as an example because it won’t need me to change my code heavily:

model.zero_grad()                                   # Reset gradients tensors
for i, (inputs, labels) in enumerate(training_set):
    predictions = model(inputs)                     # Forward pass
    loss = loss_function(predictions, labels)
    loss = loss / accumulation_steps
    loss.backward()                                 # Backward pass
    if (i+1) % accumulation_steps == 0:
        optimizer.step()
        model.zero_grad()
        if (i+1) % evaluation_steps == 0:
            evaluate_model()

But after changing my code and retrain the model, the accuracy still keep around 0.82:

Epoch     4: reducing learning rate of group 0 to 1.0000e-01.
[2020-12-16 05:53:29] Eval accuracy: 0.8283 | Train accuracy: 0.8187
[2020-12-16 10:01:40] Eval accuracy: 0.8284 | Train accuracy: 0.8938
[2020-12-16 14:11:35] Eval accuracy: 0.8284 | Train accuracy: 0.8313
Epoch     7: reducing learning rate of group 0 to 5.0000e-02.
[2020-12-16 18:21:47] Eval accuracy: 0.8285 | Train accuracy: 0.8750
[2020-12-16 22:31:19] Eval accuracy: 0.8285 | Train accuracy: 0.8313
[2020-12-17 02:41:37] Eval accuracy: 0.8284 | Train accuracy: 0.8625
Epoch    10: reducing learning rate of group 0 to 2.5000e-02.
[2020-12-17 06:52:05] Eval accuracy: 0.8286 | Train accuracy: 0.8500
[2020-12-17 11:02:11] Eval accuracy: 0.8285 | Train accuracy: 0.8063
[2020-12-17 15:12:23] Eval accuracy: 0.8286 | Train accuracy: 0.8375
Epoch    13: reducing learning rate of group 0 to 1.2500e-02.
[2020-12-17 19:22:04] Eval accuracy: 0.8285 | Train accuracy: 0.8313

This makes me really desperate. Maybe I should temporarily put this task aside and go on other works.

Export YOLOv5 models for mobile device

Somebody has finished the work about exporting YOLOv5 models to tflite model. To use it, we only need to:

git clone --single-branch --branch tf-export https://github.com/zldrobit/yolov5.git
cd yolov5
# it will download all pytorch models
sh -x weights/download_weights.sh
# export a tflite model from yolov5l
PYTHONPATH=. python3 models/tf.py --weights yolov5l.pt --cfg models/yolov5l.yaml --img 640
# there will be a tflite model file
ls yolov5l-fp16.tflite

The model file yolov5l-fp16.tflite is 91MB, which is a little too big but still could be put into a mobile phone.

The awesome YOLOv5

I just found a repository YOLOv5 from Github. It’s not just the models are accurate and fast but also easy to get and to use.

Just download the code, install some dependent libraries. And you can just run a simple command:

python3 detect.py --weights yolov5l.pt

Then it will automatically download the YOLO v5 Large model, process all images in inference/images, and put the annotated images into inference/output.

Let’s see some images predicted by YOLO v5 Large:

And every image cost no more than 2 seconds to predict on my laptop. Isn’t it awesome? At least it’s much better than my experiments…

TabNet: a new neural-network architecture for tabular data

The neural network seems mostly to be used on Computer Vision and Natural Language Processing scenarios, while tree-models like GBDT are mainly used for tabular data.

But why?

Although this article tries to give an explanation of this, it hasn’t been so promising to me. In my humble opinion, the neural network could finally surpass, or at least be competitive, to the GBDT model.

For example, the paper <TabNet: Attentive Interpretable Tabular Learning> describe a Transformer-like model to simulate the tree-model. The PyTorch implementation is here. I have used it on our own data and it finally reached 90% accuracy ( the accuracy of LightGBM is 93%). In spite of the lower accuracy, this is the first neural model reached 90% accuracy when used on our private data. The author has already done a great job.