treelite is an easy-to-use tool to accelerate prediction speed of XGBoost/LightGBM models. Three days ago, I tested it in my 4-CPU-cores virtual machine and found out that it could reduce the running time half. But after I deployed it into our Kubernetes cluster, it ran even slower than LightGBM library!
There are some strange phenomenons I noticed in the Kubernetes environment:

  1. If I run only one pod, it will run as fastly as previous test in my virtual machine. But if I run 100 pods simutaneously, they will all run very slow.
  2. After login to a pod for profiling, I found out that CPU usage of the application would start from 400% (for I set the cpu request and limit to "4") and gradually descend to less than 10%
  3. There was no memory-swapping happend in any node
  4. I used sleep 3600 to hang all the pods, login to one pod with kubectl exec -ti <pod_name> /bin/bash and run my application manually. It could still run very fast.

This weird problem haunted me, dejected me, and frustrated me for two days and almost ruined one night of my sleeping. Only until I turned to look into the code of treelite, the answer jumped out:

# src/thread_pool/thread_pool.h
    for (int i = 0; i < num_worker_; ++i) {
      thread_[i] = std::thread(task_, incoming_queue_[i].get(),
    /* bind threads to cores */

The treelite wil pin its processes to each CPU-core by default!
This explained all the thing: when we start 100 pods, they were all trying to ping their 4 processes to CPU-core 0-3. But in the container environment, every pod could see all the CPU-cores of the node (Unlike a virtual machine, in which every VM could only see 4 CPU-cores of itself). Hence they all pin their processes to first 4 CPU-cores of that node. Using only 4 CPU-cores to run all 100 pods? That’s a terrible disaster!

The conclusion is, some acceleration tricks that work very well in classic-computer-era couldn’t be used in container-era.