Finally I get a few hours to look into the code of LightGBM.
I used to have some questions about LighGBM, and now fortunately I can answer some of them by myself. Even some answers may be wrong, that is still better than no answer at all 🙂
Q: Will LightGBM contruct a couple of trees as one model?
A: No. It will only contruct one tree as a model for a dataset
Q: How would LightGBM choose the feature that has the highest gain in entropy?
A: It will simply iterate all features (with a loop, in code) and try to find the best split for all of them. After that, it will pick the feature and the split with the highest gain
"src/treelearner/serial_tree_learner.cpp" 158 Tree* SerialTreeLearner::Train(const score_t* gradients, const score_t *hessians) { ... 185 int init_splits = ForceSplits(tree_ptr, &left_leaf, &right_leaf, &cur_depth); 186 187 for (int split = init_splits; split < config_->num_leaves - 1; ++split) { 188 // some initial works before finding best split 189 if (BeforeFindBestSplit(tree_ptr, left_leaf, right_leaf)) { 190 // find best threshold for every feature 191 FindBestSplits(tree_ptr); 192 } ...
"src/treelearner/serial_tree_learner.cpp" 322 void SerialTreeLearner::FindBestSplits(const Tree* tree) { 323 std::vector<int8_t> is_feature_used(num_features_, 0); 324 #pragma omp parallel for schedule(static, 256) if (num_features_ >= 512) 325 for (int feature_index = 0; feature_index < num_features_; ++feature_index) { 326 if (!col_sampler_.is_feature_used_bytree()[feature_index]) continue; 327 if (parent_leaf_histogram_array_ != nullptr 328 && !parent_leaf_histogram_array_[feature_index].is_splittable()) { 329 smaller_leaf_histogram_array_[feature_index].set_is_splittable(false); 330 continue; 331 } 332 is_feature_used[feature_index] = 1; 333 } ...
Q: In the model file of LightGBM, it shows “num_leaves=63” in every iteration. Shouldn’t it change the depth and leaves of model for every iteration?
A: Can’t answer yet. Still need to look into the code to see why…