Finally I get a few hours to look into the code of LightGBM.
I used to have some questions about LighGBM, and now fortunately I can answer some of them by myself. Even some answers may be wrong, that is still better than no answer at all 🙂
Q: Will LightGBM contruct a couple of trees as one model?
A: No. It will only contruct one tree as a model for a dataset
Q: How would LightGBM choose the feature that has the highest gain in entropy?
A: It will simply iterate all features (with a loop, in code) and try to find the best split for all of them. After that, it will pick the feature and the split with the highest gain
"src/treelearner/serial_tree_learner.cpp"
158 Tree* SerialTreeLearner::Train(const score_t* gradients, const score_t *hessians) {
...
185 int init_splits = ForceSplits(tree_ptr, &left_leaf, &right_leaf, &cur_depth);
186
187 for (int split = init_splits; split < config_->num_leaves - 1; ++split) {
188 // some initial works before finding best split
189 if (BeforeFindBestSplit(tree_ptr, left_leaf, right_leaf)) {
190 // find best threshold for every feature
191 FindBestSplits(tree_ptr);
192 }
...
"src/treelearner/serial_tree_learner.cpp"
322 void SerialTreeLearner::FindBestSplits(const Tree* tree) {
323 std::vector<int8_t> is_feature_used(num_features_, 0);
324 #pragma omp parallel for schedule(static, 256) if (num_features_ >= 512)
325 for (int feature_index = 0; feature_index < num_features_; ++feature_index) {
326 if (!col_sampler_.is_feature_used_bytree()[feature_index]) continue;
327 if (parent_leaf_histogram_array_ != nullptr
328 && !parent_leaf_histogram_array_[feature_index].is_splittable()) {
329 smaller_leaf_histogram_array_[feature_index].set_is_splittable(false);
330 continue;
331 }
332 is_feature_used[feature_index] = 1;
333 }
...
Q: In the model file of LightGBM, it shows “num_leaves=63” in every iteration. Shouldn’t it change the depth and leaves of model for every iteration?
A: Can’t answer yet. Still need to look into the code to see why…