SciPy – Robin on Linux

Using XGBoost to predict large sparse data

For using XGBoost to predict, I wrote code like this:

test = load_npz('test.npz')
test = csr_matrix(test, dtype = 'float32')
xgb_model = xgb.Booster({'n_jobs': -1})
xgb_model.load_model(MODEL_PATH)
result = xgb_model.predict(test)

But it reported error:

  File "/usr/lib/python3.7/site-packages/scipy/sparse/base.py", line 689, in __getattr__
    raise AttributeError(attr + " not found")
AttributeError: feature_names not found

Seems csr_matrix in SciPy is not supported by XGBoost. Maybe I need to transfer sparse data to dense:

result = xgb_model.predict(test.todense())

But it still reported:

  File "/usr/lib/python3.7/site-packages/scipy/sparse/base.py", line 1187, in _process_toarray_args
    return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError

The ‘test’ data is too big so it cann’t even be transfered to dense data!
XGBoost doesn’t support the sparse format, and my sparse data cannot be changed to dense. Then what should I do?
Actually, the solution is incredible simple — just use XGBoost’s DMatrix!

result = xgb_model.predict(xgb.DMatrix(test))