Using XGBoost to predict large sparse data

For using XGBoost to predict, I wrote code like this:

But it reported error:

Seems csr_matrix in SciPy is not supported by XGBoost. Maybe I need to transfer sparse data to dense:

But it still reported:

The ‘test’ data is too big so it cann’t even be transfered to dense data!
XGBoost doesn’t support the sparse format, and my sparse data cannot be changed to dense. Then what should I do?

Actually, the solution is incredible simple — just use XGBoost’s DMatrix!

Some summaries for Kaggle’s competition ‘Humpback Whale Identification’

This time, I only spent one month on competition “Humpback Whale Identification”. But still, get a little step forward than previous competitions. Here are my summaries:

1. Do review ‘kernels’ in competition page, this will teach me a lot of information and new technology. By using Siamese Network rather than classic model, I eventually beat overfit problems. Thanks for suggestions from the ‘kernel’ page of competition.

2. Bravely use cutting-edge model, such as ResNeXt50 / Densenet121. They are more powerful and easy to use.

3. Do use fine-tuning. Don’t train model from scratch every time!

4. Ensemble learning is really powerful. I have used three different models to ensemble the final result.

There are also some tips for future challenge (may be correct, may be wrong):

1. albumentations is handful library for image augmentations

2. Cosine-decay-learning-rate performs worse than Exponential-decay-learning-rate

3. LeakyRelu doesn’t work significantly better than Relu

4. Bigger image size may not lead to higher accuracy

Using ResNeXt in Keras 2.2.4

      3 Comments on Using ResNeXt in Keras 2.2.4

To use ResNeXt50, I wrote my code as the API documentation for Keras:

But it reported errors:

That’s weird. The code doesn’t work as documentation said.
So I checked the code of Keras-2.2.4 (the version in my computer), and noticed that this version of code use ‘keras_applications’ instead of ‘keras.applications’.
Then I changed my code:

But it reported another error:

Witout choice, I had to check code of ‘/usr/lib/python3.6/site-packages/keras_applications/resnet_common.py’ too. Finally, I realise the ResNeXt50() function need three more arguments:

Now the program could run ResNeXt50 model correctly. This github issue explained the detail: the ‘keras_applications’ could be used both for Keras and Tensorflow, so it needs to pass library details into model function.

Some tips about using Keras

      No Comments on Some tips about using Keras

1. How to use part of a model

The ‘img_embed’ model is part of ‘branch_model’. We should realise that ‘Model()’ is a heavy cpu-cost function so it need to be create only once and then could be used many times.

2. How to save a model when using ‘multi_gpu_model’

We should reserve original model. And only by using it, we can save the model to file.

Some tips about Python, Pandas, and Tensorflow

There are some useful tips for using Keras and Tensorflow to build models.

1. Using applications.inception_v3.InceptionV3(include_top = False, weights = ‘Imagenet’) to get pretrained parameters for InceptionV3 model, the console reported:

The solution is here. Just install some packages:

2. Could we use ‘add’ to merge two DataFrames of Pandas? Let’s try

The result is:

The operator ‘+’ just works as ‘pandas.DataFrame.add‘. It try to add all values column by column, but the second DataFrame is empty, so the result of adding a number and a nonexistent value is ‘Nan’.
To merge two DataFrames, we should use ‘append’:

3. Why Estimator of Tensorflow doesn’t print out log?

But the logging_hook hasn’t been run. The solution is just adding one line before running Estimator:

LinearSVC versus SVC in scikit-learn

In competition ‘Quora Insincere Questions Classification’, I want to use simple TF-IDF statistics as a baseline.

The result is not bad:

But after I change LinearSVC to SVC(kernel=’linear’), the program couldn’t work out any result even after 12 hours!
Am I doing anything wrong? In the page of sklearn.svm.LinearSVC, there is a note:

Similar to SVC with parameter kernel=’linear’, but implemented in terms of liblinear rather than libsvm, so it has more flexibility in the choice of penalties and loss functions and should scale better to large numbers of samples.

Also in the page of sklearn.svm.SVC, it’s another note:

The implementation is based on libsvm. The fit time complexity is more than quadratic with the number of samples which makes it hard to scale to dataset with more than a couple of 10000 samples.

That’s the answer: LinearSVC is the right choice to process a large number of samples.

Using keras.layers.Embedding instead of python dictionary

Firstly, I use a function to transform words into word-embedding:

But I noticed that it costs quite a few CPU resource while GPU usage is still low. The reason is simple: using single thread python to do search in dictionary is uneffective. We should use Embedding layer in Keras to put all word-embedding-table into GPU memory.
The code is not difficult to understand:

This time, the program run two times faster than before. Using GPU memory (GDDR) to find word embedding is the right way.

A few other lessons from Kaggle’s competition ‘Human Protein Atlas Image Classification’

Practice makes progress. Therefore I continued to join Kaggle’s new competition ‘Human Protein Atlas Image Classification’ after the previous one.
I used think I could get a higher rating in image processing competition. But actually, I haven’t even entered the top half of rankings. After almost three month trials and errors, here are my rethinkings:

1. To solve the unbalanced data problem, we need to use ‘focal loss’ instead of normal cross entropy loss. I should be looking at other experts’ kernels earlier, then I could use new techniques as soon as possible.

2. To augment images, ‘lower resolution’ may be a better way than ‘mix up’

3. Try SGD and Cosine Decay, not only RMSProp

4. MobileNet may cause severe overfitting than Resnet

5. If dropout and weight-decay still can’t get better affection for regularization, what should we do? (An open question, feature engineering may be the answer)

6. Use more powerful DNN framework, such as Keras, so I can spend more time on the model itself

Some errors in dataset pipeline of Tensorflow

To extend image datasets by using mixup,I use this snippet to mix two images:

But after generating images by using this snippet, the training report errors:

The size of each image is 512x512x4 = 1048576 bytes. But I can’t understand why there is image has the size of 8388608 bytes.
Firstly my suspected point is the dataset flow of Tensorflow. But after changing the code of dataset pipeline, I find the problem is not in Tensorflow.
Again and again, I reviewed my code of generating new images and also adding some debug stub. Finally, I found out the problem: it’s not Tensorflow’s fault, but mine.
By using

The type of ‘new_image’ is ‘float64’, not ‘uint8’ for ‘major_image’ and ‘minor_image’! The ‘float64’ use 8 bytes to store one element, so this explains the ‘8388608’ in error information.
To correctly mixup images, the code should be:

Books I read in year 2018

      No Comments on Books I read in year 2018

In the 2018 year, I continued to learn more knowledge about machine learning and deep Learning. “Deep Learning” is pretty suitable for me and “Hands-On Machine Learning with Scikit-Learn and TensorFlow” is also a wonderful supplement for programming practice. I also learned some basic knowledge about Reinforcement learning.

To teach my daughters programming, I read some books about Arduino. In the process of learning Arduino, I became more and more interested in electronics on myself! After reading more technical documents about electronics (diode, transistor, capacitor, relay, thyristor etc.), Microcontrollers (Atmega from Atmel, MSP430 from Texas Instruments, STM8 from ST and so on), I had opened my view to a new area.

History books are always my favorite type. The most astonishing history book I have read in 2018 is “The Last Panther”. This book tells us an extremely cruel but real story in WWII.

Kazuo Inamori is a famous entrepreneur in Japan. I read some books written by him at the end of this year. Surprisingly, his books definitely inspired me and even changed some parts of my mind. I really want to thank him for his teaching.