Keras

Summaries for Kaggle’s competition ‘Histopathologic Cancer Detection’

Firstly, I want to thank for Alex Donchuk‘s advice in discussion of competition ‘Histopathologic Cancer Detection‘. His advice really helped me a lot.
1. Alex used the ‘SEE-ResNeXt50’. Instead, I used the standard ‘ResNeXt50’. Maybe this is the reason why my score ‘0.9716’ in public leaderboard is not as good as Alex’s. After the competition, I did spend some time to read the paper about ‘SE-ResNeXt50’. It’s really a simple and interesting idea about optimizing the architecture of the neural network. Maybe I can use this model on my next Kaggle competition.
2. In this competition, I split the training dataset into ten folds and train three different models on different train/eval splits. After ensembled these three models, it could get a nice score. Seems Bagging is a good method on practical application.
3. After training model to a ‘so far so good’ f1-score by using SGD with ReduceOnPlateu in Keras, I use this model as the ‘base model’ for following fine-tuning. By ensemble all high-score finetuning models, I eventually get the best score. This strategy comes from the Snapshot Ensembles.
4. By the way, ReduceOnPlateu is really useful when using SGD as the optimizer.

Using ResNeXt in Keras 2.2.4

To use ResNeXt50, I wrote my code as the API documentation for Keras:

keras.applications.resnext.ResNeXt50(...)

But it reported errors:

AttributeError: module 'keras.applications' has no attribute 'resnext'

That’s weird. The code doesn’t work as documentation said.
So I checked the code of Keras-2.2.4 (the version in my computer), and noticed that this version of code use ‘keras_applications’ instead of ‘keras.applications’.
Then I changed my code:

keras_applications.resnext.ResNeXt50((input_tensor = pinp, include_top = False, weights = 'imagenet')

But it reported another error:

Using TensorFlow backend.
Traceback (most recent call last):
  File "ktrain.py", line 292, in 
    main()
  File "ktrain.py", line 277, in main
    model, orig_model, branch_model, head_model = build_model(args)
  File "ktrain.py", line 210, in build_model
    branch_model = resnet_model(args, img_shape)
  File "ktrain.py", line 164, in resnet_model
    base_model = keras_applications.resnext.ResNeXt50(input_tensor = pinp, include_top = False, weights = 'imagenet')
  File "/usr/lib/python3.6/site-packages/keras_applications/resnet_common.py", line 555, in ResNeXt50
    **kwargs)
  File "/usr/lib/python3.6/site-packages/keras_applications/resnet_common.py", line 348, in ResNet
    data_format=backend.image_data_format(),
AttributeError: 'NoneType' object has no attribute 'image_data_format'

Witout choice, I had to check code of ‘/usr/lib/python3.6/site-packages/keras_applications/resnet_common.py’ too. Finally, I realise the ResNeXt50() function need three more arguments:

keras_applications.resnext.ResNeXt50(
        input_tensor = pinp, include_top = False, weights = 'imagenet',
        backend = keras.backend, layers = keras.layers, models = keras.models, utils = keras.utils)

Now the program could run ResNeXt50 model correctly. This github issue explained the detail: the ‘keras_applications’ could be used both for Keras and Tensorflow, so it needs to pass library details into model function.

Some tips about using Keras

1. How to use part of a model

    model = load_model(sys.argv[1], custom_objects = {'fmeasure': fmeasure})
    branch_model = model.get_layer(name = 'model_1')
    img_embed = Model(inputs = [branch_model.get_input_at(0)],
                      outputs = [branch_model.get_output_at(0)])

The ‘img_embed’ model is part of ‘branch_model’. We should realise that ‘Model()’ is a heavy cpu-cost function so it need to be create only once and then could be used many times.
2. How to save a model when using ‘multi_gpu_model’

orig_model = Model(x, y)
model = multi_gpu_model(orig_model, gpus = 2)
model.compile(optim, loss='binary_crossentropy', metrics = ['acc'])
....
model.fit_generator()
orig_model.save(PATH)

We should reserve original model. And only by using it, we can save the model to file.

Using keras.layers.Embedding instead of python dictionary

Firstly, I use a function to transform words into word-embedding:

def text_to_array(text, embeddings_index):
    empty_embed = np.zeros(EMBEDDING_LENGTH, dtype = np.float32)
    text = text[:-1].split()[:MAX_TEXT_LENGTH]
    embeds = []
    for x in text:
        em = embeddings_index.get(x)
        if em is not None:
            embeds.append(em)
    embeds += [empty_embed] * (MAX_TEXT_LENGTH - len(embeds))
    return np.array(embeds, dtype = np.float32)

But I noticed that it costs quite a few CPU resource while GPU usage is still low. The reason is simple: using single thread python to do search in dictionary is uneffective. We should use Embedding layer in Keras to put all word-embedding-table into GPU memory.
The code is not difficult to understand:

...
    # word_index = tokenizer.word_index
    nb_words = max(MAX_WORDS, len(word_index))
    embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))
    for word, i in tqdm(word_index.items()):
        if i >= nb_words: continue
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None: embedding_matrix[i] = embedding_vector
...
    inp = Input(shape = (MAX_TEXT_LENGTH,))
    net = Embedding(embedding_matrix.shape[0], EMBEDDING_LENGTH, weights=[embedding_matrix], trainable = False)(inp)
...

This time, the program run two times faster than before. Using GPU memory (GDDR) to find word embedding is the right way.

Robin on Linux

Keras