Using keras.layers.Embedding instead of python dictionary

Firstly, I use a function to transform words into word-embedding:

def text_to_array(text, embeddings_index):
    empty_embed = np.zeros(EMBEDDING_LENGTH, dtype = np.float32)
    text = text[:-1].split()[:MAX_TEXT_LENGTH]
    embeds = []
    for x in text:
        em = embeddings_index.get(x)
        if em is not None:
            embeds.append(em)
    embeds += [empty_embed] * (MAX_TEXT_LENGTH - len(embeds))
    return np.array(embeds, dtype = np.float32)

But I noticed that it costs quite a few CPU resource while GPU usage is still low. The reason is simple: using single thread python to do search in dictionary is uneffective. We should use Embedding layer in Keras to put all word-embedding-table into GPU memory.
The code is not difficult to understand:

...
    # word_index = tokenizer.word_index
    nb_words = max(MAX_WORDS, len(word_index))
    embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))
    for word, i in tqdm(word_index.items()):
        if i >= nb_words: continue
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None: embedding_matrix[i] = embedding_vector
...
    inp = Input(shape = (MAX_TEXT_LENGTH,))
    net = Embedding(embedding_matrix.shape[0], EMBEDDING_LENGTH, weights=[embedding_matrix], trainable = False)(inp)
...

This time, the program run two times faster than before. Using GPU memory (GDDR) to find word embedding is the right way.

Robin on Linux

Using keras.layers.Embedding instead of python dictionary

Leave a Reply Cancel reply

Robin on Linux

Related Posts

Leave a Reply Cancel reply