Firstly, I use a function to transform words into word-embedding:
def text_to_array(text, embeddings_index):
empty_embed = np.zeros(EMBEDDING_LENGTH, dtype = np.float32)
text = text[:-1].split()[:MAX_TEXT_LENGTH]
embeds = []
for x in text:
em = embeddings_index.get(x)
if em is not None:
embeds.append(em)
embeds += [empty_embed] * (MAX_TEXT_LENGTH - len(embeds))
return np.array(embeds, dtype = np.float32)
But I noticed that it costs quite a few CPU resource while GPU usage is still low. The reason is simple: using single thread python to do search in dictionary is uneffective. We should use Embedding layer in Keras to put all word-embedding-table into GPU memory.
The code is not difficult to understand:
...
# word_index = tokenizer.word_index
nb_words = max(MAX_WORDS, len(word_index))
embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))
for word, i in tqdm(word_index.items()):
if i >= nb_words: continue
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None: embedding_matrix[i] = embedding_vector
...
inp = Input(shape = (MAX_TEXT_LENGTH,))
net = Embedding(embedding_matrix.shape[0], EMBEDDING_LENGTH, weights=[embedding_matrix], trainable = False)(inp)
...
This time, the program run two times faster than before. Using GPU memory (GDDR) to find word embedding is the right way.