In the paper Attention Is All You Need, the Transformer neural network had been introduced for the first time in 2017. One year later, the BERT appeared. And last year I gave a simple presentation in my previous company about the Transformer and BERT. As showed below:

A couple of days before I started to review the Transformer paper and found out that I need to recommend the article The Illustrated Transformer again. This article really helps me to understand a lot of details in the Transformer.

But there is still a question jump out of my brain: what’s the use of decoder in Transformer? How the information flows through encoder to decoder ? After thinking for quite a while, I figured it out: Transformer was used for Machine Translation task at the first place. The encoder is used to “transform” sentence of source language to a couple of Keys and Values; the decoder will “transform” a word of target language to a Query. By using a Query and a couple of Keys and Values, it could get a vector, which is actually the embedding of next word in target language.

Here is a digram draw by me. Hope it could explain my own confusion.

“Ich bin ein guter Kerl” in German means “I am a good guy”. By encoding all German words to a couple of Keys and Values, and decode “good” to a Query, the Transformer could finally output the embedding vector of “guy”.