In the paper Attention Is All You Need, the Transformer neural network had been introduced for the first time in 2017. One year later, the BERT appeared. And last year I gave a simple presentation in my previous company about the Transformer and BERT. As showed below:
A couple of days before I started to review the Transformer paper and found out that I need to recommend the article The Illustrated Transformer again. This article really helps me to understand a lot of details in the Transformer.
But there is still a question jump out of my brain: what’s the use of decoder
in Transformer? How the information flows through encoder
to decoder
? After thinking for quite a while, I figured it out: Transformer was used for Machine Translation task at the first place. The encoder
is used to “transform” sentence of source language to a couple of Keys
and Values
; the decoder
will “transform” a word of target language to a Query
. By using a Query
and a couple of Keys
and Values
, it could get a vector, which is actually the embedding of next word in target language.
Here is a digram draw by me. Hope it could explain my own confusion.
“Ich bin ein guter Kerl” in German means “I am a good guy”. By encoding all German words to a couple of Keys
and Values
, and decode “good” to a Query
, the Transformer could finally output the embedding vector of “guy”.