For NLU (Natural Language Understanding), we use the bidirectional language model (like BERT), but for NLG(Natural Language Generation), the left-to-right unidirectional language model (like GPT) is the only choice.
Could we accomplish these two tasks by using one unified language model?
In this paper, the authors use a mask matrix to run different tasks in the same model:
The pivotal equation for this method is:
“M is the mask matrix and determines whether a pair of tokens can be attended to each other.”
“Unidirectional LM is done by using a triangular matrix for the self-attention mask M (as in the above equation), where the upper triangular part of the self-attention mask is set to −∞, and the other elements to 0”
“Within one training batch, 1/3 of the time we use the bidirectional LM objective, 1/3 of the time we employ the sequence-to-sequence LM objective, and both left-to-right and right-to-left LM objectives are sampled with the rate of 1/6”
Keep a note that the training process use bidirectional/unidirectional/seq2seq objective, not samples)