Idea

Previous Works

  • ELMo: use bidirectional information(the model can access the contexts above and below) on traditional architectures (RNNs)
  • GPT: use unidirectional information (left to right in particular), on more modern architectures (transformers)

Innovation

Train transformers on bidirectional information

Framework

Steps

  • Pre-training: the model is trained on unlabeled data over different pre-training tasks
  • Fine-tuning: first initialize the model with the pre-trained parameters, and all of the parameters are fine-tuned using labeled data from the downstream tasks

Tip

[CLS] is a special token stands for classification task; Since the sequence input of BERT can be a sentence pair like (Question, Answer), so another special token [SEP] is used to separate two sentences.

Embedding