Idea

Previous Works

ELMo: use bidirectional information(the model can access the contexts above and below) on traditional architectures (RNNs)
GPT: use unidirectional information (left to right in particular), on more modern architectures (transformers)

Innovation

Train transformers on bidirectional information

Framework

Steps

Pre-training: the model is trained on unlabeled data over different pre-training tasks
Fine-tuning: first initialize the model with the pre-trained parameters, and all of the parameters are fine-tuned using labeled data from the downstream tasks

Tip

[CLS] is a special token stands for classification task; Since the sequence input of BERT can be a sentence pair like (Question, Answer), so another special token [SEP] is used to separate two sentences.

Embedding

Token Embeddings: See Word2Vec
Position Embeddings: See positional encoding in transformers
Segment Embedding: To locate the sentence

Lin's Notes Garden

Explorer

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Idea

Previous Works

Innovation

Framework

Steps

Embedding

Graph View

Table of Contents

Backlinks