Anonymous
The BERT model is a pretrained language model, it is one of the first language models that was introduced, the framework is based on the paper "Attention is All you need". It takes a sequence as an input (usually text data) and outputs a context aware representation of the input data. The context aware means that the representation of each item in the sequence is calculated with respect to the other items in the sequence, and this is done by a mechanism called self attention.
More specifically the BERT model consists of multiple attention blocks, where each attention blocks consists of: a propositional embedding, multiple self-attention heads, and a feed forward layer. The positional embedding applies a function to each element in the sequence so to inject position information of the item in the sequence. The self attention mechanism is essentially a weighted average, where for each item in the sequence, the weights are coming from how each item attend to the other items in the sequence, essentially how similar it is to the other items in the sequence.
Bert is generally stronger than the its traditional counterparts such as Word2Vec and Glob, because it has contextual embedding. Moreover, it is trained in a large amount of data, therefore it is very strong in modeling language related tasks. Many NLP tasks such as sentiment analysis or part of speech tagging that required a lot of feature engineering now can be done using BERT.