What makes the BERT model a commendable choice in machine learning, and how would you describe its architecture?

Question

Anonymous · Accepted Answer

The BERT model is a pretrained language model, it is one of the first language
models that was introduced, the framework is based on the paper "Attention is
All you need". It takes a sequence as an input (usually text data) and outputs a
context aware representation of the input data. The context aware means that the
representation of each item in the sequence is calculated with respect to the
other items in the sequence, and this is done by a mechanism called self
attention. 
More specifically the BERT model consists of multiple attention blocks, where
each attention blocks consists of: a propositional embedding, multiple
self-attention heads, and a feed forward layer. The positional embedding applies
a function to each element in the sequence so to inject position information of
the item in the sequence. The self attention mechanism is essentially a weighted
average, where for each item in the sequence, the weights are coming from how
each item attend to the other items in the sequence, essentially how similar it
is to the other items in the sequence. 
Bert is generally stronger than the its traditional counterparts such as
Word2Vec and Glob, because it has contextual embedding. Moreover, it is trained
in a large amount of data, therefore it is very strong in modeling language
related tasks. Many NLP tasks such as sentiment analysis or part of speech
tagging that required a lot of feature engineering now can be done using BERT.

Anonymous · Answer

BERT has a decoder architecture, and the main building are transformers with
classification head at its end.
Compared to RNNs, this architecture addresses two problems: vanishing/exploding
gradient issues and memory loss. The self attention mechanism inside the
transformer allows the model to do an all-to-all comparison of the tokens.
The model is trained with a massive amount of text where certain words (tokens)
are masked and the classifier head must predict the missing word.

What makes the BERT model a commendable choice in machine learning, and how would you describe its architecture?

Practice this question with AI

Go Premium

Practice More Questions

Community Answers

Unlock Community Insights

Unlock Community Insights