<aside> 💡
https://arxiv.org/pdf/1810.04805 | BERT Paper Link
This paper also closely follows the Transformers paper, so it is advisable to read that paper as well.
Not a replacement to the original paper.
</aside>
For certain something, I’ll need to read a lot of papers in the coming days and I thought what better way to retain the knowledge than by writing blogs (I know a lot of people are already writing blogs, I know it! But please let me cook) So, here it is. Since, Transformers is already written so many times by so many writers, let’s start with BERT!
Transformers architecture was undoubtedly magnificent. Neural Networks now had the ability to effectively learn and generate tokens based on previous tokens, but here lied the problem that BERT set out to solve. You see, Transformers could only look at the previous tokens and could care less about the next tokens since it follows the left-to-right approach meaning it was completely possible for them to miss the context of the word.
Example:
I am water. I can code.
Here, the transformer will read the first line and consider water to be water (fluid to drink) and not water (yours truly).
This happens because transformer follows the left-to-right approach so it looks at the tokens one by one from left to right processes them and then generate the next token.
This doesn’t happen in BERT. BERT uses self-attention. It looks at all the tokens at the same time simultaneously, so BERT will know that the sentence talks about me.
BERT is used for many tasks like