Natural Language Processing: the age of Transformers
Captured source
source ↗Natural Language Processing: the age of Transformers Build • Olga Petrova • 06/08/20 • 12 min read
This article is the first installment of a two-post series on Building a machine reading comprehension system using the latest advances in deep learning for NLP . Stay tuned for the second part, where we'll introduce a pre-trained model called BERT that will take your NLP projects to the next level!
In the recent past, if you specialized in natural language processing (NLP), there may have been times when you felt a little jealous of your colleagues working in computer vision. It seemed as if they had all the fun: the annual ImageNet classification challenge , Neural Style Transfer , Generative Adversarial Networks , to name a few. At last, the dry spell is over, and the NLP revolution is well underway! It would be fair to say that the turning point was 2017, when the Transformer network was introduced in Google's Attention is all you need paper. Multiple further advances followed since then, one of the most important ones being BERT - the subject of our next article.
To lay the groundwork for the Transformer discussion, let's start by looking at one of the common categories of NLP tasks: the sequence to sequence (seq2seq) problems. They are pretty much exactly what their name suggests: both the inputs and the outputs of a seq2seq task are sequences . In the context of NLP, there are typicaly additional restrictions put in place:
The elements of the sequence are tokens corresponding to some set vocabulary (often including an Unknown token for the out-of-vocabulary words)
The order inside the sequence matters .
Next we shall take a moment to remember the fallen heros, without whom we would not be where we are today. I am, of course, referring to the RNNs - Recurrent Neural Networks, a concept that became almost synonymous with NLP in the deep learning field.
The predecessor to Transformers: the RNN Encoder-Decoder
This story takes us all the way back to 2014 ( Ref , another Ref ), when the idea of approaching seq2seq problems via two Recurrent Neural Networks combined into an Encoder-Decoder model, was born. Let's demonstrate this architecture on a simple example from the Machine Translation task. Take a French-English sentence pair, where the input is "je suis étudiant" and the output "I am a student" . First, "je" (or, most likely, a word embedding for the token representing "je" ), often accompanied by a constant vector hE0 which could be either learned or fixed, gets fed into the Encoder RNN. This results in the output vector hE1 (hidden state 1), which serves as the next input for the Encoder RNN, together with the second element in the input sequence "suis" . The output of this operation, hE2 , and "étudiant" are again fed into the Encoder, producing the last Encoded hidden state for this training sample, hE3 . The hE3 vector is dependent on all of the tokens inside the input sequence, so the idea is that it should represent the meaning of the entire phrase. For this reason it is also referred to as the context vector . The context vector is the first input to the Decoder RNN, which should then generate the first element of the output sequence "I" (in reality, the last layer of the Decoder is typically a softmax , but for simplicity we can just keep the most likely element at the end of every Decoder step). Additionally, the Decoder RNN produces a hidden state hD1 . We feed hD1 and the previous output I back into the Decoder to hopefully get "am" as our second output. This process of generating and feeding outputs back into the Decoder continues until we produce an - the end of the sentence token, which signifies that our job here is done.
The RNN Encoder-Decoder model in action. To avoid any confusion, there is something that I would like to draw your attention to. The multiple RNN blocks appear in the Figure because of the multiple elements of the sequence that get fed into / generated by the networks, but make no mistake - there is only one Encoder RNN and one Decoder RNN at play here. It may help to think of the repeated blocks as the same RNN at different timesteps, or as multiple RNNs with shared weights, that are envoked one after another.
This architecture may seem simple (especially until we sit down to actually write the code with LSTMs or GRUs thrown in for good measure), but it actually turns out to be remarkably effective for many NLP tasks. In fact, Google Translate has been using it under the hood since 2016. However, the RNN Encoder-Decoder models do suffer from certain drawbacks:
First problem with RNNs: Attention to the rescue
The RNN approach as described above does not work particularly well for longer sentences. Think about it: the meaning of the entire input sequence is expected to be captured by a single context vector with fixed dimensionality. This could work well enough for "Je suis étudiant" , but what if your input looks more like this:
"It was a wrong number that started it, the telephone ringing three times in the dead of night, and the voice on the other end asking for someone he was not."
Good luck encoding that into a context vector ! However, there turns out to be a solution, known as the Attention mechanism .
Schematics of (left) a conventional RNN Encoder-Decoder and (right) an RNN Encoder-Decoder with Attention
The basic idea behind Attention is simple: instead of passing only the last hidden state (the context vector ) to the Decoder, we give it all the hidden states that come out of the Encoder. In our example that would mean hE1 , hE2 and hE3 . The Decoder will determine which of them gets attended to (i.e., where to pay attention) via a softmax layer. Apart from adding this additional structure, the basic RNN Encoder-Decoder architecture remains the same, yet the resulting model performs much better when it comes to longer input sequences.
Second problem with Recurrent NNs: they are (surprise!) Recurrent
The other problem plaguing RNNs has to do with the R inside the name: the computation in a Recurrent neural network is, by definition, sequential. What does this property entail? A sequential computation cannot be parallelized, since we have to wait for the previous step to finish before we move on to the next one. This lengthens both the training time, and the time it takes to run inference.
One of the ways around the sequential dilemma is to use Convolutional neural networks (CNNs) instead of…
Excerpt shown — open the source for the full document.
Generative Adversarial Nets Ian J. Goodfellow∗, Jean Pouget-Abadie†, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair‡, Aaron Courville, Yoshua Bengio§ Departement d’informatique et de recherche op ´ erationnelle ´ Universite de Montr ´ eal ´ Montreal, QC H3C 3J7 ´…