Understanding text with BERT
Captured source
source ↗Understanding text with BERT Build • Olga Petrova • 10/09/20 • 17 min read
_This article is the second installment of a two-part post on Building a machine reading comprehension system using the latest advances in deep learning for NLP . Here we are going to look at a new language representation model called BERT (Bidirectional Encoder Representations from Transformers) .
The written word has allowed human communication to transcend both time and distance, revolutionizing the world order in the process. For most of us understanding text seems almost too trivial of a task to reflect upon, yet there is much complexity to it. Decoding symbols to extract meaning requires not only having a wide enough vocabulary, but also the ability to choose the appropriate meaning of a word based on the context, and the ability to understand the organization of a sentence as well as a passage. One of the ways to assess reading comprehension is to pose questions based on a given text. This task is familiar to all of us from childhood, so it comes as no surprise that machine reading comprehension is also often cast as question-answering.
The Data: meet the SQuAD
There are multiple ways of answering questions based on a text. For instance, this may or may not involve text summarization, and/or inferring - tactics that are necessary when the answer is not explicitely stated in the body of the text. In the simpler case that it is, the task is narrowed down to span extraction .
For this machine learning task, the inputs come in the form of a Context / Question pair, and the outputs are Answers: pairs of integers, indexing the start and the end of the answer's text contained inside the Context. This is precisely how one of the largest question-answering sets, the Stanford Question Answering Dataset abbreviated as SQuAD, is organized. Here is one training sample from SQuAD 1.0:
Context: Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.
Question: The Basilica of the Sacred heart at Notre Dame is beside to which structure?
Answer: start_position: 49, end_position: 51
49 and 51 are the indices that span the Main Building in the fourth sentence of the Context.
Transfer learning in NLP
Having humans read text passages, then ask and answer questions based on them, is a time-consuming and expensive task. Even the largest datasets, such as the SQuAD, rarely go beyond 100 000 training samples - which is not that much in the deep learning world. A technique called transfer learning is invaluable in situations when training data is scarce, however its use in NLP has been limited compared to the wide success that transfer learning enjoyed in computer vision. To a large extent, the latter has been due to the existence of ImageNet - a dataset with over 14 million images, each hand-annotated with the label of the subject that appears in the photo.
The idea behind transfer learning is simple: we develop a model for a task for which we have enough training data, then use that model (or a part of it) as a starting point for a different ( downstream , or target ) task, presumably one that we don't have as much data for. The tricky part? Choosing the right task for the pre-training stage. We want the resulting model to generalise beyond the original problem sufficiently well: thus, the features of the data that the model learns to identify have to be, in some sense, general. For computer vision the solution came in the form of image classification . It turned out that many downstream tasks could benefit from building upon a network that had initially been trained to recognise objects that appear on images. Conveniently, such networks can be trained on ImageNet! Unfortunately, there is no equivalent of ImageNet, a vast labeled dataset that can be used for supervised learning, for NLP. To make matters worse, it is not clear which task is best to use for pre-training. Language modeling ? Natural language inference ?
In this article, we will look at BERT: one of the major milestones in transfer learning for NLP. Here is the TL;DR summary for the impatient:
BERT is the Encoder of the Transformer that has been trained on two supervised tasks, which have been created out of the Wikipedia corpus in an unsupervised way: 1) predicting words that have been randomly masked out of sentences and 2) determining whether sentence B could follow after sentence A in a text passage. The result is a pre-trained Encoder that embeds words while taking into acount their surrounding context. When supplemented with an additional fully connected layer, BERT was able to achieve state-of-the-art results on 11 downstream tasks at the time that it was released in 2018.
What is BERT?
BERT: Bidirectional Encoder Representations from Transformers.
Much like the title of the Attention is all you need paper , the meaning of the acronym BERT is the epitome of spoiler. Being an Encoder of a Transformer (I bet Representation was mainly put in there to make the abbreviation work - too bad, I would have rather had Pre-trained in the name), BERT is Bidirectional by design due to the nature of the Encoder Self-Attention in the Transformer architecture. BERT seeks to provide a pre-trained method for obtaining contextualized word embeddings, which can then be used for a wide variety of downstream NLP tasks. And provide it does - at the time that the BERT paper was published in 2018, BERT-based NLP models have surpassed the previous state-of-the-art results on eleven different NLP tasks, including Question-Answering. A pre-trained BERT model serves as a way to embed words in a given sentence while taking into account their context: the final word embeddings are none other than the hidden states produced by the Transformer's Encoder.
What are the tasks that BERT had been (pre)trained on? Turns out…
Excerpt shown — open the source for the full document.
Attention Is All You Need Ashish Vaswani∗ Google Brain avaswani@google.com Noam Shazeer∗ Google Brain noam@google.com Niki Parmar∗ Google Research nikip@google.com Jakob Uszkoreit∗ Google Research usz@google.com Llion Jones∗ Google Research llion@google.com Aidan N. Gomez∗ †…