WritingOpenAIOpenAIpublished Jun 11, 2018seen 6d

Improving language understanding with unsupervised learning

Open original ↗

Captured source

source ↗

Improving language understanding with unsupervised learning | OpenAI

June 11, 2018

Improving language understanding with unsupervised learning

Illustration: Ben Barry

Loading…

Share

We’ve obtained state-of-the-art results on a suite of diverse language tasks with a scalable, task-agnostic system, which we’re also releasing. Our approach is a combination of two existing ideas: transformers⁠ and unsupervised pre-training⁠. These results provide a convincing example that pairing supervised learning methods with unsupervised pre-training works very well; this is an idea that many have explored in the past, and we hope our result motivates further research into applying this idea on larger and more diverse datasets.

Dataset

Task

SOTA

Ours

SNLI

Textual entailment

89.3

89.9

MNLI matched

Textual entailment

80.6

82.1

MNLI mismatched

Textual entailment

80.1

81.4

SciTail

Textual entailment

83.3

88.3

QNLI

Textual entailment

82.3

88.1

RTE

Textual entailment

61.7

56.0

STS-B

Semantic similarity

81.0

82.0

QQP

Semantic similarity

66.1

70.3

MRPC

Semantic similarity

86.0

82.3

RACE

Reading comprehension

53.3

59.0

ROCStories

Commonsense reasoning

77.6

86.5

COPA

Commonsense reasoning

71.2

78.6

SST-2

Sentiment analysis

93.2

91.3

CoLA

Linguistic acceptability

35.0

45.4

GLUE

Multi task benchmark

68.9

72.8

Our system works in two stages; first we train a transformer model on a very large amount of data in an unsupervised manner—using language modeling as a training signal—then we fine-tune this model on much smaller supervised datasets to help it solve specific tasks. We developed this approach following our sentiment neuron⁠ work, in which we noted that unsupervised learning techniques can yield surprisingly discriminative features when trained on enough data. Here, we wanted to further explore this idea: can we develop one model, train it in an unsupervised way on a large amount of data, and then fine-tune the model to achieve good performance on many different tasks? Our results indicate that this approach works surprisingly well; the same core model can be fine-tuned for very different tasks with minimal adaptation.

This work builds on the approach introduced in Semi-supervised Sequence Learning⁠, which showed how to improve document classification performance by using unsupervised pre-training of an LSTM followed by supervised fine-tuning. It also extends ULMFiT⁠, research that shows how a single dataset-agnostic LSTM language model can be fine-tuned to get state-of-the-art performance on a variety of document classification datasets; our work shows how a Transformer-based model can be used in this approach to succeed at a broader range of tasks beyond document classification, such as commonsense reasoning, semantic similarity, and reading comprehension. It is also similar to but more task-agnostic than ELMo⁠, which incorporates pre-training but uses task-customized architectures to get state-of-the-art results on a broad suite of tasks.

Very little tuning was used to achieve our results. All datasets use a single forward language model, without any ensembling, and the majority of the reported results use the exact same hyperparameter settings.

A result we are particularly excited about is the performance of our approach on three datasets— COPA⁠, RACE⁠, and ROCStories⁠—designed to test commonsense reasoning and reading comprehension. Our model obtains new state-of-the-art results on these datasets by a wide margin. These datasets are thought to require multi-sentence reasoning and significant world knowledge to solve suggesting that our model improves these skills predominantly via unsupervised learning. This suggests there’s hope for developing complex language understanding capabilities via unsupervised techniques.

Why unsupervised learning?

Supervised learning is at the core of most of the recent success of machine learning. However, it can require large, carefully cleaned, and expensive to create datasets to work well. Unsupervised learning is attractive because of its potential to address these drawbacks. Since unsupervised learning removes the bottleneck of explicit human labeling it also scales well with current trends of increasing compute⁠ and availability of raw data. Unsupervised learning is a very⁠ active⁠ area⁠ of⁠ research⁠ but practical uses of it are often still limited.

There’s been a recent push to try to further language capabilities by using unsupervised learning to augment systems with large amounts of unlabeled data; representations of words trained via unsupervised techniques can use large datasets consisting of terabytes of information and, when integrated with supervised learning, improve performance on a wide range of NLP tasks. Until recently, these unsupervised techniques for NLP (for example, GLoVe⁠ and word2vec⁠) used simple models (word vectors) and training signals (the local co-occurence of words). Skip-Thought Vectors⁠ is a notable early demonstration of the potential improvements more complex approaches can realize. But new techniques are now being used which are further boosting performance. These include the use of pre-trained sentence representation models, contextualized word vectors (notably ELMo⁠ and CoVE⁠), and approaches which use customized architectures to fuse unsupervised pre-training with supervised fine-tuning, like our own.

We also noticed we can use the underlying language model to begin to perform tasks without ever training on them. For example, performance on tasks like picking the right answer to a multiple choice question steadily increases as the underlying language model improves. While the absolute performance of these methods is still often quite low compared to the supervised state-of-the-art (for question answering it still outperformed by a simple sliding-window baseline) it is encouraging that this behavior is robust across a broad set of tasks. Randomly initialized networks containing no information about the task and the world perform no-better than random using these heuristics. This provides some insight into why generative pre-training can improve performance on downstream tasks.

We can also use the existing language functionality in the model to perform sentiment analysis. For the Stanford Sentiment Treebank dataset, which consists of sentences from positive and negative movie reviews, we can use the language model to guess whether a review is…

Excerpt shown — open the source for the full document.

Notability

Scored, but no written rationale attached yet.

OpenAI has a writing signal matching data demand, infrastructure.