What does this writing signal mean?

OpenAI Writing: Discovering types for entity disambiguation

Captured source

source ↗

openai.com/openai.com/index/discovering-types-for-entity-disambiguation

Discovering types for entity disambiguation

Source ↗

published Feb 7, 2018seen 6dcaptured 3dhttp 200method exa

Discovering types for entity disambiguation | OpenAI

February 7, 2018

Publication

Discovering types for entity disambiguation

Read paper

Loading…

We’ve built a system for automatically figuring out which object is meant by a word by having a neural network decide if the word belongs to each of about 100 automatically-discovered “types” (non-exclusive categories).

For example, given a sentence like “the prey saw the jaguar cross the jungle”, rather than trying to reason directly whether jaguar means the car, the animal, or something else, the system plays “20 questions” with a pre-chosen set of categories. This approach gives a big boost in state-of-the-art on several entity disambiguation datasets.

We achieve 94.88% accuracy on CoNLL (YAGO)⁠(previous state of the arts: 91.50⁠%, 91.70⁠%) and 90.85% on TAC KBP 2010 challenge⁠(previous state of the arts: 87.20⁠%, and 87.70⁠%). Previous methods used distributed representations⁠. Types can go almost all the way on these tasks, as perfect type prediction would give accuracies of 98.6-99%.

High-level overview

Our system uses the following steps:

1. Extract every Wikipedia-internal link to determine, for each word, the set of conceivable entities it can refer to. For example, when encountering the link[jaguar](https://en.wikipedia.org/wiki/Jaguar) in a Wikipedia page, we conclude thathttps://en.wikipedia.org/wiki/Jaguar is one of the meanings of jaguar. 2. Walk the Wikipedia category tree (using the Wikidata⁠ knowledge graph) to determine, for each entity, the set of categories it belongs to. For example, at the bottom of Jaguar cars Wikipedia page⁠ are the following categories (which themselves have their own categories, such as Automobiles⁠): “British brands | Car brands | Jaguar cars | Jaguar vehicles.” 3. Pick a list of ~100 categories to be your “type” system, and optimize over this choice of categories so that they compactly express any entity. We know the mapping of entities to categories, so given a type system, we can represent each entity as a ~100-dimensional binary vector indicating membership in each category. 4. Using every Wikipedia-internal link and its surrounding context, produce training data mapping a word plus context to the ~100-dimensional binary representation of the corresponding entity, and train a neural network to predict this mapping. This chains together the previous steps: Wikipedia links map a word to an entity, we know the categories for each entity from step 2, and step 3 picked the categories in our type system. 5. At test time, given a word and surrounding context, our neural network’s output can be interpreted as the probability that the word belongs to each category. If we knew the exact set of category memberships, we would narrow down to one entity (assuming well-chosen categories). But instead, we must play a probabilistic 20 questions: use Bayes’ theorem⁠ to calculate the chance of the word disambiguating to each of its possible entities.

More examples

Here are some other examples of our system in action:

Cleaning the data

Wikidata’s knowledge graph can be turned into a source of training data for mapping fine-grained entities to types. We apply its instance of⁠ relation recursively to determine the set of types for any given entity — for example, any descendent node of the human⁠ node has type human. Wikipedia can also provide entity-to-type mapping through its category link⁠.

Wikipedia-internal link statistics provide a good estimate of the chance a particular phrase refers to some entity. However, this is noisy since Wikipedia will often link to specific instance of a type rather than the type itself (anaphora⁠— e.g. king → Charles I of England) or link from a nickname (metonymy⁠). This results in an explosion of associated entities (e.g. king has 974 associated entities) and distorted link frequencies (e.g. queen links to the band Queen⁠ 4920 times, Elizabeth II⁠ 1430 times, and monarch⁠ only 32 times).

The easiest approach is to prune⁠ rare⁠ links⁠, but this loses information. We instead use the Wikidata property graph to heuristically turn links into their “generic” meaning, as illustrated below.

After this process, king goes from 974 to 14 associated entities, while the number of links from queen to monarch⁠ increases from 32 to 3553.

Learning a good type system

We need to select the best type system and parameters such that disambiguation accuracy is maximized. There’s a huge number of possible sets of types, making an exact solution intractable. Instead, we use heuristic search or stochastic optimization (evolutionary algorithm) to select a type system, and gradient descent to train a type classifier to predict the behavior of the type system.

We need to select types that are discriminating (so quickly whittle down the possible set of entities), while being easy to learn (so surrounding context is informative for a neural network to infer that a type applies). We inform our search with two heuristics: learnability (average of area under the curve⁠ [AUC] scores of a classifier trained to predict type membership), and oracle accuracy (how well we would disambiguate entities if we predicted all types perfectly).

Type system evolution

We train binary classifiers to predict membership in each of the 150,000 most common types in our dataset, given a window of context. The area under the curve⁠(AUC) of the classifier becomes the “learnability score” for that type. High AUC means it’s easy to predict this type from context; poor performance can mean we have little training data or that a word window isn’t terribly helpful (this tends to be true for unnatural categories like ISBNs⁠). Our full model takes several days to train, so we instead use a much smaller model as a proxy in our “learnability score”, which takes only 2.5s to train.

We can now use these learnability scores and count statistics to estimate the performance of a given subset of types as our type system. Below you can run the Cross Entropy Method⁠ to discover types in your browser. Note how changing sample size and penalties affects the solution.

To better visualize what parts of the type system design are easy and hard, we invite you to try your hand at designing your own below. After choosing a high-level domain you can start looking at ambiguous examples. The possible answers are shown as circles on…

Excerpt shown — open the source for the full document.