google-deepmind/transformer_grammars
Python
Captured source
source ↗google-deepmind/transformer_grammars
Description: Transformer Grammars: Augmenting Transformer Language Models with Syntactic Inductive Biases at Scale, TACL (2022)
Language: Python
License: Apache-2.0
Stars: 137
Forks: 8
Open issues: 17
Created: 2023-01-27T14:30:43Z
Pushed: 2026-05-20T00:04:26Z
Default branch: main
Fork: no
Archived: no
README:
Transformer Grammars
Transformer Grammars are Transformer-like models of the joint structure and sequence of words of a sentence or document. Specifically, they model the sequence of actions describing a linearized tree. Their distinguishing feature is that the attention mask used in the Transformer core is a function of the structure itself, so that representations of constituents are composed recursively. The approach is fully described in our paper Transformer Grammars: Augmenting Transformer Language Models with Syntactic Inductive Biases at Scale, Sartran et al., TACL (2022), available from MIT Press at this address.
Code organization
The code is organized as follows:
transformer_grammars/ ├─ transformer_grammars/ TG module used by the entrypoints │ ├─ data/ Dataset, tokenizer, input transformation, etc. │ ├─ models/ Core model code │ │ ├─ masking/ C++ masking code, for the attention mask, relative │ │ positions, etc. │ ├─ training/ Training loop ├─ configs/ Configs used for the paper ├─ example/ Example data + scripts to train and use a model ├─ tools/ Misc tools to prepare the data, cf. below ├─ train.py Entrypoint for training ├─ score.py Entrypoint for scoring ├─ sample.py Entrypoint for sampling
Installation
This code was tested on Google Cloud Compute Engine, using a N1 instance, NVIDIA V100 GPU, and the disk image Debian 10 based Deep Learning VM with CUDA 11.3 preinstalled, M102. In particular, the Python version it contains is 3.7, for which we install the corresponding jaxlib package.
1. Download the code from the transformer_grammars repository.
git clone https://github.com/deepmind/transformer_grammars.git cd transformer_grammars
2. Create a virtual environment.
python -m venv .tgenv source .tgenv/bin/activate
3. Install the package (in development mode) and its dependencies.
./install.sh
This also builds the C++ extension that is required to compute the attention mask, relative positions, memory update, etc.
4. Run the test suite.
nosetests transformer_grammars
Quick start: example
We provide in the example/ directory parsed sentences from Dickens's A Tale of Two Cities, prepared using spaCy for sentence segmentation (en_core_web_md) and Benepar for parsing (benepar_en3_large), and split into {train,valid,test}.txt. The following can be done from that directory:
- Run the data preparation described above at once with
./prepare_data.sh. - Train a model for a few steps with
./run_training.sh. - Use it to score the test set with
./run_scoring.sh. - Use it to generate samples with
./run_sampling.sh.
NOTE: Such a small model trained for so few steps will give bad results -- this is only meant as an end-to-end example of training and using a TG model. To really train and use the model, please follow the instructions below.
Training and using a TG model
Data preparation
Expected input
The expected input format is one tree per line, with POS tags, e.g.
(S (S (NP (NNP John) (NNP Blair) (CC &) (NNP Co.)) (VP (VBZ is) (ADVP (RB close) (PP (TO to) (NP (DT an) (NN agreement) (S (VP (TO to) (VP (VB sell) (NP (PRP$ its) (NX (NX (NN TV) (NN station) (NN advertising) (NN representation) (NN operation)) (CC and) (NX (NN program) (NN production) (NN unit)))) (PP (TO to) (NP (NP (DT an) (NN investor) (NN group)) (VP (VBN led) (PP (IN by) (NP (NP (NNP James) (NNP H.) (NNP Rosenfield)) (, ,) (NP (DT a) (JJ former) (NNP CBS) (NNP Inc.) (NN executive))))))))))))))) (, ,) (NP (NN industry) (NNS sources)) (VP (VBD said)) (. .))
We assume that 3 such files are available: train.txt, valid.txt, test.txt in $DATA.
Whilst not specific to our work, we describe below how we prepared the example data.
Convert to Choe-Charniak
Convert all files to "Choe-Charniak" format (i.e. one sequence of actions, including opening and closing non-terminals, but excluding POS tags, per line) using tools/convert_to_choe_charniak.py
for SPLIT in train valid test
do
python tools/convert_to_choe_charniak.py --input $DATA/${SPLIT}.txt --output $DATA/${SPLIT}.cc
doneNow, there are two ways of transforming this data into sequences of integers for modelling:
- one uses SentencePiece (recommended), with which we can train a tokenization model
- the other involves learning a word-based vocabulary
Exactly one or the other needs to be done.
With SentencePiece
##### Training a tokenizer (SentencePiece)
Create a directory, set it to the $TOKENIZER environment variable.
Train a tokenizer on the training data with the following, adjusting the user defined symbols to reflect the non-terminals in the data (can be obtained automatically with perl -0pe 's/ /\n/g' < $DATA/train.cc | grep -E '\(|\)' | LC_ALL=C sort | uniq | perl -0pe 's/\n/,/g', but do check the list).
MODEL_PREFIX=$TOKENIZER/spm
NON_TERMINALS=`perl -0pe 's/ /\n/g' < $DATA/train.cc | grep -E '\(|\)' | LC_ALL=C sort | uniq | perl -0pe 's/\n/,/g'`
spm_train \
--input=$DATA/train.cc \
--model_prefix=$MODEL_PREFIX \
--vocab_size=32768 \
--character_coverage=1.0 \
--pad_id=0 \
--bos_id=1 \
--eos_id=2 \
--unk_id=3 \
--user_defined_symbols=${NON_TERMINALS::-1} \
--max_sentence_length=100000 \
--shuffle_input_sentence=trueThis produces two files, $TOKENIZER/spm.model and $TOKENIZER/spm.vocab. The first one is the actual model, the second is the vocabulary, one token per line.
##### Tokenizing the data (SentencePiece)
Encode the train/validation/test data with:
for SPLIT in train valid test
do
spm_encode \
--output_format=id \
--model=$TOKENIZER/spm.model \
--input=$DATA/${SPLIT}.cc \
--output=$DATA/${SPLIT}.enc
doneWe want our user-defined symbols to implicitly represent whitespace before and after them as necessary, but SentencePiece treats them literally. We therefore end up with extraneous space tokens:
(S (NP The blue bird NP) (VP sings VP) S)
is tokenized into
▁ (S ▁ (NP ▁The ▁blue ▁ bird ▁ NP) ▁…
Excerpt shown — open the source for the full document.