What does this repo signal mean?

Google (DeepMind / Gemini) published google-deepmind/transformer_grammars (Python). This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo google-deepmind/transformer_grammars · language Python. onlylabs links this event to 1 captured evidence page and 6 related repo signals.

Google (DeepMind / Gemini) Repo: google-deepmind/transformer_grammars

Captured source

source ↗

GitHub/github.com/google-deepmind/transformer_grammars

google-deepmind/transformer_grammars repository metadata

Source ↗

published Jan 27, 2023seen 5dcaptured 10hhttp 200method plain

google-deepmind/transformer_grammars

Description: Transformer Grammars: Augmenting Transformer Language Models with Syntactic Inductive Biases at Scale, TACL (2022)

Language: Python

License: Apache-2.0

Stars: 137

Forks: 8

Open issues: 17

Created: 2023-01-27T14:30:43Z

Pushed: 2026-05-20T00:04:26Z

Default branch: main

Fork: no

Archived: no

README:

Transformer Grammars

Transformer Grammars are Transformer-like models of the joint structure and sequence of words of a sentence or document. Specifically, they model the sequence of actions describing a linearized tree. Their distinguishing feature is that the attention mask used in the Transformer core is a function of the structure itself, so that representations of constituents are composed recursively. The approach is fully described in our paper Transformer Grammars: Augmenting Transformer Language Models with Syntactic Inductive Biases at Scale, Sartran et al., TACL (2022), available from MIT Press at this address.

Code organization

The code is organized as follows:

transformer_grammars/
├─ transformer_grammars/ TG module used by the entrypoints
│ ├─ data/ Dataset, tokenizer, input transformation, etc.
│ ├─ models/ Core model code
│ │ ├─ masking/ C++ masking code, for the attention mask, relative
│ │ positions, etc.
│ ├─ training/ Training loop
├─ configs/ Configs used for the paper
├─ example/ Example data + scripts to train and use a model
├─ tools/ Misc tools to prepare the data, cf. below
├─ train.py Entrypoint for training
├─ score.py Entrypoint for scoring
├─ sample.py Entrypoint for sampling

Installation

This code was tested on Google Cloud Compute Engine, using a N1 instance, NVIDIA V100 GPU, and the disk image Debian 10 based Deep Learning VM with CUDA 11.3 preinstalled, M102. In particular, the Python version it contains is 3.7, for which we install the corresponding jaxlib package.

1. Download the code from the transformer_grammars repository.

git clone https://github.com/deepmind/transformer_grammars.git
cd transformer_grammars

2. Create a virtual environment.

python -m venv .tgenv
source .tgenv/bin/activate

3. Install the package (in development mode) and its dependencies.

./install.sh

This also builds the C++ extension that is required to compute the attention mask, relative positions, memory update, etc.

4. Run the test suite.

nosetests transformer_grammars

Quick start: example

We provide in the example/ directory parsed sentences from Dickens's A Tale of Two Cities, prepared using spaCy for sentence segmentation (en_core_web_md) and Benepar for parsing (benepar_en3_large), and split into {train,valid,test}.txt. The following can be done from that directory:

Run the data preparation described above at once with ./prepare_data.sh.
Train a model for a few steps with ./run_training.sh.
Use it to score the test set with ./run_scoring.sh.
Use it to generate samples with ./run_sampling.sh.

NOTE: Such a small model trained for so few steps will give bad results -- this is only meant as an end-to-end example of training and using a TG model. To really train and use the model, please follow the instructions below.

Training and using a TG model

Data preparation

Expected input

The expected input format is one tree per line, with POS tags, e.g.

(S (S (NP (NNP John) (NNP Blair) (CC &) (NNP Co.)) (VP (VBZ is) (ADVP (RB close) (PP (TO to) (NP (DT an) (NN agreement) (S (VP (TO to) (VP (VB sell) (NP (PRP$ its) (NX (NX (NN TV) (NN station) (NN advertising) (NN representation) (NN operation)) (CC and) (NX (NN program) (NN production) (NN unit)))) (PP (TO to) (NP (NP (DT an) (NN investor) (NN group)) (VP (VBN led) (PP (IN by) (NP (NP (NNP James) (NNP H.) (NNP Rosenfield)) (, ,) (NP (DT a) (JJ former) (NNP CBS) (NNP Inc.) (NN executive))))))))))))))) (, ,) (NP (NN industry) (NNS sources)) (VP (VBD said)) (. .))

We assume that 3 such files are available: train.txt, valid.txt, test.txt in $DATA.

Whilst not specific to our work, we describe below how we prepared the example data.

Convert to Choe-Charniak

Convert all files to "Choe-Charniak" format (i.e. one sequence of actions, including opening and closing non-terminals, but excluding POS tags, per line) using tools/convert_to_choe_charniak.py

for SPLIT in train valid test
do
python tools/convert_to_choe_charniak.py --input $DATA/${SPLIT}.txt --output $DATA/${SPLIT}.cc
done

Now, there are two ways of transforming this data into sequences of integers for modelling:

one uses SentencePiece (recommended), with which we can train a tokenization model
the other involves learning a word-based vocabulary

Exactly one or the other needs to be done.

With SentencePiece

##### Training a tokenizer (SentencePiece)

Create a directory, set it to the $TOKENIZER environment variable.

Train a tokenizer on the training data with the following, adjusting the user defined symbols to reflect the non-terminals in the data (can be obtained automatically with perl -0pe 's/ /\n/g' < $DATA/train.cc | grep -E '$|$' | LC_ALL=C sort | uniq | perl -0pe 's/\n/,/g', but do check the list).

MODEL_PREFIX=$TOKENIZER/spm
NON_TERMINALS=`perl -0pe 's/ /\n/g' < $DATA/train.cc | grep -E '\(|\)' | LC_ALL=C sort | uniq | perl -0pe 's/\n/,/g'`
spm_train \
--input=$DATA/train.cc \
--model_prefix=$MODEL_PREFIX \
--vocab_size=32768 \
--character_coverage=1.0 \
--pad_id=0 \
--bos_id=1 \
--eos_id=2 \
--unk_id=3 \
--user_defined_symbols=${NON_TERMINALS::-1} \
--max_sentence_length=100000 \
--shuffle_input_sentence=true

This produces two files, $TOKENIZER/spm.model and $TOKENIZER/spm.vocab. The first one is the actual model, the second is the vocabulary, one token per line.

##### Tokenizing the data (SentencePiece)

Encode the train/validation/test data with:

for SPLIT in train valid test
do
spm_encode \
--output_format=id \
--model=$TOKENIZER/spm.model \
--input=$DATA/${SPLIT}.cc \
--output=$DATA/${SPLIT}.enc
done

We want our user-defined symbols to implicitly represent whitespace before and after them as necessary, but SentencePiece treats them literally. We therefore end up with extraneous space tokens:

(S (NP The blue bird NP) (VP sings VP) S)

is tokenized into

▁ (S ▁ (NP ▁The ▁blue ▁ bird ▁ NP) ▁…

Excerpt shown — open the source for the full document.