What does this repo signal mean?

Amazon (Nova) published amazon-science/temporal-reasoning-dataset (Python). This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo amazon-science/temporal-reasoning-dataset · language Python · Amazon Science dataset for temporal reasoning tasks in NLP.. onlylabs links this event to 1 captured evidence page and 6 related repo signals. It also maps to Data demand, Evals and quality in the data-business radar.

Amazon (Nova) Repo: amazon-science/temporal-reasoning-dataset

Captured source

source ↗

GitHub/github.com/amazon-science/temporal-reasoning-dataset

amazon-science/temporal-reasoning-dataset repository metadata

Source ↗

published May 13, 2026seen Jun 5captured Jun 11http 200method plain

amazon-science/temporal-reasoning-dataset

Description: 🔬 Replication Package for the paper: "Benchmarking Multilingual Temporal Reasoning in LLMs: The Temporal Reasoning Dataset"

Language: Python

License: NOASSERTION

Stars: 2

Forks: 0

Open issues: 0

Created: 2026-05-13T13:07:08Z

Pushed: 2026-05-13T14:50:54Z

Default branch: main

Fork: no

Archived: no

README:

🕐 Temporal Reasoning Dataset (TRD)

![License: CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/)

> 🔬 Replication Package for the paper: > ["Benchmarking Multilingual Temporal Reasoning in LLMs: The Temporal Reasoning Dataset"](https://aclanthology.org/2026.iwsds-1.19.pdf) > *Presented at IWSDS 2026*

> ⚠️ Academic Release Notice > This code is being released solely for academic and scientific reproducibility purposes, in support of the methods and findings described in the associated publication. Pull requests are not being accepted in order to maintain the code exactly as it was used in the paper.

A programmatically generated, multilingual benchmark designed to evaluate temporal reasoning capabilities in Large Language Models (LLMs). This package allows you to reproduce the dataset used in our research with a single command.

---

📖 Overview

TRD generates question-answer pairs across 10 languages and 9 temporal task types, enabling systematic evaluation of LLM temporal reasoning through 4 experimental axes:

| Experiment | Description | Default Samples | |------------|-------------|-----------------| | 🔄 Variations | Medium difficulty baseline across all task types | 9,000 | | 📈 Difficulties | 5 difficulty levels (short to very_very_long) | 45,000 | | 🎯 Insertions | Contextual distractors (similar/dissimilar) | 18,000 | | 🧠 Memorization | Temporal shift 2025-2095 (8 year epochs) | 32,000 |

📊 Total: 104,000 samples with default settings (seed=9 for reproducibility).

---

🌍 Supported Languages

Our dataset spans 10 languages across multiple language families to ensure comprehensive multilingual evaluation:

| Code | Language | Family | |------|----------|--------| | 🇺🇸 en_US | English | Indo-European (Germanic) | | 🇪🇸 es_ES | Spanish | Indo-European (Romance) | | 🇩🇪 de_DE | German | Indo-European (Germanic) | | 🇫🇷 fr_FR | French | Indo-European (Romance) | | 🇮🇹 it_IT | Italian | Indo-European (Romance) | | 🇧🇷 pt_BR | Portuguese | Indo-European (Romance) | | 🇳🇱 nl_NL | Dutch | Indo-European (Germanic) | | 🇯🇵 ja_JP | Japanese | Japonic | | 🇸🇦 ar_SA | Arabic | Afro-Asiatic | | 🇮🇳 hi_IN | Hindi | Indo-European (Indo-Aryan) |

---

📋 Task Types

Nine carefully designed temporal reasoning tasks:

| # | Task | Description | |---|------|-------------| | 1️⃣ | date_addition | Adding days/weeks/months/years to a date | | 2️⃣ | date_subtraction | Subtracting days/weeks/months/years from a date | | 3️⃣ | time_addition | Adding hours/minutes to a time | | 4️⃣ | time_subtraction | Subtracting hours/minutes from a time | | 5️⃣ | date_duration | Days between two dates | | 6️⃣ | time_duration | Hours/minutes between two times | | 7️⃣ | date_recurrence | Next occurrence of recurring events | | 8️⃣ | interval_date | Date intervals and ranges | | 9️⃣ | day_of_week | Day name for a given date |

---

🚀 Installation

pip install trd

Or install from source:

git clone https://github.com/amazon-science/temporal-reasoning-dataset.git
cd temporal-reasoning-dataset
pip install -e .

---

⚡ Quick Start

🔁 Reproduce Paper Results

To generate the exact dataset used in our IWSDS 2026 paper:

trd generate all --output ./dataset --seed 9

This creates all 104,000 samples organized by experiment type.

💻 Command Line Interface

# Generate all experiments (104K samples)
trd generate all --output ./dataset

# Generate specific experiments
trd generate variations --output ./dataset
trd generate difficulties --output ./dataset
trd generate insertions --output ./dataset
trd generate memorization --output ./dataset

# Customize generation
trd generate variations \
--samples 50 \
--languages en_US,de_DE,ja_JP \
--format json \
--seed 42 \
--output ./my_dataset

🐍 Python API

from trd import generate_all, generate_variations

# Generate complete dataset (reproduces paper results)
total = generate_all(output_dir="./dataset", seed=9)
print(f"✅ Generated {total} samples")

# Generate specific experiment
total = generate_variations(
samples_per_task=100,
languages=["en_US", "de_DE", "ja_JP"],
output_dir="./dataset",
seed=9,
format="csv"
)

# Use experiment classes directly
from trd import VariationsExperiment
from trd.config.languages import LANGUAGE_CODES

exp = VariationsExperiment(
samples_per_task=100,
languages=LANGUAGE_CODES,
output_dir="./dataset",
seed=9
)
exp.generate()

🔧 Using Individual Generators

from trd.generators import get_generator
from trd.config import MEDIUM_TIMEFRAME

# Get a specific generator
GeneratorClass = get_generator("date_addition")
generator = GeneratorClass(MEDIUM_TIMEFRAME, "en_US")

# Generate samples
samples = generator.generate_samples(10)

for sample in samples:
print(f"❓ Q: {sample.question}")
print(f"✅ A: {sample.answer}")
print()

---

📁 Output Format

CSV Format (default)

Each experiment generates CSV files organized by task and language:

dataset/
├── variations/
│ ├── date_addition_medium_en_US.csv
│ ├── date_addition_medium_es_ES.csv
│ └── ...
├── difficulties/
│ ├── short/
│ ├── medium/
│ ├── long/
│ ├── very_long/
│ └── very_very_long/
├── insertions/
│ ├── similar/
│ └── dissimilar/
└── memorization/
├── 2025_date_addition_medium_en_US.csv
├── 2035_date_addition_medium_en_US.csv
└── ...

JSON Format

{
"metadata": {
"experiment": "variations",
"generated_at": "2025-01-15T12:00:00",
"seed": 9,
"total_samples": 9000
},
"samples": [
{
"id": "var_en_US_date_addition_001",
"language": "en_US",
"task_type": "date_addition",
"question": "Today is 2027-08-07, what is the date going to be in 8 days?",
"answer": "2027-08-15"
}
]
}

---

📊 Difficulty Levels (Table 5)

Each difficulty level uses specific timeframe configurations from the paper:

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

New dataset, low stars

Amazon (Nova) has a repo signal matching data demand, evals and quality.

Data demand Evals and quality