amazon-science/temporal-reasoning-dataset
Python
Captured source
source ↗amazon-science/temporal-reasoning-dataset
Description: 🔬 Replication Package for the paper: "Benchmarking Multilingual Temporal Reasoning in LLMs: The Temporal Reasoning Dataset"
Language: Python
License: NOASSERTION
Stars: 2
Forks: 0
Open issues: 0
Created: 2026-05-13T13:07:08Z
Pushed: 2026-05-13T14:50:54Z
Default branch: main
Fork: no
Archived: no
README:
🕐 Temporal Reasoning Dataset (TRD)

> 🔬 Replication Package for the paper: > ["Benchmarking Multilingual Temporal Reasoning in LLMs: The Temporal Reasoning Dataset"](https://aclanthology.org/2026.iwsds-1.19.pdf) > *Presented at IWSDS 2026*
> ⚠️ Academic Release Notice > This code is being released solely for academic and scientific reproducibility purposes, in support of the methods and findings described in the associated publication. Pull requests are not being accepted in order to maintain the code exactly as it was used in the paper.
A programmatically generated, multilingual benchmark designed to evaluate temporal reasoning capabilities in Large Language Models (LLMs). This package allows you to reproduce the dataset used in our research with a single command.
---
📖 Overview
TRD generates question-answer pairs across 10 languages and 9 temporal task types, enabling systematic evaluation of LLM temporal reasoning through 4 experimental axes:
| Experiment | Description | Default Samples | |------------|-------------|-----------------| | 🔄 Variations | Medium difficulty baseline across all task types | 9,000 | | 📈 Difficulties | 5 difficulty levels (short to very_very_long) | 45,000 | | 🎯 Insertions | Contextual distractors (similar/dissimilar) | 18,000 | | 🧠 Memorization | Temporal shift 2025-2095 (8 year epochs) | 32,000 |
📊 Total: 104,000 samples with default settings (seed=9 for reproducibility).
---
🌍 Supported Languages
Our dataset spans 10 languages across multiple language families to ensure comprehensive multilingual evaluation:
| Code | Language | Family | |------|----------|--------| | 🇺🇸 en_US | English | Indo-European (Germanic) | | 🇪🇸 es_ES | Spanish | Indo-European (Romance) | | 🇩🇪 de_DE | German | Indo-European (Germanic) | | 🇫🇷 fr_FR | French | Indo-European (Romance) | | 🇮🇹 it_IT | Italian | Indo-European (Romance) | | 🇧🇷 pt_BR | Portuguese | Indo-European (Romance) | | 🇳🇱 nl_NL | Dutch | Indo-European (Germanic) | | 🇯🇵 ja_JP | Japanese | Japonic | | 🇸🇦 ar_SA | Arabic | Afro-Asiatic | | 🇮🇳 hi_IN | Hindi | Indo-European (Indo-Aryan) |
---
📋 Task Types
Nine carefully designed temporal reasoning tasks:
| # | Task | Description | |---|------|-------------| | 1️⃣ | date_addition | Adding days/weeks/months/years to a date | | 2️⃣ | date_subtraction | Subtracting days/weeks/months/years from a date | | 3️⃣ | time_addition | Adding hours/minutes to a time | | 4️⃣ | time_subtraction | Subtracting hours/minutes from a time | | 5️⃣ | date_duration | Days between two dates | | 6️⃣ | time_duration | Hours/minutes between two times | | 7️⃣ | date_recurrence | Next occurrence of recurring events | | 8️⃣ | interval_date | Date intervals and ranges | | 9️⃣ | day_of_week | Day name for a given date |
---
🚀 Installation
pip install trd
Or install from source:
git clone https://github.com/amazon-science/temporal-reasoning-dataset.git cd temporal-reasoning-dataset pip install -e .
---
⚡ Quick Start
🔁 Reproduce Paper Results
To generate the exact dataset used in our IWSDS 2026 paper:
trd generate all --output ./dataset --seed 9
This creates all 104,000 samples organized by experiment type.
💻 Command Line Interface
# Generate all experiments (104K samples) trd generate all --output ./dataset # Generate specific experiments trd generate variations --output ./dataset trd generate difficulties --output ./dataset trd generate insertions --output ./dataset trd generate memorization --output ./dataset # Customize generation trd generate variations \ --samples 50 \ --languages en_US,de_DE,ja_JP \ --format json \ --seed 42 \ --output ./my_dataset
🐍 Python API
from trd import generate_all, generate_variations
# Generate complete dataset (reproduces paper results)
total = generate_all(output_dir="./dataset", seed=9)
print(f"✅ Generated {total} samples")
# Generate specific experiment
total = generate_variations(
samples_per_task=100,
languages=["en_US", "de_DE", "ja_JP"],
output_dir="./dataset",
seed=9,
format="csv"
)
# Use experiment classes directly
from trd import VariationsExperiment
from trd.config.languages import LANGUAGE_CODES
exp = VariationsExperiment(
samples_per_task=100,
languages=LANGUAGE_CODES,
output_dir="./dataset",
seed=9
)
exp.generate()🔧 Using Individual Generators
from trd.generators import get_generator
from trd.config import MEDIUM_TIMEFRAME
# Get a specific generator
GeneratorClass = get_generator("date_addition")
generator = GeneratorClass(MEDIUM_TIMEFRAME, "en_US")
# Generate samples
samples = generator.generate_samples(10)
for sample in samples:
print(f"❓ Q: {sample.question}")
print(f"✅ A: {sample.answer}")
print()---
📁 Output Format
CSV Format (default)
Each experiment generates CSV files organized by task and language:
dataset/ ├── variations/ │ ├── date_addition_medium_en_US.csv │ ├── date_addition_medium_es_ES.csv │ └── ... ├── difficulties/ │ ├── short/ │ ├── medium/ │ ├── long/ │ ├── very_long/ │ └── very_very_long/ ├── insertions/ │ ├── similar/ │ └── dissimilar/ └── memorization/ ├── 2025_date_addition_medium_en_US.csv ├── 2035_date_addition_medium_en_US.csv └── ...
JSON Format
{
"metadata": {
"experiment": "variations",
"generated_at": "2025-01-15T12:00:00",
"seed": 9,
"total_samples": 9000
},
"samples": [
{
"id": "var_en_US_date_addition_001",
"language": "en_US",
"task_type": "date_addition",
"question": "Today is 2027-08-07, what is the date going to be in 8 days?",
"answer": "2027-08-15"
}
]
}---
📊 Difficulty Levels (Table 5)
Each difficulty level uses specific timeframe configurations from the paper:
| Level | Days | Hours | Recurrence Every | Recurrence Q | Duration Date | Duration Time (min) | Year Range |…
Excerpt shown — open the source for the full document.
Notability
notability 3.0/10New dataset, low stars
Amazon (Nova) has a repo signal matching data demand, evals and quality.