RepoAmazon (Nova)Amazon (Nova)published Oct 20, 2025seen 5d

amazon-science/when-thinking-fails-RLLM-if-evaluation

Python

Open original ↗

Captured source

source ↗

amazon-science/when-thinking-fails-RLLM-if-evaluation

Language: Python

License: NOASSERTION

Stars: 0

Forks: 0

Open issues: 0

Created: 2025-10-20T22:41:15Z

Pushed: 2025-10-22T18:01:02Z

Default branch: main

Fork: no

Archived: no

README:

When Thinking Fails: The Pitfalls of Reasoning for Instruction-Following in LLMs

🎉 Our paper has been accepted to NeurIPS 2025 as a Spotlight paper (https://neurips.cc/virtual/2025/poster/115354)!

Quickstart

1. Clone the repo

2. Install dependencies

pip install -r requirements.txt

3. Generate CoT and non-CoT outputs

  • Use reasoning_prompt_template() in templates.py to prompt LLMs on IFEval and ComplexBench instructions.

4. Clean and parse responses

from utils import clean_think_responses

# Example usage:
responses = [...] # raw model generations
answers, thinking = clean_think_responses(responses)

---

Evaluate mitigation strategies

We compare four mitigation methods proposed in the paper: 1. Few-shot in-context learning 2. Self-reflection 3. Self-selective reasoning 4. Classifier-selective reasoning

Code and evaluation metrics reproduce results reported in the paper.

---

Data

  • IFEval: original benchmark dataset (541 prompts + constraints), publicly available at Hugging Face.
  • ComplexBench: we include an English-translated version (translated their evaluation prompts, rule-based evaluation functions, etc.) under its original license.

---

Templates

All prompt templates are defined in templates.py. Plain-text copies are also included under the prompts/ folder for reference.

Notability

notability 5.0/10

Research repo from amazon-science, topic notable but no traction info