amazon-science/when-thinking-fails-RLLM-if-evaluation
Python
Captured source
source ↗amazon-science/when-thinking-fails-RLLM-if-evaluation
Language: Python
License: NOASSERTION
Stars: 0
Forks: 0
Open issues: 0
Created: 2025-10-20T22:41:15Z
Pushed: 2025-10-22T18:01:02Z
Default branch: main
Fork: no
Archived: no
README:
When Thinking Fails: The Pitfalls of Reasoning for Instruction-Following in LLMs
🎉 Our paper has been accepted to NeurIPS 2025 as a Spotlight paper (https://neurips.cc/virtual/2025/poster/115354)!
Quickstart
1. Clone the repo
2. Install dependencies
pip install -r requirements.txt
3. Generate CoT and non-CoT outputs
- Use
reasoning_prompt_template()intemplates.pyto prompt LLMs on IFEval and ComplexBench instructions.
4. Clean and parse responses
from utils import clean_think_responses # Example usage: responses = [...] # raw model generations answers, thinking = clean_think_responses(responses)
---
Evaluate mitigation strategies
We compare four mitigation methods proposed in the paper: 1. Few-shot in-context learning 2. Self-reflection 3. Self-selective reasoning 4. Classifier-selective reasoning
Code and evaluation metrics reproduce results reported in the paper.
---
Data
- IFEval: original benchmark dataset (541 prompts + constraints), publicly available at Hugging Face.
- ComplexBench: we include an English-translated version (translated their evaluation prompts, rule-based evaluation functions, etc.) under its original license.
---
Templates
All prompt templates are defined in templates.py. Plain-text copies are also included under the prompts/ folder for reference.
Notability
notability 5.0/10Research repo from amazon-science, topic notable but no traction info