google-deepmind/questbench

Python

Open original ↗

Captured source

source ↗
published Jan 10, 2025seen 6dcaptured 9hhttp 200method plain

google-deepmind/questbench

Language: Python

License: Apache-2.0

Stars: 40

Forks: 6

Open issues: 1

Created: 2025-01-10T20:15:10Z

Pushed: 2025-05-15T18:10:02Z

Default branch: main

Fork: no

Archived: no

README:

QuestBench: Can LLMs ask the right question to acquire information in reasoning tasks?

Recently, a large amount of work has focused on improving large language models' (LLMs') performance on reasoning benchmarks such as math and logic. However, past work has largely assumed that tasks are well-defined. In the real world, queries to LLMs are often underspecified, only solvable through acquiring missing information. We formalize this as a constraint satisfaction problem (CSP) with missing variable assignments. Using a special case of this formalism where only one necessary variable assignment is missing, we can rigorously evaluate an LLM's ability to identify the minimal necessary question to ask and quantify axes of difficulty levels for each problem. We present QuestBench, a set of underspecified reasoning tasks solvable by asking at most one question, which includes: (1) Logic-Q: Logical reasoning tasks with one missing proposition, (2) Planning-Q: PDDL planning problems with initial states that are partially-observed, (3) GSM-Q: Human-annotated grade school math problems with one missing variable assignment, and (4) GSME-Q: a version of GSM-Q where word problems are translated into equations by human annotators. The LLM is tasked with selecting the correct clarification question(s) from a list of options. While state-of-the-art models excel at GSM-Q and GSME-Q, their accuracy is only 40-50% on Logic-Q and Planning-Q. Analysis demonstrates that the ability to solve well-specified reasoning problems may not be sufficient for success on our benchmark: models have difficulty identifying the right question to ask, even when they can solve the fully specified version of the problem. Furthermore, in the Planning-Q domain, LLMs tend not to hedge, even when explicitly presented with the option to predict ``not sure.'' This highlights the need for deeper investigation into models' information acquisition capabilities.

[Paper link](https://arxiv.org/abs/2503.22674) | [Download dataset](https://storage.googleapis.com/questbench/questbench_data.tar.gz)

This repository contains code for generating QuestBench data and evaluating LLMs on it.

Installation

1. Begin by creating a conda environment to contain the packages needed for QuestBench. You can install anaconda here: https://docs.anaconda.com/miniconda/install/#quick-command-line-install

conda create -n questbench PYTHON=3.11
conda activate questbench

2. Install PyTorch following the instructions here: https://pytorch.org/get-started/locally/

3. Install the remaining requirements

pip install -r requirements.txt

Download datasets

1. Click here to download the datasets.

2. After downloading, expand the compressed file.

tar -xzvf questbench_data.tar.gz

Run evaluations

Set your api key to be able to use Gemini models

export GOOGLE_API_KEY=

Login to HuggingFace to be able to use Gemma models, and start a vllm server with the desired model

huggingface-cli login
vllm serve "google/gemma-2-2b-it" --port
  • Substitute the model name with google/gemma-2-9b-it or google/gemma-2-27b-it as necessary.

Set your openai key to be able to use GPT models

export OPENAI_API_KEY=
export OPENAI_ORGANIZATION=
export OPENAI_PROJECT=

Next, run the eval

python mc_eval.py \
--model_name \
--domain_name [GSM_csp|Planning|SL|GSM_verbal] \
--eval_mode [mc|isambig|fullinfo] \
--data_dir \
--data_file \
--prompt_mode [|cot|fs4] \
--results_dir \
--batch_size 1 \
(--model_role_name assistant)
(--vllm_port )
  • We currently support the following --model_name:
  • gemini-1.5-pro
  • gemini-1.5-flash
  • gemini-2.0-flash-thinking-exp
  • gpt-4o
  • o1-preview
  • claude-3-5-sonnet-20241022
  • gemma_2_27b
  • gemma_2_9b
  • gemma_2_2b
  • Other Gemini models can be found here. Other OpenAI models can be used by adding their names to GPT_COSTS in model_utils.py. Other Anthropic models can be used by adding their names to CLAUDE_MODELS in model_utils.py.
  • If OpenAI or Anthropic models are used, add the --model_role_name assistant option. Otherwise do not add it.
  • Set batch_size to be lower than your RPS rate limit.
  • If a gemma-2 model is used, specify a VLLM port.
  • --data_dir should be set to the directory containing all the data files. By default, --data_dir is set to questbench_data/.
  • --data_file should be set to the appropriate file for the domain. If you downloaded the datasets from the public website, the data files should be set to
questbench_data/Logic-Q/simplelogic_heldout_1k.csv
questbench_data/Planning-Q/planning_heldout_7500.csv
questbench_data/GSM-Q/gsm_CSP_heldout_pilot.csv
questbench_data/GSM-Q/gsm_verbal_heldout_pilot.csv

Generate datasets

Before running any code, be sure to run

export PYTHONPATH=.

Logic-Q

Generate 1-sufficient rulesets

python SimpleLogic/generate_ruleset.py \
--sl_dir \
--start_idx \
--end_idx

Make Logic-Q data from 1-sufficient rulesets

python SimpleLogic/make_data.py \
--sl_dir \
--max_problems_to_sample_per_ruleset

Planning-Q

Generate 1-sufficient CSPs

python Planning/make_planning_data.py \
--pddl_dir \
--output_dir

Run remaining commands in under "Make data" header in Make Planning-Q data from 1-sufficient CSPs

python Planning/make_data.py \
--input_dir \
--output_dir

where input_dir is the output_dir from the previous command.

GSM-Q

GSM-Q was created through human annotation.

Please see the technical report for more details.

Citing this work

@misc{li2025questbenchllmsaskright,
title={QuestBench: Can LLMs ask the right question to acquire information in reasoning tasks?},
author={Belinda Z. Li and Been Kim and Zi Wang},
year={2025},
eprint={2503.22674},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2503.22674},
}

License and disclaimer

Copyright 2025 Google LLC

All software is licensed under the Apache License, Version 2.0 (Apache 2.0); you may not use this file except in compliance with the…

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

New repo with low traction