RepoAmazon (Nova)Amazon (Nova)published Sep 5, 2025seen 5d

amazon-science/DiverseAgentEntropy

Python

Open original ↗

Captured source

source ↗

amazon-science/DiverseAgentEntropy

Language: Python

License: Apache-2.0

Stars: 1

Forks: 0

Open issues: 0

Created: 2025-09-05T12:47:43Z

Pushed: 2025-09-12T14:37:50Z

Default branch: main

Fork: no

Archived: no

README:

DiverseAgentEntropy

This is the repo to reproduce the paper DiverseAgentEntropy, a novel method for quantifying the uncertainty in the factual parametric knowledge of Large Language Models (LLMs).

Setup

Please install the packages in ``requirements.txt

Data preparation

You should prepare QA datasets in the following format under ``../data``:

[
{
"question": "",
"gold_answer": "",
},
{
"question": "",
"gold_answer": "",
},
// ... more entries ...
]

API inference

We are currently using the AWS Bedrock platform, but you can switch to any other API call method by updating the ``../code/utils.py`` file accordingly.

Diverse agent question generation

We first generate varied questions about the same underlying original query with different contexts.

python -m code.question_generation.pipeline_question_generation --dataset_name=dataset_name --model_name=model_name

We then select n questions as the final questions for agent interaction.

python -m code.question_generation.pipeline_question_selection --dataset_name=dataset_name --model_name=model_name

Agent Interaction

We use the selected diverse questions to encourage agent interactions to further reveal the model's knowledge of the original query.

python -m code.agent_interaction.pipeline_agent_interaction --dataset_name=dataset_name --model_name=model_name --mode=origin

We then implement the abstention policy to get the uncertainty score for each original query.

python -m code.evaluation.agent_evaluation --dataset_name=dataset_name --model_name=model_name --mode=origin

Baseline

We use the following code to run the self-consistency-based SemanticEntropy baseline.

python -m code.baseline.vanilla_qa --dataset_name=dataset_name --model_name=model_name

We evaluate the baseline:

python -m code.evaluation.vanilla_evaluation --dataset_name=dataset_name --model_name=model_name

Evaluation

Finally, we evaluate different methods and draw the AR-curve.

python -m code.evaluation.draw_figure --dataset_name=dataset_name --model_name=model_name

License

This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.

🐛 Issues

If you encounter any issues or have questions, please open an issue on GitHub or contact the authors.

Notability

notability 2.0/10

Low stars, obscure repo.