amazon-science/journey-before-destination
Python
Captured source
source ↗amazon-science/journey-before-destination
Language: Python
License: NOASSERTION
Stars: 1
Forks: 0
Open issues: 10
Created: 2026-02-13T22:20:11Z
Pushed: 2026-06-05T23:53:03Z
Default branch: main
Fork: no
Archived: no
README: Journey Before Destination: On the importance of Visual Faithfulness in Slow Thinking
Paper
This repository provides the implementation used in Journey Before Destination: On the importance of Visual Faithfulness in Slow Thinking (EACL 2026 Oral Presentation).
Figure. Reasoning-chain faithfulness does not always align with final-answer correctness. (a–b) Visually unfaithful reasoning chains that nonetheless yield correct answers on perception tasks. (c) A visually faithful chain producing an incorrect answer, where the error arises from reasoning rather than perception.
Using this Codebase
Setup
We built our environment with Python 3.9.21.
1. Install dependencies
pip install -r requirements.txt 2. **Configure paths and model settings** - Populate the required fields in `config.ini`: - `project_root`: Absolute path to the root directory of this codebase. - `hf_home`: Directory used to cache Hugging Face models and datasets. - `batch_size`: Batch size for vanilla prompting experiments. For self-reflection, batch size is always 1. - `bedrock_judge_model`: Model ID for a Bedrock-based VLM judge. A list of supported models is available [here](https://docs.aws.amazon.com/bedrock/latest/userguide/models-supported.html). - `anthropic_version`: Specifies the model version to use when running the VLM judge via Amazon Bedrock. - `huggingface_judge_model`: Model alias for an open-source VLM judge. Currently supported: - `qwen` → `Qwen/Qwen2.5-VL-72B-Instruct` - `llava` → `llava-hf/llava-v1.6-34b-hf` ## Running Mitigation Strategy The proposed mitigation strategy detects unfaithful sentences during generation using Claude 3.7 Sonnet, and then corrects those sentences via a self-reflection mechanism. The implementation of the self-reflection algorithm can be found at:
how_to_intervene/self_reflection.py
### Example Commands To run the mitigation strategy,
python run_mitigation.py \ --method self-reflection \ --model_alias thinklite-vl \ --dataset mmeval-pro-perception \ --response_filename responses.json
To run the vanilla baseline,
python run_mitigation.py \ --method vanilla \ --model_alias thinklite-vl \ --dataset mmeval-pro-perception \ --response_filename responses.json
Results are printed to the console. A typical output looks like:
Unfaithful perception steps: 0.0570902394106814 (31/543). Illogical reasoning steps: 0.059782608695652176 (22/368). Accuracy: 0.7413793103448276 (129/174). Faithfulness: {'total_sentences': 912, 'perception_sentences': 543, 'hallucinated_sentences': 31, 'reasoning_sentences': 368, 'illogical_sentences': 22}
### Command-Line Arguments The main script supports the following parameters: - `method`: Generation strategy. Options: `vanilla`, `self-reflection`. - `model_alias`: Vision–language model used for generation. These are simplified aliases (not Hugging Face IDs). The mapping from model_alias to Hugging Face identifiers is defined in utils/model.py. Currently supported: - `openvlthinker` - `mmeureka` - `ocean-r1` - `thinklite-vl` - `dataset`: Evaluation dataset to run on. Options: `mmeval-pro-perception`, `hallusionbench`, `mmvp`. - `question`: Runs single-sample inference using the provided question. If set, this overrides the dataset option. - `image_paths`: Space-separated list of image file paths corresponding to question. This supersedes the dataset parameter. - `response_filename`: Name of the JSON file used to store model outputs. All VLM generations, along with the corresponding inputs, are written to this file prior to LLM-based judging. An example of the saved JSON format is shown below:
[ { "query": "How many example pictures have you seen?\n \nA. 6\nB. 8\nC. 10\nD. 12", "gt_answer": "B", "model_response": "To determine how many example pictures have been shown, let's count them step by step...", "judge_response": "Evaluation of Model's Reasoning\n\nSentence 1: \"The first row has 3 example pictures.\"\nType: PERCEPTION\nFaithfulness: FAITHFUL\n\nSentence 2:...", "image_base64": "..." }, { ... } ]
- `judge_model_category`: Judge used for scoring VLM outputs. - Options: `bedrock` (default), `huggingface`. - If `bedrock` is selected, the model specified by `bedrock_judge_model` in `config.ini` is used. - If `huggingface` is selected, the model specified by `huggingface_judge_model` in `config.ini` is used. ### Citation If you find our work useful, please cite our paper:
@inproceedings{uppaal2026journey, title={Journey Before Destination: On the importance of Visual Faithfulness in Slow Thinking}, author={Uppaal, Rheeya and Htut, Phu Mon and Bai, Min and Pappas, Nikolaos and Qi, Zheng and Swamy, Sandesh}, booktitle={The 19th Conference of the European Chapter of the Association for Computational Linguistics}, year={2026} }
## License This work is licensed under the terms specified in the [LICENSE](LICENSE) file.
Notability
notability 1.0/10Low stars, routine repo