What does this repo signal mean?

Amazon (Nova) published amazon-science/journey-before-destination (Python). This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo amazon-science/journey-before-destination · language Python · Low stars, routine repo. onlylabs links this event to 1 captured evidence page and 6 related repo signals.

Amazon (Nova) Repo: amazon-science/journey-before-destination

Captured source

source ↗

GitHub/github.com/amazon-science/journey-before-destination

amazon-science/journey-before-destination repository metadata

Source ↗

published Feb 13, 2026seen Jun 5captured Jun 11http 200method plain

amazon-science/journey-before-destination

Language: Python

License: NOASSERTION

Stars: 1

Forks: 0

Open issues: 10

Created: 2026-02-13T22:20:11Z

Pushed: 2026-06-05T23:53:03Z

Default branch: main

Fork: no

Archived: no

README: Journey Before Destination: On the importance of Visual Faithfulness in Slow Thinking

Paper

This repository provides the implementation used in Journey Before Destination: On the importance of Visual Faithfulness in Slow Thinking (EACL 2026 Oral Presentation).

Figure. Reasoning-chain faithfulness does not always align with final-answer correctness. (a–b) Visually unfaithful reasoning chains that nonetheless yield correct answers on perception tasks. (c) A visually faithful chain producing an incorrect answer, where the error arises from reasoning rather than perception.

Using this Codebase

Setup

We built our environment with Python 3.9.21.

1. Install dependencies

pip install -r requirements.txt
2. **Configure paths and model settings**
- Populate the required fields in `config.ini`:
- `project_root`: Absolute path to the root directory of this codebase.
- `hf_home`: Directory used to cache Hugging Face models and datasets.
- `batch_size`: Batch size for vanilla prompting experiments. For self-reflection, batch size is always 1.
- `bedrock_judge_model`: Model ID for a Bedrock-based VLM judge. A list of supported models is available [here](https://docs.aws.amazon.com/bedrock/latest/userguide/models-supported.html).
- `anthropic_version`: Specifies the model version to use when running the VLM judge via Amazon Bedrock.
- `huggingface_judge_model`: Model alias for an open-source VLM judge. Currently supported:
- `qwen` → `Qwen/Qwen2.5-VL-72B-Instruct`
- `llava` → `llava-hf/llava-v1.6-34b-hf`

## Running Mitigation Strategy

The proposed mitigation strategy detects unfaithful sentences during generation using Claude 3.7 Sonnet, and then corrects those sentences via a self-reflection mechanism.
The implementation of the self-reflection algorithm can be found at:

how_to_intervene/self_reflection.py

### Example Commands
To run the mitigation strategy,

python run_mitigation.py \ --method self-reflection \ --model_alias thinklite-vl \ --dataset mmeval-pro-perception \ --response_filename responses.json

To run the vanilla baseline,

python run_mitigation.py \ --method vanilla \ --model_alias thinklite-vl \ --dataset mmeval-pro-perception \ --response_filename responses.json

Results are printed to the console. A typical output looks like:

Unfaithful perception steps: 0.0570902394106814 (31/543). Illogical reasoning steps: 0.059782608695652176 (22/368). Accuracy: 0.7413793103448276 (129/174). Faithfulness: {'total_sentences': 912, 'perception_sentences': 543, 'hallucinated_sentences': 31, 'reasoning_sentences': 368, 'illogical_sentences': 22}

### Command-Line Arguments
The main script supports the following parameters:

- `method`: Generation strategy. Options: `vanilla`, `self-reflection`.
- `model_alias`: Vision–language model used for generation.
These are simplified aliases (not Hugging Face IDs). The mapping from model_alias to Hugging Face identifiers is defined in utils/model.py.
Currently supported:
- `openvlthinker`
- `mmeureka`
- `ocean-r1`
- `thinklite-vl`
- `dataset`: Evaluation dataset to run on. Options: `mmeval-pro-perception`, `hallusionbench`, `mmvp`.
- `question`: Runs single-sample inference using the provided question.
If set, this overrides the dataset option.
- `image_paths`: Space-separated list of image file paths corresponding to question. This supersedes the dataset parameter.
- `response_filename`: Name of the JSON file used to store model outputs.
All VLM generations, along with the corresponding inputs, are written to this file prior to LLM-based judging.
An example of the saved JSON format is shown below:

[ { "query": "How many example pictures have you seen?\n \nA. 6\nB. 8\nC. 10\nD. 12", "gt_answer": "B", "model_response": "To determine how many example pictures have been shown, let's count them step by step...", "judge_response": "Evaluation of Model's Reasoning\n\nSentence 1: \"The first row has 3 example pictures.\"\nType: PERCEPTION\nFaithfulness: FAITHFUL\n\nSentence 2:...", "image_base64": "..." }, { ... } ]

- `judge_model_category`: Judge used for scoring VLM outputs.
- Options: `bedrock` (default), `huggingface`.
- If `bedrock` is selected, the model specified by `bedrock_judge_model` in `config.ini` is used.
- If `huggingface` is selected, the model specified by `huggingface_judge_model` in `config.ini` is used.

### Citation

If you find our work useful, please cite our paper:

@inproceedings{uppaal2026journey, title={Journey Before Destination: On the importance of Visual Faithfulness in Slow Thinking}, author={Uppaal, Rheeya and Htut, Phu Mon and Bai, Min and Pappas, Nikolaos and Qi, Zheng and Swamy, Sandesh}, booktitle={The 19th Conference of the European Chapter of the Association for Computational Linguistics}, year={2026} }

## License
This work is licensed under the terms specified in the [LICENSE](LICENSE) file.

Notability

notability 1.0/10

Low stars, routine repo