RepoOpenBMB (MiniCPM)OpenBMB (MiniCPM)published Oct 14, 2024seen 5d

OpenBMB/VisRAG

Python

Open original ↗

Captured source

source ↗
published Oct 14, 2024seen 5dcaptured 11hhttp 200method plain

OpenBMB/VisRAG

Description: Parsing-free RAG supported by VLMs

Language: Python

License: Apache-2.0

Stars: 963

Forks: 76

Open issues: 0

Created: 2024-10-14T19:29:00Z

Pushed: 2025-12-07T07:45:09Z

Default branch: master

Fork: no

Archived: no

README:

VisRAG 2.0: Evidence-Guided Multi-Image Reasoning in Visual Retrieval-Augmented Generation

• 📖 Introduction • 🎉 News • ✨ VisRAG Pipeline • ⚙️ Setup • ⚡️ Training

• 📃 Evaluation • 🔧 Usage • 📄 Lisense • 📧 Contact • 📈 Star History

📖 Introduction

EVisRAG (VisRAG 2.0) is an evidence-guided Vision Retrieval-augmented Generation framework that equips VLMs for multi-image questions by first linguistically observing retrieved images to collect per-image evidence, then reasoning over those cues to answer. EVisRAG trains with Reward-Scoped GRPO, applying fine-grained token-level rewards to jointly optimize visual perception and reasoning.

VisRAG is a novel vision-language model (VLM)-based RAG pipeline. In this pipeline, instead of first parsing the document to obtain text, the document is directly embedded using a VLM as an image and then retrieved to enhance the generation of a VLM. Compared to traditional text-based RAG, VisRAG maximizes the retention and utilization of the data information in the original documents, eliminating the information loss introduced during the parsing process.

🎉 News

  • 20251207: Released all benchmarks on HuggingFace.
  • 20251118: Both EVisRAG and VisRAG can be easily reproduced within UltraRAG v2.
  • 20251022: We upload all evaluation benchmarks in VisRAG Collections
  • 20251014: Released EVisRAG-3B on HuggingFace.
  • 20251014: Released EVisRAG (VisRAG 2.0), an end-to-end Vision-Language Model. Released our Paper on arXiv. Released our Model on HuggingFace. Released our Code on GitHub
  • 20241111: Released our VisRAG Pipeline on GitHub, now supporting visual understanding across multiple PDF documents.
  • 20241104: Released our VisRAG Pipeline on Hugging Face Space.
  • 20241031: Released our VisRAG Pipeline on Colab. Released codes for converting files to images, which could be found at visrag_scripts/file2img.
  • 20241015: Released our train data and test data on Hugging Face which can be found in the VisRAG Collection on Hugging Face. It is referenced at the beginning of this page.
  • 20241014: Released our Paper on arXiv. Released our Model on Hugging Face. Released our Code on GitHub.

✨ VisRAG Pipeline

EVisRAG

EVisRAG is an end-to-end framework that equips VLMs with precise visual perception during reasoning in multi-image scenarios. We trained and released VLRMs with EVisRAG built on Qwen2.5-VL-7B-Instruct, and Qwen2.5-VL-3B-Instruct.

VisRAG-Ret

VisRAG-Ret is a document embedding model built on MiniCPM-V 2.0, a vision-language model that integrates SigLIP as the vision encoder and MiniCPM-2B as the language model.

VisRAG-Gen

In the paper, we use MiniCPM-V 2.0, MiniCPM-V 2.6, and GPT-4o as the generators. Actually, you can use any VLMs you like!

⚙️ Setup

EVisRAG

git clone https://github.com/OpenBMB/VisRAG.git
conda create --name EVisRAG python==3.10
conda activate EVisRAG
cd EVisRAG
pip install -r EVisRAG_requirements.txt

VisRAG

git clone https://github.com/OpenBMB/VisRAG.git
conda create --name VisRAG python==3.10.8
conda activate VisRAG
conda install nvidia/label/cuda-11.8.0::cuda-toolkit
cd VisRAG
pip install -r requirements.txt
pip install -e .
cd timm_modified
pip install -e .
cd ..

Note: 1. timm_modified is an enhanced version of the timm library that supports gradient checkpointing, which we use in our training process to reduce memory usage.

⚡️ Training

EVisRAG

To train EVisRAG effectively, we introduce Reward-Scoped Group Relative Policy Optimization (RS-GRPO), which binds fine-grained rewards to scope-specific tokens to jointly optimize visual perception and reasoning abilities of VLMs.

*Stage1: SFT* (based on LLaMA-Factory)

git clone https://github.com/hiyouga/LLaMA-Factory.git
bash evisrag_scripts/full_sft.sh

*Stage2: RS-GRPO* (based on Easy-R1)

bash evisrag_scripts/run_rsgrpo.sh

Notes:

1. The training data is available on Hugging Face under EVisRAG-Train, which is referenced at the beginning of this page. 2. We adopt a two-stage training strategy. In the first stage, please clone LLaMA-Factory and update the model path in the full_sft.sh script. In the second stage, we built our customized algorithm RS-GRPO based on Easy-R1, specifically designed for EVisRAG, whose implementation can be found in src/RS-GRPO.

VisRAG-Ret

Our training dataset of 362,110 Query-Document (Q-D) Pairs for VisRAG-Ret is comprised of train sets of openly available academic datasets (34%) and a synthetic dataset made up of pages from web-crawled PDF documents and augmented with VLM-generated (GPT-4o) pseudo-queries (66%).

bash visrag_scripts/train_retriever/train.sh 2048 16 8 0.02 1 true false config/deepspeed.json 1e-5 false wmean causal 1 true 2 false

Note: 1. Our training data can be found in the VisRAG collection on Hugging Face, referenced at the beginning of this page. Please note that we have separated the In-domain-data and Synthetic-data due to their distinct differences. If you wish to train with the complete dataset, you’ll need to…

Excerpt shown — open the source for the full document.

Notability

notability 6.0/10

Solid new repo, 962 stars, by notable lab.