OpenBMB/VisRAG
Python
Captured source
source ↗OpenBMB/VisRAG
Description: Parsing-free RAG supported by VLMs
Language: Python
License: Apache-2.0
Stars: 963
Forks: 76
Open issues: 0
Created: 2024-10-14T19:29:00Z
Pushed: 2025-12-07T07:45:09Z
Default branch: master
Fork: no
Archived: no
README:
VisRAG 2.0: Evidence-Guided Multi-Image Reasoning in Visual Retrieval-Augmented Generation
• 📖 Introduction • 🎉 News • ✨ VisRAG Pipeline • ⚙️ Setup • ⚡️ Training
• 📃 Evaluation • 🔧 Usage • 📄 Lisense • 📧 Contact • 📈 Star History
📖 Introduction
EVisRAG (VisRAG 2.0) is an evidence-guided Vision Retrieval-augmented Generation framework that equips VLMs for multi-image questions by first linguistically observing retrieved images to collect per-image evidence, then reasoning over those cues to answer. EVisRAG trains with Reward-Scoped GRPO, applying fine-grained token-level rewards to jointly optimize visual perception and reasoning.
VisRAG is a novel vision-language model (VLM)-based RAG pipeline. In this pipeline, instead of first parsing the document to obtain text, the document is directly embedded using a VLM as an image and then retrieved to enhance the generation of a VLM. Compared to traditional text-based RAG, VisRAG maximizes the retention and utilization of the data information in the original documents, eliminating the information loss introduced during the parsing process.
🎉 News
- 20251207: Released all benchmarks on HuggingFace.
- 20251118: Both EVisRAG and VisRAG can be easily reproduced within UltraRAG v2.
- 20251022: We upload all evaluation benchmarks in VisRAG Collections
- 20251014: Released EVisRAG-3B on HuggingFace.
- 20251014: Released EVisRAG (VisRAG 2.0), an end-to-end Vision-Language Model. Released our Paper on arXiv. Released our Model on HuggingFace. Released our Code on GitHub
- 20241111: Released our VisRAG Pipeline on GitHub, now supporting visual understanding across multiple PDF documents.
- 20241104: Released our VisRAG Pipeline on Hugging Face Space.
- 20241031: Released our VisRAG Pipeline on Colab. Released codes for converting files to images, which could be found at
visrag_scripts/file2img. - 20241015: Released our train data and test data on Hugging Face which can be found in the VisRAG Collection on Hugging Face. It is referenced at the beginning of this page.
- 20241014: Released our Paper on arXiv. Released our Model on Hugging Face. Released our Code on GitHub.
✨ VisRAG Pipeline
EVisRAG
EVisRAG is an end-to-end framework that equips VLMs with precise visual perception during reasoning in multi-image scenarios. We trained and released VLRMs with EVisRAG built on Qwen2.5-VL-7B-Instruct, and Qwen2.5-VL-3B-Instruct.
VisRAG-Ret
VisRAG-Ret is a document embedding model built on MiniCPM-V 2.0, a vision-language model that integrates SigLIP as the vision encoder and MiniCPM-2B as the language model.
VisRAG-Gen
In the paper, we use MiniCPM-V 2.0, MiniCPM-V 2.6, and GPT-4o as the generators. Actually, you can use any VLMs you like!
⚙️ Setup
EVisRAG
git clone https://github.com/OpenBMB/VisRAG.git conda create --name EVisRAG python==3.10 conda activate EVisRAG cd EVisRAG pip install -r EVisRAG_requirements.txt
VisRAG
git clone https://github.com/OpenBMB/VisRAG.git conda create --name VisRAG python==3.10.8 conda activate VisRAG conda install nvidia/label/cuda-11.8.0::cuda-toolkit cd VisRAG pip install -r requirements.txt pip install -e . cd timm_modified pip install -e . cd ..
Note: 1. timm_modified is an enhanced version of the timm library that supports gradient checkpointing, which we use in our training process to reduce memory usage.
⚡️ Training
EVisRAG
To train EVisRAG effectively, we introduce Reward-Scoped Group Relative Policy Optimization (RS-GRPO), which binds fine-grained rewards to scope-specific tokens to jointly optimize visual perception and reasoning abilities of VLMs.
*Stage1: SFT* (based on LLaMA-Factory)
git clone https://github.com/hiyouga/LLaMA-Factory.git bash evisrag_scripts/full_sft.sh
*Stage2: RS-GRPO* (based on Easy-R1)
bash evisrag_scripts/run_rsgrpo.sh
Notes:
1. The training data is available on Hugging Face under EVisRAG-Train, which is referenced at the beginning of this page. 2. We adopt a two-stage training strategy. In the first stage, please clone LLaMA-Factory and update the model path in the full_sft.sh script. In the second stage, we built our customized algorithm RS-GRPO based on Easy-R1, specifically designed for EVisRAG, whose implementation can be found in src/RS-GRPO.
VisRAG-Ret
Our training dataset of 362,110 Query-Document (Q-D) Pairs for VisRAG-Ret is comprised of train sets of openly available academic datasets (34%) and a synthetic dataset made up of pages from web-crawled PDF documents and augmented with VLM-generated (GPT-4o) pseudo-queries (66%).
bash visrag_scripts/train_retriever/train.sh 2048 16 8 0.02 1 true false config/deepspeed.json 1e-5 false wmean causal 1 true 2 false
Note: 1. Our training data can be found in the VisRAG collection on Hugging Face, referenced at the beginning of this page. Please note that we have separated the In-domain-data and Synthetic-data due to their distinct differences. If you wish to train with the complete dataset, you’ll need to…
Excerpt shown — open the source for the full document.
Notability
notability 6.0/10Solid new repo, 962 stars, by notable lab.