What does this repo signal mean?

OpenBMB (MiniCPM) published OpenBMB/VisRAG (Python). This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo OpenBMB/VisRAG · language Python · Solid new repo, 962 stars, by notable lab.. onlylabs links this event to 1 captured evidence page and 6 related repo signals.

OpenBMB (MiniCPM) Repo: OpenBMB/VisRAG

Captured source

source ↗

GitHub/github.com/OpenBMB/VisRAG

OpenBMB/VisRAG repository metadata

Source ↗

published Oct 14, 2024seen 5dcaptured 11hhttp 200method plain

OpenBMB/VisRAG

Description: Parsing-free RAG supported by VLMs

Language: Python

License: Apache-2.0

Stars: 963

Forks: 76

Open issues: 0

Created: 2024-10-14T19:29:00Z

Pushed: 2025-12-07T07:45:09Z

Default branch: master

Fork: no

Archived: no

README:

VisRAG 2.0: Evidence-Guided Multi-Image Reasoning in Visual Retrieval-Augmented Generation

• 📖 Introduction • 🎉 News • ✨ VisRAG Pipeline • ⚙️ Setup • ⚡️ Training

• 📃 Evaluation • 🔧 Usage • 📄 Lisense • 📧 Contact • 📈 Star History

📖 Introduction

EVisRAG (VisRAG 2.0) is an evidence-guided Vision Retrieval-augmented Generation framework that equips VLMs for multi-image questions by first linguistically observing retrieved images to collect per-image evidence, then reasoning over those cues to answer. EVisRAG trains with Reward-Scoped GRPO, applying fine-grained token-level rewards to jointly optimize visual perception and reasoning.

VisRAG is a novel vision-language model (VLM)-based RAG pipeline. In this pipeline, instead of first parsing the document to obtain text, the document is directly embedded using a VLM as an image and then retrieved to enhance the generation of a VLM. Compared to traditional text-based RAG, VisRAG maximizes the retention and utilization of the data information in the original documents, eliminating the information loss introduced during the parsing process.

🎉 News

20251207: Released all benchmarks on HuggingFace.
20251118: Both EVisRAG and VisRAG can be easily reproduced within UltraRAG v2.
20251022: We upload all evaluation benchmarks in VisRAG Collections
20251014: Released EVisRAG-3B on HuggingFace.
20251014: Released EVisRAG (VisRAG 2.0), an end-to-end Vision-Language Model. Released our Paper on arXiv. Released our Model on HuggingFace. Released our Code on GitHub
20241111: Released our VisRAG Pipeline on GitHub, now supporting visual understanding across multiple PDF documents.
20241104: Released our VisRAG Pipeline on Hugging Face Space.
20241031: Released our VisRAG Pipeline on Colab. Released codes for converting files to images, which could be found at visrag_scripts/file2img.
20241015: Released our train data and test data on Hugging Face which can be found in the VisRAG Collection on Hugging Face. It is referenced at the beginning of this page.
20241014: Released our Paper on arXiv. Released our Model on Hugging Face. Released our Code on GitHub.

✨ VisRAG Pipeline

EVisRAG

EVisRAG is an end-to-end framework that equips VLMs with precise visual perception during reasoning in multi-image scenarios. We trained and released VLRMs with EVisRAG built on Qwen2.5-VL-7B-Instruct, and Qwen2.5-VL-3B-Instruct.

VisRAG-Ret

VisRAG-Ret is a document embedding model built on MiniCPM-V 2.0, a vision-language model that integrates SigLIP as the vision encoder and MiniCPM-2B as the language model.

VisRAG-Gen

In the paper, we use MiniCPM-V 2.0, MiniCPM-V 2.6, and GPT-4o as the generators. Actually, you can use any VLMs you like!

⚙️ Setup

EVisRAG

git clone https://github.com/OpenBMB/VisRAG.git
conda create --name EVisRAG python==3.10
conda activate EVisRAG
cd EVisRAG
pip install -r EVisRAG_requirements.txt

VisRAG

git clone https://github.com/OpenBMB/VisRAG.git
conda create --name VisRAG python==3.10.8
conda activate VisRAG
conda install nvidia/label/cuda-11.8.0::cuda-toolkit
cd VisRAG
pip install -r requirements.txt
pip install -e .
cd timm_modified
pip install -e .
cd ..

Note: 1. timm_modified is an enhanced version of the timm library that supports gradient checkpointing, which we use in our training process to reduce memory usage.

⚡️ Training

EVisRAG

To train EVisRAG effectively, we introduce Reward-Scoped Group Relative Policy Optimization (RS-GRPO), which binds fine-grained rewards to scope-specific tokens to jointly optimize visual perception and reasoning abilities of VLMs.

*Stage1: SFT* (based on LLaMA-Factory)

git clone https://github.com/hiyouga/LLaMA-Factory.git
bash evisrag_scripts/full_sft.sh

*Stage2: RS-GRPO* (based on Easy-R1)

bash evisrag_scripts/run_rsgrpo.sh

Notes:

1. The training data is available on Hugging Face under EVisRAG-Train, which is referenced at the beginning of this page. 2. We adopt a two-stage training strategy. In the first stage, please clone LLaMA-Factory and update the model path in the full_sft.sh script. In the second stage, we built our customized algorithm RS-GRPO based on Easy-R1, specifically designed for EVisRAG, whose implementation can be found in src/RS-GRPO.

VisRAG-Ret

Our training dataset of 362,110 Query-Document (Q-D) Pairs for VisRAG-Ret is comprised of train sets of openly available academic datasets (34%) and a synthetic dataset made up of pages from web-crawled PDF documents and augmented with VLM-generated (GPT-4o) pseudo-queries (66%).

bash visrag_scripts/train_retriever/train.sh 2048 16 8 0.02 1 true false config/deepspeed.json 1e-5 false wmean causal 1 true 2 false

Note: 1. Our training data can be found in the VisRAG collection on Hugging Face, referenced at the beginning of this page. Please note that we have separated the In-domain-data and Synthetic-data due to their distinct differences. If you wish to train with the complete dataset, you’ll need to…

Excerpt shown — open the source for the full document.

Notability

notability 6.0/10

Solid new repo, 962 stars, by notable lab.