zai-org/CogVLM2
Python
Captured source
source ↗zai-org/CogVLM2
Description: GPT4V-level open-source multi-modal model based on Llama3-8B
Language: Python
License: Apache-2.0
Stars: 2438
Forks: 163
Open issues: 61
Created: 2024-05-10T09:07:11Z
Pushed: 2025-03-03T03:01:31Z
Default branch: main
Fork: no
Archived: no
README:
CogVLM2 & CogVLM2-Video
[中文版README](./README_zh.md)
👋 Join our Wechat · 💡Try CogVLM2 Online 💡Try CogVLM2-Video Online
📍Experience the larger-scale CogVLM model on the ZhipuAI Open Platform.
Recent updates
- 🔥 News: `
2024/8/30`: The CogVLM2 paper has been published on arXiv. - 🔥 News: `
2024/7/12`: We have released CogVLM2-Video online web demo, welcome to experience it. - 🔥 News: `
2024/7/8`: We released the video understanding version of the CogVLM2 model, the CogVLM2-Video model.
By extracting keyframes, it can interpret continuous images. The model can support videos of up to 1 minute. See more in our blog.
- 🔥 News: `
2024/6/8`:We release CogVLM2 TGI Weight,
which is a model can be inferred in TGI. See Inference Code in here
- 🔥 News: `
2024/6/5`:We release GLM-4V-9B, which use the same data and
training recipes as CogVLM2 but with GLM-9B as the language backbone. We removed visual experts to reduce the model size to 13B. More details at GLM-4 repo.
- 🔥 News: `
2024/5/24`: We have released
the Int4 version model, which requires only 16GB of video memory for inference. You can also run on-the-fly int4 version by passing --quant 4.
- 🔥 News: `
2024/5/20`: We released the next generation model CogVLM2, which is based on llama3-8b and is
equivalent (or better) to GPT-4V in most cases ! Welcome to download!
Model introduction
We launch a new generation of CogVLM2 series of models and open source two models based on Meta-Llama-3-8B-Instruct. Compared with the previous generation of CogVLM open source models, the CogVLM2 series of open source models have the following improvements:
1. Significant improvements in many benchmarks such as TextVQA, DocVQA. 2. Support 8K content length. 3. Support image resolution up to **1344 * 1344. 4. Provide an open source model version that supports both Chinese and English**.
You can see the details of the CogVLM2 family of open source models in the table below:
| Model Name | cogvlm2-llama3-chat-19B | cogvlm2-llama3-chinese-chat-19B | cogvlm2-video-llama3-chat | cogvlm2-video-llama3-base | |------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------| | Base Model | Meta-Llama-3-8B-Instruct | Meta-Llama-3-8B-Instruct | Meta-Llama-3-8B-Instruct | Meta-Llama-3-8B-Instruct | | Language | English | Chinese, English | English | English | | Task | Image Understanding, Multi-turn Dialogue Model | Image Understanding, Multi-turn Dialogue Model | Video Understanding, Single-turn Dialogue Model | Video Understanding, Base Model, No Dialogue | | Model Link | 🤗 Huggingface 🤖 ModelScope 💫 Wise Model | 🤗 Huggingface 🤖 ModelScope 💫 Wise Model | 🤗 Huggingface 🤖 ModelScope | 🤗 Huggingface 🤖 ModelScope | | Experience Link | 📙 Official Page | 📙 Official Page 🤖 ModelScope | 📙 Official Page 🤖 ModelScope | / | | Int4 Model | 🤗 Huggingface 🤖 ModelScope 💫 Wise Model | 🤗 Huggingface 🤖 ModelScope 💫 Wise Model | / | / | | Text Length | 8K | 8K | 2K | 2K | | Image Resolution | 1344 * 1344 | 1344 * 1344 | 224 * 224 (Video, take the first 24 frames) | 224 * 224 (Video, take the average 24 frames) |
Benchmark
Image Understand
Our open source models have achieved good results in many lists compared to the previous generation of CogVLM open source models. Its excellent performance can compete with some non-open source models, as shown in the table below:
| Model | Open Source | LLM Size | TextVQA | DocVQA | ChartQA | OCRbench | VCR_EASY | VCR_HARD | MMMU | MMVet | MMBench |…
Excerpt shown — open the source for the full document.