What does this repo signal mean?

Zhipu AI (GLM) published zai-org/CogVLM2 (Python). This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo zai-org/CogVLM2 · language Python. onlylabs links this event to 1 captured evidence page and 6 related repo signals.

Zhipu AI (GLM) Repo: zai-org/CogVLM2

Captured source

source ↗

GitHub/github.com/zai-org/CogVLM2

zai-org/CogVLM2 repository metadata

Source ↗

published May 10, 2024seen 5dcaptured 11hhttp 200method plain

zai-org/CogVLM2

Description: GPT4V-level open-source multi-modal model based on Llama3-8B

Language: Python

License: Apache-2.0

Stars: 2438

Forks: 163

Open issues: 61

Created: 2024-05-10T09:07:11Z

Pushed: 2025-03-03T03:01:31Z

Default branch: main

Fork: no

Archived: no

README:

CogVLM2 & CogVLM2-Video

[中文版README](./README_zh.md)

👋 Join our Wechat · 💡Try CogVLM2 Online 💡Try CogVLM2-Video Online

📍Experience the larger-scale CogVLM model on the ZhipuAI Open Platform.

Recent updates

🔥 News: `2024/8/30`: The CogVLM2 paper has been published on arXiv.
🔥 News: `2024/7/12`: We have released CogVLM2-Video online web demo, welcome to experience it.
🔥 News: `2024/7/8`: We released the video understanding version of the CogVLM2 model, the CogVLM2-Video model.

By extracting keyframes, it can interpret continuous images. The model can support videos of up to 1 minute. See more in our blog.

🔥 News: `2024/6/8`:We release CogVLM2 TGI Weight,

which is a model can be inferred in TGI. See Inference Code in here

🔥 News: `2024/6/5`:We release GLM-4V-9B, which use the same data and

training recipes as CogVLM2 but with GLM-9B as the language backbone. We removed visual experts to reduce the model size to 13B. More details at GLM-4 repo.

🔥 News: `2024/5/24`: We have released

the Int4 version model, which requires only 16GB of video memory for inference. You can also run on-the-fly int4 version by passing --quant 4.

🔥 News: `2024/5/20`: We released the next generation model CogVLM2, which is based on llama3-8b and is

equivalent (or better) to GPT-4V in most cases ! Welcome to download!

Model introduction

We launch a new generation of CogVLM2 series of models and open source two models based on Meta-Llama-3-8B-Instruct. Compared with the previous generation of CogVLM open source models, the CogVLM2 series of open source models have the following improvements:

1. Significant improvements in many benchmarks such as TextVQA, DocVQA. 2. Support 8K content length. 3. Support image resolution up to **1344 * 1344. 4. Provide an open source model version that supports both Chinese and English**.

You can see the details of the CogVLM2 family of open source models in the table below:

| Model Name | cogvlm2-llama3-chat-19B | cogvlm2-llama3-chinese-chat-19B | cogvlm2-video-llama3-chat | cogvlm2-video-llama3-base | |------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------| | Base Model | Meta-Llama-3-8B-Instruct | Meta-Llama-3-8B-Instruct | Meta-Llama-3-8B-Instruct | Meta-Llama-3-8B-Instruct | | Language | English | Chinese, English | English | English | | Task | Image Understanding, Multi-turn Dialogue Model | Image Understanding, Multi-turn Dialogue Model | Video Understanding, Single-turn Dialogue Model | Video Understanding, Base Model, No Dialogue | | Model Link | 🤗 Huggingface 🤖 ModelScope 💫 Wise Model | 🤗 Huggingface 🤖 ModelScope 💫 Wise Model | 🤗 Huggingface 🤖 ModelScope | 🤗 Huggingface 🤖 ModelScope | | Experience Link | 📙 Official Page | 📙 Official Page 🤖 ModelScope | 📙 Official Page 🤖 ModelScope | / | | Int4 Model | 🤗 Huggingface 🤖 ModelScope 💫 Wise Model | 🤗 Huggingface 🤖 ModelScope 💫 Wise Model | / | / | | Text Length | 8K | 8K | 2K | 2K | | Image Resolution | 1344 * 1344 | 1344 * 1344 | 224 * 224 (Video, take the first 24 frames) | 224 * 224 (Video, take the average 24 frames) |

Benchmark

Image Understand

Our open source models have achieved good results in many lists compared to the previous generation of CogVLM open source models. Its excellent performance can compete with some non-open source models, as shown in the table below:

Excerpt shown — open the source for the full document.