zai-org/CogVLM
Python
Captured source
source ↗zai-org/CogVLM
Description: a state-of-the-art-level open visual language model | 多模态预训练模型
Language: Python
License: Apache-2.0
Stars: 6738
Forks: 453
Open issues: 70
Created: 2023-09-18T02:12:50Z
Pushed: 2024-05-29T10:01:33Z
Default branch: main
Fork: no
Archived: no
README:
CogVLM & CogAgent
📗 [中文版README](./README_zh.md)
🌟 Jump to detailed introduction: [Introduction to CogVLM](#introduction-to-cogvlm), 🆕 [Introduction to CogAgent](#introduction-to-cogagent)
📔 For more detailed usage information, please refer to: CogVLM & CogAgent's technical documentation (in Chinese)
CogVLM
📖 Paper: CogVLM: Visual Expert for Pretrained Language Models
CogVLM is a powerful open-source visual language model (VLM). CogVLM-17B has 10 billion visual parameters and 7 billion language parameters, supporting image understanding and multi-turn dialogue with a resolution of 490*490.
CogVLM-17B achieves state-of-the-art performance on 10 classic cross-modal benchmarks, including NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA and TDIUC.
CogAgent
📖 Paper: CogAgent: A Visual Language Model for GUI Agents
CogAgent is an open-source visual language model improved based on CogVLM. CogAgent-18B has 11 billion visual parameters and 7 billion language parameters, supporting image understanding at a resolution of 1120*1120. On top of the capabilities of CogVLM, it further possesses GUI image Agent capabilities.
CogAgent-18B achieves state-of-the-art generalist performance on 9 classic cross-modal benchmarks, including VQAv2, OK-VQ, TextVQA, ST-VQA, ChartQA, infoVQA, DocVQA, MM-Vet, and POPE. It significantly surpasses existing models on GUI operation datasets including AITW and Mind2Web.
🌐 Web Demo for both CogVLM2: this link
Table of Contents
- [CogVLM \& CogAgent](#cogvlm--cogagent)
- [Release](#release)
- [Get Started](#get-started)
- [Option 1: Inference Using Web Demo.](#option-1-inference-using-web-demo)
- [Option 2:Deploy CogVLM / CogAgent by yourself](#option-2deploy-cogvlm--cogagent-by-yourself)
- [Situation 2.1 CLI (SAT version)](#situation-21-cli-sat-version)
- [Situation 2.2 CLI (Huggingface version)](#situation-22-cli-huggingface-version)
- [Situation 2.3 Web Demo](#situation-23-web-demo)
- [Option 3:Finetuning CogAgent / CogVLM](#option-3finetuning-cogagent--cogvlm)
- [Option 4: OpenAI Vision format](#option-4-openai-vision-format)
- [Hardware requirement](#hardware-requirement)
- [Model checkpoints](#model-checkpoints)
- [Introduction to CogVLM](#introduction-to-cogvlm)
- [Examples](#examples)
- [Introduction to CogAgent](#introduction-to-cogagent)
- [GUI Agent Examples](#gui-agent-examples)
- [Cookbook](#cookbook)
- [Task Prompts](#task-prompts)
- [Which --version to use](#which---version-to-use)
- [FAQ](#faq)
- [License](#license)
- [Citation \& Acknowledgements](#citation--acknowledgements)
Release
- 🔥🔥🔥 News: ``
2024/5/20``: We released the next generation of model, [CogVLM2](https://github.com/THUDM/CogVLM2), which is based on llama3-8b and on the par of (or better than) GPT-4V in most cases! DOWNLOAD and TRY! - 🔥🔥 News: ``
2024/4/5``: CogAgent was selected as a CVPR 2024 Highlights! - 🔥 News: ``
2023/12/26``: We have released the [CogVLM-SFT-311K](dataset.md) dataset,
which contains over 150,000 pieces of data that we used for CogVLM v1.0 only training. Welcome to follow and use.
- News: ``
2023/12/18``: New Web UI Launched! We have launched a new web UI based on Streamlit,
users can painlessly talk to CogVLM, CogAgent in our UI. Have a better user experience.
- News: ``
2023/12/15``: CogAgent Officially Launched! CogAgent is an image understanding model developed
based on CogVLM. It features visual-based GUI Agent capabilities and has further enhancements in image understanding. It supports image input with a resolution of 1120*1120, and possesses multiple abilities including multi-turn dialogue with images, GUI Agent, Grounding, and more.
- News: ``
2023/12/8`` We have updated the checkpoint of cogvlm-grounding-generalist to
cogvlm-grounding-generalist-v1.1, with image augmentation during training, therefore more robust. See [details](#introduction-to-cogvlm).
- News: ``
2023/12/7`` CogVLM supports 4-bit quantization now! You can inference with just 11GB GPU memory!
- News: ``
2023/11/20`` We have updated the checkpoint of cogvlm-chat to cogvlm-chat-v1.1, unified the versions of
chat and VQA, and refreshed the SOTA on various datasets. See [details](#introduction-to-cogvlm)
- News: ``
2023/11/20`` We release [cogvlm-chat](https://huggingface.co/THUDM/cogvlm-chat-hf), [cogvlm-grounding-generalist](https://huggingface.co/THUDM/cogvlm-grounding-generalist-hf)/[base](https://huggingface.co/THUDM/cogvlm-grounding-base-hf), [cogvlm-base-490](https://huggingface.co/THUDM/cogvlm-base-490-hf)/[224](https://huggingface.co/THUDM/cogvlm-base-224-hf) on 🤗Huggingface. you can infer with transformers in [a few lines of code](#situation-22-cli-huggingface-version)now!
2023/10/27`` CogVLM bilingual version is available online! Welcome to try it out!
2023/10/5`` CogVLM-17B released。
Get Started
Option 1: Inference Using Web Demo.
- Click here to enter CogVLM2 Demo。
If you need to use Agent and Grounding functions, please refer to [Cookbook - Task Prompts](#task-prompts)
Option 2:Deploy CogVLM / CogAgent by yourself
We support two GUIs for model inference, CLI and web demo . If you want to use it in your python code, it is easy to modify the CLI scripts for your case.
First, we need to install the dependencies.
# CUDA >= 11.8 pip install -r requirements.txt python -m spacy download en_core_web_sm
All code for inference is located under the ``basic_demo/`` directory. Please switch to this directory first before proceeding with further operations.
Situation 2.1 CLI (SAT version)
Run CLI demo via:
# CogAgent python cli_demo_sat.py --from_pretrained cogagent-chat --version chat --bf16 --stream_chat python cli_demo_sat.py --from_pretrained cogagent-vqa --version chat_old --bf16 --stream_chat # CogVLM python cli_demo_sat.py --from_pretrained cogvlm-chat --version chat_old --bf16 --stream_chat…
Excerpt shown — open the source for the full document.