RepoZhipu AI (GLM)Zhipu AI (GLM)published Sep 18, 2023seen 5d

zai-org/CogVLM

Python

Open original ↗

Captured source

source ↗
published Sep 18, 2023seen 5dcaptured 10hhttp 200method plain

zai-org/CogVLM

Description: a state-of-the-art-level open visual language model | 多模态预训练模型

Language: Python

License: Apache-2.0

Stars: 6738

Forks: 453

Open issues: 70

Created: 2023-09-18T02:12:50Z

Pushed: 2024-05-29T10:01:33Z

Default branch: main

Fork: no

Archived: no

README:

CogVLM & CogAgent

📗 [中文版README](./README_zh.md)

🌟 Jump to detailed introduction: [Introduction to CogVLM](#introduction-to-cogvlm), 🆕 [Introduction to CogAgent](#introduction-to-cogagent)

📔 For more detailed usage information, please refer to: CogVLM & CogAgent's technical documentation (in Chinese)

CogVLM

📖 Paper: CogVLM: Visual Expert for Pretrained Language Models

CogVLM is a powerful open-source visual language model (VLM). CogVLM-17B has 10 billion visual parameters and 7 billion language parameters, supporting image understanding and multi-turn dialogue with a resolution of 490*490.

CogVLM-17B achieves state-of-the-art performance on 10 classic cross-modal benchmarks, including NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA and TDIUC.

CogAgent

📖 Paper: CogAgent: A Visual Language Model for GUI Agents

CogAgent is an open-source visual language model improved based on CogVLM. CogAgent-18B has 11 billion visual parameters and 7 billion language parameters, supporting image understanding at a resolution of 1120*1120. On top of the capabilities of CogVLM, it further possesses GUI image Agent capabilities.

CogAgent-18B achieves state-of-the-art generalist performance on 9 classic cross-modal benchmarks, including VQAv2, OK-VQ, TextVQA, ST-VQA, ChartQA, infoVQA, DocVQA, MM-Vet, and POPE. It significantly surpasses existing models on GUI operation datasets including AITW and Mind2Web.

🌐 Web Demo for both CogVLM2: this link

Table of Contents

  • [CogVLM \& CogAgent](#cogvlm--cogagent)
  • [Release](#release)
  • [Get Started](#get-started)
  • [Option 1: Inference Using Web Demo.](#option-1-inference-using-web-demo)
  • [Option 2:Deploy CogVLM / CogAgent by yourself](#option-2deploy-cogvlm--cogagent-by-yourself)
  • [Situation 2.1 CLI (SAT version)](#situation-21-cli-sat-version)
  • [Situation 2.2 CLI (Huggingface version)](#situation-22-cli-huggingface-version)
  • [Situation 2.3 Web Demo](#situation-23-web-demo)
  • [Option 3:Finetuning CogAgent / CogVLM](#option-3finetuning-cogagent--cogvlm)
  • [Option 4: OpenAI Vision format](#option-4-openai-vision-format)
  • [Hardware requirement](#hardware-requirement)
  • [Model checkpoints](#model-checkpoints)
  • [Introduction to CogVLM](#introduction-to-cogvlm)
  • [Examples](#examples)
  • [Introduction to CogAgent](#introduction-to-cogagent)
  • [GUI Agent Examples](#gui-agent-examples)
  • [Cookbook](#cookbook)
  • [Task Prompts](#task-prompts)
  • [Which --version to use](#which---version-to-use)
  • [FAQ](#faq)
  • [License](#license)
  • [Citation \& Acknowledgements](#citation--acknowledgements)

Release

  • 🔥🔥🔥 News: ``2024/5/20``: We released the next generation of model, [CogVLM2](https://github.com/THUDM/CogVLM2), which is based on llama3-8b and on the par of (or better than) GPT-4V in most cases! DOWNLOAD and TRY!
  • 🔥🔥 News: ``2024/4/5``: CogAgent was selected as a CVPR 2024 Highlights!
  • 🔥 News: ``2023/12/26``: We have released the [CogVLM-SFT-311K](dataset.md) dataset,

which contains over 150,000 pieces of data that we used for CogVLM v1.0 only training. Welcome to follow and use.

  • News: ``2023/12/18``: New Web UI Launched! We have launched a new web UI based on Streamlit,

users can painlessly talk to CogVLM, CogAgent in our UI. Have a better user experience.

  • News: ``2023/12/15``: CogAgent Officially Launched! CogAgent is an image understanding model developed

based on CogVLM. It features visual-based GUI Agent capabilities and has further enhancements in image understanding. It supports image input with a resolution of 1120*1120, and possesses multiple abilities including multi-turn dialogue with images, GUI Agent, Grounding, and more.

  • News: ``2023/12/8`` We have updated the checkpoint of cogvlm-grounding-generalist to

cogvlm-grounding-generalist-v1.1, with image augmentation during training, therefore more robust. See [details](#introduction-to-cogvlm).

  • News: ``2023/12/7`` CogVLM supports 4-bit quantization now! You can inference with just 11GB GPU memory!
  • News: ``2023/11/20`` We have updated the checkpoint of cogvlm-chat to cogvlm-chat-v1.1, unified the versions of

chat and VQA, and refreshed the SOTA on various datasets. See [details](#introduction-to-cogvlm)

  • News: ``2023/11/20`` We release [cogvlm-chat](https://huggingface.co/THUDM/cogvlm-chat-hf), [cogvlm-grounding-generalist](https://huggingface.co/THUDM/cogvlm-grounding-generalist-hf)/[base](https://huggingface.co/THUDM/cogvlm-grounding-base-hf), [cogvlm-base-490](https://huggingface.co/THUDM/cogvlm-base-490-hf)/[224](https://huggingface.co/THUDM/cogvlm-base-224-hf) on 🤗Huggingface. you can infer with transformers in [a few lines of code](#situation-22-cli-huggingface-version)now!
  • 2023/10/27`` CogVLM bilingual version is available online! Welcome to try it out!
  • 2023/10/5`` CogVLM-17B released。

Get Started

Option 1: Inference Using Web Demo.

If you need to use Agent and Grounding functions, please refer to [Cookbook - Task Prompts](#task-prompts)

Option 2:Deploy CogVLM / CogAgent by yourself

We support two GUIs for model inference, CLI and web demo . If you want to use it in your python code, it is easy to modify the CLI scripts for your case.

First, we need to install the dependencies.

# CUDA >= 11.8
pip install -r requirements.txt
python -m spacy download en_core_web_sm

All code for inference is located under the ``basic_demo/`` directory. Please switch to this directory first before proceeding with further operations.

Situation 2.1 CLI (SAT version)

Run CLI demo via:

# CogAgent
python cli_demo_sat.py --from_pretrained cogagent-chat --version chat --bf16 --stream_chat
python cli_demo_sat.py --from_pretrained cogagent-vqa --version chat_old --bf16 --stream_chat

# CogVLM
python cli_demo_sat.py --from_pretrained cogvlm-chat --version chat_old --bf16 --stream_chat…

Excerpt shown — open the source for the full document.