RepoZhipu AI (GLM)Zhipu AI (GLM)published May 29, 2022seen 5d

zai-org/CogVideo

Python

Open original ↗

Captured source

source ↗
published May 29, 2022seen 5dcaptured 10hhttp 200method plain

zai-org/CogVideo

Description: text and image to video generation: CogVideoX (2024) and CogVideo (ICLR 2023)

Language: Python

License: Apache-2.0

Stars: 12774

Forks: 1305

Open issues: 113

Created: 2022-05-29T06:46:18Z

Pushed: 2025-11-04T11:19:04Z

Default branch: main

Fork: no

Archived: no

README:

CogVideo & CogVideoX

[中文阅读](./README_zh.md)

[日本語で読む](./README_ja.md)

Experience the CogVideoX-5B model online at 🤗 Huggingface Space or 🤖 ModelScope Space

📚 View the paper and user guide

👋 Join our WeChat and Discord

📍 Visit QingYing and API Platform to experience larger-scale commercial video generation models.

Project Updates

  • 🔥🔥 News: ``2025/03/24``: We have launched CogKit, a fine-tuning and inference framework for the CogView4 and CogVideoX series. This toolkit allows you to fully explore and utilize our multimodal generation models.
  • 🔥 News: ``2025/02/28: DDIM Inverse is now supported in CogVideoX-5B and CogVideoX1.5-5B`. Check [here](inference/ddim_inversion.py).
  • 🔥 News: ``2025/01/08: We have updated the code for Lora fine-tuning based on the diffusers` version model, which uses less GPU memory. For more details, please see [here](finetune/README.md).
  • 🔥 News: ``2024/11/15: We released the CogVideoX1.5` model in the diffusers version. Only minor parameter adjustments are needed to continue using previous code.
  • 🔥 News: ``2024/11/08``: We have released the CogVideoX1.5 model. CogVideoX1.5 is an upgraded version of the open-source model CogVideoX.

The CogVideoX1.5-5B series supports 10-second videos with higher resolution, and CogVideoX1.5-5B-I2V supports video generation at any resolution. The SAT code has already been updated, while the diffusers version is still under adaptation. Download the SAT version code here.

  • 🔥 News: ``2024/10/13: A more cost-effective fine-tuning framework for CogVideoX-5B` that works with a single

4090 GPU, cogvideox-factory, has been released. It supports fine-tuning with multiple resolutions. Feel free to use it!

  • 🔥 News: ``2024/10/10``: We have updated our technical report. Please

click here to view it. More training details and a demo have been added. To see the demo, click here.- 🔥 News: ``2024/10/09``: We have publicly released the technical documentation for CogVideoX fine-tuning on Feishu, further increasing distribution flexibility. All examples in the public documentation can be fully reproduced.

  • 🔥 News: ``2024/9/19``: We have open-sourced the CogVideoX series image-to-video model CogVideoX-5B-I2V.

This model can take an image as a background input and generate a video combined with prompt words, offering greater controllability. With this, the CogVideoX series models now support three tasks: text-to-video generation, video continuation, and image-to-video generation. Welcome to try it online at Experience.

  • 🔥 ``2024/9/19``: The Caption

model CogVLM2-Caption, used in the training process of CogVideoX to convert video data into text descriptions, has been open-sourced. Welcome to download and use it.

  • 🔥 ``2024/8/27``: We have open-sourced a larger model in the CogVideoX series, CogVideoX-5B. We have

significantly optimized the model's inference performance, greatly lowering the inference threshold. You can run CogVideoX-2B on older GPUs like GTX 1080TI, and CogVideoX-5B on desktop GPUs like RTX 3060. Please strictly follow the [requirements](requirements.txt) to update and install dependencies, and refer to [cli_demo](inference/cli_demo.py) for inference code. Additionally, the open-source license for the CogVideoX-2B model has been changed to the Apache 2.0 License.

  • 🔥 ``2024/8/6``: We have open-sourced 3D Causal VAE, used for CogVideoX-2B, which can reconstruct videos with

almost no loss.

  • 🔥 ``2024/8/6``: We have open-sourced the first model of the CogVideoX series video generation models, **CogVideoX-2B

**.

  • 🌱 Source: ``2022/5/19``: We have open-sourced the CogVideo video generation model (now you can see it in

the CogVideo branch). This is the first open-source large Transformer-based text-to-video generation model. You can access the ICLR'23 paper for technical details.

Table of Contents

Jump to a specific section:

  • [Quick Start](#quick-start)
  • [Prompt Optimization](#prompt-optimization)
  • [SAT](#sat)
  • [Diffusers](#diffusers)
  • [Gallery](#gallery)
  • [CogVideoX-5B](#cogvideox-5b)
  • [CogVideoX-2B](#cogvideox-2b)
  • [Model Introduction](#model-introduction)
  • [Friendly Links](#friendly-links)
  • [Project Structure](#project-structure)
  • [Quick Start with Colab](#quick-start-with-colab)
  • [Inference](#inference)
  • [finetune](#finetune)
  • [sat](#sat-1)
  • [Tools](#tools)
  • [CogVideo(ICLR'23)](#cogvideoiclr23)
  • [Citation](#citation)
  • [Model-License](#model-license)

Quick Start

Prompt Optimization

Before running the model, please refer to [this guide](inference/convert_demo.py) to see how we use large models like GLM-4 (or other comparable products, such as GPT-4) to optimize the model. This is crucial because the model is trained with long prompts, and a good prompt directly impacts the quality of the video generation.

SAT

Please make sure your Python version is between 3.10 and 3.12, inclusive of both 3.10 and 3.12.

Follow instructions in [sat_demo](sat/README.md): Contains the inference code and fine-tuning code of SAT weights. It is recommended to improve based on the CogVideoX model structure. Innovative researchers use this code to better perform rapid stacking and development.

Diffusers

Please make sure your Python version is between 3.10 and 3.12, inclusive of both 3.10 and 3.12.

pip install -r requirements.txt

Then follow [diffusers_demo](inference/cli_demo.py): A more detailed explanation of the inference code, mentioning the significance of common parameters.

For more details on quantized inference, please refer to…

Excerpt shown — open the source for the full document.