ForkSiliconFlowSiliconFlowpublished Sep 6, 2024seen 5d

siliconflow/CogVideo-P

forked from zai-org/CogVideo

Open original ↗

Captured source

source ↗
published Sep 6, 2024seen 5dcaptured 9hhttp 200method plain

siliconflow/CogVideo-P

Description: Text-to-video generation: CogVideoX (2024) and CogVideo (ICLR 2023)

License: Apache-2.0

Stars: 9

Forks: 0

Open issues: 0

Created: 2024-09-06T01:25:59Z

Pushed: 2024-10-11T13:52:12Z

Default branch: main

Fork: yes

Parent repository: zai-org/CogVideo

Archived: no

README:

CogVideo & CogVideoX

[中文阅读](./README_zh.md)

[日本語で読む](./README_ja.md)

Experience the CogVideoX-5B model online at 🤗 Huggingface Space or 🤖 ModelScope Space

📚 View the paper and user guide

👋 Join our WeChat and Discord

📍 Visit QingYing and API Platform to experience larger-scale commercial video generation models.

Update and News

  • 🔥🔥 News: ``2024/8/29: By adding pipe.enable_sequential_cpu_offload() and pipe.vae.enable_slicing()` to the

inference code of CogVideoX-5B, VRAM usage can be reduced to 5GB. Please check the updated [cli_demo](inference/cli_demo.py).

  • 🔥 News: ``2024/8/27``: The CogVideoX-2B model's open-source license has been changed to the **Apache 2.0

License**.

  • 🔥 News: ``2024/8/27``: We have open-sourced a larger model in the CogVideoX series, CogVideoX-5B.

We have significantly optimized the model's inference performance, greatly lowering the inference threshold. You can run CogVideoX-2B on older GPUs like the GTX 1080TI, and run the CogVideoX-5B model on mid-range GPUs like the RTX 3060. Please ensure you update and install the dependencies according to the [requirements](requirements.txt), and refer to the [cli_demo](inference/cli_demo.py) for inference code.

  • 🔥 News: ``2024/8/20``: VEnhancer now supports enhancing videos

generated by CogVideoX, achieving higher resolution and higher quality video rendering. We welcome you to try it out by following the [tutorial](tools/venhancer/README_zh.md).

  • 🔥 News: ``2024/8/15: The SwissArmyTransformer dependency in CogVideoX has been upgraded to 0.4.12`.

Fine-tuning no longer requires installing SwissArmyTransformer from source. Additionally, the Tied VAE technique has been applied in the implementation within the diffusers library. Please install diffusers and accelerate libraries from source. Inference for CogVideoX now requires only 12GB of VRAM. The inference code needs to be modified. Please check [cli_demo](inference/cli_demo.py).

  • 🔥 News: ``2024/8/12``: The CogVideoX paper has been uploaded to arxiv. Feel free to check out

the paper.

  • 🔥 News: ``2024/8/7: CogVideoX has been integrated into diffusers` version 0.30.0. Inference can now be

performed on a single 3090 GPU. For more details, please refer to the [code](inference/cli_demo.py).

  • 🔥 News: ``2024/8/6``: We have also open-sourced 3D Causal VAE used in CogVideoX-2B, which can

reconstruct the video almost losslessly.

  • 🔥 News: ``2024/8/6``: We have open-sourced CogVideoX-2B,the first model in the CogVideoX series of video

generation models.

  • 🌱 Source: ``2022/5/19: We have open-sourced **CogVideo** (now you can see in CogVideo` branch),the first

open-sourced pretrained text-to-video model, and you can check ICLR'23 CogVideo Paper for technical details.

More powerful models with larger parameter sizes are on the way~ Stay tuned!

Table of Contents

Jump to a specific section:

  • [Quick Start](#Quick-Start)
  • [SAT](#sat)
  • [Diffusers](#Diffusers)
  • [CogVideoX-2B Video Works](#cogvideox-2b-gallery)
  • [Introduction to the CogVideoX Model](#Model-Introduction)
  • [Full Project Structure](#project-structure)
  • [Inference](#inference)
  • [SAT](#sat)
  • [Tools](#tools)
  • [Introduction to CogVideo(ICLR'23) Model](#cogvideoiclr23)
  • [Citations](#Citation)
  • [Open Source Project Plan](#Open-Source-Project-Plan)
  • [Model License](#Model-License)

Quick Start

Prompt Optimization

Before running the model, please refer to [this guide](inference/convert_demo.py) to see how we use large models like GLM-4 (or other comparable products, such as GPT-4) to optimize the model. This is crucial because the model is trained with long prompts, and a good prompt directly impacts the quality of the video generation.

SAT

Please make sure your Python version is between 3.10 and 3.12, inclusive of both 3.10 and 3.12.

Follow instructions in [sat_demo](sat/README.md): Contains the inference code and fine-tuning code of SAT weights. It is recommended to improve based on the CogVideoX model structure. Innovative researchers use this code to better perform rapid stacking and development.

Diffusers

Please make sure your Python version is between 3.10 and 3.12, inclusive of both 3.10 and 3.12.

pip install -r requirements.txt

Then follow [diffusers_demo](inference/cli_demo.py): A more detailed explanation of the inference code, mentioning the significance of common parameters.

Gallery

CogVideoX-5B

CogVideoX-2B

To view the corresponding prompt words for the gallery, please click [here](resources/galary_prompt.md)

Model Introduction

CogVideoX is an open-source version of the video generation model originating from QingYing. The table below displays the list of video generation models we currently offer, along with their foundational information.

Model Name CogVideoX-2B CogVideoX-5B

Model Description Entry-level model, balancing compatibility. Low cost for running and secondary development. Larger model with higher video generation quality and better visual effects.

Inference Precision FP16* (Recommended), BF16, FP32, FP8*, INT8, no support for INT4 BF16 (Recommended), FP16, FP32, FP8*, INT8, no support for INT4

Single GPU VRAM Consumption

SAT FP16: 18GB diffusers FP16: starting from 4GB* diffusers INT8(torchao): starting from 3.6GB* SAT BF16: 26GB diffusers BF16: starting from 5GB* diffusers INT8(torchao): starting from 4.4GB*

Multi-GPU Inference VRAM Consumption FP16: 10GB* using diffusers BF16: 15GB* using diffusers

Inference Speed (Step = 50, FP/BF16) Single A100: ~90 seconds Single H100: ~45 seconds Single A100: ~180 seconds Single H100: ~90 seconds

Fine-tuning Precision FP16 BF16

Fine-tuning VRAM Consumption (per GPU) 47 GB (bs=1, LORA) 61 GB (bs=2, LORA) 62GB (bs=1, SFT) 63 GB (bs=1, LORA) 80 GB (bs=2, LORA) 75GB (bs=1, SFT)

Prompt Language English*

Prompt Length Limit 226…

Excerpt shown — open the source for the full document.

Notability

notability 1.0/10

Routine fork with minimal traction