What does this fork signal mean?

SiliconFlow forked siliconflow/mmgp (forked from deepbeepmeep/mmgp). This fork signal points to upstream code the lab may be inspecting, patching, or building on. High-signal details: repo siliconflow/mmgp · parent deepbeepmeep/mmgp · Routine fork, no notable traction.. onlylabs links this event to 1 captured evidence page and 6 related fork signals.

SiliconFlow Fork: siliconflow/mmgp

Captured source

source ↗

GitHub/github.com/siliconflow/mmgp

siliconflow/mmgp repository metadata

Source ↗

published Aug 7, 2025seen 5dcaptured 9hhttp 200method plain

siliconflow/mmgp

Description: Memory Management for the GPU Poor, run the latest open source frontier models on consumer Nvidia GPUs

License: NOASSERTION

Stars: 0

Forks: 0

Open issues: 0

Created: 2025-08-07T01:44:19Z

Pushed: 2025-07-30T20:15:46Z

Default branch: main

Fork: yes

Parent repository: deepbeepmeep/mmgp

Archived: no

README:

Memory Management 3.5.3 for the GPU Poor by DeepBeepMeep

This module contains multiples optimisations so that models such as Flux (and derived), Mochi, CogView, HunyuanVideo, ... can run smoothly on a 12 to 24 GB GPU limited card. This a replacement for the accelerate library that should in theory manage offloading, but doesn't work properly with models that are loaded / unloaded several times in a pipe (eg VAE).

Requirements:

VRAM: minimum 6 GB, recommended 24 GB (RTX 3090/ RTX 4090)
RAM: minimum 24 GB, recommended 48 GB

This module features 5 profiles in order to able to run the model at a decent speed on a low end consumer config (24 GB of RAM and 6 VRAM) and to run it at a very good speed (if not the best) on a high end consumer config (48 GB of RAM and 24 GB of VRAM).\ These RAM requirements are for Linux systems. Due to different memory management Windows will require an extra 16 GB of RAM to run the corresponding profile.

Each profile may use a combination of the following:

Low RAM consumption (thanks to a rewritten safetensors library) that allows low RAM on the fly quantization
Smart automated loading / unloading of models in the GPU to avoid unloading models that may be needed again soon
Smart slicing of models to reduce memory occupied by models in the VRAM
Ability to pin models to reserved RAM to accelerate transfers to VRAM
Async transfers to VRAM to avoid a pause when loading a new slice of a model
Automated on the fly quantization or ability to load pre quantized models
Pretrained Lora support with low RAM requirements
Support for pytorch compilation on Linux and WSL (supported on pure Windows but requires a complex Triton Installation).

Sample applications that use mmgp

It is recommended to have a look at these applications to see how mmgp was implemented in each of them:

Wan2GP: https://github.com/deepbeepmeep/Wan2GP :\

An excellent text to video and image to video generator that supports the best Open Source Video Architectures: Wan, Hunyuan and LTX Video

Hunyuan3D-2GP: https://github.com/deepbeepmeep/Hunyuan3D-2GP :\

A great image to 3D and text to 3D tool by the Tencent team. Thanks to mmgp it can run with less than 6 GB of VRAM

HuanyuanVideoGP: https://github.com/deepbeepmeep/HunyuanVideoGP :\

One of the best open source Text to Video generator

FluxFillGP: https://github.com/deepbeepmeep/FluxFillGP :\

One of the best inpainting / outpainting tools based on Flux that can run with less than 12 GB of VRAM.

Cosmos1GP: https://github.com/deepbeepmeep/Cosmos1GP :\

This application include two models: a text to world generator and a image / video to world (probably the best open source image to video generator).

OminiControlGP: https://github.com/deepbeepmeep/OminiControlGP :\

A Flux derived application very powerful that can be used to transfer an object of your choice in a prompted scene. With mmgp you can run it with only 6 GB of VRAM.

YuE GP: https://github.com/deepbeepmeep/YuEGP :\

A great song generator (instruments + singer's voice) based on prompted Lyrics and a genre description. Thanks to mmgp you can run it with less than 10 GB of VRAM without waiting forever.

Installation

First you need to install the module in your current project with:

pip install mmgp

Usage

It is almost plug and play and just needs to be invoked from the main app just after the model pipeline has been created. 1) First make sure that the pipeline explictly loads the models in the CPU device, for instance:

pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16).to("cpu")

2) Once every potential Lora has been loaded and merged, add the following lines for a quick setup:

from mmgp import offload, profile_type
offload.profile(pipe, profile_type.HighRAM_LowVRAM_Fast)

You can choose between 5 profiles depending on your hardware:

HighRAM_HighVRAM (1): at least 48 GB of RAM and 24 GB of VRAM : the fastest well suited for a RTX 3090 / RTX 4090 but consumes much more VRAM, adapted for fast shorter video or small batches of pictures
HighRAM_LowVRAM (2): at least 48 GB of RAM and 12 GB of VRAM : a bit slower, better suited for RTX 3070/3080/4070/4080 or for RTX 3090 / RTX 4090 with large pictures batches or long videos
LowRAM_HighVRAM (3): at least 32 GB of RAM and 24 GB of VRAM : adapted for RTX 3090 / RTX 4090 with limited RAM but at the cost of VRAM (shorter videos / fewer images)
LowRAM_LowVRAM (4): at least 32 GB of RAM and 12 GB of VRAM : if you have little VRAM or want to generate longer videos / more images
VerylowRAM_LowVRAM (5): at least 24 GB of RAM and 10 GB of VRAM : if you don't have much it won't be fast but maybe it will work

Profile 2 (High RAM) and 4 (Low RAM) are the most recommended profiles since they are versatile (support for long videos for a slight performance cost).\ If you use Flux derived applciation profile 1 and 3 will offer much faster generation times. In any case, a safe approach is to start from profile 5 (default profile) and then go down progressively to profile 4 and then to profile 2 as long as the app remains responsive or doesn't trigger any out of memory error.

By default the model named 'transformer' will be quantized to 8 bits for all profiles. If you don't want that you may specify the optional parameter *quantizeTransformer = False*.

Every parameter set automatically by a profile can be overridden with one or multiple parameters accepted by *offload.all* (see below):

from mmgp import offload, profile_type
offload.profile(pipe, profile_type.HighRAM_LowVRAM, budgets = 1000)

If you want to know which parameter are set by one specific profile you can use the parameter *verboseLevel=2*

It is highly recommended to put the *from mmgp import offload, profile_type* at the top of your main python file (that is as the first import) so that all the existing safetensors calls are redirected to mmpg.

Alternatively you may want to create your own profile with specific parameters:

For example:

from mmgp import offload
offload.all(pipe, pinnedMemory=True,…

Excerpt shown — open the source for the full document.

Notability

notability 1.0/10

Routine fork, no notable traction.