Tencent-Hunyuan/UniCom
Python
Captured source
source ↗Tencent-Hunyuan/UniCom
Language: Python
License: NOASSERTION
Stars: 33
Forks: 4
Open issues: 0
Created: 2026-04-13T09:24:30Z
Pushed: 2026-04-13T10:38:53Z
Default branch: main
Fork: no
Archived: no
README:
UniCom: Unified Multimodal Modeling via Compressed Continuous Semantic Representations
Official code for the paper UniCom: Unified Multimodal Modeling via Compressed Continuous Semantic Representations.
UniCom is a unified large-scale multimodal model that performs generation directly over compressed visual embeddings. This repository includes the inference pipeline for text-to-image generation, image editing, and image reconstruction.

*Figure: We compare different unified modeling choices in terms of convergence speed and consistency on editing tasks, and ultimately build UniCom with the Path I transfusion-style formulation rather than the Path II query-guided design.*
🔥 Key Contributions
Model: We propose UniCom, a unified large-scale multimodal model that performs generation directly over compressed visual embeddings and serves as a unified interface for both understanding and generation.Paradigm: We establish an effective paradigm for unifying visual understanding and generation by predicting continuous compressed visual embeddings, and show that compressing visual features along the channel dimension is a particularly effective way to preserve both semantics and fine-grained details.Results: UniCom achieves state-of-the-art or competitive performance across image reconstruction, text-to-image generation, and challenging image editing tasks, with especially strong performance on editing benchmarks.
Setup
1. Download Checkpoints
Download all checkpoints at once via huggingface-cli:
huggingface-cli download tencent/Unicom-Unified-Multimodal-Modeling-via-Compressed-Continuous-Semantic-Representations --repo-type model --local-dir ./model_zoo/ --resume-download
You can also download each component separately:
| Component | Local Path | Link | | --- | --- | --- | | UniCom (text → SigLIP) | model_zoo/unicom_hf_model/ | Download | | Decoder Transformer (SigLIP → image) | model_zoo/unicom_decoder_transformer.pt | Download | | Flux VAE | model_zoo/flux-vae/ | Download | | SigLIP2| model_zoo/siglip2-so400m-patch16-naflex/ | Download |
After downloading, verify the expected directory layout:
model_zoo/ ├── unicom_hf_model/ ├── unicom_decoder_transformer.pt ├── flux-vae/ └── siglip2-so400m-patch16-naflex/
2. Environment Setup
conda create -n unicom python=3.12 -y conda activate unicom
Install PyTorch first according to your CUDA version. Example for CUDA 12.8:
pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128
pip install -r requirements.txt
🚀 Usage
Case 1: Text-to-image generation
python run_unicom_decoder_pipeline.py \ --model-path ./model_zoo/unicom_hf_model \ --prompt "A ginger kitten tangled in a ball of wool, looking puzzled." \ --output-dir ./output/t2i_demo \ --diff-infer-steps 50 \ --seed 42 \ --image-size auto \ --n-samples-per-prompt 4
Case 2: Single-image editing
python run_unicom_decoder_pipeline.py \ --model-path ./model_zoo/unicom_hf_model \ --prompt "Add a blue baseball cap on the boy's head" \ --image ./UniCom/assets/demo_imgs/input_0.jpg \ --image-size auto \ --seed 42 \ --output-dir ./output/ti2i_demo \ --diff-infer-steps 50
Case 3: Multi-image editing
python run_unicom_decoder_pipeline.py \ --model-path ./model_zoo/unicom_hf_model \ --prompt "Place the chair from the second image onto the snow in the third image, and then place the coffee cup from the first image onto the chair." \ --image ./UniCom/assets/demo_imgs/input_1_0.png ./UniCom/assets/demo_imgs/input_1_1.png ./UniCom/assets/demo_imgs/input_1_2.png \ --image-size auto \ --seed 42 \ --output-dir ./output/ti2i_multi_demo \ --diff-infer-steps 50
Case 4: CSV-based batch inference
python run_unicom_decoder_pipeline.py \ --model-path ./model_zoo/unicom_hf_model \ --csv-path ./UniCom/eval/t2i.csv \ --output-dir ./output/t2i_demo_csv \ --num-gpus 8 \ --decoder-device 0,1,2,3,4,5,6,7 \ --image-size auto \ --diff-infer-steps 50 \ --n-samples-per-prompt 4
# no cot python run_unicom_decoder_pipeline.py \ --model-path ./model_zoo/unicom_hf_model \ --csv-path ./UniCom/eval/t2i.csv \ --output-dir ./output/t2i_demo_csv_nocot \ --num-gpus 8 \ --decoder-device 0,1,2,3,4,5,6,7 \ --image-size auto \ --diff-infer-steps 50 \ --bot-task vanilla \ --use-system-prompt en_vanilla \ --n-samples-per-prompt 4
Output structure
The pipeline first exports latent representations, then decodes them into images:
output_dir/ |-- latents/ | |-- results.csv | `-- *.pt `-- images/ `-- *.png
🧩 Reconstruction
UniCom_Decoder also supports reconstruction directly from input images.
Reconstruction demo
bash UniCom_Decoder/scripts/run.sh \ --config-file UniCom_Decoder/configs/reconstruction_demo.yaml
The demo images are stored in UniCom_Decoder/assets/demo_recon_imgs/.
Each saved output is a side-by-side comparison:
- left: input image
- right: reconstructed image
Recommended reconstruction settings
The default demo config already uses the recommended settings:
mode: eval_gtaba_mode: compression_64_siglipcondition_mode: siglip2cfg_scale: 1.0infer_steps: 50flow_shift: 3.0siglip2_max_num_patches: 1024
🙏 Acknowledgement
This project builds upon several excellent open-source projects and research efforts.
-…
Excerpt shown — open the source for the full document.
Notability
notability 3.0/10Low stars, routine repo