replicate/cog-yue
forked from multimodal-art-projection/YuE
Captured source
source ↗replicate/cog-yue
Description: YuE: Open Full-song Music Generation Foundation Model, something similar to Suno.ai but open
Language: Python
License: Apache-2.0
Stars: 9
Forks: 4
Open issues: 0
Created: 2025-02-03T11:11:01Z
Pushed: 2025-11-07T14:57:53Z
Default branch: main
Fork: yes
Parent repository: multimodal-art-projection/YuE
Archived: no
README:
Demo 🎶 | 📑 Paper (coming soon)
YuE-s1-7B-anneal-en-cot 🤗 | YuE-s1-7B-anneal-en-icl 🤗 | YuE-s1-7B-anneal-jp-kr-cot 🤗
YuE-s1-7B-anneal-jp-kr-icl 🤗 | YuE-s1-7B-anneal-zh-cot 🤗 | YuE-s1-7B-anneal-zh-icl 🤗
YuE-s2-1B-general 🤗 | YuE-upsampler 🤗
--- Our model's name is YuE (乐). In Chinese, the word means "music" and "happiness." Some of you may find words that start with Yu hard to pronounce. If so, you can just call it "yeah." We wrote a song with our model's name, see [here](assets/logo/yue.mp3).
YuE is a groundbreaking series of open-source foundation models designed for music generation, specifically for transforming lyrics into full songs (lyrics2song). It can generate a complete song, lasting several minutes, that includes both a catchy vocal track and accompaniment track. YuE is capable of modeling diverse genres/languages/vocal techniques. Please visit the **Demo Page** for amazing vocal performance.
News and Updates
- 2025.01.30 🔥 Inference Update: We now support dual-track ICL mode! You can prompt the model with a reference song, and it will generate a new song in a similar style (voice cloning, music style transfer, etc.). Try it out! See demo video by @abrakjamson on X. 🔥🔥🔥
- 2025.01.30 🔥 Announcement: A New Era Under Apache 2.0 🔥: We are thrilled to announce that, in response to overwhelming requests from our community, YuE is now officially licensed under the Apache 2.0 license. We sincerely hope this marks a watershed moment—akin to what Stable Diffusion and LLaMA have achieved in their respective fields—for music generation and creative AI. 🎉🎉🎉
- 2025.01.29 🎉: We have updated the license description. we ENCOURAGE artists and content creators to sample and incorporate outputs generated by our model into their own works, and even monetize them. The only requirement is to credit our name: YuE by HKUST/M-A-P (alphabetic order).
- 2025.01.28 🫶: Thanks to Fahd for creating a tutorial on how to quickly get started with YuE. Here is his demonstration.
- 2025.01.26 🔥: We have released the YuE series.
---
TODOs📋
- [ ] Example finetune code for enabling BPM control using 🤗 Transformers.
- [ ] Support stemgen mode https://github.com/multimodal-art-projection/YuE/issues/21
- [ ] Support llama.cpp https://github.com/ggerganov/llama.cpp/issues/11467
- [ ] Support gradio interface. https://github.com/multimodal-art-projection/YuE/issues/1
- [ ] Online serving on huggingface space.
- [ ] Support transformers tensor parallel. https://github.com/multimodal-art-projection/YuE/issues/7
- [x] Support dual-track ICL mode.
- [x] Fix "instrumental" naming bug in output files. https://github.com/multimodal-art-projection/YuE/pull/26
- [x] Support seeding https://github.com/multimodal-art-projection/YuE/issues/20
---
Hardware and Performance
GPU Memory
YuE requires significant GPU memory for generating long sequences. Below are the recommended configurations:
- For GPUs with 24GB memory or less: Run up to 2 sessions concurrently to avoid out-of-memory (OOM) errors. You could try this YuEGP to see if it helps reduce VRAM usage or improve speed.
- For full song generation (many sessions, e.g., 4 or more): Use GPUs with at least 80GB memory. i.e. H800, A100, or multiple RTX4090s with tensor parallel.
To customize the number of sessions, the interface allows you to specify the desired session count. By default, the model runs 2 sessions (1 verse + 1 chorus) to avoid OOM issue.
Execution Time
On an H800 GPU, generating 30s audio takes 150 seconds. On an RTX 4090 GPU, generating 30s audio takes approximately 360 seconds.
---
Quickstart
Quick start VIDEO TUTORIAL by Fahd: Link here. We recommend watching this video if you are not familiar with machine learning or the command line.
1. Install environment and dependencies
Make sure properly install flash attention 2 to reduce VRAM usage.
# We recommend using conda to create a new environment. conda create -n yue python=3.8 # Python >=3.8 is recommended. conda activate yue # install cuda >= 11.8 conda install pytorch torchvision torchaudio cudatoolkit=11.8 -c pytorch -c nvidia pip install -r <(curl -sSL https://raw.githubusercontent.com/multimodal-art-projection/YuE/main/requirements.txt) # For saving GPU memory, FlashAttention 2 is mandatory. # Without it, long audio may lead to out-of-memory (OOM) errors. # Be careful about matching the cuda version and flash-attn version pip install flash-attn --no-build-isolation
2. Download the infer code and tokenizer
# Make sure you have git-lfs installed (https://git-lfs.com) # if you don't have root, see https://github.com/git-lfs/git-lfs/issues/4134#issuecomment-1635204943 sudo apt update sudo apt install git-lfs git lfs install git clone https://github.com/multimodal-art-projection/YuE.git cd YuE/inference/ git clone https://huggingface.co/m-a-p/xcodec_mini_infer
3. Run the inference
Now generate music with YuE using 🤗 Transformers. Make sure your step [1](#1-install-environment-and-dependencies) and [2](#2-download-the-infer-code-and-tokenizer) are properly set up.
Note:
- Set
--run_n_segmentsto the number of lyric sections if you want to generate a full song. Additionally, you can increase--stage2_batch_sizebased on your available GPU memory.
- You may customize the prompt in
genre.txtandlyrics.txt. See prompt engineering guide [here](#prompt-engineering-guide).
- You can increase
--stage2_batch_sizeto speed up the inference, but be careful for OOM.
- LM ckpts will be automatically downloaded from huggingface.
# This is the CoT mode. cd YuE/inference/ python infer.py \ --cuda_idx 0 \ --stage1_model m-a-p/YuE-s1-7B-anneal-en-cot \ --stage2_model m-a-p/YuE-s2-1B-general \ --genre_txt…
Excerpt shown — open the source for the full document.
Notability
notability 1.0/10Routine fork, low traction.