microsoft/Tutel
C
Captured source
source ↗microsoft/Tutel
Description: Tutel MoE: Optimized Mixture-of-Experts Library, Support GptOss/DeepSeek/Kimi-K2/Qwen3 using FP8/NVFP4/MXFP4
Language: C
License: MIT
Stars: 993
Forks: 109
Open issues: 54
Created: 2021-08-06T02:44:04Z
Pushed: 2026-06-18T05:36:03Z
Default branch: main
Fork: no
Archived: no
README:
Tutel
Tutel MoE: An Optimized Mixture-of-Experts Implementation, also the first parallel solution proposing "No-penalty Parallism/Sparsity/Capacity/.. Switching" for modern training and inference that have dynamic behaviors.
- Supported Framework: Pytorch (recommend: >= 2.0)
- Supported GPUs: CUDA(fp64/fp32/fp16/bf16), ROCm(fp64/fp32/fp16/bf16)
- Supported CPU: fp64/fp32
- Support direct NVFP4/MXFP4/BlockwiseFP8 Inference for MoE-based DeepSeek / Kimi / Qwen3 / GptOSS using A100/A800/H100/MI300/..
> [!TIP] > #### Steps for GLM-5/5.1/5.2 (Claude-Code Mode): > > ``sh > [Model Downloads] > pip3 install -U "huggingface_hub[cli]" --upgrade > hf download --local-dir lukealonso/GLM-5.2-NVFP4 lukealonso/GLM-5.2-NVFP4 > hf download --local-dir nvidia/GLM-5.1-NVFP4 nvidia/GLM-5.1-NVFP4 > hf download --local-dir nvidia/GLM-5-NVFP4 nvidia/GLM-5-NVFP4 > > [ND_A100_80G_v4: Server GLM-5/5.1 (A100/H100/B200 only)] > docker run -e WORKER=1 -e LOCAL_SIZE=8 -p 8000:8000 -it --rm --ipc=host --shm-size=8g \ > --ulimit memlock=-1 --ulimit stack=67108864 -v /:/host -w /host$(pwd) \ > -v /usr/lib/x86_64-linux-gnu/libcuda.so.1:/usr/lib/x86_64-linux-gnu/libcuda.so.1 --privileged \ > tutelgroup/deepseek-671b:a100x8-chat-20260618 --serve=core \ > --try_path lukealonso/GLM-5.2-NVFP4 \ > --try_path nvidia/GLM-5.1-NVFP4 \ > --try_path nvidia/GLM-5-NVFP4 \ > --max_seq_len 1000000 > > [ND_MI300_192G_v5: Server GLM-5/5.1 (MI300 only)] > docker run -e WORKER=1 -e LOCAL_SIZE=8 -p 8000:8000 -it --rm --ipc=host --shm-size=8g \ > --ulimit memlock=-1 --ulimit stack=67108864 -v /:/host -w /host$(pwd) \ > --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add=video \ > tutelgroup/deepseek-671b:mi300x8-chat-20260618 --serve=core \ > --try_path lukealonso/GLM-5.2-NVFP4 \ > --try_path nvidia/GLM-5.1-NVFP4 \ > --try_path nvidia/GLM-5-NVFP4 \ > --max_seq_len 1000000 > > > #### Setup Claude Code for Linux / WSL (Ubuntu >= 24.04): > sh > sudo apt-get install -y npm > sudo npm install -g @anthropic-ai/claude-code@2.1.101 > cat > run_claude.sh mkdir -p config/ > export ANTHROPIC_BASE_URL="http://0.0.0.0:8000" > export ANTHROPIC_API_KEY="sk-ant-api00-local-mock-key" > export CLAUDE_CONFIG_DIR="config" > export DISABLE_AUTOUPDATER=1 > echo '{"customApiKeyResponses": {"approved": ["api00-local-mock-key"]}}' > config/.claude.json > claude > EOF > > ./run_claude.sh > > > #### Setup Claude Code for Windows (>= 10.0): > sh > winget install OpenJS.NodeJS.LTS > winget install --id Git.Git -e --source winget > npm install -g @anthropic-ai/claude-code@2.1.101 > ( > echo(@echo off > echo(if not exist config mkdir config > echo(set ANTHROPIC_BASE_URL=http://0.0.0.0:8000 > echo(set ANTHROPIC_API_KEY=sk-ant-api00-local-mock-key > echo(set CLAUDE_CONFIG_DIR=config > echo(set DISABLE_AUTOUPDATER=1 > echo(echo({"customApiKeyResponses": {"approved": ["api00-local-mock-key"]}} ^> config\.claude.json > echo(claude > ) > run_claude.bat > > .\run_claude.bat > `` ------------------
> [!TIP] > #### Steps for Kimi-K2.6/2.7/DeepSeek V3.2 (Long-Context Mode): > > ``sh > [Model Downloads] > pip3 install -U "huggingface_hub[cli]" --upgrade > hf download moonshotai/Kimi-K2.7-Code --local-dir moonshotai/Kimi-K2.7-Code > hf download moonshotai/Kimi-K2.6 --local-dir moonshotai/Kimi-K2.6 > hf download nvidia/Kimi-K2.5-NVFP4 --local-dir nvidia/Kimi-K2.5-NVFP4 > hf download nvidia/Kimi-K2-Thinking-NVFP4 --local-dir nvidia/Kimi-K2-Thinking-NVFP4 > hf download nvidia/DeepSeek-V3.2-NVFP4 --local-dir nvidia/DeepSeek-V3.2-NVFP4 > > [DeepSeek V3.2 Long-Context (ND_A100/H100/B200 only)] > docker run -e LOCAL_SIZE=8 -e WORKER=1 -it --rm --ipc=host --net=host --shm-size=8g \ > --ulimit memlock=-1 --ulimit stack=67108864 -v /:/host -w /host$(pwd) -v /tmp:/tmp \ > -v /usr/lib/x86_64-linux-gnu/libcuda.so.1:/usr/lib/x86_64-linux-gnu/libcuda.so.1 --privileged \ > tutelgroup/deepseek-671b:a100x8-chat-20260618 --serve=webui --listen_port 8000 \ > --try_path moonshotai/Kimi-K2.7-Code \ > --try_path moonshotai/Kimi-K2.6 \ > --try_path nvidia/Kimi-K2.5-NVFP4 \ > --try_path nvidia/Kimi-K2-Thinking-NVFP4 \ > --try_path nvidia/DeepSeek-V3.2-NVFP4 \ > --try_path nvidia/DeepSeek-R1-NVFP4 \ > --max_seq_len 16384 > > [DeepSeek V3.2 Long-Context (ND_MI300_192G_v5 only)] > docker run -e LOCAL_SIZE=8 -e WORKER=1 -it --rm --ipc=host --net=host --shm-size=8g \ > --ulimit memlock=-1 --ulimit stack=67108864 --device=/dev/kfd --device=/dev/dri --group-add=video \ > --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v /:/host -w /host$(pwd) -v /tmp:/tmp \ > tutelgroup/deepseek-671b:mi300x8-chat-20260618 --serve=webui --listen_port 8000 \ > --try_path moonshotai/Kimi-K2.7-Code \ > --try_path moonshotai/Kimi-K2.6 \ > --try_path nvidia/Kimi-K2.5-NVFP4 \ > --try_path nvidia/Kimi-K2-Thinking-NVFP4 \ > --try_path nvidia/DeepSeek-V3.2-NVFP4 \ > --try_path nvidia/DeepSeek-R1-NVFP4 \ > --max_seq_len 1000000 > > [OpenAI/Ollama/Direct Request] > curl -N -X POST http://0.0.0.0:8000/chat -d '{"text": "Write a Python code of the Quicksort algorithm."}' > python3 -m tutel.examples.oai_request_stream --url '0.0.0.0:8000' --prompt 'Write a Python code of the Quicksort algorithm.' > > [Open-WebUI URL for Web browsers] > xdg-open http://0.0.0.0:8000 >
------------------ > [!TIP] > #### Steps for Microsoft VibeVoice (Multimodality Mode): > > ```sh > [Model Downloads] > pip3 install -U "huggingface_hub[cli]" --upgrade > hf download microsoft/VibeVoice-1.5B --local-dir microsoft/VibeVoice-1.5B > hf download Qwen/Qwen2.5-1.5B --local-dir Qwen/Qwen2.5-1.5B > > hf download microsoft/VibeVoice-Large --local-dir aoi-ot/VibeVoice-Large > hf download Qwen/Qwen2.5-7B --local-dir Qwen/Qwen2.5-7B > > [Microsoft VibeVoice (ND_A100/H100/B200 only)] > docker run -e LOCAL_SIZE=1 -it --rm -p 8001:8000 --shm-size=8g \ > --ulimit memlock=-1 --ulimit stack=67108864 -v /:/host -w /host$(pwd) -v /tmp:/tmp \ > -v /usr/lib/x86_64-linux-gnu/libcuda.so.1:/usr/lib/x86_64-linux-gnu/libcuda.so.1 --privileged \ > -e...
Excerpt shown — open the source for the full document.
Notability
notability 7.0/10Notable MoE framework release with strong community traction (993 stars).