PaddlePaddle/FastDeploy v2.4.0
PaddlePaddle/FastDeploy
Captured source
source ↗published Jan 23, 2026seen 5dcaptured 8hhttp 200method plain
v2.4.0
Repository: PaddlePaddle/FastDeploy
Tag: v2.4.0
Published: 2026-01-23T02:20:55Z
Prerelease: no
Release notes:
核心推理能力与模型支持增强
- 支持文本
prompt_logprob及全量logprob能力 #4769 - 支持离线推理中基于 ZMQ 的
logprobs / prompt_logprobs,并引入max_logprobs参数 #4897 - 支持在线推理中基于 ZMQ 的
logprobs / prompt_logprobs,并优化通信方式 #5089 - 新增
logprobs / prompt_logprobs的token_id解码控制开关 #5463 - 受限解码新增
llguidance后端 #5124 - CUDAGraph 支持投机解码 Draft Model 加速(默认关闭)
- [Speculative Decoding] 解耦
draft_tokens后处理流程 #5205 - 支持 Pooling 模型 Runner
- 支持 Reward 模型
- Pooling 模型通用
embedding接口 #4344 - Pooling 模型定制
reward接口 #4518 - 新增开源模型 Ernie-4.5-VL-28B-A3B-Thinking 的
reasoning_parser,兼容- / _命名规则 #4571 #4668 - 支持通过
chat_template_kwargs.options.thinking_mode控制思考开关 - 支持多模模型传入
prompt_token_ids请求,并通过messages输入多模数据,实现 tokens-in / tokens-out 能力
并行架构、调度与 MoE 能力演进
- GLM / Qwen 模型消除 EP 空跑时的通信开销 #5254
- 支持 MoE 分 chunk 执行 #4575
- 支持 EPLB(Expert Load Balancing)#4782
- 支持 EPLB 重排与冗余专家策略 #5142 #5143 #5178 #5239 #5918
- 支持路由重放机制
- PD 分离支持 Deepseek V3 模型 EP 并行部署 #5251
- PD 分离支持 Qwen3-MoE 模型 EP 并行部署 #4691
- PD 分离支持 Prefill 与 Decode 使用不同 TP Size #5296
- 新增 Python 版本 Router,支持集中式与分离式部署调度 #4709
- 支持多步 MTP + CUDAGraph + PD 分离
- 支持 MTP 无损验证
- 支持 MTP 分 chunk #5343
多模态、缓存与量化能力增强
- 支持多模单 batch、纯文本多 batch 混合 Prefill 调度 #4611
- 支持多模 Prefix Cache #4803
- 动态量化支持 Prefix Cache #5125
- 修复并支持多模 Prefix Cache 与 CUDAGraph 同时开启 #4679
- 支持 W4AFP8 动态量化 #5282
- 支持静态 C8 scale 单独加载 #4624
- 完善 Machete 对不同量化 group size 的支持 #4911
- 支持 Flash Mask Attention Backend 接入 #5104 #5134 #5387
- v1 Loader 加载性能优化 #4532
- 支持预编译包功能 #4729
多硬件平台支持扩展
P800
- 支持多模 Prefix Cache #5356
- 支持 PD 分离 #5179
- 支持思考模型思考强度限制 #4761
- 支持 TP + EP 并行 #4688 #4836
Intel HPU
- 新增 Prefix Caching 支持 #4971
- 新增 Chunked Prefill 支持 #5289
Iluvatar GPU
- 支持 ERNIE-4.5-21B-A3B 与 ERNIE-4.5-VL-28B-A3B-Thinking #4774 #4995
- 修复多项 CI 问题 #4972 #5012 #5100
MetaX
- 支持 ERNIE-4.5-VL-28B #4820
- 新增 Cutlass MoE #4602 #4685 #5128
- 支持 default_v1 loader #4956 #5001
- 优化 Flash MLA 性能 #4915
- 新增 Triton MoE 的 default_v1 loader 与 quant_config #5030
- 支持 ENABLE_V1_KVCACHE_SCHEDULER #5163
性能优化、可观测性与稳定性修复
性能与通信优化
- AppendAttn 算子支持 CUDA-PDL #5072
- DeepGemm H2D 消除 #5262
- 优化集中式 EP 通信逻辑 #5145
- 移除 CUDA Graph 下 Append Attention 的 DtoH 同步开销
- 支持两阶段低时延通信 #4162
- 支持 TP + EP 混合并行 #4615 #5315 #5353
- 默认编译 RDMA,降低多模 CUDAGraph 开销
可观测性与安全
- 支持基于请求级别的细粒度链路追踪 #5458
- 添加 trace_id / span_id 自动注入与开关 #4692 #5765
- 新增
--api-key权限校验参数 #4806
稳定性与 Bug 修复
- 修复 logprob / prompt_logprob 计算、序列化及通信相关问题 #4681 #4884 #5237 #5335
- 修复 EP、PD 分离、MTP、Prefix Cache、量化、多模态等多类推理场景下的稳定性问题
- 修复多硬件(XPU / MetaX / Luvatar / P800)算子与参数校验问题
What's Changed
- [BugFix] fix total_block_num init error in worker_process by @RichardWooSJTU in https://github.com/PaddlePaddle/FastDeploy/pull/4553
- [BugFix] Fix graph opt test case by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/4634
- [Feature] add mm token usage by @ApplEOFDiscord in https://github.com/PaddlePaddle/FastDeploy/pull/4570
- [XPU] Update the return value of TextImageGatherScatter by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/4636
- [Docs] Add PaddleOCR-VL-0.9B best practices by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/4658
- [XPU] fix pos_emb_type bug by @cqulilujia in https://github.com/PaddlePaddle/FastDeploy/pull/4638
- [Docs] add Qwen25vl yaml by @xjkmfa in https://github.com/PaddlePaddle/FastDeploy/pull/4662
- [Feature] add a new reasoning parser by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4571
- [XPU] [CI] Increase pytest timeout for XPU ep test by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/4665
- add noaux_tc to unitest fused_moe by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4656
- [EP] fix several bugs in data parallel by @ltd0924 in https://github.com/PaddlePaddle/FastDeploy/pull/4657
- [OP] Add InferShape&InferDtype for
per_token_quant_paddingby @DrRyanHuang in https://github.com/PaddlePaddle/FastDeploy/pull/4667 - 【Hackathon 9th No.86】autogen
MoeFastHardamardImplWrappertemplate_instantiation by @ccsuzzh in https://github.com/PaddlePaddle/FastDeploy/pull/4592 - [UT] Add ut for speculative sampler by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/4650
- [Doc] update docs by @ApplEOFDiscord in https://github.com/PaddlePaddle/FastDeploy/pull/4675
- [Graph Optimization] Add the CUDAGraph usage switch for Draft Model by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/4601
- [CI] Add test for paddleocr_vl by @Limerances in https://github.com/PaddlePaddle/FastDeploy/pull/4627
- [unitest]add real gate_correction_bias weight to mock real data dispatch by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4676
- [noauxtc_kernel] remove useless code by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4643
- [BugFix] fix offline llm chat "enable_thinking" is always "False" by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4686
- [BugFix] fix total_block_num init error in worker_process and test_async_llm not throw error by @xyxinyang in https://github.com/PaddlePaddle/FastDeploy/pull/4687
- [BugFix] fix --logprobs-mode raw_logits by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/4681
- [XPU] xpu currently disable prefix cache for VL model by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/4695
- [XPU] [CI] Add Vl case by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/4649
- [BugFix] Fix finish reason in _create_chat_completion_choice by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4582
- [Feature] Unify the registration name recognition for tool_parser and reasoning_parser to “-” by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4668
- [BugFix] fix unittest of get_save_output_v1 by @Wanglongzhi2001 in https://github.com/PaddlePaddle/FastDeploy/pull/4701
- [XPU] [CI] Lock xvllm version by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/4715
- [Graph Optimization] SOT+CUDAGraph support ERNIE4.5T VL 28B / 424B by @DrRyanHuang in https://github.com/PaddlePaddle/FastDeploy/pull/4645
- [Feature] support mtp distribution equivalence verification by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/4699
- [KVCache] Support kv cache scale load by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/4624
*…
Excerpt shown — open the source for the full document.
Notability
notability 6.0/10Notable deployment tool update