What does this release signal mean?

Baidu (ERNIE) published PaddlePaddle/FastDeploy v2.4.0 (PaddlePaddle/FastDeploy). This release signal is evidence of what shipped, changed, or was packaged for users. High-signal details: Notable deployment tool update · v2.4.0 Repository: PaddlePaddle/FastDeploy Tag: v2.4.0 Published: 2026-01-23T02:20:55Z Prerelease: no Release notes: 核心推理能力与模型支持增强 * 支持文本 `prompt_logprob` 及全量 `logprob`.... onlylabs links this event to 1 captured evidence page and 6 related release signals.

Baidu (ERNIE) Release: PaddlePaddle/FastDeploy v2.4.0

Captured source

source ↗

GitHub/github.com/PaddlePaddle/FastDeploy

PaddlePaddle/FastDeploy v2.4.0

Source ↗

published Jan 23, 2026seen Jun 5captured Jun 11http 200method plain

v2.4.0

Repository: PaddlePaddle/FastDeploy

Tag: v2.4.0

Published: 2026-01-23T02:20:55Z

Prerelease: no

Release notes:

核心推理能力与模型支持增强

支持文本 prompt_logprob 及全量 logprob 能力 #4769
支持离线推理中基于 ZMQ 的 logprobs / prompt_logprobs，并引入 max_logprobs 参数 #4897
支持在线推理中基于 ZMQ 的 logprobs / prompt_logprobs，并优化通信方式 #5089
新增 logprobs / prompt_logprobs 的 token_id 解码控制开关 #5463
受限解码新增 llguidance 后端 #5124
CUDAGraph 支持投机解码 Draft Model 加速(默认关闭)
[Speculative Decoding] 解耦 draft_tokens 后处理流程 #5205
支持 Pooling 模型 Runner
支持 Reward 模型
Pooling 模型通用 embedding 接口 #4344
Pooling 模型定制 reward 接口 #4518
新增开源模型 Ernie-4.5-VL-28B-A3B-Thinking 的 reasoning_parser，兼容 - / _ 命名规则 #4571 #4668
支持通过 chat_template_kwargs.options.thinking_mode 控制思考开关
支持多模模型传入 prompt_token_ids 请求，并通过 messages 输入多模数据，实现 tokens-in / tokens-out 能力

并行架构、调度与 MoE 能力演进

GLM / Qwen 模型消除 EP 空跑时的通信开销 #5254
支持 MoE 分 chunk 执行 #4575
支持 EPLB（Expert Load Balancing）#4782
支持 EPLB 重排与冗余专家策略 #5142 #5143 #5178 #5239 #5918
支持路由重放机制
PD 分离支持 Deepseek V3 模型 EP 并行部署 #5251
PD 分离支持 Qwen3-MoE 模型 EP 并行部署 #4691
PD 分离支持 Prefill 与 Decode 使用不同 TP Size #5296
新增 Python 版本 Router，支持集中式与分离式部署调度 #4709
支持多步 MTP + CUDAGraph + PD 分离
支持 MTP 无损验证
支持 MTP 分 chunk #5343

多模态、缓存与量化能力增强

支持多模单 batch、纯文本多 batch 混合 Prefill 调度 #4611
支持多模 Prefix Cache #4803
动态量化支持 Prefix Cache #5125
修复并支持多模 Prefix Cache 与 CUDAGraph 同时开启 #4679
支持 W4AFP8 动态量化 #5282
支持静态 C8 scale 单独加载 #4624
完善 Machete 对不同量化 group size 的支持 #4911
支持 Flash Mask Attention Backend 接入 #5104 #5134 #5387
v1 Loader 加载性能优化 #4532
支持预编译包功能 #4729

多硬件平台支持扩展

P800

支持多模 Prefix Cache #5356
支持 PD 分离 #5179
支持思考模型思考强度限制 #4761
支持 TP + EP 并行 #4688 #4836

Intel HPU

新增 Prefix Caching 支持 #4971
新增 Chunked Prefill 支持 #5289

Iluvatar GPU

支持 ERNIE-4.5-21B-A3B 与 ERNIE-4.5-VL-28B-A3B-Thinking #4774 #4995
修复多项 CI 问题 #4972 #5012 #5100

MetaX

支持 ERNIE-4.5-VL-28B #4820
新增 Cutlass MoE #4602 #4685 #5128
支持 default_v1 loader #4956 #5001
优化 Flash MLA 性能 #4915
新增 Triton MoE 的 default_v1 loader 与 quant_config #5030
支持 ENABLE_V1_KVCACHE_SCHEDULER #5163

性能优化、可观测性与稳定性修复

性能与通信优化

AppendAttn 算子支持 CUDA-PDL #5072
DeepGemm H2D 消除 #5262
优化集中式 EP 通信逻辑 #5145
移除 CUDA Graph 下 Append Attention 的 DtoH 同步开销
支持两阶段低时延通信 #4162
支持 TP + EP 混合并行 #4615 #5315 #5353
默认编译 RDMA，降低多模 CUDAGraph 开销

可观测性与安全

支持基于请求级别的细粒度链路追踪 #5458
添加 trace_id / span_id 自动注入与开关 #4692 #5765
新增 --api-key 权限校验参数 #4806

稳定性与 Bug 修复

修复 logprob / prompt_logprob 计算、序列化及通信相关问题 #4681 #4884 #5237 #5335
修复 EP、PD 分离、MTP、Prefix Cache、量化、多模态等多类推理场景下的稳定性问题
修复多硬件（XPU / MetaX / Luvatar / P800）算子与参数校验问题

What's Changed

[BugFix] fix total_block_num init error in worker_process by @RichardWooSJTU in https://github.com/PaddlePaddle/FastDeploy/pull/4553
[BugFix] Fix graph opt test case by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/4634
[Feature] add mm token usage by @ApplEOFDiscord in https://github.com/PaddlePaddle/FastDeploy/pull/4570
[XPU] Update the return value of TextImageGatherScatter by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/4636
[Docs] Add PaddleOCR-VL-0.9B best practices by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/4658
[XPU] fix pos_emb_type bug by @cqulilujia in https://github.com/PaddlePaddle/FastDeploy/pull/4638
[Docs] add Qwen25vl yaml by @xjkmfa in https://github.com/PaddlePaddle/FastDeploy/pull/4662
[Feature] add a new reasoning parser by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4571
[XPU] [CI] Increase pytest timeout for XPU ep test by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/4665
add noaux_tc to unitest fused_moe by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4656
[EP] fix several bugs in data parallel by @ltd0924 in https://github.com/PaddlePaddle/FastDeploy/pull/4657
[OP] Add InferShape&InferDtype for per_token_quant_padding by @DrRyanHuang in https://github.com/PaddlePaddle/FastDeploy/pull/4667
【Hackathon 9th No.86】autogen MoeFastHardamardImplWrapper template_instantiation by @ccsuzzh in https://github.com/PaddlePaddle/FastDeploy/pull/4592
[UT] Add ut for speculative sampler by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/4650
[Doc] update docs by @ApplEOFDiscord in https://github.com/PaddlePaddle/FastDeploy/pull/4675
[Graph Optimization] Add the CUDAGraph usage switch for Draft Model by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/4601
[CI] Add test for paddleocr_vl by @Limerances in https://github.com/PaddlePaddle/FastDeploy/pull/4627
[unitest]add real gate_correction_bias weight to mock real data dispatch by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4676
[noauxtc_kernel] remove useless code by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4643
[BugFix] fix offline llm chat "enable_thinking" is always "False" by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4686
[BugFix] fix total_block_num init error in worker_process and test_async_llm not throw error by @xyxinyang in https://github.com/PaddlePaddle/FastDeploy/pull/4687
[BugFix] fix --logprobs-mode raw_logits by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/4681
[XPU] xpu currently disable prefix cache for VL model by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/4695
[XPU] [CI] Add Vl case by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/4649
[BugFix] Fix finish reason in _create_chat_completion_choice by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4582
[Feature] Unify the registration name recognition for tool_parser and reasoning_parser to “-” by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4668
[BugFix] fix unittest of get_save_output_v1 by @Wanglongzhi2001 in https://github.com/PaddlePaddle/FastDeploy/pull/4701
[XPU] [CI] Lock xvllm version by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/4715
[Graph Optimization] SOT+CUDAGraph support ERNIE4.5T VL 28B / 424B by @DrRyanHuang in https://github.com/PaddlePaddle/FastDeploy/pull/4645
[Feature] support mtp distribution equivalence verification by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/4699
[KVCache] Support kv cache scale load by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/4624

*...

Excerpt shown — open the source for the full document.

Notability

notability 6.0/10

Notable deployment tool update