What does this release signal mean?

Baidu (ERNIE) published PaddlePaddle/FastDeploy v2.2.0 (PaddlePaddle/FastDeploy). This release signal is evidence of what shipped, changed, or was packaged for users. High-signal details: Notable deployment toolkit update from Baidu · v2.2.0 Repository: PaddlePaddle/FastDeploy Tag: v2.2.0 Published: 2025-09-08T16:17:00Z Prerelease: no Release notes: 新增功能 - 采样策略中的bad_words支持传入token ids -.... onlylabs links this event to 1 captured evidence page and 6 related release signals.

Baidu (ERNIE) Release: PaddlePaddle/FastDeploy v2.2.0

Captured source

source ↗

GitHub/github.com/PaddlePaddle/FastDeploy

PaddlePaddle/FastDeploy v2.2.0

Source ↗

published Sep 8, 2025seen Jun 5captured Jun 11http 200method plain

v2.2.0

Repository: PaddlePaddle/FastDeploy

Tag: v2.2.0

Published: 2025-09-08T16:17:00Z

Prerelease: no

Release notes:

新增功能

采样策略中的bad_words支持传入token ids
新增Qwen2.5-VL系列模型支持(视频请求不支持enable-chunked-prefill)
API-Server completions接口prompt 字段支持传入token id列表，同时支持批量推理
新增function call解析功能，支持通过``tool-call-parse``解析function call结果
支持服务启动或请求中自定义chat_template
支持模型chat_template.jinja文件的加载
请求报错结果增加异常堆栈信息，完善异常log记录
新增混合MTP、Ngram的投机解码方法
支持用于投机解码的Tree Attention功能
模型加载功能增强，实现了使用迭代器加载模型权重，加载速度和内存占用进一步优化
API-Server完善日志格式，增加时间信息
新增插件机制，允许用户在不修改FastDeploy核心代码的前提下扩展自定义功能
支持Marlin kernel文件在编译阶段按照模版配置自动生成
支持加载 HuggingFace原生Safetensors格式的文心、Qwen系列模型
完善DP+TP+EP混合并行推理

性能优化

新增W4Afp8 MoE Group GEMM算子
CUDA Graph增加对超32K长文的支持
优化moe_topk_select算子性能，提升MoE模型性能
新增Machete WINT4 GEMM算子，优化WINT4 GEMM性能，通过FD_USE_MACHETE=1开启
Chunked prefill 默认开启
V1 KVCache调度策略与上下文缓存默认开启
MTP支持更多草稿token推理，提升多步接受率
新增可插拔轻量化稀疏注意力加速长文推理
针对Decode支持自适应双阶段的All-to-All通信，提升通信速度
支持DeepSeek系列模型MLA Bankend encoder阶段启用Flash-Attrntion-V3
支持DeepSeek系列模型q_a_proj & kv_a_proj_with_mqa linear横向融合
API-Server新增zmq dealer 模式通信管理模块，支持连接复用进一步扩展服务可支持的最大并发数

Bug修复

completion接口echo回显支持
修复 V1调度下上下文缓存的管理 bug
修复 Qwen 模型固定 top_p=0 两次输出不一致的问题
修复 uvicorn 多worker启动、运行中随机挂掉问题
修复 API-Server completions接口中多个 prompt 的 logprobs 聚合方式
修复 MTP 的采样问题
修复PD 分离cache 传输信号错误
修复异常抛出流量控制信号释放问题
修复``max_tokens``为0 异常抛出失败问题
修复EP + DP 混合模式下离线推理退出hang问题

文档

更新了最佳实践文档中一些技术的用法和冲突关系
新增多机张量并行部署文档
新增数据并行部署文档

其它

CI新增对自定义算子的Approve拦截
Config整理及规范化

What's Changed

Describe PR diff coverage using JSON file by @XieYunshen in https://github.com/PaddlePaddle/FastDeploy/pull/3114
[CI] add xpu ci case by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/3111
disable test_cuda_graph.py by @XieYunshen in https://github.com/PaddlePaddle/FastDeploy/pull/3124
[CE] Add base test class for web server testing by @DDDivano in https://github.com/PaddlePaddle/FastDeploy/pull/3120
[OPs] MoE Preprocess OPs Support 160 Experts by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/3121
[Docs] Optimal Deployment by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/2768
fix stop seq unittest by @zoooo0820 in https://github.com/PaddlePaddle/FastDeploy/pull/3126
[XPU]Fix out-of-memory issue during single-XPU deployment by @iosmers in https://github.com/PaddlePaddle/FastDeploy/pull/3133
[Code Simplification] Refactor Post-processing in VL Model Forward Method by @DrRyanHuang in https://github.com/PaddlePaddle/FastDeploy/pull/2937
add case by @DDDivano in https://github.com/PaddlePaddle/FastDeploy/pull/3150
fix ci by @XieYunshen in https://github.com/PaddlePaddle/FastDeploy/pull/3141
Fa3 支持集中式 by @yangjianfengo1 in https://github.com/PaddlePaddle/FastDeploy/pull/3112
Add CI cases by @ZhangYulongg in https://github.com/PaddlePaddle/FastDeploy/pull/3155
[XPU]Updata XPU dockerfiles by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/3144
[Feature] remove dependency on enable_mm and refine multimodal's code by @ApplEOFDiscord in https://github.com/PaddlePaddle/FastDeploy/pull/3014
【Inference Optimize】Support automatic generation of marlin kernel by @chang-wenbin in https://github.com/PaddlePaddle/FastDeploy/pull/3149
Update __init__.py by @DDDivano in https://github.com/PaddlePaddle/FastDeploy/pull/3163
fix load_pre_sharded_checkpoint by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/3152
【Feature】add fd plugins && rm model_classes by @gzy19990617 in https://github.com/PaddlePaddle/FastDeploy/pull/3123
[Bug Fix] fix pd disaggregated kv cache signal by @ltd0924 in https://github.com/PaddlePaddle/FastDeploy/pull/3172
Update test_base_chat.py by @DDDivano in https://github.com/PaddlePaddle/FastDeploy/pull/3183
Fix approve shell scripts by @YuanRisheng in https://github.com/PaddlePaddle/FastDeploy/pull/3108
[Bug Fix] fix the bug in test_sampler by @zeroRains in https://github.com/PaddlePaddle/FastDeploy/pull/3157
【Feature】support qwen3 name_mapping by @gzy19990617 in https://github.com/PaddlePaddle/FastDeploy/pull/3179
remove useless code by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/3166
[Bug fix] Fix cudagraph when use ep. by @Wanglongzhi2001 in https://github.com/PaddlePaddle/FastDeploy/pull/3130
[Bugfix] Fix uninitialized decoded_token and add corresponding unit t… by @sunlei1024 in https://github.com/PaddlePaddle/FastDeploy/pull/3195
[CI] add test_compare_top_logprobs by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/3191
fix expertwise_scale by @rsmallblue in https://github.com/PaddlePaddle/FastDeploy/pull/3181
[FIX]fix bad_words when sending requests consecutively by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/3197
[plugin] Custom model_runner/model support by @lizhenyun01 in https://github.com/PaddlePaddle/FastDeploy/pull/3186
Add more base chat cases by @DDDivano in https://github.com/PaddlePaddle/FastDeploy/pull/3203
Add switch to apply fine-grained per token quant fp8 by @RichardWooSJTU in https://github.com/PaddlePaddle/FastDeploy/pull/3192
[Bug Fix]Fix bug of append attention test case by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/3202
add more cases by @DDDivano in https://github.com/PaddlePaddle/FastDeploy/pull/3207
fix coverage report by @XieYunshen in https://github.com/PaddlePaddle/FastDeploy/pull/3198
[New Feature] fa3 支持flash mask by @yangjianfengo1 in https://github.com/PaddlePaddle/FastDeploy/pull/3184
[Test] scaled_gemm_f8_i4_f16 skip test while sm != 89 by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/3210
[EP] Refactor DeepEP Engine Organization for Mixed Mode & Buffer Management Optimization by @RichardWooSJTU in https://github.com/PaddlePaddle/FastDeploy/pull/3182
[Bug fix] Fix lm head bias by @RichardWooSJTU in https://github.com/PaddlePaddle/FastDeploy/pull/3185
Ce add repitation early stop cases by @DDDivano in https://github.com/PaddlePaddle/FastDeploy/pull/3213
[BugFix]fix test_air_top_p_sampling name by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/3211
[BugFix] support real batch_size by @lizexu123 in https://github.com/PaddlePaddle/FastDeploy/pull/3109
Ce add bad cases by @DDDivano in https://github.com/PaddlePaddle/FastDeploy/pull/3215
revise noaux_tc by @rsmallblue in https://github.com/PaddlePaddle/FastDeploy/pull/3164
[Bug Fix] Fix...

Excerpt shown — open the source for the full document.

Notability

notability 6.0/10

Notable deployment toolkit update from Baidu