PaddlePaddle/FastDeploy v2.2.0
PaddlePaddle/FastDeploy
Captured source
source ↗published Sep 8, 2025seen 5dcaptured 9hhttp 200method plain
v2.2.0
Repository: PaddlePaddle/FastDeploy
Tag: v2.2.0
Published: 2025-09-08T16:17:00Z
Prerelease: no
Release notes:
新增功能
- 采样策略中的bad_words支持传入token ids
- 新增Qwen2.5-VL系列模型支持(视频请求不支持enable-chunked-prefill)
- API-Server completions接口prompt 字段支持传入token id列表,同时支持批量推理
- 新增function call解析功能,支持通过``
tool-call-parse``解析function call结果 - 支持服务启动或请求中自定义chat_template
- 支持模型chat_template.jinja文件的加载
- 请求报错结果增加异常堆栈信息,完善异常log记录
- 新增混合MTP、Ngram的投机解码方法
- 支持用于投机解码的Tree Attention功能
- 模型加载功能增强,实现了使用迭代器加载模型权重,加载速度和内存占用进一步优化
- API-Server完善日志格式,增加时间信息
- 新增插件机制,允许用户在不修改FastDeploy核心代码的前提下扩展自定义功能
- 支持Marlin kernel文件在编译阶段按照模版配置自动生成
- 支持加载 HuggingFace原生Safetensors格式的文心、Qwen系列模型
- 完善DP+TP+EP混合并行推理
性能优化
- 新增W4Afp8 MoE Group GEMM算子
- CUDA Graph增加对超32K长文的支持
- 优化moe_topk_select算子性能,提升MoE模型性能
- 新增Machete WINT4 GEMM算子,优化WINT4 GEMM性能,通过FD_USE_MACHETE=1开启
- Chunked prefill 默认开启
- V1 KVCache调度策略与上下文缓存默认开启
- MTP支持更多草稿token推理,提升多步接受率
- 新增可插拔轻量化稀疏注意力加速长文推理
- 针对Decode支持自适应双阶段的All-to-All通信,提升通信速度
- 支持DeepSeek系列模型MLA Bankend encoder阶段启用Flash-Attrntion-V3
- 支持DeepSeek系列模型q_a_proj & kv_a_proj_with_mqa linear横向融合
- API-Server新增zmq dealer 模式通信管理模块,支持连接复用进一步扩展服务可支持的最大并发数
Bug修复
- completion接口echo回显支持
- 修复 V1调度下上下文缓存的管理 bug
- 修复 Qwen 模型固定 top_p=0 两次输出不一致的问题
- 修复 uvicorn 多worker启动、运行中随机挂掉问题
- 修复 API-Server completions接口中多个 prompt 的 logprobs 聚合方式
- 修复 MTP 的采样问题
- 修复PD 分离cache 传输信号错误
- 修复异常抛出流量控制信号释放问题
- 修复``
max_tokens``为0 异常抛出失败问题 - 修复EP + DP 混合模式下离线推理退出hang问题
文档
- 更新了最佳实践文档中一些技术的用法和冲突关系
- 新增多机张量并行部署文档
- 新增数据并行部署文档
其它
- CI新增对自定义算子的Approve拦截
- Config整理及规范化
What's Changed
- Describe PR diff coverage using JSON file by @XieYunshen in https://github.com/PaddlePaddle/FastDeploy/pull/3114
- [CI] add xpu ci case by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/3111
- disable test_cuda_graph.py by @XieYunshen in https://github.com/PaddlePaddle/FastDeploy/pull/3124
- [CE] Add base test class for web server testing by @DDDivano in https://github.com/PaddlePaddle/FastDeploy/pull/3120
- [OPs] MoE Preprocess OPs Support 160 Experts by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/3121
- [Docs] Optimal Deployment by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/2768
- fix stop seq unittest by @zoooo0820 in https://github.com/PaddlePaddle/FastDeploy/pull/3126
- [XPU]Fix out-of-memory issue during single-XPU deployment by @iosmers in https://github.com/PaddlePaddle/FastDeploy/pull/3133
- [Code Simplification] Refactor Post-processing in VL Model Forward Method by @DrRyanHuang in https://github.com/PaddlePaddle/FastDeploy/pull/2937
- add case by @DDDivano in https://github.com/PaddlePaddle/FastDeploy/pull/3150
- fix ci by @XieYunshen in https://github.com/PaddlePaddle/FastDeploy/pull/3141
- Fa3 支持集中式 by @yangjianfengo1 in https://github.com/PaddlePaddle/FastDeploy/pull/3112
- Add CI cases by @ZhangYulongg in https://github.com/PaddlePaddle/FastDeploy/pull/3155
- [XPU]Updata XPU dockerfiles by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/3144
- [Feature] remove dependency on enable_mm and refine multimodal's code by @ApplEOFDiscord in https://github.com/PaddlePaddle/FastDeploy/pull/3014
- 【Inference Optimize】Support automatic generation of marlin kernel by @chang-wenbin in https://github.com/PaddlePaddle/FastDeploy/pull/3149
- Update __init__.py by @DDDivano in https://github.com/PaddlePaddle/FastDeploy/pull/3163
- fix load_pre_sharded_checkpoint by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/3152
- 【Feature】add fd plugins && rm model_classes by @gzy19990617 in https://github.com/PaddlePaddle/FastDeploy/pull/3123
- [Bug Fix] fix pd disaggregated kv cache signal by @ltd0924 in https://github.com/PaddlePaddle/FastDeploy/pull/3172
- Update test_base_chat.py by @DDDivano in https://github.com/PaddlePaddle/FastDeploy/pull/3183
- Fix approve shell scripts by @YuanRisheng in https://github.com/PaddlePaddle/FastDeploy/pull/3108
- [Bug Fix] fix the bug in test_sampler by @zeroRains in https://github.com/PaddlePaddle/FastDeploy/pull/3157
- 【Feature】support qwen3 name_mapping by @gzy19990617 in https://github.com/PaddlePaddle/FastDeploy/pull/3179
- remove useless code by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/3166
- [Bug fix] Fix cudagraph when use ep. by @Wanglongzhi2001 in https://github.com/PaddlePaddle/FastDeploy/pull/3130
- [Bugfix] Fix uninitialized decoded_token and add corresponding unit t… by @sunlei1024 in https://github.com/PaddlePaddle/FastDeploy/pull/3195
- [CI] add test_compare_top_logprobs by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/3191
- fix expertwise_scale by @rsmallblue in https://github.com/PaddlePaddle/FastDeploy/pull/3181
- [FIX]fix bad_words when sending requests consecutively by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/3197
- [plugin] Custom model_runner/model support by @lizhenyun01 in https://github.com/PaddlePaddle/FastDeploy/pull/3186
- Add more base chat cases by @DDDivano in https://github.com/PaddlePaddle/FastDeploy/pull/3203
- Add switch to apply fine-grained per token quant fp8 by @RichardWooSJTU in https://github.com/PaddlePaddle/FastDeploy/pull/3192
- [Bug Fix]Fix bug of append attention test case by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/3202
- add more cases by @DDDivano in https://github.com/PaddlePaddle/FastDeploy/pull/3207
- fix coverage report by @XieYunshen in https://github.com/PaddlePaddle/FastDeploy/pull/3198
- [New Feature] fa3 支持flash mask by @yangjianfengo1 in https://github.com/PaddlePaddle/FastDeploy/pull/3184
- [Test] scaled_gemm_f8_i4_f16 skip test while sm != 89 by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/3210
- [EP] Refactor DeepEP Engine Organization for Mixed Mode & Buffer Management Optimization by @RichardWooSJTU in https://github.com/PaddlePaddle/FastDeploy/pull/3182
- [Bug fix] Fix lm head bias by @RichardWooSJTU in https://github.com/PaddlePaddle/FastDeploy/pull/3185
- Ce add repitation early stop cases by @DDDivano in https://github.com/PaddlePaddle/FastDeploy/pull/3213
- [BugFix]fix test_air_top_p_sampling name by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/3211
- [BugFix] support real batch_size by @lizexu123 in https://github.com/PaddlePaddle/FastDeploy/pull/3109
- Ce add bad cases by @DDDivano in https://github.com/PaddlePaddle/FastDeploy/pull/3215
- revise noaux_tc by @rsmallblue in https://github.com/PaddlePaddle/FastDeploy/pull/3164
- [Bug Fix] Fix…
Excerpt shown — open the source for the full document.
Notability
notability 6.0/10Notable deployment toolkit update from Baidu