ReleaseBaidu (ERNIE)Baidu (ERNIE)published Sep 8, 2025seen 5d

PaddlePaddle/FastDeploy v2.2.0

PaddlePaddle/FastDeploy

Open original ↗

Captured source

source ↗
published Sep 8, 2025seen 5dcaptured 9hhttp 200method plain

v2.2.0

Repository: PaddlePaddle/FastDeploy

Tag: v2.2.0

Published: 2025-09-08T16:17:00Z

Prerelease: no

Release notes:

新增功能

  • 采样策略中的bad_words支持传入token ids
  • 新增Qwen2.5-VL系列模型支持(视频请求不支持enable-chunked-prefill)
  • API-Server completions接口prompt 字段支持传入token id列表,同时支持批量推理
  • 新增function call解析功能,支持通过``tool-call-parse``解析function call结果
  • 支持服务启动或请求中自定义chat_template
  • 支持模型chat_template.jinja文件的加载
  • 请求报错结果增加异常堆栈信息,完善异常log记录
  • 新增混合MTP、Ngram的投机解码方法
  • 支持用于投机解码的Tree Attention功能
  • 模型加载功能增强,实现了使用迭代器加载模型权重,加载速度和内存占用进一步优化
  • API-Server完善日志格式,增加时间信息
  • 新增插件机制,允许用户在不修改FastDeploy核心代码的前提下扩展自定义功能
  • 支持Marlin kernel文件在编译阶段按照模版配置自动生成
  • 支持加载 HuggingFace原生Safetensors格式的文心、Qwen系列模型
  • 完善DP+TP+EP混合并行推理

性能优化

  • 新增W4Afp8 MoE Group GEMM算子
  • CUDA Graph增加对超32K长文的支持
  • 优化moe_topk_select算子性能,提升MoE模型性能
  • 新增Machete WINT4 GEMM算子,优化WINT4 GEMM性能,通过FD_USE_MACHETE=1开启
  • Chunked prefill 默认开启
  • V1 KVCache调度策略与上下文缓存默认开启
  • MTP支持更多草稿token推理,提升多步接受率
  • 新增可插拔轻量化稀疏注意力加速长文推理
  • 针对Decode支持自适应双阶段的All-to-All通信,提升通信速度
  • 支持DeepSeek系列模型MLA Bankend encoder阶段启用Flash-Attrntion-V3
  • 支持DeepSeek系列模型q_a_proj & kv_a_proj_with_mqa linear横向融合
  • API-Server新增zmq dealer 模式通信管理模块,支持连接复用进一步扩展服务可支持的最大并发数

Bug修复

  • completion接口echo回显支持
  • 修复 V1调度下上下文缓存的管理 bug
  • 修复 Qwen 模型固定 top_p=0 两次输出不一致的问题
  • 修复 uvicorn 多worker启动、运行中随机挂掉问题
  • 修复 API-Server completions接口中多个 prompt 的 logprobs 聚合方式
  • 修复 MTP 的采样问题
  • 修复PD 分离cache 传输信号错误
  • 修复异常抛出流量控制信号释放问题
  • 修复``max_tokens``为0 异常抛出失败问题
  • 修复EP + DP 混合模式下离线推理退出hang问题

文档

  • 更新了最佳实践文档中一些技术的用法和冲突关系
  • 新增多机张量并行部署文档
  • 新增数据并行部署文档

其它

  • CI新增对自定义算子的Approve拦截
  • Config整理及规范化

What's Changed

  • Describe PR diff coverage using JSON file by @XieYunshen in https://github.com/PaddlePaddle/FastDeploy/pull/3114
  • [CI] add xpu ci case by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/3111
  • disable test_cuda_graph.py by @XieYunshen in https://github.com/PaddlePaddle/FastDeploy/pull/3124
  • [CE] Add base test class for web server testing by @DDDivano in https://github.com/PaddlePaddle/FastDeploy/pull/3120
  • [OPs] MoE Preprocess OPs Support 160 Experts by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/3121
  • [Docs] Optimal Deployment by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/2768
  • fix stop seq unittest by @zoooo0820 in https://github.com/PaddlePaddle/FastDeploy/pull/3126
  • [XPU]Fix out-of-memory issue during single-XPU deployment by @iosmers in https://github.com/PaddlePaddle/FastDeploy/pull/3133
  • [Code Simplification] Refactor Post-processing in VL Model Forward Method by @DrRyanHuang in https://github.com/PaddlePaddle/FastDeploy/pull/2937
  • add case by @DDDivano in https://github.com/PaddlePaddle/FastDeploy/pull/3150
  • fix ci by @XieYunshen in https://github.com/PaddlePaddle/FastDeploy/pull/3141
  • Fa3 支持集中式 by @yangjianfengo1 in https://github.com/PaddlePaddle/FastDeploy/pull/3112
  • Add CI cases by @ZhangYulongg in https://github.com/PaddlePaddle/FastDeploy/pull/3155
  • [XPU]Updata XPU dockerfiles by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/3144
  • [Feature] remove dependency on enable_mm and refine multimodal's code by @ApplEOFDiscord in https://github.com/PaddlePaddle/FastDeploy/pull/3014
  • 【Inference Optimize】Support automatic generation of marlin kernel by @chang-wenbin in https://github.com/PaddlePaddle/FastDeploy/pull/3149
  • Update __init__.py by @DDDivano in https://github.com/PaddlePaddle/FastDeploy/pull/3163
  • fix load_pre_sharded_checkpoint by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/3152
  • 【Feature】add fd plugins && rm model_classes by @gzy19990617 in https://github.com/PaddlePaddle/FastDeploy/pull/3123
  • [Bug Fix] fix pd disaggregated kv cache signal by @ltd0924 in https://github.com/PaddlePaddle/FastDeploy/pull/3172
  • Update test_base_chat.py by @DDDivano in https://github.com/PaddlePaddle/FastDeploy/pull/3183
  • Fix approve shell scripts by @YuanRisheng in https://github.com/PaddlePaddle/FastDeploy/pull/3108
  • [Bug Fix] fix the bug in test_sampler by @zeroRains in https://github.com/PaddlePaddle/FastDeploy/pull/3157
  • 【Feature】support qwen3 name_mapping by @gzy19990617 in https://github.com/PaddlePaddle/FastDeploy/pull/3179
  • remove useless code by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/3166
  • [Bug fix] Fix cudagraph when use ep. by @Wanglongzhi2001 in https://github.com/PaddlePaddle/FastDeploy/pull/3130
  • [Bugfix] Fix uninitialized decoded_token and add corresponding unit t… by @sunlei1024 in https://github.com/PaddlePaddle/FastDeploy/pull/3195
  • [CI] add test_compare_top_logprobs by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/3191
  • fix expertwise_scale by @rsmallblue in https://github.com/PaddlePaddle/FastDeploy/pull/3181
  • [FIX]fix bad_words when sending requests consecutively by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/3197
  • [plugin] Custom model_runner/model support by @lizhenyun01 in https://github.com/PaddlePaddle/FastDeploy/pull/3186
  • Add more base chat cases by @DDDivano in https://github.com/PaddlePaddle/FastDeploy/pull/3203
  • Add switch to apply fine-grained per token quant fp8 by @RichardWooSJTU in https://github.com/PaddlePaddle/FastDeploy/pull/3192
  • [Bug Fix]Fix bug of append attention test case by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/3202
  • add more cases by @DDDivano in https://github.com/PaddlePaddle/FastDeploy/pull/3207
  • fix coverage report by @XieYunshen in https://github.com/PaddlePaddle/FastDeploy/pull/3198
  • [New Feature] fa3 支持flash mask by @yangjianfengo1 in https://github.com/PaddlePaddle/FastDeploy/pull/3184
  • [Test] scaled_gemm_f8_i4_f16 skip test while sm != 89 by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/3210
  • [EP] Refactor DeepEP Engine Organization for Mixed Mode & Buffer Management Optimization by @RichardWooSJTU in https://github.com/PaddlePaddle/FastDeploy/pull/3182
  • [Bug fix] Fix lm head bias by @RichardWooSJTU in https://github.com/PaddlePaddle/FastDeploy/pull/3185
  • Ce add repitation early stop cases by @DDDivano in https://github.com/PaddlePaddle/FastDeploy/pull/3213
  • [BugFix]fix test_air_top_p_sampling name by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/3211
  • [BugFix] support real batch_size by @lizexu123 in https://github.com/PaddlePaddle/FastDeploy/pull/3109
  • Ce add bad cases by @DDDivano in https://github.com/PaddlePaddle/FastDeploy/pull/3215
  • revise noaux_tc by @rsmallblue in https://github.com/PaddlePaddle/FastDeploy/pull/3164
  • [Bug Fix] Fix…

Excerpt shown — open the source for the full document.

Notability

notability 6.0/10

Notable deployment toolkit update from Baidu