amazon-science/adaptive-layerwise-perturbation
Python
Captured source
source ↗amazon-science/adaptive-layerwise-perturbation
Language: Python
License: Apache-2.0
Stars: 1
Forks: 0
Open issues: 3
Created: 2026-05-14T17:44:17Z
Pushed: 2026-05-14T19:08:21Z
Default branch: main
Fork: no
Archived: no
README:
Adaptive Layerwise Perturbation (ALP)
Chenlu Ye\*, Xuanchang Zhang\*, Yifan Hao, Zhou Yu, Ziji Zhang, Abhinav Gullapalli, Hao Chen, Jing Huang, Tong Zhang
University of Illinois Urbana-Champaign, Amazon
---
Introduction
Policy staleness and training-inference mismatch are key challenges in LLM reinforcement learning. Modern RL pipelines use separate systems for rollout generation (e.g., BF16 vLLM) and policy training (e.g., FP32 FSDP), introducing distributional gaps between the behavior policy and the training policy. These gaps destabilize training through inflated importance sampling ratios and noisy gradient estimates.
Adaptive Layerwise Perturbation (ALP) addresses this by injecting learnable Gaussian perturbations into transformer hidden states across all layers during policy updates. The perturbed policy serves as the importance sampling numerator against the unperturbed inference policy. By flattening the policy landscape through noise injection, ALP naturally reduces IS ratio tail behavior and maintains training stability.
This repository contains experiments for both multi-turn tool-integrated reasoning and single-turn RL settings.
Repository Structure
adaptive-layerwise-perturbation/ ├── README.md # This file ├── multi-turn/ # Multi-turn tool-integrated reasoning experiments (Qwen2.5-7B) │ ├── datasets/ │ ├── eval/ │ ├── figures/ │ ├── recipe/ │ ├── sandbox/ │ ├── scripts/ │ ├── sft/ │ └── ... └── single-turn/ # Single-turn RL experiments (verl-based) ├── run_scripts/ ├── scripts/ └── ...
---
Method
This codebase implements four rollout-correction strategies for LLM-RL:
- GSPO (Baseline): Group-level sequence policy optimization with no mismatch correction. Standard clipped importance ratio at the token level.
- Seq-Bypass: Uses rollout (vLLM) log-probabilities directly as old_log_probs in the loss denominator, bypassing the reference policy evaluation.
- MIS/TIS (Masked Importance Sampling): Computes an auxiliary IS ratio between the FSDP training policy and the vLLM rollout policy. Outlier ratios are masked or truncated to stabilize training.
- ALP (Adaptive Layerwise Perturbation): Injects learnable Gaussian perturbations $\delta \sim \mathcal{N}(0, \sigma^2 I)$ into transformer hidden states across all layers during policy updates. The perturbed policy serves as the IS numerator. The learnable $\sigma$ is a scalar coefficient per layer.
---
Results
Multi-Turn Tool-Integrated Reasoning (Qwen2.5-7B)
| Method | Average Score | |--------|---------------| | Seq-ALP | 50.53 | | Token-ALP | 49.62 | | Token-MIS | 48.74 | | Seq-MIS | 46.94 | | Seq-Bypass | 46.66 | | GSPO (baseline) | 46.57 |
Ablation: Layer Range for ALP
| Layer Range | Score | |-------------|-------| | All layers (0-27) | 50.53 | | Late layers (23-27) | 48.66 | | Middle layers (12-17) | 48.51 | | Early layers (0-5) | 48.25 |
All-layer perturbation substantially outperforms partial-layer variants, confirming that mismatch correction benefits from distributed noise across the full transformer stack.
---
ALP Configuration
Key Parameters
| Parameter | Config Key | Description | Default | |-----------|-----------|-------------|---------| | USE_PERTURBATION | actor_rollout_ref.actor.use_perturbation | Enable/disable ALP perturbation | True | | PERTURB_STD | actor_rollout_ref.actor.perturb_std | Initial standard deviation $\sigma_0$ for Gaussian noise. The actual noise scale is $\exp(\log(\sigma_0))$, optimized in log-space to stay non-negative. | 1e-6 | | coef_learnable | coef_learnable (in model config.json) | If True, the per-layer noise coefficient $\sigma_l$ is a learnable nn.Parameter updated via gradient descent. If False, $\sigma_l$ is fixed at perturb_std. | True | | PERTURB_LR | actor_rollout_ref.actor.perturb_lr | Learning rate for the learnable perturbation coefficients (only used when coef_learnable=True) | 5e-4 | | PERTURB_START_LAYER | actor_rollout_ref.actor.perturb_start_layer | Start layer index for perturbation (inclusive) | 0 | | PERTURB_END_LAYER | actor_rollout_ref.actor.perturb_end_layer | End layer index for perturbation (exclusive). null means through the last layer. | null | | PERTURB_PATCH | env PERTURB_PATCH | Transformer monkey-patch for noise injection. Options: qwen2 (Qwen2/2.5), qwen3, llama (LLaMA 3.x) | qwen2 | | LOSS_MODE | actor_rollout_ref.actor.policy_loss.loss_mode | Loss aggregation: token (token-level ALP), sequence (sequence-level ALP), vanilla, cum-token | sequence |
Enabling Learnable Coefficients
To use learnable perturbation coefficients, add these fields to the model's config.json before training:
{
"use_perturbation": true,
"coef_learnable": true,
"perturb_std": 1e-2
}Noise Seed Mechanism
The perturbation patch uses a stateless seeded Generator to ensure gradient-checkpointing correctness. Before every forward pass, a deterministic seed is set on each decoder layer (layer._noise_seed). During the forward pass, a local torch.Generator is created with seed = _noise_seed + layer_idx, producing identical noise on both the original forward and gradient-checkpoint recomputation. This guarantees correct gradients when enable_gradient_checkpointing=True.
For learnable coefficients, the noise injection is additionally wrapped in torch.utils.checkpoint.checkpoint() to avoid storing full-size noise activations while still computing gradients for the coefficient.
---
Multi-Turn Experiments
Multi-turn training uses Qwen2.5-7B and requires a sandbox service for code execution during rollout, as the agent interleaves natural language reasoning with executable code.
See [multi-turn/README.md](multi-turn/README.md) for full details. Key highlights:
Prerequisites
- 8 H100 GPUs recommended
- Docker with GPU support (NVIDIA Container Toolkit)
- Sandbox service for code execution
Build and Start Container
cd multi-turn docker build -t verl_sandbox -f docker/Dockerfile.simpletir .
Start Sandbox Service…
Excerpt shown — open the source for the full document.
Notability
notability 3.0/10Low traction, routine new repo from Amazon Science.