RepoAmazon (Nova)Amazon (Nova)published May 14, 2026seen 5d

amazon-science/adaptive-layerwise-perturbation

Python

Open original ↗

Captured source

source ↗

amazon-science/adaptive-layerwise-perturbation

Language: Python

License: Apache-2.0

Stars: 1

Forks: 0

Open issues: 3

Created: 2026-05-14T17:44:17Z

Pushed: 2026-05-14T19:08:21Z

Default branch: main

Fork: no

Archived: no

README:

Adaptive Layerwise Perturbation (ALP)

Chenlu Ye\*, Xuanchang Zhang\*, Yifan Hao, Zhou Yu, Ziji Zhang, Abhinav Gullapalli, Hao Chen, Jing Huang, Tong Zhang

University of Illinois Urbana-Champaign, Amazon

---

Introduction

Policy staleness and training-inference mismatch are key challenges in LLM reinforcement learning. Modern RL pipelines use separate systems for rollout generation (e.g., BF16 vLLM) and policy training (e.g., FP32 FSDP), introducing distributional gaps between the behavior policy and the training policy. These gaps destabilize training through inflated importance sampling ratios and noisy gradient estimates.

Adaptive Layerwise Perturbation (ALP) addresses this by injecting learnable Gaussian perturbations into transformer hidden states across all layers during policy updates. The perturbed policy serves as the importance sampling numerator against the unperturbed inference policy. By flattening the policy landscape through noise injection, ALP naturally reduces IS ratio tail behavior and maintains training stability.

This repository contains experiments for both multi-turn tool-integrated reasoning and single-turn RL settings.

Repository Structure

adaptive-layerwise-perturbation/
├── README.md # This file
├── multi-turn/ # Multi-turn tool-integrated reasoning experiments (Qwen2.5-7B)
│ ├── datasets/
│ ├── eval/
│ ├── figures/
│ ├── recipe/
│ ├── sandbox/
│ ├── scripts/
│ ├── sft/
│ └── ...
└── single-turn/ # Single-turn RL experiments (verl-based)
├── run_scripts/
├── scripts/
└── ...

---

Method

This codebase implements four rollout-correction strategies for LLM-RL:

  • GSPO (Baseline): Group-level sequence policy optimization with no mismatch correction. Standard clipped importance ratio at the token level.
  • Seq-Bypass: Uses rollout (vLLM) log-probabilities directly as old_log_probs in the loss denominator, bypassing the reference policy evaluation.
  • MIS/TIS (Masked Importance Sampling): Computes an auxiliary IS ratio between the FSDP training policy and the vLLM rollout policy. Outlier ratios are masked or truncated to stabilize training.
  • ALP (Adaptive Layerwise Perturbation): Injects learnable Gaussian perturbations $\delta \sim \mathcal{N}(0, \sigma^2 I)$ into transformer hidden states across all layers during policy updates. The perturbed policy serves as the IS numerator. The learnable $\sigma$ is a scalar coefficient per layer.

---

Results

Multi-Turn Tool-Integrated Reasoning (Qwen2.5-7B)

| Method | Average Score | |--------|---------------| | Seq-ALP | 50.53 | | Token-ALP | 49.62 | | Token-MIS | 48.74 | | Seq-MIS | 46.94 | | Seq-Bypass | 46.66 | | GSPO (baseline) | 46.57 |

Ablation: Layer Range for ALP

| Layer Range | Score | |-------------|-------| | All layers (0-27) | 50.53 | | Late layers (23-27) | 48.66 | | Middle layers (12-17) | 48.51 | | Early layers (0-5) | 48.25 |

All-layer perturbation substantially outperforms partial-layer variants, confirming that mismatch correction benefits from distributed noise across the full transformer stack.

---

ALP Configuration

Key Parameters

| Parameter | Config Key | Description | Default | |-----------|-----------|-------------|---------| | USE_PERTURBATION | actor_rollout_ref.actor.use_perturbation | Enable/disable ALP perturbation | True | | PERTURB_STD | actor_rollout_ref.actor.perturb_std | Initial standard deviation $\sigma_0$ for Gaussian noise. The actual noise scale is $\exp(\log(\sigma_0))$, optimized in log-space to stay non-negative. | 1e-6 | | coef_learnable | coef_learnable (in model config.json) | If True, the per-layer noise coefficient $\sigma_l$ is a learnable nn.Parameter updated via gradient descent. If False, $\sigma_l$ is fixed at perturb_std. | True | | PERTURB_LR | actor_rollout_ref.actor.perturb_lr | Learning rate for the learnable perturbation coefficients (only used when coef_learnable=True) | 5e-4 | | PERTURB_START_LAYER | actor_rollout_ref.actor.perturb_start_layer | Start layer index for perturbation (inclusive) | 0 | | PERTURB_END_LAYER | actor_rollout_ref.actor.perturb_end_layer | End layer index for perturbation (exclusive). null means through the last layer. | null | | PERTURB_PATCH | env PERTURB_PATCH | Transformer monkey-patch for noise injection. Options: qwen2 (Qwen2/2.5), qwen3, llama (LLaMA 3.x) | qwen2 | | LOSS_MODE | actor_rollout_ref.actor.policy_loss.loss_mode | Loss aggregation: token (token-level ALP), sequence (sequence-level ALP), vanilla, cum-token | sequence |

Enabling Learnable Coefficients

To use learnable perturbation coefficients, add these fields to the model's config.json before training:

{
"use_perturbation": true,
"coef_learnable": true,
"perturb_std": 1e-2
}

Noise Seed Mechanism

The perturbation patch uses a stateless seeded Generator to ensure gradient-checkpointing correctness. Before every forward pass, a deterministic seed is set on each decoder layer (layer._noise_seed). During the forward pass, a local torch.Generator is created with seed = _noise_seed + layer_idx, producing identical noise on both the original forward and gradient-checkpoint recomputation. This guarantees correct gradients when enable_gradient_checkpointing=True.

For learnable coefficients, the noise injection is additionally wrapped in torch.utils.checkpoint.checkpoint() to avoid storing full-size noise activations while still computing gradients for the coefficient.

---

Multi-Turn Experiments

Multi-turn training uses Qwen2.5-7B and requires a sandbox service for code execution during rollout, as the agent interleaves natural language reasoning with executable code.

See [multi-turn/README.md](multi-turn/README.md) for full details. Key highlights:

Prerequisites

Build and Start Container

cd multi-turn
docker build -t verl_sandbox -f docker/Dockerfile.simpletir .

Start Sandbox Service…

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

Low traction, routine new repo from Amazon Science.