ModelTogether AITogether AIpublished Feb 3, 2026seen 5d

togethercomputer/Aurora-Spec-Qwen3-Coder-Next-FP8

Open original ↗

Captured source

source ↗
published Feb 3, 2026seen 5dcaptured 9hhttp 200method plaintask text-generationlicense apache-2.0params 519Mdownloads 124likes 20

Aurora-Spec-Qwen3-Coder-Next-FP8

Model Description

This is an EAGLE3 draft model trained from scratch (random initialization) using the Aurora inference-time training framework for speculative decoding. Unlike traditional approaches that fine-tune pre-trained models, this model is built entirely through Aurora's online training process. The model is optimized to generate high-quality draft tokens for the Qwen/Qwen3-Coder-Next-FP8 target model, achieving significant speedups in code generation tasks.

Key Features

  • Training Approach: Trained from scratch (random initialization) - no pre-training required
  • Framework: Trained with Aurora - an advanced inference-time training system
  • Architecture: EAGLE3 speculative decoding draft model
  • Target Model: Qwen/Qwen3-Coder-Next-FP8
  • Training Data: OnlineSD Code Dataset
  • Performance: Achieves 3.1x average accept length for speculative decoding
  • Training: 10,000 training steps over 80,000 inference requests

Target Model

This draft model is specifically designed to work with:

  • Model: Qwen/Qwen3-Coder-Next-FP8
  • Type: Code generation language model
  • Precision: FP8 quantized
  • Domain: Programming and code synthesis

The draft model learns to predict the target model's token distribution during inference-time training, enabling efficient speculative decoding.

Architecture

EAGLE3 Speculative Decoding

This model implements the EAGLE3 (Extrapolation Algorithm for Greater Language-model Efficiency) architecture:

  • Draft Model: Lightweight model that generates candidate tokens
  • Tree-based Attention: Enables parallel verification of multiple draft tokens
  • Auto-regressive Generation: Produces speculative token sequences
  • Dynamic Adaptation: Updates during inference to match target model distribution

Model Structure

  • Initialization: Trained from scratch (random initialization, no pre-training)
  • Base Architecture: Single-layer Transformer decoder
  • Precision: FP8 (8-bit floating point)
  • Speculative Steps: 5 tokens per iteration
  • Attention Mechanism: Tree-based for parallel draft verification
  • Training Paradigm: Online learning during inference (Aurora framework)

Training Details

Aurora Framework

This model was trained from scratch using Aurora, an inference-time training framework that:

  • No Pre-training Required: Starts from random initialization and learns entirely through online training
  • Updates the draft model dynamically during inference
  • Uses reverse KL divergence for distribution matching (minimizing KL(target || draft))
  • Employs online learning with periodic model updates
  • Optimizes for both draft quality and speculative acceptance rate
  • Demonstrates that effective draft models can be built from scratch without expensive pre-training

Training Configuration

  • Hardware: NVIDIA H200 GPU
  • Training Steps: 10,000 steps over 80,000 inference requests
  • Learning Rate: 1e-4
  • TTT Length: 5 tokens
  • Speculative Steps: 5
  • Update Interval: Every 10 requests
  • Loss Weights:
  • NTP Loss: 1.0
  • Prediction Loss: 1.0
  • KL Divergence: Reverse KL divergence (draft → target)

Dataset

Trained on the OnlineSD Code Dataset, which contains diverse coding examples suitable for training speculative decoding models.

Benchmarks

End-to-End Throughput Performance

Measured on a holdout dataset from the OnlineSD Code Dataset using the final Aurora checkpoint.

Qwen-Coder-Next: end-to-end throughput under varying batch size and lookahead

We report tokens-per-second (TPS) statistics and speedup relative to the no-speculation baseline.

| BS | Config | Mean TPS | P50 TPS | P05 TPS | P95 TPS | Speedup (Mean) | Acc Len | |:---:|:---------|:--------:|:-------:|:-------:|:-------:|:--------------:|:-------:| | 1 | w/o spec | 176.4 | 178.0 | 172.3 | 178.4 | -- | -- | | | lookahead 3 | 252.1 | 254.8 | 208.8 | 291.6 | 1.43× | 2.67 | | | lookahead 4 | 263.1 | 264.0 | 211.8 | 312.7 | 1.49× | 2.91 | | | lookahead 5 | 265.7 | 264.8 | 208.7 | 320.5 | 1.51× | 3.06 | | 8 | w/o spec | 119.8 | 121.5 | 104.8 | 134.6 | -- | -- | | | lookahead 3 | 141.0 | 138.9 | 110.4 | 178.5 | 1.18× | 2.67 | | | lookahead 4 | 142.5 | 141.2 | 110.3 | 181.6 | 1.19× | 2.91 | | | lookahead 5 | 146.3 | 143.5 | 109.6 | 189.5 | 1.23× | 3.07 | | 16 | w/o spec | 99.6 | 102.1 | 74.5 | 119.2 | -- | -- | | | lookahead 3 | 104.0 | 100.5 | 75.6 | 151.9 | 1.04× | 2.67 | | | lookahead 4 | 105.6 | 101.1 | 77.5 | 149.7 | 1.06× | 2.92 | | | lookahead 5 | 107.6 | 103.7 | 75.7 | 156.6 | 1.09× | 3.06 | | 32 | w/o spec | 85.0 | 88.7 | 54.5 | 104.5 | -- | -- | | | lookahead 3 | 78.9 | 72.8 | 53.0 | 122.3 | 0.93× | 2.68 | | | lookahead 4 | 79.5 | 73.7 | 52.9 | 124.7 | 0.94× | 2.91 | | | lookahead 5 | 80.3 | 72.6 | 52.8 | 130.7 | 0.94× | 3.06 |

Performance Across Different Batch Sizes

Aurora provides the largest gains at small-to-moderate batch sizes, with up to 1.51× speedup at batch size 1, demonstrating the effectiveness of speculative decoding for latency-critical scenarios. The benefits diminish as batch size increases:

  • Batch Size 1 (Best Case): Up to 1.51× speedup with lookahead 5 configuration (3.06 average accept length). At low batch sizes, the cost of draft generation and verification is well amortized by reduced target model forward passes.
  • Batch Size 8 (Moderate): 1.23× speedup with lookahead 5 configuration (3.07 average accept length). Speculative decoding still provides meaningful throughput improvements for moderate batching.
  • Batch Size 16 (Diminishing Returns): 1.09× speedup with lookahead 5 configuration (3.06 average accept length). Benefits become marginal as verification overhead increases relative to baseline throughput.
  • Batch Size 32 (Negative Returns): At large batch sizes, verification overhead dominates and speculative decoding becomes slightly slower than the baseline (0.93-0.94×). The…

Excerpt shown — open the source for the full document.

Notability

notability 5.0/10

Low downloads, but specialized speculative decoding model.