What does this model signal mean?

Together AI published togethercomputer/Aurora-Spec-Qwen3-Coder-Next-FP8. This model signal is evidence of what shipped on model infrastructure and how the release is positioned. High-signal details: license apache-2.0 · 98 HF downloads · Low downloads, but specialized speculative decoding model.. onlylabs links this event to 1 captured evidence page and 6 related model signals.

Together AI Model: togethercomputer/Aurora-Spec-Qwen3-Coder-Next-FP8

Captured source

source ↗

Hugging Face/huggingface.co/togethercomputer/Aurora-Spec-Qwen3-Coder-Next-FP8

togethercomputer/Aurora-Spec-Qwen3-Coder-Next-FP8 model card

Source ↗

published Feb 3, 2026seen Jun 6captured Jun 11http 200method plaintask text-generationlicense apache-2.0params 519Mdownloads 98likes 20

Aurora-Spec-Qwen3-Coder-Next-FP8

Model Description

This is an EAGLE3 draft model trained from scratch (random initialization) using the Aurora inference-time training framework for speculative decoding. Unlike traditional approaches that fine-tune pre-trained models, this model is built entirely through Aurora's online training process. The model is optimized to generate high-quality draft tokens for the Qwen/Qwen3-Coder-Next-FP8 target model, achieving significant speedups in code generation tasks.

Key Features

Training Approach: Trained from scratch (random initialization) - no pre-training required
Framework: Trained with Aurora - an advanced inference-time training system
Architecture: EAGLE3 speculative decoding draft model
Target Model: Qwen/Qwen3-Coder-Next-FP8
Training Data: OnlineSD Code Dataset
Performance: Achieves 3.1x average accept length for speculative decoding
Training: 10,000 training steps over 80,000 inference requests

Target Model

This draft model is specifically designed to work with:

Model: Qwen/Qwen3-Coder-Next-FP8
Type: Code generation language model
Precision: FP8 quantized
Domain: Programming and code synthesis

The draft model learns to predict the target model's token distribution during inference-time training, enabling efficient speculative decoding.

Architecture

EAGLE3 Speculative Decoding

This model implements the EAGLE3 (Extrapolation Algorithm for Greater Language-model Efficiency) architecture:

Draft Model: Lightweight model that generates candidate tokens
Tree-based Attention: Enables parallel verification of multiple draft tokens
Auto-regressive Generation: Produces speculative token sequences
Dynamic Adaptation: Updates during inference to match target model distribution

Model Structure

Initialization: Trained from scratch (random initialization, no pre-training)
Base Architecture: Single-layer Transformer decoder
Precision: FP8 (8-bit floating point)
Speculative Steps: 5 tokens per iteration
Attention Mechanism: Tree-based for parallel draft verification
Training Paradigm: Online learning during inference (Aurora framework)

Training Details

Aurora Framework

This model was trained from scratch using Aurora, an inference-time training framework that:

No Pre-training Required: Starts from random initialization and learns entirely through online training
Updates the draft model dynamically during inference
Uses reverse KL divergence for distribution matching (minimizing KL(target || draft))
Employs online learning with periodic model updates
Optimizes for both draft quality and speculative acceptance rate
Demonstrates that effective draft models can be built from scratch without expensive pre-training

Training Configuration

Hardware: NVIDIA H200 GPU
Training Steps: 10,000 steps over 80,000 inference requests
Learning Rate: 1e-4
TTT Length: 5 tokens
Speculative Steps: 5
Update Interval: Every 10 requests
Loss Weights:
NTP Loss: 1.0
Prediction Loss: 1.0
KL Divergence: Reverse KL divergence (draft → target)

Dataset

Trained on the OnlineSD Code Dataset, which contains diverse coding examples suitable for training speculative decoding models.

Benchmarks

End-to-End Throughput Performance

Measured on a holdout dataset from the OnlineSD Code Dataset using the final Aurora checkpoint.

Qwen-Coder-Next: end-to-end throughput under varying batch size and lookahead

We report tokens-per-second (TPS) statistics and speedup relative to the no-speculation baseline.

| BS | Config | Mean TPS | P50 TPS | P05 TPS | P95 TPS | Speedup (Mean) | Acc Len | |:---:|:---------|:--------:|:-------:|:-------:|:-------:|:--------------:|:-------:| | 1 | w/o spec | 176.4 | 178.0 | 172.3 | 178.4 | -- | -- | | | lookahead 3 | 252.1 | 254.8 | 208.8 | 291.6 | 1.43× | 2.67 | | | lookahead 4 | 263.1 | 264.0 | 211.8 | 312.7 | 1.49× | 2.91 | | | lookahead 5 | 265.7 | 264.8 | 208.7 | 320.5 | 1.51× | 3.06 | | 8 | w/o spec | 119.8 | 121.5 | 104.8 | 134.6 | -- | -- | | | lookahead 3 | 141.0 | 138.9 | 110.4 | 178.5 | 1.18× | 2.67 | | | lookahead 4 | 142.5 | 141.2 | 110.3 | 181.6 | 1.19× | 2.91 | | | lookahead 5 | 146.3 | 143.5 | 109.6 | 189.5 | 1.23× | 3.07 | | 16 | w/o spec | 99.6 | 102.1 | 74.5 | 119.2 | -- | -- | | | lookahead 3 | 104.0 | 100.5 | 75.6 | 151.9 | 1.04× | 2.67 | | | lookahead 4 | 105.6 | 101.1 | 77.5 | 149.7 | 1.06× | 2.92 | | | lookahead 5 | 107.6 | 103.7 | 75.7 | 156.6 | 1.09× | 3.06 | | 32 | w/o spec | 85.0 | 88.7 | 54.5 | 104.5 | -- | -- | | | lookahead 3 | 78.9 | 72.8 | 53.0 | 122.3 | 0.93× | 2.68 | | | lookahead 4 | 79.5 | 73.7 | 52.9 | 124.7 | 0.94× | 2.91 | | | lookahead 5 | 80.3 | 72.6 | 52.8 | 130.7 | 0.94× | 3.06 |

Performance Across Different Batch Sizes

Aurora provides the largest gains at small-to-moderate batch sizes, with up to 1.51× speedup at batch size 1, demonstrating the effectiveness of speculative decoding for latency-critical scenarios. The benefits diminish as batch size increases:

Batch Size 1 (Best Case): Up to 1.51× speedup with lookahead 5 configuration (3.06 average accept length). At low batch sizes, the cost of draft generation and verification is well amortized by reduced target model forward passes.

Batch Size 8 (Moderate): 1.23× speedup with lookahead 5 configuration (3.07 average accept length). Speculative decoding still provides meaningful throughput improvements for moderate batching.

Batch Size 16 (Diminishing Returns): 1.09× speedup with lookahead 5 configuration (3.06 average accept length). Benefits become marginal as verification overhead increases relative to baseline throughput.

Batch Size 32 (Negative Returns): At large batch sizes, verification overhead dominates and speculative decoding becomes slightly slower than the baseline (0.93-0.94×). The...

Excerpt shown — open the source for the full document.

Notability

notability 5.0/10

Low downloads, but specialized speculative decoding model.