nvidia/nemotron-3.5-asr-streaming-0.6b
Captured source
source ↗Nemotron 3.5 ASR
h1, h2, h3, h4, h5, h6 { color: #76b900; /* NVIDIA green */ font-weight: 700; }
hr { border: none; border-top: 1px solid #e5e7eb; margin: 2rem 0; }
/* Improve list spacing */ ul, ol { margin-top: 0.5rem; margin-bottom: 0.5rem; }
/* Badge alignment consistency */ img { display: inline; vertical-align: middle; }
> [!Note] > This model is the multilingual extension of nvidia/nemotron-speech-streaming-en-0.6b, adding language-ID prompt conditioning to support transcription across 40 language-locales from a single model.
Nemotron 3.5 ASR is a multilingual, streaming Automatic Speech Recognition (ASR) model engineered to deliver high-quality multilingual transcription across both low-latency streaming and high-throughput batch workloads. Developed by NVIDIA, this 600M parameter model transcribes speech into text with native support for punctuation and capitalization, and offers runtime flexibility with configurable chunk sizes, including 80ms, 160ms, 320ms, 560ms, and 1120ms.
By leveraging a state-of-the-art Cache-Aware FastConformer-RNNT architecture, the model eliminates redundant overlapping computations common in traditional "buffered" streaming. This allows it to process only new audio chunks while reusing cached encoder context, significantly improving computational efficiency and minimizing end-to-end delay without sacrificing accuracy.
It was trained on a massive ASR dataset and is engineered to perform across diverse and challenging acoustic conditions.
This model is ready for commercial use.
---
License/Terms of Use
Governing Terms: Use of the model is governed by the OpenMDW-1.1 license.
Deployment Geography
Global
Use Case
This model is for transcription of multilingual audio.
Release Date
- Hugging Face [06/04/2026] via https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b
References
[1] Stateful Conformer with Cache-based Inference for Streaming Automatic Speech Recognition
[2] Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition
[3] NVIDIA Granary
Why Choose Nemotron 3.5 ASR?
- 🌍 Single Multilingual Model: Transcribes 40 language-locales from one model through language-ID prompt conditioning, with optional automatic language detection.
- ⚡ Native Streaming Architecture: Cache-aware design enables efficient processing of continuous audio streams, designed and optimized for low-latency voice agent applications.
- 💰 Improved Operational Efficiency: Delivers superior throughput compared to traditional buffered streaming approaches. This allows for a higher number of parallel streams within the same GPU memory constraints, directly reducing operational costs for production environments.
- 🎛️ Dynamic Runtime Flexibility: Choose the optimal operating point on the latency-accuracy Pareto curve at inference time. No re-training is required to adjust for different use-case requirements.
- 📝 Punctuation & Capitalization: Built-in support for punctuation and capitalization in output text.
---
Supported Languages
The model supports 40 language-locales in total, across three tiers:
- Transcription-ready (19 locales): highest-accuracy ASR, ready out of the box.
- Broad-coverage (13 locales): production ASR across an additional 13 locales.
- Adaptation-ready (8 locales): recognized by the tokenizer; fine-tune on in-domain data to unlock full transcription.
| Tier | Languages (locales) | | :--- | :--- | | Transcription-ready (19 locales) | English (en-US, en-GB), Spanish (es-US, es-ES), French (fr-FR, fr-CA), Italian (it-IT), Portuguese (pt-BR, pt-PT), Dutch (nl-NL), German (de-DE), Turkish (tr-TR), Russian (ru-RU), Arabic (ar-AR), Hindi (hi-IN), Japanese (ja-JP), Korean (ko-KR), Vietnamese (vi-VN), Ukrainian (uk-UA) | | Broad-coverage (13 locales) | Polish (pl-PL), Swedish (sv-SE), Czech (cs-CZ), Norwegian Bokmål (nb-NO), Danish (da-DK), Bulgarian (bg-BG), Finnish (fi-FI), Croatian (hr-HR), Slovak (sk-SK), Mandarin (zh-CN), Hungarian (hu-HU), Romanian (ro-RO), Estonian (et-EE) | | Adaptation-ready (8 locales) | Greek (el-GR), Lithuanian (lt-LT), Latvian (lv-LV), Maltese (mt-MT), Slovenian (sl-SI), Hebrew (he-IL), Thai (th-TH), Norwegian Nynorsk (nn-NO) |
> Note: Transcription-ready and broad-coverage locales (32 total) produce ASR transcription out of the box; adaptation-ready locales require fine-tuning on in-domain data to enable full transcription. The model supports uppercase and lowercase letters, punctuation, spaces, and apostrophes.
> Note: We would recommend Nemotron ASR Streaming (English) model for English-only transcription use cases. For all other transcription ready locales, we recommend Nemotron 3.5 ASR to leverage its expanded multilingual capabilities.
> [!Tip] > Automatic language detection / language tagging: When run with target_lang=auto, the model detects the spoken language and emits the corresponding language code/tag in the output following the terminal punctuation. This lets a single deployment transcribe mixed-language traffic and automatically label each utterance with its detected language — no separate language-ID component required.
---
Model Architecture
Architecture Type: FastConformer-CacheAware-RNNT with Prompt
This model consists of a cache-aware streaming Parakeet (FastConformer) encoder with an RNN-T decoder and language-ID prompt conditioning. It is based on the Cache-Aware [\[1\]](#ref-1) FastConformer [\[2\]](#ref-2) architecture with 24 encoder layers and an RNNT (Recurrent Neural Network Transducer) decoder. The cache-aware streaming design enables efficient processing of audio in chunks while maintaining context from previous frames. Unlike buffered inference, this model maintains caches for all encoder self-attention and convolution layers. This enables reuse of hidden states at every streaming step, where cached activations eliminate redundant computations. As a result, there are no overlapping computations; each processed frame is strictly non-overlapping. This model leverages prompts to…
Excerpt shown — open the source for the full document.
Notability
notability 5.0/10New ASR model from Nvidia, moderate traction