ModelIBM (Granite)IBM (Granite)published Apr 16, 2026seen 5d

ibm-granite/granite-speech-4.1-2b

Open original ↗

Captured source

source ↗
published Apr 16, 2026seen 5dcaptured 9hhttp 200method plaintask automatic-speech-recognitionlicense apache-2.0library transformersparams 2.3Bdownloads 518klikes 132

Granite-Speech-4.1-2B

Model Summary: Granite Speech 4.1 2B is a compact and efficient speech-language model, specifically designed for multilingual automatic speech recognition (ASR) and bidirectional automatic speech translation (AST) for English, French, German, Spanish, Portuguese and Japanese.

The model was trained on 174,000 hours of audio from public corpora for ASR and AST as well as synthetic datasets tailored to support Japanese ASR, keyword-biased ASR and speech translation. Granite Speech 4.1 2B was trained by modality aligning an intermediate checkpoint of granite-4.0-1b-base to speech on publicly available open source corpora containing audio inputs and text targets. Compared to its predecessor granite-4.0-1b-speech, this model has the same parameter count (the new naming convention reflects actual instead of base LLM size) and provides additional capabilities and improvements:

  • Higher transcription accuracy for multilingual ASR due to a novel dual-head CTC encoder with both graphemic and BPE outputs and frame importance sampling to focus on informative parts of the audio
  • Punctuation and truecasing for ASR and AST in all languages (including German noun capitalization) with a simple prompt change
  • Better keyword list biasing capability for enhanced recognition of names, acronyms and technical jargon

Two additional model variants explore different capabilities and inference optimization:

Evaluations:

We evaluated granite-speech-4.1-2b alongside other speech-language models in the less than 8b parameter range as well as dedicated ASR and AST systems on standard benchmarks. The evaluation spanned multiple public benchmarks, with particular emphasis on English ASR tasks while also including multilingual ASR and AST for X-En and En-X translations.

!granite-speech-4.1-2b-wer1-crop

!granite-speech-4.1-2b-wer2-crop

!granite-speech-4.1-2b-bleu1-crop

!granite-speech-4.1-2b-bleu2-crop

Performance on the Open ASR leaderboard (as of April 2026): !rtfx_wer

We evaluated the model’s keyword list biasing (KWB) capability by comparing performance with and without KWB applied at inference time. We report the F1 scores of transcribed keywords during ASR tasks, excluding common words from the evaluation. !kwb-f1.v2

We also evaluated our model on a variety of corpora to assess its punctuation and capitalization capabilities. We report the metrics as defined in LibriSpeech-PC. PER (punctuation error rate) measures errors in the insertion, deletion, or substitution of punctuation marks (periods, commas, and question marks). Cap-F1 (capitalization F1) measures how accurately the model capitalizes relevant words in the output. Note that our Cap-F1 is computed on Levenshtein-aligned matching word pairs rather than fully matching sentences, allowing evaluation even in the presence of ASR errors.

| Test Set | PER (↓) | Cap-F1 (↑) | |:---------|:----:|:------:| | LScln | 25.70 | 89.71 | | LSoth | 22.27 | 91.26 | | VoxPopuli | 24.86 | 95.35 | | Earnings-22 | 22.87 | 95.19 | | CV-EN | 9.13 | 96.75 | | CV-DE | 3.66 | 99.50† | | CV-ES | 11.61 | 95.68 | | CV-FR | 11.00 | 97.25 | | CV-PT | 7.86 | 98.51 |

† *We report a Cap-F1 of 99.5 on German, where noun capitalization is required.*

Release Date: April 29, 2026

License: Apache 2.0

Supported Languages: English, French, German, Spanish, Portuguese, Japanese

Intended Use: The model is intended to be used in enterprise applications that involve processing of speech inputs. In particular, the model is well-suited for English, French, German, Spanish, Portuguese and Japanese speech-to-text and speech translations to and from English for the same languages, plus English-to-Italian and English-to-Mandarin.

Usage:

Granite Speech model is supported natively in transformers>=4.52.1. Below is a simple example of how to use the granite-speech-4.1-2b model.

Usage with transformers

First, make sure to install a recent version of transformers:

pip install transformers torchaudio soundfile

Then run the code:

import torch
import torchaudio
from huggingface_hub import hf_hub_download
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor

device = "cuda" if torch.cuda.is_available() else "cpu"

model_name = "ibm-granite/granite-speech-4.1-2b"
processor = AutoProcessor.from_pretrained(model_name)
tokenizer = processor.tokenizer
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_name, device_map=device, torch_dtype=torch.bfloat16
)

# Load audio
audio_path = hf_hub_download(repo_id=model_name, filename="multilingual_sample.wav")
wav, sr = torchaudio.load(audio_path, normalize=True)
assert wav.shape[0] == 1 and sr == 16000 # mono, 16kHz

# Create text prompt
user_prompt = "transcribe the speech with proper punctuation and capitalization."
chat = [
{"role": "user", "content": user_prompt},
]
prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

# Run the processor + model
model_inputs = processor(prompt, wav, device=device, return_tensors="pt").to(device)
model_outputs = model.generate(
**model_inputs, max_new_tokens=200, do_sample=False, num_beams=1
)

# Transformers includes the input IDs in the…

Excerpt shown — open the source for the full document.

Notability

notability 9.0/10

Very high HF downloads; notable IBM speech model