ModelIBM (Granite)IBM (Granite)published Feb 27, 2026seen 5d

ibm-granite/granite-4.0-1b-speech

Open original ↗

Captured source

source ↗
published Feb 27, 2026seen 5dcaptured 9hhttp 200method plaintask automatic-speech-recognitionlicense apache-2.0library transformersparams 2.3Bdownloads 93klikes 247

Granite-4.0-1b-speech

Model Summary: Granite-4.0-1b-speech is a compact and efficient speech-language model, specifically designed for multilingual automatic speech recognition (ASR) and bidirectional automatic speech translation (AST).

The model was trained on a collection of public corpora comprising of diverse datasets for ASR and AST as well as synthetic datasets tailored to support Japanese ASR, keyword-biased ASR and speech translation. Granite-4.0-1b-speech was trained by modality aligning granite-4.0-1b-base to speech on publicly available open source corpora containing audio inputs and text targets. Compared to granite-speech-3.3-2b and granite-speech-3.3-8b, this model has the following additional capabilities and improvements:

  • Supports multilingual speech inputs in English, French, German, Spanish, Portuguese and Japanese,
  • Provides higher transcription accuracy for English ASR and faster inference through better encoder training and speculative decoding,
  • Has half the number of parameters of granite-speech-3.3-2b for running on resource-constrained devices,
  • Adds keyword list biasing capability for enhanced name and acronym recognition

Evaluations:

We evaluated granite-4.0-1b-speech alongside other speech-language models in the less than 8b parameter range as well as dedicated ASR and AST systems on standard benchmarks. The evaluation spanned multiple public benchmarks, with particular emphasis on English ASR tasks while also including multilingual ASR and AST for X-En and En-X translations.

!granite-4.0-1b-speech-wer1-crop

!granite-4.0-1b-speech-wer2-crop

!granite-4.0-1b-speech-bleu1-crop

!granite-4.0-1b-speech-bleu2-crop

Performance on **HuggingFace Open ASR leaderboard**: | model | Average WER | RTFx | AMI | Earnings22 | Gigaspeech | LS Clean | LS Other | SPGISpeech | Tedlium | Voxpopuli | |:-------------:|:---------------:|:----------:|:---------:|:----------------:|:--------------:|:--------------:|:--------------:|:----------------:|:-------------:|:---------------:| | ibm-granite/granite-4.0-1b-speech | 5.52 | 280.02 | 8.44 | 8.48 | 10.14 | 1.42 | 2.85 | 3.89 | 3.1 | 5.84 |

Release Date: March 6, 2026

License: Apache 2.0

Supported Languages: English, French, German, Spanish, Portuguese, Japanese

Intended Use: The model is intended to be used in enterprise applications that involve processing of speech inputs. In particular, the model is well-suited for English, French, German, Spanish, Portuguese and Japanese speech-to-text and speech translations to and from English for the same languages, plus English-to-Italian and English-to-Mandarin.

Generation:

Granite Speech model is supported natively in transformers>=4.52.1. Below is a simple example of how to use the granite-4.0-1b-speech model.

Usage with transformers

First, make sure to install a recent version of transformers:

pip install transformers torchaudio soundfile

Then run the code:

import torch
import torchaudio
from huggingface_hub import hf_hub_download
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor

device = "cuda" if torch.cuda.is_available() else "cpu"

model_name = "ibm-granite/granite-4.0-1b-speech"
processor = AutoProcessor.from_pretrained(model_name)
tokenizer = processor.tokenizer
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_name, device_map=device, torch_dtype=torch.bfloat16
)

# Load audio
audio_path = hf_hub_download(repo_id=model_name, filename="multilingual_sample.wav")
wav, sr = torchaudio.load(audio_path, normalize=True)
assert wav.shape[0] == 1 and sr == 16000 # mono, 16kHz

# Create text prompt
user_prompt = "can you transcribe the speech into a written format?"
# Add "Keywords: , ..." at the end for keyword biasing
chat = [
{"role": "user", "content": user_prompt},
]
prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

# Run the processor + model
model_inputs = processor(prompt, wav, device=device, return_tensors="pt").to(device)
model_outputs = model.generate(
**model_inputs, max_new_tokens=200, do_sample=False, num_beams=1
)

# Transformers includes the input IDs in the response
num_input_tokens = model_inputs["input_ids"].shape[-1]
new_tokens = model_outputs[0, num_input_tokens:].unsqueeze(0)
output_text = tokenizer.batch_decode(
new_tokens, add_special_tokens=False, skip_special_tokens=True
)
print(f"STT output = {output_text[0]}")

Usage with vLLM

First, make sure to install vLLM:

pip install vllm
  • Code for offline mode:
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
from vllm.assets.audio import AudioAsset

model_id = "ibm-granite/granite-4.0-1b-speech"
tokenizer = AutoTokenizer.from_pretrained(model_id)

def get_prompt(question: str, has_audio: bool):
"""Build the input prompt to send to vLLM."""
if has_audio:
question = f"{question}"
chat = [
{
"role": "user",
"content": question
}
]
return tokenizer.apply_chat_template(chat, tokenize=False)

model = LLM(
model=model_id,
max_model_len=2048, # This may be needed for lower resource devices.
limit_mm_per_prompt={"audio": 1},
)

question = "can you transcribe the speech into a written format?"
prompt_with_audio = get_prompt(
question=question,
has_audio=True,
)
audio = AudioAsset("mary_had_lamb").audio_and_sample_rate

inputs = {
"prompt": prompt_with_audio,
"multi_modal_data": {
"audio": audio,
}
}

outputs = model.generate(
inputs,
sampling_params=SamplingParams(
temperature=0.2,
max_tokens=64,
),
)
print(f"Audio Example - Question: {question}")
print(f"Generated text:…

Excerpt shown — open the source for the full document.

Notability

notability 8.0/10

High HF downloads for a notable IBM speech model release