ibm-granite/granite-speech-models
Jupyter Notebook
Captured source
source ↗ibm-granite/granite-speech-models
Language: Jupyter Notebook
Stars: 44
Forks: 6
Open issues: 3
Created: 2025-07-07T21:02:29Z
Pushed: 2026-04-28T14:32:26Z
Default branch: main
Fork: no
Archived: no
README:
:books: Tech Report  | :hugs: HuggingFace Collection  | :trophy: OpenASR leaderboard  | :wrench: Finetuning Example 
Granite Speech Models
Model Summary: Granite Speech models are compact and efficient speech-language models, specifically designed for automatic speech recognition (ASR) and automatic speech translation (AST). Granite Speech models use a two-pass design, unlike integrated models that combine speech and language into a single pass. Initial calls to Granite Speech will transcribe audio files into text. To process the transcribed text using the underlying Granite language model, users must make a second call as each step must be explicitly initiated.
These models were trained on a collection of public corpora comprising diverse datasets for ASR and AST as well as synthetic datasets tailored to support the speech translation task. granite-speech-3.3-2b/8b models were trained by modality aligning granite-3.3-2b/8b-instruct to speech on publicly available open source corpora containing audio inputs and text targets.
- Compared to revision 3.3.1, revision 3.3.2 supports multilingual speech inputs in English, French, German, Spanish and Portuguese and provides additional accuracy improvements for English ASR.
- Compared to the initial release, revision 3.3.2 is also trained on additional data and uses a deeper acoustic encoder for improved transcription accuracy.
Evaluations:
We evaluated Granite Speech models alongside other speech-language models in the less than 8b parameter range as well as dedicated ASR and AST systems on standard benchmarks. The evaluation spanned multiple public benchmarks, with particular emphasis on English ASR tasks while also including multilingual ASR and AST for X-En and En-X translations.
Release Date: June 19, 2025
License: Apache 2.0
Supported Languages: English, French, German, Spanish, Portuguese
Intended Use: The model is intended to be used in enterprise applications that involve processing of speech inputs. In particular, the model is well-suited for English, French, German, Spanish and Portuguese speech-to-text and speech translations to and from English for the same languages plus English-to-Japanese and English-to-Mandarin. The model can also be used for tasks that involve text-only input since it calls the underlying Granite model when the user specifies a prompt that does not contain audio.
Generation:
Granite Speech models are supported natively in transformers from the main branch. Below is a simple example of how to use the granite-speech-3.3-8b revision 3.3.2 model.
Usage with transformers
First, make sure to install a recent version of transformers:
pip install transformers>=4.52.4 torchaudio peft soundfile
Then run the code:
import torch
import torchaudio
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
from huggingface_hub import hf_hub_download
device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = "ibm-granite/granite-speech-3.3-8b"
processor = AutoProcessor.from_pretrained(model_name)
tokenizer = processor.tokenizer
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_name, device_map=device, torch_dtype=torch.bfloat16
)
# load audio
audio_path = hf_hub_download(repo_id=model_name, filename="10226_10111_000000.wav")
wav, sr = torchaudio.load(audio_path, normalize=True)
assert wav.shape[0] == 1 and sr == 16000 # mono, 16khz
# create text prompt
system_prompt = "Knowledge Cutoff Date: April 2024.\nToday's Date: April 9, 2025.\nYou are Granite, developed by IBM. You are a helpful AI assistant"
user_prompt = "can you transcribe the speech into a written format?"
chat = [
dict(role="system", content=system_prompt),
dict(role="user", content=user_prompt),
]
prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
# run the processor+model
model_inputs = processor(prompt, wav, device=device, return_tensors="pt").to(device)
model_outputs = model.generate(**model_inputs, max_new_tokens=200, do_sample=False)
# Transformers includes the input IDs in the response.
num_input_tokens = model_inputs["input_ids"].shape[-1]
new_tokens = torch.unsqueeze(model_outputs[0, num_input_tokens:], dim=0)
output_text = tokenizer.batch_decode(
new_tokens, add_special_tokens=False, skip_special_tokens=True
)
print(f"STT output = {output_text[0].upper()}")Usage with vLLM
First, make sure to install the latest version of vLLM:
pip install vllm --upgrade
- Code for offline mode:
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
from vllm.assets.audio import AudioAsset
from vllm.lora.request import LoRARequest
model_id = "ibm-granite/granite-speech-3.3-8b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
def get_prompt(question: str, has_audio: bool):
"""Build the input prompt to send to vLLM."""
if has_audio:
question = f"{question}"
chat = [
{
"role": "user",
"content": question
}
]
return tokenizer.apply_chat_template(chat, tokenize=False)
# NOTE - you may see warnings about multimodal lora layers being ignored;
# this is okay as the lora in this model is only applied to the LLM.
model = LLM(
model=model_id,
enable_lora=True,
max_lora_rank=64,
max_model_len=2048, # This may be needed for lower resource devices.
limit_mm_per_prompt={"audio": 1},
)
### 1. Example with Audio [make sure to use the lora]
question = "can you transcribe the speech into a written…Excerpt shown — open the source for the full document.
Notability
notability 4.0/10New repo, low traction, IBM