RepoIBM (Granite)IBM (Granite)published Jul 7, 2025seen 5d

ibm-granite/granite-speech-models

Jupyter Notebook

Open original ↗

Captured source

source ↗

ibm-granite/granite-speech-models

Language: Jupyter Notebook

Stars: 44

Forks: 6

Open issues: 3

Created: 2025-07-07T21:02:29Z

Pushed: 2026-04-28T14:32:26Z

Default branch: main

Fork: no

Archived: no

README:

:books: Tech Report&nbsp | :hugs: HuggingFace Collection&nbsp | :trophy: OpenASR leaderboard&nbsp | :wrench: Finetuning Example&nbsp

Granite Speech Models

Model Summary: Granite Speech models are compact and efficient speech-language models, specifically designed for automatic speech recognition (ASR) and automatic speech translation (AST). Granite Speech models use a two-pass design, unlike integrated models that combine speech and language into a single pass. Initial calls to Granite Speech will transcribe audio files into text. To process the transcribed text using the underlying Granite language model, users must make a second call as each step must be explicitly initiated.

These models were trained on a collection of public corpora comprising diverse datasets for ASR and AST as well as synthetic datasets tailored to support the speech translation task. granite-speech-3.3-2b/8b models were trained by modality aligning granite-3.3-2b/8b-instruct to speech on publicly available open source corpora containing audio inputs and text targets.

  • Compared to revision 3.3.1, revision 3.3.2 supports multilingual speech inputs in English, French, German, Spanish and Portuguese and provides additional accuracy improvements for English ASR.
  • Compared to the initial release, revision 3.3.2 is also trained on additional data and uses a deeper acoustic encoder for improved transcription accuracy.

Evaluations:

We evaluated Granite Speech models alongside other speech-language models in the less than 8b parameter range as well as dedicated ASR and AST systems on standard benchmarks. The evaluation spanned multiple public benchmarks, with particular emphasis on English ASR tasks while also including multilingual ASR and AST for X-En and En-X translations.

!image/png

!image/png

!image/png

!image/png

!image/png

Release Date: June 19, 2025

License: Apache 2.0

Supported Languages: English, French, German, Spanish, Portuguese

Intended Use: The model is intended to be used in enterprise applications that involve processing of speech inputs. In particular, the model is well-suited for English, French, German, Spanish and Portuguese speech-to-text and speech translations to and from English for the same languages plus English-to-Japanese and English-to-Mandarin. The model can also be used for tasks that involve text-only input since it calls the underlying Granite model when the user specifies a prompt that does not contain audio.

Generation:

Granite Speech models are supported natively in transformers from the main branch. Below is a simple example of how to use the granite-speech-3.3-8b revision 3.3.2 model.

Usage with transformers

First, make sure to install a recent version of transformers:

pip install transformers>=4.52.4 torchaudio peft soundfile

Then run the code:

import torch
import torchaudio
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
from huggingface_hub import hf_hub_download

device = "cuda" if torch.cuda.is_available() else "cpu"

model_name = "ibm-granite/granite-speech-3.3-8b"
processor = AutoProcessor.from_pretrained(model_name)
tokenizer = processor.tokenizer
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_name, device_map=device, torch_dtype=torch.bfloat16
)
# load audio
audio_path = hf_hub_download(repo_id=model_name, filename="10226_10111_000000.wav")
wav, sr = torchaudio.load(audio_path, normalize=True)
assert wav.shape[0] == 1 and sr == 16000 # mono, 16khz

# create text prompt
system_prompt = "Knowledge Cutoff Date: April 2024.\nToday's Date: April 9, 2025.\nYou are Granite, developed by IBM. You are a helpful AI assistant"
user_prompt = "can you transcribe the speech into a written format?"
chat = [
dict(role="system", content=system_prompt),
dict(role="user", content=user_prompt),
]
prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

# run the processor+model
model_inputs = processor(prompt, wav, device=device, return_tensors="pt").to(device)
model_outputs = model.generate(**model_inputs, max_new_tokens=200, do_sample=False)

# Transformers includes the input IDs in the response.
num_input_tokens = model_inputs["input_ids"].shape[-1]
new_tokens = torch.unsqueeze(model_outputs[0, num_input_tokens:], dim=0)
output_text = tokenizer.batch_decode(
new_tokens, add_special_tokens=False, skip_special_tokens=True
)
print(f"STT output = {output_text[0].upper()}")

Usage with vLLM

First, make sure to install the latest version of vLLM:

pip install vllm --upgrade
  • Code for offline mode:
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
from vllm.assets.audio import AudioAsset
from vllm.lora.request import LoRARequest

model_id = "ibm-granite/granite-speech-3.3-8b"
tokenizer = AutoTokenizer.from_pretrained(model_id)

def get_prompt(question: str, has_audio: bool):
"""Build the input prompt to send to vLLM."""
if has_audio:
question = f"{question}"
chat = [
{
"role": "user",
"content": question
}
]
return tokenizer.apply_chat_template(chat, tokenize=False)

# NOTE - you may see warnings about multimodal lora layers being ignored;
# this is okay as the lora in this model is only applied to the LLM.
model = LLM(
model=model_id,
enable_lora=True,
max_lora_rank=64,
max_model_len=2048, # This may be needed for lower resource devices.
limit_mm_per_prompt={"audio": 1},
)

### 1. Example with Audio [make sure to use the lora]
question = "can you transcribe the speech into a written…

Excerpt shown — open the source for the full document.

Notability

notability 4.0/10

New repo, low traction, IBM