RepoMicrosoftMicrosoftpublished Jan 7, 2026seen 5d

microsoft/Staccato-Stuttered-ASR

Python

Open original ↗

Captured source

source ↗

microsoft/Staccato-Stuttered-ASR

Language: Python

License: MIT

Stars: 0

Forks: 0

Open issues: 15

Created: 2026-01-07T21:54:17Z

Pushed: 2026-06-05T23:41:43Z

Default branch: main

Fork: no

Archived: no

README:

Staccato: Stuttered Speech Recognition

Staccato is a speech recognition pipeline optimized for transcribing stuttered speech. It combines OpenAI's Whisper model with GPT-4o to produce accurate transcriptions that capture the speaker's intended message, filtering out involuntary disfluencies.

Overview

People who stutter speak with involuntary sound repetitions, word repetitions, prolongations, and blocks. Standard ASR models like Whisper are not trained on stuttered speech data and often produce poor transcriptions. Staccato addresses this by:

1. Whisper transcription: Initial transcription using Whisper Large V3 2. GPT-4o refinement: Uses GPT-4o with audio understanding to refine the transcription, leveraging both the audio and initial transcription

Installation

Prerequisites

  • Python 3.12+
  • NVIDIA GPU with CUDA support (CUDA 11.8+)
  • CUDA toolkit installed (nvcc must be in PATH)
  • Linux (tested on Ubuntu)
  • Git

Setup

1. Clone the repository:

git clone https://github.com/YOUR_USERNAME/staccato.git
cd staccato

2. Install uv:

curl -LsSf https://astral.sh/uv/install.sh | sh

3. Restart your terminal to ensure uv is in your PATH.

4. Install dependencies:

uv sync

5. Fix flash-attn (required for Flash Attention 2 support):

export PATH=/usr/local/cuda/bin:$PATH
export CUDA_HOME=/usr/local/cuda
rm -rf .venv/lib/python3.12/site-packages/flash_attn*
uv cache clean flash-attn
uv pip install flash-attn --no-build-isolation --no-binary flash-attn

> Note: This compilation takes 5-10 minutes and requires the CUDA toolkit. If nvcc --version doesn't work, install CUDA toolkit first.

6. Download Whisper model:

mkdir -p ./data/models
uv run huggingface-cli download --local-dir ./data/models/whisper-large-v3 openai/whisper-large-v3

7. Create a .env file with your OpenAI API key:

echo "OPENAI_API_KEY=your-api-key-here" > .env

Usage

Command Line

Transcribe audio files from the terminal:

# With your own audio file
uv run python src/approaches/dspy_whisper_gpt_4o/pipeline.py \
--audio-files /path/to/your/audio.wav

# If you downloaded sample data (multiple files)
uv run python src/approaches/dspy_whisper_gpt_4o/pipeline.py \
--audio-files data/processed/fluencybank/wav_clips/27fb_086_000.wav \
data/processed/fluencybank/wav_clips/44m_381_000.wav

# With custom model directory
uv run python src/approaches/dspy_whisper_gpt_4o/pipeline.py \
--model-dir ./data/models/whisper-large-v3 \
--audio-files /path/to/your/audio.wav

Python API

from pathlib import Path
import sys

# Add src to path
sys.path.insert(0, 'src')

import librosa
from approaches.dspy_whisper_gpt_4o.pipeline import DspyPipeline

# Initialize pipeline
pipe = DspyPipeline(
model_dir=Path("./data/models/whisper-large-v3"),
)
pipe.load_model()

# Load audio
audio, sr = librosa.load("path/to/audio.wav")

# Transcribe
result = pipe.transcribe([audio], sr)
print(result)

Project Structure

src/
├── approaches/
│ ├── dspy_whisper_gpt_4o/
│ │ └── pipeline.py # Main pipeline (Whisper + GPT-4o)
│ └── vanilla_whisper/
│ └── whisper.py # WhisperTranscriber wrapper
└── evaluation/
└── inference_models/
└── asr_model_base.py # Base class for ASR models

Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md) for contribution guidelines.

Notability

notability 5.0/10

New repo from Microsoft on ASR, no traction info