microsoft/Staccato-Stuttered-ASR
Python
Captured source
source ↗microsoft/Staccato-Stuttered-ASR
Language: Python
License: MIT
Stars: 0
Forks: 0
Open issues: 15
Created: 2026-01-07T21:54:17Z
Pushed: 2026-06-05T23:41:43Z
Default branch: main
Fork: no
Archived: no
README:
Staccato: Stuttered Speech Recognition
Staccato is a speech recognition pipeline optimized for transcribing stuttered speech. It combines OpenAI's Whisper model with GPT-4o to produce accurate transcriptions that capture the speaker's intended message, filtering out involuntary disfluencies.
Overview
People who stutter speak with involuntary sound repetitions, word repetitions, prolongations, and blocks. Standard ASR models like Whisper are not trained on stuttered speech data and often produce poor transcriptions. Staccato addresses this by:
1. Whisper transcription: Initial transcription using Whisper Large V3 2. GPT-4o refinement: Uses GPT-4o with audio understanding to refine the transcription, leveraging both the audio and initial transcription
Installation
Prerequisites
- Python 3.12+
- NVIDIA GPU with CUDA support (CUDA 11.8+)
- CUDA toolkit installed (
nvccmust be in PATH) - Linux (tested on Ubuntu)
- Git
Setup
1. Clone the repository:
git clone https://github.com/YOUR_USERNAME/staccato.git cd staccato
2. Install uv:
curl -LsSf https://astral.sh/uv/install.sh | sh
3. Restart your terminal to ensure uv is in your PATH.
4. Install dependencies:
uv sync
5. Fix flash-attn (required for Flash Attention 2 support):
export PATH=/usr/local/cuda/bin:$PATH export CUDA_HOME=/usr/local/cuda rm -rf .venv/lib/python3.12/site-packages/flash_attn* uv cache clean flash-attn uv pip install flash-attn --no-build-isolation --no-binary flash-attn
> Note: This compilation takes 5-10 minutes and requires the CUDA toolkit. If nvcc --version doesn't work, install CUDA toolkit first.
6. Download Whisper model:
mkdir -p ./data/models uv run huggingface-cli download --local-dir ./data/models/whisper-large-v3 openai/whisper-large-v3
7. Create a .env file with your OpenAI API key:
echo "OPENAI_API_KEY=your-api-key-here" > .env
Usage
Command Line
Transcribe audio files from the terminal:
# With your own audio file uv run python src/approaches/dspy_whisper_gpt_4o/pipeline.py \ --audio-files /path/to/your/audio.wav # If you downloaded sample data (multiple files) uv run python src/approaches/dspy_whisper_gpt_4o/pipeline.py \ --audio-files data/processed/fluencybank/wav_clips/27fb_086_000.wav \ data/processed/fluencybank/wav_clips/44m_381_000.wav # With custom model directory uv run python src/approaches/dspy_whisper_gpt_4o/pipeline.py \ --model-dir ./data/models/whisper-large-v3 \ --audio-files /path/to/your/audio.wav
Python API
from pathlib import Path
import sys
# Add src to path
sys.path.insert(0, 'src')
import librosa
from approaches.dspy_whisper_gpt_4o.pipeline import DspyPipeline
# Initialize pipeline
pipe = DspyPipeline(
model_dir=Path("./data/models/whisper-large-v3"),
)
pipe.load_model()
# Load audio
audio, sr = librosa.load("path/to/audio.wav")
# Transcribe
result = pipe.transcribe([audio], sr)
print(result)Project Structure
src/ ├── approaches/ │ ├── dspy_whisper_gpt_4o/ │ │ └── pipeline.py # Main pipeline (Whisper + GPT-4o) │ └── vanilla_whisper/ │ └── whisper.py # WhisperTranscriber wrapper └── evaluation/ └── inference_models/ └── asr_model_base.py # Base class for ASR models
Contributing
See [CONTRIBUTING.md](CONTRIBUTING.md) for contribution guidelines.
Notability
notability 5.0/10New repo from Microsoft on ASR, no traction info