amazon-science/LARCQ
Python
Captured source
source ↗amazon-science/LARCQ
Description: Codes of LARCQ Paper (Interspeech 2025)
Language: Python
License: Apache-2.0
Stars: 0
Forks: 0
Open issues: 10
Created: 2025-08-12T22:35:31Z
Pushed: 2026-02-05T18:59:38Z
Default branch: main
Fork: no
Archived: no
README:
🚀 Official codes of our Interspeech paper *On Retrieval of Long Audios with Complex Text Queries*
- Project website https://sites.google.com/view/larcq
- Paper https://www.isca-archive.org/interspeech_2025/yang25n_interspeech.html
@inproceedings{yang25n_interspeech,
title = {On Retrieval of Long Audios with Complex Text Queries},
author = {Ruochu Yang and Milind Rao and Harshavardhan Sundar and Anirudh Raju and Aparna Khare and Srinath Tankasala and Di He and Venkatesh Ravichandran},
year = {2025},
booktitle = {Interspeech 2025},
pages = {2660--2664},
doi = {10.21437/Interspeech.2025-2085},
issn = {2958-1796},
}Prerequisite
1. Configure environments
conda create -n larcq python=3.10 conda activate larcq pip install -r requirements.txt pip install -e hf-dev-train/transformers-main pip install -e peft-main
2. Download benchmarks
Save the benchmarks in the datasets folder.
Due to license restriction, we cannot open-source our Clotho_LARCQ and SoundDescs_LARCQ benchmarks. However, we provide the codes of generating the benchmarks. Actually, you can use our codes to generate any LARCQ-style benchmark you want.
3. Download models
- Download the
clap-htsat-fusedmodel from the Hugging Face model link. Save the model in themodelsfolder.
- Download the
gpt2model from the Hugging Face model link. Save the model in themodelsfolder.
- Download the
Llama-2-7b-chat-hf-qformerfolder from the Google Drive website link. Save the folder in themodelsfolder.
- Download the
stage5_epoch2folder from the Google Drive website link. Unzip and save the folder in themodelsfolder.
- Download the
clapcap_weights_2023.pthcheckpoint from the Hugging Face website link. Save the checkpoint in themodelsfolder.
- Download the
opt-iml-max-1.3bfolder from the Hugging Face website link. Unzip and save the folder in themodelsfolder.
- Download the
foundation.ptcheckpoint from the Hugging Face website link. Save the checkpoint in themodelsfolder.
- Download the
ms-marco-MiniLM-L-6-v2folder from the Hugging Face website link. Unzip and save the folder in themodelsfolder.
4. Nvidia GPUs
The results in the paper are generated in a computer with Nvidia GPUs. Better to have four GPUs and configure nvidia-smi ready.
LARCQ Benchmark Generation
1. Clotho_LARCQ benchmark
We provide the codes of generating our Clotho_LARCQ benchmark based on Clotho Version 2.1 dataset so that you can follow it to create any LARCQ benchmark you want.
(1) Download the clotho_audio_evaluation.7z folder and the clotho_captions_evaluation.csv file from the Zenodo website link. Save them in the datasets/Clotho folder.
(2) Synthesize long-audio-long-query pairs as LARCQ benchmarks
Run terminal command python -m benchmark_generation.synthesize
The raw LARCQ captions are saved as datasets/Clotho_LARCQ/raw_LARCQ_captions.csv The LARCQ audios are saved as 'datasets/Clotho_LARCQ/audios/
(3) Run LLMs to refine the raw LARCQ captions
We use two options to refine the raw LARCQ captions into natural long queries.
- Condense the raw captions
Run terminal command python -m benchmark_generation.llm_condense The condensed LARCQ captions are saved as datasets/Clotho_LARCQ/condensed_caption.csv
- Rephrase the raw captions
Run terminal command python -m benchmark_generation.llm_rephrase The rephrased LARCQ captions are saved as datasets/Clotho_LARCQ/rephrased_caption.csv
2. SoundDescs_LARCQ benchmark
(1) Download the original SoundDescs dataset from the official GitHub website link. Save them in the datasets/SoundDescs folder.
(2) We filter for audios between 75-150 seconds with captions exceeding 150 characters as complex queries. This results in 1639 audio-query pairs, forming our SoundDescs-LARCQ benchmark.
Run Pipeline
Our pipeline consists of two main parts: multi-modal retrieval and ALM/LLM refining.
1. Run multi-modal rertieval
The retrieval scripts are in the folder pipeline/multi_modal_retrieval. Each script is independent and can be directly executed, which means that you can evaluate any method on any dataset for comprehensive comparison.
(1)retrieval_no_chunking.py is to retrieve the relevant audios given the queries without any audio chunking or query chunking applied. Run terminal command python -m pipeline.multi_modal_retrieval.retrieval_no_chunking Retrieved short-list audios are saved as results/retrieved_results/{benchmark}/retrieved_audios_no_chunking.csv
(2)retrieval_audio_chunking.py is to retrieve the relevant audios given the queries with only audio chunking max/sum vote and without any query chunking. Run terminal command python -m pipeline.multi_modal_retrieval.retrieval_audio_chunking Retrieved short-list audios are saved as results/retrieved_results/{benchmark}/retrieved_audios_audio_chunking.csv
(3)retrieval_query_chunking.py is to retrieve the relevant audios given the queries with only query chunking max/sum vote and without any audio chunking. Run terminal command python -m pipeline.multi_modal_retrieval.retrieval_query_chunking Retrieved short-list audios are saved as results/retrieved_results/{benchmark}/retrieved_audios_query_chunking.csv
(4)retrieval_audio_chunking_query_chunking.py is to apply the four combinations of audio chunking max vote × query chunking sum vote, audio chunking sum vote × query chunking sum vote, audio chunking sum vote × query chunking max vote, audio chunking max vote × query chunking max vote to retrieve the audios. Run terminal command `python -m…
Excerpt shown — open the source for the full document.
Notability
notability 5.0/10New research repo from Amazon Science