RepoAmazon (Nova)Amazon (Nova)published Aug 12, 2025seen 5d

amazon-science/LARCQ

Python

Open original ↗

Captured source

source ↗
published Aug 12, 2025seen 5dcaptured 13hhttp 200method plain

amazon-science/LARCQ

Description: Codes of LARCQ Paper (Interspeech 2025)

Language: Python

License: Apache-2.0

Stars: 0

Forks: 0

Open issues: 10

Created: 2025-08-12T22:35:31Z

Pushed: 2026-02-05T18:59:38Z

Default branch: main

Fork: no

Archived: no

README:

🚀 Official codes of our Interspeech paper *On Retrieval of Long Audios with Complex Text Queries*

  • Project website https://sites.google.com/view/larcq
  • Paper https://www.isca-archive.org/interspeech_2025/yang25n_interspeech.html
@inproceedings{yang25n_interspeech,
title = {On Retrieval of Long Audios with Complex Text Queries},
author = {Ruochu Yang and Milind Rao and Harshavardhan Sundar and Anirudh Raju and Aparna Khare and Srinath Tankasala and Di He and Venkatesh Ravichandran},
year = {2025},
booktitle = {Interspeech 2025},
pages = {2660--2664},
doi = {10.21437/Interspeech.2025-2085},
issn = {2958-1796},
}

Prerequisite

1. Configure environments

conda create -n larcq python=3.10
conda activate larcq
pip install -r requirements.txt
pip install -e hf-dev-train/transformers-main
pip install -e peft-main

2. Download benchmarks

Save the benchmarks in the datasets folder.

Due to license restriction, we cannot open-source our Clotho_LARCQ and SoundDescs_LARCQ benchmarks. However, we provide the codes of generating the benchmarks. Actually, you can use our codes to generate any LARCQ-style benchmark you want.

3. Download models

  • Download the clap-htsat-fused model from the Hugging Face model link. Save the model in the models folder.
  • Download the gpt2 model from the Hugging Face model link. Save the model in the models folder.
  • Download the Llama-2-7b-chat-hf-qformer folder from the Google Drive website link. Save the folder in the models folder.
  • Download the stage5_epoch2 folder from the Google Drive website link. Unzip and save the folder in the models folder.
  • Download the clapcap_weights_2023.pth checkpoint from the Hugging Face website link. Save the checkpoint in the models folder.
  • Download the opt-iml-max-1.3b folder from the Hugging Face website link. Unzip and save the folder in the models folder.
  • Download the foundation.pt checkpoint from the Hugging Face website link. Save the checkpoint in the models folder.
  • Download the ms-marco-MiniLM-L-6-v2 folder from the Hugging Face website link. Unzip and save the folder in the models folder.

4. Nvidia GPUs

The results in the paper are generated in a computer with Nvidia GPUs. Better to have four GPUs and configure nvidia-smi ready.

LARCQ Benchmark Generation

1. Clotho_LARCQ benchmark

We provide the codes of generating our Clotho_LARCQ benchmark based on Clotho Version 2.1 dataset so that you can follow it to create any LARCQ benchmark you want.

(1) Download the clotho_audio_evaluation.7z folder and the clotho_captions_evaluation.csv file from the Zenodo website link. Save them in the datasets/Clotho folder.

(2) Synthesize long-audio-long-query pairs as LARCQ benchmarks

Run terminal command python -m benchmark_generation.synthesize

The raw LARCQ captions are saved as datasets/Clotho_LARCQ/raw_LARCQ_captions.csv The LARCQ audios are saved as 'datasets/Clotho_LARCQ/audios/

(3) Run LLMs to refine the raw LARCQ captions

We use two options to refine the raw LARCQ captions into natural long queries.

  • Condense the raw captions

Run terminal command python -m benchmark_generation.llm_condense The condensed LARCQ captions are saved as datasets/Clotho_LARCQ/condensed_caption.csv

  • Rephrase the raw captions

Run terminal command python -m benchmark_generation.llm_rephrase The rephrased LARCQ captions are saved as datasets/Clotho_LARCQ/rephrased_caption.csv

2. SoundDescs_LARCQ benchmark

(1) Download the original SoundDescs dataset from the official GitHub website link. Save them in the datasets/SoundDescs folder.

(2) We filter for audios between 75-150 seconds with captions exceeding 150 characters as complex queries. This results in 1639 audio-query pairs, forming our SoundDescs-LARCQ benchmark.

Run Pipeline

Our pipeline consists of two main parts: multi-modal retrieval and ALM/LLM refining.

1. Run multi-modal rertieval

The retrieval scripts are in the folder pipeline/multi_modal_retrieval. Each script is independent and can be directly executed, which means that you can evaluate any method on any dataset for comprehensive comparison.

(1)retrieval_no_chunking.py is to retrieve the relevant audios given the queries without any audio chunking or query chunking applied. Run terminal command python -m pipeline.multi_modal_retrieval.retrieval_no_chunking Retrieved short-list audios are saved as results/retrieved_results/{benchmark}/retrieved_audios_no_chunking.csv

(2)retrieval_audio_chunking.py is to retrieve the relevant audios given the queries with only audio chunking max/sum vote and without any query chunking. Run terminal command python -m pipeline.multi_modal_retrieval.retrieval_audio_chunking Retrieved short-list audios are saved as results/retrieved_results/{benchmark}/retrieved_audios_audio_chunking.csv

(3)retrieval_query_chunking.py is to retrieve the relevant audios given the queries with only query chunking max/sum vote and without any audio chunking. Run terminal command python -m pipeline.multi_modal_retrieval.retrieval_query_chunking Retrieved short-list audios are saved as results/retrieved_results/{benchmark}/retrieved_audios_query_chunking.csv

(4)retrieval_audio_chunking_query_chunking.py is to apply the four combinations of audio chunking max vote × query chunking sum vote, audio chunking sum vote × query chunking sum vote, audio chunking sum vote × query chunking max vote, audio chunking max vote × query chunking max vote to retrieve the audios. Run terminal command `python -m…

Excerpt shown — open the source for the full document.

Notability

notability 5.0/10

New research repo from Amazon Science