RepoAmazon (Nova)Amazon (Nova)published Oct 22, 2023seen 5d

amazon-science/avgen-eval-toolkit

Python

Open original ↗

Captured source

source ↗

amazon-science/avgen-eval-toolkit

Language: Python

License: Apache-2.0

Stars: 19

Forks: 5

Open issues: 6

Created: 2023-10-22T02:02:11Z

Pushed: 2026-02-05T18:29:50Z

Default branch: main

Fork: no

Archived: no

README:

Audio-Visual Synchrony Evaluation

Paper: Perceptual Evaluation of Audio-Visual Synchrony Grounded in Viewers' Opinion Scores

If you use this work, please cite:

@misc{goncalves2024peavs,
title={PEAVS: Perceptual Evaluation of Audio-Visual Synchrony Grounded in Viewers' Opinion Scores},
author={Lucas Goncalves and Prashant Mathur and Chandrashekhar Lavania and Metehan Cekic and Marcello Federico and Kyu J. Han},
year={2024},
eprint={2404.07336},
archivePrefix={arXiv},
primaryClass={cs.CV}
}

Environment

Create the environment from the environment.yml file:

conda create --name eval_toolkit --file spec-file.txt
source activate eval_toolkit
pip install -r requirements.txt

Make sure you have cmd line FFmpeg 6.0 installed. You can check for installation in the terminal, by typing ffmpeg -version and press enter

Activate environment with source activate eval_toolkit

Video Evaluation

This toolbox includes the following metrics:

  • FVD: Frechet Video Distance
  • FID: Frechet Inception distance, realized by inceptionv3
  • KID: Kernel Inception Distance
  • LPIPS: Learned Perceptual Image Patch Similarity
  • MiFID: Memorization-Informed Frechet Inception Distance
  • SSIM: Structural Similarity Index Measure
  • MS-SSIM: Multi-Scale SSIM
  • PSNR: Peak Signal-to-Noise Ratio
  • PSNRB: Peak Signal To Noise Ratio With Blocked Effect
  • VMAF: Video Multi-Method Assessment Fusion
  • VIF: Visual Information Fidelity
  • CLIP-Score: Implemented with CLIP VIT model

Running the metric

python3 run_video_eval.py --preds_folder /path/to/generated/videos --target_folder /path/to/the/target/videos --num_frames {Number of frames in your video or to be used for evaluation} --output path/to/NAME_YOUR_RESULTS_FILE.txt

NOTE:

  • Using --metrics you can specify the metrics you want to run ['fvd', 'ssim', 'psnr', 'psnrb', 'lpips', 'ms_ssim', 'clip', 'fid', 'kid', 'mifid', 'vmaf', 'vif'].
  • To run CLIP-Score, name your video as per the caption used to generate it (e.g., “A horse running on a road” becomes A_horse_running_on_a_road.mp4).
  • Ensure the /path/to/generated/videos and /path/to/the/target/videos directories contain an equal number of files with matching names.
  • With --clip_model, choose the version of the CLIP model to use. Default is ‘openai/clip-vit-base-patch16’.
  • With --net, specify the backbone network for LPIPS (options: ‘alex’, ‘vgg’, ‘squeeze’. Default is ‘alex’).
  • With --feat_layer, select the inceptionv3 feature layer for FID, KID, and MIFID computation (options: 64, 192, 768, 2048. Default is 64).
  • With --calculate_per_frame, set the number of frame steps for each metric computation (e.g., for a 32-frame video and argument of 8, metrics will be computed every 8 frames). Default is 8. All metrics will run over the entire video frames regardless of this argument.
  • If you have issues running VMAF please ensure you have cmd line FFmpeg 6.0 and ffmpeg-quality-metrics installed

Audio Evaluation

This toolbox includes the following metrics:

  • FAD: Frechet audio distance
  • ISc: Inception score
  • FD: Frechet distance, realized by PANNs, a state-of-the-art audio classification model
  • KL: KL divergence (softmax over logits)
  • KL_Sigmoid: KL divergence (sigmoid over logits)
  • SI_SDR: Scale-Invariant Signal-to-Distortion Ratio
  • SDR: Signal-to-Distortion Ratio
  • SI_SNR: Scale-Invariant Signal-to-Noise Ratio
  • SNR: Signal-to-Noise Ratio
  • PESQ: Perceptual Evaluation of Speech Quality
  • STOI: Short-Time Objective Intelligibility
  • CLAP-Score: Implemented with LAION-AI/CLAP

Running the metric

python3 run_audio_eval.py --preds_folder /path/to/generated/audios --target_folder /path/to/the/target_audios --metrics SI_SDR SDR SI_SNR SNR PESQ STOI CLAP FAD ISC FD KL --results NAME_YOUR_RESULTS_FILE.txt

NOTE:

  • Using --metrics, you can specify the metrics to run [SI_SDR, SDR, SI_SNR, SNR, PESQ, STOI, CLAP, FAD, ISC, FD, KL].
  • If no /path/to/the/target/audios is provided, only reference-free metrics will be run on your audios.
  • For CLAP-Score, name your audio as per the caption used to generate it (e.g., “A car revving while honking” becomes A_car_revving_while_honking.wav).
  • Ensure /path/to/generated/audios and /path/to/the/target/audios directories contain an equal number of files with matching names. Otherwise, only reference-free metrics will be run.
  • Using --clap_model, you can can specify the CLAP model id to be used for CLAP score computations. clap_model = 0 --> 630k non-fusion ckpt; clap_model = 1 --> 630k+audioset non-fusion ckpt; clap_model = 2 --> 630k fusion ckpt; clap_model = 3 --> 630k+audioset fusion ckpt

For general audio less than 10-sec: use 0 or 1; For general audio with variable-length: use 2 or 3; more model can be use with a few modifications (please refer to: https:github.com/LAION-AI/CLAP)

Distortions Generation

This script applies specified distortions to audio, video, or audio-visual media.

Dependencies

Create the environment from the environment.yml file:

conda env create -f environment_distortions.yml

Getting Started

2. Run the script:

python3 [YOUR SCRIPT NAME].py --input_path [PATH_TO_MEDIA_FOLDER] --dest_path [PATH_TO_DESTINATION_FOLDER] --media_type [TYPE_OF_MEDIA] [OTHER_OPTIONS]

Arguments

  • --input_path: Path to the folder containing media files you want to apply distortions to.
  • --dest_path: Path to the destination folder where the distorted media files will be saved.
  • --media_type: Specify the type of media you want to distort. Options are:
  • audios
  • videos
  • audiovisual

Optional Distortion Type Arguments

Specify specific distortion types or use the default ones:

  • --audio_distortions: List of distortion types for audio content. (e.g., gaussian, pops, low_pass, etc.)
  • --visual_distortions: List of distortion types for visual content. (e.g., speed_up, black_rectangles, gaussian_noise, etc.)
  • --audiovisual_distortions: List of distortion types for audio-visual content. (e.g., audio_shift, audio_speed_up, video_speed_up, etc.)

Note: Each distortion argument accepts multiple values. For example:…

Excerpt shown — open the source for the full document.