RepoAmazon (Nova)Amazon (Nova)published Sep 30, 2024seen 5d

amazon-science/uniqsketch

C++

Open original ↗

Captured source

source ↗
published Sep 30, 2024seen 5dcaptured 10hhttp 200method plain

amazon-science/uniqsketch

Description: UniqSketch: sensitive and resource-efficient strain-level detection in metagenomes

Language: C++

License: MIT

Stars: 1

Forks: 1

Open issues: 0

Created: 2024-09-30T21:18:24Z

Pushed: 2026-04-28T19:18:46Z

Default branch: main

Fork: no

Archived: no

README:

UniqSketch: Sensitive and resource-efficient strain/abundance identification in metagenomics data

UniqSketch is a sensitive tool for identifying strains and their relative abundances in metagenomics samples. The algorithm consists of two stages: Indexing and Querying.

  • Indexing: takes a list of target sequences (usually reference genomes) and identifies all k-mers within each target that uniquely belong to that specific sequence. It then checks a set of criteria such as low-complexity on the k-mer candidates to refine the list and provide the final set of unique k-mers for each target. To efficiently keep track of k-mers and their frequencies, UniqSketch utilizes the Bloom filter data structure as an alternative to hash tables.
  • Querying: raw read sequencing data are queried against the sketch index. To remove k-mers caused by sequencing errors (k-mers with very low depth), UniqSketch utilizes a cascading Bloom filter to discard weak k-mers. The solid k-mers are then queried against the index and the count for the related reference target is updated accordingly. After all solid k-mers in the raw read input data are queried, the reference count table is computed to generate the final output of most abundant references.

Installation

conda installation (recommended)

conda install -c bioconda -c conda-forge uniqsketch

Using CMake

git clone https://github.com/amazon-science/uniqsketch
cd uniqsketch
cmake -S . -B build
cmake --build build

Binaries are placed in build/bin/.

Using Make

git clone https://github.com/amazon-science/uniqsketch
cd uniqsketch
make

Binaries are placed in bin/.

Verifying installation

uniqsketch --help

Running Tests

# CMake
cmake --build build --target test

# Make
make test

Quick Start

UniqSketch consists of three utilities:

  • uniqsketch: build a signature index from a reference/target database
  • querysketch: query input read files against a constructed index
  • comparesketch: compute pairwise similarity between two sets of reference genomes

1. Construct a sketch index of size 81bp from three references and store it in sketch_index.tsv:

uniqsketch --sensitive -k81 -o sketch_index.tsv ref_1.fa ref_2.fa ref_3.fa

2. Cluster nearly identical references and build a sketch from representatives only:

uniqsketch --cluster=2000 --sensitive -k81 -o sketch_index.tsv @refs.txt

References with fewer than 2000 unique k-mers between them are merged into clusters. A clusters.tsv file is written to the output directory mapping each reference to its cluster representative.

3. Query a metagenomics sample with read files r1.fq.gz and r2.fq.gz:

querysketch --r1 r1.fq.gz --r2 r2.fq.gz --ref sketch_index.tsv --out out_sample.tsv

Output

uniqsketch

  • sketch_uniq.tsv: primary output — tab-separated file storing the final set of signatures for each reference.
  • outdir_uniqsketch/: folder containing a file per reference with all signature candidates.
  • db_uniq_count.tsv: tab-separated summary of all references and total number of candidate signatures.
  • clusters.tsv (when --cluster is used): tab-separated file mapping each reference to its cluster representative, with columns: cluster_id, representative, member, num_kmers.

querysketch

  • out_sample.tsv: primary output reporting references and their abundance within the sample:
ref abundance count
ref_1 0.7862 3033
ref_2 0.1993 769
ref_3 0.01426 55
  • log_out_sample.tsv: detailed log showing for each reference the total number of assigned reads and all signature:count pairs.
  • logread_out_sample.tsv: read-level log listing all matched read IDs per reference.

Usage

uniqsketch

Usage: uniqsketch [OPTION] @LIST_FILES (or FILES)
Creates unique sketch from a list of fasta reference files.
A list of files containing file names in each row can be passed with @ prefix.

Options:

-t, --threads=N use N parallel threads [1]
-k, --kmer=N the length of kmer [81]
-b, --bit=N use N bits per element in Bloom filter [128]
-d, --hash1=N distinct Bloom filter hash number [5]
-s, --hash2=N repeat Bloom filter hash number [5]
-c, --cov=N number of unique k-mers to represent a reference [100]
-f, --outdir=STRING dir for universe unique k-mers [outdir_uniqsketch]
-o, --out=STRING the output sketch file name [sketch_uniq.tsv]
-r, --stat=STRING the output unique kmer stat file name [db_uniq_count.tsv]
-e, --entropy sets the aggregate entropy rate threshold [0.65]
--cluster=N cluster similar references with N unique k-mer threshold [0=off]
--sensitive sets sensitivity parameter c to 100
--very-sensitive sets sensitivity parameter c to 1000
--help display this help and exit
--version output version information and exit

querysketch

Usage: querysketch [OPTIONS] [ARGS]
Identify references and their abundance in FILE(S).

Acceptable file formats: fastq in compressed formats gz, bz, zip, xz.

Options:

-t, --threads=N use N parallel threads [1]
-b, --bit=N use N bits per element in Bloom filter [64]
-o, --out=STRING the output file name
-l, --r1=STRING input read 1
-r, --r2=STRING input read 2
-g, --ref=STRING input uniqsketch reference
-h, --hit=N number of uniqsketch hits to call a reference [10]
-a, --acutoff=N abundance cutoff to report [0.0]
-s, --rcutoff=N read cutoff to report [2]
--sensitive sets sensitivity parameter h=10
--very-sensitive sets sensitivity parameter h=5
--solid only use solid k-mers in reads
--help display this help and exit
--version version information and exit

comparesketch

Usage: comparesketch [OPTION] LIST1 LIST2
Compare two sets of fasta reference files |LIST1|*|LIST2|.
Two lists of files containing file paths in each row.

Options:

-t, --threads=N use N parallel threads [1]
-k, --kmer=N the length of kmer [81]
-b, --bit=N use N bits per element in Bloom filter [16]
-d, --hash=N Bloom filter hash number [3]
-g, --gsize=N approximate size for reference sequence [5000000]
-o, --out=STRING the output similarity file name [reference_similarity.tsv]
--help display this help and exit
--version…

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

New Amazon research repo, low traction