obiuniq

obiuniq: dereplicate sequence data sets #

Preliminary AI-generated documentation

This page was automatically generated by an AI assistant and has not yet been reviewed or validated by the OBITools4 development team. It may contain inaccuracies or incomplete information. Use with caution and refer to the command’s --help output for authoritative option descriptions.

Description #

obiuniq groups identical sequences together and replaces each group with a single representative, recording the total number of original occurrences as an abundance count. Dereplication is a standard first step in amplicon sequencing workflows: a typical fastq file from an NGS run contains many copies of the same amplicon sequence, and reducing these to unique entries with counts dramatically reduces the computational burden on downstream tools.

By default, two sequences are considered identical if and only if their nucleotide strings are exactly the same. obiuniq scans through all input sequences, groups the duplicates, and writes one record per group in fasta format. The output record inherits the identifier of the first sequence in the group and carries a count attribute recording how many input sequences it represents.

graph TD
  A@{ shape: doc, label: "reads.fastq" }
  C[obiuniq]
  D@{ shape: doc, label: "out_basic.fastq" }
  A --> C:::obitools
  C --> D
  classDef obitools fill:#99d57c

The file reads.fastq contains eight amplicon reads from two samples. Several reads share the same nucleotide sequence, representing genuine replication in the sequencing library.

πŸ“„ reads.fastq
@seq001 {"sample": "s1", "primer": "p1"}
ATCGATCGATCGATCGATCG
+
IIIIIIIIIIIIIIIIIIII
@seq002 {"sample": "s1", "primer": "p1"}
ATCGATCGATCGATCGATCG
+
IIIIIIIIIIIIIIIIIIII
@seq003 {"sample": "s1", "primer": "p1"}
ATCGATCGATCGATCGATCG
+
IIIIIIIIIIIIIIIIIIII
@seq004 {"sample": "s1", "primer": "p1"}
GCTAGCTAGCTAGCTAGCTA
+
IIIIIIIIIIIIIIIIIIII
@seq005 {"sample": "s1", "primer": "p1"}
GCTAGCTAGCTAGCTAGCTA
+
IIIIIIIIIIIIIIIIIIII
@seq006 {"sample": "s2", "primer": "p1"}
ATCGATCGATCGATCGATCG
+
IIIIIIIIIIIIIIIIIIII
@seq007 {"sample": "s2", "primer": "p1"}
TTTTTTTTTTTTTTTTTTTT
+
IIIIIIIIIIIIIIIIIIII
@seq008 {"primer": "p1"}
CCCCCCCCCCCCCCCCCCCC
+
IIIIIIIIIIIIIIIIIIII

Running obiuniq collapses all duplicates into unique representatives:

obiuniq reads.fastq -o out_basic.fastq
πŸ“„ out_basic.fastq
>seq008 {"count":1,"primer":"p1"}
cccccccccccccccccccc
>seq001 {"count":4,"primer":"p1"}
atcgatcgatcgatcgatcg
>seq004 {"count":2,"primer":"p1","sample":"s1"}
gctagctagctagctagcta
>seq007 {"count":1,"primer":"p1","sample":"s2"}
tttttttttttttttttttt

The eight input reads reduce to four unique sequences. The count attribute records how many reads were merged into each representative: seq001 (ATCGATCG…) appeared four times across both samples and now carries "count":4. Note that obiuniq always produces fasta output, even from fastq input, because quality scores cannot be meaningfully combined across merged reads.

When the same sequence genuinely occurs in multiple experimental samples and per-sample abundance matters, grouping by sequence alone is too aggressive. The --category-attribute option (repeatable, short: -c) adds metadata fields to the grouping criterion: two reads are only merged when they are nucleotide-identical and share the same value for every listed attribute. Combining -c sample with --no-singleton groups reads per sample and then discards groups with a single member β€” a common noise-reduction step to remove likely sequencing errors:

obiuniq -c sample --no-singleton reads.fastq -o out_no_singleton.fastq
πŸ“„ out_no_singleton.fastq
>seq001 {"count":3,"primer":"p1","sample":"s1"}
atcgatcgatcgatcgatcg
>seq004 {"count":2,"primer":"p1","sample":"s1"}
gctagctagctagctagcta

Of the per-sample groups formed, only the two from sample s1 have more than one member and survive the singleton filter. The groups from s2 and the unannotated sequence each appear only once and are discarded.

Synopsis #

obiuniq [--batch-mem <string>] [--batch-size <int>] [--batch-size-max <int>]
        [--category-attribute|-c <CATEGORY>]... [--chunk-count <int>]
        [--compress|-Z] [--csv] [--debug] [--ecopcr] [--embl]
        [--fail-on-taxonomy] [--fasta] [--fasta-output] [--fastq]
        [--fastq-output] [--genbank] [--help|-h|-?] [--in-memory]
        [--input-OBI-header] [--input-json-header] [--json-output]
        [--max-cpu <int>] [--merge|-m <KEY>]... [--na-value <NA_NAME>]
        [--no-order] [--no-progressbar] [--no-singleton]
        [--out|-o <FILENAME>] [--output-OBI-header|-O] [--output-json-header]
        [--pprof] [--pprof-goroutine <int>] [--pprof-mutex <int>]
        [--raw-taxid] [--silent-warning] [--skip-empty] [--solexa]
        [--taxonomy|-t <string>] [--u-to-t] [--update-taxid] [--version]
        [--with-leaves] [<args>]

Options #

obiuniq specific options #

  • --category-attribute | -c <CATEGORY>: Adds one metadata attribute to the grouping criterion. Two sequences are placed in the same group only when they are nucleotide-identical and share the same value for every attribute listed with -c. Can be repeated to combine multiple attributes (e.g. -c sample -c primer). Records missing a listed attribute receive the value set by --na-value.
  • --chunk-count <int>: Controls how many internal partitions the dataset is split into during processing (default: 100). A higher value reduces per-partition memory usage at the cost of more temporary files; a lower value reduces I/O at the cost of higher peak memory.
  • --in-memory: Stores intermediate data chunks in RAM rather than in temporary disk files. Speeds up processing for datasets that fit comfortably in available memory; omit this flag for large datasets that exceed available RAM.
  • --merge | -m <KEY>: Creates an output attribute named merged_KEY that maps each observed value of the KEY attribute to the count of input sequences carrying that value within the group. Can be repeated to track multiple attributes. Useful for tracking which sample or category contributions were collapsed into each group.
  • --na-value <NA_NAME>: Value assigned to a category attribute when a sequence record does not carry that attribute (default: "NA"). All sequences lacking the attribute are grouped together under this placeholder.
  • --no-singleton: Discards all output records whose abundance count is exactly one β€” i.e., sequences that occur only once across the entire input. Removing singletons is a standard heuristic for excluding sequencing errors from further analysis.

Taxonomic options #

  • --taxonomy | -t <string>: Path to the taxonomic database.
  • --fail-on-taxonomy:

    Cause obiuniq</abbr

    to exit with an error if a taxid in the data is not a currently valid taxon in the loaded taxonomy.

  • --raw-taxid: Print taxids in output without supplementary information (taxon name and rank).
  • --update-taxid: Automatically replace merged taxids with the most recent valid taxid.
  • --with-leaves: When taxonomy is extracted from a sequence file, add sequences as leaves of their taxid annotation.

Controlling the input data #

OBITools4 generally recognizes the input file format. It also recognizes whether the input file is compressed using GZIP. But some rare files can be misidentified, so the following options allow the user to force the format, thus bypassing the format identification step.
The file format options #
  • --fasta: indicates that sequence data is in fasta format.
  • --fastq: indicates that sequence data is in fastq format.
  • --embl: indicates that sequence data is in EMBL-ENA flatfile format.
  • --csv: indicates that sequence data is in CSV format.
  • --genbank: indicates that sequence data is in GenBank flatfile format.
  • --ecopcr: indicates that sequence data is in the old ecoPCR tabulated format.
Controlling the way OBITools4 are formatting annotations #
These options only apply to the FASTA and FASTQ formats
  • --input-OBI-header: FASTA/FASTQ title line annotations follow the old OBI format.
  • --input-json-header: FASTA/FASTQ title line annotations follow the JSON format.
Controlling quality score decoding #
This option only applies to the FASTQ formats
  • --solexa: decodes quality string according to the old Solexa specification. (default: the standard Sanger encoding is used, env: OBISSOLEXA)

Controlling the output data #

  • --compress | -Z : output is compressed using gzip. (default: false)
  • --no-order: the OBITools ensure that the order between the input file and the output file does not change. When multiple files are processed, they are processed one at a time. If the –no-order option is added to a command, multiple input files can be opened at the same time and their contents processed in parallel. This usually increases processing speed, but does not guarantee the order of the sequences in the output file. Also, processing multiple files in parallel may require more memory to perform the computation.
  • --fasta-output: writes sequence data in fasta format (default if quality data is not available).
  • --fastq-output: writes sequence data in fastq format (default if quality data is available).
  • --json-output: writes sequence data in JSON format.
  • --out | -o <FILENAME>: filename used for saving the output (default: “-”, the standard output)
  • --output-OBI-header | -O : writes output FASTA/FASTQ title line annotations in OBI format (default: JSON).
  • --output-json-header: writew output FASTA/FASTQ title line annotations in JSON format (the default format).
  • --skip-empty: sequences of length equal to zero are removed from the output (default: false).
  • --no-progressbar: deactivates progress bar display (default: false).

General options #

  • --help | -h|-? : shows this help.
  • --version: prints the version and exits.
  • --silent-warning: This option tells obitools to stop displaying warnings. This behaviour can be controlled by setting the OBIWARNINGS environment variable.
  • --max-cpu <INTEGER>: OBITools can take advantage of your computer’s multi-core architecture by parallelizing the computation across all available CPUs. Computing on more CPUs usually requires more memory to perform the computation. Reducing the number of CPUs used to perform a calculation is also a way to indirectly control the amount of memory used by the process. The number of CPUs used by OBITools can also be controlled by setting the OBIMAXCPU environment variable.
  • --force-one-cpu: forces the use of a single CPU core for parallel processing (default: false).
  • --batch-size <INTEGER>: minimum number of sequences per batch for parallel processing (floor, default: 1, env: OBIBATCHSIZE)
  • --batch-size-max <INTEGER>: maximum number of sequences per batch for parallel processing (ceiling, default: 2000, env: OBIBATCHSIZEMAX)
  • --batch-mem <STRING>: maximum memory per batch (e.g. 128K, 64M, 1G; default: 128M; set to 0 to disable, env: OBIBATCHMEM)
  • --debug: enables debug mode, by setting log level to debug (default: false, env: OBIDEBUG)
  • --pprof: enables pprof server. Look at the log for details. (default: false).
  • --pprof-mutex <INTEGER>: enables profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX)
  • --pprof-goroutine <INTEGER>: enables profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE)

Examples #

Track sample contributions across merged groups using --merge:

The file reads.fastq contains amplicon reads from two samples. Dereplicating per sample and recording in merged_sample how many reads from each sample were merged into each group reveals the sample origin of every representative. Sequences with no sample attribute are grouped under the placeholder UNKNOWN.

πŸ“„ reads.fastq
@seq001 {"sample": "s1", "primer": "p1"}
ATCGATCGATCGATCGATCG
+
IIIIIIIIIIIIIIIIIIII
@seq002 {"sample": "s1", "primer": "p1"}
ATCGATCGATCGATCGATCG
+
IIIIIIIIIIIIIIIIIIII
@seq003 {"sample": "s1", "primer": "p1"}
ATCGATCGATCGATCGATCG
+
IIIIIIIIIIIIIIIIIIII
@seq004 {"sample": "s1", "primer": "p1"}
GCTAGCTAGCTAGCTAGCTA
+
IIIIIIIIIIIIIIIIIIII
@seq005 {"sample": "s1", "primer": "p1"}
GCTAGCTAGCTAGCTAGCTA
+
IIIIIIIIIIIIIIIIIIII
@seq006 {"sample": "s2", "primer": "p1"}
ATCGATCGATCGATCGATCG
+
IIIIIIIIIIIIIIIIIIII
@seq007 {"sample": "s2", "primer": "p1"}
TTTTTTTTTTTTTTTTTTTT
+
IIIIIIIIIIIIIIIIIIII
@seq008 {"primer": "p1"}
CCCCCCCCCCCCCCCCCCCC
+
IIIIIIIIIIIIIIIIIIII
obiuniq -c sample --merge sample --na-value UNKNOWN reads.fastq -o out_merge.fastq
πŸ“„ out_merge.fastq
>seq008 {"count":1,"merged_sample":{"UNKNOWN":1},"primer":"p1"}
cccccccccccccccccccc
>seq001 {"count":3,"merged_sample":{"s1":3},"primer":"p1","sample":"s1"}
atcgatcgatcgatcgatcg
>seq006 {"count":1,"merged_sample":{"s2":1},"primer":"p1","sample":"s2"}
atcgatcgatcgatcgatcg
>seq004 {"count":2,"merged_sample":{"s1":2},"primer":"p1","sample":"s1"}
gctagctagctagctagcta
>seq007 {"count":1,"merged_sample":{"s2":1},"primer":"p1","sample":"s2"}
tttttttttttttttttttt

Dereplicate across two files with no assumed ordering, grouping by sample and primer:

The files sample1.fastq and sample2.fastq contain reads from two independent sequencing files covering two primers. Using --no-order signals that the files have no implicit read-pairing relationship. Grouping by both sample and primer keeps distinct amplicon types separate, producing one representative per unique (sequence, sample, primer) combination.

πŸ“„ sample1.fastq
@s1_seq001 {"sample": "s1", "primer": "p1"}
ATCGATCGATCGATCGATCG
+
IIIIIIIIIIIIIIIIIIII
@s1_seq002 {"sample": "s1", "primer": "p1"}
ATCGATCGATCGATCGATCG
+
IIIIIIIIIIIIIIIIIIII
@s1_seq003 {"sample": "s1", "primer": "p2"}
GCTAGCTAGCTAGCTAGCTA
+
IIIIIIIIIIIIIIIIIIII
@s1_seq004 {"sample": "s1", "primer": "p2"}
GCTAGCTAGCTAGCTAGCTA
+
IIIIIIIIIIIIIIIIIIII
πŸ“„ sample2.fastq
@s2_seq001 {"sample": "s2", "primer": "p1"}
ATCGATCGATCGATCGATCG
+
IIIIIIIIIIIIIIIIIIII
@s2_seq002 {"sample": "s2", "primer": "p2"}
TTTTTTTTTTTTTTTTTTTT
+
IIIIIIIIIIIIIIIIIIII
@s2_seq003 {"sample": "s2", "primer": "p2"}
TTTTTTTTTTTTTTTTTTTT
+
IIIIIIIIIIIIIIIIIIII
obiuniq --no-order -c sample -c primer sample1.fastq sample2.fastq -o out_multifile.fastq
πŸ“„ out_multifile.fastq
>s1_seq001 {"count":2,"primer":"p1","sample":"s1"}
atcgatcgatcgatcgatcg
>s2_seq001 {"count":1,"primer":"p1","sample":"s2"}
atcgatcgatcgatcgatcg
>s1_seq004 {"count":2,"primer":"p2","sample":"s1"}
gctagctagctagctagcta
>s2_seq002 {"count":2,"primer":"p2","sample":"s2"}
tttttttttttttttttttt

Use in-memory chunking for faster processing of smaller datasets:

For datasets that fit comfortably in RAM, --in-memory avoids temporary disk I/O and speeds up dereplication. The --chunk-count parameter controls how many internal partitions are used (here increased to 200 for finer granularity). The --compress flag writes the output as a gzip-compressed file.

obiuniq --in-memory --chunk-count 200 --compress -o out_inmemory.fastq.gz reads.fastq
obiuniq --help