`obiuniq`: dereplicate sequence data sets #

Description #

obiuniq groups identical sequences together and replaces each group with a single representative (a process commonly known as dereplication), recording the total number of original occurrences as an abundance count. Dereplication is a standard first step in amplicon sequencing workflows. A typical fastq file from an NGS run contains many copies of the same amplicon sequence. Reducing these to unique entries with counts dramatically reduces the computational burden on downstream tools.

By default, two sequences are considered identical if and only if their nucleotide strings are exactly the same. obiuniq scans through all input sequences, grouping the duplicates and writing one record per group in fasta format. The output record randomly inherits the identifier of one of the sequences in the group and carries a ‘count’ attribute recording how many input sequences it represents. Only information shared by all members of the sequence group is transferred to the representative sequence. Therefore, quality information is discarded from fastq files because two identical sequences never have the same sequencing qualities.

graph TD
  A@{ shape: doc, label: "reads.fastq" }
  C[obiuniq]
  D@{ shape: doc, label: "out_basic.fasta" }
  A --> C:::obitools
  C --> D
  classDef obitools fill:#99d57c

The file reads.fastq contains eight amplicon reads from two samples. Several reads share the same nucleotide sequence, representing genuine replication in the sequencing library.

📄 reads.fastq

@seq001 {"sample": "s1", "primer": "p1"}
ATCGATCGATCGATCGATCG
+
IIIIIIIIIIIIIIIIIIII
@seq002 {"sample": "s1", "primer": "p1"}
ATCGATCGATCGATCGATCG
+
IIIIIIIIIIIIIIIIIIII
@seq003 {"sample": "s1", "primer": "p1"}
ATCGATCGATCGATCGATCG
+
IIIIIIIIIIIIIIIIIIII
@seq004 {"sample": "s1", "primer": "p1"}
GCTAGCTAGCTAGCTAGCTA
+
IIIIIIIIIIIIIIIIIIII
@seq005 {"sample": "s1", "primer": "p1"}
GCTAGCTAGCTAGCTAGCTA
+
IIIIIIIIIIIIIIIIIIII
@seq006 {"sample": "s2", "primer": "p1"}
ATCGATCGATCGATCGATCG
+
IIIIIIIIIIIIIIIIIIII
@seq007 {"sample": "s2", "primer": "p1"}
TTTTTTTTTTTTTTTTTTTT
+
IIIIIIIIIIIIIIIIIIII
@seq008 {"sample": "s1", "primer": "p1"}
CCCCCCCCCCCCCCCCCCCC
+
IIIIIIIIIIIIIIIIIIII

Running obiuniq collapses all duplicates into unique representatives:

obiuniq reads.fastq > out_basic.fasta

📄 out_basic.fasta

>seq008 {"count":1,"primer":"p1","sample":"s1"}
cccccccccccccccccccc
>seq001 {"count":4,"primer":"p1"}
atcgatcgatcgatcgatcg
>seq004 {"count":2,"primer":"p1","sample":"s1"}
gctagctagctagctagcta
>seq007 {"count":1,"primer":"p1","sample":"s2"}
tttttttttttttttttttt

The eight input reads are reduced to four unique sequences. The count attribute records how many reads were merged into each representative. For example, seq001 (ATCGATCG…) appeared four times across both samples and now carries "count":4. As, the sequence seq001 occurs in samples S1 and S2, the sample information is not shared among all the sequences identical to seq001’; this information is discarded in the obiuniq result. Note that obiuniq always produces fasta output, even from fastq input, because quality scores cannot be meaningfully combined across merged reads.

When the same sequence occurs in multiple experimental samples and per-sample abundance matters, grouping by sequence alone can be too aggressive. The --category-attribute option (short: -c) adds metadata fields to the grouping criterion. This option can be used as many times as needed. Now, sequences are only merged if they are nucleotide-identical and share the same value for every listed attribute. In the example below, we categorise by sample. Thus, the previous group containing seq001 (ATCGATCG…) is now split into two, one for each sample S1 and S2.

obiuniq -c sample reads.fastq > out_per_sample.fasta

📄 out_per_sample.fasta

>seq008 {"count":1,"primer":"p1","sample":"s1"}
cccccccccccccccccccc
>seq006 {"count":1,"primer":"p1","sample":"s2"}
atcgatcgatcgatcgatcg
>seq001 {"count":3,"primer":"p1","sample":"s1"}
atcgatcgatcgatcgatcg
>seq004 {"count":2,"primer":"p1","sample":"s1"}
gctagctagctagctagcta
>seq007 {"count":1,"primer":"p1","sample":"s2"}
tttttttttttttttttttt

Another solution is to track sample contributions across merged groups using the --merge option.

This option reverts to the four sequence variants observed in the first example. However, the merged_sample attribute shows where each representative sample originated from by recording how many reads from each sample were merged into each group.

obiuniq --merge sample reads.fastq > out_merge.fasta

📄 out_merge.fasta

>seq008 {"count":1,"merged_sample":{"s1":1},"primer":"p1","sample":"s1"}
cccccccccccccccccccc
>seq001 {"count":4,"merged_sample":{"s1":3,"s2":1},"primer":"p1"}
atcgatcgatcgatcgatcg
>seq004 {"count":2,"merged_sample":{"s1":2},"primer":"p1","sample":"s1"}
gctagctagctagctagcta
>seq007 {"count":1,"merged_sample":{"s2":1},"primer":"p1","sample":"s2"}
tttttttttttttttttttt

Whether you choose the -c or the -m option depends on how you will process your data later on. If you plan to process your data using methods that process sequences sample by sample, it is better to use the -m option and to process the result file using the obidistribute command.

obiuniq -c sample reads.fastq \
  | obidistribute -p sample_%s.fasta -c sample

This pipeline produces two files sample_s1.fasta and sample_s2.fasta

📄 sample_s1.fasta

>seq008 {"count":1,"primer":"p1","sample":"s1"}
cccccccccccccccccccc
>seq001 {"count":3,"primer":"p1","sample":"s1"}
atcgatcgatcgatcgatcg
>seq004 {"count":2,"primer":"p1","sample":"s1"}
gctagctagctagctagcta

📄 sample_s2.fasta

>seq006 {"count":1,"primer":"p1","sample":"s2"}
atcgatcgatcgatcgatcg
>seq007 {"count":1,"primer":"p1","sample":"s2"}
tttttttttttttttttttt

However, with this approach, the explicit information that seq001 and seq006 are identical and shared by two samples is lost.

The OBITools4 algorithms process all samples’ data at once to identify OTUs shared among samples. This is why, in a pipeline based on OBITools4, the obiuniq -m option is usually preferred over the -c option.

Synopsis #

obiuniq [--batch-mem <string>] [--batch-size <int>] [--batch-size-max <int>]
        [--category-attribute|-c <CATEGORY>]... [--chunk-count <int>]
        [--compress|-Z] [--csv] [--debug] [--ecopcr] [--embl]
        [--fail-on-taxonomy] [--fasta] [--fasta-output] [--fastq]
        [--fastq-output] [--genbank] [--help|-h|-?] [--in-memory]
        [--input-OBI-header] [--input-json-header] [--json-output]
        [--max-cpu <int>] [--merge|-m <KEY>]... [--na-value <NA_NAME>]
        [--no-order] [--no-progressbar] [--no-singleton]
        [--out|-o <FILENAME>] [--output-OBI-header|-O] [--output-json-header]
        [--pprof] [--pprof-goroutine <int>] [--pprof-mutex <int>]
        [--raw-taxid] [--silent-warning] [--skip-empty] [--solexa]
        [--taxonomy|-t <string>] [--u-to-t] [--update-taxid] [--version]
        [--with-leaves] [<args>]

Options #

`obiuniq` specific options #

--category-attribute | -c <CATEGORY>: Adds one metadata attribute to the grouping criterion. Two sequences are placed in the same group only when they are nucleotide-identical and share the same value for every attribute listed with -c. Can be repeated to combine multiple attributes (e.g. -c sample -c primer). Records missing a listed attribute receive the value set by --na-value.
--chunk-count <int>: Controls how many internal partitions the dataset is split into during processing (default: 100). A higher value reduces per-partition memory usage at the cost of more temporary files; a lower value reduces I/O at the cost of higher peak memory.
--in-memory: Stores intermediate data chunks in RAM rather than in temporary disk files. Speeds up processing for datasets that fit comfortably in available memory; omit this flag for large datasets that exceed available RAM.
--merge | -m <KEY>: Creates an output attribute named merged_KEY that maps each observed value of the KEY attribute to the count of input sequences carrying that value within the group. Can be repeated to track multiple attributes. Useful for tracking which sample or category contributions were collapsed into each group.
--na-value <NA_NAME>: Value assigned to a category attribute when a sequence record does not carry that attribute (default: "NA"). All sequences lacking the attribute are grouped together under this placeholder.
--no-singleton: Discards all output records whose abundance count is exactly one — i.e., sequences that occur only once across the entire input. Removing singletons is a standard heuristic for excluding a large part of the PCR artifacts, and sequencing errors from further analysis.

Taxonomic options #

--taxonomy | -t <string>: Path to the taxonomic database.
--fail-on-taxonomy: Cause obiuniq to exit with an error if a taxid in the data is not a currently valid taxon in the loaded taxonomy.
--raw-taxid: Print taxids in output without supplementary information (taxon name and rank).
--update-taxid: Automatically replace merged taxids with the most recent valid taxid.
--with-leaves: When taxonomy is extracted from a sequence file, add sequences as leaves of their taxid annotation.

Controlling the input data #

OBITools4 generally recognizes the input file format. It also recognizes whether the input file is compressed using GZIP. But some rare files can be misidentified, so the following options allow the user to force the format, thus bypassing the format identification step.

The file format options #

--fasta: indicates that sequence data is in fasta format.
--fastq: indicates that sequence data is in fastq format.
--embl: indicates that sequence data is in EMBL-ENA flatfile format.
--csv: indicates that sequence data is in CSV format.
--genbank: indicates that sequence data is in GenBank flatfile format.
--ecopcr: indicates that sequence data is in the old ecoPCR tabulated format.

Controlling the way OBITools4 are formatting annotations #

These options only apply to the FASTA and FASTQ formats

--input-OBI-header: FASTA/FASTQ title line annotations follow the old OBI format.
--input-json-header: FASTA/FASTQ title line annotations follow the JSON format.

Controlling quality score decoding #

This option only applies to the FASTQ formats

--solexa: decodes quality string according to the old Solexa specification. (default: the standard Sanger encoding is used, env: OBISSOLEXA)

Controlling the output data #

--compress | -Z : output is compressed using gzip. (default: false)
--no-order: the OBITools ensure that the order between the input file and the output file does not change. When multiple files are processed, they are processed one at a time. If the –no-order option is added to a command, multiple input files can be opened at the same time and their contents processed in parallel. This usually increases processing speed, but does not guarantee the order of the sequences in the output file. Also, processing multiple files in parallel may require more memory to perform the computation.
--fasta-output: writes sequence data in fasta format (default if quality data is not available).
--fastq-output: writes sequence data in fastq format (default if quality data is available).
--json-output: writes sequence data in JSON format.
--out | -o <FILENAME>: filename used for saving the output (default: “-”, the standard output)
--output-OBI-header | -O : writes output FASTA/FASTQ title line annotations in OBI format (default: JSON).
--output-json-header: writew output FASTA/FASTQ title line annotations in JSON format (the default format).
--skip-empty: sequences of length equal to zero are removed from the output (default: false).
--no-progressbar: deactivates progress bar display (default: false).

General options #

--help | -h|-? : shows this help.
--version: prints the version and exits.
--silent-warning: This option tells obitools to stop displaying warnings. This behaviour can be controlled by setting the OBIWARNINGS environment variable.

--max-cpu <INTEGER>: OBITools can take advantage of your computer’s multi-core architecture by parallelizing the computation across all available CPUs. Computing on more CPUs usually requires more memory to perform the computation. Reducing the number of CPUs used to perform a calculation is also a way to indirectly control the amount of memory used by the process. The number of CPUs used by OBITools can also be controlled by setting the OBIMAXCPU environment variable.
--force-one-cpu: forces the use of a single CPU core for parallel processing (default: false).
--batch-size <INTEGER>: minimum number of sequences per batch for parallel processing (floor, default: 1, env: OBIBATCHSIZE)
--batch-size-max <INTEGER>: maximum number of sequences per batch for parallel processing (ceiling, default: 2000, env: OBIBATCHSIZEMAX)
--batch-mem <STRING>: maximum memory per batch (e.g. 128K, 64M, 1G; default: 128M; set to 0 to disable, env: OBIBATCHMEM)

--debug: enables debug mode, by setting log level to debug (default: false, env: OBIDEBUG)
--pprof: enables pprof server. Look at the log for details. (default: false).
--pprof-mutex <INTEGER>: enables profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX)
--pprof-goroutine <INTEGER>: enables profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE)

Examples #

Dealing with missing data

With the -c and -m option, obiuniq relies on values stored in attributes that can be not valuated for some sequences. In that case, a placerhold value (NA by default) is substuted to the missing information.

📄 reads_missing.fastq

@seq001 {"sample": "s1", "primer": "p1"}
ATCGATCGATCGATCGATCG
+
IIIIIIIIIIIIIIIIIIII
@seq002 {"sample": "s1", "primer": "p1"}
ATCGATCGATCGATCGATCG
+
IIIIIIIIIIIIIIIIIIII
@seq003 {"sample": "s1", "primer": "p1"}
ATCGATCGATCGATCGATCG
+
IIIIIIIIIIIIIIIIIIII
@seq004 {"primer": "p1"}
GCTAGCTAGCTAGCTAGCTA
+
IIIIIIIIIIIIIIIIIIII
@seq005 {"sample": "s1", "primer": "p1"}
GCTAGCTAGCTAGCTAGCTA
+
IIIIIIIIIIIIIIIIIIII
@seq006 {"sample": "s2", "primer": "p1"}
ATCGATCGATCGATCGATCG
+
IIIIIIIIIIIIIIIIIIII
@seq007 {"sample": "s2", "primer": "p1"}
TTTTTTTTTTTTTTTTTTTT
+
IIIIIIIIIIIIIIIIIIII
@seq008 {"primer": "p1"}
CCCCCCCCCCCCCCCCCCCC
+
IIIIIIIIIIIIIIIIIIII

obiuniq --merge sample \
        reads_missing.fastq \
    > out_missing.fasta

📄 out_missing.fasta

>seq008 {"count":1,"merged_sample":{"NA":1},"primer":"p1"}
cccccccccccccccccccc
>seq001 {"count":4,"merged_sample":{"s1":3,"s2":1},"primer":"p1"}
atcgatcgatcgatcgatcg
>seq004 {"count":2,"merged_sample":{"NA":1,"s1":1},"primer":"p1"}
gctagctagctagctagcta
>seq007 {"count":1,"merged_sample":{"s2":1},"primer":"p1","sample":"s2"}
tttttttttttttttttttt

The --na-value option allows for choosing this placerhold value.

obiuniq --merge sample \
        --na-value UNKWNON \
        reads_missing.fastq \
    > out_unknown.fasta

📄 out_unknown.fasta

>seq008 {"count":1,"merged_sample":{"UNKWNON":1},"primer":"p1"}
cccccccccccccccccccc
>seq001 {"count":4,"merged_sample":{"s1":3,"s2":1},"primer":"p1"}
atcgatcgatcgatcgatcg
>seq004 {"count":2,"merged_sample":{"UNKWNON":1,"s1":1},"primer":"p1"}
gctagctagctagctagcta
>seq007 {"count":1,"merged_sample":{"s2":1},"primer":"p1","sample":"s2"}
tttttttttttttttttttt

Dereplicate across two files with no assumed ordering, grouping by sample and primer:

The files sample1.fastq and sample2.fastq contain reads from two independent sequencing files covering two primers. Using --no-order signals that the files have no implicit read-pairing relationship. Grouping by both sample and primer keeps distinct amplicon types separate, producing one representative per unique (sequence, sample, primer) combination.

📄 sample1.fastq

@s1_seq001 {"sample": "s1", "primer": "p1"}
ATCGATCGATCGATCGATCG
+
IIIIIIIIIIIIIIIIIIII
@s1_seq002 {"sample": "s1", "primer": "p1"}
ATCGATCGATCGATCGATCG
+
IIIIIIIIIIIIIIIIIIII
@s1_seq003 {"sample": "s1", "primer": "p2"}
GCTAGCTAGCTAGCTAGCTA
+
IIIIIIIIIIIIIIIIIIII
@s1_seq004 {"sample": "s1", "primer": "p2"}
GCTAGCTAGCTAGCTAGCTA
+
IIIIIIIIIIIIIIIIIIII

📄 sample2.fastq

@s2_seq001 {"sample": "s2", "primer": "p1"}
ATCGATCGATCGATCGATCG
+
IIIIIIIIIIIIIIIIIIII
@s2_seq002 {"sample": "s2", "primer": "p2"}
TTTTTTTTTTTTTTTTTTTT
+
IIIIIIIIIIIIIIIIIIII
@s2_seq003 {"sample": "s2", "primer": "p2"}
TTTTTTTTTTTTTTTTTTTT
+
IIIIIIIIIIIIIIIIIIII

obiuniq --no-order \
        -c sample \
        -c primer \
        sample1.fastq sample2.fastq \
    > out_multifile.fastq

📄 out_multifile.fastq

>s1_seq001 {"count":2,"primer":"p1","sample":"s1"}
atcgatcgatcgatcgatcg
>s2_seq001 {"count":1,"primer":"p1","sample":"s2"}
atcgatcgatcgatcgatcg
>s1_seq004 {"count":2,"primer":"p2","sample":"s1"}
gctagctagctagctagcta
>s2_seq002 {"count":2,"primer":"p2","sample":"s2"}
tttttttttttttttttttt

Use in-memory chunking for faster processing of small datasets:

For datasets that fit comfortably in RAM, --in-memory avoids temporary disk I/O and speeds up dereplication. The --chunk-count parameter controls how many internal partitions are used (here increased to 200 for finer granularity). The --compress flag writes the output as a gzip-compressed file.

obiuniq --in-memory \
        --chunk-count 200 \
        --compress \
        --out out_inmemory.fasta.gz \
        reads.fastq

printing the command help

obiuniq --help

obiuniq: dereplicate sequence data sets #