obiuniq: dereplicate sequence data sets
#
This page was automatically generated by an AI assistant and has not yet been
reviewed or validated by the OBITools4 development team. It may contain
inaccuracies or incomplete information. Use with caution and refer to the command’s
--help output for authoritative option descriptions.
Description #
obiuniq
groups identical sequences together and replaces each group with a
single representative, recording the total number of original occurrences as an abundance
count. Dereplication is a standard first step in amplicon sequencing workflows: a typical
fastq
file from an NGS run contains many copies of the same amplicon sequence,
and reducing these to unique entries with counts dramatically reduces the computational
burden on downstream tools.
By default, two sequences are considered identical if and only if their nucleotide strings
are exactly the same. obiuniq
scans through all input sequences, groups the
duplicates, and writes one record per group in
fasta
format. The output record
inherits the identifier of the first sequence in the group and carries a count attribute
recording how many input sequences it represents.
graph TD
A@{ shape: doc, label: "reads.fastq" }
C[obiuniq]
D@{ shape: doc, label: "out_basic.fastq" }
A --> C:::obitools
C --> D
classDef obitools fill:#99d57c
The file reads.fastq contains eight amplicon reads from two samples. Several reads share the same nucleotide sequence, representing genuine replication in the sequencing library.
π reads.fastq@seq001 {"sample": "s1", "primer": "p1"}
ATCGATCGATCGATCGATCG
+
IIIIIIIIIIIIIIIIIIII
@seq002 {"sample": "s1", "primer": "p1"}
ATCGATCGATCGATCGATCG
+
IIIIIIIIIIIIIIIIIIII
@seq003 {"sample": "s1", "primer": "p1"}
ATCGATCGATCGATCGATCG
+
IIIIIIIIIIIIIIIIIIII
@seq004 {"sample": "s1", "primer": "p1"}
GCTAGCTAGCTAGCTAGCTA
+
IIIIIIIIIIIIIIIIIIII
@seq005 {"sample": "s1", "primer": "p1"}
GCTAGCTAGCTAGCTAGCTA
+
IIIIIIIIIIIIIIIIIIII
@seq006 {"sample": "s2", "primer": "p1"}
ATCGATCGATCGATCGATCG
+
IIIIIIIIIIIIIIIIIIII
@seq007 {"sample": "s2", "primer": "p1"}
TTTTTTTTTTTTTTTTTTTT
+
IIIIIIIIIIIIIIIIIIII
@seq008 {"primer": "p1"}
CCCCCCCCCCCCCCCCCCCC
+
IIIIIIIIIIIIIIIIIIII
Running obiuniq
collapses all duplicates into unique representatives:
obiuniq reads.fastq -o out_basic.fastq
>seq008 {"count":1,"primer":"p1"}
cccccccccccccccccccc
>seq001 {"count":4,"primer":"p1"}
atcgatcgatcgatcgatcg
>seq004 {"count":2,"primer":"p1","sample":"s1"}
gctagctagctagctagcta
>seq007 {"count":1,"primer":"p1","sample":"s2"}
tttttttttttttttttttt
The eight input reads reduce to four unique sequences. The count attribute records how
many reads were merged into each representative: seq001 (ATCGATCGβ¦) appeared four times
across both samples and now carries "count":4. Note that obiuniq
always
produces
fasta
output, even from
fastq
input, because quality scores
cannot be meaningfully combined across merged reads.
When the same sequence genuinely occurs in multiple experimental samples and per-sample
abundance matters, grouping by sequence alone is too aggressive. The --category-attribute
option (repeatable, short: -c) adds metadata fields to the grouping criterion: two reads
are only merged when they are nucleotide-identical and share the same value for every
listed attribute. Combining -c sample with --no-singleton groups reads per sample and
then discards groups with a single member β a common noise-reduction step to remove
likely sequencing errors:
obiuniq -c sample --no-singleton reads.fastq -o out_no_singleton.fastq
>seq001 {"count":3,"primer":"p1","sample":"s1"}
atcgatcgatcgatcgatcg
>seq004 {"count":2,"primer":"p1","sample":"s1"}
gctagctagctagctagcta
Of the per-sample groups formed, only the two from sample s1 have more than one member
and survive the singleton filter. The groups from s2 and the unannotated sequence each
appear only once and are discarded.
Synopsis #
obiuniq [--batch-mem <string>] [--batch-size <int>] [--batch-size-max <int>]
[--category-attribute|-c <CATEGORY>]... [--chunk-count <int>]
[--compress|-Z] [--csv] [--debug] [--ecopcr] [--embl]
[--fail-on-taxonomy] [--fasta] [--fasta-output] [--fastq]
[--fastq-output] [--genbank] [--help|-h|-?] [--in-memory]
[--input-OBI-header] [--input-json-header] [--json-output]
[--max-cpu <int>] [--merge|-m <KEY>]... [--na-value <NA_NAME>]
[--no-order] [--no-progressbar] [--no-singleton]
[--out|-o <FILENAME>] [--output-OBI-header|-O] [--output-json-header]
[--pprof] [--pprof-goroutine <int>] [--pprof-mutex <int>]
[--raw-taxid] [--silent-warning] [--skip-empty] [--solexa]
[--taxonomy|-t <string>] [--u-to-t] [--update-taxid] [--version]
[--with-leaves] [<args>]
Options #
obiuniq
specific options
#
--category-attribute|-c<CATEGORY>: Adds one metadata attribute to the grouping criterion. Two sequences are placed in the same group only when they are nucleotide-identical and share the same value for every attribute listed with-c. Can be repeated to combine multiple attributes (e.g.-c sample -c primer). Records missing a listed attribute receive the value set by--na-value.--chunk-count<int>: Controls how many internal partitions the dataset is split into during processing (default:100). A higher value reduces per-partition memory usage at the cost of more temporary files; a lower value reduces I/O at the cost of higher peak memory.--in-memory: Stores intermediate data chunks in RAM rather than in temporary disk files. Speeds up processing for datasets that fit comfortably in available memory; omit this flag for large datasets that exceed available RAM.--merge|-m<KEY>: Creates an output attribute namedmerged_KEYthat maps each observed value of theKEYattribute to the count of input sequences carrying that value within the group. Can be repeated to track multiple attributes. Useful for tracking which sample or category contributions were collapsed into each group.--na-value<NA_NAME>: Value assigned to a category attribute when a sequence record does not carry that attribute (default:"NA"). All sequences lacking the attribute are grouped together under this placeholder.--no-singleton: Discards all output records whose abundance count is exactly one β i.e., sequences that occur only once across the entire input. Removing singletons is a standard heuristic for excluding sequencing errors from further analysis.
Taxonomic options #
--taxonomy|-t<string>: Path to the taxonomic database.--fail-on-taxonomy:Cause
obiuniq</abbrto exit with an error if a taxid in the data is not a currently valid taxon in the loaded taxonomy.
--raw-taxid: Print taxids in output without supplementary information (taxon name and rank).--update-taxid: Automatically replace merged taxids with the most recent valid taxid.--with-leaves: When taxonomy is extracted from a sequence file, add sequences as leaves of their taxid annotation.
Controlling the input data #
OBITools4 generally recognizes the input file format. It also recognizes whether the input file is compressed using GZIP. But some rare files can be misidentified, so the following options allow the user to force the format, thus bypassing the format identification step.The file format options #
--fasta: indicates that sequence data is in fasta format.--fastq: indicates that sequence data is in fastq format.--embl: indicates that sequence data is in EMBL-ENA flatfile format.--csv: indicates that sequence data is in CSV format.--genbank: indicates that sequence data is in GenBank flatfile format.--ecopcr: indicates that sequence data is in the old ecoPCR tabulated format.
Controlling the way OBITools4 are formatting annotations #
These options only apply to the FASTA and FASTQ formats--input-OBI-header: FASTA/FASTQ title line annotations follow the old OBI format.--input-json-header: FASTA/FASTQ title line annotations follow the JSON format.
Controlling quality score decoding #
This option only applies to the FASTQ formats--solexa: decodes quality string according to the old Solexa specification. (default: the standard Sanger encoding is used, env: OBISSOLEXA)
Controlling the output data #
--compress|-Z: output is compressed using gzip. (default: false)--no-order: the OBITools ensure that the order between the input file and the output file does not change. When multiple files are processed, they are processed one at a time. If the –no-order option is added to a command, multiple input files can be opened at the same time and their contents processed in parallel. This usually increases processing speed, but does not guarantee the order of the sequences in the output file. Also, processing multiple files in parallel may require more memory to perform the computation.--fasta-output: writes sequence data in fasta format (default if quality data is not available).--fastq-output: writes sequence data in fastq format (default if quality data is available).--json-output: writes sequence data in JSON format.--out|-o<FILENAME>: filename used for saving the output (default: “-”, the standard output)--output-OBI-header|-O: writes output FASTA/FASTQ title line annotations in OBI format (default: JSON).--output-json-header: writew output FASTA/FASTQ title line annotations in JSON format (the default format).--skip-empty: sequences of length equal to zero are removed from the output (default: false).--no-progressbar: deactivates progress bar display (default: false).
General options #
--help|-h|-?: shows this help.--version: prints the version and exits.--silent-warning: This option tells obitools to stop displaying warnings. This behaviour can be controlled by setting the OBIWARNINGS environment variable.
Computation related options #
--max-cpu<INTEGER>: OBITools can take advantage of your computer’s multi-core architecture by parallelizing the computation across all available CPUs. Computing on more CPUs usually requires more memory to perform the computation. Reducing the number of CPUs used to perform a calculation is also a way to indirectly control the amount of memory used by the process. The number of CPUs used by OBITools can also be controlled by setting the OBIMAXCPU environment variable.--force-one-cpu: forces the use of a single CPU core for parallel processing (default: false).--batch-size<INTEGER>: minimum number of sequences per batch for parallel processing (floor, default: 1, env: OBIBATCHSIZE)--batch-size-max<INTEGER>: maximum number of sequences per batch for parallel processing (ceiling, default: 2000, env: OBIBATCHSIZEMAX)--batch-mem<STRING>: maximum memory per batch (e.g. 128K, 64M, 1G; default: 128M; set to 0 to disable, env: OBIBATCHMEM)
Debug related options #
--debug: enables debug mode, by setting log level to debug (default: false, env: OBIDEBUG)--pprof: enables pprof server. Look at the log for details. (default: false).--pprof-mutex<INTEGER>: enables profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX)--pprof-goroutine<INTEGER>: enables profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE)
Examples #
Track sample contributions across merged groups using --merge:
The file
reads.fastq contains amplicon reads from two samples. Dereplicating
per sample and recording in merged_sample how many reads from each sample were merged into
each group reveals the sample origin of every representative. Sequences with no sample
attribute are grouped under the placeholder UNKNOWN.
@seq001 {"sample": "s1", "primer": "p1"}
ATCGATCGATCGATCGATCG
+
IIIIIIIIIIIIIIIIIIII
@seq002 {"sample": "s1", "primer": "p1"}
ATCGATCGATCGATCGATCG
+
IIIIIIIIIIIIIIIIIIII
@seq003 {"sample": "s1", "primer": "p1"}
ATCGATCGATCGATCGATCG
+
IIIIIIIIIIIIIIIIIIII
@seq004 {"sample": "s1", "primer": "p1"}
GCTAGCTAGCTAGCTAGCTA
+
IIIIIIIIIIIIIIIIIIII
@seq005 {"sample": "s1", "primer": "p1"}
GCTAGCTAGCTAGCTAGCTA
+
IIIIIIIIIIIIIIIIIIII
@seq006 {"sample": "s2", "primer": "p1"}
ATCGATCGATCGATCGATCG
+
IIIIIIIIIIIIIIIIIIII
@seq007 {"sample": "s2", "primer": "p1"}
TTTTTTTTTTTTTTTTTTTT
+
IIIIIIIIIIIIIIIIIIII
@seq008 {"primer": "p1"}
CCCCCCCCCCCCCCCCCCCC
+
IIIIIIIIIIIIIIIIIIII
obiuniq -c sample --merge sample --na-value UNKNOWN reads.fastq -o out_merge.fastq
>seq008 {"count":1,"merged_sample":{"UNKNOWN":1},"primer":"p1"}
cccccccccccccccccccc
>seq001 {"count":3,"merged_sample":{"s1":3},"primer":"p1","sample":"s1"}
atcgatcgatcgatcgatcg
>seq006 {"count":1,"merged_sample":{"s2":1},"primer":"p1","sample":"s2"}
atcgatcgatcgatcgatcg
>seq004 {"count":2,"merged_sample":{"s1":2},"primer":"p1","sample":"s1"}
gctagctagctagctagcta
>seq007 {"count":1,"merged_sample":{"s2":1},"primer":"p1","sample":"s2"}
tttttttttttttttttttt
Dereplicate across two files with no assumed ordering, grouping by sample and primer:
The files
sample1.fastq and
sample2.fastq contain reads
from two independent sequencing files covering two primers. Using --no-order signals that
the files have no implicit read-pairing relationship. Grouping by both sample and primer
keeps distinct amplicon types separate, producing one representative per unique
(sequence, sample, primer) combination.
@s1_seq001 {"sample": "s1", "primer": "p1"}
ATCGATCGATCGATCGATCG
+
IIIIIIIIIIIIIIIIIIII
@s1_seq002 {"sample": "s1", "primer": "p1"}
ATCGATCGATCGATCGATCG
+
IIIIIIIIIIIIIIIIIIII
@s1_seq003 {"sample": "s1", "primer": "p2"}
GCTAGCTAGCTAGCTAGCTA
+
IIIIIIIIIIIIIIIIIIII
@s1_seq004 {"sample": "s1", "primer": "p2"}
GCTAGCTAGCTAGCTAGCTA
+
IIIIIIIIIIIIIIIIIIII
@s2_seq001 {"sample": "s2", "primer": "p1"}
ATCGATCGATCGATCGATCG
+
IIIIIIIIIIIIIIIIIIII
@s2_seq002 {"sample": "s2", "primer": "p2"}
TTTTTTTTTTTTTTTTTTTT
+
IIIIIIIIIIIIIIIIIIII
@s2_seq003 {"sample": "s2", "primer": "p2"}
TTTTTTTTTTTTTTTTTTTT
+
IIIIIIIIIIIIIIIIIIII
obiuniq --no-order -c sample -c primer sample1.fastq sample2.fastq -o out_multifile.fastq
>s1_seq001 {"count":2,"primer":"p1","sample":"s1"}
atcgatcgatcgatcgatcg
>s2_seq001 {"count":1,"primer":"p1","sample":"s2"}
atcgatcgatcgatcgatcg
>s1_seq004 {"count":2,"primer":"p2","sample":"s1"}
gctagctagctagctagcta
>s2_seq002 {"count":2,"primer":"p2","sample":"s2"}
tttttttttttttttttttt
Use in-memory chunking for faster processing of smaller datasets:
For datasets that fit comfortably in RAM, --in-memory avoids temporary disk I/O and
speeds up dereplication. The --chunk-count parameter controls how many internal partitions
are used (here increased to 200 for finer granularity). The --compress flag writes the
output as a gzip-compressed file.
obiuniq --in-memory --chunk-count 200 --compress -o out_inmemory.fastq.gz reads.fastq
obiuniq --help