obidistribute

obidistribute: split a sequence file into multiple files #

Description #

obidistribute splits a set of biological sequences into multiple output files according to one of three distribution strategies: annotation-based classification, round-robin batch assignment, or hash-based sharding.

The most common use case in metabarcoding is demultiplexing: sequences carry a tag annotation (e.g., sample_id) and obidistribute writes each sample’s sequences into its own file. The output filename for each group is built from a user-supplied pattern containing %s, which is replaced by the classifier value or batch index. For example, with --pattern samples_%s.fastq and a sample_id annotation, sequences labelled sampleA will be written to samples_sampleA.fastq.

When no classifier is specified, sequences can be split into a fixed number of batches(--batches) for parallel downstream processing, or sharded deterministically by hash (--hash) to ensure reproducible partitioning regardless of input order. Output files can be extended rather than overwritten using --append, making incremental demultiplexing possible. Sequences lacking the classifier annotation are written to a fallback file named using the --na-value (default: NA).

graph TD
  A@{ shape: doc, label: "reads.fastq" }
  C[obidistribute]
  D@{ shape: doc, label: "samples_sampleA.fastq" }
  E@{ shape: doc, label: "samples_sampleB.fastq" }
  F@{ shape: doc, label: "samples_NA.fastq" }
  A --> C:::obitools
  C --> D
  C --> E
  C --> F
  classDef obitools fill:#99d57c

To illustrate annotation-based demultiplexing, consider the following fastq input file where each sequence carries a sample_id tag:

πŸ“„ reads.fastq
@seq001 {"sample_id": "sampleA"}
ATCGATCGATCGATCGATCG
+
IIIIIIIIIIIIIIIIIIII
@seq002 {"sample_id": "sampleA"}
GCTAGCTAGCTAGCTAGCTA
+
IIIIIIIIIIIIIIIIIIII
@seq003 {"sample_id": "sampleA"}
TTAGCTAATCGGTAATCGGT
+
IIIIIIIIIIIIIIIIIIII
@seq004 {"sample_id": "sampleB"}
CCGGAATTCCGGAATTCCGG
+
IIIIIIIIIIIIIIIIIIII
@seq005 {"sample_id": "sampleB"}
TTAAGGCCTTAAGGCCTTAA
+
IIIIIIIIIIIIIIIIIIII
@seq006 {"sample_id": "sampleB"}
AACCTTGGAACCTTGGAACC
+
IIIIIIIIIIIIIIIIIIII
@seq007 {"sample_id": "sampleC"}
GCATGCATGCATGCATGCAT
+
IIIIIIIIIIIIIIIIIIII
@seq008 {"sample_id": "sampleC"}
CATGCATGCATGCATGCATG
+
IIIIIIIIIIIIIIIIIIII
@seq009 {"sample_id": "sampleA"}
ATGATGATGATGATGATGAT
+
IIIIIIIIIIIIIIIIIIII
@seq010 {}
GGATCGATCGATCGATCGAT
+
IIIIIIIIIIIIIIIIIIII

Running obidistribute with --classifier sample_id dispatches each sequence to a separate file based on the value of that tag:

obidistribute --classifier sample_id --pattern samples_%s.fastq \
  --no-progressbar --input-json-header reads.fastq

This produces samples_sampleA.fastq, samples_sampleB.fastq,samples_sampleC.fastq, and samples_NA.fastq (for the sequence with no sample_id annotation). Each output file contains only the sequences belonging to that sample, with all original annotations preserved.

Synopsis #

obidistribute --pattern|-p <string> [--append|-A] [--batch-mem <string>]
              [--batch-size <int>] [--batch-size-max <int>]
              [--batches|-n <int>] [--classifier|-c <string>] [--compress|-Z]
              [--csv] [--debug] [--directory|-d <string>] [--ecopcr] [--embl]
              [--fasta] [--fasta-output] [--fastq] [--fastq-output]
              [--genbank] [--hash|-H <int>] [--help|-h|-?]
              [--input-OBI-header] [--input-json-header] [--json-output]
              [--max-cpu <int>] [--na-value <string>] [--no-order]
              [--no-progressbar] [--out|-o <FILENAME>]
              [--output-OBI-header|-O] [--output-json-header] [--pprof]
              [--pprof-goroutine <int>] [--pprof-mutex <int>]
              [--silent-warning] [--skip-empty] [--solexa] [--u-to-t]
              [--version] [<args>]

Options #

Required options #

  • --pattern | -p <STRING>: The template used to build the names of the output files. The variable part is represented by %s. Example: samples_%s.fastq. This option is required.

obidistribute specific options #

  • --classifier | -c <STRING>: The name of an annotation tag on the sequences. Sequences are dispatched into separate files based on the value of this tag. The tag value must be a string, integer, or boolean. Default: "".
  • --batches | -n <INT>: Splits the input into exactly N batches by round-robin assignment, regardless of sequence metadata. Batch output files are named using 1-based indices. Default: 0.
  • --hash | -H <INT>: Splits the input into at most N batches using a hash of the sequence. Produces deterministic, reproducible sharding. Shard output files are named using 0-based indices. Default: 0.
  • --directory | -d <STRING>: Used together with --classifier: the name of a tag whose value is used to organise output files into subdirectories. Default: "".
  • --append | -A : Appends sequences to output files if they already exist, instead of overwriting them. Default: false.
  • --na-value <STRING>: Value used as the filename component when a sequence does not have the classifier tag defined. Default: "NA".

Controlling the input data #

OBITools4 generally recognizes the input file format. It also recognizes whether the input file is compressed using GZIP. But some rare files can be misidentified, so the following options allow the user to force the format, thus bypassing the format identification step.
The file format options #
  • --fasta: indicates that sequence data is in fasta format.
  • --fastq: indicates that sequence data is in fastq format.
  • --embl: indicates that sequence data is in EMBL-ENA flatfile format.
  • --csv: indicates that sequence data is in CSV format.
  • --genbank: indicates that sequence data is in GenBank flatfile format.
  • --ecopcr: indicates that sequence data is in the old ecoPCR tabulated format.
Controlling the way OBITools4 are formatting annotations #
These options only apply to the FASTA and FASTQ formats
  • --input-OBI-header: FASTA/FASTQ title line annotations follow the old OBI format.
  • --input-json-header: FASTA/FASTQ title line annotations follow the JSON format.
Controlling quality score decoding #
This option only applies to the FASTQ formats
  • --solexa: decodes quality string according to the old Solexa specification. (default: the standard Sanger encoding is used, env: OBISSOLEXA)

Controlling the output data #

  • --compress | -Z : output is compressed using gzip. (default: false)
  • --no-order: the OBITools ensure that the order between the input file and the output file does not change. When multiple files are processed, they are processed one at a time. If the –no-order option is added to a command, multiple input files can be opened at the same time and their contents processed in parallel. This usually increases processing speed, but does not guarantee the order of the sequences in the output file. Also, processing multiple files in parallel may require more memory to perform the computation.
  • --fasta-output: writes sequence data in fasta format (default if quality data is not available).
  • --fastq-output: writes sequence data in fastq format (default if quality data is available).
  • --json-output: writes sequence data in JSON format.
  • --out | -o <FILENAME>: filename used for saving the output (default: “-”, the standard output)
  • --output-OBI-header | -O : writes output FASTA/FASTQ title line annotations in OBI format (default: JSON).
  • --output-json-header: writew output FASTA/FASTQ title line annotations in JSON format (the default format).
  • --skip-empty: sequences of length equal to zero are removed from the output (default: false).
  • --no-progressbar: deactivates progress bar display (default: false).

General options #

  • --help | -h|-? : shows this help.
  • --version: prints the version and exits.
  • --silent-warning: This option tells obitools to stop displaying warnings. This behaviour can be controlled by setting the OBIWARNINGS environment variable.
  • --max-cpu <INTEGER>: OBITools can take advantage of your computer’s multi-core architecture by parallelizing the computation across all available CPUs. Computing on more CPUs usually requires more memory to perform the computation. Reducing the number of CPUs used to perform a calculation is also a way to indirectly control the amount of memory used by the process. The number of CPUs used by OBITools can also be controlled by setting the OBIMAXCPU environment variable.
  • --force-one-cpu: forces the use of a single CPU core for parallel processing (default: false).
  • --batch-size <INTEGER>: minimum number of sequences per batch for parallel processing (floor, default: 1, env: OBIBATCHSIZE)
  • --batch-size-max <INTEGER>: maximum number of sequences per batch for parallel processing (ceiling, default: 2000, env: OBIBATCHSIZEMAX)
  • --batch-mem <STRING>: maximum memory per batch (e.g. 128K, 64M, 1G; default: 128M; set to 0 to disable, env: OBIBATCHMEM)
  • --debug: enables debug mode, by setting log level to debug (default: false, env: OBIDEBUG)
  • --pprof: enables pprof server. Look at the log for details. (default: false).
  • --pprof-mutex <INTEGER>: enables profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX)
  • --pprof-goroutine <INTEGER>: enables profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE)

Examples #

Split a dataset into equal batches for parallel processing #

The file sequences.fasta contains 10 fasta sequences without any annotations. Using --batches 3 distributes them by round-robin into three equally-sized files suitable for parallel downstream analysis. Batch files are named with 1-based indices.

πŸ“„ sequences.fasta
>seq001 test sequence 1
ATCGATCGATCGATCGATCG
>seq002 test sequence 2
GCTAGCTAGCTAGCTAGCTA
>seq003 test sequence 3
TTAGCTAATCGGTAATCGGT
>seq004 test sequence 4
CCGGAATTCCGGAATTCCGG
>seq005 test sequence 5
TTAAGGCCTTAAGGCCTTAA
>seq006 test sequence 6
AACCTTGGAACCTTGGAACC
>seq007 test sequence 7
GCATGCATGCATGCATGCAT
>seq008 test sequence 8
CATGCATGCATGCATGCATG
>seq009 test sequence 9
ATGATGATGATGATGATGAT
>seq010 test sequence 10
GGATCGATCGATCGATCGAT
obidistribute --batches 3 \
  --pattern chunk_%s.fasta \
  --fasta-output --no-progressbar \
  sequences.fasta
πŸ“„ chunk_1.fasta
>seq001 {"definition":"test sequence 1"}
atcgatcgatcgatcgatcg
>seq004 {"definition":"test sequence 4"}
ccggaattccggaattccgg
>seq007 {"definition":"test sequence 7"}
gcatgcatgcatgcatgcat
>seq010 {"definition":"test sequence 10"}
ggatcgatcgatcgatcgat

Hash-based reproducible sharding #

Hash-based sharding with --hash 4 assigns sequences deterministically to one of four shards based on a hash of the sequence content. Running the same command twice on the same input always produces the same assignment, making it useful for reproducible workflows. Shard files are named with 0-based indices.

obidistribute --hash 4 \
  --pattern shard_%s.fastq \
  --no-progressbar reads.fastq
πŸ“„ shard_0.fastq
@seq002 {"sample_id":"sampleA"}
gctagctagctagctagcta
+
IIIIIIIIIIIIIIIIIIII
@seq008 {"sample_id":"sampleC"}
catgcatgcatgcatgcatg
+
IIIIIIIIIIIIIIIIIIII
@seq010
ggatcgatcgatcgatcgat
+
IIIIIIIIIIIIIIIIIIII

Custom NA label for unclassified sequences #

When a sequence lacks the annotation used for classification, it is written to a fallback file. By default that file uses the label NA, but --na-value lets you choose a more descriptive name such as unclassified.

obidistribute --classifier sample_id \
  --na-value unclassified \
  --pattern out_ex6_%s.fastq \
  --no-progressbar --input-json-header reads.fastq
πŸ“„ out_ex6_unclassified.fastq
@seq010
ggatcgatcgatcgatcgat
+
IIIIIIIIIIIIIIIIIIII

Display help #

obidistribute --help