`obidemerge`: split merged sequence records back into individual, sample-annotated copies #

Description #

In a typical metabarcoding workflow, obiuniq collapses identical sequences into a single representative record. However, to avoid losing critical information, such as sample provenance, carried by individual sequences, the --merge or -m option can be used to retain the occurrence frequency table in the representative record. For instance, when obiuniq is used with the -m sample option, the new dereplicated record contains a merged_sample attribute that stores the number of times each original sample’s sequence was observed. While this compact representation is efficient for clustering and denoising, other downstream analyses require the original per-sample view (i.e., one record per sample and per unique sequence).

obidemerge reverses the merging step. For each input sequence with a merged_* statistic attribute, one output sequence is produced for each entry in the statistics map. For example, in the case of demerging the merged_sample attribute, each output copy has its sample attribute set to the sample name and its count attribute set to the recorded abundance. The original statistics attribute (merged_sample) is removed from all output sequences. Sequences that carry no statistics for the chosen attribute are passed through unchanged.

The attribute name passed to the -d option of obidemerge is the logical attribute name (e.g., sample), not the internal storage name. The tool prepends merged_ internally when looking up the attribute. Therefore, after running obiuniq --merge sample, which stores statistics under merged_sample, you must call obidemerge with -d sample.

graph TD
  A@{ shape: doc, label: "unique.fasta" }
  C[obidemerge]
  D@{ shape: doc, label: "per_sample_merged.fasta" }
  A --> C:::obitools
  C --> D
  classDef obitools fill:#99d57c

The following file illustrates a typical input: three sequences carry merged_sample statistics recording how many reads were observed per sample, while one sequence (seq004) has no per-sample breakdown and will be passed through unchanged.

📄 unique.fasta

>seq001 {"merged_sample": {"sampleA": 5, "sampleB": 3, "sampleC": 1}, "count": 9}
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT
>seq002 {"merged_sample": {"sampleA": 2, "sampleD": 7}, "count": 9}
TTGGCCAATTGGCCAATTGGCCAATTGGCCAATTGGCCAA
>seq003 {"merged_sample": {"sampleB": 4}, "count": 4}
GCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTA
>seq004 {"count": 6}
AAAACCCCGGGGTTTTAAAACCCCGGGGTTTTAAAACCCC

Running obidemerge with -d sample expands each entry of the merged_sample attribute into a separate record, setting sample and count on each copy:

obidemerge -d sample unique.fasta > per_sample_merged.fasta

📄 per_sample_merged.fasta

>seq001 {"count":5,"sample":"sampleA"}
acgtacgtacgtacgtacgtacgtacgtacgtacgtacgt
>seq001 {"count":3,"sample":"sampleB"}
acgtacgtacgtacgtacgtacgtacgtacgtacgtacgt
>seq001 {"count":1,"sample":"sampleC"}
acgtacgtacgtacgtacgtacgtacgtacgtacgtacgt
>seq002 {"count":2,"sample":"sampleA"}
ttggccaattggccaattggccaattggccaattggccaa
>seq002 {"count":7,"sample":"sampleD"}
ttggccaattggccaattggccaattggccaattggccaa
>seq003 {"count":4,"sample":"sampleB"}
gctagctagctagctagctagctagctagctagctagcta
>seq004 {"count":6}
aaaaccccggggttttaaaaccccggggttttaaaacccc

seq001 yielded three copies (one per sample in merged_sample), seq002 yielded two, seq003 yielded one, and seq004 — which had no merged_sample attribute — was passed through unchanged with its original count of 6.

Synopsis #

obidemerge [--demerge|-d <string>] [--out|-o <FILENAME>]
           [--fasta-output] [--fastq-output] [--json-output]
           [--compress|-Z] [--taxonomy|-t <string>]
           [--fasta] [--fastq] [--csv] [--embl] [--genbank] [--ecopcr]
           [--max-cpu <int>] [--batch-size <int>] [--no-progressbar]
           [<args>]

Options #

`obidemerge` specific options #

--demerge | -d <attribute>: Name of the sequence attribute that holds the merged statistics to expand. Each key in that statistics map becomes a separate output sequence. The tool looks for the attribute named merged_<attribute> in the sequence annotations — pass the logical name without the merged_ prefix. Default: sample

Taxonomic options #

--taxonomy | -t <string>: Path to the taxonomic database.

Controlling the input data #

OBITools4 generally recognizes the input file format. It also recognizes whether the input file is compressed using GZIP. But some rare files can be misidentified, so the following options allow the user to force the format, thus bypassing the format identification step.

The file format options #

--fasta: indicates that sequence data is in fasta format.
--fastq: indicates that sequence data is in fastq format.
--embl: indicates that sequence data is in EMBL-ENA flatfile format.
--csv: indicates that sequence data is in CSV format.
--genbank: indicates that sequence data is in GenBank flatfile format.
--ecopcr: indicates that sequence data is in the old ecoPCR tabulated format.

Controlling the way OBITools4 are formatting annotations #

These options only apply to the FASTA and FASTQ formats

--input-OBI-header: FASTA/FASTQ title line annotations follow the old OBI format.
--input-json-header: FASTA/FASTQ title line annotations follow the JSON format.

Controlling quality score decoding #

This option only applies to the FASTQ formats

--solexa: decodes quality string according to the old Solexa specification. (default: the standard Sanger encoding is used, env: OBISSOLEXA)

Controlling the output data #

--compress | -Z : output is compressed using gzip. (default: false)
--no-order: the OBITools ensure that the order between the input file and the output file does not change. When multiple files are processed, they are processed one at a time. If the –no-order option is added to a command, multiple input files can be opened at the same time and their contents processed in parallel. This usually increases processing speed, but does not guarantee the order of the sequences in the output file. Also, processing multiple files in parallel may require more memory to perform the computation.
--fasta-output: writes sequence data in fasta format (default if quality data is not available).
--fastq-output: writes sequence data in fastq format (default if quality data is available).
--json-output: writes sequence data in JSON format.
--out | -o <FILENAME>: filename used for saving the output (default: “-”, the standard output)
--output-OBI-header | -O : writes output FASTA/FASTQ title line annotations in OBI format (default: JSON).
--output-json-header: writew output FASTA/FASTQ title line annotations in JSON format (the default format).
--skip-empty: sequences of length equal to zero are removed from the output (default: false).
--no-progressbar: deactivates progress bar display (default: false).

General options #

--help | -h|-? : shows this help.
--version: prints the version and exits.
--silent-warning: This option tells obitools to stop displaying warnings. This behaviour can be controlled by setting the OBIWARNINGS environment variable.

--max-cpu <INTEGER>: OBITools can take advantage of your computer’s multi-core architecture by parallelizing the computation across all available CPUs. Computing on more CPUs usually requires more memory to perform the computation. Reducing the number of CPUs used to perform a calculation is also a way to indirectly control the amount of memory used by the process. The number of CPUs used by OBITools can also be controlled by setting the OBIMAXCPU environment variable.
--force-one-cpu: forces the use of a single CPU core for parallel processing (default: false).
--batch-size <INTEGER>: minimum number of sequences per batch for parallel processing (floor, default: 1, env: OBIBATCHSIZE)
--batch-size-max <INTEGER>: maximum number of sequences per batch for parallel processing (ceiling, default: 2000, env: OBIBATCHSIZEMAX)
--batch-mem <STRING>: maximum memory per batch (e.g. 128K, 64M, 1G; default: 128M; set to 0 to disable, env: OBIBATCHMEM)

--debug: enables debug mode, by setting log level to debug (default: false, env: OBIDEBUG)
--pprof: enables pprof server. Look at the log for details. (default: false).
--pprof-mutex <INTEGER>: enables profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX)
--pprof-goroutine <INTEGER>: enables profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE)

Examples #

Display help #

obidemerge --help

obidemerge: split merged sequence records back into individual, sample-annotated copies #