`obimatrix`: build a sample × sequence count matrix from annotated sequences #

Description #

After a metabarcoding sequencing experiment has been processed through the OBITools4 pipeline:

paired read alignment: obipairing
PCR demultiplexing: obimultiplex
sequence dereplication: obiuniq

each unique sequence (OTU) carries a merged_sample attribute: a dictionary mapping every sample name to the number of reads that matched that sequence in that sample. The main aim of obimatrix is to read these annotations and assembles them into a single sample × sequence count matrix (also called an OTU table), written in CSV format to standard output. By default obimatrix works with the merged_sample attribute, but it can apply the same process to any map attribute.

The file dereplicated.fasta below shows what such an annotated fasta file looks like. Each sequence header carries both the merged_sample map (used to fill the matrix) and additional attributes such as count, taxid, and definition that can optionally appear as extra annotation columns.

📄 dereplicated.fasta

>seq001 {"count": 15, "merged_sample": {"sample_A": 10, "sample_B": 5, "sample_C": 0}, "by_locality": {"loc_north": 12, "loc_south": 3}, "taxid": 2, "scientific_name": "Bacteria", "definition": "16S rRNA Bacteria OTU_001"}
ACGTACGTACGTACGTACGT
>seq002 {"count": 28, "merged_sample": {"sample_A": 0, "sample_B": 20, "sample_C": 8}, "by_locality": {"loc_north": 0, "loc_south": 28}, "taxid": 2759, "scientific_name": "Eukaryota", "definition": "18S rRNA Eukaryota OTU_002"}
TTGGCCAAGGTTCCAAGGTT
>seq003 {"count": 22, "merged_sample": {"sample_A": 7, "sample_B": 0, "sample_C": 15}, "by_locality": {"loc_north": 10, "loc_south": 12}, "taxid": 2157, "scientific_name": "Archaea", "definition": "16S rRNA Archaea OTU_003"}
GCGCGCATGCGCGCATGCGC
>seq004 {"count": 9, "merged_sample": {"sample_A": 3, "sample_B": 3, "sample_C": 3}, "by_locality": {"loc_north": 5, "loc_south": 4}, "taxid": 2, "scientific_name": "Bacteria", "definition": "16S rRNA Bacteria OTU_004"}
AAATTTCCCGGGAAATTTCC

graph TD
  A@{ shape: doc, label: "dereplicated.fasta" }
  C[obimatrix]
  D([stdout])
  A --> C:::obitools
  C --> D
  classDef obitools fill:#99d57c

By default obimatrix the matrix is oriented so that the rows correspond to samples and the columns to sequences. This is the layout expected by many ecological analysis tools such as the R packages vegan and phyloseq. Running obimatrix on the file above produces:

obimatrix dereplicated.fasta | csvlook

| id       | seq001 | seq002 | seq003 | seq004 |
| -------- | ------ | ------ | ------ | ------ |
| sample_A |     10 |      0 |      7 |      3 |
| sample_B |      5 |     20 |      0 |      3 |
| sample_C |      0 |      8 |     15 |      3 |

The --map option selects which sequence attribute to use as the source dictionary (default: merged_sample). This makes it straightforward to pivot on any other per-sequence map attribute. For example, using the by_locality attribute present in the demo file gives a locality × sequence table instead:

obimatrix --map by_locality dereplicated.fasta | csvlook

| id        | seq001 | seq002 | seq003 | seq004 |
| --------- | ------ | ------ | ------ | ------ |
| loc_north |     12 |      0 |     10 |      5 |
| loc_south |      3 |     28 |     12 |      4 |

Setting --transpose rparameter reverses the orientation, meaning that each row becomes a sequence rather than a sample. In this mode, extra per-sequence annotation columns become available: --count appends the total abundance across all samples, --definition appends the free-text title, --taxon the NCBI taxid, --sequence the nucleotide sequence, and --quality the Phred quality string. The option --keep|k <key> appends a column corresponding to the key attribute. By default the sequence IDs are not displayed, add the option --ids. These supplementary annotation columns are silently dropped in the default mode because the standard orientation cannot preserve them.

obimatrix --ids \
          --transpose \
          --count dereplicated.fasta \
    | csvlook

| id     | count | sample_A | sample_B | sample_C |
| ------ | ----- | -------- | -------- | -------- |
| seq001 |    15 |       10 |        5 |        0 |
| seq002 |    28 |        0 |       20 |        8 |
| seq003 |    22 |        7 |        0 |       15 |
| seq004 |     9 |        3 |        3 |        3 |

A third output format, selected with --three-columns, produces a long (tidy) table with one row per (sequence, sample) pair and exactly three columns: sequence ID, sample name, and count. This format is convenient for import into R’s tidyverse or Python’s pandas.

obimatrix --three-columns dereplicated.fasta \
   | csvlook

| id     | sample   | count |
| ------ | -------- | ----- |
| seq004 | sample_A |     3 |
| seq004 | sample_B |     3 |
| seq004 | sample_C |     3 |
| seq001 | sample_A |    10 |
| seq001 | sample_B |     5 |
| seq001 | sample_C |     0 |
| seq002 | sample_A |     0 |
| seq002 | sample_B |    20 |
| seq002 | sample_C |     8 |
| seq003 | sample_C |    15 |
| seq003 | sample_A |     7 |
| seq003 | sample_B |     0 |

Synopsis #

obimatrix [--allow-empty] [--auto] [--batch-mem <string>]
          [--batch-size <int>] [--batch-size-max <int>] [--count] [--csv]
          [--debug] [--definition|-d] [--ecopcr] [--embl] [--fasta] [--fastq]
          [--genbank] [--help|-h|-?] [--ids|-i] [--input-OBI-header]
          [--input-json-header] [--keep|-k <KEY>]... [--map <string>]
          [--map-na-value <string>] [--max-cpu <int>] [--na-value <NAVALUE>]
          [--no-order] [--obipairing] [--pprof] [--pprof-goroutine <int>]
          [--pprof-mutex <int>] [--quality|-q] [--sample-name <string>]
          [--sequence|-s] [--silent-warning] [--solexa] [--taxon]
          [--three-columns] [--transpose] [--u-to-t] [--value-name <string>]
          [--version] [<args>]

Options #

`obimatrix` specific options #

Matrix construction

--map <ATTR>: Name of the sequence attribute used as the sample→count map. The attribute must be a dictionary whose keys are sample identifiers and values are integer read counts. In a standard OBITools4 workflow this attribute is named merged_sample and is created by obiuniq. (default: merged_sample)
--map-na-value <VALUE>: String written in matrix cells when a sample has no count for a given sequence. (default: 0)
--allow-empty: By default, sequences whose map attribute is absent cause a fatal error. Setting this flag prevents the error and includes such sequences in the output, filling all sample cells with the --map-na-value string.
--auto: Inspect the first batch of input sequences and automatically derive the list of sequence attributes to include as extra fixed columns. Only has a visible effect in --transpose=false mode.
--keep | -k <KEY>: Append the sequence attribute named KEY as an extra fixed column in the output. This is equivalent to the dedicated column flags (--definition, --count, etc.) but for arbitrary user-defined attribute names. The flag can be repeated. Only has a visible effect in --transpose=false mode. This flag does not restrict which sample columns appear in the matrix.

Output format

--transpose: Transpose the matrix so that rows correspond to samples and columns to sequences. Set --transpose=false for the sequence-per-row orientation. In transposed mode, all extra attribute columns (--ids, --sequence, --quality, --count, --definition, --taxon, --obipairing, --keep, --auto) are silently discarded. (default: true)
--three-columns: Write a long-format (tidy) CSV table instead of a wide matrix. Each row encodes one (sequence_id, sample, value) triple. Column names are set by --sample-name and --value-name.
--sample-name <NAME>: Name of the column holding sample identifiers in three-column output. (default: sample)
--value-name <NAME>: Name of the column holding count values in three-column output. (default: count)

Extra annotation columns (visible only in --transpose=false mode)

--ids | -i : Append a column containing the sequence identifier.
--sequence | -s : Append a column containing the nucleotide sequence.
--quality | -q : Append a column named qualities containing the Phred quality string.
--definition | -d : Append a column containing the sequence definition (free-text title).
--count: Append a column containing the total read count of the sequence across all samples.
--taxon: Append a column named taxon containing the NCBI taxid of each sequence.
--obipairing: Append the eight attributes generated by obipairing (mode, seq_a_single, seq_b_single, ali_dir, score, score_norm, seq_ab_match, pairing_mismatches). Attributes absent from a sequence are written as NA.

CSV input

--na-value <NAVALUE>: String used to identify missing values when reading a CSV input file. (default: NA)

Controlling the input data #

OBITools4 generally recognizes the input file format. It also recognizes whether the input file is compressed using GZIP. But some rare files can be misidentified, so the following options allow the user to force the format, thus bypassing the format identification step.

The file format options #

--fasta: indicates that sequence data is in fasta format.
--fastq: indicates that sequence data is in fastq format.
--embl: indicates that sequence data is in EMBL-ENA flatfile format.
--csv: indicates that sequence data is in CSV format.
--genbank: indicates that sequence data is in GenBank flatfile format.
--ecopcr: indicates that sequence data is in the old ecoPCR tabulated format.

Controlling the way OBITools4 are formatting annotations #

These options only apply to the FASTA and FASTQ formats

--input-OBI-header: FASTA/FASTQ title line annotations follow the old OBI format.
--input-json-header: FASTA/FASTQ title line annotations follow the JSON format.

Controlling quality score decoding #

This option only applies to the FASTQ formats

--solexa: decodes quality string according to the old Solexa specification. (default: the standard Sanger encoding is used, env: OBISSOLEXA)

General options #

--help | -h|-? : shows this help.
--version: prints the version and exits.
--silent-warning: This option tells obitools to stop displaying warnings. This behaviour can be controlled by setting the OBIWARNINGS environment variable.

--max-cpu <INTEGER>: OBITools can take advantage of your computer’s multi-core architecture by parallelizing the computation across all available CPUs. Computing on more CPUs usually requires more memory to perform the computation. Reducing the number of CPUs used to perform a calculation is also a way to indirectly control the amount of memory used by the process. The number of CPUs used by OBITools can also be controlled by setting the OBIMAXCPU environment variable.
--force-one-cpu: forces the use of a single CPU core for parallel processing (default: false).
--batch-size <INTEGER>: minimum number of sequences per batch for parallel processing (floor, default: 1, env: OBIBATCHSIZE)
--batch-size-max <INTEGER>: maximum number of sequences per batch for parallel processing (ceiling, default: 2000, env: OBIBATCHSIZEMAX)
--batch-mem <STRING>: maximum memory per batch (e.g. 128K, 64M, 1G; default: 128M; set to 0 to disable, env: OBIBATCHMEM)

--debug: enables debug mode, by setting log level to debug (default: false, env: OBIDEBUG)
--pprof: enables pprof server. Look at the log for details. (default: false).
--pprof-mutex <INTEGER>: enables profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX)
--pprof-goroutine <INTEGER>: enables profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE)

Examples #

obimatrix --help

obimatrix: build a sample × sequence count matrix from annotated sequences #