`obisummary`: resume main information from a sequence file #

Description #

obisummary provides a rapid statistical overview of a biological sequence dataset. Rather than transforming sequences, it reads them and outputs a single structured record describing global and annotation types. When sample information is present, it also outputs per-sample statistics.

The output record is organised into three sections. The count section reports the total number of reads (accounting for the count attribute of each sequence), the number of distinct sequence variants, and the cumulative sequence length. The annotations section enumerates every annotation key found in the dataset, classifying each as scalar, map, or vector. The samples section appears only when merged-sample data is present and lists per-sample reads, variants, and singletons.

obisummary is typically run after obiuniq or obiclean to validate the state of a dataset. The default output format is JSON ; use --yaml-output for YAML output.

graph TD
  A@{ shape: doc, label: "simple.fasta" }
  C[obisummary]
  D@{ shape: doc, label: "cleaned.json" }
  A --> C:::obitools
  C --> D
  classDef obitools fill:#99d57c

The file simple.fasta contains five fasta sequences with abundance annotations:

📄 simple.fasta

>seq001 {"count":5}
ACGTACGTACGTACGTACGT
>seq002 {"count":3}
TGCATGCATGCATGCATGCA
>seq003 {"count":1}
AAACCCGGGTTTTAAAACCC
>seq004 {"count":2}
GCTAGCTAGCTAGCTAGCTA
>seq005 {"count":10}
AATTCCGGAATTCCGGAATT

Running obisummary on this file produces a JSON record:

obisummary simple.fasta

{
  "annotations": {
    "keys": {
      "scalar": {
        "count": 5
      }
    },
    "map_attributes": 0,
    "scalar_attributes": 1,
    "vector_attributes": 0
  },
  "count": {
    "reads": 21,
    "total_length": 100,
    "variants": 5
  }
}

The count.variants field (5) is the number of distinct sequences; count.reads (21) is the sum of all count attributes; count.total_length (100) is the total nucleotide count. The annotations section confirms that count is the only scalar annotation present.

To obtain the same information in YAML format, add --yaml-output:

obisummary --yaml-output simple.fasta

annotations:
    keys:
        scalar:
            count: 5
    map_attributes: 0
    scalar_attributes: 1
    vector_attributes: 0
count:
    reads: 21
    total_length: 100
    variants: 5

Synopsis #

obisummary [--batch-mem <string>] [--batch-size <int>]
           [--batch-size-max <int>] [--csv] [--debug] [--ecopcr] [--embl]
           [--fasta] [--fastq] [--genbank] [--help|-h|-?]
           [--input-OBI-header] [--input-json-header] [--json-output]
           [--map <string>]... [--max-cpu <int>] [--no-order] [--pprof]
           [--pprof-goroutine <int>] [--pprof-mutex <int>] [--silent-warning]
           [--solexa] [--u-to-t] [--version] [--yaml-output] [<args>]

Options #

`obisummary` specific options #

--json-output: Print the result as a JSON record. This is the default behaviour; this flag makes the choice explicit.
--yaml-output: Print the result as a YAML record instead of the default JSON format.
--map <string>: Name of a map attribute to include in the summary detail. This option may be repeated to request multiple map attributes.

Controlling the input data #

OBITools4 generally recognizes the input file format. It also recognizes whether the input file is compressed using GZIP. But some rare files can be misidentified, so the following options allow the user to force the format, thus bypassing the format identification step.

The file format options #

--fasta: indicates that sequence data is in fasta format.
--fastq: indicates that sequence data is in fastq format.
--embl: indicates that sequence data is in EMBL-ENA flatfile format.
--csv: indicates that sequence data is in CSV format.
--genbank: indicates that sequence data is in GenBank flatfile format.
--ecopcr: indicates that sequence data is in the old ecoPCR tabulated format.

Controlling the way OBITools4 are formatting annotations #

These options only apply to the FASTA and FASTQ formats

--input-OBI-header: FASTA/FASTQ title line annotations follow the old OBI format.
--input-json-header: FASTA/FASTQ title line annotations follow the JSON format.

Controlling quality score decoding #

This option only applies to the FASTQ formats

--solexa: decodes quality string according to the old Solexa specification. (default: the standard Sanger encoding is used, env: OBISSOLEXA)

General options #

--help | -h|-? : shows this help.
--version: prints the version and exits.
--silent-warning: This option tells obitools to stop displaying warnings. This behaviour can be controlled by setting the OBIWARNINGS environment variable.

--max-cpu <INTEGER>: OBITools can take advantage of your computer’s multi-core architecture by parallelizing the computation across all available CPUs. Computing on more CPUs usually requires more memory to perform the computation. Reducing the number of CPUs used to perform a calculation is also a way to indirectly control the amount of memory used by the process. The number of CPUs used by OBITools can also be controlled by setting the OBIMAXCPU environment variable.
--force-one-cpu: forces the use of a single CPU core for parallel processing (default: false).
--batch-size <INTEGER>: minimum number of sequences per batch for parallel processing (floor, default: 1, env: OBIBATCHSIZE)
--batch-size-max <INTEGER>: maximum number of sequences per batch for parallel processing (ceiling, default: 2000, env: OBIBATCHSIZEMAX)
--batch-mem <STRING>: maximum memory per batch (e.g. 128K, 64M, 1G; default: 128M; set to 0 to disable, env: OBIBATCHMEM)

--debug: enables debug mode, by setting log level to debug (default: false, env: OBIDEBUG)
--pprof: enables pprof server. Look at the log for details. (default: false).
--pprof-mutex <INTEGER>: enables profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX)
--pprof-goroutine <INTEGER>: enables profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE)

Examples #

The file sequences.fasta contains five fasta sequences, two of which are singletons (count equal to 1). The following pipeline uses obigrep to discard singletons before summarising the remaining reads:

📄 sequences.fasta

>seq001 {"count":5}
ACGTACGTACGTACGTACGT
>seq002 {"count":1}
TGCATGCATGCATGCATGCA
>seq003 {"count":3}
AAACCCGGGTTTTAAAACCC
>seq004 {"count":2}
GCTAGCTAGCTAGCTAGCTA
>seq005 {"count":1}
AATTCCGGAATTCCGGAATT

obigrep -p 'annotations.count > 1' sequences.fasta \
  | obisummary --yaml-output  \
  > out_pipeline.yaml

📄 out_pipeline.yaml

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
annotations:
    keys:
        scalar:
            count: 3
    map_attributes: 0
    scalar_attributes: 1
    vector_attributes: 0
count:
    reads: 10
    total_length: 60
    variants: 3

Aggregate read counts per map attribute with `--map` #

The --map option names a map attribute and instructs obisummary to accumulate, for each key of that attribute, the total number of reads across all sequences. The option may be repeated to request several attributes at once.

The file merged.fasta contains four fasta sequences produced after dereplication. Each carries a merged_sample map added by obiuniq , a obiclean_weight map (per-sample read counts as written by obiclean ) and a marker map identifying which PCR target was amplified:

📄 merged.fasta

>seq001 {"count":5,"merged_sample":{"s1":3,"s2":2},"obiclean_weight":{"s1":6,"s2":2},"marker":{"16S":5}}
ACGTACGTACGTACGTACGT
>seq002 {"count":3,"merged_sample":{"s1":3},"obiclean_weight":{"s1":3},"marker":{"COI":3}}
TGCATGCATGCATGCATGCA
>seq003 {"count":8,"merged_sample":{"s2":8},"obiclean_weight":{"s2":9},"marker":{"16S":8}}
AAACCCGGGTTTTAAAACCC
>seq004 {"count":2,"merged_sample":{"s1":1,"s2":1},"obiclean_weight":{"s1":1,"s2":1},"marker":{"COI":2}}
GCTAGCTAGCTAGCTAGCTA

obisummary --map obiclean_weight \
           --map marker \
           --yaml-output  \
    merged.fasta > out_map.yaml

📄 out_map.yaml

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
annotations:
    keys:
        map:
            marker: 4
            merged_sample: 4
            obiclean_weight: 4
        scalar:
            count: 4
    map_attributes: 3
    scalar_attributes: 1
    vector_attributes: 0
count:
    reads: 18
    total_length: 80
    variants: 4
map_summaries:
    marker:
        16S: 13
        COI: 5
    obiclean_weight:
        s1: 10
        s2: 12
samples:
    sample_count: 2
    sample_stats:
        s1:
            reads: 7
            singletons: 1
            variants: 3
        s2:
            reads: 11
            singletons: 1
            variants: 3

The new map_summaries section provides experiment-wide totals: sample s1 contributed 7 reads, s2 contributed 11; marker 16S accounts for 13 reads and COI for 5. Note that samples.sample_stats is derived automatically from obiclean_weight whenever that attribute is present and gives the same per-sample read totals together with variant and singleton counts.

Summarise a real metabarcoding dataset (wolf tutorial) #

The file wolf.taxo.ann.fasta.gz is the final output of the OBITools4 wolf tutorial. It contains 26 sequence variants produced by the full pipeline — paired-end assembly, demultiplexing, dereplication, chimera filtering, and taxonomic assignment with obitag. Running obisummary on this file gives an immediate overview of the whole experiment:

obisummary --yaml-output \
    wolf.taxo.ann.fasta.gz > out_wolf.yaml

📄 out_wolf.yaml

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
annotations:
    keys:
        map:
            merged_sample: 26
            obiclean_status: 26
            obiclean_weight: 26
        scalar:
            count: 26
            obitag_bestid: 26
            obitag_bestmatch: 26
            obitag_difference: 26
            obitag_match_count: 26
            obitag_rank: 26
            scientific_name: 26
            taxid: 26
    map_attributes: 3
    scalar_attributes: 8
    vector_attributes: 0
count:
    reads: 31337
    total_length: 2585
    variants: 26
samples:
    sample_count: 4
    sample_stats:
        13a_F730603:
            obiclean_bad: 0
            reads: 7337
            singletons: 1
            variants: 8
        15a_F730814:
            obiclean_bad: 0
            reads: 7568
            singletons: 0
            variants: 3
        26a_F040644:
            obiclean_bad: 0
            reads: 11059
            singletons: 0
            variants: 13
        29a_F260619:
            obiclean_bad: 0
            reads: 5373
            singletons: 1
            variants: 9

The summary reveals 26 variants representing 31 337 reads spread across 4 samples, 8 scalar attributes (abundance, taxonomy, and obitag scores) and 3 map attributes (merged_sample, obiclean_status, obiclean_weight).

obisummary --help

obisummary: resume main information from a sequence file #