obisummary

obisummary: resume main information from a sequence file #

Preliminary AI-generated documentation

This page was automatically generated by an AI assistant and has not yet been reviewed or validated by the OBITools4 development team. It may contain inaccuracies or incomplete information. Use with caution and refer to the command’s --help output for authoritative option descriptions.

Description #

obisummary provides a rapid statistical overview of a biological sequence dataset. Rather than transforming sequences, it reads them and outputs a single structured record describing global counts, annotation types, and β€” when sample information is present β€” per-sample statistics.

The output record is organised into three sections. The count section reports the total number of reads (accounting for the count attribute of each sequence), the number of distinct sequence variants, and the cumulative sequence length. The annotations section enumerates every annotation key found in the dataset, classifying each as scalar, map, or vector. The samples section appears only when merged-sample data is present and lists per-sample reads, variants, and singletons.

obisummary is typically run after obiuniq or obiclean to validate the state of a dataset. The default output format is JSON ; use --yaml-output for YAML output.

graph TD
  A@{ shape: doc, label: "cleaned.fasta" }
  C[obisummary]
  D([stdout])
  A --> C:::obitools
  C --> D
  classDef obitools fill:#99d57c

The file cleaned.fasta contains five fasta sequences with abundance annotations:

πŸ“„ cleaned.fasta
>seq001 {"count":5}
ACGTACGTACGTACGTACGT
>seq002 {"count":3}
TGCATGCATGCATGCATGCA
>seq003 {"count":1}
AAACCCGGGTTTTAAAACCC
>seq004 {"count":2}
GCTAGCTAGCTAGCTAGCTA
>seq005 {"count":10}
AATTCCGGAATTCCGGAATT

Running obisummary on this file produces a JSON record:

obisummary cleaned.fasta
{
  "annotations": {
    "keys": {
      "scalar": {
        "count": 5
      }
    },
    "map_attributes": 0,
    "scalar_attributes": 1,
    "vector_attributes": 0
  },
  "count": {
    "reads": 21,
    "total_length": 100,
    "variants": 5
  }
}

The count.variants field (5) is the number of distinct sequences; count.reads (21) is the sum of all count attributes; count.total_length (100) is the total nucleotide count. The annotations section confirms that count is the only scalar annotation present.

To obtain the same information in YAML format, add --yaml-output:

obisummary --yaml-output cleaned.fasta
annotations:
    keys:
        scalar:
            count: 5
    map_attributes: 0
    scalar_attributes: 1
    vector_attributes: 0
count:
    reads: 21
    total_length: 100
    variants: 5

Synopsis #

obisummary [--batch-mem <string>] [--batch-size <int>]
           [--batch-size-max <int>] [--csv] [--debug] [--ecopcr] [--embl]
           [--fasta] [--fastq] [--genbank] [--help|-h|-?]
           [--input-OBI-header] [--input-json-header] [--json-output]
           [--map <string>]... [--max-cpu <int>] [--no-order] [--pprof]
           [--pprof-goroutine <int>] [--pprof-mutex <int>] [--silent-warning]
           [--solexa] [--u-to-t] [--version] [--yaml-output] [<args>]

Options #

obisummary specific options #

  • --json-output: Print the result as a JSON record. This is the default behaviour; this flag makes the choice explicit.
  • --yaml-output: Print the result as a YAML record instead of the default JSON format.
  • --map <string>: Name of a map attribute to include in the summary detail. This option may be repeated to request multiple map attributes.

Controlling the input data #

OBITools4 generally recognizes the input file format. It also recognizes whether the input file is compressed using GZIP. But some rare files can be misidentified, so the following options allow the user to force the format, thus bypassing the format identification step.
The file format options #
  • --fasta: indicates that sequence data is in fasta format.
  • --fastq: indicates that sequence data is in fastq format.
  • --embl: indicates that sequence data is in EMBL-ENA flatfile format.
  • --csv: indicates that sequence data is in CSV format.
  • --genbank: indicates that sequence data is in GenBank flatfile format.
  • --ecopcr: indicates that sequence data is in the old ecoPCR tabulated format.
Controlling the way OBITools4 are formatting annotations #
These options only apply to the FASTA and FASTQ formats
  • --input-OBI-header: FASTA/FASTQ title line annotations follow the old OBI format.
  • --input-json-header: FASTA/FASTQ title line annotations follow the JSON format.
Controlling quality score decoding #
This option only applies to the FASTQ formats
  • --solexa: decodes quality string according to the old Solexa specification. (default: the standard Sanger encoding is used, env: OBISSOLEXA)

General options #

  • --help | -h|-? : shows this help.
  • --version: prints the version and exits.
  • --silent-warning: This option tells obitools to stop displaying warnings. This behaviour can be controlled by setting the OBIWARNINGS environment variable.
  • --max-cpu <INTEGER>: OBITools can take advantage of your computer’s multi-core architecture by parallelizing the computation across all available CPUs. Computing on more CPUs usually requires more memory to perform the computation. Reducing the number of CPUs used to perform a calculation is also a way to indirectly control the amount of memory used by the process. The number of CPUs used by OBITools can also be controlled by setting the OBIMAXCPU environment variable.
  • --force-one-cpu: forces the use of a single CPU core for parallel processing (default: false).
  • --batch-size <INTEGER>: minimum number of sequences per batch for parallel processing (floor, default: 1, env: OBIBATCHSIZE)
  • --batch-size-max <INTEGER>: maximum number of sequences per batch for parallel processing (ceiling, default: 2000, env: OBIBATCHSIZEMAX)
  • --batch-mem <STRING>: maximum memory per batch (e.g. 128K, 64M, 1G; default: 128M; set to 0 to disable, env: OBIBATCHMEM)
  • --debug: enables debug mode, by setting log level to debug (default: false, env: OBIDEBUG)
  • --pprof: enables pprof server. Look at the log for details. (default: false).
  • --pprof-mutex <INTEGER>: enables profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX)
  • --pprof-goroutine <INTEGER>: enables profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE)

Examples #

The file reads.fastq contains three fastq reads. The following command produces a YAML summary, explicitly forcing fastq parsing and requesting YAML output format:

πŸ“„ reads.fastq
@seq001 fastq read one
ACGTACGTACGTACGTACGT
+
IIIIIIIIIIIIIIIIIIII
@seq002 fastq read two
TGCATGCATGCATGCATGCA
+
IIIIIIIIIIIIIIIIIIII
@seq003 fastq read three
AAACCCGGGTTTTAAAACCC
+
IIIIIIIIIIIIIIIIIIII
obisummary --yaml-output --fastq reads.fastq > out_yaml.yaml
πŸ“„ out_yaml.yaml
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
annotations:
    keys:
        scalar:
            definition: 3
    map_attributes: 0
    scalar_attributes: 1
    vector_attributes: 0
count:
    reads: 3
    total_length: 60
    variants: 3

The file sequences.fasta contains five fasta sequences, two of which are singletons (count equal to 1). The following pipeline uses obigrep to discard singletons before summarising the remaining reads:

πŸ“„ sequences.fasta
>seq001 {"count":5}
ACGTACGTACGTACGTACGT
>seq002 {"count":1}
TGCATGCATGCATGCATGCA
>seq003 {"count":3}
AAACCCGGGTTTTAAAACCC
>seq004 {"count":2}
GCTAGCTAGCTAGCTAGCTA
>seq005 {"count":1}
AATTCCGGAATTCCGGAATT
obigrep -p 'annotations.count > 1' sequences.fasta | obisummary --fasta > out_pipeline.yaml
πŸ“„ out_pipeline.yaml
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
{
  "annotations": {
    "keys": {
      "scalar": {
        "count": 3
      }
    },
    "map_attributes": 0,
    "scalar_attributes": 1,
    "vector_attributes": 0
  },
  "count": {
    "reads": 10,
    "total_length": 60,
    "variants": 3
  }
}
obisummary --help