obisummary: resume main information from a sequence file
#
Description #
obisummary
provides a rapid statistical overview of a biological sequence
dataset. Rather than transforming sequences, it reads them and outputs a single
structured record describing global and annotation types. When sample information
is present, it also outputs per-sample statistics.
The output record is organised into three sections. The count section reports
the total number of reads (accounting for the count attribute of each sequence),
the number of distinct sequence variants, and the cumulative sequence length. The
annotations section enumerates every annotation key found in the dataset,
classifying each as scalar, map, or vector. The samples section appears only
when merged-sample data is present and lists per-sample reads, variants, and
singletons.
obisummary
is typically run after obiuniq
or obiclean
to validate the state of a dataset. The default output format is
JSON
;
use --yaml-output for YAML
output.
graph TD
A@{ shape: doc, label: "simple.fasta" }
C[obisummary]
D@{ shape: doc, label: "cleaned.json" }
A --> C:::obitools
C --> D
classDef obitools fill:#99d57c
The file simple.fasta contains five fasta sequences with abundance annotations:
📄 simple.fasta>seq001 {"count":5}
ACGTACGTACGTACGTACGT
>seq002 {"count":3}
TGCATGCATGCATGCATGCA
>seq003 {"count":1}
AAACCCGGGTTTTAAAACCC
>seq004 {"count":2}
GCTAGCTAGCTAGCTAGCTA
>seq005 {"count":10}
AATTCCGGAATTCCGGAATT
Running obisummary
on this file produces a
JSON
record:
obisummary simple.fasta
{
"annotations": {
"keys": {
"scalar": {
"count": 5
}
},
"map_attributes": 0,
"scalar_attributes": 1,
"vector_attributes": 0
},
"count": {
"reads": 21,
"total_length": 100,
"variants": 5
}
}
The count.variants field (5) is the number of distinct sequences; count.reads
(21) is the sum of all count attributes; count.total_length (100) is the total
nucleotide count. The annotations section confirms that count is the only scalar
annotation present.
To obtain the same information in YAML
format, add --yaml-output:
obisummary --yaml-output simple.fasta
annotations:
keys:
scalar:
count: 5
map_attributes: 0
scalar_attributes: 1
vector_attributes: 0
count:
reads: 21
total_length: 100
variants: 5
Synopsis #
obisummary [--batch-mem <string>] [--batch-size <int>]
[--batch-size-max <int>] [--csv] [--debug] [--ecopcr] [--embl]
[--fasta] [--fastq] [--genbank] [--help|-h|-?]
[--input-OBI-header] [--input-json-header] [--json-output]
[--map <string>]... [--max-cpu <int>] [--no-order] [--pprof]
[--pprof-goroutine <int>] [--pprof-mutex <int>] [--silent-warning]
[--solexa] [--u-to-t] [--version] [--yaml-output] [<args>]
Options #
obisummary
specific options
#
--json-output: Print the result as a JSON record. This is the default behaviour; this flag makes the choice explicit.--yaml-output: Print the result as a YAML record instead of the default JSON format.--map<string>: Name of a map attribute to include in the summary detail. This option may be repeated to request multiple map attributes.
Controlling the input data #
OBITools4 generally recognizes the input file format. It also recognizes whether the input file is compressed using GZIP. But some rare files can be misidentified, so the following options allow the user to force the format, thus bypassing the format identification step.The file format options #
--fasta: indicates that sequence data is in fasta format.--fastq: indicates that sequence data is in fastq format.--embl: indicates that sequence data is in EMBL-ENA flatfile format.--csv: indicates that sequence data is in CSV format.--genbank: indicates that sequence data is in GenBank flatfile format.--ecopcr: indicates that sequence data is in the old ecoPCR tabulated format.
Controlling the way OBITools4 are formatting annotations #
These options only apply to the FASTA and FASTQ formats--input-OBI-header: FASTA/FASTQ title line annotations follow the old OBI format.--input-json-header: FASTA/FASTQ title line annotations follow the JSON format.
Controlling quality score decoding #
This option only applies to the FASTQ formats--solexa: decodes quality string according to the old Solexa specification. (default: the standard Sanger encoding is used, env: OBISSOLEXA)
General options #
--help|-h|-?: shows this help.--version: prints the version and exits.--silent-warning: This option tells obitools to stop displaying warnings. This behaviour can be controlled by setting the OBIWARNINGS environment variable.
Computation related options #
--max-cpu<INTEGER>: OBITools can take advantage of your computer’s multi-core architecture by parallelizing the computation across all available CPUs. Computing on more CPUs usually requires more memory to perform the computation. Reducing the number of CPUs used to perform a calculation is also a way to indirectly control the amount of memory used by the process. The number of CPUs used by OBITools can also be controlled by setting the OBIMAXCPU environment variable.--force-one-cpu: forces the use of a single CPU core for parallel processing (default: false).--batch-size<INTEGER>: minimum number of sequences per batch for parallel processing (floor, default: 1, env: OBIBATCHSIZE)--batch-size-max<INTEGER>: maximum number of sequences per batch for parallel processing (ceiling, default: 2000, env: OBIBATCHSIZEMAX)--batch-mem<STRING>: maximum memory per batch (e.g. 128K, 64M, 1G; default: 128M; set to 0 to disable, env: OBIBATCHMEM)
Debug related options #
--debug: enables debug mode, by setting log level to debug (default: false, env: OBIDEBUG)--pprof: enables pprof server. Look at the log for details. (default: false).--pprof-mutex<INTEGER>: enables profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX)--pprof-goroutine<INTEGER>: enables profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE)
Examples #
The file
sequences.fasta contains five
fasta
sequences,
two of which are singletons (count equal to 1). The following pipeline uses
obigrep
to discard singletons before summarising the remaining reads:
>seq001 {"count":5}
ACGTACGTACGTACGTACGT
>seq002 {"count":1}
TGCATGCATGCATGCATGCA
>seq003 {"count":3}
AAACCCGGGTTTTAAAACCC
>seq004 {"count":2}
GCTAGCTAGCTAGCTAGCTA
>seq005 {"count":1}
AATTCCGGAATTCCGGAATT
obigrep -p 'annotations.count > 1' sequences.fasta \
| obisummary --yaml-output \
> out_pipeline.yaml
| |
Aggregate read counts per map attribute with --map
#
The --map option names a map attribute and instructs obisummary
to
accumulate, for each key of that attribute, the total number of reads across all
sequences. The option may be repeated to request several attributes at once.
The file
merged.fasta contains four
fasta
sequences produced
after dereplication. Each carries a merged_sample map added by obiuniq
,
a obiclean_weight map (per-sample read counts as written by obiclean
) and
a marker map identifying which PCR target was amplified:
>seq001 {"count":5,"merged_sample":{"s1":3,"s2":2},"obiclean_weight":{"s1":6,"s2":2},"marker":{"16S":5}}
ACGTACGTACGTACGTACGT
>seq002 {"count":3,"merged_sample":{"s1":3},"obiclean_weight":{"s1":3},"marker":{"COI":3}}
TGCATGCATGCATGCATGCA
>seq003 {"count":8,"merged_sample":{"s2":8},"obiclean_weight":{"s2":9},"marker":{"16S":8}}
AAACCCGGGTTTTAAAACCC
>seq004 {"count":2,"merged_sample":{"s1":1,"s2":1},"obiclean_weight":{"s1":1,"s2":1},"marker":{"COI":2}}
GCTAGCTAGCTAGCTAGCTA
obisummary --map obiclean_weight \
--map marker \
--yaml-output \
merged.fasta > out_map.yaml
| |
The new map_summaries section provides experiment-wide totals: sample s1
contributed 7 reads, s2 contributed 11; marker 16S accounts for 13 reads
and COI for 5. Note that samples.sample_stats is derived automatically from
obiclean_weight whenever that attribute is present and gives the same per-sample
read totals together with variant and singleton counts.
Summarise a real metabarcoding dataset (wolf tutorial) #
The file
wolf.taxo.ann.fasta.gz is the final
output of the OBITools4 wolf tutorial. It contains 26 sequence variants
produced by the full pipeline — paired-end assembly, demultiplexing, dereplication,
chimera filtering, and taxonomic assignment with obitag. Running
obisummary
on this file gives an immediate overview of the whole
experiment:
obisummary --yaml-output \
wolf.taxo.ann.fasta.gz > out_wolf.yaml
| |
The summary reveals 26 variants representing 31 337 reads spread across 4 samples,
8 scalar attributes (abundance, taxonomy, and obitag scores) and 3 map attributes
(merged_sample, obiclean_status, obiclean_weight).
obisummary --help