obisummary: resume main information from a sequence file
#
This page was automatically generated by an AI assistant and has not yet been
reviewed or validated by the OBITools4 development team. It may contain
inaccuracies or incomplete information. Use with caution and refer to the command’s
--help output for authoritative option descriptions.
Description #
obisummary
provides a rapid statistical overview of a biological sequence
dataset. Rather than transforming sequences, it reads them and outputs a single
structured record describing global counts, annotation types, and β when sample
information is present β per-sample statistics.
The output record is organised into three sections. The count section reports
the total number of reads (accounting for the count attribute of each sequence),
the number of distinct sequence variants, and the cumulative sequence length. The
annotations section enumerates every annotation key found in the dataset,
classifying each as scalar, map, or vector. The samples section appears only
when merged-sample data is present and lists per-sample reads, variants, and
singletons.
obisummary
is typically run after obiuniq
or obiclean
to validate the state of a dataset. The default output format is
JSON
;
use --yaml-output for YAML
output.
graph TD
A@{ shape: doc, label: "cleaned.fasta" }
C[obisummary]
D([stdout])
A --> C:::obitools
C --> D
classDef obitools fill:#99d57c
The file cleaned.fasta contains five fasta sequences with abundance annotations:
π cleaned.fasta>seq001 {"count":5}
ACGTACGTACGTACGTACGT
>seq002 {"count":3}
TGCATGCATGCATGCATGCA
>seq003 {"count":1}
AAACCCGGGTTTTAAAACCC
>seq004 {"count":2}
GCTAGCTAGCTAGCTAGCTA
>seq005 {"count":10}
AATTCCGGAATTCCGGAATT
Running obisummary
on this file produces a
JSON
record:
obisummary cleaned.fasta
{
"annotations": {
"keys": {
"scalar": {
"count": 5
}
},
"map_attributes": 0,
"scalar_attributes": 1,
"vector_attributes": 0
},
"count": {
"reads": 21,
"total_length": 100,
"variants": 5
}
}
The count.variants field (5) is the number of distinct sequences; count.reads
(21) is the sum of all count attributes; count.total_length (100) is the total
nucleotide count. The annotations section confirms that count is the only scalar
annotation present.
To obtain the same information in YAML
format, add --yaml-output:
obisummary --yaml-output cleaned.fasta
annotations:
keys:
scalar:
count: 5
map_attributes: 0
scalar_attributes: 1
vector_attributes: 0
count:
reads: 21
total_length: 100
variants: 5
Synopsis #
obisummary [--batch-mem <string>] [--batch-size <int>]
[--batch-size-max <int>] [--csv] [--debug] [--ecopcr] [--embl]
[--fasta] [--fastq] [--genbank] [--help|-h|-?]
[--input-OBI-header] [--input-json-header] [--json-output]
[--map <string>]... [--max-cpu <int>] [--no-order] [--pprof]
[--pprof-goroutine <int>] [--pprof-mutex <int>] [--silent-warning]
[--solexa] [--u-to-t] [--version] [--yaml-output] [<args>]
Options #
obisummary
specific options
#
--json-output: Print the result as a JSON record. This is the default behaviour; this flag makes the choice explicit.--yaml-output: Print the result as a YAML record instead of the default JSON format.--map<string>: Name of a map attribute to include in the summary detail. This option may be repeated to request multiple map attributes.
Controlling the input data #
OBITools4 generally recognizes the input file format. It also recognizes whether the input file is compressed using GZIP. But some rare files can be misidentified, so the following options allow the user to force the format, thus bypassing the format identification step.The file format options #
--fasta: indicates that sequence data is in fasta format.--fastq: indicates that sequence data is in fastq format.--embl: indicates that sequence data is in EMBL-ENA flatfile format.--csv: indicates that sequence data is in CSV format.--genbank: indicates that sequence data is in GenBank flatfile format.--ecopcr: indicates that sequence data is in the old ecoPCR tabulated format.
Controlling the way OBITools4 are formatting annotations #
These options only apply to the FASTA and FASTQ formats--input-OBI-header: FASTA/FASTQ title line annotations follow the old OBI format.--input-json-header: FASTA/FASTQ title line annotations follow the JSON format.
Controlling quality score decoding #
This option only applies to the FASTQ formats--solexa: decodes quality string according to the old Solexa specification. (default: the standard Sanger encoding is used, env: OBISSOLEXA)
General options #
--help|-h|-?: shows this help.--version: prints the version and exits.--silent-warning: This option tells obitools to stop displaying warnings. This behaviour can be controlled by setting the OBIWARNINGS environment variable.
Computation related options #
--max-cpu<INTEGER>: OBITools can take advantage of your computer’s multi-core architecture by parallelizing the computation across all available CPUs. Computing on more CPUs usually requires more memory to perform the computation. Reducing the number of CPUs used to perform a calculation is also a way to indirectly control the amount of memory used by the process. The number of CPUs used by OBITools can also be controlled by setting the OBIMAXCPU environment variable.--force-one-cpu: forces the use of a single CPU core for parallel processing (default: false).--batch-size<INTEGER>: minimum number of sequences per batch for parallel processing (floor, default: 1, env: OBIBATCHSIZE)--batch-size-max<INTEGER>: maximum number of sequences per batch for parallel processing (ceiling, default: 2000, env: OBIBATCHSIZEMAX)--batch-mem<STRING>: maximum memory per batch (e.g. 128K, 64M, 1G; default: 128M; set to 0 to disable, env: OBIBATCHMEM)
Debug related options #
--debug: enables debug mode, by setting log level to debug (default: false, env: OBIDEBUG)--pprof: enables pprof server. Look at the log for details. (default: false).--pprof-mutex<INTEGER>: enables profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX)--pprof-goroutine<INTEGER>: enables profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE)
Examples #
The file reads.fastq contains three fastq reads. The following command produces a YAML summary, explicitly forcing fastq parsing and requesting YAML output format:
π reads.fastq@seq001 fastq read one
ACGTACGTACGTACGTACGT
+
IIIIIIIIIIIIIIIIIIII
@seq002 fastq read two
TGCATGCATGCATGCATGCA
+
IIIIIIIIIIIIIIIIIIII
@seq003 fastq read three
AAACCCGGGTTTTAAAACCC
+
IIIIIIIIIIIIIIIIIIII
obisummary --yaml-output --fastq reads.fastq > out_yaml.yaml
| |
The file
sequences.fasta contains five
fasta
sequences,
two of which are singletons (count equal to 1). The following pipeline uses
obigrep
to discard singletons before summarising the remaining reads:
>seq001 {"count":5}
ACGTACGTACGTACGTACGT
>seq002 {"count":1}
TGCATGCATGCATGCATGCA
>seq003 {"count":3}
AAACCCGGGTTTTAAAACCC
>seq004 {"count":2}
GCTAGCTAGCTAGCTAGCTA
>seq005 {"count":1}
AATTCCGGAATTCCGGAATT
obigrep -p 'annotations.count > 1' sequences.fasta | obisummary --fasta > out_pipeline.yaml
| |
obisummary --help