obicount
: counting sequence records
#
Description #
Count the sequence records in a sequence file. It returns three pieces of information. The first is the number of sequence variants (the actual number of sequence records in the file). Each sequence record is associated with a count
attribute (equal to 1 if absent), this number corresponds to the number of times that sequence has been observed in the data set. In the following example, the first sequence record has no count
attribute and therefore counts for 1, when the second sequence record has a count
attribute equal to 2.
>AB061527 {"definition":"Sorex unguiculatus mitochondrial NA, complete genome.","family_name":"Soricidae","family_taxid":9376,"genus_name":"Sorex","genus_taxid":9379,"obicleandb_level":"family","obicleandb_trusted":2.2137847111025621e-13,"species_name":"Sorex unguiculatus","species_taxid":62275,"taxid":62275}
ttagccctaaacttaggtatttaatctaacaaaaatacccgtcagagaactactagcaat
agcttaaaactcaaaggacttggcggtgctttatatccct
>AL355887 {"count":2,"definition":"Human chromosome 14 NA sequence BAC R-179O11 of library RPCI-11 from chromosome 14 of Homo sapiens (Human)XXKW HTG.; HTGS_ACTIVFIN.","family_name":"Hominidae","family_taxid":9604,"genus_name":"Homo","genus_taxid":9605,"obicleandb_level":"genus","obicleandb_trusted":0,"species_name":"Homo sapiens","species_taxid":9606,"taxid":9606}
ttagccctaaactctagtagttacattaacaaaaccattcgtcagaatactacgagcaac
agcttaaaactcaaaggacctggcagttctttatatccct
Thus, the second value returned is the sum of the count values for all sequences, 3 for the presented example file. The last value is the number of nucleotides stored in the file, the sum of the sequence lengths.
graph TD A@{ shape: doc, label: "my_sequences.fastq" } C[obicount] D@{ shape: doc, label: "counts.csv" } A --> C:::obitools C --> D classDef obitools fill:#99d57c
Synopsis #
obicount [--batch-size <int>] [--debug] [--ecopcr] [--embl] [--fasta]
[--fastq] [--force-one-cpu] [--genbank] [--help|-h|-?]
[--input-OBI-header] [--input-json-header] [--max-cpu <int>]
[--no-order] [--pprof] [--pprof-goroutine <int>]
[--pprof-mutex <int>] [--reads|-r] [--solexa] [--symbols|-s]
[--variants|-v] [--version] [<args>]
Options #
obicount
specific options
#
--variants
|-v
: When present, output the number of variants (sequence records) in the sequence file.--reads
|-r
: When present, output the number of reads (the sum of sequence counts) in the sequence file.--symbols
|-s
: When present, output the number of symbols (nucleotides) in the sequence file.
Controlling the input data #
OBITools4 generally recognizes the input file format. It also recognizes whether the input file is compressed using GZIP. But some rare files can be misidentified, so the following options allow the user to force the format, thus bypassing the format identification step.The file format options #
--fasta
: indicates that sequence data is in fasta format.--fastq
: indicates that sequence data is in fastq format.--embl
: indicates that sequence data is in EMBL-ENA flatfile format.--csv
: indicates that sequence data is in CSV format.--genbank
: indicates that sequence data is in GenBank flatfile format.--ecopcr
: indicates that sequence data is in the old ecoPCR tabulated format.
Controlling the way OBITools4 are formatting annotations #
These options only apply to the FASTA and FASTQ formats--input-OBI-header
: FASTA/FASTQ title line annotations follow the old OBI format.--input-json-header
: FASTA/FASTQ title line annotations follow the JSON format.
Controlling quality score decoding #
This option only applies to the FASTQ formats--solexa
: decodes quality string according to the Solexa specification. (default: the standard Sanger encoding is used, env: OBISSOLEXA)
General options #
--help
|-h|-?
: shows this help.--version
: prints the version and exits.--silent-warning
: This option tells obitools to stop displaying warnings. This behaviour can be controlled by setting the OBIWARNINGS environment variable.
Computation related options #
--max-cpu
<INTEGER>: OBITools can take advantage of your computer’s multi-core architecture by parallelizing the computation across all available CPUs. Computing on more CPUs usually requires more memory to perform the computation. Reducing the number of CPUs used to perform a calculation is also a way to indirectly control the amount of memory used by the process. The number of CPUs used by OBITools can also be controlled by setting the OBIMAXCPU environment variable.--force-one-cpu
: forces the use of a single CPU core for parallel processing (default: false).--batch-size
<INTEGER>: number of sequence per batch for parallel processing (default: 1000, env: OBIBATCHSIZE)
Debug related options #
--debug
: enables debug mode, by setting log level to debug (default: false, env: OBIDEBUG)--pprof
: enables pprof server. Look at the log for details. (default: false).--pprof-mutex
<INTEGER>: enables profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX)--pprof-goroutine
<INTEGER>: enables profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE)
Examples #
By default, the obicount
command will output the number of variants, reads and symbols in the sequence file.
obicount my_sequence_file.fasta
INFO[0000] Number of workers set 16
INFO[0000] Found 1 files to process
INFO[0000] xxx.fastq.gz mime type: text/fastq
entities,n
variants,43221
reads,43221
symbols,4391530
The output is in CSV format and can advantageously transform to Markdown for a prettier output using the
csvtomd
command.
obicount my_sequence_file.fasta | csvtomd
entities | n
----------|---------
variants | 43221
reads | 43221
symbols | 4391530
The conversion can also be done with the csvlook
command from the
csvkit package.
obicount my_sequence_file.fasta | csvlook
| entities | n |
| -------- | --------- |
| variants | 43 221 |
| reads | 43 221 |
| symbols | 4 391 530 |
When using the --variants
, --reads
or --symbols
option, the output only contains the number corresponding to the options specified.
obicount -v --reads my_sequence_file.fasta | csvlook
| entities | n |
| -------- | ------ |
| variants | 43 221 |
| reads | 43 221 |
As for all the OBITools commands, the input file can be compressed with GZIP.
obicount my_sequence_file.fasta.gz | csvlook
| entities | n |
| -------- | --------- |
| variants | 43 221 |
| reads | 43 221 |
| symbols | 4 391 530 |