`obicount`: counting sequence records #

Description #

Count the sequence records in a sequence file. It returns three pieces of information. The first is the number of sequence records. Each sequence record is associated with a count attribute (equal to 1 if absent), this number corresponds to the number of times that sequence has been observed in the non-dereplicated data set. In the following example, the first sequence record has no count attribute and therefore counts for 1, when the second sequence record has a count attribute equal to 2.

>AB061527 {"definition":"Sorex unguiculatus mitochondrial NA, complete genome.","family_name":"Soricidae","family_taxid":9376,"genus_name":"Sorex","genus_taxid":9379,"obicleandb_level":"family","obicleandb_trusted":2.2137847111025621e-13,"species_name":"Sorex unguiculatus","species_taxid":62275,"taxid":62275}
ttagccctaaacttaggtatttaatctaacaaaaatacccgtcagagaactactagcaat
agcttaaaactcaaaggacttggcggtgctttatatccct
>AL355887 {"count":2,"definition":"Human chromosome 14 NA sequence BAC R-179O11 of library RPCI-11 from chromosome 14 of Homo sapiens (Human)XXKW HTG.; HTGS_ACTIVFIN.","family_name":"Hominidae","family_taxid":9604,"genus_name":"Homo","genus_taxid":9605,"obicleandb_level":"genus","obicleandb_trusted":0,"species_name":"Homo sapiens","species_taxid":9606,"taxid":9606}
ttagccctaaactctagtagttacattaacaaaaccattcgtcagaatactacgagcaac
agcttaaaactcaaaggacctggcagttctttatatccct

Thus, the second value returned is the sum of the count values for all sequences, 3 for the presented example file. The last value is the number of nucleotides stored in the file, the sum of the sequence lengths, without accounting for the count tag.

graph TD
  A@{ shape: doc, label: "my_sequences.fastq" }
  C[obicount]
  D@{ shape: doc, label: "counts.csv" }
  A --> C:::obitools
  C --> D
  classDef obitools fill:#99d57c

Synopsis #

obicount [--batch-size <int>] [--csv] [--debug] [--ecopcr] [--embl] [--fasta]
         [--fastq] [--force-one-cpu] [--genbank] [--help|-h|-?]
         [--input-OBI-header] [--input-json-header] [--max-cpu <int>]
         [--no-order] [--pprof] [--pprof-goroutine <int>]
         [--pprof-mutex <int>] [--reads|-r] [--silent-warning] [--solexa]
         [--symbols|-s] [--u-to-t] [--variants|-v] [--version] [<args>]

Options #

`obicount` specific options #

--variants | -v : when present, output the only the number of sequence records in the file.
--reads | -r : when present, output only the sum of sequence counts in the file.
--symbols | -s : when present, output only the number of nucleotides in the file.

It is possible to combine two of the above options.

Controlling the input data #

OBITools4 generally recognizes the input file format. It also recognizes whether the input file is compressed using GZIP. But some rare files can be misidentified, so the following options allow the user to force the format, thus bypassing the format identification step.

The file format options #

--fasta: indicates that sequence data is in fasta format.
--fastq: indicates that sequence data is in fastq format.
--embl: indicates that sequence data is in EMBL-ENA flatfile format.
--csv: indicates that sequence data is in CSV format.
--genbank: indicates that sequence data is in GenBank flatfile format.
--ecopcr: indicates that sequence data is in the old ecoPCR tabulated format.

Controlling the way OBITools4 are formatting annotations #

These options only apply to the FASTA and FASTQ formats

--input-OBI-header: FASTA/FASTQ title line annotations follow the old OBI format.
--input-json-header: FASTA/FASTQ title line annotations follow the JSON format.

Controlling quality score decoding #

This option only applies to the FASTQ formats

--solexa: decodes quality string according to the old Solexa specification. (default: the standard Sanger encoding is used, env: OBISSOLEXA)

General options #

--help | -h|-? : shows this help.
--version: prints the version and exits.
--silent-warning: This option tells obitools to stop displaying warnings. This behaviour can be controlled by setting the OBIWARNINGS environment variable.

--max-cpu <INTEGER>: OBITools can take advantage of your computer’s multi-core architecture by parallelizing the computation across all available CPUs. Computing on more CPUs usually requires more memory to perform the computation. Reducing the number of CPUs used to perform a calculation is also a way to indirectly control the amount of memory used by the process. The number of CPUs used by OBITools can also be controlled by setting the OBIMAXCPU environment variable.
--force-one-cpu: forces the use of a single CPU core for parallel processing (default: false).
--batch-size <INTEGER>: number of sequence per batch for parallel processing (default: 1000, env: OBIBATCHSIZE)

--debug: enables debug mode, by setting log level to debug (default: false, env: OBIDEBUG)
--pprof: enables pprof server. Look at the log for details. (default: false).
--pprof-mutex <INTEGER>: enables profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX)
--pprof-goroutine <INTEGER>: enables profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE)

Examples #

By default, the obicount command will output the number of sequence records (variants), sum of counts (reads), and number of nucleotides (symbols) in the sequence file.

obicount my_sequence_file.fasta

INFO[0000] Number of workers set 16
INFO[0000] Found 1 files to process
INFO[0000] xxx.fastq.gz mime type: text/fastq

entities,n
variants,43221
reads,43221
symbols,4391530

The output is in CSV format and can be transformed into Markdown for a prettier output using the csvtomd command.

obicount my_sequence_file.fasta | csvtomd

entities  |  n
----------|---------
variants  |  43221
reads     |  43221
symbols   |  4391530

The conversion can also be done with the csvlook command from the csvkit package.

obicount my_sequence_file.fasta | csvlook

| entities |         n |
| -------- | --------- |
| variants |    43 221 |
| reads    |    43 221 |
| symbols  | 4 391 530 |

When using the --variants, --reads or --symbols option, the output only contains the number(s) corresponding to the specified option(s).

obicount -v --reads my_sequence_file.fasta | csvlook

| entities |      n |
| -------- | ------ |
| variants | 43 221 |
| reads    | 43 221 |

As for all the OBITools commands, a GZIP compressed input file can be used.

obicount my_sequence_file.fasta.gz | csvlook

| entities |         n |
| -------- | --------- |
| variants |    43 221 |
| reads    |    43 221 |
| symbols  | 4 391 530 |

obicount: counting sequence records #