Customising the execution of OBITools #
OBITools are a set of UNIX commands that can be used from a UNIX shell. They can be used interactively from a terminal, or as part of a shell script to automate a data analysis pipeline. Each OBITools command implements an algorithm to process the data. For example, the obicount
command implements an algorithm to count the number of sequences in a sequence file.
>AB061527 {"count":1,"definition":"Sorex unguiculatus mitochondrial NA, complete genome.","family_name":"Soricidae","family_taxid":9376,"genus_name":"Sorex","genus_taxid":9379,"obicleandb_level":"family","obicleandb_trusted":2.2137847111025621e-13,"species_name":"Sorex unguiculatus","species_taxid":62275,"taxid":62275}
ttagccctaaacttaggtatttaatctaacaaaaatacccgtcagagaactactagcaat
agcttaaaactcaaaggacttggcggtgctttatatccct
>AL355887 {"count":2,"definition":"Human chromosome 14 NA sequence BAC R-179O11 of library RPCI-11 from chromosome 14 of Homo sapiens (Human)XXKW HTG.; HTGS_ACTIVFIN.","family_name":"Hominidae","family_taxid":9604,"genus_name":"Homo","genus_taxid":9605,"obicleandb_level":"genus","obicleandb_trusted":0,"species_name":"Homo sapiens","species_taxid":9606,"taxid":9606}
ttagccctaaactctagtagttacattaacaaaaccattcgtcagaatactacgagcaac
agcttaaaactcaaaggacctggcagttctttatatccct
obicount two_sequences.fasta
entities,n
variants,2
reads,3
symbols,200
In addition to its name, an OBITools command has a number of options that allow you to customise its behaviour. For example, the obicount
command has the --symbols
option, which tells it to count only the total number of nucleotides in the sequence file.
obicount --symbols two_sequences.fasta
entities,n
symbols,200
If you compare the two outputs, you will notice that the first version of the obicount
command without the --symbols
option counts the total number of nucleotides, but also the number of sequence variants and the number of reads, while the second version with the --symbols
option counts only the total number of nucleotides.
Multiple ways to specify the same option #
Unix options are specified on the command line by adding then after the command name. They can take two forms:
- The long option name, which is the name of the option preceded by two hyphens, for example
--help
. - For some options, such as the
help
option, there is also a short version of the option. This consists of a single character preceded by a single hyphen, for example-h
.
If multiple forms of the same option exist, they are separated in the documentation by a vertical bar |
, e.g. the option help
exists in its long form --help
and in one of its short forms -h
or -?
. These different forms are represented as follows --help|-h|-?
.
Specifying an option through environment variables #
Options such as --max-cpu
, which specifies the maximum number of CPU cores used by OBITools, can be specified when running the command
obicount --max-cpu=4 my_sequence.fasta
or by declaring an environment variable. For this example, the environment variable corresponding to the --max-cpu
option is OBIMAXCPU
. When using
bash or
zsh shells, the environment variable can be set using the export
command:
export OBIMAXCPU=4
Once the environment variable is set, any OBITools command run in the same shell session will use the value of four CPU cores, in this case without the need to specify the --max-cpu
option.
Some OBITools options are shared by most of the commands. These options are listed in the following table.
Controlling the input data #
OBITools4 generally recognizes the input file format. It also recognizes whether the input file is compressed using GZIP. But some rare files can be misidentified, so the following options allow the user to force the format, thus bypassing the format identification step.The file format options #
--fasta
: indicates that sequence data is in fasta format.--fastq
: indicates that sequence data is in fastq format.--embl
: indicates that sequence data is in EMBL-ENA flatfile format.--csv
: indicates that sequence data is in CSV format.--genbank
: indicates that sequence data is in GenBank flatfile format.--ecopcr
: indicates that sequence data is in the old ecoPCR tabulated format.
Controlling the way OBITools4 are formatting annotations #
These options only apply to the FASTA and FASTQ formats--input-OBI-header
: FASTA/FASTQ title line annotations follow the old OBI format.--input-json-header
: FASTA/FASTQ title line annotations follow the JSON format.
Controlling quality score decoding #
This option only applies to the FASTQ formats--solexa
: decodes quality string according to the Solexa specification. (default: the standard Sanger encoding is used, env: OBISSOLEXA)
Controlling the output data #
--compress
|-Z
: output is compressed using gzip. (default: false)--no-order
: the OBITools ensure that the order between the input file and the output file does not change. When multiple files are processed, they are processed one at a time. If the –no-order option is added to a command, multiple input files can be opened at the same time and their contents processed in parallel. This usually increases processing speed, but does not guarantee the order of the sequences in the output file. Also, processing multiple files in parallel may require more memory to perform the computation.--fasta-output
: writes sequence data in fasta format (default if quality data is not available).--fastq-output
: writes sequence data in fastq format (default if quality data is available).--json-output
: writes sequence data in JSON format.--out
|-o
<FILENAME>: filename used for saving the output (default: “-”, the standard output)--output-OBI-header
|-O
: writes output FASTA/FASTQ title line annotations in OBI format (default: JSON).--output-json-header
: writew output FASTA/FASTQ title line annotations in JSON format (the default format).--skip-empty
: sequences of length equal to zero are removed from the output (default: false).--no-progressbar
: deactivates progress bar display (default: false).
General options #
--help
|-h|-?
: shows this help.--version
: prints the version and exits.--silent-warning
: This option tells obitools to stop displaying warnings. This behaviour can be controlled by setting the OBIWARNINGS environment variable.
Computation related options #
--max-cpu
<INTEGER>: OBITools can take advantage of your computer’s multi-core architecture by parallelizing the computation across all available CPUs. Computing on more CPUs usually requires more memory to perform the computation. Reducing the number of CPUs used to perform a calculation is also a way to indirectly control the amount of memory used by the process. The number of CPUs used by OBITools can also be controlled by setting the OBIMAXCPU environment variable.--force-one-cpu
: forces the use of a single CPU core for parallel processing (default: false).--batch-size
<INTEGER>: number of sequence per batch for parallel processing (default: 1000, env: OBIBATCHSIZE)
Debug related options #
--debug
: enables debug mode, by setting log level to debug (default: false, env: OBIDEBUG)--pprof
: enables pprof server. Look at the log for details. (default: false).--pprof-mutex
<INTEGER>: enables profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX)--pprof-goroutine
<INTEGER>: enables profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE)