obiannotate
: edit sequence annotations
#
Description #
Add or edit annotations associated with sequences in a sequence file.
Synopsis #
obiannotate [--add-lca-in <ATTRIBUTE>] [--aho-corasick <string>]
[--allows-indels] [--approx-pattern <PATTERN>]...
[--attribute|-a <KEY=VALUE>]... [--batch-size <int>] [--clear]
[--compressed|-Z] [--csv] [--cut <###:###>] [--debug]
[--definition|-D <PATTERN>]... [--delete-tag <KEY>]...
[--ecopcr] [--embl] [--fail-on-taxonomy] [--fasta]
[--fasta-output] [--fastq] [--fastq-output] [--force-one-cpu]
[--genbank] [--has-attribute|-A <KEY>]... [--help|-h|-?]
[--id-list <FILENAME>] [--identifier|-I <PATTERN>]...
[--ignore-taxon|-i <TAXID>]... [--input-OBI-header]
[--input-json-header] [--inverse-match|-v] [--json-output]
[--keep|-k <KEY>]... [--lca-error <#.###>] [--length]
[--max-count|-C <COUNT>] [--max-cpu <int>]
[--max-length|-L <LENGTH>] [--min-count|-c <COUNT>]
[--min-length|-l <LENGTH>] [--no-order] [--no-progressbar]
[--only-forward] [--out|-o <FILENAME>] [--output-OBI-header|-O]
[--output-json-header]
[--paired-mode <forward|reverse|and|or|andnot|xor>]
[--paired-with <FILENAME>] [--pattern <string>]
[--pattern-error <int>] [--pattern-name <string>] [--pprof]
[--pprof-goroutine <int>] [--pprof-mutex <int>]
[--predicate|-p <EXPRESSION>]... [--raw-taxid]
[--rename-tag|-R <NEW_NAME=OLD_NAME>]...
[--require-rank <RANK_NAME>]...
[--restrict-to-taxon|-r <TAXID>]... [--scientific-name]
[--sequence|-s <PATTERN>]... [--set-identifier <EXPRESSION>]
[--set-tag|-S <KEY=EXPRESSION>]... [--skip-empty] [--solexa]
[--taxonomic-path] [--taxonomic-rank] [--taxonomy|-t <string>]
[--update-taxid] [--valid-taxid] [--version]
[--with-taxon-at-rank <RANK_NAME>]... [<args>]
Options #
obiannotate
specific options
#
Identifier modification #
--set-identifier
<EXPRESSION>: An expression used to assigned the new id of the sequence.
Attribute modification #
--clear
: Clears all attributes associated to the sequence records.--delete-tag
<KEY>: Deletes attribute namedKEY
. When this attribute is missing, the sequence record is skipped and the next one is examined.--keep
|-k
<KEY>: Keeps only attribute namedKEY
. Several -k options can be combined.--rename-tag
|-R
<NEW_NAME=OLD_NAME>: Changes attribute nameOLD_NAME
toNEW_NAME
. When attribute namedOLD_NAME
is missing, the sequence record is skipped and the next one is examined.--set-tag
|-S
<KEY=EXPRESSION>: Creates a new attribute named with a keyKEY
set with a value computed fromEXPRESSION
.
Sequence-related annotation #
--aho-corasick
<string>: Adds an aho-corasick attribute with the count of matches of the provided patterns.--length
: Adds attribute with seq_length as a key and sequence length as a value.--pattern
<string>: Adds a pattern attribute containing the pattern, a pattern_match attribute indicating the matched sequence, and a pattern_error slot indicating the number difference between the pattern and the match to the sequence.--pattern-name
<string>: specifies the name to use as prefix for the attributes reporting the match. (default: “pattern”)
Sequence modification #
--cut
<###:###>: A pattern describing how to cut the sequence.
Taxonomy annotation #
--add-lca-in
<KEY>: From the taxonomic annotation of the sequence (taxid attribute or merged_taxid attribute), a new attribute namedKEY
is added with the taxid of the lowest common ancestor corresponding to the current annotation.--lca-error
<#.###>: Error rate tolerated on the taxonomical description during the lowest common ancestor. At most a fraction of lca-error of the taxonomic information can disagree with the estimated LCA. (default: 0.000000)--scientific-name
: Annotates the sequence with its scientific name.
Taxonomy options #
Check taxids against a taxonomy #
OBITools4 allow loading a taxonomy database when they are processing sequence data. If done, the command checks the validity of taxids during the processing of the command. Three cases can occur:- The taxon is valid
- The taxon is no more valid, but a new one replaces it
- The taxon is no more valid, and no new taxid exists to replace it.
TAXCOD:TAXID [SCIENTIFIC NAME]@RANKAs example with the NCBI taxonomy the human taxid looks like :
taxon:9606 [Homo sapiens]@speciesThat rewriting doesn't happen if the --raw-taxid option is set. In that case only the raw taxid is conserved.
9606In the second case, a warning message is logged on the standard error. If the --update-taxid is set, the command will update the expired taxid to the new equivalent one, and the valid taxon rules apply. Otherwise, the old taxid is maintained in the result. In the last case, a warning message is also issued to the standard error, and non-valid taxid is conserved as is. If the --fail-on-taxonomy option is set, the command stop and exit with an error at the first non-valid taxid encountred in input data.
--taxonomy
|-t
<string>: Path to the taxonomic database.--raw-taxid
: Displays the raw taxid for each displayed taxon. (default: false)--update-taxid
: Make obitools automatically updating the taxids that are declared merged to a newest one (default: false).--fail-on-taxonomy
: Make obitools failing on error if a used taxid is not a currently valid one (default: false).
--taxonomic-rank
: Annotates the sequence with its taxonomic rank.--taxonomic-path
: Annotates the sequence with its taxonomic path.--with-taxon-at-rank
: Adds taxonomic annotation at taxonomic rankRANK_NAME
.
Selecting sequence records #
Selection based on the sequence #
Strict matching #
--sequence
|-s
<PATTERN>: A Regular expression pattern used to match the sequence. Only the entries whose sequence matches the pattern are kept. Regular expression patterns are case-insensitive.
Approximate matching #
--approx-pattern
<PATTERN>: A DNA pattern used to match the sequence. Only the entries whose sequence matches the pattern are kept. DNA patterns are case-insensitive. They can be matched allowing for errors: mismatches or insertions or deletions.--allows-indels
: allows for indels during pattern DNA pattern matching (see--approx-pattern
option).--pattern-error
<INTEGER>: maximum number of errors allowed when searching for patterns in DNA (default 0, see--approx-pattern
option).
Selection based on the sequence identifier #
--identifier
|-I
<REGEX>: Regular expression pattern to be tested against the sequence identifier. The pattern is case-insensitive.--id-list
<FILENAME>:points to a text file containing the list of sequence record identifiers to be selected. The file format consists in a single identifier per line.
Selection based on the sequence definition #
--definition
|-D
<REGEX>: Regular expression pattern to be tested against the sequence definition. The pattern is case-insensitive.
Selection based on the sequence properties #
--min-count
|-c
<COUNT>: selects the sequence records for which the number of occurrences (i.e the count attribute) is equal to or greater than the defined minimum count.--max-count
|-C
<COUNT>: Select the sequence records for which the occurrence count (i.e the count attribute) is equal to or smaller than the defined maximum count.--min-length
|-l
<LENGTH>: selects the sequence records for which the sequence length is equal to or greater than the defined minimum sequence length.--max-length
|-L
<LENGTH>: selects sequence records for which the sequence length is equal to or less than the defined maximum sequence length.
Controlling the input data #
OBITools4 generally recognizes the input file format. It also recognizes whether the input file is compressed using GZIP. But some rare files can be misidentified, so the following options allow the user to force the format, thus bypassing the format identification step.The file format options #
--fasta
: indicates that sequence data is in fasta format.--fastq
: indicates that sequence data is in fastq format.--embl
: indicates that sequence data is in EMBL-ENA flatfile format.--csv
: indicates that sequence data is in CSV format.--genbank
: indicates that sequence data is in GenBank flatfile format.--ecopcr
: indicates that sequence data is in the old ecoPCR tabulated format.
Controlling the way OBITools4 are formatting annotations #
These options only apply to the FASTA and FASTQ formats--input-OBI-header
: FASTA/FASTQ title line annotations follow the old OBI format.--input-json-header
: FASTA/FASTQ title line annotations follow the JSON format.
Controlling quality score decoding #
This option only applies to the FASTQ formats--solexa
: decodes quality string according to the Solexa specification. (default: the standard Sanger encoding is used, env: OBISSOLEXA)
Controlling the output data #
--compress
|-Z
: output is compressed using gzip. (default: false)--no-order
: the OBITools ensure that the order between the input file and the output file does not change. When multiple files are processed, they are processed one at a time. If the –no-order option is added to a command, multiple input files can be opened at the same time and their contents processed in parallel. This usually increases processing speed, but does not guarantee the order of the sequences in the output file. Also, processing multiple files in parallel may require more memory to perform the computation.--fasta-output
: writes sequence data in fasta format (default if quality data is not available).--fastq-output
: writes sequence data in fastq format (default if quality data is available).--json-output
: writes sequence data in JSON format.--out
|-o
<FILENAME>: filename used for saving the output (default: “-”, the standard output)--output-OBI-header
|-O
: writes output FASTA/FASTQ title line annotations in OBI format (default: JSON).--output-json-header
: writew output FASTA/FASTQ title line annotations in JSON format (the default format).--skip-empty
: sequences of length equal to zero are removed from the output (default: false).--no-progressbar
: deactivates progress bar display (default: false).
General options #
--help
|-h|-?
: shows this help.--version
: prints the version and exits.--silent-warning
: This option tells obitools to stop displaying warnings. This behaviour can be controlled by setting the OBIWARNINGS environment variable.
Computation related options #
--max-cpu
<INTEGER>: OBITools can take advantage of your computer’s multi-core architecture by parallelizing the computation across all available CPUs. Computing on more CPUs usually requires more memory to perform the computation. Reducing the number of CPUs used to perform a calculation is also a way to indirectly control the amount of memory used by the process. The number of CPUs used by OBITools can also be controlled by setting the OBIMAXCPU environment variable.--force-one-cpu
: forces the use of a single CPU core for parallel processing (default: false).--batch-size
<INTEGER>: number of sequence per batch for parallel processing (default: 1000, env: OBIBATCHSIZE)
Debug related options #
--debug
: enables debug mode, by setting log level to debug (default: false, env: OBIDEBUG)--pprof
: enables pprof server. Look at the log for details. (default: false).--pprof-mutex
<INTEGER>: enables profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX)--pprof-goroutine
<INTEGER>: enables profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE)
Examples #
obiannotate --help