obiannotate
: edit sequence annotations
#
Description #
obiannotate
is a tool for editing the sequence records of a dataset. It allows you to add, delete or modify annotations of sequence records, as well as edit the identifier, definition or sequence itself.
There are two particularly important groups of options in obiannotate
. The first group is shared with obigrep
and enables the selection of sequences. The second group specifies the changes to be made to the sequence records. In obigrep
, the selection options determine which sequences the program will retain in its output. In contrast, every sequence in the input dataset is included in the result produced by obiannotate
; however, only the sequences selected by the selection options are modified according to the editing options. Non-selected sequences are transferred to the result without modification.
The selection options #
The edition options #
Edition of the annotations #
OBITools4 store annotations attached to each sequence using a tag/value mechanism. The annotation of a sequence if a set of tags each of them being associated to a value. Therefor, annotating a sequence is changing this set of tags by adding new tags, deleting some others or changing the value associated to a tag.
Adding annotations #
To add a new tag/value pair to a sequence obiannotate
propose the generic option --set-tag
Considering the following file:
>seqA1
cgatgctgcatgctagtgctagtcgat
>seqB1
tagctagctagctagctagctagctagcta
>seqA2
gtagctagctagctagctagctagctaga
>seqC1
cgatgctccatgctagtgctagtcgatga
>seqB2
cgatggctccatgctagtgctagtcgatga
To add a foo
tag to each sequence associated to the numeric value 3
the command is:
obiannotate --set-tag foo=3 empty.fasta
>seqA1 {"foo":3}
cgatgctgcatgctagtgctagtcgat
>seqB1 {"foo":3}
tagctagctagctagctagctagctagcta
>seqA2 {"foo":3}
gtagctagctagctagctagctagctaga
>seqC1 {"foo":3}
cgatgctccatgctagtgctagtcgatga
>seqB2 {"foo":3}
cgatggctccatgctagtgctagtcgatga
The argument of the --set-tag
option foo=3
can be decomposed in two parts separated by the equal sign.
The left part foo
is the name of the target tag, the right part is the value to assign to the tag.
The left part must be a string when the right part is actually an
OBITools4 expression language. Here the expression is simple 3
, which is evaluated to the 3 integer value.
To assign as string value to a tag, the rigth part of the option argument must be a valid
OBITools4 expression language corresponding to a string: "bar"
with double quotes flanking the text having to be assigned. But to prevent the Bash UNIX shell to interpret itself the option parameter foo="bar"
, it has to be protected itself by single quote.
obiannotate --set-tag 'foo="bar"' empty.fasta
>seqA1 {"foo":"bar"}
cgatgctgcatgctagtgctagtcgat
>seqB1 {"foo":"bar"}
tagctagctagctagctagctagctagcta
>seqA2 {"foo":"bar"}
gtagctagctagctagctagctagctaga
>seqC1 {"foo":"bar"}
cgatgctccatgctagtgctagtcgatga
>seqB2 {"foo":"bar"}
cgatggctccatgctagtgctagtcgatga
As the right part is an expression, it can be more complex and realize some basic computations. In the next example the foo tag is valuated with the sequence identifier prefixed by "bar-"
.
obiannotate --set-tag 'foo="bar-" + sequence.Id()' empty.fasta
>seqA1 {"foo":"bar-seqA1"}
cgatgctgcatgctagtgctagtcgat
>seqB1 {"foo":"bar-seqB1"}
tagctagctagctagctagctagctagcta
>seqA2 {"foo":"bar-seqA2"}
gtagctagctagctagctagctagctaga
>seqC1 {"foo":"bar-seqC1"}
cgatgctccatgctagtgctagtcgatga
>seqB2 {"foo":"bar-seqB2"}
cgatggctccatgctagtgctagtcgatga
The complete description of the OBITools4 expression language is available here.
All the previous examples are tagging each sequence in the same way, but you can also use obiannotate
to modify the annotation of only a subset of the sequence. As explained in the introduction of this documentation, this is achieved by combining selection and edition options.
For instance, to add a foo tag only to the single sequence having the id seqA2
, is achieved by combining the selection option -I seqA2
and the edition option --set-tag 'foo="bar"'
obiannotate -I seqA2 --set-tag 'foo="bar"' empty.fasta
>seqA1
cgatgctgcatgctagtgctagtcgat
>seqB1
tagctagctagctagctagctagctagcta
>seqA2 {"foo":"bar"}
gtagctagctagctagctagctagctaga
>seqC1
cgatgctccatgctagtgctagtcgatga
>seqB2
cgatggctccatgctagtgctagtcgatga
Used with obigrep
the -I seqA2
would have selected only the modified sequence.
obigrep -I seqA2 empty.fasta
>seqA2
gtagctagctagctagctagctagctaga
The selection options being shared between obiannotate
and obigrep
, good method to check which sequences will be modified by obiannotate
is to check the selection options at first with obigrep
. Only the sequences present in the obigrep
output will be edited by obiannotate
.
obigrep -l 30 empty.fasta
>seqB1
tagctagctagctagctagctagctagcta
>seqB2
cgatggctccatgctagtgctagtcgatga
obiannotate -l 30 \
--set-tag 'foo="bar-" + sequence.Id()' \
empty.fasta
>seqA1
cgatgctgcatgctagtgctagtcgat
>seqB1 {"foo":"bar-seqB1"}
tagctagctagctagctagctagctagcta
>seqA2
gtagctagctagctagctagctagctaga
>seqC1
cgatgctccatgctagtgctagtcgatga
>seqB2 {"foo":"bar-seqB2"}
cgatggctccatgctagtgctagtcgatga
Renaming tags #
Renaming tags can be useful for accounting for changes in a pipeline, adapting old datasets to new scripts or saving annotations produced by an *OBITools* command before rerunning it with different parameters. Consider the following fasta file:
>seqA1 {"count":1,"tata":"bar","taxid":"taxon:9606 [Homo sapiens]@species","toto":"titi"}
cgatgctgcatgctagtgctagtcgat
>seqB1 {"tata":"bar","taxid":"taxon:63221 [Homo sapiens neanderthalensis]@subspecies","toto":"tata"}
tagctagctagctagctagctagctagcta
>seqA2 {"count":5,"tata":"foo","taxid":"taxon:9605 [Homo]@genus","toto":"tutu"}
gtagctagctagctagctagctagctaga
>seqC1 {"count":15,"tata":"foo","taxid":"taxon:9604 [Hominidae]@family","toto":"foo"}
cgatgctccatgctagtgctagtcgatga
>seqB2 {"count":25,"tata":"bar"}
cgatggctccatgctagtgctagtcgatga
If you want to keep the taxonomic annotations as a reference before running the obitag
command to produce a new one and then be able to compare the new one to the old one later, you can rename the taxid
tag to ref_taxid
and then run the obitag
command, which will set a new ’taxid’ tag.
obiannotate --rename-tag ref_taxid=taxid five_tags.fasta
>seqA1 {"count":1,"ref_taxid":"taxon:9606 [Homo sapiens]@species","tata":"bar","toto":"titi"}
cgatgctgcatgctagtgctagtcgat
>seqB1 {"ref_taxid":"taxon:63221 [Homo sapiens neanderthalensis]@subspecies","tata":"bar","toto":"tata"}
tagctagctagctagctagctagctagcta
>seqA2 {"count":5,"ref_taxid":"taxon:9605 [Homo]@genus","tata":"foo","toto":"tutu"}
gtagctagctagctagctagctagctaga
>seqC1 {"count":15,"ref_taxid":"taxon:9604 [Hominidae]@family","tata":"foo","toto":"foo"}
cgatgctccatgctagtgctagtcgatga
>seqB2 {"count":25,"tata":"bar"}
cgatggctccatgctagtgctagtcgatga
–number
Adding sequence related annotations #
–length
–aho-corasick
Edit taxonomy related annotations #
–scientific-name
–with-taxon-at-rank <RANK_NAME>
–taxonomic-rank
–taxonomic-path
–raw-taxid
–add-lca-in <SLOT_NAME> –lca-error <#.###>
Deleting annotations #
There are three options that allow for deleting annotations associated with a sequence.
The easiest one is --clear
. It removes every annotation associated to a sequence.
Considering the fasta sequence file
📄 five_tags.fasta>seqA1 {"count":1,"tata":"bar","taxid":"taxon:9606 [Homo sapiens]@species","toto":"titi"}
cgatgctgcatgctagtgctagtcgat
>seqB1 {"tata":"bar","taxid":"taxon:63221 [Homo sapiens neanderthalensis]@subspecies","toto":"tata"}
tagctagctagctagctagctagctagcta
>seqA2 {"count":5,"tata":"foo","taxid":"taxon:9605 [Homo]@genus","toto":"tutu"}
gtagctagctagctagctagctagctaga
>seqC1 {"count":15,"tata":"foo","taxid":"taxon:9604 [Hominidae]@family","toto":"foo"}
cgatgctccatgctagtgctagtcgatga
>seqB2 {"count":25,"tata":"bar"}
cgatggctccatgctagtgctagtcgatga
The next command removes all the annotations
obiannotate --clear five_tags.fasta
>seqA1
cgatgctgcatgctagtgctagtcgat
>seqB1
tagctagctagctagctagctagctagcta
>seqA2
gtagctagctagctagctagctagctaga
>seqC1
cgatgctccatgctagtgctagtcgatga
>seqB2
cgatggctccatgctagtgctagtcgatga
If you combine a selection option, here -C 10
which selects all the sequences occurring at most ten times, and the --clear
option, you will delete annotations only on selected sequences. For other sequences the annotations are kept.
obiannotate -C 10 --clear five_tags.fasta
>seqA1
cgatgctgcatgctagtgctagtcgat
>seqB1
tagctagctagctagctagctagctagcta
>seqA2
gtagctagctagctagctagctagctaga
>seqC1 {"count":15,"tata":"foo","taxid":"taxon:9604 [Hominidae]@family","toto":"foo"}
cgatgctccatgctagtgctagtcgatga
>seqB2 {"count":25,"tata":"bar"}
cgatggctccatgctagtgctagtcgatga
It is possible to delete a given tag based on its name using the --delete-tag
option. In the following example the taxid tag is deleted. As the seqB2 sequence does not exhibe a taxid tag, it is not affected.
obiannotate --delete-tag taxid five_tags.fasta
>seqA1 {"count":1,"tata":"bar","toto":"titi"}
cgatgctgcatgctagtgctagtcgat
>seqB1 {"tata":"bar","toto":"tata"}
tagctagctagctagctagctagctagcta
>seqA2 {"count":5,"tata":"foo","toto":"tutu"}
gtagctagctagctagctagctagctaga
>seqC1 {"count":15,"tata":"foo","toto":"foo"}
cgatgctccatgctagtgctagtcgatga
>seqB2 {"count":25,"tata":"bar"}
cgatggctccatgctagtgctagtcgatga
Several --delete-tag
options can be inserted in a single obiannotate
command.
obiannotate --delete-tag taxid \
--delete-tag count \
five_tags.fasta
>seqA1 {"tata":"bar","toto":"titi"}
cgatgctgcatgctagtgctagtcgat
>seqB1 {"tata":"bar","toto":"tata"}
tagctagctagctagctagctagctagcta
>seqA2 {"tata":"foo","toto":"tutu"}
gtagctagctagctagctagctagctaga
>seqC1 {"tata":"foo","toto":"foo"}
cgatgctccatgctagtgctagtcgatga
>seqB2 {"tata":"bar"}
cgatggctccatgctagtgctagtcgatga
The last way to delete annotations is indirect. It is based on the --keep
option, indicating the annotation to be kept. Consequently, all the other tags, the not kept, are deleted
obiannotate --keep taxid five_tags.fasta
>seqA1 {"taxid":"taxon:9606 [Homo sapiens]@species"}
cgatgctgcatgctagtgctagtcgat
>seqB1 {"taxid":"taxon:63221 [Homo sapiens neanderthalensis]@subspecies"}
tagctagctagctagctagctagctagcta
>seqA2 {"taxid":"taxon:9605 [Homo]@genus"}
gtagctagctagctagctagctagctaga
>seqC1 {"taxid":"taxon:9604 [Hominidae]@family"}
cgatgctccatgctagtgctagtcgatga
>seqB2
cgatggctccatgctagtgctagtcgatga
Similarly to --delete-tag
several --keep
options can be provided to keep several annotations.
obiannotate --keep taxid \
--keep count \
five_tags.fasta
>seqA1 {"count":1,"taxid":"taxon:9606 [Homo sapiens]@species"}
cgatgctgcatgctagtgctagtcgat
>seqB1 {"taxid":"taxon:63221 [Homo sapiens neanderthalensis]@subspecies"}
tagctagctagctagctagctagctagcta
>seqA2 {"count":5,"taxid":"taxon:9605 [Homo]@genus"}
gtagctagctagctagctagctagctaga
>seqC1 {"count":15,"taxid":"taxon:9604 [Hominidae]@family"}
cgatgctccatgctagtgctagtcgatga
>seqB2 {"count":25}
cgatggctccatgctagtgctagtcgatga
Changing annotation values #
Edition of the identifier #
–set-identifier
Edition of the definition #
Edition of the sequence #
–cut <###:###>
–sequence
Synopsis #
obiannotate [--add-lca-in <SLOT_NAME>] [--aho-corasick <string>]
[--allows-indels] [--approx-pattern <PATTERN>]...
[--attribute|-a <KEY=VALUE>]... [--batch-size <int>] [--clear]
[--compress|-Z] [--csv] [--cut <###:###>] [--debug]
[--definition|-D <PATTERN>]... [--delete-tag <KEY>]... [--ecopcr]
[--embl] [--fail-on-taxonomy] [--fasta] [--fasta-output]
[--fastq] [--fastq-output] [--force-one-cpu] [--genbank]
[--has-attribute|-A <KEY>]... [--help|-h|-?]
[--id-list <FILENAME>] [--identifier|-I <PATTERN>]...
[--ignore-taxon|-i <TAXID>]... [--input-OBI-header]
[--input-json-header] [--inverse-match|-v] [--json-output]
[--keep|-k <KEY>]... [--lca-error <#.###>] [--length]
[--max-count|-C <COUNT>] [--max-cpu <int>]
[--max-length|-L <LENGTH>] [--min-count|-c <COUNT>]
[--min-length|-l <LENGTH>] [--no-order] [--no-progressbar]
[--number] [--only-forward] [--out|-o <FILENAME>]
[--output-OBI-header|-O] [--output-json-header]
[--paired-mode <forward|reverse|and|or|andnot|xor>]
[--pattern <string>] [--pattern-error <int>]
[--pattern-name <string>] [--pprof] [--pprof-goroutine <int>]
[--pprof-mutex <int>] [--predicate|-p <EXPRESSION>]...
[--raw-taxid] [--rename-tag|-R <NEW_NAME=OLD_NAME>]...
[--require-rank <RANK_NAME>]...
[--restrict-to-taxon|-r <TAXID>]... [--scientific-name]
[--sequence|-s <PATTERN>]... [--set-identifier <EXPRESSION>]
[--set-tag|-S <KEY=EXPRESSION>]... [--silent-warning]
[--skip-empty] [--solexa] [--taxonomic-path] [--taxonomic-rank]
[--taxonomy|-t <string>] [--update-taxid] [--valid-taxid]
[--version] [--with-leaves] [--with-taxon-at-rank <RANK_NAME>]...
[<args>]
Options #
obiannotate
specific options
#
Identifier modification #
--set-identifier
<EXPRESSION>: An expression used to assigned the new id of the sequence.
Attribute modification #
--clear
: Clears all attributes associated to the sequence records.--delete-tag
<KEY>: Deletes attribute namedKEY
. When this attribute is missing, the sequence record is skipped and the next one is examined.--keep
|-k
<KEY>: Keeps only attribute namedKEY
. Several -k options can be combined.--rename-tag
|-R
<NEW_NAME=OLD_NAME>: Changes attribute nameOLD_NAME
toNEW_NAME
. When attribute namedOLD_NAME
is missing, the sequence record is skipped and the next one is examined.--set-tag
|-S
<KEY=EXPRESSION>: Creates a new attribute named with a keyKEY
set with a value computed fromEXPRESSION
.
Sequence-related annotation #
--aho-corasick
<string>: Adds an aho-corasick attribute with the count of matches of the provided patterns.--length
: Adds attribute with seq_length as a key and sequence length as a value.--pattern
<string>: Adds a pattern attribute containing the pattern, a pattern_match attribute indicating the matched sequence, and a pattern_error slot indicating the number difference between the pattern and the match to the sequence.--pattern-name
<string>: specifies the name to use as prefix for the attributes reporting the match. (default: “pattern”)
Sequence modification #
--cut
<###:###>: A pattern describing how to cut the sequence.
Taxonomy annotation #
--add-lca-in
<KEY>: From the taxonomic annotation of the sequence (taxid attribute or merged_taxid attribute), a new attribute namedKEY
is added with the taxid of the lowest common ancestor corresponding to the current annotation.--lca-error
<#.###>: Error rate tolerated on the taxonomical description during the lowest common ancestor. At most a fraction of lca-error of the taxonomic information can disagree with the estimated LCA. (default: 0.000000)--scientific-name
: Annotates the sequence with its scientific name.
Taxonomy options #
Check taxids against a taxonomy #
OBITools4 allow loading a taxonomy database when they are processing sequence data. If done, the command checks the validity of taxids during the processing of the command. Three cases can occur:- The taxon is valid
- The taxon is no more valid, but a new one replaces it
- The taxon is no more valid, and no new taxid exists to replace it.
TAXCOD:TAXID [SCIENTIFIC NAME]@RANKAs example with the NCBI taxonomy the human taxid looks like :
taxon:9606 [Homo sapiens]@speciesThat rewriting doesn't happen if the --raw-taxid option is set. In that case only the raw taxid is conserved.
9606In the second case, a warning message is logged on the standard error. If the --update-taxid is set, the command will update the expired taxid to the new equivalent one, and the valid taxon rules apply. Otherwise, the old taxid is maintained in the result. In the last case, a warning message is also issued to the standard error, and non-valid taxid is conserved as is. If the --fail-on-taxonomy option is set, the command stop and exit with an error at the first non-valid taxid encountred in input data.
--taxonomy
|-t
<string>: Path to the taxonomic database.--raw-taxid
: Displays the raw taxid for each displayed taxon. (default: false)--update-taxid
: Make obitools automatically updating the taxids that are declared merged to a newest one (default: false).--fail-on-taxonomy
: Make obitools failing on error if a used taxid is not a currently valid one (default: false).
--taxonomic-rank
: Annotates the sequence with its taxonomic rank.--taxonomic-path
: Annotates the sequence with its taxonomic path.--with-taxon-at-rank
: Adds taxonomic annotation at taxonomic rankRANK_NAME
.
Selecting sequence records #
Selection based on the sequence #
Strict matching #
--sequence
|-s
<PATTERN>: A Regular expression pattern used to match the sequence. Only the entries whose sequence matches the pattern are kept. Regular expression patterns are case-insensitive.
Approximate matching #
--approx-pattern
<PATTERN>: A DNA pattern used to match the sequence. Only the entries whose sequence matches the pattern are kept. DNA patterns are case-insensitive. They can be matched allowing for errors: mismatches or insertions or deletions.--allows-indels
: allows for indels during pattern DNA pattern matching (see--approx-pattern
option).--pattern-error
<INTEGER>: maximum number of errors allowed when searching for patterns in DNA (default 0, see--approx-pattern
option).
Selection based on the sequence identifier #
--identifier
|-I
<REGEX>: Regular expression pattern to be tested against the sequence identifier. The pattern is case-insensitive.--id-list
<FILENAME>:points to a text file containing the list of sequence record identifiers to be selected. The file format consists in a single identifier per line.
Selection based on the sequence definition #
--definition
|-D
<REGEX>: Regular expression pattern to be tested against the sequence definition. The pattern is case-insensitive.
Selection based on the sequence properties #
--min-count
|-c
<COUNT>: selects the sequence records for which the number of occurrences (i.e the count attribute) is equal to or greater than the defined minimum count.--max-count
|-C
<COUNT>: Select the sequence records for which the occurrence count (i.e the count attribute) is equal to or smaller than the defined maximum count.--min-length
|-l
<LENGTH>: selects the sequence records for which the sequence length is equal to or greater than the defined minimum sequence length.--max-length
|-L
<LENGTH>: selects sequence records for which the sequence length is equal to or less than the defined maximum sequence length.
Controlling the input data #
OBITools4 generally recognizes the input file format. It also recognizes whether the input file is compressed using GZIP. But some rare files can be misidentified, so the following options allow the user to force the format, thus bypassing the format identification step.The file format options #
--fasta
: indicates that sequence data is in fasta format.--fastq
: indicates that sequence data is in fastq format.--embl
: indicates that sequence data is in EMBL-ENA flatfile format.--csv
: indicates that sequence data is in CSV format.--genbank
: indicates that sequence data is in GenBank flatfile format.--ecopcr
: indicates that sequence data is in the old ecoPCR tabulated format.
Controlling the way OBITools4 are formatting annotations #
These options only apply to the FASTA and FASTQ formats--input-OBI-header
: FASTA/FASTQ title line annotations follow the old OBI format.--input-json-header
: FASTA/FASTQ title line annotations follow the JSON format.
Controlling quality score decoding #
This option only applies to the FASTQ formats--solexa
: decodes quality string according to the old Solexa specification. (default: the standard Sanger encoding is used, env: OBISSOLEXA)
Controlling the output data #
--compress
|-Z
: output is compressed using gzip. (default: false)--no-order
: the OBITools ensure that the order between the input file and the output file does not change. When multiple files are processed, they are processed one at a time. If the –no-order option is added to a command, multiple input files can be opened at the same time and their contents processed in parallel. This usually increases processing speed, but does not guarantee the order of the sequences in the output file. Also, processing multiple files in parallel may require more memory to perform the computation.--fasta-output
: writes sequence data in fasta format (default if quality data is not available).--fastq-output
: writes sequence data in fastq format (default if quality data is available).--json-output
: writes sequence data in JSON format.--out
|-o
<FILENAME>: filename used for saving the output (default: “-”, the standard output)--output-OBI-header
|-O
: writes output FASTA/FASTQ title line annotations in OBI format (default: JSON).--output-json-header
: writew output FASTA/FASTQ title line annotations in JSON format (the default format).--skip-empty
: sequences of length equal to zero are removed from the output (default: false).--no-progressbar
: deactivates progress bar display (default: false).
General options #
--help
|-h|-?
: shows this help.--version
: prints the version and exits.--silent-warning
: This option tells obitools to stop displaying warnings. This behaviour can be controlled by setting the OBIWARNINGS environment variable.
Computation related options #
--max-cpu
<INTEGER>: OBITools can take advantage of your computer’s multi-core architecture by parallelizing the computation across all available CPUs. Computing on more CPUs usually requires more memory to perform the computation. Reducing the number of CPUs used to perform a calculation is also a way to indirectly control the amount of memory used by the process. The number of CPUs used by OBITools can also be controlled by setting the OBIMAXCPU environment variable.--force-one-cpu
: forces the use of a single CPU core for parallel processing (default: false).--batch-size
<INTEGER>: number of sequence per batch for parallel processing (default: 1000, env: OBIBATCHSIZE)
Debug related options #
--debug
: enables debug mode, by setting log level to debug (default: false, env: OBIDEBUG)--pprof
: enables pprof server. Look at the log for details. (default: false).--pprof-mutex
<INTEGER>: enables profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX)--pprof-goroutine
<INTEGER>: enables profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE)
Examples #
obiannotate --help