obiannotate: edit sequence annotations
#
Description #
obiannotate is a tool for editing sequence records in a dataset. It enables you to add, delete or modify annotations, as well as edit identifiers, definitions and sequences.
There are two particularly important groups of options in obiannotate
. The first group, which is shared with obigrep
, is used to select sequences. The second group specifies the changes to be made to the selected sequence records. In obigrep
, the selection options determine which sequences the program will retain in its output. By contrast, obiannotate
includes every sequence occuring in the input dataset in the output file; however, only the sequences selected by the selection options are modified according to the editing options. Non-selected sequences are transferred to the result without modification.
The selection options #
They correspond to the selection options described in the obigrep
documentation.
The edition options #
Edition of the annotations #
OBITools4 store annotations attached to each sequence using a tag/value system. The annotation of a sequence if a set of tags, each of them being associated to a value. Therefore, annotating a sequence is changing this set of tags by adding new tags, deleting some others, or changing the value associated to a tag.
Adding annotations #
To add a new tag/value pair to a sequence, obiannotate
proposes the generic option --set-tag
Considering the following file:
>seqA1
cgatgctgcatgctagtgctagtcgat
>seqB1
tagctagctagctagctagctagctagcta
>seqA2
gtagctagctagctagctagctagctaga
>seqC1
cgatgctccatgctagtgctagtcgatga
>seqB2
cgatggctccatgctagtgctagtcgatga
To add a foo tag to each sequence associated to the numeric value 3 the command is:
obiannotate --set-tag foo=3 empty.fasta
>seqA1 {"foo":3}
cgatgctgcatgctagtgctagtcgat
>seqB1 {"foo":3}
tagctagctagctagctagctagctagcta
>seqA2 {"foo":3}
gtagctagctagctagctagctagctaga
>seqC1 {"foo":3}
cgatgctccatgctagtgctagtcgatga
>seqB2 {"foo":3}
cgatggctccatgctagtgctagtcgatga
The argument of the --set-tag option foo=3 can be decomposed into two parts separated by an equal sign.
The left part, foo, is the name of the target tag, and the right part is the value to be assigned to the tag.
The left part must be a string. The right part is actually an
OBITools4 expression language. Here the expression is a simple 3, which is evaluated to the 3 integer value.
In order to assign a string value to a tag, the right-hand side of the option argument must correspond to a valid
OBITools4 expression language string. For example the text bar must be indicated as "bar", with double quotation marks flanking the text to be assigned. However, to prevent the Bash UNIX shell from interpreting the quotation marks, the option value must be protected by a single quotation mark on each side: 'foo="bar"'.
obiannotate --set-tag 'foo="bar"' empty.fasta
>seqA1 {"foo":"bar"}
cgatgctgcatgctagtgctagtcgat
>seqB1 {"foo":"bar"}
tagctagctagctagctagctagctagcta
>seqA2 {"foo":"bar"}
gtagctagctagctagctagctagctaga
>seqC1 {"foo":"bar"}
cgatgctccatgctagtgctagtcgatga
>seqB2 {"foo":"bar"}
cgatggctccatgctagtgctagtcgatga
As the right part is an expression, it can be more complex and perform some basic computations. In the next example the foo tag is assigned a value based on the sequence identifier prefixed by "bar-".
obiannotate --set-tag 'foo="bar-" + sequence.Id()' empty.fasta
>seqA1 {"foo":"bar-seqA1"}
cgatgctgcatgctagtgctagtcgat
>seqB1 {"foo":"bar-seqB1"}
tagctagctagctagctagctagctagcta
>seqA2 {"foo":"bar-seqA2"}
gtagctagctagctagctagctagctaga
>seqC1 {"foo":"bar-seqC1"}
cgatgctccatgctagtgctagtcgatga
>seqB2 {"foo":"bar-seqB2"}
cgatggctccatgctagtgctagtcgatga
The complete description of the OBITools4 expression language is available here.
All the previous examples are tagging each sequence in the same way, but you can also use obiannotate
to modify the annotation of only a subset of the sequence. As explained in the introduction of this documentation, this is achieved by combining selection and edition options.
For instance, to add a foo tag only to the single sequence having the id seqA2, is achieved by combining the selection option -I seqA2 and the edition option --set-tag 'foo="bar"'
obiannotate -I seqA2 --set-tag 'foo="bar"' empty.fasta
>seqA1
cgatgctgcatgctagtgctagtcgat
>seqB1
tagctagctagctagctagctagctagcta
>seqA2 {"foo":"bar"}
gtagctagctagctagctagctagctaga
>seqC1
cgatgctccatgctagtgctagtcgatga
>seqB2
cgatggctccatgctagtgctagtcgatga
Used with obigrep
the -I seqA2 would have selected only the modified sequence.
obigrep -I seqA2 empty.fasta
>seqA2
gtagctagctagctagctagctagctaga
As the selection options are shared between obiannotate
and obigrep
, a good method of checking which sequences will be modified by obiannotate
is to first check the selection options with obigrep. Only sequences present in the obigrep
output will be edited by obiannotate
.
obigrep -l 30 empty.fasta
>seqB1
tagctagctagctagctagctagctagcta
>seqB2
cgatggctccatgctagtgctagtcgatga
obiannotate -l 30 \
--set-tag 'foo="bar-" + sequence.Id()' \
empty.fasta
>seqA1
cgatgctgcatgctagtgctagtcgat
>seqB1 {"foo":"bar-seqB1"}
tagctagctagctagctagctagctagcta
>seqA2
gtagctagctagctagctagctagctaga
>seqC1
cgatgctccatgctagtgctagtcgatga
>seqB2 {"foo":"bar-seqB2"}
cgatggctccatgctagtgctagtcgatga
Renaming tags #
Renaming tags can be useful when accounting for changes in a pipeline, i.e. adapting old datasets to new scripts. It can also be useful for saving annotations produced by an *OBITools* command before rerunning it with different parameters. Consider the following fasta file:
>seqA1 {"count":1,"tata":"bar","taxid":"taxon:9606 [Homo sapiens]@species","toto":"titi"}
cgatgctgcatgctagtgctagtcgat
>seqB1 {"tata":"bar","taxid":"taxon:63221 [Homo sapiens neanderthalensis]@subspecies","toto":"tata"}
tagctagctagctagctagctagctagcta
>seqA2 {"count":5,"tata":"foo","taxid":"taxon:9605 [Homo]@genus","toto":"tutu"}
gtagctagctagctagctagctagctaga
>seqC1 {"count":15,"tata":"foo","taxid":"taxon:9604 [Hominidae]@family","toto":"foo"}
cgatgctccatgctagtgctagtcgatga
>seqB2 {"count":25,"tata":"bar"}
cgatggctccatgctagtgctagtcgatga
If you want to keep the taxonomic annotations as a reference before running the obitag
command to produce a new one, so that you can compare the new one to the old one later, you can rename the taxid tag to ref_taxid and then run the o
obitag
bitag command. This will set a new taxid tag.
obiannotate --rename-tag ref_taxid=taxid five_tags.fasta
>seqA1 {"count":1,"ref_taxid":"taxon:9606 [Homo sapiens]@species","tata":"bar","toto":"titi"}
cgatgctgcatgctagtgctagtcgat
>seqB1 {"ref_taxid":"taxon:63221 [Homo sapiens neanderthalensis]@subspecies","tata":"bar","toto":"tata"}
tagctagctagctagctagctagctagcta
>seqA2 {"count":5,"ref_taxid":"taxon:9605 [Homo]@genus","tata":"foo","toto":"tutu"}
gtagctagctagctagctagctagctaga
>seqC1 {"count":15,"ref_taxid":"taxon:9604 [Hominidae]@family","tata":"foo","toto":"foo"}
cgatgctccatgctagtgctagtcgatga
>seqB2 {"count":25,"tata":"bar"}
cgatggctccatgctagtgctagtcgatga
Adding a serial number to each sequence #
Adding a serial number to each sequence can be useful. This can be done using the obiannotate
command with the --number option. This option adds a new tag to each sequence with the name seq_number and an integer value that increments for each sequence.
obiannotate --number empty.fasta
>seqA1 {"seq_number":1}
cgatgctgcatgctagtgctagtcgat
>seqB1 {"seq_number":2}
tagctagctagctagctagctagctagcta
>seqA2 {"seq_number":3}
gtagctagctagctagctagctagctaga
>seqC1 {"seq_number":4}
cgatgctccatgctagtgctagtcgatga
>seqB2 {"seq_number":5}
cgatggctccatgctagtgctagtcgatga
Adding sequence related annotations #
- Annotating sequences with their length
The sequence length can be added to the annotation using the --length option which adds the
seq_length.
obiannotate --length empty.fasta
>seqA1 {"seq_length":27}
cgatgctgcatgctagtgctagtcgat
>seqB1 {"seq_length":30}
tagctagctagctagctagctagctagcta
>seqA2 {"seq_length":29}
gtagctagctagctagctagctagctaga
>seqC1 {"seq_length":29}
cgatgctccatgctagtgctagtcgatga
>seqB2 {"seq_length":30}
cgatggctccatgctagtgctagtcgatga
- Counting occurrences of a set of patterns
The --aho-corasick option allow for counting the occurrences of a set of patterns stored in a text, one pattern per line. The patterns are strictly matched against both strands of the DNA sequence using the Aho-Corasick multiple pattern matching algorithm. The option requires as argument the name of the file containing these patterns.
| |
obiannotate --aho-corasick motifs.txt empty.fasta
>seqA1 {"aho_corasick":2,"aho_corasick_Fwd":2,"aho_corasick_Rev":0}
cgatgctgcatgctagtgctagtcgat
>seqB1
tagctagctagctagctagctagctagcta
>seqA2
gtagctagctagctagctagctagctaga
>seqC1 {"aho_corasick":2,"aho_corasick_Fwd":1,"aho_corasick_Rev":1}
cgatgctccatgctagtgctagtcgatga
>seqB2 {"aho_corasick":2,"aho_corasick_Fwd":1,"aho_corasick_Rev":1}
cgatggctccatgctagtgctagtcgatga
When used with the --aho-corasick option obiannotate
adds the three following options:
aho_corasick: the total number of match on the sequenceaho_corasick_Fwd: the number of match on the forward strandaho_corasick_Rev: the number of match on the reverse strand
- Matching a primer against sequences
It is possible to identify sequences that match a primer using the same algorithm than the one used by obipcr
or obimultiplex
. Four options controle this feature:
--pattern <PATTERN>: the primer sequence to be searched. The pattern is following the DNA Pattern grammar allowing to use the IUPAC DNA codes and to indicates non mutable positions.--pattern-error <INT>: the maximum error allowed when matching the primer. Default is 0.
obiannotate --pattern tagctagctcgctagcta \
--pattern-error 3 \
empty.fasta
>seqA1
cgatgctgcatgctagtgctagtcgat
>seqB1 {"pattern":"tagctagctcgctagcta","pattern_error":1,"pattern_location":"1..18","pattern_match":"tagctagctagctagcta"}
tagctagctagctagctagctagctagcta
>seqA2 {"pattern":"tagctagctcgctagcta","pattern_error":1,"pattern_location":"2..19","pattern_match":"tagctagctagctagcta"}
gtagctagctagctagctagctagctaga
>seqC1
cgatgctccatgctagtgctagtcgatga
>seqB2
cgatggctccatgctagtgctagtcgatga
--pattern-name <STRING>
obiannotate --pattern tagctagctcgctagcta \
--pattern-error 3 \
--pattern-name primer1 \
empty.fasta
>seqA1
cgatgctgcatgctagtgctagtcgat
>seqB1 {"primer1_error":1,"primer1_location":"1..18","primer1_match":"tagctagctagctagcta","primer1_pattern":"tagctagctcgctagcta"}
tagctagctagctagctagctagctagcta
>seqA2 {"primer1_error":1,"primer1_location":"2..19","primer1_match":"tagctagctagctagcta","primer1_pattern":"tagctagctcgctagcta"}
gtagctagctagctagctagctagctaga
>seqC1
cgatgctccatgctagtgctagtcgatga
>seqB2
cgatggctccatgctagtgctagtcgatga
--allows-indels: by default the program will not allow indels in patterns, but you can use this option to enable them. When enabled, an error can be a mismatch or an insertion/deletion.
Edit taxonomy related annotations #
–scientific-name
–with-taxon-at-rank <RANK_NAME>
–taxonomic-rank
–taxonomic-path
–raw-taxid
–add-lca-in <SLOT_NAME> –lca-error <#.###>
Deleting annotations #
There are three options for deleting annotations associated with a sequence. The easiest is the --clear option. This command removes all annotations associated with a sequence.
Considering the fasta sequence file
📄 five_tags.fasta>seqA1 {"count":1,"tata":"bar","taxid":"taxon:9606 [Homo sapiens]@species","toto":"titi"}
cgatgctgcatgctagtgctagtcgat
>seqB1 {"tata":"bar","taxid":"taxon:63221 [Homo sapiens neanderthalensis]@subspecies","toto":"tata"}
tagctagctagctagctagctagctagcta
>seqA2 {"count":5,"tata":"foo","taxid":"taxon:9605 [Homo]@genus","toto":"tutu"}
gtagctagctagctagctagctagctaga
>seqC1 {"count":15,"tata":"foo","taxid":"taxon:9604 [Hominidae]@family","toto":"foo"}
cgatgctccatgctagtgctagtcgatga
>seqB2 {"count":25,"tata":"bar"}
cgatggctccatgctagtgctagtcgatga
The next command removes all the annotations
obiannotate --clear five_tags.fasta
>seqA1
cgatgctgcatgctagtgctagtcgat
>seqB1
tagctagctagctagctagctagctagcta
>seqA2
gtagctagctagctagctagctagctaga
>seqC1
cgatgctccatgctagtgctagtcgatga
>seqB2
cgatggctccatgctagtgctagtcgatga
Combining the -C 10 selection option, which selects all sequences that occur at most ten times, and the --clear option will delete annotations only on the selected sequences. The annotations on other sequences are kept.
obiannotate -C 10 --clear five_tags.fasta
>seqA1
cgatgctgcatgctagtgctagtcgat
>seqB1
tagctagctagctagctagctagctagcta
>seqA2
gtagctagctagctagctagctagctaga
>seqC1 {"count":15,"tata":"foo","taxid":"taxon:9604 [Hominidae]@family","toto":"foo"}
cgatgctccatgctagtgctagtcgatga
>seqB2 {"count":25,"tata":"bar"}
cgatggctccatgctagtgctagtcgatga
Using the --delete-tag option, it is possible to delete a tag based on its name. In the following example, the taxid tag is deleted. The seseqB2qB2 sequence is not affected because it does not exhibit a taxid tag.
obiannotate --delete-tag taxid five_tags.fasta
>seqA1 {"count":1,"tata":"bar","toto":"titi"}
cgatgctgcatgctagtgctagtcgat
>seqB1 {"tata":"bar","toto":"tata"}
tagctagctagctagctagctagctagcta
>seqA2 {"count":5,"tata":"foo","toto":"tutu"}
gtagctagctagctagctagctagctaga
>seqC1 {"count":15,"tata":"foo","toto":"foo"}
cgatgctccatgctagtgctagtcgatga
>seqB2 {"count":25,"tata":"bar"}
cgatggctccatgctagtgctagtcgatga
You can insert several --delete-tag options in a single obiannotate
command.
obiannotate --delete-tag taxid \
--delete-tag count \
five_tags.fasta
>seqA1 {"tata":"bar","toto":"titi"}
cgatgctgcatgctagtgctagtcgat
>seqB1 {"tata":"bar","toto":"tata"}
tagctagctagctagctagctagctagcta
>seqA2 {"tata":"foo","toto":"tutu"}
gtagctagctagctagctagctagctaga
>seqC1 {"tata":"foo","toto":"foo"}
cgatgctccatgctagtgctagtcgatga
>seqB2 {"tata":"bar"}
cgatggctccatgctagtgctagtcgatga
The last method for deleting annotations is indirect. It is based on the --keep option, which indicates which annotation should be kept. Consequently, all the other tags that are not kept are deleted.
obiannotate --keep taxid five_tags.fasta
>seqA1 {"taxid":"taxon:9606 [Homo sapiens]@species"}
cgatgctgcatgctagtgctagtcgat
>seqB1 {"taxid":"taxon:63221 [Homo sapiens neanderthalensis]@subspecies"}
tagctagctagctagctagctagctagcta
>seqA2 {"taxid":"taxon:9605 [Homo]@genus"}
gtagctagctagctagctagctagctaga
>seqC1 {"taxid":"taxon:9604 [Hominidae]@family"}
cgatgctccatgctagtgctagtcgatga
>seqB2
cgatggctccatgctagtgctagtcgatga
Similarly to the --delete-tag option, several --keep options can be provided to keep multiple annotations.
obiannotate --keep taxid \
--keep count \
five_tags.fasta
>seqA1 {"count":1,"taxid":"taxon:9606 [Homo sapiens]@species"}
cgatgctgcatgctagtgctagtcgat
>seqB1 {"taxid":"taxon:63221 [Homo sapiens neanderthalensis]@subspecies"}
tagctagctagctagctagctagctagcta
>seqA2 {"count":5,"taxid":"taxon:9605 [Homo]@genus"}
gtagctagctagctagctagctagctaga
>seqC1 {"count":15,"taxid":"taxon:9604 [Hominidae]@family"}
cgatgctccatgctagtgctagtcgatga
>seqB2 {"count":25}
cgatggctccatgctagtgctagtcgatga
Changing annotation values #
Edition of the identifier #
You can update the identifier of a sequence using the --set-id option. One useful application of this option is substituting the long id generated by the sequencer with a new, short id based on a number incremented from sequence to sequence, as with the id generated by the --number option. To do so, use two piped obiannotate
commands. The first command adds the seq_number annotation to the sequences. Then, the second command updates the sequence id from the newly added seq_number tag.
obiannotate --number empty.fasta \
| obiannotate --set-id 'sprintf("motus_%04d", annotations.seq_number)'
>motus_0001 {"seq_number":1}
cgatgctgcatgctagtgctagtcgat
>motus_0002 {"seq_number":2}
tagctagctagctagctagctagctagcta
>motus_0003 {"seq_number":3}
gtagctagctagctagctagctagctaga
>motus_0004 {"seq_number":4}
cgatgctccatgctagtgctagtcgatga
>motus_0005 {"seq_number":5}
cgatggctccatgctagtgctagtcgatga
The sprintf function in the
OBITools4 expression language is used to format sequence identifiers. It requires a format string, "motus_%04d" in this case, which describes how the new identifier will be generated. The %04d in the format string will be replaced by the second argument of the sprintf function, annotations.seq_number. This argument is the number associated with the sequence in the file. The d specifies that the number is a decimal integer, and the 4 specifies that the number will be padded to four digits. The 0 before the 4 specifies that the number will be padded with zeros.
The results of the printf function are presented above. The first sequence is identified as motus_0001, the second as motus_0002, and so on.
Edition of the sequence #
Extracting a fragment of the sequence #
You can extract a fragment of a sequence using the --cut option. This option requires an argument in the form of #:###, where # is the start position and ### is the end position of the fragment. Position numbering is one-based, and the fragment includes the limits.
obiannotate --cut 2:7 five_tags.fasta > five_tags_sub_2_7.fasta
>seqA1_sub[2..7] {"count":1,"tata":"bar","taxid":"taxon:9606 [Homo sapiens]@species","toto":"titi"}
gatgct
>seqB1_sub[2..7] {"tata":"bar","taxid":"taxon:63221 [Homo sapiens neanderthalensis]@subspecies","toto":"tata"}
agctag
>seqA2_sub[2..7] {"count":5,"tata":"foo","taxid":"taxon:9605 [Homo]@genus","toto":"tutu"}
tagcta
>seqC1_sub[2..7] {"count":15,"tata":"foo","taxid":"taxon:9604 [Hominidae]@family","toto":"foo"}
gatgct
>seqB2_sub[2..7] {"count":25,"tata":"bar"}
gatggc
If # is absent the fragment extracted starts from the beginning of the sequence.
obiannotate --cut :7 five_tags.fasta
>seqA1_sub[1..7] {"count":1,"tata":"bar","taxid":"taxon:9606 [Homo sapiens]@species","toto":"titi"}
cgatgct
>seqB1_sub[1..7] {"tata":"bar","taxid":"taxon:63221 [Homo sapiens neanderthalensis]@subspecies","toto":"tata"}
tagctag
>seqA2_sub[1..7] {"count":5,"tata":"foo","taxid":"taxon:9605 [Homo]@genus","toto":"tutu"}
gtagcta
>seqC1_sub[1..7] {"count":15,"tata":"foo","taxid":"taxon:9604 [Hominidae]@family","toto":"foo"}
cgatgct
>seqB2_sub[1..7] {"count":25,"tata":"bar"}
cgatggc
If ### is absent the fragment extracted ends at the end of the sequence.
obiannotate --cut 2: five_tags.fasta
>seqA1_sub[2..27] {"count":1,"tata":"bar","taxid":"taxon:9606 [Homo sapiens]@species","toto":"titi"}
gatgctgcatgctagtgctagtcgat
>seqB1_sub[2..30] {"tata":"bar","taxid":"taxon:63221 [Homo sapiens neanderthalensis]@subspecies","toto":"tata"}
agctagctagctagctagctagctagcta
>seqA2_sub[2..29] {"count":5,"tata":"foo","taxid":"taxon:9605 [Homo]@genus","toto":"tutu"}
tagctagctagctagctagctagctaga
>seqC1_sub[2..29] {"count":15,"tata":"foo","taxid":"taxon:9604 [Hominidae]@family","toto":"foo"}
gatgctccatgctagtgctagtcgatga
>seqB2_sub[2..30] {"count":25,"tata":"bar"}
gatggctccatgctagtgctagtcgatga
Following python usage negative coordinates have to be considered from the end of the sequence. -1 is the last position of the sequence, -2 is the second last position of the sequence, and so on.
obiannotate --cut='-7:-2' five_tags.fasta
>seqA1_sub[22..26] {"count":1,"tata":"bar","taxid":"taxon:9606 [Homo sapiens]@species","toto":"titi"}
gtcga
>seqB1_sub[25..29] {"tata":"bar","taxid":"taxon:63221 [Homo sapiens neanderthalensis]@subspecies","toto":"tata"}
tagct
>seqA2_sub[24..28] {"count":5,"tata":"foo","taxid":"taxon:9605 [Homo]@genus","toto":"tutu"}
gctag
>seqC1_sub[24..28] {"count":15,"tata":"foo","taxid":"taxon:9604 [Hominidae]@family","toto":"foo"}
cgatg
>seqB2_sub[25..29] {"count":25,"tata":"bar"}
cgatg
When using negative coordinates like in the above command to not confuse the shell interpretor,
the option has to be written followed by the = sign without space between the option and the value: --cut='-7:-2'
Editing the sequence itself #
The nucleic sequence of a sequence record is considered by obitools as a special tag annotation name sequence. Therefore, it is possible to edit the sequence itself by using the obiannotate command with the --set-tag option.
obiannotate --set-tag sequence='"acgtacgt"' five_tags.fasta
>seqA1 {"count":1,"tata":"bar","taxid":"taxon:9606 [Homo sapiens]@species","toto":"titi"}
acgtacgt
>seqB1 {"tata":"bar","taxid":"taxon:63221 [Homo sapiens neanderthalensis]@subspecies","toto":"tata"}
acgtacgt
>seqA2 {"count":5,"tata":"foo","taxid":"taxon:9605 [Homo]@genus","toto":"tutu"}
acgtacgt
>seqC1 {"count":15,"tata":"foo","taxid":"taxon:9604 [Hominidae]@family","toto":"foo"}
acgtacgt
>seqB2 {"count":25,"tata":"bar"}
acgtacgt
As for the other tags, the --set-tag option requires a expression expressed using the
OBITools4 expression language and returning a string.
Synopsis #
obiannotate [--add-lca-in <SLOT_NAME>] [--aho-corasick <string>]
[--allows-indels] [--approx-pattern <PATTERN>]...
[--attribute|-a <KEY=VALUE>]... [--batch-size <int>] [--clear]
[--compress|-Z] [--csv] [--cut <###:###>] [--debug]
[--definition|-D <PATTERN>]... [--delete-tag <KEY>]... [--ecopcr]
[--embl] [--fail-on-taxonomy] [--fasta] [--fasta-output]
[--fastq] [--fastq-output] [--force-one-cpu] [--genbank]
[--has-attribute|-A <KEY>]... [--help|-h|-?]
[--id-list <FILENAME>] [--identifier|-I <PATTERN>]...
[--ignore-taxon|-i <TAXID>]... [--input-OBI-header]
[--input-json-header] [--inverse-match|-v] [--json-output]
[--keep|-k <KEY>]... [--lca-error <#.###>] [--length]
[--max-count|-C <COUNT>] [--max-cpu <int>]
[--max-length|-L <LENGTH>] [--min-count|-c <COUNT>]
[--min-length|-l <LENGTH>] [--no-order] [--no-progressbar]
[--number] [--only-forward] [--out|-o <FILENAME>]
[--output-OBI-header|-O] [--output-json-header]
[--paired-mode <forward|reverse|and|or|andnot|xor>]
[--pattern <string>] [--pattern-error <int>]
[--pattern-name <string>] [--pprof] [--pprof-goroutine <int>]
[--pprof-mutex <int>] [--predicate|-p <EXPRESSION>]...
[--raw-taxid] [--rename-tag|-R <NEW_NAME=OLD_NAME>]...
[--require-rank <RANK_NAME>]...
[--restrict-to-taxon|-r <TAXID>]... [--scientific-name]
[--sequence|-s <PATTERN>]... [--set-identifier <EXPRESSION>]
[--set-tag|-S <KEY=EXPRESSION>]... [--silent-warning]
[--skip-empty] [--solexa] [--taxonomic-path] [--taxonomic-rank]
[--taxonomy|-t <string>] [--u-to-t] [--update-taxid]
[--valid-taxid] [--version] [--with-leaves]
[--with-taxon-at-rank <RANK_NAME>]... [<args>]
Options #
obiannotate
specific options
#
Identifier modification #
--set-identifier<EXPRESSION>: An expression used to assigned the new id of the sequence.
Attribute modification #
--clear: Clears all attributes associated to the sequence records.--delete-tag<KEY>: Deletes attribute namedKEY. When this attribute is missing, the sequence record is skipped and the next one is examined.--keep|-k<KEY>: Keeps only attribute namedKEY. Several -k options can be combined.--rename-tag|-R<NEW_NAME=OLD_NAME>: Changes attribute nameOLD_NAMEtoNEW_NAME. When attribute namedOLD_NAMEis missing, the sequence record is skipped and the next one is examined.--set-tag|-S<KEY=EXPRESSION>: Creates a new attribute named with a keyKEYset with a value computed fromEXPRESSION.
Sequence-related annotation #
--aho-corasick<string>: Adds an aho-corasick attribute with the count of matches of the provided patterns.--length: Adds attribute with seq_length as a key and sequence length as a value.--pattern<string>: Adds a pattern attribute containing the pattern, a pattern_match attribute indicating the matched sequence, and a pattern_error slot indicating the number difference between the pattern and the match to the sequence.--pattern-name<string>: specifies the name to use as prefix for the attributes reporting the match. (default: “pattern”)
Sequence modification #
--cut<###:###>: A pattern describing how to cut the sequence.
Taxonomy annotation #
--add-lca-in<KEY>: From the taxonomic annotation of the sequence (taxid attribute or merged_taxid attribute), a new attribute namedKEYis added with the taxid of the lowest common ancestor corresponding to the current annotation.--lca-error<#.###>: Error rate tolerated on the taxonomical description during the lowest common ancestor. At most a fraction of lca-error of the taxonomic information can disagree with the estimated LCA. (default: 0.000000)--scientific-name: Annotates the sequence with its scientific name.
Taxonomy options #
Check taxids against a taxonomy #
OBITools4 allow loading a taxonomy database when they are processing sequence data. If done, the command checks the validity of taxids during the processing of the command. Three cases can occur:- The taxon is valid
- The taxon is no more valid, but a new one replaces it
- The taxon is no more valid, and no new taxid exists to replace it.
TAXCOD:TAXID [SCIENTIFIC NAME]@RANK
As example with the NCBI taxonomy the human taxid looks like :
taxon:9606 [Homo sapiens]@species
That rewriting doesn't happen if the --raw-taxid option is set.
In that case only the raw taxid is conserved.
9606
In the second case, a warning message is logged on the standard error. If the
--update-taxid is set, the command will update the expired taxid
to the new equivalent one, and the valid taxon rules apply. Otherwise, the old
taxid is maintained in the result.
In the last case, a warning message is also issued to the standard error, and
non-valid taxid is conserved as is. If the --fail-on-taxonomy option is
set, the command stop and exit with an error at the first non-valid taxid
encountred in input data.--taxonomy|-t<string>: Path to the taxonomic database.--raw-taxid: Displays the raw taxid for each displayed taxon. (default: false)--update-taxid: Make obitools automatically updating the taxids that are declared merged to a newest one (default: false).--fail-on-taxonomy: Make obitools failing on error if a used taxid is not a currently valid one (default: false).
--taxonomic-rank: Annotates the sequence with its taxonomic rank.--taxonomic-path: Annotates the sequence with its taxonomic path.--with-taxon-at-rank: Adds taxonomic annotation at taxonomic rankRANK_NAME.
Selecting sequence records #
Selection based on the sequence #
Strict matching #
--sequence|-s<PATTERN>: A Regular expression pattern used to match the sequence. Only the entries whose sequence matches the pattern are kept. Regular expression patterns are case-insensitive.
Approximate matching #
--approx-pattern<PATTERN>: A DNA pattern used to match the sequence. Only the entries whose sequence matches the pattern are kept. DNA patterns are case-insensitive. They can be matched allowing for errors: mismatches or insertions or deletions.--allows-indels: allows for indels during pattern DNA pattern matching (see--approx-patternoption).--pattern-error<INTEGER>: maximum number of errors allowed when searching for patterns in DNA (default 0, see--approx-patternoption).
Selection based on the sequence identifier #
--identifier|-I<REGEX>: Regular expression pattern to be tested against the sequence identifier. The pattern is case-insensitive.--id-list<FILENAME>:points to a text file containing the list of sequence record identifiers to be selected. The file format consists in a single identifier per line.
Selection based on the sequence definition #
--definition|-D<REGEX>: Regular expression pattern to be tested against the sequence definition. The pattern is case-insensitive.
Selection based on the sequence properties #
--min-count|-c<COUNT>: selects the sequence records for which the number of occurrences (i.e the count attribute) is equal to or greater than the defined minimum count.--max-count|-C<COUNT>: Select the sequence records for which the occurrence count (i.e the count attribute) is equal to or smaller than the defined maximum count.--min-length|-l<LENGTH>: selects the sequence records for which the sequence length is equal to or greater than the defined minimum sequence length.--max-length|-L<LENGTH>: selects sequence records for which the sequence length is equal to or less than the defined maximum sequence length.
Controlling the input data #
OBITools4 generally recognizes the input file format. It also recognizes whether the input file is compressed using GZIP. But some rare files can be misidentified, so the following options allow the user to force the format, thus bypassing the format identification step.The file format options #
--fasta: indicates that sequence data is in fasta format.--fastq: indicates that sequence data is in fastq format.--embl: indicates that sequence data is in EMBL-ENA flatfile format.--csv: indicates that sequence data is in CSV format.--genbank: indicates that sequence data is in GenBank flatfile format.--ecopcr: indicates that sequence data is in the old ecoPCR tabulated format.
Controlling the way OBITools4 are formatting annotations #
These options only apply to the FASTA and FASTQ formats--input-OBI-header: FASTA/FASTQ title line annotations follow the old OBI format.--input-json-header: FASTA/FASTQ title line annotations follow the JSON format.
Controlling quality score decoding #
This option only applies to the FASTQ formats--solexa: decodes quality string according to the old Solexa specification. (default: the standard Sanger encoding is used, env: OBISSOLEXA)
Controlling the output data #
--compress|-Z: output is compressed using gzip. (default: false)--no-order: the OBITools ensure that the order between the input file and the output file does not change. When multiple files are processed, they are processed one at a time. If the –no-order option is added to a command, multiple input files can be opened at the same time and their contents processed in parallel. This usually increases processing speed, but does not guarantee the order of the sequences in the output file. Also, processing multiple files in parallel may require more memory to perform the computation.--fasta-output: writes sequence data in fasta format (default if quality data is not available).--fastq-output: writes sequence data in fastq format (default if quality data is available).--json-output: writes sequence data in JSON format.--out|-o<FILENAME>: filename used for saving the output (default: “-”, the standard output)--output-OBI-header|-O: writes output FASTA/FASTQ title line annotations in OBI format (default: JSON).--output-json-header: writew output FASTA/FASTQ title line annotations in JSON format (the default format).--skip-empty: sequences of length equal to zero are removed from the output (default: false).--no-progressbar: deactivates progress bar display (default: false).
General options #
--help|-h|-?: shows this help.--version: prints the version and exits.--silent-warning: This option tells obitools to stop displaying warnings. This behaviour can be controlled by setting the OBIWARNINGS environment variable.
Computation related options #
--max-cpu<INTEGER>: OBITools can take advantage of your computer’s multi-core architecture by parallelizing the computation across all available CPUs. Computing on more CPUs usually requires more memory to perform the computation. Reducing the number of CPUs used to perform a calculation is also a way to indirectly control the amount of memory used by the process. The number of CPUs used by OBITools can also be controlled by setting the OBIMAXCPU environment variable.--force-one-cpu: forces the use of a single CPU core for parallel processing (default: false).--batch-size<INTEGER>: number of sequence per batch for parallel processing (default: 1000, env: OBIBATCHSIZE)
Debug related options #
--debug: enables debug mode, by setting log level to debug (default: false, env: OBIDEBUG)--pprof: enables pprof server. Look at the log for details. (default: false).--pprof-mutex<INTEGER>: enables profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX)--pprof-goroutine<INTEGER>: enables profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE)
Examples #
obiannotate --help