obigrep
: filter a sequence file
#
Description #
obigrep
is a tool for selecting a subset of sequences based on a set of criteria. Sequences from the input dataset that match all the criteria are retained and printed in the result, while other sequences are discarded.
Selection criteria can be based on different aspects of the sequence data, such as
- The sequence identifier (ID)
- The sequence annotations
- The sequence itself
Selection based on sequence identifier (ID) #
There are two ways of selecting sequences according to their identifier:
- Using a
regular pattern with option
-I
- Using a list of identifiers (IDs) provided in a file with option
--id-list
On the following five-sequences sample file:
📄 five_ids.fasta>seqA1
cgatgctgcatgctagtgctagtcgat
>seqB1
tagctagctagctagctagctagctagcta
>seqA2
gtagctagctagctagctagctagctaga
>seqC1
cgatgctgcatgctagtgctagtcgatga
>seqB2
tagctagctagctagctagctagctagcta
To select sequences with IDs “seqA1” and “seqB1”, you can use the command
obigrep -I '^seq[AB]1$' five_ids.fasta
>seqA1
cgatgctgcatgctagtgctagtcgat
>seqB1
tagctagctagctagctagctagctagcta
The explanations for the regular pattern ^seq[AB]1$
are
- the
^
at the beginning means that the string must start with that pattern seq
is an exact match for that string[AB]
means any character in the set {A, B}1
is an exact match for that character$
at the end of the pattern means that the string must end with that pattern.
If the starting ^
had been omitted, the pattern would have matched any sequence ID containing “seq” followed by a character from the set {A, B} and ending with “1”, for example the IDs my_seqA1
or my_seqB1
would have been selected.
If the ending ‘$’ had been omitted, the pattern would have matched any sequence ID starting with ‘seq’ followed by a character in the set {A, B} and containing ‘1’, e.g. the ids seqA102
or seqB1023456789
would have been selected.
Another solution to extract these sequence IDs would be to use a text file containing them, one per line, as follows
📄 seqAB.txt
|
|
This seqAB.txt
can then be used as an index file by obigrep
:
obigrep --id-list seqAB.txt five_ids.fasta
>seqA1
cgatgctgcatgctagtgctagtcgat
>seqB1
tagctagctagctagctagctagctagcta
Selection based on sequence definition #
Each sequence record can have a sequence definition describing the sequence. In fasta or fastq format, this definition is found in the header of each sequence record after the second word (the first being the sequence id), or after the annotations between braces in the OBITools4 extended version of these formats.
📄 three_def.fasta>seqA1
cgatgctgcatgctagtgctagtcgat
>seqB1 my beautiful sequence
tagctagctagctagctagctagctagcta
>seqA2 {"count":10} my pretty sequence
gtagctagctagctagctagctagctaga
In the three_def.fasta
example file:
seqA1
has no definitionseqB1
definition ismy beautiful sequence
seqA2
definition ismy pretty sequence
The -D
or --definition
option lets you specify a
regular pattern to select only those sequences whose definition matches the pattern. The example below selects sequences whose definition contains the word pretty
.
obigrep -D pretty three_def.fasta
>seqA2 {"count":10,"definition":"my pretty sequence"}
gtagctagctagctagctagctagctaga
As you can see in the results, all the OBITools4 include the definition present in the original file as a new annotation tag called definition
. So it is actually this tag that is tested by the -D
option.
Selection based on the annotations #
Selection based on any annotation #
The obigrep
tool can also be used to select sequences based on their annotations. Annotation are constituted by all the tags and values added to each sequence header in the
fasta
/
fastq
file. For instance, if you have a sequence file with the following headers:
>seqA1 {"count":1,"tata":"bar","taxid":"taxon:9606 [Homo sapiens]@species","toto":"titi"}
cgatgctgcatgctagtgctagtcgat
>seqB1 {"tata":"bar","taxid":"taxon:63221 [Homo sapiens neanderthalensis]@subspecies","toto":"tata"}
tagctagctagctagctagctagctagcta
>seqA2 {"count":5,"tata":"foo","taxid":"taxon:9605 [Homo]@genus","toto":"tutu"}
gtagctagctagctagctagctagctaga
>seqC1 {"count":15,"tata":"foo","taxid":"taxon:9604 [Hominidae]@family","toto":"foo"}
cgatgctccatgctagtgctagtcgatga
>seqB2 {"count":25,"tata":"bar"}
cgatggctccatgctagtgctagtcgatga
Selecting sequences having a tag whatever its value #
The -A
option allows for selecting sequences having the given attribute whatever its value. In the following example, all the sequences having the count
attribute are selected.
obigrep -A "count" five_tags.fasta
>seqA1 {"count":1,"tata":"bar","taxid":"taxon:9606 [Homo sapiens]@species","toto":"titi"}
cgatgctgcatgctagtgctagtcgat
>seqA2 {"count":5,"tata":"foo","taxid":"taxon:9605 [Homo]@genus","toto":"tutu"}
gtagctagctagctagctagctagctaga
>seqC1 {"count":15,"tata":"foo","taxid":"taxon:9604 [Hominidae]@family","toto":"foo"}
cgatgctgcatgctagtgctagtcgatga
>seqB2 {"count":25,"tata":"bar"}
tagctagctagctagctagctagctagcta
Only four sequences are retained, the sequence seqB1
is excluded because it does not have the tag count
.
Selecting sequences having a tag with a specific value #
The -a
option allows for selecting sequences having the given attribute affected to a value matching the provided
regular pattern. In the following example, only the sequence seqA1 having the toto
attribute containing the value titi
is selected.
obigrep -a toto="titi" five_tags.fasta
>seqA1 {"count":1,"tata":"bar","taxid":"taxon:9606 [Homo sapiens]@species","toto":"titi"}
cgatgctgcatgctagtgctagtcgat
As the value is a
regular pattern, it is possible to be less strict, and for example,
the following command will select all sequences with the toto
attribute containing a value beginning (^
at the start of the expression) with t
.
obigrep -a toto="^t" five_tags.fasta
>seqA1 {"count":1,"tata":"bar","taxid":"taxon:9606 [Homo sapiens]@species","toto":"titi"}
cgatgctgcatgctagtgctagtcgat
>seqB1 {"tata":"bar","taxid":"taxon:63221 [Homo sapiens neanderthalensis]@subspecies","toto":"tata"}
tagctagctagctagctagctagctagcta
>seqA2 {"count":5,"tata":"foo","taxid":"taxon:9605 [Homo]@genus","toto":"tutu"}
gtagctagctagctagctagctagctaga
The sequence seqC1
is excluded because its toto
attribute contains the value foo
, which does not begin with t
, while seqB2
is excluded because it does not have a toto
attribute.
Selection based on the sequence abundances #
In amplicon sequencing experiments, a sequence may be observed many times. The obiuniq
command can be used to dereplicate strictly identical sequences. The number of strictly identical sequence reads merged into a single sequence record is stored in the count
annotation tag of that sequence record. It is common to filter out sequences that are too rare or too abundant, depending on the purpose of the experiment. There are two ways to select sequence records based on this count
tag.
- the
--min-count
or-c
options, followed by a numeric argument, select sequence records with acount
greater than or equal to that argument. - The
--max-count
or-C
options, followed by a numeric argument, select sequence records with acount
less than or equal to that argument.
If the count
tag is missing from a data set, it is assumed to be equal to 1.
obigrep -c 2 five_tags.fasta
>seqA2 {"count":5,"tata":"foo","taxid":"taxon:9605 [Homo]@genus","toto":"tutu"}
gtagctagctagctagctagctagctaga
>seqC1 {"count":15,"tata":"foo","taxid":"taxon:9604 [Hominidae]@family","toto":"foo"}
cgatgctccatgctagtgctagtcgatga
>seqB2 {"count":25,"tata":"bar"}
cgatggctccatgctagtgctagtcgatga
Remove singleton sequences (sequences observed only once), here the sequences seqA1
having a count
tag equal to 1, and seqB1
having no count
tag defined.
The next command excludes from its results all the sequences occurring at least ten times.
obigrep -C 10 five_tags.fasta
>seqA1 {"count":1,"tata":"bar","taxid":"taxon:9606 [Homo sapiens]@species","toto":"titi"}
cgatgctgcatgctagtgctagtcgat
>seqB1 {"tata":"bar","taxid":"taxon:63221 [Homo sapiens neanderthalensis]@subspecies","toto":"tata"}
tagctagctagctagctagctagctagcta
>seqA2 {"count":5,"tata":"foo","taxid":"taxon:9605 [Homo]@genus","toto":"tutu"}
gtagctagctagctagctagctagctaga
As usual, both options can be combined
obigrep -c 2 -C 10 five_tags.fasta
>seqA2 {"count":5,"tata":"foo","taxid":"taxon:9605 [Homo]@genus","toto":"tutu"}
gtagctagctagctagctagctagctaga
Selection based on taxonomic annotation. #
Taxonomy-based selection is always performed on the taxid
attribute of a sequence, even if it contains other taxonomic information stored in other attribute such as scientific_name
or family_taxid
. To use taxonomy-based selection with obigrep
, it is mandatory to load a taxonomy using the -t
or --taxonomy
option.
Selecting sequences belonging a clade #
If you do not have a taxonomy dump already downloaded, you must first download one using the following obitaxonomy
command.
The taxonomy will be stored in a file named ncbitaxo.tgz
. This compressed archive can be supplied to other OBITools4 at a later date.
obitaxonomy --download-ncbi --out ncbitaxo.tgz
To select the sequences belonging to the Homo sapiens species, the first step is to extract the taxid corresponding to the species of interest from the downloaded taxonomy using the obitaxonomy
command.
- The
-t
option indicates the taxonomy to load - The
--fixed
option indicates to consider the query string as the exact name of the species, not as a regular pattern. - The
--rank species
indicates that our interest is only on taxa having the species taxonomic rank. "Homo sapiens"
is the query string used to match the taxonomy names.
The csvlook
command aims to present nicely the
CSV
output of obitaxonomy
.
obitaxonomy -t ncbitaxo.tgz --fixed --rank species "Homo sapiens" | csvlook -I
| taxid | parent | taxonomic_rank | scientific_name |
| --------------------------------- | ----------------------- | -------------- | --------------- |
| taxon:9606 [Homo sapiens]@species | taxon:9605 [Homo]@genus | species | Homo sapiens |
The obigrep
option to select sequences belonging a taxon is -r
or --restrict-to-taxon
. The option requires as argument the taxid of the clade of interest, here 9606
for Homo sapiens.
obigrep -t ncbitaxo.tgz -r taxon:9606 five_tags.fasta
>seqA1 {"count":1,"tata":"bar","taxid":"taxon:9606 [Homo sapiens]@species","toto":"titi"}
cgatgctgcatgctagtgctagtcgat
>seqB1 {"tata":"bar","taxid":"taxon:63221 [Homo sapiens neanderthalensis]@subspecies","toto":"tata"}
tagctagctagctagctagctagctagcta
Only sequences seqA1 and seqB1 annotated as belonging to the target clade Homo sapiens or one of its subspecies Homo sapiens neanderthalensis are retained. Sequence seqA2 is not retained as it is annotated at genus level as Homo and therefore does not belong to the Homo sapiens clade, nor is sequence seqC1 annotated at family level as Hominidae. The last sequence seqB2 has no taxonomic annotation and is therefore considered to be annotated at the root of the taxonomy and no part of the Homo sapiens species clade.
Excluding sequences belonging a clade #
The -i
or --ignore-taxon
in its long form, performs the reverse selection of the -r
option presented above. It only retains sequences that do not belong to the taxid target clade passed as an argument.
obigrep -t ncbitaxo.tgz -i taxon:9606 five_tags.fasta
>seqA2 {"count":5,"tata":"foo","taxid":"taxon:9605 [Homo]@genus","toto":"tutu"}
gtagctagctagctagctagctagctaga
>seqC1 {"count":15,"tata":"foo","taxid":"taxon:9604 [Hominidae]@family","toto":"foo"}
cgatgctccatgctagtgctagtcgatga
>seqB2 {"count":25,"tata":"bar"}
cgatggctccatgctagtgctagtcgatga
Here, only the sequence seqA2, seqC1 and seqB2 are retained as none of them belongs to the Homo sapiens species.
Keep only sequence with taxonomic information at a given rank #
A taxid, when associated with a taxonomy, not only provides information at its taxonomic rank, but also makes it possible to retrieve information at any higher rank. For example, from a species taxid, it is expected that by querying the taxonomy, it will be possible to retrieve the corresponding genus or family taxid. obigrep
allows you to select sequences annotated by a taxid capable of providing information at a given taxonomic rank using the --require-rank
option.
To retrieve all ranks defined by a taxonomy, it is possible to use the obitaxonomy
command with the -l
option.
obitaxonomy -t ncbitaxo.tgz -l | csvlook
| rank |
| ---------------- |
| domain |
| phylum |
| class |
| suborder |
| subcohort |
| superphylum |
| subspecies |
| varietas |
| subgenus |
| parvorder |
| acellular root |
| genotype |
| subtribe |
| subkingdom |
| subfamily |
| kingdom |
| isolate |
| superorder |
| section |
| subvariety |
| genus |
| serogroup |
| tribe |
| forma |
| infraclass |
| superclass |
| serotype |
| no rank |
| family |
| species group |
| subclass |
| infraorder |
| pathogroup |
| realm |
| order |
| biotype |
| species subgroup |
| species |
| strain |
| clade |
| cohort |
| series |
| cellular root |
| morph |
| subphylum |
| forma specialis |
| superfamily |
| subsection |
This allows us to check that the species rank is defined and to filter the five_tags.fasta
test file to retain only sequences with information available at the species level.
obigrep -t ncbitaxo.tgz --require-rank species five_tags.fasta
>seqA1 {"count":1,"tata":"bar","taxid":"taxon:9606 [Homo sapiens]@species","toto":"titi"}
cgatgctgcatgctagtgctagtcgat
>seqB1 {"tata":"bar","taxid":"taxon:63221 [Homo sapiens neanderthalensis]@subspecies","toto":"tata"}
tagctagctagctagctagctagctagcta
Only two sequences are selected by this command, because seqA1
is annotated at the species level, and seqB1
is annotated at the subspecies taxonomic rank, which allows for retrieving species level information.
seqA2
and seqC1
are discarded as they are annotated at genus and family levels, respectively. seqB2
is discarded as it is not taxonomically annotated and is therefore considered to be annotated at the root of the taxonomy.
Keep only sequences annotated with valid taxids #
📄 six_invalid.fasta>seqA1 {"count":1,"tata":"bar","taxid":"taxon:9606 [Homo sapiens]@species","toto":"titi"}
cgatgctgcatgctagtgctagtcgat
>seqB1 {"tata":"bar","taxid":"taxon:63221 [Homo sapiens neanderthalensis]@subspecies","toto":"tata"}
tagctagctagctagctagctagctagcta
>seqA2 {"count":5,"tata":"foo","taxid":"taxon:9605 [Homo]@genus","toto":"tutu"}
gtagctagctagctagctagctagctaga
>seqC1 {"count":15,"tata":"foo","taxid":"taxon:9604 [Hominidae]@family","toto":"foo"}
cgatgctgcatgctagtgctagtcgatga
>seqB2 {"count":25,"tata":"bar"}
tagctagctagctagctagctagctagcta
>seqD1 {"taxid":"taxon:9607"}
gctagctagctgacgatgcatgcgtaggtgcagttgcgta
obigrep -t ncbitaxo.tgz --valid-taxid six_invalid.fasta
WARN[0005] seqD1: Taxid: taxon:9607 is unknown from taxonomy (Taxid taxon:9607 is not part of the taxonomy NCBI Taxonomy)
>seqA1 {"count":1,"tata":"bar","taxid":"taxon:9606 [Homo sapiens]@species","toto":"titi"}
cgatgctgcatgctagtgctagtcgat
>seqB1 {"tata":"bar","taxid":"taxon:63221 [Homo sapiens neanderthalensis]@subspecies","toto":"tata"}
tagctagctagctagctagctagctagcta
>seqA2 {"count":5,"tata":"foo","taxid":"taxon:9605 [Homo]@genus","toto":"tutu"}
gtagctagctagctagctagctagctaga
>seqC1 {"count":15,"tata":"foo","taxid":"taxon:9604 [Hominidae]@family","toto":"foo"}
cgatgctgcatgctagtgctagtcgatga
Selection based on the sequence #
Selection based on the sequence length #
Two options -l
(--min-length
) and -L
(--max-length
) allow to select sequences based on their length. A sequence is selected if its length is greater or equal to the --min-length
and less or equal to the --max-length
. If only one of these options is used, only the specified limit is applied.
In the
five_tags.fasta
, one sequence is 27 base pairs (bp) long, two are 29 bp and the two last 30 bp long.
To select only sequences with a minimum length of 29 bp, the following command can be executed
obigrep -l 29 five_tags.fasta
>seqB1 {"tata":"bar","taxid":"taxon:63221 [Homo sapiens neanderthalensis]@subspecies","toto":"tata"}
tagctagctagctagctagctagctagcta
>seqA2 {"count":5,"tata":"foo","taxid":"taxon:9605 [Homo]@genus","toto":"tutu"}
gtagctagctagctagctagctagctaga
>seqC1 {"count":15,"tata":"foo","taxid":"taxon:9604 [Hominidae]@family","toto":"foo"}
cgatgctccatgctagtgctagtcgatga
>seqB2 {"count":25,"tata":"bar"}
cgatggctccatgctagtgctagtcgatga
To select only sequences with a maximum length of 29 bp, the following command can be executed
obigrep -L 29 five_tags.fasta
>seqA1 {"count":1,"tata":"bar","taxid":"taxon:9606 [Homo sapiens]@species","toto":"titi"}
cgatgctgcatgctagtgctagtcgat
>seqA2 {"count":5,"tata":"foo","taxid":"taxon:9605 [Homo]@genus","toto":"tutu"}
gtagctagctagctagctagctagctaga
>seqC1 {"count":15,"tata":"foo","taxid":"taxon:9604 [Hominidae]@family","toto":"foo"}
cgatgctccatgctagtgctagtcgatga
Interestingly, in both cases, both 29-bp sequences were selected.
Selection based on the sequence #
Sequence records can be selected on the sequence itself. There are two pattern matching algorithms available, depending on the options used:
--sequence
or-s
: The pattern is a regular pattern used to match the sequence records. The pattern is not case-sensitive.--approx-pattern
: This option uses the same algorithm asobipcr
andobimultiplex
to locate primers. The description of the pattern follows the same grammar.
While
regular pattern allows for more complex expression in describing the look-up sequence, the
DNA Patterns have the advantage of offering discrepancy between the pattern and the actual sequence (mismatches and indels). To set the number and the type of allowed errors use the --pattern-error
and the --allows-indels
options.
In the next example, sequences containing the pattern tgc
present twice at least in the sequence eventually separated by any number of bases (.*
) are searched. This can be expressed as the
regular pattern : tgc.*tgc
obigrep -s 'tgc.*tgc' five_tags.fasta
>seqA1 {"count":1,"tata":"bar","taxid":"taxon:9606 [Homo sapiens]@species","toto":"titi"}
cgatgctgcatgctagtgctagtcgat
>seqC1 {"count":15,"tata":"foo","taxid":"taxon:9604 [Hominidae]@family","toto":"foo"}
cgatgctccatgctagtgctagtcgatga
>seqB2 {"count":25,"tata":"bar"}
cgatggctccatgctagtgctagtcgatga
If we are interested in sequence matching this pattern gatgctgcat
, but want to allow a certain number of errors, we can use the --approx-pattern
option. Despite its name, this option does not allow any errors by default, so for simple patterns like the one we have here, both the --approx-pattern
and the -s
options are equivalent.
obigrep --approx-pattern gatgctgcat \
five_tags.fasta
>seqA1 {"count":1,"tata":"bar","taxid":"taxon:9606 [Homo sapiens]@species","toto":"titi"}
cgatgctgcatgctagtgctagtcgat
obigrep -s gatgctgcat \
five_tags.fasta
>seqA1 {"count":1,"tata":"bar","taxid":"taxon:9606 [Homo sapiens]@species","toto":"titi"}
cgatgctgcatgctagtgctagtcgat
However, --approx-pattern
can be parameterized using the --pattern-error
option. The following example allows two errors (differences) between the pattern and the matched sequence. Without a further option, these errors can only be substitutions. Thus, the value defined by --pattern-error
is the maximum
Hamming distance between the pattern and the matched sequence.
obigrep --approx-pattern gatgctgcat \
--pattern-error 2 \
five_tags.fasta
>seqA1 {"count":1,"tata":"bar","taxid":"taxon:9606 [Homo sapiens]@species","toto":"titi"}
cgatgctgcatgctagtgctagtcgat
>seqC1 {"count":15,"tata":"foo","taxid":"taxon:9604 [Hominidae]@family","toto":"foo"}
cgatgctccatgctagtgctagtcgatga
By adding the --allows-indels
option, obigrep will allow indels in the pattern. This means that it can match sequences where the differences between the pattern and the matched sequence are insertions or deletions. Insertion or deletion of a symbol is considered one error. Therefore, with --pattern-error 2
and --allows-indels
you can allow two mismatches, two insertions or deletions, or one mismatch and one indel. In this case, the `–pattern-error’ defines the maximum
Levenshtein distance allowed between the pattern and the matched sequence.
obigrep --approx-pattern gatgctgcat \
--pattern-error 2 \
--allows-indels \
five_tags.fasta
>seqA1 {"count":1,"tata":"bar","taxid":"taxon:9606 [Homo sapiens]@species","toto":"titi"}
cgatgctgcatgctagtgctagtcgat
>seqC1 {"count":15,"tata":"foo","taxid":"taxon:9604 [Hominidae]@family","toto":"foo"}
cgatgctccatgctagtgctagtcgatga
>seqB2 {"count":25,"tata":"bar"}
cgatggctccatgctagtgctagtcgatga
Defining you own predicate #
You can define your own predicate to filter your data set. A predicate is an expression that, when evaluated, returns a logical value of true
or false
. The predicate is defined with the --predicate
(-p
) option using the
OBITools4 expression language. The predicate is evaluated on each sequence in the data set. Sequences that result in a True
value are retained in the result, while those that result in a False
value are discarded.
As first example the following command that filters out all sequences that have an annotation “count” lesser than 2 and greater than 10
obigrep -c 2 -C 10 five_tags.fasta
>seqA2 {"count":5,"tata":"foo","taxid":"taxon:9605 [Homo]@genus","toto":"tutu"}
gtagctagctagctagctagctagctaga
can be substituted by:
obigrep -p 'sequence.Count() >= 2 && sequence.Count() <= 10' five_tags.fasta
>seqA2 {"count":5,"tata":"foo","taxid":"taxon:9605 [Homo]@genus","toto":"tutu"}
gtagctagctagctagctagctagctaga
Working with paired sequence files #
OBITools4 can handle paired sequence files. This means that it will process the paired sequences in the two different files together, and in particular for obigrep
it will apply the same filtering to both sequence files. This ensures that in the result files, each sequence is still paired with its correct counterpart sequence.
The most important option for manipulating paired sequence files is the --paired with
option. This option allows you to specify the name of a file containing the sequences to be paired with those in the main sequence file.
Synopsis #
obigrep [--allows-indels] [--approx-pattern <PATTERN>]...
[--attribute|-a <KEY=VALUE>]... [--batch-size <int>] [--compress|-Z]
[--debug] [--definition|-D <PATTERN>]... [--ecopcr] [--embl]
[--fasta] [--fasta-output] [--fastq] [--fastq-output]
[--force-one-cpu] [--genbank] [--has-attribute|-A <KEY>]...
[--help|-h|-?] [--id-list <FILENAME>] [--identifier|-I <PATTERN>]...
[--ignore-taxon|-i <TAXID>]... [--input-OBI-header]
[--input-json-header] [--inverse-match|-v] [--json-output]
[--max-count|-C <COUNT>] [--max-cpu <int>] [--max-length|-L <LENGTH>]
[--min-count|-c <COUNT>] [--min-length|-l <LENGTH>] [--no-order]
[--no-progressbar] [--only-forward] [--out|-o <FILENAME>]
[--output-OBI-header|-O] [--output-json-header]
[--paired-mode <forward|reverse|and|or|andnot|xor>]
[--paired-with <FILENAME>] [--pattern-error <int>] [--pprof]
[--pprof-goroutine <int>] [--pprof-mutex <int>]
[--predicate|-p <EXPRESSION>]... [--require-rank <RANK_NAME>]...
[--restrict-to-taxon|-r <TAXID>]... [--save-discarded <FILENAME>]
[--sequence|-s <PATTERN>]... [--skip-empty] [--solexa]
[--taxonomy|-t <string>] [--version] [<args>]
Options #
Selecting sequence records #
Selection based on the sequence #
Strict matching #
--sequence
|-s
<PATTERN>: A Regular expression pattern used to match the sequence. Only the entries whose sequence matches the pattern are kept. Regular expression patterns are case-insensitive.
Approximate matching #
--approx-pattern
<PATTERN>: A DNA pattern used to match the sequence. Only the entries whose sequence matches the pattern are kept. DNA patterns are case-insensitive. They can be matched allowing for errors: mismatches or insertions or deletions.--allows-indels
: allows for indels during pattern DNA pattern matching (see--approx-pattern
option).--pattern-error
<INTEGER>: maximum number of errors allowed when searching for patterns in DNA (default 0, see--approx-pattern
option).
Selection based on the sequence identifier #
--identifier
|-I
<REGEX>: Regular expression pattern to be tested against the sequence identifier. The pattern is case-insensitive.--id-list
<FILENAME>:points to a text file containing the list of sequence record identifiers to be selected. The file format consists in a single identifier per line.
Selection based on the sequence definition #
--definition
|-D
<REGEX>: Regular expression pattern to be tested against the sequence definition. The pattern is case-insensitive.
Selection based on the sequence properties #
--min-count
|-c
<COUNT>: selects the sequence records for which the number of occurrences (i.e the count attribute) is equal to or greater than the defined minimum count.--max-count
|-C
<COUNT>: Select the sequence records for which the occurrence count (i.e the count attribute) is equal to or smaller than the defined maximum count.--min-length
|-l
<LENGTH>: selects the sequence records for which the sequence length is equal to or greater than the defined minimum sequence length.--max-length
|-L
<LENGTH>: selects sequence records for which the sequence length is equal to or less than the defined maximum sequence length.
Matching the sequence annotations #
Taxonomy based filtering #
If the user specifies a taxonomy when calling *OBITools* (see --taxonomy
option), it is possible to filter the sequences based on taxonomic properties. Each of the following options can be used multiple times if needed to specify multiple taxids or ranks.
--restrict-to-taxon
|-r
<TAXID>: Only sequences having a taxid belonging the provided taxid are conserved.--ignore-taxon
|-i
<TAXID>: Sequences having a taxid belonging the provided taxid are discarded.--require-rank
<RANK_NAME>: Only sequences having a taxid able to provide information at the <RANK_NAME> level are conserved. As an example, the NCBI taxid 74635 corresponding to Rosa canina is able to provide information at the species, genus or family level. But, taxid 3764 (Rosa genus) is not able to provide information at the species level. Many of the taxid related to environmental samples have partial classification and a taxon at the species level is not always connected to a taxon at the genus level as parent. They can sometimes be connected to a taxon at higher level.
Controlling the input data #
OBITools4 generally recognizes the input file format. It also recognizes whether the input file is compressed using GZIP. But some rare files can be misidentified, so the following options allow the user to force the format, thus bypassing the format identification step.The file format options #
--fasta
: indicates that sequence data is in fasta format.--fastq
: indicates that sequence data is in fastq format.--embl
: indicates that sequence data is in EMBL-ENA flatfile format.--csv
: indicates that sequence data is in CSV format.--genbank
: indicates that sequence data is in GenBank flatfile format.--ecopcr
: indicates that sequence data is in the old ecoPCR tabulated format.
Controlling the way OBITools4 are formatting annotations #
These options only apply to the FASTA and FASTQ formats--input-OBI-header
: FASTA/FASTQ title line annotations follow the old OBI format.--input-json-header
: FASTA/FASTQ title line annotations follow the JSON format.
Controlling quality score decoding #
This option only applies to the FASTQ formats--solexa
: decodes quality string according to the old Solexa specification. (default: the standard Sanger encoding is used, env: OBISSOLEXA)
Controlling the output data #
--compress
|-Z
: output is compressed using gzip. (default: false)--no-order
: the OBITools ensure that the order between the input file and the output file does not change. When multiple files are processed, they are processed one at a time. If the –no-order option is added to a command, multiple input files can be opened at the same time and their contents processed in parallel. This usually increases processing speed, but does not guarantee the order of the sequences in the output file. Also, processing multiple files in parallel may require more memory to perform the computation.--fasta-output
: writes sequence data in fasta format (default if quality data is not available).--fastq-output
: writes sequence data in fastq format (default if quality data is available).--json-output
: writes sequence data in JSON format.--out
|-o
<FILENAME>: filename used for saving the output (default: “-”, the standard output)--output-OBI-header
|-O
: writes output FASTA/FASTQ title line annotations in OBI format (default: JSON).--output-json-header
: writew output FASTA/FASTQ title line annotations in JSON format (the default format).--skip-empty
: sequences of length equal to zero are removed from the output (default: false).--no-progressbar
: deactivates progress bar display (default: false).
General options #
--help
|-h|-?
: shows this help.--version
: prints the version and exits.--silent-warning
: This option tells obitools to stop displaying warnings. This behaviour can be controlled by setting the OBIWARNINGS environment variable.
Computation related options #
--max-cpu
<INTEGER>: OBITools can take advantage of your computer’s multi-core architecture by parallelizing the computation across all available CPUs. Computing on more CPUs usually requires more memory to perform the computation. Reducing the number of CPUs used to perform a calculation is also a way to indirectly control the amount of memory used by the process. The number of CPUs used by OBITools can also be controlled by setting the OBIMAXCPU environment variable.--force-one-cpu
: forces the use of a single CPU core for parallel processing (default: false).--batch-size
<INTEGER>: number of sequence per batch for parallel processing (default: 1000, env: OBIBATCHSIZE)
Debug related options #
--debug
: enables debug mode, by setting log level to debug (default: false, env: OBIDEBUG)--pprof
: enables pprof server. Look at the log for details. (default: false).--pprof-mutex
<INTEGER>: enables profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX)--pprof-goroutine
<INTEGER>: enables profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE)
Examples #
obigrep --help