obigrep
: filter a sequence file
#
Description #
obigrep
is a tool for selecting a subset of sequences based on a set of criteria. Sequences from the input data set that match all criteria are retained and printed in the result, while the other sequences are discarded. The criteria can be based on the sequence identifier, the sequence itself or the annotations on the sequence.
Selection criteria can be based on different aspects of the sequence data, such as
- The sequence identifier (ID)
- The sequence annotations
- The sequence itself
Selection based on sequence identifier (ID) #
There are two ways to select sequences based on their identifier:
- Using a
regular pattern with option
-I
- Using a list of identifiers (IDs) provided in a file with option
--id-list
On the following five-sequences sample file:
📄 five_ids.fasta>seqA1
cgatgctgcatgctagtgctagtcgat
>seqB1
tagctagctagctagctagctagctagcta
>seqA2
gtagctagctagctagctagctagctaga
>seqC1
cgatgctgcatgctagtgctagtcgatga
>seqB2
tagctagctagctagctagctagctagcta
To select sequences with IDs “seqA1” and “seqB1”, you can use the command
obigrep -I '^seq[AB]1$' five_ids.fasta
>seqA1
cgatgctgcatgctagtgctagtcgat
>seqB1
tagctagctagctagctagctagctagcta
The explanations for the regular pattern ^seq[AB]1$
are
- the
^
at the beginning means that the string should start with that pattern seq
is an exact match for that string[AB]
means any character in the set {A, B}1
is an exact match for that character$
at the end of the pattern means that the string should end with that pattern.
If the starting ^
had been omitted, the pattern would have matched any sequence ID containing “seq” followed by a character from the set {A, B} and ending with “1”, for example the IDs my_seqA1
or my_seqB1
would have been matched.
If the ending ‘$’ had been omitted, the pattern would have matched any sequence ID starting with ‘seq’ followed by a character in the set {A, B} and containing ‘1’, e.g. the ids seqA102
or seqB1023456789
would have been matched.
Another solution to extract these sequence IDs would be to use a text file containing them, one per line, as follows
📄 seqAB.txt
|
|
This seqAB.txt
can then be used as an index file by obigrep
:
obigrep --id-list seqAB.txt five_ids.fasta
>seqA1
cgatgctgcatgctagtgctagtcgat
>seqB1
tagctagctagctagctagctagctagcta
Selection based on the annotations #
The obigrep
tool can also be used to select sequences based on their annotations. Annotation are constituted by all the tags and values added to each sequence header in the
fasta
/
fastq
file. For instance, if you have a sequence file with the following headers:
>seqA1 {"count":1,"tata":"bar","taxid":"taxon:9606 [Homo sapiens]@species","toto":"titi"}
cgatgctgcatgctagtgctagtcgat
>seqB1 {"tata":"bar","taxid":"taxon:63221 [Homo sapiens neanderthalensis]@subspecies","toto":"tata"}
tagctagctagctagctagctagctagcta
>seqA2 {"count":5,"tata":"foo","taxid":"taxon:9605 [Homo]@genus","toto":"tutu"}
gtagctagctagctagctagctagctaga
>seqC1 {"count":15,"tata":"foo","taxid":"taxon:9604 [Hominidae]@family","toto":"foo"}
cgatgctgcatgctagtgctagtcgatga
>seqB2 {"count":25,"tata":"bar"}
tagctagctagctagctagctagctagcta
Selecting sequences having a tag whatever its value #
The -A
option allows for selecting sequences having the given attribute whatever its value. In the following example, all the sequences having the count
attribute are selected.
obigrep -A "count" five_tags.fasta
>seqA1 {"count":1,"tata":"bar","taxid":"taxon:9606 [Homo sapiens]@species","toto":"titi"}
cgatgctgcatgctagtgctagtcgat
>seqA2 {"count":5,"tata":"foo","taxid":"taxon:9605 [Homo]@genus","toto":"tutu"}
gtagctagctagctagctagctagctaga
>seqC1 {"count":15,"tata":"foo","taxid":"taxon:9604 [Hominidae]@family","toto":"foo"}
cgatgctgcatgctagtgctagtcgatga
>seqB2 {"count":25,"tata":"bar"}
tagctagctagctagctagctagctagcta
Only four sequences are retained, the sequence seqB1
is excluded because it does not have the tag count
.
Selecting sequences having a tag with a specific value #
The -A
option allows for selecting sequences having the given attribute affected to a value matching the provided
regular pattern. In the following example, only the sequence seqA1 having the toto
attribute valued to titi
is selected.
obigrep -a toto="titi" five_tags.fasta
>seqA1 {"count":1,"tata":"bar","taxid":"taxon:9606 [Homo sapiens]@species","toto":"titi"}
cgatgctgcatgctagtgctagtcgat
As the value is a
regular pattern, it is possible to be less strict, and as example,
the next command will select all the sequences having the toto
attribute valued with a value starting (^
at the beginning of the expression) by a t
.
obigrep -a toto="^t" five_tags.fasta
>seqA1 {"count":1,"tata":"bar","taxid":"taxon:9606 [Homo sapiens]@species","toto":"titi"}
cgatgctgcatgctagtgctagtcgat
>seqB1 {"tata":"bar","taxid":"taxon:63221 [Homo sapiens neanderthalensis]@subspecies","toto":"tata"}
tagctagctagctagctagctagctagcta
>seqA2 {"count":5,"tata":"foo","taxid":"taxon:9605 [Homo]@genus","toto":"tutu"}
gtagctagctagctagctagctagctaga
Sequence seqC1
is excluded because its toto
attribute is valued with foo
which doesn’t start by a t
when seqB2
is excluded because it doesn’t have a toto
attribute.
Synopsis #
obigrep [--allows-indels] [--approx-pattern <PATTERN>]...
[--attribute|-a <KEY=VALUE>]... [--batch-size <int>] [--compress|-Z]
[--debug] [--definition|-D <PATTERN>]... [--ecopcr] [--embl]
[--fasta] [--fasta-output] [--fastq] [--fastq-output]
[--force-one-cpu] [--genbank] [--has-attribute|-A <KEY>]...
[--help|-h|-?] [--id-list <FILENAME>] [--identifier|-I <PATTERN>]...
[--ignore-taxon|-i <TAXID>]... [--input-OBI-header]
[--input-json-header] [--inverse-match|-v] [--json-output]
[--max-count|-C <COUNT>] [--max-cpu <int>] [--max-length|-L <LENGTH>]
[--min-count|-c <COUNT>] [--min-length|-l <LENGTH>] [--no-order]
[--no-progressbar] [--only-forward] [--out|-o <FILENAME>]
[--output-OBI-header|-O] [--output-json-header]
[--paired-mode <forward|reverse|and|or|andnot|xor>]
[--paired-with <FILENAME>] [--pattern-error <int>] [--pprof]
[--pprof-goroutine <int>] [--pprof-mutex <int>]
[--predicate|-p <EXPRESSION>]... [--require-rank <RANK_NAME>]...
[--restrict-to-taxon|-r <TAXID>]... [--save-discarded <FILENAME>]
[--sequence|-s <PATTERN>]... [--skip-empty] [--solexa]
[--taxonomy|-t <string>] [--version] [<args>]
Options #
Selecting sequence records #
Selection based on the sequence #
Strict matching #
--sequence
|-s
<PATTERN>: A Regular expression pattern used to match the sequence. Only the entries whose sequence matches the pattern are kept. Regular expression patterns are case-insensitive.
Approximate matching #
--approx-pattern
<PATTERN>: A DNA pattern used to match the sequence. Only the entries whose sequence matches the pattern are kept. DNA patterns are case-insensitive. They can be matched allowing for errors: mismatches or insertions or deletions.--allows-indels
: allows for indels during pattern DNA pattern matching (see--approx-pattern
option).--pattern-error
<INTEGER>: maximum number of errors allowed when searching for patterns in DNA (default 0, see--approx-pattern
option).
Selection based on the sequence identifier #
--identifier
|-I
<REGEX>: Regular expression pattern to be tested against the sequence identifier. The pattern is case-insensitive.--id-list
<FILENAME>:points to a text file containing the list of sequence record identifiers to be selected. The file format consists in a single identifier per line.
Selection based on the sequence definition #
--definition
|-D
<REGEX>: Regular expression pattern to be tested against the sequence definition. The pattern is case-insensitive.
Selection based on the sequence properties #
--min-count
|-c
<COUNT>: selects the sequence records for which the number of occurrences (i.e the count attribute) is equal to or greater than the defined minimum count.--max-count
|-C
<COUNT>: Select the sequence records for which the occurrence count (i.e the count attribute) is equal to or smaller than the defined maximum count.--min-length
|-l
<LENGTH>: selects the sequence records for which the sequence length is equal to or greater than the defined minimum sequence length.--max-length
|-L
<LENGTH>: selects sequence records for which the sequence length is equal to or less than the defined maximum sequence length.
Matching the sequence annotations #
Taxonomy based filtering #
If the user specifies a taxonomy when calling *OBITools* (see --taxonomy
option), it is possible to filter the sequences based on taxonomic properties. Each of the following options can be used multiple times if needed to specify multiple taxids or ranks.
--restrict-to-taxon
|-r
<TAXID>: Only sequences having a taxid belonging the provided taxid are conserved.--ignore-taxon
|-i
<TAXID>: Sequences having a taxid belonging the provided taxid are discarded.--require-rank
<RANK_NAME>: Only sequences having a taxid able to provide information at the <RANK_NAME> level are conserved. As an example, the NCBI taxid 74635 corresponding to Rosa canina is able to provide information at the species, genus or family level. But, taxid 3764 (Rosa genus) is not able to provide information at the species level. Many of the taxid related to environmental samples have partial classification and a taxon at the species level is not always connected to a taxon at the genus level as parent. They can sometimes be connected to a taxon at higher level.
Controlling the input data #
OBITools4 generally recognizes the input file format. It also recognizes whether the input file is compressed using GZIP. But some rare files can be misidentified, so the following options allow the user to force the format, thus bypassing the format identification step.The file format options #
--fasta
: indicates that sequence data is in fasta format.--fastq
: indicates that sequence data is in fastq format.--embl
: indicates that sequence data is in EMBL-ENA flatfile format.--csv
: indicates that sequence data is in CSV format.--genbank
: indicates that sequence data is in GenBank flatfile format.--ecopcr
: indicates that sequence data is in the old ecoPCR tabulated format.
Controlling the way OBITools4 are formatting annotations #
These options only apply to the FASTA and FASTQ formats--input-OBI-header
: FASTA/FASTQ title line annotations follow the old OBI format.--input-json-header
: FASTA/FASTQ title line annotations follow the JSON format.
Controlling quality score decoding #
This option only applies to the FASTQ formats--solexa
: decodes quality string according to the Solexa specification. (default: the standard Sanger encoding is used, env: OBISSOLEXA)
Controlling the output data #
--compress
|-Z
: output is compressed using gzip. (default: false)--no-order
: the OBITools ensure that the order between the input file and the output file does not change. When multiple files are processed, they are processed one at a time. If the –no-order option is added to a command, multiple input files can be opened at the same time and their contents processed in parallel. This usually increases processing speed, but does not guarantee the order of the sequences in the output file. Also, processing multiple files in parallel may require more memory to perform the computation.--fasta-output
: writes sequence data in fasta format (default if quality data is not available).--fastq-output
: writes sequence data in fastq format (default if quality data is available).--json-output
: writes sequence data in JSON format.--out
|-o
<FILENAME>: filename used for saving the output (default: “-”, the standard output)--output-OBI-header
|-O
: writes output FASTA/FASTQ title line annotations in OBI format (default: JSON).--output-json-header
: writew output FASTA/FASTQ title line annotations in JSON format (the default format).--skip-empty
: sequences of length equal to zero are removed from the output (default: false).--no-progressbar
: deactivates progress bar display (default: false).
General options #
--help
|-h|-?
: shows this help.--version
: prints the version and exits.--silent-warning
: This option tells obitools to stop displaying warnings. This behaviour can be controlled by setting the OBIWARNINGS environment variable.
Computation related options #
--max-cpu
<INTEGER>: OBITools can take advantage of your computer’s multi-core architecture by parallelizing the computation across all available CPUs. Computing on more CPUs usually requires more memory to perform the computation. Reducing the number of CPUs used to perform a calculation is also a way to indirectly control the amount of memory used by the process. The number of CPUs used by OBITools can also be controlled by setting the OBIMAXCPU environment variable.--force-one-cpu
: forces the use of a single CPU core for parallel processing (default: false).--batch-size
<INTEGER>: number of sequence per batch for parallel processing (default: 1000, env: OBIBATCHSIZE)
Debug related options #
--debug
: enables debug mode, by setting log level to debug (default: false, env: OBIDEBUG)--pprof
: enables pprof server. Look at the log for details. (default: false).--pprof-mutex
<INTEGER>: enables profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX)--pprof-goroutine
<INTEGER>: enables profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE)
Examples #
obigrep --help