obigrep

obigrep: filter a sequence file #

Description #

obigrep is a tool for selecting a subset of sequences based on a set of criteria. Sequences from the input dataset that match all the criteria are retained and printed in the result, while other sequences are discarded.

Selection criteria can be based on different aspects of the sequence data, such as

  • The sequence identifier (ID)
  • The sequence annotations
  • The sequence itself

Selection based on sequence identifier (ID) #

There are two ways of selecting sequences according to their identifier:

  • Using a regular pattern with option -I
  • Using a list of identifiers (IDs) provided in a file with option --id-list

On the following five-sequences sample file:

📄 five_ids.fasta
>seqA1
cgatgctgcatgctagtgctagtcgat
>seqB1
tagctagctagctagctagctagctagcta
>seqA2
gtagctagctagctagctagctagctaga
>seqC1
cgatgctgcatgctagtgctagtcgatga
>seqB2
tagctagctagctagctagctagctagcta

To select sequences with IDs “seqA1” and “seqB1”, you can use the command

obigrep -I '^seq[AB]1$' five_ids.fasta
>seqA1 
cgatgctgcatgctagtgctagtcgat
>seqB1 
tagctagctagctagctagctagctagcta

The explanations for the regular pattern ^seq[AB]1$ are

  • the ^ at the beginning means that the string must start with that pattern
  • seq is an exact match for that string
  • [AB] means any character in the set {A, B}
  • 1 is an exact match for that character
  • $ at the end of the pattern means that the string must end with that pattern.

If the starting ^ had been omitted, the pattern would have matched any sequence ID containing “seq” followed by a character from the set {A, B} and ending with “1”, for example the IDs my_seqA1 or my_seqB1 would have been selected.

If the ending ‘$’ had been omitted, the pattern would have matched any sequence ID starting with ‘seq’ followed by a character in the set {A, B} and containing ‘1’, e.g. the ids seqA102 or seqB1023456789 would have been selected.

Another solution to extract these sequence IDs would be to use a text file containing them, one per line, as follows

📄 seqAB.txt
1
2
seqA1
seqB1

This seqAB.txt can then be used as an index file by obigrep :

obigrep --id-list seqAB.txt five_ids.fasta
>seqA1 
cgatgctgcatgctagtgctagtcgat
>seqB1 
tagctagctagctagctagctagctagcta

Selection based on sequence definition #

Each sequence record can have a sequence definition describing the sequence. In fasta or fastq format, this definition is found in the header of each sequence record after the second word (the first being the sequence id), or after the annotations between braces in the OBITools4 extended version of these formats.

📄 three_def.fasta
>seqA1
cgatgctgcatgctagtgctagtcgat
>seqB1 my beautiful sequence
tagctagctagctagctagctagctagcta
>seqA2 {"count":10} my pretty sequence
gtagctagctagctagctagctagctaga

In the three_def.fasta example file:

  • seqA1 has no definition
  • seqB1 definition is my beautiful sequence
  • seqA2 definition is my pretty sequence

The -D or --definition option lets you specify a regular pattern to select only those sequences whose definition matches the pattern. The example below selects sequences whose definition contains the word pretty.

obigrep -D pretty three_def.fasta
>seqA2 {"count":10,"definition":"my pretty sequence"}
gtagctagctagctagctagctagctaga

As you can see in the results, all the OBITools4 include the definition present in the original file as a new annotation tag called definition. So it is actually this tag that is tested by the -D option.

Selection based on the annotations #

Selection based on any annotation #

The obigrep tool can also be used to select sequences based on their annotations. Annotation are constituted by all the tags and values added to each sequence header in the fasta / fastq file. For instance, if you have a sequence file with the following headers:

📄 five_tags.fasta
>seqA1 {"count":1,"tata":"bar","taxid":"taxon:9606 [Homo sapiens]@species","toto":"titi"}
cgatgctgcatgctagtgctagtcgat
>seqB1 {"tata":"bar","taxid":"taxon:63221 [Homo sapiens neanderthalensis]@subspecies","toto":"tata"}
tagctagctagctagctagctagctagcta
>seqA2 {"count":5,"tata":"foo","taxid":"taxon:9605 [Homo]@genus","toto":"tutu"}
gtagctagctagctagctagctagctaga
>seqC1 {"count":15,"tata":"foo","taxid":"taxon:9604 [Hominidae]@family","toto":"foo"}
cgatgctccatgctagtgctagtcgatga
>seqB2 {"count":25,"tata":"bar"}
cgatggctccatgctagtgctagtcgatga
Selecting sequences having a tag whatever its value #

The -A option allows for selecting sequences having the given attribute whatever its value. In the following example, all the sequences having the count attribute are selected.

obigrep -A "count" five_tags.fasta
>seqA1 {"count":1,"tata":"bar","taxid":"taxon:9606 [Homo sapiens]@species","toto":"titi"}
cgatgctgcatgctagtgctagtcgat
>seqA2 {"count":5,"tata":"foo","taxid":"taxon:9605 [Homo]@genus","toto":"tutu"}
gtagctagctagctagctagctagctaga
>seqC1 {"count":15,"tata":"foo","taxid":"taxon:9604 [Hominidae]@family","toto":"foo"}
cgatgctgcatgctagtgctagtcgatga
>seqB2 {"count":25,"tata":"bar"}
tagctagctagctagctagctagctagcta

Only four sequences are retained, the sequence seqB1 is excluded because it does not have the tag count.

Selecting sequences having a tag with a specific value #

The -a option allows for selecting sequences having the given attribute affected to a value matching the provided regular pattern. In the following example, only the sequence seqA1 having the toto attribute containing the value titi is selected.

obigrep -a toto="titi" five_tags.fasta
>seqA1 {"count":1,"tata":"bar","taxid":"taxon:9606 [Homo sapiens]@species","toto":"titi"}
cgatgctgcatgctagtgctagtcgat

As the value is a regular pattern, it is possible to be less strict, and for example, the following command will select all sequences with the toto attribute containing a value beginning (^ at the start of the expression) with t.

obigrep -a toto="^t" five_tags.fasta
>seqA1 {"count":1,"tata":"bar","taxid":"taxon:9606 [Homo sapiens]@species","toto":"titi"}
cgatgctgcatgctagtgctagtcgat
>seqB1 {"tata":"bar","taxid":"taxon:63221 [Homo sapiens neanderthalensis]@subspecies","toto":"tata"}
tagctagctagctagctagctagctagcta
>seqA2 {"count":5,"tata":"foo","taxid":"taxon:9605 [Homo]@genus","toto":"tutu"}
gtagctagctagctagctagctagctaga

The sequence seqC1 is excluded because its toto attribute contains the value foo, which does not begin with t, while seqB2 is excluded because it does not have a toto attribute.

Selection based on the sequence abundances #

In amplicon sequencing experiments, a sequence may be observed many times. The obiuniq command can be used to dereplicate strictly identical sequences. The number of strictly identical sequence reads merged into a single sequence record is stored in the count annotation tag of that sequence record. It is common to filter out sequences that are too rare or too abundant, depending on the purpose of the experiment. There are two ways to select sequence records based on this count tag.

  • the --min-count or -c options, followed by a numeric argument, select sequence records with a count greater than or equal to that argument.
  • The --max-count or -C options, followed by a numeric argument, select sequence records with a count less than or equal to that argument.
Note

If the count tag is missing from a data set, it is assumed to be equal to 1.

obigrep -c 2 five_tags.fasta
>seqA2 {"count":5,"tata":"foo","taxid":"taxon:9605 [Homo]@genus","toto":"tutu"}
gtagctagctagctagctagctagctaga
>seqC1 {"count":15,"tata":"foo","taxid":"taxon:9604 [Hominidae]@family","toto":"foo"}
cgatgctccatgctagtgctagtcgatga
>seqB2 {"count":25,"tata":"bar"}
cgatggctccatgctagtgctagtcgatga

Remove singleton sequences (sequences observed only once), here the sequences seqA1 having a count tag equal to 1, and seqB1 having no count tag defined.

The next command excludes from its results all the sequences occurring at least ten times.

obigrep -C 10 five_tags.fasta
>seqA1 {"count":1,"tata":"bar","taxid":"taxon:9606 [Homo sapiens]@species","toto":"titi"}
cgatgctgcatgctagtgctagtcgat
>seqB1 {"tata":"bar","taxid":"taxon:63221 [Homo sapiens neanderthalensis]@subspecies","toto":"tata"}
tagctagctagctagctagctagctagcta
>seqA2 {"count":5,"tata":"foo","taxid":"taxon:9605 [Homo]@genus","toto":"tutu"}
gtagctagctagctagctagctagctaga

As usual, both options can be combined

obigrep -c 2 -C 10 five_tags.fasta
>seqA2 {"count":5,"tata":"foo","taxid":"taxon:9605 [Homo]@genus","toto":"tutu"}
gtagctagctagctagctagctagctaga

Selection based on taxonomic annotation. #

Taxonomy-based selection is always performed on the taxid attribute of a sequence, even if it contains other taxonomic information stored in other attribute such as scientific_name or family_taxid. To use taxonomy-based selection with obigrep , it is mandatory to load a taxonomy using the -t or --taxonomy option.

Selecting sequences belonging a clade #

If you do not have a taxonomy dump already downloaded, you must first download one using the following obitaxonomy command. The taxonomy will be stored in a file named ncbitaxo.tgz. This compressed archive can be supplied to other OBITools4 at a later date.

obitaxonomy --download-ncbi --out ncbitaxo.tgz

To select the sequences belonging to the Homo sapiens species, the first step is to extract the taxid corresponding to the species of interest from the downloaded taxonomy using the obitaxonomy command.

  • The -t option indicates the taxonomy to load
  • The --fixed option indicates to consider the query string as the exact name of the species, not as a regular pattern.
  • The --rank species indicates that our interest is only on taxa having the species taxonomic rank.
  • "Homo sapiens" is the query string used to match the taxonomy names.

The csvlook command aims to present nicely the CSV output of obitaxonomy .

obitaxonomy -t ncbitaxo.tgz --fixed --rank species "Homo sapiens" | csvlook -I
| taxid                             | parent                  | taxonomic_rank | scientific_name |
| --------------------------------- | ----------------------- | -------------- | --------------- |
| taxon:9606 [Homo sapiens]@species | taxon:9605 [Homo]@genus | species        | Homo sapiens    |

The obigrep option to select sequences belonging a taxon is -r or --restrict-to-taxon. The option requires as argument the taxid of the clade of interest, here 9606 for Homo sapiens.

obigrep -t ncbitaxo.tgz -r taxon:9606 five_tags.fasta
>seqA1 {"count":1,"tata":"bar","taxid":"taxon:9606 [Homo sapiens]@species","toto":"titi"}
cgatgctgcatgctagtgctagtcgat
>seqB1 {"tata":"bar","taxid":"taxon:63221 [Homo sapiens neanderthalensis]@subspecies","toto":"tata"}
tagctagctagctagctagctagctagcta

Only sequences seqA1 and seqB1 annotated as belonging to the target clade Homo sapiens or one of its subspecies Homo sapiens neanderthalensis are retained. Sequence seqA2 is not retained as it is annotated at genus level as Homo and therefore does not belong to the Homo sapiens clade, nor is sequence seqC1 annotated at family level as Hominidae. The last sequence seqB2 has no taxonomic annotation and is therefore considered to be annotated at the root of the taxonomy and no part of the Homo sapiens species clade.

Excluding sequences belonging a clade #

The -i or --ignore-taxon in its long form, performs the reverse selection of the -r option presented above. It only retains sequences that do not belong to the taxid target clade passed as an argument.

obigrep -t ncbitaxo.tgz -i taxon:9606 five_tags.fasta
>seqA2 {"count":5,"tata":"foo","taxid":"taxon:9605 [Homo]@genus","toto":"tutu"}
gtagctagctagctagctagctagctaga
>seqC1 {"count":15,"tata":"foo","taxid":"taxon:9604 [Hominidae]@family","toto":"foo"}
cgatgctccatgctagtgctagtcgatga
>seqB2 {"count":25,"tata":"bar"}
cgatggctccatgctagtgctagtcgatga

Here, only the sequence seqA2, seqC1 and seqB2 are retained as none of them belongs to the Homo sapiens species.

Keep only sequence with taxonomic information at a given rank #

A taxid, when associated with a taxonomy, not only provides information at its taxonomic rank, but also makes it possible to retrieve information at any higher rank. For example, from a species taxid, it is expected that by querying the taxonomy, it will be possible to retrieve the corresponding genus or family taxid. obigrep allows you to select sequences annotated by a taxid capable of providing information at a given taxonomic rank using the --require-rank option.

To retrieve all ranks defined by a taxonomy, it is possible to use the obitaxonomy command with the -l option.

obitaxonomy -t ncbitaxo.tgz -l | csvlook
| rank             |
| ---------------- |
| domain           |
| phylum           |
| class            |
| suborder         |
| subcohort        |
| superphylum      |
| subspecies       |
| varietas         |
| subgenus         |
| parvorder        |
| acellular root   |
| genotype         |
| subtribe         |
| subkingdom       |
| subfamily        |
| kingdom          |
| isolate          |
| superorder       |
| section          |
| subvariety       |
| genus            |
| serogroup        |
| tribe            |
| forma            |
| infraclass       |
| superclass       |
| serotype         |
| no rank          |
| family           |
| species group    |
| subclass         |
| infraorder       |
| pathogroup       |
| realm            |
| order            |
| biotype          |
| species subgroup |
| species          |
| strain           |
| clade            |
| cohort           |
| series           |
| cellular root    |
| morph            |
| subphylum        |
| forma specialis  |
| superfamily      |
| subsection       |

This allows us to check that the species rank is defined and to filter the five_tags.fasta test file to retain only sequences with information available at the species level.

obigrep -t ncbitaxo.tgz --require-rank species five_tags.fasta
>seqA1 {"count":1,"tata":"bar","taxid":"taxon:9606 [Homo sapiens]@species","toto":"titi"}
cgatgctgcatgctagtgctagtcgat
>seqB1 {"tata":"bar","taxid":"taxon:63221 [Homo sapiens neanderthalensis]@subspecies","toto":"tata"}
tagctagctagctagctagctagctagcta

Only two sequences are selected by this command, because seqA1 is annotated at the species level, and seqB1 is annotated at the subspecies taxonomic rank, which allows for retrieving species level information.

seqA2 and seqC1 are discarded as they are annotated at genus and family levels, respectively. seqB2 is discarded as it is not taxonomically annotated and is therefore considered to be annotated at the root of the taxonomy.

Keep only sequences annotated with valid taxids #
📄 six_invalid.fasta
>seqA1 {"count":1,"tata":"bar","taxid":"taxon:9606 [Homo sapiens]@species","toto":"titi"}
cgatgctgcatgctagtgctagtcgat
>seqB1 {"tata":"bar","taxid":"taxon:63221 [Homo sapiens neanderthalensis]@subspecies","toto":"tata"}
tagctagctagctagctagctagctagcta
>seqA2 {"count":5,"tata":"foo","taxid":"taxon:9605 [Homo]@genus","toto":"tutu"}
gtagctagctagctagctagctagctaga
>seqC1 {"count":15,"tata":"foo","taxid":"taxon:9604 [Hominidae]@family","toto":"foo"}
cgatgctgcatgctagtgctagtcgatga
>seqB2 {"count":25,"tata":"bar"}
tagctagctagctagctagctagctagcta
>seqD1 {"taxid":"taxon:9607"}
gctagctagctgacgatgcatgcgtaggtgcagttgcgta
obigrep -t ncbitaxo.tgz --valid-taxid six_invalid.fasta
WARN[0005] seqD1: Taxid: taxon:9607 is unknown from taxonomy (Taxid taxon:9607 is not part of the taxonomy NCBI Taxonomy) 
>seqA1 {"count":1,"tata":"bar","taxid":"taxon:9606 [Homo sapiens]@species","toto":"titi"}
cgatgctgcatgctagtgctagtcgat
>seqB1 {"tata":"bar","taxid":"taxon:63221 [Homo sapiens neanderthalensis]@subspecies","toto":"tata"}
tagctagctagctagctagctagctagcta
>seqA2 {"count":5,"tata":"foo","taxid":"taxon:9605 [Homo]@genus","toto":"tutu"}
gtagctagctagctagctagctagctaga
>seqC1 {"count":15,"tata":"foo","taxid":"taxon:9604 [Hominidae]@family","toto":"foo"}
cgatgctgcatgctagtgctagtcgatga

Selection based on the sequence #

Selection based on the sequence length #

Two options -l (--min-length) and -L (--max-length) allow to select sequences based on their length. A sequence is selected if its length is greater or equal to the --min-length and less or equal to the --max-length. If only one of these options is used, only the specified limit is applied.

In the five_tags.fasta, one sequence is 27 base pairs (bp) long, two are 29 bp and the two last 30 bp long.

To select only sequences with a minimum length of 29 bp, the following command can be executed

obigrep -l 29 five_tags.fasta
>seqB1 {"tata":"bar","taxid":"taxon:63221 [Homo sapiens neanderthalensis]@subspecies","toto":"tata"}
tagctagctagctagctagctagctagcta
>seqA2 {"count":5,"tata":"foo","taxid":"taxon:9605 [Homo]@genus","toto":"tutu"}
gtagctagctagctagctagctagctaga
>seqC1 {"count":15,"tata":"foo","taxid":"taxon:9604 [Hominidae]@family","toto":"foo"}
cgatgctccatgctagtgctagtcgatga
>seqB2 {"count":25,"tata":"bar"}
cgatggctccatgctagtgctagtcgatga

To select only sequences with a maximum length of 29 bp, the following command can be executed

obigrep -L 29 five_tags.fasta
>seqA1 {"count":1,"tata":"bar","taxid":"taxon:9606 [Homo sapiens]@species","toto":"titi"}
cgatgctgcatgctagtgctagtcgat
>seqA2 {"count":5,"tata":"foo","taxid":"taxon:9605 [Homo]@genus","toto":"tutu"}
gtagctagctagctagctagctagctaga
>seqC1 {"count":15,"tata":"foo","taxid":"taxon:9604 [Hominidae]@family","toto":"foo"}
cgatgctccatgctagtgctagtcgatga

Interestingly, in both cases, both 29-bp sequences were selected.

Selection based on the sequence #

Sequence records can be selected on the sequence itself. There are two pattern matching algorithms available, depending on the options used:

  • --sequence or -s : The pattern is a regular pattern used to match the sequence records. The pattern is not case-sensitive.
  • --approx-pattern : This option uses the same algorithm as obipcr and obimultiplex to locate primers. The description of the pattern follows the same grammar.

While regular pattern allows for more complex expression in describing the look-up sequence, the DNA Patterns have the advantage of offering discrepancy between the pattern and the actual sequence (mismatches and indels). To set the number and the type of allowed errors use the --pattern-error and the --allows-indels options.

In the next example, sequences containing the pattern tgc present twice at least in the sequence eventually separated by any number of bases (.*) are searched. This can be expressed as the regular pattern : tgc.*tgc

obigrep -s 'tgc.*tgc' five_tags.fasta
>seqA1 {"count":1,"tata":"bar","taxid":"taxon:9606 [Homo sapiens]@species","toto":"titi"}
cgatgctgcatgctagtgctagtcgat
>seqC1 {"count":15,"tata":"foo","taxid":"taxon:9604 [Hominidae]@family","toto":"foo"}
cgatgctccatgctagtgctagtcgatga
>seqB2 {"count":25,"tata":"bar"}
cgatggctccatgctagtgctagtcgatga

If we are interested in sequence matching this pattern gatgctgcat, but want to allow a certain number of errors, we can use the --approx-pattern option. Despite its name, this option does not allow any errors by default, so for simple patterns like the one we have here, both the --approx-pattern and the -s options are equivalent.

obigrep --approx-pattern gatgctgcat \
        five_tags.fasta
>seqA1 {"count":1,"tata":"bar","taxid":"taxon:9606 [Homo sapiens]@species","toto":"titi"}
cgatgctgcatgctagtgctagtcgat
obigrep -s gatgctgcat \
        five_tags.fasta
>seqA1 {"count":1,"tata":"bar","taxid":"taxon:9606 [Homo sapiens]@species","toto":"titi"}
cgatgctgcatgctagtgctagtcgat

However, --approx-pattern can be parameterized using the --pattern-error option. The following example allows two errors (differences) between the pattern and the matched sequence. Without a further option, these errors can only be substitutions. Thus, the value defined by --pattern-error is the maximum Hamming distance between the pattern and the matched sequence.

obigrep --approx-pattern gatgctgcat \
        --pattern-error 2 \
        five_tags.fasta
>seqA1 {"count":1,"tata":"bar","taxid":"taxon:9606 [Homo sapiens]@species","toto":"titi"}
cgatgctgcatgctagtgctagtcgat
>seqC1 {"count":15,"tata":"foo","taxid":"taxon:9604 [Hominidae]@family","toto":"foo"}
cgatgctccatgctagtgctagtcgatga

By adding the --allows-indels option, obigrep will allow indels in the pattern. This means that it can match sequences where the differences between the pattern and the matched sequence are insertions or deletions. Insertion or deletion of a symbol is considered one error. Therefore, with --pattern-error 2 and --allows-indels you can allow two mismatches, two insertions or deletions, or one mismatch and one indel. In this case, the `–pattern-error’ defines the maximum Levenshtein distance allowed between the pattern and the matched sequence.

obigrep --approx-pattern gatgctgcat \
        --pattern-error 2 \
        --allows-indels \
        five_tags.fasta
>seqA1 {"count":1,"tata":"bar","taxid":"taxon:9606 [Homo sapiens]@species","toto":"titi"}
cgatgctgcatgctagtgctagtcgat
>seqC1 {"count":15,"tata":"foo","taxid":"taxon:9604 [Hominidae]@family","toto":"foo"}
cgatgctccatgctagtgctagtcgatga
>seqB2 {"count":25,"tata":"bar"}
cgatggctccatgctagtgctagtcgatga

Defining you own predicate #

You can define your own predicate to filter your data set. A predicate is an expression that, when evaluated, returns a logical value of true or false. The predicate is defined with the --predicate (-p) option using the OBITools4 expression language. The predicate is evaluated on each sequence in the data set. Sequences that result in a True value are retained in the result, while those that result in a False value are discarded.

As first example the following command that filters out all sequences that have an annotation “count” lesser than 2 and greater than 10

obigrep -c 2 -C 10 five_tags.fasta
>seqA2 {"count":5,"tata":"foo","taxid":"taxon:9605 [Homo]@genus","toto":"tutu"}
gtagctagctagctagctagctagctaga

can be substituted by:

obigrep -p 'sequence.Count() >= 2 && sequence.Count() <= 10' five_tags.fasta
>seqA2 {"count":5,"tata":"foo","taxid":"taxon:9605 [Homo]@genus","toto":"tutu"}
gtagctagctagctagctagctagctaga

Working with paired sequence files #

OBITools4 can handle paired sequence files. This means that it will process the paired sequences in the two different files together, and in particular for obigrep it will apply the same filtering to both sequence files. This ensures that in the result files, each sequence is still paired with its correct counterpart sequence.

The most important option for manipulating paired sequence files is the --paired with option. This option allows you to specify the name of a file containing the sequences to be paired with those in the main sequence file.

Synopsis #

obigrep [--allows-indels] [--approx-pattern <PATTERN>]...
        [--attribute|-a <KEY=VALUE>]... [--batch-size <int>] [--compress|-Z]
        [--debug] [--definition|-D <PATTERN>]... [--ecopcr] [--embl]
        [--fasta] [--fasta-output] [--fastq] [--fastq-output]
        [--force-one-cpu] [--genbank] [--has-attribute|-A <KEY>]...
        [--help|-h|-?] [--id-list <FILENAME>] [--identifier|-I <PATTERN>]...
        [--ignore-taxon|-i <TAXID>]... [--input-OBI-header]
        [--input-json-header] [--inverse-match|-v] [--json-output]
        [--max-count|-C <COUNT>] [--max-cpu <int>] [--max-length|-L <LENGTH>]
        [--min-count|-c <COUNT>] [--min-length|-l <LENGTH>] [--no-order]
        [--no-progressbar] [--only-forward] [--out|-o <FILENAME>]
        [--output-OBI-header|-O] [--output-json-header]
        [--paired-mode <forward|reverse|and|or|andnot|xor>]
        [--paired-with <FILENAME>] [--pattern-error <int>] [--pprof]
        [--pprof-goroutine <int>] [--pprof-mutex <int>]
        [--predicate|-p <EXPRESSION>]... [--require-rank <RANK_NAME>]...
        [--restrict-to-taxon|-r <TAXID>]... [--save-discarded <FILENAME>]
        [--sequence|-s <PATTERN>]... [--skip-empty] [--solexa]
        [--taxonomy|-t <string>] [--version] [<args>]

Options #

Selecting sequence records #

Selection based on the sequence #
Strict matching #
  • --sequence | -s <PATTERN>: A Regular expression pattern used to match the sequence. Only the entries whose sequence matches the pattern are kept. Regular expression patterns are case-insensitive.
Approximate matching #
  • --approx-pattern <PATTERN>: A DNA pattern used to match the sequence. Only the entries whose sequence matches the pattern are kept. DNA patterns are case-insensitive. They can be matched allowing for errors: mismatches or insertions or deletions.
  • --allows-indels: allows for indels during pattern DNA pattern matching (see --approx-pattern option).
  • --pattern-error <INTEGER>: maximum number of errors allowed when searching for patterns in DNA (default 0, see --approx-pattern option).
Selection based on the sequence identifier #
  • --identifier | -I <REGEX>: Regular expression pattern to be tested against the sequence identifier. The pattern is case-insensitive.
  • --id-list <FILENAME>: points to a text file containing the list of sequence record identifiers to be selected. The file format consists in a single identifier per line.
Selection based on the sequence definition #
  • --definition | -D <REGEX>: Regular expression pattern to be tested against the sequence definition. The pattern is case-insensitive.
Selection based on the sequence properties #
  • --min-count | -c <COUNT>: selects the sequence records for which the number of occurrences (i.e the count attribute) is equal to or greater than the defined minimum count.
  • --max-count | -C <COUNT>: Select the sequence records for which the occurrence count (i.e the count attribute) is equal to or smaller than the defined maximum count.
  • --min-length | -l <LENGTH>: selects the sequence records for which the sequence length is equal to or greater than the defined minimum sequence length.
  • --max-length | -L <LENGTH>: selects sequence records for which the sequence length is equal to or less than the defined maximum sequence length.

Matching the sequence annotations #

Taxonomy based filtering #

If the user specifies a taxonomy when calling *OBITools* (see --taxonomy option), it is possible to filter the sequences based on taxonomic properties. Each of the following options can be used multiple times if needed to specify multiple taxids or ranks.

  • --restrict-to-taxon | -r <TAXID>: Only sequences having a taxid belonging the provided taxid are conserved.
  • --ignore-taxon | -i <TAXID>: Sequences having a taxid belonging the provided taxid are discarded.
  • --require-rank <RANK_NAME>: Only sequences having a taxid able to provide information at the <RANK_NAME> level are conserved. As an example, the NCBI taxid 74635 corresponding to Rosa canina is able to provide information at the species, genus or family level. But, taxid 3764 (Rosa genus) is not able to provide information at the species level. Many of the taxid related to environmental samples have partial classification and a taxon at the species level is not always connected to a taxon at the genus level as parent. They can sometimes be connected to a taxon at higher level.

Controlling the input data #

OBITools4 generally recognizes the input file format. It also recognizes whether the input file is compressed using GZIP. But some rare files can be misidentified, so the following options allow the user to force the format, thus bypassing the format identification step.
The file format options #
  • --fasta: indicates that sequence data is in fasta format.
  • --fastq: indicates that sequence data is in fastq format.
  • --embl: indicates that sequence data is in EMBL-ENA flatfile format.
  • --csv: indicates that sequence data is in CSV format.
  • --genbank: indicates that sequence data is in GenBank flatfile format.
  • --ecopcr: indicates that sequence data is in the old ecoPCR tabulated format.
Controlling the way OBITools4 are formatting annotations #
These options only apply to the FASTA and FASTQ formats
  • --input-OBI-header: FASTA/FASTQ title line annotations follow the old OBI format.
  • --input-json-header: FASTA/FASTQ title line annotations follow the JSON format.
Controlling quality score decoding #
This option only applies to the FASTQ formats
  • --solexa: decodes quality string according to the old Solexa specification. (default: the standard Sanger encoding is used, env: OBISSOLEXA)

Controlling the output data #

  • --compress | -Z : output is compressed using gzip. (default: false)
  • --no-order: the OBITools ensure that the order between the input file and the output file does not change. When multiple files are processed, they are processed one at a time. If the –no-order option is added to a command, multiple input files can be opened at the same time and their contents processed in parallel. This usually increases processing speed, but does not guarantee the order of the sequences in the output file. Also, processing multiple files in parallel may require more memory to perform the computation.
  • --fasta-output: writes sequence data in fasta format (default if quality data is not available).
  • --fastq-output: writes sequence data in fastq format (default if quality data is available).
  • --json-output: writes sequence data in JSON format.
  • --out | -o <FILENAME>: filename used for saving the output (default: “-”, the standard output)
  • --output-OBI-header | -O : writes output FASTA/FASTQ title line annotations in OBI format (default: JSON).
  • --output-json-header: writew output FASTA/FASTQ title line annotations in JSON format (the default format).
  • --skip-empty: sequences of length equal to zero are removed from the output (default: false).
  • --no-progressbar: deactivates progress bar display (default: false).

General options #

  • --help | -h|-? : shows this help.
  • --version: prints the version and exits.
  • --silent-warning: This option tells obitools to stop displaying warnings. This behaviour can be controlled by setting the OBIWARNINGS environment variable.
  • --max-cpu <INTEGER>: OBITools can take advantage of your computer’s multi-core architecture by parallelizing the computation across all available CPUs. Computing on more CPUs usually requires more memory to perform the computation. Reducing the number of CPUs used to perform a calculation is also a way to indirectly control the amount of memory used by the process. The number of CPUs used by OBITools can also be controlled by setting the OBIMAXCPU environment variable.
  • --force-one-cpu: forces the use of a single CPU core for parallel processing (default: false).
  • --batch-size <INTEGER>: number of sequence per batch for parallel processing (default: 1000, env: OBIBATCHSIZE)
  • --debug: enables debug mode, by setting log level to debug (default: false, env: OBIDEBUG)
  • --pprof: enables pprof server. Look at the log for details. (default: false).
  • --pprof-mutex <INTEGER>: enables profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX)
  • --pprof-goroutine <INTEGER>: enables profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE)

Examples #

obigrep --help