`obigrep`: filter a sequence file #

Description #

obigrep is a tool for selecting a subset of sequences based on a set of criteria. Sequences from the input dataset that match all the criteria are retained and printed in the result, while other sequences are discarded.

Selection criteria can be based on different aspects of the sequence data, such as

The sequence identifier (ID)
The sequence annotations
The sequence itself

Selection based on sequence identifier (ID) #

There are two ways of selecting sequences according to their identifier:

Using a regular pattern with option -I
Using a list of identifiers (IDs) provided in a file with option --id-list

On the following five-sequences sample file:

📄 five_ids.fasta

>seqA1
cgatgctgcatgctagtgctagtcgat
>seqB1
tagctagctagctagctagctagctagcta
>seqA2
gtagctagctagctagctagctagctaga
>seqC1
cgatgctgcatgctagtgctagtcgatga
>seqB2
tagctagctagctagctagctagctagcta

To select sequences with IDs “seqA1” and “seqB1”, you can use the command

obigrep -I '^seq[AB]1$' five_ids.fasta

>seqA1 
cgatgctgcatgctagtgctagtcgat
>seqB1 
tagctagctagctagctagctagctagcta

The explanations for the regular pattern ^seq[AB]1$ are

the ^ at the beginning means that the string must start with that pattern
seq is an exact match for that string
[AB] means any character in the set {A, B}
1 is an exact match for that character
$ at the end of the pattern means that the string must end with that pattern.

If the starting ^ had been omitted, the pattern would have matched any sequence ID containing “seq” followed by a character from the set {A, B} and ending with “1”, for example the IDs my_seqA1 or my_seqB1 would have been selected.

If the ending ‘$’ had been omitted, the pattern would have matched any sequence ID starting with ‘seq’ followed by a character in the set {A, B} and containing ‘1’, e.g. the ids seqA102 or seqB1023456789 would have been selected.

Another solution to extract these sequence IDs would be to use a text file containing them, one per line, as follows

📄 seqAB.txt

1
2
seqA1
seqB1

This seqAB.txt can then be used as an index file by obigrep :

obigrep --id-list seqAB.txt five_ids.fasta

>seqA1 
cgatgctgcatgctagtgctagtcgat
>seqB1 
tagctagctagctagctagctagctagcta

Selection based on sequence definition #

Each sequence record can have a sequence definition describing the sequence. In fasta or fastq format, this definition is found in the header of each sequence record after the second word (the first being the sequence id), or after the annotations between braces in the OBITools4 extended version of these formats.

📄 three_def.fasta

>seqA1
cgatgctgcatgctagtgctagtcgat
>seqB1 my beautiful sequence
tagctagctagctagctagctagctagcta
>seqA2 {"count":10} my pretty sequence
gtagctagctagctagctagctagctaga

In the three_def.fasta example file:

seqA1 has no definition
seqB1 definition is my beautiful sequence
seqA2 definition is my pretty sequence

The -D or --definition option lets you specify a regular pattern to select only those sequences whose definition matches the pattern. The example below selects sequences whose definition contains the word pretty.

obigrep -D pretty three_def.fasta

>seqA2 {"count":10,"definition":"my pretty sequence"}
gtagctagctagctagctagctagctaga

As you can see in the results, all the OBITools4 include the definition present in the original file as a new annotation tag called definition. So it is actually this tag that is tested by the -D option.

Selection based on the annotations #

Selection based on any annotation #

The obigrep tool can also be used to select sequences based on their annotations. Annotation are constituted by all the tags and values added to each sequence header in the fasta / fastq file. For instance, if you have a sequence file with the following headers:

📄 five_tags.fasta

>seqA1 {"count":1,"tata":"bar","taxid":"taxon:9606 [Homo sapiens]@species","toto":"titi"}
cgatgctgcatgctagtgctagtcgat
>seqB1 {"tata":"bar","taxid":"taxon:63221 [Homo sapiens neanderthalensis]@subspecies","toto":"tata"}
tagctagctagctagctagctagctagcta
>seqA2 {"count":5,"tata":"foo","taxid":"taxon:9605 [Homo]@genus","toto":"tutu"}
gtagctagctagctagctagctagctaga
>seqC1 {"count":15,"tata":"foo","taxid":"taxon:9604 [Hominidae]@family","toto":"foo"}
cgatgctccatgctagtgctagtcgatga
>seqB2 {"count":25,"tata":"bar"}
cgatggctccatgctagtgctagtcgatga

Selecting sequences having a tag whatever its value #

The -A option allows for selecting sequences having the given attribute whatever its value. In the following example, all the sequences having the count attribute are selected.

obigrep -A "count" five_tags.fasta

>seqA1 {"count":1,"tata":"bar","taxid":"taxon:9606 [Homo sapiens]@species","toto":"titi"}
cgatgctgcatgctagtgctagtcgat
>seqA2 {"count":5,"tata":"foo","taxid":"taxon:9605 [Homo]@genus","toto":"tutu"}
gtagctagctagctagctagctagctaga
>seqC1 {"count":15,"tata":"foo","taxid":"taxon:9604 [Hominidae]@family","toto":"foo"}
cgatgctgcatgctagtgctagtcgatga
>seqB2 {"count":25,"tata":"bar"}
tagctagctagctagctagctagctagcta

Only four sequences are retained, the sequence seqB1 is excluded because it does not have the tag count.

Selecting sequences having a tag with a specific value #

The -a option allows for selecting sequences having the given attribute affected to a value matching the provided regular pattern. In the following example, only the sequence seqA1 having the toto attribute containing the value titi is selected.

obigrep -a toto="titi" five_tags.fasta

>seqA1 {"count":1,"tata":"bar","taxid":"taxon:9606 [Homo sapiens]@species","toto":"titi"}
cgatgctgcatgctagtgctagtcgat

As the value is a regular pattern, it is possible to be less strict, and for example, the following command will select all sequences with the toto attribute containing a value beginning (^ at the start of the expression) with t.

obigrep -a toto="^t" five_tags.fasta

>seqA1 {"count":1,"tata":"bar","taxid":"taxon:9606 [Homo sapiens]@species","toto":"titi"}
cgatgctgcatgctagtgctagtcgat
>seqB1 {"tata":"bar","taxid":"taxon:63221 [Homo sapiens neanderthalensis]@subspecies","toto":"tata"}
tagctagctagctagctagctagctagcta
>seqA2 {"count":5,"tata":"foo","taxid":"taxon:9605 [Homo]@genus","toto":"tutu"}
gtagctagctagctagctagctagctaga

The sequence seqC1 is excluded because its toto attribute contains the value foo, which does not begin with t, while seqB2 is excluded because it does not have a toto attribute.

Selection based on the sequence abundances #

In amplicon sequencing experiments, a sequence may be observed many times. The obiuniq command can be used to dereplicate strictly identical sequences. The number of strictly identical sequence reads merged into a single sequence record is stored in the count annotation tag of that sequence record. It is common to filter out sequences that are too rare or too abundant, depending on the purpose of the experiment. There are two ways to select sequence records based on this count tag.

the --min-count or -c options, followed by a numeric argument, select sequence records with a count greater than or equal to that argument.
The --max-count or -C options, followed by a numeric argument, select sequence records with a count less than or equal to that argument.

Note

If the count tag is missing from a data set, it is assumed to be equal to 1.

obigrep -c 2 five_tags.fasta

>seqA2 {"count":5,"tata":"foo","taxid":"taxon:9605 [Homo]@genus","toto":"tutu"}
gtagctagctagctagctagctagctaga
>seqC1 {"count":15,"tata":"foo","taxid":"taxon:9604 [Hominidae]@family","toto":"foo"}
cgatgctccatgctagtgctagtcgatga
>seqB2 {"count":25,"tata":"bar"}
cgatggctccatgctagtgctagtcgatga

Remove singleton sequences (sequences observed only once), here the sequences seqA1 having a count tag equal to 1, and seqB1 having no count tag defined.

The next command excludes from its results all the sequences occurring at least ten times.

obigrep -C 10 five_tags.fasta

>seqA1 {"count":1,"tata":"bar","taxid":"taxon:9606 [Homo sapiens]@species","toto":"titi"}
cgatgctgcatgctagtgctagtcgat
>seqB1 {"tata":"bar","taxid":"taxon:63221 [Homo sapiens neanderthalensis]@subspecies","toto":"tata"}
tagctagctagctagctagctagctagcta
>seqA2 {"count":5,"tata":"foo","taxid":"taxon:9605 [Homo]@genus","toto":"tutu"}
gtagctagctagctagctagctagctaga

As usual, both options can be combined

obigrep -c 2 -C 10 five_tags.fasta

>seqA2 {"count":5,"tata":"foo","taxid":"taxon:9605 [Homo]@genus","toto":"tutu"}
gtagctagctagctagctagctagctaga

Selection based on taxonomic annotation. #

Taxonomy-based selection is always performed on the taxid attribute of a sequence, even if it contains other taxonomic information stored in other attribute such as scientific_name or family_taxid. To use taxonomy-based selection with obigrep , it is mandatory to load a taxonomy using the -t or --taxonomy option.

Selecting sequences belonging a clade #

If you do not have a taxonomy dump already downloaded, you must first download one using the following obitaxonomy command. The taxonomy will be stored in a file named ncbitaxo.tgz. This compressed archive can be supplied to other OBITools4 at a later date.

obitaxonomy --download-ncbi --out ncbitaxo.tgz

To select the sequences belonging to the Homo sapiens species, the first step is to extract the taxid corresponding to the species of interest from the downloaded taxonomy using the obitaxonomy command.

The -t option indicates the taxonomy to load
The --fixed option indicates to consider the query string as the exact name of the species, not as a regular pattern.
The --rank species indicates that our interest is only on taxa having the species taxonomic rank.
"Homo sapiens" is the query string used to match the taxonomy names.

The csvlook command aims to present nicely the CSV output of obitaxonomy .

obitaxonomy -t ncbitaxo.tgz --fixed --rank species "Homo sapiens" | csvlook -I

| taxid                             | parent                  | taxonomic_rank | scientific_name |
| --------------------------------- | ----------------------- | -------------- | --------------- |
| taxon:9606 [Homo sapiens]@species | taxon:9605 [Homo]@genus | species        | Homo sapiens    |

The obigrep option to select sequences belonging a taxon is -r or --restrict-to-taxon. The option requires as argument the taxid of the clade of interest, here 9606 for Homo sapiens.

obigrep -t ncbitaxo.tgz -r taxon:9606 five_tags.fasta

>seqA1 {"count":1,"tata":"bar","taxid":"taxon:9606 [Homo sapiens]@species","toto":"titi"}
cgatgctgcatgctagtgctagtcgat
>seqB1 {"tata":"bar","taxid":"taxon:63221 [Homo sapiens neanderthalensis]@subspecies","toto":"tata"}
tagctagctagctagctagctagctagcta

Only sequences seqA1 and seqB1 annotated as belonging to the target clade Homo sapiens or one of its subspecies Homo sapiens neanderthalensis are retained. Sequence seqA2 is not retained as it is annotated at genus level as Homo and therefore does not belong to the Homo sapiens clade, nor is sequence seqC1 annotated at family level as Hominidae. The last sequence seqB2 has no taxonomic annotation and is therefore considered to be annotated at the root of the taxonomy and no part of the Homo sapiens species clade.

Excluding sequences belonging a clade #

The -i or --ignore-taxon in its long form, performs the reverse selection of the -r option presented above. It only retains sequences that do not belong to the taxid target clade passed as an argument.

obigrep -t ncbitaxo.tgz -i taxon:9606 five_tags.fasta

>seqA2 {"count":5,"tata":"foo","taxid":"taxon:9605 [Homo]@genus","toto":"tutu"}
gtagctagctagctagctagctagctaga
>seqC1 {"count":15,"tata":"foo","taxid":"taxon:9604 [Hominidae]@family","toto":"foo"}
cgatgctccatgctagtgctagtcgatga
>seqB2 {"count":25,"tata":"bar"}
cgatggctccatgctagtgctagtcgatga

Here, only the sequence seqA2, seqC1 and seqB2 are retained as none of them belongs to the Homo sapiens species.

Keep only sequence with taxonomic information at a given rank #

A taxid, when associated with a taxonomy, not only provides information at its taxonomic rank, but also makes it possible to retrieve information at any higher rank. For example, from a species taxid, it is expected that by querying the taxonomy, it will be possible to retrieve the corresponding genus or family taxid. obigrep allows you to select sequences annotated by a taxid capable of providing information at a given taxonomic rank using the --require-rank option.

To retrieve all ranks defined by a taxonomy, it is possible to use the obitaxonomy command with the -l option.

obitaxonomy -t ncbitaxo.tgz -l | csvlook

| rank             |
| ---------------- |
| domain           |
| phylum           |
| class            |
| suborder         |
| subcohort        |
| superphylum      |
| subspecies       |
| varietas         |
| subgenus         |
| parvorder        |
| acellular root   |
| genotype         |
| subtribe         |
| subkingdom       |
| subfamily        |
| kingdom          |
| isolate          |
| superorder       |
| section          |
| subvariety       |
| genus            |
| serogroup        |
| tribe            |
| forma            |
| infraclass       |
| superclass       |
| serotype         |
| no rank          |
| family           |
| species group    |
| subclass         |
| infraorder       |
| pathogroup       |
| realm            |
| order            |
| biotype          |
| species subgroup |
| species          |
| strain           |
| clade            |
| cohort           |
| series           |
| cellular root    |
| morph            |
| subphylum        |
| forma specialis  |
| superfamily      |
| subsection       |

This allows us to check that the species rank is defined and to filter the five_tags.fasta test file to retain only sequences with information available at the species level.

obigrep -t ncbitaxo.tgz --require-rank species five_tags.fasta

>seqA1 {"count":1,"tata":"bar","taxid":"taxon:9606 [Homo sapiens]@species","toto":"titi"}
cgatgctgcatgctagtgctagtcgat
>seqB1 {"tata":"bar","taxid":"taxon:63221 [Homo sapiens neanderthalensis]@subspecies","toto":"tata"}
tagctagctagctagctagctagctagcta

Only two sequences are selected by this command, because seqA1 is annotated at the species level, and seqB1 is annotated at the subspecies taxonomic rank, which allows for retrieving species level information.

seqA2 and seqC1 are discarded as they are annotated at genus and family levels, respectively. seqB2 is discarded as it is not taxonomically annotated and is therefore considered to be annotated at the root of the taxonomy.

Keep only sequences annotated with valid taxids #

📄 six_invalid.fasta

>seqA1 {"count":1,"tata":"bar","taxid":"taxon:9606 [Homo sapiens]@species","toto":"titi"}
cgatgctgcatgctagtgctagtcgat
>seqB1 {"tata":"bar","taxid":"taxon:63221 [Homo sapiens neanderthalensis]@subspecies","toto":"tata"}
tagctagctagctagctagctagctagcta
>seqA2 {"count":5,"tata":"foo","taxid":"taxon:9605 [Homo]@genus","toto":"tutu"}
gtagctagctagctagctagctagctaga
>seqC1 {"count":15,"tata":"foo","taxid":"taxon:9604 [Hominidae]@family","toto":"foo"}
cgatgctgcatgctagtgctagtcgatga
>seqB2 {"count":25,"tata":"bar"}
tagctagctagctagctagctagctagcta
>seqD1 {"taxid":"taxon:9607"}
gctagctagctgacgatgcatgcgtaggtgcagttgcgta

obigrep -t ncbitaxo.tgz --valid-taxid six_invalid.fasta

WARN[0005] seqD1: Taxid: taxon:9607 is unknown from taxonomy (Taxid taxon:9607 is not part of the taxonomy NCBI Taxonomy)

>seqA1 {"count":1,"tata":"bar","taxid":"taxon:9606 [Homo sapiens]@species","toto":"titi"}
cgatgctgcatgctagtgctagtcgat
>seqB1 {"tata":"bar","taxid":"taxon:63221 [Homo sapiens neanderthalensis]@subspecies","toto":"tata"}
tagctagctagctagctagctagctagcta
>seqA2 {"count":5,"tata":"foo","taxid":"taxon:9605 [Homo]@genus","toto":"tutu"}
gtagctagctagctagctagctagctaga
>seqC1 {"count":15,"tata":"foo","taxid":"taxon:9604 [Hominidae]@family","toto":"foo"}
cgatgctgcatgctagtgctagtcgatga

Selection based on the sequence #

Selection based on the sequence length #

Two options -l (--min-length) and -L (--max-length) allow to select sequences based on their length. A sequence is selected if its length is greater or equal to the --min-length and less or equal to the --max-length. If only one of these options is used, only the specified limit is applied.

In the five_tags.fasta, one sequence is 27 base pairs (bp) long, two are 29 bp and the two last 30 bp long.

To select only sequences with a minimum length of 29 bp, the following command can be executed

obigrep -l 29 five_tags.fasta

>seqB1 {"tata":"bar","taxid":"taxon:63221 [Homo sapiens neanderthalensis]@subspecies","toto":"tata"}
tagctagctagctagctagctagctagcta
>seqA2 {"count":5,"tata":"foo","taxid":"taxon:9605 [Homo]@genus","toto":"tutu"}
gtagctagctagctagctagctagctaga
>seqC1 {"count":15,"tata":"foo","taxid":"taxon:9604 [Hominidae]@family","toto":"foo"}
cgatgctccatgctagtgctagtcgatga
>seqB2 {"count":25,"tata":"bar"}
cgatggctccatgctagtgctagtcgatga

To select only sequences with a maximum length of 29 bp, the following command can be executed

obigrep -L 29 five_tags.fasta

>seqA1 {"count":1,"tata":"bar","taxid":"taxon:9606 [Homo sapiens]@species","toto":"titi"}
cgatgctgcatgctagtgctagtcgat
>seqA2 {"count":5,"tata":"foo","taxid":"taxon:9605 [Homo]@genus","toto":"tutu"}
gtagctagctagctagctagctagctaga
>seqC1 {"count":15,"tata":"foo","taxid":"taxon:9604 [Hominidae]@family","toto":"foo"}
cgatgctccatgctagtgctagtcgatga

Interestingly, in both cases, both 29-bp sequences were selected.

Selection based on the sequence #

Sequence records can be selected on the sequence itself. There are two pattern matching algorithms available, depending on the options used:

--sequence or -s : The pattern is a regular pattern used to match the sequence records. The pattern is not case-sensitive.
--approx-pattern : This option uses the same algorithm as obipcr and obimultiplex to locate primers. The description of the pattern follows the same grammar.

While regular pattern allows for more complex expression in describing the look-up sequence, the DNA Patterns have the advantage of offering discrepancy between the pattern and the actual sequence (mismatches and indels). To set the number and the type of allowed errors use the --pattern-error and the --allows-indels options.

In the next example, sequences containing the pattern tgc present twice at least in the sequence eventually separated by any number of bases (.*) are searched. This can be expressed as the regular pattern : tgc.*tgc

obigrep -s 'tgc.*tgc' five_tags.fasta

>seqA1 {"count":1,"tata":"bar","taxid":"taxon:9606 [Homo sapiens]@species","toto":"titi"}
cgatgctgcatgctagtgctagtcgat
>seqC1 {"count":15,"tata":"foo","taxid":"taxon:9604 [Hominidae]@family","toto":"foo"}
cgatgctccatgctagtgctagtcgatga
>seqB2 {"count":25,"tata":"bar"}
cgatggctccatgctagtgctagtcgatga

If we are interested in sequence matching this pattern gatgctgcat, but want to allow a certain number of errors, we can use the --approx-pattern option. Despite its name, this option does not allow any errors by default, so for simple patterns like the one we have here, both the --approx-pattern and the -s options are equivalent.

obigrep --approx-pattern gatgctgcat \
        five_tags.fasta

>seqA1 {"count":1,"tata":"bar","taxid":"taxon:9606 [Homo sapiens]@species","toto":"titi"}
cgatgctgcatgctagtgctagtcgat

obigrep -s gatgctgcat \
        five_tags.fasta

>seqA1 {"count":1,"tata":"bar","taxid":"taxon:9606 [Homo sapiens]@species","toto":"titi"}
cgatgctgcatgctagtgctagtcgat

However, --approx-pattern can be parameterized using the --pattern-error option. The following example allows two errors (differences) between the pattern and the matched sequence. Without a further option, these errors can only be substitutions. Thus, the value defined by --pattern-error is the maximum Hamming distance between the pattern and the matched sequence.

obigrep --approx-pattern gatgctgcat \
        --pattern-error 2 \
        five_tags.fasta

>seqA1 {"count":1,"tata":"bar","taxid":"taxon:9606 [Homo sapiens]@species","toto":"titi"}
cgatgctgcatgctagtgctagtcgat
>seqC1 {"count":15,"tata":"foo","taxid":"taxon:9604 [Hominidae]@family","toto":"foo"}
cgatgctccatgctagtgctagtcgatga

By adding the --allows-indels option, obigrep will allow indels in the pattern. This means that it can match sequences where the differences between the pattern and the matched sequence are insertions or deletions. Insertion or deletion of a symbol is considered one error. Therefore, with --pattern-error 2 and --allows-indels you can allow two mismatches, two insertions or deletions, or one mismatch and one indel. In this case, the `–pattern-error’ defines the maximum Levenshtein distance allowed between the pattern and the matched sequence.

obigrep --approx-pattern gatgctgcat \
        --pattern-error 2 \
        --allows-indels \
        five_tags.fasta

>seqA1 {"count":1,"tata":"bar","taxid":"taxon:9606 [Homo sapiens]@species","toto":"titi"}
cgatgctgcatgctagtgctagtcgat
>seqC1 {"count":15,"tata":"foo","taxid":"taxon:9604 [Hominidae]@family","toto":"foo"}
cgatgctccatgctagtgctagtcgatga
>seqB2 {"count":25,"tata":"bar"}
cgatggctccatgctagtgctagtcgatga

Defining you own predicate #

You can create your own predicate to filter your dataset. A predicate is an expression that returns a logical value of true or false when evaluated. It is defined using the --predicate (-p) option and the OBITools4 expression language. The predicate is evaluated on each sequence in the dataset. Sequences that result in a true value are retained in the result, while those that result in a false value are discarded.

The following command, for example, filters out all sequences with a count annotation of less than 2 and greater than 10.

obigrep -c 2 -C 10 five_tags.fasta

>seqA2 {"count":5,"tata":"foo","taxid":"taxon:9605 [Homo]@genus","toto":"tutu"}
gtagctagctagctagctagctagctaga

The following predicate can be used to substitute for it:

obigrep -p 'sequence.Count() >= 2 && sequence.Count() <= 10' five_tags.fasta

>seqA2 {"count":5,"tata":"foo","taxid":"taxon:9605 [Homo]@genus","toto":"tutu"}
gtagctagctagctagctagctagctaga

The OBITools4 expression language provides min and max functions. These functions extract the minimum and maximum values from a map or vector, respectively.

In the file some_uniq_seq.fasta, the ‘merged_sample` tag on each sequence indicates how the corresponding reads are distributed among samples.

📄 some_uniq_seq.fasta

>Seq_1 {"count":2,"merged_sample":{"15a_F730814":1,"29a_F260619":1}}
ttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccggcaat
agctyaaaactcaaaggacttggcggtgctttataccctt
>Seq_2 {"count":22,"merged_sample":{"15a_F730814":12,"29a_F260619":10}}
ttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccggcaat
atcttaaaactcaaaggacttggcggtgctttataccctt
>Seq_3 {"count":22,"merged_sample":{"15a_F730814":15,"29a_F260619":7}}
ttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccggcgat
agcttaaaactcaaaggacttggcggtgctttataccctt

It is possible to extract the contingency table from this file using the obimatrix command. The --transpose option transposes the matrix so that sequences are in rows and samples are in columns.

obimatrix --transpose some_uniq_seq.fasta \
  | csvtomd

id     |  15a_F730814  |  29a_F260619
-------|---------------|-------------
Seq_1  |  1            |  1
Seq_2  |  12           |  10
Seq_3  |  15           |  7

To select sequences that occur at least ten times in a sample, you have to determine the maximum value of the merged_sample tag and compare it to the value ten.

This can be done using a predicate expression:

obigrep -p 'max(annotations.merged_sample) >= 10' some_uniq_seq.fasta

>Seq_2 {"count":22,"merged_sample":{"15a_F730814":12,"29a_F260619":10}}
ttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccggcaat
atcttaaaactcaaaggacttggcggtgctttataccctt
>Seq_3 {"count":22,"merged_sample":{"15a_F730814":15,"29a_F260619":7}}
ttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccggcgat
agcttaaaactcaaaggacttggcggtgctttataccctt

As you can see from the results, seq_1 is discarded because it does not appear in any of the samples. It does not occur more than ten times. The maximum number of occurrences of seq_1 is 1.

Working with paired sequence files: #

OBITools4 can handle paired sequence files. This means that it processes the paired sequences in the two files together. In particular, for obigrep , it will apply the same filtering to both files. This ensures that each sequence in the result files is paired with its correct counterpart.

The most important option for manipulating paired sequence files is the --paired-with option. This allows you to specify the name of a file containing sequences to be paired with those in the main sequence file. Since an obitools4 command that processes paired sequences produces two paired result files, the standard output cannot be used to store the results. Instead, you must use the --out option to specify where the results should be written.

Considering the two paired input files:

📄 forward.fastq

@M01334:147:000000000-LBRVD:1:1101:14968:1570 1:N:0:CTCACCAA+CTAGGCAA
TGTTCCACGGGCAATCCTGAGCCAAATCTTTCATTTTGAAAAAATGAGAGATATAATGTATCTCTTATTTATTATAAGAAATAAAATATTTCTTATCTAATATTAAAGTTAGGTGCAGAGACTCAATGGGTGGAACTAGATCGGATGTGCA
+
11>A>@3@A11>ACFFEG110BFB00BAFGHE2DFGG201110/B11111/D1D2222D2FDFDFGDGHHBGG2F222110D11@1D1FGHFHGFF@GE1F2FG22112B220F1@111/0>BF11B210B>//11B1<1BB<///<1122
@M01334:147:000000000-LBRVD:1:1101:15946:1586 1:N:0:CTCACCAA+CTAGGCAA
TCCTAACCCCATTGAGTCTCTGCACCTATCTTTAATATTAGATAAGAAATATTTTATTTCTTATAATAAATAAGAGATATTTTATATCTCTCATTTTTTCAAAATGAAAGATTTGGCTCAGGATTGCCCACGTAACGGAGATCGGAAGAGC
+
1>>A111>>>AFGGB1FFGFGFF3BBF1GGHHH33D2GH2B1D211110D1DGHHBFGGGGG2FA2F221F21A1F0D1DGHH2FAFFGFHFFGHHHHGG22@1BD111@0FFHE11GC1001BGF1B1B/EF00??////BF////<000
@M01334:147:000000000-LBRVD:1:1101:15399:1590 1:N:0:CTCACCAA+CTAGGCAA
TGTTCCACCCATTGAGTCTCTGCACCTATCTTTAATATTAGATAAGAAATATTTTACTTCTTATAATAAATAAGAGTTATTTTATATCTCTCATTTTTTCAAAATGAAAGATTTGGCTCAGGATTGCCCGTGGAACTAGATCGGAAGAGCA
+
11>A>@3B>>1CF111BBFAG3A3AAF1FFGHHF3FBGH221F211110D1DGHH2BBGBFF2F22D221D211111A2DDGG2F2FFFEGD1FFHHHGFD221B111110BFGD11F@1001BF0@@1/EA//1>F1B1FD/////00<1
@M01334:147:000000000-LBRVD:1:1101:13773:1687 1:N:0:CTCACCAA+CTAGGCAA
CTCGGATCACCATTGAGTCTCTGCACCTATCTTTAATATTAGATAAGAAAAAATATTATTTCTTATCTGAAATAAGAAATATTTTATATATTTCTTTTTCTCAAAATGAAAGATTTGGCTCAGGATTGCCCTGATCCGAGGGATAGCACCA
+
3AAAAAADFFFFGGGGFGGGGGHHHHHHFHHHHHHHHGHHHHGHGGHFFHHHCGFHHHHHHHHHHHHHGHHGGFHFFHHHGHHHHBHHHGHHHHHHHHHHHHHFFHHFBDFBCGHHF4BGHFGFFHHBDGFHHEHHFAAEECEGF3FDGFC

📄 reverse.fastq

@M01334:147:000000000-LBRVD:1:1101:14968:1570 2:N:0:CTCACCAA+CTAGGCAA
TTTTCCTCCCTTTTTTTCTCTGCACCTTTCTTTTTTATTAGTTTTTTATTATTTTTTTTCTTTTTTTATTTTATTGATACTTTATATCTCTCTTTTTTTCTTTTTTATTGATTTTTCTCTGGTTTTCCCTTGTTACTTGTTCTTTTTTGCT
+
11>>1131111BB111A0B3B313A0B1BAFGG11E/DG222B22///1D2DDGG1AE>>FG1D1/>/12B221212@21BFD2B2B2B2F11BFGHEEC1111B//1212BBF110@22111@@/2111?01111@111?111111--11
@M01334:147:000000000-LBRVD:1:1101:15946:1586 2:N:0:CTCACCAA+CTAGGCAA
CCGTTACGTGGGCAATCCTGAGCCAATTCTTTCTTTTTGAAAAAATGAGAGATATAAAATATCTCTTATTTATTATAAGAAATAAAATATTTCTTATCTAATATTAATGATAGGTGCAGTGACTCTATGGGGTTAGGTAGTTCGGATGAGC
+
111>>111B111111BA0B1101B001BAGGH22DGGH?01110/B11111/D1D2221D1DBEDGH1GHH2GG2F222110D@111D1DFGEGFBG@GB1B2FG22222B220B11111111B@11B210/?E/00B211B2/////111
@M01334:147:000000000-LBRVD:1:1101:15399:1590 2:N:0:CTCACCAA+CTAGGCAA
TTTTCCTCGGGCTATCCTGAGCCAAATCTTTCCTTTTGAAAAATTTAGAGATATAAAATATCTCTTATTTATTTTATGTAGTATTATATTTCTTATCTAATATTAAATTTAGTTGCTTTTTCTCATTTTGTTTTACTTTTTCTTTTTTGCT
+
11>>1131111111B11B1101A000B1DFF21DDFG1011100B122111D1D2221D1DADAFG1DGH2FG2D212222D2222D2DAF2FG2D@F21B2DE22122B221@11111110B222B222B00021B221B011111//11
@M01334:147:000000000-LBRVD:1:1101:13773:1687 2:N:0:CTCACCAA+CTAGGCAA
TGATAGCAGGGCTATCCTGAGCCAAATCCGTGTTTTGAGAAAACAAGGGGGTTCTCGAACTAGAATACAAAAGAAAAGGATAGGTGCAGAGACTCAATGGTGCTATCCCTCGGATCAGGGCAATCCTTAGCCAAATCTTTCATTTTTTGAA
+
111>13@1111>11B1AF11BABC00B110BAFGGH0000DFAB//0///EEECGFA10AG1111D@@11100/0000/0F110B11@11/0>FC@1B>1B11FEFEC>E>///?<0110/?/FF<G22111@00@<GHHB>FHHH1///1

To conserve only sequences starting with a t, use the following command:

obigrep -s '^t' \
        --paired-with reverse.fastq \
        --out start_t.fastq \
        forward.fastq

After running the obigrep command, if you check the directory contents, you will obtain two new files named start_t_R1.fastq and start_t_R2.fastq, in addition to the two input files, forward.fastq and reverse.fastq. These file names are created by adding the suffixes _R1 and _R2 to the start_t.fastq file name specified in the --out option. The start_t_R1.fastq file (suffix _R1) contains the reads from the main file ( forward.fastq), while start_t_R2.fastq (suffix _R2) contains the reads from the file specified by the ‘–paired-with’ option ( reverse.fastq).

% ls -l
total 135568
-rw-r--r--@ 1 coissac  staff      1504 13 mai 18:09 forward.fastq
-rw-r--r--@ 1 coissac  staff      1504 13 mai 18:09 reverse.fastq
-rw-r-----@ 1 coissac  staff      1179 13 mai 18:14 start_t_R1.fastq
-rw-r-----@ 1 coissac  staff      1179 13 mai 18:14 start_t_R2.fastq

Inspecting the file start_t_R1.fastq makes the effect of obigrep clear. Every sequence starts with t.

📄 start_t_R1.fastq

@M01334:147:000000000-LBRVD:1:1101:14968:1570 {"definition":"1:N:0:CTCACCAA+CTAGGCAA"}
tgttccacgggcaatcctgagccaaatctttcattttgaaaaaatgagagatataatgtatctcttatttattataagaaataaaatatttcttatctaatattaaagttaggtgcagagactcaatgggtggaactagatcggatgtgca
+
11>A>@3@A11>ACFFEG110BFB00BAFGHE2DFGG201110/B11111/D1D2222D2FDFDFGDGHHBGG2F222110D11@1D1FGHFHGFF@GE1F2FG22112B220F1@111/0>BF11B210B>//11B1<1BB<///<1122
@M01334:147:000000000-LBRVD:1:1101:15946:1586 {"definition":"1:N:0:CTCACCAA+CTAGGCAA"}
tcctaaccccattgagtctctgcacctatctttaatattagataagaaatattttatttcttataataaataagagatattttatatctctcattttttcaaaatgaaagatttggctcaggattgcccacgtaacggagatcggaagagc
+
1>>A111>>>AFGGB1FFGFGFF3BBF1GGHHH33D2GH2B1D211110D1DGHHBFGGGGG2FA2F221F21A1F0D1DGHH2FAFFGFHFFGHHHHGG22@1BD111@0FFHE11GC1001BGF1B1B/EF00??////BF////<000
@M01334:147:000000000-LBRVD:1:1101:15399:1590 {"definition":"1:N:0:CTCACCAA+CTAGGCAA"}
tgttccacccattgagtctctgcacctatctttaatattagataagaaatattttacttcttataataaataagagttattttatatctctcattttttcaaaatgaaagatttggctcaggattgcccgtggaactagatcggaagagca
+
11>A>@3B>>1CF111BBFAG3A3AAF1FFGHHF3FBGH221F211110D1DGHH2BBGBFF2F22D221D211111A2DDGG2F2FFFEGD1FFHHHGFD221B111110BFGD11F@1001BF0@@1/EA//1>F1B1FD/////00<1

However, when we look at the file start_t_R2.fastq, the second sequence starts with a c. In fact, the obigrep constraint was only applied to the forward.fastq file. The sequences were selected from the reverse.fastq file because they are paired with one of the sequences selected from the forward.fastq file.

📄 start_t_R2.fastq

@M01334:147:000000000-LBRVD:1:1101:14968:1570 {"definition":"2:N:0:CTCACCAA+CTAGGCAA"}
ttttcctccctttttttctctgcacctttcttttttattagttttttattattttttttctttttttattttattgatactttatatctctctttttttcttttttattgatttttctctggttttcccttgttacttgttcttttttgct
+
11>>1131111BB111A0B3B313A0B1BAFGG11E/DG222B22///1D2DDGG1AE>>FG1D1/>/12B221212@21BFD2B2B2B2F11BFGHEEC1111B//1212BBF110@22111@@/2111?01111@111?111111--11
@M01334:147:000000000-LBRVD:1:1101:15946:1586 {"definition":"2:N:0:CTCACCAA+CTAGGCAA"}
ccgttacgtgggcaatcctgagccaattctttctttttgaaaaaatgagagatataaaatatctcttatttattataagaaataaaatatttcttatctaatattaatgataggtgcagtgactctatggggttaggtagttcggatgagc
+
111>>111B111111BA0B1101B001BAGGH22DGGH?01110/B11111/D1D2221D1DBEDGH1GHH2GG2F222110D@111D1DFGEGFBG@GB1B2FG22222B220B11111111B@11B210/?E/00B211B2/////111
@M01334:147:000000000-LBRVD:1:1101:15399:1590 {"definition":"2:N:0:CTCACCAA+CTAGGCAA"}
ttttcctcgggctatcctgagccaaatctttccttttgaaaaatttagagatataaaatatctcttatttattttatgtagtattatatttcttatctaatattaaatttagttgctttttctcattttgttttactttttcttttttgct
+
11>>1131111111B11B1101A000B1DFF21DDFG1011100B122111D1D2221D1DADAFG1DGH2FG2D212222D2222D2DAF2FG2D@F21B2DE22122B221@11111110B222B222B00021B221B011111//11

The --paired-mode option can be used to specify how the obigrep filtering constraints are applied to both files. The option requires an argument that can take four different values:

forward: the selection rules apply only to the forward reads; the reverse reads are selected because they are paired with a selected forward read. This is the default behaviour presented above.
reverse: the selection rules apply only to the reverse reads; the forward reads are selected because they are paired with a selected reverse read.

obigrep -s '^t' \
        --paired-with reverse.fastq \
        --paired-mode reverse \
        --out start_t_rev.fastq \
        forward.fastq

📄 start_t_rev_R1.fastq

@M01334:147:000000000-LBRVD:1:1101:14968:1570 {"definition":"1:N:0:CTCACCAA+CTAGGCAA"}
tgttccacgggcaatcctgagccaaatctttcattttgaaaaaatgagagatataatgtatctcttatttattataagaaataaaatatttcttatctaatattaaagttaggtgcagagactcaatgggtggaactagatcggatgtgca
+
11>A>@3@A11>ACFFEG110BFB00BAFGHE2DFGG201110/B11111/D1D2222D2FDFDFGDGHHBGG2F222110D11@1D1FGHFHGFF@GE1F2FG22112B220F1@111/0>BF11B210B>//11B1<1BB<///<1122
@M01334:147:000000000-LBRVD:1:1101:15399:1590 {"definition":"1:N:0:CTCACCAA+CTAGGCAA"}
tgttccacccattgagtctctgcacctatctttaatattagataagaaatattttacttcttataataaataagagttattttatatctctcattttttcaaaatgaaagatttggctcaggattgcccgtggaactagatcggaagagca
+
11>A>@3B>>1CF111BBFAG3A3AAF1FFGHHF3FBGH221F211110D1DGHH2BBGBFF2F22D221D211111A2DDGG2F2FFFEGD1FFHHHGFD221B111110BFGD11F@1001BF0@@1/EA//1>F1B1FD/////00<1
@M01334:147:000000000-LBRVD:1:1101:13773:1687 {"definition":"1:N:0:CTCACCAA+CTAGGCAA"}
ctcggatcaccattgagtctctgcacctatctttaatattagataagaaaaaatattatttcttatctgaaataagaaatattttatatatttctttttctcaaaatgaaagatttggctcaggattgccctgatccgagggatagcacca
+
3AAAAAADFFFFGGGGFGGGGGHHHHHHFHHHHHHHHGHHHHGHGGHFFHHHCGFHHHHHHHHHHHHHGHHGGFHFFHHHGHHHHBHHHGHHHHHHHHHHHHHFFHHFBDFBCGHHF4BGHFGFFHHBDGFHHEHHFAAEECEGF3FDGFC

📄 start_t_rev_R2.fastq

@M01334:147:000000000-LBRVD:1:1101:14968:1570 {"definition":"2:N:0:CTCACCAA+CTAGGCAA"}
ttttcctccctttttttctctgcacctttcttttttattagttttttattattttttttctttttttattttattgatactttatatctctctttttttcttttttattgatttttctctggttttcccttgttacttgttcttttttgct
+
11>>1131111BB111A0B3B313A0B1BAFGG11E/DG222B22///1D2DDGG1AE>>FG1D1/>/12B221212@21BFD2B2B2B2F11BFGHEEC1111B//1212BBF110@22111@@/2111?01111@111?111111--11
@M01334:147:000000000-LBRVD:1:1101:15399:1590 {"definition":"2:N:0:CTCACCAA+CTAGGCAA"}
ttttcctcgggctatcctgagccaaatctttccttttgaaaaatttagagatataaaatatctcttatttattttatgtagtattatatttcttatctaatattaaatttagttgctttttctcattttgttttactttttcttttttgct
+
11>>1131111111B11B1101A000B1DFF21DDFG1011100B122111D1D2221D1DADAFG1DGH2FG2D212222D2222D2DAF2FG2D@F21B2DE22122B221@11111110B222B222B00021B221B011111//11
@M01334:147:000000000-LBRVD:1:1101:13773:1687 {"definition":"2:N:0:CTCACCAA+CTAGGCAA"}
tgatagcagggctatcctgagccaaatccgtgttttgagaaaacaagggggttctcgaactagaatacaaaagaaaaggataggtgcagagactcaatggtgctatccctcggatcagggcaatccttagccaaatctttcattttttgaa
+
111>13@1111>11B1AF11BABC00B110BAFGGH0000DFAB//0///EEECGFA10AG1111D@@11100/0000/0F110B11@11/0>FC@1B>1B11FEFEC>E>///?<0110/?/FF<G22111@00@<GHHB>FHHH1///1

and: the selection rules must be true for both reads of the pair

obigrep -s '^t' \
        --paired-with reverse.fastq \
        --paired-mode and \
        --out start_t_and.fastq \
        forward.fastq

📄 start_t_and_R1.fastq

@M01334:147:000000000-LBRVD:1:1101:14968:1570 {"definition":"1:N:0:CTCACCAA+CTAGGCAA"}
tgttccacgggcaatcctgagccaaatctttcattttgaaaaaatgagagatataatgtatctcttatttattataagaaataaaatatttcttatctaatattaaagttaggtgcagagactcaatgggtggaactagatcggatgtgca
+
11>A>@3@A11>ACFFEG110BFB00BAFGHE2DFGG201110/B11111/D1D2222D2FDFDFGDGHHBGG2F222110D11@1D1FGHFHGFF@GE1F2FG22112B220F1@111/0>BF11B210B>//11B1<1BB<///<1122
@M01334:147:000000000-LBRVD:1:1101:15399:1590 {"definition":"1:N:0:CTCACCAA+CTAGGCAA"}
tgttccacccattgagtctctgcacctatctttaatattagataagaaatattttacttcttataataaataagagttattttatatctctcattttttcaaaatgaaagatttggctcaggattgcccgtggaactagatcggaagagca
+
11>A>@3B>>1CF111BBFAG3A3AAF1FFGHHF3FBGH221F211110D1DGHH2BBGBFF2F22D221D211111A2DDGG2F2FFFEGD1FFHHHGFD221B111110BFGD11F@1001BF0@@1/EA//1>F1B1FD/////00<1

📄 start_t_and_R2.fastq

@M01334:147:000000000-LBRVD:1:1101:14968:1570 {"definition":"2:N:0:CTCACCAA+CTAGGCAA"}
ttttcctccctttttttctctgcacctttcttttttattagttttttattattttttttctttttttattttattgatactttatatctctctttttttcttttttattgatttttctctggttttcccttgttacttgttcttttttgct
+
11>>1131111BB111A0B3B313A0B1BAFGG11E/DG222B22///1D2DDGG1AE>>FG1D1/>/12B221212@21BFD2B2B2B2F11BFGHEEC1111B//1212BBF110@22111@@/2111?01111@111?111111--11
@M01334:147:000000000-LBRVD:1:1101:15399:1590 {"definition":"2:N:0:CTCACCAA+CTAGGCAA"}
ttttcctcgggctatcctgagccaaatctttccttttgaaaaatttagagatataaaatatctcttatttattttatgtagtattatatttcttatctaatattaaatttagttgctttttctcattttgttttactttttcttttttgct
+
11>>1131111111B11B1101A000B1DFF21DDFG1011100B122111D1D2221D1DADAFG1DGH2FG2D212222D2222D2DAF2FG2D@F21B2DE22122B221@11111110B222B222B00021B221B011111//11

or: the selection rules must be true for at least one read of the pair. The second read is selected because its counterpart has been selected by the obigrep rules.

obigrep -s '^t' \
        --paired-with reverse.fastq \
        --paired-mode or \
        --out start_t_or.fastq \
        forward.fastq

📄 start_t_or_R1.fastq

@M01334:147:000000000-LBRVD:1:1101:14968:1570 {"definition":"1:N:0:CTCACCAA+CTAGGCAA"}
tgttccacgggcaatcctgagccaaatctttcattttgaaaaaatgagagatataatgtatctcttatttattataagaaataaaatatttcttatctaatattaaagttaggtgcagagactcaatgggtggaactagatcggatgtgca
+
11>A>@3@A11>ACFFEG110BFB00BAFGHE2DFGG201110/B11111/D1D2222D2FDFDFGDGHHBGG2F222110D11@1D1FGHFHGFF@GE1F2FG22112B220F1@111/0>BF11B210B>//11B1<1BB<///<1122
@M01334:147:000000000-LBRVD:1:1101:15946:1586 {"definition":"1:N:0:CTCACCAA+CTAGGCAA"}
tcctaaccccattgagtctctgcacctatctttaatattagataagaaatattttatttcttataataaataagagatattttatatctctcattttttcaaaatgaaagatttggctcaggattgcccacgtaacggagatcggaagagc
+
1>>A111>>>AFGGB1FFGFGFF3BBF1GGHHH33D2GH2B1D211110D1DGHHBFGGGGG2FA2F221F21A1F0D1DGHH2FAFFGFHFFGHHHHGG22@1BD111@0FFHE11GC1001BGF1B1B/EF00??////BF////<000
@M01334:147:000000000-LBRVD:1:1101:15399:1590 {"definition":"1:N:0:CTCACCAA+CTAGGCAA"}
tgttccacccattgagtctctgcacctatctttaatattagataagaaatattttacttcttataataaataagagttattttatatctctcattttttcaaaatgaaagatttggctcaggattgcccgtggaactagatcggaagagca
+
11>A>@3B>>1CF111BBFAG3A3AAF1FFGHHF3FBGH221F211110D1DGHH2BBGBFF2F22D221D211111A2DDGG2F2FFFEGD1FFHHHGFD221B111110BFGD11F@1001BF0@@1/EA//1>F1B1FD/////00<1
@M01334:147:000000000-LBRVD:1:1101:13773:1687 {"definition":"1:N:0:CTCACCAA+CTAGGCAA"}
ctcggatcaccattgagtctctgcacctatctttaatattagataagaaaaaatattatttcttatctgaaataagaaatattttatatatttctttttctcaaaatgaaagatttggctcaggattgccctgatccgagggatagcacca
+
3AAAAAADFFFFGGGGFGGGGGHHHHHHFHHHHHHHHGHHHHGHGGHFFHHHCGFHHHHHHHHHHHHHGHHGGFHFFHHHGHHHHBHHHGHHHHHHHHHHHHHFFHHFBDFBCGHHF4BGHFGFFHHBDGFHHEHHFAAEECEGF3FDGFC

📄 start_t_or_R2.fastq

@M01334:147:000000000-LBRVD:1:1101:14968:1570 {"definition":"2:N:0:CTCACCAA+CTAGGCAA"}
ttttcctccctttttttctctgcacctttcttttttattagttttttattattttttttctttttttattttattgatactttatatctctctttttttcttttttattgatttttctctggttttcccttgttacttgttcttttttgct
+
11>>1131111BB111A0B3B313A0B1BAFGG11E/DG222B22///1D2DDGG1AE>>FG1D1/>/12B221212@21BFD2B2B2B2F11BFGHEEC1111B//1212BBF110@22111@@/2111?01111@111?111111--11
@M01334:147:000000000-LBRVD:1:1101:15946:1586 {"definition":"2:N:0:CTCACCAA+CTAGGCAA"}
ccgttacgtgggcaatcctgagccaattctttctttttgaaaaaatgagagatataaaatatctcttatttattataagaaataaaatatttcttatctaatattaatgataggtgcagtgactctatggggttaggtagttcggatgagc
+
111>>111B111111BA0B1101B001BAGGH22DGGH?01110/B11111/D1D2221D1DBEDGH1GHH2GG2F222110D@111D1DFGEGFBG@GB1B2FG22222B220B11111111B@11B210/?E/00B211B2/////111
@M01334:147:000000000-LBRVD:1:1101:15399:1590 {"definition":"2:N:0:CTCACCAA+CTAGGCAA"}
ttttcctcgggctatcctgagccaaatctttccttttgaaaaatttagagatataaaatatctcttatttattttatgtagtattatatttcttatctaatattaaatttagttgctttttctcattttgttttactttttcttttttgct
+
11>>1131111111B11B1101A000B1DFF21DDFG1011100B122111D1D2221D1DADAFG1DGH2FG2D212222D2222D2DAF2FG2D@F21B2DE22122B221@11111110B222B222B00021B221B011111//11
@M01334:147:000000000-LBRVD:1:1101:13773:1687 {"definition":"2:N:0:CTCACCAA+CTAGGCAA"}
tgatagcagggctatcctgagccaaatccgtgttttgagaaaacaagggggttctcgaactagaatacaaaagaaaaggataggtgcagagactcaatggtgctatccctcggatcagggcaatccttagccaaatctttcattttttgaa
+
111>13@1111>11B1AF11BABC00B110BAFGGH0000DFAB//0///EEECGFA10AG1111D@@11100/0000/0F110B11@11/0>FC@1B>1B11FEFEC>E>///?<0110/?/FF<G22111@00@<GHHB>FHHH1///1

andnot: the selection rules must be true on the forward sequence but not on the reverse one.

obigrep -s '^t' \
        --paired-with reverse.fastq \
        --paired-mode andnot \
        --out start_t_andnot.fastq \
        forward.fastq

📄 start_t_andnot_R1.fastq

@M01334:147:000000000-LBRVD:1:1101:15946:1586 {"definition":"1:N:0:CTCACCAA+CTAGGCAA"}
tcctaaccccattgagtctctgcacctatctttaatattagataagaaatattttatttcttataataaataagagatattttatatctctcattttttcaaaatgaaagatttggctcaggattgcccacgtaacggagatcggaagagc
+
1>>A111>>>AFGGB1FFGFGFF3BBF1GGHHH33D2GH2B1D211110D1DGHHBFGGGGG2FA2F221F21A1F0D1DGHH2FAFFGFHFFGHHHHGG22@1BD111@0FFHE11GC1001BGF1B1B/EF00??////BF////<000

📄 start_t_andnot_R2.fastq

@M01334:147:000000000-LBRVD:1:1101:15946:1586 {"definition":"2:N:0:CTCACCAA+CTAGGCAA"}
ccgttacgtgggcaatcctgagccaattctttctttttgaaaaaatgagagatataaaatatctcttatttattataagaaataaaatatttcttatctaatattaatgataggtgcagtgactctatggggttaggtagttcggatgagc
+
111>>111B111111BA0B1101B001BAGGH22DGGH?01110/B11111/D1D2221D1DBEDGH1GHH2GG2F222110D@111D1DFGEGFBG@GB1B2FG22222B220B11111111B@11B210/?E/00B211B2/////111

xor: the selection rules must be true on only one read of the pair, not on both.

obigrep -s '^t' \
        --paired-with reverse.fastq \
        --paired-mode xor \
        --out start_t_xor.fastq \
        forward.fastq

📄 start_t_xor_R1.fastq

@M01334:147:000000000-LBRVD:1:1101:15946:1586 {"definition":"1:N:0:CTCACCAA+CTAGGCAA"}
tcctaaccccattgagtctctgcacctatctttaatattagataagaaatattttatttcttataataaataagagatattttatatctctcattttttcaaaatgaaagatttggctcaggattgcccacgtaacggagatcggaagagc
+
1>>A111>>>AFGGB1FFGFGFF3BBF1GGHHH33D2GH2B1D211110D1DGHHBFGGGGG2FA2F221F21A1F0D1DGHH2FAFFGFHFFGHHHHGG22@1BD111@0FFHE11GC1001BGF1B1B/EF00??////BF////<000
@M01334:147:000000000-LBRVD:1:1101:13773:1687 {"definition":"1:N:0:CTCACCAA+CTAGGCAA"}
ctcggatcaccattgagtctctgcacctatctttaatattagataagaaaaaatattatttcttatctgaaataagaaatattttatatatttctttttctcaaaatgaaagatttggctcaggattgccctgatccgagggatagcacca
+
3AAAAAADFFFFGGGGFGGGGGHHHHHHFHHHHHHHHGHHHHGHGGHFFHHHCGFHHHHHHHHHHHHHGHHGGFHFFHHHGHHHHBHHHGHHHHHHHHHHHHHFFHHFBDFBCGHHF4BGHFGFFHHBDGFHHEHHFAAEECEGF3FDGFC

📄 start_t_xor_R2.fastq

@M01334:147:000000000-LBRVD:1:1101:15946:1586 {"definition":"2:N:0:CTCACCAA+CTAGGCAA"}
ccgttacgtgggcaatcctgagccaattctttctttttgaaaaaatgagagatataaaatatctcttatttattataagaaataaaatatttcttatctaatattaatgataggtgcagtgactctatggggttaggtagttcggatgagc
+
111>>111B111111BA0B1101B001BAGGH22DGGH?01110/B11111/D1D2221D1DBEDGH1GHH2GG2F222110D@111D1DFGEGFBG@GB1B2FG22222B220B11111111B@11B210/?E/00B211B2/////111
@M01334:147:000000000-LBRVD:1:1101:13773:1687 {"definition":"2:N:0:CTCACCAA+CTAGGCAA"}
tgatagcagggctatcctgagccaaatccgtgttttgagaaaacaagggggttctcgaactagaatacaaaagaaaaggataggtgcagagactcaatggtgctatccctcggatcagggcaatccttagccaaatctttcattttttgaa
+
111>13@1111>11B1AF11BABC00B110BAFGGH0000DFAB//0///EEECGFA10AG1111D@@11100/0000/0F110B11@11/0>FC@1B>1B11FEFEC>E>///?<0110/?/FF<G22111@00@<GHHB>FHHH1///1

Synopsis #

obigrep [--allows-indels] [--approx-pattern <PATTERN>]...
        [--attribute|-a <KEY=VALUE>]... [--batch-size <int>] [--compress|-Z]
        [--csv] [--debug] [--definition|-D <PATTERN>]... [--ecopcr] [--embl]
        [--fail-on-taxonomy] [--fasta] [--fasta-output] [--fastq]
        [--fastq-output] [--force-one-cpu] [--genbank]
        [--has-attribute|-A <KEY>]... [--help|-h|-?] [--id-list <FILENAME>]
        [--identifier|-I <PATTERN>]... [--ignore-taxon|-i <TAXID>]...
        [--input-OBI-header] [--input-json-header] [--inverse-match|-v]
        [--json-output] [--max-count|-C <COUNT>] [--max-cpu <int>]
        [--max-length|-L <LENGTH>] [--min-count|-c <COUNT>]
        [--min-length|-l <LENGTH>] [--no-order] [--no-progressbar]
        [--only-forward] [--out|-o <FILENAME>] [--output-OBI-header|-O]
        [--output-json-header]
        [--paired-mode <forward|reverse|and|or|andnot|xor>]
        [--paired-with <FILENAME>] [--pattern-error <int>] [--pprof]
        [--pprof-goroutine <int>] [--pprof-mutex <int>]
        [--predicate|-p <EXPRESSION>]... [--raw-taxid]
        [--require-rank <RANK_NAME>]... [--restrict-to-taxon|-r <TAXID>]...
        [--save-discarded <FILENAME>] [--sequence|-s <PATTERN>]...
        [--silent-warning] [--skip-empty] [--solexa] [--taxonomy|-t <string>]
        [--u-to-t] [--update-taxid] [--valid-taxid] [--version]
        [--with-leaves] [<args>]

Options #

Selecting sequence records #

Selection based on the sequence #

Strict matching #

--sequence | -s <PATTERN>: A Regular expression pattern used to match the sequence. Only the entries whose sequence matches the pattern are kept. Regular expression patterns are case-insensitive.

Approximate matching #

--approx-pattern <PATTERN>: A DNA pattern used to match the sequence. Only the entries whose sequence matches the pattern are kept. DNA patterns are case-insensitive. They can be matched allowing for errors: mismatches or insertions or deletions.
--allows-indels: allows for indels during pattern DNA pattern matching (see --approx-pattern option).
--pattern-error <INTEGER>: maximum number of errors allowed when searching for patterns in DNA (default 0, see --approx-pattern option).

Selection based on the sequence identifier #

--identifier | -I <REGEX>: Regular expression pattern to be tested against the sequence identifier. The pattern is case-insensitive.
--id-list <FILENAME>: points to a text file containing the list of sequence record identifiers to be selected. The file format consists in a single identifier per line.

Selection based on the sequence definition #

--definition | -D <REGEX>: Regular expression pattern to be tested against the sequence definition. The pattern is case-insensitive.

Selection based on the sequence properties #

--min-count | -c <COUNT>: selects the sequence records for which the number of occurrences (i.e the count attribute) is equal to or greater than the defined minimum count.
--max-count | -C <COUNT>: Select the sequence records for which the occurrence count (i.e the count attribute) is equal to or smaller than the defined maximum count.
--min-length | -l <LENGTH>: selects the sequence records for which the sequence length is equal to or greater than the defined minimum sequence length.
--max-length | -L <LENGTH>: selects sequence records for which the sequence length is equal to or less than the defined maximum sequence length.

Matching the sequence annotations #

Taxonomy based filtering #

If the user specifies a taxonomy when calling *OBITools* (see --taxonomy option), it is possible to filter the sequences based on taxonomic properties. Each of the following options can be used multiple times if needed to specify multiple taxids or ranks.

--restrict-to-taxon | -r <TAXID>: Only sequences having a taxid belonging the provided taxid are conserved.
--ignore-taxon | -i <TAXID>: Sequences having a taxid belonging the provided taxid are discarded.
--require-rank <RANK_NAME>: Only sequences having a taxid able to provide information at the <RANK_NAME> level are conserved. As an example, the NCBI taxid 74635 corresponding to Rosa canina is able to provide information at the species, genus or family level. But, taxid 3764 (Rosa genus) is not able to provide information at the species level. Many of the taxid related to environmental samples have partial classification and a taxon at the species level is not always connected to a taxon at the genus level as parent. They can sometimes be connected to a taxon at higher level.

Controlling the input data #

OBITools4 generally recognizes the input file format. It also recognizes whether the input file is compressed using GZIP. But some rare files can be misidentified, so the following options allow the user to force the format, thus bypassing the format identification step.

The file format options #

--fasta: indicates that sequence data is in fasta format.
--fastq: indicates that sequence data is in fastq format.
--embl: indicates that sequence data is in EMBL-ENA flatfile format.
--csv: indicates that sequence data is in CSV format.
--genbank: indicates that sequence data is in GenBank flatfile format.
--ecopcr: indicates that sequence data is in the old ecoPCR tabulated format.

Controlling the way OBITools4 are formatting annotations #

These options only apply to the FASTA and FASTQ formats

--input-OBI-header: FASTA/FASTQ title line annotations follow the old OBI format.
--input-json-header: FASTA/FASTQ title line annotations follow the JSON format.

Controlling quality score decoding #

This option only applies to the FASTQ formats

--solexa: decodes quality string according to the old Solexa specification. (default: the standard Sanger encoding is used, env: OBISSOLEXA)

Controlling the output data #

--compress | -Z : output is compressed using gzip. (default: false)
--no-order: the OBITools ensure that the order between the input file and the output file does not change. When multiple files are processed, they are processed one at a time. If the –no-order option is added to a command, multiple input files can be opened at the same time and their contents processed in parallel. This usually increases processing speed, but does not guarantee the order of the sequences in the output file. Also, processing multiple files in parallel may require more memory to perform the computation.
--fasta-output: writes sequence data in fasta format (default if quality data is not available).
--fastq-output: writes sequence data in fastq format (default if quality data is available).
--json-output: writes sequence data in JSON format.
--out | -o <FILENAME>: filename used for saving the output (default: “-”, the standard output)
--output-OBI-header | -O : writes output FASTA/FASTQ title line annotations in OBI format (default: JSON).
--output-json-header: writew output FASTA/FASTQ title line annotations in JSON format (the default format).
--skip-empty: sequences of length equal to zero are removed from the output (default: false).
--no-progressbar: deactivates progress bar display (default: false).

General options #

--help | -h|-? : shows this help.
--version: prints the version and exits.
--silent-warning: This option tells obitools to stop displaying warnings. This behaviour can be controlled by setting the OBIWARNINGS environment variable.

--max-cpu <INTEGER>: OBITools can take advantage of your computer’s multi-core architecture by parallelizing the computation across all available CPUs. Computing on more CPUs usually requires more memory to perform the computation. Reducing the number of CPUs used to perform a calculation is also a way to indirectly control the amount of memory used by the process. The number of CPUs used by OBITools can also be controlled by setting the OBIMAXCPU environment variable.
--force-one-cpu: forces the use of a single CPU core for parallel processing (default: false).
--batch-size <INTEGER>: number of sequence per batch for parallel processing (default: 1000, env: OBIBATCHSIZE)

--debug: enables debug mode, by setting log level to debug (default: false, env: OBIDEBUG)
--pprof: enables pprof server. Look at the log for details. (default: false).
--pprof-mutex <INTEGER>: enables profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX)
--pprof-goroutine <INTEGER>: enables profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE)

Examples #

obigrep --help

obigrep: filter a sequence file #

Description #

Selection based on sequence identifier (ID) #

Selection based on sequence definition #

Selection based on the annotations #

Selection based on any annotation #

Selecting sequences having a tag whatever its value #

Selecting sequences having a tag with a specific value #

Selection based on the sequence abundances #

Selection based on taxonomic annotation. #

Selecting sequences belonging a clade #

Excluding sequences belonging a clade #

Keep only sequence with taxonomic information at a given rank #

Keep only sequences annotated with valid taxids #

Selection based on the sequence #

Selection based on the sequence length #

Selection based on the sequence #

Defining you own predicate #

Working with paired sequence files: #

Synopsis #

Options #

Selecting sequence records #

Selection based on the sequence #

Strict matching #

Approximate matching #

Selection based on the sequence identifier #

Selection based on the sequence definition #

Selection based on the sequence properties #

Matching the sequence annotations #

Taxonomy based filtering #

Controlling the input data #

The file format options #

Controlling the way OBITools4 are formatting annotations #

Controlling quality score decoding #

Controlling the output data #

General options #

Computation related options #

Debug related options #

Examples #

`obigrep`: filter a sequence file #