General operating principles for OBITools #

OBIToolsare not a metabarcoding data analysis pipeline, but a set of tools for developing customized analyses, while avoiding the black-box effect of a ready-to-use pipeline. A particular effort in the development of OBITools4 has been to use data formats that can be easily interfaced with other software.

OBITools correspond to a set of UNIX commands that are executed from a command line interface, also known as a terminal, to perform various tasks on DNA sequence files. A UNIX command can be considered as a process that takes a set of inputs and produces a set of outputs.

graph LR
  A@{ shape: doc, label: "input 1" }
  B@{ shape: doc, label: "input 2" }
  C[Unix command]
  D@{ shape: doc, label: "output 1" }
  E@{ shape: doc, label: "output 2" }
  F@{ shape: doc, label: "output 3" }
  A --> C
  B --> C:::obitools
  C --> D
  C --> E
  C --> F
  classDef obitools fill:#99d57c

Most OBITools take a single file as input and produce a single file as output. Among the inputs, one has a special status: the standard input (stdin). Symmetrically, there is the standard output (stdout). By default, like any other UNIX command, the OBITools reads its data from stdin and write its results to stdout.

graph LR
  A@{ shape: doc, label: "stdin" }
  C[Unix command]
  D@{ shape: doc, label: "stdout" }
  A --> C:::obitools
  C --> D
  classDef obitools fill:#99d57c

If nothing is specified, the UNIX system connects standard input to the terminal keyboard and standard output to the terminal screen. So, for example, if you enter the obiconvert command in your terminal without any arguments, it will appear to stop and do nothing, when in fact it is waiting for you to type something on the keyboard. To stop it, just press Ctrl+D or Ctrl+C. Ctrl+D ends typing and stops the program. Ctrl+C kills the program.

Specifying the input data #

OBITools are designed to process DNA sequence files. Most of them therefore accept DNA sequence files as input. They can be formatted in the most common sequence file formats, fasta , fastq , EMBL-ENA and GenBank flat files. Data can also be supplied as CSV files. The OBITools generally recognize the file format of input data, but options are provided to force a specific format (i.e. --fasta, --fastq, --genbank, --embl).

The most common way to specify the file containing the DNA sequences to be processed is to specify its name as an argument. Here is an example using obicount to count the number of DNA sequences in a file named my_file.fasta.

obicount my_file.fasta

But it is also possible to pass data using the Unix redirection mechanism (i.e. > and <, more details).

obicount < my_file.fasta

OBITools can also be used to process a set of files. In this case, OBITools will process the files in the order in which they appear on the command line.

obicount my_file1.fasta my_file2.fasta my_file3.fasta

The wildcard character (i.e. *) can be used to specify a set of files to be processed.

obicount my_file*.fasta

If the files are located in a subdirectory, the directory name can be specified, without the need to specify any file name. In that case, OBITools will process all the sequence files present in the subdirectory. Sequence files are searched recursively in the specified directory and all its sub-directories.

obicount my_sub_directory

Files considered to be DNA sequence files are those with the extension .fasta, .fastq, .genbank or .embl, .seq or .dat. Files with the second extension .gz (e.g. .fasta.gz) are also considered to be DNA sequence files and are processed without the need for decompression.

Imagine a folder called Genbank containing a complete copy of the Genbank database organized into subdirectories, one per division. Each division subdirectory contains a set of fasta compressed (.gz) files.

. 📂 Genbank
└── 📂 bct
│   └── 📄 gbbct1.fasta.gz
│   ├── 📄 gbbct2.fasta.gz
│   ├── 📄 gbbct3.fasta.gz
│   └── 📄 ...
└── 📂 inv
│   └── 📄 gbinv1.fasta.gz
│   ├── 📄 gbinv2.fasta.gz
│   ├── 📄 gbinv3.fasta.gz
│   └── 📄...
└── 📂 mam
│   └── 📄 gbmam1.fasta.gz
│   ├── 📄 gbmam2.fasta.gz
│   ├── 📄 gbmam3.fasta.gz
│   └── 📄...
└── 📂...
│

It is possible to count entries in the gbbct1.fasta.gz file with the command

obicount Genbank/bct/gbbct1.fasta.gz

to count the entries in the bct (bacterial) division with the command

obicount Genbank/bct

or to count the entries in the complete Genbank copy with the command

obicount Genbank

Specifying what to do with the output #

By default, OBITools write their output to standard output (stdout), which means that the results of a command are printed out on the terminal screen.

Most OBITools produce sequence files as output. The output sequence file is in fasta or fastq format, depending on whether it contains quality scores ( fastq ) or not ( fasta ). The output format of sequence files can be forced using the --fasta-output or --fastq-output options. If the --fastq-output option is used for a dataset without quality information, a default quality score of 40 will be assigned to each nucleotide. A third option is the --json-output option, which produces data in JSON format.

With the exception of the obisummary command, the OBITools which produce other types of data return them in CSV format. The obisummary command returns its results in JSON or YAML formats.

The obicomplement command computes the reverse-complement of the DNA sequences provided as input.

📄 two_sequences.fasta

>AB061527 {"count":1,"definition":"Sorex unguiculatus mitochondrial NA, complete genome.","family_name":"Soricidae","family_taxid":9376,"genus_name":"Sorex","genus_taxid":9379,"obicleandb_level":"family","obicleandb_trusted":2.2137847111025621e-13,"species_name":"Sorex unguiculatus","species_taxid":62275,"taxid":62275}
ttagccctaaacttaggtatttaatctaacaaaaatacccgtcagagaactactagcaat
agcttaaaactcaaaggacttggcggtgctttatatccct
>AL355887 {"count":2,"definition":"Human chromosome 14 NA sequence BAC R-179O11 of library RPCI-11 from chromosome 14 of Homo sapiens (Human)XXKW HTG.; HTGS_ACTIVFIN.","family_name":"Hominidae","family_taxid":9604,"genus_name":"Homo","genus_taxid":9605,"obicleandb_level":"genus","obicleandb_trusted":0,"species_name":"Homo sapiens","species_taxid":9606,"taxid":9606}
ttagccctaaactctagtagttacattaacaaaaccattcgtcagaatactacgagcaac
agcttaaaactcaaaggacctggcagttctttatatccct

If the two_sequences.fasta file is processed with the obicomplement command, without indicating the name of the output file, the result is written to the terminal screen.

obicomplement two_sequences.fasta

>AB061527 {"count":1,"definition":"Sorex unguiculatus mitochondrial NA, complete genome.","family_name":"Soricidae","family_taxid":"9376","genus_name":"Sorex","genus_taxid":"9379","obicleandb_level":"family","obicleandb_trusted":2.2137847111025621e-13,"species_name":"Sorex unguiculatus","species_taxid":"62275","taxid":"62275"}
agggatataaagcaccgccaagtcctttgagttttaagctattgctagtagttctctgac
gggtatttttgttagattaaatacctaagtttagggctaa
>AL355887 {"count":2,"definition":"Human chromosome 14 NA sequence BAC R-179O11 of library RPCI-11 from chromosome 14 of Homo sapiens (Human)XXKW HTG.; HTGS_ACTIVFIN.","family_name":"Hominidae","family_taxid":"9604","genus_name":"Homo","genus_taxid":"9605","obicleandb_level":"genus","obicleandb_trusted":0,"species_name":"Homo sapiens","species_taxid":"9606","taxid":"9606"}
agggatataaagaactgccaggtcctttgagttttaagctgttgctcgtagtattctgac
gaatggttttgttaatgtaactactagagtttagggctaa

There are two options for saving the results to a file. The first is to redirect the output to a file, as in the following example.

obicomplement two_sequences.fasta > two_sequences_comp.fasta

The second option is to use the --out option.

obicomplement two_sequences.fasta --out two_sequences_comp.fasta

Both methods will produce the same result, a file named two_sequences_comp.fasta containing the reverse-complement of the DNA sequences contained in two_sequences.fasta.

Combining OBITools commands using pipes #

Since OBITools are UNIX commands, and their default behaviour is to read their input from stdin and write their output to stdout, it is possible to combine them using the Unix pipe mechanism (i.e. |). For example, you can reverse-complement the file two_sequences.fasta with the command obicomplement , and then count the number of DNA sequences in the resulting file with the command obicount , without saving the intermediate results, by linking the stdout of obicomplement to the stdin of obicount .

obicomplement two_sequences.fasta | obicount

entities,n
variants,2
reads,3
symbols,200

The result of the obicount command is a CSV file. Therefore, it can itself be piped to another command, like csvtomd to reformat the result in a Markdown table.

obicomplement two_sequences.fasta | obicount | csvtomd

entities  |  n
----------|-----
variants  |  2
reads     |  3
symbols   |  200

Or being plotted with the uplot command.

obicomplement two_sequences.fasta | obicount | uplot barplot -H -d,

                               n
            ┌                                        ┐ 
   variants ┤ 2.0                                      
      reads ┤ 3.0                                      
    symbols ┤■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 200.0   
            └                                        ┘

The tagging system of OBITools #

OBITools provide several tools for performing computations on the sequences. The result of such a computation may be the selection of a subset of the input sequences, a modification of the sequences themselves, or it may only lead to the estimation of some sequence properties. In the latter case, OBITools store the estimated properties of the relevant sequence in a fasta or fastq file. To achieve this, OBITools add structured information in the form of a JSON map to the header of each sequence. The JSON map allows calculation results to be stored in key-value pairs. Each OBITools command adds one or more key-value pairs to the JSON map as required to annotate a sequence. Below is an example of a fasta formatted sequence with a JSON map added to its header containing three key-value pairs: count associated with the value 2, is_annotated associated with the value true and xxxx associated with the value yyyy.

>sequence1 {"count": 2, "is_annotated": true, "xxxx": "yyy"}
cgacgtagctgtgatgcagtgcagttatattttacgtgctatgtttcagtttttttt
fdcgacgcagcggag

The key names #

Keys can be any string of characters. Their names are case-sensitive. The keys count, Count and COUNT are all considered to be different keys. Some key names are reserved by OBITools and have special meanings (e.g. count contains, if present, an integer value indicating how many times this sequence has been observed, taxid contains a string corresponding to a taxonomic identifier from a taxonomy).

The tag values #

Values can be strings, integers, floats, or boolean values. Values can also be of composite types but with some limitations compared to the JSON format. In OBITools4 annotations it is not possible to nest composite types. A list cannot contain a list or a map.

A list is an ordered set of values, in this case a set of integer values:

[1,3,2,12]

A map is a set of values indexed by a key, which is a string. As an example, here is a map of integer values:

{"toto":4,"titi":10,"tutu":1}

Maps are notably used by obiuniq to aggregate information collected from the merged sequence records.

>my_seq_O1 {"merged_sample":{"sample1":45,"sample_2":33}}
gctagctagctgtgatgtcgtagttgctgatgctagtgctagtcgtaaaaaat

Using the obiannotate command, it is possible to edit these annotations, adding new ones, deleting others, renaming keys or changing values.

Caution

You are free to add, edit and delete even the OBITools4 reserved keys to mimic the results of an OBITools4 commands. But beware of the impact of these manually modified values. It is best not to modified reserved annotation keys.

OBITools4 and the taxonomic information #

One of the advantages of OBITools is their ability to handle taxonomy annotations. Each sequence in a sequence file can be individually taxonomically annotated by adding a taxid tag. Although several annotation tags can be related to taxonomic information, only the taxid tag really matters.

The tags associated with taxonomic annotations fall into three categories

taxid The main taxonomic annotation
Any tag ending with the _taxid suffix contains secondary taxid annotations, such as family_taxid which contains the taxid at the family level.
Text tags ending with _name, such as scientific_name or family_name, which contain the textual representation corresponding to the taxids.

The last category is intended solely to facilitate the user’s task, to make taxonomic information more comprehensible on a human level. The second category is also intended to help the user, bearing in mind that any taxonomy-based selection implemented by OBITools4 is based solely on the taxid tag.

Taxonomic identifiers, taxid, are short strings that uniquely identify a taxon within a taxonomy. It is important to rely on taxid rather than Latin names to identify taxa, as several taxa share the same Latin name (e.g. Vertebrata is also a genus of red algae).

For example, in the NCBI taxonomy, the species Homo sapiens has the taxid 9606 and belongs to the genus Homo, which has the taxid 9605. Although all NCBI taxids are numeric, the OBITools4 treats them as strings: "9606" and "9605".

The one way to specify a taxid to obitools is to provide this short string: "9606" or "9605".

If the --taxonomy or -t option, which takes a filename as parameter, is used when calling a OBITools command, the corresponding taxonomy will be loaded and every taxid present in a file (taxid and *_taxid tags) will be checked against the taxonomy. To download a copy of the NCBI taxonomy you can use the obitaxonomy command:

obitaxonomy --download-ncbi --out ncbitaxo.tgz

This will create a new ncbitaxo.tgz file containing a local copy of the complete taxonomy.

The first consequence of this check is that all taxa are rewritten in their long form. "9606" becomes "taxon:9606 [Homo sapiens]@species":

taxon: is the taxonomy code (TAXOCOD is taxon for the NCBI taxonomy).
9606: is the taxid
Homo sapiens: is the scientific name
species: is the taxonomic rank

So the long form of a taxid can be written as "TAXOCOD:TAXID [SCIENTIFIC NAME]@RANK".

If you look at the following files, you can see that the taxid tag is set to 62275 and 9606 for the first and second sequences respectively:

📄 two_sequences.fasta

>AB061527 {"count":1,"definition":"Sorex unguiculatus mitochondrial NA, complete genome.","family_name":"Soricidae","family_taxid":9376,"genus_name":"Sorex","genus_taxid":9379,"obicleandb_level":"family","obicleandb_trusted":2.2137847111025621e-13,"species_name":"Sorex unguiculatus","species_taxid":62275,"taxid":62275}
ttagccctaaacttaggtatttaatctaacaaaaatacccgtcagagaactactagcaat
agcttaaaactcaaaggacttggcggtgctttatatccct
>AL355887 {"count":2,"definition":"Human chromosome 14 NA sequence BAC R-179O11 of library RPCI-11 from chromosome 14 of Homo sapiens (Human)XXKW HTG.; HTGS_ACTIVFIN.","family_name":"Hominidae","family_taxid":9604,"genus_name":"Homo","genus_taxid":9605,"obicleandb_level":"genus","obicleandb_trusted":0,"species_name":"Homo sapiens","species_taxid":9606,"taxid":9606}
ttagccctaaactctagtagttacattaacaaaaccattcgtcagaatactacgagcaac
agcttaaaactcaaaggacctggcagttctttatatccct

If you use obiconvert without specifying a taxonomy, its only action is to convert potential old numeric taxids (62275 and 9606) to their string equivalents ("62275" and "9606").

obiconvert two_sequences.fasta

>AB061527 {"count":1,"definition":"Sorex unguiculatus mitochondrial NA, complete genome.","family_name":"Soricidae","family_taxid":"9376","genus_name":"Sorex","genus_taxid":"9379","obicleandb_level":"family","obicleandb_trusted":2.2137847111025621e-13,"species_name":"Sorex unguiculatus","species_taxid":"62275","taxid":"62275"}
ttagccctaaacttaggtatttaatctaacaaaaatacccgtcagagaactactagcaat
agcttaaaactcaaaggacttggcggtgctttatatccct
>AL355887 {"count":2,"definition":"Human chromosome 14 NA sequence BAC R-179O11 of library RPCI-11 from chromosome 14 of Homo sapiens (Human)XXKW HTG.; HTGS_ACTIVFIN.","family_name":"Hominidae","family_taxid":"9604","genus_name":"Homo","genus_taxid":"9605","obicleandb_level":"genus","obicleandb_trusted":0,"species_name":"Homo sapiens","species_taxid":"9606","taxid":"9606"}
ttagccctaaactctagtagttacattaacaaaaccattcgtcagaatactacgagcaac
agcttaaaactcaaaggacctggcagttctttatatccct

If the previously downloaded NCBI taxonomy is specified to obiconvert , the output of the command will be as follows. You will notice that, this time, the taxa are given in their long form. The scientific name and taxonomic rank are also given.

obiconvert -t ncbitaxo.tgz two_sequences.fasta

>AB061527 {"count":1,"definition":"Sorex unguiculatus mitochondrial NA, complete genome.","family_name":"Soricidae","family_taxid":"taxon:9376 [Soricidae]@family","genus_name":"Sorex","genus_taxid":"taxon:9379 [Sorex]@genus","obicleandb_level":"family","obicleandb_trusted":2.2137847111025621e-13,"species_name":"Sorex unguiculatus","species_taxid":"taxon:62275 [Sorex unguiculatus]@species","taxid":"taxon:62275 [Sorex unguiculatus]@species"}
ttagccctaaacttaggtatttaatctaacaaaaatacccgtcagagaactactagcaat
agcttaaaactcaaaggacttggcggtgctttatatccct
>AL355887 {"count":2,"definition":"Human chromosome 14 NA sequence BAC R-179O11 of library RPCI-11 from chromosome 14 of Homo sapiens (Human)XXKW HTG.; HTGS_ACTIVFIN.","family_name":"Hominidae","family_taxid":"taxon:9604 [Hominidae]@family","genus_name":"Homo","genus_taxid":"taxon:9605 [Homo]@genus","obicleandb_level":"genus","obicleandb_trusted":0,"species_name":"Homo sapiens","species_taxid":"taxon:9606 [Homo sapiens]@species","taxid":"taxon:9606 [Homo sapiens]@species"}
ttagccctaaactctagtagttacattaacaaaaccattcgtcagaatactacgagcaac
agcttaaaactcaaaggacctggcagttctttatatccct

If the check reveals that taxid is not present in the taxonomy, a warning is issued by the OBITools4. As example, the obiconvert command applied to the following file:

📄 four_sequences.fasta

>AY189646 {"count":1,"definition":"Homo sapiens clone arCan119 12S ribosomal RNA gene, partial sequence; mitochondrial gene for mitochondrial product.","species_name":"Homo sapiens","taxid":"taxon:9606 [Homo sapiens]@species"}
ttagccctaaacctcaacagttaaatcaacaaaactgctcgccagaacactacgrgccac
agcttaaaactcaaaggacctggcggtgcttcatatccct
>AF023201 {"count":1,"definition":"Snyderichthys copei 12S ribosomal RNA gene, mitochondrial gene for mitochondrial RNA, complete sequence.","species_name":"Snyderichthys copei","taxid":"67561"}
tcagccataaacctagatgtccagctacagttagacatccgcccgggtactacgagcatt
agcttgaaacccaaaggacctgacggtgccttagaccccc
>JN897380 {"count":1,"definition":"Nihonotrypaea thermophila mitochondrion, complete genome.","species_name":"Nihonotrypaea thermophila","taxid":"1114968"}
tagccttaacaaacatactaaaatattaaaagttatggtctctaaatttaaaggatttgg
cggtaatttagtccag
>KC236422 {"count":1,"definition":"Nihonotrypaea japonica mitochondrion, complete genome.","species_name":"Nihonotrypaea japonica","taxid":"5799994"}
cagctttaacaaacatactaaaatattaaaagttatggtctctaaatttaaaggatttgg
cggtaatttagtccag

displays the following warning:

obiconvert -t ncbitaxo.tgz four_sequences.fasta

INFO[0000] Number of workers set 16                     
INFO[0000] Found 1 files to process                     
INFO[0000] four_sequences.fasta mime type: text/fasta   
INFO[0000] On output use JSON headers                   
INFO[0000] Output is done on stdout                     
INFO[0000] Data is writen to stdout                     
INFO[0000] NCBI Taxdump Tar Archive detected: ncbitaxo.tgz 
INFO[0000] Loading Taxonomy nodes                       
INFO[0003] 2653519 Taxonomy nodes read                  
INFO[0003] Loading Taxon names                          
INFO[0005] 2653519 taxon names read                     
INFO[0005] Loading Merged taxa                          
INFO[0005] 88919 merged taxa read                       
WARN[0005] AF023201: Taxid 67561 has to be updated to taxon:305503 [Lepidomeda copei]@species 
WARN[0005] JN897380: Taxid 1114968 has to be updated to taxon:2734678 [Neotrypaea thermophila]@species 
WARN[0005] KC236422: Taxid: 5799994 is unknown from taxonomy (Taxid 5799994 is not part of the taxonomy NCBI Taxonomy)

Of the four sequences, only the first sequence has a taxid known from the NCBI taxonomy. The other three sequences have taxids that are not part of the NCBI taxonomy. In fact, the second and third sequences have taxids that were known in the NCBI taxonomy, but are now transferred to other taxids. The fourth sequence has a taxid that is actually unknown in the NCBI taxonomy.

Since only the first sequence AY189646 has a known taxid in the output, the taxids are rewritten in long form for this sequence only. For the other three sequences, the taxids are left as they were before. Nevertheless, all four sequences are present in the output.

>AY189646 {"count":1,"definition":"Homo sapiens clone arCan119 12S ribosomal RNA gene, partial sequence; mitochondrial gene for mitochondrial product.","species_name":"Homo sapiens","taxid":"taxon:9606 [Homo sapiens]@species"}
ttagccctaaacctcaacagttaaatcaacaaaactgctcgccagaacactacgrgccac
agcttaaaactcaaaggacctggcggtgcttcatatccct
>AF023201 {"count":1,"definition":"Snyderichthys copei 12S ribosomal RNA gene, mitochondrial gene for mitochondrial RNA, complete sequence.","species_name":"Snyderichthys copei","taxid":"67561"}
tcagccataaacctagatgtccagctacagttagacatccgcccgggtactacgagcatt
agcttgaaacccaaaggacctgacggtgccttagaccccc
>JN897380 {"count":1,"definition":"Nihonotrypaea thermophila mitochondrion, complete genome.","species_name":"Nihonotrypaea thermophila","taxid":"1114968"}
tagccttaacaaacatactaaaatattaaaagttatggtctctaaatttaaaggatttgg
cggtaatttagtccag
>KC236422 {"count":1,"definition":"Nihonotrypaea japonica mitochondrion, complete genome.","species_name":"Nihonotrypaea japonica","taxid":"5799994"}
cagctttaacaaacatactaaaatattaaaagttatggtctctaaatttaaaggatttgg
cggtaatttagtccag

If the --update-taxid option is used, the OBITools4 command will update the taxids of sequences that have been transferred to other taxids. When executed on the same sequence file, the same three warnings appear, but the first two warnings announce that the taxids have been updated.

obiconvert -t ncbitaxo.tgz --update-taxid four_sequences.fasta

WARN[0007] AF023201: Taxid: 67561 is updated to taxon:305503 [Lepidomeda copei]@species 
WARN[0007] JN897380: Taxid: 1114968 is updated to taxon:2734678 [Neotrypaea thermophila]@species 
WARN[0007] KC236422: Taxid: 5799994 is unknown from taxonomy (Taxid 5799994 is not part of the taxonomy NCBI Taxonomy)

In the output, the taxids are rewritten in long format for the first sequence as before, but also for the next two sequences, taking into account their updated taxids.

>AY189646 {"count":1,"definition":"Homo sapiens clone arCan119 12S ribosomal RNA gene, partial sequence; mitochondrial gene for mitochondrial product.","species_name":"Homo sapiens","taxid":"taxon:9606 [Homo sapiens]@species"}
ttagccctaaacctcaacagttaaatcaacaaaactgctcgccagaacactacgrgccac
agcttaaaactcaaaggacctggcggtgcttcatatccct
>AF023201 {"count":1,"definition":"Snyderichthys copei 12S ribosomal RNA gene, mitochondrial gene for mitochondrial RNA, complete sequence.","species_name":"Snyderichthys copei","taxid":"taxon:305503 [Lepidomeda copei]@species"}
tcagccataaacctagatgtccagctacagttagacatccgcccgggtactacgagcatt
agcttgaaacccaaaggacctgacggtgccttagaccccc
>JN897380 {"count":1,"definition":"Nihonotrypaea thermophila mitochondrion, complete genome.","species_name":"Nihonotrypaea thermophila","taxid":"taxon:2734678 [Neotrypaea thermophila]@species"}
tagccttaacaaacatactaaaatattaaaagttatggtctctaaatttaaaggatttgg
cggtaatttagtccag
>KC236422 {"count":1,"definition":"Nihonotrypaea japonica mitochondrion, complete genome.","species_name":"Nihonotrypaea japonica","taxid":"5799994"}
cagctttaacaaacatactaaaatattaaaagttatggtctctaaatttaaaggatttgg
cggtaatttagtccag

If the --fail-on-taxonomy option is used, the OBITools4 command will abort if it encounters a taxid that is not in the NCBI taxonomy. If it is run on the same sequence file, you will see the error message that stops the command when reading the last sequence annotated with a taxid that is not in the NCBI taxonomy. If the --update-taxid option was not used, the command would also have been aborted on the sequence AF023201.

obiconvert -t ncbitaxo.tgz --update-taxid --fail-on-taxonomy four_sequences.fasta

WARN[0007] AF023201: Taxid: 67561 is updated to taxon:305503 [Lepidomeda copei]@species 
WARN[0007] JN897380: Taxid: 1114968 is updated to taxon:2734678 [Neotrypaea thermophila]@species 
FATA[0007] KC236422: Taxid: 5799994 is unknown from taxonomy (Taxid 5799994 is not part of the taxonomy NCBI Taxonomy)

To remove invalid taxids from your file, you can use obigrep to keep only sequences with a valid taxid. This is the role of the `–valid-taxid’ option.

obigrep -t ncbitaxo.tgz \
        --update-taxid \
        --valid-taxid four_sequences.fasta

WARN[0006] KC236422: Taxid: 5799994 is unknown from taxonomy (Taxid 5799994 is not part of the taxonomy NCBI Taxonomy) 
WARN[0006] AF023201: Taxid: 67561 is updated to taxon:305503 [Lepidomeda copei]@species 
WARN[0006] JN897380: Taxid: 1114968 is updated to taxon:2734678 [Neotrypaea thermophila]@species

If the same three warnings occur, you will notice that only the first three sequences are preserved in the resulting file.

>AY189646 {"count":1,"definition":"Homo sapiens clone arCan119 12S ribosomal RNA gene, partial sequence; mitochondrial gene for mitochondrial product.","species_name":"Homo sapiens","taxid":"taxon:9606 [Homo sapiens]@species"}
ttagccctaaacctcaacagttaaatcaacaaaactgctcgccagaacactacgrgccac
agcttaaaactcaaaggacctggcggtgcttcatatccct
>AF023201 {"count":1,"definition":"Snyderichthys copei 12S ribosomal RNA gene, mitochondrial gene for mitochondrial RNA, complete sequence.","species_name":"Snyderichthys copei","taxid":"taxon:305503 [Lepidomeda copei]@species"}
tcagccataaacctagatgtccagctacagttagacatccgcccgggtactacgagcatt
agcttgaaacccaaaggacctgacggtgccttagaccccc
>JN897380 {"count":1,"definition":"Nihonotrypaea thermophila mitochondrion, complete genome.","species_name":"Nihonotrypaea thermophila","taxid":"taxon:2734678 [Neotrypaea thermophila]@species"}
tagccttaacaaacatactaaaatattaaaagttatggtctctaaatttaaaggatttgg
cggtaatttagtccag

Manipulating paired sequence files with OBITools4 #

Sequencing machines, particularly Illumina machines, produce paired-read data sets. The two paired reads correspond to two sequencings of the same DNA molecule from either end. They are commonly referred to as ‘forward reads’ and ‘reverse reads’.

Today, these paired reads are provided to the biologist in the form of two fastq files. These files assume that the two reads corresponding to the sequencing of the same DNA molecule are in the same position in the two files. If the data manipulations that delete or insert sequences in these files are not performed symmetrically, it is very likely that they will be out of phase, so that the two sequences will no longer be in the same position.

📄 forward.fastq

@M01334:147:000000000-LBRVD:1:1101:14968:1570 1:N:0:CTCACCAA+CTAGGCAA
TGTTCCACGGGCAATCCTGAGCCAAATCTTTCATTTTGAAAAAATGAGAGATATAATGTATCTCTTATTTATTATAAGAAATAAAATATTTCTTATCTAATATTAAAGTTAGGTGCAGAGACTCAATGGGTGGAACTAGATCGGATGTGCA
+
11>A>@3@A11>ACFFEG110BFB00BAFGHE2DFGG201110/B11111/D1D2222D2FDFDFGDGHHBGG2F222110D11@1D1FGHFHGFF@GE1F2FG22112B220F1@111/0>BF11B210B>//11B1<1BB<///<1122
@M01334:147:000000000-LBRVD:1:1101:15946:1586 1:N:0:CTCACCAA+CTAGGCAA
TCCTAACCCCATTGAGTCTCTGCACCTATCTTTAATATTAGATAAGAAATATTTTATTTCTTATAATAAATAAGAGATATTTTATATCTCTCATTTTTTCAAAATGAAAGATTTGGCTCAGGATTGCCCACGTAACGGAGATCGGAAGAGC
+
1>>A111>>>AFGGB1FFGFGFF3BBF1GGHHH33D2GH2B1D211110D1DGHHBFGGGGG2FA2F221F21A1F0D1DGHH2FAFFGFHFFGHHHHGG22@1BD111@0FFHE11GC1001BGF1B1B/EF00??////BF////<000
@M01334:147:000000000-LBRVD:1:1101:15399:1590 1:N:0:CTCACCAA+CTAGGCAA
TGTTCCACCCATTGAGTCTCTGCACCTATCTTTAATATTAGATAAGAAATATTTTACTTCTTATAATAAATAAGAGTTATTTTATATCTCTCATTTTTTCAAAATGAAAGATTTGGCTCAGGATTGCCCGTGGAACTAGATCGGAAGAGCA
+
11>A>@3B>>1CF111BBFAG3A3AAF1FFGHHF3FBGH221F211110D1DGHH2BBGBFF2F22D221D211111A2DDGG2F2FFFEGD1FFHHHGFD221B111110BFGD11F@1001BF0@@1/EA//1>F1B1FD/////00<1
@M01334:147:000000000-LBRVD:1:1101:13773:1687 1:N:0:CTCACCAA+CTAGGCAA
CTCGGATCACCATTGAGTCTCTGCACCTATCTTTAATATTAGATAAGAAAAAATATTATTTCTTATCTGAAATAAGAAATATTTTATATATTTCTTTTTCTCAAAATGAAAGATTTGGCTCAGGATTGCCCTGATCCGAGGGATAGCACCA
+
3AAAAAADFFFFGGGGFGGGGGHHHHHHFHHHHHHHHGHHHHGHGGHFFHHHCGFHHHHHHHHHHHHHGHHGGFHFFHHHGHHHHBHHHGHHHHHHHHHHHHHFFHHFBDFBCGHHF4BGHFGFFHHBDGFHHEHHFAAEECEGF3FDGFC

📄 reverse.fastq

@M01334:147:000000000-LBRVD:1:1101:14968:1570 2:N:0:CTCACCAA+CTAGGCAA
TTTTCCTCCCTTTTTTTCTCTGCACCTTTCTTTTTTATTAGTTTTTTATTATTTTTTTTCTTTTTTTATTTTATTGATACTTTATATCTCTCTTTTTTTCTTTTTTATTGATTTTTCTCTGGTTTTCCCTTGTTACTTGTTCTTTTTTGCT
+
11>>1131111BB111A0B3B313A0B1BAFGG11E/DG222B22///1D2DDGG1AE>>FG1D1/>/12B221212@21BFD2B2B2B2F11BFGHEEC1111B//1212BBF110@22111@@/2111?01111@111?111111--11
@M01334:147:000000000-LBRVD:1:1101:15946:1586 2:N:0:CTCACCAA+CTAGGCAA
CCGTTACGTGGGCAATCCTGAGCCAATTCTTTCTTTTTGAAAAAATGAGAGATATAAAATATCTCTTATTTATTATAAGAAATAAAATATTTCTTATCTAATATTAATGATAGGTGCAGTGACTCTATGGGGTTAGGTAGTTCGGATGAGC
+
111>>111B111111BA0B1101B001BAGGH22DGGH?01110/B11111/D1D2221D1DBEDGH1GHH2GG2F222110D@111D1DFGEGFBG@GB1B2FG22222B220B11111111B@11B210/?E/00B211B2/////111
@M01334:147:000000000-LBRVD:1:1101:15399:1590 2:N:0:CTCACCAA+CTAGGCAA
TTTTCCTCGGGCTATCCTGAGCCAAATCTTTCCTTTTGAAAAATTTAGAGATATAAAATATCTCTTATTTATTTTATGTAGTATTATATTTCTTATCTAATATTAAATTTAGTTGCTTTTTCTCATTTTGTTTTACTTTTTCTTTTTTGCT
+
11>>1131111111B11B1101A000B1DFF21DDFG1011100B122111D1D2221D1DADAFG1DGH2FG2D212222D2222D2DAF2FG2D@F21B2DE22122B221@11111110B222B222B00021B221B011111//11
@M01334:147:000000000-LBRVD:1:1101:13773:1687 2:N:0:CTCACCAA+CTAGGCAA
TGATAGCAGGGCTATCCTGAGCCAAATCCGTGTTTTGAGAAAACAAGGGGGTTCTCGAACTAGAATACAAAAGAAAAGGATAGGTGCAGAGACTCAATGGTGCTATCCCTCGGATCAGGGCAATCCTTAGCCAAATCTTTCATTTTTTGAA
+
111>13@1111>11B1AF11BABC00B110BAFGGH0000DFAB//0///EEECGFA10AG1111D@@11100/0000/0F110B11@11/0>FC@1B>1B11FEFEC>E>///?<0110/?/FF<G22111@00@<GHHB>FHHH1///1

In the two files above, the first sequence of the forward.fastq file with the ID M01334:147:000000000-LBRVD:1:1101:14968:1570 is paired with the first sequence of the reverse.fastq file with the same ID M01334:147:000000000-LBRVD:1:1101:14968:1570, not because they have the same identifier, but because they are both the first sequence of their respective files.

Some of the OBITools4 commands, such as obiconvert , obigrep or obiannotate offer a --paired-with option. This option takes a filename as a parameter. It tells the OBITools4 command that the file given as an argument is paired with the file being processed. Therefore, the OBITools4 commands will process both the forward and reverse files in parallel.

As the --paired-with option allows the OBITools4 command to process two files, it also produces two result files. As a result, standard output cannot be used to return the results. Therefore, when using the --paired-with option, the --out option must be used. The --out option takes a filename as a parameter and tells the OBITools4 command to write the result to the specified file. As a single filename is given, the OBITools4 command modifies this filename by adding a suffix _R1 or _R2 to create two filenames.

obiconvert --paired-with reverse.fastq \
           --out result.fasta \
           --fasta-output \
           forward.fastq

This command processes the forward.fastq and the reverse.fastq as two paired files. It then converts them into two fasta files named result_R1.fasta and result_R2.fasta for the forward and reverse reads respectively.

ls -l *.fast?

-rw-r--r--@ 1 myself  staff  1504  8 mar 15:09 forward.fastq
-rw-r-----@ 1 myself  staff   964  8 mar 17:36 result_R1.fasta
-rw-r-----@ 1 myself  staff   964  8 mar 17:36 result_R2.fasta
-rw-r--r--@ 1 myself  staff  1504  8 mar 15:09 reverse.fastq

The ls command is used here to see the results of the above obiconvert command, with the two resulting files and their names built by adding the suffixes _R1 or _R2 at the end of the filename just before the extension.