obitag

obitag: realizes taxonomic assignment #

Description #

obitag assigns a taxonomic annotation to each input sequence by searching a reference database and computing the Lowest Common Ancestor (LCA) of the best-matching reference sequences. It is typically run after paired-end merging ( obipairing ), demultiplexing ( obimultiplex ), dereplication ( obiuniq ), and denoising ( obiclean ) steps.

The taxonomic identification is a four steps process. For each query sequence ( \(Q\) ), obitag :

  • pre-screens the reference database using 4-mer contengency table to identify candidate references sequences;
  • among the pre-screened sequences, identifies the best hit sequence ( \(BH\) ) using Longest Common Subsequence (LCS) scoring and records its identity ( \(BI\) ) with the query sequence ;
\[ BI = \text{identity}(Q,BH) \]
  • identifies in the reference database, the set of sequences \(\mathcal{S}\) such as
\[ \mathcal{S} = \left\{ S_i \mid \text{identity}(S_i, BH) \geq BI \right\} \]
  • compute the LCA in the taxonomy, yielding the most precise taxonomic node consistent with all \(\mathcal{S}\) identified at the previous step.
Note

When multiple reference sequences share the same best identity with \(Q\) , the set of best hits is defined as: \[ > \mathcal{BH} = \left\{ BH_j \mid \text{identity}(Q, BH_j) = BI \right\} > \] and \(\mathcal{S}\) is then computed as the union over all best hits: \[ > \mathcal{S} = \bigcup_{BH_j \,\in\, \mathcal{BH}} \left\{ S_i \mid \text{identity}(S_i, BH_j) \geq BI \right\} > \]

graph TD
  A@{ shape: doc, label: "wolf_filtered.fasta" }
  C[obitag]
  D@{ shape: doc, label: "wolf_tag.fasta" }
  R@{ shape: doc, label: "db_v05_r117.fasta.gz" }
  T@{ shape: cyl, label: "ncbitaxo.tgz" }
  R --> C
  T --> C
  A --> C:::obitools
  C --> D
  classDef obitools fill:#99d57c

Each output sequence carries the original attributes plus the following obitag specific annotations:

AttributeDescription
taxidAssigned taxonomic node, written as TAXOID:ID [name]@rank (e.g. for an NCBI taxon : taxon:9858 [Capreolus capreolus]@species). With --raw-taxid option set, taxid is written as a plain integer string (e.g. 9858).
obitag_bestmatchSequence ID of the best-matching reference sequence.
obitag_rankTaxonomic rank of the assigned node (e.g., species, genus, infraorder).
obitag_bestidSequence identity (ratio 0–1) of the best-matching reference.
obitag_match_countNumber of reference sequences used for the LCA computation: \(|\mathcal{S}|\) .
obitag_similarity_methodSimilarity method used: "lcs" for the default alignment-based mode.

When no confident match is found, the sequence is assigned to the root of the taxonomy (taxid=1).

Quality of the reference database

Because of the taxonomic inference based on the LCA algorithm, obitag is higly sensible to error in the taxonomic reference database. A single wrongly annotated sequence in a clade can in the worst case lead to annotate all the sequences corresponding to this clade as the root taxon of the taxonomy.

Example output #

The 8 MOTUs from the wolf diet tutorial, after assignment against the EMBL vertebrate 12S reference database ( db_v05_r117.fasta.gz) using the ncbi reference taxonomy ( ncbitaxo.tgz):

📄 wolf_query.fasta
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
>HELIUM_000100422_612GNAAXX:7:118:3572:14633#0/1_sub[28..126] {"count":10172,"merged_sample":{"26a_F040644":10172}}
ttagccctaaacataaacattcaataaacaagaatgttcgccagagtactactagcaaca
gcctgaaactcaaaggacttggcggtgctttacatccct
>HELIUM_000100422_612GNAAXX:7:99:9351:13090#0/1_sub[28..127] {"count":260,"merged_sample":{"29a_F260619":260}}
ttagccctaaacacaaataattacacaaacaaaattgttcaccagagtactagcggcaac
agcttaaaactcaaaggacttggcggtgctttataccctt
>HELIUM_000100422_612GNAAXX:7:108:10111:9078#0/1_sub[28..127] {"count":7146,"merged_sample":{"13a_F730603":7146}}
ctagccttaaacacaaatagttatgcaaacaaaactattcgccagagtactaccggcaat
agcttaaaactcaaaggacttggcggtgctttataccctt
>HELIUM_000100422_612GNAAXX:7:38:14204:12725#0/1_sub[28..126] {"count":87,"merged_sample":{"26a_F040644":87}}
ttagccctaaacataaacattcaataaacaagaatgttcgccagaggactactagcaata
gcttaaaactcaaaggacttggcggtgctttatatccct
>HELIUM_000100422_612GNAAXX:7:30:9942:4495#0/1_sub[28..126] {"count":95,"merged_sample":{"26a_F040644":11,"29a_F260619":84}}
ttagccctaaacataagctattccataacaaaataattcgccagagaactactagcaaca
gattaaacctcaaaggacttggcagtgctttatacccct
>HELIUM_000100422_612GNAAXX:7:51:16702:19393#0/1_sub[28..127] {"count":12004,"merged_sample":{"15a_F730814":7465,"29a_F260619":4539}}
ttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccggcaat
agcttaaaactcaaaggacttggcggtgctttataccctt
>HELIUM_000100422_612GNAAXX:7:84:14502:1617#0/1_sub[28..127] {"count":319,"merged_sample":{"29a_F260619":319}}
ttagccctaaacacaagtaattattataacaaaattattcgccagagtactaccggcaat
agcttaaaactcaaaggacttggcggtgctttataccctt
>HELIUM_000100422_612GNAAXX:7:50:10637:6527#0/1_sub[28..126] {"count":366,"merged_sample":{"13a_F730603":13,"15a_F730814":5,"26a_F040644":347,"29a_F260619":1}}
ttagccctaaacatagataattttacaacaaaataattcgccagaggactactagcaata
gcttaaaactcaaaggacttggcggtgctttatatccct
obitag -t ncbitaxo.tgz \
       -R db_v05_r117.fasta.gz \
       wolf_query.fasta \
       > out_ecotag.fasta
📄 out_ecotag.fasta
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
>HELIUM_000100422_612GNAAXX:7:118:3572:14633#0/1_sub[28..126] {"count":10172,"merged_sample":{"26a_F040644":10172},"obitag_bestid":0.9797979797979798,"obitag_bestmatch":"AY227529","obitag_match_count":1,"obitag_rank":"genus","obitag_similarity_method":"lcs","taxid":"taxon:9992 [Marmota]@genus"}
ttagccctaaacataaacattcaataaacaagaatgttcgccagagtactactagcaaca
gcctgaaactcaaaggacttggcggtgctttacatccct
>HELIUM_000100422_612GNAAXX:7:99:9351:13090#0/1_sub[28..127] {"count":260,"merged_sample":{"29a_F260619":260},"obitag_bestid":0.9405940594059405,"obitag_bestmatch":"AF154263","obitag_match_count":9,"obitag_rank":"infraorder","obitag_similarity_method":"lcs","taxid":"taxon:35500 [Pecora]@infraorder"}
ttagccctaaacacaaataattacacaaacaaaattgttcaccagagtactagcggcaac
agcttaaaactcaaaggacttggcggtgctttataccctt
>HELIUM_000100422_612GNAAXX:7:108:10111:9078#0/1_sub[28..127] {"count":7146,"merged_sample":{"13a_F730603":7146},"obitag_bestid":1,"obitag_bestmatch":"AB245427","obitag_match_count":1,"obitag_rank":"species","obitag_similarity_method":"lcs","taxid":"taxon:9860 [Cervus elaphus]@species"}
ctagccttaaacacaaatagttatgcaaacaaaactattcgccagagtactaccggcaat
agcttaaaactcaaaggacttggcggtgctttataccctt
>HELIUM_000100422_612GNAAXX:7:38:14204:12725#0/1_sub[28..126] {"count":87,"merged_sample":{"26a_F040644":87},"obitag_bestid":0.9494949494949495,"obitag_bestmatch":"AY227530","obitag_match_count":2,"obitag_rank":"tribe","obitag_similarity_method":"lcs","taxid":"taxon:337730 [Marmotini]@tribe"}
ttagccctaaacataaacattcaataaacaagaatgttcgccagaggactactagcaata
gcttaaaactcaaaggacttggcggtgctttatatccct
>HELIUM_000100422_612GNAAXX:7:30:9942:4495#0/1_sub[28..126] {"count":95,"merged_sample":{"26a_F040644":11,"29a_F260619":84},"obitag_bestid":0.9595959595959596,"obitag_bestmatch":"AC187326","obitag_match_count":1,"obitag_rank":"subspecies","obitag_similarity_method":"lcs","taxid":"taxon:9615 [Canis lupus familiaris]@subspecies"}
ttagccctaaacataagctattccataacaaaataattcgccagagaactactagcaaca
gattaaacctcaaaggacttggcagtgctttatacccct
>HELIUM_000100422_612GNAAXX:7:51:16702:19393#0/1_sub[28..127] {"count":12004,"merged_sample":{"15a_F730814":7465,"29a_F260619":4539},"obitag_bestid":1,"obitag_bestmatch":"AJ885202","obitag_match_count":1,"obitag_rank":"species","obitag_similarity_method":"lcs","taxid":"taxon:9858 [Capreolus capreolus]@species"}
ttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccggcaat
agcttaaaactcaaaggacttggcggtgctttataccctt
>HELIUM_000100422_612GNAAXX:7:84:14502:1617#0/1_sub[28..127] {"count":319,"merged_sample":{"29a_F260619":319},"obitag_bestid":1,"obitag_bestmatch":"AJ972683","obitag_match_count":1,"obitag_rank":"species","obitag_similarity_method":"lcs","taxid":"taxon:9858 [Capreolus capreolus]@species"}
ttagccctaaacacaagtaattattataacaaaattattcgccagagtactaccggcaat
agcttaaaactcaaaggacttggcggtgctttataccctt
>HELIUM_000100422_612GNAAXX:7:50:10637:6527#0/1_sub[28..126] {"count":366,"merged_sample":{"13a_F730603":13,"15a_F730814":5,"26a_F040644":347,"29a_F260619":1},"obitag_bestid":1,"obitag_bestmatch":"AB048590","obitag_match_count":1,"obitag_rank":"genus","obitag_similarity_method":"lcs","taxid":"taxon:9611 [Canis]@genus"}
ttagccctaaacatagataattttacaacaaaataattcgccagaggactactagcaata
gcttaaaactcaaaggacttggcggtgctttatatccct

Synopsis #

obitag --reference-db|-R <FILENAME> [--batch-mem <string>]
       [--batch-size <int>] [--batch-size-max <int>] [--compress|-Z] [--csv]
       [--debug] [--ecopcr] [--embl] [--fail-on-taxonomy] [--fasta]
       [--fasta-output] [--fastq] [--fastq-output] [--genbank]
       [--geometric|-G] [--help|-h|-?] [--input-OBI-header]
       [--input-json-header] [--json-output] [--max-cpu <int>] [--no-order]
       [--no-progressbar] [--out|-o <FILENAME>] [--output-OBI-header|-O]
       [--output-json-header] [--pprof] [--pprof-goroutine <int>]
       [--pprof-mutex <int>] [--raw-taxid] [--save-db <FILENAME>]
       [--silent-warning] [--skip-empty] [--solexa] [--taxonomy|-t <string>]
       [--u-to-t] [--update-taxid] [--version] [--with-leaves] [<args>]

Options #

obitag specific options #

  • --reference-db | -R <FILENAME>: The name of the file containing the reference database.
  • --save-db <FILENAME>: The name of the file in which the reference database should be saved after a round of annotation. This new database includes precomputed similarity indexes that can be used to accelerate the annotation process for other datasets. The default setting is to not save this database.
  • --geometric | -G : Activate an experimental geometric similarity heuristic (default: false)
  • --with-leaves: When the taxonomy is extracted from the reference sequence file itself, add the reference sequences as leaf nodes under their respective taxids in the taxonomy tree. Useful when the reference file is the primary source of taxonomic information (default: false).

Taxonomy options #

Check taxids against a taxonomy #

OBITools4 allow loading a taxonomy database when they are processing sequence data. If done, the command checks the validity of taxids during the processing of the command. Three cases can occur:
  • The taxon is valid
  • The taxon is no more valid, but a new one replaces it
  • The taxon is no more valid, and no new taxid exists to replace it.
In the first case, the obitools normalize the writing of the taxid in the form:
    TAXCOD:TAXID [SCIENTIFIC NAME]@RANK
As example with the NCBI taxonomy the human taxid looks like :
    taxon:9606 [Homo sapiens]@species
That rewriting doesn't happen if the --raw-taxid option is set. In that case only the raw taxid is conserved.
    9606
In the second case, a warning message is logged on the standard error. If the --update-taxid is set, the command will update the expired taxid to the new equivalent one, and the valid taxon rules apply. Otherwise, the old taxid is maintained in the result. In the last case, a warning message is also issued to the standard error, and non-valid taxid is conserved as is. If the --fail-on-taxonomy option is set, the command stop and exit with an error at the first non-valid taxid encountred in input data.
  • --taxonomy | -t <string>: Path to the taxonomic database.
  • --raw-taxid: Displays the raw taxid for each displayed taxon. (default: false)
  • --update-taxid: Make obitools automatically updating the taxids that are declared merged to a newest one (default: false).
  • --fail-on-taxonomy: Make obitools failing on error if a used taxid is not a currently valid one (default: false).

Controlling the input data #

OBITools4 generally recognizes the input file format. It also recognizes whether the input file is compressed using GZIP. But some rare files can be misidentified, so the following options allow the user to force the format, thus bypassing the format identification step.
The file format options #
  • --fasta: indicates that sequence data is in fasta format.
  • --fastq: indicates that sequence data is in fastq format.
  • --embl: indicates that sequence data is in EMBL-ENA flatfile format.
  • --csv: indicates that sequence data is in CSV format.
  • --genbank: indicates that sequence data is in GenBank flatfile format.
  • --ecopcr: indicates that sequence data is in the old ecoPCR tabulated format.
Controlling the way OBITools4 are formatting annotations #
These options only apply to the FASTA and FASTQ formats
  • --input-OBI-header: FASTA/FASTQ title line annotations follow the old OBI format.
  • --input-json-header: FASTA/FASTQ title line annotations follow the JSON format.
Controlling quality score decoding #
This option only applies to the FASTQ formats
  • --solexa: decodes quality string according to the old Solexa specification. (default: the standard Sanger encoding is used, env: OBISSOLEXA)

Controlling the output data #

  • --compress | -Z : output is compressed using gzip. (default: false)
  • --no-order: the OBITools ensure that the order between the input file and the output file does not change. When multiple files are processed, they are processed one at a time. If the –no-order option is added to a command, multiple input files can be opened at the same time and their contents processed in parallel. This usually increases processing speed, but does not guarantee the order of the sequences in the output file. Also, processing multiple files in parallel may require more memory to perform the computation.
  • --fasta-output: writes sequence data in fasta format (default if quality data is not available).
  • --fastq-output: writes sequence data in fastq format (default if quality data is available).
  • --json-output: writes sequence data in JSON format.
  • --out | -o <FILENAME>: filename used for saving the output (default: “-”, the standard output)
  • --output-OBI-header | -O : writes output FASTA/FASTQ title line annotations in OBI format (default: JSON).
  • --output-json-header: writew output FASTA/FASTQ title line annotations in JSON format (the default format).
  • --skip-empty: sequences of length equal to zero are removed from the output (default: false).
  • --no-progressbar: deactivates progress bar display (default: false).

General options #

  • --help | -h|-? : shows this help.
  • --version: prints the version and exits.
  • --silent-warning: This option tells obitools to stop displaying warnings. This behaviour can be controlled by setting the OBIWARNINGS environment variable.
  • --max-cpu <INTEGER>: OBITools can take advantage of your computer’s multi-core architecture by parallelizing the computation across all available CPUs. Computing on more CPUs usually requires more memory to perform the computation. Reducing the number of CPUs used to perform a calculation is also a way to indirectly control the amount of memory used by the process. The number of CPUs used by OBITools can also be controlled by setting the OBIMAXCPU environment variable.
  • --force-one-cpu: forces the use of a single CPU core for parallel processing (default: false).
  • --batch-size <INTEGER>: minimum number of sequences per batch for parallel processing (floor, default: 1, env: OBIBATCHSIZE)
  • --batch-size-max <INTEGER>: maximum number of sequences per batch for parallel processing (ceiling, default: 2000, env: OBIBATCHSIZEMAX)
  • --batch-mem <STRING>: maximum memory per batch (e.g. 128K, 64M, 1G; default: 128M; set to 0 to disable, env: OBIBATCHMEM)
  • --debug: enables debug mode, by setting log level to debug (default: false, env: OBIDEBUG)
  • --pprof: enables pprof server. Look at the log for details. (default: false).
  • --pprof-mutex <INTEGER>: enables profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX)
  • --pprof-goroutine <INTEGER>: enables profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE)

Examples #

Save and reuse the indexed reference database #

Building the internal reference index takes time on large databases. Use --save-db option to persist the internal index computed during an annotation run, allowing it to be reused on subsequent runs:

# First run: assign and save the indexed reference DB
obitag -t ncbitaxo.tgz \
       -R db_v05_r117.fasta.gz \
       --save-db wolf_ref_indexed.fasta \
       wolf_query.fasta \
       > out_basic.fasta

# Subsequent runs: use the pre-built index (significantly faster)
obitag -t ncbitaxo.tgz \
       -R wolf_ref_indexed.fasta \
       wolf_query.fasta \
       > out_basic.fasta

Integrate taxonomy to the reference database #

The reference database can be modified to integrate its own taxonomy. This is acheived using the obiannotate command.

obiannotate -t ncbitaxo.tgz \
            -Z \
            --taxonomic-path \
            --update-taxid \
            db_v05_r117.fasta.gz \
            > db_v05_r117_taxo.fasta.gz

The modified reference database includes for each reference sequence a new annotation taxonomic_path describing the full taxonomic path extracted from the ncbi taxonomy database.

gzcat db_v05_r117_taxo.fasta.gz | head -2
>AY189646 {"count":1,"definition":"Homo sapiens clone arCan119 12S ribosomal RNA gene, partial sequence; mitochondrial gene for mitochondrial product.","species_name":"Homo sapiens","taxid":"taxon:9606 [Homo sapiens]@species","taxonomic_path":["taxon:1 [root]@no rank","taxon:131567 [cellular organisms]@cellular root","taxon:2759 [Eukaryota]@domain","taxon:33154 [Opisthokonta]@clade","taxon:33208 [Metazoa]@kingdom","taxon:6072 [Eumetazoa]@clade","taxon:33213 [Bilateria]@clade","taxon:33511 [Deuterostomia]@clade","taxon:7711 [Chordata]@phylum","taxon:89593 [Craniata]@subphylum","taxon:7742 [Vertebrata]@clade","taxon:7776 [Gnathostomata]@clade","taxon:117570 [Teleostomi]@clade","taxon:117571 [Euteleostomi]@clade","taxon:8287 [Sarcopterygii]@superclass","taxon:1338369 [Dipnotetrapodomorpha]@clade","taxon:32523 [Tetrapoda]@clade","taxon:32524 [Amniota]@clade","taxon:40674 [Mammalia]@class","taxon:32525 [Theria]@clade","taxon:9347 [Eutheria]@clade","taxon:1437010 [Boreoeutheria]@clade","taxon:314146 [Euarchontoglires]@superorder","taxon:9443 [Primates]@order","taxon:376913 [Haplorrhini]@suborder","taxon:314293 [Simiiformes]@infraorder","taxon:9526 [Catarrhini]@parvorder","taxon:314295 [Hominoidea]@superfamily","taxon:9604 [Hominidae]@family","taxon:207598 [Homininae]@subfamily","taxon:9605 [Homo]@genus","taxon:9606 [Homo sapiens]@species"]}
ttagccctaaacctcaacagttaaatcaacaaaactgctcgccagaacactacgrgccac

The new db_v05_r117_taxo.fasta.gz file can now be used as as taxonomy self contained reference database by obitag , without requiring any reference to an external taxonomy file.

obitag -R db_v05_r117_taxo.fasta.gz \
       wolf_query.fasta \
       > out_basic.fasta

Write plain taxids with automatic deprecated-taxid correction #

Use --raw-taxid to write compact integer taxids, and --update-taxid to silently replace any deprecated taxids found in the reference database:

obitag -t ncbitaxo.tgz \
       -R db_v05_r117.fasta.gz \
       --update-taxid \
       --raw-taxid \
       wolf_query.fasta \
       > out_raw_taxid.fasta
📄 out_raw_taxid.fasta
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
>HELIUM_000100422_612GNAAXX:7:118:3572:14633#0/1_sub[28..126] {"count":10172,"merged_sample":{"26a_F040644":10172},"obiclean_head":true,"obiclean_headcount":1,"obiclean_internalcount":0,"obiclean_samplecount":1,"obiclean_singletoncount":0,"obiclean_status":{"26a_F040644":"h"},"obiclean_weight":{"26a_F040644":12205},"obitag_bestid":0.9797979797979798,"obitag_bestmatch":"AY227529","obitag_match_count":1,"obitag_rank":"genus","obitag_similarity_method":"lcs","taxid":"9992"}
ttagccctaaacataaacattcaataaacaagaatgttcgccagagtactactagcaaca
gcctgaaactcaaaggacttggcggtgctttacatccct
>HELIUM_000100422_612GNAAXX:7:99:9351:13090#0/1_sub[28..127] {"count":260,"merged_sample":{"29a_F260619":260},"obiclean_head":true,"obiclean_headcount":1,"obiclean_internalcount":0,"obiclean_samplecount":1,"obiclean_singletoncount":0,"obiclean_status":{"29a_F260619":"h"},"obiclean_weight":{"29a_F260619":337},"obitag_bestid":0.9405940594059405,"obitag_bestmatch":"AF154263","obitag_match_count":9,"obitag_rank":"infraorder","obitag_similarity_method":"lcs","taxid":"35500"}
ttagccctaaacacaaataattacacaaacaaaattgttcaccagagtactagcggcaac
agcttaaaactcaaaggacttggcggtgctttataccctt
>HELIUM_000100422_612GNAAXX:7:108:10111:9078#0/1_sub[28..127] {"count":7146,"merged_sample":{"13a_F730603":7146},"obiclean_head":true,"obiclean_headcount":1,"obiclean_internalcount":0,"obiclean_samplecount":1,"obiclean_singletoncount":0,"obiclean_status":{"13a_F730603":"h"},"obiclean_weight":{"13a_F730603":8039},"obitag_bestid":1,"obitag_bestmatch":"AB245427","obitag_match_count":1,"obitag_rank":"species","obitag_similarity_method":"lcs","taxid":"9860"}
ctagccttaaacacaaatagttatgcaaacaaaactattcgccagagtactaccggcaat
agcttaaaactcaaaggacttggcggtgctttataccctt
>HELIUM_000100422_612GNAAXX:7:38:14204:12725#0/1_sub[28..126] {"count":87,"merged_sample":{"26a_F040644":87},"obiclean_head":true,"obiclean_headcount":1,"obiclean_internalcount":0,"obiclean_samplecount":1,"obiclean_singletoncount":0,"obiclean_status":{"26a_F040644":"h"},"obiclean_weight":{"26a_F040644":202},"obitag_bestid":0.9494949494949495,"obitag_bestmatch":"AY227530","obitag_match_count":2,"obitag_rank":"tribe","obitag_similarity_method":"lcs","taxid":"337730"}
ttagccctaaacataaacattcaataaacaagaatgttcgccagaggactactagcaata
gcttaaaactcaaaggacttggcggtgctttatatccct
>HELIUM_000100422_612GNAAXX:7:30:9942:4495#0/1_sub[28..126] {"count":95,"merged_sample":{"26a_F040644":11,"29a_F260619":84},"obiclean_head":true,"obiclean_headcount":1,"obiclean_internalcount":0,"obiclean_samplecount":2,"obiclean_singletoncount":1,"obiclean_status":{"26a_F040644":"s","29a_F260619":"h"},"obiclean_weight":{"26a_F040644":12,"29a_F260619":105},"obitag_bestid":0.9595959595959596,"obitag_bestmatch":"AC187326","obitag_match_count":1,"obitag_rank":"subspecies","obitag_similarity_method":"lcs","taxid":"9615"}
ttagccctaaacataagctattccataacaaaataattcgccagagaactactagcaaca
gattaaacctcaaaggacttggcagtgctttatacccct
>HELIUM_000100422_612GNAAXX:7:51:16702:19393#0/1_sub[28..127] {"count":12004,"merged_sample":{"15a_F730814":7465,"29a_F260619":4539},"obiclean_head":true,"obiclean_headcount":2,"obiclean_internalcount":0,"obiclean_samplecount":2,"obiclean_singletoncount":0,"obiclean_status":{"15a_F730814":"h","29a_F260619":"h"},"obiclean_weight":{"15a_F730814":8822,"29a_F260619":5789},"obitag_bestid":1,"obitag_bestmatch":"AJ885202","obitag_match_count":1,"obitag_rank":"species","obitag_similarity_method":"lcs","taxid":"9858"}
ttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccggcaat
agcttaaaactcaaaggacttggcggtgctttataccctt
>HELIUM_000100422_612GNAAXX:7:84:14502:1617#0/1_sub[28..127] {"count":319,"merged_sample":{"29a_F260619":319},"obiclean_head":true,"obiclean_headcount":1,"obiclean_internalcount":0,"obiclean_samplecount":1,"obiclean_singletoncount":0,"obiclean_status":{"29a_F260619":"h"},"obiclean_weight":{"29a_F260619":376},"obitag_bestid":1,"obitag_bestmatch":"AJ972683","obitag_match_count":1,"obitag_rank":"species","obitag_similarity_method":"lcs","taxid":"9858"}
ttagccctaaacacaagtaattattataacaaaattattcgccagagtactaccggcaat
agcttaaaactcaaaggacttggcggtgctttataccctt
>HELIUM_000100422_612GNAAXX:7:50:10637:6527#0/1_sub[28..126] {"count":366,"merged_sample":{"13a_F730603":13,"15a_F730814":5,"26a_F040644":347,"29a_F260619":1},"obiclean_head":true,"obiclean_headcount":1,"obiclean_internalcount":0,"obiclean_samplecount":4,"obiclean_singletoncount":3,"obiclean_status":{"13a_F730603":"s","15a_F730814":"s","26a_F040644":"h","29a_F260619":"s"},"obiclean_weight":{"13a_F730603":17,"15a_F730814":5,"26a_F040644":468,"29a_F260619":1},"obitag_bestid":1,"obitag_bestmatch":"AB048590","obitag_match_count":1,"obitag_rank":"genus","obitag_similarity_method":"lcs","taxid":"9611"}
ttagccctaaacatagataattttacaacaaaataattcgccagaggactactagcaata
gcttaaaactcaaaggacttggcggtgctttatatccct

Strict taxonomy enforcement #

Combine --update-taxid and --fail-on-taxonomy to first update deprecated taxids, then terminate immediately on any taxid that remains invalid after the update:

obitag -t ncbitaxo.tgz \
       -R db_v05_r117.fasta.gz \
       --update-taxid \
       --fail-on-taxonomy \
       wolf_query.fasta \
       > out_strict_update.fasta
📄 out_strict_update.fasta
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
>HELIUM_000100422_612GNAAXX:7:118:3572:14633#0/1_sub[28..126] {"count":10172,"merged_sample":{"26a_F040644":10172},"obiclean_head":true,"obiclean_headcount":1,"obiclean_internalcount":0,"obiclean_samplecount":1,"obiclean_singletoncount":0,"obiclean_status":{"26a_F040644":"h"},"obiclean_weight":{"26a_F040644":12205},"obitag_bestid":0.9797979797979798,"obitag_bestmatch":"AY227529","obitag_match_count":1,"obitag_rank":"genus","obitag_similarity_method":"lcs","taxid":"taxon:9992 [Marmota]@genus"}
ttagccctaaacataaacattcaataaacaagaatgttcgccagagtactactagcaaca
gcctgaaactcaaaggacttggcggtgctttacatccct
>HELIUM_000100422_612GNAAXX:7:99:9351:13090#0/1_sub[28..127] {"count":260,"merged_sample":{"29a_F260619":260},"obiclean_head":true,"obiclean_headcount":1,"obiclean_internalcount":0,"obiclean_samplecount":1,"obiclean_singletoncount":0,"obiclean_status":{"29a_F260619":"h"},"obiclean_weight":{"29a_F260619":337},"obitag_bestid":0.9405940594059405,"obitag_bestmatch":"AF154263","obitag_match_count":9,"obitag_rank":"infraorder","obitag_similarity_method":"lcs","taxid":"taxon:35500 [Pecora]@infraorder"}
ttagccctaaacacaaataattacacaaacaaaattgttcaccagagtactagcggcaac
agcttaaaactcaaaggacttggcggtgctttataccctt
>HELIUM_000100422_612GNAAXX:7:108:10111:9078#0/1_sub[28..127] {"count":7146,"merged_sample":{"13a_F730603":7146},"obiclean_head":true,"obiclean_headcount":1,"obiclean_internalcount":0,"obiclean_samplecount":1,"obiclean_singletoncount":0,"obiclean_status":{"13a_F730603":"h"},"obiclean_weight":{"13a_F730603":8039},"obitag_bestid":1,"obitag_bestmatch":"AB245427","obitag_match_count":1,"obitag_rank":"species","obitag_similarity_method":"lcs","taxid":"taxon:9860 [Cervus elaphus]@species"}
ctagccttaaacacaaatagttatgcaaacaaaactattcgccagagtactaccggcaat
agcttaaaactcaaaggacttggcggtgctttataccctt
>HELIUM_000100422_612GNAAXX:7:38:14204:12725#0/1_sub[28..126] {"count":87,"merged_sample":{"26a_F040644":87},"obiclean_head":true,"obiclean_headcount":1,"obiclean_internalcount":0,"obiclean_samplecount":1,"obiclean_singletoncount":0,"obiclean_status":{"26a_F040644":"h"},"obiclean_weight":{"26a_F040644":202},"obitag_bestid":0.9494949494949495,"obitag_bestmatch":"AY227530","obitag_match_count":2,"obitag_rank":"tribe","obitag_similarity_method":"lcs","taxid":"taxon:337730 [Marmotini]@tribe"}
ttagccctaaacataaacattcaataaacaagaatgttcgccagaggactactagcaata
gcttaaaactcaaaggacttggcggtgctttatatccct
>HELIUM_000100422_612GNAAXX:7:30:9942:4495#0/1_sub[28..126] {"count":95,"merged_sample":{"26a_F040644":11,"29a_F260619":84},"obiclean_head":true,"obiclean_headcount":1,"obiclean_internalcount":0,"obiclean_samplecount":2,"obiclean_singletoncount":1,"obiclean_status":{"26a_F040644":"s","29a_F260619":"h"},"obiclean_weight":{"26a_F040644":12,"29a_F260619":105},"obitag_bestid":0.9595959595959596,"obitag_bestmatch":"AC187326","obitag_match_count":1,"obitag_rank":"subspecies","obitag_similarity_method":"lcs","taxid":"taxon:9615 [Canis lupus familiaris]@subspecies"}
ttagccctaaacataagctattccataacaaaataattcgccagagaactactagcaaca
gattaaacctcaaaggacttggcagtgctttatacccct
>HELIUM_000100422_612GNAAXX:7:51:16702:19393#0/1_sub[28..127] {"count":12004,"merged_sample":{"15a_F730814":7465,"29a_F260619":4539},"obiclean_head":true,"obiclean_headcount":2,"obiclean_internalcount":0,"obiclean_samplecount":2,"obiclean_singletoncount":0,"obiclean_status":{"15a_F730814":"h","29a_F260619":"h"},"obiclean_weight":{"15a_F730814":8822,"29a_F260619":5789},"obitag_bestid":1,"obitag_bestmatch":"AJ885202","obitag_match_count":1,"obitag_rank":"species","obitag_similarity_method":"lcs","taxid":"taxon:9858 [Capreolus capreolus]@species"}
ttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccggcaat
agcttaaaactcaaaggacttggcggtgctttataccctt
>HELIUM_000100422_612GNAAXX:7:84:14502:1617#0/1_sub[28..127] {"count":319,"merged_sample":{"29a_F260619":319},"obiclean_head":true,"obiclean_headcount":1,"obiclean_internalcount":0,"obiclean_samplecount":1,"obiclean_singletoncount":0,"obiclean_status":{"29a_F260619":"h"},"obiclean_weight":{"29a_F260619":376},"obitag_bestid":1,"obitag_bestmatch":"AJ972683","obitag_match_count":1,"obitag_rank":"species","obitag_similarity_method":"lcs","taxid":"taxon:9858 [Capreolus capreolus]@species"}
ttagccctaaacacaagtaattattataacaaaattattcgccagagtactaccggcaat
agcttaaaactcaaaggacttggcggtgctttataccctt
>HELIUM_000100422_612GNAAXX:7:50:10637:6527#0/1_sub[28..126] {"count":366,"merged_sample":{"13a_F730603":13,"15a_F730814":5,"26a_F040644":347,"29a_F260619":1},"obiclean_head":true,"obiclean_headcount":1,"obiclean_internalcount":0,"obiclean_samplecount":4,"obiclean_singletoncount":3,"obiclean_status":{"13a_F730603":"s","15a_F730814":"s","26a_F040644":"h","29a_F260619":"s"},"obiclean_weight":{"13a_F730603":17,"15a_F730814":5,"26a_F040644":468,"29a_F260619":1},"obitag_bestid":1,"obitag_bestmatch":"AB048590","obitag_match_count":1,"obitag_rank":"genus","obitag_similarity_method":"lcs","taxid":"taxon:9611 [Canis]@genus"}
ttagccctaaacatagataattttacaacaaaataattcgccagaggactactagcaata
gcttaaaactcaaaggacttggcggtgctttatatccct

Note: Using --fail-on-taxonomy alone (without --update-taxid) will cause obitag to exit with a fatal error when it encounters the deprecated taxids that are common in reference databases built from older taxonomy snapshots.

Display help #

obitag --help