Designing new barcodes with ecoPrimers #

ecoPrimers ( Citation: Riaz, Shehzad & al., 2011 Riaz, T., Shehzad, W., Viari, A., Pompanon, F., Taberlet, P. & Coissac, E. (2011). ecoPrimers: inference of new DNA barcode markers from whole genome sequence analysis. Nucleic acids research, 39(21). e145. https://doi.org/10.1093/nar/gkr732 ) is a tool for designing new DNA metabarcodes. It is capable of working with a collection of mitochondrial genomes, chloroplast genomes or rRNA nuclear gene clusters. It is an alignment free method, which guarantees its efficiency.

The ecoPrimers program was developed to be used in conjunction with the original OBITools. Therefore, using it with the new OBITools4 requires some special care in data preparation.

In this recipe we will use ecoPrimers to design a new bony fish DNA metabarcode.

Installation of `ecoPrimers` #

ecoPrimers is available from the git reposiroty of metabarcoding site at

https://git.metabarcoding.org/obitools/ecoprimers

Installation can be done by cloning the project:

git clone https://git.metabarcoding.org/obitools/ecoprimers.git

This will create a new ecoprimers directory with a src subdirectory containing the source code. You will need to change your current working directory to this ecoprimers/src directory.

cd ecoprimers/src

It is now possible to compile the ecoPrimers program using the make command:

make

This command will produce a series of messages on your screen similar to the following. You may get some extra warning messages, but no errors should be reported. If compilation is successful, an ecoPrimers executable will be created in the current directory.

gcc -DMAC_OS_X -M  -o ecoprimer.d ecoprimer.c
gcc -DMAC_OS_X -W -Wall -m64 -g -c -o ecoprimer.o ecoprimer.c
/Library/Developer/CommandLineTools/usr/bin/make -C libecoPCR
gcc -DMAC_OS_X -M  -o econame.d econame.c
gcc -DMAC_OS_X -M  -o ecofilter.d ecofilter.c
gcc -DMAC_OS_X -M  -o ecotax.d ecotax.c
gcc -DMAC_OS_X -M  -o ecoseq.d ecoseq.c
gcc -DMAC_OS_X -M  -o ecorank.d ecorank.c
gcc -DMAC_OS_X -M  -o ecoMalloc.d ecoMalloc.c
gcc -DMAC_OS_X -M  -o ecoIOUtils.d ecoIOUtils.c
gcc -DMAC_OS_X -M  -o ecoError.d ecoError.c
gcc -DMAC_OS_X -M  -o ecodna.d ecodna.c
gcc -DMAC_OS_X -W -Wall -m64 -g -c -o ecodna.o ecodna.c
gcc -DMAC_OS_X -W -Wall -m64 -g -c -o ecoError.o ecoError.c
gcc -DMAC_OS_X -W -Wall -m64 -g -c -o ecoIOUtils.o ecoIOUtils.c
gcc -DMAC_OS_X -W -Wall -m64 -g -c -o ecoMalloc.o ecoMalloc.c
gcc -DMAC_OS_X -W -Wall -m64 -g -c -o ecorank.o ecorank.c
gcc -DMAC_OS_X -W -Wall -m64 -g -c -o ecoseq.o ecoseq.c
gcc -DMAC_OS_X -W -Wall -m64 -g -c -o ecotax.o ecotax.c
gcc -DMAC_OS_X -W -Wall -m64 -g -c -o ecofilter.o ecofilter.c
gcc -DMAC_OS_X -W -Wall -m64 -g -c -o econame.o econame.c
ar -cr libecoPCR.a ecodna.o ecoError.o ecoIOUtils.o ecoMalloc.o ecorank.o ecoseq.o ecotax.o ecofilter.o econame.o
ranlib libecoPCR.a
/Library/Developer/CommandLineTools/usr/bin/make -C libecoprimer
gcc -DMAC_OS_X -M  -o ahocorasick.d ahocorasick.c
gcc -DMAC_OS_X -M  -o PrimerSets.d PrimerSets.c
gcc -DMAC_OS_X -M  -o filtering.d filtering.c
gcc -DMAC_OS_X -M  -o apat_search.d apat_search.c
gcc -DMAC_OS_X -M  -o taxstats.d taxstats.c
gcc -DMAC_OS_X -M  -o pairs.d pairs.c
gcc -DMAC_OS_X -M  -o pairtree.d pairtree.c
gcc -DMAC_OS_X -M  -o sortmatch.d sortmatch.c
gcc -DMAC_OS_X -M  -o libstki.d libstki.c
gcc -DMAC_OS_X -M  -o queue.d queue.c
gcc -DMAC_OS_X -M  -o merge.d merge.c
gcc -DMAC_OS_X -M  -o aproxpattern.d aproxpattern.c
gcc -DMAC_OS_X -M  -o strictprimers.d strictprimers.c
gcc -DMAC_OS_X -M  -o hashsequence.d hashsequence.c
gcc -DMAC_OS_X -M  -o sortword.d sortword.c
gcc -DMAC_OS_X -M  -o smothsort.d smothsort.c
gcc -DMAC_OS_X -M  -o readdnadb.d readdnadb.c
gcc -DMAC_OS_X -M  -o goodtaxon.d goodtaxon.c
gcc -DMAC_OS_X -W -Wall -m64 -g -c -o goodtaxon.o goodtaxon.c
gcc -DMAC_OS_X -W -Wall -m64 -g -c -o readdnadb.o readdnadb.c
gcc -DMAC_OS_X -W -Wall -m64 -g -c -o smothsort.o smothsort.c
gcc -DMAC_OS_X -W -Wall -m64 -g -c -o sortword.o sortword.c
gcc -DMAC_OS_X -W -Wall -m64 -g -c -o hashsequence.o hashsequence.c
gcc -DMAC_OS_X -W -Wall -m64 -g -c -o strictprimers.o strictprimers.c
gcc -DMAC_OS_X -W -Wall -m64 -g -c -o aproxpattern.o aproxpattern.c
gcc -DMAC_OS_X -W -Wall -m64 -g -c -o merge.o merge.c
gcc -DMAC_OS_X -W -Wall -m64 -g -c -o queue.o queue.c
gcc -DMAC_OS_X -W -Wall -m64 -g -c -o libstki.o libstki.c
gcc -DMAC_OS_X -W -Wall -m64 -g -c -o sortmatch.o sortmatch.c
gcc -DMAC_OS_X -W -Wall -m64 -g -c -o pairtree.o pairtree.c
gcc -DMAC_OS_X -W -Wall -m64 -g -c -o pairs.o pairs.c
gcc -DMAC_OS_X -W -Wall -m64 -g -c -o taxstats.o taxstats.c
gcc -DMAC_OS_X -W -Wall -m64 -g -c -o apat_search.o apat_search.c
gcc -DMAC_OS_X -W -Wall -m64 -g -c -o filtering.o filtering.c
gcc -DMAC_OS_X -W -Wall -m64 -g -c -o PrimerSets.o PrimerSets.c
gcc -DMAC_OS_X -W -Wall -m64 -g -c -o ahocorasick.o ahocorasick.c
ar -cr libecoprimer.a goodtaxon.o readdnadb.o smothsort.o sortword.o hashsequence.o strictprimers.o aproxpattern.o merge.o queue.o libstki.o sortmatch.o pairtree.o pairs.o taxstats.o apat_search.o filtering.o PrimerSets.o ahocorasick.o
ranlib libecoprimer.a
/Library/Developer/CommandLineTools/usr/bin/make -C libthermo
gcc -DMAC_OS_X -M  -o thermostats.d thermostats.c
gcc -DMAC_OS_X -M  -o nnparams.d nnparams.c
gcc -DMAC_OS_X -W -Wall -m64 -g -c -o nnparams.o nnparams.c
gcc -DMAC_OS_X -W -Wall -m64 -g -c -o thermostats.o thermostats.c
ar -cr libthermo.a nnparams.o thermostats.o
ranlib libthermo.a
gcc -g  -O5 -m64 -o ecoPrimers ecoprimer.o -LlibecoPCR -Llibecoprimer -Llibthermo -L/usr/local/lib -lecoprimer -lecoPCR -lthermo -lz -lm

You can now copy the ecoPrimers executable to a directory that is part of your PATH environment variable. You can use the following command to list all these directories. For example, the result is:

for p in $path; do echo $p; done | sort -u

/Users/coissac/bin
/Users/coissac/go/bin
/bin
/opt/X11/bin
/sbin
/usr/bin
/usr/local/bin
/usr/local/go/bin
/usr/sbin

From this list you can choose the directory where you want to install the ecoPrimers executable. Here we can choose the folder /Users/coissac/bin to store it, as it is in the path of the home directory, and therefore does not require root privileges to copy the ecoPrimers executable into. /usr/local/bin is also a good choice, as it is the default directory for installing non-standard software on a UNIX system. When software is installed in /usr/local/bin, it is available to all users of the system. However, copying the ecoPrimers executable to /usr/local/bin requires root privileges.

If we install the software without root privileges:

cp ecoPrimers /Users/coissac/bin

If we install the software for all users on the system, but with root privileges:

sudo cp ecoPrimers /usr/local/bin

Preparing the data #

What do we need ? #

To design a new animal DNA metabarcode, we have to download the following data from the NCBI website:

The complete set of whole mitochondrial genomes
The NCBI taxonomy

Downloading the mitochondrial genomes #

The file containing the complete set of mitochondrial genomes can be downloaded using your favourite web browser from the NCBI FTP website.

You will need to download the GenBank flat file format of the data, with extension gbff.gz. This is the only one that contains the link to the NCBI taxonomy for each sequence.

If you need to download the data on a UNIX computer, you may not have access to a web browser on that system. In this case, use the curl command to download the file:

curl 'https://ftp.ncbi.nlm.nih.gov/genomes/refseq/mitochondrion/mitochondrion.1.genomic.gbff.gz' \
     > mito.all.gb.gz

Because the file is compressed, you must use the zless command instead of the classic less command to inspect the file without decompressing it first:

zless mito.all.gb.gz

LOCUS       NW_009243181           45189 bp    DNA     linear   CON 06-OCT-2014
DEFINITION  Fonticula alba strain ATCC 38817 mitochondrial scaffold
            supercont2.211, whole genome shotgun sequence.
ACCESSION   NW_009243181 NZ_AROH01000000
VERSION     NW_009243181.1
DBLINK      BioProject: PRJNA262900
            Assembly: GCF_000388065.1
KEYWORDS    WGS; RefSeq.
SOURCE      mitochondrion Fonticula alba
  ORGANISM  Fonticula alba
            Eukaryota; Rotosphaerida; Fonticulaceae; Fonticula.
REFERENCE   1  (bases 1 to 45189)
  AUTHORS   Russ,C., Cuomo,C., Burger,G., Gray,M.W., Holland,P.W.H., King,N.,
            Lang,F.B.F., Roger,A.J., Ruiz-Trillo,I., Brown,M., Walker,B.,
            Young,S., Zeng,Q., Gargeya,S., Fitzgerald,M., Haas,B.,
            Abouelleil,A., Allen,A.W., Alvarado,L., Arachchi,H.M., Berlin,A.M.,
            Chapman,S.B., Gainer-Dewar,J., Goldberg,J., Griggs,A., Gujja,S.,
            Hansen,M., Howarth,C., Imamovic,A., Ireland,A., Larimer,J.,
            McCowan,C., Murphy,C., Pearson,M., Poon,T.W., Priest,M.,
            Roberts,A., Saif,S., Shea,T., Sisk,P., Sykes,S., Wortman,J.,
            Nusbaum,C. and Birren,B.
  CONSRTM   The Broad Institute Genomics Platform
  TITLE     The Genome Sequence of Fonticula alba ATCC 38817
  JOURNAL   Unpublished
REFERENCE   2  (bases 1 to 45189)
  CONSRTM   NCBI Genome Project
  TITLE     Direct Submission
  JOURNAL   Submitted (06-OCT-2014) National Center for Biotechnology
            Information, NIH, Bethesda, MD 20894, USA
REFERENCE   3  (bases 1 to 45189)
  AUTHORS   Russ,C., Cuomo,C., Burger,G., Gray,M.W., Holland,P.W.H., King,N.,
            Lang,F.B.F., Roger,A.J., Ruiz-Trillo,I., Brown,M., Walker,B.,
            Young,S., Zeng,Q., Gargeya,S., Fitzgerald,M., Haas,B.,
            Abouelleil,A., Allen,A.W., Alvarado,L., Arachchi,H.M., Berlin,A.M.,
            Chapman,S.B., Gainer-Dewar,J., Goldberg,J., Griggs,A., Gujja,S.,
            Hansen,M., Howarth,C., Imamovic,A., Ireland,A., Larimer,J.,
            McCowan,C., Murphy,C., Pearson,M., Poon,T.W., Priest,M.,
            Roberts,A., Saif,S., Shea,T., Sisk,P., Sykes,S., Wortman,J.,
            Nusbaum,C. and Birren,B.
  CONSRTM   The Broad Institute Genomics Platform
  TITLE     Direct Submission
  JOURNAL   Submitted (26-APR-2013) Broad Institute of MIT and Harvard, 7
            Cambridge Center, Cambridge, MA 02142, USA
COMMENT     PROVISIONAL REFSEQ: This record has not yet been subject to final
            NCBI review. The reference sequence is identical to KB932304.
            
            ##Genome-Assembly-Data-START##
            Assembly Method       :: ALLPATHS v. R44024; Mito ALLPATHS v.
                                     R43919
            Assembly Name         :: Font_alba_ATCC_38817_V2
            Genome Coverage       :: 317.0x; Mito 63.0x
            Sequencing Technology :: Illumina
            ##Genome-Assembly-Data-END##
FEATURES             Location/Qualifiers
     source          1..45189
                     /organism="Fonticula alba"
                     /organelle="mitochondrion"
                     /mol_type="genomic DNA"
                     /strain="ATCC 38817"
                     /isolation_source="dog dung"
                     /culture_collection="ATCC:38817"
                     /db_xref="taxon:691883"
                     /geo_loc_name="USA: Grainfield, Kansas"
                     /collection_date="1960"

At the end of the top of the file shown above, we can see the /db_xref="taxon:691883" field, which provides the link to the NCBI taxonomy for this first entry in the file.

Download the full taxonomy #

The NCBI taxonomy is available as a tarball file. It can be downloaded in the same way as the RefSeq mitochondrial database. You can also download the NCBI taxonomy using the obitaxonomy command with the --download-ncbi option.

obitaxonomy --download-ncbi

INFO[0000] Number of workers set 16                     
INFO[0000] Downloading NCBI Taxdump to ncbitaxo_20250211.tgz 
downloading 100% ████████████████████████████████████████| (66/66 MB, 5.1 MB/s)

By default, obitaxonomy downloads the latest version of the NCBI taxonomy available from the NCBI FTP site and saves it to the current directory in a file named ncbitaxo_YYYYMMDD.tgz where YYYY is the year, MM is the month and DD is the day of the download. Here the date is 2025/02/11, so the filename is ncbitaxo_20250211.tgz.

You can also specify the filename of the downloaded file using the --out filename option. For example:

obitaxonomy --download-ncbi --out ncbitaxo.tgz

The archive contains several files #

The NCBI taxonomy dump file contains all the relationships between taxa. This information is stored in two files: nodes.dmp and names.dmp.

The nodes.dmp file:
It contains the taxonomic hierarchy of the NCBI taxonomy. It is a tabular file where the columns are separated by a | character and some whitespace.
- The first column is the taxid of the taxon.
- The second column is the parent taxid of the taxon.
- The third column is the taxonomic rank of the taxon.
The remaining columns are not used by the OBITools.

1  |  1  |  no rank  |		|	8	|	0	| ...
2	|	131567	|	superkingdom	|		|	0	|	0	|
6	|	335928	|	genus	|		|	0	|	1	|
7	|	6	|	species	|	AC	|	0	|	1	|
9	|	32199	|	species	|	BA	|	0	|
10	|	135621	|	genus	|		|	0	|
11	|	1707	|	species	|	CG	|	0	|	1	|
13	|	203488	|	genus	|		|	0	|	1	|
14	|	13	|	species	|	DT	|	0	|	1	|

The names.dmp file:
It contains the scientific names, and a set of alternative names, for all the taxa. It is also a tabular file where the columns are separated by a | character and some whitespace.
- The first column is the taxid of the taxon.
- The second column is the name of the taxon.
- The third column is the class name of this name (e.g scientific name, or blast name…)

1	|	root	|		|	scientific name	|
2	|	Bacteria	|	Bacteria <prokaryote>	|	scientific name	|
2	|	Monera	|	Monera <Bacteria>	|	in-part	|
2	|	Procaryotae	|	Procaryotae <Bacteria>	|	in-part	|
2	|	Prokaryota	|	Prokaryota <Bacteria>	|	in-part	|
2	|	Prokaryotae	|	Prokaryotae <Bacteria>	|	in-part	|
2	|	bacteria	|	bacteria <blast2>	|	blast name	|
2	|	eubacteria	|		|	genbank common name	|
2	|	prokaryote	|	prokaryote <Bacteria>	|	in-part	|
...
10	|	Cellvibrio	|		|	scientific name	|
11	|	[Cellvibrio] gilvus	|		|	scientific name	|
13	|	Dictyoglomus	|		|	scientific name	|
14	|	Dictyoglomus thermophilum	|		|	scientific name	|

A readme.txt file is present in the archive for more information about the NCBI taxonomy dump file.

Preparing the set of complete genomes #

With OBITools, the favorite format for storing sequences is the fasta format. Therefore, we will use the obiconvert tool to convert the GenBank files into fasta format.

obiconvert --skip-empty \
           --update-taxid \
           -t ncbitaxo_20250211.tgz \
           mito.all.gb.gz \
       > mito.all.fasta
head -5 mito.all.fasta

It is not equivalent downloading directly the fasta formatted file from the NCBI FTP site and downloading a GenBank file that will be converted in fasta format using obiconvert . By converting from GenBank format, the fasta formatted file will contain the taxid of the taxon.

Here are the first lines of the mito.all.fasta file:

>NC_072933 {"definition":"Echinosophora koreensis mitochondrion, complete genome.","scientific_name":"mitochondrion Echinosophora koreensis","taxid":228658}
ctttcgggtcggaaatagaagatctggattagatcccttctcgatagctttagtcagagc
tcatccctcgaaaaagggagtagtgagatgagaaaagggtgactagaatacggaaattca
actagtgaagtcagatccgggaattccactattgaagttatccgtcttaggcttcaagca
agctatctttcaaggaagtcagtctaagccctaagccaagatctgctttttgccagtcaa

Preparing a database for new barcode inference #

Preparing a database for new barcode inference involves three steps:

Annotate the sequences by their species taxid.
Make sure that no species is represented much more than the others.
Extract only vertebrate genomes.

Searching for the taxid of vertebrates. #

First we will search for the taxid of Vertebrata, as the taxid is the only way to pass taxonomic information to the OBITools. The --fixed option asks for exact matches of the name. The name search is not case-sensitive.

obitaxonomy -t ncbitaxo_20250211.tgz \
              --fixed \
              'vertebrata'

taxid,parent,taxonomic_rank,scientific_name
taxon:1261581 [Vertebrata]@genus,taxon:2008651 [Polysiphonioideae]@subfamily,genus,Vertebrata
taxon:7742 [Vertebrata]@clade,taxon:89593 [Craniata]@subphylum,clade,Vertebrata

The csvlook command allows to have a pretty and more readable table:

obitaxonomy -t ncbitaxo_20250211.tgz \
              --fixed \
              'vertebrata' \
    | csvlook

| taxid                            | parent                                      | taxonomic_rank | scientific_name |
| -------------------------------- | ------------------------------------------- | -------------- | --------------- |
| taxon:1261581 [Vertebrata]@genus | taxon:2008651 [Polysiphonioideae]@subfamily | genus          | Vertebrata      |
| taxon:7742 [Vertebrata]@clade    | taxon:89593 [Craniata]@subphylum            | clade          | Vertebrata      |

Surprisingly, the Latin name Vertebrata is shared by two different taxa. The first is a genus and obviously not the one we are looking for. The second is a clade, and it is the one we are looking for.

Looking for the Vertebrata genus taxid #

Just out of curiosity, we are going to search for the taxonomic path Vertebrata genus taxid.

obitaxonomy -t ncbitaxo_20250211.tgz \
              -p 2008651 \
      | csvlook

| taxid                                       | parent                                      | taxonomic_rank | scientific_name    |
| ------------------------------------------- | ------------------------------------------- | -------------- | ------------------ |
| taxon:2008651 [Polysiphonioideae]@subfamily | taxon:2803 [Rhodomelaceae]@family           | subfamily      | Polysiphonioideae  |
| taxon:2803 [Rhodomelaceae]@family           | taxon:2802 [Ceramiales]@order               | family         | Rhodomelaceae      |
| taxon:2802 [Ceramiales]@order               | taxon:2045261 [Rhodymeniophycidae]@subclass | order          | Ceramiales         |
| taxon:2045261 [Rhodymeniophycidae]@subclass | taxon:2806 [Florideophyceae]@class          | subclass       | Rhodymeniophycidae |
| taxon:2806 [Florideophyceae]@class          | taxon:2763 [Rhodophyta]@phylum              | class          | Florideophyceae    |
| taxon:2763 [Rhodophyta]@phylum              | taxon:2759 [Eukaryota]@superkingdom         | phylum         | Rhodophyta         |
| taxon:2759 [Eukaryota]@superkingdom         | taxon:131567 [cellular organisms]@no rank   | superkingdom   | Eukaryota          |
| taxon:131567 [cellular organisms]@no rank   | taxon:1 [root]@no rank                      | no rank        | cellular organisms |
| taxon:1 [root]@no rank                      | taxon:1 [root]@no rank                      | no rank        | root               |

You can see that Vertebrata genus belongs to the Rhodophyta phylum, which corresponds to red algae.

Re-annotation of sequences to species level and selection of genomes #

In order to know how species are represented in the database, and more specifically how many sequences represent each species, we will annotate the sequences with taxonomic information at the species level. We need to do this because some mitochondrial genomes can be annotated at other taxonomic levels, such as subspecies.

obiannotate can perform this task using the --with-taxon-at-rank option. This option requires you to specify the taxonomic rank at which the annotation should be performed. In this example case, we have to use the rank species. The species taxid is stored in the species_taxid tag of the sequence.

In the following command we combine three obiannotate commands with one obiuniq command using the | pipe operator (see the General operating principles section):

obiannotate -t ncbitaxo_20250211.tgz \
            --with-taxon-at-rank species \
            mito.all.fasta | \
  obiannotate -S 'ori_taxid=annotations.taxid' | \
  obiannotate -S 'taxid=annotations.species_taxid' | \
  obiuniq -c taxid > mito.one.fasta

Looking at the sequence of NC_050066, it is annotated with taxon 2756270, which corresponds to the subspecies Monochamus alternatus alternatus:

>NC_050066 {"definition":"Monochamus alternatus alternatus mitochondrion, complete genome.","scientific_name":"mitochondrion Monochamus alternatus alternatus","taxid":"taxon:2756270 [Monochamus alternatus alternatus]@subspecies"}
aatgaagtgcctgagcaaagggtaattttgatagaattagtaacgtgaattttcaccttc
attaattatatttaatagaattaaactatttccttagatatcaaaaatctttatacatca
...

The first obiannotate command adds the species_taxid tag to the sequences.

>NC_050066 {"definition":"Monochamus alternatus alternatus mitochondrion, complete genome.","scientific_name":"mitochondrion Monochamus alternatus alternatus","species_name":"Monochamus alternatus","species_taxid":"taxon:192382 [Monochamus alternatus]@species","taxid":"taxon:2756270 [Monochamus alternatus alternatus]@subspecies"}
aatgaagtgcctgagcaaagggtaattttgatagaattagtaacgtgaattttcaccttc
attaattatatttaatagaattaaactatttccttagatatcaaaaatctttatacatca
...

The second obiannotate copies the original taxid tag into a new tag named ori_taxid to preserve the original taxid for possible future use.

>NC_050066 {"definition":"Monochamus alternatus alternatus mitochondrion, complete genome.","ori_taxid":"taxon:2756270 [Monochamus alternatus alternatus]@subspecies","scientific_name":"mitochondrion Monochamus alternatus alternatus","species_name":"Monochamus alternatus","species_taxid":"taxon:192382 [Monochamus alternatus]@species","taxid":"taxon:2756270 [Monochamus alternatus alternatus]@subspecies"}
aatgaagtgcctgagcaaagggtaattttgatagaattagtaacgtgaattttcaccttc
attaattatatttaatagaattaaactatttccttagatatcaaaaatctttatacatca
...

The third obiannotate then copies the species_taxid tag into the main taxid tag. From now on, the OBITools will use the species taxid stored in the taxid tag as the taxonomic annotation for the sequence.

>NC_050066 {"definition":"Monochamus alternatus alternatus mitochondrion, complete genome.","ori_taxid":"taxon:2756270 [Monochamus alternatus alternatus]@subspecies","scientific_name":"mitochondrion Monochamus alternatus alternatus","species_name":"Monochamus alternatus","species_taxid":"taxon:192382 [Monochamus alternatus]@species","taxid":"taxon:192382 [Monochamus alternatus]@species"}
aatgaagtgcctgagcaaagggtaattttgatagaattagtaacgtgaattttcaccttc
attaattatatttaatagaattaaactatttccttagatatcaaaaatctttatacatca
...

Look carefully at this latest version of the sequence. The taxid tag has been updated to the species taxid, the ori_taxid tag contains the original taxid as provided by Genbank, and the species_taxid tag also contains the species taxid.

The last obiuniq merges in a single sequence entry all the sequences strictly identical. Here, the -c taxid option ensures that only sequences with the same taxid are merged. Therefore, two strictly identical sequences not annotated with the same taxid will be kept as two sequence entries.

Look at the evenness of the species representation #

The goal here is to create a histogram representing the number of sequences per species, thanks to UNIX commands. More specifically, how many species are represented by one, two, three or more sequences.

The last command to run is the following:

obicsv -k taxid mito.one.fasta \
     | tail -n +2 \
     | sort \
     | uniq -c \
     | sort -nk1 \
     | cut -w -f 2 \
     | uplot count

But first, try to understand what is going on.

obicsv converts a sequence file into a CSV file. Here because of the -k taxid option, the CSV file will only contain the taxid tag for every sequence. The head command is used to display the top ten first lines of the result.

obicsv -k taxid mito.one.fasta \
     | head

taxid
taxon:2065826 [Sineleotris saccharae]@species
taxon:2219250 [Ocinara albicollis]@species
taxon:8306 [Ambystoma talpoideum]@species
taxon:80600 [Rhizopogon vinicolor]@species
taxon:270463 [Vanessa indica]@species
taxon:1028098 [Hierodula patellifera]@species
taxon:56258 [Sagittarius serpentarius]@species
taxon:457650 [Myadora brevis]@species
taxon:763200 [Arma chinensis]@species

The tail command is used to remove the header line from the CSV file, to keep only the data part of the file. It is done by extracting the tail, the end of the file, from its second line (option -n +2).

obicsv -k taxid mito.one.fasta \
     | tail -n +2 \
     | head

taxon:2065826 [Sineleotris saccharae]@species
taxon:2219250 [Ocinara albicollis]@species
taxon:8306 [Ambystoma talpoideum]@species
taxon:80600 [Rhizopogon vinicolor]@species
taxon:270463 [Vanessa indica]@species
taxon:1028098 [Hierodula patellifera]@species
taxon:56258 [Sagittarius serpentarius]@species
taxon:457650 [Myadora brevis]@species
taxon:763200 [Arma chinensis]@species
taxon:2060314 [Neotrygon indica]@species

As you can see, the first line of the output does not contain the taxid column name header present in the previous output. In the next command, the sort command is used to sort the line to put identical taxid values in a row.

obicsv -k taxid mito.one.fasta \
     | tail -n +2 \
     | sort \
     | head

"taxon:1030158 [Ficus variegata Roding, 1798]@species"
"taxon:244488 [Pillucina pisidium (Dunker, 1860)]@species"
"taxon:352057 [Anopheles albitarsis F Brochero et al., 2007]@species"
"taxon:646521 [Contracaecum rudolphii B Bullini et al., 1986]@species"
"taxon:908352 [Anopheles albitarsis G Krzywinski et al., 2011]@species"
taxon:1000982 [Steindachneridion melanodermatum]@species
taxon:1001283 [Calameuta idolon]@species
taxon:1001291 [Trachelus tabidus]@species
taxon:1001332 [Phylloporia weberiana]@species
taxon:1001553 [Dephomys defua]@species

We can then add the uniq -c command to count the number of times each taxid appears in the file.

obicsv -k taxid mito.one.fasta \
     | tail -n +2 \
     | sort \
     | uniq -c \
     | head

   1 "taxon:1030158 [Ficus variegata Roding, 1798]@species"
   1 "taxon:244488 [Pillucina pisidium (Dunker, 1860)]@species"
   1 "taxon:352057 [Anopheles albitarsis F Brochero et al., 2007]@species"
   1 "taxon:646521 [Contracaecum rudolphii B Bullini et al., 1986]@species"
   1 "taxon:908352 [Anopheles albitarsis G Krzywinski et al., 2011]@species"
   1 taxon:1000982 [Steindachneridion melanodermatum]@species
   1 taxon:1001283 [Calameuta idolon]@species
   1 taxon:1001291 [Trachelus tabidus]@species
   1 taxon:1001332 [Phylloporia weberiana]@species
   1 taxon:1001553 [Dephomys defua]@species

The uniq command added the first column to the output, which is the number of times each taxid appears in the original file.

Next step is to remove the taxid column from the output and keep only the count first column. Because the uniq command adds a space between before the count column, the cut command will consider it as the second column despite for us it looks like the first column.

obicsv -k taxid mito.one.fasta \
     | tail -n +2 \
     | sort \
     | uniq -c \
     | cut -w -f 2 \
     | head

The -w is used to specify that the column separator is the space character.
The -f 2 is used to specify that the second column is the one to be cut.

The last step is to send this output to the uplot command to plot the histogram.

obicsv -k taxid mito.one.fasta \
     | tail -n +2 \
     | sort \
     | uniq -c \
     | sort -nk1 \
     | cut -w -f 2 \
     | uplot count

     ┌                                        ┐ 
   1 ┤■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 17769.0   
   2 ┤ 90.0                                     
   3 ┤ 17.0                                     
   4 ┤ 5.0                                      
   5 ┤ 4.0                                      
   6 ┤ 2.0                                      
   7 ┤ 1.0                                      
     └                                        ┘

Very few taxa are represented by more than one mitochondrial genome, while 17769 species are represented by a single genome. Here we can assume that the mitochondrial genomes are not too much biased in favour of a particular taxon.

Selection of vertebrate genomes #

The mitochondrial database we have downloaded contains mitochondrial genomes from vertebrates, but also from invertebrates, fungi, plants… Since the ecoPrimers require that potentially all sequences provided in the learning database can contain the barcode we are looking for, we will restrict the learning database to contain only vertebrate genomes.

obigrep command will do this for us. We just need to provide the taxid of the vertebrata taxon use as the -r option, and the taxonomy using the -t option.

obigrep -t ncbitaxo_20250211.tgz \
        -r 7742 \
        mito.one.fasta > mito.vert.fasta

Now we can count the number of sequences in the new learning database.

obicount mito.vert.fasta \
   | csvlook

| entities |           n |
| -------- | ----------- |
| variants |       7,822 |
| reads    |       7,823 |
| symbols  | 131,378,756 |

Formatting data for `ecoPrimers` #

As mentioned in the introduction, the ecoPrimers have been designed to work with the original version of OBITools. We now need to perform three more steps to prepare the data for the ecoPrimers.

Unarchiving the taxonomy #

The old OBITools cannot use archived and compressed taxonomies. So we need to

Create a new directory to store the unarchived taxonomy using the mkdir command.
Change to the new directory using the `cd’ command.
Extract the taxonomy from the compressed file using the tar command.
Return to the original directory using the `cd’ command.

mkdir ncbitaxo_20250211
cd ncbitaxo_20250211
tar zxvf ../ncbitaxo_20250211.tgz 
cd ..

Converting the database to the old obitools format #

Now OBITools4 stores the annotations in JSON format.

>NC_050066 {"definition":"Monochamus alternatus alternatus mitochondrion, complete genome. ","ori_taxid":"taxon:2756270 [Monochamus alternatus alternatus]@subspecies","scientific_name":"mitochondrion Monochamus alternatus alternatus","species_name":"Monochamus alternatus","species_taxid":"taxon:192382 [Monochamus alternatus]@species","taxid":"taxon:192382 [Monochamus alternatus]@species"}
aatgaagtgcctgagcaaagggtaattttgatagaattagtaacgtgaattttcaccttc
attaattatatttaatagaattaaactatttccttagatatcaaaaatctttatacatca
...

The original OBITools stored the annotation in a key=value; format.

>NC_050066 ori_taxid=taxon:2756270 [Monochamus alternatus alternatus]@subspecies; scientific_name=mitochondrion Monochamus alternatus alternatus; species_name=Monochamus alternatus; species_taxid=taxon:192382 [Monochamus alternatus]@species; taxid=taxon:192382 [Monochamus alternatus]@species; count=1;  Monochamus alternatus alternatus mitochondrion, complete genome.
aatgaagtgcctgagcaaagggtaattttgatagaattagtaacgtgaattttcaccttc
attaattatatttaatagaattaaactatttccttagatatcaaaaatctttatacatca
...

When the -O option is added to a OBITools4 command, the old OBITools format is used instead of the new JSON-based format.

obiconvert -O mito.vert.fasta > mito.vert.old.fasta

head -5 mito.vert.old.fasta

>NC_071784 taxid=taxon:2065826 [Sineleotris saccharae]@species; count=1; ori_taxid=taxon:2065826 [Sineleotris saccharae]@species; scientific_name=mitochondrion Sineleotris saccharae; species_name=Sineleotris saccharae; species_taxid=taxon:2065826 [Sineleotris saccharae]@species;  Sineleotris saccharae mitochondrion, complete genome.
gctagcgtagcttaaccaaagcataacactgaagatgttaagatgggccctagaaagccc
cgcaagcacaaaagcttggtcctggctttactatcagcttaggctaaacttacacatgca
agtatccgcatccccgtgagaatgcccttaagctcccaccgctaacaggagtcaaggagc
cggtatcaggcacaaccctgagttagcccacgacaccttgctcagccacacccccaaggg

Indexing the mitochondrial learning database #

The last step for preparing the data for the ecoPrimers is to index the learning database. This job was done by the original OBITools, but the new OBITools4 do not.

Using the ecoPCRFormat python script, you can do that indexing without the need of the original OBITools.

Once you have downloaded the ecoPCRFormat python script by clicking here, you have to make it executable and to copy it to the same directory as the ecoPrimers program.

Here, an example of how to do that:

curl http://localhost:1313/obitools4-doc/docs/cookbook/ecoprimers/ecoPCRFormat > ecoPCRFormat
chmod +x ecoPCRFormat
cp ecoPCRFormat /Users/coissac/bin

You can now run the ecoPCRFormat script to create the index files.

ecoPCRFormat -t ncbitaxo_20250211 \
             -f \
             -n vertebrata \
             mito.vert.old.fasta

The -t option specifies the directory where the taxonomy database is located.
The -f option specifies that the input file is in fasta format.
The -n option specifies the name of the indexed learning database.
The last parameter mito.vert.old.fasta is the name of the input file containing the sequences to be indexed.

This command creates the following index files:

ls -l vertebrata*

-rw-r--r--@ 1 coissac  staff  260899785 Feb 11 11:53 vertabrata.ndx
-rw-r--r--@ 1 coissac  staff        546 Feb 11 11:53 vertabrata.rdx
-rw-r--r--@ 1 coissac  staff  121379751 Feb 11 11:53 vertabrata.tdx
-rw-r--r--@ 1 coissac  staff   40446318 Feb 11 11:54 vertabrata_001.sdx

Selecting the best primer pairs #

Searching the Teleostei `taxid` #

To design a new DNA metabarcode for bony fish, we have first to find the Teleostei taxid.

obitaxonomy -t ncbitaxo_20250211.tgz \
              --fixed \
              'Teleostei' \
    | csvlook

| taxid                              | parent                             | taxonomic_rank | scientific_name |
| ---------------------------------- | ---------------------------------- | -------------- | --------------- |
| taxon:32443 [Teleostei]@infraclass | taxon:41665 [Neopterygii]@subclass | infraclass     | Teleostei       |

Running the `ecoPrimers` program #

The ecoPrimers command is responsible for looking for the priming sites. ecoPrimers is an alignment free software able to identify conserved regions among a large set of sequences.

ecoPrimers -d vertebrata \
           -e 3 -3 2 \
           -l 30 -L 150 \
           -r 32443 \
           -c > Teleostei.ecoprimers

The -d option allows you to specify the learning database, here the vertebrate mitochondrial genome database indexed above.
The -e option specifies the maximum number of mismatches allowed between the primer and the priming site. The number of mismatches is per primer.
The -3 option, used here with the 2 argument (-3 2), indicates that no mismatches are allowed on the last two nucleotides (3’ end) of the primer.
The -l option specifies the minimum length of the barcode (excluding primers) to search for.
The -L option specifies the maximum length of the barcode (excluding primers) to search for.
The -r indicates which taxon (here Teleostei) ecoPrimers will focus on.
The -c indicates that the learning database consists of circular genomes.

After a few minutes of running and writing information about its progress to the terminal, ecoPrimer returns a here, indicating that it has identified :

Total number of pairs : 9407
Total number of good pairs : 407

We can now have a look at the beginning of the result file.

head -35 Teleostei.ecoprimers

#
# ecoPrimer version 0.5
# Rank level optimisation : species
# max error count by oligonucleotide : 3
#
# Restricted to taxon:
#     32443 : Teleostei (infraclass)
#
# strict primer quorum  : 0.70
# example quorum        : 0.90
# counterexample quorum : 0.10
#
# database : vertebrata
# Database is constituted of  3909 examples        corresponding to  3876 species
#                        and     0 counterexamples corresponding to     0 species
#
# amplifiat length between [30,150] bp
# DB sequences are considered as circular
# Pairs having specificity less than 0.60 will be ignored
#
     0  AGAGTGACGGGCGGTGTG      CGTCAGGTCGAGGTGTAG      62.8    42.4    57.5    34.1    12      11      GG      3864    0       0.988   3832    0       0.989   2731    0.713      134     146     138.22
     1  CGTCAGGTCGAGGTGTAG      GAGTGACGGGCGGTGTGT      57.5    34.1    63.1    42.9    11      12      GG      3863    0       0.988   3831    0       0.988   2730    0.713      133     145     137.22
     2  CGTCAGGTCGAGGTGTAG      GGGAGAGTGACGGGCGGT      57.5    34.1    64.5    37.0    11      13      GG      3811    0       0.975   3779    0       0.975   2689    0.712      137     149     141.22
     3  CGTCAGGTCGAGGTGTAG      GGGGAGAGTGACGGGCGG      57.5    34.1    65.5    38.4    11      14      GG      3804    0       0.973   3772    0       0.973   2682    0.711      138     149     142.22
     4  ACACCGCCCGTCACTCTC      ACCTTCCGGTACACTTAC      62.5    36.8    54.0    16.6    12      9       GG      3850    0       0.985   3818    0       0.985   2658    0.696      46      132     66.51
     5  AACGTCAGGTCGAGGTGT      AGAGTGACGGGCGGTGTG      58.8    28.4    62.8    41.7    10      12      GG      3779    0       0.967   3746    0       0.966   2653    0.708      137     148     140.23
     6  ACACCGCCCGTCACTCTC      CACCTTCCGGTACACTTA      62.5    36.8    54.0    16.6    12      9       GG      3846    0       0.984   3814    0       0.984   2654    0.696      47      133     67.51
     7  AACGTCAGGTCGAGGTGT      GAGTGACGGGCGGTGTGT      58.8    28.4    63.1    42.1    10      12      GG      3778    0       0.966   3745    0       0.966   2652    0.708      136     147     139.23
     8  ACCTTCCGGTACACTTAC      CACACCGCCCGTCACTCT      54.0    16.6    62.8    37.3    9       12      GG      3845    0       0.984   3813    0       0.984   2653    0.696      47      133     67.51
     9  ACACCGCCCGTCACTCTC      TCCGGTACACTTACCATG      62.5    36.8    54.1    18.1    12      9       GG      3851    0       0.985   3819    0       0.985   2651    0.694      42      128     62.51
    10  ACACCGCCCGTCACTCTC      CCGGTACACTTACCATGT      62.5    36.8    54.4    18.6    12      9       GG      3851    0       0.985   3819    0       0.985   2651    0.694      41      127     61.51
    11  ACACCGCCCGTCACTCTC      CCAAGTGCACCTTCCGGT      62.5    36.8    60.7    28.9    12      11      GG      3837    0       0.982   3805    0       0.982   2650    0.696      54      140     74.51
    12  ACACCGCCCGTCACTCTC      GCACCTTCCGGTACACTT      62.5    36.8    57.7    22.5    12      10      GG      3842    0       0.983   3810    0       0.983   2650    0.696      48      134     68.51
    13  ACACCGCCCGTCACTCTC      CGGTACACTTACCATGTT      62.5    36.8    52.4    15.7    12      8       GG      3850    0       0.985   3818    0       0.985   2650    0.694      40      126     60.51
    14  ACACCGCCCGTCACTCTC      CACTTACCATGTTACGAC      62.5    36.8    51.1    27.7    12      8       GG      3850    0       0.985   3817    0       0.985   2649    0.694      35      121     55.51

The result file consists of two parts. The header, consisting of lines starting with the # character, contains all the parameters used by the ecoPrimer algorithms and some statistics about the database and the current search.

The second part is a tabular text describing all potential primer pairs identified. Immediately below this is a detailed description of the information contained in each column.

Table result description :

column 1 : serial number
column 2 : primer1
column 3 : primer2
column 4 : primer1 Tm without mismatch
column 5 : primer1 lowest Tm against exemple sequences
column 6 : primer2 Tm without mismatch
column 7 : primer2 lowest Tm against exemple sequences
column 8 : primer1 G+C count
column 9 : primer2 G+C count
column 10 : good/bad
column 11 : amplified example sequence count
column 12 : amplified counterexample sequence count
column 13 : yule
column 14 : amplified example taxa count
column 15 : amplified counterexample taxa count
column 16 : ratio of amplified example taxa versus all example taxa (Bc index)
column 17 : unambiguously identified example taxa count
column 18 : ratio of specificity unambiguously identified example taxa versus all example taxa (Bs index)
column 19 : minimum amplified length
column 20 : maximum amplified length
column 21 : average amplified length

Suppose we decide to focus on the 11^th pair because it seems to have relatively good properties and, in particular, a relatively balanced melting temperature between the two primers.

Primer ID : 11
Primer sequence tm max tm min GC count
Forward ACACCGCCCGTCACTCTC 62.5 36.8 12
Reverse CCAAGTGCACCTTCCGGT 60.7 28.9 11

Primer	sequence	tm max	tm min	GC count
Forward	ACACCGCCCGTCACTCTC	62.5	36.8	12
Reverse	CCAAGTGCACCTTCCGGT	60.7	28.9	11

amplifying 3837/3909 sequences
identify 2650/3876 Species
Size ranging from 54bp to 140bp (mean: 74.75 bp)

Testing the new primer pair #

To better characterise this pair, we can now use the obipcr tool to extract the barcode sequence corresponding to this pair from the learning database.

obipcr --forward ACACCGCCCGTCACTCTC \
       --reverse CCAAGTGCACCTTCCGGT \
       -e 5 \
       -l 30 -L 150 \
       -c \
       mito.vert.fasta \
       > Teleostei_11.fasta

head Teleostei_11.fasta

>NC_022183_sub[925..998] {"count":1,"definition":"Acrossocheilus hemispinus mitochondrion, complete genome.","direction":"forward","forward_error":1,"forward_match":"acaccgcccgtcaccctc","forward_primer":"ACACCGCCCGTCACTCTC","ori_taxid":"taxon:356810 [Acrossocheilus hemispinus]@species","reverse_error":0,"reverse_match":"ccaagtgcaccttccggt","reverse_primer":"CCAAGTGCACCTTCCGGT","scientific_name":"mitochondrion Acrossocheilus hemispinus","species_name":"Acrossocheilus hemispinus","species_taxid":"taxon:356810 [Acrossocheilus hemispinus]@species","taxid":"taxon:356810 [Acrossocheilus hemispinus]@species"}
cccgtcaaaatacaccaaaaatacttaatacaataacactaacaaggggaggcaagtcgt
aacatggtaagtgt
>NC_018560_sub[916..988] {"count":1,"definition":"Astatotilapia calliptera mitochondrion, complete genome.","direction":"forward","forward_error":0,"forward_match":"acaccgcccgtcactctc","forward_primer":"ACACCGCCCGTCACTCTC","ori_taxid":"taxon:8154 [Astatotilapia calliptera]@species","reverse_error":1,"reverse_match":"ccaagtacaccttccggt","reverse_primer":"CCAAGTGCACCTTCCGGT","scientific_name":"mitochondrion Astatotilapia calliptera (eastern happy)","species_name":"Astatotilapia calliptera","species_taxid":"taxon:8154 [Astatotilapia calliptera]@species","taxid":"taxon:8154 [Astatotilapia calliptera]@species"}
cccaagccaacaacatcctataaataatacattttaccggtaaaggggaggcaagtcgta
acatggtaagtgt
>NC_056117_sub[923..997] {"count":1,"definition":"Pseudocrossocheilus tridentis mitochondrion, complete genome.","direction":"forward","forward_error":0,"forward_match":"acaccgcccgtcactctc","forward_primer":"ACACCGCCCGTCACTCTC","ori_taxid":"taxon:887881 [Pseudocrossocheilus tridentis]@species","reverse_error":0,"reverse_match":"ccaagtgcaccttccggt","reverse_primer":"CCAAGTGCACCTTCCGGT","scientific_name":"mitochondrion Pseudocrossocheilus tridentis","species_name":"Pseudocrossocheilus tridentis","species_taxid":"taxon:887881 [Pseudocrossocheilus tridentis]@species","taxid":"taxon:887881 [Pseudocrossocheilus tridentis]@species"}
ccctgtcaaaaagcatcaaatatatataataaattagcaatgacaaggggaggcaagtcg
taacacggtaagtgt
>NC_045904_sub[919..997] {"count":1,"definition":"Eospalax fontanierii mitochondrion, complete genome.","direction":"forward","forward_error":1,"forward_match":"acaccgcccgtcgctctc","forward_primer":"ACACCGCCCGTCACTCTC","ori_taxid":"taxon:146134 [Eospalax fontanierii]@species","reverse_error":4,"reverse_match":"ccaagcacactttccagt","reverse_primer":"CCAAGTGCACCTTCCGGT","scientific_name":"mitochondrion Eospalax fontanierii","species_name":"Eospalax fontanierii","species_taxid":"taxon:146134 [Eospalax fontanierii]@species","taxid":"taxon:146134 [Eospalax fontanierii]@species"}

To be able to process the fasta file with R and produce some statistics describing the conservation of barcodes between taxa and the ability of the barcode to discriminate between taxa, we need to convert the fasta file to CSV format. This can be done with the command obicsv . The command, when run with the --auto option, will automatically identify all tags present in the annotations of the first few records and create a CSV file with the corresponding columns.

obicsv --auto -s -i Teleostei_11.fasta > Teleostei_11.csv

It is now possible to view the first few lines of the generated CSV file using a combination of the head and csvlook commands.

head Teleostei_11.csv | csvlook

| id                        | count | direction | forward_error | forward_match      | forward_primer     | ori_taxid                                            | reverse_error | reverse_match      | reverse_primer     | scientific_name                                        | species_name                  | species_taxid                                        | taxid                                                | sequence                                                                        |
| ------------------------- | ----- | --------- | ------------- | ------------------ | ------------------ | ---------------------------------------------------- | ------------- | ------------------ | ------------------ | ------------------------------------------------------ | ----------------------------- | ---------------------------------------------------- | ---------------------------------------------------- | ------------------------------------------------------------------------------- |
| NC_022183_sub[925..998]   |  True | forward   |          True | acaccgcccgtcaccctc | ACACCGCCCGTCACTCTC | taxon:356810 [Acrossocheilus hemispinus]@species     |             0 | ccaagtgcaccttccggt | CCAAGTGCACCTTCCGGT | mitochondrion Acrossocheilus hemispinus                | Acrossocheilus hemispinus     | taxon:356810 [Acrossocheilus hemispinus]@species     | taxon:356810 [Acrossocheilus hemispinus]@species     | cccgtcaaaatacaccaaaaatacttaatacaataacactaacaaggggaggcaagtcgtaacatggtaagtgt      |
| NC_018560_sub[916..988]   |  True | forward   |         False | acaccgcccgtcactctc | ACACCGCCCGTCACTCTC | taxon:8154 [Astatotilapia calliptera]@species        |             1 | ccaagtacaccttccggt | CCAAGTGCACCTTCCGGT | mitochondrion Astatotilapia calliptera (eastern happy) | Astatotilapia calliptera      | taxon:8154 [Astatotilapia calliptera]@species        | taxon:8154 [Astatotilapia calliptera]@species        | cccaagccaacaacatcctataaataatacattttaccggtaaaggggaggcaagtcgtaacatggtaagtgt       |
| NC_056117_sub[923..997]   |  True | forward   |         False | acaccgcccgtcactctc | ACACCGCCCGTCACTCTC | taxon:887881 [Pseudocrossocheilus tridentis]@species |             0 | ccaagtgcaccttccggt | CCAAGTGCACCTTCCGGT | mitochondrion Pseudocrossocheilus tridentis            | Pseudocrossocheilus tridentis | taxon:887881 [Pseudocrossocheilus tridentis]@species | taxon:887881 [Pseudocrossocheilus tridentis]@species | ccctgtcaaaaagcatcaaatatatataataaattagcaatgacaaggggaggcaagtcgtaacacggtaagtgt     |
| NC_045904_sub[919..997]   |  True | forward   |          True | acaccgcccgtcgctctc | ACACCGCCCGTCACTCTC | taxon:146134 [Eospalax fontanierii]@species          |             4 | ccaagcacactttccagt | CCAAGTGCACCTTCCGGT | mitochondrion Eospalax fontanierii                     | Eospalax fontanierii          | taxon:146134 [Eospalax fontanierii]@species          | taxon:146134 [Eospalax fontanierii]@species          | ctcaagtacataaacttggatatattcttaataacccaacaaaaatattagaggagataagtcgtaacaaggtaagcat |
| NC_018546_sub[916..987]   |  True | forward   |         False | acaccgcccgtcactctc | ACACCGCCCGTCACTCTC | taxon:30732 [Oryzias melastigma]@species             |             0 | ccaagtgcaccttccggt | CCAAGTGCACCTTCCGGT | mitochondrion Oryzias melastigma (Indian medaka)       | Oryzias melastigma            | taxon:30732 [Oryzias melastigma]@species             | taxon:30732 [Oryzias melastigma]@species             | cccgacccattttaaaaattaaataaaagatttcaggaactaaggggaggcaagtcgtaacatggtaagtgt        |
| NC_044151_sub[922..993]   |  True | forward   |         False | acaccgcccgtcactctc | ACACCGCCCGTCACTCTC | taxon:2597641 [Sicyopterus squamosissimus]@species   |             0 | ccaagtgcaccttccggt | CCAAGTGCACCTTCCGGT | mitochondrion Sicyopterus squamosissimus (cling goby)  | Sicyopterus squamosissimus    | taxon:2597641 [Sicyopterus squamosissimus]@species   | taxon:2597641 [Sicyopterus squamosissimus]@species   | cccaaaacaaacacacacataaataagaaaaaatgaaaataaaggggaggcaagtcgtaacatggtaagtgt        |
| NC_044152_sub[922..994]   |  True | forward   |         False | acaccgcccgtcactctc | ACACCGCCCGTCACTCTC | taxon:2597642 [Sicyopterus stiphodonoides]@species   |             0 | ccaagtgcaccttccggt | CCAAGTGCACCTTCCGGT | mitochondrion Sicyopterus stiphodonoides (cling goby)  | Sicyopterus stiphodonoides    | taxon:2597642 [Sicyopterus stiphodonoides]@species   | taxon:2597642 [Sicyopterus stiphodonoides]@species   | cccaaaacaaacacacacataaataagaaaaaantgaaaataaaggggaggcaagtcgtaacatggtaagtgt       |
| NC_026976_sub[1453..1531] |  True | forward   |          True | acaccgcccgtcactccc | ACACCGCCCGTCACTCTC | taxon:9545 [Macaca nemestrina]@species               |             1 | ccaagtgcaccttccagt | CCAAGTGCACCTTCCGGT | mitochondrion Macaca nemestrina (pig-tailed macaque)   | Macaca nemestrina             | taxon:9545 [Macaca nemestrina]@species               | taxon:9545 [Macaca nemestrina]@species               | ctcaaatatatttaaggaacatcttaactaaacgccctaatatttatatagaggggataagtcgtaacatggtaagtgt |
| NC_031553_sub[921..995]   |  True | forward   |         False | acaccgcccgtcactctc | ACACCGCCCGTCACTCTC | taxon:643337 [Puntioplites proctozystron]@species    |             0 | ccaagtgcaccttccggt | CCAAGTGCACCTTCCGGT | mitochondrion Puntioplites proctozystron               | Puntioplites proctozystron    | taxon:643337 [Puntioplites proctozystron]@species    | taxon:643337 [Puntioplites proctozystron]@species    | ccctgtcaaaacgcactaaaaatatctaatacaaaagcaccgacaaggggaggcaagtcgtaacacggtaagtgt     |

References #

Riaz, Shehzad, Viari, Pompanon, Taberlet & Coissac (2011): Riaz, T., Shehzad, W., Viari, A., Pompanon, F., Taberlet, P. & Coissac, E. (2011). ecoPrimers: inference of new DNA barcode markers from whole genome sequence analysis. Nucleic acids research, 39(21). e145. https://doi.org/10.1093/nar/gkr732