Build a reference database

Build a reference database #

One of the crucial steps in the analysis of environmental DNA data is the taxonomic assignment of sequences, i.e. assigning a species, genus or other taxonomic rank to the sequences present in the collected samples.

Taxonomic assignment requires annotated reference sequences, against which the sequences of interest are compared. These reference sequences form what is known as a reference database, which is a sequence file in fasta format, for a given marker of metabarcoding.

Here is a quick step-by-step guide to creating a reference database, here for assigning sequences from wolf fecal samples to study its diet, a dataset used in the metabarcoding analysis tutorial here.

One way to build a reference database is to use the obipcr program to simulate a PCR and extract all sequences from a general purpose DNA database such as GenBank or EMBL that can be amplified in silico by the two primers used for PCR amplification.

The steps to create a reference database are:

  1. Download sequences from a public database such as GenBank or EMBL
  2. Perform an in silico PCR amplification of these sequences with a given marker with obipcr
  3. Clean up the database by deleting sequences that do not provide sufficient taxonomic information and are redundant

Since Genbank and the taxonomy associated with sequences are constantly evolving, you may not get exactly the same results when using the following commands.

Download the sequences #

In this example, the sequences are downloaded from the GenBank FTP server. Please note that the download takes more than a day and currently occupies around 1.5 TB, so make sure you have the necessary storage capacity before launching it. To have a local copy of GenBank sequences, please go to the Prepare a local copy of GenBank page.

Perform a in silico PCR amplification #

In this example, we amplify the 12S-V5 region [@Riaz2011-gn] with the forward primer TTAGATACCCCACTATGC and the reverse primer TAGAACAGGCTCCTCTAG, with the following command, to study the wolf diet (see the tutorial). Do not forget to update the release number of GenBank in the command line.

obipcr -e 3 -l 50 -L 150 \ 
       --forward TTAGATACCCCACTATGC \
       --reverse TAGAACAGGCTCCTCTAG \
       --no-order \
       genbank/Release_264/fasta/*
       > v05_pcr.fasta

The -l and -L options define the minimum and maximum sizes of sequence fragments to be amplified. Three mismatches with primer sequences are allowed here (-e 3), and we recommend using the --no-order option to speed up the program (see obipcr documentation).

This previous command produces a fasta file, with the computed amplified sequences.

Clean the database #

We choose to apply these different steps of filtering to clean up the sequences obtained with obipcr :

  1. Keep the sequences with a taxid and a taxonomic description to family, genus and species ranks ( obigrep )
  2. Remove redundant sequences (dereplicate)
  3. Ensure that the dereplicated sequences have a taxid (taxon identifier) at the family level
  4. Ensure that sequences each have a unique identification ID with obiannotate
  5. Index the database

Keep annotated sequences #

To use the -t taxonomy option on all OBITools commands, you can either enter the path to the taxonomy if you have downloaded the sequences from the help page here which looks like Release_264/taxonomy, or download the taxdump file online with curl.

curl http://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz

The obigrep program allows to filter sequences, to keep only those with a taxid and a sufficient taxonomic description.

obigrep -t taxdump.tar.gz \
        -A taxid \
        --require-rank species \
        --require-rank genus \
        --require-rank family \
        v05_pcr.fasta > v05_clean.fasta

Dereplicate sequences #

The obiuniq program is able to dereplicate the sequences.

obiuniq -c taxid v05_clean.fasta > v05_clean_uniq.fasta

Ensure that the dereplicated sequences have a taxid at the family level #

Some sequences lose taxonomic information at the dereplication stage if certain versions of the sequence did not have this information beforehand. So we apply a second filter of this type.

obigrep -t taxdump.tar.gz --require-rank=family v05_clean_uniq.fasta > v05_clean_uniq.fasta

Ensure that sequences each have a unique identifier #

Index the database #

obirefidx -t taxdump.tar.gz v05_clean_uniq.fasta > v05_clean_uniq_indexed.fasta

The database provided in the tutorial is called wolf_data/db_v05_r117_indexed.fasta.