Metabarcoding analysis tutorial #
Here is a short tutorial for analyzing metabarcoding data, on an Illumina dataset from a wolf diet study, using the OBITools 4 and basic unix commands. It presents the following analysis steps:
- Assembly of forward and reverse reads
- Suppression of unaligned sequence records
- Sequence demultiplexing
- Dereplication of sequences
- Dataset denoising
- Taxonomic assignation of the sequences
- Generate the contingency table of results
- Import the results in R
The dataset to analyze and the reference database #
The data used in this tutorial corresponds to the analysis of four wolf scats
using the protocol published in @Shehzad2012-pn for carnivore diet assessment.
After extraction of DNA from feces, DNA amplification was performed using the
primers TTAGATACCCCACTATGC
and TAGAACAGGCTCCTCTAG
amplifying the 12S-V5
region [@Riaz2011-gn], together with a wolf blocking oligonucleotide.
An archive containing all the files needed for the analysis can be downloaded by clicking here: wolf_diet_dataset
The downloaded archive can be unarchived using the following unix command:
tar zxvf wolf_diet_dataset.tgz
It creates a wolf_data
directory with the following data files:
Two fastq files resulting from the aGA IIx (Illumina) paired-end (2 x 108 bp) sequencing of DNA extracted and amplified from four wolf feces:
wolf_F.fastq
with the forward sequenceswolf_R.fastq
with the reverse sequences
A tabulated file for sample demultiplexing, named
wolf_diet_ngsfilter.txt
, with primer and tag sequences for each sample. The tags correspond to short and specific sequences added to the 5' end of each primer to distinguish the different samples.The reference database in fasta format named
db_v05_r117_indexed.fasta
, which has been extracted from the from EMBL release 117 (see “build a reference database” documentation for details)
We recommend creating a new folder for the results, to separate them from the raw data:
mkdir results
Recover full length sequences from forward and reverse reads #
When using the result of a paired-end sequencing with supposedly overlapping forward and reverse reads, the first step is to recover the assembled sequence.
Forward and reverse reads of the same fragment are located at the same line position in both fastqs files.
Based on these two files, the assembly of the forward and reverse reads is performed using the obipairing
program, which aligns the two reads and returns the reconstructed sequence:
obipairing --min-identity=0.8 \
--min-overlap=10 \
-F wolf_data/wolf_F.fastq \
-R wolf_data/wolf_R.fastq \
> results/wolf.fastq
The --min-identity
and --min-overlap
options allow to discard sequences with low alignment quality:
overlapping reads with less than 10 base pairs, or alignment with less than 80% similarity (identity)
produce a concatenated sequence of the two reads, with an attribute "mode":"join"
.
Other reads that assemble well together produce sequences with the attribute "mode":"alignment"
.
For more information, please refer to the program obipairing
.
Remove unaligned sequence records #
Unaligned sequences, with an attribute "mode":"join"
, cannot be used. They are removed from the dataset:
obigrep -p 'annotations.mode != "join"' \
results/wolf.fastq > results/wolf_assembled.fastq
The -p
requires a go like expression, annotations.mode != "join"
means that
if the value of the mode
annotation of a sequence is different from join
,
the corresponding sequence record will be kept.
The first sequence record of wolf_assembled.fastq
can be obtained by the command:
head -n 4 results/wolf_assembled.fastq
@HELIUM_000100422_612GNAAXX:7:108:5640:3823#0/1 {"ali_dir":"left","ali_length":62,"mode":"alignment","pairing_mismatches":{"(T:26)->(G:13)":62,"(T:34)->(G:18)":48},"paring_fast_count":53,"paring_fast_overlap":62,"paring_fast_score":0.898,"score":1826,"score_norm":0.968,"seq_a_single":46,"seq_ab_match":60,"seq_b_single":46}
ccgcctcctttagataccccactatgcttagccctaaacacaagtaattaatataacaaaattgttcgccagagtactaccggcaatagcttaaaactcaaaggacttggcggtgctttatacccttctagaggagcctgttctaaggaggcgg
+
CCCCCCCBCCCCCCCCCCCCCCCCCCCCCCBCCCCCBCCCCCCC<CcDccbe[`F`accXV=TA\RYU\\ee_e[XZ[XEEEEEEEEEE?EEEEEEEEEEDEEEEEEECCCCCCCCCCCCCCCCCCCCCCCACCCCCACCCCCCCCCCCCCCCC
Assign each sequence record to the corresponding sample and marker combination #
Each sequence record is assigned to its corresponding sample and marker using the data provided in the file wolf_diet_ngsfilter.txt
.
This file follows the required format for obimultiplex
program.
It contains one line per sample, with the name of the experiment (several experiments can be included in the same file),
the tags (e.g. aattaac
if a same tag has been used on each extremity of the PCR products, or aattaac:gaagtag
if two different tags were used),
the sequence of the forward primer, the sequence of the reverse primer, the letter T
or F
for sample identification using
the forward primer and tag only or using both primers and both tags, respectively (see obimultiplex
for details).
obimultiplex -s wolf_data/wolf_diet_ngsfilter.txt \
-u results/unidentified.fastq \
results/wolf_assembled.fastq \
> results/wolf_assembled_assigned.fastq
This command creates two files:
unidentified.fastq
with the sequences that failed to be assigned to a sample/marker combinationwolf_assembled_assigned.fastq
with the sequence records that were properly assigned to a sample/marker combination
Note that each sequence record of the wolf_assembled_assigned.fastq
file
contains only the barcode sequence as the sequences of primers and tags
are removed by the obimultiplex
program. Information concerning the
experiment, sample, primers and tags is added as attributes in the
sequence header.
For instance, the first sequence record of wolf_assembled_assigned.fastq
is:
@HELIUM_000100422_612GNAAXX:7:108:5640:3823#0/1_sub[28..127] {"ali_dir":"left","ali_length":62,"experiment":"wolf_diet","mode":"alignment","obimultiplex_amplicon_rank":"1/1","obimultiplex_direction":"forward","obimultiplex_forward_error":0,"obimultiplex_forward_match":"ttagataccccactatgc","obimultiplex_forward_matching":"strict","obimultiplex_forward_primer":"ttagataccccactatgc","obimultiplex_forward_proposed_tag":"gcctcct","obimultiplex_forward_tag":"gcctcct","obimultiplex_forward_tag_dist":0,"obimultiplex_reverse_error":0,"obimultiplex_reverse_match":"tagaacaggctcctctag","obimultiplex_reverse_matching":"strict","obimultiplex_reverse_primer":"tagaacaggctcctctag","obimultiplex_reverse_proposed_tag":"gcctcct","obimultiplex_reverse_tag":"gcctcct","obimultiplex_reverse_tag_dist":0,"pairing_mismatches":{"(T:26)->(G:13)":35,"(T:34)->(G:18)":21},"paring_fast_count":53,"paring_fast_overlap":62,"paring_fast_score":0.898,"sample":"29a_F260619","score":1826,"score_norm":0.968,"seq_a_single":46,"seq_ab_match":60,"seq_b_single":46}
ttagccctaaacacaagtaattaatataacaaaattgttcgccagagtactaccggcaatagcttaaaactcaaaggacttggcggtgctttataccctt
+
CCCBCCCCCBCCCCCCC<CcDccbe[`F`accXV=TA\RYU\\ee_e[XZ[XEEEEEEEEEE?EEEEEEEEEEDEEEEEEECCCCCCCCCCCCCCCCCCC
Dereplicate the sequences #
A same DNA molecule can be sequenced several times. In order to reduce
both file size and computations time, and to get easier interpretable
results, it is convenient to work with unique sequences instead of
reads. To dereplicate such reads into unique sequences,
we use the obiuniq
command.
All reads in the dataset are compared in pairs, those that are strictly identical are grouped together,
and only one copy of each sequence is kept, with frequency information in the count
attribute.
For dereplication, we use the obiuniq
command with the -m sample
option
to keep the information of the original samples for each unique sequence. It returns a
fasta
file.
obiuniq -m sample \
results/wolf_assembled_assigned.fastq \
> results/wolf_assembled_assigned_uniq.fasta
The first sequence record of wolf_assembled_assigned_uniq.fasta
is:
>HELIUM_000100422_612GNAAXX:7:99:12017:19418#0/1_sub[28..127] {"ali_dir":"left","ali_length":62,"count":1,"experiment":"wolf_diet","merged_sample":{"29a_F260619":1},"mode":"alignment","obimultiplex_amplicon_rank":"1/1","obimultiplex_direction":"forward","obimultiplex_forward_error":0,"obimultiplex_forward_match":"ttagataccccactatgc","obimultiplex_forward_matching":"strict","obimultiplex_forward_primer":"ttagataccccactatgc","obimultiplex_forward_proposed_tag":"gcctcct","obimultiplex_forward_tag":"gcctcct","obimultiplex_forward_tag_dist":0,"obimultiplex_reverse_error":0,"obimultiplex_reverse_match":"tagaacaggctcctctag","obimultiplex_reverse_matching":"strict","obimultiplex_reverse_primer":"tagaacaggctcctctag","obimultiplex_reverse_proposed_tag":"gcctcct","obimultiplex_reverse_tag":"gcctcct","obimultiplex_reverse_tag_dist":0,"pairing_mismatches":{"(A:02)->(C:07)":54,"(A:02)->(G:17)":59,"(C:02)->(G:10)":42},"paring_fast_count":43,"paring_fast_overlap":62,"paring_fast_score":0.729,"sample":"29a_F260619","score":567,"score_norm":0.935,"seq_a_single":46,"seq_ab_match":58,"seq_b_single":46}
ttagccctaaacacaagtaattaatataacaaaattattcggcagagtactaccggcagt
agcttaaaactcaaaggacttggcggtgctttatacccct
The obiuniq
command has added two key:value
entries in the sequences attributes:
"merged_sample":{"29a_F260619":1}
: this sequence have been found once in a single sample called “29a_F260619”"count":1
: the total count for this sequence is 1
To keep only these two attributes, we can use the obiannotate
command:
obiannotate -k count -k merged_sample \
results/wolf_assembled_assigned_uniq.fasta \
> results/wolf_assembled_assigned_simple.fasta
The first five sequence records of wolf_assembled_assigned_simple.fasta
become:
>HELIUM_000100422_612GNAAXX:7:99:12017:19418#0/1_sub[28..127] {"count":1,"merged_sample":{"29a_F260619":1}}
ttagccctaaacacaagtaattaatataacaaaattattcggcagagtactaccggcagt
agcttaaaactcaaaggacttggcggtgctttatacccct
>HELIUM_000100422_612GNAAXX:7:56:19300:10949#0/1_sub[28..127] {"count":37,"merged_sample":{"29a_F260619":37}}
ttagccctaaacacaagtaattaatataacaaaattgttcaccagagtactagcggcaac
agcttaaaactcaaaggacttggcggtgctttataccctt
>HELIUM_000100422_612GNAAXX:7:117:10934:7472#0/1_sub[28..127] {"count":1,"merged_sample":{"29a_F260619":1}}
ttagccctaaacacaagtaattattataacaaaattattcgccagagtactaccggcaat
agcttaaaactcaaaggacttggcggtgctttatacccgt
>HELIUM_000100422_612GNAAXX:7:28:9432:2506#0/1_sub[28..127] {"count":4,"merged_sample":{"13a_F730603":4}}
ccagccttaaacacaaatagttatgcaaacaaaactattcgccagagtactaccggcaat
agcttaaaactcaaaggacttggcggtgctttataccctt
>HELIUM_000100422_612GNAAXX:7:94:11447:14902#0/1_sub[28..127] {"count":1,"merged_sample":{"15a_F730814":1}}
ttagccctaaacacaagtaattagtataacaaaattattccccagagtactaccggcaat
agcttaaaactcaaaggacttggcggtgctttataccctt
Denoise the sequence dataset #
To have a set of sequences assigned to their corresponding samples does not mean that all sequences are biologically meaningful, i.e. some of these sequences can contains PCR and/or sequencing errors, or chimeras.
Tag the sequences for PCR errors (sequence variants) #
The obiclean
program tags sequence variants as potential error generated during
PCR amplification. We ask it to keep the cluster-head sequences (-H
option) that are sequences which
are not variants of another sequence with a count greater than 5% of their own count
(-r 0.05
option). See the obiclean
documentation for details.
obiclean -s sample -r 0.05 -H \
results/wolf_assembled_assigned_simple.fasta \
> results/wolf_assembled_assigned_simple_clean.fasta
One of the sequence records of wolf_assembled_assigned_simple_clean.fasta
is:
>HELIUM_000100422_612GNAAXX:7:56:19300:10949#0/1_sub[28..127] {"count":37,"merged_sample":{"29a_F260619":37},"obiclean_head":true,"obiclean_headcount":1,"obiclean_internalcount":0,"obiclean_samplecount":1,"obiclean_singletoncount":0,"obiclean_status":{"29a_F260619":"h"},"obiclean_weight":{"29a_F260619":43}}
ttagccctaaacacaagtaattaatataacaaaattgttcaccagagtactagcggcaac
agcttaaaactcaaaggacttggcggtgctttataccctt
To remove potential chimeric sequences, or amplification or sequencing errors (artifacts), we discard the rare sequence variants.
Get some statistics about the sequence counts #
obicount results/wolf_assembled_assigned_simple_clean.fasta
entities,n
variants,765
reads,33823
symbols,75668
The dataset contains 765 sequence variants corresponding to 33823 sequence reads. Most of the variants occur only a single time in the complete dataset and are usually named singletons, let us see how many singletons there are:
obigrep -p 'sequence.Count() == 1' \
results/wolf_assembled_assigned_simple_clean.fasta |\
obicount
entities,n
variants,649
reads,649
symbols,64497
In our dataset, there are 649 singletons (or variants).
Using R and the ROBIFastread
package able to read headers of the fasta files produced by OBITools,
we can get more complete statistics on the sqrt-transformed distribution of occurrencies.
# Import libraries
library(ROBIFastread)
library(ggplot2)
# Import the fasta file
seqs <- read_obifasta("results/wolf_assembled_assigned_simple_clean.fasta", keys="count")
# Plot occurencies
ggplot(data=seqs, mapping=aes(x=count)) +
geom_histogram(bins=100) +
scale_y_sqrt() +
scale_x_sqrt() +
geom_vline(xintercept=10, col="red", lty=2) +
xlab("Number of occurrencies of a variant") +
ylab("Count") +
ggtitle("Variants occurencies distribution (sqrt-transformed)") +
theme_minimal()
The red dotted vertical line is placed at the level of variants that appear exactly 10 times (count=10). This represents the threshold at which we wish to retain sequence variants. Below this threshold, we consider that the variant is not abundant enough in the dataset to be considered a true sequence, but a potential artifact because it is too rare. Singletons correspond to variants whose count is 1.
In a similar way, it is also possible to plot the distribution of the sequence length.
ggplot(data=seqs, mapping=aes(x=nchar(sequence))) +
geom_histogram() +
scale_y_log10() +
geom_vline(xintercept = 80, col="red", lty=2) +
xlab("Sequence lengths in base pair") +
ylab("Count") +
ggtitle("Sequence length log-distribution") +
theme_minimal()
The red dotted vertical line represents the threshold at 80 bp, above which we wish to keep the sequences, otherwise considered too short so as potential artefacts.
Keep only the sequences having a count greater or equal to 10 and a length shorter than 80 bp #
Based on the previous observation, first we set a threshold to keep only the sequences that appear at least 10 times (count ≥ 10),
thanks to obigrep
.
The -p
option means that the python expression count>=10
must be evaluated to
True
for each sequence to be kept, and the -l
option allows us to remove sequences
shorter than 80 bp, as we know that the amplified 12S-V5 barcode for vertebrates must have a length around 100 bp.
obigrep -l 80 \
-p 'sequence.Count() >= 10' \
results/wolf_assembled_assigned_simple_clean.fasta \
> results/wolf_assembled_assigned_simple_clean_c10l80.fasta
The first sequence record of results/wolf_assembled_assigned_simple_clean_c10l80.fasta
is:
>HELIUM_000100422_612GNAAXX:7:56:19300:10949#0/1_sub[28..127] {"count":37,"merged_sample":{"29a_F260619":37},"obiclean_head":true,"obiclean_headcount":1,"obiclean_internalcount":0,"obiclean_samplecount":1,"obiclean_singletoncount":0,"obiclean_status":{"29a_F260619":"h"},"obiclean_weight":{"29a_F260619":43}}
ttagccctaaacacaagtaattaatataacaaaattgttcaccagagtactagcggcaac
agcttaaaactcaaaggacttggcggtgctttataccctt
Count the sequences after these filtering steps:
obicount results/wolf_assembled_assigned_simple_clean_c10l80.fasta
entities,n
variants,25
reads,30816
symbols,2486
Taxonomic assignment of sequences #
Once denoising has been done, the next step in diet analysis is to assign the barcodes to the corresponding taxa (species, genus, etc.), in order to get the complete list of the taxa associated to each sample.
The taxonomic assignment of sequences requires a reference database to
detect all possible taxa to be identified in samples, which is provided in
this tutorial as db_v05_r117_indexed.fasta
(see “Build a reference database”
documentation for more information about the reference database).
It is then based on sequence comparison between sample
sequences and reference sequences.
Download the taxonomy #
The current and complete taxonomy from the NCBI is available online, it is possible to download it with the following command:
curl http://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
A copy of this taxdump.tar.gz
file is also available into the tutorial archive.
Note that the NCBI taxonomy is daily updated, but this version is correct to use.
Assign the sequences #
Thanks to the reference database, taxonomic assignment can be
carried out with obitag
:
obitag -t wolf_data/taxdump.tar.gz \
-R wolf_data/db_v05_r117_indexed.fasta \
results/wolf_assembled_assigned_simple_clean_c10l80.fasta \
> results/wolf_assembled_assigned_simple_clean_c10l80_taxo.fasta
The first sequence record of wolf_assembled_assigned_simple_clean_c10l80_taxo.fasta
is:
>HELIUM_000100422_612GNAAXX:7:56:19300:10949#0/1_sub[28..127] {"count":37,"merged_sample":{"29a_F260619":37},"obiclean_head":true,"obiclean_headcount":1,"obiclean_internalcount":0,"obiclean_samplecount":1,"obiclean_singletoncount":0,"obiclean_status":{"29a_F260619":"h"},"obiclean_weight":{"29a_F260619":43},"obitag_bestid":0.96,"obitag_bestmatch":"AJ885202","obitag_match_count":1,"obitag_rank":"NA","obitag_similarity_method":"lcs","taxid":"NA"}
ttagccctaaacacaagtaattaatataacaaaattgttcaccagagtactagcggcaac
agcttaaaactcaaaggacttggcggtgctttataccctt
The obitag
command adds several attributes in the sequence record header, like:
obitag_bestmatch:ACCESSION
where ACCESSION is the id of hte sequence in the reference database that best aligns to the query sequenceobitag_bestid:FLOAT
where FLOAT*100 is the identity percent between the best match sequence and the query sequencetaxid:TAXID
where TAXID is the final taxonomy ID assigned to the sequence byobitag
scientific_name:NAME
where NAME is the scientific name of the assigned taxid
Generate the contingency table of results #
Some useless attributes can be removed at this stage with obiannotate
:
obiannotate --delete-tag=obiclean_head \
--delete-tag=obiclean_headcount \
--delete-tag=obiclean_internalcount \
--delete-tag=obiclean_samplecount \
--delete-tag=obiclean_singletoncount \
results/wolf_assembled_assigned_simple_clean_c10l80_taxo.fasta \
> results/wolf_final.fasta
The first sequence record of wolf_final.fasta
is then:
>HELIUM_000100422_612GNAAXX:7:56:19300:10949#0/1_sub[28..127] {"count":37,"merged_sample":{"29a_F260619":37},"obiclean_status":{"29a_F260619":"h"},"obiclean_weight":{"29a_F260619":43},"obitag_bestid":0.96,"obitag_bestmatch":"AJ885202","obitag_match_count":1,"obitag_rank":"NA","obitag_similarity_method":"lcs","taxid":"NA"}
ttagccctaaacacaagtaattaatataacaaaattgttcaccagagtactagcggcaac
agcttaaaactcaaaggacttggcggtgctttataccctt
Looking at the data in R #
The results can be imported in R.
# Import libraries
library(ROBIFastread)
library(vegan)
library(magrittr)
# Import the results
diet_data <- read_obifasta("results/wolf_final.fasta")
diet_data %<>% extract_features("obitag_bestmatch", "obitag_rank", "scientific_name", "taxid")
# Extract count matrix
diet_tab <- extract_readcount(diet_data,key="obiclean_weight")
diet_tab
There are 25 final annotated sequences. We can make the following assumptions about wolf diet, based on our samples:
- 13a_F730603 sample: Cervus elaphus
- 15a_F730814 sample: Capreolus capreolus
- 26a_F040644 sample: Marmota sp. (according to the location, it is Marmota marmota)
- 29a_F260619 sample: Capreolus capreolus
Note that we also obtained a few wolf sequences although a wolf-blocking oligonucleotide was used.