DNA Patterns #
DNA patterns are useful for describing short DNA sequences like oligonucleotides. They are used by several OBITools
like obimultiplex
, obipcr
or obigrep
. The advantage of using DNA patterns over classical regular expressions is that they can be matched with errors. Allowed errors can be simple mismatches, or mismatches and insertions/deletions.
Syntax of a DNA Pattern #
- Patterns are limited to sequences up to 63 bases long.
- As all DNA sequences, they are represented from the 5’ end to the 3’ end.
- Each base is represented by a single letter (A, C, G, T).
- IUPAC codes can be used to represent ambiguous bases (N, M, K, R, Y, S, W, B, D, H, V, N, see table below).
- Ambiguous positions can also be denoted by a range of base characters (i.e.
ATGC
) surrounded by square brackets ([]
) :[ATC]
. - A range of bases can negate by prefixing it with a
!
:[!AC]
. - Patterns do not allow for ambiguity on the number of occurrences of a base.
- Positions where errors are not allowed, are denoted by a sharp (
#
) symbol after the base. - Patterns are case unsensitive.
Example
A DNA pattern corresponding to the forward primer of the Euka02 marker with no errors allowed at the two last bases on the 3’ end:
TTTGTCTGSTTAATTSC#G#
Example
The same pattern using base ranges for indicating the second S
ambiguity:
TTTGTCTGSTTAATT[CG]C#G#
IUPAC Codes for Ambiguous Bases #
Symbol | Bases | Origin of designation |
---|---|---|
G | G | Guanine |
A | A | Adenine |
T | T | Thymine |
C | C | Cytosine |
R | G or A | puRine |
Y | T or C | pYrimidine |
M | A or C | aMino |
K | G or T | Keto |
S | G or C | Strong interaction (3 H bonds) |
W | A or T | Weak interaction (2 H bonds) |
H | A or C or T | not-G, H follows G in the alphabet |
B | G or T or C | not-A, B follows A |
V | G or C or A | not-T (not-U), V follows U |
D | G or A or T | not-C, D follows C |
N | G or A or T or C | aNy |