DNA Patterns

DNA Patterns #

DNA patterns are useful for describing short DNA sequences like oligonucleotides. They are used by several OBITools like obimultiplex , obipcr or obigrep . The advantage of using DNA patterns over classical regular expressions is that they can be matched with errors. Allowed errors can be simple mismatches, or mismatches and insertions/deletions.

Syntax of a DNA Pattern #

  • Patterns are limited to sequences up to 63 bases long.
  • As all DNA sequences, they are represented from the 5’ end to the 3’ end.
  • Each base is represented by a single letter (A, C, G, T).
  • IUPAC codes can be used to represent ambiguous bases (N, M, K, R, Y, S, W, B, D, H, V, N, see table below).
  • Ambiguous positions can also be denoted by a range of base characters (i.e. ATGC) surrounded by square brackets ([]) : [ATC].
  • A range of bases can negate by prefixing it with a ! : [!AC].
  • Patterns do not allow for ambiguity on the number of occurrences of a base.
  • Positions where errors are not allowed, are denoted by a sharp (#) symbol after the base.
  • Patterns are case unsensitive.
Example

A DNA pattern corresponding to the forward primer of the Euka02 marker with no errors allowed at the two last bases on the 3’ end:

TTTGTCTGSTTAATTSC#G#

Example

The same pattern using base ranges for indicating the second S ambiguity:

TTTGTCTGSTTAATT[CG]C#G#

IUPAC Codes for Ambiguous Bases #

IUPAC DNA ambiguity codes
SymbolBasesOrigin of designation
GGGuanine
AAAdenine
TTThymine
CCCytosine
RG or ApuRine
YT or CpYrimidine
MA or CaMino
KG or TKeto
SG or CStrong interaction (3 H bonds)
WA or TWeak interaction (2 H bonds)
HA or C or Tnot-G, H follows G in the alphabet
BG or T or Cnot-A, B follows A
VG or C or Anot-T (not-U), V follows U
DG or A or Tnot-C, D follows C
NG or A or T or CaNy