The FASTA-like alignment

The FASTA-like first step of alignment #

The FASTA algorithm ( Citation: & , & (). Rapid and sensitive protein similarity searches. Science, 227(4693). 1435–1441. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/2983426 ) can be considered as the ancestor of BLAST ( Citation: , & al., , , , & (). Basic local alignment search tool. Journal of molecular biology, 215(3). 403–410. https://doi.org/10.1006/jmbi.1990.9999 ) . It has the advantage of being easy to implement. It primarily calculates the best shift to apply between the two sequences under consideration to minimize the Hamming distance (number of differences) between them. This alignment algorithm is used in obipairing to determine the position and the size of the overlapping region of paired-end reads. These two criteria will guide the exact alignment method for the subsequent step, and determine the segments of the reads to align.

The FASTA-like algorithm builds a table of 4mers (DNA word of length 4) with their positions for the forward and reverse reads.

To illustrate, let us consider two short reads, A: ACGTTAGCTAGCTAGCTAA and B: CGCTAGCTAGCTAATTTGG, each of 19 nucleotides, with positions indexed from 00 to 18:

The sequences are both composed of \(19 - 4 + 1 = 16\) overlapping 4mers. An illustration of the overlapping 4mers is shown below for sequence A:

0000000000111111111
0123456789012345678

ACGTTAGCTAGCTAGCTAA

ACGT AGCT GCTA CTAA
 CGTT GCTA CTAG 
  GTTA CTAG TAGC
   TTAG TAGC AGCT
    TAGC AGCT GCTA

The 4mer indices of sequences A and B can then be compared as follows:

graph LR
 %%{init: {'flowchart': {'nodeSpacing': 10, 'rankSpacing': 30, 'htmlLabels': true}} }%%
 subgraph Sequence_A
    A_ACGT:::list@{ shape: hex, label: "00"}
    A_AGCT:::list@{ shape: hex, label: "05, 09, 13"}
    A_CGTT:::list@{ shape: hex, label: "01"}
    A_CTAA:::list@{ shape: hex, label: "15"}
    A_CTAG:::list@{ shape: hex, label: "07, 11"}
    A_GCTA:::list@{ shape: hex, label: "06, 12, 14"}
    A_GTTA:::list@{ shape: hex, label: "02"}
    A_TAGC:::list@{ shape: hex, label: "04, 08, 12"}
    A_TTAG:::list@{ shape: hex, label: "03"}
end

 subgraph Sequence_B
    B_AATT:::list@{shape: hex, label: "12"}
    B_AGCT:::list@{shape: hex, label: "04, 08"}
    B_CAGC:::list@{shape: hex, label: "02"}
    B_CGCT:::list@{shape: hex, label: "00"}
    B_CTAA:::list@{shape: hex, label: "10"}
    B_GAGC:::list@{shape: hex, label: "01"}
    B_GCTA:::list@{shape: hex, label: "05, 09"}
    B_GGCT:::list@{shape: hex, label: "18"}
    B_TAAT:::list@{shape: hex, label: "13"}
    B_TAGC:::list@{shape: hex, label: "03"}
    B_TTGG:::list@{shape: hex, label: "16"}
    B_TTTC:::list@{shape: hex, label: "15"}
end

  
    AATT:::red --> B_AATT
    A_ACGT --> ACGT:::red 
    A_AGCT --> AGCT:::green --> B_AGCT
    CAGC:::red --> B_CAGC
    CGCT:::red --> B_CGCT
    A_CGTT --> CGTT:::red 
    A_CTAA --> CTAA:::green --> B_CTAA
    A_CTAG --> CTAG:::red 
    GAGC:::red --> B_GAGC
    A_GCTA --> GCTA:::green --> B_GCTA
    GGCT:::red --> B_GGCT
    A_GTTA --> GTTA:::red 
    TAAT:::red --> B_TAAT
    A_TAGC --> TAGC:::green --> B_TAGC
    A_TTAG --> TTAG:::red 
    TTGG:::red --> B_TTGG
    TTTC:::red --> B_TTTC

    classDef green fill:#9f6, width:80px, height:50px, font-size:14px, align-items:center, text-align:center
    classDef red fill:#FF8080, width:80px, height:50px, font-size:14px, align-items:center
    classDef list stroke:#333, stroke-width:2px, width:80px, height:30px, font-size:14px

The diagram above indicates that the two sequences share four 4mers, highlighted in green.

Next, the algorithm computes \(\Delta = Pos_A - Pos_B\) for each 4mer shared between sequences A and B. If a 4mer occurs more than once, all the combinations of the differences are considered for that 4mer.

4merpositions on Apositions on B\(\Delta\)
AGCT05, 09, 1304, 081, -3, 5, 1, 9, 5
CTAA15105
GCTA06, 12, 1405, 091, -3, 7, 3, 9, 5
TAGC04, 08, 12031, 5, 9

Then, for each \(\Delta\) value, the algorithm computes its frequency, as well its relative score (RelScore), expressed as the frequency of the \(\Delta\) value normalized by the number of 4mers involved in the overlap. This allows to identify the best \(\Delta\) value, i.e. the one having the highest RelScore

\[ RelScore = \frac{Frequency}{length(overlap) - (4-1)} \]
\(\Delta\)FrequencyRelScore = Frequency/(Overlap - 3)
550.455
930.428
140.267
-320.154
710.111
310.077

The table above is the equivalent of a DNA Dot Plot. A \(\Delta\) value corresponds to one diagonal in the dot plot. By default, obipairing considers the diagonal with the highest RelScore as the optimal alignment, i.e. the best shift-only alignment (no insertions or deletions allowed in the overlap).

If the --fasta-absolute option is used, the best \(\Delta\) (or diagonal) is the one exhibiting the highest Frequency.

The scatter plot of the shared 4mers between sequences A & B

Shared 4mer positions:

This plot is a thresholded DNA dot plot of both sequences. Each point corresponds to a shared 4mer and is located at its respective positions on sequences A & B. The red dotted line indicates the diagonal with the greatest number of dots, here representing five overlapping 4mers and corresponding to a positional difference of 5 between sequences.

In this example, the diagonal having the largest RelScore (0.455) corresponds to a \(\Delta\) value of 5, which was observed five times. Thus, the sequence similarity between the sequences A and B is maximized by shifting B of 5 positions relatively to A.

Sequence A: ACGTTAGCTAGCTAGCTAA-----
Sequence B: -----CGCTAGCTAGCTAATTTGG
Overlap   :      .+++++++++++++  

This delimits the overlapping region between the two reads.

References #

Altschul, Gish, Miller, Myers & Lipman (1990)
, , , & (). Basic local alignment search tool. Journal of molecular biology, 215(3). 403–410. https://doi.org/10.1006/jmbi.1990.9999
Lipman & Pearson (1985)
& (). Rapid and sensitive protein similarity searches. Science, 227(4693). 1435–1441. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/2983426