obipairing
: align forward and reverse paired reads
#
Description #
When DNA metabarcoding sequences are generated as paired reads on the Illumina platform, obipairing
aims to align both forward and reverse reads to generate full length amplicon sequences.
Input data #
The obipairing
command requires two input files:
- One file contains the forward reads.
- The second file contains the reverse reads.
Both files must have the same number of sequences, and the sequences must be in the same order. This means that the first sequence of the forward reads file must correspond to the first sequence of the reverse reads file. obipairing
will assume this order and only align sequences that occur in the same rank.
Consider the following example, where the forward reads file is
forward.fastq
and the reverse reads file is
reverse.fastq
and both consist of 4 sequences:
@M01334:147:000000000-LBRVD:1:1101:14968:1570 1:N:0:CTCACCAA+CTAGGCAA
TGTTCCACGGGCAATCCTGAGCCAAATCTTTCATTTTGAAAAAATGAGAGATATAATGTATCTCTTATTTATTATAAGAAATAAAATATTTCTTATCTAATATTAAAGTTAGGTGCAGAGACTCAATGGGTGGAACTAGATCGGATGTGCA
+
11>A>@3@A11>ACFFEG110BFB00BAFGHE2DFGG201110/B11111/D1D2222D2FDFDFGDGHHBGG2F222110D11@1D1FGHFHGFF@GE1F2FG22112B220F1@111/0>BF11B210B>//11B1<1BB<///<1122
@M01334:147:000000000-LBRVD:1:1101:15946:1586 1:N:0:CTCACCAA+CTAGGCAA
TCCTAACCCCATTGAGTCTCTGCACCTATCTTTAATATTAGATAAGAAATATTTTATTTCTTATAATAAATAAGAGATATTTTATATCTCTCATTTTTTCAAAATGAAAGATTTGGCTCAGGATTGCCCACGTAACGGAGATCGGAAGAGC
+
1>>A111>>>AFGGB1FFGFGFF3BBF1GGHHH33D2GH2B1D211110D1DGHHBFGGGGG2FA2F221F21A1F0D1DGHH2FAFFGFHFFGHHHHGG22@1BD111@0FFHE11GC1001BGF1B1B/EF00??////BF////<000
@M01334:147:000000000-LBRVD:1:1101:15399:1590 1:N:0:CTCACCAA+CTAGGCAA
TGTTCCACCCATTGAGTCTCTGCACCTATCTTTAATATTAGATAAGAAATATTTTACTTCTTATAATAAATAAGAGTTATTTTATATCTCTCATTTTTTCAAAATGAAAGATTTGGCTCAGGATTGCCCGTGGAACTAGATCGGAAGAGCA
+
11>A>@3B>>1CF111BBFAG3A3AAF1FFGHHF3FBGH221F211110D1DGHH2BBGBFF2F22D221D211111A2DDGG2F2FFFEGD1FFHHHGFD221B111110BFGD11F@1001BF0@@1/EA//1>F1B1FD/////00<1
@M01334:147:000000000-LBRVD:1:1101:13773:1687 1:N:0:CTCACCAA+CTAGGCAA
CTCGGATCACCATTGAGTCTCTGCACCTATCTTTAATATTAGATAAGAAAAAATATTATTTCTTATCTGAAATAAGAAATATTTTATATATTTCTTTTTCTCAAAATGAAAGATTTGGCTCAGGATTGCCCTGATCCGAGGGATAGCACCA
+
3AAAAAADFFFFGGGGFGGGGGHHHHHHFHHHHHHHHGHHHHGHGGHFFHHHCGFHHHHHHHHHHHHHGHHGGFHFFHHHGHHHHBHHHGHHHHHHHHHHHHHFFHHFBDFBCGHHF4BGHFGFFHHBDGFHHEHHFAAEECEGF3FDGFC
@M01334:147:000000000-LBRVD:1:1101:14968:1570 2:N:0:CTCACCAA+CTAGGCAA
TTTTCCTCCCTTTTTTTCTCTGCACCTTTCTTTTTTATTAGTTTTTTATTATTTTTTTTCTTTTTTTATTTTATTGATACTTTATATCTCTCTTTTTTTCTTTTTTATTGATTTTTCTCTGGTTTTCCCTTGTTACTTGTTCTTTTTTGCT
+
11>>1131111BB111A0B3B313A0B1BAFGG11E/DG222B22///1D2DDGG1AE>>FG1D1/>/12B221212@21BFD2B2B2B2F11BFGHEEC1111B//1212BBF110@22111@@/2111?01111@111?111111--11
@M01334:147:000000000-LBRVD:1:1101:15946:1586 2:N:0:CTCACCAA+CTAGGCAA
CCGTTACGTGGGCAATCCTGAGCCAATTCTTTCTTTTTGAAAAAATGAGAGATATAAAATATCTCTTATTTATTATAAGAAATAAAATATTTCTTATCTAATATTAATGATAGGTGCAGTGACTCTATGGGGTTAGGTAGTTCGGATGAGC
+
111>>111B111111BA0B1101B001BAGGH22DGGH?01110/B11111/D1D2221D1DBEDGH1GHH2GG2F222110D@111D1DFGEGFBG@GB1B2FG22222B220B11111111B@11B210/?E/00B211B2/////111
@M01334:147:000000000-LBRVD:1:1101:15399:1590 2:N:0:CTCACCAA+CTAGGCAA
TTTTCCTCGGGCTATCCTGAGCCAAATCTTTCCTTTTGAAAAATTTAGAGATATAAAATATCTCTTATTTATTTTATGTAGTATTATATTTCTTATCTAATATTAAATTTAGTTGCTTTTTCTCATTTTGTTTTACTTTTTCTTTTTTGCT
+
11>>1131111111B11B1101A000B1DFF21DDFG1011100B122111D1D2221D1DADAFG1DGH2FG2D212222D2222D2DAF2FG2D@F21B2DE22122B221@11111110B222B222B00021B221B011111//11
@M01334:147:000000000-LBRVD:1:1101:13773:1687 2:N:0:CTCACCAA+CTAGGCAA
TGATAGCAGGGCTATCCTGAGCCAAATCCGTGTTTTGAGAAAACAAGGGGGTTCTCGAACTAGAATACAAAAGAAAAGGATAGGTGCAGAGACTCAATGGTGCTATCCCTCGGATCAGGGCAATCCTTAGCCAAATCTTTCATTTTTTGAA
+
111>13@1111>11B1AF11BABC00B110BAFGGH0000DFAB//0///EEECGFA10AG1111D@@11100/0000/0F110B11@11/0>FC@1B>1B11FEFEC>E>///?<0110/?/FF<G22111@00@<GHHB>FHHH1///1
The first sequence of the
forward.fastq
file having the id M01334:147:000000000-LBRVD:1:1101:14968:1570
will be paired with the first sequence of the
reverse.fastq
file having the same id M01334:147:000000000-LBRVD:1:1101:14968:1570
, not because they have the same identifier but because they are both the first sequence of their respective files.
The simplest obipairing command #
The minimal obipairing
command to align the
forward.fastq
and
reverse.fastq
files is:
obipairing -F forward.fastq -R reverse.fastq > paired.fastq
graph TD A@{ shape: doc, label: "forward.fastq" } B@{ shape: doc, label: "reverse.fastq" } C[obipairing] D@{ shape: doc, label: "paired.fastq" } A --> C B --> C:::obitools C --> D classDef obitools fill:#99d57c
it will produce a file named
paired.fastq
with the following content:
@M01334:147:000000000-LBRVD:1:1101:14968:1570 {"ali_length":137,"definition":"1:N:0:CTCACCAA+CTAGGCAA","mode":"join","score":1687,"score_norm":0.679,"seq_ab_match":93}
tgttccacgggcaatcctgagccaaatctttcattttgaaaaaatgagagatataatgtatctcttatttattataagaaataaaatatttcttatctaatattaaagttaggtgcagagactcaatgggtggaactagatcggatgtgca..........agcaaaaaagaacaagtaacaagggaaaaccagagaaaaatcaataaaaaagaaaaaaagagagatataaagtatcaataaaataaaaaaagaaaaaaaataataaaaaactaataaaaaagaaaggtgcagagaaaaaaagggaggaaaa
+
11>A>@3@A11>ACFFEG110BFB00BAFGHE2DFGG201110/B11111/D1D2222D2FDFDFGDGHHBGG2F222110D11@1D1FGHFHGFF@GE1F2FG22112B220F1@111/0>BF11B210B>//11B1<1BB<///<1122!!!!!!!!!!11--111111?111@11110?1112/@@11122@011FBB2121//B1111CEEHGFB11F2B2B2B2DFB12@212122B21/>/1D1GF>>EA1GGDD2D1///22B222GD/E11GGFAB1B0A313B3B0A111BB1111311>>11
@M01334:147:000000000-LBRVD:1:1101:15946:1586 {"ali_dir":"right","ali_length":138,"definition":"1:N:0:CTCACCAA+CTAGGCAA","mode":"alignment","pairing_mismatches":{"(T:16)->(A:33)":14,"(T:33)->(A:17)":118,"(T:37)->(A:16)":125,"(T:38)->(A:16)":32,"(T:39)->(A:17)":44},"paring_fast_count":114,"paring_fast_overlap":138,"paring_fast_score":0.844,"score":5446,"score_norm":0.957,"seq_a_single":13,"seq_ab_match":132,"seq_b_single":13}
gctcatccgaactacctaaccccattgagtctctgcacctatctttaatattagataagaaatattttatttcttataataaataagagatattttatatctctcattttttcaaaatgaaagatttggctcaggattgcccacgtaacggagatcggaagagc
+
111/////2B112CMMOUO?MNObVHfcAVVHVWVVTQSWRXXIYYYXUSWiXaWeWWUWVSTTTWXgeUWWXXXWWgXWYYWVYWdUgSTTTXYYUVdTVWVXVgUWXXXVeYXfTCUXWW`QGUWfA@WSR?PRRWVARAc?UVMMOO?///BF////<000
@M01334:147:000000000-LBRVD:1:1101:15399:1590 {"ali_length":4,"definition":"1:N:0:CTCACCAA+CTAGGCAA","mode":"join","score":126,"score_norm":1,"seq_ab_match":4}
tgttccacccattgagtctctgcacctatctttaatattagataagaaatattttacttcttataataaataagagttattttatatctctcattttttcaaaatgaaagatttggctcaggattgcccgtggaactagatcggaagagca..........agcaaaaaagaaaaagtaaaacaaaatgagaaaaagcaactaaatttaatattagataagaaatataatactacataaaataaataagagatattttatatctctaaatttttcaaaaggaaagatttggctcaggatagcccgaggaaaa
+
11>A>@3B>>1CF111BBFAG3A3AAF1FFGHHF3FBGH221F211110D1DGHH2BBGBFF2F22D221D211111A2DDGG2F2FFFEGD1FFHHHGFD221B111110BFGD11F@1001BF0@@1/EA//1>F1B1FD/////00<1!!!!!!!!!!11//111110B122B12000B222B222B01111111@122B22122ED2B12F@D2GF2FAD2D2222D222212D2GF2HGD1GFADAD1D1222D1D111221B0011101GFDD12FFD1B000A1011B11B1111111311>>11
@M01334:147:000000000-LBRVD:1:1101:13773:1687 {"ali_dir":"left","ali_length":54,"definition":"1:N:0:CTCACCAA+CTAGGCAA","mode":"alignment","pairing_mismatches":{"(C:39)->(A:16)":102,"(C:39)->(A:17)":121,"(T:39)->(A:14)":101},"paring_fast_count":42,"paring_fast_overlap":54,"paring_fast_score":0.824,"score":2888,"score_norm":0.944,"seq_a_single":97,"seq_ab_match":51,"seq_b_single":97}
ctcggatcaccattgagtctctgcacctatctttaatattagataagaaaaaatattatttcttatctgaaataagaaatattttatatatttctttttctcaaaatgaaagatttggctcaggattgccctgatccgagggatagcaccattgagtctctgcacctatccttttcttttgtattctagttcgagaacccccttgttttctcaaaacacggatttggctcaggatagccctgctatca
+
3AAAAAADFFFFGGGGFGGGGGHHHHHHFHHHHHHHHGHHHHGHGGHFFHHHCGFHHHHHHHHHHHHHGHHGGFHFFHHHGHHHHBHHHGHHHHHHHXVVJIommmegikl]bVWgVDRXIlbkkVfPSWVWccVVT^ebggjkkCVeWcd1@CF>0/11@11B011F0/0000/00111@@D1111GA01AFGCEEE///0//BAFD0000HGGFAB011B00CBAB11FA1B11>1111@31>111
The alignment process #
obipairing
will align the reads following a two-step procedure to increase computation speed.
A fast alignment to determine quickly the overlap #
The first step aligns the reads using a
FASTA-derived algorithm.
Based on results of the first step, a second alignment step is on the overlapping region only using an exact dynamic programming algorithm taking into account sequence quality scores present in the
fastq
files. It is possible to disable this first alignment step at the cost of an increase in the computation time by using the --exact-mode
option.
The first fast alignment step adds three tags are added to the FASTQ header for each read to report the results of this first step alignment.
paring_fast_count
: Number of kmer shared on the main diagonal of the fasta dot plot.paring_fast_overlap
: Length of the overlap as detected by this algorithm in nucleotides.paring_fast_score
: The pairing fast score is the number of shared 4mer on the main diagonal of the fasta dot plot (paring_fast_count
) by the number of 4mer involved in the overlapping region of the reads ( \(paring\_fast\_overlap - 3\) ) \[ paring\_fast\_score = \frac{paring\_fast\_count}{paring\_fast\_overlap - 3} \]The
--fasta-exact
option allows changing the best alignment selection from the one with the highestparing_fast_score
(the default behavior) to the one with the highestparing_fast_count
.The
--exact-mode
option tellsobipairing
to bypass this first alignment step and proceed directly to the exact alignment, at the cost of a largest computation time.
The exact alignment of the overlapping regions #
Once the overlap has been quickly identified using the
FASTA-derived algorithm, the overlapping region as detected in this first step is extended by \(\Delta\)
nucleotides at each end (
\(\Delta = 5\)
by default and can be defined with the --delta
option) to be exactly aligned using a
semi-global alignment algorithm taking into account the sequence quality scores present in the
fastq
files. There are two versions of this algorithm,
the left-align and the right-align version. Which one is used, left or right, depends on the length of the amplicon. Amplicons longer than the read length will be aligned with the left version. The shorter ones are aligned with the right version.
When the --exact-mode
option is used, full length reads are aligned twice, once with the left version and once with the right version. The alignment with the highest score is used. This consequently increases the computation time.
The exact alignment step adds the following tags to the FASTQ header for each read to report the quality of the alignment.
ali_dir
: indicates the mode of the used exact alignment left or right.ali_length
: The length of the aligned overlapping region (including gaps).seq_a_single
: The length of the unaligned region on the forward read.seq_ab_match
: The number of matches in the aligned overlapping region.seq_b_single
: The length of the unaligned region on the reverse read.score
: The raw score of the alignment (the sum of the elementary scores for each aligned position).score_norm
:seq_ab_match
divided byali_length
.pairing_mismatches
: A description of the mismatches between the reads (this tag is not added if the--without-stat
is set). It is expressed as a JSON map with keys describing the mismatch and values corresponding to the position of the mismatch in the reconstructed full length amplicon.This example describes the three mismatches found in the overlapping region of the fourth sequence pair:{"(C:39)->(A:16)":102,"(C:39)->(A:17)":121,"(T:40)->(A:14)":101}
- A C with a quality score of 39 on the forward read is aligned to an A with a quality score of 16 on the reverse read at position 102.
- A C with a quality score of 39 on the forward read is aligned to an A with a quality score of 17 on the reverse read at position 121.
- A T with a quality score of 40 on the forward read aligns to an A with a quality score of 14 on the reverse read at position 101.
Building the consensus sequence #
If the overlap length is below a threshold (20 by default, and can be set with the --min-overlap
option), or the score_norm
is below an identity threshold (0.9 by default, and can be set with the --min-identity
option), no consensus is computed for the read pair. Both sequences are only pasted together with a set of .
separating the forward read and the reverse complementary sequence of the reverse read. In this case, the sequence is tagged with a mode
attribute set to join
.
If the overlap is long enough and the identity is sufficient, a consensus sequence is built to maximize the global sequencing quality of the reconstructed amplicon. The non-aligned regions are reported as is. The overlapping regions are transcribed as follows:
- For each match, the base observed on both reads is conserved, and the quality score is increased to reflect the congruence of the two reads. \[Q_{consensus} = Q_F + Q_R\]
- If there is a mismatch, the base with the highest quality score is retained and its quality score is decreased to reflect the discrepancy between the two reads (with \(Q_{max} = max(Q_F, Q_R)\) and \(Q_{min} = min(Q_F, Q_R)\) ). \[Q_{consensus} = \log_{10} \left(10^{-\frac{Q_max}{10}} \cdot \frac{1 - 10^{-\frac{Q_min}{10}}}{4} \right)\]
- In case of an insertion or deletion, the gap will be affected with a quality of 0 and the mismatch rules will be applied. This means that insertions and deletions will always be considered as insertions in the consensus sequence.
A mode
attribute set to alignment
will be added to the consensus sequence annotations.
Synopsis #
obipairing --forward-reads|-F <FILENAME_F> --reverse-reads|-R <FILENAME_R>
[--batch-size <int>] [--compress|-Z] [--debug] [--delta|-D <int>]
[--ecopcr] [--embl] [--exact-mode] [--fast-absolute] [--fasta]
[--fasta-output] [--fastq] [--fastq-output] [--force-one-cpu]
[--gap-penality|-G <float64>] [--genbank] [--help|-h|-?]
[--input-OBI-header] [--input-json-header] [--json-output]
[--max-cpu <int>] [--min-identity|-X <float64>]
[--min-overlap <int>] [--no-order] [--no-progressbar]
[--out|-o <FILENAME>] [--output-OBI-header|-O]
[--output-json-header] [--penality-scale <float64>] [--pprof]
[--pprof-goroutine <int>] [--pprof-mutex <int>] [--skip-empty]
[--solexa] [--version] [--without-stat|-S] [<args>]
Options #
obipairing mandatory options #
--forward-reads
|-F
<FILENAME>: The name of the file containing the forward reads.--reverse-reads
|-R
<FILENAME>: The name of the file containing the reverse reads.
Other obipairing
specific options
#
--delta
|-D
<INTEGER>: Length added to the fast-detected overlap for the exact alignment algorithm (default: 5 nucleotides).--exact-mode
: Do not run fast alignment heuristic. (default: a fast algorithm is run at first to accelerate the final exact alignment).--fast-absolute
: Compute absolute fast score, this option has no effect in exact mode (default: false).--gap-penalty
|-G
<FLOAT64>: Gap penalty expressed as the multiply factor applied to the mismatch score between two nucleotides with a quality of 40 (default 2). (default: 2.000000)--min-identity
|-X
<FLOAT64>: Minimum identity between overlapped regions of the reads to consider the alignment (default: 0.900000).--min-overlap
<INTEGER>: Minimum overlap between both the reads to consider the alignment (default: 20).--penalty-scale
<FLOAT64>: Scale factor applied to the mismatch score and the gap penalty (default 1).--without-stat
|-S
: Remove alignment statistics from the produced consensus sequences (default: false).
Controlling the input data #
OBITools4 generally recognizes the input file format. It also recognizes whether the input file is compressed using GZIP. But some rare files can be misidentified, so the following options allow the user to force the format, thus bypassing the format identification step.The file format options #
--fasta
: indicates that sequence data is in fasta format.--fastq
: indicates that sequence data is in fastq format.--embl
: indicates that sequence data is in EMBL-ENA flatfile format.--csv
: indicates that sequence data is in CSV format.--genbank
: indicates that sequence data is in GenBank flatfile format.--ecopcr
: indicates that sequence data is in the old ecoPCR tabulated format.
Controlling the way OBITools4 are formatting annotations #
These options only apply to the FASTA and FASTQ formats--input-OBI-header
: FASTA/FASTQ title line annotations follow the old OBI format.--input-json-header
: FASTA/FASTQ title line annotations follow the JSON format.
Controlling quality score decoding #
This option only applies to the FASTQ formats--solexa
: decodes quality string according to the Solexa specification. (default: the standard Sanger encoding is used, env: OBISSOLEXA)
Controlling the output data #
--compress
|-Z
: output is compressed using gzip. (default: false)--no-order
: the OBITools ensure that the order between the input file and the output file does not change. When multiple files are processed, they are processed one at a time. If the –no-order option is added to a command, multiple input files can be opened at the same time and their contents processed in parallel. This usually increases processing speed, but does not guarantee the order of the sequences in the output file. Also, processing multiple files in parallel may require more memory to perform the computation.--fasta-output
: writes sequence data in fasta format (default if quality data is not available).--fastq-output
: writes sequence data in fastq format (default if quality data is available).--json-output
: writes sequence data in JSON format.--out
|-o
<FILENAME>: filename used for saving the output (default: “-”, the standard output)--output-OBI-header
|-O
: writes output FASTA/FASTQ title line annotations in OBI format (default: JSON).--output-json-header
: writew output FASTA/FASTQ title line annotations in JSON format (the default format).--skip-empty
: sequences of length equal to zero are removed from the output (default: false).--no-progressbar
: deactivates progress bar display (default: false).
General options #
--help
|-h|-?
: shows this help.--version
: prints the version and exits.--silent-warning
: This option tells obitools to stop displaying warnings. This behaviour can be controlled by setting the OBIWARNINGS environment variable.
Computation related options #
--max-cpu
<INTEGER>: OBITools can take advantage of your computer’s multi-core architecture by parallelizing the computation across all available CPUs. Computing on more CPUs usually requires more memory to perform the computation. Reducing the number of CPUs used to perform a calculation is also a way to indirectly control the amount of memory used by the process. The number of CPUs used by OBITools can also be controlled by setting the OBIMAXCPU environment variable.--force-one-cpu
: forces the use of a single CPU core for parallel processing (default: false).--batch-size
<INTEGER>: number of sequence per batch for parallel processing (default: 1000, env: OBIBATCHSIZE)
Debug related options #
--debug
: enables debug mode, by setting log level to debug (default: false, env: OBIDEBUG)--pprof
: enables pprof server. Look at the log for details. (default: false).--pprof-mutex
<INTEGER>: enables profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX)--pprof-goroutine
<INTEGER>: enables profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE)
Examples #
Basic example #
Consider the two small fastq files presented above, each containing four sequences and named forward.fastq
and reverse.fastq
. The following command will align them and create a file named paired.fastq
containing the full-length amplicon sequences:
obipairing -F forward.fastq -R reverse.fastq > paired.fastq
A bar graph showing the frequencies of the aligned and joined read pairs can be generated by combining the output of the obicsv
command with the uplot
command:
obicsv -k mode paired.fastq | uplot -H count
mode
┌ ┐
alignment ┤■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 2.0
join ┤■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 2.0
└ ┘
It is possible to use the obidistribute
tool to separate the reads according to their mode
attribute, which is set to join
or alignment
:
obidistribute -p "paired_%s.fastq" -c mode paired.fastq
This command will produce two files named paired_join.fastq
and paired_alignment.fastq
containing the sequences with mode
set to join
and alignment
respectively.
Looking at the contents of the paired_join.fastq
file, we can see that the first pair of reads was not aligned because the score_norm
tag is less than the default identity threshold of 0.9, while the second pair of reads was aligned because the length of the overlap (ali_length
tag) is less than the default minimum overlap of 20.
@M01334:147:000000000-LBRVD:1:1101:14968:1570 {"ali_length":137,"definition":"1:N:0:CTCACCAA+CTAGGCAA","mode":"join","score":1687,"score_norm":0.679,"seq_ab_match":93}
tgttccacgggcaatcctgagccaaatctttcattttgaaaaaatgagagatataatgtatctcttatttattataagaaataaaatatttcttatctaatattaaagttaggtgcagagactcaatgggtggaactagatcggatgtgca..........agcaaaaaagaacaagtaacaagggaaaaccagagaaaaatcaataaaaaagaaaaaaagagagatataaagtatcaataaaataaaaaaagaaaaaaaataataaaaaactaataaaaaagaaaggtgcagagaaaaaaagggaggaaaa
+
11>A>@3@A11>ACFFEG110BFB00BAFGHE2DFGG201110/B11111/D1D2222D2FDFDFGDGHHBGG2F222110D11@1D1FGHFHGFF@GE1F2FG22112B220F1@111/0>BF11B210B>//11B1<1BB<///<1122!!!!!!!!!!11--111111?111@11110?1112/@@11122@011FBB2121//B1111CEEHGFB11F2B2B2B2DFB12@212122B21/>/1D1GF>>EA1GGDD2D1///22B222GD/E11GGFAB1B0A313B3B0A111BB1111311>>11
@M01334:147:000000000-LBRVD:1:1101:15399:1590 {"ali_length":4,"definition":"1:N:0:CTCACCAA+CTAGGCAA","mode":"join","score":126,"score_norm":1,"seq_ab_match":4}
tgttccacccattgagtctctgcacctatctttaatattagataagaaatattttacttcttataataaataagagttattttatatctctcattttttcaaaatgaaagatttggctcaggattgcccgtggaactagatcggaagagca..........agcaaaaaagaaaaagtaaaacaaaatgagaaaaagcaactaaatttaatattagataagaaatataatactacataaaataaataagagatattttatatctctaaatttttcaaaaggaaagatttggctcaggatagcccgaggaaaa
+
11>A>@3B>>1CF111BBFAG3A3AAF1FFGHHF3FBGH221F211110D1DGHH2BBGBFF2F22D221D211111A2DDGG2F2FFFEGD1FFHHHGFD221B111110BFGD11F@1001BF0@@1/EA//1>F1B1FD/////00<1!!!!!!!!!!11//111110B122B12000B222B222B01111111@122B22122ED2B12F@D2GF2FAD2D2222D222212D2GF2HGD1GFADAD1D1222D1D111221B0011101GFDD12FFD1B000A1011B11B1111111311>>11
Looking at the contents of the paired_alignment.fastq
file, we can see (mode
tag) that the first pair of reads was aligned using the right version of the exact alignment algorithm, while the second pair of reads was aligned using the left version.
@M01334:147:000000000-LBRVD:1:1101:15946:1586 {"ali_dir":"right","ali_length":138,"definition":"1:N:0:CTCACCAA+CTAGGCAA","mode":"alignment","pairing_mismatches":{"(T:16)->(A:33)":14,"(T:33)->(A:17)":118,"(T:37)->(A:16)":125,"(T:38)->(A:16)":32,"(T:39)->(A:17)":44},"paring_fast_count":114,"paring_fast_overlap":138,"paring_fast_score":0.844,"score":5446,"score_norm":0.957,"seq_a_single":13,"seq_ab_match":132,"seq_b_single":13}
gctcatccgaactacctaaccccattgagtctctgcacctatctttaatattagataagaaatattttatttcttataataaataagagatattttatatctctcattttttcaaaatgaaagatttggctcaggattgcccacgtaacggagatcggaagagc
+
111/////2B112CMMOUO?MNObVHfcAVVHVWVVTQSWRXXIYYYXUSWiXaWeWWUWVSTTTWXgeUWWXXXWWgXWYYWVYWdUgSTTTXYYUVdTVWVXVgUWXXXVeYXfTCUXWW`QGUWfA@WSR?PRRWVARAc?UVMMOO?///BF////<000
@M01334:147:000000000-LBRVD:1:1101:13773:1687 {"ali_dir":"left","ali_length":54,"definition":"1:N:0:CTCACCAA+CTAGGCAA","mode":"alignment","pairing_mismatches":{"(C:39)->(A:16)":102,"(C:39)->(A:17)":121,"(T:39)->(A:14)":101},"paring_fast_count":42,"paring_fast_overlap":54,"paring_fast_score":0.824,"score":2888,"score_norm":0.944,"seq_a_single":97,"seq_ab_match":51,"seq_b_single":97}
ctcggatcaccattgagtctctgcacctatctttaatattagataagaaaaaatattatttcttatctgaaataagaaatattttatatatttctttttctcaaaatgaaagatttggctcaggattgccctgatccgagggatagcaccattgagtctctgcacctatccttttcttttgtattctagttcgagaacccccttgttttctcaaaacacggatttggctcaggatagccctgctatca
+
3AAAAAADFFFFGGGGFGGGGGHHHHHHFHHHHHHHHGHHHHGHGGHFFHHHCGFHHHHHHHHHHHHHGHHGGFHFFHHHGHHHHBHHHGHHHHHHHXVVJIommmegikl]bVWgVDRXIlbkkVfPSWVWccVVT^ebggjkkCVeWcd1@CF>0/11@11B011F0/0000/00111@@D1111GA01AFGCEEE///0//BAFD0000HGGFAB011B00CBAB11FA1B11>1111@31>111
Pairing the reads in exact mode #
The --exact-mode
option can be used to align the reads in exact mode. This option bypasses the first fast alignment step and aligns the overlapping region of the reads using the exact alignment algorithm. This option increases the calculation time.
obipairing -F forward.fastq -R reverse.fastq \
--exact-mode > paired_exact.fastq
@M01334:147:000000000-LBRVD:1:1101:14968:1570 {"ali_length":137,"definition":"1:N:0:CTCACCAA+CTAGGCAA","mode":"join","score":1687,"score_norm":0.679,"seq_ab_match":93}
tgttccacgggcaatcctgagccaaatctttcattttgaaaaaatgagagatataatgtatctcttatttattataagaaataaaatatttcttatctaatattaaagttaggtgcagagactcaatgggtggaactagatcggatgtgca..........agcaaaaaagaacaagtaacaagggaaaaccagagaaaaatcaataaaaaagaaaaaaagagagatataaagtatcaataaaataaaaaaagaaaaaaaataataaaaaactaataaaaaagaaaggtgcagagaaaaaaagggaggaaaa
+
11>A>@3@A11>ACFFEG110BFB00BAFGHE2DFGG201110/B11111/D1D2222D2FDFDFGDGHHBGG2F222110D11@1D1FGHFHGFF@GE1F2FG22112B220F1@111/0>BF11B210B>//11B1<1BB<///<1122!!!!!!!!!!11--111111?111@11110?1112/@@11122@011FBB2121//B1111CEEHGFB11F2B2B2B2DFB12@212122B21/>/1D1GF>>EA1GGDD2D1///22B222GD/E11GGFAB1B0A313B3B0A111BB1111311>>11
@M01334:147:000000000-LBRVD:1:1101:15946:1586 {"ali_dir":"right","ali_length":138,"definition":"1:N:0:CTCACCAA+CTAGGCAA","mode":"alignment","pairing_mismatches":{"(T:16)->(A:33)":14,"(T:33)->(A:17)":118,"(T:37)->(A:16)":125,"(T:38)->(A:16)":32,"(T:39)->(A:17)":44},"score":5446,"score_norm":0.957,"seq_a_single":13,"seq_ab_match":132,"seq_b_single":13}
gctcatccgaactacctaaccccattgagtctctgcacctatctttaatattagataagaaatattttatttcttataataaataagagatattttatatctctcattttttcaaaatgaaagatttggctcaggattgcccacgtaacggagatcggaagagc
+
111/////2B112CMMOUO?MNObVHfcAVVHVWVVTQSWRXXIYYYXUSWiXaWeWWUWVSTTTWXgeUWWXXXWWgXWYYWVYWdUgSTTTXYYUVdTVWVXVgUWXXXVeYXfTCUXWW`QGUWfA@WSR?PRRWVARAc?UVMMOO?///BF////<000
@M01334:147:000000000-LBRVD:1:1101:15399:1590 {"ali_length":137,"definition":"1:N:0:CTCACCAA+CTAGGCAA","mode":"join","score":3033,"score_norm":0.796,"seq_ab_match":109}
tgttccacccattgagtctctgcacctatctttaatattagataagaaatattttacttcttataataaataagagttattttatatctctcattttttcaaaatgaaagatttggctcaggattgcccgtggaactagatcggaagagca..........agcaaaaaagaaaaagtaaaacaaaatgagaaaaagcaactaaatttaatattagataagaaatataatactacataaaataaataagagatattttatatctctaaatttttcaaaaggaaagatttggctcaggatagcccgaggaaaa
+
11>A>@3B>>1CF111BBFAG3A3AAF1FFGHHF3FBGH221F211110D1DGHH2BBGBFF2F22D221D211111A2DDGG2F2FFFEGD1FFHHHGFD221B111110BFGD11F@1001BF0@@1/EA//1>F1B1FD/////00<1!!!!!!!!!!11//111110B122B12000B222B222B01111111@122B22122ED2B12F@D2GF2FAD2D2222D222212D2GF2HGD1GFADAD1D1222D1D111221B0011101GFDD12FFD1B000A1011B11B1111111311>>11
@M01334:147:000000000-LBRVD:1:1101:13773:1687 {"ali_dir":"left","ali_length":54,"definition":"1:N:0:CTCACCAA+CTAGGCAA","mode":"alignment","pairing_mismatches":{"(C:39)->(A:16)":102,"(C:39)->(A:17)":121,"(T:39)->(A:14)":101},"score":2888,"score_norm":0.944,"seq_a_single":97,"seq_ab_match":51,"seq_b_single":97}
ctcggatcaccattgagtctctgcacctatctttaatattagataagaaaaaatattatttcttatctgaaataagaaatattttatatatttctttttctcaaaatgaaagatttggctcaggattgccctgatccgagggatagcaccattgagtctctgcacctatccttttcttttgtattctagttcgagaacccccttgttttctcaaaacacggatttggctcaggatagccctgctatca
+
3AAAAAADFFFFGGGGFGGGGGHHHHHHFHHHHHHHHGHHHHGHGGHFFHHHCGFHHHHHHHHHHHHHGHHGGFHFFHHHGHHHHBHHHGHHHHHHHXVVJIommmegikl]bVWgVDRXIlbkkVfPSWVWccVVT^ebggjkkCVeWcd1@CF>0/11@11B011F0/0000/00111@@D1111GA01AFGCEEE///0//BAFD0000HGGFAB011B00CBAB11FA1B11>1111@31>111
For this trivial data set, both results, paired.fastq
and paired_exact.fastq
, are identical with respect to the consensus sequence.
But the annotations are different. Using the UNIX diff command, it’s possible to compare the two files:
diff -u paired.fastq paired_exact.fastq
--- paired.fastq 2025-02-23 16:50:12
+++ paired_exact.fastq 2025-02-23 17:24:37
@@ -2,15 +2,15 @@
tgttccacgggcaatcctgagccaaatctttcattttgaaaaaatgagagatataatgtatctcttatttattataagaaataaaatatttcttatctaatattaaagttaggtgcagagactcaatgggtggaactagatcggatgtgca..........agcaaaaaagaacaagtaacaagggaaaaccagagaaaaatcaataaaaaagaaaaaaagagagatataaagtatcaataaaataaaaaaagaaaaaaaataataaaaaactaataaaaaagaaaggtgcagagaaaaaaagggaggaaaa
+
11>A>@3@A11>ACFFEG110BFB00BAFGHE2DFGG201110/B11111/D1D2222D2FDFDFGDGHHBGG2F222110D11@1D1FGHFHGFF@GE1F2FG22112B220F1@111/0>BF11B210B>//11B1<1BB<///<1122!!!!!!!!!!11--111111?111@11110?1112/@@11122@011FBB2121//B1111CEEHGFB11F2B2B2B2DFB12@212122B21/>/1D1GF>>EA1GGDD2D1///22B222GD/E11GGFAB1B0A313B3B0A111BB1111311>>11
-@M01334:147:000000000-LBRVD:1:1101:15946:1586 {"ali_dir":"right","ali_length":138,"definition":"1:N:0:CTCACCAA+CTAGGCAA","mode":"alignment","pairing_mismatches":{"(T:16)->(A:33)":14,"(T:33)->(A:17)":118,"(T:37)->(A:16)":125,"(T:38)->(A:16)":32,"(T:39)->(A:17)":44},"paring_fast_count":114,"paring_fast_overlap":138,"paring_fast_score":0.844,"score":5446,"score_norm":0.957,"seq_a_single":13,"seq_ab_match":132,"seq_b_single":13}
+@M01334:147:000000000-LBRVD:1:1101:15946:1586 {"ali_dir":"right","ali_length":138,"definition":"1:N:0:CTCACCAA+CTAGGCAA","mode":"alignment","pairing_mismatches":{"(T:16)->(A:33)":14,"(T:33)->(A:17)":118,"(T:37)->(A:16)":125,"(T:38)->(A:16)":32,"(T:39)->(A:17)":44},"score":5446,"score_norm":0.957,"seq_a_single":13,"seq_ab_match":132,"seq_b_single":13}
gctcatccgaactacctaaccccattgagtctctgcacctatctttaatattagataagaaatattttatttcttataataaataagagatattttatatctctcattttttcaaaatgaaagatttggctcaggattgcccacgtaacggagatcggaagagc
+
111/////2B112CMMOUO?MNObVHfcAVVHVWVVTQSWRXXIYYYXUSWiXaWeWWUWVSTTTWXgeUWWXXXWWgXWYYWVYWdUgSTTTXYYUVdTVWVXVgUWXXXVeYXfTCUXWW`QGUWfA@WSR?PRRWVARAc?UVMMOO?///BF////<000
-@M01334:147:000000000-LBRVD:1:1101:15399:1590 {"ali_length":4,"definition":"1:N:0:CTCACCAA+CTAGGCAA","mode":"join","score":126,"score_norm":1,"seq_ab_match":4}
+@M01334:147:000000000-LBRVD:1:1101:15399:1590 {"ali_length":137,"definition":"1:N:0:CTCACCAA+CTAGGCAA","mode":"join","score":3033,"score_norm":0.796,"seq_ab_match":109}
tgttccacccattgagtctctgcacctatctttaatattagataagaaatattttacttcttataataaataagagttattttatatctctcattttttcaaaatgaaagatttggctcaggattgcccgtggaactagatcggaagagca..........agcaaaaaagaaaaagtaaaacaaaatgagaaaaagcaactaaatttaatattagataagaaatataatactacataaaataaataagagatattttatatctctaaatttttcaaaaggaaagatttggctcaggatagcccgaggaaaa
+
11>A>@3B>>1CF111BBFAG3A3AAF1FFGHHF3FBGH221F211110D1DGHH2BBGBFF2F22D221D211111A2DDGG2F2FFFEGD1FFHHHGFD221B111110BFGD11F@1001BF0@@1/EA//1>F1B1FD/////00<1!!!!!!!!!!11//111110B122B12000B222B222B01111111@122B22122ED2B12F@D2GF2FAD2D2222D222212D2GF2HGD1GFADAD1D1222D1D111221B0011101GFDD12FFD1B000A1011B11B1111111311>>11
-@M01334:147:000000000-LBRVD:1:1101:13773:1687 {"ali_dir":"left","ali_length":54,"definition":"1:N:0:CTCACCAA+CTAGGCAA","mode":"alignment","pairing_mismatches":{"(C:39)->(A:16)":102,"(C:39)->(A:17)":121,"(T:39)->(A:14)":101},"paring_fast_count":42,"paring_fast_overlap":54,"paring_fast_score":0.824,"score":2888,"score_norm":0.944,"seq_a_single":97,"seq_ab_match":51,"seq_b_single":97}
+@M01334:147:000000000-LBRVD:1:1101:13773:1687 {"ali_dir":"left","ali_length":54,"definition":"1:N:0:CTCACCAA+CTAGGCAA","mode":"alignment","pairing_mismatches":{"(C:39)->(A:16)":102,"(C:39)->(A:17)":121,"(T:39)->(A:14)":101},"score":2888,"score_norm":0.944,"seq_a_single":97,"seq_ab_match":51,"seq_b_single":97}
ctcggatcaccattgagtctctgcacctatctttaatattagataagaaaaaatattatttcttatctgaaataagaaatattttatatatttctttttctcaaaatgaaagatttggctcaggattgccctgatccgagggatagcaccattgagtctctgcacctatccttttcttttgtattctagttcgagaacccccttgttttctcaaaacacggatttggctcaggatagccctgctatca
+
3AAAAAADFFFFGGGGFGGGGGHHHHHHFHHHHHHHHGHHHHGHGGHFFHHHCGFHHHHHHHHHHHHHGHHGGFHFFHHHGHHHHBHHHGHHHHHHHXVVJIommmegikl]bVWgVDRXIlbkkVfPSWVWccVVT^ebggjkkCVeWcd1@CF>0/11@11B011F0/0000/00111@@D1111GA01AFGCEEE///0//BAFD0000HGGFAB011B00CBAB11FA1B11>1111@31>111
You can see that only the description line of the sequences has been changed. They are the only ones that start with a +
or a -
in the first column. The lines starting with -
are from the paired.fastq
file. The lines starting with +
are from the unpaired.fastq
file. Lines starting with are identical in both files.
For the two aligned sequences, the tags describing the fast alignment performed first are missing in the paired_exact.fastq
file because the
FASTA-derived algorithm is not run when the --exact-mode
option is used.
The second joined sequence pair with the --exact-mode
now has a very long overlap of 137 bases, as opposed to 4 bases in the previous command, but the score_norm
value is only 0.796, which is much lower than the threshold of 0.9, leading to a rejection of the alignment.