`obitagpcr`: split paired-end raw reads per sample #

Description #

The obitagpcr command processes paired-end raw reads from amplicon sequencing experiments and assigns each read pair to the correct biological sample/PCR replicate. It relies on two successive operations:

Paired-end assembly — forward (R1) and reverse (R2) reads are merged into a single consensus amplicon using the same overlap-based algorithm as obipairing .
Demultiplexing — each assembled amplicon is matched against a list of known primers and barcodes (tags) to identify the sample of origin, using the same engine as obimultiplex .

However, unlike the chaining of these two steps using obipairing and obimultiplex commands, obitagpcr forgets the assembled amplicon and only tags the forward and the reverse reads with the deduced sample ID.

graph LR
    R1["Forward reads
(R1.fastq)"] --> OBT{{obitagpcr}}
    R2["Reverse reads
(R2.fastq)"] --> OBT
    CSV["NGSFilter CSV
(--tag-list)"] --> OBT
    OBT --> OUT_R1["result_R1.fastq"]
    OBT --> OUT_R2["result_R2.fastq"]
    OBT -. "--unidentified" .-> UNID["unassigned.fastq"]

obitagpcr is an alternative entry point for Illumina paired-end metabarcoding data, when we want to delegate the processing of data to external tools requiring per sample data files such as DADA2.

Output files #

Unlike most OBITools4 commands, obitagpcr always produces paired output. Therefore the --out option must be used to indicate where to save the results, as example using --out result.fastq is producing two files: result_R1.fastq and result_R2.fastq.

The NGSFilter sample description file #

The --tag-list option takes a CSV file that describes all PCR reactions in the library. The exact structure of the file is shared with obimultiplex ; see the obimultiplex page for a complete description of the format, @param configuration options, and tag-matching algorithms.

📄 wolf_diet_ngsfilter.csv

1
2
3
4
5
6
7
8
@param,matching,strict
@param,primer_mismatches,2
@param,indels,false
experiment,sample,sample_tag,forward_primer,reverse_primer
wolf_diet,13a_F730603,aattaac,TTAGATACCCCACTATGC,TAGAACAGGCTCCTCTAG
wolf_diet,15a_F730814,gaagtag,TTAGATACCCCACTATGC,TAGAACAGGCTCCTCTAG
wolf_diet,26a_F040644,gaatatc,TTAGATACCCCACTATGC,TAGAACAGGCTCCTCTAG
wolf_diet,29a_F260619,gcctcct,TTAGATACCCCACTATGC,TAGAACAGGCTCCTCTAG

The file has two sections: optional @param lines that configure matching behaviour (primer mismatches, indels, tag-matching algorithm), and a sample table with at minimum five columns: experiment, sample, sample_tag, forward_primer, reverse_primer.

Use obipcrtag --template to print an annotated example of the format.

Annotations added by `obitagpcr` #

Each successfully demultiplexed sequence is annotated with the following attributes:

Sample identification
- experiment: "wolf_diet"
  Experiment name as defined in the NGSFilter file.
- sample: "29a_F260619"
  Sample (PCR) name as defined in the NGSFilter file.
Amplicon orientation
- obimultiplex_direction: "forward"/"reverse"
  Because sequencing is not oriented, some read pairs have the forward read starting with the forward primer, while some others have the forward read starting with the reverse primer. The obimultiplex_direction annotation documents these two cases:
  - "forward" value means that the forward primer was found at the beginning of the forward read,
  - "reverse" value means the reverse primer was found at the beginning of the forward read.
  Adding the --reorientate flag to the command, exchanges and reverse-complements both reads of pairs annotated as "reverse". Therefore, the reads in the R1 file all match the forward primer at their beginning, while the sequences in the R2 file all end with the reverse primer.
Primer matching
- obimultiplex_forward_match: "ttagataccccactatgc"
  Forward primer sequence as observed in the read.
- obimultiplex_forward_error: 0
  Number of mismatches between the forward primer and the read.
- obimultiplex_reverse_match: "tagaacaggctcctctag"
  Reverse primer sequence as observed in the read.
- obimultiplex_reverse_error: 0
  Number of mismatches between the reverse primer and the read.
Tag identification
- obimultiplex_forward_tag: "gcctcct"
  Barcode sequence observed at the forward end of the read.
- obimultiplex_reverse_tag: "gcctcct"
  Barcode sequence observed at the reverse end of the read.

When paired-end assembly succeeds via overlap alignment, additional attributes from the obipairing step (ali_length, score_norm, identity, mode) are also present in the output, unless suppressed with --without-stat (see obimultiplex for a complete description of the added annotations).

Synopsis #

obitagpcr --forward-reads|-F <FILENAME_F> --reverse-reads|-R <FILENAME_R>
          [--allowed-mismatches|-e <int>] [--batch-mem <string>]
          [--batch-size <int>] [--batch-size-max <int>] [--compress|-Z]
          [--debug] [--delta|-D <int>] [--ecopcr] [--embl] [--exact-mode]
          [--fast-absolute] [--fasta] [--fasta-output] [--fastq]
          [--fastq-output] [--gap-penalty|-G <float64>] [--genbank]
          [--help|-h|-?] [--input-OBI-header] [--input-json-header]
          [--json-output] [--keep-errors] [--max-cpu <int>]
          [--min-identity|-X <float64>] [--min-overlap <int>] [--no-order]
          [--no-progressbar] [--out|-o <FILENAME>] [--output-OBI-header|-O]
          [--output-json-header] [--penalty-scale <float64>] [--pprof]
          [--pprof-goroutine <int>] [--pprof-mutex <int>] [--reorientate]
          [--silent-warning] [--skip-empty] [--solexa]
          [--tag-list|-s <string>] [--template] [--u-to-t]
          [--unidentified|-u <string>] [--version] [--with-indels]
          [--without-stat|-S] [<args>]

Options #

`obitagpcr` specific options #

--forward-reads | -F <FILENAME_F>: The file names containing the forward reads.
--reverse-reads | -R <FILENAME_R>: The file names containing the reverse reads.
--allowed-mismatches | -e <INTEGER>: Used to specify the number of errors allowed for matching primers. (default: -1)
--delta | -D <int>: Length added to the fast detected overlap for the precise alignement (default: 5)
--exact-mode: Do not run fast alignment heuristic. (default: false)
--fast-absolute: Compute absolute fast score (no action in exact mode). (default: false)
--gap-penalty | -G <float64>: Gap penaity expressed as the multiply factor applied to the mismatch score between two nucleotides with a quality of 40 (default 2). (default: 2.000000)
--keep-errors: Prints symbol counts. (default: false)
--min-identity | -X <float64>: Minimum identity between ovelaped regions of the reads to consider the aligment (default: 0.900000)
--min-overlap <int>: Minimum ovelap between both the reads to consider the aligment (default: 20)
--penalty-scale <float64>: Scale factor applied to the mismatch score and the gap penalty (default 1). (default: 1.000000)
--reorientate: Reverse complemente reads if needed to store all the sequences in the same orientation respectively to forward and reverse primers (default: false)
--tag-list | -s <string>: File name of the NGSFilter file describing PCRs.
--template: Print on the standard output an example of CSV configuration file. (default: false)
--unidentified | -u <string>: Filename used to store the sequences unassigned to any sample.
--with-indels: Allows for indels during the primers matching. (default: false)
--without-stat | -S : Remove alignment statistics from the produced consensus sequences. (default: false)

Controlling the input data #

OBITools4 generally recognizes the input file format. It also recognizes whether the input file is compressed using GZIP. But some rare files can be misidentified, so the following options allow the user to force the format, thus bypassing the format identification step.

The file format options #

--fasta: indicates that sequence data is in fasta format.
--fastq: indicates that sequence data is in fastq format.
--embl: indicates that sequence data is in EMBL-ENA flatfile format.
--csv: indicates that sequence data is in CSV format.
--genbank: indicates that sequence data is in GenBank flatfile format.
--ecopcr: indicates that sequence data is in the old ecoPCR tabulated format.

Controlling the way OBITools4 are formatting annotations #

These options only apply to the FASTA and FASTQ formats

--input-OBI-header: FASTA/FASTQ title line annotations follow the old OBI format.
--input-json-header: FASTA/FASTQ title line annotations follow the JSON format.

Controlling quality score decoding #

This option only applies to the FASTQ formats

--solexa: decodes quality string according to the old Solexa specification. (default: the standard Sanger encoding is used, env: OBISSOLEXA)

Controlling the output data #

--compress | -Z : output is compressed using gzip. (default: false)
--no-order: the OBITools ensure that the order between the input file and the output file does not change. When multiple files are processed, they are processed one at a time. If the –no-order option is added to a command, multiple input files can be opened at the same time and their contents processed in parallel. This usually increases processing speed, but does not guarantee the order of the sequences in the output file. Also, processing multiple files in parallel may require more memory to perform the computation.
--fasta-output: writes sequence data in fasta format (default if quality data is not available).
--fastq-output: writes sequence data in fastq format (default if quality data is available).
--json-output: writes sequence data in JSON format.
--out | -o <FILENAME>: filename used for saving the output (default: “-”, the standard output)
--output-OBI-header | -O : writes output FASTA/FASTQ title line annotations in OBI format (default: JSON).
--output-json-header: writew output FASTA/FASTQ title line annotations in JSON format (the default format).
--skip-empty: sequences of length equal to zero are removed from the output (default: false).
--no-progressbar: deactivates progress bar display (default: false).

General options #

--help | -h|-? : shows this help.
--version: prints the version and exits.
--silent-warning: This option tells obitools to stop displaying warnings. This behaviour can be controlled by setting the OBIWARNINGS environment variable.

--max-cpu <INTEGER>: OBITools can take advantage of your computer’s multi-core architecture by parallelizing the computation across all available CPUs. Computing on more CPUs usually requires more memory to perform the computation. Reducing the number of CPUs used to perform a calculation is also a way to indirectly control the amount of memory used by the process. The number of CPUs used by OBITools can also be controlled by setting the OBIMAXCPU environment variable.
--force-one-cpu: forces the use of a single CPU core for parallel processing (default: false).
--batch-size <INTEGER>: minimum number of sequences per batch for parallel processing (floor, default: 1, env: OBIBATCHSIZE)
--batch-size-max <INTEGER>: maximum number of sequences per batch for parallel processing (ceiling, default: 2000, env: OBIBATCHSIZEMAX)
--batch-mem <STRING>: maximum memory per batch (e.g. 128K, 64M, 1G; default: 128M; set to 0 to disable, env: OBIBATCHMEM)

--debug: enables debug mode, by setting log level to debug (default: false, env: OBIDEBUG)
--pprof: enables pprof server. Look at the log for details. (default: false).
--pprof-mutex <INTEGER>: enables profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX)
--pprof-goroutine <INTEGER>: enables profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE)

Examples #

The examples below use the wolf diet 12S metabarcoding dataset from the Illumina OBITools4 cookbook (4 paired-end reads shown; full 20-read files: wolf_F.fastq and wolf_R.fastq).

Basic demultiplexing of a paired-end library #

Assemble paired reads and assign them to samples using the primer–barcode combinations defined in wolf_diet_ngsfilter.csv. The --out flag creates two files: out_basic_R1.fastq and out_basic_R2.fastq.

📄 wolf_F_4seq.fastq

@HELIUM_000100422_612GNAAXX:7:119:14871:19157#0/1  
ccgcctcctttagataccccactatgcttagccctaaacacaagtaattattataacaaaatcattcgccagagtactaccggcaatagctcaaaactcaaagaactt
+
CCCCCCCCCCCCCCCCCCCCCCBCCCCB@BCCCCCCCCCCCCCB;CCCACCCCCCCAACA29,?<5899+A=A###################################
@HELIUM_000100422_612GNAAXX:7:108:5640:3823#0/1  
ccgcctcctttagataccccactatgcttagccctaaacacaagtaattaatataacaaaattgttcgccagagtactaccggcaatagcttaaaactcaaaggactt
+
CCCCCCCBCCCCCCCCCCCCCCCCCCCCCCBCCCCCBCCCCCCC<CCCCACC;C?CCCC@A;=,B;93:;CC=C;==??#############################
@HELIUM_000100422_612GNAAXX:7:97:14311:19299#0/1  
ccgcctcctttagataccccactatgcttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccgccactagcttaaaactcaaagaactc
+
CCCCCCCCCCCCCCCCCCCCCCCBBCCC?BCCCCCBC?CCCC@@;AAAAA5C@C@CCC@C>>;C@7CC@C93;31::5<<AA<@########################
@HELIUM_000100422_612GNAAXX:7:22:8540:14708#0/1  
ccgcctccttagaacaggctcctctagaagggtataaagcaccgccaagtcctttgagttttaagctattgccggtagtactctggcgaataattttgttatattaat
+
CCCCCCCCCBCCCCCCCCBCCCCCCCCCCCA=AAA@CCCCCCCCCCC?CACCC?CC@C@CACC?CA=B?0A;AAA6;>3?AC?C?8AAA3<<-8<BAC@22<6?####

📄 wolf_R_4seq.fastq

@HELIUM_000100422_612GNAAXX:7:119:14871:19157#0/2  
ccgcctccttagaacaggctcctctagaagggtataaagcaccgccaagtcctttgagttttaacctactcccgctacacgtccgccgaataatactgttatcatatt
+
CCCCCCCCCCCAAC@CCBCCCCCCB@C@CCCC@@CBBB6@@CC@AC8CC<C>C@@#####################################################
@HELIUM_000100422_612GNAAXX:7:108:5640:3823#0/2  
ccgcctccttagaacaggctcctctagaagggtataaagcaccgccaagtcctttgagttttaagctattgccggtagtactctggcgaacacttttgttatattact
+
CCCCCCCCCCCCCCCCACCCCCACCCCCCCCCCCCCCCCCCCCCCCCCCCCCCBCCCCCCCCCC=CCCCCCCCCC:=><ACCCCBCCA8;68.69AA?>(>AC@CA3A
@HELIUM_000100422_612GNAAXX:7:97:14311:19299#0/2  
ccgcctccttagaacaggctcctctagaagggtataaagcaccgccaagtcctttgagttttaagctcttgccggtagtactctggcgcacacttttcttatattact
+
CC@CCCCCBC?CBCCCCCCCCC=CC<CCCCCC9@C?;?<+BB@??85<?>?<<6<:<?43???<2?3;??CA@C552(8<5<>:).(//1//,1'6:375=CCCC@?6
@HELIUM_000100422_612GNAAXX:7:22:8540:14708#0/2  
ccgcctcctttagataccccactatgcttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccggcaatagcttcacactcaaagaactt
+
CCCDCCCCCCCCCDCCCCCCCCCDCCC@CCACCCCCCCCCCCCDCCCCCCCCCDCCCCCBBBBCC=/AAA===>=C<CCC?B9AA;3??7CC@C6CCC8ACCC+AB8A

obitagpcr \
  --forward-reads wolf_F_4seq.fastq \
  --reverse-reads wolf_R_4seq.fastq \
  --tag-list wolf_diet_ngsfilter.csv \
  --out out_basic.fastq

The R1 output, with demultiplexing annotations added to each sequence header:

📄 out_basic_4seq_R1.fastq

@HELIUM_000100422_612GNAAXX:7:119:14871:19157#0/1 {"experiment":"wolf_diet","obimultiplex_direction":"forward","obimultiplex_forward_match":"ttagataccccactatgc","obimultiplex_forward_mismatches":0,"obimultiplex_forward_tag":"gcctcct","obimultiplex_reverse_match":"tagaacaggctcctctag","obimultiplex_reverse_mismatches":0,"obimultiplex_reverse_tag":"gcctcct","sample":"29a_F260619"}
ccgcctcctttagataccccactatgcttagccctaaacacaagtaattattataacaaaatcattcgccagagtactaccggcaatagctcaaaactcaaagaactt
+
CCCCCCCCCCCCCCCCCCCCCCBCCCCB@BCCCCCCCCCCCCCB;CCCACCCCCCCAACA29,?<5899+A=A###################################
@HELIUM_000100422_612GNAAXX:7:108:5640:3823#0/1 {"experiment":"wolf_diet","obimultiplex_direction":"forward","obimultiplex_forward_match":"ttagataccccactatgc","obimultiplex_forward_mismatches":0,"obimultiplex_forward_tag":"gcctcct","obimultiplex_reverse_match":"tagaacaggctcctctag","obimultiplex_reverse_mismatches":0,"obimultiplex_reverse_tag":"gcctcct","sample":"29a_F260619"}
ccgcctcctttagataccccactatgcttagccctaaacacaagtaattaatataacaaaattgttcgccagagtactaccggcaatagcttaaaactcaaaggactt
+
CCCCCCCBCCCCCCCCCCCCCCCCCCCCCCBCCCCCBCCCCCCC<CCCCACC;C?CCCC@A;=,B;93:;CC=C;==??#############################
@HELIUM_000100422_612GNAAXX:7:97:14311:19299#0/1 {"experiment":"wolf_diet","obimultiplex_direction":"forward","obimultiplex_forward_match":"ttagataccccactatgc","obimultiplex_forward_mismatches":0,"obimultiplex_forward_tag":"gcctcct","obimultiplex_reverse_match":"tagaacaggctcctctag","obimultiplex_reverse_mismatches":0,"obimultiplex_reverse_tag":"gcctcct","sample":"29a_F260619"}
ccgcctcctttagataccccactatgcttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccgccactagcttaaaactcaaagaactc
+
CCCCCCCCCCCCCCCCCCCCCCCBBCCC?BCCCCCBC?CCCC@@;AAAAA5C@C@CCC@C>>;C@7CC@C93;31::5<<AA<@########################
@HELIUM_000100422_612GNAAXX:7:22:8540:14708#0/1 {"experiment":"wolf_diet","obimultiplex_direction":"reverse","obimultiplex_forward_match":"ttagataccccactatgc","obimultiplex_forward_mismatches":0,"obimultiplex_forward_tag":"gcctcct","obimultiplex_reverse_match":"tagaacaggctcctctag","obimultiplex_reverse_mismatches":0,"obimultiplex_reverse_tag":"gcctcct","sample":"29a_F260619"}
ccgcctccttagaacaggctcctctagaagggtataaagcaccgccaagtcctttgagttttaagctattgccggtagtactctggcgaataattttgttatattaat
+
CCCCCCCCCBCCCCCCCCBCCCCCCCCCCCA=AAA@CCCCCCCCCCC?CACCC?CC@C@CACC?CA=B?0A;AAA6;>3?AC?C?8AAA3<<-8<BAC@22<6?####

The R2 output (out_basic_4seq_R2.fastq) contains the corresponding reverse reads carrying identical annotations.

📄 out_basic_4seq_R2.fastq

@HELIUM_000100422_612GNAAXX:7:119:14871:19157#0/2 {"experiment":"wolf_diet","obimultiplex_direction":"forward","obimultiplex_forward_match":"ttagataccccactatgc","obimultiplex_forward_mismatches":0,"obimultiplex_forward_tag":"gcctcct","obimultiplex_reverse_match":"tagaacaggctcctctag","obimultiplex_reverse_mismatches":0,"obimultiplex_reverse_tag":"gcctcct","sample":"29a_F260619"}
ccgcctccttagaacaggctcctctagaagggtataaagcaccgccaagtcctttgagttttaacctactcccgctacacgtccgccgaataatactgttatcatatt
+
CCCCCCCCCCCAAC@CCBCCCCCCB@C@CCCC@@CBBB6@@CC@AC8CC<C>C@@#####################################################
@HELIUM_000100422_612GNAAXX:7:108:5640:3823#0/2 {"experiment":"wolf_diet","obimultiplex_direction":"forward","obimultiplex_forward_match":"ttagataccccactatgc","obimultiplex_forward_mismatches":0,"obimultiplex_forward_tag":"gcctcct","obimultiplex_reverse_match":"tagaacaggctcctctag","obimultiplex_reverse_mismatches":0,"obimultiplex_reverse_tag":"gcctcct","sample":"29a_F260619"}
ccgcctccttagaacaggctcctctagaagggtataaagcaccgccaagtcctttgagttttaagctattgccggtagtactctggcgaacacttttgttatattact
+
CCCCCCCCCCCCCCCCACCCCCACCCCCCCCCCCCCCCCCCCCCCCCCCCCCCBCCCCCCCCCC=CCCCCCCCCC:=><ACCCCBCCA8;68.69AA?>(>AC@CA3A
@HELIUM_000100422_612GNAAXX:7:97:14311:19299#0/2 {"experiment":"wolf_diet","obimultiplex_direction":"forward","obimultiplex_forward_match":"ttagataccccactatgc","obimultiplex_forward_mismatches":0,"obimultiplex_forward_tag":"gcctcct","obimultiplex_reverse_match":"tagaacaggctcctctag","obimultiplex_reverse_mismatches":0,"obimultiplex_reverse_tag":"gcctcct","sample":"29a_F260619"}
ccgcctccttagaacaggctcctctagaagggtataaagcaccgccaagtcctttgagttttaagctcttgccggtagtactctggcgcacacttttcttatattact
+
CC@CCCCCBC?CBCCCCCCCCC=CC<CCCCCC9@C?;?<+BB@??85<?>?<<6<:<?43???<2?3;??CA@C552(8<5<>:).(//1//,1'6:375=CCCC@?6
@HELIUM_000100422_612GNAAXX:7:22:8540:14708#0/2 {"experiment":"wolf_diet","obimultiplex_direction":"reverse","obimultiplex_forward_match":"ttagataccccactatgc","obimultiplex_forward_mismatches":0,"obimultiplex_forward_tag":"gcctcct","obimultiplex_reverse_match":"tagaacaggctcctctag","obimultiplex_reverse_mismatches":0,"obimultiplex_reverse_tag":"gcctcct","sample":"29a_F260619"}
ccgcctcctttagataccccactatgcttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccggcaatagcttcacactcaaagaactt
+
CCCDCCCCCCCCCDCCCCCCCCCDCCC@CCACCCCCCCCCCCCDCCCCCCCCCDCCCCCBBBBCC=/AAA===>=C<CCC?B9AA;3??7CC@C6CCC8ACCC+AB8A

Reorientate reads for consistent strand direction #

As explained above, Illumina sequencing is not an orientated process, so reads arrive in mixed orientations (obimultiplex_direction: "forward" or "reverse"). The --reorientate flag reverse-complements reads matched in the reverse direction so that all output sequences run from forward primer to reverse primer:

obitagpcr \
  --forward-reads wolf_F_4seq.fastq \
  --reverse-reads wolf_R_4seq.fastq \
  --tag-list wolf_diet_ngsfilter.csv \
  --reorientate \
  --out out_reorientate.fastq

Preview the first two sequences of the result:

head -n 8 out_reorientate_R1.fastq

@HELIUM_000100422_612GNAAXX:7:119:14871:19157#0/1 {"experiment":"wolf_diet","obimultiplex_direction":"forward","obimultiplex_forward_error":0,"obimultiplex_forward_match":"ttagataccccactatgc","obimultiplex_forward_tag":"gcctcct","obimultiplex_reverse_error":0,"obimultiplex_reverse_match":"tagaacaggctcctctag","obimultiplex_reverse_tag":"gcctcct","sample":"29a_F260619"}
ccgcctcctttagataccccactatgcttagccctaaacacaagtaattattataacaaaatcattcgccagagtactaccggcaatagctcaaaactcaaagaactt
+
CCCCCCCCCCCCCCCCCCCCCCBCCCCB@BCCCCCCCCCCCCCB;CCCACCCCCCCAACA29,?<5899+A=A###################################
@HELIUM_000100422_612GNAAXX:7:108:5640:3823#0/1 {"experiment":"wolf_diet","obimultiplex_direction":"forward","obimultiplex_forward_error":0,"obimultiplex_forward_match":"ttagataccccactatgc","obimultiplex_forward_tag":"gcctcct","obimultiplex_reverse_error":0,"obimultiplex_reverse_match":"tagaacaggctcctctag","obimultiplex_reverse_tag":"gcctcct","sample":"29a_F260619"}
ccgcctcctttagataccccactatgcttagccctaaacacaagtaattaatataacaaaattgttcgccagagtactaccggcaatagcttaaaactcaaaggactt
+
CCCCCCCBCCCCCCCCCCCCCCCCCCCCCCBCCCCCBCCCCCCC<CCCCACC;C?CCCC@A;=,B;93:;CC=C;==??#############################

The fourth reads, which was matched in reverse, have been exchanged and reverse-complemented. As a result, the read that was originally in the “R2” file is now in the “R1” file, and vice versa.:

head -n 16 out_reorientate_R1.fastq | tail -n 4

coissac@MacBook-Pro-de-Eric obitagpcr % head -n 16 out_reorientate_R1.fastq | tail -n 4

@HELIUM_000100422_612GNAAXX:7:22:8540:14708#0/2 {"experiment":"wolf_diet","obimultiplex_direction":"reverse","obimultiplex_forward_error":0,"obimultiplex_forward_match":"ttagataccccactatgc","obimultiplex_forward_tag":"gcctcct","obimultiplex_reverse_error":0,"obimultiplex_reverse_match":"tagaacaggctcctctag","obimultiplex_reverse_tag":"gcctcct","sample":"29a_F260619"}
ccgcctcctttagataccccactatgcttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccggcaatagcttcacactcaaagaactt
+
CCCDCCCCCCCCCDCCCCCCCCCDCCC@CCACCCCCCCCCCCCDCCCCCCCCCDCCCCCBBBBCC=/AAA===>=C<CCC?B9AA;3??7CC@C6CCC8ACCC+AB8A

Download the full output:

Capture unassigned reads for quality control #

Reads that fail barcode matching are silently discarded by default. Use --unidentified to redirect them to a separate file for inspection. Unassigned reads are annotated with an obimultiplex_error attribute indicating the rejection reason.

In this dataset all reads are successfully assigned, so the unassigned file is empty:

obitagpcr \
  --forward-reads wolf_F_4seq.fastq \
  --reverse-reads wolf_R_4seq.fastq \
  --tag-list wolf_diet_ngsfilter.csv \
  --reorientate \
  --unidentified out_unassigned.fastq \
  --out out_identified.fastq

# Count unassigned reads (0 in this dataset)
obicount out_unassigned_R1.fastq

entities,n
variants,0
reads,0
symbols,0

For more on diagnosing rejection causes see the obimultiplex page.

Allow indels in primer and tag matching #

By default only substitutions are accepted as differences. For sequencers that produce indel errors (e.g. Oxford Nanopore), add --with-indels to enable full edit-distance matching of primers. This can also be activated per-primer via @param,indels,true in the NGSFilter file.

obitagpcr \
  --forward-reads wolf_F_4seq.fastq \
  --reverse-reads wolf_R_4seq.fastq \
  --tag-list wolf_diet_ngsfilter.csv \
  --allowed-mismatches 3 \
  --with-indels \
  --reorientate \
  --out out_indels.fastq

The output format is identical to the basic example above. Download a full-size result (20 reads): out_indels_R1.fastq

Split output by sample with `obidistribute` #

Because obitagpcr always writes paired files (_R1 and _R2), it cannot be piped directly into obidistribute . The two steps must be run sequentially: first demultiplex with --out, then distribute each file separately.

obitagpcr \
  --forward-reads wolf_F.fastq \
  --reverse-reads wolf_R.fastq \
  --tag-list wolf_diet_ngsfilter.csv \
  --reorientate \
  --out demux.fastq

obidistribute \
  --classifier sample \
  --pattern "sample_%s_R1.fastq" \
  demux_R1.fastq

obidistribute \
  --classifier sample \
  --pattern "sample_%s_R2.fastq" \
  demux_R2.fastq

This produces one R1/R2 pair per sample:

sample_13a_F730603_R1.fastq / sample_13a_F730603_R2.fastq
sample_15a_F730814_R1.fastq / sample_15a_F730814_R2.fastq
sample_26a_F040644_R1.fastq / sample_26a_F040644_R2.fastq
sample_29a_F260619_R1.fastq / sample_29a_F260619_R2.fastq

obitagpcr: split paired-end raw reads per sample #