obimultiplex

obimultiplex: demultiplex the sequence reads #

Description #

The obimultiplex command demultiplexes sequencing reads by identifying sample-specific tags (barcodes) and PCR primers in the sequences. It assigns each sequence to its corresponding sample based on the tag combinations and primer sequences provided in a sample description file.

The demultiplexing process involves:

  • Identifying forward and reverse PCR primers in the sequences.
  • Detecting sample-specific tags.
  • Assigning sequences to samples based on the tag/primer combinations.
  • Trimming primers and tags from the sequences.
  • Reverse complementing the sequences if needed.
  • Adding comprehensive annotations about the identification process.

The new obimultiplex sample description file format #

If obimultiplex is still able to use the old ngsfilter format used by the legacy obitools, it is now preferable to rely on the new format.

The new format is a CSV file, which can easily be prepared using an export from your favourite spreadsheet program.

# primer matching options
@param,primer_mismatches,2
@param,indels,false
# tag matching options
@param,matching,strict
experiment,sample,sample_tag,forward_primer,reverse_primer
wolf_diet,13a_F730603,aattaac,TTAGATACCCCACTATGC,TAGAACAGGCTCCTCTAG
wolf_diet,15a_F730814,gaagtag,TTAGATACCCCACTATGC,TAGAACAGGCTCCTCTAG

The CSV file is divided into two sections. The first section consists of lines beginning with @param in the first cell. These lines specify the parameters used to match the primers and tags to the sequence. The second section provides a description of all the samples (PCRs) included in the sequencing library. This section begins with a line containing the names of the columns used to describe the samples in the subsequent lines. Only the second section is required.

Basic format and required columns #

Below is an example for the minimal description of the PCRs multiplexed in the sequencing library. In the new version of OBITools4 this file is a CSV file.

The first line is mandatory and must contains at least the five column names presented below:

  • experiment: the name of the experiment that allows for grouping of samples;

  • sample: the sample (PCR) name;

  • sample_tag: the tag identifying the sample:

    Each sample tag must be unique within the library for each pair of primers. They can be provided in upper or lower case. No distinction is made between the two.

    • They can be a simple DNA word as here. This means that the same tag is used for both forward and reverse primers (eg: aattaac).

    • It can be two DNA words separated by a colon. For example, aagtag:gaagtag. This means that the first tag is used for the forward primer and the second for the reverse primers.

      The example presented above :aattaac is equivalent to aattaac:aattaac.
    • In the two-word syntax, if a forward or reverse primer is not tagged, the tag is replaced by a hyphen. For example, aagtag:- or -:aagtag. Consequently, an experiments conducted without primer tags must declare a dummy tag: -:-.

    For a given primer all the tags must have the same length.
  • forward_primer: the forward primer sequence

  • reverse_primer: the reverse primer sequence

📄 samples_simple.csv
experiment,sample,sample_tag,forward_primer,reverse_primer
wolf_diet,13a_F730603,aattaac,TTAGATACCCCACTATGC,TAGAACAGGCTCCTCTAG
wolf_diet,15a_F730814,gaagtag,TTAGATACCCCACTATGC,TAGAACAGGCTCCTCTAG
wolf_diet,26a_F040644,gaatatc,TTAGATACCCCACTATGC,TAGAACAGGCTCCTCTAG
wolf_diet,29a_F260619,gcctcct,TTAGATACCCCACTATGC,TAGAACAGGCTCCTCTAG
📄 wolf_4seq.fastq
@HELIUM_000100422_612GNAAXX:7:6:9274:14951#0/1 
ccaattaactagaacaggctcgtctagaagggtataaagcaccgccaagtcctttgagttttaagctattgccggtagtactctggcgaatagttttgtttgcataactatttgtgtttaaggctaggcatagtggggtatctaagttaattgg
+
CCCCCCCCCDCCCCCCCCCCCCCCCCCCCCCC=CBCCBCBCCCCCCDEFAEDEEEEBEAEJEJ?D?CD@^aVca\C????CEBC>I?D<>EEDDDEEEEEEEAFEEDECCCCCCCCCCCCCCCBCCCCCCBCCCCCCCCCDCCCCCCCCCCCBC
@HELIUM_000100422_612GNAAXX:7:57:18459:16145#0/1 
ccgaatatcttagataccccactatgcttagccctaaacataaacattcaataaacaagaatgttcgccagagtactactagcaacagcctgaaactcaaaggacttggcggtgctttacatccctctagaggagcctgttctagatattcgg
+
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCacZXceafbd_e_bVb`cb[WZb]aaaaV`ECDDCEDCDKECFFEEEEEDEDEEJEEE@EEJECCCCCBCCCCCCCCCCCCCCCCCCCCDCCCCCCCCCCCCCCCBCC
@HELIUM_000100422_612GNAAXX:7:108:5640:3823#0/1 
ccgcctcctttagataccccactatgcttagccctaaacacaagtaattaatataacaaaattgttcgccagagtactaccggcaatagcttaaaactcaaaggacttggcggtgctttatacccttctagaggagcctgttctaaggaggcgg
+
CCCCCCCBCCCCCCCCCCCCCCCCCCCCCCBCCCCCBCCCCCCC<CcCccbe[`F`accXV<TA\RYU\\ee_e[XZ[XEEEEEEEEEE?EEEEEEEEEEDEEEEEEECCCCCCCCCCCCCCCCCCCCCCCACCCCCACCCCCCCCCCCCCCCC
@HELIUM_000100422_612GNAAXX:7:108:6440:4223#0/1 
ccgcctcctttagatcccactatgcttagccctaaacacaagtaattaatataacaaaattgttcgccagagtactaccggcaatagcttaaaactcaaaggacttggcggtgctttatacccttctagaggagcctgttctaaggaggcgg
+
CCCCCCCBCCCCCCCCCCCCCCCCCCCCBCCCCCBCCCCCCC<CcCccbe[`F`accXV<TA\RYU\\ee_e[XZ[XEEEEEEEEEE?EEEEEEEEEEDEEEEEEECCCCCCCCCCCCCCCCCCCCCCCACCCCCACCCCCCCCCCCCCCCC
obimultiplex -s samples_simple.csv \
             wolf_4seq.fastq \
             > wolf_4seq_simple.fastq
📄 wolf_4seq_simple.fastq
@HELIUM_000100422_612GNAAXX:7:6:9274:14951#0/1_sub[28..127] {"experiment":"wolf_diet","obimultiplex_amplicon_rank":"1/1","obimultiplex_direction":"reverse","obimultiplex_forward_error":0,"obimultiplex_forward_match":"ttagataccccactatgc","obimultiplex_forward_matching":"strict","obimultiplex_forward_primer":"ttagataccccactatgc","obimultiplex_forward_proposed_tag":"aattaac","obimultiplex_forward_tag":"aattaac","obimultiplex_forward_tag_dist":0,"obimultiplex_reverse_error":1,"obimultiplex_reverse_match":"tagaacaggctcgtctag","obimultiplex_reverse_matching":"strict","obimultiplex_reverse_primer":"tagaacaggctcctctag","obimultiplex_reverse_proposed_tag":"aattaac","obimultiplex_reverse_tag":"aattaac","obimultiplex_reverse_tag_dist":0,"sample":"13a_F730603"}
ctagccttaaacacaaatagttatgcaaacaaaactattcgccagagtactaccggcaatagcttaaaactcaaaggacttggcggtgctttataccctt
+
CCCBCCCCCCCCCCCCCCCEDEEFAEEEEEEEDDDEE><D?I>CBEC????C\acVa^@DC?D?JEJEAEBEEEEDEAFEDCCCCCCBCBCCBC=CCCCC
@HELIUM_000100422_612GNAAXX:7:57:18459:16145#0/1_sub[28..126] {"experiment":"wolf_diet","obimultiplex_amplicon_rank":"1/1","obimultiplex_direction":"forward","obimultiplex_forward_error":0,"obimultiplex_forward_match":"ttagataccccactatgc","obimultiplex_forward_matching":"strict","obimultiplex_forward_primer":"ttagataccccactatgc","obimultiplex_forward_proposed_tag":"gaatatc","obimultiplex_forward_tag":"gaatatc","obimultiplex_forward_tag_dist":0,"obimultiplex_reverse_error":0,"obimultiplex_reverse_match":"tagaacaggctcctctag","obimultiplex_reverse_matching":"strict","obimultiplex_reverse_primer":"tagaacaggctcctctag","obimultiplex_reverse_proposed_tag":"gaatatc","obimultiplex_reverse_tag":"gaatatc","obimultiplex_reverse_tag_dist":0,"sample":"26a_F040644"}
ttagccctaaacataaacattcaataaacaagaatgttcgccagagtactactagcaacagcctgaaactcaaaggacttggcggtgctttacatccct
+
CCCCCCCCCCCCCCCCCCacZXceafbd_e_bVb`cb[WZb]aaaaV`ECDDCEDCDKECFFEEEEEDEDEEJEEE@EEJECCCCCBCCCCCCCCCCCC
@HELIUM_000100422_612GNAAXX:7:108:5640:3823#0/1_sub[28..127] {"experiment":"wolf_diet","obimultiplex_amplicon_rank":"1/1","obimultiplex_direction":"forward","obimultiplex_forward_error":0,"obimultiplex_forward_match":"ttagataccccactatgc","obimultiplex_forward_matching":"strict","obimultiplex_forward_primer":"ttagataccccactatgc","obimultiplex_forward_proposed_tag":"gcctcct","obimultiplex_forward_tag":"gcctcct","obimultiplex_forward_tag_dist":0,"obimultiplex_reverse_error":0,"obimultiplex_reverse_match":"tagaacaggctcctctag","obimultiplex_reverse_matching":"strict","obimultiplex_reverse_primer":"tagaacaggctcctctag","obimultiplex_reverse_proposed_tag":"gcctcct","obimultiplex_reverse_tag":"gcctcct","obimultiplex_reverse_tag_dist":0,"sample":"29a_F260619"}
ttagccctaaacacaagtaattaatataacaaaattgttcgccagagtactaccggcaatagcttaaaactcaaaggacttggcggtgctttataccctt
+
CCCBCCCCCBCCCCCCC<CcCccbe[`F`accXV<TA\RYU\\ee_e[XZ[XEEEEEEEEEE?EEEEEEEEEEDEEEEEEECCCCCCCCCCCCCCCCCCC
  • Sample description

    • experiment: “wolf_diet”

      The experiment name imputed to the barcode sequence

    • sample: “13a_F730603”

      The sample (PCR) name imputed to the barcode sequence

  • Amplicon description

    • obimultiplex_amplicon_rank: “1/1”

      obimultiplex is able to detect concatemer of several amplicons. This information is reported in the `obimultiplex_amplicon_rank` as a ratio here "1/1" meaning the first among one in the read. A value of "2/3" would mean the second amplicon detected among three in the read.
    • obimultiplex_direction: “reverse”

      The direction in which the amplicon has been detected:

      • “forward” means, the forward primer has been identified, then the reverse complementary sequence of the reverse primer.

      • “reverse” means, the reverse primer has been identified, then the forward complementary sequence of the forward primer. The sequence of the barcode has been reverse complemented to be always reported as a sequence oriented from the forward to the reverse primer.

  • Primer matching

    • Forward primer:

      • obimultiplex_forward_primer: “ttagataccccactatgc”

        The true forward primer sequence as provided in the obimultiplex sample description file.

      • obimultiplex_forward_match: “ttagataccccactatgc”

        The primer sequence as detected in the sequence read.

      • obimultiplex_forward_error: 0

        The number of differences between the obimultiplex_forward_primer and the obimultiplex_forward_match attribute values. obimultiplex by default allows up to two mismatches. That threshold can be changed using the –allowed-mismatches option (or -e for the short version option).

    • Reverse primer:

      • “obimultiplex_reverse_primer”:“tagaacaggctcctctag”

        The true reverse primer sequence as provided in the obimultiplex sample description file.

      • “obimultiplex_reverse_match”:“tagaacaggctcgtctag”

        The primer sequence as detected in the sequence read.

      • “obimultiplex_reverse_error”:1

        Here one mismatch has been detected between the primer sequence and the read sequence match.

  • Tag identification

    • Forward tag:
      • “obimultiplex_forward_tag”:“gcctcct”
      • “obimultiplex_forward_proposed_tag”:“gcctcct”
      • “obimultiplex_forward_matching”:“strict”
      • “obimultiplex_forward_tag_dist”:0
    • Reverse tag:
      • “obimultiplex_reverse_tag”:“gcctcct”
      • “obimultiplex_reverse_proposed_tag”:“gcctcct”
      • “obimultiplex_reverse_matching”:“strict”
      • “obimultiplex_reverse_tag_dist”:0

Supplementary columns #

📄 samples_extra.csv
experiment,sample,sample_tag,forward_primer,reverse_primer,sex,age,plate,position
wolf_diet,13a_F730603,aattaac,TTAGATACCCCACTATGC,TAGAACAGGCTCCTCTAG,male,adult,02,A03
wolf_diet,15a_F730814,gaagtag,TTAGATACCCCACTATGC,TAGAACAGGCTCCTCTAG,male,juvenile,02,A01
wolf_diet,26a_F040644,gaatatc,TTAGATACCCCACTATGC,TAGAACAGGCTCCTCTAG,female,adult,01,B08
wolf_diet,29a_F260619,gcctcct,TTAGATACCCCACTATGC,TAGAACAGGCTCCTCTAG,female,adult,01,B12
obimultiplex -s samples_extra.csv \
             wolf_4seq.fastq \
             > wolf_4seq_extra.fastq
📄 wolf_4seq_extra.fastq
@HELIUM_000100422_612GNAAXX:7:6:9274:14951#0/1_sub[28..127] {"age":"adult","experiment":"wolf_diet","obimultiplex_amplicon_rank":"1/1","obimultiplex_direction":"reverse","obimultiplex_forward_error":0,"obimultiplex_forward_match":"ttagataccccactatgc","obimultiplex_forward_matching":"strict","obimultiplex_forward_primer":"ttagataccccactatgc","obimultiplex_forward_proposed_tag":"aattaac","obimultiplex_forward_tag":"aattaac","obimultiplex_forward_tag_dist":0,"obimultiplex_reverse_error":1,"obimultiplex_reverse_match":"tagaacaggctcgtctag","obimultiplex_reverse_matching":"strict","obimultiplex_reverse_primer":"tagaacaggctcctctag","obimultiplex_reverse_proposed_tag":"aattaac","obimultiplex_reverse_tag":"aattaac","obimultiplex_reverse_tag_dist":0,"plate":"02","position":"A03","sample":"13a_F730603","sex":"male"}
ctagccttaaacacaaatagttatgcaaacaaaactattcgccagagtactaccggcaatagcttaaaactcaaaggacttggcggtgctttataccctt
+
CCCBCCCCCCCCCCCCCCCEDEEFAEEEEEEEDDDEE><D?I>CBEC????C\acVa^@DC?D?JEJEAEBEEEEDEAFEDCCCCCCBCBCCBC=CCCCC
@HELIUM_000100422_612GNAAXX:7:57:18459:16145#0/1_sub[28..126] {"age":"adult","experiment":"wolf_diet","obimultiplex_amplicon_rank":"1/1","obimultiplex_direction":"forward","obimultiplex_forward_error":0,"obimultiplex_forward_match":"ttagataccccactatgc","obimultiplex_forward_matching":"strict","obimultiplex_forward_primer":"ttagataccccactatgc","obimultiplex_forward_proposed_tag":"gaatatc","obimultiplex_forward_tag":"gaatatc","obimultiplex_forward_tag_dist":0,"obimultiplex_reverse_error":0,"obimultiplex_reverse_match":"tagaacaggctcctctag","obimultiplex_reverse_matching":"strict","obimultiplex_reverse_primer":"tagaacaggctcctctag","obimultiplex_reverse_proposed_tag":"gaatatc","obimultiplex_reverse_tag":"gaatatc","obimultiplex_reverse_tag_dist":0,"plate":"01","position":"B08","sample":"26a_F040644","sex":"female"}
ttagccctaaacataaacattcaataaacaagaatgttcgccagagtactactagcaacagcctgaaactcaaaggacttggcggtgctttacatccct
+
CCCCCCCCCCCCCCCCCCacZXceafbd_e_bVb`cb[WZb]aaaaV`ECDDCEDCDKECFFEEEEEDEDEEJEEE@EEJECCCCCBCCCCCCCCCCCC
@HELIUM_000100422_612GNAAXX:7:108:5640:3823#0/1_sub[28..127] {"age":"adult","experiment":"wolf_diet","obimultiplex_amplicon_rank":"1/1","obimultiplex_direction":"forward","obimultiplex_forward_error":0,"obimultiplex_forward_match":"ttagataccccactatgc","obimultiplex_forward_matching":"strict","obimultiplex_forward_primer":"ttagataccccactatgc","obimultiplex_forward_proposed_tag":"gcctcct","obimultiplex_forward_tag":"gcctcct","obimultiplex_forward_tag_dist":0,"obimultiplex_reverse_error":0,"obimultiplex_reverse_match":"tagaacaggctcctctag","obimultiplex_reverse_matching":"strict","obimultiplex_reverse_primer":"tagaacaggctcctctag","obimultiplex_reverse_proposed_tag":"gcctcct","obimultiplex_reverse_tag":"gcctcct","obimultiplex_reverse_tag_dist":0,"plate":"01","position":"B12","sample":"29a_F260619","sex":"female"}
ttagccctaaacacaagtaattaatataacaaaattgttcgccagagtactaccggcaatagcttaaaactcaaaggacttggcggtgctttataccctt
+
CCCBCCCCCBCCCCCCC<CcCccbe[`F`accXV<TA\RYU\\ee_e[XZ[XEEEEEEEEEE?EEEEEEEEEEDEEEEEEECCCCCCCCCCCCCCCCCCC
obimultiplex -s samples_simple.csv \
             -u wolf_4seq_bad.fastq \
             wolf_4seq.fastq \
             > wolf_4seq_simple.fastq
📄 wolf_4seq_bad.fastq
@HELIUM_000100422_612GNAAXX:7:108:6440:4223#0/1 {"obimultiplex_error":"No barcode identified"}
ccgcctcctttagatcccactatgcttagccctaaacacaagtaattaatataacaaaattgttcgccagagtactaccggcaatagcttaaaactcaaaggacttggcggtgctttatacccttctagaggagcctgttctaaggaggcgg
+
CCCCCCCBCCCCCCCCCCCCCCCCCCCCBCCCCCBCCCCCCC<CcCccbe[`F`accXV<TA\RYU\\ee_e[XZ[XEEEEEEEEEE?EEEEEEEEEEDEEEEEEECCCCCCCCCCCCCCCCCCCCCCCACCCCCACCCCCCCCCCCCCCCC

Synopsis #

obimultiplex [--allowed-mismatches|-e <int>] [--batch-size <int>]
             [--compress|-Z] [--debug] [--ecopcr] [--embl] [--fasta]
             [--fasta-output] [--fastq] [--fastq-output] [--force-one-cpu]
             [--genbank] [--help|-h|-?] [--input-OBI-header]
             [--input-json-header] [--json-output] [--keep-errors]
             [--max-cpu <int>] [--no-order] [--no-progressbar]
             [--out|-o <FILENAME>] [--output-OBI-header|-O]
             [--output-json-header] [--paired-with <FILENAME>] [--pprof]
             [--pprof-goroutine <int>] [--pprof-mutex <int>] [--skip-empty]
             [--solexa] [--tag-list|-s <string>] [--taxonomy|-t <string>]
             [--template] [--unidentified|-u <string>] [--version]
             [--with-indels] [<args>]

Options #

obimultiplex specific options #

  • --allowed-mismatches | -e <INTEGER>: Used to specify the number of errors allowed for matching primers. (default: -1)
  • --keep-errors: Prints symbol counts. (default: false)
  • --paired-with <FILENAME>: filename containing the paired reads.
  • --tag-list | -s <string>: File name of the NGSFilter file describing PCRs.
  • --template: Print on the standard output an example of CSV configuration file. (default: false)
  • --unidentified | -u <string>: Filename used to store the sequences unassigned to any sample.
  • --with-indels: Allows for indels during the primers matching. (default: false)

Taxonomic options #

  • --taxonomy | -t <string>: Path to the taxonomic database.

Controlling the input data #

OBITools4 generally recognizes the input file format. It also recognizes whether the input file is compressed using GZIP. But some rare files can be misidentified, so the following options allow the user to force the format, thus bypassing the format identification step.
The file format options #
  • --fasta: indicates that sequence data is in fasta format.
  • --fastq: indicates that sequence data is in fastq format.
  • --embl: indicates that sequence data is in EMBL-ENA flatfile format.
  • --csv: indicates that sequence data is in CSV format.
  • --genbank: indicates that sequence data is in GenBank flatfile format.
  • --ecopcr: indicates that sequence data is in the old ecoPCR tabulated format.
Controlling the way OBITools4 are formatting annotations #
These options only apply to the FASTA and FASTQ formats
  • --input-OBI-header: FASTA/FASTQ title line annotations follow the old OBI format.
  • --input-json-header: FASTA/FASTQ title line annotations follow the JSON format.
Controlling quality score decoding #
This option only applies to the FASTQ formats
  • --solexa: decodes quality string according to the old Solexa specification. (default: the standard Sanger encoding is used, env: OBISSOLEXA)

Controlling the output data #

  • --compress | -Z : output is compressed using gzip. (default: false)
  • --no-order: the OBITools ensure that the order between the input file and the output file does not change. When multiple files are processed, they are processed one at a time. If the –no-order option is added to a command, multiple input files can be opened at the same time and their contents processed in parallel. This usually increases processing speed, but does not guarantee the order of the sequences in the output file. Also, processing multiple files in parallel may require more memory to perform the computation.
  • --fasta-output: writes sequence data in fasta format (default if quality data is not available).
  • --fastq-output: writes sequence data in fastq format (default if quality data is available).
  • --json-output: writes sequence data in JSON format.
  • --out | -o <FILENAME>: filename used for saving the output (default: “-”, the standard output)
  • --output-OBI-header | -O : writes output FASTA/FASTQ title line annotations in OBI format (default: JSON).
  • --output-json-header: writew output FASTA/FASTQ title line annotations in JSON format (the default format).
  • --skip-empty: sequences of length equal to zero are removed from the output (default: false).
  • --no-progressbar: deactivates progress bar display (default: false).

General options #

  • --help | -h|-? : shows this help.
  • --version: prints the version and exits.
  • --silent-warning: This option tells obitools to stop displaying warnings. This behaviour can be controlled by setting the OBIWARNINGS environment variable.
  • --max-cpu <INTEGER>: OBITools can take advantage of your computer’s multi-core architecture by parallelizing the computation across all available CPUs. Computing on more CPUs usually requires more memory to perform the computation. Reducing the number of CPUs used to perform a calculation is also a way to indirectly control the amount of memory used by the process. The number of CPUs used by OBITools can also be controlled by setting the OBIMAXCPU environment variable.
  • --force-one-cpu: forces the use of a single CPU core for parallel processing (default: false).
  • --batch-size <INTEGER>: number of sequence per batch for parallel processing (default: 1000, env: OBIBATCHSIZE)
  • --debug: enables debug mode, by setting log level to debug (default: false, env: OBIDEBUG)
  • --pprof: enables pprof server. Look at the log for details. (default: false).
  • --pprof-mutex <INTEGER>: enables profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX)
  • --pprof-goroutine <INTEGER>: enables profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE)

Examples #

obimultiplex --help