obicomplement

obicomplement: reverse complement of sequences #

Description #

obicomplement computes the reverse complement of every sequence in its input. For each sequence, the nucleotides are reversed and each base is replaced by its Watson–Crick complement (A↔T, C↔G), yielding the strand that would pair with the original sequence read in the opposite direction. Ambiguous IUPAC characters are handled correctly and preserved in the output.

When quality scores are present ( fastq input), they are reversed in lock-step with the sequence so that each quality value remains associated with its corresponding base after transformation. This makes obicomplement safe to use in any pipeline that carries per-base quality information.

This operation is commonly needed when sequences were read on the wrong strand, when a primer is designed on the reverse strand, or when preparing data for strand-aware downstream tools such as obipairing or obigrep .

graph TD
  A@{ shape: doc, label: "sequences.fasta" }
  C[obicomplement]
  D@{ shape: doc, label: "out_default.fasta" }
  A --> C:::obitools
  C --> D
  classDef obitools fill:#99d57c

The file sequences.fasta contains five sample fasta

sequences:

📄 sequences.fasta
>seq001 basic DNA sequence
ATCGATCGATCGATCGATCG
>seq002 GC-rich sequence
GCGCGCGCGCGCGCGCGCGC
>seq003 AT-rich sequence
ATATATATATATATATATAT
>seq004 palindromic sequence
AATTCCGGAATTCCGGAATT
>seq005 mixed sequence
ATCGGCTATGCATGCTAGCT

To compute the reverse complement of all five sequences:

obicomplement sequences.fasta > out_default.fasta
📄 out_default.fasta
>seq001 {"definition":"basic DNA sequence"}
cgatcgatcgatcgatcgat
>seq002 {"definition":"GC-rich sequence"}
gcgcgcgcgcgcgcgcgcgc
>seq003 {"definition":"AT-rich sequence"}
atatatatatatatatatat
>seq004 {"definition":"palindromic sequence"}
aattccggaattccggaatt
>seq005 {"definition":"mixed sequence"}
agctagcatgcatagccgat

Each sequence header description is wrapped in a JSON annotation block, and the sequence itself is written in lowercase with all bases reverse-complemented.

Synopsis #

obicomplement [--batch-mem <string>] [--batch-size <int>]
              [--batch-size-max <int>] [--compress|-Z] [--csv] [--debug]
              [--ecopcr] [--embl] [--fail-on-taxonomy] [--fasta]
              [--fasta-output] [--fastq] [--fastq-output] [--genbank]
              [--help|-h|-?] [--input-OBI-header] [--input-json-header]
              [--json-output] [--max-cpu <int>] [--no-order]
              [--no-progressbar] [--out|-o <FILENAME>]
              [--output-OBI-header|-O] [--output-json-header]
              [--paired-with <FILENAME>] [--pprof] [--pprof-goroutine <int>]
              [--pprof-mutex <int>] [--raw-taxid] [--silent-warning]
              [--skip-empty] [--solexa] [--taxonomy|-t <string>] [--u-to-t]
              [--update-taxid] [--version] [--with-leaves] [<args>]

Options #

obicomplement specific options #

  • --paired-with <FILENAME>: filename containing the paired reads.
  • --skip-empty: Sequences of length zero are removed from the output. Useful as a safety guard after upstream processing steps that may produce empty sequences.
  • --u-to-t: Convert Uracil (U) to Thymine (T) before computing the reverse complement. Ensures that RNA sequences are explicitly treated as DNA throughout the pipeline.

Taxonomic options #

  • --taxonomy | -t <string>: Path to the taxonomic database.
  • --fail-on-taxonomy: Exit with an error if a taxid found in the data is not a currently valid node in the loaded taxonomy.
  • --update-taxid: Automatically replace taxids that have been declared merged into a newer node by the taxonomy database.
  • --raw-taxid: Print taxids in output files without appending the taxon name and rank.
  • --with-leaves: When the taxonomy is extracted from a sequence file, attach sequences as leaves of their taxid node.

Controlling the input data #

OBITools4 generally recognizes the input file format. It also recognizes whether the input file is compressed using GZIP. But some rare files can be misidentified, so the following options allow the user to force the format, thus bypassing the format identification step.
The file format options #
  • --fasta: indicates that sequence data is in fasta format.
  • --fastq: indicates that sequence data is in fastq format.
  • --embl: indicates that sequence data is in EMBL-ENA flatfile format.
  • --csv: indicates that sequence data is in CSV format.
  • --genbank: indicates that sequence data is in GenBank flatfile format.
  • --ecopcr: indicates that sequence data is in the old ecoPCR tabulated format.
Controlling the way OBITools4 are formatting annotations #
These options only apply to the FASTA and FASTQ formats
  • --input-OBI-header: FASTA/FASTQ title line annotations follow the old OBI format.
  • --input-json-header: FASTA/FASTQ title line annotations follow the JSON format.
Controlling quality score decoding #
This option only applies to the FASTQ formats
  • --solexa: decodes quality string according to the old Solexa specification. (default: the standard Sanger encoding is used, env: OBISSOLEXA)

Controlling the output data #

  • --compress | -Z : output is compressed using gzip. (default: false)
  • --no-order: the OBITools ensure that the order between the input file and the output file does not change. When multiple files are processed, they are processed one at a time. If the –no-order option is added to a command, multiple input files can be opened at the same time and their contents processed in parallel. This usually increases processing speed, but does not guarantee the order of the sequences in the output file. Also, processing multiple files in parallel may require more memory to perform the computation.
  • --fasta-output: writes sequence data in fasta format (default if quality data is not available).
  • --fastq-output: writes sequence data in fastq format (default if quality data is available).
  • --json-output: writes sequence data in JSON format.
  • --out | -o <FILENAME>: filename used for saving the output (default: “-”, the standard output)
  • --output-OBI-header | -O : writes output FASTA/FASTQ title line annotations in OBI format (default: JSON).
  • --output-json-header: writew output FASTA/FASTQ title line annotations in JSON format (the default format).
  • --skip-empty: sequences of length equal to zero are removed from the output (default: false).
  • --no-progressbar: deactivates progress bar display (default: false).

General options #

  • --help | -h|-? : shows this help.
  • --version: prints the version and exits.
  • --silent-warning: This option tells obitools to stop displaying warnings. This behaviour can be controlled by setting the OBIWARNINGS environment variable.
  • --max-cpu <INTEGER>: OBITools can take advantage of your computer’s multi-core architecture by parallelizing the computation across all available CPUs. Computing on more CPUs usually requires more memory to perform the computation. Reducing the number of CPUs used to perform a calculation is also a way to indirectly control the amount of memory used by the process. The number of CPUs used by OBITools can also be controlled by setting the OBIMAXCPU environment variable.
  • --force-one-cpu: forces the use of a single CPU core for parallel processing (default: false).
  • --batch-size <INTEGER>: minimum number of sequences per batch for parallel processing (floor, default: 1, env: OBIBATCHSIZE)
  • --batch-size-max <INTEGER>: maximum number of sequences per batch for parallel processing (ceiling, default: 2000, env: OBIBATCHSIZEMAX)
  • --batch-mem <STRING>: maximum memory per batch (e.g. 128K, 64M, 1G; default: 128M; set to 0 to disable, env: OBIBATCHMEM)
  • --debug: enables debug mode, by setting log level to debug (default: false, env: OBIDEBUG)
  • --pprof: enables pprof server. Look at the log for details. (default: false).
  • --pprof-mutex <INTEGER>: enables profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX)
  • --pprof-goroutine <INTEGER>: enables profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE)

Examples #

Reverse complement paired-end reads #

For paired-end data, R1.fastq and R2.fastq contain the forward and reverse reads, respectively. obicomplement processes both files simultaneously and writes the reverse-complemented results to separate _R1 and _R2 output files:

📄 R1.fastq
@pair001/1 paired read 1 forward
ATCGATCGATCGATCGATCG
+
IIIIIIIIIIIIIIIIIIII
@pair002/1 paired read 2 forward
GCGCGCGCGCGCGCGCGCGC
+
IIIIIIIIIIIIIIIIIIII
@pair003/1 paired read 3 forward
AATTCCGGAATTCCGGAATT
+
IIIIIIIIIIIIIIIIIIII
📄 R2.fastq
@pair001/2 paired read 1 reverse
CGATCGATCGATCGATCGAT
+
IIIIIIIIIIIIIIIIIIII
@pair002/2 paired read 2 reverse
CGCGCGCGCGCGCGCGCGCG
+
IIIIIIIIIIIIIIIIIIII
@pair003/2 paired read 3 reverse
TTAATTGGCCAATTGGCCAA
+
IIIIIIIIIIIIIIIIIIII
obicomplement --paired-with R2.fastq \
    --out out_paired.fastq \
    R1.fastq

📄 out_paired_R1.fastq

@pair001/1 {"definition":"paired read 1 forward"}
cgatcgatcgatcgatcgat
+
IIIIIIIIIIIIIIIIIIII
@pair002/1 {"definition":"paired read 2 forward"}
gcgcgcgcgcgcgcgcgcgc
+
IIIIIIIIIIIIIIIIIIII
@pair003/1 {"definition":"paired read 3 forward"}
aattccggaattccggaatt
+
IIIIIIIIIIIIIIIIIIII
📄 out_paired_R2.fastq
@pair001/2 {"definition":"paired read 1 reverse"}
atcgatcgatcgatcgatcg
+
IIIIIIIIIIIIIIIIIIII
@pair002/2 {"definition":"paired read 2 reverse"}
cgcgcgcgcgcgcgcgcgcg
+
IIIIIIIIIIIIIIIIIIII
@pair003/2 {"definition":"paired read 3 reverse"}
ttggccaattggccaattaa
+
IIIIIIIIIIIIIIIIIIII

Reverse complement RNA sequences #

obicomplement handles Uracil (U) natively: each U is complemented to Adenine (A) just like Thymine would be, so fasta files containing RNA sequences can be processed directly without any extra flag. The output is a standard DNA file. The file rna_sequences.fasta contains three RNA sequences with U bases:

📄 rna_sequences.fasta
>rna001 mRNA fragment with uracil
AUGCAUGCAUGCAUGCAUGC
>rna002 coding RNA with uracil
GCAUGCAUGCAUGCAUGCAU
>rna003 polyU RNA sequence
UUUUUUUUUUUUUUUUUUUU
obicomplement rna_sequences.fasta > out_rna_rc.fasta
📄 out_rna_rc.fasta
>rna001 {"definition":"mRNA fragment with uracil"}
gcatgcatgcatgcatgcat
>rna002 {"definition":"coding RNA with uracil"}
atgcatgcatgcatgcatgc
>rna003 {"definition":"polyU RNA sequence"}
aaaaaaaaaaaaaaaaaaaa
obicomplement --help