obicomplement

obicomplement: reverse complement of sequences #

Preliminary AI-generated documentation

This page was automatically generated by an AI assistant and has not yet been reviewed or validated by the OBITools4 development team. It may contain inaccuracies or incomplete information. Use with caution and refer to the command’s --help output for authoritative option descriptions.

Description #

obicomplement computes the reverse complement of every sequence in its input. For each sequence, the nucleotides are reversed and each base is replaced by its Watson–Crick complement (A↔T, C↔G), yielding the strand that would pair with the original sequence read in the opposite direction. Ambiguous IUPAC characters are handled correctly and preserved in the output.

When quality scores are present ( fastq input), they are reversed in lock-step with the sequence so that each quality value remains associated with its corresponding base after transformation. This makes obicomplement safe to use in any pipeline that carries per-base quality information.

This operation is commonly needed when sequences were read on the wrong strand, when a primer is designed on the reverse strand, or when preparing data for strand-aware downstream tools such as obipairing or obigrep .

graph TD
  A@{ shape: doc, label: "sequences.fasta" }
  C[obicomplement]
  D@{ shape: doc, label: "out_default.fasta" }
  A --> C:::obitools
  C --> D
  classDef obitools fill:#99d57c

The file sequences.fasta contains five sample fasta

sequences:

πŸ“„ sequences.fasta
>seq001 basic DNA sequence
ATCGATCGATCGATCGATCG
>seq002 GC-rich sequence
GCGCGCGCGCGCGCGCGCGC
>seq003 AT-rich sequence
ATATATATATATATATATAT
>seq004 palindromic sequence
AATTCCGGAATTCCGGAATT
>seq005 mixed sequence
ATCGGCTATGCATGCTAGCT

To compute the reverse complement of all five sequences:

obicomplement sequences.fasta -o out_default.fasta
πŸ“„ out_default.fasta
>seq001 {"definition":"basic DNA sequence"}
cgatcgatcgatcgatcgat
>seq002 {"definition":"GC-rich sequence"}
gcgcgcgcgcgcgcgcgcgc
>seq003 {"definition":"AT-rich sequence"}
atatatatatatatatatat
>seq004 {"definition":"palindromic sequence"}
aattccggaattccggaatt
>seq005 {"definition":"mixed sequence"}
agctagcatgcatagccgat

Each sequence header description is wrapped in a JSON annotation block, and the sequence itself is written in lowercase with all bases reverse-complemented.

Synopsis #

obicomplement [--batch-mem <string>] [--batch-size <int>]
              [--batch-size-max <int>] [--compress|-Z] [--csv] [--debug]
              [--ecopcr] [--embl] [--fail-on-taxonomy] [--fasta]
              [--fasta-output] [--fastq] [--fastq-output] [--genbank]
              [--help|-h|-?] [--input-OBI-header] [--input-json-header]
              [--json-output] [--max-cpu <int>] [--no-order]
              [--no-progressbar] [--out|-o <FILENAME>]
              [--output-OBI-header|-O] [--output-json-header]
              [--paired-with <FILENAME>] [--raw-taxid] [--silent-warning]
              [--skip-empty] [--solexa] [--taxonomy|-t <string>] [--u-to-t]
              [--update-taxid] [--with-leaves] [<args>]

Options #

obicomplement specific options #

  • --paired-with <FILENAME>: filename containing the paired reads.

Taxonomic options #

  • --taxonomy | -t <string>: Path to the taxonomic database.
  • --fail-on-taxonomy: Exit with an error if a taxid found in the data is not a currently valid node in the loaded taxonomy.
  • --update-taxid: Automatically replace taxids that have been declared merged into a newer node by the taxonomy database.
  • --raw-taxid: Print taxids in output files without appending the taxon name and rank.
  • --with-leaves: When the taxonomy is extracted from a sequence file, attach sequences as leaves of their taxid node.

Controlling the input data #

OBITools4 generally recognizes the input file format. It also recognizes whether the input file is compressed using GZIP. But some rare files can be misidentified, so the following options allow the user to force the format, thus bypassing the format identification step.
The file format options #
  • --fasta: indicates that sequence data is in fasta format.
  • --fastq: indicates that sequence data is in fastq format.
  • --embl: indicates that sequence data is in EMBL-ENA flatfile format.
  • --csv: indicates that sequence data is in CSV format.
  • --genbank: indicates that sequence data is in GenBank flatfile format.
  • --ecopcr: indicates that sequence data is in the old ecoPCR tabulated format.
Controlling the way OBITools4 are formatting annotations #
These options only apply to the FASTA and FASTQ formats
  • --input-OBI-header: FASTA/FASTQ title line annotations follow the old OBI format.
  • --input-json-header: FASTA/FASTQ title line annotations follow the JSON format.
Controlling quality score decoding #
This option only applies to the FASTQ formats
  • --solexa: decodes quality string according to the old Solexa specification. (default: the standard Sanger encoding is used, env: OBISSOLEXA)

Controlling the output data #

  • --compress | -Z : output is compressed using gzip. (default: false)
  • --no-order: the OBITools ensure that the order between the input file and the output file does not change. When multiple files are processed, they are processed one at a time. If the –no-order option is added to a command, multiple input files can be opened at the same time and their contents processed in parallel. This usually increases processing speed, but does not guarantee the order of the sequences in the output file. Also, processing multiple files in parallel may require more memory to perform the computation.
  • --fasta-output: writes sequence data in fasta format (default if quality data is not available).
  • --fastq-output: writes sequence data in fastq format (default if quality data is available).
  • --json-output: writes sequence data in JSON format.
  • --out | -o <FILENAME>: filename used for saving the output (default: “-”, the standard output)
  • --output-OBI-header | -O : writes output FASTA/FASTQ title line annotations in OBI format (default: JSON).
  • --output-json-header: writew output FASTA/FASTQ title line annotations in JSON format (the default format).
  • --skip-empty: sequences of length equal to zero are removed from the output (default: false).
  • --no-progressbar: deactivates progress bar display (default: false).

General options #

  • --help | -h|-? : shows this help.
  • --version: prints the version and exits.
  • --silent-warning: This option tells obitools to stop displaying warnings. This behaviour can be controlled by setting the OBIWARNINGS environment variable.
  • --max-cpu <INTEGER>: OBITools can take advantage of your computer’s multi-core architecture by parallelizing the computation across all available CPUs. Computing on more CPUs usually requires more memory to perform the computation. Reducing the number of CPUs used to perform a calculation is also a way to indirectly control the amount of memory used by the process. The number of CPUs used by OBITools can also be controlled by setting the OBIMAXCPU environment variable.
  • --force-one-cpu: forces the use of a single CPU core for parallel processing (default: false).
  • --batch-size <INTEGER>: minimum number of sequences per batch for parallel processing (floor, default: 1, env: OBIBATCHSIZE)
  • --batch-size-max <INTEGER>: maximum number of sequences per batch for parallel processing (ceiling, default: 2000, env: OBIBATCHSIZEMAX)
  • --batch-mem <STRING>: maximum memory per batch (e.g. 128K, 64M, 1G; default: 128M; set to 0 to disable, env: OBIBATCHMEM)
  • --debug: enables debug mode, by setting log level to debug (default: false, env: OBIDEBUG)
  • --pprof: enables pprof server. Look at the log for details. (default: false).
  • --pprof-mutex <INTEGER>: enables profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX)
  • --pprof-goroutine <INTEGER>: enables profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE)

Examples #

The file reads.fastq contains five fastq reads with Phred quality scores. obicomplement reverses both the nucleotide sequence and the quality string so that each quality value stays aligned with its base after the transformation:

πŸ“„ reads.fastq
@read001 sequencing read 1
ATCGATCGATCGATCGATCG
+
IIIIIIIIIIIIIIIIIIII
@read002 sequencing read 2
GCGCGCGCGCGCGCGCGCGC
+
IIIIIIIIIIIIIIIIIIII
@read003 sequencing read 3
ATATATATATATATATATAT
+
IIIIIIIIIIIIIIIIIIII
@read004 sequencing read 4
AATTCCGGAATTCCGGAATT
+
IIIIIIIIIIIIIIIIIIII
@read005 sequencing read 5
ATCGGCTATGCATGCTAGCT
+
IIIIIIIIIIIIIIIIIIII
obicomplement reads.fastq --fastq-output -o out_fastq.fastq
πŸ“„ out_fastq.fastq
@read001 {"definition":"sequencing read 1"}
cgatcgatcgatcgatcgat
+
IIIIIIIIIIIIIIIIIIII
@read002 {"definition":"sequencing read 2"}
gcgcgcgcgcgcgcgcgcgc
+
IIIIIIIIIIIIIIIIIIII
@read003 {"definition":"sequencing read 3"}
atatatatatatatatatat
+
IIIIIIIIIIIIIIIIIIII
@read004 {"definition":"sequencing read 4"}
aattccggaattccggaatt
+
IIIIIIIIIIIIIIIIIIII
@read005 {"definition":"sequencing read 5"}
agctagcatgcatagccgat
+
IIIIIIIIIIIIIIIIIIII

The file rna_sequences.fasta contains RNA sequences that use Uracil (U) instead of Thymine (T). The --u-to-t flag converts each U to T before computing the reverse complement, producing valid DNA output that can be used in DNA-based downstream analyses:

πŸ“„ rna_sequences.fasta
>rna001 mRNA fragment with uracil
AUGCAUGCAUGCAUGCAUGC
>rna002 coding RNA with uracil
GCAUGCAUGCAUGCAUGCAU
>rna003 polyU RNA sequence
UUUUUUUUUUUUUUUUUUUU
obicomplement --u-to-t rna_sequences.fasta -o out_rna_rc.fasta
πŸ“„ out_rna_rc.fasta
>rna001 {"definition":"mRNA fragment with uracil"}
gcatgcatgcatgcatgcat
>rna002 {"definition":"coding RNA with uracil"}
atgcatgcatgcatgcatgc
>rna003 {"definition":"polyU RNA sequence"}
aaaaaaaaaaaaaaaaaaaa

For paired-end data, R1.fastq and R2.fastq contain the forward and reverse mates respectively. obicomplement processes both files and writes the reverse-complemented results to separate _R1 and _R2 output files:

πŸ“„ R1.fastq
@pair001/1 paired read 1 forward
ATCGATCGATCGATCGATCG
+
IIIIIIIIIIIIIIIIIIII
@pair002/1 paired read 2 forward
GCGCGCGCGCGCGCGCGCGC
+
IIIIIIIIIIIIIIIIIIII
@pair003/1 paired read 3 forward
AATTCCGGAATTCCGGAATT
+
IIIIIIIIIIIIIIIIIIII
πŸ“„ R2.fastq
@pair001/2 paired read 1 reverse
CGATCGATCGATCGATCGAT
+
IIIIIIIIIIIIIIIIIIII
@pair002/2 paired read 2 reverse
CGCGCGCGCGCGCGCGCGCG
+
IIIIIIIIIIIIIIIIIIII
@pair003/2 paired read 3 reverse
TTAATTGGCCAATTGGCCAA
+
IIIIIIIIIIIIIIIIIIII
obicomplement R1.fastq --paired-with R2.fastq --out out_paired.fastq

πŸ“„ out_paired_R1.fastq

@pair001/1 {"definition":"paired read 1 forward"}
cgatcgatcgatcgatcgat
+
IIIIIIIIIIIIIIIIIIII
@pair002/1 {"definition":"paired read 2 forward"}
gcgcgcgcgcgcgcgcgcgc
+
IIIIIIIIIIIIIIIIIIII
@pair003/1 {"definition":"paired read 3 forward"}
aattccggaattccggaatt
+
IIIIIIIIIIIIIIIIIIII
πŸ“„ out_paired_R2.fastq
@pair001/2 {"definition":"paired read 1 reverse"}
cgatcgatcgatcgatcgat
+
IIIIIIIIIIIIIIIIIIII
@pair002/2 {"definition":"paired read 2 reverse"}
cgcgcgcgcgcgcgcgcgcg
+
IIIIIIIIIIIIIIIIIIII
@pair003/2 {"definition":"paired read 3 reverse"}
ttaattggccaattggccaa
+
IIIIIIIIIIIIIIIIIIII

obicomplement --help