obiconvert

obiconvert: convert sequence files between formats #

Description #

obiconvert is a versatile tool for converting biological sequence data between multiple standard bioinformatics formats. It enables biologists to process large datasets by reading from one format and writing to another. The tool automatically detects input formats and selects output formats based on data presence — fastq when quality scores exist, fasta otherwise. To force a specific output format regardless of input content, use the explicit output flags --fasta-output, --fastq-output, or --json-output.

graph TD
  A@{ shape: doc, label: "input.fastq" }
  C[obiconvert]
  D@{ shape: doc, label: "output.fasta" }
  A --> C:::obitools
  C --> D
  classDef obitools fill:#99d57c

Consider the following fastq input file:

📄 input.fastq
@seq001 DNA sequence with quality scores for FASTQ to FASTA conversion
ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@seq002 Second sequence with moderate quality scores
GCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCT
+
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
@seq003 Third sequence with high quality scores
TTAACCGGTTAACCGGTTAACCGGTTAACCGGTTAACCG
+
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
@seq004 Fourth sequence with variable quality scores
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACG
+
IIAAIIAAIIAAIIAAIIAAIIAAIIAAIIAAIIAAIIAA

Running obiconvert with --fasta-output converts the fastq file to fasta format, discarding the quality scores:

obiconvert --fastq --fasta-output input.fastq -o output.fasta
📄 output.fasta
>seq001 {"definition":"DNA sequence with quality scores for FASTQ to FASTA conversion"}
atcgatcgatcgatcgatcgatcgatcgatcgatcgatcg
>seq002 {"definition":"Second sequence with moderate quality scores"}
gctagctagctagctagctagctagctagctagctagct
>seq003 {"definition":"Third sequence with high quality scores"}
ttaaccggttaaccggttaaccggttaaccggttaaccg
>seq004 {"definition":"Fourth sequence with variable quality scores"}
acgtacgtacgtacgtacgtacgtacgtacgtacgtacg

obiconvert can also convert sequences to JSON format, which preserves all annotations in a structured, machine-readable form. Consider the following fasta

input:

📄 input.fasta
>seq001 Test DNA sequence for FASTA conversion
ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG
>seq002 Another test sequence with different nucleotide content
GCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCT
>seq003 Third sequence for testing output format
TTAACCGGTTAACCGGTTAACCGGTTAACCGGTTAACCG
obiconvert --fasta --json-output input.fasta -o output.json
📄 output.json
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
[
  {
    "annotations": {
      "definition": "Test DNA sequence for FASTA conversion"
    },
    "id": "seq001",
    "sequence": "atcgatcgatcgatcgatcgatcgatcgatcgatcgatcg"
  },
  {
    "annotations": {
      "definition": "Another test sequence with different nucleotide content"
    },
    "id": "seq002",
    "sequence": "gctagctagctagctagctagctagctagctagctagct"
  },
  {
    "annotations": {
      "definition": "Third sequence for testing output format"
    },
    "id": "seq003",
    "sequence": "ttaaccggttaaccggttaaccggttaaccggttaaccg"
  }
]

When working with paired-end sequencing data, the --paired-with option links two files so that read pairing is preserved across the conversion. The output is automatically split into two files suffixed _R1 and _R2:

📄 forward.fastq
@seq001 Forward read one from paired-end sequencing
ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@seq002 Forward read two from paired-end sequencing
GCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCT
+
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
@seq003 Forward read three from paired-end sequencing
TTAACCGGTTAACCGGTTAACCGGTTAACCGGTTAACCG
+
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
@seq004 Forward read four from paired-end sequencing
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACG
+
IIAAIIAAIIAAIIAAIIAAIIAAIIAAIIAAIIAAIIAA
📄 reverse.fastq
@seq001 Reverse read one from paired-end sequencing
CGATCGATCGATCGATCGATCGATCGATCGATCGATCGA
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@seq002 Reverse read two from paired-end sequencing
TAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAG
+
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
@seq003 Reverse read three from paired-end sequencing
CCGGTTAACCGGTTAACCGGTTAACCGGTTAACCGGTTA
+
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
@seq004 Reverse read four from paired-end sequencing
GTACGTACGTACGTACGTACGTACGTACGTACGTACGTA
+
IIAAIIAAIIAAIIAAIIAAIIAAIIAAIIAAIIAAIIAA
obiconvert --fastq --fasta-output forward.fastq \
           --paired-with reverse.fastq \
           -o sequences.fasta
📄 sequences_R1.fasta
>seq001 {"definition":"Forward read one from paired-end sequencing"}
atcgatcgatcgatcgatcgatcgatcgatcgatcgatcg
>seq002 {"definition":"Forward read two from paired-end sequencing"}
gctagctagctagctagctagctagctagctagctagct
>seq003 {"definition":"Forward read three from paired-end sequencing"}
ttaaccggttaaccggttaaccggttaaccggttaaccg
>seq004 {"definition":"Forward read four from paired-end sequencing"}
acgtacgtacgtacgtacgtacgtacgtacgtacgtacg
📄 sequences_R2.fasta
>seq001 {"definition":"Reverse read one from paired-end sequencing"}
cgatcgatcgatcgatcgatcgatcgatcgatcgatcga
>seq002 {"definition":"Reverse read two from paired-end sequencing"}
tagctagctagctagctagctagctagctagctagctag
>seq003 {"definition":"Reverse read three from paired-end sequencing"}
ccggttaaccggttaaccggttaaccggttaaccggtta
>seq004 {"definition":"Reverse read four from paired-end sequencing"}
gtacgtacgtacgtacgtacgtacgtacgtacgtacgta

Synopsis #

obiconvert [--batch-mem <string>] [--batch-size <int>]
           [--batch-size-max <int>] [--compress|-Z] [--csv] [--debug] [--ecopcr]
           [--embl] [--fail-on-taxonomy] [--fasta] [--fasta-output] [--fastq]
           [--fastq-output] [--genbank] [--help|-h|-?]
           [--input-OBI-header] [--input-json-header] [--json-output]
           [--max-cpu <int>] [--no-order] [--no-progressbar]
           [--out|-o <FILENAME>] [--output-OBI-header|-O]
           [--output-json-header] [--paired-with <FILENAME>] [--pprof]
           [--pprof-goroutine <int>] [--pprof-mutex <int>] [--raw-taxid]
           [--silent-warning] [--skip-empty] [--solexa]
           [--taxonomy|-t <string>] [--u-to-t] [--update-taxid] [--version]
           [--with-leaves] [<args>]

Options #

obiconvert specific options #

  • --paired-with <FILENAME>: filename containing the paired reads.

Check taxids against a taxonomy #

OBITools4 allow loading a taxonomy database when they are processing sequence data. If done, the command checks the validity of taxids during the processing of the command. Three cases can occur:
  • The taxon is valid
  • The taxon is no more valid, but a new one replaces it
  • The taxon is no more valid, and no new taxid exists to replace it.
In the first case, the obitools normalize the writing of the taxid in the form:
    TAXCOD:TAXID [SCIENTIFIC NAME]@RANK
As example with the NCBI taxonomy the human taxid looks like :
    taxon:9606 [Homo sapiens]@species
That rewriting doesn't happen if the --raw-taxid option is set. In that case only the raw taxid is conserved.
    9606
In the second case, a warning message is logged on the standard error. If the --update-taxid is set, the command will update the expired taxid to the new equivalent one, and the valid taxon rules apply. Otherwise, the old taxid is maintained in the result. In the last case, a warning message is also issued to the standard error, and non-valid taxid is conserved as is. If the --fail-on-taxonomy option is set, the command stop and exit with an error at the first non-valid taxid encountred in input data.
  • --taxonomy | -t <string>: Path to the taxonomic database.
  • --raw-taxid: Displays the raw taxid for each displayed taxon. (default: false)
  • --update-taxid: Make obitools automatically updating the taxids that are declared merged to a newest one (default: false).
  • --fail-on-taxonomy: Make obitools failing on error if a used taxid is not a currently valid one (default: false).

Controlling the input data #

OBITools4 generally recognizes the input file format. It also recognizes whether the input file is compressed using GZIP. But some rare files can be misidentified, so the following options allow the user to force the format, thus bypassing the format identification step.
The file format options #
  • --fasta: indicates that sequence data is in fasta format.
  • --fastq: indicates that sequence data is in fastq format.
  • --embl: indicates that sequence data is in EMBL-ENA flatfile format.
  • --csv: indicates that sequence data is in CSV format.
  • --genbank: indicates that sequence data is in GenBank flatfile format.
  • --ecopcr: indicates that sequence data is in the old ecoPCR tabulated format.
Controlling the way OBITools4 are formatting annotations #
These options only apply to the FASTA and FASTQ formats
  • --input-OBI-header: FASTA/FASTQ title line annotations follow the old OBI format.
  • --input-json-header: FASTA/FASTQ title line annotations follow the JSON format.
Controlling quality score decoding #
This option only applies to the FASTQ formats
  • --solexa: decodes quality string according to the old Solexa specification. (default: the standard Sanger encoding is used, env: OBISSOLEXA)

Controlling the output data #

  • --compress | -Z : output is compressed using gzip. (default: false)
  • --no-order: the OBITools ensure that the order between the input file and the output file does not change. When multiple files are processed, they are processed one at a time. If the –no-order option is added to a command, multiple input files can be opened at the same time and their contents processed in parallel. This usually increases processing speed, but does not guarantee the order of the sequences in the output file. Also, processing multiple files in parallel may require more memory to perform the computation.
  • --fasta-output: writes sequence data in fasta format (default if quality data is not available).
  • --fastq-output: writes sequence data in fastq format (default if quality data is available).
  • --json-output: writes sequence data in JSON format.
  • --out | -o <FILENAME>: filename used for saving the output (default: “-”, the standard output)
  • --output-OBI-header | -O : writes output FASTA/FASTQ title line annotations in OBI format (default: JSON).
  • --output-json-header: writew output FASTA/FASTQ title line annotations in JSON format (the default format).
  • --skip-empty: sequences of length equal to zero are removed from the output (default: false).
  • --no-progressbar: deactivates progress bar display (default: false).

General options #

  • --help | -h|-? : shows this help.
  • --version: prints the version and exits.
  • --silent-warning: This option tells obitools to stop displaying warnings. This behaviour can be controlled by setting the OBIWARNINGS environment variable.
  • --max-cpu <INTEGER>: OBITools can take advantage of your computer’s multi-core architecture by parallelizing the computation across all available CPUs. Computing on more CPUs usually requires more memory to perform the computation. Reducing the number of CPUs used to perform a calculation is also a way to indirectly control the amount of memory used by the process. The number of CPUs used by OBITools can also be controlled by setting the OBIMAXCPU environment variable.
  • --force-one-cpu: forces the use of a single CPU core for parallel processing (default: false).
  • --batch-size <INTEGER>: minimum number of sequences per batch for parallel processing (floor, default: 1, env: OBIBATCHSIZE)
  • --batch-size-max <INTEGER>: maximum number of sequences per batch for parallel processing (ceiling, default: 2000, env: OBIBATCHSIZEMAX)
  • --batch-mem <STRING>: maximum memory per batch (e.g. 128K, 64M, 1G; default: 128M; set to 0 to disable, env: OBIBATCHMEM)
  • --debug: enables debug mode, by setting log level to debug (default: false, env: OBIDEBUG)
  • --pprof: enables pprof server. Look at the log for details. (default: false).
  • --pprof-mutex <INTEGER>: enables profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX)
  • --pprof-goroutine <INTEGER>: enables profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE)

Examples #

When working with rRNA metabarcoding data, sequences are often stored with uracil (U) instead of thymine (T). The file input_rna.fasta contains three such fasta sequences. The --u-to-t flag converts them to standard DNA for alignment tools that do not accept RNA notation.

📄 input_rna.fasta
>seq001 {"definition":"18S rRNA fragment from soil sample A","sample":"soil_A","taxid":4932}
AUUGCGGUGGAGCAUGUUUUCUUCAAAGAUUAAAGGUUGGUGCAUGCGAGAGUAGUGCGUGGAAUUCGUGG
>seq002 {"definition":"16S rRNA fragment from water sample B","sample":"water_B","taxid":1760}
GCUGGCGGCAGGCCUAACACAUGCAAGUCGAACGGUGAACAGAGCUUGCUCUUCGGUGUGAGUGGCGGACG
>seq003 {"definition":"ITS1 region from fungal isolate","sample":"soil_A","taxid":5204}
UUCGUGCGAAUUCGUGCAAAUCGCGCCUAAGUGUGCGCAAAGCAAAGCUUCGGCGGUGACCGAGUGCUCGC
obiconvert --fasta --fasta-output --u-to-t input_rna.fasta -o output_dna.fasta
📄 output_dna.fasta
>seq001 {"definition":"18S rRNA fragment from soil sample A","sample":"soil_A","taxid":"4932"}
attgcggtggagcatgttttcttcaaagattaaaggttggtgcatgcgagagtagtgcgt
ggaattcgtgg
>seq002 {"definition":"16S rRNA fragment from water sample B","sample":"water_B","taxid":"1760"}
gctggcggcaggcctaacacatgcaagtcgaacggtgaacagagcttgctcttcggtgtg
agtggcggacg
>seq003 {"definition":"ITS1 region from fungal isolate","sample":"soil_A","taxid":"5204"}
ttcgtgcgaattcgtgcaaatcgcgcctaagtgtgcgcaaagcaaagcttcggcggtgac
cgagtgctcgc

OBITools stores sequence annotations as JSON objects in the sequence header. Some downstream tools expect headers formatted according to the JSON standard. The file input.fastq illustrates a typical fastq file with annotation fields. Using --output-json-header ensures the headers are written in strict JSON format, regardless of the original header style.

📄 input.fastq
@seq001 DNA sequence with quality scores for FASTQ to FASTA conversion
ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@seq002 Second sequence with moderate quality scores
GCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCT
+
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
@seq003 Third sequence with high quality scores
TTAACCGGTTAACCGGTTAACCGGTTAACCGGTTAACCG
+
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
@seq004 Fourth sequence with variable quality scores
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACG
+
IIAAIIAAIIAAIIAAIIAAIIAAIIAAIIAAIIAAIIAA
obiconvert --fastq --output-json-header input.fastq -o output_jsonheader.fastq
📄 output_jsonheader.fastq
@seq001 {"definition":"DNA sequence with quality scores for FASTQ to FASTA conversion"}
atcgatcgatcgatcgatcgatcgatcgatcgatcgatcg
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@seq002 {"definition":"Second sequence with moderate quality scores"}
gctagctagctagctagctagctagctagctagctagct
+
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
@seq003 {"definition":"Third sequence with high quality scores"}
ttaaccggttaaccggttaaccggttaaccggttaaccg
+
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
@seq004 {"definition":"Fourth sequence with variable quality scores"}
acgtacgtacgtacgtacgtacgtacgtacgtacgtacg
+
IIAAIIAAIIAAIIAAIIAAIIAAIIAAIIAAIIAAIIAA

The OBITools native header format encodes annotations as key=value pairs rather than JSON. The file input.fasta uses JSON annotations. Converting with --output-OBI-header produces a fasta file whose headers follow the OBI format, which is required by some older OBITools-based pipelines.

📄 input.fasta
>seq001 Test DNA sequence for FASTA conversion
ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG
>seq002 Another test sequence with different nucleotide content
GCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCT
>seq003 Third sequence for testing output format
TTAACCGGTTAACCGGTTAACCGGTTAACCGGTTAACCG
obiconvert --fasta --output-OBI-header input.fasta -o output_obi.fasta
📄 output_obi.fasta
>seq001  Test DNA sequence for FASTA conversion
atcgatcgatcgatcgatcgatcgatcgatcgatcgatcg
>seq002  Another test sequence with different nucleotide content
gctagctagctagctagctagctagctagctagctagct
>seq003  Third sequence for testing output format
ttaaccggttaaccggttaaccggttaaccggttaaccg
obiconvert --help