`obicsv`: converts sequence files to CSV format #

Preliminary AI-generated documentation

This page was automatically generated by an AI assistant and has not yet been reviewed or validated by the OBITools4 development team. It may contain inaccuracies or incomplete information. Use with caution and refer to the command’s --help output for authoritative option descriptions.

Description #

obicsvconverts biological sequence data into CSV format for easy inspection, spreadsheet analysis, or integration with other tools. A biologist might use it to export sequences from OBITools for quality control, taxonomic inspection, or downstream analysis in R or Python.

No columns are output unless explicitly selected with flags such as --ids, --sequence, --quality, --taxon, --auto, or --keep. Multiple flags can be combined to choose the desired columns. The command uses parallel workers to process large datasets efficiently and can write output to stdout or directly to a file.

graph TD
  A@{ shape: doc, label: "sequences.fastq" }
  C[obicsv]
  D@{ shape: doc, label: "output.csv" }
  A --> C:::obitools
  C --> D
  classDef obitools fill:#99d57c

The file sequences.fastq contains three FASTQ records with quality scores:

📄 sequences.fastq

@seq001 Sample sequence for testing
ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@seq002 Another test sequence
GGGGAAAATTTTCCCCGGGGAAAATTTTCCCCGGGGAAAATTTTCCCCGGGGAAAATTTT
+
JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
@seq003 Third sequence
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
+
KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK

When you export with --ids --sequence, you get:

obicsv --ids --sequence sequences.fastq -o output1.csv

id,sequence
seq001,atgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgc
seq002,ggggaaaattttccccggggaaaattttccccggggaaaattttccccggggaaaatttt
seq003,cccccccccccccccccccccccccccccccccccccccccccccccccccccccccc

Synopsis #

obicsv [--auto] [--batch-mem <string>] [--batch-size <int>]
       [--batch-size-max <int>] [--compress|-Z] [--count] [--csv] [--debug]
       [--definition|-d] [--ecopcr] [--embl] [--fail-on-taxonomy] [--fasta]
       [--fastq] [--genbank] [--help|-h|-?] [--ids|-i] [--input-OBI-header]
       [--input-json-header] [--keep|-k <KEY>]... [--max-cpu <int>]
       [--na-value <NAVALUE>] [--no-order] [--no-progressbar] [--obipairing]
       [--out|-o <FILENAME>] [--pprof] [--pprof-goroutine <int>]
       [--pprof-mutex <int>] [--quality|-q] [--raw-taxid] [--sequence|-s]
       [--silent-warning] [--solexa] [--taxon] [--taxonomy|-t <string>]
       [--u-to-t] [--update-taxid] [--version] [--with-leaves] [<args>]

Options #

`obicsv` specific options #

--ids | -i : Include the sequence identifier column in the CSV output. Useful for tracking or linking sequences.
--sequence | -s : Include the nucleotide or amino acid sequence in the CSV output. This is the main biological data column.
--quality | -q : Include quality scores for each position in the CSV output. Essential for quality control and filtering downstream.
--definition | -d : Include the sequence description or definition from the source file in the CSV output.
--count: Include the count attribute, representing how many original reads were collapsed into this sequence (e.g., from clustering or demultiplexing).
--taxon: Include taxonomic information in the CSV output. Outputs both the NCBI taxid and the scientific name. Requires a taxonomy database (see --taxonomy).
--obipairing: Include attributes that were added by the obipairing command (pairing scores, mismatches, etc.).
--auto: Automatically detect which columns to output by examining the first batch of sequences. Outputs all annotation attributes found in the headers. Can be combined with --ids, --sequence, etc. to add those columns on top of the auto-detected ones.
--keep | -k <KEY>: Keep only the specified attribute(s). Can be used multiple times to keep several columns. Useful for extracting specific annotations. If the specified attributes are not present in the input, NA values are output.
--na-value <NAVALUE>: String to use for missing or unavailable values in the CSV. Customize for compatibility with other tools (e.g., empty string, “NA”, “null”). Default: “NA”.

Taxonomic options #

--taxonomy | -t <string>: Path to the taxonomic database.

Controlling the input data #

OBITools4 generally recognizes the input file format. It also recognizes whether the input file is compressed using GZIP. But some rare files can be misidentified, so the following options allow the user to force the format, thus bypassing the format identification step.

The file format options #

--fasta: indicates that sequence data is in fasta format.
--fastq: indicates that sequence data is in fastq format.
--embl: indicates that sequence data is in EMBL-ENA flatfile format.
--csv: indicates that sequence data is in CSV format.
--genbank: indicates that sequence data is in GenBank flatfile format.
--ecopcr: indicates that sequence data is in the old ecoPCR tabulated format.

Controlling the way OBITools4 are formatting annotations #

These options only apply to the FASTA and FASTQ formats

--input-OBI-header: FASTA/FASTQ title line annotations follow the old OBI format.
--input-json-header: FASTA/FASTQ title line annotations follow the JSON format.

Controlling quality score decoding #

This option only applies to the FASTQ formats

--solexa: decodes quality string according to the old Solexa specification. (default: the standard Sanger encoding is used, env: OBISSOLEXA)

Controlling the output data #

--compress | -Z : output is compressed using gzip. (default: false)
--no-order: the OBITools ensure that the order between the input file and the output file does not change. When multiple files are processed, they are processed one at a time. If the –no-order option is added to a command, multiple input files can be opened at the same time and their contents processed in parallel. This usually increases processing speed, but does not guarantee the order of the sequences in the output file. Also, processing multiple files in parallel may require more memory to perform the computation.
--fasta-output: writes sequence data in fasta format (default if quality data is not available).
--fastq-output: writes sequence data in fastq format (default if quality data is available).
--json-output: writes sequence data in JSON format.
--out | -o <FILENAME>: filename used for saving the output (default: “-”, the standard output)
--output-OBI-header | -O : writes output FASTA/FASTQ title line annotations in OBI format (default: JSON).
--output-json-header: writew output FASTA/FASTQ title line annotations in JSON format (the default format).
--skip-empty: sequences of length equal to zero are removed from the output (default: false).
--no-progressbar: deactivates progress bar display (default: false).

General options #

--help | -h|-? : shows this help.
--version: prints the version and exits.
--silent-warning: This option tells obitools to stop displaying warnings. This behaviour can be controlled by setting the OBIWARNINGS environment variable.

--max-cpu <INTEGER>: OBITools can take advantage of your computer’s multi-core architecture by parallelizing the computation across all available CPUs. Computing on more CPUs usually requires more memory to perform the computation. Reducing the number of CPUs used to perform a calculation is also a way to indirectly control the amount of memory used by the process. The number of CPUs used by OBITools can also be controlled by setting the OBIMAXCPU environment variable.
--force-one-cpu: forces the use of a single CPU core for parallel processing (default: false).
--batch-size <INTEGER>: minimum number of sequences per batch for parallel processing (floor, default: 1, env: OBIBATCHSIZE)
--batch-size-max <INTEGER>: maximum number of sequences per batch for parallel processing (ceiling, default: 2000, env: OBIBATCHSIZEMAX)
--batch-mem <STRING>: maximum memory per batch (e.g. 128K, 64M, 1G; default: 128M; set to 0 to disable, env: OBIBATCHMEM)

--debug: enables debug mode, by setting log level to debug (default: false, env: OBIDEBUG)
--pprof: enables pprof server. Look at the log for details. (default: false).
--pprof-mutex <INTEGER>: enables profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX)
--pprof-goroutine <INTEGER>: enables profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE)

Examples #

Export sequences with identifiers and sequence data:

obicsv --ids --sequence sequences.fastq -o output1.csv

📄 output1.csv

id,sequence
seq001,atgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgc
seq002,ggggaaaattttccccggggaaaattttccccggggaaaattttccccggggaaaatttt
seq003,cccccccccccccccccccccccccccccccccccccccccccccccccccccccccc

Export with quality scores included:

obicsv --ids --sequence --quality sequences.fastq -o output2.csv

📄 output2.csv

id,sequence,qualities
seq001,atgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgc,IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
seq002,ggggaaaattttccccggggaaaattttccccggggaaaattttccccggggaaaatttt,JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
seq003,cccccccccccccccccccccccccccccccccccccccccccccccccccccccccc,KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK

Auto-detect annotation columns from sequence headers:

The file sequences.fasta contains annotated FASTA sequences:

📄 sequences.fasta

>seq001 {"sample":"soil_A","taxid":2}
ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
>seq002 {"sample":"soil_A","taxid":2157}
GGGGAAAATTTTCCCCGGGGAAAATTTTCCCCGGGGAAAATTTTCCCCGGGGAAAATTTT
>seq003 {"sample":"soil_B","taxid":2759}
CCCGGGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATC

obicsv --auto --ids sequences.fasta -o output4.csv

📄 output4.csv

id,sample,taxid
seq001,soil_A,2
seq002,soil_A,2157
seq003,soil_B,2759

Extract specific attributes:

obicsv --keep sample --keep taxid sequences.fasta -o output5.csv

📄 output5.csv

taxid,sample
2,soil_A
2157,soil_A
2759,soil_B

Export with gzip compression:

obicsv --ids --sequence -Z sequences.fasta -o output6.csv.gz

The output is compressed and can be decompressed with gunzip -c output6.csv.gz.

obicsv --help