`obicsv`: converts sequence files to CSV format #

Description #

obicsv converts biological sequence datasets into CSV (comma-separated values) format. Each row in the output represents one sequence, and the columns are chosen explicitly by the user: the sequence identifier, the nucleotide sequence itself, per-base quality scores, taxonomic annotation, abundance count, sequence definition, and any annotation attributes stored in the sequence headers. This makes obicsv particularly useful when a biologist wants to analyse sequence metadata in a spreadsheet application, load annotations into R or Python, or export data for database ingestion. Rather than parsing OBITools JSON-annotated fasta or fastq headers manually, obicsv extracts the desired fields into a clean tabular format.

No column is included unless the corresponding flag is given. Flags such as --ids, --sequence, --quality, --count, --taxon, and --keep select individual columns; the --auto flag inspects the first batch of sequences and automatically selects all annotation attributes found there, saving the effort of enumerating column names manually. Values absent from a given sequence are represented by NA in the output.

graph TD
  A@{ shape: doc, label: "sequences.fasta" }
  C[obicsv]
  D@{ shape: doc, label: "out_ids_sample.csv" }
  A --> C:::obitools
  C --> D
  classDef obitools fill:#99d57c

Consider the annotated fasta file sequences.fasta, in which each record carries attributes such as sample, location, experiment, count, and optionally taxid:

📄 sequences.fasta

>seq001 {"sample":"S1","location":"Paris","experiment":"run1","count":42,"taxid":2}
ATGCATGCATGCATGCATGC
>seq002 {"sample":"S2","location":"Lyon","experiment":"run1","count":15,"taxid":2157}
GCTAGCTAGCTAGCTAGCTA
>seq003 {"sample":"S1","location":"Paris","experiment":"run2","count":7,"taxid":2759}
TTTTTTTTTTTTTTTTTTTT
>seq004 {"sample":"S3","location":"Grenoble","experiment":"run2","count":3}
AAAAATTTTTCCCCCGGGGG
>seq005 {"sample":"S2","location":"Lyon","experiment":"run3","count":20}
GGGGGAAAAATTTTTCCCCC
>seq006 {"sample":"S4","location":"Bordeaux","experiment":"run3","count":1}
CCCCCCGGGGGTTTTTAAAA

The simplest use of obicsv is to extract the sequence identifier alongside a single annotation attribute. Here --ids adds the identifier column and --keep sample adds the sample attribute:

obicsv --ids --keep sample sequences.fasta > out_ids_sample.csv

📄 out_ids_sample.csv

1
2
3
4
5
6
7
id,sample
seq001,S1
seq002,S2
seq003,S1
seq004,S3
seq005,S2
seq006,S4

When the list of annotation attributes is not known in advance, --auto inspects the first batch of sequences and includes every attribute it finds. Attributes absent from a given sequence receive NA:

obicsv --auto sequences.fasta > out_auto.csv

📄 out_auto.csv

1
2
3
4
5
6
7
count,experiment,location,sample,taxid
42,run1,Paris,S1,2
15,run1,Lyon,S2,2157
7,run2,Paris,S1,2759
3,run2,Grenoble,S3,NA
20,run3,Lyon,S2,NA
1,run3,Bordeaux,S4,NA

Note that seq004, seq005, and seq006 lack a taxid annotation; their value in the taxid column is NA. This is expected whenever annotation is incomplete — obicsv never omits rows for missing attributes.

Synopsis #

obicsv [--auto] [--batch-mem <string>] [--batch-size <int>]
       [--batch-size-max <int>] [--compress|-Z] [--count] [--csv] [--debug]
       [--definition|-d] [--ecopcr] [--embl] [--fail-on-taxonomy] [--fasta]
       [--fastq] [--genbank] [--help|-h|-?] [--ids|-i] [--input-OBI-header]
       [--input-json-header] [--keep|-k <KEY>]... [--max-cpu <int>]
       [--na-value <NAVALUE>] [--no-order] [--no-progressbar] [--obipairing]
       [--out|-o <FILENAME>] [--pprof] [--pprof-goroutine <int>]
       [--pprof-mutex <int>] [--quality|-q] [--raw-taxid] [--sequence|-s]
       [--silent-warning] [--solexa] [--taxon] [--taxonomy|-t <string>]
       [--u-to-t] [--update-taxid] [--version] [--with-leaves] [<args>]

Options #

`obicsv` specific options #

--ids | -i : Print the sequence identifier as the first column of the output.
--sequence | -s : Print the nucleotide (or amino acid) sequence in the output.
--quality | -q : Print the per-base quality scores in the output.
--definition | -d : Print the sequence definition (title line text after the identifier) in the output.
--count: Print the count annotation attribute in the output.
--taxon: Print the taxid attribute as a dedicated column in the output. Note: only the numeric taxid is output — no scientific name column is produced. Sequences without a taxid annotation show NA.
--obipairing: Print the eight attributes added by the obipairing command: mode, seq_a_single, seq_b_single, ali_dir, score, score_norm, seq_ab_match, pairing_mismatches. This is only meaningful for sequences that have been processed by obipairing. All other values are NA.
--keep | -k <KEY>: Include the annotation attribute named KEY as an output column. Repeat the flag to include multiple attributes. If an attribute is absent from a sequence, its value appears as NA.
--auto: Inspect the first batch of sequences and automatically select all annotation attributes found there as output columns. Attributes that appear only in later batches will not be included in the header and their values will be treated as missing.
--na-value <NAVALUE>: Intended to customise the placeholder string for missing values in the output. Default: NA.

Taxonomic options #

--taxonomy | -t <string>: Path to the taxonomic database.
--fail-on-taxonomy: Cause obicsv to fail with an error if a taxid encountered is not currently valid in the taxonomy database.
--raw-taxid: Print taxids in the output without supplementary information (taxon name and rank).
--update-taxid: Automatically update taxids declared as merged to a newer one.
--with-leaves: When taxonomy is extracted from a sequence file, add sequences as leaves of their taxid annotation.

Controlling the input data #

OBITools4 generally recognizes the input file format. It also recognizes whether the input file is compressed using GZIP. But some rare files can be misidentified, so the following options allow the user to force the format, thus bypassing the format identification step.

The file format options #

--fasta: indicates that sequence data is in fasta format.
--fastq: indicates that sequence data is in fastq format.
--embl: indicates that sequence data is in EMBL-ENA flatfile format.
--csv: indicates that sequence data is in CSV format.
--genbank: indicates that sequence data is in GenBank flatfile format.
--ecopcr: indicates that sequence data is in the old ecoPCR tabulated format.

Controlling the way OBITools4 are formatting annotations #

These options only apply to the FASTA and FASTQ formats

--input-OBI-header: FASTA/FASTQ title line annotations follow the old OBI format.
--input-json-header: FASTA/FASTQ title line annotations follow the JSON format.

Controlling quality score decoding #

This option only applies to the FASTQ formats

--solexa: decodes quality string according to the old Solexa specification. (default: the standard Sanger encoding is used, env: OBISSOLEXA)

Controlling the output data #

--compress | -Z : output is compressed using gzip. (default: false)
--no-order: the OBITools ensure that the order between the input file and the output file does not change. When multiple files are processed, they are processed one at a time. If the –no-order option is added to a command, multiple input files can be opened at the same time and their contents processed in parallel. This usually increases processing speed, but does not guarantee the order of the sequences in the output file. Also, processing multiple files in parallel may require more memory to perform the computation.
--fasta-output: writes sequence data in fasta format (default if quality data is not available).
--fastq-output: writes sequence data in fastq format (default if quality data is available).
--json-output: writes sequence data in JSON format.
--out | -o <FILENAME>: filename used for saving the output (default: “-”, the standard output)
--output-OBI-header | -O : writes output FASTA/FASTQ title line annotations in OBI format (default: JSON).
--output-json-header: writew output FASTA/FASTQ title line annotations in JSON format (the default format).
--skip-empty: sequences of length equal to zero are removed from the output (default: false).
--no-progressbar: deactivates progress bar display (default: false).

General options #

--help | -h|-? : shows this help.
--version: prints the version and exits.
--silent-warning: This option tells obitools to stop displaying warnings. This behaviour can be controlled by setting the OBIWARNINGS environment variable.

--max-cpu <INTEGER>: OBITools can take advantage of your computer’s multi-core architecture by parallelizing the computation across all available CPUs. Computing on more CPUs usually requires more memory to perform the computation. Reducing the number of CPUs used to perform a calculation is also a way to indirectly control the amount of memory used by the process. The number of CPUs used by OBITools can also be controlled by setting the OBIMAXCPU environment variable.
--force-one-cpu: forces the use of a single CPU core for parallel processing (default: false).
--batch-size <INTEGER>: minimum number of sequences per batch for parallel processing (floor, default: 1, env: OBIBATCHSIZE)
--batch-size-max <INTEGER>: maximum number of sequences per batch for parallel processing (ceiling, default: 2000, env: OBIBATCHSIZEMAX)
--batch-mem <STRING>: maximum memory per batch (e.g. 128K, 64M, 1G; default: 128M; set to 0 to disable, env: OBIBATCHMEM)

--debug: enables debug mode, by setting log level to debug (default: false, env: OBIDEBUG)
--pprof: enables pprof server. Look at the log for details. (default: false).
--pprof-mutex <INTEGER>: enables profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX)
--pprof-goroutine <INTEGER>: enables profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE)

Examples #

Export sequence content and quality from a fastq file:

The file reads.fastq contains four fastq records with free-text definitions and per-base quality scores. Exporting identifiers, definitions, sequences, and quality strings together gives a complete per-read table, useful for quality control or inspection in a spreadsheet:

📄 reads.fastq

@seq001 Bacteria amplicon read
ATGCATGCATGCATGCATGC
+
IIIIIIIIIIIIIIIIIIII
@seq002 Archaea amplicon read
GCTAGCTAGCTAGCTAGCTA
+
IIIIIIIIIIIIIIIIIIII
@seq003 Eukaryota amplicon read
TTTTTTTTTTTTTTTTTTTT
+
IIIIIIIIIIIIIIIIIIII
@seq004 unknown origin read
AAAAATTTTTCCCCCGGGGG
+
IIIIIIIIIIIIIIIIIIII

obicsv --ids --sequence --quality --definition reads.fastq | csvlook

| id     | definition              | sequence             | qualities            |
| ------ | ----------------------- | -------------------- | -------------------- |
| seq001 | Bacteria amplicon read  | atgcatgcatgcatgcatgc | IIIIIIIIIIIIIIIIIIII |
| seq002 | Archaea amplicon read   | gctagctagctagctagcta | IIIIIIIIIIIIIIIIIIII |
| seq003 | Eukaryota amplicon read | tttttttttttttttttttt | IIIIIIIIIIIIIIIIIIII |
| seq004 | unknown origin read     | aaaaatttttcccccggggg | IIIIIIIIIIIIIIIIIIII |

Export abundance counts and taxonomic annotations alongside selected attributes:

After taxonomic assignment, sequences carry a taxid annotation and may carry count values from clustering or demultiplexing. The --count flag exports the abundance count, --taxon exports the numeric taxid, and --keep adds further annotation attributes. Sequences lacking a taxid produce NA in that column:

obicsv --count --taxon --keep location --keep experiment sequences.fasta | csvlook

| count | taxid | location  | experiment |
| ----- | ----- | --------- | ---------- |
|    42 |     2 | Paris     | run1       |
|    15 |  2157 | Lyon      | run1       |
|     7 |  2759 | Paris     | run2       |
|     3 |    NA | Grenoble  | run2       |
|    20 |    NA | Lyon      | run3       |
|     1 |    NA | Bordeaux  | run3       |

Export identifiers and taxid, illustrating missing-value handling:

When only some sequences carry a taxid annotation, the remaining rows show NA. The --na-value flag is intended to customise this placeholder:

obicsv --ids --keep taxid --na-value MISSING sequences.fasta | csvlook

| id     | taxid   |
| ------ | ------- |
| seq001 | 2       |
| seq002 | 2157    |
| seq003 | 2759    |
| seq004 | MISSING |
| seq005 | MISSING |
| seq006 | MISSING |

obicsv --help

obicsv: converts sequence files to CSV format #