obijoin

obijoin: merge annotations contained in a file to another file #

Description #

obijoin enriches a primary sequence dataset with annotations from a secondary file by matching records on shared attribute values. For each sequence in the primary input, it finds all records in the secondary file that share the same value for one or more specified keys, then copies their annotation attributes onto the primary sequence. The operation is a left outer join: every primary sequence is preserved in the output; those without a matching partner keep their original annotations unchanged.

A common use case is adding sample metadata — collection site, experimental condition, or sequencing run — to a set of amplicon reads. The secondary file can be in any format that OBITools4 accepts, including fasta , fastq , or CSV

(including plain CSV spreadsheets); the format is auto-detected automatically.

The workflow for the basic case — matching on a sample attribute — looks like this:

graph TD
  A@{ shape: doc, label: "input.fasta" }
  B@{ shape: doc, label: "metadata.csv" }
  C[obijoin]
  D@{ shape: doc, label: "out_basic.fasta" }
  A --> C
  B --> C:::obitools
  C --> D
  classDef obitools fill:#99d57c

The file input.fasta contains six sequences, each annotated with a sample identifier (S1–S4) and a barcode:

📄 input.fasta
>seq001 {"sample":"S1","barcode":"ATGC"}
ATGCATGCATGCATGCATGC
>seq002 {"sample":"S2","barcode":"GCTA"}
GCTAGCTAGCTAGCTAGCTA
>seq003 {"sample":"S3","barcode":"TTTT"}
TTTTTTTTTTTTTTTTTTTT
>seq004 {"sample":"S1","barcode":"ATGC"}
AAAAATTTTTCCCCCGGGGG
>seq005 {"sample":"S2","barcode":"GCTA"}
GGGGGAAAAATTTTTCCCCC
>seq006 {"sample":"S4","barcode":"AAAA"}
CCCCCCGGGGGTTTTTAAAAA

The file metadata.csv is a plain CSV spreadsheet mapping each sample identifier to a geographic location and an experiment name:

📄 metadata.csv
1
2
3
sample,location,experiment
S1,Paris,amplicon_run1
S2,Lyon,amplicon_run2

To merge the CSV metadata into the sequence dataset, matching records where the primary’s sample attribute equals the secondary’s sample column, run:

obijoin --join-with metadata.csv --by sample input.fasta > out_basic.fasta
📄 out_basic.fasta
>seq001 {"barcode":"ATGC","experiment":"amplicon_run1","location":"Paris","sample":"S1"}
atgcatgcatgcatgcatgc
>seq002 {"barcode":"GCTA","experiment":"amplicon_run2","location":"Lyon","sample":"S2"}
gctagctagctagctagcta
>seq003 {"barcode":"TTTT","sample":"S3"}
tttttttttttttttttttt
>seq004 {"barcode":"ATGC","experiment":"amplicon_run1","location":"Paris","sample":"S1"}
aaaaatttttcccccggggg
>seq005 {"barcode":"GCTA","experiment":"amplicon_run2","location":"Lyon","sample":"S2"}
gggggaaaaatttttccccc
>seq006 {"barcode":"AAAA","sample":"S4"}
ccccccgggggtttttaaaaa

Sequences seq001, seq002, seq004, and seq005 (belonging to samples S1 or S2) received the location and experiment attributes from the CSV. Sequences seq003 and seq006 (samples S3 and S4, absent from the CSV) were emitted unchanged with no extra annotations added.

Synopsis #

obijoin --join-with|-j <string> [--batch-mem <string>] [--batch-size <int>]
        [--batch-size-max <int>] [--by|-b <string>]... [--compress|-Z]
        [--csv] [--debug] [--ecopcr] [--embl] [--fail-on-taxonomy] [--fasta]
        [--fasta-output] [--fastq] [--fastq-output] [--genbank]
        [--help|-h|-?] [--input-OBI-header] [--input-json-header]
        [--json-output] [--max-cpu <int>] [--no-order] [--no-progressbar]
        [--out|-o <FILENAME>] [--output-OBI-header|-O] [--output-json-header]
        [--pprof] [--pprof-goroutine <int>] [--pprof-mutex <int>]
        [--raw-taxid] [--silent-warning] [--skip-empty] [--solexa]
        [--taxonomy|-t <string>] [--u-to-t] [--update-id|-i]
        [--update-quality|-q] [--update-sequence|-s] [--update-taxid]
        [--version] [--with-leaves] [<args>]

Options #

obijoin specific options #

  • --join-with | -j <FILENAME>: Path to the secondary file whose records are joined onto the primary sequences. Required. The file can be in any format accepted by OBITools4 (including fasta , fastq , CSV , EMBL, GenBank, ecoPCR); the format is auto-detected.
  • --by | -b <string>: Declares a join key as an attribute name or a primary_attr=secondary_attr mapping (see the first example below). Repeat the flag to require multiple keys to match simultaneously (all must match for a pair to be considered a hit). When the ‘–by’ option is omitted, the matching will be made by default with the sequence identifier (id).
  • --update-id | -i : Replace the identifier of each primary sequence with the identifier from its matched partner record. Default: false.
  • --update-sequence | -s : Replace the nucleotide of each primary sequence with the sequence from its matched partner. Default: false.
  • --update-quality | -q : Replace the per-base quality scores of each primary sequence with the quality scores from its matched partner. Relevant only when both datasets carry quality information ( fastq ). Default: false.

Taxonomic options #

  • --taxonomy | -t <string>: Path to the taxonomic database.
  • --fail-on-taxonomy: Cause obijoin to fail with an error if a taxid encountered during processing is not currently valid in the taxonomy database. Default: false.
  • --raw-taxid: Print taxids in output files without supplementary information (taxon name and rank). Default: false.
  • --update-taxid: Automatically update taxids that are declared as merged to a newer one in the taxonomy database. Default: false.
  • --with-leaves: When taxonomy is extracted from a sequence file, add sequences as leaves of their taxid annotation in the taxonomy tree. Default: false.

Controlling the input data #

OBITools4 generally recognizes the input file format. It also recognizes whether the input file is compressed using GZIP. But some rare files can be misidentified, so the following options allow the user to force the format, thus bypassing the format identification step.
The file format options #
  • --fasta: indicates that sequence data is in fasta format.
  • --fastq: indicates that sequence data is in fastq format.
  • --embl: indicates that sequence data is in EMBL-ENA flatfile format.
  • --csv: indicates that sequence data is in CSV format.
  • --genbank: indicates that sequence data is in GenBank flatfile format.
  • --ecopcr: indicates that sequence data is in the old ecoPCR tabulated format.
Controlling the way OBITools4 are formatting annotations #
These options only apply to the FASTA and FASTQ formats
  • --input-OBI-header: FASTA/FASTQ title line annotations follow the old OBI format.
  • --input-json-header: FASTA/FASTQ title line annotations follow the JSON format.
Controlling quality score decoding #
This option only applies to the FASTQ formats
  • --solexa: decodes quality string according to the old Solexa specification. (default: the standard Sanger encoding is used, env: OBISSOLEXA)

Controlling the output data #

  • --compress | -Z : output is compressed using gzip. (default: false)
  • --no-order: the OBITools ensure that the order between the input file and the output file does not change. When multiple files are processed, they are processed one at a time. If the –no-order option is added to a command, multiple input files can be opened at the same time and their contents processed in parallel. This usually increases processing speed, but does not guarantee the order of the sequences in the output file. Also, processing multiple files in parallel may require more memory to perform the computation.
  • --fasta-output: writes sequence data in fasta format (default if quality data is not available).
  • --fastq-output: writes sequence data in fastq format (default if quality data is available).
  • --json-output: writes sequence data in JSON format.
  • --out | -o <FILENAME>: filename used for saving the output (default: “-”, the standard output)
  • --output-OBI-header | -O : writes output FASTA/FASTQ title line annotations in OBI format (default: JSON).
  • --output-json-header: writew output FASTA/FASTQ title line annotations in JSON format (the default format).
  • --skip-empty: sequences of length equal to zero are removed from the output (default: false).
  • --no-progressbar: deactivates progress bar display (default: false).

General options #

  • --help | -h|-? : shows this help.
  • --version: prints the version and exits.
  • --silent-warning: This option tells obitools to stop displaying warnings. This behaviour can be controlled by setting the OBIWARNINGS environment variable.
  • --max-cpu <INTEGER>: OBITools can take advantage of your computer’s multi-core architecture by parallelizing the computation across all available CPUs. Computing on more CPUs usually requires more memory to perform the computation. Reducing the number of CPUs used to perform a calculation is also a way to indirectly control the amount of memory used by the process. The number of CPUs used by OBITools can also be controlled by setting the OBIMAXCPU environment variable.
  • --force-one-cpu: forces the use of a single CPU core for parallel processing (default: false).
  • --batch-size <INTEGER>: minimum number of sequences per batch for parallel processing (floor, default: 1, env: OBIBATCHSIZE)
  • --batch-size-max <INTEGER>: maximum number of sequences per batch for parallel processing (ceiling, default: 2000, env: OBIBATCHSIZEMAX)
  • --batch-mem <STRING>: maximum memory per batch (e.g. 128K, 64M, 1G; default: 128M; set to 0 to disable, env: OBIBATCHMEM)
  • --debug: enables debug mode, by setting log level to debug (default: false, env: OBIDEBUG)
  • --pprof: enables pprof server. Look at the log for details. (default: false).
  • --pprof-mutex <INTEGER>: enables profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX)
  • --pprof-goroutine <INTEGER>: enables profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE)

Examples #

Join on a cross-attribute key:

Sometimes the primary dataset and the secondary annotation file use different column names for the same identifier. The primary_attr=secondary_attr syntax of --by maps the primary attribute to the secondary one. Here the primary sequences have a sample attribute while the annotation CSV uses well:

📄 input.fasta

>seq001 {"sample":"S1","barcode":"ATGC"}
ATGCATGCATGCATGCATGC
>seq002 {"sample":"S2","barcode":"GCTA"}
GCTAGCTAGCTAGCTAGCTA
>seq003 {"sample":"S3","barcode":"TTTT"}
TTTTTTTTTTTTTTTTTTTT
>seq004 {"sample":"S1","barcode":"ATGC"}
AAAAATTTTTCCCCCGGGGG
>seq005 {"sample":"S2","barcode":"GCTA"}
GGGGGAAAAATTTTTCCCCC
>seq006 {"sample":"S4","barcode":"AAAA"}
CCCCCCGGGGGTTTTTAAAAA
📄 well_metadata.csv
1
2
3
well,location,experiment
S1,Paris,amplicon_run1
S2,Lyon,amplicon_run2

obijoin --join-with well_metadata.csv \
        --by sample=well \
        input.fasta > out_crosskey.fasta
📄 out_crosskey.fasta
>seq001 {"barcode":"ATGC","experiment":"amplicon_run1","location":"Paris","sample":"S1","well":"S1"}
atgcatgcatgcatgcatgc
>seq002 {"barcode":"GCTA","experiment":"amplicon_run2","location":"Lyon","sample":"S2","well":"S2"}
gctagctagctagctagcta
>seq003 {"barcode":"TTTT","sample":"S3"}
tttttttttttttttttttt
>seq004 {"barcode":"ATGC","experiment":"amplicon_run1","location":"Paris","sample":"S1","well":"S1"}
aaaaatttttcccccggggg
>seq005 {"barcode":"GCTA","experiment":"amplicon_run2","location":"Lyon","sample":"S2","well":"S2"}
gggggaaaaatttttccccc
>seq006 {"barcode":"AAAA","sample":"S4"}
ccccccgggggtttttaaaaa

The well column value from the CSV is copied onto each matched sequence together with location and experiment. Sequences with no match (S3, S4) are emitted unchanged.


Join on two keys simultaneously, then update sequence identifiers:

The file references.fasta contains two reference sequences each annotated with both sample and barcode. Using --by sample --by barcode requires both attributes to match before a join is made. Adding --update-id replaces the primary sequence’s identifier with the reference identifier, which is useful when sequence IDs need to track which reference was matched.

📄 input.fasta

>seq001 {"sample":"S1","barcode":"ATGC"}
ATGCATGCATGCATGCATGC
>seq002 {"sample":"S2","barcode":"GCTA"}
GCTAGCTAGCTAGCTAGCTA
>seq003 {"sample":"S3","barcode":"TTTT"}
TTTTTTTTTTTTTTTTTTTT
>seq004 {"sample":"S1","barcode":"ATGC"}
AAAAATTTTTCCCCCGGGGG
>seq005 {"sample":"S2","barcode":"GCTA"}
GGGGGAAAAATTTTTCCCCC
>seq006 {"sample":"S4","barcode":"AAAA"}
CCCCCCGGGGGTTTTTAAAAA
📄 references.fasta
>ref001 {"sample":"S1","barcode":"ATGC"}
ATGCATGCATGCATGCATGCATGC
>ref002 {"sample":"S2","barcode":"GCTA"}
GCTAGCTAGCTAGCTAGCTAGCTA

obijoin --join-with references.fasta \
        --by sample --by barcode \
        --update-id \
        input.fasta > out_multikey.fasta
📄 out_multikey.fasta
>ref001 {"barcode":"ATGC","sample":"S1"}
atgcatgcatgcatgcatgc
>ref002 {"barcode":"GCTA","sample":"S2"}
gctagctagctagctagcta
>seq003 {"barcode":"TTTT","sample":"S3"}
tttttttttttttttttttt
>ref001 {"barcode":"ATGC","sample":"S1"}
aaaaatttttcccccggggg
>ref002 {"barcode":"GCTA","sample":"S2"}
gggggaaaaatttttccccc
>seq006 {"barcode":"AAAA","sample":"S4"}
ccccccgggggtttttaaaaa

Sequences seq001 and seq004 now carry the identifier ref001; seq002 and seq005 carry ref002. The two unmatched sequences (seq003, seq006) keep their original IDs.


Replace sequences and quality scores with corrected values from a FASTQ file:

After error-correction or quality trimming, the corrected reads may be stored in a separate file. obijoin can re-annotate the original reads with the corrected sequence and quality data using --update-sequence and --update-quality. Sequences absent from the corrected file (here seq003) are kept unchanged.

The file input.fastq is the original dataset:

📄 input.fastq
@seq001 {"sample":"S1"}
ATGCATGCATGCATGCATGC
+
IIIIIIIIIIIIIIIIIIII
@seq002 {"sample":"S2"}
GCTAGCTAGCTAGCTAGCTA
+
IIIIIIIIIIIIIIIIIIII
@seq003 {"sample":"S3"}
TTTTTTTTTTTTTTTTTTTT
+
IIIIIIIIIIIIIIIIIIII

The file corrected.fastq provides updated sequences and qualities for seq001 and seq002:

📄 corrected.fastq
@seq001
CCCCCCCCCCCCCCCCCCCC
+
BBBBBBBBBBBBBBBBBBBB
@seq002
TTTTTTTTTTTTTTTTTTTT
+
BBBBBBBBBBBBBBBBBBBB
obijoin --join-with corrected.fastq \
        --update-sequence --update-quality \
        input.fastq > out_updated.fastq
📄 out_updated.fastq
@seq001 {"sample":"S1"}
cccccccccccccccccccc
+
BBBBBBBBBBBBBBBBBBBB
@seq002 {"sample":"S2"}
tttttttttttttttttttt
+
BBBBBBBBBBBBBBBBBBBB
@seq003 {"sample":"S3"}
tttttttttttttttttttt
+
IIIIIIIIIIIIIIIIIIII

Use an OBITools CSV file as primary input and write compressed output:

When the primary sequences are stored in OBITools CSV format (e.g., from a previous obicsv export), use --csv to force CSV reading. The secondary annotation file is always auto-detected. Here primary.csv is the primary input:

📄 primary.csv

1
2
3
4
id,sequence,sample,barcode
seq001,ATGCATGCATGCATGCATGC,S1,ATGC
seq002,GCTAGCTAGCTAGCTAGCTA,S2,GCTA
seq003,TTTTTTTTTTTTTTTTTTTT,S3,TTTT
📄 metadata.csv
1
2
3
sample,location,experiment
S1,Paris,amplicon_run1
S2,Lyon,amplicon_run2

obijoin --join-with metadata.csv --by sample \
        --csv --fasta-output --compress \
        --no-progressbar \
        primary.csv > out_compressed.fasta.gz

produced a gziped fasta file : out_compressed.fasta.gz that can be decompressed to produce the following fasta

gunzip out_compressed.fasta.gz
📄 out_compressed.fasta
1
2
3
4
5
6
>seq001 {"barcode":"ATGC","experiment":"amplicon_run1","location":"Paris","sample":"S1"}
atgcatgcatgcatgcatgc
>seq002 {"barcode":"GCTA","experiment":"amplicon_run2","location":"Lyon","sample":"S2"}
gctagctagctagctagcta
>seq003 {"barcode":"TTTT","sample":"S3"}
tttttttttttttttttttt

obijoin --help