`obiscript`: executes a lua script on the input sequences #

Preliminary AI-generated documentation

This page was automatically generated by an AI assistant and has not yet been reviewed or validated by the OBITools4 development team. It may contain inaccuracies or incomplete information. Use with caution and refer to the command’s --help output for authoritative option descriptions.

Description #

obiscript lets applying fully custom logic to every sequence in a dataset by writing a short Lua script, without recompiling OBITools4 or writing Go code. It is the right tool whenever the built-in commands — obigrep , obiannotate , and the like — are not flexible enough: for instance to compute a new attribute from the sequence itself, to maintain a running counter across all records, or merging external data sources into the annotations.

The script is structured around three optional Lua functions.

The worker(sequence) function is the core: it is called once per input sequence record, receives the record as a BioSequence object, and must return the (possibly modified) record, or set or sequence records, or nil to drop the current sequence from the output.
The begin() function runs once before any record is processed and is typically used for initialisation.
The finish() function runs once after the last record and is typically used to print summary statistics. A thread-safe key-value table called obicontext is shared across all parallel worker invocations and can be used to accumulate results safely.

To allow for interacting with OBITools4 objects, an obitools Lua extension is available.

The set of selection options (such as --min-length, --predicate, --sequence, etc.) used by obigrep are also available in obiscript . But an important behavioural detail has to be considered: the sequence selection option do not filter sequences out of the output, like in obigrep , they only select which sequences the worker() function is applied to. Sequences that do not match the selection pass through to the output unchanged, without the script being executed on them.

To bootstrap a new script, run obiscript with --template. It prints a self-contained Lua skeleton with inline comments:

obiscript --template

📄 my_script.lua

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
function begin()
    obicontext.item("compteur",0)
end

function worker(sequence)
    samples = sequence:attribute("merged_sample")
    samples["tutu"]=4
    sequence:attribute("merged_sample",samples)
    sequence:attribute("toto",44444)
    nb = obicontext.inc("compteur")
    sequence:id("seq_" .. nb)
    return sequence
end

function finish()
    print("compteur = " .. obicontext.item("compteur"))
end

The skeleton demonstrates all three lifecycle functions and shows how to use obicontext for cross-record aggregation and sequence:id() to rename records.

Workflow of the command #

graph TD
A@{ shape: doc, label: "sequences.fasta" }
B@{ shape: doc, label: "annotate.lua" }
C[obiscript]
D@{ shape: doc, label: "annotated.fasta" }
A --> C
B --> C:::obitools
C --> D
classDef obitools fill:#99d57c

Synopsis #

obiscript [--script|-S SCRIPT] [--template]
          [--predicate|-p EXPRESSION]... [--sequence|-s PATTERN]...
          [--identifier|-I PATTERN]... [--definition|-D PATTERN]...
          [--approx-pattern PATTERN]... [--pattern-error int] [--allows-indels]
          [--only-forward] [--has-attribute|-A KEY]... [--attribute|-a KEY=VALUE]...
          [--id-list FILENAME] [--min-length|-l LENGTH] [--max-length|-L LENGTH]
          [--min-count|-c COUNT] [--max-count|-C COUNT] [--inverse-match|-v]
          [--taxonomy|-t PATH] [--restrict-to-taxon|-r TAXID]...
          [--ignore-taxon|-i TAXID]... [--require-rank RANK_NAME]...
          [--valid-taxid] [--fail-on-taxonomy] [--update-taxid] [--raw-taxid]
          [--with-leaves] [--paired-mode forward|reverse|and|or|andnot|xor]
          [--fasta] [--fastq] [--embl] [--genbank] [--ecopcr] [--csv]
          [--input-OBI-header] [--input-json-header] [--u-to-t] [--solexa]
          [--skip-empty] [--no-order]
          [--out|-o FILENAME] [--fasta-output] [--fastq-output] [--json-output]
          [--output-OBI-header|-O] [--output-json-header] [--compress|-Z]
          [--max-cpu int] [--batch-size int] [--batch-size-max int]
          [--batch-mem string] [--no-progressbar] [--debug] [--silent-warning]
          [<args>]

Options #

`obiscript` specific options #

--script | -S <SCRIPT>: Path to the Lua script file to execute. The file must exist and be syntactically valid Lua. The script should define a worker(sequence) function, and optionally begin() and finish().
--template: Print a minimal Lua script template to standard output, with stubs for begin(), worker(), and finish() and inline documentation, then exit. Use this to bootstrap a new script.
--paired-mode <forward|reverse|and|or|andnot|xor>: When paired reads are provided, determines how filter conditions are applied to both reads of a pair. Default: forward.

Taxonomic options #

--taxonomy | -t <string>: Path to the taxonomic database.
--restrict-to-taxon | -r <TAXID>: Apply the script only to sequences whose taxid belongs to the specified taxon.
--ignore-taxon | -i <TAXID>: Apply the script only to sequences whose taxid does NOT belong to the specified taxon.
--require-rank <RANK_NAME>: Apply the script only to sequences whose taxon has the specified rank (e.g., species, genus).
--valid-taxid: Apply the script only to sequences that carry a currently valid NCBI taxid. Default: false.
--fail-on-taxonomy: Abort with an error if a taxid used during filtering is not currently valid. Default: false.
--update-taxid: Automatically replace taxids declared as merged with their current equivalent. Default: false.
--raw-taxid: Print taxids in output without supplementary information (taxon name and rank). Default: false.
--with-leaves: When extracting taxonomy from a sequence file, attach sequences as leaves of their taxid annotation. Default: false.

Selecting sequence records #

Selection based on the sequence #

Strict matching #

--sequence | -s <PATTERN>: A Regular expression pattern used to match the sequence. Only the entries whose sequence matches the pattern are kept. Regular expression patterns are case-insensitive.

Approximate matching #

--approx-pattern <PATTERN>: A DNA pattern used to match the sequence. Only the entries whose sequence matches the pattern are kept. DNA patterns are case-insensitive. They can be matched allowing for errors: mismatches or insertions or deletions.
--allows-indels: allows for indels during pattern DNA pattern matching (see --approx-pattern option).
--pattern-error <INTEGER>: maximum number of errors allowed when searching for patterns in DNA (default 0, see --approx-pattern option).

Selection based on the sequence identifier #

--identifier | -I <REGEX>: Regular expression pattern to be tested against the sequence identifier. The pattern is case-insensitive.
--id-list <FILENAME>: points to a text file containing the list of sequence record identifiers to be selected. The file format consists in a single identifier per line.

Selection based on the sequence definition #

--definition | -D <REGEX>: Regular expression pattern to be tested against the sequence definition. The pattern is case-insensitive.

Selection based on the sequence properties #

--min-count | -c <COUNT>: selects the sequence records for which the number of occurrences (i.e the count attribute) is equal to or greater than the defined minimum count.
--max-count | -C <COUNT>: Select the sequence records for which the occurrence count (i.e the count attribute) is equal to or smaller than the defined maximum count.
--min-length | -l <LENGTH>: selects the sequence records for which the sequence length is equal to or greater than the defined minimum sequence length.
--max-length | -L <LENGTH>: selects sequence records for which the sequence length is equal to or less than the defined maximum sequence length.

Controlling the input data #

OBITools4 generally recognizes the input file format. It also recognizes whether the input file is compressed using GZIP. But some rare files can be misidentified, so the following options allow the user to force the format, thus bypassing the format identification step.

The file format options #

--fasta: indicates that sequence data is in fasta format.
--fastq: indicates that sequence data is in fastq format.
--embl: indicates that sequence data is in EMBL-ENA flatfile format.
--csv: indicates that sequence data is in CSV format.
--genbank: indicates that sequence data is in GenBank flatfile format.
--ecopcr: indicates that sequence data is in the old ecoPCR tabulated format.

Controlling the way OBITools4 are formatting annotations #

These options only apply to the FASTA and FASTQ formats

--input-OBI-header: FASTA/FASTQ title line annotations follow the old OBI format.
--input-json-header: FASTA/FASTQ title line annotations follow the JSON format.

Controlling quality score decoding #

This option only applies to the FASTQ formats

--solexa: decodes quality string according to the old Solexa specification. (default: the standard Sanger encoding is used, env: OBISSOLEXA)

Controlling the output data #

--compress | -Z : output is compressed using gzip. (default: false)
--no-order: the OBITools ensure that the order between the input file and the output file does not change. When multiple files are processed, they are processed one at a time. If the –no-order option is added to a command, multiple input files can be opened at the same time and their contents processed in parallel. This usually increases processing speed, but does not guarantee the order of the sequences in the output file. Also, processing multiple files in parallel may require more memory to perform the computation.
--fasta-output: writes sequence data in fasta format (default if quality data is not available).
--fastq-output: writes sequence data in fastq format (default if quality data is available).
--json-output: writes sequence data in JSON format.
--out | -o <FILENAME>: filename used for saving the output (default: “-”, the standard output)
--output-OBI-header | -O : writes output FASTA/FASTQ title line annotations in OBI format (default: JSON).
--output-json-header: writew output FASTA/FASTQ title line annotations in JSON format (the default format).
--skip-empty: sequences of length equal to zero are removed from the output (default: false).
--no-progressbar: deactivates progress bar display (default: false).

General options #

--help | -h|-? : shows this help.
--version: prints the version and exits.
--silent-warning: This option tells obitools to stop displaying warnings. This behaviour can be controlled by setting the OBIWARNINGS environment variable.

--max-cpu <INTEGER>: OBITools can take advantage of your computer’s multi-core architecture by parallelizing the computation across all available CPUs. Computing on more CPUs usually requires more memory to perform the computation. Reducing the number of CPUs used to perform a calculation is also a way to indirectly control the amount of memory used by the process. The number of CPUs used by OBITools can also be controlled by setting the OBIMAXCPU environment variable.
--force-one-cpu: forces the use of a single CPU core for parallel processing (default: false).
--batch-size <INTEGER>: minimum number of sequences per batch for parallel processing (floor, default: 1, env: OBIBATCHSIZE)
--batch-size-max <INTEGER>: maximum number of sequences per batch for parallel processing (ceiling, default: 2000, env: OBIBATCHSIZEMAX)
--batch-mem <STRING>: maximum memory per batch (e.g. 128K, 64M, 1G; default: 128M; set to 0 to disable, env: OBIBATCHMEM)

--debug: enables debug mode, by setting log level to debug (default: false, env: OBIDEBUG)
--pprof: enables pprof server. Look at the log for details. (default: false).
--pprof-mutex <INTEGER>: enables profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX)
--pprof-goroutine <INTEGER>: enables profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE)

Examples #

Add a sample identifier to every sequence record #

The file sequences.fasta contains six fasta sequences whose identifiers follow the pattern <sample>_<number>. The script annotate.lua extracts the sample prefix and stores it as a new attribute sample on each record using sequence:attribute("sample", value).

📄 sequences.fasta

>sample1_seq001 control sequence for annotation test
ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG
>sample1_seq002 another control sequence from sample1
GCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTA
>sample2_seq003 second sample sequence
TTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAA
>sample2_seq004 second sample another sequence
CCGGCCGGCCGGCCGGCCGGCCGGCCGGCCGGCCGGCCGGCCGGCCGG
>sample3_seq005 third sample first sequence
AAAATTTTCCCCGGGGAAAATTTTCCCCGGGGAAAATTTTCCCCGGGG
>sample3_seq006 third sample second sequence
TTTTAAAACCCCGGGGTTTTAAAACCCCGGGGTTTTAAAACCCCGGGG

📄 annotate.lua

1
2
3
4
5
6
7
8
9
-- Adds a 'sample' attribute by extracting the prefix before the first underscore
function worker(sequence)
    local id = sequence:id()
    local sample = string.match(id, "^(.-)_")
    if sample then
        sequence:attribute("sample", sample)
    end
    return sequence
end

obiscript --script annotate.lua --fasta-output -o annotated.fasta sequences.fasta

📄 annotated.fasta

>sample1_seq001 {"definition":"control sequence for annotation test","sample":"sample1"}
atcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcg
>sample1_seq002 {"definition":"another control sequence from sample1","sample":"sample1"}
gctagctagctagctagctagctagctagctagctagctagctagcta
>sample2_seq003 {"definition":"second sample sequence","sample":"sample2"}
ttaattaattaattaattaattaattaattaattaattaattaattaa
>sample2_seq004 {"definition":"second sample another sequence","sample":"sample2"}
ccggccggccggccggccggccggccggccggccggccggccggccgg
>sample3_seq005 {"definition":"third sample first sequence","sample":"sample3"}
aaaattttccccggggaaaattttccccggggaaaattttccccgggg
>sample3_seq006 {"definition":"third sample second sequence","sample":"sample3"}
ttttaaaaccccggggttttaaaaccccggggttttaaaaccccgggg

Apply a script selectively to sequences above a length threshold #

The file reads.fastq contains four fastq reads of varying lengths. By combining obiscript with --min-length, the process_pairs.lua script is applied only to reads that are at least 100 bp long; shorter reads pass through to the output without modification.

📄 reads.fastq

@seq001 long sequence passes min-length 100 filter
ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@seq002 short sequence fails min-length 100 filter
GCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTA
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@seq003 long sequence passes min-length 100 filter
TTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAA
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@seq004 short sequence fails min-length 100 filter
CCGGCCGGCCGGCCGGCCGGCCGGCCGGCCGGCCGGCCGGCCGGCCGGCCGG
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

📄 process_pairs.lua

1
2
3
4
-- Simple pass-through script: returns each sequence unchanged
function worker(sequence)
    return sequence
end

obiscript --script process_pairs.lua --min-length 100 -o result.fastq reads.fastq

📄 result.fastq

@seq001 {"definition":"long sequence passes min-length 100 filter"}
atcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcg
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@seq002 {"definition":"short sequence fails min-length 100 filter"}
gctagctagctagctagctagctagctagctagctagctagctagctagctagctagctagctagctagctagctagcta
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@seq003 {"definition":"long sequence passes min-length 100 filter"}
ttaattaattaattaattaattaattaattaattaattaattaattaattaattaattaattaattaattaattaattaattaattaattaattaattaattaattaattaa
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@seq004 {"definition":"short sequence fails min-length 100 filter"}
ccggccggccggccggccggccggccggccggccggccggccggccggccgg
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Enrich FASTQ records with a custom attribute and output as JSON #

The enrich.lua script adds a processed attribute to every sequence. Combining this with --json-output produces a structured JSON array where each record carries the new annotation alongside its sequence and quality data.

📄 sequences.fastq

@seq001 long sequence passes min-length 100 filter
ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@seq002 short sequence fails min-length 100 filter
GCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTA
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@seq003 long sequence passes min-length 100 filter
TTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAA
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@seq004 short sequence fails min-length 100 filter
CCGGCCGGCCGGCCGGCCGGCCGGCCGGCCGGCCGGCCGGCCGGCCGGCCGG
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

📄 enrich.lua

1
2
3
4
5
-- Marks each sequence as processed by adding a 'processed' attribute
function worker(sequence)
    sequence:attribute("processed", "true")
    return sequence
end

obiscript --script enrich.lua --json-output -o enriched.json sequences.fastq

Display help #

obiscript --help