The FASTA sequence file format #
The FASTA sequence file format is the most widely used sequence file format. This is probably due to its simplicity. It was originally created for the Lipman and Pearson FASTA program ( Citation: Pearson & Lipman, 1988 Pearson, W. & Lipman, D. (1988). Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences of the United States of America, 85(8). 2444–2448. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/3162770 ) .
In the FASTA format, a sequence is represented by a title line starting with a > character, and the sequences themselves follow the
iupac
code. The sequence is usually split into several other lines of the same length (expect for the last one). Several sequences can be stored in the same file. The first line of the next sequence also marks the end of the previous one.
>my_sequence this is my pretty sequence
ACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGT
GTGCTGACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTGTTT
AACGACGTTGCAGTACGTTGCAGT
The first word in the title line is the sequence identifier. The rest of the line is a description of the sequence. The OBITools extend this format by adding structured data to the title line. In the previous version of the OBITools, the structured data was stored after the sequence identifier in a key=value;
format, as shown below. The sequence definition was stored as free text after the last key=value;
pair.
>AB061527 obicleandb_level=family; count=1; family_name=Soricidae; genus_name=Sorex; genus_taxid=9379; obicleandb_trusted=2.2137847111025621e-13; species_name=Sorex unguiculatus; species_taxid=62275; taxid=62275; family_taxid=9376; Sorex unguiculatus mitochondrial NA, complete genome.
ttagccctaaacttaggtatttaatctaacaaaaatacccgtcagagaactactagcaat
agcttaaaactcaaaggacttggcggtgctttatatccct
>AL355887 species_name=Homo sapiens; family_taxid=9604; genus_name=Homo; obicleandb_trusted=0; genus_taxid=9605; obicleandb_level=genus; species_taxid=9606; taxid=9606; count=2; family_name=Hominidae; Human chromosome 14 NA sequence BAC R-179O11 of library RPCI-11 from chromosome 14 of Homo sapiens (Human)XXKW HTG.; HTGS_ACTIVFIN.
ttagccctaaactctagtagttacattaacaaaaccattcgtcagaatactacgagcaac
agcttaaaactcaaaggacctggcagttctttatatccct
With OBITools4 a new format has been introduced to store structured data in the title line. The key/value annotation pairs are now formatted as a JSON map object. The definition is stored as an additional key/value pair using the key ‘definition’.
📄 two_sequences_obi4.fasta>AB061527 {"count":1,"definition":"Sorex unguiculatus mitochondrial NA, complete genome.","family_name":"Soricidae","family_taxid":9376,"genus_name":"Sorex","genus_taxid":9379,"obicleandb_level":"family","obicleandb_trusted":2.2137847111025621e-13,"species_name":"Sorex unguiculatus","species_taxid":62275,"taxid":62275}
ttagccctaaacttaggtatttaatctaacaaaaatacccgtcagagaactactagcaat
agcttaaaactcaaaggacttggcggtgctttatatccct
>AL355887 {"count":2,"definition":"Human chromosome 14 NA sequence BAC R-179O11 of library RPCI-11 from chromosome 14 of Homo sapiens (Human)XXKW HTG.; HTGS_ACTIVFIN.","family_name":"Hominidae","family_taxid":9604,"genus_name":"Homo","genus_taxid":9605,"obicleandb_level":"genus","obicleandb_trusted":0,"species_name":"Homo sapiens","species_taxid":9606,"taxid":9606}
ttagccctaaactctagtagttacattaacaaaaccattcgtcagaatactacgagcaac
agcttaaaactcaaaggacctggcagttctttatatccct
The obiconvert
command, like all other OBITools4 commands, has two options --output-json-header
and --output-OBI-header
to choose between the new
JSON format and the old OBITools format. The --output-OBI-header
option can be abbreviated to -O
. By default, the new
JSON OBITools4 format is used, so only the -O
option is really useful if the old format is required for compatibility with other software.
Converting from the new JSON format to the old OBITools format:
obiconvert -O two_sequences_obi4.fasta
>AB061527 obicleandb_level=family; count=1; family_name=Soricidae; genus_name=Sorex; genus_taxid=9379; obicleandb_trusted=2.2137847111025621e-13; species_name=Sorex unguiculatus; species_taxid=62275; taxid=62275; family_taxid=9376; Sorex unguiculatus mitochondrial NA, complete genome.
ttagccctaaacttaggtatttaatctaacaaaaatacccgtcagagaactactagcaat
agcttaaaactcaaaggacttggcggtgctttatatccct
>AL355887 species_name=Homo sapiens; family_taxid=9604; genus_name=Homo; obicleandb_trusted=0; genus_taxid=9605; obicleandb_level=genus; species_taxid=9606; taxid=9606; count=2; family_name=Hominidae; Human chromosome 14 NA sequence BAC R-179O11 of library RPCI-11 from chromosome 14 of Homo sapiens (Human)XXKW HTG.; HTGS_ACTIVFIN.
ttagccctaaactctagtagttacattaacaaaaccattcgtcagaatactacgagcaac
agcttaaaactcaaaggacctggcagttctttatatccct
Converting from the old OBITools format to the new JSON format:
obiconvert two_sequences_obi2.fasta
>AB061527 {"count":1,"definition":"Sorex unguiculatus mitochondrial NA, complete genome.","family_name":"Soricidae","family_taxid":9376,"genus_name":"Sorex","genus_taxid":9379,"obicleandb_level":"family","obicleandb_trusted":2.2137847111025621e-13,"species_name":"Sorex unguiculatus","species_taxid":62275,"taxid":62275}
ttagccctaaacttaggtatttaatctaacaaaaatacccgtcagagaactactagcaat
agcttaaaactcaaaggacttggcggtgctttatatccct
>AL355887 {"count":2,"definition":"Human chromosome 14 NA sequence BAC R-179O11 of library RPCI-11 from chromosome 14 of Homo sapiens (Human)XXKW HTG.; HTGS_ACTIVFIN.","family_name":"Hominidae","family_taxid":9604,"genus_name":"Homo","genus_taxid":9605,"obicleandb_level":"genus","obicleandb_trusted":0,"species_name":"Homo sapiens","species_taxid":9606,"taxid":9606}
ttagccctaaactctagtagttacattaacaaaaccattcgtcagaatactacgagcaac
agcttaaaactcaaaggacctggcagttctttatatccct
The actual format of the header is automatically detected when OBITools4 commands read a FASTA file.
References #
- Pearson & Lipman (1988)
- Pearson, W. & Lipman, D. (1988). Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences of the United States of America, 85(8). 2444–2448. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/3162770