The NCBI taxonomy dump #
The NCBI provides a taxonomy that is used as a reference taxonomy for all molecular data published by NCBI, EBI and DDBJ. This taxonomy is available via a web interface, but can also be downloaded from the NCBI FTP server.
The NCBI taxonomy can be used by OBITools4 by downloading the taxdump from the NCBI FTP server. The file is a gzipped tarball archive containing the following files required by OBITools4:
nodes.dmp
: a tab-separated file containing the taxonomic hierarchynames.dmp
: a tab-separated file containing the scientific names of the organismsmerged.dmp
: a tab-separated file containing the information about reassignment of taxidsdelnodes.dmp
: a tab-separated file containing the information about old taxids today deleted from the taxonomy.
Downloading the NCBI taxonomy dump #
The obitaxonomy
command provides the --download-ncbi
option, which downloads a copy of the NCBI taxonomy dump tarball from the
NCBI FTP server. By default, the file is downloaded to the current directory with the name ncbitaxo_YYYYMMDD.tgz
, where YYYY is the year, MM the month and DD the current date.
The filename used to save the tarball can be specified with the --out
option, as in the following example:
obitaxonomy --download-ncbi --out ncbitaxo.tgz
OBITools4 do not require extracting the downloaded file. The name of the compressed file can be passed directly to any OBITools4 command using the --taxonomy
option.
Structure of the NCBI taxonomy directory #
The ncbitaxo.tgz
archive can be unpacked using the following bash commands:
mkdir ncbitaxo
cd ncbitaxo
tar -zxvf ../ncbitaxo.tgz
cd ..
The ncbitaxo
directory contains all the files provided by NCBI. The readme.txt
file describes the content of each file provided.
Only the files used by OBITools4 are described below.
The nodes.dmp
file
#
The nodes.dmp
file is a tab-separated file, here is the description of the first columns used by OBITools4:
Field | Description |
---|---|
tax_id | A unique taxonomic identifier composed only of digits (0-9) |
parent tax_id | The taxid of the parent taxon of the current taxon |
rank | The taxonomic rank of the taxon (e.g. species, genus, family, etc.) |
Here are the first lines of this file:
1 | 1 | no rank | | 8 | 0 | 1 | 0 | 0 | 0 | 0| 0 | |
2 | 131567 | superkingdom | | 0 | 0 | 11 | 0 | 0 | 0 |0 | 0 | |
6 | 335928 | genus | | 0 | 1 | 11 | 1 | 0 | 1 | 0| 0 | |
7 | 6 | species | AC | 0 | 1 | 11 | 1 | 0 | 1 | 1| 0 | |
9 | 32199 | species | BA | 0 | 1 | 11 | 1 | 0 | 1 | 1| 0 | |
10 | 1706371 | genus | | 0 | 1 | 11 | 1 | 0 | 1 | 0| 0 | |
11 | 1707 | species | CG | 0 | 1 | 11 | 1 | 0 | 1 | 1| 0 | effective current name; |
13 | 203488 | genus | | 0 | 1 | 11 | 1 | 0 | 1 | 0| 0 | |
14 | 13 | species | DT | 0 | 1 | 11 | 1 | 0 | 1 | 1| 0 | |
16 | 32011 | genus | | 0 | 1 | 11 | 1 | 0 | 1 | 0| 0 | |
The names.dmp
file
#
The names.dmp
file is a tab-separated file with the following columns:
Field | Description |
---|---|
tax_id | The node identifier associated with this name |
name_txt | The name itself |
unique name | The unique variant of this name if name not unique |
name class | (synonym, common name, …) |
Here are the first lines of this file:
1 | all | | synonym |
1 | root | | scientific name |
2 | Bacteria | Bacteria <bacteria> | scientific name |
2 | bacteria | | blast name |
2 | eubacteria | | genbank common name |
2 | Monera | Monera <bacteria> | in-part |
2 | Procaryotae | Procaryotae <bacteria> | in-part |
2 | Prokaryotae | Prokaryotae <bacteria> | in-part |
2 | Prokaryota | Prokaryota <bacteria> | in-part |
2 | prokaryote | prokaryote <bacteria> | in-part |
The merged.dmp
file
#
The merged.dmp
file is a tab-separated file with the following columns:
Field | Description |
---|---|
old_tax_id | The node identifier which has been merged |
new_tax_id | The node identifier which is result of merging |
Here are the first lines of this file:
12 | 74109 |
30 | 29 |
36 | 184914 |
37 | 42 |
46 | 39 |
67 | 32033 |
76 | 155892 |
77 | 74311 |
79 | 74313 |
80 | 155892 |
The delnodes.dmp
file
#
The delnodes.dmp
file is a tab-separated file with the following columns:
Field | Description |
---|---|
tax_id | The deleted node ID |
Here are the first lines of this file:
3025011 |
3025010 |
3025009 |
3025008 |
3025007 |
3025006 |
3025005 |
3025004 |
3025003 |
3025002 |