The NCBI taxonomy dump #
The NCBI provides a taxonomy that is used as a reference taxonomy for all molecular data published by NCBI, EBI and DDBJ. This taxonomy is available via a web interface, but can also be downloaded from the NCBI FTP server.
The NCBI taxonomy can be used by OBITools4 by downloading the taxdump from the NCBI FTP server. The file is a gzipped tarball archive containing the following files required by OBITools4:
nodes.dmp: a tab-separated file containing the taxonomic hierarchynames.dmp: a tab-separated file containing the scientific names of the organismsmerged.dmp: a tab-separated file containing the information about reassignment of taxidsdelnodes.dmp: a tab-separated file containing the information about old taxids today deleted from the taxonomy.
Downloading the NCBI taxonomy dump #
The obitaxonomy
command provides the --download-ncbi option, which downloads a copy of the NCBI taxonomy dump tarball from the
NCBI FTP server. By default, the file is downloaded to the current directory with the name ncbitaxo_YYYYMMDD.tgz, where YYYY is the year, MM the month and DD the current date.
The filename used to save the tarball can be specified with the --out option, as in the following example:
obitaxonomy --download-ncbi --out ncbitaxo.tgz
OBITools4 do not require extracting the downloaded file. The name of the compressed file can be passed directly to any OBITools4 command using the --taxonomy option.
Structure of the NCBI taxonomy directory #
The ncbitaxo.tgz archive can be unpacked using the following bash commands:
mkdir ncbitaxo
cd ncbitaxo
tar -zxvf ../ncbitaxo.tgz
cd ..
The ncbitaxo directory contains all the files provided by NCBI. The readme.txt file describes the content of each file provided.
Only the files used by OBITools4 are described below.
The nodes.dmp file
#
The nodes.dmp file is a tab-separated file, here is the description of the first columns used by OBITools4:
| Field | Description |
|---|---|
tax_id | A unique taxonomic identifier composed only of digits (0-9) |
parent tax_id | The taxid of the parent taxon of the current taxon |
rank | The taxonomic rank of the taxon (e.g. species, genus, family, etc.) |
Here are the first lines of this file:
1 | 1 | no rank | | 8 | 0 | 1 | 0 | 0 | 0 | 0| 0 | |
2 | 131567 | superkingdom | | 0 | 0 | 11 | 0 | 0 | 0 |0 | 0 | |
6 | 335928 | genus | | 0 | 1 | 11 | 1 | 0 | 1 | 0| 0 | |
7 | 6 | species | AC | 0 | 1 | 11 | 1 | 0 | 1 | 1| 0 | |
9 | 32199 | species | BA | 0 | 1 | 11 | 1 | 0 | 1 | 1| 0 | |
10 | 1706371 | genus | | 0 | 1 | 11 | 1 | 0 | 1 | 0| 0 | |
11 | 1707 | species | CG | 0 | 1 | 11 | 1 | 0 | 1 | 1| 0 | effective current name; |
13 | 203488 | genus | | 0 | 1 | 11 | 1 | 0 | 1 | 0| 0 | |
14 | 13 | species | DT | 0 | 1 | 11 | 1 | 0 | 1 | 1| 0 | |
16 | 32011 | genus | | 0 | 1 | 11 | 1 | 0 | 1 | 0| 0 | |
The names.dmp file
#
The names.dmp file is a tab-separated file with the following columns:
| Field | Description |
|---|---|
tax_id | The node identifier associated with this name |
name_txt | The name itself |
unique name | The unique variant of this name if name not unique |
name class | (synonym, common name, …) |
Here are the first lines of this file:
1 | all | | synonym |
1 | root | | scientific name |
2 | Bacteria | Bacteria <bacteria> | scientific name |
2 | bacteria | | blast name |
2 | eubacteria | | genbank common name |
2 | Monera | Monera <bacteria> | in-part |
2 | Procaryotae | Procaryotae <bacteria> | in-part |
2 | Prokaryotae | Prokaryotae <bacteria> | in-part |
2 | Prokaryota | Prokaryota <bacteria> | in-part |
2 | prokaryote | prokaryote <bacteria> | in-part |
The merged.dmp file
#
The merged.dmp file is a tab-separated file with the following columns:
| Field | Description |
|---|---|
old_tax_id | The node identifier which has been merged |
new_tax_id | The node identifier which is result of merging |
Here are the first lines of this file:
12 | 74109 |
30 | 29 |
36 | 184914 |
37 | 42 |
46 | 39 |
67 | 32033 |
76 | 155892 |
77 | 74311 |
79 | 74313 |
80 | 155892 |
The delnodes.dmp file
#
The delnodes.dmp file is a tab-separated file with the following columns:
| Field | Description |
|---|---|
tax_id | The deleted node ID |
Here are the first lines of this file:
3025011 |
3025010 |
3025009 |
3025008 |
3025007 |
3025006 |
3025005 |
3025004 |
3025003 |
3025002 |