Prepare a local copy of Genbank

Prepare a local copy of Genbank #

A local copy of the GenBank database requires a lot of disk space.

A whole copy of GenBank stored as compressed fasta files takes up about 1TB of disk space.

Three bioinformatics centres distribute all publicly available DNA sequences worldwide. They are

The three centres are associated in an international agreement, the International Nucleotide Sequence Database Collaboration (INSDC). This agreement allows the three centres to share the sequences submitted by biologists. As a result, all sequences are available in the three databases, where they are identified by the same accession number.

The content of these databases is available via a web interface, but can also be downloaded to have a local copy. The NCBI and the EMBL-EBI have two different strategies for distributing data. The EMBL-EBI distributes fewer large files, whereas the NCBI platform prefers to distribute many small files. This is why we choose to download the sequences from GenBank here.

Each of these databases is divided into several taxonomic divisions. The main GenBank divisions useful for metabarcoding are:

  • bct: Bacteria
  • inv: Invertebrates
  • mam: Mammals
  • phg: Phages
  • pln: Plants
  • pri: Primates
  • rod: Rodents
  • vrl: Viruses
  • vrt: Vertebrates

Other divisions exist, but are less useful for metabarcoding ( click here more information).

Download GenBank #

GenBank is distributed in two main formats: fasta and GenBank . The fasta format has the advantage of being smaller than the GenBank format because all the sequence annotations stored in the GenBank format are not present in the fasta format. For metabarcoding, however, the disadvantage is that the fasta format does not contain the sequence taxonomic information stored as a taxon identifier (taxid).

To combine the advantages of both formats, you can download the GenBank format and convert it to the fasta format using the obiconvert command. The obiconvert command ensures that taxonomic information is preserved during conversion.

Network interruptions can occur quite frequently during the process of downloading all these files, so there is a risk of the download failing. To solve this problem, here is a make script that downloads the GenBank files and converts them in fasta files. The choice of make allows the download process to be restarted at the point of failure if it fails.

To download GenBank, copy the Makefile file to your local computer in the directory where you want to store the GenBank files.

The Makefile script must be called Makefile without any extension.

Then, execute the following command:

make

By default, the script download the divisions of GenBank listed above. To download one or more specific divisions of GenBank, you can use the GBDIV variable. For example, to download only the mam division, enter the following command:

make GBDIV=mam

To download several divisions like mam and rod, separate the names by a space:

make GBDIV="mam rod"

If the download fails, restart the download process by using the make command again, without specifying the GBDIV variable again:

make

The Makefile will create a directory called Release_###, where ### is the number of the current release. This directory will contain the following files:

. 📂 Release_264
└── 📂 depends/
│  ├── 📄 gbfiles.d
│  ├── 📄 gbfiles.d.full
└── 📂 fasta/
│  └── 📂 mam/
│    ├── 📄 gbmam1.fasta.gz
│    ├── 📄 gbmam10.fasta.gz
│    └── 📄 ...
│  └── 📂 rod/
│    ├── 📄 gbrod1.fasta.gz
│    └── 📄 ...
└── 📂 stamp/
│  ├── 📄 gbmam1.seq.gz.stamp
│  ├── 📄 gbmam10.seq.gz.stamp
│  ├── 📄 gbrod1.seq.gz.stamp
└── 📂 taxonomy/
   ├── 📄 citations.dmp
   ├── 📄 delnodes.dmp
   ├── 📄 division.dmp
   ├── 📄 gc.prt
   ├── 📄 gencode.dmp
   ├── 📄 images.dmp
   ├── 📄 merged.dmp
   ├── 📄 names.dmp
   ├── 📄 nodes.dmp
   └── 📄 readme.txt
  • The taxonomy directory contains a copy of the NCBI taxonomy database at the time of download.
  • The fasta directory contains the fasta files sorted by taxonomic division in subdirectories, here mam and rod.
  • The stamp directory allows the Makefile script to restart the download process if it fails, without having to download the whole GenBank database again. To free up space, the stamp directory can be deleted at the end of the download process.
  • The depends directory contains a make script with all the instructions for downloading the GenBank files. It is first created by the Makefile script. It contains instructions for downloading the files that need to be downloaded according to the specified GenBank division. To free up space, the depends directory can be deleted at the end of the download process.
  • The tmp directory is used to store the downloaded GenBank files before they are converted into fasta . It does not normally persist after the download process. To free up space, the tmp directory can be deleted at the end of the download process if it persists.

The Makefile script for downloading Genbank #

📄 Makefile

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
SHELL := /bin/bash
FTPNCBI=ftp.ncbi.nlm.nih.gov
GBURL=https://$(FTPNCBI)/genbank
GBRELEASE_URL=$(GBURL)/GB_Release_Number

TAXOURL=https://$(FTPNCBI)/pub/taxonomy/taxdump.tar.gz

GBRELEASE:=$(shell curl $(GBRELEASE_URL))

GBDIV_ALL:=$(shell curl -L ${GBURL} \
                  | grep -E 'gb.+\.seq\.gz' \
				  | sed -E 's@^.*<a href="gb([^0-9]+)[0-9]+\.seq.gz.*$$@\1@' \
				  | sort \
				  | uniq)

GBDIV=bct inv mam phg pln pri rod vrl vrt
DIRECTORIES=fasta fasta_fgs

GBFILE_ALL:=$(shell curl -L ${GBURL} \
					| grep -E "gb($$(tr ' ' '|' <<< "${GBDIV}"))[0-9]+" \
					| sed -E 's@^<a href="(gb.+.seq.gz)">.*$$@\1@')


SUFFIXES += .d
NODEPS:=clean taxonomy
DEPFILES:=$(wildcard Release_$(GBRELEASE)/depends/*.d)

ifeq (0, $(words $(findstring $(MAKECMDGOALS), $(NODEPS))))
    #Chances are, these files don't exist.  GMake will create them and
    #clean up automatically afterwards
    -include $(DEPFILES)
endif


all: depends directories FORCE
	@make downloads

downloads: taxonomy fasta_files 
	@echo Genbank Release number $(GBRELEASE)
	@echo all divisions : $(GBDIV_ALL)

FORCE:
	@sleep 1

.PHONY: all directories depends taxonomy fasta_files FORCE

depends: directories Release_$(GBRELEASE)/depends/gbfiles.d Makefile

division: $(GBDIV)

taxonomy: directories Release_$(GBRELEASE)/taxonomy

directories: Release_$(GBRELEASE)/fasta Release_$(GBRELEASE)/stamp Release_$(GBRELEASE)/tmp 

Release_$(GBRELEASE):
	@mkdir -p $@ 
	@echo Create $@ directory

Release_$(GBRELEASE)/fasta: Release_$(GBRELEASE)
	@mkdir -p $@ 
	@echo Create $@ directory

Release_$(GBRELEASE)/stamp: Release_$(GBRELEASE)
	@mkdir -p $@ 
	@echo Create $@ directory

Release_$(GBRELEASE)/tmp: Release_$(GBRELEASE)
	@mkdir -p $@ 
	@echo Create $@ directory

Release_$(GBRELEASE)/depends/gbfiles.d: Makefile
	@echo Create depends directory
	@mkdir -p Release_$(GBRELEASE)/depends
	@for f in ${GBFILE_ALL} ; do \
	            echo -e "Release_$(GBRELEASE)/stamp/$$f.stamp:" ; \
				echo -e "\t@echo Downloading file : $$f..." ; \
				echo -e "\t@mkdir -p Release_$(GBRELEASE)/tmp" ; \
				echo -e "\t@mkdir -p Release_$(GBRELEASE)/stamp" ; \
				echo -e "\t@curl -L ${GBURL}/$$f > Release_$(GBRELEASE)/tmp/$$f && touch \$$@"  ; \
				echo ; \
				div=$$(sed -E 's@^gb(...).*$$@\1@' <<< $$f) ; \
				fasta="Release_$(GBRELEASE)/fasta/$$div/$${f/.seq.gz/.fasta.gz}" ; \
				fasta_fgs="Release_$(GBRELEASE)/fasta_fgs/$$div/$${f/.seq.gz/.fasta.gz}" ; \
				fasta_files="$$fasta_files $$fasta" ; \
				fasta_fgs_files="$$fasta_fgs_files $$fasta_fgs" ; \
				echo -e "$$fasta: Release_$(GBRELEASE)/stamp/$$f.stamp" ; \
				echo -e "\t@echo converting file : \$$< in fasta" ; \
				echo -e "\t@mkdir -p Release_$(GBRELEASE)/fasta/$$div" ; \
				echo -e "\t@obiconvert -Z --fasta-output --skip-empty  \\" ; \
				echo -e "\t            Release_$(GBRELEASE)/tmp/$$f > Release_$(GBRELEASE)/tmp/$${f/.seq.gz/.fasta.gz} \\" ; \
				echo -e "\t            && mv Release_$(GBRELEASE)/tmp/$${f/.seq.gz/.fasta.gz} \$$@  \\" ; \
				echo -e "\t            && rm -f Release_$(GBRELEASE)/tmp/$$f  \\" ; \
				echo -e "\t            || rm -f \$$@" ; \
				echo -e "\t@echo conversion of $$@ done." ; \
				echo ; \
				done > $@ ; \
				echo >> $@ ; \
				echo "fasta_files: $$fasta_files" >> $@ ; 

Release_$(GBRELEASE)/taxonomy: 
	mkdir -p $@
	curl -iL $(TAXOURL) \
	| tar -C $@ -zxf -