NCBI (genbank) ftp paths do not match

This is more a note to myself more than anything else. However, hopefully someone else will also find it useful.

When traversing ftp paths on the genbank, especially programmatically, do not be surprised if the paths change suddenly, especially using the raw FTP protocol.

A simple example with a Bacterial genome on the commandline ftp client shows

(cbc) CIS2X1NFGTF1:test_download_genbank aragaven$ ftp ftp://ftp.ncbi.nlm.nih.gov/
Trying 130.14.250.12...
Connected to ftp.wip.ncbi.nlm.nih.gov.
220-
 This warning banner provides privacy and security notices consistent with
 applicable federal laws, directives, and other federal guidance for accessing
 this Government system, which includes all devices/storage media attached to
 this system. This system is provided for Government-authorized use only.
 Unauthorized or improper use of this system is prohibited and may result in
 disciplinary action and/or civil and criminal penalties. At any time, and for
 any lawful Government purpose, the government may monitor, record, and audit
 your system usage and/or intercept, search and seize any communication or data
 transiting or stored on this system. Therefore, you have no reasonable
 expectation of privacy. Any communication or data transiting or stored on this
 system may be disclosed or used for any lawful Government purpose.
220 FTP Server ready.
331 Anonymous login ok, send your complete email address as your password
230 Anonymous access granted, restrictions apply
Remote system type is UNIX.
Using binary mode to transfer files.
200 Type set to I
ftp> cd /genomes/refseq/bacteria/Acaryochloris_marina/latest_assembly_versions/GCF_000018105.1_ASM1810v1
250 CWD command successful
ftp> pwd
Remote directory: /genomes/all/GCF/000/018/105/GCF_000018105.1_ASM1810v1
ftp>

While i was puzzled for a little bit, it does make sense to have some sort of either hard links or symbolic links on the server to avoid duplication of data.

It should definitely be borne in mind when programmatically querying the server as I have done here https://github.com/compbiocore/access_genbank , wherein I store the download paths and access them in the program later as well as for logging purposes.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s