Downloading the Zinc Database Rapidly with Aria2

Sulstice
3 min readMar 24, 2022

--

There was a time where I needed to download the zinc database rapidly because we made some earlier mistakes in some downstream pipelines and was approaching a deadline. The entire pipeline needed to be restarted and thus from downloading the zinc database.

The Zinc Database has two major versions, 2015 and 2020, so you’ll see the name Zinc15 and Zinc20 respectively. The Zinc Database uses a chemical makeup of building blocks and, I believe, a big emphasis on molecular shape diversity. Essentially, is breaking up a big group of compounds based on whether they look more “spherical-like”, “disc-like”, “rod-like” in that order shown down below.

To download these combinatorial libraries, we have to take a look at the existing infrastructure of the zinc web server.

The data is available here in a series of tranches. Tranches are allocated sections of molecules split up by some basic physical characteristics “Molecular Weight” and “LogP”.

For our particular purpose we needed the data in “3D” and not “2D” space. For stable molecules that I think will minimize any data probably not needed for our purposes:

  • React Clean (Non-Reactive Molecules)
  • Purch Wait Ok (There is a way to but it)
  • pH Reference R(One form at pH 7).
  • Neutral Charge (0)

This gives us around 362 Million Molecules Distributed over the course of 1.1 Tranches.

The Webserver provides you with some options on how to download. Since I only want the RAW URLS and the tranches of their molecule exist and the mol2 format, my options looked like this:

Next I have my URLS:

mkdir -pv AA/AARN && wget http://files.docking.org/3D/AA/AARN/AAAARN.xaa.mol2.gz -O AA/AARN/AAAARN.xaa.mol2.gz

I did a little file magic here and removed the mkdir and wgetcomponent. We won’t be needing that and we should have a list of URLS like so in a file called urls.txt:

http://files.docking.org/3D/CC/ADRN/CCADRN.xaw.mol2.gz
http://files.docking.org/3D/CE/ADRO/CEADRO.xas.mol2.gz
http://files.docking.org/3D/CH/ADRN/CHADRN.xat.mol2.gz
http://files.docking.org/3D/CD/ADRN/CDADRN.xbd.mol2.gz
http://files.docking.org/3D/BE/ADRP/BEADRP.xaa.mol2.gz
http://files.docking.org/3D/CF/ADRO/CFADRO.xag.mol2.gz
http://files.docking.org/3D/CF/ADRN/CFADRN.xcg.mol2.gz
http://files.docking.org/3D/CF/ADRN/CFADRN.xai.mol2.gz
http://files.docking.org/3D/BF/ADRN/BFADRN.xak.mol2.gz

Alright, next we are going to download them in parallel. I chose the software aria2c to download because of it’s features to download rapidly, keep an accurate log of downloads, and ability to restart interrupted downloads in case it happens. Which most likely will when you do massive data transfers. Data loss metrics should always be reported.

It’s pretty easy to setup since it’s conda installable.

conda install -c bioconda aria2

And then activate your environment:

source activate aria2

And then run it:

aria2c -i urls.txt -x 8

This will run the aria2 program over 8 cores fetching the URLS like so:

I’ve been able to download pretty nearly all of it within an hour or so as well as monitor it’s persistence for rapid downloading over here:

https://github.com/Sulstice/Uptime-Cheminformatics

Try not to bring it down, remember that with great power comes great responsibility.

--

--

Sulstice
Sulstice

No responses yet