Skip to content

Commit 22c9f2c

Browse files
committed
allow searching against BFVD
1 parent 0f1aab2 commit 22c9f2c

File tree

3 files changed

+7
-6
lines changed

3 files changed

+7
-6
lines changed

README.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ It provides the `progres` Python package that lets you search structures against
1010
Searching typically takes 1-2 s and is much faster for multiple queries.
1111
For the AlphaFold database, initial data loading takes around a minute but subsequent searching takes a tenth of a second per query.
1212

13-
Currently [SCOPe](https://scop.berkeley.edu), [CATH](http://cathdb.info), [ECOD](http://prodata.swmed.edu/ecod), the whole [PDB](https://www.rcsb.org), the [AlphaFold structures for 21 model organisms](https://doi.org/10.1093/nar/gkab1061) and the [AlphaFold database TED domains](https://www.biorxiv.org/content/10.1101/2024.03.18.585509) are provided for searching against.
13+
Currently [SCOPe](https://scop.berkeley.edu), [CATH](http://cathdb.info), [ECOD](http://prodata.swmed.edu/ecod), the whole [PDB](https://www.rcsb.org), [BFVD](https://bfvd.foldseek.com), the [AlphaFold structures for 21 model organisms](https://doi.org/10.1093/nar/gkab1061) and the [AlphaFold database TED domains](https://www.biorxiv.org/content/10.1101/2024.03.18.585509) are provided for searching against.
1414
Searching is done by domain but [Chainsaw](https://github.com/JudeWells/chainsaw) can be used to automatically split query structures into domains.
1515

1616
A [web server](https://progres.mrc-lmb.cam.ac.uk) is available to run Progres.
@@ -27,7 +27,7 @@ conda install pytorch-scatter pyg -c pyg
2727
conda install kimlab::stride
2828
```
2929
3. Run `pip install progres`, which will also install [Biopython](https://biopython.org), [mmtf-python](https://github.com/rcsb/mmtf-python), [einops](https://github.com/arogozhnikov/einops) and [pydantic](https://github.com/pydantic/pydantic) if they are not already present.
30-
4. The first time you search with the software the trained model and pre-embedded databases (~660 MB) will be downloaded to the package directory from [Zenodo](https://zenodo.org/record/7782088), which requires an internet connection. This can take a few minutes. You can set the environmental variable `PROGRES_DATA_DIR` to change where this data is stored, for example if you cannot write to the package directory. Remember to keep it set the next time you run Progres.
30+
4. The first time you search with the software the trained model and pre-embedded databases (~850 MB) will be downloaded to the package directory from [Zenodo](https://zenodo.org/record/7782088), which requires an internet connection. This can take a few minutes. You can set the environmental variable `PROGRES_DATA_DIR` to change where this data is stored, for example if you cannot write to the package directory. Remember to keep it set the next time you run Progres.
3131
5. The first time you search against the AlphaFold database TED domains the pre-embedded database (~33 GB) will be downloaded similarly. This can take a while. Make sure you have enough disk space!
3232

3333
Alternatively, a Docker file is available in the `docker` directory.
@@ -93,6 +93,7 @@ The available pre-embedded databases are:
9393
| `cath40` | S40 non-redundant domains from [CATH](http://cathdb.info) 23/11/22 | 31,884 | 1.38 s | 2.79 s |
9494
| `ecod70` | F70 representative domains from [ECOD](http://prodata.swmed.edu/ecod) develop287 | 71,635 | 1.46 s | 3.82 s |
9595
| `pdb100` | All [PDB](https://www.rcsb.org) protein chains as of 02/08/24 split into domains with Chainsaw | 1,177,152 | 2.90 s | 27.3 s |
96+
| `bfvd` | [Big Fantastic Virus Database (BFVD)](https://bfvd.foldseek.com) structures split into domains with Chainsaw | 446,655 | 2.66 s | 13.3 s |
9697
| `af21org` | [AlphaFold](https://alphafold.ebi.ac.uk) structures for 21 model organisms split into domains by [CATH-Assign](https://doi.org/10.1038/s42003-023-04488-9) | 338,258 | 2.21 s | 11.0 s |
9798
| `afted` | [AlphaFold database](https://alphafold.ebi.ac.uk) structures split into domains by [TED](https://www.biorxiv.org/content/10.1101/2024.03.18.585509) and clustered at 50% sequence identity | 53,344,209 | 67.7 s | 73.1 s |
9899

bin/progres

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ parser_search.add_argument("-l", "--querylist",
2727
help="text file with one query file path per line")
2828
parser_search.add_argument("-t", "--targetdb", required=True,
2929
help=("pre-embedded database to search against, either \"scope95\", \"scope40\", "
30-
"\"cath40\", \"ecod70\", \"pdb100\", \"af21org\", \"afted\" or a file path"))
30+
"\"cath40\", \"ecod70\", \"pdb100\", \"bfvd\", \"af21org\", \"afted\" or a file path"))
3131
parser_search.add_argument("-f", "--fileformat",
3232
choices=["guess", "pdb", "mmcif", "mmtf", "coords"], default="guess",
3333
help="file format of the query structure(s), by default guessed from the file extension")

progres/progres.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -31,9 +31,9 @@
3131
dropout_final = 0.0
3232
default_minsimilarity = 0.8
3333
default_maxhits = 100
34-
pre_embedded_dbs = ["scope95", "scope40", "cath40", "ecod70", "pdb100", "af21org"]
34+
pre_embedded_dbs = ["scope95", "scope40", "cath40", "ecod70", "pdb100", "bfvd", "af21org"]
3535
pre_embedded_dbs_faiss = ["afted"]
36-
zenodo_record = "13365312" # This only needs to change when the trained model or databases change
36+
zenodo_record = "18245422" # This only needs to change when the trained model or databases change
3737
trained_model_subdir = "v_0_2_0" # This only needs to change when the trained model changes
3838
database_subdir = "v_0_2_1" # This only needs to change when the databases change
3939
progres_dir = os.path.dirname(os.path.realpath(__file__))
@@ -471,7 +471,7 @@ def download_data_if_required(download_afted=False):
471471
sep="", file=sys.stderr)
472472
printed = True
473473
if not printed:
474-
print("Downloading data as first time setup (~660 MB) to ", data_dir,
474+
print("Downloading data as first time setup (~850 MB) to ", data_dir,
475475
", internet connection required, this can take a few minutes",
476476
sep="", file=sys.stderr)
477477
printed = True

0 commit comments

Comments
 (0)