# df['SMILES']=df.drug.progress_apply(name2smi)Data
Setup
Compound name to SMILES
name2smi
def name2smi(
name
):
Given a compound name, get SMILES in PubChem database.
Compound datasets
fetch_csv
def fetch_csv(
url
):
Collins
def Collins(
args:VAR_POSITIONAL, kwargs:VAR_KEYWORD
):
A class of loading compound datasets from Collins lab.
Collins lab dataset
Publication list is available on the lab page
Collins.get_antibiotics_2k
def get_antibiotics_2k(
):
Antibiotics dataset of 50 µM 2,560 compounds screening in E. coli K12 BW25113. 2,335 unique compounds after deduplicated. Table S1B from 2020 Cell: A Deep Learning Approach to Antibiotic Discovery.
Collins.get_antibiotics_2k()| name | SMILES | inhibition | activity | |
|---|---|---|---|---|
| 0 | CEFPIRAMIDE | Cc1cc(O)c(C(=O)NC(C(=O)NC2C(=O)N3C(C(=O)O)=C(C... | 0.041572 | 1 |
| 1 | GEMIFLOXACIN MESYLATE | CON=C1CN(c2nc3c(cc2F)c(=O)c(C(=O)O)cn3C2CC2)CC... | 0.041876 | 1 |
| ... | ... | ... | ... | ... |
| 2333 | EVANS BLUE | Cc1cc(-c2ccc(N=Nc3ccc4c(S(=O)(=O)[O-])cc(S(=O)... | 2.263200 | 0 |
| 2334 | PROTOPORPHYRINOGEN IX | C=Cc1c(C)c2cc3[nH]c(cc4nc(cc5[nH]c(cc1n2)c(C)c... | 2.627450 | 0 |
2335 rows × 4 columns
Collins.get_antibiotics_39k
def get_antibiotics_39k(
):
Antibiotics dataset of 50 µM 39,128 compounds screening in E. coli K12 BW25113. Supplementary dataset EV1 from 2022 Molecular Systems Biology: Benchmarking AlphaFold-enabled molecular docking predictions for antibiotic discovery.
Collins.get_antibiotics_39k()| CANONICAL_SMILES | NAME | R1_50uM | R2_50uM | MEAN_50uM | activity | |
|---|---|---|---|---|---|---|
| 0 | O=C1NC(=O)C=C1 | BRD-K78206682 | 0.029762 | 0.028344 | 0.029053 | 1 |
| 1 | CC(C)CCCC(=O)N[C@@H](CCNCS(=O)(=O)O)C(=O)N[C@@... | BRD-K01666924 | 0.029695 | 0.030176 | 0.029935 | 1 |
| ... | ... | ... | ... | ... | ... | ... |
| 39126 | COC(=O)CCc1c(c/2[nH]c1/C=C/1\N=C(/C=C/3\N=C(/C... | BRD-K81849500 | 1.924381 | 1.924847 | 1.924614 | 0 |
| 39127 | Cc1cc(-c2ccc(N=Nc3ccc4c(S(=O)(=O)[O-])cc(S(=O)... | EVANS BLUE | 2.242000 | 2.284400 | 2.263200 | 0 |
39128 rows × 6 columns
Collins.get_antibiotics_enzyme
def get_antibiotics_enzyme(
):
Antibiotics enzymatic inhibition dataset of 100 µM 218 compounds and 12 essential proteins in E. coli K12 BW25113. Flattened benchmark dataset/Supplementary EV4 from 2022 Molecular Systems Biology: Benchmarking AlphaFold-enabled molecular docking predictions for antibiotic discovery.
Collins.get_antibiotics_enzyme()| compound_ID | SMILES | enzyme_uniprot | enzyme | enzyme_type | rep1 | rep2 | both_less_05 | |
|---|---|---|---|---|---|---|---|---|
| 0 | BRD-K78206682 | O=C1NC(=O)C=C1 | P0AES4, P0AES6 | gyrA, gyrB | DNA gyrase | 0.610697 | 0.411738 | 0 |
| 1 | BRD-K01666924 | CC(C)CCCC(=O)N[C@@H](CCNCS(=O)(=O)O)C(=O)N[C@@... | P0AES4, P0AES6 | gyrA, gyrB | DNA gyrase | 0.574242 | 0.536372 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2614 | OXYQUINOLINE HEMISULFATE | O=S(=O)(O)O.Oc1cccc2cccnc12 | P11880 | murF | MurF | 0.746858 | 1.154308 | 0 |
| 2615 | CHLOROXINE | Oc1c(Cl)cc(Cl)c2cccnc12 | P11880 | murF | MurF | 0.851553 | 1.108231 | 0 |
2616 rows × 8 columns
Kras datasets
Kras
def Kras(
args:VAR_POSITIONAL, kwargs:VAR_KEYWORD
):
A class of fetching various KRAS datasets.
Kras.get_mirati_g12d_raw
def get_mirati_g12d_raw(
):
Raw G12D dataset from the paper and patents without deduplication.
Kras.get_mirati_g12d_raw()| ID | SMILES | group | with_3F | racemic_trans | mixture_isomer | trans | Kd | IC50 | erk_IC50 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | US_1 | CN1CCC[C@H]1COc1nc(N2CC3CCC(C2)N3)c2cnc(cc2n1)... | US | 0 | 0 | 0 | 0 | 97.7 | 124.7 | 3159.1 |
| 1 | US_2 | CN1CCC[C@H]1COc1nc(N2CC3CCC(C2)N3)c2cnc(c(F)c2... | US | 1 | 0 | 0 | 0 | 2.4 | 2.7 | 721.4 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 720 | paper_37 | FC1=C(C2=C(C(Cl)=CC=C3)C3=CC(O)=C2)N=CC4=C1N=C... | paper | 0 | 0 | 0 | 0 | NaN | 2.0 | 63.0 |
| 721 | paper_38 | FC1=C(C2=C(C(C#C)=CC=C3)C3=CC(O)=C2)N=CC4=C1N=... | paper | 0 | 0 | 0 | 0 | NaN | 2.0 | 14.0 |
722 rows × 10 columns
Kras.get_mirati_g12d
def get_mirati_g12d(
):
Deduplicated G12D dataset from the mirati paper and patents.
Kras.get_mirati_g12d()| ID | SMILES | Kd | IC50 | erk_IC50 | |
|---|---|---|---|---|---|
| 0 | US_1 | CN1CCC[C@H]1COc1nc(N2CC3CCC(C2)N3)c2cnc(cc2n1)... | 97.7 | 124.7 | 3159.1 |
| 1 | US_4 | Oc1cc(-c2ncc3c(nc(OCCc4ccccn4)nc3c2F)N2CC3CCC(... | 155.7 | 496.2 | 8530.0 |
| ... | ... | ... | ... | ... | ... |
| 658 | US_56 | OC[C@@H](O)COc1nc(N2CC3CCC(C2)N3)c2cnc(c(F)c2n... | 13805.3 | 6024.0 | NaN |
| 659 | US_66 | Fc1c(ncc2c(nc(OC[C@@]34CCCN3C(CCl)CC4)nc12)N1C... | NaN | 273.8 | 1332.6 |
660 rows × 5 columns
Kras.get_seq
def get_seq(
):
Protein sequence of human KRAS and its mutants G12D and G12C.
Kras.get_seq()| ID | WT_sequence | g12d_seq | g12c_seq | |
|---|---|---|---|---|
| 0 | kras_human | MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI... | MTEYKLVVVGADGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI... | MTEYKLVVVGACGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI... |
| 1 | kras_human_isoform2b | MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI... | MTEYKLVVVGADGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI... | MTEYKLVVVGACGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI... |