# Load data
= Data.get_aa_rdkit()
df df.shape
(25, 106)
remove_hi_corr (df:pandas.core.frame.DataFrame, thr:float=0.98)
Remove highly correlated features in a dataframe given a pearson threshold
Type | Default | Details | |
---|---|---|---|
df | DataFrame | ||
thr | float | 0.98 | threshold |
remove_hi_corr
is a function to remove highly correlated features based on threshold of Pearson correlation between features.
preprocess (df:pandas.core.frame.DataFrame, thr:float=0.98)
Remove features with no variance, and highly correlated features based on threshold
This function is similar to remove_hi_corr
, but can additionaly remove features of zero variance (e.g., 1 across all samples)
removing columns: {'Chi2n', 'NumAromaticRings', 'SMR_VSA9', 'Chi4n', 'VSA_EState2', 'NumRotatableBonds', 'Chi0n', 'Chi1', 'SlogP_VSA5', 'Chi3v', 'Ipc', 'VSA_EState6', 'Chi2v', 'NumHDonors', 'NumAromaticCarbocycles', 'NumHeteroatoms', 'Chi4v', 'NOCount', 'Chi0v', 'Chi3n', 'Kappa1', 'VSA_EState10', 'fr_NH2', 'SMR_VSA1', 'Chi1v', 'fr_SH', 'fr_Ar_N', 'RingCount'}
(25, 78)
standardize (df)
Standardize features from a df
get_rdkit (SMILES)
Extract chemical features from SMILES Reference: https://greglandrum.github.io/rdkit-blog/posts/2022-12-23-descriptor-tutorial.html
get_rdkit_3d (SMILES)
Extract 3d features from SMILES
get_rdkit_all (SMILES)
Extract chemical features and 3d features from SMILES
get_rdkit_df (df, col, postprocess=True)
Extract rdkit features (including 3d) from SMILES in a df
Type | Default | Details | |
---|---|---|---|
df | |||
col | column of SMILES | ||
postprocess | bool | True | remove redundant columns and standardize features for dimension reduction |
Name | SMILES | MW | pKa1 | pKb2 | pKx3 | pl4 | H | VSC | P1 | P2 | SASA | NCISC | phospho | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
aa | ||||||||||||||
A | Alanine | C[C@@H](C(=O)O)N | 89.10 | 2.34 | 9.69 | NaN | 6.00 | 0.62 | 27.5 | 8.1 | 0.046 | 1.181 | 0.007187 | 0 |
C | Cysteine | C([C@@H](C(=O)O)N)S | 121.16 | 1.96 | 10.28 | 8.18 | 5.07 | 0.29 | 44.6 | 5.5 | 0.128 | 1.461 | -0.036610 | 0 |
D | Aspartic acid | C([C@@H](C(=O)O)N)C(=O)O | 133.11 | 1.88 | 9.60 | 3.65 | 2.77 | -0.90 | 40.0 | 13.0 | 0.105 | 1.587 | -0.023820 | 0 |
E | Glutamic acid | C(CC(=O)O)[C@@H](C(=O)O)N | 147.13 | 2.19 | 9.67 | 4.25 | 3.22 | -0.74 | 62.0 | 12.3 | 0.151 | 1.862 | 0.006802 | 0 |
F | Phenylalanine | c1ccc(cc1)C[C@@H](C(=O)O)N | 165.19 | 1.83 | 9.13 | NaN | 5.48 | 1.19 | 115.5 | 5.2 | 0.290 | 2.228 | 0.037552 | 0 |
removing columns: {'fr_aniline', 'fr_ester', 'fr_nitrile', 'PMI3', 'fr_phenol', 'Chi0', 'NumAmideBonds', 'HeavyAtomMolWt', 'fr_phos_acid', 'fr_piperzine', 'fr_term_acetylene', 'fr_quatN', 'fr_pyridine', 'SlogP_VSA6', 'fr_COO2', 'SlogP_VSA9', 'fr_lactone', 'Eccentricity', 'fr_amidine', 'NumSaturatedCarbocycles', 'fr_piperdine', 'fr_dihydropyridine', 'fr_isocyan', 'MaxPartialCharge', 'fr_phenol_noOrthoHbond', 'fr_nitroso', 'fr_oxazole', 'fr_furan', 'fr_para_hydroxylation', 'fr_benzene', 'fr_Al_OH_noTert', 'fr_guanido', 'fr_C_S', 'fr_alkyl_halide', 'fr_COO', 'fr_urea', 'fr_azide', 'fr_barbitur', 'fr_morpholine', 'fr_lactam', 'fr_amide', 'NumAliphaticCarbocycles', 'PEOE_VSA13', 'fr_C_O_noCOO', 'fr_azo', 'SMR_VSA2', 'fr_Ar_COO', 'fr_epoxide', 'SlogP_VSA7', 'fr_tetrazole', 'fr_isothiocyan', 'LabuteASA', 'fr_methoxy', 'fr_Ndealkylation2', 'fr_ketone', 'fr_oxime', 'MolMR', 'SMR_VSA8', 'NumAliphaticRings', 'fr_sulfone', 'fr_halogen', 'fr_ArN', 'fr_aryl_methyl', 'Chi1n', 'NumSpiroAtoms', 'NumSaturatedHeterocycles', 'fr_Nhpyrrole', 'fr_Ar_OH', 'fr_Imine', 'fr_thiophene', 'Asphericity', 'HeavyAtomCount', 'fr_diazo', 'fr_Ar_NH', 'NumValenceElectrons', 'EState_VSA11', 'SlogP_VSA10', 'fr_benzodiazepine', 'ExactMolWt', 'SlogP_VSA11', 'fr_sulfonamd', 'fr_HOCCN', 'NumSaturatedRings', 'fr_Ndealkylation1', 'SlogP_VSA12', 'fr_aldehyde', 'BCUT2D_MRHI', 'PEOE_VSA5', 'NumBridgeheadAtoms', 'fr_alkyl_carbamate', 'fr_hdrzone', 'fr_thiazole', 'fr_ether', 'fr_imide', 'fr_N_O', 'fr_prisulfonamd', 'VSA_EState1', 'fr_nitro_arom', 'fr_thiocyan', 'fr_ketone_Topliss', 'fr_bicyclic', 'fr_nitro', 'fr_hdrzine', 'MinAbsPartialCharge', 'fr_allylic_oxid', 'NumRadicalElectrons', 'fr_nitro_arom_nonortho', 'fr_phos_ester', 'MaxEStateIndex'}
MaxAbsEStateIndex | MinAbsEStateIndex | MinEStateIndex | qed | SPS | MolWt | MinPartialCharge | MaxAbsPartialCharge | FpDensityMorgan1 | FpDensityMorgan2 | ... | fr_sulfide | fr_unbrch_alkane | PMI1 | PMI2 | NPR1 | NPR2 | RadiusOfGyration | InertialShapeFactor | SpherocityIndex | PBF | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
aa | |||||||||||||||||||||
A | -1.653421 | 1.218945 | 0.407753 | -0.383393 | -0.070345 | -1.523488 | 0.218275 | -0.295105 | 1.732180 | 0.163309 | ... | -0.204124 | -0.369274 | -1.294800 | -1.086763 | 1.710287 | 1.057430 | -1.514798 | 1.952520 | 0.622167 | -1.068611 |
C | -1.058215 | -0.588000 | 0.372307 | -0.641865 | -0.138884 | -0.727067 | 0.223839 | -0.298132 | 1.732180 | 1.481227 | ... | -0.204124 | -0.369274 | -0.445607 | -0.885625 | 1.398986 | -2.080741 | -0.970023 | -0.247854 | 0.617683 | -0.493157 |
D | -0.764466 | 0.554854 | 0.126078 | -0.376981 | -0.390192 | -0.430473 | 0.020085 | -0.187297 | -0.934412 | -1.234483 | ... | -0.204124 | -0.369274 | -0.724059 | -0.459223 | -0.441549 | 0.454216 | -0.332705 | 0.426131 | 0.316338 | 0.080717 |
E | -0.283221 | -1.143984 | 0.235448 | -0.051582 | -0.406185 | -0.082096 | 0.010166 | -0.181902 | -1.147739 | -1.178571 | ... | -0.204124 | -0.369274 | -0.504211 | -0.096004 | -0.699329 | 1.605206 | 0.070614 | 0.283217 | -0.128709 | -0.453659 |
F | 0.972596 | 0.063427 | 0.410788 | 1.908107 | -0.430173 | 0.366494 | 0.221306 | -0.296754 | -1.067741 | -0.675366 | ... | -0.204124 | -0.369274 | -0.220806 | 0.390422 | -1.137341 | 0.320498 | 0.643196 | -0.162855 | -1.214938 | -0.971254 |
5 rows × 119 columns
get_morgan (df:pandas.core.frame.DataFrame, col:str='SMILES', radius=3)
Get 2048 morgan fingerprint (binary feature) from smiles in a dataframe
Type | Default | Details | |
---|---|---|---|
df | DataFrame | a dataframe that contains smiles | |
col | str | SMILES | colname of smile |
radius | int | 3 |
morgan_0 | morgan_1 | morgan_2 | morgan_3 | morgan_4 | morgan_5 | morgan_6 | morgan_7 | morgan_8 | morgan_9 | ... | morgan_2038 | morgan_2039 | morgan_2040 | morgan_2041 | morgan_2042 | morgan_2043 | morgan_2044 | morgan_2045 | morgan_2046 | morgan_2047 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
aa | |||||||||||||||||||||
A | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
C | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
D | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
E | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
F | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 2048 columns
morgan_0 | morgan_1 | morgan_2 | morgan_3 | morgan_4 | morgan_5 | morgan_6 | morgan_7 | morgan_8 | morgan_9 | ... | morgan_2038 | morgan_2039 | morgan_2040 | morgan_2041 | morgan_2042 | morgan_2043 | morgan_2044 | morgan_2045 | morgan_2046 | morgan_2047 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
aa | |||||||||||||||||||||
A | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
C | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
D | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
E | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
F | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 2048 columns
onehot_encode (sequences, transform_colname=True, n=20)
-20A | -20C | -20D | -20E | -20F | -20G | -20H | -20I | -20K | -20L | ... | -6N | -6P | -6Q | -6R | -6S | -6T | -6V | -6W | -6Y | -6_ | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
3 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
995 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
996 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
997 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
998 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
999 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1000 rows × 297 columns
get_esm (df:pandas.core.frame.DataFrame, col:str='sequence', model_name:str='esm2_t33_650M_UR50D')
Extract esmfold2 embeddings from protein sequence in a dataframe
Type | Default | Details | |
---|---|---|---|
df | DataFrame | a dataframe that contains amino acid sequence | |
col | str | sequence | colname of amino acid sequence |
model_name | str | esm2_t33_650M_UR50D | Name of the ESM model to use for the embeddings. |
ESM2 model is trained on UniRef sequence. The default model in the function is esm2_t33_650M_UR50D, which is trained on UniRef50.
Uncheck below to use:
get_t5 (df:pandas.core.frame.DataFrame, col:str='sequence')
Extract ProtT5-XL-uniref50 embeddings from protein sequence in a dataframe
XL-uniref50 model is a t5-3b model trained on Uniref50 Dataset.
Uncheck below to use:
get_t5_bfd (df:pandas.core.frame.DataFrame, col:str='sequence')
Extract ProtT5-XL-BFD embeddings from protein sequence in a dataframe
XL-BFD model is a t5-3b model trained on Big Fantastic Database(BFD).
Uncheck below to use:
reduce_feature (df:pandas.core.frame.DataFrame, method:str='pca', complexity:int=20, n:int=2, load:str=None, save:str=None, seed:int=123, **kwargs)
Reduce the dimensionality given a dataframe of values
Type | Default | Details | |
---|---|---|---|
df | DataFrame | ||
method | str | pca | dimensionality reduction method, accept both capital and lower case |
complexity | int | 20 | None for PCA; perfplexity for TSNE, recommend: 30; n_neigbors for UMAP, recommend: 15 |
n | int | 2 | n_components |
load | str | None | load a previous model, e.g. model.pkl |
save | str | None | pkl file to be saved, e.g. pca_model.pkl |
seed | int | 123 | seed for random_state |
kwargs | VAR_KEYWORD |
A very common way to reduce feature number is to use dimensionality reduction method. reduce_feature
is a dimensionality reduction function that can apply three dimensionality reduction methods: PCA, UMAP, TSNE. The later two is non-linear transformation, and PCA is linear transformation. Therefore, for plotting purpose, it is good to use UMAP/TSNE, by setting n (n_components) to 2 for 2d plot; for featuring purpose, it is good to use PCA, and set n to values to a rational values, like 64, 128 etc.
# Load data
df = Data.get_aa_rdkit()
# Use PCA to reduce dimension; reduce the number of features to 20
reduce_feature(df,'pca',n=20).head()
PCA1 | PCA2 | PCA3 | PCA4 | PCA5 | PCA6 | PCA7 | PCA8 | PCA9 | PCA10 | PCA11 | PCA12 | PCA13 | PCA14 | PCA15 | PCA16 | PCA17 | PCA18 | PCA19 | PCA20 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
aa | ||||||||||||||||||||
A | -463.014948 | -79.180061 | -8.957621 | -13.455810 | 1.334975 | 0.996915 | -6.228205 | -2.573987 | 0.637178 | 1.904173 | -1.165818 | 5.803809 | -3.519867 | 1.620306 | -3.686674 | -1.070729 | -1.044587 | -2.245493 | -0.023173 | 2.143434 |
C | -446.251885 | -52.851228 | 1.200874 | 0.469406 | -16.721236 | 12.310611 | 5.623647 | 17.543569 | 6.290376 | -4.818617 | 0.871101 | -1.274344 | 3.983329 | -6.019231 | -5.866159 | -0.339787 | -3.342606 | 0.934348 | -0.335121 | -0.045040 |
D | -407.721016 | 9.532878 | 10.375789 | -21.871983 | -3.757091 | -2.804468 | 3.684495 | -8.257556 | 0.885011 | 4.454468 | 4.085862 | 3.059634 | 3.463971 | 2.626308 | 1.260286 | -3.054156 | -1.823627 | 4.300302 | -2.407938 | 1.076088 |
E | -355.786380 | 21.077202 | 11.870110 | -10.861780 | 4.869825 | -3.906521 | -2.281413 | -2.893303 | 8.997722 | 3.828554 | 2.004998 | -1.002484 | 9.471326 | -1.945113 | 3.684237 | 2.119809 | 0.362597 | -1.166347 | -3.604024 | -0.752169 |
F | 69.598210 | 74.375112 | -68.407808 | 2.572185 | 6.659703 | 16.787547 | 10.585299 | -1.588954 | -10.532959 | -2.643261 | -7.012191 | 4.522682 | 4.779671 | 0.908865 | -0.218389 | 1.152286 | 0.560021 | 1.736196 | 0.152801 | -0.567469 |