Feature

A collection of tools to extract features from SMILES, proteins, etc.

Setup

Utils


source

remove_hi_corr

 remove_hi_corr (df:pandas.core.frame.DataFrame, thr:float=0.98)

Remove highly correlated features in a dataframe given a pearson threshold

Type Default Details
df DataFrame
thr float 0.98 threshold

remove_hi_corr is a function to remove highly correlated features based on threshold of Pearson correlation between features.

# Load data
df = Data.get_aa_rdkit()
df.shape
(25, 106)
remove_hi_corr(df,thr=0.9).shape
(25, 78)

source

preprocess

 preprocess (df:pandas.core.frame.DataFrame, thr:float=0.98)

Remove features with no variance, and highly correlated features based on threshold

This function is similar to remove_hi_corr, but can additionaly remove features of zero variance (e.g., 1 across all samples)

preprocess(df,thr=0.9).shape
removing columns: {'Chi2n', 'NumAromaticRings', 'SMR_VSA9', 'Chi4n', 'VSA_EState2', 'NumRotatableBonds', 'Chi0n', 'Chi1', 'SlogP_VSA5', 'Chi3v', 'Ipc', 'VSA_EState6', 'Chi2v', 'NumHDonors', 'NumAromaticCarbocycles', 'NumHeteroatoms', 'Chi4v', 'NOCount', 'Chi0v', 'Chi3n', 'Kappa1', 'VSA_EState10', 'fr_NH2', 'SMR_VSA1', 'Chi1v', 'fr_SH', 'fr_Ar_N', 'RingCount'}
(25, 78)

source

standardize

 standardize (df)

Standardize features from a df

Compound features

RDKit descriptors


source

get_rdkit

 get_rdkit (SMILES)

Extract chemical features from SMILES Reference: https://greglandrum.github.io/rdkit-blog/posts/2022-12-23-descriptor-tutorial.html


source

get_rdkit_3d

 get_rdkit_3d (SMILES)

Extract 3d features from SMILES


source

get_rdkit_all

 get_rdkit_all (SMILES)

Extract chemical features and 3d features from SMILES


source

get_rdkit_df

 get_rdkit_df (df, col, postprocess=True)

Extract rdkit features (including 3d) from SMILES in a df

Type Default Details
df
col column of SMILES
postprocess bool True remove redundant columns and standardize features for dimension reduction
aa = Data.get_aa_info()
aa.head()
Name SMILES MW pKa1 pKb2 pKx3 pl4 H VSC P1 P2 SASA NCISC phospho
aa
A Alanine C[C@@H](C(=O)O)N 89.10 2.34 9.69 NaN 6.00 0.62 27.5 8.1 0.046 1.181 0.007187 0
C Cysteine C([C@@H](C(=O)O)N)S 121.16 1.96 10.28 8.18 5.07 0.29 44.6 5.5 0.128 1.461 -0.036610 0
D Aspartic acid C([C@@H](C(=O)O)N)C(=O)O 133.11 1.88 9.60 3.65 2.77 -0.90 40.0 13.0 0.105 1.587 -0.023820 0
E Glutamic acid C(CC(=O)O)[C@@H](C(=O)O)N 147.13 2.19 9.67 4.25 3.22 -0.74 62.0 12.3 0.151 1.862 0.006802 0
F Phenylalanine c1ccc(cc1)C[C@@H](C(=O)O)N 165.19 1.83 9.13 NaN 5.48 1.19 115.5 5.2 0.290 2.228 0.037552 0
aa_rdkit = get_rdkit_df(aa, 'SMILES')
aa_rdkit.head()
removing columns: {'fr_aniline', 'fr_ester', 'fr_nitrile', 'PMI3', 'fr_phenol', 'Chi0', 'NumAmideBonds', 'HeavyAtomMolWt', 'fr_phos_acid', 'fr_piperzine', 'fr_term_acetylene', 'fr_quatN', 'fr_pyridine', 'SlogP_VSA6', 'fr_COO2', 'SlogP_VSA9', 'fr_lactone', 'Eccentricity', 'fr_amidine', 'NumSaturatedCarbocycles', 'fr_piperdine', 'fr_dihydropyridine', 'fr_isocyan', 'MaxPartialCharge', 'fr_phenol_noOrthoHbond', 'fr_nitroso', 'fr_oxazole', 'fr_furan', 'fr_para_hydroxylation', 'fr_benzene', 'fr_Al_OH_noTert', 'fr_guanido', 'fr_C_S', 'fr_alkyl_halide', 'fr_COO', 'fr_urea', 'fr_azide', 'fr_barbitur', 'fr_morpholine', 'fr_lactam', 'fr_amide', 'NumAliphaticCarbocycles', 'PEOE_VSA13', 'fr_C_O_noCOO', 'fr_azo', 'SMR_VSA2', 'fr_Ar_COO', 'fr_epoxide', 'SlogP_VSA7', 'fr_tetrazole', 'fr_isothiocyan', 'LabuteASA', 'fr_methoxy', 'fr_Ndealkylation2', 'fr_ketone', 'fr_oxime', 'MolMR', 'SMR_VSA8', 'NumAliphaticRings', 'fr_sulfone', 'fr_halogen', 'fr_ArN', 'fr_aryl_methyl', 'Chi1n', 'NumSpiroAtoms', 'NumSaturatedHeterocycles', 'fr_Nhpyrrole', 'fr_Ar_OH', 'fr_Imine', 'fr_thiophene', 'Asphericity', 'HeavyAtomCount', 'fr_diazo', 'fr_Ar_NH', 'NumValenceElectrons', 'EState_VSA11', 'SlogP_VSA10', 'fr_benzodiazepine', 'ExactMolWt', 'SlogP_VSA11', 'fr_sulfonamd', 'fr_HOCCN', 'NumSaturatedRings', 'fr_Ndealkylation1', 'SlogP_VSA12', 'fr_aldehyde', 'BCUT2D_MRHI', 'PEOE_VSA5', 'NumBridgeheadAtoms', 'fr_alkyl_carbamate', 'fr_hdrzone', 'fr_thiazole', 'fr_ether', 'fr_imide', 'fr_N_O', 'fr_prisulfonamd', 'VSA_EState1', 'fr_nitro_arom', 'fr_thiocyan', 'fr_ketone_Topliss', 'fr_bicyclic', 'fr_nitro', 'fr_hdrzine', 'MinAbsPartialCharge', 'fr_allylic_oxid', 'NumRadicalElectrons', 'fr_nitro_arom_nonortho', 'fr_phos_ester', 'MaxEStateIndex'}
MaxAbsEStateIndex MinAbsEStateIndex MinEStateIndex qed SPS MolWt MinPartialCharge MaxAbsPartialCharge FpDensityMorgan1 FpDensityMorgan2 ... fr_sulfide fr_unbrch_alkane PMI1 PMI2 NPR1 NPR2 RadiusOfGyration InertialShapeFactor SpherocityIndex PBF
aa
A -1.653421 1.218945 0.407753 -0.383393 -0.070345 -1.523488 0.218275 -0.295105 1.732180 0.163309 ... -0.204124 -0.369274 -1.294800 -1.086763 1.710287 1.057430 -1.514798 1.952520 0.622167 -1.068611
C -1.058215 -0.588000 0.372307 -0.641865 -0.138884 -0.727067 0.223839 -0.298132 1.732180 1.481227 ... -0.204124 -0.369274 -0.445607 -0.885625 1.398986 -2.080741 -0.970023 -0.247854 0.617683 -0.493157
D -0.764466 0.554854 0.126078 -0.376981 -0.390192 -0.430473 0.020085 -0.187297 -0.934412 -1.234483 ... -0.204124 -0.369274 -0.724059 -0.459223 -0.441549 0.454216 -0.332705 0.426131 0.316338 0.080717
E -0.283221 -1.143984 0.235448 -0.051582 -0.406185 -0.082096 0.010166 -0.181902 -1.147739 -1.178571 ... -0.204124 -0.369274 -0.504211 -0.096004 -0.699329 1.605206 0.070614 0.283217 -0.128709 -0.453659
F 0.972596 0.063427 0.410788 1.908107 -0.430173 0.366494 0.221306 -0.296754 -1.067741 -0.675366 ... -0.204124 -0.369274 -0.220806 0.390422 -1.137341 0.320498 0.643196 -0.162855 -1.214938 -0.971254

5 rows × 119 columns

Morgan fingerprint


source

get_morgan

 get_morgan (df:pandas.core.frame.DataFrame, col:str='SMILES', radius=3)

Get 2048 morgan fingerprint (binary feature) from smiles in a dataframe

Type Default Details
df DataFrame a dataframe that contains smiles
col str SMILES colname of smile
radius int 3
aa_morgan = get_morgan(aa, 'SMILES')
aa_morgan.head()
morgan_0 morgan_1 morgan_2 morgan_3 morgan_4 morgan_5 morgan_6 morgan_7 morgan_8 morgan_9 ... morgan_2038 morgan_2039 morgan_2040 morgan_2041 morgan_2042 morgan_2043 morgan_2044 morgan_2045 morgan_2046 morgan_2047
aa
A 0 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
C 0 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
D 0 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
E 0 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
F 0 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 2048 columns

aa_morgan = get_morgan(aa, 'SMILES')
aa_morgan.head()
morgan_0 morgan_1 morgan_2 morgan_3 morgan_4 morgan_5 morgan_6 morgan_7 morgan_8 morgan_9 ... morgan_2038 morgan_2039 morgan_2040 morgan_2041 morgan_2042 morgan_2043 morgan_2044 morgan_2045 morgan_2046 morgan_2047
aa
A 0 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
C 0 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
D 0 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
E 0 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
F 0 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 2048 columns

Protein sequence

Onehot


source

onehot_encode

 onehot_encode (sequences, transform_colname=True, n=20)
df=Data.get_combine_site_psp_ochoa()
onehot_encode(df['site_seq'].head(1000))
-20A -20C -20D -20E -20F -20G -20H -20I -20K -20L ... -6N -6P -6Q -6R -6S -6T -6V -6W -6Y -6_
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
995 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
996 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
997 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
998 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
999 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

1000 rows × 297 columns

ESM2


source

get_esm

 get_esm (df:pandas.core.frame.DataFrame, col:str='sequence',
          model_name:str='esm2_t33_650M_UR50D')

Extract esmfold2 embeddings from protein sequence in a dataframe

Type Default Details
df DataFrame a dataframe that contains amino acid sequence
col str sequence colname of amino acid sequence
model_name str esm2_t33_650M_UR50D Name of the ESM model to use for the embeddings.

ESM2 model is trained on UniRef sequence. The default model in the function is esm2_t33_650M_UR50D, which is trained on UniRef50.

Uncheck below to use:

# # Examples
# df = Data.get_kinase_info().set_index('kinase')
# sample = df[:5]
# esmfeature = get_esm(sample,'sequence')
# esmfeature.head()

ProtT5


source

get_t5

 get_t5 (df:pandas.core.frame.DataFrame, col:str='sequence')

Extract ProtT5-XL-uniref50 embeddings from protein sequence in a dataframe

XL-uniref50 model is a t5-3b model trained on Uniref50 Dataset.

Uncheck below to use:

# t5feature = get_t5(sample,'sequence')
# t5feature.head()

source

get_t5_bfd

 get_t5_bfd (df:pandas.core.frame.DataFrame, col:str='sequence')

Extract ProtT5-XL-BFD embeddings from protein sequence in a dataframe

XL-BFD model is a t5-3b model trained on Big Fantastic Database(BFD).

Uncheck below to use:

# t5bfd = get_t5_bfd(sample,'sequence')
# t5bfd.head()

Dimensionality reduction


source

reduce_feature

 reduce_feature (df:pandas.core.frame.DataFrame, method:str='pca',
                 complexity:int=20, n:int=2, load:str=None, save:str=None,
                 seed:int=123, **kwargs)

Reduce the dimensionality given a dataframe of values

Type Default Details
df DataFrame
method str pca dimensionality reduction method, accept both capital and lower case
complexity int 20 None for PCA; perfplexity for TSNE, recommend: 30; n_neigbors for UMAP, recommend: 15
n int 2 n_components
load str None load a previous model, e.g. model.pkl
save str None pkl file to be saved, e.g. pca_model.pkl
seed int 123 seed for random_state
kwargs VAR_KEYWORD

A very common way to reduce feature number is to use dimensionality reduction method. reduce_feature is a dimensionality reduction function that can apply three dimensionality reduction methods: PCA, UMAP, TSNE. The later two is non-linear transformation, and PCA is linear transformation. Therefore, for plotting purpose, it is good to use UMAP/TSNE, by setting n (n_components) to 2 for 2d plot; for featuring purpose, it is good to use PCA, and set n to values to a rational values, like 64, 128 etc.

# Load data
df = Data.get_aa_rdkit()

# Use PCA to reduce dimension; reduce the number of features to 20
reduce_feature(df,'pca',n=20).head()
PCA1 PCA2 PCA3 PCA4 PCA5 PCA6 PCA7 PCA8 PCA9 PCA10 PCA11 PCA12 PCA13 PCA14 PCA15 PCA16 PCA17 PCA18 PCA19 PCA20
aa
A -463.014948 -79.180061 -8.957621 -13.455810 1.334975 0.996915 -6.228205 -2.573987 0.637178 1.904173 -1.165818 5.803809 -3.519867 1.620306 -3.686674 -1.070729 -1.044587 -2.245493 -0.023173 2.143434
C -446.251885 -52.851228 1.200874 0.469406 -16.721236 12.310611 5.623647 17.543569 6.290376 -4.818617 0.871101 -1.274344 3.983329 -6.019231 -5.866159 -0.339787 -3.342606 0.934348 -0.335121 -0.045040
D -407.721016 9.532878 10.375789 -21.871983 -3.757091 -2.804468 3.684495 -8.257556 0.885011 4.454468 4.085862 3.059634 3.463971 2.626308 1.260286 -3.054156 -1.823627 4.300302 -2.407938 1.076088
E -355.786380 21.077202 11.870110 -10.861780 4.869825 -3.906521 -2.281413 -2.893303 8.997722 3.828554 2.004998 -1.002484 9.471326 -1.945113 3.684237 2.119809 0.362597 -1.166347 -3.604024 -0.752169
F 69.598210 74.375112 -68.407808 2.572185 6.659703 16.787547 10.585299 -1.588954 -10.532959 -2.643261 -7.012191 4.522682 4.779671 0.908865 -0.218389 1.152286 0.560021 1.736196 0.152801 -0.567469

End