Feature

A collection of tools to extract features from SMILES, proteins, etc.

Setup

Utils


source

remove_hi_corr

 remove_hi_corr (df:pandas.core.frame.DataFrame, thr:float=0.98)

Remove highly correlated features in a dataframe given a pearson threshold

Type Default Details
df DataFrame
thr float 0.98 threshold

remove_hi_corr is a function to remove highly correlated features based on threshold of Pearson correlation between features.

# Load data
df = Data.get_aa_rdkit()
df.shape
(25, 106)
remove_hi_corr(df,thr=0.9).shape
(25, 78)

source

preprocess

 preprocess (df:pandas.core.frame.DataFrame, thr:float=0.98)

Remove features with no variance, and highly correlated features based on threshold

This function is similar to remove_hi_corr, but can additionaly remove features of zero variance (e.g., 1 across all samples)

preprocess(df,thr=0.9).shape
removing columns: {'Chi3v', 'SMR_VSA9', 'SMR_VSA1', 'Chi2v', 'fr_SH', 'NumAromaticCarbocycles', 'fr_NH2', 'Chi1v', 'VSA_EState6', 'NumRotatableBonds', 'Chi0n', 'SlogP_VSA5', 'NumAromaticRings', 'fr_Ar_N', 'Chi2n', 'Chi1', 'Chi4n', 'Chi3n', 'VSA_EState10', 'Chi0v', 'Kappa1', 'NumHDonors', 'RingCount', 'NOCount', 'Chi4v', 'Ipc', 'NumHeteroatoms', 'VSA_EState2'}
(25, 78)

source

standardize

 standardize (df)

Standardize features from a df

Compound features

RDKit descriptors


source

get_rdkit

 get_rdkit (SMILES)

Extract chemical features from SMILES Reference: https://greglandrum.github.io/rdkit-blog/posts/2022-12-23-descriptor-tutorial.html


source

get_rdkit_3d

 get_rdkit_3d (SMILES)

Extract 3d features from SMILES


source

get_rdkit_all

 get_rdkit_all (SMILES)

Extract chemical features and 3d features from SMILES


source

get_rdkit_df

 get_rdkit_df (df, col, postprocess=True)

Extract rdkit features (including 3d) from SMILES in a df

Type Default Details
df
col column of SMILES
postprocess bool True remove redundant columns and standardize features for dimension reduction
aa = Data.get_aa_info()
aa.head()
Name SMILES MW pKa1 pKb2 pKx3 pl4 H VSC P1 P2 SASA NCISC phospho
aa
A Alanine C[C@@H](C(=O)O)N 89.10 2.34 9.69 NaN 6.00 0.62 27.5 8.1 0.046 1.181 0.007187 0
C Cysteine C([C@@H](C(=O)O)N)S 121.16 1.96 10.28 8.18 5.07 0.29 44.6 5.5 0.128 1.461 -0.036610 0
D Aspartic acid C([C@@H](C(=O)O)N)C(=O)O 133.11 1.88 9.60 3.65 2.77 -0.90 40.0 13.0 0.105 1.587 -0.023820 0
E Glutamic acid C(CC(=O)O)[C@@H](C(=O)O)N 147.13 2.19 9.67 4.25 3.22 -0.74 62.0 12.3 0.151 1.862 0.006802 0
F Phenylalanine c1ccc(cc1)C[C@@H](C(=O)O)N 165.19 1.83 9.13 NaN 5.48 1.19 115.5 5.2 0.290 2.228 0.037552 0
aa_rdkit = get_rdkit_df(aa, 'SMILES')
aa_rdkit.head()
removing columns: {'fr_para_hydroxylation', 'MinAbsPartialCharge', 'SlogP_VSA9', 'NumSpiroAtoms', 'fr_sulfone', 'PEOE_VSA5', 'fr_Ndealkylation1', 'fr_halogen', 'MolMR', 'fr_nitro_arom', 'VSA_EState1', 'fr_piperdine', 'fr_lactam', 'fr_imide', 'PMI3', 'fr_methoxy', 'Chi0', 'fr_alkyl_carbamate', 'Asphericity', 'fr_bicyclic', 'PEOE_VSA13', 'fr_sulfonamd', 'Chi1n', 'fr_urea', 'fr_aryl_methyl', 'NumBridgeheadAtoms', 'fr_hdrzine', 'fr_nitro', 'fr_phos_acid', 'fr_pyridine', 'fr_nitro_arom_nonortho', 'fr_amide', 'Eccentricity', 'fr_azo', 'fr_isocyan', 'HeavyAtomCount', 'fr_oxazole', 'SMR_VSA2', 'NumRadicalElectrons', 'fr_epoxide', 'fr_Nhpyrrole', 'fr_term_acetylene', 'fr_hdrzone', 'BCUT2D_MRHI', 'fr_aldehyde', 'fr_Ar_NH', 'fr_ketone_Topliss', 'fr_allylic_oxid', 'NumAmideBonds', 'NumAliphaticCarbocycles', 'NumSaturatedCarbocycles', 'fr_thiazole', 'fr_amidine', 'fr_phenol', 'fr_Ar_OH', 'fr_nitrile', 'fr_phos_ester', 'fr_nitroso', 'fr_benzodiazepine', 'NumAliphaticRings', 'fr_azide', 'fr_COO2', 'fr_Ndealkylation2', 'MaxEStateIndex', 'fr_ketone', 'fr_tetrazole', 'fr_HOCCN', 'fr_oxime', 'fr_C_O_noCOO', 'fr_ether', 'NumSaturatedHeterocycles', 'fr_prisulfonamd', 'fr_isothiocyan', 'fr_quatN', 'fr_furan', 'SlogP_VSA10', 'LabuteASA', 'fr_aniline', 'fr_guanido', 'HeavyAtomMolWt', 'SlogP_VSA7', 'EState_VSA11', 'fr_diazo', 'NumSaturatedRings', 'fr_benzene', 'fr_lactone', 'fr_COO', 'ExactMolWt', 'fr_alkyl_halide', 'fr_Ar_COO', 'MaxPartialCharge', 'fr_C_S', 'fr_thiophene', 'SlogP_VSA12', 'SlogP_VSA6', 'fr_ArN', 'fr_dihydropyridine', 'fr_phenol_noOrthoHbond', 'fr_N_O', 'fr_ester', 'fr_morpholine', 'NumValenceElectrons', 'fr_Al_OH_noTert', 'fr_piperzine', 'fr_barbitur', 'SlogP_VSA11', 'fr_thiocyan', 'fr_Imine', 'SMR_VSA8'}
MaxAbsEStateIndex MinAbsEStateIndex MinEStateIndex qed SPS MolWt MinPartialCharge MaxAbsPartialCharge FpDensityMorgan1 FpDensityMorgan2 ... fr_sulfide fr_unbrch_alkane PMI1 PMI2 NPR1 NPR2 RadiusOfGyration InertialShapeFactor SpherocityIndex PBF
aa
A -1.653421 1.218945 0.407753 -0.383393 -0.070345 -1.523488 0.218275 -0.295105 1.732180 0.163309 ... -0.204124 -0.369274 -1.294800 -1.086763 1.710287 1.057430 -1.514798 1.952520 0.622167 -1.068611
C -1.058215 -0.588000 0.372307 -0.641865 -0.138884 -0.727067 0.223839 -0.298132 1.732180 1.481227 ... -0.204124 -0.369274 -0.445607 -0.885625 1.398986 -2.080741 -0.970023 -0.247854 0.617683 -0.493157
D -0.764466 0.554854 0.126078 -0.376981 -0.390192 -0.430473 0.020085 -0.187297 -0.934412 -1.234483 ... -0.204124 -0.369274 -0.724059 -0.459223 -0.441549 0.454216 -0.332705 0.426131 0.316338 0.080717
E -0.283221 -1.143984 0.235448 -0.051582 -0.406185 -0.082096 0.010166 -0.181902 -1.147739 -1.178571 ... -0.204124 -0.369274 -0.504211 -0.096004 -0.699329 1.605206 0.070614 0.283217 -0.128709 -0.453659
F 0.972596 0.063427 0.410788 1.908107 -0.430173 0.366494 0.221306 -0.296754 -1.067741 -0.675366 ... -0.204124 -0.369274 -0.220806 0.390422 -1.137341 0.320498 0.643196 -0.162855 -1.214938 -0.971254

5 rows × 119 columns

Morgan fingerprint


source

get_morgan

 get_morgan (df:pandas.core.frame.DataFrame, col:str='SMILES', radius=3)

Get 2048 morgan fingerprint (binary feature) from smiles in a dataframe

Type Default Details
df DataFrame a dataframe that contains smiles
col str SMILES colname of smile
radius int 3
aa_morgan = get_morgan(aa, 'SMILES')
aa_morgan.head()
morgan_0 morgan_1 morgan_2 morgan_3 morgan_4 morgan_5 morgan_6 morgan_7 morgan_8 morgan_9 ... morgan_2038 morgan_2039 morgan_2040 morgan_2041 morgan_2042 morgan_2043 morgan_2044 morgan_2045 morgan_2046 morgan_2047
aa
A 0 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
C 0 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
D 0 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
E 0 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
F 0 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 2048 columns

aa_morgan = get_morgan(aa, 'SMILES')
aa_morgan.head()
morgan_0 morgan_1 morgan_2 morgan_3 morgan_4 morgan_5 morgan_6 morgan_7 morgan_8 morgan_9 ... morgan_2038 morgan_2039 morgan_2040 morgan_2041 morgan_2042 morgan_2043 morgan_2044 morgan_2045 morgan_2046 morgan_2047
aa
A 0 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
C 0 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
D 0 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
E 0 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
F 0 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 2048 columns

Protein sequence

Onehot


source

onehot_encode

 onehot_encode (sequences, transform_colname=True, n=20)
df=Data.get_combine_site_psp_ochoa()
onehot = onehot_encode(df['site_seq'].head(1000))
onehot
-20A -20C -20D -20E -20F -20G -20H -20I -20K -20L ... -6N -6P -6Q -6R -6S -6T -6V -6W -6Y -6_
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
995 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
996 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
997 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
998 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
999 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

1000 rows × 297 columns

Elbow method


source

get_clusters_elbow

 get_clusters_elbow (encoded_data, max_cluster=400, interval=50)
get_clusters_elbow(onehot,5,2)

ESM2


source

get_esm

 get_esm (df:pandas.core.frame.DataFrame, col:str='sequence',
          model_name:str='esm2_t33_650M_UR50D')

Extract esmfold2 embeddings from protein sequence in a dataframe

Type Default Details
df DataFrame a dataframe that contains amino acid sequence
col str sequence colname of amino acid sequence
model_name str esm2_t33_650M_UR50D Name of the ESM model to use for the embeddings.

ESM2 model is trained on UniRef sequence. The default model in the function is esm2_t33_650M_UR50D, which is trained on UniRef50.

Uncheck below to use:

# # Examples
# df = Data.get_kinase_info().set_index('kinase')
# sample = df[:5]
# esmfeature = get_esm(sample,'sequence')
# esmfeature.head()

ProtT5


source

get_t5

 get_t5 (df:pandas.core.frame.DataFrame, col:str='sequence')

Extract ProtT5-XL-uniref50 embeddings from protein sequence in a dataframe

XL-uniref50 model is a t5-3b model trained on Uniref50 Dataset.

Uncheck below to use:

# t5feature = get_t5(sample,'sequence')
# t5feature.head()

source

get_t5_bfd

 get_t5_bfd (df:pandas.core.frame.DataFrame, col:str='sequence')

Extract ProtT5-XL-BFD embeddings from protein sequence in a dataframe

XL-BFD model is a t5-3b model trained on Big Fantastic Database(BFD).

Uncheck below to use:

# t5bfd = get_t5_bfd(sample,'sequence')
# t5bfd.head()

End