Feature

A collection of tools to extract features from SMILES, proteins, etc.

Setup

Utils

remove_hi_corr

 remove_hi_corr (df:pandas.core.frame.DataFrame, thr:float=0.98)

Remove highly correlated features in a dataframe given a pearson threshold

	Type	Default	Details
df	DataFrame
thr	float	0.98	threshold

remove_hi_corr is a function to remove highly correlated features based on threshold of Pearson correlation between features.

# Load data
df = Data.get_aa_rdkit()
df.shape

(25, 106)

remove_hi_corr(df,thr=0.9).shape

(25, 78)

source

preprocess

 preprocess (df:pandas.core.frame.DataFrame, thr:float=0.98)

Remove features with no variance, and highly correlated features based on threshold

This function is similar to remove_hi_corr, but can additionaly remove features of zero variance (e.g., 1 across all samples)

preprocess(df,thr=0.9).shape

removing columns: {'Chi3v', 'SMR_VSA9', 'SMR_VSA1', 'Chi2v', 'fr_SH', 'NumAromaticCarbocycles', 'fr_NH2', 'Chi1v', 'VSA_EState6', 'NumRotatableBonds', 'Chi0n', 'SlogP_VSA5', 'NumAromaticRings', 'fr_Ar_N', 'Chi2n', 'Chi1', 'Chi4n', 'Chi3n', 'VSA_EState10', 'Chi0v', 'Kappa1', 'NumHDonors', 'RingCount', 'NOCount', 'Chi4v', 'Ipc', 'NumHeteroatoms', 'VSA_EState2'}

(25, 78)

source

standardize

 standardize (df)

Standardize features from a df

Compound features

RDKit descriptors

source

get_rdkit

 get_rdkit (SMILES)

Extract chemical features from SMILES Reference: https://greglandrum.github.io/rdkit-blog/posts/2022-12-23-descriptor-tutorial.html

source

get_rdkit_3d

 get_rdkit_3d (SMILES)

Extract 3d features from SMILES

source

get_rdkit_all

 get_rdkit_all (SMILES)

Extract chemical features and 3d features from SMILES

source

get_rdkit_df

 get_rdkit_df (df, col, postprocess=True)

Extract rdkit features (including 3d) from SMILES in a df

	Type	Default	Details
df
col			column of SMILES
postprocess	bool	True	remove redundant columns and standardize features for dimension reduction

aa = Data.get_aa_info()
aa.head()

	Name	SMILES	MW	pKa1	pKb2	pKx3	pl4	H	VSC	P1	P2	SASA	NCISC	phospho
aa
A	Alanine	C[C@@H](C(=O)O)N	89.10	2.34	9.69	NaN	6.00	0.62	27.5	8.1	0.046	1.181	0.007187	0
C	Cysteine	C([C@@H](C(=O)O)N)S	121.16	1.96	10.28	8.18	5.07	0.29	44.6	5.5	0.128	1.461	-0.036610	0
D	Aspartic acid	C([C@@H](C(=O)O)N)C(=O)O	133.11	1.88	9.60	3.65	2.77	-0.90	40.0	13.0	0.105	1.587	-0.023820	0
E	Glutamic acid	C(CC(=O)O)[C@@H](C(=O)O)N	147.13	2.19	9.67	4.25	3.22	-0.74	62.0	12.3	0.151	1.862	0.006802	0
F	Phenylalanine	c1ccc(cc1)C[C@@H](C(=O)O)N	165.19	1.83	9.13	NaN	5.48	1.19	115.5	5.2	0.290	2.228	0.037552	0

aa_rdkit = get_rdkit_df(aa, 'SMILES')
aa_rdkit.head()

removing columns: {'fr_para_hydroxylation', 'MinAbsPartialCharge', 'SlogP_VSA9', 'NumSpiroAtoms', 'fr_sulfone', 'PEOE_VSA5', 'fr_Ndealkylation1', 'fr_halogen', 'MolMR', 'fr_nitro_arom', 'VSA_EState1', 'fr_piperdine', 'fr_lactam', 'fr_imide', 'PMI3', 'fr_methoxy', 'Chi0', 'fr_alkyl_carbamate', 'Asphericity', 'fr_bicyclic', 'PEOE_VSA13', 'fr_sulfonamd', 'Chi1n', 'fr_urea', 'fr_aryl_methyl', 'NumBridgeheadAtoms', 'fr_hdrzine', 'fr_nitro', 'fr_phos_acid', 'fr_pyridine', 'fr_nitro_arom_nonortho', 'fr_amide', 'Eccentricity', 'fr_azo', 'fr_isocyan', 'HeavyAtomCount', 'fr_oxazole', 'SMR_VSA2', 'NumRadicalElectrons', 'fr_epoxide', 'fr_Nhpyrrole', 'fr_term_acetylene', 'fr_hdrzone', 'BCUT2D_MRHI', 'fr_aldehyde', 'fr_Ar_NH', 'fr_ketone_Topliss', 'fr_allylic_oxid', 'NumAmideBonds', 'NumAliphaticCarbocycles', 'NumSaturatedCarbocycles', 'fr_thiazole', 'fr_amidine', 'fr_phenol', 'fr_Ar_OH', 'fr_nitrile', 'fr_phos_ester', 'fr_nitroso', 'fr_benzodiazepine', 'NumAliphaticRings', 'fr_azide', 'fr_COO2', 'fr_Ndealkylation2', 'MaxEStateIndex', 'fr_ketone', 'fr_tetrazole', 'fr_HOCCN', 'fr_oxime', 'fr_C_O_noCOO', 'fr_ether', 'NumSaturatedHeterocycles', 'fr_prisulfonamd', 'fr_isothiocyan', 'fr_quatN', 'fr_furan', 'SlogP_VSA10', 'LabuteASA', 'fr_aniline', 'fr_guanido', 'HeavyAtomMolWt', 'SlogP_VSA7', 'EState_VSA11', 'fr_diazo', 'NumSaturatedRings', 'fr_benzene', 'fr_lactone', 'fr_COO', 'ExactMolWt', 'fr_alkyl_halide', 'fr_Ar_COO', 'MaxPartialCharge', 'fr_C_S', 'fr_thiophene', 'SlogP_VSA12', 'SlogP_VSA6', 'fr_ArN', 'fr_dihydropyridine', 'fr_phenol_noOrthoHbond', 'fr_N_O', 'fr_ester', 'fr_morpholine', 'NumValenceElectrons', 'fr_Al_OH_noTert', 'fr_piperzine', 'fr_barbitur', 'SlogP_VSA11', 'fr_thiocyan', 'fr_Imine', 'SMR_VSA8'}

	MaxAbsEStateIndex	MinAbsEStateIndex	MinEStateIndex	qed	SPS	MolWt	MinPartialCharge	MaxAbsPartialCharge	FpDensityMorgan1	FpDensityMorgan2	...	fr_sulfide	fr_unbrch_alkane	PMI1	PMI2	NPR1	NPR2	RadiusOfGyration	InertialShapeFactor	SpherocityIndex	PBF
aa
A	-1.653421	1.218945	0.407753	-0.383393	-0.070345	-1.523488	0.218275	-0.295105	1.732180	0.163309	...	-0.204124	-0.369274	-1.294800	-1.086763	1.710287	1.057430	-1.514798	1.952520	0.622167	-1.068611
C	-1.058215	-0.588000	0.372307	-0.641865	-0.138884	-0.727067	0.223839	-0.298132	1.732180	1.481227	...	-0.204124	-0.369274	-0.445607	-0.885625	1.398986	-2.080741	-0.970023	-0.247854	0.617683	-0.493157
D	-0.764466	0.554854	0.126078	-0.376981	-0.390192	-0.430473	0.020085	-0.187297	-0.934412	-1.234483	...	-0.204124	-0.369274	-0.724059	-0.459223	-0.441549	0.454216	-0.332705	0.426131	0.316338	0.080717
E	-0.283221	-1.143984	0.235448	-0.051582	-0.406185	-0.082096	0.010166	-0.181902	-1.147739	-1.178571	...	-0.204124	-0.369274	-0.504211	-0.096004	-0.699329	1.605206	0.070614	0.283217	-0.128709	-0.453659
F	0.972596	0.063427	0.410788	1.908107	-0.430173	0.366494	0.221306	-0.296754	-1.067741	-0.675366	...	-0.204124	-0.369274	-0.220806	0.390422	-1.137341	0.320498	0.643196	-0.162855	-1.214938	-0.971254

5 rows × 119 columns

Morgan fingerprint

source

get_morgan

 get_morgan (df:pandas.core.frame.DataFrame, col:str='SMILES', radius=3)

Get 2048 morgan fingerprint (binary feature) from smiles in a dataframe

	Type	Default	Details
df	DataFrame		a dataframe that contains smiles
col	str	SMILES	colname of smile
radius	int	3

aa_morgan = get_morgan(aa, 'SMILES')
aa_morgan.head()

	morgan_0	morgan_1	morgan_2	morgan_3	morgan_4	morgan_5	morgan_6	morgan_7	morgan_8	morgan_9	...	morgan_2038	morgan_2039	morgan_2040	morgan_2041	morgan_2042	morgan_2043	morgan_2044	morgan_2045	morgan_2046	morgan_2047
aa
A	0	1	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
C	0	1	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
D	0	1	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
E	0	1	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
F	0	1	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

5 rows × 2048 columns

aa_morgan = get_morgan(aa, 'SMILES')
aa_morgan.head()

	morgan_0	morgan_1	morgan_2	morgan_3	morgan_4	morgan_5	morgan_6	morgan_7	morgan_8	morgan_9	...	morgan_2038	morgan_2039	morgan_2040	morgan_2041	morgan_2042	morgan_2043	morgan_2044	morgan_2045	morgan_2046	morgan_2047
aa
A	0	1	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
C	0	1	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
D	0	1	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
E	0	1	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
F	0	1	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

5 rows × 2048 columns

Protein sequence

Onehot

source

onehot_encode

 onehot_encode (sequences, transform_colname=True, n=20)

df=Data.get_combine_site_psp_ochoa()

onehot = onehot_encode(df['site_seq'].head(1000))
onehot

	-20A	-20C	-20D	-20E	-20F	-20G	-20H	-20I	-20K	-20L	...	-6N	-6P	-6Q	-6R	-6S	-6T	-6V	-6W	-6Y	-6_
0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	...	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0
1	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0
3	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
4	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	...	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
995	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
996	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0
997	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
998	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
999	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

1000 rows × 297 columns

Elbow method

source

get_clusters_elbow

 get_clusters_elbow (encoded_data, max_cluster=400, interval=50)

get_clusters_elbow(onehot,5,2)

ESM2

source

get_esm

 get_esm (df:pandas.core.frame.DataFrame, col:str='sequence',
          model_name:str='esm2_t33_650M_UR50D')

Extract esmfold2 embeddings from protein sequence in a dataframe

	Type	Default	Details
df	DataFrame		a dataframe that contains amino acid sequence
col	str	sequence	colname of amino acid sequence
model_name	str	esm2_t33_650M_UR50D	Name of the ESM model to use for the embeddings.

ESM2 model is trained on UniRef sequence. The default model in the function is esm2_t33_650M_UR50D, which is trained on UniRef50.

Uncheck below to use:

# # Examples
# df = Data.get_kinase_info().set_index('kinase')
# sample = df[:5]
# esmfeature = get_esm(sample,'sequence')
# esmfeature.head()

ProtT5

source

get_t5

 get_t5 (df:pandas.core.frame.DataFrame, col:str='sequence')

Extract ProtT5-XL-uniref50 embeddings from protein sequence in a dataframe

XL-uniref50 model is a t5-3b model trained on Uniref50 Dataset.

Uncheck below to use:

# t5feature = get_t5(sample,'sequence')
# t5feature.head()

source

get_t5_bfd

 get_t5_bfd (df:pandas.core.frame.DataFrame, col:str='sequence')

Extract ProtT5-XL-BFD embeddings from protein sequence in a dataframe

XL-BFD model is a t5-3b model trained on Big Fantastic Database(BFD).

Uncheck below to use:

# t5bfd = get_t5_bfd(sample,'sequence')
# t5bfd.head()

Setup

Utils

remove_hi_corr

preprocess

standardize

Compound features

RDKit descriptors

get_rdkit

get_rdkit_3d

get_rdkit_all

get_rdkit_df

Morgan fingerprint

get_morgan

Protein sequence

Onehot

onehot_encode

Elbow method

get_clusters_elbow

ESM2

get_esm

ProtT5

get_t5

get_t5_bfd

End