# Load data
df = Data.get_aa_rdkit()
df.shapeFeature
Overview
This module provides tools to extract features from SMILES (chemical compounds) and protein sequences for machine learning applications.
Utility Functions
remove_hi_corr(df, thr) - Removes highly correlated features from a DataFrame based on Pearson correlation threshold. Useful for reducing multicollinearity before modeling.
df_cleaned = remove_hi_corr(
df=my_features, # DataFrame with features as columns
thr=0.98, # correlation threshold above which to drop columns
)preprocess(df, thr) - Combines zero-variance removal with correlation filtering. Drops columns with no variance (e.g., constant values) and highly correlated features.
df_processed = preprocess(
df=my_features, # DataFrame with features
thr=0.98, # correlation threshold
)standardize(df) - Standardizes features to zero mean and unit variance using sklearn’s StandardScaler.
df_scaled = standardize(
df=my_features, # DataFrame to standardize
)Compound Features (SMILES)
get_rdkit(SMILES) - Extracts ~200 RDKit molecular descriptors from a SMILES string.
features = get_rdkit(
SMILES="CC(=O)O", # SMILES representation of molecule
)get_rdkit_3d(SMILES) - Extracts 3D molecular descriptors after generating a conformer using ETKDG embedding.
features_3d = get_rdkit_3d(
SMILES="CC(=O)O", # SMILES representation of molecule
)get_rdkit_df(df, col, postprocess) - Batch extracts RDKit features (2D + 3D) from a DataFrame column containing SMILES. Optionally removes redundant features and standardizes.
rdkit_features = get_rdkit_df(
df=compounds_df, # DataFrame containing SMILES
col='SMILES', # column name with SMILES strings
postprocess=True, # remove redundant columns & standardize
)get_morgan(df, col, radius) - Generates 2048-bit Morgan fingerprints (circular fingerprints) from SMILES.
morgan_fps = get_morgan(
df=compounds_df, # DataFrame containing SMILES
col='SMILES', # column name with SMILES strings
radius=3, # radius for Morgan fingerprint
)Protein Sequence Features - One-Hot Encoding
onehot_encode(sequences, transform_colname, n) - Converts amino acid sequences to one-hot encoded matrix.
encoded = onehot_encode(
sequences=df['site_seq'], # iterable of AA sequences
transform_colname=True, # convert column names to position format
n=20, # number of standard amino acids
)onehot_encode_df(df, seq_col) - Convenience wrapper for one-hot encoding from a DataFrame.
encoded = onehot_encode_df(
df=my_df, # DataFrame with sequences
seq_col='site_seq', # column name containing sequences
)filter_range_columns(df, low, high) - Filters one-hot encoded columns to specific sequence positions (e.g., -10 to +10 around a site).
filtered = filter_range_columns(
df=onehot_df, # one-hot encoded DataFrame with position+AA column names
low=-10, # minimum position to include
high=10, # maximum position to include
)Clustering
run_kmeans(onehot, n, seed) - Performs K-means clustering on encoded data and returns cluster assignments.
clusters = run_kmeans(
onehot=encoded_df, # one-hot or other feature matrix
n=10, # number of clusters
seed=42, # random seed for reproducibility
)get_clusters_elbow(encoded_data, max_cluster, interval) - Plots the elbow curve (WCSS vs. # clusters) to help choose optimal k.
get_clusters_elbow(
encoded_data=onehot_df, # feature matrix for clustering
max_cluster=400, # maximum clusters to test
interval=50, # step size between cluster counts
)Protein Language Model Embeddings
get_esm(df, col, model_name) - Extracts ESM2 embeddings (mean-pooled) from protein sequences. Requires GPU.
esm_features = get_esm(
df=kinase_df, # DataFrame with protein sequences
col='sequence', # column name with AA sequences
model_name='esm2_t33_650M_UR50D', # ESM2 model variant
)get_t5(df, col) - Extracts ProtT5-XL-UniRef50 embeddings from protein sequences.
t5_features = get_t5(
df=kinase_df, # DataFrame with protein sequences
col='sequence', # column name with AA sequences
)get_t5_bfd(df, col) - Extracts ProtT5-XL-BFD embeddings (trained on Big Fantastic Database).
t5bfd_features = get_t5_bfd(
df=kinase_df, # DataFrame with protein sequences
col='sequence', # column name with AA sequences
)Setup
Utils
remove_hi_corr
def remove_hi_corr(
df:DataFrame, thr:float=0.98, # threshold
):
Remove highly correlated features in a dataframe given a pearson threshold
remove_hi_corr is a function to remove highly correlated features based on threshold of Pearson correlation between features.
remove_hi_corr(df,thr=0.9).shapepreprocess
def preprocess(
df:DataFrame, thr:float=0.98
):
Remove features with no variance, and highly correlated features based on threshold
This function is similar to remove_hi_corr, but can additionaly remove features of zero variance (e.g., 1 across all samples)
preprocess(df,thr=0.9).shapestandardize
def standardize(
df
):
Standardize features from a df
Compound features
RDKit descriptors
get_rdkit
def get_rdkit(
SMILES
):
Extract chemical features from SMILES Reference: https://greglandrum.github.io/rdkit-blog/posts/2022-12-23-descriptor-tutorial.html
get_rdkit_3d
def get_rdkit_3d(
SMILES
):
Extract 3d features from SMILES
get_rdkit_all
def get_rdkit_all(
SMILES
):
Extract chemical features and 3d features from SMILES
get_rdkit_df
def get_rdkit_df(
df, col, # column of SMILES
postprocess:bool=True, # remove redundant columns and standardize features for dimension reduction
):
Extract rdkit features (including 3d) from SMILES in a df
aa = Data.get_aa_info()
aa.head()aa_rdkit = get_rdkit_df(aa, 'SMILES')
aa_rdkit.head()Morgan fingerprint
get_morgan
def get_morgan(
df:DataFrame, # a dataframe that contains smiles
col:str='SMILES', # colname of smile
radius:int=3
):
Get 2048 morgan fingerprint (binary feature) from smiles in a dataframe
aa_morgan = get_morgan(aa, 'SMILES')
aa_morgan.head()aa_morgan = get_morgan(aa, 'SMILES')
aa_morgan.head()Protein sequence
Onehot
onehot_encode
def onehot_encode(
sequences, transform_colname:bool=True, n:int=20
):
onehot_encode_df
def onehot_encode_df(
df, seq_col:str='site_seq', kwargs:VAR_KEYWORD
):
df=Data.get_combine_site_psp_ochoa()df_k = df.head(1000)onehot = onehot_encode_df(df_k, seq_col='site_seq')
onehotKemans of onehot
run_kmeans
def run_kmeans(
onehot, n:int=2, seed:int=42
):
Take onehot encoded and regurn the cluster number.
run_kmeans(onehot.head(100),n=10)onehotfilter_range_columns
def filter_range_columns(
df, # df need to have column names of position + aa
low:int=-10, high:int=10
):
onehot_10 = filter_range_columns(onehot,low=-10,high=10)
onehot_10Pipeline:
onehot = onehot_encode(df_k.site_seq)
onehot_10 = filter_range_columns(onehot)
df_k['Cluster'] = run_kmeans(onehot_10,n=n,seed=42)Then plot onehot of onehot_10 with hue =‘Cluster’
Elbow method
get_clusters_elbow
def get_clusters_elbow(
encoded_data, max_cluster:int=400, interval:int=50
):
get_clusters_elbow(onehot,5,2)ESM2
get_esm
def get_esm(
df:DataFrame, # DataFrame containing protein sequences
col:str, # column with amino acid sequences
model_name:str='esm2_t33_650M_UR50D', batch_size:int=1, # Number of sequences per batch
):
Extract ESM2 embeddings (mean pooled per sequence).
ESM2 model is trained on UniRef sequence. The default model in the function is esm2_t33_650M_UR50D, which is trained on UniRef50.
Uncheck below to use:
# # Examples
# df = Data.get_kinase_info().set_index('kinase')
# sample = df[:5]
# esmfeature = get_esm(sample,'sequence')
# esmfeature.head()ProtT5
get_t5
def get_t5(
df:DataFrame, col:str='sequence'
):
Extract ProtT5-XL-uniref50 embeddings from protein sequence in a dataframe
XL-uniref50 model is a t5-3b model trained on Uniref50 Dataset.
Uncheck below to use:
# t5feature = get_t5(sample,'sequence')
# t5feature.head()get_t5_bfd
def get_t5_bfd(
df:DataFrame, col:str='sequence'
):
Extract ProtT5-XL-BFD embeddings from protein sequence in a dataframe
XL-BFD model is a t5-3b model trained on Big Fantastic Database(BFD).
Uncheck below to use:
# t5bfd = get_t5_bfd(sample,'sequence')
# t5bfd.head()