Feature

A collection of tools to extract features from SMILES, proteins, etc.

Overview

This module provides tools to extract features from SMILES (chemical compounds) and protein sequences for machine learning applications.


Utility Functions

remove_hi_corr(df, thr) - Removes highly correlated features from a DataFrame based on Pearson correlation threshold. Useful for reducing multicollinearity before modeling.

df_cleaned = remove_hi_corr(
    df=my_features,  # DataFrame with features as columns
    thr=0.98,        # correlation threshold above which to drop columns
)

preprocess(df, thr) - Combines zero-variance removal with correlation filtering. Drops columns with no variance (e.g., constant values) and highly correlated features.

df_processed = preprocess(
    df=my_features,  # DataFrame with features
    thr=0.98,        # correlation threshold
)

standardize(df) - Standardizes features to zero mean and unit variance using sklearn’s StandardScaler.

df_scaled = standardize(
    df=my_features,  # DataFrame to standardize
)

Compound Features (SMILES)

get_rdkit(SMILES) - Extracts ~200 RDKit molecular descriptors from a SMILES string.

features = get_rdkit(
    SMILES="CC(=O)O",  # SMILES representation of molecule
)

get_rdkit_3d(SMILES) - Extracts 3D molecular descriptors after generating a conformer using ETKDG embedding.

features_3d = get_rdkit_3d(
    SMILES="CC(=O)O",  # SMILES representation of molecule
)

get_rdkit_df(df, col, postprocess) - Batch extracts RDKit features (2D + 3D) from a DataFrame column containing SMILES. Optionally removes redundant features and standardizes.

rdkit_features = get_rdkit_df(
    df=compounds_df,   # DataFrame containing SMILES
    col='SMILES',      # column name with SMILES strings
    postprocess=True,  # remove redundant columns & standardize
)

get_morgan(df, col, radius) - Generates 2048-bit Morgan fingerprints (circular fingerprints) from SMILES.

morgan_fps = get_morgan(
    df=compounds_df,  # DataFrame containing SMILES
    col='SMILES',     # column name with SMILES strings
    radius=3,         # radius for Morgan fingerprint
)

Protein Sequence Features - One-Hot Encoding

onehot_encode(sequences, transform_colname, n) - Converts amino acid sequences to one-hot encoded matrix.

encoded = onehot_encode(
    sequences=df['site_seq'],  # iterable of AA sequences
    transform_colname=True,    # convert column names to position format
    n=20,                      # number of standard amino acids
)

onehot_encode_df(df, seq_col) - Convenience wrapper for one-hot encoding from a DataFrame.

encoded = onehot_encode_df(
    df=my_df,            # DataFrame with sequences
    seq_col='site_seq',  # column name containing sequences
)

filter_range_columns(df, low, high) - Filters one-hot encoded columns to specific sequence positions (e.g., -10 to +10 around a site).

filtered = filter_range_columns(
    df=onehot_df,  # one-hot encoded DataFrame with position+AA column names
    low=-10,       # minimum position to include
    high=10,       # maximum position to include
)

Clustering

run_kmeans(onehot, n, seed) - Performs K-means clustering on encoded data and returns cluster assignments.

clusters = run_kmeans(
    onehot=encoded_df,  # one-hot or other feature matrix
    n=10,               # number of clusters
    seed=42,            # random seed for reproducibility
)

get_clusters_elbow(encoded_data, max_cluster, interval) - Plots the elbow curve (WCSS vs. # clusters) to help choose optimal k.

get_clusters_elbow(
    encoded_data=onehot_df,  # feature matrix for clustering
    max_cluster=400,         # maximum clusters to test
    interval=50,             # step size between cluster counts
)

Protein Language Model Embeddings

get_esm(df, col, model_name) - Extracts ESM2 embeddings (mean-pooled) from protein sequences. Requires GPU.

esm_features = get_esm(
    df=kinase_df,                      # DataFrame with protein sequences
    col='sequence',                    # column name with AA sequences
    model_name='esm2_t33_650M_UR50D',  # ESM2 model variant
)

get_t5(df, col) - Extracts ProtT5-XL-UniRef50 embeddings from protein sequences.

t5_features = get_t5(
    df=kinase_df,       # DataFrame with protein sequences
    col='sequence',     # column name with AA sequences
)

get_t5_bfd(df, col) - Extracts ProtT5-XL-BFD embeddings (trained on Big Fantastic Database).

t5bfd_features = get_t5_bfd(
    df=kinase_df,       # DataFrame with protein sequences
    col='sequence',     # column name with AA sequences
)

Setup

Utils


remove_hi_corr


def remove_hi_corr(
    df:DataFrame, thr:float=0.98, # threshold
):

Remove highly correlated features in a dataframe given a pearson threshold

remove_hi_corr is a function to remove highly correlated features based on threshold of Pearson correlation between features.

# Load data
df = Data.get_aa_rdkit()
df.shape
remove_hi_corr(df,thr=0.9).shape

preprocess


def preprocess(
    df:DataFrame, thr:float=0.98
):

Remove features with no variance, and highly correlated features based on threshold

This function is similar to remove_hi_corr, but can additionaly remove features of zero variance (e.g., 1 across all samples)

preprocess(df,thr=0.9).shape

standardize


def standardize(
    df
):

Standardize features from a df

Compound features

RDKit descriptors


get_rdkit


def get_rdkit(
    SMILES
):

Extract chemical features from SMILES Reference: https://greglandrum.github.io/rdkit-blog/posts/2022-12-23-descriptor-tutorial.html


get_rdkit_3d


def get_rdkit_3d(
    SMILES
):

Extract 3d features from SMILES


get_rdkit_all


def get_rdkit_all(
    SMILES
):

Extract chemical features and 3d features from SMILES


get_rdkit_df


def get_rdkit_df(
    df, col, # column of SMILES
    postprocess:bool=True, # remove redundant columns and standardize features for dimension reduction
):

Extract rdkit features (including 3d) from SMILES in a df

aa = Data.get_aa_info()
aa.head()
aa_rdkit = get_rdkit_df(aa, 'SMILES')
aa_rdkit.head()

Morgan fingerprint


get_morgan


def get_morgan(
    df:DataFrame, # a dataframe that contains smiles
    col:str='SMILES', # colname of smile
    radius:int=3
):

Get 2048 morgan fingerprint (binary feature) from smiles in a dataframe

aa_morgan = get_morgan(aa, 'SMILES')
aa_morgan.head()
aa_morgan = get_morgan(aa, 'SMILES')
aa_morgan.head()

Protein sequence

Onehot


onehot_encode


def onehot_encode(
    sequences, transform_colname:bool=True, n:int=20
):

onehot_encode_df


def onehot_encode_df(
    df, seq_col:str='site_seq', kwargs:VAR_KEYWORD
):
df=Data.get_combine_site_psp_ochoa()
df_k = df.head(1000)
onehot = onehot_encode_df(df_k, seq_col='site_seq')
onehot

Kemans of onehot


run_kmeans


def run_kmeans(
    onehot, n:int=2, seed:int=42
):

Take onehot encoded and regurn the cluster number.

run_kmeans(onehot.head(100),n=10)
onehot

filter_range_columns


def filter_range_columns(
    df, # df need to have column names of position + aa
    low:int=-10, high:int=10
):
onehot_10 = filter_range_columns(onehot,low=-10,high=10)
onehot_10

Pipeline:

onehot = onehot_encode(df_k.site_seq)
onehot_10 = filter_range_columns(onehot)
df_k['Cluster'] = run_kmeans(onehot_10,n=n,seed=42)

Then plot onehot of onehot_10 with hue =‘Cluster’

Elbow method


get_clusters_elbow


def get_clusters_elbow(
    encoded_data, max_cluster:int=400, interval:int=50
):
get_clusters_elbow(onehot,5,2)

ESM2


get_esm


def get_esm(
    df:DataFrame, # DataFrame containing protein sequences
    col:str, # column with amino acid sequences
    model_name:str='esm2_t33_650M_UR50D', batch_size:int=1, # Number of sequences per batch
):

Extract ESM2 embeddings (mean pooled per sequence).

ESM2 model is trained on UniRef sequence. The default model in the function is esm2_t33_650M_UR50D, which is trained on UniRef50.

Uncheck below to use:

# # Examples
# df = Data.get_kinase_info().set_index('kinase')
# sample = df[:5]
# esmfeature = get_esm(sample,'sequence')
# esmfeature.head()

ProtT5


get_t5


def get_t5(
    df:DataFrame, col:str='sequence'
):

Extract ProtT5-XL-uniref50 embeddings from protein sequence in a dataframe

XL-uniref50 model is a t5-3b model trained on Uniref50 Dataset.

Uncheck below to use:

# t5feature = get_t5(sample,'sequence')
# t5feature.head()

get_t5_bfd


def get_t5_bfd(
    df:DataFrame, col:str='sequence'
):

Extract ProtT5-XL-BFD embeddings from protein sequence in a dataframe

XL-BFD model is a t5-3b model trained on Big Fantastic Database(BFD).

Uncheck below to use:

# t5bfd = get_t5_bfd(sample,'sequence')
# t5bfd.head()