A collection of utilities for training, evaluating, and deploying scikit-learn models for kinase substrate specificity prediction.
Data Splitting
get_splits - Creates cross-validation splits using stratified, grouped, or stratified-grouped KFold methods. This ensures proper data separation to avoid data leakage (e.g., keeping kinases from the same subfamily in the same fold).
splits = get_splits( df=pspa_info, # DataFrame containing metadata for splitting stratified=None, # column name for stratified sampling (samples from different strata in each fold) group='subfamily', # column name for group splitting (train/test never share groups) nfold=5, # number of cross-validation folds seed=123, # random seed for reproducibility)
split_data - Splits a dataframe into train/test features and targets based on a single split tuple from get_splits.
X_train, y_train, X_test, y_test = split_data( df=df, # full DataFrame with features and targets feat_col=feat_col, # list of feature column names (e.g., T5 embeddings) target_col=target_col, # list of target column names (e.g., PSSM values) split=splits[0], # tuple of (train_indices, test_indices))
Model Training
train_ml - Fits a single sklearn model on one train/test split and returns predictions on the test set. Optionally saves the trained model.
y_test, y_pred = train_ml( df=df, # DataFrame with features and targets feat_col=feat_col, # feature column names target_col=target_col, # target column names split=splits[0], # single split tuple (train_idx, test_idx) model=LinearRegression(), # any sklearn-compatible model save='models/lr_fold0.joblib', # path to save model (None to skip) params={}, # extra kwargs passed to model.fit())
train_ml_cv - Performs full cross-validation across all splits, returning out-of-fold (OOF) predictions for the entire dataset.
oof = train_ml_cv( df=df, # DataFrame with features and targets feat_col=feat_col, # feature column names target_col=target_col, # target column names splits=splits, # list of split tuples from get_splits model=Ridge(alpha=1.0), # sklearn model (re-instantiated each fold) save='ridge', # base name for saved models (becomes ridge_0.joblib, etc.) params={}, # extra kwargs for model.fit())
Post-Processing
post_process - Cleans raw PSSM predictions by clipping negatives to zero, cleaning position zero, and normalizing each position to sum to 1.
def get_splits( df:DataFrame, # df contains info for split stratified:str=None, # colname to make stratified kfold; sampling from different groups group:str=None, # colname to make group kfold; test and train are from different groups nfold:int=5, seed:int=123):
Split samples in a dataframe based on Stratified, Group, or StratifiedGroup Kfold method
# # column name of feature and target# feat_col = df.columns[df.columns.str.startswith('T5_')]# target_col = df.columns[~df.columns.isin(feat_col)][1:]
# feat_col
# target_col
split_data
def split_data( df:DataFrame, # dataframe of values feat_col:list, # feature columns target_col:list, # target columns split:tuple, # one of the split in splits):
Given split tuple, split dataframe into X_train, y_train, X_test, y_test
def train_ml( df, # dataframe of values feat_col, # feature columns target_col, # target columns split, # one split in splits model, # a sklearn models save:NoneType=None, # file (.joblib) to save, e.g. 'model.joblib' params:NoneType=None, # dict parameters for model.fit from sklearn):
Fit and predict using sklearn model format, return target and pred of valid dataset.
# model = LinearRegression()# ## Uncheck to run with saving model# # target,pred = train_ml(df, feat_col, target_col, split0, model,'model.joblib')# # Run without saving model# target,pred = train_ml(df, feat_col, target_col, split0, model)# pred.head()
Cross-Validation
train_ml_cv
def train_ml_cv( df, # dataframe of values feat_col, # feature columns target_col, # target columns splits, # splits model, # sklearn model save:NoneType=None, # model name to be saved, e.g., 'LR' params:NoneType=None, # act as kwargs, for model.fit):