Train ML

A collection of machine learning tools

Setup

Splitter


source

get_splits

 get_splits (df:pandas.core.frame.DataFrame, stratified:str=None,
             group:str=None, nfold:int=5, seed:int=123)

Split samples in a dataframe based on Stratified, Group, or StratifiedGroup Kfold method

Type Default Details
df DataFrame df contains info for split
stratified str None colname to make stratified kfold; sampling from different groups
group str None colname to make group kfold; test and train are from different groups
nfold int 5
seed int 123
!ls
00_data.ipynb          06_pathway.ipynb     _11_DNN.ipynb     nbdev.yml
01_utils.ipynb         07_alignment.ipynb   _nbs_names.ipynb  paper
02_pssm.ipynb          08_statistics.ipynb  _old.ipynb        styles.css
03_hierarchical.ipynb  10_ML.ipynb      _quarto.yml       tutorials
03_scoring.ipynb       11_DNN.ipynb     custom.scss
04_feature.ipynb       Untitled.ipynb       index.ipynb
05_plot.ipynb          _11_DL.ipynb     models
df=pd.read_parquet('paper/kinase_domain/train/pspa_t5.parquet')
info=Data.get_kinase_info()

info = info[info.pseudo=='0']

info = info[info.kd_ID.notna()]

subfamily_map = info[['kd_ID','subfamily']].drop_duplicates().set_index('kd_ID')['subfamily']

pspa_info = pd.DataFrame(df.index.tolist(),columns=['kinase'])

pspa_info['subfamily'] = pspa_info.kinase.map(subfamily_map)

splits = get_splits(pspa_info, group='subfamily',nfold=5)

split0 = splits[0]
GroupKFold(n_splits=5, random_state=None, shuffle=False)
# subfamily in train set: 120
# subfamily in test set: 29
df=df.reset_index()
df.columns
Index(['index', '-5P', '-4P', '-3P', '-2P', '-1P', '0P', '1P', '2P', '3P',
       ...
       'T5_1014', 'T5_1015', 'T5_1016', 'T5_1017', 'T5_1018', 'T5_1019',
       'T5_1020', 'T5_1021', 'T5_1022', 'T5_1023'],
      dtype='object', length=1255)
# column name of feature and target
feat_col = df.columns[df.columns.str.startswith('T5_')]
target_col = df.columns[~df.columns.isin(feat_col)][1:]
feat_col
Index(['T5_0', 'T5_1', 'T5_2', 'T5_3', 'T5_4', 'T5_5', 'T5_6', 'T5_7', 'T5_8',
       'T5_9',
       ...
       'T5_1014', 'T5_1015', 'T5_1016', 'T5_1017', 'T5_1018', 'T5_1019',
       'T5_1020', 'T5_1021', 'T5_1022', 'T5_1023'],
      dtype='object', length=1024)
target_col
Index(['-5P', '-4P', '-3P', '-2P', '-1P', '0P', '1P', '2P', '3P', '4P',
       ...
       '-5pY', '-4pY', '-3pY', '-2pY', '-1pY', '0pY', '1pY', '2pY', '3pY',
       '4pY'],
      dtype='object', length=230)

source

split_data

 split_data (df:pandas.core.frame.DataFrame, feat_col:list,
             target_col:list, split:tuple)

Given split tuple, split dataframe into X_train, y_train, X_test, y_test

Type Details
df DataFrame dataframe of values
feat_col list feature columns
target_col list target columns
split tuple one of the split in splits
X_train, y_train, X_test, y_test = split_data(df,feat_col, target_col, split0)
X_train.shape,y_train.shape,X_test.shape,y_test.shape
((294, 1024), (294, 230), (74, 1024), (74, 230))

Trainer


source

train_ml

 train_ml (df, feat_col, target_col, split, model, save=None, params={})

Fit and predict using sklearn model format, return target and pred of valid dataset.

Type Default Details
df dataframe of values
feat_col feature columns
target_col target columns
split one split in splits
model a sklearn models
save NoneType None file (.joblib) to save, e.g. ‘model.joblib’
params dict {} parameters for model.fit from sklearn
model = LinearRegression()

## Uncheck to run with saving model
# target,pred = train_ml(df, feat_col, target_col, split0, model,'model.joblib')

# Run without saving model
target,pred = train_ml(df, feat_col, target_col, split0, model)

pred.head()
-5P -4P -3P -2P -1P 0P 1P 2P 3P 4P ... -5pY -4pY -3pY -2pY -1pY 0pY 1pY 2pY 3pY 4pY
14 0.025734 0.037940 0.066932 0.019279 0.073621 0.0 -0.076180 0.035905 -0.005546 0.042921 ... 0.075505 0.071823 0.055032 0.080379 -0.017676 -0.038610 0.001892 0.084595 0.052408 0.032714
15 0.029486 0.041007 0.069491 -0.007920 0.059908 0.0 -0.025618 0.023630 0.006274 0.035469 ... 0.095213 0.067924 0.043564 0.083987 0.059089 -0.037290 0.042184 0.076801 0.093041 0.077291
16 0.017894 0.022863 0.046134 -0.027531 0.045264 0.0 -0.022573 0.015241 -0.007165 0.030543 ... 0.111131 0.079995 0.045566 0.087192 0.081287 -0.023617 0.034757 0.067262 0.138138 0.099193
36 0.052927 0.043052 0.084370 -0.064723 0.079333 0.0 0.204311 0.087066 0.150505 0.108832 ... 0.173530 0.151807 0.092447 0.128092 0.316406 -0.061446 0.289082 0.068021 0.257368 0.211838
37 0.045769 0.028035 0.057566 0.091526 0.037958 0.0 0.602056 0.030222 0.024714 0.037466 ... 0.053620 0.041376 0.021975 -0.004069 0.010513 -0.010340 0.046816 -0.002519 0.025250 0.024028

5 rows × 230 columns

Cross-Validation


source

train_ml_cv

 train_ml_cv (df, feat_col, target_col, splits, model, save=None,
              params={})

Cross-validation through the given splits

Type Default Details
df dataframe of values
feat_col feature columns
target_col target columns
splits splits
model sklearn model
save NoneType None model name to be saved, e.g., ‘LR’
params dict {} act as kwargs, for model.fit
oof = train_ml_cv(df,feat_col,target_col,splits=splits,model=model)

Score


source

post_process

 post_process (pssm_df)

Convert neg value to 0, clean non-last three values in position zero, and normalize each position

pssm = post_process(recover_pssm(oof.iloc[0,:-1].sort_values()))
pssm.sum()
Position
-5    1.0
-4    1.0
-3    1.0
-2    1.0
-1    1.0
 0    1.0
 1    1.0
 2    1.0
 3    1.0
 4    1.0
dtype: float64

source

post_process_oof

 post_process_oof (oof_ml, target_col)
oof = post_process_oof(oof,target_col)

source

get_score

 get_score (target, pred, func)
target = df[target_col].copy()
pspa_info['jsd'] =get_score_jsd(target,oof)
pspa_info['kld'] =get_score_kld(target,oof)
pspa_info['jsd']
0      0.096463
1      0.068114
2      0.069720
3      0.021238
4      0.029262
         ...   
363    0.055437
364    0.038325
365    0.034305
366    0.039840
367    0.076918
Name: jsd, Length: 368, dtype: float64
pspa_info['kld']
0      0.611760
1      0.838306
2      0.940138
3      0.199112
4      0.412790
         ...   
363    0.835054
364    0.502747
365    0.228586
366    0.521212
367    0.556203
Name: kld, Length: 368, dtype: float64

source

calculate_ce

 calculate_ce (target_series, pred_series)
pspa_info['ce'] =get_score_ce(target,oof)
pspa_info['ce']
0      3.242851
1      3.633247
2      3.734538
3      2.869370
4      3.104084
         ...   
363    3.572266
364    3.254591
365    2.973783
366    3.278089
367    3.126652
Name: ce, Length: 368, dtype: float64
pspa_info['nfold'] = oof['nfold']
pspa_info.groupby('nfold').jsd.mean()
nfold
0    0.042169
1    0.046005
2    0.050073
3    0.053140
4    0.049842
Name: jsd, dtype: float64

Predictor


source

predict_ml

 predict_ml (df, feat_col, target_col=None, model_pth='model.joblib')

Make predictions based on trained model.

Type Default Details
df Dataframe that contains features
feat_col feature columns
target_col NoneType None
model_pth str model.joblib

Uncheck below to run if you have model_pth:

# pred2 = predict_ml(X_test,feat_col, target_col, model_pth = 'model.joblib')
# pred2.head()
## or
# predict_ml(df.iloc[split_0[1]],feat_col,'model.joblib')

End