Train ML

A collection of machine learning tools

Setup

Splitter


source

get_splits

 get_splits (df:pandas.core.frame.DataFrame, stratified:str=None,
             group:str=None, nfold:int=5, seed:int=123)

Split samples in a dataframe based on Stratified, Group, or StratifiedGroup Kfold method

Type Default Details
df DataFrame df contains info for split
stratified str None colname to make stratified kfold; sampling from different groups
group str None colname to make group kfold; test and train are from different groups
nfold int 5
seed int 123
!ls
00_data.ipynb          04_feature.ipynb     10_ML.ipynb   models
01_utils.ipynb         05_plot.ipynb        11_DNN.ipynb  nbdev.yml
02_pssm.ipynb          06_pathway.ipynb     _quarto.yml   styles.css
03_hierarchical.ipynb  07_alignment.ipynb   custom.scss
03_scoring.ipynb       08_statistics.ipynb  index.ipynb
# df=pd.read_parquet('paper/kinase_domain/train/pspa_t5.parquet')
# info=Data.get_kinase_info()

# info = info[info.pseudo=='0']

# info = info[info.kd_ID.notna()]

# subfamily_map = info[['kd_ID','subfamily']].drop_duplicates().set_index('kd_ID')['subfamily']

# pspa_info = pd.DataFrame(df.index.tolist(),columns=['kinase'])

# pspa_info['subfamily'] = pspa_info.kinase.map(subfamily_map)

# splits = get_splits(pspa_info, group='subfamily',nfold=5)

# split0 = splits[0]
# df=df.reset_index()
# df.columns
# # column name of feature and target
# feat_col = df.columns[df.columns.str.startswith('T5_')]
# target_col = df.columns[~df.columns.isin(feat_col)][1:]
# feat_col
# target_col

source

split_data

 split_data (df:pandas.core.frame.DataFrame, feat_col:list,
             target_col:list, split:tuple)

Given split tuple, split dataframe into X_train, y_train, X_test, y_test

Type Details
df DataFrame dataframe of values
feat_col list feature columns
target_col list target columns
split tuple one of the split in splits
# X_train, y_train, X_test, y_test = split_data(df,feat_col, target_col, split0)
# X_train.shape,y_train.shape,X_test.shape,y_test.shape

Trainer


source

train_ml

 train_ml (df, feat_col, target_col, split, model, save=None, params={})

Fit and predict using sklearn model format, return target and pred of valid dataset.

Type Default Details
df dataframe of values
feat_col feature columns
target_col target columns
split one split in splits
model a sklearn models
save NoneType None file (.joblib) to save, e.g. ‘model.joblib’
params dict {} parameters for model.fit from sklearn
# model = LinearRegression()

# ## Uncheck to run with saving model
# # target,pred = train_ml(df, feat_col, target_col, split0, model,'model.joblib')

# # Run without saving model
# target,pred = train_ml(df, feat_col, target_col, split0, model)

# pred.head()

Cross-Validation


source

train_ml_cv

 train_ml_cv (df, feat_col, target_col, splits, model, save=None,
              params={})

Cross-validation through the given splits

Type Default Details
df dataframe of values
feat_col feature columns
target_col target columns
splits splits
model sklearn model
save NoneType None model name to be saved, e.g., ‘LR’
params dict {} act as kwargs, for model.fit
# oof = train_ml_cv(df,feat_col,target_col,splits=splits,model=model)

Score


source

post_process

 post_process (pssm_df)

Convert neg value to 0, clean non-last three values in position zero, and normalize each position

# pssm = post_process(recover_pssm(oof.iloc[0,:-1].sort_values()))
# pssm.sum()

source

post_process_oof

 post_process_oof (oof_ml, target_col)
# oof = post_process_oof(oof,target_col)

source

get_score

 get_score (target, pred, func)
# target = df[target_col].copy()
# pspa_info['jsd'] =get_score_jsd(target,oof)
# pspa_info['kld'] =get_score_kld(target,oof)
# pspa_info['jsd']
# pspa_info['kld']

source

calculate_ce

 calculate_ce (target_series, pred_series)
# pspa_info['ce'] =get_score_ce(target,oof)
# pspa_info['ce']
# pspa_info['nfold'] = oof['nfold']
# pspa_info.groupby('nfold').jsd.mean()

Predictor


source

predict_ml

 predict_ml (df, feat_col, target_col=None, model_pth='model.joblib')

Make predictions based on trained model.

Type Default Details
df Dataframe that contains features
feat_col feature columns
target_col NoneType None
model_pth str model.joblib

Uncheck below to run if you have model_pth:

# pred2 = predict_ml(X_test,feat_col, target_col, model_pth = 'model.joblib')
# pred2.head()
## or
# predict_ml(df.iloc[split_0[1]],feat_col,'model.joblib')

End