# ml


<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

## Setup

## Example Data

``` python
import seaborn as sns
```

``` python
df = sns.load_dataset("penguins").dropna(
    subset=["bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g", "species"]
).reset_index(drop=True)
feat_col = ["bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g"]
target_df = pd.get_dummies(df["species"], prefix="species", dtype=float)
target_col = target_df.columns.tolist()
df[target_col] = target_df
df.shape
```

    (342, 10)

``` python
df[feat_col + ["species"] + target_col].head()
```

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
&#10;    .dataframe tbody tr th {
        vertical-align: top;
    }
&#10;    .dataframe thead th {
        text-align: right;
    }
</style>

<table class="dataframe" data-quarto-postprocess="true" data-border="1">
<thead>
<tr style="text-align: right;">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">bill_length_mm</th>
<th data-quarto-table-cell-role="th">bill_depth_mm</th>
<th data-quarto-table-cell-role="th">flipper_length_mm</th>
<th data-quarto-table-cell-role="th">body_mass_g</th>
<th data-quarto-table-cell-role="th">species</th>
<th data-quarto-table-cell-role="th">species_Adelie</th>
<th data-quarto-table-cell-role="th">species_Chinstrap</th>
<th data-quarto-table-cell-role="th">species_Gentoo</th>
</tr>
</thead>
<tbody>
<tr>
<td data-quarto-table-cell-role="th">0</td>
<td>39.1</td>
<td>18.7</td>
<td>181.0</td>
<td>3750.0</td>
<td>Adelie</td>
<td>1.0</td>
<td>0.0</td>
<td>0.0</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">1</td>
<td>39.5</td>
<td>17.4</td>
<td>186.0</td>
<td>3800.0</td>
<td>Adelie</td>
<td>1.0</td>
<td>0.0</td>
<td>0.0</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">2</td>
<td>40.3</td>
<td>18.0</td>
<td>195.0</td>
<td>3250.0</td>
<td>Adelie</td>
<td>1.0</td>
<td>0.0</td>
<td>0.0</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">3</td>
<td>36.7</td>
<td>19.3</td>
<td>193.0</td>
<td>3450.0</td>
<td>Adelie</td>
<td>1.0</td>
<td>0.0</td>
<td>0.0</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">4</td>
<td>39.3</td>
<td>20.6</td>
<td>190.0</td>
<td>3650.0</td>
<td>Adelie</td>
<td>1.0</td>
<td>0.0</td>
<td>0.0</td>
</tr>
</tbody>
</table>

</div>

## Splitter

------------------------------------------------------------------------

<a href="https://github.com/sky1ove/kmodel/blob/main/kmodel/ml.py#L25"
target="_blank" style="float:right; font-size:smaller">source</a>

### get_splits

``` python

def get_splits(
    df:DataFrame, stratified:str | None=None, # col used for stratified sampling
    group:str | None=None, # col used to keep grouped rows together
    nfold:int=5, seed:int=123
)->list:

```

*Split samples in a dataframe with stratified, grouped, or
stratified-grouped K-fold logic.*

``` python
splits = get_splits(df, stratified="species", nfold=3)
split0 = splits[0]
len(split0[0]), len(split0[1])
```

    StratifiedKFold(n_splits=3, random_state=123, shuffle=True)
    # species in train set: 3
    # species in test set: 3

    (228, 114)

## Train/Test Split

------------------------------------------------------------------------

<a href="https://github.com/sky1ove/kmodel/blob/main/kmodel/ml.py#L61"
target="_blank" style="float:right; font-size:smaller">source</a>

### split_data

``` python

def split_data(
    df:DataFrame, # dataframe of values
    feat_col:Sequence, # feature columns
    target_col:Sequence, # target columns
    split:tuple
)->tuple:

```

*Given a split tuple, return X_train, y_train, X_test, and y_test.*

``` python
X_train, y_train, X_test, y_test = split_data(df, feat_col, target_col, split0)
X_train.shape, y_train.shape, X_test.shape, y_test.shape
```

    ((228, 4), (228, 3), (114, 4), (114, 3))

## Trainer

------------------------------------------------------------------------

<a href="https://github.com/sky1ove/kmodel/blob/main/kmodel/ml.py#L75"
target="_blank" style="float:right; font-size:smaller">source</a>

### train_ml

``` python

def train_ml(
    df:DataFrame, # dataframe of values
    feat_col:Sequence, # feature columns
    target_col:Sequence, # target columns
    split:tuple, model, # sklearn model instance
    save:str | pathlib.Path | None=None, # output path for joblib model
    params:dict | None=None, # kwargs forwarded to model.fit
)->tuple:

```

*Fit and predict with a sklearn model, returning validation targets and
predictions.*

``` python
model = LinearRegression()
target, pred = train_ml(df, feat_col, target_col, split0, model)
pred.head()
```

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
&#10;    .dataframe tbody tr th {
        vertical-align: top;
    }
&#10;    .dataframe thead th {
        text-align: right;
    }
</style>

<table class="dataframe" data-quarto-postprocess="true" data-border="1">
<thead>
<tr style="text-align: right;">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">species_Adelie</th>
<th data-quarto-table-cell-role="th">species_Chinstrap</th>
<th data-quarto-table-cell-role="th">species_Gentoo</th>
</tr>
</thead>
<tbody>
<tr>
<td data-quarto-table-cell-role="th">0</td>
<td>0.993427</td>
<td>0.137000</td>
<td>-0.130427</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">3</td>
<td>1.064457</td>
<td>0.046586</td>
<td>-0.111043</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">9</td>
<td>0.839056</td>
<td>0.118838</td>
<td>0.042105</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">11</td>
<td>0.669557</td>
<td>0.423417</td>
<td>-0.092974</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">14</td>
<td>1.050863</td>
<td>-0.073914</td>
<td>0.023052</td>
</tr>
</tbody>
</table>

</div>

## Cross-Validation

------------------------------------------------------------------------

<a href="https://github.com/sky1ove/kmodel/blob/main/kmodel/ml.py#L94"
target="_blank" style="float:right; font-size:smaller">source</a>

### train_ml_cv

``` python

def train_ml_cv(
    df:DataFrame, # dataframe of values
    feat_col:Sequence, # feature columns
    target_col:Sequence, # target columns
    splits:Sequence, model, # sklearn model instance
    save:str | None=None, # model name prefix for saved folds
    params:dict | None=None, # kwargs forwarded to model.fit
)->DataFrame:

```

*Run cross-validation through the given splits.*

``` python
oof = train_ml_cv(df, feat_col, target_col, splits=splits, model=LinearRegression())
oof.head()
```

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
&#10;    .dataframe tbody tr th {
        vertical-align: top;
    }
&#10;    .dataframe thead th {
        text-align: right;
    }
</style>

<table class="dataframe" data-quarto-postprocess="true" data-border="1">
<thead>
<tr style="text-align: right;">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">species_Adelie</th>
<th data-quarto-table-cell-role="th">species_Chinstrap</th>
<th data-quarto-table-cell-role="th">species_Gentoo</th>
<th data-quarto-table-cell-role="th">nfold</th>
</tr>
</thead>
<tbody>
<tr>
<td data-quarto-table-cell-role="th">0</td>
<td>0.993427</td>
<td>0.137000</td>
<td>-0.130427</td>
<td>0</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">1</td>
<td>0.790344</td>
<td>0.103762</td>
<td>0.105894</td>
<td>1</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">2</td>
<td>0.673088</td>
<td>0.317647</td>
<td>0.009265</td>
<td>2</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">3</td>
<td>1.064457</td>
<td>0.046586</td>
<td>-0.111043</td>
<td>0</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">4</td>
<td>1.122991</td>
<td>0.154406</td>
<td>-0.277398</td>
<td>1</td>
</tr>
</tbody>
</table>

</div>

## Score

------------------------------------------------------------------------

<a href="https://github.com/sky1ove/kmodel/blob/main/kmodel/ml.py#L120"
target="_blank" style="float:right; font-size:smaller">source</a>

### post_process

``` python

def post_process(
    pred_like:pandas.DataFrame | pandas.Series, epsilon:float=1e-08
):

```

*Clip negatives and renormalize probability-like predictions.*

``` python
post_process(pred.head())
```

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
&#10;    .dataframe tbody tr th {
        vertical-align: top;
    }
&#10;    .dataframe thead th {
        text-align: right;
    }
</style>

<table class="dataframe" data-quarto-postprocess="true" data-border="1">
<thead>
<tr style="text-align: right;">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">species_Adelie</th>
<th data-quarto-table-cell-role="th">species_Chinstrap</th>
<th data-quarto-table-cell-role="th">species_Gentoo</th>
</tr>
</thead>
<tbody>
<tr>
<td data-quarto-table-cell-role="th">0</td>
<td>0.878807</td>
<td>1.211930e-01</td>
<td>8.846216e-09</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">3</td>
<td>0.958070</td>
<td>4.192990e-02</td>
<td>9.000554e-09</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">9</td>
<td>0.839056</td>
<td>1.188384e-01</td>
<td>4.210543e-02</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">11</td>
<td>0.612601</td>
<td>3.873988e-01</td>
<td>9.149350e-09</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">14</td>
<td>0.978535</td>
<td>9.311731e-09</td>
<td>2.146502e-02</td>
</tr>
</tbody>
</table>

</div>

------------------------------------------------------------------------

<a href="https://github.com/sky1ove/kmodel/blob/main/kmodel/ml.py#L134"
target="_blank" style="float:right; font-size:smaller">source</a>

### post_process_oof

``` python

def post_process_oof(
    oof_ml:DataFrame, target_col:Sequence
)->DataFrame:

```

*Post-process prediction columns in an out-of-fold dataframe.*

``` python
oof = post_process_oof(oof, target_col)
oof[target_col].head()
```

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
&#10;    .dataframe tbody tr th {
        vertical-align: top;
    }
&#10;    .dataframe thead th {
        text-align: right;
    }
</style>

<table class="dataframe" data-quarto-postprocess="true" data-border="1">
<thead>
<tr style="text-align: right;">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">species_Adelie</th>
<th data-quarto-table-cell-role="th">species_Chinstrap</th>
<th data-quarto-table-cell-role="th">species_Gentoo</th>
</tr>
</thead>
<tbody>
<tr>
<td data-quarto-table-cell-role="th">0</td>
<td>0.878807</td>
<td>0.121193</td>
<td>8.846216e-09</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">1</td>
<td>0.790344</td>
<td>0.103762</td>
<td>1.058942e-01</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">2</td>
<td>0.673088</td>
<td>0.317647</td>
<td>9.264531e-03</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">3</td>
<td>0.958070</td>
<td>0.041930</td>
<td>9.000554e-09</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">4</td>
<td>0.879124</td>
<td>0.120876</td>
<td>7.828416e-09</td>
</tr>
</tbody>
</table>

</div>

------------------------------------------------------------------------

<a href="https://github.com/sky1ove/kmodel/blob/main/kmodel/ml.py#L141"
target="_blank" style="float:right; font-size:smaller">source</a>

### get_score

``` python

def get_score(
    target:DataFrame, pred:DataFrame, func:Callable
)->Series:

```

*Apply a row-wise score function to aligned target and prediction
frames.*

------------------------------------------------------------------------

<a href="https://github.com/sky1ove/kmodel/blob/main/kmodel/ml.py#L151"
target="_blank" style="float:right; font-size:smaller">source</a>

### js_divergence_flat

``` python

def js_divergence_flat(
    target_series:pandas.Series | numpy.ndarray, pred_series:pandas.Series | numpy.ndarray, epsilon:float=1e-08
)->float:

```

*Compute Jensen-Shannon divergence between two flattened probability
vectors.*

------------------------------------------------------------------------

<a href="https://github.com/sky1ove/kmodel/blob/main/kmodel/ml.py#L166"
target="_blank" style="float:right; font-size:smaller">source</a>

### kl_divergence_flat

``` python

def kl_divergence_flat(
    target_series:pandas.Series | numpy.ndarray, pred_series:pandas.Series | numpy.ndarray, epsilon:float=1e-08
)->float:

```

*Compute KL divergence between two flattened probability vectors.*

------------------------------------------------------------------------

<a href="https://github.com/sky1ove/kmodel/blob/main/kmodel/ml.py#L180"
target="_blank" style="float:right; font-size:smaller">source</a>

### calculate_ce

``` python

def calculate_ce(
    target_series:pandas.Series | numpy.ndarray, pred_series:pandas.Series | numpy.ndarray, epsilon:float=1e-08
)->float:

```

*Compute cross-entropy between two flattened probability vectors.*

``` python
target = df.loc[oof.index, target_col].copy()
pd.DataFrame({
    "jsd": get_score_jsd(target, oof),
    "kld": get_score_kld(target, oof),
    "ce": get_score_ce(target, oof),
}).head()
```

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
&#10;    .dataframe tbody tr th {
        vertical-align: top;
    }
&#10;    .dataframe thead th {
        text-align: right;
    }
</style>

<table class="dataframe" data-quarto-postprocess="true" data-border="1">
<thead>
<tr style="text-align: right;">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">jsd</th>
<th data-quarto-table-cell-role="th">kld</th>
<th data-quarto-table-cell-role="th">ce</th>
</tr>
</thead>
<tbody>
<tr>
<td data-quarto-table-cell-role="th">0</td>
<td>0.043958</td>
<td>0.129190</td>
<td>0.129190</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">1</td>
<td>0.078813</td>
<td>0.235286</td>
<td>0.235287</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">2</td>
<td>0.129371</td>
<td>0.395879</td>
<td>0.395879</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">3</td>
<td>0.014756</td>
<td>0.042834</td>
<td>0.042835</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">4</td>
<td>0.043837</td>
<td>0.128829</td>
<td>0.128829</td>
</tr>
</tbody>
</table>

</div>

## Predictor

------------------------------------------------------------------------

<a href="https://github.com/sky1ove/kmodel/blob/main/kmodel/ml.py#L194"
target="_blank" style="float:right; font-size:smaller">source</a>

### predict_ml

``` python

def predict_ml(
    df:DataFrame, # dataframe with features
    feat_col:Sequence, # feature columns
    target_col:collections.abc.Sequence[str] | None=None, model_pth:str | pathlib.Path='model.joblib'
)->DataFrame:

```

*Predict from a saved sklearn model.*

``` python
model_path = Path("_tmp/penguins_ml.joblib")
model_path.parent.mkdir(parents=True, exist_ok=True)
_ = train_ml(df, feat_col, target_col, split0, LinearRegression(), save=model_path)
predict_ml(df.iloc[split0[1]], feat_col, target_col, model_pth=model_path).head()
```

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
&#10;    .dataframe tbody tr th {
        vertical-align: top;
    }
&#10;    .dataframe thead th {
        text-align: right;
    }
</style>

<table class="dataframe" data-quarto-postprocess="true" data-border="1">
<thead>
<tr style="text-align: right;">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">species_Adelie</th>
<th data-quarto-table-cell-role="th">species_Chinstrap</th>
<th data-quarto-table-cell-role="th">species_Gentoo</th>
</tr>
</thead>
<tbody>
<tr>
<td data-quarto-table-cell-role="th">0</td>
<td>0.993427</td>
<td>0.137000</td>
<td>-0.130427</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">3</td>
<td>1.064457</td>
<td>0.046586</td>
<td>-0.111043</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">9</td>
<td>0.839056</td>
<td>0.118838</td>
<td>0.042105</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">11</td>
<td>0.669557</td>
<td>0.423417</td>
<td>-0.092974</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">14</td>
<td>1.050863</td>
<td>-0.073914</td>
<td>0.023052</td>
</tr>
</tbody>
</table>

</div>
