Scoring

Scoring functions to calculate kinase score based on substrate sequence

Overview

This module provides functions to calculate kinase scores based on substrate sequences using Position-Specific Scoring Matrices (PSSMs) from either PSPA data or kinase-substrate datasets.


Utility Functions

These helper functions prepare and parse phosphorylation site sequences.

cut_seq — Extracts a substring from a sequence based on positions relative to its center (the phosphorylation site).

cut_seq(
    input_string='AAkUuPSFSTtH',  # site sequence (15-mer or similar)
    min_position=-5,              # start position relative to center
    max_position=4,               # end position relative to center
)
# Returns: 'AkUuPSFSTt'

STY2sty — Converts uppercase S/T/Y (unphosphorylated) to lowercase s/t/y (phosphorylated) in a sequence.

STY2sty(
    input_string='AAkUuPSFSTtH',  # sequence with uppercase STY
)
# Returns: 'AAkUuPsFsttH'

get_dict — Parses a site sequence into position-annotated amino acid keys (e.g., -3P, 0s, 1Q) for PSSM lookup.

get_dict(
    input_string='PSVEPPLsQETFSDL',  # 15-mer phosphosite sequence
)
# Returns: ['-7P', '-6S', '-5V', ..., '0s', '1Q', ..., '7L']

Scoring Functions

These functions aggregate PSSM values across positions to produce a single kinase score.

sumup — Simple summation of PSSM values at each position. Used with log-odds (LO) matrices.

sumup(
    values=[0.5, 1.2, -0.3, 0.8],  # PSSM values at each position
    kinase='ATM',                  # kinase name (unused, for API consistency)
)
# Returns: sum of values

multiply — Log-space multiplication of PSSM probabilities, normalized by number of possible amino acids. Used with probability matrices.

multiply(
    values=[0.04, 0.05, 0.03],  # probability values from PSSM
    kinase='ATM',              # kinase name (unused)
    num_aa=23,                 # number of amino acids (23 for CDDM, 20 for uppercase)
)
# Returns: log2(product) + normalization factor

multiply_pspa — Like multiply, but uses a kinase-specific normalization factor from the PSPA random amino acid library.

multiply_pspa(
    values=[1, 2, 3, 4, 5],  # PSSM probability values
    kinase='PDHK1',          # kinase name (determines normalization)
)
# Returns: np.float64(22.907...)

Single Sequence Prediction

predict_kinase — Scores all kinases against a single phosphosite sequence. Returns a sorted Series of kinase scores.

predict_kinase(
    input_string='PSVEPPLsQETFSDL',  # 15-mer site sequence (lowercase = phosphosite)
    ref=Data.get_pspa(),             # reference PSSM DataFrame (kinases × positions)
    func=multiply_pspa,              # aggregation function (sumup or multiply_*)
    to_lower=False,                  # convert STY→sty before scoring
    to_upper=False,                  # convert all to uppercase before scoring
    verbose=True,                    # print which positions were used
)
# Returns: pd.Series of scores indexed by kinase, sorted descending

Batch Prediction (DataFrame)

predict_kinase_df — Scores all kinases against multiple sequences in a DataFrame. Optimized for large-scale scoring.

predict_kinase_df(
    df=Data.get_human_site().head(100),  # DataFrame containing site sequences
    seq_col='site_seq',                   # column name with sequences
    ref=Data.get_pspa(),                  # reference PSSM DataFrame
    func=multiply_pspa,                   # aggregation function
    to_lower=False,                       # convert STY→sty
    to_upper=False,                       # convert all to uppercase
)
# Returns: DataFrame (rows=input sites, cols=kinases) of scores

Percentile Scoring

These functions convert raw scores to percentiles based on a reference distribution.

get_pct — Calculates percentile rank for a single sequence against all kinases.

get_pct(
    site='PSVEPPLsQETFSDL',       # phosphosite sequence
    ref=Data.get_pspa_st(),       # reference PSSM
    func=multiply_pspa,           # scoring function
    pct_ref=Data.get_pspa_st_pct(),  # percentile reference distribution
)
# Returns: DataFrame with columns ['log2(score)', 'percentile']

get_pct_df — Converts a DataFrame of raw scores to percentile ranks (vectorized).

get_pct_df(
    score_df=out,                    # output from predict_kinase_df
    pct_ref=Data.get_pspa_st_pct(),  # reference distribution for percentiles
)
# Returns: DataFrame of percentile ranks (same shape as score_df)

Configuration Presets

Params — Returns pre-configured parameter dictionaries for different scoring modes.

Params(
    name='PSPA',  # preset name: 'CDDM', 'CDDM_upper', 'PSPA_st', 'PSPA_y', 'PSPA'
    load=True,    # whether to load the reference DataFrame immediately
)
# Returns: dict with keys 'ref', 'func', and optionally 'to_upper'/'to_lower'

# List available presets:
Params() # Returns: ['CDDM', 'CDDM_upper', 'PSPA_st', 'PSPA_y', 'PSPA']
Preset Reference Function Use Case
CDDM Log-odds matrix sumup Known phosphorylated status
CDDM_upper Log-odds (uppercase) sumup All uppercase only, unknown phosphorylated status
PSPA_st PSPA S/T kinases multiply_pspa Known phosphorylated status; S/T only
PSPA_y PSPA Y kinases multiply_pspa Known phosphorylated status; Y only
PSPA All PSPA kinases multiply_pspa Known phosphorylated status

Typical Workflow

# Single sequence scoring
scores = predict_kinase(
    input_string='PSVEPPLsQETFSDL',
    **Params('PSPA'),
)

# Batch scoring with percentiles on S/T sites
df = Data.get_human_site()
score_df = predict_kinase_df(
    df=df,
    seq_col='site_seq',
    **Params('PSPA_st'),
)
pct_df = get_pct_df(
    score_df=score_df,
    pct_ref=Data.get_pspa_st_pct(),
)

Setup

Utils


cut_seq


def cut_seq(
    input_string:str, # site sequence
    min_position:int, # minimum position relative to its center
    max_position:int, # maximum position relative to its center
):

Extract sequence based on a range relative to its center position

cut_seq('AAkUuPSFSTtH',-5,4)
'AkUuPSFSTt'

STY2sty


def STY2sty(
    input_string:str
):

Replace all ‘STY’ with ‘sty’ in a sequence

STY2sty('AAkUuPSFSTtH') # convert all capital STY to sty in a string
'AAkUuPsFsttH'

get_dict


def get_dict(
    input_string:str, # phosphorylation site sequence
):

Get a dictionary of input string; no need for the star in the middle; make sure it is 15 or 10 length

cols = get_dict("PSVEPPLsQETFSDL")
cols
['-7P',
 '-6S',
 '-5V',
 '-4E',
 '-3P',
 '-2P',
 '-1L',
 '0s',
 '1Q',
 '2E',
 '3T',
 '4F',
 '5S',
 '6D',
 '7L']

Scoring func

Multiply


multiply


def multiply(
    values, # list of values, possibilities of amino acids at certain positions
    kinase:NoneType=None, num_aa:int=23, # number of amino acids, 23 for standard CDDM, 20 for all uppercase CDDM
):

Multiply the possibilities of the amino acids at each position in a phosphorylation site

\[ \text{Score} = \log_2 \left( \frac{ \prod P_{\text{KinX}}(\text{AA}, \text{Position}) }{ \left( \frac{1}{\#\text{Random AA}} \right)^{\text{length(Position except 0)}} } \right) \]

The function implement formula from Johnson et al. Nature: An atlas of substrate specificities for the human serine/threonine kinome, Supplementary Note2 (page 160)

Multiply class, consider the dynamics of scale factor


multiply_pspa


def multiply_pspa(
    values, kinase, num_aa_dict:NoneType=None
):

Multiply values, consider the dynamics of scale factor, which is PSPA random aa number.

multiply_pspa(values=[1,2,3,4,5],kinase='PDHK1')
np.float64(22.906890595608516)

Sum


sumup


def sumup(
    values, # list of values, possibilities of amino acids at certain positions
    kinase:NoneType=None
):

Sum up the possibilities of the amino acids at each position in a phosphorylation site sequence

Predict kinase


duplicate_ref_zero


def duplicate_ref_zero(
    df:DataFrame
)->DataFrame:

If ‘0S’, ‘0T’, ‘0Y’ exist with non-zero values, create ‘0s’, ‘0t’, ‘0y’ with same values. If ‘0s’, ‘0t’, ‘0y’ exist with non-zero values, create ‘0S’, ‘0T’, ‘0Y’ with same values.


preprocess_ref


def preprocess_ref(
    ref
):

Convert pS/T/Y in ref columns to s/t/y if any; mirror 0S/T/Y to 0s/t/y.


predict_kinase


def predict_kinase(
    input_string:str, # site sequence
    ref:DataFrame, # reference dataframe for scoring
    func:Callable, # function to calculate score
    to_lower:bool=False, # convert capital STY to lower case
    to_upper:bool=False, # convert all letter to uppercase
    verbose:bool=True
):

Predict kinase given a phosphorylation site sequence

PSPA scoring:

pspa_ref = Data.get_pspa()
predict_kinase("PSVEPPLsQETFSDL",ref=pspa_ref,func=multiply_pspa)
considering string: ['-5V', '-4E', '-3P', '-2P', '-1L', '0s', '1Q', '2E', '3T', '4F', '5S']
kinase
ATM       5.037
SMG1      4.385
DNAPK     3.818
ATR       3.507
FAM20C    3.170
          ...  
PKN1     -7.275
P70S6K   -7.295
AKT3     -7.375
PKCI     -7.742
NEK3     -8.254
Length: 303, dtype: float64

CDDM scoring, LO + sum

ref=Data.get_cddm_LO() # Data.get_cddm_LO_upper()
predict_kinase("PSVEPPLsQETFSDL",ref=ref,func=sumup)
considering string: ['-7P', '-6S', '-5V', '-4E', '-3P', '-2P', '-1L', '0s', '1Q', '2E', '3T', '4F', '5S', '6D', '7L']
ATR        12.751
ATM        10.960
DNAPK       6.039
SRPK2       2.079
SMMLCK      1.876
           ...   
ROR1      -89.216
CDC7      -91.457
CAMK1B    -91.577
TNNI3K   -118.835
BRAF     -134.851
Length: 328, dtype: float64

CDDM scoring, PSSM + multiply (#23aa)

ref=Data.get_cddm() # Data.get_cddm_upper()
predict_kinase("PSVEPPLsQETFSDL",ref=ref,func=multiply_23)
considering string: ['-7P', '-6S', '-5V', '-4E', '-3P', '-2P', '-1L', '0s', '1Q', '2E', '3T', '4F', '5S', '6D', '7L']
ATR        16.824
ATM        15.033
DNAPK      10.112
SRPK2       6.152
SMMLCK      5.949
           ...   
ROR1      -85.143
CDC7      -87.384
CAMK1B    -87.503
TNNI3K   -114.762
BRAF     -130.778
Length: 328, dtype: float64

CDDM scoring, PSSM + multiply (#20aa)

ref=Data.get_cddm_upper() # Data.get_cddm_upper()
predict_kinase("PSVEPPLsQETFSDL",ref=ref,func=multiply_20)
considering string: ['-7P', '-6S', '-5V', '-4E', '-3P', '-2P', '-1L', '0s', '1Q', '2E', '3T', '4F', '5S', '6D', '7L']
ATR        16.587
ATM        14.362
DNAPK      10.430
SRPK2       8.044
CHK2        7.955
           ...   
TTK       -43.375
GAK       -45.159
CAMK1B    -69.395
TNNI3K    -70.993
BRAF     -109.130
Length: 328, dtype: float64

Params

Here we provide different PSSM settings from either PSPA data or kinase-substrate dataset for kinase prediction:


Params


def Params(
    name:NoneType=None, load:bool=True
):
Params()
['CDDM', 'CDDM_upper', 'PSPA_st', 'PSPA_y', 'PSPA']
for p in ['PSPA', 'CDDM','CDDM_upper']:
    print(predict_kinase("PSVEPPLsQETFSDL",**Params(p)).head())
considering string: ['-5V', '-4E', '-3P', '-2P', '-1L', '0s', '1Q', '2E', '3T', '4F', '5S']
kinase
ATM       5.037
SMG1      4.385
DNAPK     3.818
ATR       3.507
FAM20C    3.170
dtype: float64
considering string: ['-7P', '-6S', '-5V', '-4E', '-3P', '-2P', '-1L', '0s', '1Q', '2E', '3T', '4F', '5S', '6D', '7L']
ATR       12.751
ATM       10.960
DNAPK      6.039
SRPK2      2.079
SMMLCK     1.876
dtype: float64
considering string: ['-7P', '-6S', '-5V', '-4E', '-3P', '-2P', '-1L', '0S', '1Q', '2E', '3T', '4F', '5S', '6D', '7L']
ATR      11.815
ATM       9.590
DNAPK     5.659
SRPK2     3.272
CHK2      3.183
dtype: float64

Predict kinase in df


multiply_generic


def multiply_generic(
    merged_df, kinases, df_index, divide_factor_func
):

Multiply-based log-sum aggregation across kinases.


predict_kinase_df


def predict_kinase_df(
    df, seq_col, ref, func, to_lower:bool=False, to_upper:bool=False
):

Predict kinase scores based on reference PSSM or weight matrix. Applies preprocessing, merges long format keys, then aggregates using given func.

df=Data.get_human_site()
# for p in ['CDDM', 'CDDM_upper', 'PSPA_st', 'PSPA_y', 'PSPA']:
#     out = predict_kinase_df2(df.head(10), seq_col='site_seq', **Params(p))
#     print(out.head())
out = predict_kinase_df(df.head(100), seq_col='site_seq', **Params('PSPA'))
Input dataframe has 100 rows
Preprocessing...
Preprocessing done. Expanding sequences...
Merging reference...
Merge complete.
Computing multiply_generic: 100%|██████████| 396/396 [00:00<00:00, 2108.42it/s]

Percentile scoring


get_pct


def get_pct(
    site, ref, func, pct_ref
):

Replicate the precentile results from The Kinase Library.

st_pct = Data.get_pspa_st_pct()
y_pct = Data.get_pspa_tyr_pct()
out = get_pct('PSVEPPLyQETFSDL',**Params('PSPA_y'), pct_ref=y_pct)
out.sort_values('percentile',ascending=False)
considering string: ['-5V', '-4E', '-3P', '-2P', '-1L', '0Y', '1Q', '2E', '3T', '4F', '5S']
log2(score) percentile
ABL2 3.137 96.568694
BMX 2.816 96.117567
BTK 1.956 95.693780
CSK 2.303 95.174299
MERTK 2.509 93.588517
... ... ...
FLT1 -1.919 25.358852
PINK1_TYR -1.227 21.927546
MUSK -3.031 21.298701
TNNI3K_TYR -3.549 11.004785
PKMYT1_TYR -1.739 4.798360

93 rows × 2 columns

get_pct('PSVEPPLsQETFSDL',**Params('PSPA_st'), pct_ref=st_pct)
considering string: ['-5V', '-4E', '-3P', '-2P', '-1L', '0S', '1Q', '2E', '3T', '4F']
log2(score) percentile
ATM 5.037 99.822351
SMG1 4.385 99.831819
DNAPK 3.818 99.205315
ATR 3.507 99.680344
FAM20C 3.170 95.370556
... ... ...
PKN1 -7.275 14.070436
P70S6K -7.295 4.089816
AKT3 -7.375 11.432995
PKCI -7.742 8.129511
NEK3 -8.254 4.637240

303 rows × 2 columns


get_pct_df


def get_pct_df(
    score_df, # output from predict_kinase_df
    pct_ref, # a reference df for percentile calculation
):

Replicate the precentile results from The Kinase Library.

# substrate score first
score_df = predict_kinase_df(df_sty,'site_seq', **Params('PSPA_st'))

#get percentile reference
pct_ref = Data.get_pspa_st_pct()

# calculate percentile score
pct = get_pct_df(score_df,pct_ref)

End