Scoring

Scoring functions to calculate kinase score based on substrate sequence

Setup

Utils


cut_seq


def cut_seq(
    input_string:str, # site sequence
    min_position:int, # minimum position relative to its center
    max_position:int, # maximum position relative to its center
):

Extract sequence based on a range relative to its center position

cut_seq('AAkUuPSFSTtH',-5,4)
'AkUuPSFSTt'
STY2sty('AAkUuPSFSTtH') # convert all capital STY to sty in a string
'AAkUuPsFsttH'

get_dict


def get_dict(
    input_string:str, # phosphorylation site sequence
):

Get a dictionary of input string; no need for the star in the middle; make sure it is 15 or 10 length

cols = get_dict("PSVEPPLsQETFSDL")
cols
['-7P',
 '-6S',
 '-5V',
 '-4E',
 '-3P',
 '-2P',
 '-1L',
 '0s',
 '1Q',
 '2E',
 '3T',
 '4F',
 '5S',
 '6D',
 '7L']

Scoring func

Multiply


multiply


def multiply(
    values, # list of values, possibilities of amino acids at certain positions
    kinase:NoneType=None, num_aa:int=23, # number of amino acids, 23 for standard CDDM, 20 for all uppercase CDDM
):

Multiply the possibilities of the amino acids at each position in a phosphorylation site

\[ \text{Score} = \log_2 \left( \frac{ \prod P_{\text{KinX}}(\text{AA}, \text{Position}) }{ \left( \frac{1}{\#\text{Random AA}} \right)^{\text{length(Position except 0)}} } \right) \]

The function implement formula from Johnson et al. Nature: An atlas of substrate specificities for the human serine/threonine kinome, Supplementary Note2 (page 160)

Multiply class, consider the dynamics of scale factor


multiply_pspa


def multiply_pspa(
    values, kinase, num_aa_dict:NoneType=None
):

Multiply values, consider the dynamics of scale factor, which is PSPA random aa number.

multiply_pspa(values=[1,2,3,4,5],kinase='PDHK1')
np.float64(22.906890595608516)

Sum


sumup


def sumup(
    values, # list of values, possibilities of amino acids at certain positions
    kinase:NoneType=None
):

Sum up the possibilities of the amino acids at each position in a phosphorylation site sequence

Predict kinase


duplicate_ref_zero


def duplicate_ref_zero(
    df:DataFrame
)->DataFrame:

If ‘0S’, ‘0T’, ‘0Y’ exist with non-zero values, create ‘0s’, ‘0t’, ‘0y’ with same values. If ‘0s’, ‘0t’, ‘0y’ exist with non-zero values, create ‘0S’, ‘0T’, ‘0Y’ with same values.


preprocess_ref


def preprocess_ref(
    ref
):

Convert pS/T/Y in ref columns to s/t/y if any; mirror 0S/T/Y to 0s/t/y.


predict_kinase


def predict_kinase(
    input_string:str, # site sequence
    ref:DataFrame, # reference dataframe for scoring
    func:Callable, # function to calculate score
    to_lower:bool=False, # convert capital STY to lower case
    to_upper:bool=False, # convert all letter to uppercase
    verbose:bool=True
):

Predict kinase given a phosphorylation site sequence

PSPA scoring:

pspa_ref = Data.pspa()
predict_kinase("PSVEPPLsQETFSDL",ref=pspa_ref,func=multiply_pspa)
considering string: ['-5V', '-4E', '-3P', '-2P', '-1L', '0s', '1Q', '2E', '3T', '4F', '5S']
kinase
ATM       5.037
SMG1      4.385
DNAPK     3.818
ATR       3.507
FAM20C    3.170
          ...  
PKN1     -7.275
P70S6K   -7.295
AKT3     -7.375
PKCI     -7.742
NEK3     -8.254
Length: 303, dtype: float64

CDDM scoring, LO + sum

ref=Data.cddm_LO() # Data.cddm_LO_upper()
predict_kinase("PSVEPPLsQETFSDL",ref=ref,func=sumup)
considering string: ['-7P', '-6S', '-5V', '-4E', '-3P', '-2P', '-1L', '0s', '1Q', '2E', '3T', '4F', '5S', '6D', '7L']
ATR        12.751
ATM        10.960
DNAPK       6.039
SRPK2       2.079
SMMLCK      1.876
           ...   
ROR1      -89.216
CDC7      -91.457
CAMK1B    -91.577
TNNI3K   -118.835
BRAF     -134.851
Length: 328, dtype: float64

CDDM scoring, PSSM + multiply (#23aa)

ref=Data.cddm() # Data.cddm_upper()
predict_kinase("PSVEPPLsQETFSDL",ref=ref,func=multiply_23)
considering string: ['-7P', '-6S', '-5V', '-4E', '-3P', '-2P', '-1L', '0s', '1Q', '2E', '3T', '4F', '5S', '6D', '7L']
ATR        16.824
ATM        15.033
DNAPK      10.112
SRPK2       6.152
SMMLCK      5.949
           ...   
ROR1      -85.143
CDC7      -87.384
CAMK1B    -87.503
TNNI3K   -114.762
BRAF     -130.778
Length: 328, dtype: float64

CDDM scoring, PSSM + multiply (#20aa)

ref=Data.cddm_upper() # Data.cddm_upper()
predict_kinase("PSVEPPLsQETFSDL",ref=ref,func=multiply_20)
considering string: ['-7P', '-6S', '-5V', '-4E', '-3P', '-2P', '-1L', '0s', '1Q', '2E', '3T', '4F', '5S', '6D', '7L']
ATR        16.587
ATM        14.362
DNAPK      10.430
SRPK2       8.044
CHK2        7.955
           ...   
TTK       -43.375
GAK       -45.159
CAMK1B    -69.395
TNNI3K    -70.993
BRAF     -109.130
Length: 328, dtype: float64

Params

Here we provide different PSSM settings from either PSPA data or kinase-substrate dataset for kinase prediction:


Params


def Params(
    name:NoneType=None, load:bool=True
):

Call self as a function.

Params()
['CDDM', 'CDDM_upper', 'PSPA_st', 'PSPA_y', 'PSPA']
for p in ['PSPA', 'CDDM','CDDM_upper']:
    print(predict_kinase("PSVEPPLsQETFSDL",**Params(p)).head())
considering string: ['-5V', '-4E', '-3P', '-2P', '-1L', '0s', '1Q', '2E', '3T', '4F', '5S']
kinase
ATM       5.037
SMG1      4.385
DNAPK     3.818
ATR       3.507
FAM20C    3.170
dtype: float64
considering string: ['-7P', '-6S', '-5V', '-4E', '-3P', '-2P', '-1L', '0s', '1Q', '2E', '3T', '4F', '5S', '6D', '7L']
ATR       12.751
ATM       10.960
DNAPK      6.039
SRPK2      2.079
SMMLCK     1.876
dtype: float64
considering string: ['-7P', '-6S', '-5V', '-4E', '-3P', '-2P', '-1L', '0S', '1Q', '2E', '3T', '4F', '5S', '6D', '7L']
ATR      11.815
ATM       9.590
DNAPK     5.659
SRPK2     3.272
CHK2      3.183
dtype: float64

Predict kinase in df


multiply_generic


def multiply_generic(
    merged_df, kinases, df_index, divide_factor_func
):

Multiply-based log-sum aggregation across kinases.


predict_kinase_df


def predict_kinase_df(
    df, seq_col, ref, func, to_lower:bool=False, to_upper:bool=False
):

Predict kinase scores based on reference PSSM or weight matrix. Applies preprocessing, merges long format keys, then aggregates using given func.

df=Data.human_site()
# for p in ['CDDM', 'CDDM_upper', 'PSPA_st', 'PSPA_y', 'PSPA']:
#     out = predict_kinase_df2(df.head(10), seq_col='site_seq', **Params(p))
#     print(out.head())
out = predict_kinase_df(df.head(100), seq_col='site_seq', **Params('PSPA'))
Input dataframe has 100 rows
Preprocessing...
Preprocessing done. Expanding sequences...
Merging reference...
Merge complete.

Computing multiply_generic:   0%|          | 0/396 [00:00<?, ?it/s]
Computing multiply_generic:  44%|████▍     | 176/396 [00:00<00:00, 1755.82it/s]
Computing multiply_generic:  91%|█████████▏| 362/396 [00:00<00:00, 1812.62it/s]
Computing multiply_generic: 100%|██████████| 396/396 [00:00<00:00, 1816.26it/s]

Percentile scoring


get_pct


def get_pct(
    site, ref, func, pct_ref
):

Replicate the precentile results from The Kinase Library.

st_pct = Data.pspa_st_pct()
y_pct = Data.pspa_tyr_pct()
out = get_pct('PSVEPPLyQETFSDL',**Params('PSPA_y'), pct_ref=y_pct)
out.sort_values('percentile',ascending=False)
considering string: ['-5V', '-4E', '-3P', '-2P', '-1L', '0Y', '1Q', '2E', '3T', '4F', '5S']
log2(score) percentile
ABL2 3.137 96.568694
BMX 2.816 96.117567
BTK 1.956 95.693780
CSK 2.303 95.174299
MERTK 2.509 93.588517
... ... ...
FLT1 -1.919 25.358852
PINK1_TYR -1.227 21.927546
MUSK -3.031 21.298701
TNNI3K_TYR -3.549 11.004785
PKMYT1_TYR -1.739 4.798360

93 rows × 2 columns

get_pct('PSVEPPLsQETFSDL',**Params('PSPA_st'), pct_ref=st_pct)
considering string: ['-5V', '-4E', '-3P', '-2P', '-1L', '0S', '1Q', '2E', '3T', '4F']
log2(score) percentile
ATM 5.037 99.822351
SMG1 4.385 99.831819
DNAPK 3.818 99.205315
ATR 3.507 99.680344
FAM20C 3.170 95.370556
... ... ...
PKN1 -7.275 14.070436
P70S6K -7.295 4.089816
AKT3 -7.375 11.432995
PKCI -7.742 8.129511
NEK3 -8.254 4.637240

303 rows × 2 columns


get_pct_df


def get_pct_df(
    score_df, # output from predict_kinase_df
    pct_ref, # a reference df for percentile calculation
):

Replicate the precentile results from The Kinase Library.

# substrate score first
score_df = predict_kinase_df(df_sty,'site_seq', **Params('PSPA_st'))

#get percentile reference
pct_ref = Data.pspa_st_pct()

# calculate percentile score
pct = get_pct_df(score_df,pct_ref)