In this session, we will conduct substrate scoring on well known kinase-substrate pairs. We will show examples of the two methods respectively:
CDDM, computational data-driven method
PSPA, positional scanning peptide array
Above is a diagram showing the calculation of the score using sum as aggregation, which is used in CDDM. For PSPA, values are multiplied followed by a log transform (which is equal to log transform first followed by sum)
Setup
! pip install python- katlas - Uq
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
from katlas.core import *
import pandas as pd
Test with CDDM
We provide two parameters for CDDM:
param_CDDM: if you know the phosphorylated status of the substrate sequence
param_CDDM_upper: for all capital sequence
Below are substrate sequences from PhosphoSitePlus:
# ATM/ATR/DNAPK --> H2AX
predict_kinase('GkkAtQAsQEy____' ,** param_CDDM)
considering string: ['-7G', '-6K', '-5K', '-4A', '-3t', '-2Q', '-1A', '0s', '1Q', '2E', '3y']
kinase
ATR 2.321
ATM 2.291
DNAPK 2.013
NIM1 1.663
MARK3 1.658
...
HCK 0.668
SRC 0.656
FYN 0.654
JAK2 0.654
TNK1 0.641
Length: 289, dtype: float64
# AKT--> mTOR
predict_kinase('RsRtRtDsysAGQsV' ,** param_CDDM)
considering string: ['-7R', '-6s', '-5R', '-4t', '-3R', '-2t', '-1D', '0s', '1y', '2s', '3A', '4G', '5Q', '6s', '7V']
kinase
AKT1 2.702
SGK1 2.476
P90RSK 2.473
AKT2 2.437
AKT3 2.436
...
FLT3 0.719
LCK 0.717
SRC 0.708
TEC 0.686
FYN 0.685
Length: 289, dtype: float64
# ATM/ATR --> p53, S15
predict_kinase('PsVEPPLsQEtFsDL' ,** param_CDDM)
considering string: ['-7P', '-6s', '-5V', '-4E', '-3P', '-2P', '-1L', '0s', '1Q', '2E', '3t', '4F', '5s', '6D', '7L']
kinase
ATR 2.998
ATM 2.800
DNAPK 2.340
CK2A1 1.912
CDK8 1.908
...
YES1 0.850
DDR2 0.833
WEE1 0.818
TNK1 0.809
TYK2 0.804
Length: 289, dtype: float64
# ABL--> CRKL, Y207
predict_kinase('IPEPAHAyAQPQttt' ,** param_CDDM)
considering string: ['-7I', '-6P', '-5E', '-4P', '-3A', '-2H', '-1A', '0y', '1A', '2Q', '3P', '4Q', '5t', '6t', '7t']
kinase
ABL1 1.722
TNK2 1.700
ABL2 1.672
JAK2 1.669
FER 1.652
...
CK1G2 0.560
DCAMKL1 0.551
CK1G1 0.540
GRK7 0.526
CK2A1 0.518
Length: 289, dtype: float64
# EGFR --> EGFR, Y1092
predict_kinase('tFLPVPEyINQsVPk' ,** param_CDDM)
considering string: ['-7t', '-6F', '-5L', '-4P', '-3V', '-2P', '-1E', '0y', '1I', '2N', '3Q', '4s', '5V', '6P', '7K']
kinase
EGFR 1.774
CSK 1.733
JAK2 1.731
ERBB4 1.725
FLT3 1.719
...
PKACB 0.641
PAK6 0.630
NIM1 0.627
PAK5 0.593
SGK2 0.571
Length: 289, dtype: float64
# JAK2 --> STAT3, Y705
predict_kinase('DPGsAAPyLktKFIC' ,** param_CDDM)
considering string: ['-7D', '-6P', '-5G', '-4s', '-3A', '-2A', '-1P', '0y', '1L', '2K', '3t', '4K', '5F', '6I', '7C']
kinase
JAK2 1.716
EPHA4 1.709
KIT 1.702
FLT3 1.696
TNK1 1.696
...
CAMK4 0.567
BRSK2 0.558
PAK5 0.555
CAMK1D 0.535
PRKX 0.532
Length: 289, dtype: float64
# LCK --> cd3 zeta,y83
predict_kinase('NLGRREEyDVLDkRR' ,** param_CDDM)
considering string: ['-7N', '-6L', '-5G', '-4R', '-3R', '-2E', '-1E', '0y', '1D', '2V', '3L', '4D', '5K', '6R', '7R']
kinase
PTK2 2.155
LCK 2.127
ZAP70 2.117
EPHA2 2.117
BLK 2.105
...
GSK3A 0.801
ERK1 0.778
ATR 0.767
GSK3B 0.758
DNAPK 0.689
Length: 289, dtype: float64
# SYK--> BLNK, Y96
predict_kinase('EENADDSyEPPPVEQ' ,** param_CDDM)
considering string: ['-7E', '-6E', '-5N', '-4A', '-3D', '-2D', '-1S', '0y', '1E', '2P', '3P', '4P', '5V', '6E', '7Q']
kinase
SYK 2.045
ZAP70 2.038
LCK 1.975
EGFR 1.967
PTK2 1.966
...
PKCA 0.656
PKCB 0.651
PKCD 0.640
PHKG1 0.638
PKCZ 0.631
Length: 289, dtype: float64
# CDK4 --> RB1, S807
predict_kinase('PGGNIyIsPLksPyk' ,** param_CDDM)
considering string: ['-7P', '-6G', '-5G', '-4N', '-3I', '-2y', '-1I', '0s', '1P', '2L', '3K', '4s', '5P', '6y', '7K']
kinase
CDK2 2.369
CDK4 2.351
CDK1 2.346
CDK3 2.300
CDK5 2.296
...
JAK3 0.700
EPHA4 0.698
ERBB4 0.688
FGFR4 0.687
TNK2 0.665
Length: 289, dtype: float64
# AKT --> TSC2, S939
predict_kinase('sFRARstsLNERPKs' ,** param_CDDM)
considering string: ['-7s', '-6F', '-5R', '-4A', '-3R', '-2s', '-1t', '0s', '1L', '2N', '3E', '4R', '5P', '6K', '7s']
kinase
AKT1 2.776
SGK1 2.578
AKT3 2.526
AKT2 2.437
P90RSK 2.420
...
EPHA4 0.717
FER 0.714
TNK2 0.705
TEC 0.702
FYN 0.696
Length: 289, dtype: float64
# CK1G1 --> NFkB, RELA S536
predict_kinase('sGDEDFSsIADMDFS' ,** param_CDDM)
considering string: ['-7s', '-6G', '-5D', '-4E', '-3D', '-2F', '-1S', '0s', '1I', '2A', '3D', '4M', '5D', '6F', '7S']
kinase
CK1G1 2.177
CK1G2 2.100
CK1G3 2.012
CK2A1 1.942
PAK6 1.916
...
DDR2 0.769
TYRO3 0.769
TNK1 0.768
TNK2 0.746
AXL 0.743
Length: 289, dtype: float64
# LKB1 --> AMPK
predict_kinase('sDGEFLRtsCGsPNY' ,** param_CDDM)
considering string: ['-7s', '-6D', '-5G', '-4E', '-3F', '-2L', '-1R', '0t', '1s', '2C', '3G', '4s', '5P', '6N', '7Y']
kinase
LKB1 1.898
CAMKK2 1.690
CAMKK1 1.684
PBK 1.485
GSK3A 1.403
...
CSK 0.611
DDR2 0.611
KIT 0.607
FGFR4 0.594
TSSK2 0.590
Length: 289, dtype: float64
Test with PSPA
We provide three parameters for PSPA:
param_PSPA_s : for S/T sequence
param_PSPA_y : for Y sequence
param_PSPA : lazy mode, for both S/T and Y sequences, run slower
PSPA performs the best on substrate sequences with phosphorylation status informed.
#Insulin Receptor and IRS-1 (Insulin Receptor Substrate 1)
# Kinase: Insulin Receptor
# Substrate: IRS-1 #Y612, Y632, Y662
predict_kinase('GRKGsGDyMPMsPKs' ,** param_PSPA)
considering string: ['-5K', '-4G', '-3s', '-2G', '-1D', '0y', '1M', '2P', '3M', '4s', '5P']
kinase
ZAP70 6.625
INSRR 4.442
IGF1R 3.792
FLT1 3.693
ERBB4 3.503
...
YANK2 NaN
YANK3 NaN
YSK1 NaN
YSK4 NaN
ZAK NaN
Length: 396, dtype: float64
We’ll get the same result with param_PSPA_y
, which do not include the calculation of Ser/Thr kinase (those NaNs) and works faster.
predict_kinase('GRKGsGDyMPMsPKs' ,** param_PSPA_y)
considering string: ['-5K', '-4G', '-3s', '-2G', '-1D', '0y', '1M', '2P', '3M', '4s', '5P']
kinase
ZAP70 6.625
INSRR 4.442
IGF1R 3.792
FLT1 3.693
ERBB4 3.503
...
TEC -1.348
TNNI3K_TYR -1.713
LIMK1_TYR -2.112
TNK1 -2.217
BTK -2.622
Length: 93, dtype: float64
Let’s try using param_PSPA_st
# CK1G1 --> NFkB, RELA S536
predict_kinase('sGDEDFSsIADMDFS' ,** param_PSPA_st)
considering string: ['-5D', '-4E', '-3D', '-2F', '-1S', '0s', '1I', '2A', '3D', '4M']
kinase
IKKA 5.435
CK1G3 4.977
GRK1 4.488
IKKB 4.286
CK1G2 4.184
...
DMPK1 -8.521
MOK -9.204
BUB1 -9.361
CDK10 -10.330
AAK1 -10.342
Length: 303, dtype: float64
param_PSPA_st
shows same result with param_PSPA
, but faster
predict_kinase('sGDEDFSsIADMDFS' ,** param_PSPA)
considering string: ['-5D', '-4E', '-3D', '-2F', '-1S', '0s', '1I', '2A', '3D', '4M', '5D']
kinase
IKKA 5.435
CK1G3 4.977
GRK1 4.488
IKKB 4.286
CK1G2 4.184
...
KDR NaN
FLT4 NaN
WEE1_TYR NaN
YES1 NaN
ZAP70 NaN
Length: 396, dtype: float64
Customize reference PSSM and aggregation function
You can put your own PSSM dataframe and aggregation function in predict_kinase
and predict_kinase_df
For example, predict_kinase(‘sGDEDFSsIADMDFS’,ref = df, func=some_func)
Here we show an example of PSPA canonical TK as ref and sumup as func:
canonical_TK = TK.loc[~ TK.index.str .contains('_TYR' ),:]
kinase
ABL1
0.0668
0.0689
0.0646
0.0520
0.0564
0.0539
0.0485
0.0448
0.0520
0.0536
...
0.0613
0.0652
0.0756
0.0526
0.0512
0.0362
0.0339
0.0254
0.0254
0.0337
TNK2
0.0679
0.0818
0.0627
0.0617
0.0529
0.0528
0.0419
0.0463
0.0437
0.0453
...
0.0499
0.0385
0.0302
0.0531
0.0465
0.0630
0.0572
0.0364
0.0364
0.0572
ALK
0.0675
0.0640
0.0590
0.0511
0.0476
0.0422
0.0455
0.0514
0.0546
0.0543
...
0.0448
0.0367
0.0489
0.0334
0.0387
0.0245
0.0226
0.0181
0.0181
0.0172
ABL2
0.0687
0.0715
0.0611
0.0448
0.0537
0.0513
0.0467
0.0398
0.0462
0.0505
...
0.0566
0.0640
0.0779
0.0538
0.0565
0.0378
0.0381
0.0252
0.0252
0.0289
AXL
0.0656
0.0753
0.0535
0.0525
0.0468
0.0467
0.0459
0.0538
0.0507
0.0542
...
0.0441
0.0506
0.0355
0.0635
0.0696
0.0592
0.0559
0.0413
0.0413
0.0455
5 rows × 236 columns
predict_kinase('GRKGsGDyMPMsPKs' ,ref = canonical_TK, func= sumup)
considering string: ['-5K', '-4G', '-3s', '-2G', '-1D', '0y', '1M', '2P', '3M', '4s', '5P']
kinase
ZAP70 2.041
INSRR 1.907
FLT1 1.906
PTK2 1.873
SYK 1.842
...
PTK6 1.546
LYN 1.541
PDGFRA 1.539
TEC 1.539
BTK 1.496
Length: 78, dtype: float64