KATLAS

Predict kinases given a substrate sequence

KATLAS is a repository containing python tools to predict kinases given a substrate sequence. It also contains datasets of kinase substrate specificities and human phosphoproteomics.

References: Please cite the appropriate papers if KATLAS is helpful to your research.

Reproduce datasets & figures

Follow the instructions in katlas_raw: https://github.com/sky1ove/katlas_raw

Web applications

Users can now run the analysis directly on the web without needing to code.

Check out our latest web platform: kinase-atlas.com

Install

UV:

uv add -U python-katlas

pip:

pip install -U python-katlas

If using machine-learning related modules, need to install development verison: pip install -U "python-katlas[dev]"

Import

from katlas.common import *

Quick start

We provide two methods to calculate substrate sequence:

  • Computational Data-Driven Method (CDDM)
  • Positional Scanning Peptide Array (PSPA)

We consider the input in two formats:

  • a single input string (phosphorylation site)
  • a csv/dataframe that contains a column of phosphorylation sites

For input sequences, we also consider it in two conditions:

  • all capital
  • contains lower cases indicating phosphorylation status

Quick start

Site scoring

CDDM, all capital

predict_kinase('AAAAAAASGAGSDN',**Params("CDDM_upper"))
considering string: ['-7A', '-6A', '-5A', '-4A', '-3A', '-2A', '-1A', '0S', '1G', '2A', '3G', '4S', '5D', '6N']
GCN2      4.556
MPSK1     4.425
MEKK2     4.253
WNK3      4.213
WNK1      4.064
          ...  
PDK1    -25.077
PDHK3   -25.346
CLK2    -27.251
ROR2    -27.582
DDR1    -53.581
Length: 328, dtype: float64

CDDM, with lower case indicating phosphorylation status

predict_kinase('AAAAAAAsGGAGsDN',**Params("CDDM"))
considering string: ['-7A', '-6A', '-5A', '-4A', '-3A', '-2A', '-1A', '0s', '1G', '2G', '3A', '4G', '5s', '6D', '7N']
ROR1       8.355
WNK1       4.907
WNK2       4.782
ERK5       4.466
RIPK2      4.045
           ...  
DDR1     -29.393
TNNI3K   -29.884
CHAK1    -31.775
VRK1     -45.287
BRAF     -49.403
Length: 328, dtype: float64

PSPA, with lower case indicating phosphorylation status

predict_kinase('AEEKEyHsEGG',**Params("PSPA"))
considering string: ['-5A', '-4E', '-3E', '-2K', '-1E', '0y', '1H', '2s', '3E', '4G', '5G']
kinase
EGFR          4.013
FGFR4         3.568
ZAP70         3.412
CSK           3.241
SYK           3.209
              ...  
JAK1         -3.837
DDR2         -4.421
TNK2         -4.534
TNNI3K_TYR   -4.651
TNK1         -5.320
Length: 93, dtype: float64

To replicate the results from The Kinase Library (PSPA)

Check this link: The Kinase Library, and use log2(score) to rank, it shows same results with the below (with slight differences due to rounding).

out = predict_kinase('AEEKEyHSEGG',**Params("PSPA"))
out
considering string: ['-5A', '-4E', '-3E', '-2K', '-1E', '0y', '1H', '2S', '3E', '4G', '5G']
kinase
EGFR     3.181
FGFR4    2.390
CSK      2.308
ZAP70    2.068
SYK      1.998
         ...  
EPHA1   -3.501
FES     -3.699
TNK1    -4.269
TNK2    -4.577
DDR2    -4.920
Length: 93, dtype: float64
  • So far The kinase Library considers all tyr sequences in capital regardless of whether or not they contain lower cases, which is a small bug and should be fixed soon.
  • Kinase with “_TYR” indicates it is a dual specificity kinase tested in PSPA tyrosine setting, which has not been included in kinase-library yet.

We can also calculate the percentile score using a referenced score sheet.

# Percentile reference sheet
y_pct = Data.get_pspa_tyr_pct()
get_pct('AEEKEyHSEGG',pct_ref = y_pct,**Params("PSPA_y"))
considering string: ['-5A', '-4E', '-3E', '-2K', '-1E', '0Y', '1H', '2S', '3E', '4G', '5G']
log2(score) percentile
EGFR 3.181 96.787423
FGFR4 2.390 94.012303
CSK 2.308 95.201640
ZAP70 2.068 88.380041
SYK 1.998 85.522898
... ... ...
EPHA1 -3.501 12.139440
FES -3.699 21.216678
TNK1 -4.269 5.481887
TNK2 -4.577 2.050581
DDR2 -4.920 10.403281

93 rows × 2 columns

Site scoring in a df

Load your csv:

# df = pd.read_csv('your_file.csv')

Or load a demo df

# Load a demo df with phosphorylation sites
df = Data.get_ochoa_site().head()
df.iloc[:,-2:]
site_seq gene_site
0 VDDEKGDSNDDYDSA A0A075B6Q4_S24
1 YDSAGLLSDEDCMSV A0A075B6Q4_S35
2 IADHLFWSEETKSRF A0A075B6Q4_S57
3 KSRFTEYSMTSSVMR A0A075B6Q4_S68
4 FTEYSMTSSVMRRNE A0A075B6Q4_S71

Set the column name and param to calculate

Here we choose param_CDDM_upper, as the sequences in the demo df are all in capital. You can also choose other params.

results = predict_kinase_df(df,'site_seq',**Params("CDDM_upper"))
results
input dataframe has a length 5
Preprocessing
Finish preprocessing
Merging reference
Finish merging
SRC EPHA3 FES NTRK3 ALK ABL1 FLT3 EPHA8 EPHB2 EPHB1 ... VRK1 PKMYT1 GRK3 CAMK1B CDC7 SMMLCK ROR1 GAK MAST2 BRAF
0 -2.440640 -0.818753 -1.663990 -0.738991 -2.047628 -3.602344 -3.200998 -0.935176 -1.388444 -1.859450 ... -17.103237 -113.698143 -16.848783 -41.520172 -41.646187 1.284159 -26.566362 -69.165062 -17.706400 -87.763214
1 -3.838486 -2.735969 -2.533986 -2.150399 -3.792498 -4.725527 -5.711791 -4.534240 -3.148449 -2.511518 ... -67.889053 -68.652641 -45.833855 -64.171600 -39.465572 -65.061722 -109.561707 -85.911224 -60.105064 -63.889122
2 -2.610423 -2.370090 -3.235637 -1.508413 -2.571347 -3.740941 -3.025596 -3.373504 -2.776297 -3.060740 ... -15.798462 -45.905319 -61.440742 -67.695694 -55.047962 -42.135216 -38.501572 -62.624382 -56.119389 -107.060989
3 -5.180541 -4.201880 -5.766463 -3.038421 -3.836897 -4.249900 -5.029885 -5.411311 -4.713308 -4.827825 ... -96.978317 -83.419777 -22.559393 -110.611588 -63.283070 -37.240440 -24.497492 -112.878151 -43.538158 -60.348518
4 -2.844254 -3.322700 -3.681745 -1.766435 -2.666579 -3.748774 -4.083619 -3.912834 -3.724181 -3.948160 ... -35.824612 -87.983566 -83.312317 -107.162407 -61.478374 -85.793571 -43.738819 -47.004211 -42.281624 -59.518513

5 rows × 328 columns

results.iloc[0].sort_values(ascending=False)
TLK2        8.264621
GCN2        8.101542
TLK1        7.693897
HRI         6.691402
PLK3        6.579368
             ...    
NIK       -64.605148
SRPK2     -67.300667
GAK       -69.165062
BRAF      -87.763214
PKMYT1   -113.698143
Name: 0, Length: 328, dtype: float32

Dataset

Besides calculating sequence scores, we also provides multiple datasets of phosphorylation sites.

CPTAC pan-cancer phosphoproteomics

df = Data.get_cptac_ensembl_site()
df.head(3)
gene site site_seq protein gene_name gene_site protein_site
0 ENSG00000003056.8 S267 DDQLGEESEERDDHL ENSP00000000412.3 M6PR M6PR_S267 ENSP00000000412_S267
1 ENSG00000003056.8 S267 DDQLGEESEERDDHL ENSP00000440488.2 M6PR M6PR_S267 ENSP00000440488_S267
2 ENSG00000048028.11 S1053 PPTIRPNSPYDLCSR ENSP00000003302.4 USP28 USP28_S1053 ENSP00000003302_S1053

Ochoa et al. human phosphoproteome

df = Data.get_ochoa_site()
df.head(3)
uniprot position residue is_disopred disopred_score log10_hotspot_pval_min isHotspot uniprot_position functional_score current_uniprot name gene Sequence is_valid site_seq gene_site
0 A0A075B6Q4 24 S 1.0 0.91 6.839384 1.0 A0A075B6Q4_24 0.149257 A0A075B6Q4 A0A075B6Q4_HUMAN None MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT... True VDDEKGDSNDDYDSA A0A075B6Q4_S24
1 A0A075B6Q4 35 S 1.0 0.87 9.192622 0.0 A0A075B6Q4_35 0.136966 A0A075B6Q4 A0A075B6Q4_HUMAN None MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT... True YDSAGLLSDEDCMSV A0A075B6Q4_S35
2 A0A075B6Q4 57 S 0.0 0.28 0.818834 0.0 A0A075B6Q4_57 0.125364 A0A075B6Q4 A0A075B6Q4_HUMAN None MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT... True IADHLFWSEETKSRF A0A075B6Q4_S57

PhosphoSitePlus human phosphorylation site

df = Data.get_psp_human_site()
df.head(3)
gene protein uniprot site gene_site SITE_GRP_ID species site_seq LT_LIT MS_LIT MS_CST CST_CAT# Ambiguous_Site
0 YWHAB 14-3-3 beta P31946 T2 YWHAB_T2 15718712 human ______MtMDksELV NaN 3.0 1.0 None 0
1 YWHAB 14-3-3 beta P31946 S6 YWHAB_S6 15718709 human __MtMDksELVQkAk NaN 8.0 NaN None 0
2 YWHAB 14-3-3 beta P31946 Y21 YWHAB_Y21 3426383 human LAEQAERyDDMAAAM NaN NaN 4.0 None 0

Unique sites of combined Ochoa & PhosphoSitePlus

df = Data.get_combine_site_psp_ochoa()
df.head(3)
uniprot gene site site_seq source AM_pathogenicity CDDM_upper CDDM_max_score
0 A0A024R4G9 C19orf48 S20 ITGSRLLSMVPGPAR psp NaN PRKX,AKT1,PKG1,P90RSK,HIPK4,AKT3,HIPK1,PKACB,H... 2.407041
1 A0A075B6Q4 None S24 VDDEKGDSNDDYDSA ochoa NaN CK2A2,CK2A1,GRK7,GRK5,CK1G1,CK1A,IKKA,CK1G2,CA... 2.295654
2 A0A075B6Q4 None S35 YDSAGLLSDEDCMSV ochoa NaN CK2A2,CK2A1,IKKA,ATM,IKKB,CAMK1D,MARK2,GRK7,IK... 2.488683

Phosphorylation site sequence example

All capital - 15 length (-7 to +7)

  • QSEEEKLSPSPTTED
  • TLQHVPDYRQNVYIP
  • TMGLSARyGPQFTLQ

All capital - 10 length (-5 to +4)

  • SRDPHYQDPH
  • LDNPDyQQDF
  • AAAAAsGGAG

With lowercase - (-7 to +7)

  • QsEEEKLsPsPTTED
  • TLQHVPDyRQNVYIP
  • TMGLsARyGPQFTLQ

With lowercase - (-5 to +4)

  • sRDPHyQDPH
  • LDNPDyQQDF
  • AAAAAsGGAG