KATLAS

Predict kinases given a substrate sequence

KATLAS is a repository containing python tools to predict kinases given a substrate sequence. It also contains datasets of kinase substrate specificities and human phosphoproteomics.

References: Please cite the appropriate papers if KATLAS is helpful to your research.

KATLAS was described in the paper [Computational Decoding of Human Kinome Substrate Specificities and Functions]
The positional scanning peptide array (PSPA) data is from paper An atlas of substrate specificities for the human serine/threonine kinome and paper The intrinsic substrate specificity of the human tyrosine kinome
The kinase substrate datasets used for generating PSSMs are derived from PhosphoSitePlus and paper Large-scale Discovery of Substrates of the Human Kinome
Phosphorylation sites are acquired from PhosphoSitePlus, paper The functional landscape of the human phosphoproteome, and CPTAC / LinkedOmics

Web applications

Users can now run the analysis directly on the web without needing to code.

Check out our latest web platform: kinase-atlas.com

Install

pip install python-katlas

Import

from katlas.common import *

Quick start

We provide two methods to calculate substrate sequence:

Computational Data-Driven Method (CDDM)
Positional Scanning Peptide Array (PSPA)

We consider the input in two formats:

a single input string (phosphorylation site)
a csv/dataframe that contains a column of phosphorylation sites

For input sequences, we also consider it in two conditions:

all capital
contains lower cases indicating phosphorylation status

Quick start

Site scoring

CDDM, all capital

predict_kinase('AAAAAAASGAGSDN',**Params("CDDM_upper"))

considering string: ['-7A', '-6A', '-5A', '-4A', '-3A', '-2A', '-1A', '0S', '1G', '2A', '3G', '4S', '5D', '6N']

GCN2      4.556
MPSK1     4.425
MEKK2     4.253
WNK3      4.213
WNK1      4.064
          ...  
PDK1    -25.077
PDHK3   -25.346
CLK2    -27.251
ROR2    -27.582
DDR1    -53.581
Length: 328, dtype: float64

CDDM, with lower case indicating phosphorylation status

predict_kinase('AAAAAAAsGGAGsDN',**Params("CDDM"))

considering string: ['-7A', '-6A', '-5A', '-4A', '-3A', '-2A', '-1A', '0s', '1G', '2G', '3A', '4G', '5s', '6D', '7N']

ROR1       8.355
WNK1       4.907
WNK2       4.782
ERK5       4.466
RIPK2      4.045
           ...  
DDR1     -29.393
TNNI3K   -29.884
CHAK1    -31.775
VRK1     -45.287
BRAF     -49.403
Length: 328, dtype: float64

PSPA, with lower case indicating phosphorylation status

predict_kinase('AEEKEyHsEGG',**Params("PSPA"))

considering string: ['-5A', '-4E', '-3E', '-2K', '-1E', '0y', '1H', '2s', '3E', '4G', '5G']

kinase
EGFR          4.013
FGFR4         3.568
ZAP70         3.412
CSK           3.241
SYK           3.209
              ...  
JAK1         -3.837
DDR2         -4.421
TNK2         -4.534
TNNI3K_TYR   -4.651
TNK1         -5.320
Length: 93, dtype: float64

To replicate the results from The Kinase Library (PSPA)

Check this link: The Kinase Library, and use log2(score) to rank, it shows same results with the below (with slight differences due to rounding).

out = predict_kinase('AEEKEyHSEGG',**Params("PSPA"))
out

considering string: ['-5A', '-4E', '-3E', '-2K', '-1E', '0y', '1H', '2S', '3E', '4G', '5G']

kinase
EGFR     3.181
FGFR4    2.390
CSK      2.308
ZAP70    2.068
SYK      1.998
         ...  
EPHA1   -3.501
FES     -3.699
TNK1    -4.269
TNK2    -4.577
DDR2    -4.920
Length: 93, dtype: float64

So far The kinase Library considers all tyr sequences in capital regardless of whether or not they contain lower cases, which is a small bug and should be fixed soon.
Kinase with “_TYR” indicates it is a dual specificity kinase tested in PSPA tyrosine setting, which has not been included in kinase-library yet.

We can also calculate the percentile score using a referenced score sheet.

# Percentile reference sheet
y_pct = Data.get_pspa_tyr_pct()

get_pct('AEEKEyHSEGG',pct_ref = y_pct,**Params("PSPA_y"))

considering string: ['-5A', '-4E', '-3E', '-2K', '-1E', '0Y', '1H', '2S', '3E', '4G', '5G']

	log2(score)	percentile
EGFR	3.181	96.787423
FGFR4	2.390	94.012303
CSK	2.308	95.201640
ZAP70	2.068	88.380041
SYK	1.998	85.522898
...	...	...
EPHA1	-3.501	12.139440
FES	-3.699	21.216678
TNK1	-4.269	5.481887
TNK2	-4.577	2.050581
DDR2	-4.920	10.403281

93 rows × 2 columns

Site scoring in a df

Load your csv:

# df = pd.read_csv('your_file.csv')

Or load a demo df

# Load a demo df with phosphorylation sites
df = Data.get_ochoa_site().head()
df.iloc[:,-2:]

	site_seq	gene_site
0	VDDEKGDSNDDYDSA	A0A075B6Q4_S24
1	YDSAGLLSDEDCMSV	A0A075B6Q4_S35
2	IADHLFWSEETKSRF	A0A075B6Q4_S57
3	KSRFTEYSMTSSVMR	A0A075B6Q4_S68
4	FTEYSMTSSVMRRNE	A0A075B6Q4_S71

Set the column name and param to calculate

Here we choose param_CDDM_upper, as the sequences in the demo df are all in capital. You can also choose other params.

results = predict_kinase_df(df,'site_seq',**Params("CDDM_upper"))
results

input dataframe has a length 5
Preprocessing
Finish preprocessing
Merging reference
Finish merging

	SRC	EPHA3	FES	NTRK3	ALK	ABL1	FLT3	EPHA8	EPHB2	EPHB1	...	VRK1	PKMYT1	GRK3	CAMK1B	CDC7	SMMLCK	ROR1	GAK	MAST2	BRAF
0	-2.440640	-0.818753	-1.663990	-0.738991	-2.047628	-3.602344	-3.200998	-0.935176	-1.388444	-1.859450	...	-17.103237	-113.698143	-16.848783	-41.520172	-41.646187	1.284159	-26.566362	-69.165062	-17.706400	-87.763214
1	-3.838486	-2.735969	-2.533986	-2.150399	-3.792498	-4.725527	-5.711791	-4.534240	-3.148449	-2.511518	...	-67.889053	-68.652641	-45.833855	-64.171600	-39.465572	-65.061722	-109.561707	-85.911224	-60.105064	-63.889122
2	-2.610423	-2.370090	-3.235637	-1.508413	-2.571347	-3.740941	-3.025596	-3.373504	-2.776297	-3.060740	...	-15.798462	-45.905319	-61.440742	-67.695694	-55.047962	-42.135216	-38.501572	-62.624382	-56.119389	-107.060989
3	-5.180541	-4.201880	-5.766463	-3.038421	-3.836897	-4.249900	-5.029885	-5.411311	-4.713308	-4.827825	...	-96.978317	-83.419777	-22.559393	-110.611588	-63.283070	-37.240440	-24.497492	-112.878151	-43.538158	-60.348518
4	-2.844254	-3.322700	-3.681745	-1.766435	-2.666579	-3.748774	-4.083619	-3.912834	-3.724181	-3.948160	...	-35.824612	-87.983566	-83.312317	-107.162407	-61.478374	-85.793571	-43.738819	-47.004211	-42.281624	-59.518513

5 rows × 328 columns

results.iloc[0].sort_values(ascending=False)

TLK2        8.264621
GCN2        8.101542
TLK1        7.693897
HRI         6.691402
PLK3        6.579368
             ...    
NIK       -64.605148
SRPK2     -67.300667
GAK       -69.165062
BRAF      -87.763214
PKMYT1   -113.698143
Name: 0, Length: 328, dtype: float32

Dataset

Besides calculating sequence scores, we also provides multiple datasets of phosphorylation sites.

CPTAC pan-cancer phosphoproteomics

df = Data.get_cptac_ensembl_site()
df.head(3)

	gene	site	site_seq	protein	gene_name	gene_site	protein_site
0	ENSG00000003056.8	S267	DDQLGEESEERDDHL	ENSP00000000412.3	M6PR	M6PR_S267	ENSP00000000412_S267
1	ENSG00000003056.8	S267	DDQLGEESEERDDHL	ENSP00000440488.2	M6PR	M6PR_S267	ENSP00000440488_S267
2	ENSG00000048028.11	S1053	PPTIRPNSPYDLCSR	ENSP00000003302.4	USP28	USP28_S1053	ENSP00000003302_S1053

Ochoa et al. human phosphoproteome

df = Data.get_ochoa_site()
df.head(3)

	uniprot	position	residue	is_disopred	disopred_score	log10_hotspot_pval_min	isHotspot	uniprot_position	functional_score	current_uniprot	name	gene	Sequence	is_valid	site_seq	gene_site
0	A0A075B6Q4	24	S	1.0	0.91	6.839384	1.0	A0A075B6Q4_24	0.149257	A0A075B6Q4	A0A075B6Q4_HUMAN	None	MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT...	True	VDDEKGDSNDDYDSA	A0A075B6Q4_S24
1	A0A075B6Q4	35	S	1.0	0.87	9.192622	0.0	A0A075B6Q4_35	0.136966	A0A075B6Q4	A0A075B6Q4_HUMAN	None	MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT...	True	YDSAGLLSDEDCMSV	A0A075B6Q4_S35
2	A0A075B6Q4	57	S	0.0	0.28	0.818834	0.0	A0A075B6Q4_57	0.125364	A0A075B6Q4	A0A075B6Q4_HUMAN	None	MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT...	True	IADHLFWSEETKSRF	A0A075B6Q4_S57

PhosphoSitePlus human phosphorylation site

df = Data.get_psp_human_site()
df.head(3)

	gene	protein	uniprot	site	gene_site	SITE_GRP_ID	species	site_seq	LT_LIT	MS_LIT	MS_CST	CST_CAT#
0	YWHAB	14-3-3 beta	P31946	T2	YWHAB_T2	15718712	human	______MtMDksELV	NaN	3.0	1.0	None
1	YWHAB	14-3-3 beta	P31946	S6	YWHAB_S6	15718709	human	__MtMDksELVQkAk	NaN	8.0	NaN	None
2	YWHAB	14-3-3 beta	P31946	Y21	YWHAB_Y21	3426383	human	LAEQAERyDDMAAAM	NaN	NaN	4.0	None

Unique sites of combined Ochoa & PhosphoSitePlus

df = Data.get_combine_site_psp_ochoa()
df.head(3)

	uniprot	gene	site	site_seq	source	AM_pathogenicity	CDDM_upper	CDDM_max_score
0	A0A024R4G9	C19orf48	S20	ITGSRLLSMVPGPAR	psp	NaN	PRKX,AKT1,PKG1,P90RSK,HIPK4,AKT3,HIPK1,PKACB,H...	2.407041
1	A0A075B6Q4	None	S24	VDDEKGDSNDDYDSA	ochoa	NaN	CK2A2,CK2A1,GRK7,GRK5,CK1G1,CK1A,IKKA,CK1G2,CA...	2.295654
2	A0A075B6Q4	None	S35	YDSAGLLSDEDCMSV	ochoa	NaN	CK2A2,CK2A1,IKKA,ATM,IKKB,CAMK1D,MARK2,GRK7,IK...	2.488683

Phosphorylation site sequence example

All capital - 15 length (-7 to +7)

QSEEEKLSPSPTTED
TLQHVPDYRQNVYIP
TMGLSARyGPQFTLQ

All capital - 10 length (-5 to +4)

SRDPHYQDPH
LDNPDyQQDF
AAAAAsGGAG

With lowercase - (-7 to +7)

QsEEEKLsPsPTTED
TLQHVPDyRQNVYIP
TMGLsARyGPQFTLQ

With lowercase - (-5 to +4)

sRDPHyQDPH
LDNPDyQQDF
AAAAAsGGAG