Phosphoproteomics scoring

In this session, instead of scoring sequence one by one, we will score the whole phosphoproteomics dataset at once.

Setup

!pip install python-katlas -Uq

WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

import pandas as pd,numpy as np,seaborn as sns
from matplotlib import pyplot as plt
from katlas.core import *

Phosphoproteomics dataset

Three datasets are available:

PhosphoSitePlus human (2024 Jan)
Ochoa et al. human phosphoproteom
Combine Ochoa and PSP low throughput data

psp = Data.get_psp_human_site()
ochoa = Data.get_ochoa_site()
comb = Data.get_combine_site_psp_ochoa()

psp.head()

	gene	protein	uniprot	site	gene_site	SITE_GRP_ID	species	site_seq	LT_LIT	MS_LIT	MS_CST	CST_CAT#
0	YWHAB	14-3-3 beta	P31946	T2	YWHAB_T2	15718712	human	______MtMDksELV	NaN	3.0	1.0	None
1	YWHAB	14-3-3 beta	P31946	S6	YWHAB_S6	15718709	human	__MtMDksELVQkAk	NaN	8.0	NaN	None
2	YWHAB	14-3-3 beta	P31946	Y21	YWHAB_Y21	3426383	human	LAEQAERyDDMAAAM	NaN	NaN	4.0	None
3	YWHAB	14-3-3 beta	P31946	T32	YWHAB_T32	23077803	human	AAAMkAVtEQGHELs	NaN	NaN	1.0	None
4	YWHAB	14-3-3 beta	P31946	S39	YWHAB_S39	27442700	human	tEQGHELsNEERNLL	NaN	4.0	NaN	None

ochoa.head()

	uniprot	position	residue	is_disopred	disopred_score	log10_hotspot_pval_min	isHotspot	uniprot_position	functional_score	current_uniprot	name	gene	Sequence	is_valid	site_seq	gene_site
0	A0A075B6Q4	24	S	1.0	0.91	6.839384	1.0	A0A075B6Q4_24	0.149257	A0A075B6Q4	A0A075B6Q4_HUMAN	None	MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT...	True	VDDEKGDSNDDYDSA	A0A075B6Q4_S24
1	A0A075B6Q4	35	S	1.0	0.87	9.192622	0.0	A0A075B6Q4_35	0.136966	A0A075B6Q4	A0A075B6Q4_HUMAN	None	MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT...	True	YDSAGLLSDEDCMSV	A0A075B6Q4_S35
2	A0A075B6Q4	57	S	0.0	0.28	0.818834	0.0	A0A075B6Q4_57	0.125364	A0A075B6Q4	A0A075B6Q4_HUMAN	None	MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT...	True	IADHLFWSEETKSRF	A0A075B6Q4_S57
3	A0A075B6Q4	68	S	0.0	0.03	0.375986	0.0	A0A075B6Q4_68	0.119811	A0A075B6Q4	A0A075B6Q4_HUMAN	None	MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT...	True	KSRFTEYSMTSSVMR	A0A075B6Q4_S68
4	A0A075B6Q4	71	S	0.0	0.05	0.000000	0.0	A0A075B6Q4_71	0.095193	A0A075B6Q4	A0A075B6Q4_HUMAN	None	MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT...	True	FTEYSMTSSVMRRNE	A0A075B6Q4_S71

comb.head()

	uniprot	gene	site	site_seq	source	AM_pathogenicity	CDDM_upper	CDDM_max_score
0	A0A024R4G9	C19orf48	S20	ITGSRLLSMVPGPAR	psp	NaN	PRKX,AKT1,PKG1,P90RSK,HIPK4,AKT3,HIPK1,PKACB,H...	2.407041
1	A0A075B6Q4	None	S24	VDDEKGDSNDDYDSA	ochoa	NaN	CK2A2,CK2A1,GRK7,GRK5,CK1G1,CK1A,IKKA,CK1G2,CA...	2.295654
2	A0A075B6Q4	None	S35	YDSAGLLSDEDCMSV	ochoa	NaN	CK2A2,CK2A1,IKKA,ATM,IKKB,CAMK1D,MARK2,GRK7,IK...	2.488683
3	A0A075B6Q4	None	S57	IADHLFWSEETKSRF	ochoa	NaN	GRK7,CK2A1,CK2A2,PKN2,GRK1,GRK5,MARK1,MARK2,UL...	1.851894
4	A0A075B6Q4	None	S68	KSRFTEYSMTSSVMR	ochoa	NaN	AKT1,P90RSK,AKT3,SGK1,AKT2,NDR2,RSK2,P70S6K,RS...	2.026384

Scoring on all-capital sequence

comb.head()

	uniprot	gene	site	site_seq	source	AM_pathogenicity	CDDM_upper	CDDM_max_score
0	A0A024R4G9	C19orf48	S20	ITGSRLLSMVPGPAR	psp	NaN	PRKX,AKT1,PKG1,P90RSK,HIPK4,AKT3,HIPK1,PKACB,H...	2.407041
1	A0A075B6Q4	None	S24	VDDEKGDSNDDYDSA	ochoa	NaN	CK2A2,CK2A1,GRK7,GRK5,CK1G1,CK1A,IKKA,CK1G2,CA...	2.295654
2	A0A075B6Q4	None	S35	YDSAGLLSDEDCMSV	ochoa	NaN	CK2A2,CK2A1,IKKA,ATM,IKKB,CAMK1D,MARK2,GRK7,IK...	2.488683
3	A0A075B6Q4	None	S57	IADHLFWSEETKSRF	ochoa	NaN	GRK7,CK2A1,CK2A2,PKN2,GRK1,GRK5,MARK1,MARK2,UL...	1.851894
4	A0A075B6Q4	None	S68	KSRFTEYSMTSSVMR	ochoa	NaN	AKT1,P90RSK,AKT3,SGK1,AKT2,NDR2,RSK2,P70S6K,RS...	2.026384

It looks that the sequence are in all-capital format, so we will use param_CDDM_upper

cddm = predict_kinase_df(comb,'site_seq',**param_CDDM_upper)

input dataframe has a length 121419
Preprocessing
Finish preprocessing
Merging reference
Finish merging

# PSPA involves more calculation, will take longer
pspa = predict_kinase_df(comb,'site_seq',**param_PSPA)

input dataframe has a length 121419
Preprocessing
Finish preprocessing
Merging reference
Finish merging

100%|██████████| 396/396 [01:37<00:00,  4.06it/s]

We can also split S/T and Y sites for scoring.

comb_y = comb[comb.site_seq.str[7]=='Y']
comb_st = comb[comb.site_seq.str[7]!='Y']

pspa_y = predict_kinase_df(comb_y,'site_seq',**param_PSPA_y)
pspa_st = predict_kinase_df(comb_st,'site_seq',**param_PSPA_st)

input dataframe has a length 8051
Preprocessing
Finish preprocessing
Merging reference
Finish merging

100%|██████████| 93/93 [00:01<00:00, 72.16it/s]

input dataframe has a length 113368
Preprocessing
Finish preprocessing
Merging reference
Finish merging

100%|██████████| 303/303 [00:38<00:00,  7.79it/s]

Release RAM for the following tasks

del comb_y, comb_st,pspa_y,pspa_st

Scoring on phosphorylated sequence

psp.head()

	gene	protein	uniprot	site	gene_site	SITE_GRP_ID	species	site_seq	LT_LIT	MS_LIT	MS_CST	CST_CAT#
0	YWHAB	14-3-3 beta	P31946	T2	YWHAB_T2	15718712	human	______MtMDksELV	NaN	3.0	1.0	None
1	YWHAB	14-3-3 beta	P31946	S6	YWHAB_S6	15718709	human	__MtMDksELVQkAk	NaN	8.0	NaN	None
2	YWHAB	14-3-3 beta	P31946	Y21	YWHAB_Y21	3426383	human	LAEQAERyDDMAAAM	NaN	NaN	4.0	None
3	YWHAB	14-3-3 beta	P31946	T32	YWHAB_T32	23077803	human	AAAMkAVtEQGHELs	NaN	NaN	1.0	None
4	YWHAB	14-3-3 beta	P31946	S39	YWHAB_S39	27442700	human	tEQGHELsNEERNLL	NaN	4.0	NaN	None

It looks that the sequence are in phosphorylated status, so we will use param_CDDM

Directly run psp in predict_kinase_df will gives an error, as some of the site_sequence do not have S/T/Y at the center. We need to clean them before process.

psp.site_seq.str[7].value_counts()

site_seq
s    141851
t     58761
y     39367
h        14
k         4
r         3
g         3
p         2
n         1
f         1
l         1
a         1
i         1
d         1
Name: count, dtype: int64

psp = psp[psp.site_seq.str[7].isin(['s','t','y'])]

cddm = predict_kinase_df(psp,'site_seq',**param_CDDM)

input dataframe has a length 239979
Preprocessing
Finish preprocessing
Merging reference
Finish merging

# PSPA calculation will take longer
pspa = predict_kinase_df(psp,'site_seq',**param_PSPA)

input dataframe has a length 239979
Preprocessing
Finish preprocessing
Merging reference
Finish merging

100%|██████████| 396/396 [03:02<00:00,  2.17it/s]