import pandas as pd,numpy as np,seaborn as sns
from matplotlib import pyplot as plt
from katlas.core import *
Phosphoproteomics scoring
In this session, instead of scoring sequence one by one, we will score the whole phosphoproteomics dataset at once.
Setup
Run below to install:
!pip install python-katlas
Phosphoproteomics dataset
Three datasets are available:
- PhosphoSitePlus human (2024 Jan)
- Ochoa et al. human phosphoproteom
- Combine Ochoa and PSP low throughput data
= Data.get_psp_human_site()
psp = Data.get_ochoa_site()
ochoa = Data.get_combine_site_psp_ochoa() comb
psp.head()
gene | protein | uniprot | site | gene_site | SITE_GRP_ID | species | site_seq | LT_LIT | MS_LIT | MS_CST | CST_CAT# | Ambiguous_Site | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | YWHAB | 14-3-3 beta | P31946 | T2 | YWHAB_T2 | 15718712 | human | ______MtMDksELV | NaN | 3.0 | 1.0 | None | 0 |
1 | YWHAB | 14-3-3 beta | P31946 | S6 | YWHAB_S6 | 15718709 | human | __MtMDksELVQkAk | NaN | 8.0 | NaN | None | 0 |
2 | YWHAB | 14-3-3 beta | P31946 | Y21 | YWHAB_Y21 | 3426383 | human | LAEQAERyDDMAAAM | NaN | NaN | 4.0 | None | 0 |
3 | YWHAB | 14-3-3 beta | P31946 | T32 | YWHAB_T32 | 23077803 | human | AAAMkAVtEQGHELs | NaN | NaN | 1.0 | None | 0 |
4 | YWHAB | 14-3-3 beta | P31946 | S39 | YWHAB_S39 | 27442700 | human | tEQGHELsNEERNLL | NaN | 4.0 | NaN | None | 0 |
ochoa.head()
uniprot | position | residue | is_disopred | disopred_score | log10_hotspot_pval_min | isHotspot | uniprot_position | functional_score | current_uniprot | name | gene | Sequence | is_valid | site_seq | gene_site | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | A0A075B6Q4 | 24 | S | True | 0.91 | 6.839384 | True | A0A075B6Q4_24 | 0.149257 | A0A075B6Q4 | A0A075B6Q4_HUMAN | None | MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT... | True | VDDEKGDSNDDYDSA | A0A075B6Q4_S24 |
1 | A0A075B6Q4 | 35 | S | True | 0.87 | 9.192622 | False | A0A075B6Q4_35 | 0.136966 | A0A075B6Q4 | A0A075B6Q4_HUMAN | None | MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT... | True | YDSAGLLSDEDCMSV | A0A075B6Q4_S35 |
2 | A0A075B6Q4 | 57 | S | False | 0.28 | 0.818834 | False | A0A075B6Q4_57 | 0.125364 | A0A075B6Q4 | A0A075B6Q4_HUMAN | None | MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT... | True | IADHLFWSEETKSRF | A0A075B6Q4_S57 |
3 | A0A075B6Q4 | 68 | S | False | 0.03 | 0.375986 | False | A0A075B6Q4_68 | 0.119811 | A0A075B6Q4 | A0A075B6Q4_HUMAN | None | MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT... | True | KSRFTEYSMTSSVMR | A0A075B6Q4_S68 |
4 | A0A075B6Q4 | 71 | S | False | 0.05 | 0.000000 | False | A0A075B6Q4_71 | 0.095193 | A0A075B6Q4 | A0A075B6Q4_HUMAN | None | MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT... | True | FTEYSMTSSVMRRNE | A0A075B6Q4_S71 |
comb.head()
uniprot | gene | site | site_seq | source | AM_pathogenicity | CDDM_upper | CDDM_max_score | |
---|---|---|---|---|---|---|---|---|
0 | A0A024R4G9 | C19orf48 | S20 | ITGSRLLSMVPGPAR | psp | NaN | PRKX,AKT1,PKG1,P90RSK,HIPK4,AKT3,HIPK1,PKACB,H... | 2.407041 |
1 | A0A075B6Q4 | None | S24 | VDDEKGDSNDDYDSA | ochoa | NaN | CK2A2,CK2A1,GRK7,GRK5,CK1G1,CK1A,IKKA,CK1G2,CA... | 2.295654 |
2 | A0A075B6Q4 | None | S35 | YDSAGLLSDEDCMSV | ochoa | NaN | CK2A2,CK2A1,IKKA,ATM,IKKB,CAMK1D,MARK2,GRK7,IK... | 2.488683 |
3 | A0A075B6Q4 | None | S57 | IADHLFWSEETKSRF | ochoa | NaN | GRK7,CK2A1,CK2A2,PKN2,GRK1,GRK5,MARK1,MARK2,UL... | 1.851894 |
4 | A0A075B6Q4 | None | S68 | KSRFTEYSMTSSVMR | ochoa | NaN | AKT1,P90RSK,AKT3,SGK1,AKT2,NDR2,RSK2,P70S6K,RS... | 2.026384 |
Scoring on all-capital sequence
comb.head()
uniprot | gene | site | site_seq | source | AM_pathogenicity | CDDM_upper | CDDM_max_score | |
---|---|---|---|---|---|---|---|---|
0 | A0A024R4G9 | C19orf48 | S20 | ITGSRLLSMVPGPAR | psp | NaN | PRKX,AKT1,PKG1,P90RSK,HIPK4,AKT3,HIPK1,PKACB,H... | 2.407041 |
1 | A0A075B6Q4 | None | S24 | VDDEKGDSNDDYDSA | ochoa | NaN | CK2A2,CK2A1,GRK7,GRK5,CK1G1,CK1A,IKKA,CK1G2,CA... | 2.295654 |
2 | A0A075B6Q4 | None | S35 | YDSAGLLSDEDCMSV | ochoa | NaN | CK2A2,CK2A1,IKKA,ATM,IKKB,CAMK1D,MARK2,GRK7,IK... | 2.488683 |
3 | A0A075B6Q4 | None | S57 | IADHLFWSEETKSRF | ochoa | NaN | GRK7,CK2A1,CK2A2,PKN2,GRK1,GRK5,MARK1,MARK2,UL... | 1.851894 |
4 | A0A075B6Q4 | None | S68 | KSRFTEYSMTSSVMR | ochoa | NaN | AKT1,P90RSK,AKT3,SGK1,AKT2,NDR2,RSK2,P70S6K,RS... | 2.026384 |
It looks that the sequence are in all-capital format, so we will use param_CDDM_upper
= predict_kinase_df(comb,'site_seq',**Params('CDDM_upper')) cddm
input dataframe has a length 121419
Preprocessing
Finish preprocessing
Merging reference
Finish merging
# PSPA involves more calculation, will take longer
= predict_kinase_df(comb,'site_seq',**Params('PSPA')) pspa
input dataframe has a length 121419
Preprocessing
Finish preprocessing
Merging reference
Finish merging
100%|██████████| 396/396 [00:44<00:00, 8.86it/s]
We can also split S/T and Y sites for scoring.
= comb[comb.site_seq.str[7]=='Y']
comb_y = comb[comb.site_seq.str[7]!='Y'] comb_st
= predict_kinase_df(comb_y,'site_seq',**Params('PSPA_y'))
pspa_y = predict_kinase_df(comb_st,'site_seq',**Params('PSPA_st')) pspa_st
input dataframe has a length 8051
Preprocessing
Finish preprocessing
Merging reference
Finish merging
100%|██████████| 93/93 [00:00<00:00, 227.27it/s]
input dataframe has a length 113368
Preprocessing
Finish preprocessing
Merging reference
Finish merging
100%|██████████| 303/303 [00:15<00:00, 19.77it/s]
Release RAM for the following tasks
del comb_y, comb_st,pspa_y,pspa_st
Scoring on phosphorylated sequence
psp.head()
gene | protein | uniprot | site | gene_site | SITE_GRP_ID | species | site_seq | LT_LIT | MS_LIT | MS_CST | CST_CAT# | Ambiguous_Site | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | YWHAB | 14-3-3 beta | P31946 | T2 | YWHAB_T2 | 15718712 | human | ______MtMDksELV | NaN | 3.0 | 1.0 | None | 0 |
1 | YWHAB | 14-3-3 beta | P31946 | S6 | YWHAB_S6 | 15718709 | human | __MtMDksELVQkAk | NaN | 8.0 | NaN | None | 0 |
2 | YWHAB | 14-3-3 beta | P31946 | Y21 | YWHAB_Y21 | 3426383 | human | LAEQAERyDDMAAAM | NaN | NaN | 4.0 | None | 0 |
3 | YWHAB | 14-3-3 beta | P31946 | T32 | YWHAB_T32 | 23077803 | human | AAAMkAVtEQGHELs | NaN | NaN | 1.0 | None | 0 |
4 | YWHAB | 14-3-3 beta | P31946 | S39 | YWHAB_S39 | 27442700 | human | tEQGHELsNEERNLL | NaN | 4.0 | NaN | None | 0 |
It looks that the sequence are in phosphorylated status, so we will use param_CDDM
Directly run psp in predict_kinase_df
will gives an error, as some of the site_sequence do not have S/T/Y at the center. We need to clean them before process.
str[7].value_counts() psp.site_seq.
site_seq
s 141851
t 58761
y 39367
h 14
k 4
r 3
g 3
p 2
n 1
f 1
l 1
a 1
i 1
d 1
Name: count, dtype: int64
= psp[psp.site_seq.str[7].isin(['s','t','y'])] psp
= predict_kinase_df(psp,'site_seq',**Params('CDDM')) cddm
input dataframe has a length 239979
Preprocessing
Finish preprocessing
Merging reference
Finish merging
# PSPA calculation will take longer
= predict_kinase_df(psp,'site_seq',**Params('PSPA')) pspa
input dataframe has a length 239979
Preprocessing
Finish preprocessing
Merging reference
Finish merging
100%|██████████| 396/396 [01:34<00:00, 4.19it/s]