In this session, instead of scoring sequence one by one, we will score the whole phosphoproteomics dataset at once.
Setup
!pip install python-katlas -Uq
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
import pandas as pd,numpy as np,seaborn as snsfrom matplotlib import pyplot as pltfrom katlas.core import*
input dataframe has a length 8051
Preprocessing
Finish preprocessing
Merging reference
Finish merging
100%|██████████| 93/93 [00:01<00:00, 72.16it/s]
input dataframe has a length 113368
Preprocessing
Finish preprocessing
Merging reference
Finish merging
100%|██████████| 303/303 [00:38<00:00, 7.79it/s]
Release RAM for the following tasks
del comb_y, comb_st,pspa_y,pspa_st
Scoring on phosphorylated sequence
psp.head()
gene
protein
uniprot
site
gene_site
SITE_GRP_ID
species
site_seq
LT_LIT
MS_LIT
MS_CST
CST_CAT#
Ambiguous_Site
0
YWHAB
14-3-3 beta
P31946
T2
YWHAB_T2
15718712
human
______MtMDksELV
NaN
3.0
1.0
None
0
1
YWHAB
14-3-3 beta
P31946
S6
YWHAB_S6
15718709
human
__MtMDksELVQkAk
NaN
8.0
NaN
None
0
2
YWHAB
14-3-3 beta
P31946
Y21
YWHAB_Y21
3426383
human
LAEQAERyDDMAAAM
NaN
NaN
4.0
None
0
3
YWHAB
14-3-3 beta
P31946
T32
YWHAB_T32
23077803
human
AAAMkAVtEQGHELs
NaN
NaN
1.0
None
0
4
YWHAB
14-3-3 beta
P31946
S39
YWHAB_S39
27442700
human
tEQGHELsNEERNLL
NaN
4.0
NaN
None
0
It looks that the sequence are in phosphorylated status, so we will use param_CDDM
Directly run psp in predict_kinase_df will gives an error, as some of the site_sequence do not have S/T/Y at the center. We need to clean them before process.
psp.site_seq.str[7].value_counts()
site_seq
s 141851
t 58761
y 39367
h 14
k 4
r 3
g 3
p 2
n 1
f 1
l 1
a 1
i 1
d 1
Name: count, dtype: int64