Checker and Scoring

from katlas.core import *

Checker

It will check if the middle position is s,t,y.
Check all sequences have same length
Convert non-s,t,y to upper case
Convert rare aa to _

df = Data.get_psp_human_site()

df.head()

	gene	protein	uniprot	site	gene_site	SITE_GRP_ID	species	site_seq	LT_LIT	MS_LIT	MS_CST	CST_CAT#
0	YWHAB	14-3-3 beta	P31946	T2	YWHAB_T2	15718712	human	______MtMDksELV	NaN	3.0	1.0	None
1	YWHAB	14-3-3 beta	P31946	S6	YWHAB_S6	15718709	human	__MtMDksELVQkAk	NaN	8.0	NaN	None
2	YWHAB	14-3-3 beta	P31946	Y21	YWHAB_Y21	3426383	human	LAEQAERyDDMAAAM	NaN	NaN	4.0	None
3	YWHAB	14-3-3 beta	P31946	T32	YWHAB_T32	23077803	human	AAAMkAVtEQGHELs	NaN	NaN	1.0	None
4	YWHAB	14-3-3 beta	P31946	S39	YWHAB_S39	27442700	human	tEQGHELsNEERNLL	NaN	4.0	NaN	None

try:
    check_seq_df(df, 'site_seq')
except Exception as e:
    print(f"Error in check_site: {e}")

Error in check_site: SSEVQFGhAGACANQ has h at position 7; need to have one of 's', 't', or 'y' in the center

df['site_seq'].str[7].value_counts()

site_seq
s    141851
t     58761
y     39367
h        14
k         4
r         3
g         3
p         2
n         1
f         1
l         1
a         1
i         1
d         1
Name: count, dtype: int64

Remove non-sty centered sequences:

df = df[df['site_seq'].str[7].isin(list('sty'))].copy()

df['site_seq'] = check_seq_df(df, 'site_seq')

df['site_seq']

0         ______MtMDKsELV
1         __MtMDKsELVQKAK
2         LAEQAERyDDMAAAM
3         AAAMKAVtEQGHELs
4         tEQGHELsNEERNLL
               ...       
240006    PsAKESAsQHITEEE
240007    GLPARPKsPLDPKKD
240008    LEQLLIKyPPEEVEs
240009    yPPEEVEsRRWQKIA
240010    NTAVEDAsDDESIPI
Name: site_seq, Length: 239979, dtype: object

Scoring

Params()

Available parameter sets:

['PSPA_st', 'PSPA_y', 'PSPA', 'CDDM', 'CDDM_upper']

Single sequence:

for param in ['PSPA', 'CDDM','CDDM_upper']:
    print(predict_kinase('PSVEPPLsQETFSDL',**Params(param)).head())

considering string: ['-5V', '-4E', '-3P', '-2P', '-1L', '0s', '1Q', '2E', '3T', '4F', '5S']
kinase
ATM       5.037
SMG1      4.385
DNAPK     3.818
ATR       3.507
FAM20C    3.170
dtype: float64
considering string: ['-7P', '-6S', '-5V', '-4E', '-3P', '-2P', '-1L', '0s', '1Q', '2E', '3T', '4F', '5S', '6D', '7L']
kinase
ATR      3.064
ATM      2.909
DNAPK    2.270
CK2A1    1.873
TSSK1    1.856
dtype: float64
considering string: ['-7P', '-6S', '-5V', '-4E', '-3P', '-2P', '-1L', '0S', '1Q', '2E', '3T', '4F', '5S', '6D', '7L']
kinase
ATR      3.229
ATM      3.038
DNAPK    2.479
CK2A1    2.006
CDK8     1.999
dtype: float64

Multiple sequences scoring

CDDM:

out_cddm = predict_kinase_df(df.head(500),'site_seq', **Params('CDDM'))

input dataframe has a length 500
Preprocessing
Finish preprocessing
Merging reference
Finish merging
CPU times: user 25.6 ms, sys: 4.73 ms, total: 30.3 ms
Wall time: 29.4 ms

PSPA:

out_cddm = predict_kinase_df(df.head(500),'site_seq', **Params('PSPA'))

input dataframe has a length 500
Preprocessing
Finish preprocessing
Merging reference
Finish merging

100%|██████████| 396/396 [00:00<00:00, 423.14it/s]

CPU times: user 973 ms, sys: 12.4 ms, total: 985 ms
Wall time: 998 ms

PSPA percentile score

Single sequence

st_pct = Data.get_pspa_st_pct()
y_pct = Data.get_pspa_tyr_pct()

a = get_pct('PSVEPPLyPETFSDL',**Params('PSPA_y'), pct_ref=y_pct)
a.sort_values('percentile',ascending=False)

considering string: ['-5V', '-4E', '-3P', '-2P', '-1L', '0Y', '1P', '2E', '3T', '4F', '5S']

	log2(score)	percentile
TEK	1.789	93.820916
PDHK4_TYR	0.851	87.204375
DDR2	0.204	80.929597
CSF1R	1.030	80.423787
PDHK3_TYR	0.477	79.384826
...	...	...
FLT1	-3.121	7.723855
TNNI3K_TYR	-3.979	7.682843
PDGFRB	-3.144	7.177033
PTK6	-3.540	4.743677
MUSK	-5.376	3.773069

93 rows × 2 columns

get_pct('PSVEPPLsQETFSDL',**Params('PSPA_st'), pct_ref=st_pct)

considering string: ['-5V', '-4E', '-3P', '-2P', '-1L', '0S', '1Q', '2E', '3T', '4F']

	log2(score)	percentile
ATM	5.037	99.822351
SMG1	4.385	99.831819
DNAPK	3.818	99.205315
ATR	3.507	99.680344
FAM20C	3.170	95.370556
...	...	...
PKN1	-7.275	14.070436
P70S6K	-7.295	4.089816
AKT3	-7.375	11.432995
PKCI	-7.742	8.129511
NEK3	-8.254	4.637240

303 rows × 2 columns

Multiple sequences

score_df = predict_kinase_df(df.head(1000),'site_seq', **Params('PSPA_st'))

input dataframe has a length 1000
Preprocessing
Finish preprocessing
Merging reference
Finish merging

100%|██████████| 303/303 [00:00<00:00, 387.86it/s]

# get percentile reference
pct_ref = Data.get_pspa_st_pct()

pct = get_pct_df(score_df,pct_ref)
pct # the lower the better

100%|██████████| 303/303 [00:00<00:00, 1095.91it/s]

	AAK1	ACVR2A	ACVR2B	AKT1	AKT2	AKT3	ALK2	ALK4	ALPHAK3	AMPKA1	...	VRK1	VRK2	WNK1	WNK3	WNK4	YANK2	YANK3	YSK1	YSK4	ZAK
0	73.188	87.606	84.729	79.877	79.533	81.850	89.768	91.201	98.426	54.554	...	84.632	77.239	82.520	69.104	87.543	99.299	92.838	89.845	95.202	81.707
1	56.703	95.693	95.585	13.314	9.253	7.540	89.098	91.358	96.091	48.544	...	30.292	25.050	17.132	47.388	20.742	80.747	87.061	8.621	53.530	45.247
2	100.000	100.000	100.000	100.000	100.000	100.000	100.000	100.000	100.000	100.000	...	100.000	100.000	100.000	100.000	100.000	100.000	100.000	100.000	100.000	100.000
3	73.105	82.606	85.710	62.441	67.236	58.350	83.195	88.883	72.216	62.304	...	99.196	88.782	42.032	26.203	48.761	48.276	36.724	68.810	65.439	40.623
4	41.466	93.063	95.262	25.830	32.035	22.619	98.084	96.177	53.175	29.332	...	31.234	6.206	19.819	38.389	18.550	79.177	45.058	10.560	42.568	26.996
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
995	78.612	48.108	49.316	70.027	68.095	73.951	54.940	60.196	67.759	27.998	...	80.102	79.886	48.167	46.783	49.481	90.781	52.997	84.096	44.233	84.789
996	15.366	42.891	37.567	42.280	46.271	53.040	31.365	24.402	46.969	56.985	...	65.541	68.537	28.641	52.885	39.271	75.061	76.763	52.028	81.469	74.303
997	37.572	21.732	22.480	83.025	76.807	79.376	14.507	46.188	3.412	88.018	...	78.811	58.363	95.636	96.562	91.573	81.036	77.758	85.756	75.917	82.245
998	100.000	100.000	100.000	100.000	100.000	100.000	100.000	100.000	100.000	100.000	...	100.000	100.000	100.000	100.000	100.000	100.000	100.000	100.000	100.000	100.000
999	94.640	73.473	82.830	7.624	11.623	5.405	53.584	48.375	94.639	25.493	...	87.950	66.768	8.615	0.937	14.440	14.019	3.233	73.357	35.516	42.516

1000 rows × 303 columns