Get reference score to calculate percentile

Setup

from katlas.imports import *
import pickle, pandas as pd,numpy as np, seaborn as sns,matplotlib.pyplot as plt
from numpy import trapz

Scoring

Download supp table3 (sheet2) from nature Ser/Thr kinase paper and supp table3 (sheet3, with non-canonical) from nature Tyr kinase paper

These two files are too big to upload to the current repository, please download the files yourself.

# df = pd.read_csv('supp3_ST.csv')
df = pd.read_csv('supp3_tyr.csv')

Check whether the sequence contains lowercase other than the phospho-acceptor

df['SITE_+/-7_AA'].str[0:7].str.contains('[a-z]').value_counts()

SITE_+/-7_AA
False    7315
Name: count, dtype: int64

df['SITE_+/-7_AA'].str[8:].str.contains('[a-z]').value_counts()

SITE_+/-7_AA
False    7315
Name: count, dtype: int64

It seems the sequence in the supp3 does not contains phosphorylated sequence.

Convert column names

cols=df.columns[df.columns.str.contains('_percentile')][1:]

# get the target
target = df[cols]

target.columns = target.columns.str.split('_').str[0]

Below is for tyr only, as we need to transform the name to be consistent

kinase_dict = pd.read_csv('raw/lew_tyr_info.csv')

name_dict = kinase_dict.set_index('lew_kinase2')['kinase_tyr']

target.columns = target.columns.map(name_dict)

target.head()

	ABL1	TNK2	ALK	ABL2	AXL	BLK	BMPR2_TYR	PTK6	BTK	CSF1R	CSK	MATK	DDR1	DDR2	EGFR	EPHA1	EPHA2	EPHA3	EPHA4	EPHA5	EPHA6	EPHA7	EPHA8	EPHB1	EPHB2	EPHB3	EPHB4	BMX	PTK2	FER	FES	FGFR1	FGFR2	FGFR3	FGFR4	FGR	FLT3	FRK	FYN	HCK	ERBB2	ERBB4	IGF1R	INSR	INSRR	ITK	JAK1	JAK2	JAK3	KIT	LCK	LIMK1_TYR	LIMK2_TYR	LTK	LYN	MERTK	MET	MAP2K4_TYR	MAP2K6_TYR	MAP2K7_TYR	MST1R	MUSK	PKMYT1_TYR	NEK10_TYR	PDGFRA	PDGFRB	PDHK1_TYR	PDHK3_TYR	PDHK4_TYR	PINK1_TYR	PTK2B	RET	ROS1	SRC	SRMS	SYK	TEC	TESK1_TYR	TEK	TNK1	TNNI3K_TYR	NTRK1	NTRK2	NTRK3	TXK	TYK2	TYRO3	FLT1	KDR	FLT4	WEE1_TYR	YES1	ZAP70
0	36.18	6.22	27.40	46.44	49.64	87.22	78.99	98.40	71.50	26.37	90.22	93.41	12.52	22.11	79.36	18.85	58.28	37.30	69.07	71.50	24.25	27.75	64.70	41.30	76.43	59.62	42.81	61.77	91.84	95.64	79.32	25.83	41.72	63.80	89.69	94.16	38.89	42.42	99.87	90.09	41.93	58.44	86.36	51.67	57.80	61.20	5.36	6.46	8.84	34.86	83.74	2.54	2.34	54.93	83.98	78.71	27.29	41.43	92.16	51.72	5.28	17.51	12.72	7.70	3.17	22.57	96.65	70.10	74.98	77.46	76.65	39.40	7.86	98.82	88.18	86.48	58.13	0.85	16.63	7.79	0.68	82.19	57.28	57.63	73.82	16.04	27.23	51.22	21.43	37.76	47.45	99.67	62.11
1	72.93	31.82	45.68	60.52	41.74	34.16	40.25	69.07	47.87	38.26	74.41	62.25	69.23	43.84	66.23	61.17	76.28	45.94	51.26	40.58	62.25	46.36	46.36	36.40	31.87	28.93	16.52	36.00	27.03	41.94	60.14	40.12	67.21	57.41	68.86	24.32	50.71	37.53	35.76	50.03	66.64	35.72	42.17	47.12	43.38	24.05	34.25	51.28	47.06	65.70	41.34	54.30	70.01	60.52	45.30	43.86	47.49	52.27	23.03	60.08	59.09	48.17	26.40	96.04	53.40	32.85	61.31	74.96	37.58	50.25	57.78	71.04	17.07	49.22	61.17	30.66	56.25	36.92	16.90	47.01	17.29	44.76	45.41	49.90	44.37	44.52	36.18	64.74	33.12	50.84	40.25	46.09	15.02
2	3.11	75.58	2.80	2.85	23.33	0.41	33.75	14.55	10.83	9.41	26.77	22.19	20.42	25.69	2.52	23.51	5.62	16.81	16.13	8.84	41.58	21.67	6.00	3.44	3.52	8.29	9.37	2.98	9.70	0.35	3.26	9.45	18.80	7.05	1.73	1.53	28.63	7.77	1.10	0.24	2.52	2.78	0.57	1.20	1.29	3.20	9.26	9.02	24.05	10.35	0.61	97.24	92.14	4.11	1.36	2.56	12.32	12.02	28.93	94.92	25.12	15.12	84.33	68.74	50.93	25.13	64.67	37.47	75.49	86.50	2.36	20.36	10.02	0.68	3.96	1.57	3.22	96.45	23.03	86.41	68.24	1.36	1.42	0.33	1.95	29.09	11.86	24.97	29.11	35.98	97.79	1.36	1.03
3	81.77	87.26	41.91	83.34	83.72	54.91	24.01	51.59	63.93	29.72	59.95	66.97	17.75	4.20	16.79	85.64	61.11	78.49	58.44	70.59	68.81	66.71	54.00	61.54	69.60	80.30	84.39	67.56	59.77	60.89	28.37	10.64	9.69	8.71	11.78	59.40	57.10	60.93	32.20	51.06	18.87	28.89	12.23	19.22	12.34	62.79	49.11	23.44	17.31	21.84	71.33	90.33	85.16	58.64	51.30	77.52	41.39	95.03	53.82	54.58	67.23	75.29	59.34	71.70	47.63	40.56	35.11	60.54	30.62	86.12	88.82	45.66	39.27	18.91	60.71	45.68	74.22	63.54	59.31	43.42	44.50	28.26	45.13	39.29	81.42	13.09	75.40	27.66	33.97	18.85	87.66	30.93	45.39
4	30.44	30.38	39.00	26.64	60.36	38.28	88.22	62.60	76.95	80.45	64.19	87.20	56.38	67.89	79.05	60.60	64.35	72.51	55.90	70.15	81.44	70.72	73.69	65.46	76.19	72.31	89.08	67.59	61.44	29.11	44.23	75.31	48.02	52.27	61.68	69.12	79.49	55.66	27.51	46.03	64.98	60.38	65.24	83.98	51.35	79.60	85.99	56.95	93.52	68.90	57.17	48.72	44.74	27.71	40.64	40.27	83.12	33.14	46.18	14.64	85.51	94.37	59.55	61.43	92.23	96.87	27.66	33.00	35.46	34.56	36.97	68.35	83.94	48.41	21.97	68.62	60.93	44.93	86.25	46.66	98.80	70.61	94.84	86.87	56.16	70.65	84.24	80.78	72.93	79.01	95.19	39.35	74.46

Scoring

result = predict_kinase_df(df,'SITE_+/-7_AA',**param_PSPA)

input dataframe has a length 7315
Preprocessing
Finish preprocessing
Merging reference
Finish merging

100%|██████████| 396/396 [00:02<00:00, 157.53it/s]

# get the percentile score
percentile = result[target.columns].rank(axis=0,pct=True)

percentile = (percentile*100).round(2)

percentile.head()

	ABL1	TNK2	ALK	ABL2	AXL	BLK	BMPR2_TYR	PTK6	BTK	CSF1R	CSK	MATK	DDR1	DDR2	EGFR	EPHA1	EPHA2	EPHA3	EPHA4	EPHA5	EPHA6	EPHA7	EPHA8	EPHB1	EPHB2	EPHB3	EPHB4	BMX	PTK2	FER	FES	FGFR1	FGFR2	FGFR3	FGFR4	FGR	FLT3	FRK	FYN	HCK	ERBB2	ERBB4	IGF1R	INSR	INSRR	ITK	JAK1	JAK2	JAK3	KIT	LCK	LIMK1_TYR	LIMK2_TYR	LTK	LYN	MERTK	MET	MAP2K4_TYR	MAP2K6_TYR	MAP2K7_TYR	MST1R	MUSK	PKMYT1_TYR	NEK10_TYR	PDGFRA	PDGFRB	PDHK1_TYR	PDHK3_TYR	PDHK4_TYR	PINK1_TYR	PTK2B	RET	ROS1	SRC	SRMS	SYK	TEC	TESK1_TYR	TEK	TNK1	TNNI3K_TYR	NTRK1	NTRK2	NTRK3	TXK	TYK2	TYRO3	FLT1	KDR	FLT4	WEE1_TYR	YES1	ZAP70
0	36.90	6.23	26.32	46.87	49.48	87.50	78.57	98.49	71.71	25.83	89.54	93.08	12.08	21.67	77.75	18.28	57.33	36.49	67.95	70.62	24.37	27.27	63.62	41.39	75.71	59.68	43.03	62.14	91.13	95.52	78.35	25.27	40.98	62.66	88.20	94.21	36.96	41.16	99.84	90.01	40.20	57.33	85.35	50.85	57.27	61.66	5.32	6.25	8.63	34.05	84.16	3.10	2.80	54.25	83.56	78.68	27.61	42.73	92.39	52.09	5.41	16.96	13.69	8.03	2.92	21.61	96.40	70.79	75.19	77.29	76.60	38.91	7.80	98.78	87.28	85.72	57.67	1.11	16.04	8.06	0.77	80.68	55.93	56.02	74.24	15.55	27.09	50.01	20.96	36.08	48.09	99.69	62.03
1	73.94	31.87	45.08	61.11	41.85	34.85	40.27	69.06	47.25	37.45	73.02	61.91	69.23	43.32	64.44	60.55	74.88	44.68	50.25	39.88	61.72	45.71	45.65	36.37	31.93	28.73	17.05	35.85	26.68	41.67	58.79	39.35	66.12	56.15	67.02	24.81	48.69	36.55	36.03	50.15	65.15	35.11	41.13	46.32	42.75	24.33	33.27	49.55	46.72	63.94	41.44	55.50	71.09	60.08	45.00	43.78	47.36	53.40	23.47	59.51	58.15	47.46	27.73	95.82	52.02	31.86	60.87	75.47	37.70	50.42	57.81	70.94	16.81	49.34	61.06	30.30	55.67	38.15	16.22	47.40	18.19	43.98	44.49	48.59	45.42	42.92	36.49	63.37	32.42	49.25	41.07	46.65	15.55
2	3.01	76.08	2.55	2.80	23.14	0.43	33.66	14.39	10.26	9.00	25.57	21.89	19.71	25.13	2.59	22.71	5.43	16.93	16.00	8.59	41.55	21.48	5.87	3.41	3.63	8.22	9.58	3.06	9.70	0.33	3.30	9.23	18.08	6.86	1.88	1.58	27.13	7.29	1.32	0.28	2.32	2.93	0.57	1.21	1.21	3.34	9.22	8.74	23.86	9.67	0.65	96.85	92.00	3.79	1.32	2.48	12.32	13.02	29.02	94.74	25.00	14.46	84.91	68.87	49.34	24.04	64.31	38.23	75.62	86.30	2.55	20.02	9.95	0.73	4.05	1.62	3.06	96.47	22.48	86.21	68.06	1.51	1.35	0.40	2.00	27.79	11.75	23.90	28.22	34.54	97.48	1.47	1.13
3	82.29	87.57	41.46	83.82	83.94	55.37	23.93	51.19	63.75	29.02	58.52	66.27	17.08	4.27	16.53	85.72	59.79	77.56	57.29	69.70	68.59	65.58	53.55	60.72	68.81	79.94	84.15	67.80	59.45	60.17	27.33	10.42	9.26	8.47	11.31	59.98	55.28	59.82	32.71	51.19	17.87	28.43	11.50	18.82	12.15	63.41	47.95	22.69	17.00	21.00	72.05	90.03	85.13	57.96	51.13	77.29	41.63	95.50	54.62	54.94	66.61	74.57	60.31	71.74	46.24	39.64	35.13	61.16	30.64	86.02	88.81	44.81	39.30	19.16	60.64	44.98	74.25	64.19	59.13	43.81	44.76	27.50	44.24	38.65	81.51	12.64	75.78	26.61	33.20	17.67	86.80	31.91	45.39
4	31.05	30.53	38.42	28.13	60.13	38.78	88.18	62.51	77.00	79.15	62.96	86.69	55.73	67.27	77.48	59.87	63.15	71.18	54.73	69.28	81.18	69.69	72.53	64.53	75.33	71.95	88.88	67.85	60.99	28.97	42.93	74.40	47.06	50.79	60.36	69.69	78.28	55.02	27.99	46.10	63.34	59.35	64.19	82.99	50.90	80.06	85.22	54.84	93.25	67.05	57.78	49.97	47.05	27.16	40.51	40.21	82.30	34.11	46.84	14.78	85.42	94.09	60.42	61.47	91.64	96.68	27.44	33.72	35.55	34.89	37.20	68.28	83.66	48.50	22.19	67.38	60.68	46.77	86.17	47.33	98.80	68.70	94.39	85.68	57.07	68.78	84.63	79.69	72.34	77.55	94.72	40.12	73.87

Compare the calculated percentile with the target, they are very similar. The differences are due to raw data rounding.

Compare

(target-percentile).abs().max().sort_values()

MERTK        0.45
PTK2B        0.51
AXL          0.53
SRC          0.64
ROS1         0.65
             ... 
NTRK1        2.18
TESK1_TYR    2.19
JAK2         2.30
FLT3         2.31
LIMK2_TYR    2.67
Length: 93, dtype: float64

No much difference between the two

# save the result for reference
# result.round(3).to_parquet('ochoa_pspa_score.parquet')

To use

We’ve saved the reference sheet in Data and can load using the below function:

pct_ref = Data.get_ochoa_score()

score_df = predict_kinase_df(df,'site_seq',**param_PSPA)

input dataframe has a length 35
Preprocessing
Finish preprocessing
Calculating position: [-5, -4, -3, -2, -1, 0, 1, 2, 3, 4]

  0%|          | 0/303 [00:00<?, ?it/s]/usr/local/lib/python3.9/dist-packages/katlas/core.py:575: RuntimeWarning: divide by zero encountered in log2
  log_sum = np.sum(np.log2(values)) + (len(values) - 1) * np.log2(divide)
100%|██████████| 303/303 [00:00<00:00, 1096.01it/s]

pct = get_pct_df(score_df,pct_ref)

site = 'PGGNIyIsPLksPyk'
get_pct(site,pct_ref)