Get reference score to calculate percentile

Setup

from katlas.imports import *
import pickle, pandas as pd,numpy as np, seaborn as sns,matplotlib.pyplot as plt
from numpy import trapz

Scoring

Download supp table3 (sheet2) from nature Ser/Thr kinase paper and supp table3 (sheet3, with non-canonical) from nature Tyr kinase paper

These two files are too big to upload to the current repository, please download the files yourself.

# df = pd.read_csv('supp3_ST.csv')
df = pd.read_csv('supp3_tyr.csv')

Check whether the sequence contains lowercase other than the phospho-acceptor

df['SITE_+/-7_AA'].str[0:7].str.contains('[a-z]').value_counts()
SITE_+/-7_AA
False    7315
Name: count, dtype: int64
df['SITE_+/-7_AA'].str[8:].str.contains('[a-z]').value_counts()
SITE_+/-7_AA
False    7315
Name: count, dtype: int64

It seems the sequence in the supp3 does not contains phosphorylated sequence.

Convert column names

cols=df.columns[df.columns.str.contains('_percentile')][1:]

# get the target
target = df[cols]
target.columns = target.columns.str.split('_').str[0]

Below is for tyr only, as we need to transform the name to be consistent

kinase_dict = pd.read_csv('raw/lew_tyr_info.csv')
name_dict = kinase_dict.set_index('lew_kinase2')['kinase_tyr']
target.columns = target.columns.map(name_dict)
target.head()
ABL1 TNK2 ALK ABL2 AXL BLK BMPR2_TYR PTK6 BTK CSF1R CSK MATK DDR1 DDR2 EGFR EPHA1 EPHA2 EPHA3 EPHA4 EPHA5 EPHA6 EPHA7 EPHA8 EPHB1 EPHB2 EPHB3 EPHB4 BMX PTK2 FER FES FGFR1 FGFR2 FGFR3 FGFR4 FGR FLT3 FRK FYN HCK ERBB2 ERBB4 IGF1R INSR INSRR ITK JAK1 JAK2 JAK3 KIT LCK LIMK1_TYR LIMK2_TYR LTK LYN MERTK MET MAP2K4_TYR MAP2K6_TYR MAP2K7_TYR MST1R MUSK PKMYT1_TYR NEK10_TYR PDGFRA PDGFRB PDHK1_TYR PDHK3_TYR PDHK4_TYR PINK1_TYR PTK2B RET ROS1 SRC SRMS SYK TEC TESK1_TYR TEK TNK1 TNNI3K_TYR NTRK1 NTRK2 NTRK3 TXK TYK2 TYRO3 FLT1 KDR FLT4 WEE1_TYR YES1 ZAP70
0 36.18 6.22 27.40 46.44 49.64 87.22 78.99 98.40 71.50 26.37 90.22 93.41 12.52 22.11 79.36 18.85 58.28 37.30 69.07 71.50 24.25 27.75 64.70 41.30 76.43 59.62 42.81 61.77 91.84 95.64 79.32 25.83 41.72 63.80 89.69 94.16 38.89 42.42 99.87 90.09 41.93 58.44 86.36 51.67 57.80 61.20 5.36 6.46 8.84 34.86 83.74 2.54 2.34 54.93 83.98 78.71 27.29 41.43 92.16 51.72 5.28 17.51 12.72 7.70 3.17 22.57 96.65 70.10 74.98 77.46 76.65 39.40 7.86 98.82 88.18 86.48 58.13 0.85 16.63 7.79 0.68 82.19 57.28 57.63 73.82 16.04 27.23 51.22 21.43 37.76 47.45 99.67 62.11
1 72.93 31.82 45.68 60.52 41.74 34.16 40.25 69.07 47.87 38.26 74.41 62.25 69.23 43.84 66.23 61.17 76.28 45.94 51.26 40.58 62.25 46.36 46.36 36.40 31.87 28.93 16.52 36.00 27.03 41.94 60.14 40.12 67.21 57.41 68.86 24.32 50.71 37.53 35.76 50.03 66.64 35.72 42.17 47.12 43.38 24.05 34.25 51.28 47.06 65.70 41.34 54.30 70.01 60.52 45.30 43.86 47.49 52.27 23.03 60.08 59.09 48.17 26.40 96.04 53.40 32.85 61.31 74.96 37.58 50.25 57.78 71.04 17.07 49.22 61.17 30.66 56.25 36.92 16.90 47.01 17.29 44.76 45.41 49.90 44.37 44.52 36.18 64.74 33.12 50.84 40.25 46.09 15.02
2 3.11 75.58 2.80 2.85 23.33 0.41 33.75 14.55 10.83 9.41 26.77 22.19 20.42 25.69 2.52 23.51 5.62 16.81 16.13 8.84 41.58 21.67 6.00 3.44 3.52 8.29 9.37 2.98 9.70 0.35 3.26 9.45 18.80 7.05 1.73 1.53 28.63 7.77 1.10 0.24 2.52 2.78 0.57 1.20 1.29 3.20 9.26 9.02 24.05 10.35 0.61 97.24 92.14 4.11 1.36 2.56 12.32 12.02 28.93 94.92 25.12 15.12 84.33 68.74 50.93 25.13 64.67 37.47 75.49 86.50 2.36 20.36 10.02 0.68 3.96 1.57 3.22 96.45 23.03 86.41 68.24 1.36 1.42 0.33 1.95 29.09 11.86 24.97 29.11 35.98 97.79 1.36 1.03
3 81.77 87.26 41.91 83.34 83.72 54.91 24.01 51.59 63.93 29.72 59.95 66.97 17.75 4.20 16.79 85.64 61.11 78.49 58.44 70.59 68.81 66.71 54.00 61.54 69.60 80.30 84.39 67.56 59.77 60.89 28.37 10.64 9.69 8.71 11.78 59.40 57.10 60.93 32.20 51.06 18.87 28.89 12.23 19.22 12.34 62.79 49.11 23.44 17.31 21.84 71.33 90.33 85.16 58.64 51.30 77.52 41.39 95.03 53.82 54.58 67.23 75.29 59.34 71.70 47.63 40.56 35.11 60.54 30.62 86.12 88.82 45.66 39.27 18.91 60.71 45.68 74.22 63.54 59.31 43.42 44.50 28.26 45.13 39.29 81.42 13.09 75.40 27.66 33.97 18.85 87.66 30.93 45.39
4 30.44 30.38 39.00 26.64 60.36 38.28 88.22 62.60 76.95 80.45 64.19 87.20 56.38 67.89 79.05 60.60 64.35 72.51 55.90 70.15 81.44 70.72 73.69 65.46 76.19 72.31 89.08 67.59 61.44 29.11 44.23 75.31 48.02 52.27 61.68 69.12 79.49 55.66 27.51 46.03 64.98 60.38 65.24 83.98 51.35 79.60 85.99 56.95 93.52 68.90 57.17 48.72 44.74 27.71 40.64 40.27 83.12 33.14 46.18 14.64 85.51 94.37 59.55 61.43 92.23 96.87 27.66 33.00 35.46 34.56 36.97 68.35 83.94 48.41 21.97 68.62 60.93 44.93 86.25 46.66 98.80 70.61 94.84 86.87 56.16 70.65 84.24 80.78 72.93 79.01 95.19 39.35 74.46

Scoring

result = predict_kinase_df(df,'SITE_+/-7_AA',**param_PSPA)
input dataframe has a length 7315
Preprocessing
Finish preprocessing
Merging reference
Finish merging
100%|██████████| 396/396 [00:02<00:00, 157.53it/s]
# get the percentile score
percentile = result[target.columns].rank(axis=0,pct=True)
percentile = (percentile*100).round(2)
percentile.head()
ABL1 TNK2 ALK ABL2 AXL BLK BMPR2_TYR PTK6 BTK CSF1R CSK MATK DDR1 DDR2 EGFR EPHA1 EPHA2 EPHA3 EPHA4 EPHA5 EPHA6 EPHA7 EPHA8 EPHB1 EPHB2 EPHB3 EPHB4 BMX PTK2 FER FES FGFR1 FGFR2 FGFR3 FGFR4 FGR FLT3 FRK FYN HCK ERBB2 ERBB4 IGF1R INSR INSRR ITK JAK1 JAK2 JAK3 KIT LCK LIMK1_TYR LIMK2_TYR LTK LYN MERTK MET MAP2K4_TYR MAP2K6_TYR MAP2K7_TYR MST1R MUSK PKMYT1_TYR NEK10_TYR PDGFRA PDGFRB PDHK1_TYR PDHK3_TYR PDHK4_TYR PINK1_TYR PTK2B RET ROS1 SRC SRMS SYK TEC TESK1_TYR TEK TNK1 TNNI3K_TYR NTRK1 NTRK2 NTRK3 TXK TYK2 TYRO3 FLT1 KDR FLT4 WEE1_TYR YES1 ZAP70
0 36.90 6.23 26.32 46.87 49.48 87.50 78.57 98.49 71.71 25.83 89.54 93.08 12.08 21.67 77.75 18.28 57.33 36.49 67.95 70.62 24.37 27.27 63.62 41.39 75.71 59.68 43.03 62.14 91.13 95.52 78.35 25.27 40.98 62.66 88.20 94.21 36.96 41.16 99.84 90.01 40.20 57.33 85.35 50.85 57.27 61.66 5.32 6.25 8.63 34.05 84.16 3.10 2.80 54.25 83.56 78.68 27.61 42.73 92.39 52.09 5.41 16.96 13.69 8.03 2.92 21.61 96.40 70.79 75.19 77.29 76.60 38.91 7.80 98.78 87.28 85.72 57.67 1.11 16.04 8.06 0.77 80.68 55.93 56.02 74.24 15.55 27.09 50.01 20.96 36.08 48.09 99.69 62.03
1 73.94 31.87 45.08 61.11 41.85 34.85 40.27 69.06 47.25 37.45 73.02 61.91 69.23 43.32 64.44 60.55 74.88 44.68 50.25 39.88 61.72 45.71 45.65 36.37 31.93 28.73 17.05 35.85 26.68 41.67 58.79 39.35 66.12 56.15 67.02 24.81 48.69 36.55 36.03 50.15 65.15 35.11 41.13 46.32 42.75 24.33 33.27 49.55 46.72 63.94 41.44 55.50 71.09 60.08 45.00 43.78 47.36 53.40 23.47 59.51 58.15 47.46 27.73 95.82 52.02 31.86 60.87 75.47 37.70 50.42 57.81 70.94 16.81 49.34 61.06 30.30 55.67 38.15 16.22 47.40 18.19 43.98 44.49 48.59 45.42 42.92 36.49 63.37 32.42 49.25 41.07 46.65 15.55
2 3.01 76.08 2.55 2.80 23.14 0.43 33.66 14.39 10.26 9.00 25.57 21.89 19.71 25.13 2.59 22.71 5.43 16.93 16.00 8.59 41.55 21.48 5.87 3.41 3.63 8.22 9.58 3.06 9.70 0.33 3.30 9.23 18.08 6.86 1.88 1.58 27.13 7.29 1.32 0.28 2.32 2.93 0.57 1.21 1.21 3.34 9.22 8.74 23.86 9.67 0.65 96.85 92.00 3.79 1.32 2.48 12.32 13.02 29.02 94.74 25.00 14.46 84.91 68.87 49.34 24.04 64.31 38.23 75.62 86.30 2.55 20.02 9.95 0.73 4.05 1.62 3.06 96.47 22.48 86.21 68.06 1.51 1.35 0.40 2.00 27.79 11.75 23.90 28.22 34.54 97.48 1.47 1.13
3 82.29 87.57 41.46 83.82 83.94 55.37 23.93 51.19 63.75 29.02 58.52 66.27 17.08 4.27 16.53 85.72 59.79 77.56 57.29 69.70 68.59 65.58 53.55 60.72 68.81 79.94 84.15 67.80 59.45 60.17 27.33 10.42 9.26 8.47 11.31 59.98 55.28 59.82 32.71 51.19 17.87 28.43 11.50 18.82 12.15 63.41 47.95 22.69 17.00 21.00 72.05 90.03 85.13 57.96 51.13 77.29 41.63 95.50 54.62 54.94 66.61 74.57 60.31 71.74 46.24 39.64 35.13 61.16 30.64 86.02 88.81 44.81 39.30 19.16 60.64 44.98 74.25 64.19 59.13 43.81 44.76 27.50 44.24 38.65 81.51 12.64 75.78 26.61 33.20 17.67 86.80 31.91 45.39
4 31.05 30.53 38.42 28.13 60.13 38.78 88.18 62.51 77.00 79.15 62.96 86.69 55.73 67.27 77.48 59.87 63.15 71.18 54.73 69.28 81.18 69.69 72.53 64.53 75.33 71.95 88.88 67.85 60.99 28.97 42.93 74.40 47.06 50.79 60.36 69.69 78.28 55.02 27.99 46.10 63.34 59.35 64.19 82.99 50.90 80.06 85.22 54.84 93.25 67.05 57.78 49.97 47.05 27.16 40.51 40.21 82.30 34.11 46.84 14.78 85.42 94.09 60.42 61.47 91.64 96.68 27.44 33.72 35.55 34.89 37.20 68.28 83.66 48.50 22.19 67.38 60.68 46.77 86.17 47.33 98.80 68.70 94.39 85.68 57.07 68.78 84.63 79.69 72.34 77.55 94.72 40.12 73.87

Compare the calculated percentile with the target, they are very similar. The differences are due to raw data rounding.

Compare

(target-percentile).abs().max().sort_values()
MERTK        0.45
PTK2B        0.51
AXL          0.53
SRC          0.64
ROS1         0.65
             ... 
NTRK1        2.18
TESK1_TYR    2.19
JAK2         2.30
FLT3         2.31
LIMK2_TYR    2.67
Length: 93, dtype: float64

No much difference between the two

# save the result for reference
# result.round(3).to_parquet('ochoa_pspa_score.parquet')

To use

We’ve saved the reference sheet in Data and can load using the below function:

pct_ref = Data.get_ochoa_score()
score_df = predict_kinase_df(df,'site_seq',**param_PSPA)
input dataframe has a length 35
Preprocessing
Finish preprocessing
Calculating position: [-5, -4, -3, -2, -1, 0, 1, 2, 3, 4]
  0%|          | 0/303 [00:00<?, ?it/s]/usr/local/lib/python3.9/dist-packages/katlas/core.py:575: RuntimeWarning: divide by zero encountered in log2
  log_sum = np.sum(np.log2(values)) + (len(values) - 1) * np.log2(divide)
100%|██████████| 303/303 [00:00<00:00, 1096.01it/s]
pct = get_pct_df(score_df,pct_ref)
site = 'PGGNIyIsPLksPyk'
get_pct(site,pct_ref)