from katlas.imports import *
import pickle, pandas as pd,numpy as np, seaborn as sns,matplotlib.pyplot as plt
from numpy import trapzGet reference score to calculate percentile
Setup
Scoring
Download supp table3 (sheet2) from nature Ser/Thr kinase paper and supp table3 (sheet3, with non-canonical) from nature Tyr kinase paper
These two files are too big to upload to the current repository, please download the files yourself.
# df = pd.read_csv('supp3_ST.csv')
df = pd.read_csv('supp3_tyr.csv')Check whether the sequence contains lowercase other than the phospho-acceptor
df['SITE_+/-7_AA'].str[0:7].str.contains('[a-z]').value_counts()SITE_+/-7_AA
False 7315
Name: count, dtype: int64
df['SITE_+/-7_AA'].str[8:].str.contains('[a-z]').value_counts()SITE_+/-7_AA
False 7315
Name: count, dtype: int64
It seems the sequence in the supp3 does not contains phosphorylated sequence.
Convert column names
cols=df.columns[df.columns.str.contains('_percentile')][1:]
# get the target
target = df[cols]target.columns = target.columns.str.split('_').str[0]Below is for tyr only, as we need to transform the name to be consistent
kinase_dict = pd.read_csv('raw/lew_tyr_info.csv')name_dict = kinase_dict.set_index('lew_kinase2')['kinase_tyr']target.columns = target.columns.map(name_dict)target.head()| ABL1 | TNK2 | ALK | ABL2 | AXL | BLK | BMPR2_TYR | PTK6 | BTK | CSF1R | CSK | MATK | DDR1 | DDR2 | EGFR | EPHA1 | EPHA2 | EPHA3 | EPHA4 | EPHA5 | EPHA6 | EPHA7 | EPHA8 | EPHB1 | EPHB2 | EPHB3 | EPHB4 | BMX | PTK2 | FER | FES | FGFR1 | FGFR2 | FGFR3 | FGFR4 | FGR | FLT3 | FRK | FYN | HCK | ERBB2 | ERBB4 | IGF1R | INSR | INSRR | ITK | JAK1 | JAK2 | JAK3 | KIT | LCK | LIMK1_TYR | LIMK2_TYR | LTK | LYN | MERTK | MET | MAP2K4_TYR | MAP2K6_TYR | MAP2K7_TYR | MST1R | MUSK | PKMYT1_TYR | NEK10_TYR | PDGFRA | PDGFRB | PDHK1_TYR | PDHK3_TYR | PDHK4_TYR | PINK1_TYR | PTK2B | RET | ROS1 | SRC | SRMS | SYK | TEC | TESK1_TYR | TEK | TNK1 | TNNI3K_TYR | NTRK1 | NTRK2 | NTRK3 | TXK | TYK2 | TYRO3 | FLT1 | KDR | FLT4 | WEE1_TYR | YES1 | ZAP70 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 36.18 | 6.22 | 27.40 | 46.44 | 49.64 | 87.22 | 78.99 | 98.40 | 71.50 | 26.37 | 90.22 | 93.41 | 12.52 | 22.11 | 79.36 | 18.85 | 58.28 | 37.30 | 69.07 | 71.50 | 24.25 | 27.75 | 64.70 | 41.30 | 76.43 | 59.62 | 42.81 | 61.77 | 91.84 | 95.64 | 79.32 | 25.83 | 41.72 | 63.80 | 89.69 | 94.16 | 38.89 | 42.42 | 99.87 | 90.09 | 41.93 | 58.44 | 86.36 | 51.67 | 57.80 | 61.20 | 5.36 | 6.46 | 8.84 | 34.86 | 83.74 | 2.54 | 2.34 | 54.93 | 83.98 | 78.71 | 27.29 | 41.43 | 92.16 | 51.72 | 5.28 | 17.51 | 12.72 | 7.70 | 3.17 | 22.57 | 96.65 | 70.10 | 74.98 | 77.46 | 76.65 | 39.40 | 7.86 | 98.82 | 88.18 | 86.48 | 58.13 | 0.85 | 16.63 | 7.79 | 0.68 | 82.19 | 57.28 | 57.63 | 73.82 | 16.04 | 27.23 | 51.22 | 21.43 | 37.76 | 47.45 | 99.67 | 62.11 |
| 1 | 72.93 | 31.82 | 45.68 | 60.52 | 41.74 | 34.16 | 40.25 | 69.07 | 47.87 | 38.26 | 74.41 | 62.25 | 69.23 | 43.84 | 66.23 | 61.17 | 76.28 | 45.94 | 51.26 | 40.58 | 62.25 | 46.36 | 46.36 | 36.40 | 31.87 | 28.93 | 16.52 | 36.00 | 27.03 | 41.94 | 60.14 | 40.12 | 67.21 | 57.41 | 68.86 | 24.32 | 50.71 | 37.53 | 35.76 | 50.03 | 66.64 | 35.72 | 42.17 | 47.12 | 43.38 | 24.05 | 34.25 | 51.28 | 47.06 | 65.70 | 41.34 | 54.30 | 70.01 | 60.52 | 45.30 | 43.86 | 47.49 | 52.27 | 23.03 | 60.08 | 59.09 | 48.17 | 26.40 | 96.04 | 53.40 | 32.85 | 61.31 | 74.96 | 37.58 | 50.25 | 57.78 | 71.04 | 17.07 | 49.22 | 61.17 | 30.66 | 56.25 | 36.92 | 16.90 | 47.01 | 17.29 | 44.76 | 45.41 | 49.90 | 44.37 | 44.52 | 36.18 | 64.74 | 33.12 | 50.84 | 40.25 | 46.09 | 15.02 |
| 2 | 3.11 | 75.58 | 2.80 | 2.85 | 23.33 | 0.41 | 33.75 | 14.55 | 10.83 | 9.41 | 26.77 | 22.19 | 20.42 | 25.69 | 2.52 | 23.51 | 5.62 | 16.81 | 16.13 | 8.84 | 41.58 | 21.67 | 6.00 | 3.44 | 3.52 | 8.29 | 9.37 | 2.98 | 9.70 | 0.35 | 3.26 | 9.45 | 18.80 | 7.05 | 1.73 | 1.53 | 28.63 | 7.77 | 1.10 | 0.24 | 2.52 | 2.78 | 0.57 | 1.20 | 1.29 | 3.20 | 9.26 | 9.02 | 24.05 | 10.35 | 0.61 | 97.24 | 92.14 | 4.11 | 1.36 | 2.56 | 12.32 | 12.02 | 28.93 | 94.92 | 25.12 | 15.12 | 84.33 | 68.74 | 50.93 | 25.13 | 64.67 | 37.47 | 75.49 | 86.50 | 2.36 | 20.36 | 10.02 | 0.68 | 3.96 | 1.57 | 3.22 | 96.45 | 23.03 | 86.41 | 68.24 | 1.36 | 1.42 | 0.33 | 1.95 | 29.09 | 11.86 | 24.97 | 29.11 | 35.98 | 97.79 | 1.36 | 1.03 |
| 3 | 81.77 | 87.26 | 41.91 | 83.34 | 83.72 | 54.91 | 24.01 | 51.59 | 63.93 | 29.72 | 59.95 | 66.97 | 17.75 | 4.20 | 16.79 | 85.64 | 61.11 | 78.49 | 58.44 | 70.59 | 68.81 | 66.71 | 54.00 | 61.54 | 69.60 | 80.30 | 84.39 | 67.56 | 59.77 | 60.89 | 28.37 | 10.64 | 9.69 | 8.71 | 11.78 | 59.40 | 57.10 | 60.93 | 32.20 | 51.06 | 18.87 | 28.89 | 12.23 | 19.22 | 12.34 | 62.79 | 49.11 | 23.44 | 17.31 | 21.84 | 71.33 | 90.33 | 85.16 | 58.64 | 51.30 | 77.52 | 41.39 | 95.03 | 53.82 | 54.58 | 67.23 | 75.29 | 59.34 | 71.70 | 47.63 | 40.56 | 35.11 | 60.54 | 30.62 | 86.12 | 88.82 | 45.66 | 39.27 | 18.91 | 60.71 | 45.68 | 74.22 | 63.54 | 59.31 | 43.42 | 44.50 | 28.26 | 45.13 | 39.29 | 81.42 | 13.09 | 75.40 | 27.66 | 33.97 | 18.85 | 87.66 | 30.93 | 45.39 |
| 4 | 30.44 | 30.38 | 39.00 | 26.64 | 60.36 | 38.28 | 88.22 | 62.60 | 76.95 | 80.45 | 64.19 | 87.20 | 56.38 | 67.89 | 79.05 | 60.60 | 64.35 | 72.51 | 55.90 | 70.15 | 81.44 | 70.72 | 73.69 | 65.46 | 76.19 | 72.31 | 89.08 | 67.59 | 61.44 | 29.11 | 44.23 | 75.31 | 48.02 | 52.27 | 61.68 | 69.12 | 79.49 | 55.66 | 27.51 | 46.03 | 64.98 | 60.38 | 65.24 | 83.98 | 51.35 | 79.60 | 85.99 | 56.95 | 93.52 | 68.90 | 57.17 | 48.72 | 44.74 | 27.71 | 40.64 | 40.27 | 83.12 | 33.14 | 46.18 | 14.64 | 85.51 | 94.37 | 59.55 | 61.43 | 92.23 | 96.87 | 27.66 | 33.00 | 35.46 | 34.56 | 36.97 | 68.35 | 83.94 | 48.41 | 21.97 | 68.62 | 60.93 | 44.93 | 86.25 | 46.66 | 98.80 | 70.61 | 94.84 | 86.87 | 56.16 | 70.65 | 84.24 | 80.78 | 72.93 | 79.01 | 95.19 | 39.35 | 74.46 |
Scoring
result = predict_kinase_df(df,'SITE_+/-7_AA',**param_PSPA)input dataframe has a length 7315
Preprocessing
Finish preprocessing
Merging reference
Finish merging
100%|██████████| 396/396 [00:02<00:00, 157.53it/s]
# get the percentile score
percentile = result[target.columns].rank(axis=0,pct=True)percentile = (percentile*100).round(2)percentile.head()| ABL1 | TNK2 | ALK | ABL2 | AXL | BLK | BMPR2_TYR | PTK6 | BTK | CSF1R | CSK | MATK | DDR1 | DDR2 | EGFR | EPHA1 | EPHA2 | EPHA3 | EPHA4 | EPHA5 | EPHA6 | EPHA7 | EPHA8 | EPHB1 | EPHB2 | EPHB3 | EPHB4 | BMX | PTK2 | FER | FES | FGFR1 | FGFR2 | FGFR3 | FGFR4 | FGR | FLT3 | FRK | FYN | HCK | ERBB2 | ERBB4 | IGF1R | INSR | INSRR | ITK | JAK1 | JAK2 | JAK3 | KIT | LCK | LIMK1_TYR | LIMK2_TYR | LTK | LYN | MERTK | MET | MAP2K4_TYR | MAP2K6_TYR | MAP2K7_TYR | MST1R | MUSK | PKMYT1_TYR | NEK10_TYR | PDGFRA | PDGFRB | PDHK1_TYR | PDHK3_TYR | PDHK4_TYR | PINK1_TYR | PTK2B | RET | ROS1 | SRC | SRMS | SYK | TEC | TESK1_TYR | TEK | TNK1 | TNNI3K_TYR | NTRK1 | NTRK2 | NTRK3 | TXK | TYK2 | TYRO3 | FLT1 | KDR | FLT4 | WEE1_TYR | YES1 | ZAP70 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 36.90 | 6.23 | 26.32 | 46.87 | 49.48 | 87.50 | 78.57 | 98.49 | 71.71 | 25.83 | 89.54 | 93.08 | 12.08 | 21.67 | 77.75 | 18.28 | 57.33 | 36.49 | 67.95 | 70.62 | 24.37 | 27.27 | 63.62 | 41.39 | 75.71 | 59.68 | 43.03 | 62.14 | 91.13 | 95.52 | 78.35 | 25.27 | 40.98 | 62.66 | 88.20 | 94.21 | 36.96 | 41.16 | 99.84 | 90.01 | 40.20 | 57.33 | 85.35 | 50.85 | 57.27 | 61.66 | 5.32 | 6.25 | 8.63 | 34.05 | 84.16 | 3.10 | 2.80 | 54.25 | 83.56 | 78.68 | 27.61 | 42.73 | 92.39 | 52.09 | 5.41 | 16.96 | 13.69 | 8.03 | 2.92 | 21.61 | 96.40 | 70.79 | 75.19 | 77.29 | 76.60 | 38.91 | 7.80 | 98.78 | 87.28 | 85.72 | 57.67 | 1.11 | 16.04 | 8.06 | 0.77 | 80.68 | 55.93 | 56.02 | 74.24 | 15.55 | 27.09 | 50.01 | 20.96 | 36.08 | 48.09 | 99.69 | 62.03 |
| 1 | 73.94 | 31.87 | 45.08 | 61.11 | 41.85 | 34.85 | 40.27 | 69.06 | 47.25 | 37.45 | 73.02 | 61.91 | 69.23 | 43.32 | 64.44 | 60.55 | 74.88 | 44.68 | 50.25 | 39.88 | 61.72 | 45.71 | 45.65 | 36.37 | 31.93 | 28.73 | 17.05 | 35.85 | 26.68 | 41.67 | 58.79 | 39.35 | 66.12 | 56.15 | 67.02 | 24.81 | 48.69 | 36.55 | 36.03 | 50.15 | 65.15 | 35.11 | 41.13 | 46.32 | 42.75 | 24.33 | 33.27 | 49.55 | 46.72 | 63.94 | 41.44 | 55.50 | 71.09 | 60.08 | 45.00 | 43.78 | 47.36 | 53.40 | 23.47 | 59.51 | 58.15 | 47.46 | 27.73 | 95.82 | 52.02 | 31.86 | 60.87 | 75.47 | 37.70 | 50.42 | 57.81 | 70.94 | 16.81 | 49.34 | 61.06 | 30.30 | 55.67 | 38.15 | 16.22 | 47.40 | 18.19 | 43.98 | 44.49 | 48.59 | 45.42 | 42.92 | 36.49 | 63.37 | 32.42 | 49.25 | 41.07 | 46.65 | 15.55 |
| 2 | 3.01 | 76.08 | 2.55 | 2.80 | 23.14 | 0.43 | 33.66 | 14.39 | 10.26 | 9.00 | 25.57 | 21.89 | 19.71 | 25.13 | 2.59 | 22.71 | 5.43 | 16.93 | 16.00 | 8.59 | 41.55 | 21.48 | 5.87 | 3.41 | 3.63 | 8.22 | 9.58 | 3.06 | 9.70 | 0.33 | 3.30 | 9.23 | 18.08 | 6.86 | 1.88 | 1.58 | 27.13 | 7.29 | 1.32 | 0.28 | 2.32 | 2.93 | 0.57 | 1.21 | 1.21 | 3.34 | 9.22 | 8.74 | 23.86 | 9.67 | 0.65 | 96.85 | 92.00 | 3.79 | 1.32 | 2.48 | 12.32 | 13.02 | 29.02 | 94.74 | 25.00 | 14.46 | 84.91 | 68.87 | 49.34 | 24.04 | 64.31 | 38.23 | 75.62 | 86.30 | 2.55 | 20.02 | 9.95 | 0.73 | 4.05 | 1.62 | 3.06 | 96.47 | 22.48 | 86.21 | 68.06 | 1.51 | 1.35 | 0.40 | 2.00 | 27.79 | 11.75 | 23.90 | 28.22 | 34.54 | 97.48 | 1.47 | 1.13 |
| 3 | 82.29 | 87.57 | 41.46 | 83.82 | 83.94 | 55.37 | 23.93 | 51.19 | 63.75 | 29.02 | 58.52 | 66.27 | 17.08 | 4.27 | 16.53 | 85.72 | 59.79 | 77.56 | 57.29 | 69.70 | 68.59 | 65.58 | 53.55 | 60.72 | 68.81 | 79.94 | 84.15 | 67.80 | 59.45 | 60.17 | 27.33 | 10.42 | 9.26 | 8.47 | 11.31 | 59.98 | 55.28 | 59.82 | 32.71 | 51.19 | 17.87 | 28.43 | 11.50 | 18.82 | 12.15 | 63.41 | 47.95 | 22.69 | 17.00 | 21.00 | 72.05 | 90.03 | 85.13 | 57.96 | 51.13 | 77.29 | 41.63 | 95.50 | 54.62 | 54.94 | 66.61 | 74.57 | 60.31 | 71.74 | 46.24 | 39.64 | 35.13 | 61.16 | 30.64 | 86.02 | 88.81 | 44.81 | 39.30 | 19.16 | 60.64 | 44.98 | 74.25 | 64.19 | 59.13 | 43.81 | 44.76 | 27.50 | 44.24 | 38.65 | 81.51 | 12.64 | 75.78 | 26.61 | 33.20 | 17.67 | 86.80 | 31.91 | 45.39 |
| 4 | 31.05 | 30.53 | 38.42 | 28.13 | 60.13 | 38.78 | 88.18 | 62.51 | 77.00 | 79.15 | 62.96 | 86.69 | 55.73 | 67.27 | 77.48 | 59.87 | 63.15 | 71.18 | 54.73 | 69.28 | 81.18 | 69.69 | 72.53 | 64.53 | 75.33 | 71.95 | 88.88 | 67.85 | 60.99 | 28.97 | 42.93 | 74.40 | 47.06 | 50.79 | 60.36 | 69.69 | 78.28 | 55.02 | 27.99 | 46.10 | 63.34 | 59.35 | 64.19 | 82.99 | 50.90 | 80.06 | 85.22 | 54.84 | 93.25 | 67.05 | 57.78 | 49.97 | 47.05 | 27.16 | 40.51 | 40.21 | 82.30 | 34.11 | 46.84 | 14.78 | 85.42 | 94.09 | 60.42 | 61.47 | 91.64 | 96.68 | 27.44 | 33.72 | 35.55 | 34.89 | 37.20 | 68.28 | 83.66 | 48.50 | 22.19 | 67.38 | 60.68 | 46.77 | 86.17 | 47.33 | 98.80 | 68.70 | 94.39 | 85.68 | 57.07 | 68.78 | 84.63 | 79.69 | 72.34 | 77.55 | 94.72 | 40.12 | 73.87 |
Compare the calculated percentile with the target, they are very similar. The differences are due to raw data rounding.
Compare
(target-percentile).abs().max().sort_values()MERTK 0.45
PTK2B 0.51
AXL 0.53
SRC 0.64
ROS1 0.65
...
NTRK1 2.18
TESK1_TYR 2.19
JAK2 2.30
FLT3 2.31
LIMK2_TYR 2.67
Length: 93, dtype: float64
No much difference between the two
# save the result for reference
# result.round(3).to_parquet('ochoa_pspa_score.parquet')To use
We’ve saved the reference sheet in Data and can load using the below function:
pct_ref = Data.get_ochoa_score()score_df = predict_kinase_df(df,'site_seq',**param_PSPA)input dataframe has a length 35
Preprocessing
Finish preprocessing
Calculating position: [-5, -4, -3, -2, -1, 0, 1, 2, 3, 4]
0%| | 0/303 [00:00<?, ?it/s]/usr/local/lib/python3.9/dist-packages/katlas/core.py:575: RuntimeWarning: divide by zero encountered in log2
log_sum = np.sum(np.log2(values)) + (len(values) - 1) * np.log2(divide)
100%|██████████| 303/303 [00:00<00:00, 1096.01it/s]
pct = get_pct_df(score_df,pct_ref)site = 'PGGNIyIsPLksPyk'
get_pct(site,pct_ref)