from katlas.imports import *
import pickle, pandas as pd,numpy as np, seaborn as sns,matplotlib.pyplot as plt
from numpy import trapz
Get reference score to calculate percentile
Setup
Scoring
Download supp table3 (sheet2) from nature Ser/Thr kinase paper and supp table3 (sheet3, with non-canonical) from nature Tyr kinase paper
These two files are too big to upload to the current repository, please download the files yourself.
# df = pd.read_csv('supp3_ST.csv')
= pd.read_csv('supp3_tyr.csv') df
Check whether the sequence contains lowercase other than the phospho-acceptor
'SITE_+/-7_AA'].str[0:7].str.contains('[a-z]').value_counts() df[
SITE_+/-7_AA
False 7315
Name: count, dtype: int64
'SITE_+/-7_AA'].str[8:].str.contains('[a-z]').value_counts() df[
SITE_+/-7_AA
False 7315
Name: count, dtype: int64
It seems the sequence in the supp3 does not contains phosphorylated sequence.
Convert column names
=df.columns[df.columns.str.contains('_percentile')][1:]
cols
# get the target
= df[cols] target
= target.columns.str.split('_').str[0] target.columns
Below is for tyr only, as we need to transform the name to be consistent
= pd.read_csv('raw/lew_tyr_info.csv') kinase_dict
= kinase_dict.set_index('lew_kinase2')['kinase_tyr'] name_dict
= target.columns.map(name_dict) target.columns
target.head()
ABL1 | TNK2 | ALK | ABL2 | AXL | BLK | BMPR2_TYR | PTK6 | BTK | CSF1R | CSK | MATK | DDR1 | DDR2 | EGFR | EPHA1 | EPHA2 | EPHA3 | EPHA4 | EPHA5 | EPHA6 | EPHA7 | EPHA8 | EPHB1 | EPHB2 | EPHB3 | EPHB4 | BMX | PTK2 | FER | FES | FGFR1 | FGFR2 | FGFR3 | FGFR4 | FGR | FLT3 | FRK | FYN | HCK | ERBB2 | ERBB4 | IGF1R | INSR | INSRR | ITK | JAK1 | JAK2 | JAK3 | KIT | LCK | LIMK1_TYR | LIMK2_TYR | LTK | LYN | MERTK | MET | MAP2K4_TYR | MAP2K6_TYR | MAP2K7_TYR | MST1R | MUSK | PKMYT1_TYR | NEK10_TYR | PDGFRA | PDGFRB | PDHK1_TYR | PDHK3_TYR | PDHK4_TYR | PINK1_TYR | PTK2B | RET | ROS1 | SRC | SRMS | SYK | TEC | TESK1_TYR | TEK | TNK1 | TNNI3K_TYR | NTRK1 | NTRK2 | NTRK3 | TXK | TYK2 | TYRO3 | FLT1 | KDR | FLT4 | WEE1_TYR | YES1 | ZAP70 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 36.18 | 6.22 | 27.40 | 46.44 | 49.64 | 87.22 | 78.99 | 98.40 | 71.50 | 26.37 | 90.22 | 93.41 | 12.52 | 22.11 | 79.36 | 18.85 | 58.28 | 37.30 | 69.07 | 71.50 | 24.25 | 27.75 | 64.70 | 41.30 | 76.43 | 59.62 | 42.81 | 61.77 | 91.84 | 95.64 | 79.32 | 25.83 | 41.72 | 63.80 | 89.69 | 94.16 | 38.89 | 42.42 | 99.87 | 90.09 | 41.93 | 58.44 | 86.36 | 51.67 | 57.80 | 61.20 | 5.36 | 6.46 | 8.84 | 34.86 | 83.74 | 2.54 | 2.34 | 54.93 | 83.98 | 78.71 | 27.29 | 41.43 | 92.16 | 51.72 | 5.28 | 17.51 | 12.72 | 7.70 | 3.17 | 22.57 | 96.65 | 70.10 | 74.98 | 77.46 | 76.65 | 39.40 | 7.86 | 98.82 | 88.18 | 86.48 | 58.13 | 0.85 | 16.63 | 7.79 | 0.68 | 82.19 | 57.28 | 57.63 | 73.82 | 16.04 | 27.23 | 51.22 | 21.43 | 37.76 | 47.45 | 99.67 | 62.11 |
1 | 72.93 | 31.82 | 45.68 | 60.52 | 41.74 | 34.16 | 40.25 | 69.07 | 47.87 | 38.26 | 74.41 | 62.25 | 69.23 | 43.84 | 66.23 | 61.17 | 76.28 | 45.94 | 51.26 | 40.58 | 62.25 | 46.36 | 46.36 | 36.40 | 31.87 | 28.93 | 16.52 | 36.00 | 27.03 | 41.94 | 60.14 | 40.12 | 67.21 | 57.41 | 68.86 | 24.32 | 50.71 | 37.53 | 35.76 | 50.03 | 66.64 | 35.72 | 42.17 | 47.12 | 43.38 | 24.05 | 34.25 | 51.28 | 47.06 | 65.70 | 41.34 | 54.30 | 70.01 | 60.52 | 45.30 | 43.86 | 47.49 | 52.27 | 23.03 | 60.08 | 59.09 | 48.17 | 26.40 | 96.04 | 53.40 | 32.85 | 61.31 | 74.96 | 37.58 | 50.25 | 57.78 | 71.04 | 17.07 | 49.22 | 61.17 | 30.66 | 56.25 | 36.92 | 16.90 | 47.01 | 17.29 | 44.76 | 45.41 | 49.90 | 44.37 | 44.52 | 36.18 | 64.74 | 33.12 | 50.84 | 40.25 | 46.09 | 15.02 |
2 | 3.11 | 75.58 | 2.80 | 2.85 | 23.33 | 0.41 | 33.75 | 14.55 | 10.83 | 9.41 | 26.77 | 22.19 | 20.42 | 25.69 | 2.52 | 23.51 | 5.62 | 16.81 | 16.13 | 8.84 | 41.58 | 21.67 | 6.00 | 3.44 | 3.52 | 8.29 | 9.37 | 2.98 | 9.70 | 0.35 | 3.26 | 9.45 | 18.80 | 7.05 | 1.73 | 1.53 | 28.63 | 7.77 | 1.10 | 0.24 | 2.52 | 2.78 | 0.57 | 1.20 | 1.29 | 3.20 | 9.26 | 9.02 | 24.05 | 10.35 | 0.61 | 97.24 | 92.14 | 4.11 | 1.36 | 2.56 | 12.32 | 12.02 | 28.93 | 94.92 | 25.12 | 15.12 | 84.33 | 68.74 | 50.93 | 25.13 | 64.67 | 37.47 | 75.49 | 86.50 | 2.36 | 20.36 | 10.02 | 0.68 | 3.96 | 1.57 | 3.22 | 96.45 | 23.03 | 86.41 | 68.24 | 1.36 | 1.42 | 0.33 | 1.95 | 29.09 | 11.86 | 24.97 | 29.11 | 35.98 | 97.79 | 1.36 | 1.03 |
3 | 81.77 | 87.26 | 41.91 | 83.34 | 83.72 | 54.91 | 24.01 | 51.59 | 63.93 | 29.72 | 59.95 | 66.97 | 17.75 | 4.20 | 16.79 | 85.64 | 61.11 | 78.49 | 58.44 | 70.59 | 68.81 | 66.71 | 54.00 | 61.54 | 69.60 | 80.30 | 84.39 | 67.56 | 59.77 | 60.89 | 28.37 | 10.64 | 9.69 | 8.71 | 11.78 | 59.40 | 57.10 | 60.93 | 32.20 | 51.06 | 18.87 | 28.89 | 12.23 | 19.22 | 12.34 | 62.79 | 49.11 | 23.44 | 17.31 | 21.84 | 71.33 | 90.33 | 85.16 | 58.64 | 51.30 | 77.52 | 41.39 | 95.03 | 53.82 | 54.58 | 67.23 | 75.29 | 59.34 | 71.70 | 47.63 | 40.56 | 35.11 | 60.54 | 30.62 | 86.12 | 88.82 | 45.66 | 39.27 | 18.91 | 60.71 | 45.68 | 74.22 | 63.54 | 59.31 | 43.42 | 44.50 | 28.26 | 45.13 | 39.29 | 81.42 | 13.09 | 75.40 | 27.66 | 33.97 | 18.85 | 87.66 | 30.93 | 45.39 |
4 | 30.44 | 30.38 | 39.00 | 26.64 | 60.36 | 38.28 | 88.22 | 62.60 | 76.95 | 80.45 | 64.19 | 87.20 | 56.38 | 67.89 | 79.05 | 60.60 | 64.35 | 72.51 | 55.90 | 70.15 | 81.44 | 70.72 | 73.69 | 65.46 | 76.19 | 72.31 | 89.08 | 67.59 | 61.44 | 29.11 | 44.23 | 75.31 | 48.02 | 52.27 | 61.68 | 69.12 | 79.49 | 55.66 | 27.51 | 46.03 | 64.98 | 60.38 | 65.24 | 83.98 | 51.35 | 79.60 | 85.99 | 56.95 | 93.52 | 68.90 | 57.17 | 48.72 | 44.74 | 27.71 | 40.64 | 40.27 | 83.12 | 33.14 | 46.18 | 14.64 | 85.51 | 94.37 | 59.55 | 61.43 | 92.23 | 96.87 | 27.66 | 33.00 | 35.46 | 34.56 | 36.97 | 68.35 | 83.94 | 48.41 | 21.97 | 68.62 | 60.93 | 44.93 | 86.25 | 46.66 | 98.80 | 70.61 | 94.84 | 86.87 | 56.16 | 70.65 | 84.24 | 80.78 | 72.93 | 79.01 | 95.19 | 39.35 | 74.46 |
Scoring
= predict_kinase_df(df,'SITE_+/-7_AA',**param_PSPA) result
input dataframe has a length 7315
Preprocessing
Finish preprocessing
Merging reference
Finish merging
100%|██████████| 396/396 [00:02<00:00, 157.53it/s]
# get the percentile score
= result[target.columns].rank(axis=0,pct=True) percentile
= (percentile*100).round(2) percentile
percentile.head()
ABL1 | TNK2 | ALK | ABL2 | AXL | BLK | BMPR2_TYR | PTK6 | BTK | CSF1R | CSK | MATK | DDR1 | DDR2 | EGFR | EPHA1 | EPHA2 | EPHA3 | EPHA4 | EPHA5 | EPHA6 | EPHA7 | EPHA8 | EPHB1 | EPHB2 | EPHB3 | EPHB4 | BMX | PTK2 | FER | FES | FGFR1 | FGFR2 | FGFR3 | FGFR4 | FGR | FLT3 | FRK | FYN | HCK | ERBB2 | ERBB4 | IGF1R | INSR | INSRR | ITK | JAK1 | JAK2 | JAK3 | KIT | LCK | LIMK1_TYR | LIMK2_TYR | LTK | LYN | MERTK | MET | MAP2K4_TYR | MAP2K6_TYR | MAP2K7_TYR | MST1R | MUSK | PKMYT1_TYR | NEK10_TYR | PDGFRA | PDGFRB | PDHK1_TYR | PDHK3_TYR | PDHK4_TYR | PINK1_TYR | PTK2B | RET | ROS1 | SRC | SRMS | SYK | TEC | TESK1_TYR | TEK | TNK1 | TNNI3K_TYR | NTRK1 | NTRK2 | NTRK3 | TXK | TYK2 | TYRO3 | FLT1 | KDR | FLT4 | WEE1_TYR | YES1 | ZAP70 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 36.90 | 6.23 | 26.32 | 46.87 | 49.48 | 87.50 | 78.57 | 98.49 | 71.71 | 25.83 | 89.54 | 93.08 | 12.08 | 21.67 | 77.75 | 18.28 | 57.33 | 36.49 | 67.95 | 70.62 | 24.37 | 27.27 | 63.62 | 41.39 | 75.71 | 59.68 | 43.03 | 62.14 | 91.13 | 95.52 | 78.35 | 25.27 | 40.98 | 62.66 | 88.20 | 94.21 | 36.96 | 41.16 | 99.84 | 90.01 | 40.20 | 57.33 | 85.35 | 50.85 | 57.27 | 61.66 | 5.32 | 6.25 | 8.63 | 34.05 | 84.16 | 3.10 | 2.80 | 54.25 | 83.56 | 78.68 | 27.61 | 42.73 | 92.39 | 52.09 | 5.41 | 16.96 | 13.69 | 8.03 | 2.92 | 21.61 | 96.40 | 70.79 | 75.19 | 77.29 | 76.60 | 38.91 | 7.80 | 98.78 | 87.28 | 85.72 | 57.67 | 1.11 | 16.04 | 8.06 | 0.77 | 80.68 | 55.93 | 56.02 | 74.24 | 15.55 | 27.09 | 50.01 | 20.96 | 36.08 | 48.09 | 99.69 | 62.03 |
1 | 73.94 | 31.87 | 45.08 | 61.11 | 41.85 | 34.85 | 40.27 | 69.06 | 47.25 | 37.45 | 73.02 | 61.91 | 69.23 | 43.32 | 64.44 | 60.55 | 74.88 | 44.68 | 50.25 | 39.88 | 61.72 | 45.71 | 45.65 | 36.37 | 31.93 | 28.73 | 17.05 | 35.85 | 26.68 | 41.67 | 58.79 | 39.35 | 66.12 | 56.15 | 67.02 | 24.81 | 48.69 | 36.55 | 36.03 | 50.15 | 65.15 | 35.11 | 41.13 | 46.32 | 42.75 | 24.33 | 33.27 | 49.55 | 46.72 | 63.94 | 41.44 | 55.50 | 71.09 | 60.08 | 45.00 | 43.78 | 47.36 | 53.40 | 23.47 | 59.51 | 58.15 | 47.46 | 27.73 | 95.82 | 52.02 | 31.86 | 60.87 | 75.47 | 37.70 | 50.42 | 57.81 | 70.94 | 16.81 | 49.34 | 61.06 | 30.30 | 55.67 | 38.15 | 16.22 | 47.40 | 18.19 | 43.98 | 44.49 | 48.59 | 45.42 | 42.92 | 36.49 | 63.37 | 32.42 | 49.25 | 41.07 | 46.65 | 15.55 |
2 | 3.01 | 76.08 | 2.55 | 2.80 | 23.14 | 0.43 | 33.66 | 14.39 | 10.26 | 9.00 | 25.57 | 21.89 | 19.71 | 25.13 | 2.59 | 22.71 | 5.43 | 16.93 | 16.00 | 8.59 | 41.55 | 21.48 | 5.87 | 3.41 | 3.63 | 8.22 | 9.58 | 3.06 | 9.70 | 0.33 | 3.30 | 9.23 | 18.08 | 6.86 | 1.88 | 1.58 | 27.13 | 7.29 | 1.32 | 0.28 | 2.32 | 2.93 | 0.57 | 1.21 | 1.21 | 3.34 | 9.22 | 8.74 | 23.86 | 9.67 | 0.65 | 96.85 | 92.00 | 3.79 | 1.32 | 2.48 | 12.32 | 13.02 | 29.02 | 94.74 | 25.00 | 14.46 | 84.91 | 68.87 | 49.34 | 24.04 | 64.31 | 38.23 | 75.62 | 86.30 | 2.55 | 20.02 | 9.95 | 0.73 | 4.05 | 1.62 | 3.06 | 96.47 | 22.48 | 86.21 | 68.06 | 1.51 | 1.35 | 0.40 | 2.00 | 27.79 | 11.75 | 23.90 | 28.22 | 34.54 | 97.48 | 1.47 | 1.13 |
3 | 82.29 | 87.57 | 41.46 | 83.82 | 83.94 | 55.37 | 23.93 | 51.19 | 63.75 | 29.02 | 58.52 | 66.27 | 17.08 | 4.27 | 16.53 | 85.72 | 59.79 | 77.56 | 57.29 | 69.70 | 68.59 | 65.58 | 53.55 | 60.72 | 68.81 | 79.94 | 84.15 | 67.80 | 59.45 | 60.17 | 27.33 | 10.42 | 9.26 | 8.47 | 11.31 | 59.98 | 55.28 | 59.82 | 32.71 | 51.19 | 17.87 | 28.43 | 11.50 | 18.82 | 12.15 | 63.41 | 47.95 | 22.69 | 17.00 | 21.00 | 72.05 | 90.03 | 85.13 | 57.96 | 51.13 | 77.29 | 41.63 | 95.50 | 54.62 | 54.94 | 66.61 | 74.57 | 60.31 | 71.74 | 46.24 | 39.64 | 35.13 | 61.16 | 30.64 | 86.02 | 88.81 | 44.81 | 39.30 | 19.16 | 60.64 | 44.98 | 74.25 | 64.19 | 59.13 | 43.81 | 44.76 | 27.50 | 44.24 | 38.65 | 81.51 | 12.64 | 75.78 | 26.61 | 33.20 | 17.67 | 86.80 | 31.91 | 45.39 |
4 | 31.05 | 30.53 | 38.42 | 28.13 | 60.13 | 38.78 | 88.18 | 62.51 | 77.00 | 79.15 | 62.96 | 86.69 | 55.73 | 67.27 | 77.48 | 59.87 | 63.15 | 71.18 | 54.73 | 69.28 | 81.18 | 69.69 | 72.53 | 64.53 | 75.33 | 71.95 | 88.88 | 67.85 | 60.99 | 28.97 | 42.93 | 74.40 | 47.06 | 50.79 | 60.36 | 69.69 | 78.28 | 55.02 | 27.99 | 46.10 | 63.34 | 59.35 | 64.19 | 82.99 | 50.90 | 80.06 | 85.22 | 54.84 | 93.25 | 67.05 | 57.78 | 49.97 | 47.05 | 27.16 | 40.51 | 40.21 | 82.30 | 34.11 | 46.84 | 14.78 | 85.42 | 94.09 | 60.42 | 61.47 | 91.64 | 96.68 | 27.44 | 33.72 | 35.55 | 34.89 | 37.20 | 68.28 | 83.66 | 48.50 | 22.19 | 67.38 | 60.68 | 46.77 | 86.17 | 47.33 | 98.80 | 68.70 | 94.39 | 85.68 | 57.07 | 68.78 | 84.63 | 79.69 | 72.34 | 77.55 | 94.72 | 40.12 | 73.87 |
Compare the calculated percentile with the target, they are very similar. The differences are due to raw data rounding.
Compare
-percentile).abs().max().sort_values() (target
MERTK 0.45
PTK2B 0.51
AXL 0.53
SRC 0.64
ROS1 0.65
...
NTRK1 2.18
TESK1_TYR 2.19
JAK2 2.30
FLT3 2.31
LIMK2_TYR 2.67
Length: 93, dtype: float64
No much difference between the two
# save the result for reference
# result.round(3).to_parquet('ochoa_pspa_score.parquet')
To use
We’ve saved the reference sheet in Data
and can load using the below function:
= Data.get_ochoa_score() pct_ref
= predict_kinase_df(df,'site_seq',**param_PSPA) score_df
input dataframe has a length 35
Preprocessing
Finish preprocessing
Calculating position: [-5, -4, -3, -2, -1, 0, 1, 2, 3, 4]
0%| | 0/303 [00:00<?, ?it/s]/usr/local/lib/python3.9/dist-packages/katlas/core.py:575: RuntimeWarning: divide by zero encountered in log2
log_sum = np.sum(np.log2(values)) + (len(values) - 1) * np.log2(divide)
100%|██████████| 303/303 [00:00<00:00, 1096.01it/s]
= get_pct_df(score_df,pct_ref) pct
= 'PGGNIyIsPLksPyk'
site get_pct(site,pct_ref)