import pandas as pd
from katlas.core import *
PSPA raw data normalization
In this notebook, we replicate the normalization process for PSPA Ser/Thr raw data
Download PSPA data
= pd.read_csv('https://github.com/sky1ove/katlas_raw/raw/refs/heads/main/nbs/raw/pspa_st_raw.csv').set_index('kinase')
raw = pd.read_csv('https://github.com/sky1ove/katlas_raw/raw/refs/heads/main/nbs/raw/pspa_st_norm.csv').set_index('kinase')
norm = pd.read_csv('https://github.com/sky1ove/katlas_raw/raw/refs/heads/main/nbs/raw/pspa_st_scale.csv').set_index('kinase') scale
In get_one_kinase
, drop_s
is set to True as s
is a duplicates of t
in PSPA
= get_one_kinase(raw,'PDHK1')
k k.head()
aa | A | C | D | E | F | G | H | I | K | L | ... | P | Q | R | S | T | V | W | Y | t | y |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
position | |||||||||||||||||||||
-5 | 8742435.33 | 10414182.29 | 8663835.37 | 8096013.86 | 11402696.32 | 10253402.14 | 10105837.98 | 8683931.90 | 7578162.13 | 9660152.81 | ... | 6637930.10 | 6242275.07 | 8735083.42 | 17325761.72 | 10840094.13 | 8430649.60 | 14729350.10 | 11402696.32 | 8575155.23 | 9671765.02 |
-4 | 9382375.57 | 10685938.26 | 8357249.08 | 7761083.75 | 11217909.10 | 10855959.77 | 9079043.40 | 9130790.82 | 7898317.44 | 9322057.05 | ... | 9672268.74 | 8379245.47 | 9377210.68 | 10952415.10 | 9895845.34 | 7886254.77 | 13908900.76 | 11217909.10 | 8025228.06 | 11415154.10 |
-3 | 9566806.27 | 10274228.62 | 7860338.75 | 6664677.78 | 12646646.40 | 9136758.39 | 10619788.43 | 10815274.55 | 7575486.39 | 10510394.47 | ... | 8973502.17 | 8383343.00 | 8378836.06 | 15571737.29 | 10373422.50 | 9253028.96 | 17526458.60 | 12646646.40 | 6558017.14 | 8706611.00 |
-2 | 8874823.78 | 11219554.16 | 7104673.31 | 6607581.65 | 11937469.77 | 13445698.89 | 11887506.94 | 8049058.41 | 6643874.14 | 9617614.67 | ... | 7548109.53 | 8208440.55 | 9307590.91 | 20205849.32 | 13325121.79 | 7839573.90 | 16355323.34 | 11937469.77 | 4944830.56 | 8422409.78 |
-1 | 10110169.52 | 14777201.90 | 12784916.61 | 5507173.44 | 8406884.45 | 8990141.98 | 10109111.77 | 6409587.79 | 5295768.52 | 7469514.59 | ... | 6981606.35 | 6472612.56 | 6069925.70 | 19309187.20 | 22395646.37 | 6650117.78 | 9773567.40 | 8406884.45 | 4625731.13 | 5606047.19 |
5 rows × 22 columns
Normalize PSPA raw data
We’ll implement the normalization method from Johnson et al. Nature: An atlas of substrate specificities for the human serine/threonine kinome
Specifically, > - matrices were column-normalized at all positions by the sum of the 17 randomized amino acids (excluding serine, threonine and cysteine), to yield PSSMs. >- PDHK1 and PDHK4 were normalized to the 16 randomized amino acids (excluding serine, threonine, cysteine and additionally tyrosine) >- The cysteine row was scaled by its median to be 1/17 (1/16 for PDHK1 and PDHK4). >- The serine and threonine values in each position were set to be the median of that position. >- The S0/T0 ratio was determined by summing the values of S and T rows in the matrix (SS and ST, respectively), accounting for the different S vs. T composition of the central (1:1) and peripheral (only S or only T) positions (Sctrl and Tctrl, respectively), and then normalizing to the higher value among the two (S0 and T0, respectively, Supplementary Note 1)
This function is usually implemented with the below function, with normalize
being a bool argument.
Set normalize to True can normalize the data based on previous normalization method.
= get_one_kinase(raw,'PDHK1',normalize=True)
k_norm k_norm
aa | A | C | D | E | F | G | H | I | K | L | ... | P | Q | R | S | T | V | W | Y | t | y |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
position | |||||||||||||||||||||
-5 | 0.0594 | 0.0625 | 0.0589 | 0.0550 | 0.0775 | 0.0697 | 0.0687 | 0.0590 | 0.0515 | 0.0657 | ... | 0.0451 | 0.0424 | 0.0594 | 0.0594 | 0.0594 | 0.0573 | 0.1001 | 0.0775 | 0.0583 | 0.0658 |
-4 | 0.0618 | 0.0621 | 0.0550 | 0.0511 | 0.0739 | 0.0715 | 0.0598 | 0.0601 | 0.0520 | 0.0614 | ... | 0.0637 | 0.0552 | 0.0617 | 0.0608 | 0.0608 | 0.0519 | 0.0916 | 0.0739 | 0.0528 | 0.0752 |
-3 | 0.0608 | 0.0576 | 0.0499 | 0.0423 | 0.0803 | 0.0580 | 0.0674 | 0.0687 | 0.0481 | 0.0667 | ... | 0.0570 | 0.0532 | 0.0532 | 0.0584 | 0.0584 | 0.0588 | 0.1113 | 0.0803 | 0.0416 | 0.0553 |
-2 | 0.0587 | 0.0655 | 0.0470 | 0.0437 | 0.0790 | 0.0890 | 0.0787 | 0.0533 | 0.0440 | 0.0637 | ... | 0.0500 | 0.0543 | 0.0616 | 0.0565 | 0.0565 | 0.0519 | 0.1082 | 0.0790 | 0.0327 | 0.0557 |
-1 | 0.0782 | 0.1009 | 0.0989 | 0.0426 | 0.0650 | 0.0695 | 0.0782 | 0.0496 | 0.0409 | 0.0578 | ... | 0.0540 | 0.0500 | 0.0469 | 0.0594 | 0.0594 | 0.0514 | 0.0756 | 0.0650 | 0.0358 | 0.0433 |
1 | 0.0400 | 0.0562 | 0.0394 | 0.0355 | 0.0735 | 0.0400 | 0.0502 | 0.1288 | 0.0390 | 0.1439 | ... | 0.0379 | 0.0455 | 0.0455 | 0.0455 | 0.0455 | 0.0797 | 0.0784 | 0.0735 | 0.0336 | 0.0452 |
2 | 0.0496 | 0.0783 | 0.0643 | 0.0555 | 0.0720 | 0.1067 | 0.0684 | 0.0480 | 0.0505 | 0.0555 | ... | 0.0564 | 0.0653 | 0.0695 | 0.0601 | 0.0601 | 0.0508 | 0.0672 | 0.0720 | 0.0414 | 0.0594 |
3 | 0.0486 | 0.0609 | 0.0938 | 0.0684 | 0.1024 | 0.0676 | 0.0544 | 0.0583 | 0.0388 | 0.0552 | ... | 0.0686 | 0.0502 | 0.0561 | 0.0588 | 0.0588 | 0.0593 | 0.0641 | 0.1024 | 0.0539 | 0.0431 |
4 | 0.0565 | 0.0749 | 0.0631 | 0.0535 | 0.0732 | 0.0655 | 0.0664 | 0.0625 | 0.0496 | 0.0552 | ... | 0.0677 | 0.0553 | 0.0604 | 0.0626 | 0.0626 | 0.0579 | 0.0864 | 0.0732 | 0.0548 | 0.0575 |
9 rows × 22 columns
= get_one_kinase(norm,'PDHK1',normalize=False)
k_norm_official k_norm_official
aa | A | C | D | E | F | G | H | I | K | L | ... | P | Q | R | S | T | V | W | Y | t | y |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
position | |||||||||||||||||||||
-5 | 0.0594 | 0.0625 | 0.0589 | 0.0550 | 0.0775 | 0.0697 | 0.0687 | 0.0590 | 0.0515 | 0.0657 | ... | 0.0451 | 0.0424 | 0.0594 | 0.0594 | 0.0594 | 0.0573 | 0.1001 | 0.0775 | 0.0583 | 0.0658 |
-4 | 0.0618 | 0.0621 | 0.0550 | 0.0511 | 0.0739 | 0.0715 | 0.0598 | 0.0601 | 0.0520 | 0.0614 | ... | 0.0637 | 0.0552 | 0.0617 | 0.0608 | 0.0608 | 0.0519 | 0.0916 | 0.0739 | 0.0528 | 0.0752 |
-3 | 0.0608 | 0.0576 | 0.0499 | 0.0423 | 0.0803 | 0.0580 | 0.0674 | 0.0687 | 0.0481 | 0.0667 | ... | 0.0570 | 0.0532 | 0.0532 | 0.0584 | 0.0584 | 0.0588 | 0.1113 | 0.0803 | 0.0416 | 0.0553 |
-2 | 0.0587 | 0.0655 | 0.0470 | 0.0437 | 0.0790 | 0.0890 | 0.0787 | 0.0533 | 0.0440 | 0.0637 | ... | 0.0500 | 0.0543 | 0.0616 | 0.0565 | 0.0565 | 0.0519 | 0.1082 | 0.0790 | 0.0327 | 0.0557 |
-1 | 0.0782 | 0.1009 | 0.0989 | 0.0426 | 0.0650 | 0.0695 | 0.0782 | 0.0496 | 0.0409 | 0.0578 | ... | 0.0540 | 0.0500 | 0.0469 | 0.0594 | 0.0594 | 0.0514 | 0.0756 | 0.0650 | 0.0358 | 0.0433 |
1 | 0.0400 | 0.0562 | 0.0394 | 0.0355 | 0.0735 | 0.0400 | 0.0502 | 0.1288 | 0.0390 | 0.1439 | ... | 0.0379 | 0.0455 | 0.0455 | 0.0455 | 0.0455 | 0.0797 | 0.0784 | 0.0735 | 0.0336 | 0.0452 |
2 | 0.0496 | 0.0783 | 0.0643 | 0.0555 | 0.0720 | 0.1067 | 0.0684 | 0.0480 | 0.0505 | 0.0555 | ... | 0.0564 | 0.0653 | 0.0695 | 0.0601 | 0.0601 | 0.0508 | 0.0672 | 0.0720 | 0.0414 | 0.0594 |
3 | 0.0486 | 0.0609 | 0.0938 | 0.0684 | 0.1024 | 0.0676 | 0.0544 | 0.0583 | 0.0388 | 0.0552 | ... | 0.0686 | 0.0502 | 0.0561 | 0.0588 | 0.0588 | 0.0593 | 0.0641 | 0.1024 | 0.0539 | 0.0431 |
4 | 0.0565 | 0.0749 | 0.0631 | 0.0535 | 0.0732 | 0.0655 | 0.0664 | 0.0625 | 0.0496 | 0.0552 | ... | 0.0677 | 0.0553 | 0.0604 | 0.0626 | 0.0626 | 0.0579 | 0.0864 | 0.0732 | 0.0548 | 0.0575 |
9 rows × 22 columns
They are same
Scale
To further scale the data based on the scaling method from Johnson et al. Nature: An atlas of substrate specificities for the human serine/threonine kinome, we can multiply all values by a certain factor (16 for most kinases, and 17 for PDHK) >All kinases are divided by 1/17 (#Random AA); PDHK1 or 4 are divided by 1/16.
= Data.get_num_dict() num_dict
# multiply all values by a scale factor (number of random amino acids)
= norm.apply(lambda r: r*num_dict.get(r.name), axis=1) scale2
We can compare the calculated one with the original one from the paper. They are same.
round(2) scale2.
-5P | -5G | -5A | -5C | -5S | -5T | -5V | -5I | -5L | -5M | ... | 4H | 4K | 4R | 4Q | 4N | 4D | 4E | 4s | 4t | 4y | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
kinase | |||||||||||||||||||||
AAK1 | 1.22 | 0.42 | 0.48 | 0.78 | 0.72 | 0.72 | 1.62 | 2.64 | 1.69 | 1.47 | ... | 0.95 | 1.41 | 1.58 | 1.08 | 1.01 | 0.66 | 0.78 | 0.43 | 0.43 | 0.46 |
ACVR2A | 0.71 | 0.82 | 0.99 | 0.83 | 0.98 | 0.98 | 1.02 | 1.06 | 1.01 | 0.89 | ... | 0.97 | 0.90 | 0.83 | 1.05 | 0.95 | 1.09 | 1.09 | 1.20 | 1.20 | 1.00 |
ACVR2B | 0.91 | 0.88 | 0.96 | 1.31 | 0.91 | 0.91 | 0.92 | 0.75 | 0.80 | 0.88 | ... | 0.95 | 0.77 | 0.83 | 0.99 | 0.90 | 1.24 | 1.18 | 1.29 | 1.29 | 1.08 |
AKT1 | 1.03 | 1.01 | 0.94 | 1.03 | 0.88 | 0.88 | 0.73 | 0.74 | 0.79 | 0.86 | ... | 1.13 | 1.95 | 1.84 | 1.25 | 1.10 | 0.75 | 0.53 | 0.67 | 0.67 | 0.45 |
AKT2 | 1.02 | 1.05 | 1.09 | 0.99 | 0.91 | 0.91 | 0.74 | 0.71 | 0.84 | 0.87 | ... | 1.15 | 1.96 | 1.68 | 1.11 | 1.06 | 0.62 | 0.60 | 0.93 | 0.93 | 0.71 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
YANK2 | 0.99 | 1.19 | 1.08 | 1.02 | 0.99 | 0.99 | 0.74 | 0.80 | 0.78 | 0.80 | ... | 1.16 | 1.84 | 1.87 | 0.86 | 0.83 | 0.54 | 0.77 | 1.86 | 1.86 | 10.72 |
YANK3 | 1.06 | 1.32 | 1.10 | 1.02 | 0.93 | 0.93 | 0.85 | 0.91 | 0.95 | 0.92 | ... | 1.02 | 1.19 | 1.25 | 1.07 | 0.99 | 0.87 | 1.47 | 2.05 | 2.05 | 9.82 |
YSK1 | 1.00 | 1.21 | 1.24 | 1.03 | 0.92 | 0.92 | 0.85 | 0.80 | 0.76 | 0.90 | ... | 1.20 | 2.01 | 2.19 | 0.86 | 0.92 | 0.52 | 0.45 | 0.44 | 0.44 | 0.37 |
YSK4 | 1.01 | 1.24 | 1.26 | 1.25 | 1.01 | 1.01 | 0.88 | 0.68 | 0.74 | 0.87 | ... | 1.12 | 1.38 | 1.05 | 1.26 | 1.05 | 0.99 | 0.82 | 1.08 | 1.08 | 0.66 |
ZAK | 1.03 | 1.09 | 1.12 | 1.07 | 1.01 | 1.01 | 0.77 | 0.73 | 0.81 | 0.82 | ... | 1.14 | 2.05 | 1.72 | 1.04 | 0.95 | 0.58 | 0.63 | 0.66 | 0.66 | 0.69 |
303 rows × 207 columns
round(2) scale.
-5P | -5G | -5A | -5C | -5S | -5T | -5V | -5I | -5L | -5M | ... | 4H | 4K | 4R | 4Q | 4N | 4D | 4E | 4s | 4t | 4y | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
kinase | |||||||||||||||||||||
AAK1 | 1.22 | 0.42 | 0.48 | 0.78 | 0.72 | 0.72 | 1.62 | 2.64 | 1.69 | 1.47 | ... | 0.95 | 1.41 | 1.58 | 1.08 | 1.01 | 0.66 | 0.78 | 0.43 | 0.43 | 0.46 |
ACVR2A | 0.71 | 0.82 | 0.99 | 0.83 | 0.98 | 0.98 | 1.02 | 1.06 | 1.01 | 0.89 | ... | 0.97 | 0.90 | 0.84 | 1.05 | 0.95 | 1.09 | 1.09 | 1.20 | 1.20 | 1.00 |
ACVR2B | 0.91 | 0.88 | 0.96 | 1.31 | 0.91 | 0.91 | 0.92 | 0.75 | 0.80 | 0.88 | ... | 0.95 | 0.77 | 0.83 | 0.99 | 0.90 | 1.24 | 1.18 | 1.29 | 1.29 | 1.08 |
AKT1 | 1.03 | 1.01 | 0.94 | 1.03 | 0.88 | 0.88 | 0.73 | 0.74 | 0.79 | 0.86 | ... | 1.13 | 1.95 | 1.84 | 1.25 | 1.10 | 0.75 | 0.53 | 0.67 | 0.67 | 0.45 |
AKT2 | 1.02 | 1.05 | 1.09 | 0.99 | 0.91 | 0.91 | 0.74 | 0.71 | 0.84 | 0.87 | ... | 1.15 | 1.96 | 1.68 | 1.11 | 1.06 | 0.61 | 0.59 | 0.93 | 0.93 | 0.71 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
YANK2 | 0.99 | 1.19 | 1.08 | 1.02 | 0.99 | 0.99 | 0.74 | 0.80 | 0.78 | 0.80 | ... | 1.16 | 1.84 | 1.87 | 0.86 | 0.83 | 0.54 | 0.77 | 1.86 | 1.86 | 10.72 |
YANK3 | 1.06 | 1.32 | 1.10 | 1.02 | 0.93 | 0.93 | 0.85 | 0.91 | 0.95 | 0.92 | ... | 1.02 | 1.19 | 1.25 | 1.07 | 0.99 | 0.87 | 1.46 | 2.05 | 2.05 | 9.82 |
YSK1 | 1.00 | 1.21 | 1.24 | 1.03 | 0.92 | 0.92 | 0.85 | 0.80 | 0.76 | 0.90 | ... | 1.20 | 2.01 | 2.19 | 0.86 | 0.92 | 0.51 | 0.45 | 0.44 | 0.44 | 0.37 |
YSK4 | 1.01 | 1.24 | 1.27 | 1.25 | 1.02 | 1.02 | 0.88 | 0.68 | 0.74 | 0.87 | ... | 1.12 | 1.38 | 1.05 | 1.26 | 1.05 | 1.00 | 0.82 | 1.08 | 1.08 | 0.66 |
ZAK | 1.03 | 1.09 | 1.12 | 1.07 | 1.01 | 1.01 | 0.77 | 0.73 | 0.81 | 0.82 | ... | 1.14 | 2.05 | 1.72 | 1.04 | 0.95 | 0.58 | 0.63 | 0.66 | 0.66 | 0.69 |
303 rows × 207 columns