= Data.get_ks_dataset() ks
PSSM
Setup
from katlas.pssm import *
PSSM
get_prob
get_prob (df:pandas.core.frame.DataFrame, col:str, aa_order=['P', 'G', 'A', 'C', 'S', 'T', 'V', 'I', 'L', 'M', 'F', 'Y', 'W', 'H', 'K', 'R', 'Q', 'N', 'D', 'E', 's', 't', 'y'])
Get the probability matrix of PSSM from phosphorylation site sequences.
= ks[ks.kinase_uniprot=='P00519'] ks_k
= get_prob(ks_k,'site_seq')
pssm_df pssm_df.head()
Position | -20 | -19 | -18 | -17 | -16 | -15 | -14 | -13 | -12 | -11 | -10 | -9 | -8 | -7 | -6 | -5 | -4 | -3 | -2 | -1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
aa | |||||||||||||||||||||||||||||||||||||||||
P | 0.050061 | 0.048691 | 0.062349 | 0.055489 | 0.046988 | 0.054753 | 0.064787 | 0.055090 | 0.056683 | 0.048272 | 0.052257 | 0.054599 | 0.053974 | 0.043170 | 0.060839 | 0.067138 | 0.049971 | 0.055817 | 0.076968 | 0.049354 | 0.0 | 0.020024 | 0.049645 | 0.125370 | 0.054997 | 0.056872 | 0.057382 | 0.057588 | 0.062048 | 0.053463 | 0.058104 | 0.052728 | 0.051140 | 0.069436 | 0.063164 | 0.057716 | 0.056639 | 0.051072 | 0.050697 | 0.052163 | 0.060703 |
G | 0.080586 | 0.080341 | 0.069007 | 0.067551 | 0.082530 | 0.070397 | 0.093581 | 0.073054 | 0.077566 | 0.072706 | 0.067102 | 0.077745 | 0.052788 | 0.078060 | 0.070880 | 0.065371 | 0.087008 | 0.073443 | 0.084019 | 0.065217 | 0.0 | 0.091284 | 0.069740 | 0.062685 | 0.070373 | 0.075237 | 0.060371 | 0.072585 | 0.080120 | 0.077157 | 0.072783 | 0.099939 | 0.070856 | 0.071916 | 0.075672 | 0.071518 | 0.064821 | 0.080076 | 0.088720 | 0.062341 | 0.090735 |
A | 0.080586 | 0.080341 | 0.062954 | 0.054282 | 0.075301 | 0.071600 | 0.070186 | 0.070060 | 0.065632 | 0.070322 | 0.077791 | 0.073591 | 0.053381 | 0.065050 | 0.069108 | 0.071849 | 0.062316 | 0.081669 | 0.063455 | 0.060517 | 0.0 | 0.102473 | 0.084515 | 0.075103 | 0.063276 | 0.075237 | 0.083084 | 0.049190 | 0.053614 | 0.073512 | 0.073394 | 0.064378 | 0.077634 | 0.069436 | 0.072545 | 0.063363 | 0.079924 | 0.088272 | 0.087452 | 0.057888 | 0.070927 |
C | 0.017094 | 0.012781 | 0.013317 | 0.019903 | 0.012048 | 0.017449 | 0.007798 | 0.014371 | 0.013126 | 0.012515 | 0.023159 | 0.015430 | 0.012456 | 0.014784 | 0.010632 | 0.012956 | 0.014697 | 0.009988 | 0.010576 | 0.008226 | 0.0 | 0.017079 | 0.017730 | 0.023655 | 0.018924 | 0.010664 | 0.016736 | 0.010798 | 0.008434 | 0.010328 | 0.012232 | 0.007357 | 0.017868 | 0.014879 | 0.012508 | 0.011920 | 0.018880 | 0.019546 | 0.014575 | 0.019084 | 0.014058 |
S | 0.047619 | 0.035910 | 0.046610 | 0.030157 | 0.037349 | 0.042720 | 0.041992 | 0.041916 | 0.034010 | 0.039333 | 0.037411 | 0.031454 | 0.043891 | 0.035482 | 0.024808 | 0.029446 | 0.026455 | 0.016451 | 0.021739 | 0.009401 | 0.0 | 0.019435 | 0.017730 | 0.019515 | 0.031934 | 0.029028 | 0.028691 | 0.034793 | 0.024699 | 0.032199 | 0.029969 | 0.024525 | 0.036352 | 0.047117 | 0.040025 | 0.042033 | 0.040277 | 0.039092 | 0.051965 | 0.041349 | 0.039617 |
pssm_to_seq
pssm_to_seq (pssm_df, thr=0.4, contain_sty=True)
Represent PSSM in string sequence of amino acids
Type | Default | Details | |
---|---|---|---|
pssm_df | |||
thr | float | 0.4 | threshold of probability to show in sequence |
contain_sty | bool | True | keep only s,t,y values (last three) in center 0 position |
=0.1) pssm_to_seq(pssm_df,thr
'........K.K.K..E.EEVy*[E/A].[L/P]....K..........L.'
recover_pssm
recover_pssm (flat_pssm:pandas.core.series.Series, aa_order=['P', 'G', 'A', 'C', 'S', 'T', 'V', 'I', 'L', 'M', 'F', 'Y', 'W', 'H', 'K', 'R', 'Q', 'N', 'D', 'E', 's', 't', 'y'])
Recover 2D pssm from flat pssm Series
= Data.get_pspa_all_norm() pspa
= pspa.loc['AAK1'].dropna() flat_pssm
= recover_pssm(flat_pssm)
recovered recovered
Position | -5 | -4 | -3 | -2 | -1 | 0 | 1 | 2 | 3 | 4 |
---|---|---|---|---|---|---|---|---|---|---|
aa | ||||||||||
P | 0.0720 | 0.0534 | 0.1084 | 0.0226 | 0.1136 | 0.0 | 0.0463 | 0.0527 | 0.0681 | 0.0628 |
G | 0.0245 | 0.0642 | 0.0512 | 0.0283 | 0.0706 | 0.0 | 0.7216 | 0.0749 | 0.0923 | 0.0702 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
pT | 0.0201 | 0.0332 | 0.0303 | 0.0209 | 0.0121 | 1.0 | 0.0123 | 0.0409 | 0.0335 | 0.0251 |
pY | 0.0611 | 0.0339 | 0.0274 | 0.0486 | 0.0178 | 0.0 | 0.0100 | 0.0410 | 0.0359 | 0.0270 |
23 rows × 10 columns
process_pssm
process_pssm (pssm_df)
Keep only s,t,y values in center 0 position; normalize per position
= process_pssm(recovered)
norm_pssm norm_pssm.head()
Position | -5 | -4 | -3 | -2 | -1 | 0 | 1 | 2 | 3 | 4 |
---|---|---|---|---|---|---|---|---|---|---|
aa | ||||||||||
P | 0.058446 | 0.041715 | 0.086100 | 0.017935 | 0.096068 | 0.0 | 0.042649 | 0.040482 | 0.052640 | 0.050260 |
G | 0.019888 | 0.050152 | 0.040667 | 0.022459 | 0.059704 | 0.0 | 0.664702 | 0.057536 | 0.071346 | 0.056182 |
A | 0.023054 | 0.055152 | 0.088880 | 0.042695 | 0.032558 | 0.0 | 0.028740 | 0.057613 | 0.044987 | 0.051701 |
C | 0.037016 | 0.043747 | 0.052025 | 0.046663 | 0.026469 | 0.0 | 0.020542 | 0.052543 | 0.057355 | 0.048259 |
S | 0.034500 | 0.048356 | 0.041859 | 0.044044 | 0.046089 | 0.0 | 0.013172 | 0.042403 | 0.044987 | 0.044818 |
pssm2dict
pssm2dict (pssm_df)
Convert pssm dataframe to dict
1,:10]) pssm2dict(pssm_df.iloc[:
{'-20P': 0.05006,
'-19P': 0.04869,
'-18P': 0.06235,
'-17P': 0.05549,
'-16P': 0.04699,
'-15P': 0.05475,
'-14P': 0.06479,
'-13P': 0.05509,
'-12P': 0.05668,
'-11P': 0.04827}
JS divergence
js_divergence
js_divergence (p1, p2, mean=True)
p1 and p2 are two arrays (df or np) with index as aa and column as position
Type | Default | Details | |
---|---|---|---|
p1 | pssm | ||
p2 | pssm | ||
mean | bool | True |
js_divergence(pssm_df,pssm_df)
1.0000000826903708e-10
js_divergence_flat
js_divergence_flat (p1_flat, p2_flat)
p1 and p2 are two flattened pd.Series with index as aa and column as position
Details | |
---|---|
p1_flat | pd.Series of flattened pssm |
p2_flat | pd.Series of flattened pssm |
= pd.Series(pssm2dict(norm_pssm)) flat_norm_pssm
js_divergence(flat_norm_pssm,flat_norm_pssm)
1.0000050826907844e-09
Entropy & Information Content
entropy
entropy (pssm_df, return_min=False, exclude_zero=False, contain_sty=True)
Calculate entropy per position (max) of a PSSM surrounding 0
Type | Default | Details | |
---|---|---|---|
pssm_df | a dataframe of pssm with index as aa and column as position | ||
return_min | bool | False | return min entropy as a single value or return all entropy as a series |
exclude_zero | bool | False | exclude the column of 0 (center position) in the entropy calculation |
contain_sty | bool | True | keep only s,t,y values (last three) in center 0 position |
entropy(pssm_df)
Position
-20 4.324109
-19 4.257291
...
19 4.293755
20 4.279981
Length: 41, dtype: float64
entropy_flat
entropy_flat (flat_pssm:pandas.core.series.Series, return_min=False, exclude_zero=False, contain_sty=True)
Calculate entropy per position of a flat PSSM surrounding 0
Type | Default | Details | |
---|---|---|---|
flat_pssm | Series | ||
return_min | bool | False | return min entropy as a single value or return all entropy as a series |
exclude_zero | bool | False | exclude the column of 0 (center position) in the entropy calculation |
contain_sty | bool | True | keep only s,t,y values (last three) in center 0 position |
get_IC_standard
get_IC_standard (pssm_df)
Calculate the standard information content (bits) from frequency matrix, using the same number of residues log2(len(pssm_df)) for all positions
get_IC
get_IC (pssm_df, return_min=False, exclude_zero=False, contain_sty=True)
Calculate the information content (bits) from a frequency matrix, using log2(3) for the middle position and log2(len(pssm_df)) for others.
Type | Default | Details | |
---|---|---|---|
pssm_df | a dataframe of pssm with index as aa and column as position | ||
return_min | bool | False | return min entropy as a single value or return all entropy as a series |
exclude_zero | bool | False | exclude the column of 0 (center position) in the entropy calculation |
contain_sty | bool | True | keep only s,t,y values (last three) in center 0 position |
get_IC_flat
get_IC_flat (flat_pssm:pandas.core.series.Series, return_min=False, exclude_zero=False, contain_sty=True)
Calculate the information content (bits) from a flattened pssm pd.Series, using log2(3) for the middle position and log2(len(pssm_df)) for others.
Type | Default | Details | |
---|---|---|---|
flat_pssm | Series | ||
return_min | bool | False | return min entropy as a single value or return all entropy as a series |
exclude_zero | bool | False | exclude the column of 0 (center position) in the entropy calculation |
contain_sty | bool | True | keep only s,t,y values (last three) in center 0 position |
get_scaled_IC
get_scaled_IC (pssm_df)
For plotting purpose, calculate the scaled information content (bits) from a frequency matrix, using log2(3) for the middle position and log2(len(pssm_df)) for others.
PSPA normalization
raw2norm
raw2norm (df:pandas.core.frame.DataFrame, PDHK:bool=False)
Normalize single ST kinase data
Type | Default | Details | |
---|---|---|---|
df | DataFrame | single kinase’s df has position as index, and single amino acid as columns | |
PDHK | bool | False | whether this kinase belongs to PDHK family |
This function implement the normalization method from Johnson et al. Nature: An atlas of substrate specificities for the human serine/threonine kinome
Specifically, > - matrices were column-normalized at all positions by the sum of the 17 randomized amino acids (excluding serine, threonine and cysteine), to yield PSSMs. >- PDHK1 and PDHK4 were normalized to the 16 randomized amino acids (excluding serine, threonine, cysteine and additionally tyrosine) >- The cysteine row was scaled by its median to be 1/17 (1/16 for PDHK1 and PDHK4). >- The serine and threonine values in each position were set to be the median of that position. >- The S0/T0 ratio was determined by summing the values of S and T rows in the matrix (SS and ST, respectively), accounting for the different S vs. T composition of the central (1:1) and peripheral (only S or only T) positions (Sctrl and Tctrl, respectively), and then normalizing to the higher value among the two (S0 and T0, respectively, Supplementary Note 1)
This function is usually implemented with the below function, with normalize
being a bool argument.
get_one_kinase
get_one_kinase (df:pandas.core.frame.DataFrame, kinase:str, normalize:bool=False, drop_s:bool=True)
Obtain a specific kinase data from stacked dataframe
Type | Default | Details | |
---|---|---|---|
df | DataFrame | stacked dataframe (paper’s raw data) | |
kinase | str | a specific kinase | |
normalize | bool | False | normalize according to the paper; special for PDHK1/4 |
drop_s | bool | True | drop s as s is a duplicates of t in PSPA |
Retreive a single kinase data from PSPA data that has an format of kinase as index and position+amino acid as column.
= Data.get_pspa_st_norm() data
'PDHK1') get_one_kinase(data,
aa | A | C | D | E | F | G | H | I | K | L | M | N | P | Q | R | S | T | V | W | Y | t | y |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
position | ||||||||||||||||||||||
-5 | 0.0594 | 0.0625 | 0.0589 | 0.0550 | 0.0775 | 0.0697 | 0.0687 | 0.0590 | 0.0515 | 0.0657 | 0.0687 | 0.0613 | 0.0451 | 0.0424 | 0.0594 | 0.0594 | 0.0594 | 0.0573 | 0.1001 | 0.0775 | 0.0583 | 0.0658 |
-4 | 0.0618 | 0.0621 | 0.0550 | 0.0511 | 0.0739 | 0.0715 | 0.0598 | 0.0601 | 0.0520 | 0.0614 | 0.0744 | 0.0549 | 0.0637 | 0.0552 | 0.0617 | 0.0608 | 0.0608 | 0.0519 | 0.0916 | 0.0739 | 0.0528 | 0.0752 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3 | 0.0486 | 0.0609 | 0.0938 | 0.0684 | 0.1024 | 0.0676 | 0.0544 | 0.0583 | 0.0388 | 0.0552 | 0.0637 | 0.0505 | 0.0686 | 0.0502 | 0.0561 | 0.0588 | 0.0588 | 0.0593 | 0.0641 | 0.1024 | 0.0539 | 0.0431 |
4 | 0.0565 | 0.0749 | 0.0631 | 0.0535 | 0.0732 | 0.0655 | 0.0664 | 0.0625 | 0.0496 | 0.0552 | 0.0627 | 0.0640 | 0.0677 | 0.0553 | 0.0604 | 0.0626 | 0.0626 | 0.0579 | 0.0864 | 0.0732 | 0.0548 | 0.0575 |
10 rows × 22 columns