PSSM

Functions related with PSSMs

Setup

from katlas.pssm import *

PSSM


source

get_prob

 get_prob (df:pandas.core.frame.DataFrame, col:str, aa_order=['P', 'G',
           'A', 'C', 'S', 'T', 'V', 'I', 'L', 'M', 'F', 'Y', 'W', 'H',
           'K', 'R', 'Q', 'N', 'D', 'E', 's', 't', 'y'])

Get the probability matrix of PSSM from phosphorylation site sequences.

ks = Data.get_ks_dataset()
ks_k = ks[ks.kinase_uniprot=='P00519']
pssm_df = get_prob(ks_k,'site_seq')
pssm_df.head()
Position -20 -19 -18 -17 -16 -15 -14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
aa
P 0.050061 0.048691 0.062349 0.055489 0.046988 0.054753 0.064787 0.055090 0.056683 0.048272 0.052257 0.054599 0.053974 0.043170 0.060839 0.067138 0.049971 0.055817 0.076968 0.049354 0.0 0.020024 0.049645 0.125370 0.054997 0.056872 0.057382 0.057588 0.062048 0.053463 0.058104 0.052728 0.051140 0.069436 0.063164 0.057716 0.056639 0.051072 0.050697 0.052163 0.060703
G 0.080586 0.080341 0.069007 0.067551 0.082530 0.070397 0.093581 0.073054 0.077566 0.072706 0.067102 0.077745 0.052788 0.078060 0.070880 0.065371 0.087008 0.073443 0.084019 0.065217 0.0 0.091284 0.069740 0.062685 0.070373 0.075237 0.060371 0.072585 0.080120 0.077157 0.072783 0.099939 0.070856 0.071916 0.075672 0.071518 0.064821 0.080076 0.088720 0.062341 0.090735
A 0.080586 0.080341 0.062954 0.054282 0.075301 0.071600 0.070186 0.070060 0.065632 0.070322 0.077791 0.073591 0.053381 0.065050 0.069108 0.071849 0.062316 0.081669 0.063455 0.060517 0.0 0.102473 0.084515 0.075103 0.063276 0.075237 0.083084 0.049190 0.053614 0.073512 0.073394 0.064378 0.077634 0.069436 0.072545 0.063363 0.079924 0.088272 0.087452 0.057888 0.070927
C 0.017094 0.012781 0.013317 0.019903 0.012048 0.017449 0.007798 0.014371 0.013126 0.012515 0.023159 0.015430 0.012456 0.014784 0.010632 0.012956 0.014697 0.009988 0.010576 0.008226 0.0 0.017079 0.017730 0.023655 0.018924 0.010664 0.016736 0.010798 0.008434 0.010328 0.012232 0.007357 0.017868 0.014879 0.012508 0.011920 0.018880 0.019546 0.014575 0.019084 0.014058
S 0.047619 0.035910 0.046610 0.030157 0.037349 0.042720 0.041992 0.041916 0.034010 0.039333 0.037411 0.031454 0.043891 0.035482 0.024808 0.029446 0.026455 0.016451 0.021739 0.009401 0.0 0.019435 0.017730 0.019515 0.031934 0.029028 0.028691 0.034793 0.024699 0.032199 0.029969 0.024525 0.036352 0.047117 0.040025 0.042033 0.040277 0.039092 0.051965 0.041349 0.039617

source

pssm_to_seq

 pssm_to_seq (pssm_df, thr=0.4, contain_sty=True)

Represent PSSM in string sequence of amino acids

Type Default Details
pssm_df
thr float 0.4 threshold of probability to show in sequence
contain_sty bool True keep only s,t,y values (last three) in center 0 position
pssm_to_seq(pssm_df,thr=0.1)
'........K.K.K..E.EEVy*[E/A].[L/P]....K..........L.'

source

recover_pssm

 recover_pssm (flat_pssm:pandas.core.series.Series, aa_order=['P', 'G',
               'A', 'C', 'S', 'T', 'V', 'I', 'L', 'M', 'F', 'Y', 'W', 'H',
               'K', 'R', 'Q', 'N', 'D', 'E', 's', 't', 'y'])

Recover 2D pssm from flat pssm Series

pspa = Data.get_pspa_all_norm()
flat_pssm = pspa.loc['AAK1'].dropna()
recovered = recover_pssm(flat_pssm)
recovered
Position -5 -4 -3 -2 -1 0 1 2 3 4
aa
P 0.0720 0.0534 0.1084 0.0226 0.1136 0.0 0.0463 0.0527 0.0681 0.0628
G 0.0245 0.0642 0.0512 0.0283 0.0706 0.0 0.7216 0.0749 0.0923 0.0702
... ... ... ... ... ... ... ... ... ... ...
pT 0.0201 0.0332 0.0303 0.0209 0.0121 1.0 0.0123 0.0409 0.0335 0.0251
pY 0.0611 0.0339 0.0274 0.0486 0.0178 0.0 0.0100 0.0410 0.0359 0.0270

23 rows × 10 columns


source

process_pssm

 process_pssm (pssm_df)

Keep only s,t,y values in center 0 position; normalize per position

norm_pssm = process_pssm(recovered)
norm_pssm.head()
Position -5 -4 -3 -2 -1 0 1 2 3 4
aa
P 0.058446 0.041715 0.086100 0.017935 0.096068 0.0 0.042649 0.040482 0.052640 0.050260
G 0.019888 0.050152 0.040667 0.022459 0.059704 0.0 0.664702 0.057536 0.071346 0.056182
A 0.023054 0.055152 0.088880 0.042695 0.032558 0.0 0.028740 0.057613 0.044987 0.051701
C 0.037016 0.043747 0.052025 0.046663 0.026469 0.0 0.020542 0.052543 0.057355 0.048259
S 0.034500 0.048356 0.041859 0.044044 0.046089 0.0 0.013172 0.042403 0.044987 0.044818

source

pssm2dict

 pssm2dict (pssm_df)

Convert pssm dataframe to dict

pssm2dict(pssm_df.iloc[:1,:10])
{'-20P': 0.05006,
 '-19P': 0.04869,
 '-18P': 0.06235,
 '-17P': 0.05549,
 '-16P': 0.04699,
 '-15P': 0.05475,
 '-14P': 0.06479,
 '-13P': 0.05509,
 '-12P': 0.05668,
 '-11P': 0.04827}

JS divergence


source

js_divergence

 js_divergence (p1, p2, mean=True)

p1 and p2 are two arrays (df or np) with index as aa and column as position

Type Default Details
p1 pssm
p2 pssm
mean bool True
js_divergence(pssm_df,pssm_df)
1.0000000826903708e-10

source

js_divergence_flat

 js_divergence_flat (p1_flat, p2_flat)

p1 and p2 are two flattened pd.Series with index as aa and column as position

Details
p1_flat pd.Series of flattened pssm
p2_flat pd.Series of flattened pssm
flat_norm_pssm = pd.Series(pssm2dict(norm_pssm))
js_divergence(flat_norm_pssm,flat_norm_pssm)
1.0000050826907844e-09

Entropy & Information Content


source

entropy

 entropy (pssm_df, return_min=False, exclude_zero=False, contain_sty=True)

Calculate entropy per position (max) of a PSSM surrounding 0

Type Default Details
pssm_df a dataframe of pssm with index as aa and column as position
return_min bool False return min entropy as a single value or return all entropy as a series
exclude_zero bool False exclude the column of 0 (center position) in the entropy calculation
contain_sty bool True keep only s,t,y values (last three) in center 0 position
entropy(pssm_df)
Position
-20    4.324109
-19    4.257291
         ...   
 19    4.293755
 20    4.279981
Length: 41, dtype: float64

source

entropy_flat

 entropy_flat (flat_pssm:pandas.core.series.Series, return_min=False,
               exclude_zero=False, contain_sty=True)

Calculate entropy per position of a flat PSSM surrounding 0

Type Default Details
flat_pssm Series
return_min bool False return min entropy as a single value or return all entropy as a series
exclude_zero bool False exclude the column of 0 (center position) in the entropy calculation
contain_sty bool True keep only s,t,y values (last three) in center 0 position

source

get_IC_standard

 get_IC_standard (pssm_df)

Calculate the standard information content (bits) from frequency matrix, using the same number of residues log2(len(pssm_df)) for all positions


source

get_IC

 get_IC (pssm_df, return_min=False, exclude_zero=False, contain_sty=True)

Calculate the information content (bits) from a frequency matrix, using log2(3) for the middle position and log2(len(pssm_df)) for others.

Type Default Details
pssm_df a dataframe of pssm with index as aa and column as position
return_min bool False return min entropy as a single value or return all entropy as a series
exclude_zero bool False exclude the column of 0 (center position) in the entropy calculation
contain_sty bool True keep only s,t,y values (last three) in center 0 position

source

get_IC_flat

 get_IC_flat (flat_pssm:pandas.core.series.Series, return_min=False,
              exclude_zero=False, contain_sty=True)

Calculate the information content (bits) from a flattened pssm pd.Series, using log2(3) for the middle position and log2(len(pssm_df)) for others.

Type Default Details
flat_pssm Series
return_min bool False return min entropy as a single value or return all entropy as a series
exclude_zero bool False exclude the column of 0 (center position) in the entropy calculation
contain_sty bool True keep only s,t,y values (last three) in center 0 position

source

get_scaled_IC

 get_scaled_IC (pssm_df)

For plotting purpose, calculate the scaled information content (bits) from a frequency matrix, using log2(3) for the middle position and log2(len(pssm_df)) for others.

PSPA normalization


source

raw2norm

 raw2norm (df:pandas.core.frame.DataFrame, PDHK:bool=False)

Normalize single ST kinase data

Type Default Details
df DataFrame single kinase’s df has position as index, and single amino acid as columns
PDHK bool False whether this kinase belongs to PDHK family

This function implement the normalization method from Johnson et al. Nature: An atlas of substrate specificities for the human serine/threonine kinome

Specifically, > - matrices were column-normalized at all positions by the sum of the 17 randomized amino acids (excluding serine, threonine and cysteine), to yield PSSMs. >- PDHK1 and PDHK4 were normalized to the 16 randomized amino acids (excluding serine, threonine, cysteine and additionally tyrosine) >- The cysteine row was scaled by its median to be 1/17 (1/16 for PDHK1 and PDHK4). >- The serine and threonine values in each position were set to be the median of that position. >- The S0/T0 ratio was determined by summing the values of S and T rows in the matrix (SS and ST, respectively), accounting for the different S vs. T composition of the central (1:1) and peripheral (only S or only T) positions (Sctrl and Tctrl, respectively), and then normalizing to the higher value among the two (S0 and T0, respectively, Supplementary Note 1)

This function is usually implemented with the below function, with normalize being a bool argument.


source

get_one_kinase

 get_one_kinase (df:pandas.core.frame.DataFrame, kinase:str,
                 normalize:bool=False, drop_s:bool=True)

Obtain a specific kinase data from stacked dataframe

Type Default Details
df DataFrame stacked dataframe (paper’s raw data)
kinase str a specific kinase
normalize bool False normalize according to the paper; special for PDHK1/4
drop_s bool True drop s as s is a duplicates of t in PSPA

Retreive a single kinase data from PSPA data that has an format of kinase as index and position+amino acid as column.

data = Data.get_pspa_st_norm()
get_one_kinase(data,'PDHK1')
aa A C D E F G H I K L M N P Q R S T V W Y t y
position
-5 0.0594 0.0625 0.0589 0.0550 0.0775 0.0697 0.0687 0.0590 0.0515 0.0657 0.0687 0.0613 0.0451 0.0424 0.0594 0.0594 0.0594 0.0573 0.1001 0.0775 0.0583 0.0658
-4 0.0618 0.0621 0.0550 0.0511 0.0739 0.0715 0.0598 0.0601 0.0520 0.0614 0.0744 0.0549 0.0637 0.0552 0.0617 0.0608 0.0608 0.0519 0.0916 0.0739 0.0528 0.0752
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3 0.0486 0.0609 0.0938 0.0684 0.1024 0.0676 0.0544 0.0583 0.0388 0.0552 0.0637 0.0505 0.0686 0.0502 0.0561 0.0588 0.0588 0.0593 0.0641 0.1024 0.0539 0.0431
4 0.0565 0.0749 0.0631 0.0535 0.0732 0.0655 0.0664 0.0625 0.0496 0.0552 0.0627 0.0640 0.0677 0.0553 0.0604 0.0626 0.0626 0.0579 0.0864 0.0732 0.0548 0.0575

10 rows × 22 columns

End