pssm

Functions related with PSSMs

Setup

from katlas.data import Data

PSSM

We need to compute position-specific probability matrix (PSSM) from a list of aligned site sequences.

For each position \(i\) (e.g., from \(-7\) to \(+7\)), the probability of observing amino acid \(x\) is:

\[ P_i(x) = \frac{\text{count of amino acid } x \text{ at position } i}{\text{total counts at position } i} \]

The following 23 amino acids are included:

  • Standard amino acids:
    A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y
  • Modified amino acids:
    s, t, y (often used to denote phosphorylated S, T, Y)

The resulting matrix has: - Rows: Amino acids, - Columns: Sequence positions (centered on the phosphosite), - Values: Probabilities of each amino acid at each position.


get_prob


def get_prob(
    data:Union, # input data, list or df
    seq_col:str='site_seq', # column name if input is df
):

Get the probability matrix of PSSM from phosphorylation site sequences.

data = Data.ks_dataset()
data_k = data[data.kinase_uniprot=='P49841'] # CDK1
get_prob(data_k, seq_col='site_seq').shape
(23, 41)
# or
pssm_df = get_prob(data_k['site_seq'].tolist())
pssm_df.tail()
Position -20 -19 -18 -17 -16 -15 -14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
aa
D 0.059524 0.065476 0.044577 0.080238 0.051775 0.048744 0.047267 0.054572 0.064611 0.048387 0.052709 0.058565 0.049635 0.049563 0.071325 0.069666 0.058055 0.063768 0.037627 0.062229 0.000000 0.034783 0.046579 0.056934 0.030702 0.084919 0.064516 0.056464 0.051128 0.072289 0.059091 0.054962 0.033846 0.063174 0.053042 0.054859 0.063091 0.055380 0.054054 0.056090 0.053140
E 0.061012 0.066964 0.077266 0.056464 0.075444 0.075332 0.090103 0.094395 0.066079 0.095308 0.086384 0.070278 0.068613 0.083090 0.093159 0.076923 0.076923 0.053623 0.060781 0.069465 0.000000 0.023188 0.081514 0.075912 0.071637 0.080527 0.055718 0.063893 0.081203 0.078313 0.068182 0.087023 0.072308 0.064715 0.098284 0.073668 0.056782 0.072785 0.057234 0.057692 0.074074
s 0.037202 0.041667 0.035661 0.031204 0.038462 0.038405 0.044313 0.041298 0.070485 0.045455 0.049780 0.048316 0.103650 0.056851 0.059680 0.068215 0.223512 0.073913 0.082489 0.076700 0.677279 0.037681 0.081514 0.106569 0.274854 0.057101 0.076246 0.074294 0.129323 0.057229 0.054545 0.041221 0.083077 0.066256 0.045242 0.047022 0.069401 0.033228 0.063593 0.051282 0.045089
t 0.019345 0.010417 0.022288 0.019316 0.019231 0.016248 0.016248 0.011799 0.014684 0.020528 0.033675 0.036603 0.024818 0.017493 0.024745 0.039187 0.065312 0.030435 0.050651 0.026049 0.285094 0.021739 0.029112 0.037956 0.089181 0.033675 0.033724 0.040119 0.039098 0.031627 0.050000 0.025954 0.030769 0.015408 0.020281 0.010972 0.014196 0.015823 0.015898 0.014423 0.014493
y 0.004464 0.008929 0.007429 0.008915 0.008876 0.010340 0.007386 0.002950 0.005874 0.016129 0.010249 0.014641 0.005839 0.013120 0.014556 0.017417 0.007257 0.007246 0.005789 0.010130 0.037627 0.020290 0.007278 0.017518 0.019006 0.004392 0.004399 0.008915 0.006015 0.007530 0.006061 0.012214 0.007692 0.004622 0.012480 0.004702 0.006309 0.006329 0.012719 0.006410 0.011272

PSSM with weight score

For each amino acid \(x\) at position \(i\), the enrichment-weighted PSSM value is:

\[ \mathrm{PSSM}(x, i) = \frac{\sum_{p \in \text{peptides with } x \text{ at } i} \log_2(\text{score}_p)}{\text{count}(x, i)} \]

This represents the mean log2-transformed enrichment score (or selection ratio) for each position-specific amino acid. The enrichment-based PSSM can be further refined through:

  • Shrinkage: Applying empirical Bayes-style regularization to stabilize estimates with low sample counts
  • Masking: Filtering out positions with insufficient support (count threshold)
  • Centering: Optional normalization via position-wise or global median centering to highlight deviations from background

get_pssm_weight


def get_pssm_weight(
    data:DataFrame, seq_col:str='seq', score_col:str='enrichment', # selected/input ratio OR already-log2
    to_log2:bool=True, aa_order:str='PGACSTVILMFYWHKRQNDEsty',
    center:str='pos', # per-position median centering or 'glob' global median centering
    count_thr:int=5, # threshold to filter out if <count_thr
    alpha:str | float='auto', # shrinkage strength: value * count / (count + alpha)
):

Position-specific amino acid enrichment matrix: PSSM(aa,pos) = mean( log2(score) ) over peptides with aa at pos.

Transform PSSM


flatten_pssm


def flatten_pssm(
    pssm_df, column_wise:bool=True, # if True, column major flatten; else row wise flatten (for pytorch training)
):

Flatten PSSM dataframe to dictionary

flat_pssm = pd.Series(flatten_pssm(pssm_df))
flat_pssm
-20P    0.069940
-20G    0.087798
          ...   
20t     0.014493
20y     0.011272
Length: 943, dtype: float64
flat_pssm.reset_index()[0]
0      0.069940
1      0.087798
         ...   
941    0.014493
942    0.011272
Name: 0, Length: 943, dtype: float64

recover_pssm


def recover_pssm(
    flat_pssm:Series
):

Recover 2D PSSM from flattened PSSM Series.

out = recover_pssm(flat_pssm)
out
Position -20 -19 -18 -17 -16 -15 -14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
aa
P 0.069940 0.077381 0.062407 0.084695 0.082840 0.072378 0.087149 0.081121 0.077827 0.071848 0.101025 0.074671 0.084672 0.106414 0.112082 0.094340 0.079826 0.133333 0.150507 0.131693 0.000000 0.420290 0.141194 0.094891 0.049708 0.187408 0.082111 0.077266 0.064662 0.106928 0.074242 0.077863 0.078462 0.070878 0.095164 0.092476 0.074132 0.090190 0.071542 0.099359 0.078905
G 0.087798 0.087798 0.069837 0.065379 0.076923 0.054653 0.088626 0.095870 0.064611 0.068915 0.070278 0.087848 0.058394 0.065598 0.066958 0.063861 0.068215 0.101449 0.076700 0.105644 0.000000 0.075362 0.065502 0.086131 0.046784 0.054173 0.060117 0.083210 0.084211 0.058735 0.063636 0.083969 0.084615 0.075501 0.076443 0.081505 0.067823 0.060127 0.077901 0.088141 0.066023
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
t 0.019345 0.010417 0.022288 0.019316 0.019231 0.016248 0.016248 0.011799 0.014684 0.020528 0.033675 0.036603 0.024818 0.017493 0.024745 0.039187 0.065312 0.030435 0.050651 0.026049 0.285094 0.021739 0.029112 0.037956 0.089181 0.033675 0.033724 0.040119 0.039098 0.031627 0.050000 0.025954 0.030769 0.015408 0.020281 0.010972 0.014196 0.015823 0.015898 0.014423 0.014493
y 0.004464 0.008929 0.007429 0.008915 0.008876 0.010340 0.007386 0.002950 0.005874 0.016129 0.010249 0.014641 0.005839 0.013120 0.014556 0.017417 0.007257 0.007246 0.005789 0.010130 0.037627 0.020290 0.007278 0.017518 0.019006 0.004392 0.004399 0.008915 0.006015 0.007530 0.006061 0.012214 0.007692 0.004622 0.012480 0.004702 0.006309 0.006329 0.012719 0.006410 0.011272

23 rows × 41 columns

out.equals(pssm_df)
True

Or recover from PSPA data

pspa = Data.pspa()
pssm=recover_pssm(pspa.loc['AAK1'].dropna())
pssm
Position -5 -4 -3 -2 -1 0 1 2 3 4
aa
P 0.0720 0.0534 0.1084 0.0226 0.1136 0.0 0.0463 0.0527 0.0681 0.0628
G 0.0245 0.0642 0.0512 0.0283 0.0706 0.0 0.7216 0.0749 0.0923 0.0702
... ... ... ... ... ... ... ... ... ... ...
t 0.0201 0.0332 0.0303 0.0209 0.0121 1.0 0.0123 0.0409 0.0335 0.0251
y 0.0611 0.0339 0.0274 0.0486 0.0178 0.0 0.0100 0.0410 0.0359 0.0270

23 rows × 10 columns

PSPA is not scaled per position.

So we need to remove the redundant copy in zero position (leave s/t/y only) and scaled to 1 per position.

pssm.index[pssm.index.isin(['s','t','y'])]
Index(['s', 't', 'y'], dtype='str', name='aa')
_clean_zero(pssm)
Position -5 -4 -3 -2 -1 0 1 2 3 4
aa
P 0.0720 0.0534 0.1084 0.0226 0.1136 0.0 0.0463 0.0527 0.0681 0.0628
G 0.0245 0.0642 0.0512 0.0283 0.0706 0.0 0.7216 0.0749 0.0923 0.0702
... ... ... ... ... ... ... ... ... ... ...
t 0.0201 0.0332 0.0303 0.0209 0.0121 1.0 0.0123 0.0409 0.0335 0.0251
y 0.0611 0.0339 0.0274 0.0486 0.0178 0.0 0.0100 0.0410 0.0359 0.0270

23 rows × 10 columns


clean_zero_normalize


def clean_zero_normalize(
    pssm_df
):

Zero out non-last three values in position 0 (keep only s,t,y values at center), and normalize per position

This function applies phosphosite-specific cleaning and normalization to a PSSM.

At the center position (\(i = 0\)), only the last three rows of the matrix — corresponding to phosphorylatable residues s, t, and y — are retained. All other amino acid values at position 0 are set to 0.

After masking, the matrix is column-normalized to ensure the probabilities at each position sum to 1:

\[ P_i(x) = \frac{P_i(x)}{\sum_{x'} P_i(x')} \]

clean_zero_normalize(pssm)
Position -5 -4 -3 -2 -1 0 1 2 3 4
aa
P 0.058446 0.041715 0.086100 0.017935 0.096068 0.000000 0.042649 0.040482 0.052640 0.050260
G 0.019888 0.050152 0.040667 0.022459 0.059704 0.000000 0.664702 0.057536 0.071346 0.056182
... ... ... ... ... ... ... ... ... ... ...
t 0.016316 0.025935 0.024067 0.016586 0.010233 0.908018 0.011330 0.031418 0.025895 0.020088
y 0.049598 0.026482 0.021763 0.038568 0.015053 0.000000 0.009211 0.031495 0.027750 0.021609

23 rows × 10 columns

PSSM to seq string


pssm_to_seq


def pssm_to_seq(
    pssm_df, thr:float=0.2, # threshold of probability to show in sequence
    clean_center:bool=True, # if true, zero out non-last three values in position 0 (keep only s,t,y values at center)
):

Represent PSSM in string sequence of amino acids

pssm_to_seq(pssm_df,thr=0.1) # pssm_df is matrix of pssm, not flattened version
'A....A....P.sPP.s[P/G]P[P/G][s/t]*P[P/A]ssP..sP.R.........'

get_pssm_seq_labels


def get_pssm_seq_labels(
    pssms, count_map:NoneType=None, # df index as key, counts as value
    thr:float=0.3, # threshold of probability to show in sequence
):

Use index of pssms and the pssm to seq to represent pssm.

pssms=Data.pspa_scale()
get_pssm_seq_labels(pssms.head(10))
['AAK1: .....t*G...',
 'ACVR2A: .....[t/s]*....',
 'ACVR2B: .....[t/s]*....',
 'AKT1: ..RR.[s/t]*....',
 'AKT2: ..R..[s/t]*....',
 'AKT3: ..RR.[s/t]*....',
 'ALK2: .....[t/s]*....',
 'ALK4: .....[t/s]*....',
 'ALPHAK3: .....t*....',
 'AMPKA1: .....[s/t]*....']
import random
# get a dict of index and counts
count_dict = {idx:random.randint(1,100) for idx in pssms.head(10).index}
labels= get_pssm_seq_labels(pssms.head(10),count_dict)
labels
['AAK1 (n=46): .....t*G...',
 'ACVR2A (n=75): .....[t/s]*....',
 'ACVR2B (n=71): .....[t/s]*....',
 'AKT1 (n=97): ..RR.[s/t]*....',
 'AKT2 (n=51): ..R..[s/t]*....',
 'AKT3 (n=83): ..RR.[s/t]*....',
 'ALK2 (n=5): .....[t/s]*....',
 'ALK4 (n=34): .....[t/s]*....',
 'ALPHAK3 (n=14): .....t*....',
 'AMPKA1 (n=72): .....[s/t]*....']

PSSMs of clusters


get_cluster_pssms


def get_cluster_pssms(
    df, cluster_col, seq_col:str='site_seq', id_col:str='sub_site',
    count_thr:int=10, # if less than the count threshold, not include in the return
    valid_thr:NoneType=None, # percentage of not-nan values in pssm
    plot:bool=False
):

Extract motifs from clusters in a dataframe

get_cluster_pssms(data,'kinase_group')
100%|██████████| 10/10 [00:00<00:00, 29.62it/s]
-20P -20G -20A -20C -20S -20T -20V -20I -20L -20M -20F -20Y -20W -20H -20K -20R -20Q -20N -20D -20E -20s -20t -20y -19P -19G -19A -19C -19S -19T -19V -19I -19L -19M -19F -19Y -19W -19H -19K -19R -19Q -19N -19D -19E -19s -19t -19y -18P -18G -18A -18C ... 18E 18s 18t 18y 19P 19G 19A 19C 19S 19T 19V 19I 19L 19M 19F 19Y 19W 19H 19K 19R 19Q 19N 19D 19E 19s 19t 19y 20P 20G 20A 20C 20S 20T 20V 20I 20L 20M 20F 20Y 20W 20H 20K 20R 20Q 20N 20D 20E 20s 20t 20y
TK 0.056754 0.069067 0.070474 0.016886 0.045028 0.039400 0.058630 0.049836 0.083255 0.024038 0.032716 0.023335 0.007974 0.021107 0.069184 0.056168 0.045966 0.036585 0.058865 0.081027 0.024508 0.011843 0.017355 0.057915 0.067158 0.073125 0.016731 0.045396 0.037440 0.053001 0.049959 0.087633 0.023751 0.031356 0.018018 0.009594 0.022230 0.074529 0.063999 0.043173 0.041652 0.060840 0.073359 0.025389 0.010881 0.012870 0.058123 0.067810 0.062792 0.014006 ... 0.080097 0.027670 0.012621 0.012743 0.058601 0.065424 0.065302 0.014864 0.045078 0.038986 0.056408 0.048124 0.087354 0.018884 0.042276 0.017178 0.008894 0.018519 0.073830 0.061769 0.047149 0.040327 0.059698 0.078216 0.028509 0.010599 0.014011 0.066015 0.072738 0.065037 0.013692 0.048900 0.037897 0.054523 0.050122 0.077262 0.023227 0.039364 0.018337 0.008435 0.019071 0.070905 0.059413 0.040465 0.038509 0.056357 0.083496 0.025917 0.012836 0.017482
CMGC 0.080589 0.070340 0.083792 0.013709 0.050865 0.035874 0.053812 0.035874 0.074824 0.022037 0.029468 0.015247 0.007559 0.020884 0.069058 0.057143 0.044331 0.032031 0.053299 0.078668 0.042921 0.020115 0.007559 0.079718 0.075496 0.072937 0.012028 0.058733 0.035061 0.053871 0.032885 0.074472 0.021753 0.026104 0.013436 0.007806 0.021241 0.064491 0.063596 0.046449 0.037492 0.055534 0.080614 0.042482 0.018298 0.005502 0.074292 0.068930 0.074419 0.013786 ... 0.084768 0.052185 0.016556 0.005960 0.074212 0.073813 0.074212 0.011970 0.053731 0.034047 0.050805 0.031520 0.074345 0.017423 0.028195 0.015029 0.010374 0.020215 0.070887 0.063173 0.045884 0.033914 0.056124 0.086980 0.048278 0.018619 0.006251 0.082465 0.066961 0.074980 0.012563 0.050789 0.034215 0.050521 0.038626 0.079925 0.019246 0.028869 0.016840 0.006816 0.023924 0.072307 0.060011 0.046111 0.035285 0.057070 0.074044 0.043972 0.018043 0.006415
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
CK1 0.057860 0.076419 0.076965 0.014192 0.039847 0.037118 0.063865 0.052402 0.070961 0.017467 0.027293 0.012555 0.014192 0.024017 0.079694 0.050764 0.033297 0.035480 0.058406 0.086790 0.038210 0.015830 0.016376 0.061069 0.070883 0.069793 0.007634 0.033806 0.032715 0.053435 0.043075 0.087786 0.023991 0.022901 0.016903 0.010905 0.020720 0.076881 0.059978 0.056161 0.037077 0.061069 0.087786 0.041439 0.017448 0.006543 0.050000 0.067935 0.073913 0.010870 ... 0.084629 0.041451 0.015544 0.012090 0.053148 0.068746 0.069324 0.012709 0.036973 0.021953 0.047371 0.042172 0.082034 0.017909 0.030040 0.017909 0.020220 0.019064 0.095321 0.047371 0.041594 0.034084 0.065280 0.098209 0.041594 0.021953 0.015020 0.051865 0.074592 0.088578 0.015152 0.039627 0.034382 0.055361 0.039627 0.074592 0.020979 0.034382 0.020396 0.008741 0.026224 0.079837 0.046620 0.038462 0.043124 0.067599 0.087995 0.027972 0.012238 0.011655
Atypical 0.071895 0.065359 0.063492 0.014939 0.053221 0.045752 0.058824 0.050420 0.085901 0.020542 0.017740 0.020542 0.005602 0.034547 0.049486 0.057890 0.040149 0.034547 0.059757 0.078431 0.041083 0.025210 0.004669 0.042870 0.052190 0.074557 0.013979 0.063374 0.036347 0.054986 0.031687 0.089469 0.027959 0.036347 0.015843 0.010252 0.026095 0.074557 0.056850 0.054054 0.034483 0.063374 0.068966 0.046598 0.018639 0.006524 0.051068 0.063138 0.072423 0.012071 ... 0.072175 0.047483 0.013295 0.015195 0.059829 0.081671 0.066477 0.017094 0.047483 0.044634 0.050332 0.034188 0.063628 0.010446 0.037037 0.016144 0.006648 0.027540 0.047483 0.055081 0.057930 0.043685 0.059829 0.091168 0.039886 0.027540 0.014245 0.077290 0.060115 0.062977 0.015267 0.055344 0.029580 0.065840 0.026718 0.092557 0.017176 0.039122 0.023855 0.009542 0.016221 0.046756 0.054389 0.050573 0.055344 0.062977 0.068702 0.040076 0.020992 0.008588

10 rows × 943 columns

Entropy


get_entropy


def get_entropy(
    pssm_df, # a dataframe of pssm with index as aa and column as position
    return_min:bool=False, # return min entropy as a single value or return all entropy as a pd.series
    exclude_zero:bool=False, # exclude the column of 0 (center position) in the entropy calculation
    clean_zero:bool=True, # if true, zero out non-last three values in position 0 (keep only s,t,y values at center)
):

Calculate entropy per position of a PSSM surrounding 0. The less entropy the more information it contains.

Let \(P_i(x)\) be the probability of amino acid \(x\) at position \(i\) in the PSSM, with \(i \in \{-k, \dots, -1, 0, +1, \dots, +k\}\). The entropy at each position \(i\) is defined as:

\[ H_i = - \sum_{x} P_i(x) \log_2 \left( P_i(x) + \varepsilon \right) \]

where \(\varepsilon = 10^{-8}\) is a small constant added for numerical stability.

If exclude_zero=True, the central position \(i = 0\) is omitted from the entropy calculation.

If clean_zero=True, all values at position \(i = 0\) are zeroed out except for amino acids Serine (S), Threonine (T), and Tyrosine (Y), typically the only possible phospho-acceptors in kinase motif analysis.

If return_min=True, the function returns the minimum entropy across all positions:

\[ H_{\text{spec}} = \min_i H_i \]

Otherwise, the function returns the full vector \(\{H_i\}\) for each position \(i\), reflecting how much information (or uncertainty) is contained at each position in the motif.

# get entropy per position
get_entropy(pssm_df).sort_values()
Position
 0     1.074964
 1     3.346861
         ...   
-9     4.297737
-12    4.302189
Length: 41, dtype: float64
# calculate minimum entropy of surrouding positions
get_entropy(pssm_df,return_min=True,exclude_zero=True)
3.3468606104695913

get_entropy_flat


def get_entropy_flat(
    flat_pssm:Series,
    return_min:bool=False, # return min entropy as a single value or return all entropy as a pd.series
    exclude_zero:bool=False, # exclude the column of 0 (center position) in the entropy calculation
    clean_zero:bool=True, # if true, zero out non-last three values in position 0 (keep only s,t,y values at center)
):

Calculate entropy per position of a flat PSSM surrounding 0

get_entropy_flat(flat_pssm).sort_values()
Position
 0     1.074964
 1     3.346861
         ...   
-9     4.297737
-12    4.302189
Length: 41, dtype: float64
get_entropy_flat(flat_pssm,return_min=True,exclude_zero=True)
3.3468606104695913
get_entropy_flat(flat_pssm).round(5)
Position
-20    4.24258
-19    4.17569
        ...   
 19    4.22666
 20    4.24636
Length: 41, dtype: float64
get_entropy(pssm_df).round(5)
Position
-20    4.24258
-19    4.17569
        ...   
 19    4.22666
 20    4.24636
Length: 41, dtype: float64
# test equal
(get_entropy_flat(flat_pssm).round(5) == get_entropy(pssm_df).round(5)).value_counts()
True    41
Name: count, dtype: int64

Information Content


get_IC


def get_IC(
    pssm_df, return_min:bool=False, # return min entropy as a single value or return all entropy as a pd.series
    exclude_zero:bool=False, # exclude the column of 0 (center position) in the entropy calculation
    clean_zero:bool=True, # if true, zero out non-last three values in position 0 (keep only s,t,y values at center)
):

Calculate the information content (bits) from a frequency matrix, using log2(3) for the middle position and log2(len(pssm_df)) for others. The higher the more information it contains.

Let \(P_i(x)\) be the frequency (probability) of amino acid \(x\) at position \(i\) in the PSSM. The standard information content (IC) at position \(i\) is defined as:

\[ \mathrm{IC}_i = \max H_i - H_i \]

which is:

\[ \mathrm{IC}_i = \log_2(N) - H_i \]

where \(N\) is the number of possible amino acids (i.e., \(N = \text{len}(P_i)\)).

At the center position (\(i = 0\)), only three amino acids (S, T, Y) are relevant, so the maximum entropy at each position is defined as:

\[ \max H_i = \begin{cases} \log_2(3) & \text{if } i = 0 \\ \log_2(N) & \text{otherwise} \end{cases} \]

# the higher the more conserved
get_IC(pssm_df,exclude_zero=True).sort_values()
Position
-12    0.221373
-9     0.225825
         ...   
 4     0.717760
 1     1.176701
Length: 40, dtype: float64

Check all zero cases:

pssm_df2=pssm_df.copy()
pssm_df2[-20]=0
get_entropy(pssm_df2,exclude_zero=True).sort_values()
Position
-20    0.000000
 1     3.346861
         ...   
-9     4.297737
-12    4.302189
Length: 40, dtype: float64

get_IC_flat


def get_IC_flat(
    flat_pssm:Series,
    return_min:bool=False, # return min entropy as a single value or return all entropy as a pd.series
    exclude_zero:bool=False, # exclude the column of 0 (center position) in the entropy calculation
    clean_zero:bool=True, # if true, zero out non-last three values in position 0 (keep only s,t,y values at center)
):

Calculate the information content (bits) from a flattened pssm pd.Series, using log2(3) for the middle position and log2(len(pssm_df)) for others.

get_IC_flat(flat_pssm,exclude_zero=True).sort_values()
Position
-12    0.221373
-9     0.225825
         ...   
 4     0.717760
 1     1.176701
Length: 40, dtype: float64
(get_IC_flat(flat_pssm).round(5) == get_IC(pssm_df).round(5)).value_counts()
True    41
Name: count, dtype: int64

Overall specificity


get_specificity


def get_specificity(
    pssm_df
):

Get specificity score of a pssm, excluding zero position.

We evaluated the overall specificity of a PSSM by combining two metrics: the maximum IC across surrounding positions and the variance of IC values:

\[ \text{Specificity Score} = 2 \times \max(\text{IC}) + \mathrm{Var}(\text{IC}) \]

get_specificity(pssm_df)
2.381609408364424

get_specificity_flat


def get_specificity_flat(
    flat_pssm
):

Get specificity score of a pssm, excluding zero position.

get_specificity_flat(flat_pssm)
2.381609408364424