pssm.core

Functions related with PSSMs

Overview

PSSM Calculation

To calculate a position-specific probability matrix (PSSM) from phosphorylation site sequences:

pssm_df = get_prob(
    data=df,       # input DataFrame, Series, or list of sequences
    col='site_seq',    # column name containing sequences (if DataFrame)
)

Alternatively, use a list of sequences as input

pssm_df = get_prob(
    data=df['site_seq'],       # input DataFrame, Series, or list of sequences
)

To extract flattened PSSMs for each cluster in a dataset:

cluster_pssms = get_cluster_pssms(
    df=data,                    # DataFrame containing sequences and cluster labels
    cluster_col='kinase_group', # column name for cluster identifiers
    seq_col='site_seq',         # column name for site sequences
    id_col='sub_site',          # column for deduplication within clusters (None to skip)
    count_thr=10,               # minimum sequences required per cluster
    valid_thr=None,             # minimum fraction of non-zero PSSM values (None to skip)
    plot=False,                 # whether to plot sequence logos for each cluster
)

PSSM Transformation

To flatten a 2D PSSM into a dictionary (for storage or model input):

flat_dict = flatten_pssm(
    pssm_df=pssm_df,     # 2D PSSM with amino acids as rows, positions as columns
    column_wise=True,    # True for column-major order; False for row-wise (PyTorch)
)

To recover a 2D PSSM from a flattened Series:

pssm_df = recover_pssm(
    flat_pssm=flat_pssm,  # flattened PSSM as pd.Series with position+aa index
)

To clean the center position and normalize each column to sum to 1:

pssm_norm = clean_zero_normalize(
    pssm_df=pssm,  # 2D PSSM (zeros out non-s/t/y at position 0, then normalizes)
)

Entropy Analysis

To calculate Shannon entropy per position (lower entropy = more conserved):

entropy = get_entropy(
    pssm_df=pssm_df,      # 2D PSSM matrix
    return_min=False,     # True to return single minimum value
    exclude_zero=True,    # True to exclude center position from calculation
    clean_zero=True,      # True to zero out non-s/t/y at position 0
)

To calculate entropy from a flattened PSSM:

entropy = get_entropy_flat(
    flat_pssm=flat_pssm,  # flattened PSSM as pd.Series
    return_min=True,      # return minimum entropy across positions
    exclude_zero=True,    # exclude center position
)

Information Content

To calculate information content per position (higher IC = more conserved):

ic = get_IC(
    pssm_df=pssm_df,    # 2D PSSM matrix
    exclude_zero=True,  # exclude center position from calculation
)

To calculate IC from a flattened PSSM:

ic = get_IC_flat(
    flat_pssm=flat_pssm,  # flattened PSSM as pd.Series
    exclude_zero=True,    # exclude center position
)

Overall Specificity

To compute an overall specificity score combining max IC and IC variance:

spec = get_specificity(
    pssm_df=pssm_df,  # 2D PSSM matrix (excludes position 0 automatically)
)

To compute specificity from a flattened PSSM:

spec = get_specificity_flat(
    flat_pssm=flat_pssm,  # flattened PSSM as pd.Series
)

Setup

PSSM

We need to compute position-specific probability matrix (PSSM) from a list of aligned site sequences.

For each position \(i\) (e.g., from \(-7\) to \(+7\)), the probability of observing amino acid \(x\) is:

\[ P_i(x) = \frac{\text{count of amino acid } x \text{ at position } i}{\text{total counts at position } i} \]

The following 23 amino acids are included:

  • Standard amino acids:
    A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y
  • Modified amino acids:
    s, t, y (often used to denote phosphorylated S, T, Y)

The resulting matrix has: - Rows: Amino acids, - Columns: Sequence positions (centered on the phosphosite), - Values: Probabilities of each amino acid at each position.


get_prob


def get_prob(
    data:Union, # input data, list or df
    col:str='site_seq', # column name if input is df
):

Get the probability matrix of PSSM from phosphorylation site sequences.

data = Data.get_ks_dataset()
data_k = data[data.kinase_uniprot=='P49841'] # CDK1
get_prob(data_k, col='site_seq').shape
(23, 41)
# or
pssm_df = get_prob(data_k['site_seq'].tolist())
pssm_df.tail()
Position -20 -19 -18 -17 -16 -15 -14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
aa
D 0.059524 0.065476 0.044577 0.080238 0.051775 0.048744 0.047267 0.054572 0.064611 0.048387 0.052709 0.058565 0.049635 0.049563 0.071325 0.069666 0.058055 0.063768 0.037627 0.062229 0.000000 0.034783 0.046579 0.056934 0.030702 0.084919 0.064516 0.056464 0.051128 0.072289 0.059091 0.054962 0.033846 0.063174 0.053042 0.054859 0.063091 0.055380 0.054054 0.056090 0.053140
E 0.061012 0.066964 0.077266 0.056464 0.075444 0.075332 0.090103 0.094395 0.066079 0.095308 0.086384 0.070278 0.068613 0.083090 0.093159 0.076923 0.076923 0.053623 0.060781 0.069465 0.000000 0.023188 0.081514 0.075912 0.071637 0.080527 0.055718 0.063893 0.081203 0.078313 0.068182 0.087023 0.072308 0.064715 0.098284 0.073668 0.056782 0.072785 0.057234 0.057692 0.074074
s 0.037202 0.041667 0.035661 0.031204 0.038462 0.038405 0.044313 0.041298 0.070485 0.045455 0.049780 0.048316 0.103650 0.056851 0.059680 0.068215 0.223512 0.073913 0.082489 0.076700 0.677279 0.037681 0.081514 0.106569 0.274854 0.057101 0.076246 0.074294 0.129323 0.057229 0.054545 0.041221 0.083077 0.066256 0.045242 0.047022 0.069401 0.033228 0.063593 0.051282 0.045089
t 0.019345 0.010417 0.022288 0.019316 0.019231 0.016248 0.016248 0.011799 0.014684 0.020528 0.033675 0.036603 0.024818 0.017493 0.024745 0.039187 0.065312 0.030435 0.050651 0.026049 0.285094 0.021739 0.029112 0.037956 0.089181 0.033675 0.033724 0.040119 0.039098 0.031627 0.050000 0.025954 0.030769 0.015408 0.020281 0.010972 0.014196 0.015823 0.015898 0.014423 0.014493
y 0.004464 0.008929 0.007429 0.008915 0.008876 0.010340 0.007386 0.002950 0.005874 0.016129 0.010249 0.014641 0.005839 0.013120 0.014556 0.017417 0.007257 0.007246 0.005789 0.010130 0.037627 0.020290 0.007278 0.017518 0.019006 0.004392 0.004399 0.008915 0.006015 0.007530 0.006061 0.012214 0.007692 0.004622 0.012480 0.004702 0.006309 0.006329 0.012719 0.006410 0.011272

PSSM with weight score


get_pssm_weight


def get_pssm_weight(
    data:DataFrame, seq_col:str='seq', score_col:str='enrichment', # selected/input ratio OR already-log2
    to_log2:bool=True, aa_order:str='PGACSTVILMFYWHKRQNDEsty',
    center:str='pos', # per-position median centering or 'glob' global median centering
    count_thr:int=5, # threshold to filter out if <count_thr
    alpha:str | float='auto', # shrinkage strength: value * count / (count + alpha)
):

Position-specific amino acid enrichment matrix: PSSM(aa,pos) = mean( log2(score) ) over peptides with aa at pos.

Transform PSSM


flatten_pssm


def flatten_pssm(
    pssm_df, column_wise:bool=True, # if True, column major flatten; else row wise flatten (for pytorch training)
):

Flatten PSSM dataframe to dictionary

flat_pssm = pd.Series(flatten_pssm(pssm_df))
flat_pssm
-20P    0.069940
-20G    0.087798
          ...   
20t     0.014493
20y     0.011272
Length: 943, dtype: float64
flat_pssm.reset_index()[0]
0      0.069940
1      0.087798
         ...   
941    0.014493
942    0.011272
Name: 0, Length: 943, dtype: float64

recover_pssm


def recover_pssm(
    flat_pssm:Series
):

Recover 2D PSSM from flattened PSSM Series.

out = recover_pssm(flat_pssm)
out
Position -20 -19 -18 -17 -16 -15 -14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
aa
P 0.069940 0.077381 0.062407 0.084695 0.082840 0.072378 0.087149 0.081121 0.077827 0.071848 0.101025 0.074671 0.084672 0.106414 0.112082 0.094340 0.079826 0.133333 0.150507 0.131693 0.000000 0.420290 0.141194 0.094891 0.049708 0.187408 0.082111 0.077266 0.064662 0.106928 0.074242 0.077863 0.078462 0.070878 0.095164 0.092476 0.074132 0.090190 0.071542 0.099359 0.078905
G 0.087798 0.087798 0.069837 0.065379 0.076923 0.054653 0.088626 0.095870 0.064611 0.068915 0.070278 0.087848 0.058394 0.065598 0.066958 0.063861 0.068215 0.101449 0.076700 0.105644 0.000000 0.075362 0.065502 0.086131 0.046784 0.054173 0.060117 0.083210 0.084211 0.058735 0.063636 0.083969 0.084615 0.075501 0.076443 0.081505 0.067823 0.060127 0.077901 0.088141 0.066023
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
t 0.019345 0.010417 0.022288 0.019316 0.019231 0.016248 0.016248 0.011799 0.014684 0.020528 0.033675 0.036603 0.024818 0.017493 0.024745 0.039187 0.065312 0.030435 0.050651 0.026049 0.285094 0.021739 0.029112 0.037956 0.089181 0.033675 0.033724 0.040119 0.039098 0.031627 0.050000 0.025954 0.030769 0.015408 0.020281 0.010972 0.014196 0.015823 0.015898 0.014423 0.014493
y 0.004464 0.008929 0.007429 0.008915 0.008876 0.010340 0.007386 0.002950 0.005874 0.016129 0.010249 0.014641 0.005839 0.013120 0.014556 0.017417 0.007257 0.007246 0.005789 0.010130 0.037627 0.020290 0.007278 0.017518 0.019006 0.004392 0.004399 0.008915 0.006015 0.007530 0.006061 0.012214 0.007692 0.004622 0.012480 0.004702 0.006309 0.006329 0.012719 0.006410 0.011272

23 rows × 41 columns

out.equals(pssm_df)
True

Or recover from PSPA data

pspa = Data.get_pspa()
pssm=recover_pssm(pspa.loc['AAK1'].dropna())
pssm
Position -5 -4 -3 -2 -1 0 1 2 3 4
aa
P 0.0720 0.0534 0.1084 0.0226 0.1136 0.0 0.0463 0.0527 0.0681 0.0628
G 0.0245 0.0642 0.0512 0.0283 0.0706 0.0 0.7216 0.0749 0.0923 0.0702
... ... ... ... ... ... ... ... ... ... ...
t 0.0201 0.0332 0.0303 0.0209 0.0121 1.0 0.0123 0.0409 0.0335 0.0251
y 0.0611 0.0339 0.0274 0.0486 0.0178 0.0 0.0100 0.0410 0.0359 0.0270

23 rows × 10 columns

PSPA is not scaled per position.

So we need to remove the redundant copy in zero position (leave s/t/y only) and scaled to 1 per position.

pssm.index[pssm.index.isin(['s','t','y'])]
Index(['s', 't', 'y'], dtype='object', name='aa')
_clean_zero(pssm)
Position -5 -4 -3 -2 -1 0 1 2 3 4
aa
P 0.0720 0.0534 0.1084 0.0226 0.1136 0.0 0.0463 0.0527 0.0681 0.0628
G 0.0245 0.0642 0.0512 0.0283 0.0706 0.0 0.7216 0.0749 0.0923 0.0702
... ... ... ... ... ... ... ... ... ... ...
t 0.0201 0.0332 0.0303 0.0209 0.0121 1.0 0.0123 0.0409 0.0335 0.0251
y 0.0611 0.0339 0.0274 0.0486 0.0178 0.0 0.0100 0.0410 0.0359 0.0270

23 rows × 10 columns


clean_zero_normalize


def clean_zero_normalize(
    pssm_df
):

Zero out non-last three values in position 0 (keep only s,t,y values at center), and normalize per position

This function applies phosphosite-specific cleaning and normalization to a PSSM.

At the center position (\(i = 0\)), only the last three rows of the matrix — corresponding to phosphorylatable residues s, t, and y — are retained. All other amino acid values at position 0 are set to 0.

After masking, the matrix is column-normalized to ensure the probabilities at each position sum to 1:

\[ P_i(x) = \frac{P_i(x)}{\sum_{x'} P_i(x')} \]

clean_zero_normalize(pssm)
Position -5 -4 -3 -2 -1 0 1 2 3 4
aa
P 0.058446 0.041715 0.086100 0.017935 0.096068 0.000000 0.042649 0.040482 0.052640 0.050260
G 0.019888 0.050152 0.040667 0.022459 0.059704 0.000000 0.664702 0.057536 0.071346 0.056182
... ... ... ... ... ... ... ... ... ... ...
t 0.016316 0.025935 0.024067 0.016586 0.010233 0.908018 0.011330 0.031418 0.025895 0.020088
y 0.049598 0.026482 0.021763 0.038568 0.015053 0.000000 0.009211 0.031495 0.027750 0.021609

23 rows × 10 columns

PSSMs of clusters


get_cluster_pssms


def get_cluster_pssms(
    df, cluster_col, seq_col:str='site_seq', id_col:str='sub_site',
    count_thr:int=10, # if less than the count threshold, not include in the return
    valid_thr:NoneType=None, # percentage of not-nan values in pssm
    plot:bool=False
):

Extract motifs from clusters in a dataframe

get_cluster_pssms(data,'kinase_group')

  0%|          | 0/10 [00:00<?, ?it/s]
 10%|█         | 1/10 [00:00<00:01,  5.11it/s]
 20%|██        | 2/10 [00:00<00:01,  5.45it/s]
 30%|███       | 3/10 [00:00<00:01,  5.62it/s]
 40%|████      | 4/10 [00:00<00:00,  6.10it/s]
 50%|█████     | 5/10 [00:00<00:00,  7.04it/s]
 70%|███████   | 7/10 [00:00<00:00,  8.60it/s]
 90%|█████████ | 9/10 [00:01<00:00, 10.54it/s]
100%|██████████| 10/10 [00:01<00:00,  8.72it/s]
-20P -20G -20A -20C -20S -20T -20V -20I -20L -20M -20F -20Y -20W -20H -20K -20R -20Q -20N -20D -20E -20s -20t -20y -19P -19G -19A -19C -19S -19T -19V -19I -19L -19M -19F -19Y -19W -19H -19K -19R -19Q -19N -19D -19E -19s -19t -19y -18P -18G -18A -18C ... 18E 18s 18t 18y 19P 19G 19A 19C 19S 19T 19V 19I 19L 19M 19F 19Y 19W 19H 19K 19R 19Q 19N 19D 19E 19s 19t 19y 20P 20G 20A 20C 20S 20T 20V 20I 20L 20M 20F 20Y 20W 20H 20K 20R 20Q 20N 20D 20E 20s 20t 20y
TK 0.056754 0.069067 0.070474 0.016886 0.045028 0.039400 0.058630 0.049836 0.083255 0.024038 0.032716 0.023335 0.007974 0.021107 0.069184 0.056168 0.045966 0.036585 0.058865 0.081027 0.024508 0.011843 0.017355 0.057915 0.067158 0.073125 0.016731 0.045396 0.037440 0.053001 0.049959 0.087633 0.023751 0.031356 0.018018 0.009594 0.022230 0.074529 0.063999 0.043173 0.041652 0.060840 0.073359 0.025389 0.010881 0.012870 0.058123 0.067810 0.062792 0.014006 ... 0.080097 0.027670 0.012621 0.012743 0.058601 0.065424 0.065302 0.014864 0.045078 0.038986 0.056408 0.048124 0.087354 0.018884 0.042276 0.017178 0.008894 0.018519 0.073830 0.061769 0.047149 0.040327 0.059698 0.078216 0.028509 0.010599 0.014011 0.066015 0.072738 0.065037 0.013692 0.048900 0.037897 0.054523 0.050122 0.077262 0.023227 0.039364 0.018337 0.008435 0.019071 0.070905 0.059413 0.040465 0.038509 0.056357 0.083496 0.025917 0.012836 0.017482
CMGC 0.080589 0.070340 0.083792 0.013709 0.050865 0.035874 0.053812 0.035874 0.074824 0.022037 0.029468 0.015247 0.007559 0.020884 0.069058 0.057143 0.044331 0.032031 0.053299 0.078668 0.042921 0.020115 0.007559 0.079718 0.075496 0.072937 0.012028 0.058733 0.035061 0.053871 0.032885 0.074472 0.021753 0.026104 0.013436 0.007806 0.021241 0.064491 0.063596 0.046449 0.037492 0.055534 0.080614 0.042482 0.018298 0.005502 0.074292 0.068930 0.074419 0.013786 ... 0.084768 0.052185 0.016556 0.005960 0.074212 0.073813 0.074212 0.011970 0.053731 0.034047 0.050805 0.031520 0.074345 0.017423 0.028195 0.015029 0.010374 0.020215 0.070887 0.063173 0.045884 0.033914 0.056124 0.086980 0.048278 0.018619 0.006251 0.082465 0.066961 0.074980 0.012563 0.050789 0.034215 0.050521 0.038626 0.079925 0.019246 0.028869 0.016840 0.006816 0.023924 0.072307 0.060011 0.046111 0.035285 0.057070 0.074044 0.043972 0.018043 0.006415
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
CK1 0.057860 0.076419 0.076965 0.014192 0.039847 0.037118 0.063865 0.052402 0.070961 0.017467 0.027293 0.012555 0.014192 0.024017 0.079694 0.050764 0.033297 0.035480 0.058406 0.086790 0.038210 0.015830 0.016376 0.061069 0.070883 0.069793 0.007634 0.033806 0.032715 0.053435 0.043075 0.087786 0.023991 0.022901 0.016903 0.010905 0.020720 0.076881 0.059978 0.056161 0.037077 0.061069 0.087786 0.041439 0.017448 0.006543 0.050000 0.067935 0.073913 0.010870 ... 0.084629 0.041451 0.015544 0.012090 0.053148 0.068746 0.069324 0.012709 0.036973 0.021953 0.047371 0.042172 0.082034 0.017909 0.030040 0.017909 0.020220 0.019064 0.095321 0.047371 0.041594 0.034084 0.065280 0.098209 0.041594 0.021953 0.015020 0.051865 0.074592 0.088578 0.015152 0.039627 0.034382 0.055361 0.039627 0.074592 0.020979 0.034382 0.020396 0.008741 0.026224 0.079837 0.046620 0.038462 0.043124 0.067599 0.087995 0.027972 0.012238 0.011655
Atypical 0.071895 0.065359 0.063492 0.014939 0.053221 0.045752 0.058824 0.050420 0.085901 0.020542 0.017740 0.020542 0.005602 0.034547 0.049486 0.057890 0.040149 0.034547 0.059757 0.078431 0.041083 0.025210 0.004669 0.042870 0.052190 0.074557 0.013979 0.063374 0.036347 0.054986 0.031687 0.089469 0.027959 0.036347 0.015843 0.010252 0.026095 0.074557 0.056850 0.054054 0.034483 0.063374 0.068966 0.046598 0.018639 0.006524 0.051068 0.063138 0.072423 0.012071 ... 0.072175 0.047483 0.013295 0.015195 0.059829 0.081671 0.066477 0.017094 0.047483 0.044634 0.050332 0.034188 0.063628 0.010446 0.037037 0.016144 0.006648 0.027540 0.047483 0.055081 0.057930 0.043685 0.059829 0.091168 0.039886 0.027540 0.014245 0.077290 0.060115 0.062977 0.015267 0.055344 0.029580 0.065840 0.026718 0.092557 0.017176 0.039122 0.023855 0.009542 0.016221 0.046756 0.054389 0.050573 0.055344 0.062977 0.068702 0.040076 0.020992 0.008588

10 rows × 943 columns

Entropy


get_entropy


def get_entropy(
    pssm_df, # a dataframe of pssm with index as aa and column as position
    return_min:bool=False, # return min entropy as a single value or return all entropy as a pd.series
    exclude_zero:bool=False, # exclude the column of 0 (center position) in the entropy calculation
    clean_zero:bool=True, # if true, zero out non-last three values in position 0 (keep only s,t,y values at center)
):

Calculate entropy per position of a PSSM surrounding 0. The less entropy the more information it contains.

Let \(P_i(x)\) be the probability of amino acid \(x\) at position \(i\) in the PSSM, with \(i \in \{-k, \dots, -1, 0, +1, \dots, +k\}\). The entropy at each position \(i\) is defined as:

\[ H_i = - \sum_{x} P_i(x) \log_2 \left( P_i(x) + \varepsilon \right) \]

where \(\varepsilon = 10^{-8}\) is a small constant added for numerical stability.

If exclude_zero=True, the central position \(i = 0\) is omitted from the entropy calculation.

If clean_zero=True, all values at position \(i = 0\) are zeroed out except for amino acids Serine (S), Threonine (T), and Tyrosine (Y), typically the only possible phospho-acceptors in kinase motif analysis.

If return_min=True, the function returns the minimum entropy across all positions:

\[ H_{\text{spec}} = \min_i H_i \]

Otherwise, the function returns the full vector \(\{H_i\}\) for each position \(i\), reflecting how much information (or uncertainty) is contained at each position in the motif.

# get entropy per position
get_entropy(pssm_df).sort_values()
Position
 0     1.074964
 1     3.346861
         ...   
-9     4.297737
-12    4.302189
Length: 41, dtype: float64
# calculate minimum entropy of surrouding positions
get_entropy(pssm_df,return_min=True,exclude_zero=True)
3.3468606104695913

get_entropy_flat


def get_entropy_flat(
    flat_pssm:Series,
    return_min:bool=False, # return min entropy as a single value or return all entropy as a pd.series
    exclude_zero:bool=False, # exclude the column of 0 (center position) in the entropy calculation
    clean_zero:bool=True, # if true, zero out non-last three values in position 0 (keep only s,t,y values at center)
):

Calculate entropy per position of a flat PSSM surrounding 0

get_entropy_flat(flat_pssm).sort_values()
Position
 0     1.074964
 1     3.346861
         ...   
-9     4.297737
-12    4.302189
Length: 41, dtype: float64
get_entropy_flat(flat_pssm,return_min=True,exclude_zero=True)
3.3468606104695913
# test equal
(get_entropy_flat(flat_pssm).round(5) == get_entropy(pssm_df).round(5)).value_counts()
True    41
Name: count, dtype: int64

Information Content


get_IC


def get_IC(
    pssm_df, return_min:bool=False, # return min entropy as a single value or return all entropy as a pd.series
    exclude_zero:bool=False, # exclude the column of 0 (center position) in the entropy calculation
    clean_zero:bool=True, # if true, zero out non-last three values in position 0 (keep only s,t,y values at center)
):

Calculate the information content (bits) from a frequency matrix, using log2(3) for the middle position and log2(len(pssm_df)) for others. The higher the more information it contains.

Let \(P_i(x)\) be the frequency (probability) of amino acid \(x\) at position \(i\) in the PSSM. The standard information content (IC) at position \(i\) is defined as:

\[ \mathrm{IC}_i = \max H_i - H_i \]

which is:

\[ \mathrm{IC}_i = \log_2(N) - H_i \]

where \(N\) is the number of possible amino acids (i.e., \(N = \text{len}(P_i)\)).

At the center position (\(i = 0\)), only three amino acids (S, T, Y) are relevant, so the maximum entropy at each position is defined as:

\[ \max H_i = \begin{cases} \log_2(3) & \text{if } i = 0 \\ \log_2(N) & \text{otherwise} \end{cases} \]

# the higher the more conserved
get_IC(pssm_df,exclude_zero=True).sort_values()
Position
-12    0.221373
-9     0.225825
         ...   
 4     0.717760
 1     1.176701
Length: 40, dtype: float64

Check all zero cases:

pssm_df2=pssm_df.copy()
pssm_df2[-20]=0
get_entropy(pssm_df2,exclude_zero=True).sort_values()
Position
-20    0.000000
 1     3.346861
         ...   
-9     4.297737
-12    4.302189
Length: 40, dtype: float64

get_IC_flat


def get_IC_flat(
    flat_pssm:Series,
    return_min:bool=False, # return min entropy as a single value or return all entropy as a pd.series
    exclude_zero:bool=False, # exclude the column of 0 (center position) in the entropy calculation
    clean_zero:bool=True, # if true, zero out non-last three values in position 0 (keep only s,t,y values at center)
):

Calculate the information content (bits) from a flattened pssm pd.Series, using log2(3) for the middle position and log2(len(pssm_df)) for others.

get_IC_flat(flat_pssm,exclude_zero=True).sort_values()
Position
-12    0.221373
-9     0.225825
         ...   
 4     0.717760
 1     1.176701
Length: 40, dtype: float64
(get_IC_flat(flat_pssm).round(5) == get_IC(pssm_df).round(5)).value_counts()
True    41
Name: count, dtype: int64

Overall specificity


get_specificity


def get_specificity(
    pssm_df
):

Get specificity score of a pssm, excluding zero position.

We evaluated the overall specificity of a PSSM by combining two metrics: the maximum IC across surrounding positions and the variance of IC values:

\[ \text{Specificity Score} = 2 \times \max(\text{IC}) + \mathrm{Var}(\text{IC}) \]

get_specificity(pssm_df)
2.381609408364424

get_specificity_flat


def get_specificity_flat(
    flat_pssm
):

Get specificity score of a pssm, excluding zero position.

get_specificity_flat(flat_pssm)
2.381609408364424