pssm.core

Functions related with PSSMs

Overview

PSSM Calculation

To calculate a position-specific probability matrix (PSSM) from phosphorylation site sequences:

pssm_df = get_prob(
    data=df,       # input DataFrame, Series, or list of sequences
    col='site_seq',    # column name containing sequences (if DataFrame)
)

Alternatively, use a list of sequences as input

pssm_df = get_prob(
    data=df['site_seq'],       # input DataFrame, Series, or list of sequences
)

To extract flattened PSSMs for each cluster in a dataset:

cluster_pssms = get_cluster_pssms(
    df=data,                    # DataFrame containing sequences and cluster labels
    cluster_col='kinase_group', # column name for cluster identifiers
    seq_col='site_seq',         # column name for site sequences
    id_col='sub_site',          # column for deduplication within clusters (None to skip)
    count_thr=10,               # minimum sequences required per cluster
    valid_thr=None,             # minimum fraction of non-zero PSSM values (None to skip)
    plot=False,                 # whether to plot sequence logos for each cluster
)

PSSM Transformation

To flatten a 2D PSSM into a dictionary (for storage or model input):

flat_dict = flatten_pssm(
    pssm_df=pssm_df,     # 2D PSSM with amino acids as rows, positions as columns
    column_wise=True,    # True for column-major order; False for row-wise (PyTorch)
)

To recover a 2D PSSM from a flattened Series:

pssm_df = recover_pssm(
    flat_pssm=flat_pssm,  # flattened PSSM as pd.Series with position+aa index
)

To clean the center position and normalize each column to sum to 1:

pssm_norm = clean_zero_normalize(
    pssm_df=pssm,  # 2D PSSM (zeros out non-s/t/y at position 0, then normalizes)
)

Entropy Analysis

To calculate Shannon entropy per position (lower entropy = more conserved):

entropy = get_entropy(
    pssm_df=pssm_df,      # 2D PSSM matrix
    return_min=False,     # True to return single minimum value
    exclude_zero=True,    # True to exclude center position from calculation
    clean_zero=True,      # True to zero out non-s/t/y at position 0
)

To calculate entropy from a flattened PSSM:

entropy = get_entropy_flat(
    flat_pssm=flat_pssm,  # flattened PSSM as pd.Series
    return_min=True,      # return minimum entropy across positions
    exclude_zero=True,    # exclude center position
)

Information Content

To calculate information content per position (higher IC = more conserved):

ic = get_IC(
    pssm_df=pssm_df,    # 2D PSSM matrix
    exclude_zero=True,  # exclude center position from calculation
)

To calculate IC from a flattened PSSM:

ic = get_IC_flat(
    flat_pssm=flat_pssm,  # flattened PSSM as pd.Series
    exclude_zero=True,    # exclude center position
)

Overall Specificity

To compute an overall specificity score combining max IC and IC variance:

spec = get_specificity(
    pssm_df=pssm_df,  # 2D PSSM matrix (excludes position 0 automatically)
)

To compute specificity from a flattened PSSM:

spec = get_specificity_flat(
    flat_pssm=flat_pssm,  # flattened PSSM as pd.Series
)

Setup

PSSM

We need to compute position-specific probability matrix (PSSM) from a list of aligned site sequences.

For each position \(i\) (e.g., from \(-7\) to \(+7\)), the probability of observing amino acid \(x\) is:

\[ P_i(x) = \frac{\text{count of amino acid } x \text{ at position } i}{\text{total counts at position } i} \]

The following 23 amino acids are included:

Standard amino acids:
A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y
Modified amino acids:
s, t, y (often used to denote phosphorylated S, T, Y)

The resulting matrix has: - Rows: Amino acids, - Columns: Sequence positions (centered on the phosphosite), - Values: Probabilities of each amino acid at each position.

get_prob


def get_prob(
    data:Union, # input data, list or df
    col:str='site_seq', # column name if input is df
):

Get the probability matrix of PSSM from phosphorylation site sequences.

data = Data.get_ks_dataset()
data_k = data[data.kinase_uniprot=='P49841'] # CDK1

get_prob(data_k, col='site_seq').shape

(23, 41)

# or
pssm_df = get_prob(data_k['site_seq'].tolist())
pssm_df.tail()

Position	-20	-19	-18	-17	-16	-15	-14	-13	-12	-11	-10	-9	-8	-7	-6	-5	-4	-3	-2	-1	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20
aa
D	0.059524	0.065476	0.044577	0.080238	0.051775	0.048744	0.047267	0.054572	0.064611	0.048387	0.052709	0.058565	0.049635	0.049563	0.071325	0.069666	0.058055	0.063768	0.037627	0.062229	0.000000	0.034783	0.046579	0.056934	0.030702	0.084919	0.064516	0.056464	0.051128	0.072289	0.059091	0.054962	0.033846	0.063174	0.053042	0.054859	0.063091	0.055380	0.054054	0.056090	0.053140
E	0.061012	0.066964	0.077266	0.056464	0.075444	0.075332	0.090103	0.094395	0.066079	0.095308	0.086384	0.070278	0.068613	0.083090	0.093159	0.076923	0.076923	0.053623	0.060781	0.069465	0.000000	0.023188	0.081514	0.075912	0.071637	0.080527	0.055718	0.063893	0.081203	0.078313	0.068182	0.087023	0.072308	0.064715	0.098284	0.073668	0.056782	0.072785	0.057234	0.057692	0.074074
s	0.037202	0.041667	0.035661	0.031204	0.038462	0.038405	0.044313	0.041298	0.070485	0.045455	0.049780	0.048316	0.103650	0.056851	0.059680	0.068215	0.223512	0.073913	0.082489	0.076700	0.677279	0.037681	0.081514	0.106569	0.274854	0.057101	0.076246	0.074294	0.129323	0.057229	0.054545	0.041221	0.083077	0.066256	0.045242	0.047022	0.069401	0.033228	0.063593	0.051282	0.045089
t	0.019345	0.010417	0.022288	0.019316	0.019231	0.016248	0.016248	0.011799	0.014684	0.020528	0.033675	0.036603	0.024818	0.017493	0.024745	0.039187	0.065312	0.030435	0.050651	0.026049	0.285094	0.021739	0.029112	0.037956	0.089181	0.033675	0.033724	0.040119	0.039098	0.031627	0.050000	0.025954	0.030769	0.015408	0.020281	0.010972	0.014196	0.015823	0.015898	0.014423	0.014493
y	0.004464	0.008929	0.007429	0.008915	0.008876	0.010340	0.007386	0.002950	0.005874	0.016129	0.010249	0.014641	0.005839	0.013120	0.014556	0.017417	0.007257	0.007246	0.005789	0.010130	0.037627	0.020290	0.007278	0.017518	0.019006	0.004392	0.004399	0.008915	0.006015	0.007530	0.006061	0.012214	0.007692	0.004622	0.012480	0.004702	0.006309	0.006329	0.012719	0.006410	0.011272

PSSM with weight score

get_pssm_weight


def get_pssm_weight(
    data:DataFrame, seq_col:str='seq', score_col:str='enrichment', # selected/input ratio OR already-log2
    to_log2:bool=True, aa_order:str='PGACSTVILMFYWHKRQNDEsty',
    center:str='pos', # per-position median centering or 'glob' global median centering
    count_thr:int=5, # threshold to filter out if <count_thr
    alpha:str | float='auto', # shrinkage strength: value * count / (count + alpha)
):

Position-specific amino acid enrichment matrix: PSSM(aa,pos) = mean( log2(score) ) over peptides with aa at pos.

Transform PSSM

flatten_pssm


def flatten_pssm(
    pssm_df, column_wise:bool=True, # if True, column major flatten; else row wise flatten (for pytorch training)
):

Flatten PSSM dataframe to dictionary

flat_pssm = pd.Series(flatten_pssm(pssm_df))
flat_pssm

-20P    0.069940
-20G    0.087798
          ...   
20t     0.014493
20y     0.011272
Length: 943, dtype: float64

flat_pssm.reset_index()[0]

0      0.069940
1      0.087798
         ...   
941    0.014493
942    0.011272
Name: 0, Length: 943, dtype: float64

recover_pssm


def recover_pssm(
    flat_pssm:Series
):

Recover 2D PSSM from flattened PSSM Series.

out = recover_pssm(flat_pssm)
out

Position	-20	-19	-18	-17	-16	-15	-14	-13	-12	-11	-10	-9	-8	-7	-6	-5	-4	-3	-2	-1	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20
aa
P	0.069940	0.077381	0.062407	0.084695	0.082840	0.072378	0.087149	0.081121	0.077827	0.071848	0.101025	0.074671	0.084672	0.106414	0.112082	0.094340	0.079826	0.133333	0.150507	0.131693	0.000000	0.420290	0.141194	0.094891	0.049708	0.187408	0.082111	0.077266	0.064662	0.106928	0.074242	0.077863	0.078462	0.070878	0.095164	0.092476	0.074132	0.090190	0.071542	0.099359	0.078905
G	0.087798	0.087798	0.069837	0.065379	0.076923	0.054653	0.088626	0.095870	0.064611	0.068915	0.070278	0.087848	0.058394	0.065598	0.066958	0.063861	0.068215	0.101449	0.076700	0.105644	0.000000	0.075362	0.065502	0.086131	0.046784	0.054173	0.060117	0.083210	0.084211	0.058735	0.063636	0.083969	0.084615	0.075501	0.076443	0.081505	0.067823	0.060127	0.077901	0.088141	0.066023
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
t	0.019345	0.010417	0.022288	0.019316	0.019231	0.016248	0.016248	0.011799	0.014684	0.020528	0.033675	0.036603	0.024818	0.017493	0.024745	0.039187	0.065312	0.030435	0.050651	0.026049	0.285094	0.021739	0.029112	0.037956	0.089181	0.033675	0.033724	0.040119	0.039098	0.031627	0.050000	0.025954	0.030769	0.015408	0.020281	0.010972	0.014196	0.015823	0.015898	0.014423	0.014493
y	0.004464	0.008929	0.007429	0.008915	0.008876	0.010340	0.007386	0.002950	0.005874	0.016129	0.010249	0.014641	0.005839	0.013120	0.014556	0.017417	0.007257	0.007246	0.005789	0.010130	0.037627	0.020290	0.007278	0.017518	0.019006	0.004392	0.004399	0.008915	0.006015	0.007530	0.006061	0.012214	0.007692	0.004622	0.012480	0.004702	0.006309	0.006329	0.012719	0.006410	0.011272

23 rows × 41 columns

out.equals(pssm_df)

True

Or recover from PSPA data

pspa = Data.get_pspa()

pssm=recover_pssm(pspa.loc['AAK1'].dropna())
pssm

Position	-5	-4	-3	-2	-1	0	1	2	3	4
aa
P	0.0720	0.0534	0.1084	0.0226	0.1136	0.0	0.0463	0.0527	0.0681	0.0628
G	0.0245	0.0642	0.0512	0.0283	0.0706	0.0	0.7216	0.0749	0.0923	0.0702
...	...	...	...	...	...	...	...	...	...	...
t	0.0201	0.0332	0.0303	0.0209	0.0121	1.0	0.0123	0.0409	0.0335	0.0251
y	0.0611	0.0339	0.0274	0.0486	0.0178	0.0	0.0100	0.0410	0.0359	0.0270

23 rows × 10 columns

PSPA is not scaled per position.

So we need to remove the redundant copy in zero position (leave s/t/y only) and scaled to 1 per position.

pssm.index[pssm.index.isin(['s','t','y'])]

Index(['s', 't', 'y'], dtype='object', name='aa')

_clean_zero(pssm)

Position	-5	-4	-3	-2	-1	0	1	2	3	4
aa
P	0.0720	0.0534	0.1084	0.0226	0.1136	0.0	0.0463	0.0527	0.0681	0.0628
G	0.0245	0.0642	0.0512	0.0283	0.0706	0.0	0.7216	0.0749	0.0923	0.0702
...	...	...	...	...	...	...	...	...	...	...
t	0.0201	0.0332	0.0303	0.0209	0.0121	1.0	0.0123	0.0409	0.0335	0.0251
y	0.0611	0.0339	0.0274	0.0486	0.0178	0.0	0.0100	0.0410	0.0359	0.0270

23 rows × 10 columns

clean_zero_normalize


def clean_zero_normalize(
    pssm_df
):

Zero out non-last three values in position 0 (keep only s,t,y values at center), and normalize per position

This function applies phosphosite-specific cleaning and normalization to a PSSM.

At the center position (\(i = 0\)), only the last three rows of the matrix — corresponding to phosphorylatable residues s, t, and y — are retained. All other amino acid values at position 0 are set to 0.

After masking, the matrix is column-normalized to ensure the probabilities at each position sum to 1:

\[ P_i(x) = \frac{P_i(x)}{\sum_{x'} P_i(x')} \]

clean_zero_normalize(pssm)

Position	-5	-4	-3	-2	-1	0	1	2	3	4
aa
P	0.058446	0.041715	0.086100	0.017935	0.096068	0.000000	0.042649	0.040482	0.052640	0.050260
G	0.019888	0.050152	0.040667	0.022459	0.059704	0.000000	0.664702	0.057536	0.071346	0.056182
...	...	...	...	...	...	...	...	...	...	...
t	0.016316	0.025935	0.024067	0.016586	0.010233	0.908018	0.011330	0.031418	0.025895	0.020088
y	0.049598	0.026482	0.021763	0.038568	0.015053	0.000000	0.009211	0.031495	0.027750	0.021609

23 rows × 10 columns

PSSMs of clusters

get_cluster_pssms


def get_cluster_pssms(
    df, cluster_col, seq_col:str='site_seq', id_col:str='sub_site',
    count_thr:int=10, # if less than the count threshold, not include in the return
    valid_thr:NoneType=None, # percentage of not-nan values in pssm
    plot:bool=False
):

Extract motifs from clusters in a dataframe

get_cluster_pssms(data,'kinase_group')


  0%|          | 0/10 [00:00<?, ?it/s]
 10%|█         | 1/10 [00:00<00:01,  5.11it/s]
 20%|██        | 2/10 [00:00<00:01,  5.45it/s]
 30%|███       | 3/10 [00:00<00:01,  5.62it/s]
 40%|████      | 4/10 [00:00<00:00,  6.10it/s]
 50%|█████     | 5/10 [00:00<00:00,  7.04it/s]
 70%|███████   | 7/10 [00:00<00:00,  8.60it/s]
 90%|█████████ | 9/10 [00:01<00:00, 10.54it/s]
100%|██████████| 10/10 [00:01<00:00,  8.72it/s]

	-20P	-20G	-20A	-20C	-20S	-20T	-20V	-20I	-20L	-20M	-20F	-20Y	-20W	-20H	-20K	-20R	-20Q	-20N	-20D	-20E	-20s	-20t	-20y	-19P	-19G	-19A	-19C	-19S	-19T	-19V	-19I	-19L	-19M	-19F	-19Y	-19W	-19H	-19K	-19R	-19Q	-19N	-19D	-19E	-19s	-19t	-19y	-18P	-18G	-18A	-18C	...	18E	18s	18t	18y	19P	19G	19A	19C	19S	19T	19V	19I	19L	19M	19F	19Y	19W	19H	19K	19R	19Q	19N	19D	19E	19s	19t	19y	20P	20G	20A	20C	20S	20T	20V	20I	20L	20M	20F	20Y	20W	20H	20K	20R	20Q	20N	20D	20E	20s	20t	20y
TK	0.056754	0.069067	0.070474	0.016886	0.045028	0.039400	0.058630	0.049836	0.083255	0.024038	0.032716	0.023335	0.007974	0.021107	0.069184	0.056168	0.045966	0.036585	0.058865	0.081027	0.024508	0.011843	0.017355	0.057915	0.067158	0.073125	0.016731	0.045396	0.037440	0.053001	0.049959	0.087633	0.023751	0.031356	0.018018	0.009594	0.022230	0.074529	0.063999	0.043173	0.041652	0.060840	0.073359	0.025389	0.010881	0.012870	0.058123	0.067810	0.062792	0.014006	...	0.080097	0.027670	0.012621	0.012743	0.058601	0.065424	0.065302	0.014864	0.045078	0.038986	0.056408	0.048124	0.087354	0.018884	0.042276	0.017178	0.008894	0.018519	0.073830	0.061769	0.047149	0.040327	0.059698	0.078216	0.028509	0.010599	0.014011	0.066015	0.072738	0.065037	0.013692	0.048900	0.037897	0.054523	0.050122	0.077262	0.023227	0.039364	0.018337	0.008435	0.019071	0.070905	0.059413	0.040465	0.038509	0.056357	0.083496	0.025917	0.012836	0.017482
CMGC	0.080589	0.070340	0.083792	0.013709	0.050865	0.035874	0.053812	0.035874	0.074824	0.022037	0.029468	0.015247	0.007559	0.020884	0.069058	0.057143	0.044331	0.032031	0.053299	0.078668	0.042921	0.020115	0.007559	0.079718	0.075496	0.072937	0.012028	0.058733	0.035061	0.053871	0.032885	0.074472	0.021753	0.026104	0.013436	0.007806	0.021241	0.064491	0.063596	0.046449	0.037492	0.055534	0.080614	0.042482	0.018298	0.005502	0.074292	0.068930	0.074419	0.013786	...	0.084768	0.052185	0.016556	0.005960	0.074212	0.073813	0.074212	0.011970	0.053731	0.034047	0.050805	0.031520	0.074345	0.017423	0.028195	0.015029	0.010374	0.020215	0.070887	0.063173	0.045884	0.033914	0.056124	0.086980	0.048278	0.018619	0.006251	0.082465	0.066961	0.074980	0.012563	0.050789	0.034215	0.050521	0.038626	0.079925	0.019246	0.028869	0.016840	0.006816	0.023924	0.072307	0.060011	0.046111	0.035285	0.057070	0.074044	0.043972	0.018043	0.006415
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
CK1	0.057860	0.076419	0.076965	0.014192	0.039847	0.037118	0.063865	0.052402	0.070961	0.017467	0.027293	0.012555	0.014192	0.024017	0.079694	0.050764	0.033297	0.035480	0.058406	0.086790	0.038210	0.015830	0.016376	0.061069	0.070883	0.069793	0.007634	0.033806	0.032715	0.053435	0.043075	0.087786	0.023991	0.022901	0.016903	0.010905	0.020720	0.076881	0.059978	0.056161	0.037077	0.061069	0.087786	0.041439	0.017448	0.006543	0.050000	0.067935	0.073913	0.010870	...	0.084629	0.041451	0.015544	0.012090	0.053148	0.068746	0.069324	0.012709	0.036973	0.021953	0.047371	0.042172	0.082034	0.017909	0.030040	0.017909	0.020220	0.019064	0.095321	0.047371	0.041594	0.034084	0.065280	0.098209	0.041594	0.021953	0.015020	0.051865	0.074592	0.088578	0.015152	0.039627	0.034382	0.055361	0.039627	0.074592	0.020979	0.034382	0.020396	0.008741	0.026224	0.079837	0.046620	0.038462	0.043124	0.067599	0.087995	0.027972	0.012238	0.011655
Atypical	0.071895	0.065359	0.063492	0.014939	0.053221	0.045752	0.058824	0.050420	0.085901	0.020542	0.017740	0.020542	0.005602	0.034547	0.049486	0.057890	0.040149	0.034547	0.059757	0.078431	0.041083	0.025210	0.004669	0.042870	0.052190	0.074557	0.013979	0.063374	0.036347	0.054986	0.031687	0.089469	0.027959	0.036347	0.015843	0.010252	0.026095	0.074557	0.056850	0.054054	0.034483	0.063374	0.068966	0.046598	0.018639	0.006524	0.051068	0.063138	0.072423	0.012071	...	0.072175	0.047483	0.013295	0.015195	0.059829	0.081671	0.066477	0.017094	0.047483	0.044634	0.050332	0.034188	0.063628	0.010446	0.037037	0.016144	0.006648	0.027540	0.047483	0.055081	0.057930	0.043685	0.059829	0.091168	0.039886	0.027540	0.014245	0.077290	0.060115	0.062977	0.015267	0.055344	0.029580	0.065840	0.026718	0.092557	0.017176	0.039122	0.023855	0.009542	0.016221	0.046756	0.054389	0.050573	0.055344	0.062977	0.068702	0.040076	0.020992	0.008588

10 rows × 943 columns

Entropy

get_entropy


def get_entropy(
    pssm_df, # a dataframe of pssm with index as aa and column as position
    return_min:bool=False, # return min entropy as a single value or return all entropy as a pd.series
    exclude_zero:bool=False, # exclude the column of 0 (center position) in the entropy calculation
    clean_zero:bool=True, # if true, zero out non-last three values in position 0 (keep only s,t,y values at center)
):

Calculate entropy per position of a PSSM surrounding 0. The less entropy the more information it contains.

Let \(P_i(x)\) be the probability of amino acid \(x\) at position \(i\) in the PSSM, with \(i \in \{-k, \dots, -1, 0, +1, \dots, +k\}\). The entropy at each position \(i\) is defined as:

\[ H_i = - \sum_{x} P_i(x) \log_2 \left( P_i(x) + \varepsilon \right) \]

where \(\varepsilon = 10^{-8}\) is a small constant added for numerical stability.

If exclude_zero=True, the central position \(i = 0\) is omitted from the entropy calculation.

If clean_zero=True, all values at position \(i = 0\) are zeroed out except for amino acids Serine (S), Threonine (T), and Tyrosine (Y), typically the only possible phospho-acceptors in kinase motif analysis.

If return_min=True, the function returns the minimum entropy across all positions:

\[ H_{\text{spec}} = \min_i H_i \]

Otherwise, the function returns the full vector \(\{H_i\}\) for each position \(i\), reflecting how much information (or uncertainty) is contained at each position in the motif.

# get entropy per position
get_entropy(pssm_df).sort_values()

Position
 0     1.074964
 1     3.346861
         ...   
-9     4.297737
-12    4.302189
Length: 41, dtype: float64

# calculate minimum entropy of surrouding positions
get_entropy(pssm_df,return_min=True,exclude_zero=True)

3.3468606104695913

get_entropy_flat


def get_entropy_flat(
    flat_pssm:Series,
    return_min:bool=False, # return min entropy as a single value or return all entropy as a pd.series
    exclude_zero:bool=False, # exclude the column of 0 (center position) in the entropy calculation
    clean_zero:bool=True, # if true, zero out non-last three values in position 0 (keep only s,t,y values at center)
):

Calculate entropy per position of a flat PSSM surrounding 0

get_entropy_flat(flat_pssm).sort_values()

Position
 0     1.074964
 1     3.346861
         ...   
-9     4.297737
-12    4.302189
Length: 41, dtype: float64

get_entropy_flat(flat_pssm,return_min=True,exclude_zero=True)

3.3468606104695913

# test equal
(get_entropy_flat(flat_pssm).round(5) == get_entropy(pssm_df).round(5)).value_counts()

True    41
Name: count, dtype: int64

Information Content

get_IC


def get_IC(
    pssm_df, return_min:bool=False, # return min entropy as a single value or return all entropy as a pd.series
    exclude_zero:bool=False, # exclude the column of 0 (center position) in the entropy calculation
    clean_zero:bool=True, # if true, zero out non-last three values in position 0 (keep only s,t,y values at center)
):

Calculate the information content (bits) from a frequency matrix, using log2(3) for the middle position and log2(len(pssm_df)) for others. The higher the more information it contains.

Let \(P_i(x)\) be the frequency (probability) of amino acid \(x\) at position \(i\) in the PSSM. The standard information content (IC) at position \(i\) is defined as:

\[ \mathrm{IC}_i = \max H_i - H_i \]

which is:

\[ \mathrm{IC}_i = \log_2(N) - H_i \]

where \(N\) is the number of possible amino acids (i.e., \(N = \text{len}(P_i)\)).

At the center position (\(i = 0\)), only three amino acids (S, T, Y) are relevant, so the maximum entropy at each position is defined as:

\[ \max H_i = \begin{cases} \log_2(3) & \text{if } i = 0 \\ \log_2(N) & \text{otherwise} \end{cases} \]

# the higher the more conserved
get_IC(pssm_df,exclude_zero=True).sort_values()

Position
-12    0.221373
-9     0.225825
         ...   
 4     0.717760
 1     1.176701
Length: 40, dtype: float64

Check all zero cases:

pssm_df2=pssm_df.copy()

pssm_df2[-20]=0

get_entropy(pssm_df2,exclude_zero=True).sort_values()

Position
-20    0.000000
 1     3.346861
         ...   
-9     4.297737
-12    4.302189
Length: 40, dtype: float64

get_IC_flat


def get_IC_flat(
    flat_pssm:Series,
    return_min:bool=False, # return min entropy as a single value or return all entropy as a pd.series
    exclude_zero:bool=False, # exclude the column of 0 (center position) in the entropy calculation
    clean_zero:bool=True, # if true, zero out non-last three values in position 0 (keep only s,t,y values at center)
):

Calculate the information content (bits) from a flattened pssm pd.Series, using log2(3) for the middle position and log2(len(pssm_df)) for others.

get_IC_flat(flat_pssm,exclude_zero=True).sort_values()

Position
-12    0.221373
-9     0.225825
         ...   
 4     0.717760
 1     1.176701
Length: 40, dtype: float64

(get_IC_flat(flat_pssm).round(5) == get_IC(pssm_df).round(5)).value_counts()

True    41
Name: count, dtype: int64

Overall specificity

get_specificity


def get_specificity(
    pssm_df
):

Get specificity score of a pssm, excluding zero position.

We evaluated the overall specificity of a PSSM by combining two metrics: the maximum IC across surrounding positions and the variance of IC values:

\[ \text{Specificity Score} = 2 \times \max(\text{IC}) + \mathrm{Var}(\text{IC}) \]

get_specificity(pssm_df)

2.381609408364424

get_specificity_flat


def get_specificity_flat(
    flat_pssm
):

Get specificity score of a pssm, excluding zero position.

get_specificity_flat(flat_pssm)

2.381609408364424