PSSM

Functions related with PSSMs

Setup

from katlas.pssm import *

PSSM

We need to compute position-specific probability matrix (PSSM) from a list of aligned site sequences.

For each position $i$ (e.g., from $-7$ to $+7$), the probability of observing amino acid $x$ is:

\[ P_i(x) = \frac{\text{count of amino acid } x \text{ at position } i}{\text{total counts at position } i} \]

The following 23 amino acids are included:

Standard amino acids:
A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y
Modified amino acids:
s, t, y (often used to denote phosphorylated S, T, Y)

The resulting matrix has: - Rows: Amino acids, - Columns: Sequence positions (centered on the phosphosite), - Values: Probabilities of each amino acid at each position.

source

get_prob

 get_prob (data:Union[pandas.core.frame.DataFrame,pandas.core.series.Serie
           s,Sequence[str]], col:str='site_seq')

Get the probability matrix of PSSM from phosphorylation site sequences.

	Type	Default	Details
data	Union		input data, list or df
col	str	site_seq	column name if input is df

data = Data.get_ks_dataset()
data_k = data[data.kinase_uniprot=='P49841'] # CDK1

get_prob(data_k, col='site_seq').shape

(23, 41)

# or
pssm_df = get_prob(data_k['site_seq'].tolist())
pssm_df.tail()

Position	-20	-19	-18	-17	-16	-15	-14	-13	-12	-11	-10	-9	-8	-7	-6	-5	-4	-3	-2	-1	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20
aa
D	0.059524	0.065476	0.044577	0.080238	0.051775	0.048744	0.047267	0.054572	0.064611	0.048387	0.052709	0.058565	0.049635	0.049563	0.071325	0.069666	0.058055	0.063768	0.037627	0.062229	0.000000	0.034783	0.046579	0.056934	0.030702	0.084919	0.064516	0.056464	0.051128	0.072289	0.059091	0.054962	0.033846	0.063174	0.053042	0.054859	0.063091	0.055380	0.054054	0.056090	0.053140
E	0.061012	0.066964	0.077266	0.056464	0.075444	0.075332	0.090103	0.094395	0.066079	0.095308	0.086384	0.070278	0.068613	0.083090	0.093159	0.076923	0.076923	0.053623	0.060781	0.069465	0.000000	0.023188	0.081514	0.075912	0.071637	0.080527	0.055718	0.063893	0.081203	0.078313	0.068182	0.087023	0.072308	0.064715	0.098284	0.073668	0.056782	0.072785	0.057234	0.057692	0.074074
s	0.037202	0.041667	0.035661	0.031204	0.038462	0.038405	0.044313	0.041298	0.070485	0.045455	0.049780	0.048316	0.103650	0.056851	0.059680	0.068215	0.223512	0.073913	0.082489	0.076700	0.677279	0.037681	0.081514	0.106569	0.274854	0.057101	0.076246	0.074294	0.129323	0.057229	0.054545	0.041221	0.083077	0.066256	0.045242	0.047022	0.069401	0.033228	0.063593	0.051282	0.045089
t	0.019345	0.010417	0.022288	0.019316	0.019231	0.016248	0.016248	0.011799	0.014684	0.020528	0.033675	0.036603	0.024818	0.017493	0.024745	0.039187	0.065312	0.030435	0.050651	0.026049	0.285094	0.021739	0.029112	0.037956	0.089181	0.033675	0.033724	0.040119	0.039098	0.031627	0.050000	0.025954	0.030769	0.015408	0.020281	0.010972	0.014196	0.015823	0.015898	0.014423	0.014493
y	0.004464	0.008929	0.007429	0.008915	0.008876	0.010340	0.007386	0.002950	0.005874	0.016129	0.010249	0.014641	0.005839	0.013120	0.014556	0.017417	0.007257	0.007246	0.005789	0.010130	0.037627	0.020290	0.007278	0.017518	0.019006	0.004392	0.004399	0.008915	0.006015	0.007530	0.006061	0.012214	0.007692	0.004622	0.012480	0.004702	0.006309	0.006329	0.012719	0.006410	0.011272

Transform PSSM

source

flatten_pssm

 flatten_pssm (pssm_df, column_wise=True)

Flatten PSSM dataframe to dictionary

	Type	Default	Details
pssm_df
column_wise	bool	True	if True, column major flatten; else row wise flatten (for pytorch training)

flat_pssm = pd.Series(flatten_pssm(pssm_df))
flat_pssm

-20P    0.069940
-20G    0.087798
          ...   
20t     0.014493
20y     0.011272
Length: 943, dtype: float64

flat_pssm.reset_index()[0]

0      0.069940
1      0.087798
         ...   
941    0.014493
942    0.011272
Name: 0, Length: 943, dtype: float64

source

recover_pssm

 recover_pssm (flat_pssm:pandas.core.series.Series)

Recover 2D PSSM from flattened PSSM Series.

out = recover_pssm(flat_pssm)
out

Position	-20	-19	-18	-17	-16	-15	-14	-13	-12	-11	-10	-9	-8	-7	-6	-5	-4	-3	-2	-1	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20
aa
P	0.069940	0.077381	0.062407	0.084695	0.082840	0.072378	0.087149	0.081121	0.077827	0.071848	0.101025	0.074671	0.084672	0.106414	0.112082	0.094340	0.079826	0.133333	0.150507	0.131693	0.000000	0.420290	0.141194	0.094891	0.049708	0.187408	0.082111	0.077266	0.064662	0.106928	0.074242	0.077863	0.078462	0.070878	0.095164	0.092476	0.074132	0.090190	0.071542	0.099359	0.078905
G	0.087798	0.087798	0.069837	0.065379	0.076923	0.054653	0.088626	0.095870	0.064611	0.068915	0.070278	0.087848	0.058394	0.065598	0.066958	0.063861	0.068215	0.101449	0.076700	0.105644	0.000000	0.075362	0.065502	0.086131	0.046784	0.054173	0.060117	0.083210	0.084211	0.058735	0.063636	0.083969	0.084615	0.075501	0.076443	0.081505	0.067823	0.060127	0.077901	0.088141	0.066023
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
t	0.019345	0.010417	0.022288	0.019316	0.019231	0.016248	0.016248	0.011799	0.014684	0.020528	0.033675	0.036603	0.024818	0.017493	0.024745	0.039187	0.065312	0.030435	0.050651	0.026049	0.285094	0.021739	0.029112	0.037956	0.089181	0.033675	0.033724	0.040119	0.039098	0.031627	0.050000	0.025954	0.030769	0.015408	0.020281	0.010972	0.014196	0.015823	0.015898	0.014423	0.014493
y	0.004464	0.008929	0.007429	0.008915	0.008876	0.010340	0.007386	0.002950	0.005874	0.016129	0.010249	0.014641	0.005839	0.013120	0.014556	0.017417	0.007257	0.007246	0.005789	0.010130	0.037627	0.020290	0.007278	0.017518	0.019006	0.004392	0.004399	0.008915	0.006015	0.007530	0.006061	0.012214	0.007692	0.004622	0.012480	0.004702	0.006309	0.006329	0.012719	0.006410	0.011272

23 rows × 41 columns

out.equals(pssm_df)

True

Or recover from PSPA data

pspa = Data.get_pspa()

pssm=recover_pssm(pspa.loc['AAK1'].dropna())
pssm

Position	-5	-4	-3	-2	-1	0	1	2	3	4
aa
P	0.0720	0.0534	0.1084	0.0226	0.1136	0.0	0.0463	0.0527	0.0681	0.0628
G	0.0245	0.0642	0.0512	0.0283	0.0706	0.0	0.7216	0.0749	0.0923	0.0702
...	...	...	...	...	...	...	...	...	...	...
t	0.0201	0.0332	0.0303	0.0209	0.0121	1.0	0.0123	0.0409	0.0335	0.0251
y	0.0611	0.0339	0.0274	0.0486	0.0178	0.0	0.0100	0.0410	0.0359	0.0270

23 rows × 10 columns

PSPA is not scaled per position.

So we need to remove the redundant copy in zero position (leave s/t/y only) and scaled to 1 per position.

pssm.index[pssm.index.isin(['s','t','y'])]

Index(['s', 't', 'y'], dtype='object', name='aa')

_clean_zero(pssm)

Position	-5	-4	-3	-2	-1	0	1	2	3	4
aa
P	0.0720	0.0534	0.1084	0.0226	0.1136	0.0	0.0463	0.0527	0.0681	0.0628
G	0.0245	0.0642	0.0512	0.0283	0.0706	0.0	0.7216	0.0749	0.0923	0.0702
...	...	...	...	...	...	...	...	...	...	...
t	0.0201	0.0332	0.0303	0.0209	0.0121	1.0	0.0123	0.0409	0.0335	0.0251
y	0.0611	0.0339	0.0274	0.0486	0.0178	0.0	0.0100	0.0410	0.0359	0.0270

23 rows × 10 columns

source

clean_zero_normalize

 clean_zero_normalize (pssm_df)

Zero out non-last three values in position 0 (keep only s,t,y values at center), and normalize per position

This function applies phosphosite-specific cleaning and normalization to a PSSM.

At the center position ($i = 0$), only the last three rows of the matrix — corresponding to phosphorylatable residues s, t, and y — are retained. All other amino acid values at position 0 are set to 0.

After masking, the matrix is column-normalized to ensure the probabilities at each position sum to 1:

\[ P_i(x) = \frac{P_i(x)}{\sum_{x'} P_i(x')} \]

clean_zero_normalize(pssm)

Position	-5	-4	-3	-2	-1	0	1	2	3	4
aa
P	0.058446	0.041715	0.086100	0.017935	0.096068	0.000000	0.042649	0.040482	0.052640	0.050260
G	0.019888	0.050152	0.040667	0.022459	0.059704	0.000000	0.664702	0.057536	0.071346	0.056182
...	...	...	...	...	...	...	...	...	...	...
t	0.016316	0.025935	0.024067	0.016586	0.010233	0.908018	0.011330	0.031418	0.025895	0.020088
y	0.049598	0.026482	0.021763	0.038568	0.015053	0.000000	0.009211	0.031495	0.027750	0.021609

23 rows × 10 columns

PSSM of Log odds

source

get_pssm_LO

 get_pssm_LO (pssm_df, site_type)

Get log odds PSSM: log2 (freq pssm/background pssm).

	Details
pssm_df
site_type	S, T, Y, ST, or STY

Let $P_i(x)$ be the frequency of amino acid $x$ at position $i$ in the input PSSM, and let $B_i(x)$ be the background frequency of amino acid $x$ at the same position, derived from a background model corresponding to the specified site type (S, T, Y, or STY).

The log-odds score at each position $i$ for amino acid $x$ is computed as:

\[ \mathrm{LO}_i(x) = \log_2 \left( \frac{P_i(x) + \varepsilon}{B_i(x) + \varepsilon} \right) \]

where $\varepsilon = 10^{-8}$ is a small constant added for numerical stability and to avoid division by zero.

This results in a matrix where:

Positive values indicate enrichment over background,
Negative values indicate depletion relative to background,
Zero indicates no difference from the expected background.

data_y = data_k[data_k.site.str[0]=='Y']

pssm_y = get_prob(data_y,'site_seq')

pssm_LO = get_pssm_LO(pssm_y,'Y')
pssm_LO.head()

Position	-20	-19	-18	-17	-16	-15	-14	-13	-12	-11	-10	-9	-8	-7	-6	-5	-4	-3	-2	-1	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20
aa
P	-22.421662	-0.512361	-22.483757	-0.441300	-22.330630	1.127181	0.556571	-22.472709	-22.395309	-22.472370	-22.494346	-0.498220	0.362731	-22.282520	-22.393568	-0.658704	0.401814	-22.316020	0.711607	-22.159026	0.0	-21.219306	-22.470595	-22.994372	-22.573771	2.500636	0.457348	1.192256	-22.627742	1.636288	-0.497208	-22.614291	-0.295385	-0.334330	-22.595771	0.856785	0.132517	-22.489922	-22.515220	-22.349670	-22.682968
G	2.143585	2.259460	0.275900	0.699035	-22.708532	-0.749012	0.962707	-0.786810	-0.692558	-22.752478	0.175936	0.227255	0.291871	1.651462	0.762874	-0.765860	-1.010588	-22.841111	-0.008750	-0.610321	0.0	1.045771	1.222666	-22.301926	2.115713	0.180116	-0.599896	-22.557503	-0.735659	-22.691406	-22.606718	1.180075	0.444924	-22.716477	1.039133	-0.027003	-0.045322	-0.093640	-22.828378	0.077856	-0.148372
A	-0.821010	-22.787101	-22.593992	-22.501932	-22.687021	2.118529	-0.733631	-0.709477	0.848204	1.176168	-0.930771	-0.820148	0.359434	-0.644471	-22.685439	0.194020	-22.568254	0.155705	-22.719076	-22.414048	0.0	-0.833935	1.577583	-22.495567	-22.470923	-22.548438	-0.825586	1.019478	0.525050	-0.821217	-22.715532	-22.607643	-0.783671	1.415181	1.795060	1.573972	0.886847	0.885076	0.998036	2.420744	1.066110
C	-20.604884	-20.821590	-20.610481	2.820191	1.226090	-20.613911	-20.674559	-20.555965	-20.395309	-20.465246	-20.723212	-20.683076	-20.273082	-20.562865	3.365323	-20.637713	-20.490617	-20.197935	-20.073742	-20.197731	0.0	-20.723779	2.562935	-20.857363	1.456954	-20.380203	-20.784654	-20.540430	1.726683	1.964277	-20.474525	-20.292364	4.183644	-20.827394	-20.497208	-20.411330	-20.843591	-20.777410	-20.951953	-20.735729	-20.464922
S	0.846942	-22.150318	2.316723	-0.128025	-22.210853	-22.220374	2.079367	0.768343	2.167026	-21.971941	-21.907636	-21.850281	-22.021350	0.783742	-21.770936	-21.831141	2.757737	-21.627747	-21.621229	-21.092120	0.0	-21.500400	-21.519269	-21.615861	-22.015041	-22.077133	-21.879565	-21.929472	-21.962609	-21.994476	-21.856395	-22.144041	-22.079396	-22.071062	0.543296	-22.129558	-22.149269	1.464015	-22.361597	-22.234534	1.394090

pssm_y[0][pssm_y[0]==1].index

Index(['y'], dtype='object', name='aa')

pssm_LO[0].sort_values() # log-odds is zero at center position when single site log-odds pssm

aa
P    0.0
G    0.0
    ... 
t    0.0
y    0.0
Name: 0, Length: 23, dtype: float64

source

get_pssm_LO_flat

 get_pssm_LO_flat (flat_pssm, site_type)

	Details
flat_pssm
site_type	S, T, Y, ST, or STY

pssm_LO = get_pssm_LO_flat(flat_pssm,'STY')
pssm_LO

Position	-20	-19	-18	-17	-16	-15	-14	-13	-12	-11	-10	-9	-8	-7	-6	-5	-4	-3	-2	-1	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20
aa
P	0.059908	0.281140	-0.026749	0.355318	0.346468	0.178779	0.463147	0.308758	0.239997	0.179283	0.608034	0.192985	0.329370	0.721201	0.715760	0.588674	0.320993	1.112040	0.973481	1.055403	0.000000	1.539081	1.109428	0.434700	-0.301588	1.379883	0.314032	0.243357	-0.060158	0.644231	0.061865	0.161917	0.308782	0.140834	0.521194	0.373708	0.158627	0.462854	0.138734	0.658643	0.216764
G	0.355425	0.331928	0.015762	-0.090070	0.155057	-0.318908	0.274539	0.486357	-0.081333	-0.044640	-0.023817	0.297183	-0.167014	-0.096944	-0.092276	-0.064319	-0.075365	0.473975	0.243292	0.358158	0.000000	0.235499	-0.025131	0.284594	-0.457652	-0.327199	-0.123033	0.303309	0.315783	-0.190655	-0.067465	0.298568	0.361469	0.170286	0.215167	0.258789	-0.054255	-0.185829	0.200256	0.388134	-0.066157
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
t	0.338037	-0.508472	0.536544	0.409825	0.546294	0.066828	0.190809	-0.298146	0.023876	0.295761	0.995227	1.015112	0.246666	-0.332277	0.195250	0.680780	1.210743	0.166721	0.546415	-0.157283	0.329833	-0.420698	-0.286888	0.507349	1.536109	0.349480	0.508819	1.032751	0.944610	0.697806	1.386345	0.723589	0.986439	-0.036165	0.368048	-0.516571	-0.021008	0.119857	0.161114	-0.047991	-0.097686
y	-1.109821	0.265760	-0.184739	0.075387	0.307806	0.041911	-0.180464	-1.440588	-0.516052	0.508038	-0.011805	0.496492	-0.749242	0.261563	0.139186	0.233321	-1.138437	-1.195610	-1.747661	-1.166771	-2.655631	-0.065791	-1.535247	-0.045814	0.191191	-1.732180	-1.613038	-0.465142	-0.886647	-0.336320	-0.712011	0.069406	-0.161659	-0.829713	0.474154	-0.851437	-0.438485	-0.267581	0.642920	-0.356680	0.230591

23 rows × 41 columns

PSSMs of clusters

source

get_cluster_pssms

 get_cluster_pssms (df, cluster_col, seq_col='site_seq',
                    id_col='sub_site', count_thr=10, valid_thr=None,
                    IC_thr=None, plot=False)

Extract motifs from clusters in a dataframe

	Type	Default	Details
df
cluster_col
seq_col	str	site_seq
id_col	str	sub_site
count_thr	int	10	if less than the count threshold, not include in the return
valid_thr	NoneType	None	percentage of not-nan values in pssm
IC_thr	NoneType	None
plot	bool	False

get_cluster_pssms(data,'kinase_group')

100%|███████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 10.70it/s]

	-20P	-20G	-20A	-20C	-20S	-20T	-20V	-20I	-20L	-20M	-20F	-20Y	-20W	-20H	-20K	-20R	-20Q	-20N	-20D	-20E	-20s	-20t	-20y	-19P	-19G	-19A	-19C	-19S	-19T	-19V	-19I	-19L	-19M	-19F	-19Y	-19W	-19H	-19K	-19R	-19Q	-19N	-19D	-19E	-19s	-19t	-19y	-18P	-18G	-18A	-18C	...	18E	18s	18t	18y	19P	19G	19A	19C	19S	19T	19V	19I	19L	19M	19F	19Y	19W	19H	19K	19R	19Q	19N	19D	19E	19s	19t	19y	20P	20G	20A	20C	20S	20T	20V	20I	20L	20M	20F	20Y	20W	20H	20K	20R	20Q	20N	20D	20E	20s	20t	20y
TK	0.056754	0.069067	0.070474	0.016886	0.045028	0.039400	0.058630	0.049836	0.083255	0.024038	0.032716	0.023335	0.007974	0.021107	0.069184	0.056168	0.045966	0.036585	0.058865	0.081027	0.024508	0.011843	0.017355	0.057915	0.067158	0.073125	0.016731	0.045396	0.037440	0.053001	0.049959	0.087633	0.023751	0.031356	0.018018	0.009594	0.022230	0.074529	0.063999	0.043173	0.041652	0.060840	0.073359	0.025389	0.010881	0.012870	0.058123	0.067810	0.062792	0.014006	...	0.080097	0.027670	0.012621	0.012743	0.058601	0.065424	0.065302	0.014864	0.045078	0.038986	0.056408	0.048124	0.087354	0.018884	0.042276	0.017178	0.008894	0.018519	0.073830	0.061769	0.047149	0.040327	0.059698	0.078216	0.028509	0.010599	0.014011	0.066015	0.072738	0.065037	0.013692	0.048900	0.037897	0.054523	0.050122	0.077262	0.023227	0.039364	0.018337	0.008435	0.019071	0.070905	0.059413	0.040465	0.038509	0.056357	0.083496	0.025917	0.012836	0.017482
CMGC	0.080589	0.070340	0.083792	0.013709	0.050865	0.035874	0.053812	0.035874	0.074824	0.022037	0.029468	0.015247	0.007559	0.020884	0.069058	0.057143	0.044331	0.032031	0.053299	0.078668	0.042921	0.020115	0.007559	0.079718	0.075496	0.072937	0.012028	0.058733	0.035061	0.053871	0.032885	0.074472	0.021753	0.026104	0.013436	0.007806	0.021241	0.064491	0.063596	0.046449	0.037492	0.055534	0.080614	0.042482	0.018298	0.005502	0.074292	0.068930	0.074419	0.013786	...	0.084768	0.052185	0.016556	0.005960	0.074212	0.073813	0.074212	0.011970	0.053731	0.034047	0.050805	0.031520	0.074345	0.017423	0.028195	0.015029	0.010374	0.020215	0.070887	0.063173	0.045884	0.033914	0.056124	0.086980	0.048278	0.018619	0.006251	0.082465	0.066961	0.074980	0.012563	0.050789	0.034215	0.050521	0.038626	0.079925	0.019246	0.028869	0.016840	0.006816	0.023924	0.072307	0.060011	0.046111	0.035285	0.057070	0.074044	0.043972	0.018043	0.006415
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
CK1	0.057860	0.076419	0.076965	0.014192	0.039847	0.037118	0.063865	0.052402	0.070961	0.017467	0.027293	0.012555	0.014192	0.024017	0.079694	0.050764	0.033297	0.035480	0.058406	0.086790	0.038210	0.015830	0.016376	0.061069	0.070883	0.069793	0.007634	0.033806	0.032715	0.053435	0.043075	0.087786	0.023991	0.022901	0.016903	0.010905	0.020720	0.076881	0.059978	0.056161	0.037077	0.061069	0.087786	0.041439	0.017448	0.006543	0.050000	0.067935	0.073913	0.010870	...	0.084629	0.041451	0.015544	0.012090	0.053148	0.068746	0.069324	0.012709	0.036973	0.021953	0.047371	0.042172	0.082034	0.017909	0.030040	0.017909	0.020220	0.019064	0.095321	0.047371	0.041594	0.034084	0.065280	0.098209	0.041594	0.021953	0.015020	0.051865	0.074592	0.088578	0.015152	0.039627	0.034382	0.055361	0.039627	0.074592	0.020979	0.034382	0.020396	0.008741	0.026224	0.079837	0.046620	0.038462	0.043124	0.067599	0.087995	0.027972	0.012238	0.011655
Atypical	0.071895	0.065359	0.063492	0.014939	0.053221	0.045752	0.058824	0.050420	0.085901	0.020542	0.017740	0.020542	0.005602	0.034547	0.049486	0.057890	0.040149	0.034547	0.059757	0.078431	0.041083	0.025210	0.004669	0.042870	0.052190	0.074557	0.013979	0.063374	0.036347	0.054986	0.031687	0.089469	0.027959	0.036347	0.015843	0.010252	0.026095	0.074557	0.056850	0.054054	0.034483	0.063374	0.068966	0.046598	0.018639	0.006524	0.051068	0.063138	0.072423	0.012071	...	0.072175	0.047483	0.013295	0.015195	0.059829	0.081671	0.066477	0.017094	0.047483	0.044634	0.050332	0.034188	0.063628	0.010446	0.037037	0.016144	0.006648	0.027540	0.047483	0.055081	0.057930	0.043685	0.059829	0.091168	0.039886	0.027540	0.014245	0.077290	0.060115	0.062977	0.015267	0.055344	0.029580	0.065840	0.026718	0.092557	0.017176	0.039122	0.023855	0.009542	0.016221	0.046756	0.054389	0.050573	0.055344	0.062977	0.068702	0.040076	0.020992	0.008588

10 rows × 943 columns

Entropy

source

get_entropy

 get_entropy (pssm_df, return_min=False, exclude_zero=False,
              clean_zero=True)

Calculate entropy per position of a PSSM surrounding 0. The less entropy the more information it contains.

	Type	Default	Details
pssm_df			a dataframe of pssm with index as aa and column as position
return_min	bool	False	return min entropy as a single value or return all entropy as a pd.series
exclude_zero	bool	False	exclude the column of 0 (center position) in the entropy calculation
clean_zero	bool	True	if true, zero out non-last three values in position 0 (keep only s,t,y values at center)

Let $P_i(x)$ be the probability of amino acid $x$ at position $i$ in the PSSM, with $i \in \{-k, \dots, -1, 0, +1, \dots, +k\}$. The entropy at each position $i$ is defined as:

\[ H_i = - \sum_{x} P_i(x) \log_2 \left( P_i(x) + \varepsilon \right) \]

where $\varepsilon = 10^{-8}$ is a small constant added for numerical stability.

If exclude_zero=True, the central position $i = 0$ is omitted from the entropy calculation.

If clean_zero=True, all values at position $i = 0$ are zeroed out except for amino acids Serine (S), Threonine (T), and Tyrosine (Y), typically the only possible phospho-acceptors in kinase motif analysis.

If return_min=True, the function returns the minimum entropy across all positions:

\[ H_{\text{spec}} = \min_i H_i \]

Otherwise, the function returns the full vector $\{H_i\}$ for each position $i$, reflecting how much information (or uncertainty) is contained at each position in the motif.

# get entropy per position
get_entropy(pssm_df).sort_values()

Position
 0     1.074964
 1     3.346861
         ...   
-9     4.297737
-12    4.302189
Length: 41, dtype: float64

# calculate minimum entropy of surrouding positions
get_entropy(pssm_df,return_min=True,exclude_zero=True)

3.3468606104695913

source

get_entropy_flat

 get_entropy_flat (flat_pssm:pandas.core.series.Series, return_min=False,
                   exclude_zero=False, clean_zero=True)

Calculate entropy per position of a flat PSSM surrounding 0

	Type	Default	Details
flat_pssm	Series
return_min	bool	False	return min entropy as a single value or return all entropy as a pd.series
exclude_zero	bool	False	exclude the column of 0 (center position) in the entropy calculation
clean_zero	bool	True	if true, zero out non-last three values in position 0 (keep only s,t,y values at center)

get_entropy_flat(flat_pssm).sort_values()

Position
 0     1.074964
 1     3.346861
         ...   
-9     4.297737
-12    4.302189
Length: 41, dtype: float64

get_entropy_flat(flat_pssm,return_min=True,exclude_zero=True)

3.3468606104695913

# test equal
(get_entropy_flat(flat_pssm).round(5) == get_entropy(pssm_df).round(5)).value_counts()

True    41
Name: count, dtype: int64

Information Content

source

get_IC

 get_IC (pssm_df, return_min=False, exclude_zero=False, clean_zero=True)

Calculate the information content (bits) from a frequency matrix, using log2(3) for the middle position and log2(len(pssm_df)) for others. The higher the more information it contains.

	Type	Default	Details
pssm_df			a dataframe of pssm with index as aa and column as position
return_min	bool	False	return min entropy as a single value or return all entropy as a pd.series
exclude_zero	bool	False	exclude the column of 0 (center position) in the entropy calculation
clean_zero	bool	True	if true, zero out non-last three values in position 0 (keep only s,t,y values at center)

Let $P_i(x)$ be the frequency (probability) of amino acid $x$ at position $i$ in the PSSM. The standard information content (IC) at position $i$ is defined as:

\[ \mathrm{IC}_i = \max H_i - H_i \]

which is:

\[ \mathrm{IC}_i = \log_2(N) - H_i \]

where $N$ is the number of possible amino acids (i.e., $N = \text{len}(P_i)$).

At the center position ($i = 0$), only three amino acids (S, T, Y) are relevant, so the maximum entropy at each position is defined as:

\[ \max H_i = \begin{cases} \log_2(3) & \text{if } i = 0 \\ \log_2(N) & \text{otherwise} \end{cases} \]

# the higher the more conserved
get_IC(pssm_df,exclude_zero=True).sort_values()

Position
-12    0.221373
-9     0.225825
         ...   
 4     0.717760
 1     1.176701
Length: 40, dtype: float64

Check all zero cases:

pssm_df2=pssm_df.copy()

pssm_df2[-20]=0

get_entropy(pssm_df2,exclude_zero=True).sort_values()

Position
-20    0.000000
 1     3.346861
         ...   
-9     4.297737
-12    4.302189
Length: 40, dtype: float64

source

get_IC_flat

 get_IC_flat (flat_pssm:pandas.core.series.Series, return_min=False,
              exclude_zero=False, clean_zero=True)

Calculate the information content (bits) from a flattened pssm pd.Series, using log2(3) for the middle position and log2(len(pssm_df)) for others.

	Type	Default	Details
flat_pssm	Series
return_min	bool	False	return min entropy as a single value or return all entropy as a pd.series
exclude_zero	bool	False	exclude the column of 0 (center position) in the entropy calculation
clean_zero	bool	True	if true, zero out non-last three values in position 0 (keep only s,t,y values at center)

get_IC_flat(flat_pssm,exclude_zero=True).sort_values()

Position
-12    0.221373
-9     0.225825
         ...   
 4     0.717760
 1     1.176701
Length: 40, dtype: float64

(get_IC_flat(flat_pssm).round(5) == get_IC(pssm_df).round(5)).value_counts()

True    41
Name: count, dtype: int64

Overall specificity

source

get_specificity

 get_specificity (pssm_df)

Get specificity score of a pssm, excluding zero position.

We evaluated the overall specificity of a PSSM by combining two metrics: the maximum IC across surrounding positions and the variance of IC values:

\[ \text{Specificity Score} = 2 \times \max(\text{IC}) + \mathrm{Var}(\text{IC}) \]

get_specificity(pssm_df)

2.381609408364424

source

get_specificity_flat

 get_specificity_flat (flat_pssm)

Get specificity score of a pssm, excluding zero position.

get_specificity_flat(flat_pssm)

2.381609408364424

Plot

Heatmap

/opt/hostedtoolcache/Python/3.10.19/x64/lib/python3.10/site-packages/fastcore/docscrape.py:230: UserWarning: Unknown section See Also
  else: warn(msg)

source

plot_heatmap_simple

 plot_heatmap_simple (matrix, title:str='heatmap', figsize:tuple=(6, 7),
                      cmap:str='binary', vmin=None, vmax=None,
                      center=None, robust=False, annot=None, fmt='.2g',
                      annot_kws=None, linewidths=0, linecolor='white',
                      cbar=True, cbar_kws=None, cbar_ax=None,
                      square=False, xticklabels='auto',
                      yticklabels='auto', mask=None, ax=None)

Plot heatmap based on a matrix of values

	Type	Default	Details
matrix			a matrix of values
title	str	heatmap	title of the heatmap
figsize	tuple	(6, 7)	(width, height)
cmap	str	binary	color map, default is dark&white
vmin	NoneType	None
vmax	NoneType	None
center	NoneType	None	The value at which to center the colormap when plotting divergent data. Using this parameter will change the default `cmap` if none is specified.
robust	bool	False	If True and `vmin` or `vmax` are absent, the colormap range is computed with robust quantiles instead of the extreme values.
annot	NoneType	None	If True, write the data value in each cell. If an array-like with the same shape as `data`, then use this to annotate the heatmap instead of the data. Note that DataFrames will match on position, not index.
fmt	str	.2g	String formatting code to use when adding annotations.
annot_kws	NoneType	None	Keyword arguments for :meth:`matplotlib.axes.Axes.text` when `annot` is True.
linewidths	int	0	Width of the lines that will divide each cell.
linecolor	str	white	Color of the lines that will divide each cell.
cbar	bool	True	Whether to draw a colorbar.
cbar_kws	NoneType	None	Keyword arguments for :meth:`matplotlib.figure.Figure.colorbar`.
cbar_ax	NoneType	None	Axes in which to draw the colorbar, otherwise take space from the main Axes.
square	bool	False	If True, set the Axes aspect to “equal” so each cell will be square-shaped.
xticklabels	str	auto
yticklabels	str	auto
mask	NoneType	None	If passed, data will not be shown in cells where `mask` is True. Cells with missing values are automatically masked.
ax	NoneType	None	Axes in which to draw the plot, otherwise use the currently-active Axes.
Returns	matplotlib Axes		Axes object with the heatmap.

# plot_heatmap_simple(pssm_df,'kinase',figsize=(10,7))

source

plot_heatmap

 plot_heatmap (heatmap_df, ax=None, position_label=True, figsize=(5, 6),
               include_zero=True, scale_pos_neg=False,
               colorbar_title='Prob.')

Plot a heatmap of pssm.

This function visualizes a PSSM or log-odds matrix as a heatmap with diverging color scales centered at 0.

Color scale behavior:

By default (scale_pos_neg=False), the colormap is centered at 0, but the full data range determines the color intensity:

\[ \text{color range} = [\min(\text{data}), \max(\text{data})], \quad \text{with center at } 0 \]

This is useful when you want to emphasize whether values are above or below zero, but without enforcing symmetry.
If scale_pos_neg=True, the function uses a balanced diverging scale via TwoSlopeNorm, such that:

\[ \text{min color} = \min(\text{data}), \quad \text{center} = 0, \quad \text{max color} = \max(\text{data}) \]

The positive and negative ranges are scaled separately, ensuring that both ends of the heatmap have equal visual weight — especially helpful for symmetric data like log-odds matrices.

Additional visual features: - The center position ($i = 0$) can be masked out using include_zero=False.

plot_heatmap(pssm_df-0.3,scale_pos_neg=False,figsize=(20, 6));

plot_heatmap(pssm_df-0.3,scale_pos_neg=True,figsize=(20, 6));

plt.close('all')

source

plot_two_heatmaps

 plot_two_heatmaps (pssm1, pssm2, kinase_name='Kinase', title1='CDDM',
                    title2='PSPA', figsize=(4, 4.5), cbar=True,
                    scale_01=False, **kwargs)

Plot two side-by-side heatmaps with black rectangle borders, titles on top, shared kinase label below, and only left plot showing y-axis labels.

pssm1 = recover_pssm(pspa.loc['AKT1'].dropna())
pssm2 = recover_pssm(pspa.loc['AKT2'].dropna())

plot_two_heatmaps(pssm1,pssm2,'AKT','AKT1','AKT2')

Logo motif

source

plot_logo_raw

 plot_logo_raw (pssm_df, ax=None, title='Motif', ytitle='Bits',
                figsize=(10, 2))

Plot logo motif using Logomaker.

plot_logo_raw(pssm_df)

We can find the center name is in lower case, so need to change them

source

change_center_name

 change_center_name (df)

Transfer the middle s,t,y to S,T,Y for plot if s,t,y have values; otherwise keep the original.

Now instead of s,t,y, the center name becomes S, T and Y:

change_center_name(pssm_df)[0]

aa
P    0.0
G    0.0
    ... 
t    0.0
y    0.0
Name: 0, Length: 23, dtype: float64

source

get_pos_min_max

 get_pos_min_max (pssm_df)

Get min and max value of sum of positive and negative values across each position.

source

scale_zero_position

 scale_zero_position (pssm_df)

Scale position 0 so that: - Positive values match the max positive column sum of other positions - Negative values match the min (most negative) column sum of other positions

This function rescales position 0 in a log-odds PSSM so that its total positive and negative stack heights match those of the most extreme positions on either side.

This ensures the central position visually matches the dynamic range of surrounding positions in log-odds logo plots.

source

scale_pos_neg_values

 scale_pos_neg_values (pssm_df)

Globally scale all positive values by max positive column sum, and negative values by min negative column sum (preserving sign).

source

convert_logo_df

 convert_logo_df (pssm_df, scale_zero=True, scale_pos_neg=False)

Change center name from s,t,y to S, T, Y in a pssm and scaled zero position to the max of neigbors.

source

get_logo_IC

 get_logo_IC (pssm_df)

For plotting purpose, calculate the scaled information content (bits) from a frequency matrix, using log2(3) for the middle position and log2(len(pssm_df)) for others.

To visualize the motif using Logomaker, the scaled PSSM is computed by weighting each amino acid’s frequency at position $i$ by the position’s information content:

\[ \text{PSSM\_scaled}_i(x) = P_i(x) \cdot \mathrm{IC}_i \]

This results in a matrix where the total stack height at each position equals the information content, and each letter’s height is proportional to its contribution. This is the standard format used by Logomaker to generate sequence logos.

get_logo_IC(pssm_df)

Position	-20	-19	-18	-17	-16	-15	-14	-13	-12	-11	-10	-9	-8	-7	-6	-5	-4	-3	-2	-1	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20
aa
P	0.019652	0.026918	0.015665	0.023951	0.020766	0.018778	0.025329	0.023845	0.017229	0.019292	0.026572	0.016862	0.022977	0.032447	0.030862	0.022572	0.043840	0.056881	0.058556	0.051823	0.000000	0.494556	0.054836	0.026297	0.035678	0.097961	0.024077	0.020674	0.023837	0.026058	0.020890	0.020991	0.020345	0.019011	0.022280	0.027848	0.021000	0.027223	0.017312	0.029500	0.021873
G	0.024669	0.030542	0.017530	0.018489	0.019283	0.014179	0.025758	0.028180	0.014303	0.018504	0.018485	0.019838	0.015846	0.020002	0.018437	0.015279	0.037463	0.043279	0.029841	0.041572	0.000000	0.088679	0.025439	0.023870	0.033579	0.028317	0.017628	0.022265	0.031043	0.014314	0.017906	0.022637	0.021941	0.020251	0.017897	0.024544	0.019213	0.018149	0.018850	0.026169	0.018302
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
t	0.005436	0.003624	0.005595	0.005463	0.004821	0.004215	0.004722	0.003468	0.003251	0.005512	0.008857	0.008266	0.006735	0.005334	0.006814	0.009376	0.035869	0.012984	0.019706	0.010251	0.145398	0.025580	0.011306	0.010519	0.064011	0.017602	0.009889	0.010735	0.014413	0.007707	0.014069	0.006997	0.007979	0.004133	0.004748	0.003304	0.004021	0.004776	0.003847	0.004282	0.004017
y	0.001254	0.003106	0.001865	0.002521	0.002225	0.002683	0.002147	0.000867	0.001300	0.004331	0.002696	0.003306	0.001585	0.004000	0.004008	0.004167	0.003985	0.003091	0.002252	0.003986	0.019190	0.023875	0.002827	0.004855	0.013642	0.002296	0.001290	0.002385	0.002217	0.001835	0.001705	0.003293	0.001995	0.001240	0.002922	0.001416	0.001787	0.001910	0.003078	0.001903	0.003125

23 rows × 41 columns

source

plot_logo

 plot_logo (pssm_df, title='Motif', scale_zero=True, ax=None, figsize=(10,
            1))

Plot logo of information content given a frequency PSSM.

# plot_logo(pssm_df,scale_zero=False,figsize=(10,1))

Set scale_zero to default True can have better vision of the side amino acids

plot_logo(pssm_df,figsize=(10,1))

plt.close('all')

Logo motif of log-odds

source

plot_logo_LO

 plot_logo_LO (pssm_LO, title='Motif', acceptor=None, scale_zero=True,
               scale_pos_neg=True, ax=None, figsize=(10, 1))

Plot logo of log-odds given a frequency PSSM.

To ensure the phosphorylated residue is visible at the center of a log-odds motif (position 0), two mechanisms are used:

Acceptor override: If the center column is entirely zero (e.g., masked), the user can specify an acceptor ('S', 'T', 'Y', or 'STY'). The function then assigns a small nonzero value (e.g., 0.1) to the corresponding phospho-residue row (pS, pT, pY) at position 0. This ensures the central letter appears in the logo plot, even when real log-odds values are absent.
Stack height rescaling: To maintain visual consistency with surrounding columns, position 0 is rescaled so that its total positive and negative stack heights match the most extreme values observed elsewhere.

Together, these adjustments ensure that: - The phospho-acceptor appears explicitly at the center, - The visual scale remains consistent with neighboring positions, - The resulting logo can faithfully reflect both biological relevance and statistical signal.

pssm_LO = get_pssm_LO(pssm_df,'STY')
# plot_logo_LO(pssm_LO,scale_zero=False,scale_pos_neg=False)

## with zero position scaled to the max
# plot_logo_LO(pssm_LO,scale_zero=True,scale_pos_neg=False)

# # scaled positive and negative values for better visualization
plot_logo_LO(pssm_LO,scale_zero=True,scale_pos_neg=True)

# for those specific site type (S,T or Y), show acceptor in the middle instead of empty
pssm_LO = get_pssm_LO(pssm_y,'Y')
plot_logo_LO(pssm_LO,acceptor='Y')

plt.close('all')

Multiple logos

As multiple figures:

source

plot_logos_idx

 plot_logos_idx (pssms_df, *idxs, figsize=(14, 1))

Plot logos of a dataframe with flattened PSSMs with index ad IDs.

pssms=Data.get_cddm()

plot_logos_idx(pssms,'AKT1','AKT2')

In one figure:

source

plot_logos

 plot_logos (pssms_df, count_dict=None, path=None, prefix='Motif',
             figsize=(14, 1))

Plot all logos from a dataframe of flattened PSSMs as subplots in a single figure.

	Type	Default	Details
pssms_df
count_dict	NoneType	None	used to display n in motif title
path	NoneType	None
prefix	str	Motif
figsize	tuple	(14, 1)

plot_logos(pssms.head(2),prefix=None)

plt.close('all')

Logo motif + Heatmap

source

plot_logo_heatmap

 plot_logo_heatmap (pssm_df, title='Motif', figsize=(17, 10),
                    include_zero=False)

Plot logo and heatmap vertically

	Type	Default	Details
pssm_df			column is position, index is aa
title	str	Motif
figsize	tuple	(17, 10)
include_zero	bool	False

plot_logo_heatmap(pssm_df,'Kinase',(17,10))

source

plot_logo_heatmap_LO

 plot_logo_heatmap_LO (pssm_LO, title='Motif', acceptor=None, figsize=(17,
                       10), include_zero=False, scale_pos_neg=True)

Plot logo and heatmap of enrichment bits vertically

	Type	Default	Details
pssm_LO			pssm of log-odds
title	str	Motif
acceptor	NoneType	None
figsize	tuple	(17, 10)
include_zero	bool	False
scale_pos_neg	bool	True

# plot_logo_heatmap_LO(pssm_LO,acceptor='Y')

pssm_LO = get_pssm_LO(pssm_df,'STY')
plot_logo_heatmap_LO(pssm_LO,scale_pos_neg=False) # normal color scale

plt.close('all')

PSPA

Plot

source

preprocess_pspa

 preprocess_pspa (pssm)

row = pspa.loc['GSK3B']
pssm = recover_pssm(row.dropna())
pssm = preprocess_pspa(pssm)
pssm

Position	-5	-4	-3	-2	-1	0	1	2	3	4
aa
P	0.128793	0.103768	0.327105	0.377614	0.200697	0.0	0.857330	-0.156606	-0.022523	0.020985
G	0.267518	0.232140	0.709128	0.215152	0.186051	0.0	-0.070893	-0.132404	-0.032647	-0.138561
...	...	...	...	...	...	...	...	...	...	...
pS/pT	-0.112011	0.206940	0.355120	0.106174	-0.114481	0.0	0.055519	-0.004874	1.836501	6.163857
pY	-0.106727	0.196308	0.158811	0.259140	0.699151	0.0	0.222213	0.047855	1.191558	1.496894

22 rows × 10 columns

source

plot_logo_pspa

 plot_logo_pspa (row, title='Motif', figsize=(5, 2))

plot_logo_pspa(pspa.loc['GSK3B'],title='GSK3B')

source

plot_logo_heatmap_pspa

 plot_logo_heatmap_pspa (row, title='Motif', figsize=(6, 10),
                         include_zero=False)

Plot logo and heatmap vertically

	Type	Default	Details
row			row of Data.get_pspa()
title	str	Motif
figsize	tuple	(6, 10)
include_zero	bool	False

plot_logo_heatmap_pspa(pspa.loc['GSK3B'],title='GSK3B')

Calculations

source

raw2norm

 raw2norm (df:pandas.core.frame.DataFrame, PDHK:bool=False)

Normalize single ST kinase data

	Type	Default	Details
df	DataFrame		single kinase’s df has position as index, and single amino acid as columns
PDHK	bool	False	whether this kinase belongs to PDHK family

This function implement the normalization method from Johnson et al. Nature: An atlas of substrate specificities for the human serine/threonine kinome

Specifically, > - matrices were column-normalized at all positions by the sum of the 17 randomized amino acids (excluding serine, threonine and cysteine), to yield PSSMs. >- PDHK1 and PDHK4 were normalized to the 16 randomized amino acids (excluding serine, threonine, cysteine and additionally tyrosine) >- The cysteine row was scaled by its median to be 1/17 (1/16 for PDHK1 and PDHK4). >- The serine and threonine values in each position were set to be the median of that position. >- The S0/T0 ratio was determined by summing the values of S and T rows in the matrix (SS and ST, respectively), accounting for the different S vs. T composition of the central (1:1) and peripheral (only S or only T) positions (Sctrl and Tctrl, respectively), and then normalizing to the higher value among the two (S0 and T0, respectively, Supplementary Note 1)

This function is usually implemented with the below function, with normalize being a bool argument.

source

get_one_kinase

 get_one_kinase (df:pandas.core.frame.DataFrame, kinase:str,
                 normalize:bool=False, drop_s:bool=True)

Obtain a specific kinase data from stacked dataframe

	Type	Default	Details
df	DataFrame		stacked dataframe (paper’s raw data)
kinase	str		a specific kinase
normalize	bool	False	normalize according to the paper; special for PDHK1/4
drop_s	bool	True	drop s as s is a duplicates of t in PSPA

Retreive a single kinase data from PSPA data that has an format of kinase as index and position+amino acid as column.

data = Data.get_pspa_st()

get_one_kinase(data,'PDHK1')

aa	A	C	D	E	F	G	H	I	K	L	M	N	P	Q	R	S	T	V	W	Y	t	y
position
-5	0.0594	0.0625	0.0589	0.0550	0.0775	0.0697	0.0687	0.0590	0.0515	0.0657	0.0687	0.0613	0.0451	0.0424	0.0594	0.0594	0.0594	0.0573	0.1001	0.0775	0.0583	0.0658
-4	0.0618	0.0621	0.0550	0.0511	0.0739	0.0715	0.0598	0.0601	0.0520	0.0614	0.0744	0.0549	0.0637	0.0552	0.0617	0.0608	0.0608	0.0519	0.0916	0.0739	0.0528	0.0752
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3	0.0486	0.0609	0.0938	0.0684	0.1024	0.0676	0.0544	0.0583	0.0388	0.0552	0.0637	0.0505	0.0686	0.0502	0.0561	0.0588	0.0588	0.0593	0.0641	0.1024	0.0539	0.0431
4	0.0565	0.0749	0.0631	0.0535	0.0732	0.0655	0.0664	0.0625	0.0496	0.0552	0.0627	0.0640	0.0677	0.0553	0.0604	0.0626	0.0626	0.0579	0.0864	0.0732	0.0548	0.0575

10 rows × 22 columns

Plot PSPA logo motif (old)

source

get_logo

 get_logo (df:pandas.core.frame.DataFrame, kinase:str)

Given stacked df (index as kinase, columns as substrates), get a specific kinase’s logo

	Type	Details
df	DataFrame	stacked Dataframe with kinase as index, substrates as columns
kinase	str	a specific kinase name in index

This function is to replicate the motif logo from Johnson et al. Nature: An atlas of substrate specificities for the human serine/threonine kinome. Given raw PSPA data, it can output a motif logo.

# load raw PSPA data
# df = pd.read_csv('https://github.com/sky1ove/katlas_raw/raw/refs/heads/main/nbs/raw/pspa_st_raw.csv').set_index('kinase')
# df.head()
# get_logo(df, 'AAK1')

Compare PSSM

pssms = Data.get_pspa_scale()

# one example
pssm_df = recover_pssm(pssms.iloc[1])
pssm_df2 = recover_pssm(pssms.iloc[0])

KL divergence

source

kl_divergence

 kl_divergence (p1, p2)

*KL divergence D_KL(p1 || p2) over positions.

p1 and p2 are arrays (df or np) with index as aa and column as position. Returns average divergence across positions if mean=True, else per-position.*

	Details
p1	target pssm p (array-like, shape: (AA, positions))
p2	pred pssm q (array-like, same shape as p1)

The Kullback–Leibler (KL) divergence between two probability distributions ( P ) and ( Q ) is defined as:

\[ \mathrm{KL}(P \| Q) = \sum_{x \in \mathcal{X}} P(x) \log \left( \frac{P(x)}{Q(x)} \right) \]

This measures the information lost when ( Q ) is used to approximate ( P ). It is not symmetric, i.e.,

\[ \mathrm{KL}(P \| Q) \ne \mathrm{KL}(Q \| P) \]

and it is non-negative, meaning:

\[ \mathrm{KL}(P \| Q) \ge 0 \]

with equality if and only if ( P = Q ) almost everywhere.

In practical computation, to avoid numerical instability when ( P(x) = 0 ) or ( Q(x) = 0 ), we often add a small constant ( ):

\[ \mathrm{KL}_\varepsilon(P \| Q) = \sum_{x \in \mathcal{X}} P(x) \log \left( \frac{P(x) + \varepsilon}{Q(x) + \varepsilon} \right) \]

kl_divergence(pssm_df,pssm_df2)

array([0.29182172, 0.11138481, 0.24590698, 0.46021635, 0.36874823,
       0.53858511, 1.51571614, 0.02905442, 0.08530757, 0.07753394])

kl_divergence(pssm_df,pssm_df2).mean(),kl_divergence(pssm_df,pssm_df2).max()

(np.float64(0.37242752573216287), np.float64(1.5157161422110503))

source

kl_divergence_flat

 kl_divergence_flat (p1_flat, p2_flat)

p1 and p2 are two flattened pd.Series with index as aa and column as position

	Details
p1_flat	pd.Series of target flattened pssm p
p2_flat	pd.Series of pred flattened pssm q

kl_divergence_flat(pssms.iloc[1],pssms.iloc[0])

CPU times: user 1.38 ms, sys: 30 μs, total: 1.41 ms
Wall time: 1.39 ms

0.37242752573216287

JS divergence

source

js_divergence

 js_divergence (p1, p2, index=True)

p1 and p2 are two arrays (df or np) with index as aa and column as position

	Type	Default	Details
p1			pssm
p2			pssm
index	bool	True

The Jensen-Shannon divergence between two probability distributions $ P $ and $ Q $ is defined as:

\[ \mathrm{JS}(P \| Q) = \frac{1}{2} \, \mathrm{KL}(P \| M) + \frac{1}{2} \, \mathrm{KL}(Q \| M) \]

where $ M = (P + Q) $ is the average (mixture) distribution, and $ $ denotes the Kullback–Leibler divergence:

\[ \mathrm{KL}(P \| Q) = \sum_{x \in \mathcal{X}} P(x) \log \left( \frac{P(x)}{Q(x)} \right) \]

Therefore,

\[ \mathrm{JS}_\varepsilon(P \| Q) = \frac{1}{2} \sum_{x \in \mathcal{X}} P(x) \log \left( \frac{P(x) + \varepsilon}{M(x) + \varepsilon} \right) + \frac{1}{2} \sum_{x \in \mathcal{X}} Q(x) \log \left( \frac{Q(x) + \varepsilon}{M(x) + \varepsilon} \right) \]

js_divergence(pssm_df,pssm_df2)

Position
-5    0.065539
-4    0.025712
        ...   
 3    0.020949
 4    0.018206
Length: 10, dtype: float64

js_divergence(pssm_df,pssm_df2).max(),js_divergence(pssm_df,pssm_df2).mean()

(np.float64(0.34404931056288773), np.float64(0.08286124552178498))

source

js_divergence_flat

 js_divergence_flat (p1_flat, p2_flat)

p1 and p2 are two flattened pd.Series with index as aa and column as position

	Details
p1_flat	pd.Series of flattened pssm
p2_flat	pd.Series of flattened pssm

js_divergence_flat(pssms.iloc[1],pssms.iloc[0])

CPU times: user 0 ns, sys: 1.72 ms, total: 1.72 ms
Wall time: 1.7 ms

0.08286124552178498

JS similarity

To convert the Jensen–Shannon divergence into a similarity measure, we first normalize it to bits by dividing by log(2), ensuring that the divergence lies within the range [0, 1]. \[ \mathrm{JS}_{\text{bits}}(P \| Q) = \frac{\mathrm{JS}(P \| Q)}{\log 2} \]

The similarity is then defined as one minus this normalized divergence: \[ \mathrm{Sim}_{\mathrm{JS}}(P, Q) = 1 - \mathrm{JS}_{\text{bits}}(P \| Q) \]

Thus, $\mathrm{Sim}_{\mathrm{JS}}$ ranges from 0 (completely dissimilar) to 1 (identical distributions).

source

js_similarity

 js_similarity (pssm1, pssm2)

Convert JSD to bits to be in range (0,1) then 1-JSD.

js_similarity(pssm_df,pssm_df2).mean()

np.float64(0.880456492003838)

source

js_similarity_flat

 js_similarity_flat (p1_flat, p2_flat)

Convert JSD to bits to be in range (0,1) then 1-JSD.

js_similarity_flat(pssms.iloc[1],pssms.iloc[0])

np.float64(0.880456492003838)

Cosine similarity

source

cosine_similarity

 cosine_similarity (pssm1:pandas.core.frame.DataFrame,
                    pssm2:pandas.core.frame.DataFrame)

Compute cosine similarity per position (column) between two PSSMs.

The cosine similarity between two vectors ( P ) and ( Q ) (e.g., two PSSM columns representing amino acid probability distributions) is defined as:

\[ \mathrm{cos}(P, Q) = \frac{P \cdot Q}{\|P\| \, \|Q\|} \]

where $ P Q = _{i=1}^{n} P_i Q_i $ is the dot product between $ P $ and $ Q $, and $ |P| = $ is the Euclidean norm of $ P $.

Since all entries of $ P $ and $ Q $ are nonnegative probabilities (i.e., $ P_i, Q_i $), the cosine similarity lies within the range:

\[ 0 \leq \mathrm{cos}(P, Q) \leq 1 \]

Given that pssm are probabilities between 0 and 1, cosine similarity is within (0,1)

cosine_similarity(pssm_df,pssm_df2).sort_values()

 1    0.130818
-2    0.606234
        ...   
 4    0.934967
 2    0.971066
Length: 10, dtype: float64

cosine_similarity(pssm_df,pssm_df2).mean()

np.float64(0.754148470457778)

source

cosine_overall_flat

 cosine_overall_flat (pssm1_flat, pssm2_flat)

Compute overall cosine similarity between two PSSMs (flattened).

cosine_overall_flat(pssms.iloc[0],pssms.iloc[0])

np.float64(1.0000000000000004)

cosine_overall_flat(pssms.iloc[0],pssms.iloc[1])

np.float64(0.6614783212500965)

Setup

PSSM

get_prob

Transform PSSM

flatten_pssm

recover_pssm

clean_zero_normalize

PSSM of Log odds

get_pssm_LO

get_pssm_LO_flat

PSSMs of clusters

get_cluster_pssms

Entropy

get_entropy

get_entropy_flat

Information Content

get_IC

get_IC_flat

Overall specificity

get_specificity

get_specificity_flat

Plot

Heatmap

plot_heatmap_simple

plot_heatmap

plot_two_heatmaps

Logo motif

plot_logo_raw

change_center_name

get_pos_min_max

scale_zero_position

scale_pos_neg_values

convert_logo_df

get_logo_IC

plot_logo

Logo motif of log-odds

plot_logo_LO

Multiple logos

plot_logos_idx

plot_logos

Logo motif + Heatmap

plot_logo_heatmap

plot_logo_heatmap_LO

PSPA

Plot

preprocess_pspa

plot_logo_pspa

plot_logo_heatmap_pspa

Calculations

raw2norm

get_one_kinase

Plot PSPA logo motif (old)

get_logo

Compare PSSM

KL divergence

kl_divergence

kl_divergence_flat

JS divergence

js_divergence

js_divergence_flat

JS similarity

js_similarity

js_similarity_flat

Cosine similarity

cosine_similarity

cosine_overall_flat

End