from katlas.data import *
from katlas.pssm.core import *pssm.compare
Overview
Functions for comparing Position-Specific Scoring Matrices (PSSMs) using various divergence and similarity metrics.
KL Divergence
kl_divergence computes the Kullback-Leibler divergence \(D_{KL}(P \| Q)\) between two PSSMs, measuring information lost when one distribution approximates another. Returns per-position divergence values (non-negative, 0 means identical). Note: KL divergence is asymmetric.
kl_divergence(
p1=pssm_df, # target PSSM P (DataFrame, shape: AA × positions)
p2=pssm_df2, # predicted PSSM Q (DataFrame, same shape as p1)
)kl_divergence_flat computes mean KL divergence from flattened PSSM Series (e.g., rows from a PSSM DataFrame).
kl_divergence_flat(
p1_flat=pssms.iloc[1], # target flattened PSSM (pd.Series with 'pos_AA' index)
p2_flat=pssms.iloc[0], # predicted flattened PSSM (pd.Series, same structure)
)JS Divergence
js_divergence computes the Jensen-Shannon divergence, a symmetric version of KL divergence using the mixture \(M = \frac{1}{2}(P + Q)\). Returns per-position divergence as a Series.
js_divergence(
p1=pssm_df, # first PSSM (DataFrame, shape: AA × positions)
p2=pssm_df2, # second PSSM (DataFrame, same shape)
index=True, # whether to return Series with position index
)js_divergence_flat computes mean JS divergence from flattened PSSM Series.
js_divergence_flat(
p1_flat=pssms.iloc[1], # first flattened PSSM (pd.Series)
p2_flat=pssms.iloc[0], # second flattened PSSM (pd.Series)
)JS Similarity
js_similarity converts JS divergence to a similarity score in \([0, 1]\) by normalizing to bits (dividing by \(\log 2\)) then computing \(1 - JS_{bits}\). Returns per-position similarity.
js_similarity(
pssm1=pssm_df, # first PSSM (DataFrame)
pssm2=pssm_df2, # second PSSM (DataFrame)
).mean() # average across positionsjs_similarity_flat computes mean JS similarity from flattened PSSM Series.
js_similarity_flat(
p1_flat=pssms.iloc[1], # first flattened PSSM (pd.Series)
p2_flat=pssms.iloc[0], # second flattened PSSM (pd.Series)
)Cosine Similarity
cosine_similarity computes cosine similarity \(\frac{P \cdot Q}{\|P\| \|Q\|}\) per position between two PSSMs. Since PSSM values are probabilities in \([0,1]\), similarity is bounded in \([0, 1]\).
cosine_similarity(
pssm1=pssm_df, # first PSSM (DataFrame, shape: AA × positions)
pssm2=pssm_df2, # second PSSM (DataFrame, same shape)
)cosine_overall_flat computes a single overall cosine similarity across all positions from flattened PSSMs.
cosine_overall_flat(
pssm1_flat=pssms.iloc[0], # first flattened PSSM (pd.Series)
pssm2_flat=pssms.iloc[1], # second flattened PSSM (pd.Series)
)Setup
Compare PSSM
pssms = Data.get_pspa_scale()# one example
pssm_df = recover_pssm(pssms.iloc[1])
pssm_df2 = recover_pssm(pssms.iloc[0])KL divergence
kl_divergence
def kl_divergence(
p1, # target pssm p (array-like, shape: (AA, positions))
p2, # pred pssm q (array-like, same shape as p1)
):
KL divergence D_KL(p1 || p2) over positions.
p1 and p2 are arrays (df or np) with index as aa and column as position. Returns average divergence across positions if mean=True, else per-position.
The Kullback–Leibler (KL) divergence between two probability distributions ( P ) and ( Q ) is defined as:
\[ \mathrm{KL}(P \| Q) = \sum_{x \in \mathcal{X}} P(x) \log \left( \frac{P(x)}{Q(x)} \right) \]
This measures the information lost when ( Q ) is used to approximate ( P ). It is not symmetric, i.e.,
\[ \mathrm{KL}(P \| Q) \ne \mathrm{KL}(Q \| P) \]
and it is non-negative, meaning:
\[ \mathrm{KL}(P \| Q) \ge 0 \]
with equality if and only if ( P = Q ) almost everywhere.
In practical computation, to avoid numerical instability when ( P(x) = 0 ) or ( Q(x) = 0 ), we often add a small constant ( ):
\[ \mathrm{KL}_\varepsilon(P \| Q) = \sum_{x \in \mathcal{X}} P(x) \log \left( \frac{P(x) + \varepsilon}{Q(x) + \varepsilon} \right) \]
kl_divergence(pssm_df,pssm_df2)array([0.29182172, 0.11138481, 0.24590698, 0.46021635, 0.36874823,
0.53858511, 1.51571614, 0.02905442, 0.08530757, 0.07753394])
kl_divergence(pssm_df,pssm_df2).mean(),kl_divergence(pssm_df,pssm_df2).max()(np.float64(0.37242752573216287), np.float64(1.5157161422110503))
kl_divergence_flat
def kl_divergence_flat(
p1_flat, # pd.Series of target flattened pssm p
p2_flat, # pd.Series of pred flattened pssm q
):
p1 and p2 are two flattened pd.Series with index as aa and column as position
kl_divergence_flat(pssms.iloc[1],pssms.iloc[0])CPU times: user 2.01 ms, sys: 0 ns, total: 2.01 ms
Wall time: 2 ms
0.37242752573216287
JS divergence
js_divergence
def js_divergence(
p1, # pssm
p2, # pssm
index:bool=True
):
p1 and p2 are two arrays (df or np) with index as aa and column as position
The Jensen-Shannon divergence between two probability distributions \(P\) and \(Q\) is defined as:
\[ \mathrm{JS}(P \| Q) = \frac{1}{2} \, \mathrm{KL}(P \| M) + \frac{1}{2} \, \mathrm{KL}(Q \| M) \]
where $ M = (P + Q) $ is the average (mixture) distribution, and $ $ denotes the Kullback–Leibler divergence:
\[ \mathrm{KL}(P \| Q) = \sum_{x \in \mathcal{X}} P(x) \log \left( \frac{P(x)}{Q(x)} \right) \]
Therefore,
\[ \mathrm{JS}_\varepsilon(P \| Q) = \frac{1}{2} \sum_{x \in \mathcal{X}} P(x) \log \left( \frac{P(x) + \varepsilon}{M(x) + \varepsilon} \right) + \frac{1}{2} \sum_{x \in \mathcal{X}} Q(x) \log \left( \frac{Q(x) + \varepsilon}{M(x) + \varepsilon} \right) \]
js_divergence(pssm_df,pssm_df2)Position
-5 0.065539
-4 0.025712
-3 0.054799
-2 0.103192
-1 0.083377
0 0.105490
1 0.344049
2 0.007299
3 0.020949
4 0.018206
dtype: float64
js_divergence(pssm_df,pssm_df2).max(),js_divergence(pssm_df,pssm_df2).mean()(np.float64(0.34404931056288773), np.float64(0.08286124552178498))
js_divergence_flat
def js_divergence_flat(
p1_flat, # pd.Series of flattened pssm
p2_flat, # pd.Series of flattened pssm
):
p1 and p2 are two flattened pd.Series with index as aa and column as position
js_divergence_flat(pssms.iloc[1],pssms.iloc[0])CPU times: user 2.02 ms, sys: 0 ns, total: 2.02 ms
Wall time: 2.01 ms
0.08286124552178498
JS similarity
To convert the Jensen–Shannon divergence into a similarity measure, we first normalize it to bits by dividing by log(2), ensuring that the divergence lies within the range [0, 1]. \[ \mathrm{JS}_{\text{bits}}(P \| Q) = \frac{\mathrm{JS}(P \| Q)}{\log 2} \]
The similarity is then defined as one minus this normalized divergence: \[ \mathrm{Sim}_{\mathrm{JS}}(P, Q) = 1 - \mathrm{JS}_{\text{bits}}(P \| Q) \]
Thus, \(\mathrm{Sim}_{\mathrm{JS}}\) ranges from 0 (completely dissimilar) to 1 (identical distributions).
js_similarity
def js_similarity(
pssm1, pssm2
):
Convert JSD to bits to be in range (0,1) then 1-JSD.
js_similarity(pssm_df,pssm_df2).mean()np.float64(0.880456492003838)
js_similarity_flat
def js_similarity_flat(
p1_flat, p2_flat
):
Convert JSD to bits to be in range (0,1) then 1-JSD.
js_similarity_flat(pssms.iloc[1],pssms.iloc[0])np.float64(0.880456492003838)
Cosine similarity
cosine_similarity
def cosine_similarity(
pssm1:DataFrame, pssm2:DataFrame
)->Series:
Compute cosine similarity per position (column) between two PSSMs.
The cosine similarity between two vectors ( P ) and ( Q ) (e.g., two PSSM columns representing amino acid probability distributions) is defined as:
\[ \mathrm{cos}(P, Q) = \frac{P \cdot Q}{\|P\| \, \|Q\|} \]
where $ P Q = _{i=1}^{n} P_i Q_i $ is the dot product between $ P $ and $ Q $, and $ |P| = $ is the Euclidean norm of $ P $.
Since all entries of $ P $ and $ Q $ are nonnegative probabilities (i.e., $ P_i, Q_i $), the cosine similarity lies within the range:
\[ 0 \leq \mathrm{cos}(P, Q) \leq 1 \]
Given that pssm are probabilities between 0 and 1, cosine similarity is within (0,1)
cosine_similarity(pssm_df,pssm_df2).sort_values() 1 0.130818
-2 0.606234
-1 0.731466
0 0.780066
-5 0.780504
-3 0.786276
-4 0.901395
3 0.918692
4 0.934967
2 0.971066
dtype: float64
cosine_similarity(pssm_df,pssm_df2).mean()np.float64(0.754148470457778)
cosine_overall_flat
def cosine_overall_flat(
pssm1_flat, pssm2_flat
):
Compute overall cosine similarity between two PSSMs (flattened).
cosine_overall_flat(pssms.iloc[0],pssms.iloc[0])np.float64(0.9999999999999999)
cosine_overall_flat(pssms.iloc[0],pssms.iloc[1])np.float64(0.6614783212500968)