from katlas.data import *
from katlas.pssm import *compare
Setup
Compare PSSM
pssms = Data.pspa_scale()# one example
pssm_df = recover_pssm(pssms.iloc[1])
pssm_df2 = recover_pssm(pssms.iloc[0])KL divergence
kl_divergence
def kl_divergence(
p1, # target pssm p (array-like, shape: (AA, positions))
p2, # pred pssm q (array-like, same shape as p1)
):
KL divergence D_KL(p1 || p2) over positions.
p1 and p2 are arrays (df or np) with index as aa and column as position. Returns average divergence across positions if mean=True, else per-position.
The Kullback–Leibler (KL) divergence between two probability distributions ( P ) and ( Q ) is defined as:
\[ \mathrm{KL}(P \| Q) = \sum_{x \in \mathcal{X}} P(x) \log \left( \frac{P(x)}{Q(x)} \right) \]
This measures the information lost when ( Q ) is used to approximate ( P ). It is not symmetric, i.e.,
\[ \mathrm{KL}(P \| Q) \ne \mathrm{KL}(Q \| P) \]
and it is non-negative, meaning:
\[ \mathrm{KL}(P \| Q) \ge 0 \]
with equality if and only if ( P = Q ) almost everywhere.
In practical computation, to avoid numerical instability when ( P(x) = 0 ) or ( Q(x) = 0 ), we often add a small constant ( ):
\[ \mathrm{KL}_\varepsilon(P \| Q) = \sum_{x \in \mathcal{X}} P(x) \log \left( \frac{P(x) + \varepsilon}{Q(x) + \varepsilon} \right) \]
kl_divergence(pssm_df,pssm_df2)array([0.29182172, 0.11138481, 0.24590698, 0.46021635, 0.36874823,
0.53858511, 1.51571614, 0.02905442, 0.08530757, 0.07753394])
kl_divergence(pssm_df,pssm_df2).mean(),kl_divergence(pssm_df,pssm_df2).max()(np.float64(0.37242752573216287), np.float64(1.5157161422110503))
kl_divergence_flat
def kl_divergence_flat(
p1_flat, # pd.Series of target flattened pssm p
p2_flat, # pd.Series of pred flattened pssm q
):
p1 and p2 are two flattened pd.Series with index as aa and column as position
kl_divergence_flat(pssms.iloc[1],pssms.iloc[0])CPU times: user 568 μs, sys: 164 μs, total: 732 μs
Wall time: 602 μs
0.37242752573216287
JS divergence
js_divergence
def js_divergence(
p1, # pssm
p2, # pssm
index:bool=True
):
p1 and p2 are two arrays (df or np) with index as aa and column as position
The Jensen-Shannon divergence between two probability distributions \(P\) and \(Q\) is defined as:
\[ \mathrm{JS}(P \| Q) = \frac{1}{2} \, \mathrm{KL}(P \| M) + \frac{1}{2} \, \mathrm{KL}(Q \| M) \]
where $ M = (P + Q) $ is the average (mixture) distribution, and $ $ denotes the Kullback–Leibler divergence:
\[ \mathrm{KL}(P \| Q) = \sum_{x \in \mathcal{X}} P(x) \log \left( \frac{P(x)}{Q(x)} \right) \]
Therefore,
\[ \mathrm{JS}_\varepsilon(P \| Q) = \frac{1}{2} \sum_{x \in \mathcal{X}} P(x) \log \left( \frac{P(x) + \varepsilon}{M(x) + \varepsilon} \right) + \frac{1}{2} \sum_{x \in \mathcal{X}} Q(x) \log \left( \frac{Q(x) + \varepsilon}{M(x) + \varepsilon} \right) \]
js_divergence(pssm_df,pssm_df2)Position
-5 0.065539
-4 0.025712
-3 0.054799
-2 0.103192
-1 0.083377
0 0.105490
1 0.344049
2 0.007299
3 0.020949
4 0.018206
dtype: float64
js_divergence(pssm_df,pssm_df2).max(),js_divergence(pssm_df,pssm_df2).mean()(np.float64(0.3440493105628877), np.float64(0.08286124552178498))
js_divergence_flat
def js_divergence_flat(
p1_flat, # pd.Series of flattened pssm
p2_flat, # pd.Series of flattened pssm
):
p1 and p2 are two flattened pd.Series with index as aa and column as position
js_divergence_flat(pssms.iloc[1],pssms.iloc[0])CPU times: user 577 μs, sys: 173 μs, total: 750 μs
Wall time: 604 μs
0.08286124552178498
JS similarity
To convert the Jensen–Shannon divergence into a similarity measure, we first normalize it to bits by dividing by log(2), ensuring that the divergence lies within the range [0, 1]. \[ \mathrm{JS}_{\text{bits}}(P \| Q) = \frac{\mathrm{JS}(P \| Q)}{\log 2} \]
The similarity is then defined as one minus this normalized divergence: \[ \mathrm{Sim}_{\mathrm{JS}}(P, Q) = 1 - \mathrm{JS}_{\text{bits}}(P \| Q) \]
Thus, \(\mathrm{Sim}_{\mathrm{JS}}\) ranges from 0 (completely dissimilar) to 1 (identical distributions).
js_similarity
def js_similarity(
pssm1, pssm2
):
Convert JSD to bits to be in range (0,1) then 1-JSD.
js_similarity(pssm_df,pssm_df2).mean()np.float64(0.880456492003838)
js_similarity_flat
def js_similarity_flat(
p1_flat, p2_flat
):
Convert JSD to bits to be in range (0,1) then 1-JSD.
js_similarity_flat(pssms.iloc[1],pssms.iloc[0])np.float64(0.880456492003838)
Cosine similarity
cosine_similarity
def cosine_similarity(
pssm1:DataFrame, pssm2:DataFrame
)->Series:
Compute cosine similarity per position (column) between two PSSMs.
The cosine similarity between two vectors ( P ) and ( Q ) (e.g., two PSSM columns representing amino acid probability distributions) is defined as:
\[ \mathrm{cos}(P, Q) = \frac{P \cdot Q}{\|P\| \, \|Q\|} \]
where $ P Q = _{i=1}^{n} P_i Q_i $ is the dot product between $ P $ and $ Q $, and $ |P| = $ is the Euclidean norm of $ P $.
Since all entries of $ P $ and $ Q $ are nonnegative probabilities (i.e., $ P_i, Q_i $), the cosine similarity lies within the range:
\[ 0 \leq \mathrm{cos}(P, Q) \leq 1 \]
Given that pssm are probabilities between 0 and 1, cosine similarity is within (0,1)
cosine_similarity(pssm_df,pssm_df2).sort_values() 1 0.130818
-2 0.606234
-1 0.731466
0 0.780066
-5 0.780504
-3 0.786276
-4 0.901395
3 0.918692
4 0.934967
2 0.971066
dtype: float64
cosine_similarity(pssm_df,pssm_df2).mean()np.float64(0.754148470457778)
cosine_overall_flat
def cosine_overall_flat(
pssm1_flat, pssm2_flat
):
Compute overall cosine similarity between two PSSMs (flattened).
cosine_overall_flat(pssms.iloc[0],pssms.iloc[0])1.0000000000000002
cosine_overall_flat(pssms.iloc[0],pssms.iloc[1])0.6614783212500965