compare

algorithms about comparing two PSSMs

Setup

Compare PSSM

from katlas.data import *
from katlas.pssm import *

pssms = Data.pspa_scale()

# one example
pssm_df = recover_pssm(pssms.iloc[1])
pssm_df2 = recover_pssm(pssms.iloc[0])

KL divergence

kl_divergence


def kl_divergence(
    p1, # target pssm p (array-like, shape: (AA, positions))
    p2, # pred pssm q (array-like, same shape as p1)
):

KL divergence D_KL(p1 || p2) over positions.

p1 and p2 are arrays (df or np) with index as aa and column as position. Returns average divergence across positions if mean=True, else per-position.

The Kullback–Leibler (KL) divergence between two probability distributions ( P ) and ( Q ) is defined as:

\[ \mathrm{KL}(P \| Q) = \sum_{x \in \mathcal{X}} P(x) \log \left( \frac{P(x)}{Q(x)} \right) \]

This measures the information lost when ( Q ) is used to approximate ( P ). It is not symmetric, i.e.,

\[ \mathrm{KL}(P \| Q) \ne \mathrm{KL}(Q \| P) \]

and it is non-negative, meaning:

\[ \mathrm{KL}(P \| Q) \ge 0 \]

with equality if and only if ( P = Q ) almost everywhere.

In practical computation, to avoid numerical instability when ( P(x) = 0 ) or ( Q(x) = 0 ), we often add a small constant ( ):

\[ \mathrm{KL}_\varepsilon(P \| Q) = \sum_{x \in \mathcal{X}} P(x) \log \left( \frac{P(x) + \varepsilon}{Q(x) + \varepsilon} \right) \]

kl_divergence(pssm_df,pssm_df2)

array([0.29182172, 0.11138481, 0.24590698, 0.46021635, 0.36874823,
       0.53858511, 1.51571614, 0.02905442, 0.08530757, 0.07753394])

kl_divergence(pssm_df,pssm_df2).mean(),kl_divergence(pssm_df,pssm_df2).max()

(np.float64(0.37242752573216287), np.float64(1.5157161422110503))

kl_divergence_flat


def kl_divergence_flat(
    p1_flat, # pd.Series of target flattened pssm p
    p2_flat, # pd.Series of pred flattened pssm q
):

p1 and p2 are two flattened pd.Series with index as aa and column as position

kl_divergence_flat(pssms.iloc[1],pssms.iloc[0])

CPU times: user 568 μs, sys: 164 μs, total: 732 μs
Wall time: 602 μs

0.37242752573216287

JS divergence

js_divergence


def js_divergence(
    p1, # pssm
    p2, # pssm
    index:bool=True
):

p1 and p2 are two arrays (df or np) with index as aa and column as position

The Jensen-Shannon divergence between two probability distributions $P$ and $Q$ is defined as:

\[ \mathrm{JS}(P \| Q) = \frac{1}{2} \, \mathrm{KL}(P \| M) + \frac{1}{2} \, \mathrm{KL}(Q \| M) \]

where $ M = (P + Q) $ is the average (mixture) distribution, and $ $ denotes the Kullback–Leibler divergence:

\[ \mathrm{KL}(P \| Q) = \sum_{x \in \mathcal{X}} P(x) \log \left( \frac{P(x)}{Q(x)} \right) \]

Therefore,

\[ \mathrm{JS}_\varepsilon(P \| Q) = \frac{1}{2} \sum_{x \in \mathcal{X}} P(x) \log \left( \frac{P(x) + \varepsilon}{M(x) + \varepsilon} \right) + \frac{1}{2} \sum_{x \in \mathcal{X}} Q(x) \log \left( \frac{Q(x) + \varepsilon}{M(x) + \varepsilon} \right) \]

js_divergence(pssm_df,pssm_df2)

Position
-5    0.065539
-4    0.025712
-3    0.054799
-2    0.103192
-1    0.083377
 0    0.105490
 1    0.344049
 2    0.007299
 3    0.020949
 4    0.018206
dtype: float64

js_divergence(pssm_df,pssm_df2).max(),js_divergence(pssm_df,pssm_df2).mean()

(np.float64(0.3440493105628877), np.float64(0.08286124552178498))

js_divergence_flat


def js_divergence_flat(
    p1_flat, # pd.Series of flattened pssm
    p2_flat, # pd.Series of flattened pssm
):

p1 and p2 are two flattened pd.Series with index as aa and column as position

js_divergence_flat(pssms.iloc[1],pssms.iloc[0])

CPU times: user 577 μs, sys: 173 μs, total: 750 μs
Wall time: 604 μs

0.08286124552178498

JS similarity

To convert the Jensen–Shannon divergence into a similarity measure, we first normalize it to bits by dividing by log(2), ensuring that the divergence lies within the range [0, 1]. \[ \mathrm{JS}_{\text{bits}}(P \| Q) = \frac{\mathrm{JS}(P \| Q)}{\log 2} \]

The similarity is then defined as one minus this normalized divergence: \[ \mathrm{Sim}_{\mathrm{JS}}(P, Q) = 1 - \mathrm{JS}_{\text{bits}}(P \| Q) \]

Thus, $\mathrm{Sim}_{\mathrm{JS}}$ ranges from 0 (completely dissimilar) to 1 (identical distributions).

js_similarity


def js_similarity(
    pssm1, pssm2
):

Convert JSD to bits to be in range (0,1) then 1-JSD.

js_similarity(pssm_df,pssm_df2).mean()

np.float64(0.880456492003838)

js_similarity_flat


def js_similarity_flat(
    p1_flat, p2_flat
):

Convert JSD to bits to be in range (0,1) then 1-JSD.

js_similarity_flat(pssms.iloc[1],pssms.iloc[0])

np.float64(0.880456492003838)

Cosine similarity

cosine_similarity


def cosine_similarity(
    pssm1:DataFrame, pssm2:DataFrame
)->Series:

Compute cosine similarity per position (column) between two PSSMs.

The cosine similarity between two vectors ( P ) and ( Q ) (e.g., two PSSM columns representing amino acid probability distributions) is defined as:

\[ \mathrm{cos}(P, Q) = \frac{P \cdot Q}{\|P\| \, \|Q\|} \]

where $ P Q = _{i=1}^{n} P_i Q_i $ is the dot product between $ P $ and $ Q $, and $ |P| = $ is the Euclidean norm of $ P $.

Since all entries of $ P $ and $ Q $ are nonnegative probabilities (i.e., $ P_i, Q_i $), the cosine similarity lies within the range:

\[ 0 \leq \mathrm{cos}(P, Q) \leq 1 \]

Given that pssm are probabilities between 0 and 1, cosine similarity is within (0,1)

cosine_similarity(pssm_df,pssm_df2).sort_values()

 1    0.130818
-2    0.606234
-1    0.731466
 0    0.780066
-5    0.780504
-3    0.786276
-4    0.901395
 3    0.918692
 4    0.934967
 2    0.971066
dtype: float64

cosine_similarity(pssm_df,pssm_df2).mean()

np.float64(0.754148470457778)

cosine_overall_flat


def cosine_overall_flat(
    pssm1_flat, pssm2_flat
):

Compute overall cosine similarity between two PSSMs (flattened).

cosine_overall_flat(pssms.iloc[0],pssms.iloc[0])

1.0000000000000002

cosine_overall_flat(pssms.iloc[0],pssms.iloc[1])

0.6614783212500965