pssm.compare

algorithms about comparing two PSSMs

Overview

Functions for comparing Position-Specific Scoring Matrices (PSSMs) using various divergence and similarity metrics.

KL Divergence

kl_divergence computes the Kullback-Leibler divergence $D_{KL}(P \| Q)$ between two PSSMs, measuring information lost when one distribution approximates another. Returns per-position divergence values (non-negative, 0 means identical). Note: KL divergence is asymmetric.

kl_divergence(
    p1=pssm_df,   # target PSSM P (DataFrame, shape: AA × positions)
    p2=pssm_df2,  # predicted PSSM Q (DataFrame, same shape as p1)
)

kl_divergence_flat computes mean KL divergence from flattened PSSM Series (e.g., rows from a PSSM DataFrame).

kl_divergence_flat(
    p1_flat=pssms.iloc[1],  # target flattened PSSM (pd.Series with 'pos_AA' index)
    p2_flat=pssms.iloc[0],  # predicted flattened PSSM (pd.Series, same structure)
)

JS Divergence

js_divergence computes the Jensen-Shannon divergence, a symmetric version of KL divergence using the mixture $M = \frac{1}{2}(P + Q)$. Returns per-position divergence as a Series.

js_divergence(
    p1=pssm_df,    # first PSSM (DataFrame, shape: AA × positions)
    p2=pssm_df2,   # second PSSM (DataFrame, same shape)
    index=True,    # whether to return Series with position index
)

js_divergence_flat computes mean JS divergence from flattened PSSM Series.

js_divergence_flat(
    p1_flat=pssms.iloc[1],  # first flattened PSSM (pd.Series)
    p2_flat=pssms.iloc[0],  # second flattened PSSM (pd.Series)
)

JS Similarity

js_similarity converts JS divergence to a similarity score in $[0, 1]$ by normalizing to bits (dividing by $\log 2$) then computing $1 - JS_{bits}$. Returns per-position similarity.

js_similarity(
    pssm1=pssm_df,   # first PSSM (DataFrame)
    pssm2=pssm_df2,  # second PSSM (DataFrame)
).mean()  # average across positions

js_similarity_flat computes mean JS similarity from flattened PSSM Series.

js_similarity_flat(
    p1_flat=pssms.iloc[1],  # first flattened PSSM (pd.Series)
    p2_flat=pssms.iloc[0],  # second flattened PSSM (pd.Series)
)

Cosine Similarity

cosine_similarity computes cosine similarity $\frac{P \cdot Q}{\|P\| \|Q\|}$ per position between two PSSMs. Since PSSM values are probabilities in $[0,1]$, similarity is bounded in $[0, 1]$.

cosine_similarity(
    pssm1=pssm_df,   # first PSSM (DataFrame, shape: AA × positions)
    pssm2=pssm_df2,  # second PSSM (DataFrame, same shape)
)

cosine_overall_flat computes a single overall cosine similarity across all positions from flattened PSSMs.

cosine_overall_flat(
    pssm1_flat=pssms.iloc[0],  # first flattened PSSM (pd.Series)
    pssm2_flat=pssms.iloc[1],  # second flattened PSSM (pd.Series)
)

Setup

Compare PSSM

from katlas.data import *
from katlas.pssm.core import *

pssms = Data.get_pspa_scale()

# one example
pssm_df = recover_pssm(pssms.iloc[1])
pssm_df2 = recover_pssm(pssms.iloc[0])

KL divergence

kl_divergence


def kl_divergence(
    p1, # target pssm p (array-like, shape: (AA, positions))
    p2, # pred pssm q (array-like, same shape as p1)
):

KL divergence D_KL(p1 || p2) over positions.

p1 and p2 are arrays (df or np) with index as aa and column as position. Returns average divergence across positions if mean=True, else per-position.

The Kullback–Leibler (KL) divergence between two probability distributions ( P ) and ( Q ) is defined as:

\[ \mathrm{KL}(P \| Q) = \sum_{x \in \mathcal{X}} P(x) \log \left( \frac{P(x)}{Q(x)} \right) \]

This measures the information lost when ( Q ) is used to approximate ( P ). It is not symmetric, i.e.,

\[ \mathrm{KL}(P \| Q) \ne \mathrm{KL}(Q \| P) \]

and it is non-negative, meaning:

\[ \mathrm{KL}(P \| Q) \ge 0 \]

with equality if and only if ( P = Q ) almost everywhere.

In practical computation, to avoid numerical instability when ( P(x) = 0 ) or ( Q(x) = 0 ), we often add a small constant ( ):

\[ \mathrm{KL}_\varepsilon(P \| Q) = \sum_{x \in \mathcal{X}} P(x) \log \left( \frac{P(x) + \varepsilon}{Q(x) + \varepsilon} \right) \]

kl_divergence(pssm_df,pssm_df2)

array([0.29182172, 0.11138481, 0.24590698, 0.46021635, 0.36874823,
       0.53858511, 1.51571614, 0.02905442, 0.08530757, 0.07753394])

kl_divergence(pssm_df,pssm_df2).mean(),kl_divergence(pssm_df,pssm_df2).max()

(np.float64(0.37242752573216287), np.float64(1.5157161422110503))

kl_divergence_flat


def kl_divergence_flat(
    p1_flat, # pd.Series of target flattened pssm p
    p2_flat, # pd.Series of pred flattened pssm q
):

p1 and p2 are two flattened pd.Series with index as aa and column as position

kl_divergence_flat(pssms.iloc[1],pssms.iloc[0])

CPU times: user 2.01 ms, sys: 0 ns, total: 2.01 ms
Wall time: 2 ms

0.37242752573216287

JS divergence

js_divergence


def js_divergence(
    p1, # pssm
    p2, # pssm
    index:bool=True
):

p1 and p2 are two arrays (df or np) with index as aa and column as position

The Jensen-Shannon divergence between two probability distributions $P$ and $Q$ is defined as:

\[ \mathrm{JS}(P \| Q) = \frac{1}{2} \, \mathrm{KL}(P \| M) + \frac{1}{2} \, \mathrm{KL}(Q \| M) \]

where $ M = (P + Q) $ is the average (mixture) distribution, and $ $ denotes the Kullback–Leibler divergence:

\[ \mathrm{KL}(P \| Q) = \sum_{x \in \mathcal{X}} P(x) \log \left( \frac{P(x)}{Q(x)} \right) \]

Therefore,

\[ \mathrm{JS}_\varepsilon(P \| Q) = \frac{1}{2} \sum_{x \in \mathcal{X}} P(x) \log \left( \frac{P(x) + \varepsilon}{M(x) + \varepsilon} \right) + \frac{1}{2} \sum_{x \in \mathcal{X}} Q(x) \log \left( \frac{Q(x) + \varepsilon}{M(x) + \varepsilon} \right) \]

js_divergence(pssm_df,pssm_df2)

Position
-5    0.065539
-4    0.025712
-3    0.054799
-2    0.103192
-1    0.083377
 0    0.105490
 1    0.344049
 2    0.007299
 3    0.020949
 4    0.018206
dtype: float64

js_divergence(pssm_df,pssm_df2).max(),js_divergence(pssm_df,pssm_df2).mean()

(np.float64(0.34404931056288773), np.float64(0.08286124552178498))

js_divergence_flat


def js_divergence_flat(
    p1_flat, # pd.Series of flattened pssm
    p2_flat, # pd.Series of flattened pssm
):

p1 and p2 are two flattened pd.Series with index as aa and column as position

js_divergence_flat(pssms.iloc[1],pssms.iloc[0])

CPU times: user 2.02 ms, sys: 0 ns, total: 2.02 ms
Wall time: 2.01 ms

0.08286124552178498

JS similarity

To convert the Jensen–Shannon divergence into a similarity measure, we first normalize it to bits by dividing by log(2), ensuring that the divergence lies within the range [0, 1]. \[ \mathrm{JS}_{\text{bits}}(P \| Q) = \frac{\mathrm{JS}(P \| Q)}{\log 2} \]

The similarity is then defined as one minus this normalized divergence: \[ \mathrm{Sim}_{\mathrm{JS}}(P, Q) = 1 - \mathrm{JS}_{\text{bits}}(P \| Q) \]

Thus, $\mathrm{Sim}_{\mathrm{JS}}$ ranges from 0 (completely dissimilar) to 1 (identical distributions).

js_similarity


def js_similarity(
    pssm1, pssm2
):

Convert JSD to bits to be in range (0,1) then 1-JSD.

js_similarity(pssm_df,pssm_df2).mean()

np.float64(0.880456492003838)

js_similarity_flat


def js_similarity_flat(
    p1_flat, p2_flat
):

Convert JSD to bits to be in range (0,1) then 1-JSD.

js_similarity_flat(pssms.iloc[1],pssms.iloc[0])

np.float64(0.880456492003838)

Cosine similarity

cosine_similarity


def cosine_similarity(
    pssm1:DataFrame, pssm2:DataFrame
)->Series:

Compute cosine similarity per position (column) between two PSSMs.

The cosine similarity between two vectors ( P ) and ( Q ) (e.g., two PSSM columns representing amino acid probability distributions) is defined as:

\[ \mathrm{cos}(P, Q) = \frac{P \cdot Q}{\|P\| \, \|Q\|} \]

where $ P Q = _{i=1}^{n} P_i Q_i $ is the dot product between $ P $ and $ Q $, and $ |P| = $ is the Euclidean norm of $ P $.

Since all entries of $ P $ and $ Q $ are nonnegative probabilities (i.e., $ P_i, Q_i $), the cosine similarity lies within the range:

\[ 0 \leq \mathrm{cos}(P, Q) \leq 1 \]

Given that pssm are probabilities between 0 and 1, cosine similarity is within (0,1)

cosine_similarity(pssm_df,pssm_df2).sort_values()

 1    0.130818
-2    0.606234
-1    0.731466
 0    0.780066
-5    0.780504
-3    0.786276
-4    0.901395
 3    0.918692
 4    0.934967
 2    0.971066
dtype: float64

cosine_similarity(pssm_df,pssm_df2).mean()

np.float64(0.754148470457778)

cosine_overall_flat


def cosine_overall_flat(
    pssm1_flat, pssm2_flat
):

Compute overall cosine similarity between two PSSMs (flattened).

cosine_overall_flat(pssms.iloc[0],pssms.iloc[0])

np.float64(0.9999999999999999)

cosine_overall_flat(pssms.iloc[0],pssms.iloc[1])

np.float64(0.6614783212500968)