ranking

Ranking and AUCDF helpers.
df = sns.load_dataset('tips')
df.shape
(244, 7)
df.head()
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

Ranking Plots


plot_rank


def plot_rank(
    sorted_df:DataFrame, # dataframe already sorted by the ranking value
    x:str, # label column used for annotations
    y:str, # numeric ranking column
    n_hi:int | None=10, # number of items to annotate at the head
    n_lo:int | None=10, # number of items to annotate at the tail
    figsize:tuple=(10, 8), # figure size in inches
    data:NoneType=None, hue:NoneType=None, size:NoneType=None, style:NoneType=None, palette:NoneType=None,
    hue_order:NoneType=None, hue_norm:NoneType=None, sizes:NoneType=None, size_order:NoneType=None,
    size_norm:NoneType=None, markers:bool=True, style_order:NoneType=None, legend:str='auto', ax:NoneType=None
):

Plot a ranked scatter and annotate the highest and lowest entries.

sort_df=df.sort_values('total_bill').copy()
sort_df['id'] = sort_df.index.astype(str)
plot_rank(sort_df, x='id', y='total_bill', n_hi=10, n_lo=10)

Rank Summary Metrics, AUCDF

We compute the area under the empirical cumulative distribution function (CDF) as a function of kinase rank using the trapezoidal rule.
Let $ r_{(1)} < r_{(2)} < < r_{(n)} $ be the sorted rank values (e.g., \(1,2,\dots,n\)), and define the empirical CDF values as:

\[ F(r_{(i)}) = \frac{i}{n} \]

The normalized area under this CDF-vs-rank curve (AUCDF) is then computed via the trapezoidal rule:

\[ \text{AUC}_{\text{CDF}} = \frac{1}{r_{\max} - r_{\min}} \sum_{i=1}^{n-1} \frac{F(r_{(i)}) + F(r_{(i+1)})}{2} \cdot (r_{(i+1)} - r_{(i)}) \]

where $ r_{} = r_{(1)} $, typically 1; $ r_{} = r_{(n)} $, typically \(n\).

This measures how quickly the cumulative mass increases across the ranked kinases. If better kinases (lower rank) tend to appear earlier in the CDF, the AUCDF will be higher.


get_AUCDF


def get_AUCDF(
    df:DataFrame, # dataframe containing the ranking column
    col:str, # numeric ranking column
    reverse:bool=False, # flip the empirical CDF direction
    plot:bool=True, # whether to draw the histogram and CDF panels
    xlabel:str='Rank of kinase', # x-axis label for the histogram
    ylabel:str='Substrates', # y-axis label for the histogram
)->float:

Compute the normalized area under an empirical CDF over rank values.

get_AUCDF(df, 'total_bill', plot=True)

0.6519265042202643