cut_seq('AAkUuPSFSTtH',-5,4)'AkUuPSFSTt'
This module provides functions to calculate kinase scores based on substrate sequences using Position-Specific Scoring Matrices (PSSMs) from either PSPA data or kinase-substrate datasets.
Utility Functions
These helper functions prepare and parse phosphorylation site sequences.
cut_seq — Extracts a substring from a sequence based on positions relative to its center (the phosphorylation site).
STY2sty — Converts uppercase S/T/Y (unphosphorylated) to lowercase s/t/y (phosphorylated) in a sequence.
get_dict — Parses a site sequence into position-annotated amino acid keys (e.g., -3P, 0s, 1Q) for PSSM lookup.
Scoring Functions
These functions aggregate PSSM values across positions to produce a single kinase score.
sumup — Simple summation of PSSM values at each position. Used with log-odds (LO) matrices.
multiply — Log-space multiplication of PSSM probabilities, normalized by number of possible amino acids. Used with probability matrices.
multiply_pspa — Like multiply, but uses a kinase-specific normalization factor from the PSPA random amino acid library.
Single Sequence Prediction
predict_kinase — Scores all kinases against a single phosphosite sequence. Returns a sorted Series of kinase scores.
predict_kinase(
input_string='PSVEPPLsQETFSDL', # 15-mer site sequence (lowercase = phosphosite)
ref=Data.get_pspa(), # reference PSSM DataFrame (kinases × positions)
func=multiply_pspa, # aggregation function (sumup or multiply_*)
to_lower=False, # convert STY→sty before scoring
to_upper=False, # convert all to uppercase before scoring
verbose=True, # print which positions were used
)
# Returns: pd.Series of scores indexed by kinase, sorted descendingBatch Prediction (DataFrame)
predict_kinase_df — Scores all kinases against multiple sequences in a DataFrame. Optimized for large-scale scoring.
predict_kinase_df(
df=Data.get_human_site().head(100), # DataFrame containing site sequences
seq_col='site_seq', # column name with sequences
ref=Data.get_pspa(), # reference PSSM DataFrame
func=multiply_pspa, # aggregation function
to_lower=False, # convert STY→sty
to_upper=False, # convert all to uppercase
)
# Returns: DataFrame (rows=input sites, cols=kinases) of scoresPercentile Scoring
These functions convert raw scores to percentiles based on a reference distribution.
get_pct — Calculates percentile rank for a single sequence against all kinases.
get_pct_df — Converts a DataFrame of raw scores to percentile ranks (vectorized).
Configuration Presets
Params — Returns pre-configured parameter dictionaries for different scoring modes.
Params(
name='PSPA', # preset name: 'CDDM', 'CDDM_upper', 'PSPA_st', 'PSPA_y', 'PSPA'
load=True, # whether to load the reference DataFrame immediately
)
# Returns: dict with keys 'ref', 'func', and optionally 'to_upper'/'to_lower'
# List available presets:
Params() # Returns: ['CDDM', 'CDDM_upper', 'PSPA_st', 'PSPA_y', 'PSPA']| Preset | Reference | Function | Use Case |
|---|---|---|---|
CDDM |
Log-odds matrix | sumup |
Known phosphorylated status |
CDDM_upper |
Log-odds (uppercase) | sumup |
All uppercase only, unknown phosphorylated status |
PSPA_st |
PSPA S/T kinases | multiply_pspa |
Known phosphorylated status; S/T only |
PSPA_y |
PSPA Y kinases | multiply_pspa |
Known phosphorylated status; Y only |
PSPA |
All PSPA kinases | multiply_pspa |
Known phosphorylated status |
Typical Workflow
# Single sequence scoring
scores = predict_kinase(
input_string='PSVEPPLsQETFSDL',
**Params('PSPA'),
)
# Batch scoring with percentiles on S/T sites
df = Data.get_human_site()
score_df = predict_kinase_df(
df=df,
seq_col='site_seq',
**Params('PSPA_st'),
)
pct_df = get_pct_df(
score_df=score_df,
pct_ref=Data.get_pspa_st_pct(),
)Extract sequence based on a range relative to its center position
Replace all ‘STY’ with ‘sty’ in a sequence
Get a dictionary of input string; no need for the star in the middle; make sure it is 15 or 10 length
Multiply the possibilities of the amino acids at each position in a phosphorylation site
\[ \text{Score} = \log_2 \left( \frac{ \prod P_{\text{KinX}}(\text{AA}, \text{Position}) }{ \left( \frac{1}{\#\text{Random AA}} \right)^{\text{length(Position except 0)}} } \right) \]
The function implement formula from Johnson et al. Nature: An atlas of substrate specificities for the human serine/threonine kinome, Supplementary Note2 (page 160)
Multiply class, consider the dynamics of scale factor
Multiply values, consider the dynamics of scale factor, which is PSPA random aa number.
Sum up the possibilities of the amino acids at each position in a phosphorylation site sequence
If ‘0S’, ‘0T’, ‘0Y’ exist with non-zero values, create ‘0s’, ‘0t’, ‘0y’ with same values. If ‘0s’, ‘0t’, ‘0y’ exist with non-zero values, create ‘0S’, ‘0T’, ‘0Y’ with same values.
Convert pS/T/Y in ref columns to s/t/y if any; mirror 0S/T/Y to 0s/t/y.
Predict kinase given a phosphorylation site sequence
PSPA scoring:
considering string: ['-5V', '-4E', '-3P', '-2P', '-1L', '0s', '1Q', '2E', '3T', '4F', '5S']
kinase
ATM 5.037
SMG1 4.385
DNAPK 3.818
ATR 3.507
FAM20C 3.170
...
PKN1 -7.275
P70S6K -7.295
AKT3 -7.375
PKCI -7.742
NEK3 -8.254
Length: 303, dtype: float64
CDDM scoring, LO + sum
considering string: ['-7P', '-6S', '-5V', '-4E', '-3P', '-2P', '-1L', '0s', '1Q', '2E', '3T', '4F', '5S', '6D', '7L']
ATR 12.751
ATM 10.960
DNAPK 6.039
SRPK2 2.079
SMMLCK 1.876
...
ROR1 -89.216
CDC7 -91.457
CAMK1B -91.577
TNNI3K -118.835
BRAF -134.851
Length: 328, dtype: float64
CDDM scoring, PSSM + multiply (#23aa)
considering string: ['-7P', '-6S', '-5V', '-4E', '-3P', '-2P', '-1L', '0s', '1Q', '2E', '3T', '4F', '5S', '6D', '7L']
ATR 16.824
ATM 15.033
DNAPK 10.112
SRPK2 6.152
SMMLCK 5.949
...
ROR1 -85.143
CDC7 -87.384
CAMK1B -87.503
TNNI3K -114.762
BRAF -130.778
Length: 328, dtype: float64
CDDM scoring, PSSM + multiply (#20aa)
considering string: ['-7P', '-6S', '-5V', '-4E', '-3P', '-2P', '-1L', '0s', '1Q', '2E', '3T', '4F', '5S', '6D', '7L']
ATR 16.587
ATM 14.362
DNAPK 10.430
SRPK2 8.044
CHK2 7.955
...
TTK -43.375
GAK -45.159
CAMK1B -69.395
TNNI3K -70.993
BRAF -109.130
Length: 328, dtype: float64
Here we provide different PSSM settings from either PSPA data or kinase-substrate dataset for kinase prediction:
considering string: ['-5V', '-4E', '-3P', '-2P', '-1L', '0s', '1Q', '2E', '3T', '4F', '5S']
kinase
ATM 5.037
SMG1 4.385
DNAPK 3.818
ATR 3.507
FAM20C 3.170
dtype: float64
considering string: ['-7P', '-6S', '-5V', '-4E', '-3P', '-2P', '-1L', '0s', '1Q', '2E', '3T', '4F', '5S', '6D', '7L']
ATR 12.751
ATM 10.960
DNAPK 6.039
SRPK2 2.079
SMMLCK 1.876
dtype: float64
considering string: ['-7P', '-6S', '-5V', '-4E', '-3P', '-2P', '-1L', '0S', '1Q', '2E', '3T', '4F', '5S', '6D', '7L']
ATR 11.815
ATM 9.590
DNAPK 5.659
SRPK2 3.272
CHK2 3.183
dtype: float64
Multiply-based log-sum aggregation across kinases.
Predict kinase scores based on reference PSSM or weight matrix. Applies preprocessing, merges long format keys, then aggregates using given func.
Replicate the precentile results from The Kinase Library.
considering string: ['-5V', '-4E', '-3P', '-2P', '-1L', '0Y', '1Q', '2E', '3T', '4F', '5S']
| log2(score) | percentile | |
|---|---|---|
| ABL2 | 3.137 | 96.568694 |
| BMX | 2.816 | 96.117567 |
| BTK | 1.956 | 95.693780 |
| CSK | 2.303 | 95.174299 |
| MERTK | 2.509 | 93.588517 |
| ... | ... | ... |
| FLT1 | -1.919 | 25.358852 |
| PINK1_TYR | -1.227 | 21.927546 |
| MUSK | -3.031 | 21.298701 |
| TNNI3K_TYR | -3.549 | 11.004785 |
| PKMYT1_TYR | -1.739 | 4.798360 |
93 rows × 2 columns
considering string: ['-5V', '-4E', '-3P', '-2P', '-1L', '0S', '1Q', '2E', '3T', '4F']
| log2(score) | percentile | |
|---|---|---|
| ATM | 5.037 | 99.822351 |
| SMG1 | 4.385 | 99.831819 |
| DNAPK | 3.818 | 99.205315 |
| ATR | 3.507 | 99.680344 |
| FAM20C | 3.170 | 95.370556 |
| ... | ... | ... |
| PKN1 | -7.275 | 14.070436 |
| P70S6K | -7.295 | 4.089816 |
| AKT3 | -7.375 | 11.432995 |
| PKCI | -7.742 | 8.129511 |
| NEK3 | -8.254 | 4.637240 |
303 rows × 2 columns
Replicate the precentile results from The Kinase Library.