from katlas.common import *
KATLAS
KATLAS is a repository containing python tools to predict kinases given a substrate sequence. It also contains datasets of kinase substrate specificities and human phosphoproteomics.
References: Please cite the appropriate papers if KATLAS is helpful to your research.
KATLAS was described in the paper [Computational Decoding of Human Kinome Substrate Specificities and Functions]
The positional scanning peptide array (PSPA) data is from paper An atlas of substrate specificities for the human serine/threonine kinome and paper The intrinsic substrate specificity of the human tyrosine kinome
The kinase substrate datasets used for generating PSSMs are derived from PhosphoSitePlus and paper Large-scale Discovery of Substrates of the Human Kinome
Phosphorylation sites are acquired from PhosphoSitePlus, paper The functional landscape of the human phosphoproteome, and CPTAC / LinkedOmics
Reproduce datasets & figures
Follow the instructions in katlas_raw: https://github.com/sky1ove/katlas_raw
Web applications
Users can now run the analysis directly on the web without needing to code.
Check out our latest web platform: kinase-atlas.com
Install
UV:
uv add -U python-katlas
pip:
pip install -U python-katlas
If using machine-learning related modules, need to install development verison: pip install -U "python-katlas[dev]"
Import
Quick start
We provide two methods to calculate substrate sequence:
- Computational Data-Driven Method (CDDM)
- Positional Scanning Peptide Array (PSPA)
We consider the input in two formats:
- a single input string (phosphorylation site)
- a csv/dataframe that contains a column of phosphorylation sites
For input sequences, we also consider it in two conditions:
- all capital
- contains lower cases indicating phosphorylation status
Quick start
Site scoring
CDDM, all capital
'AAAAAAASGAGSDN',**Params("CDDM_upper")) predict_kinase(
considering string: ['-7A', '-6A', '-5A', '-4A', '-3A', '-2A', '-1A', '0S', '1G', '2A', '3G', '4S', '5D', '6N']
GCN2 4.556
MPSK1 4.425
MEKK2 4.253
WNK3 4.213
WNK1 4.064
...
PDK1 -25.077
PDHK3 -25.346
CLK2 -27.251
ROR2 -27.582
DDR1 -53.581
Length: 328, dtype: float64
CDDM, with lower case indicating phosphorylation status
'AAAAAAAsGGAGsDN',**Params("CDDM")) predict_kinase(
considering string: ['-7A', '-6A', '-5A', '-4A', '-3A', '-2A', '-1A', '0s', '1G', '2G', '3A', '4G', '5s', '6D', '7N']
ROR1 8.355
WNK1 4.907
WNK2 4.782
ERK5 4.466
RIPK2 4.045
...
DDR1 -29.393
TNNI3K -29.884
CHAK1 -31.775
VRK1 -45.287
BRAF -49.403
Length: 328, dtype: float64
PSPA, with lower case indicating phosphorylation status
'AEEKEyHsEGG',**Params("PSPA")) predict_kinase(
considering string: ['-5A', '-4E', '-3E', '-2K', '-1E', '0y', '1H', '2s', '3E', '4G', '5G']
kinase
EGFR 4.013
FGFR4 3.568
ZAP70 3.412
CSK 3.241
SYK 3.209
...
JAK1 -3.837
DDR2 -4.421
TNK2 -4.534
TNNI3K_TYR -4.651
TNK1 -5.320
Length: 93, dtype: float64
To replicate the results from The Kinase Library (PSPA)
Check this link: The Kinase Library, and use log2(score) to rank, it shows same results with the below (with slight differences due to rounding).
= predict_kinase('AEEKEyHSEGG',**Params("PSPA"))
out out
considering string: ['-5A', '-4E', '-3E', '-2K', '-1E', '0y', '1H', '2S', '3E', '4G', '5G']
kinase
EGFR 3.181
FGFR4 2.390
CSK 2.308
ZAP70 2.068
SYK 1.998
...
EPHA1 -3.501
FES -3.699
TNK1 -4.269
TNK2 -4.577
DDR2 -4.920
Length: 93, dtype: float64
- So far The kinase Library considers all tyr sequences in capital regardless of whether or not they contain lower cases, which is a small bug and should be fixed soon.
- Kinase with “_TYR” indicates it is a dual specificity kinase tested in PSPA tyrosine setting, which has not been included in kinase-library yet.
We can also calculate the percentile score using a referenced score sheet.
# Percentile reference sheet
= Data.get_pspa_tyr_pct() y_pct
'AEEKEyHSEGG',pct_ref = y_pct,**Params("PSPA_y")) get_pct(
considering string: ['-5A', '-4E', '-3E', '-2K', '-1E', '0Y', '1H', '2S', '3E', '4G', '5G']
log2(score) | percentile | |
---|---|---|
EGFR | 3.181 | 96.787423 |
FGFR4 | 2.390 | 94.012303 |
CSK | 2.308 | 95.201640 |
ZAP70 | 2.068 | 88.380041 |
SYK | 1.998 | 85.522898 |
... | ... | ... |
EPHA1 | -3.501 | 12.139440 |
FES | -3.699 | 21.216678 |
TNK1 | -4.269 | 5.481887 |
TNK2 | -4.577 | 2.050581 |
DDR2 | -4.920 | 10.403281 |
93 rows × 2 columns
Site scoring in a df
Load your csv:
# df = pd.read_csv('your_file.csv')
Or load a demo df
# Load a demo df with phosphorylation sites
= Data.get_ochoa_site().head()
df -2:] df.iloc[:,
site_seq | gene_site | |
---|---|---|
0 | VDDEKGDSNDDYDSA | A0A075B6Q4_S24 |
1 | YDSAGLLSDEDCMSV | A0A075B6Q4_S35 |
2 | IADHLFWSEETKSRF | A0A075B6Q4_S57 |
3 | KSRFTEYSMTSSVMR | A0A075B6Q4_S68 |
4 | FTEYSMTSSVMRRNE | A0A075B6Q4_S71 |
Set the column name and param to calculate
Here we choose param_CDDM_upper, as the sequences in the demo df are all in capital. You can also choose other params.
= predict_kinase_df(df,'site_seq',**Params("CDDM_upper"))
results results
input dataframe has a length 5
Preprocessing
Finish preprocessing
Merging reference
Finish merging
SRC | EPHA3 | FES | NTRK3 | ALK | ABL1 | FLT3 | EPHA8 | EPHB2 | EPHB1 | ... | VRK1 | PKMYT1 | GRK3 | CAMK1B | CDC7 | SMMLCK | ROR1 | GAK | MAST2 | BRAF | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -2.440640 | -0.818753 | -1.663990 | -0.738991 | -2.047628 | -3.602344 | -3.200998 | -0.935176 | -1.388444 | -1.859450 | ... | -17.103237 | -113.698143 | -16.848783 | -41.520172 | -41.646187 | 1.284159 | -26.566362 | -69.165062 | -17.706400 | -87.763214 |
1 | -3.838486 | -2.735969 | -2.533986 | -2.150399 | -3.792498 | -4.725527 | -5.711791 | -4.534240 | -3.148449 | -2.511518 | ... | -67.889053 | -68.652641 | -45.833855 | -64.171600 | -39.465572 | -65.061722 | -109.561707 | -85.911224 | -60.105064 | -63.889122 |
2 | -2.610423 | -2.370090 | -3.235637 | -1.508413 | -2.571347 | -3.740941 | -3.025596 | -3.373504 | -2.776297 | -3.060740 | ... | -15.798462 | -45.905319 | -61.440742 | -67.695694 | -55.047962 | -42.135216 | -38.501572 | -62.624382 | -56.119389 | -107.060989 |
3 | -5.180541 | -4.201880 | -5.766463 | -3.038421 | -3.836897 | -4.249900 | -5.029885 | -5.411311 | -4.713308 | -4.827825 | ... | -96.978317 | -83.419777 | -22.559393 | -110.611588 | -63.283070 | -37.240440 | -24.497492 | -112.878151 | -43.538158 | -60.348518 |
4 | -2.844254 | -3.322700 | -3.681745 | -1.766435 | -2.666579 | -3.748774 | -4.083619 | -3.912834 | -3.724181 | -3.948160 | ... | -35.824612 | -87.983566 | -83.312317 | -107.162407 | -61.478374 | -85.793571 | -43.738819 | -47.004211 | -42.281624 | -59.518513 |
5 rows × 328 columns
0].sort_values(ascending=False) results.iloc[
TLK2 8.264621
GCN2 8.101542
TLK1 7.693897
HRI 6.691402
PLK3 6.579368
...
NIK -64.605148
SRPK2 -67.300667
GAK -69.165062
BRAF -87.763214
PKMYT1 -113.698143
Name: 0, Length: 328, dtype: float32
Dataset
Besides calculating sequence scores, we also provides multiple datasets of phosphorylation sites.
CPTAC pan-cancer phosphoproteomics
= Data.get_cptac_ensembl_site()
df 3) df.head(
gene | site | site_seq | protein | gene_name | gene_site | protein_site | |
---|---|---|---|---|---|---|---|
0 | ENSG00000003056.8 | S267 | DDQLGEESEERDDHL | ENSP00000000412.3 | M6PR | M6PR_S267 | ENSP00000000412_S267 |
1 | ENSG00000003056.8 | S267 | DDQLGEESEERDDHL | ENSP00000440488.2 | M6PR | M6PR_S267 | ENSP00000440488_S267 |
2 | ENSG00000048028.11 | S1053 | PPTIRPNSPYDLCSR | ENSP00000003302.4 | USP28 | USP28_S1053 | ENSP00000003302_S1053 |
Ochoa et al. human phosphoproteome
= Data.get_ochoa_site()
df 3) df.head(
uniprot | position | residue | is_disopred | disopred_score | log10_hotspot_pval_min | isHotspot | uniprot_position | functional_score | current_uniprot | name | gene | Sequence | is_valid | site_seq | gene_site | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | A0A075B6Q4 | 24 | S | 1.0 | 0.91 | 6.839384 | 1.0 | A0A075B6Q4_24 | 0.149257 | A0A075B6Q4 | A0A075B6Q4_HUMAN | None | MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT... | True | VDDEKGDSNDDYDSA | A0A075B6Q4_S24 |
1 | A0A075B6Q4 | 35 | S | 1.0 | 0.87 | 9.192622 | 0.0 | A0A075B6Q4_35 | 0.136966 | A0A075B6Q4 | A0A075B6Q4_HUMAN | None | MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT... | True | YDSAGLLSDEDCMSV | A0A075B6Q4_S35 |
2 | A0A075B6Q4 | 57 | S | 0.0 | 0.28 | 0.818834 | 0.0 | A0A075B6Q4_57 | 0.125364 | A0A075B6Q4 | A0A075B6Q4_HUMAN | None | MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT... | True | IADHLFWSEETKSRF | A0A075B6Q4_S57 |
PhosphoSitePlus human phosphorylation site
= Data.get_psp_human_site()
df 3) df.head(
gene | protein | uniprot | site | gene_site | SITE_GRP_ID | species | site_seq | LT_LIT | MS_LIT | MS_CST | CST_CAT# | Ambiguous_Site | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | YWHAB | 14-3-3 beta | P31946 | T2 | YWHAB_T2 | 15718712 | human | ______MtMDksELV | NaN | 3.0 | 1.0 | None | 0 |
1 | YWHAB | 14-3-3 beta | P31946 | S6 | YWHAB_S6 | 15718709 | human | __MtMDksELVQkAk | NaN | 8.0 | NaN | None | 0 |
2 | YWHAB | 14-3-3 beta | P31946 | Y21 | YWHAB_Y21 | 3426383 | human | LAEQAERyDDMAAAM | NaN | NaN | 4.0 | None | 0 |
Unique sites of combined Ochoa & PhosphoSitePlus
= Data.get_combine_site_psp_ochoa()
df 3) df.head(
uniprot | gene | site | site_seq | source | AM_pathogenicity | CDDM_upper | CDDM_max_score | |
---|---|---|---|---|---|---|---|---|
0 | A0A024R4G9 | C19orf48 | S20 | ITGSRLLSMVPGPAR | psp | NaN | PRKX,AKT1,PKG1,P90RSK,HIPK4,AKT3,HIPK1,PKACB,H... | 2.407041 |
1 | A0A075B6Q4 | None | S24 | VDDEKGDSNDDYDSA | ochoa | NaN | CK2A2,CK2A1,GRK7,GRK5,CK1G1,CK1A,IKKA,CK1G2,CA... | 2.295654 |
2 | A0A075B6Q4 | None | S35 | YDSAGLLSDEDCMSV | ochoa | NaN | CK2A2,CK2A1,IKKA,ATM,IKKB,CAMK1D,MARK2,GRK7,IK... | 2.488683 |
Phosphorylation site sequence example
All capital - 15 length (-7 to +7)
- QSEEEKLSPSPTTED
- TLQHVPDYRQNVYIP
- TMGLSARyGPQFTLQ
All capital - 10 length (-5 to +4)
- SRDPHYQDPH
- LDNPDyQQDF
- AAAAAsGGAG
With lowercase - (-7 to +7)
- QsEEEKLsPsPTTED
- TLQHVPDyRQNVYIP
- TMGLsARyGPQFTLQ
With lowercase - (-5 to +4)
- sRDPHyQDPH
- LDNPDyQQDF
- AAAAAsGGAG