from katlas.core import *
KATLAS
KATLAS is a repository containing python tools to predict kinases given a substrate sequence. It also contains datasets of kinase substrate specificities and human phosphoproteomics.
References: Please cite the appropriate papers if KATLAS is helpful to your research.
KATLAS was described in the paper [Computational Decoding of Human Kinome Substrate Specificities and Functions]
The positional scanning peptide array (PSPA) data is from paper An atlas of substrate specificities for the human serine/threonine kinome and paper The intrinsic substrate specificity of the human tyrosine kinome
The kinase substrate datasets used for generating PSSMs are derived from PhosphoSitePlus and paper Large-scale Discovery of Substrates of the Human Kinome
Phosphorylation sites are acquired from PhosphoSitePlus, paper The functional landscape of the human phosphoproteome, and CPTAC / LinkedOmics
Reproduce datasets & figures
Follow the instructions in katlas_raw: https://github.com/sky1ove/katlas_raw
Need to install the package via: pip install 'python-katlas[dev]' -U
Web applications
Users can now run the analysis directly on the web without needing to code.
Check out our latest web platform: kinase-atlas.com
Tutorials on Colab
Install
pip install python-katlas -U
To use other modules besides the core, do pip install 'python-katlas[dev]' -U
Import
Quick start
We provide two methods to calculate substrate sequence:
- Computational Data-Driven Method (CDDM)
- Positional Scanning Peptide Array (PSPA)
We consider the input in two formats:
- a single input string (phosphorylation site)
- a csv/dataframe that contains a column of phosphorylation sites
For input sequences, we also consider it in two conditions:
- all capital
- contains lower cases indicating phosphorylation status
Single sequence as input
CDDM, all capital
'AAAAAAASGGAGSDN',**Params("CDDM_upper")) predict_kinase(
considering string: ['-7A', '-6A', '-5A', '-4A', '-3A', '-2A', '-1A', '0S', '1G', '2G', '3A', '4G', '5S', '6D', '7N']
kinase
PAK6 2.032
ULK3 2.032
PRKX 2.012
ATR 1.991
PRKD1 1.988
...
DDR2 0.928
EPHA4 0.928
TEK 0.921
KIT 0.915
FGFR3 0.910
Length: 289, dtype: float64
CDDM, with lower case indicating phosphorylation status
'AAAAAAAsGGAGsDN',**Params("CDDM")) predict_kinase(
considering string: ['-7A', '-6A', '-5A', '-4A', '-3A', '-2A', '-1A', '0s', '1G', '2G', '3A', '4G', '5s', '6D', '7N']
kinase
ULK3 1.987
PAK6 1.981
PRKD1 1.946
PIM3 1.944
PRKX 1.939
...
EPHA4 0.905
EGFR 0.900
TEK 0.898
FGFR3 0.894
KIT 0.882
Length: 289, dtype: float64
PSPA, with lower case indicating phosphorylation status
'AEEKEyHsEGG',**Params("PSPA")).head() predict_kinase(
considering string: ['-5A', '-4E', '-3E', '-2K', '-1E', '0y', '1H', '2s', '3E', '4G', '5G']
kinase
EGFR 4.013
FGFR4 3.568
ZAP70 3.412
CSK 3.241
SYK 3.209
dtype: float64
To replicate the results from The Kinase Library (PSPA)
Check this link: The Kinase Library, and use log2(score) to rank, it shows same results with the below (with slight differences due to rounding).
'AEEKEyHSEGG',**Params("PSPA")).head(10) predict_kinase(
considering string: ['-5A', '-4E', '-3E', '-2K', '-1E', '0y', '1H', '2S', '3E', '4G', '5G']
kinase
EGFR 3.181
FGFR4 2.390
CSK 2.308
ZAP70 2.068
SYK 1.998
PDHK1_TYR 1.922
RET 1.732
MATK 1.688
FLT1 1.627
BMPR2_TYR 1.456
dtype: float64
- So far The kinase Library considers all tyr sequences in capital regardless of whether or not they contain lower cases, which is a small bug and should be fixed soon.
- Kinase with “_TYR” indicates it is a dual specificity kinase tested in PSPA tyrosine setting, which has not been included in kinase-library yet.
We can also calculate the percentile score using a referenced score sheet.
# Percentile reference sheet
= Data.get_pspa_tyr_pct()
y_pct
'AEEKEyHSEGG',**Params("PSPA_y"), pct_ref = y_pct) get_pct(
considering string: ['-5A', '-4E', '-3E', '-2K', '-1E', '0Y', '1H', '2S', '3E', '4G', '5G']
log2(score) | percentile | |
---|---|---|
EGFR | 3.181 | 96.787423 |
FGFR4 | 2.390 | 94.012303 |
CSK | 2.308 | 95.201640 |
ZAP70 | 2.068 | 88.380041 |
SYK | 1.998 | 85.522898 |
... | ... | ... |
EPHA1 | -3.501 | 12.139440 |
FES | -3.699 | 21.216678 |
TNK1 | -4.269 | 5.481887 |
TNK2 | -4.577 | 2.050581 |
DDR2 | -4.920 | 10.403281 |
93 rows × 2 columns
High-throughput substrate scoring on a dataframe
Load your csv
# df = pd.read_csv('your_file.csv')
Load a demo df
# Load a demo df with phosphorylation sites
= Data.get_ochoa_site().head()
df -2:] df.iloc[:,
site_seq | gene_site | |
---|---|---|
0 | VDDEKGDSNDDYDSA | A0A075B6Q4_S24 |
1 | YDSAGLLSDEDCMSV | A0A075B6Q4_S35 |
2 | IADHLFWSEETKSRF | A0A075B6Q4_S57 |
3 | KSRFTEYSMTSSVMR | A0A075B6Q4_S68 |
4 | FTEYSMTSSVMRRNE | A0A075B6Q4_S71 |
Set the column name and param to calculate
Here we choose param_CDDM_upper, as the sequences in the demo df are all in capital. You can also choose other params.
= predict_kinase_df(df,'site_seq',**Params("CDDM_upper"))
results results
input dataframe has a length 5
Preprocessing
Finish preprocessing
Merging reference
Finish merging
SRC | EPHA3 | FES | NTRK3 | ALK | EPHA8 | ABL1 | FLT3 | EPHB2 | FYN | ... | MEK5 | PKN2 | MAP2K7 | MRCKB | HIPK3 | CDK8 | BUB1 | MEKK3 | MAP2K3 | GRK1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.991760 | 1.093712 | 1.051750 | 1.067134 | 1.013682 | 1.097520 | 0.966379 | 0.982464 | 1.054986 | 1.055910 | ... | 1.314859 | 1.635470 | 1.652251 | 1.622672 | 1.362972 | 1.797155 | 1.305198 | 1.423618 | 1.504941 | 1.872020 |
1 | 0.910262 | 0.953743 | 0.942327 | 0.950601 | 0.872693 | 0.932586 | 0.846898 | 0.826662 | 0.915020 | 0.942713 | ... | 1.175454 | 1.402006 | 1.430392 | 1.215826 | 1.569373 | 1.716455 | 1.270999 | 1.195081 | 1.223082 | 1.793290 |
2 | 0.849866 | 0.899910 | 0.848895 | 0.879652 | 0.874959 | 0.899414 | 0.839200 | 0.836523 | 0.858040 | 0.867269 | ... | 1.408003 | 1.813738 | 1.454786 | 1.084522 | 1.352556 | 1.524663 | 1.377839 | 1.173830 | 1.305691 | 1.811849 |
3 | 0.803826 | 0.836527 | 0.800759 | 0.894570 | 0.839905 | 0.781001 | 0.847847 | 0.807039 | 0.805877 | 0.801401 | ... | 1.110307 | 1.703637 | 1.795092 | 1.469653 | 1.549935 | 1.491344 | 1.446922 | 1.055452 | 1.534895 | 1.741090 |
4 | 0.822793 | 0.796532 | 0.792343 | 0.839882 | 0.810122 | 0.781420 | 0.805251 | 0.795022 | 0.790380 | 0.864538 | ... | 1.062617 | 1.357689 | 1.485945 | 1.249266 | 1.456078 | 1.422782 | 1.376471 | 1.089629 | 1.121309 | 1.697524 |
5 rows × 289 columns
Phosphorylation sites
Besides calculating sequence scores, we also provides multiple datasets of phosphorylation sites.
CPTAC pan-cancer phosphoproteomics
= Data.get_cptac_ensembl_site()
df 3) df.head(
gene | site | site_seq | protein | gene_name | gene_site | protein_site | |
---|---|---|---|---|---|---|---|
0 | ENSG00000003056.8 | S267 | DDQLGEESEERDDHL | ENSP00000000412.3 | M6PR | M6PR_S267 | ENSP00000000412_S267 |
1 | ENSG00000003056.8 | S267 | DDQLGEESEERDDHL | ENSP00000440488.2 | M6PR | M6PR_S267 | ENSP00000440488_S267 |
2 | ENSG00000048028.11 | S1053 | PPTIRPNSPYDLCSR | ENSP00000003302.4 | USP28 | USP28_S1053 | ENSP00000003302_S1053 |
Ochoa et al. human phosphoproteome
= Data.get_ochoa_site()
df 3) df.head(
uniprot | position | residue | is_disopred | disopred_score | log10_hotspot_pval_min | isHotspot | uniprot_position | functional_score | current_uniprot | name | gene | Sequence | is_valid | site_seq | gene_site | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | A0A075B6Q4 | 24 | S | True | 0.91 | 6.839384 | True | A0A075B6Q4_24 | 0.149257 | A0A075B6Q4 | A0A075B6Q4_HUMAN | None | MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT... | True | VDDEKGDSNDDYDSA | A0A075B6Q4_S24 |
1 | A0A075B6Q4 | 35 | S | True | 0.87 | 9.192622 | False | A0A075B6Q4_35 | 0.136966 | A0A075B6Q4 | A0A075B6Q4_HUMAN | None | MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT... | True | YDSAGLLSDEDCMSV | A0A075B6Q4_S35 |
2 | A0A075B6Q4 | 57 | S | False | 0.28 | 0.818834 | False | A0A075B6Q4_57 | 0.125364 | A0A075B6Q4 | A0A075B6Q4_HUMAN | None | MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT... | True | IADHLFWSEETKSRF | A0A075B6Q4_S57 |
PhosphoSitePlus human phosphorylation site
= Data.get_psp_human_site()
df 3) df.head(
gene | protein | uniprot | site | gene_site | SITE_GRP_ID | species | site_seq | LT_LIT | MS_LIT | MS_CST | CST_CAT# | Ambiguous_Site | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | YWHAB | 14-3-3 beta | P31946 | T2 | YWHAB_T2 | 15718712 | human | ______MtMDksELV | NaN | 3.0 | 1.0 | None | 0 |
1 | YWHAB | 14-3-3 beta | P31946 | S6 | YWHAB_S6 | 15718709 | human | __MtMDksELVQkAk | NaN | 8.0 | NaN | None | 0 |
2 | YWHAB | 14-3-3 beta | P31946 | Y21 | YWHAB_Y21 | 3426383 | human | LAEQAERyDDMAAAM | NaN | NaN | 4.0 | None | 0 |
Unique sites of combined Ochoa & PhosphoSitePlus
= Data.get_combine_site_psp_ochoa()
df 3) df.head(
uniprot | gene | site | site_seq | source | AM_pathogenicity | CDDM_upper | CDDM_max_score | |
---|---|---|---|---|---|---|---|---|
0 | A0A024R4G9 | C19orf48 | S20 | ITGSRLLSMVPGPAR | psp | NaN | PRKX,AKT1,PKG1,P90RSK,HIPK4,AKT3,HIPK1,PKACB,H... | 2.407041 |
1 | A0A075B6Q4 | None | S24 | VDDEKGDSNDDYDSA | ochoa | NaN | CK2A2,CK2A1,GRK7,GRK5,CK1G1,CK1A,IKKA,CK1G2,CA... | 2.295654 |
2 | A0A075B6Q4 | None | S35 | YDSAGLLSDEDCMSV | ochoa | NaN | CK2A2,CK2A1,IKKA,ATM,IKKB,CAMK1D,MARK2,GRK7,IK... | 2.488683 |
Phosphorylation site sequence example
All capital - 15 length (-7 to +7)
- QSEEEKLSPSPTTED
- TLQHVPDYRQNVYIP
- TMGLSARyGPQFTLQ
All capital - 10 length (-5 to +4)
- SRDPHYQDPH
- LDNPDyQQDF
- AAAAAsGGAG
With lowercase - (-7 to +7)
- QsEEEKLsPsPTTED
- TLQHVPDyRQNVYIP
- TMGLsARyGPQFTLQ
With lowercase - (-5 to +4)
- sRDPHyQDPH
- LDNPDyQQDF
- AAAAAsGGAG