from katlas.core import *
KATLAS
KATLAS is a repository containing python tools to predict kinases given a substrate sequence. It also contains datasets of kinase substrate specificities and human phosphoproteomics.
References: Please cite the appropriate papers if KATLAS is helpful to your research.
KATLAS was described in the paper [Computational Decoding of Human Kinome Substrate Specificities and Functions]
The positional scanning peptide array (PSPA) data is from paper An atlas of substrate specificities for the human serine/threonine kinome and paper The intrinsic substrate specificity of the human tyrosine kinome
The kinase substrate datasets used for generating PSSMs are derived from PhosphoSitePlus and paper Large-scale Discovery of Substrates of the Human Kinome
Phosphorylation sites are acquired from PhosphoSitePlus, paper The functional landscape of the human phosphoproteome, and CPTAC / LinkedOmics
Reproduce datasets & figures
Follow the instructions in katlas_raw: https://github.com/sky1ove/katlas_raw
Need to install the package via: pip install 'python-katlas[dev]' -U
Web applications
Users can now run the analysis directly on the web without needing to code.
Check out our latest web platform: kinase-atlas.com
Tutorials on Colab
Install
pip install python-katlas -U
To use other modules besides the core, do pip install 'python-katlas[dev]' -U
Import
Quick start
We provide two methods to calculate substrate sequence:
- Computational Data-Driven Method (CDDM)
- Positional Scanning Peptide Array (PSPA)
We consider the input in two formats:
- a single input string (phosphorylation site)
- a csv/dataframe that contains a column of phosphorylation sites
For input sequences, we also consider it in two conditions:
- all capital
- contains lower cases indicating phosphorylation status
Single sequence as input
CDDM, all capital
'AAAAAAASGGAGSDN',**param_CDDM_upper) predict_kinase(
considering string: ['-7A', '-6A', '-5A', '-4A', '-3A', '-2A', '-1A', '0S', '1G', '2G', '3A', '4G', '5S', '6D', '7N']
kinase
PAK6 2.032
ULK3 2.032
PRKX 2.012
ATR 1.991
PRKD1 1.988
...
DDR2 0.928
EPHA4 0.928
TEK 0.921
KIT 0.915
FGFR3 0.910
Length: 289, dtype: float64
CDDM, with lower case indicating phosphorylation status
'AAAAAAAsGGAGsDN',**param_CDDM) predict_kinase(
considering string: ['-7A', '-6A', '-5A', '-4A', '-3A', '-2A', '-1A', '0s', '1G', '2G', '3A', '4G', '5s', '6D', '7N']
kinase
ULK3 1.987
PAK6 1.981
PRKD1 1.946
PIM3 1.944
PRKX 1.939
...
EPHA4 0.905
EGFR 0.900
TEK 0.898
FGFR3 0.894
KIT 0.882
Length: 289, dtype: float64
PSPA, with lower case indicating phosphorylation status
'AEEKEyHsEGG',**param_PSPA).head() predict_kinase(
considering string: ['-5A', '-4E', '-3E', '-2K', '-1E', '0y', '1H', '2s', '3E', '4G', '5G']
kinase
EGFR 4.013
FGFR4 3.568
ZAP70 3.412
CSK 3.241
SYK 3.209
dtype: float64
To replicate the results from The Kinase Library (PSPA)
Check this link: The Kinase Library, and use log2(score) to rank, it shows same results with the below (with slight differences due to rounding).
'AEEKEyHSEGG',**param_PSPA).head(10) predict_kinase(
considering string: ['-5A', '-4E', '-3E', '-2K', '-1E', '0y', '1H', '2S', '3E', '4G', '5G']
kinase
EGFR 3.181
FGFR4 2.390
CSK 2.308
ZAP70 2.068
SYK 1.998
PDHK1_TYR 1.922
RET 1.732
MATK 1.688
FLT1 1.627
BMPR2_TYR 1.456
dtype: float64
- So far The kinase Library considers all tyr sequences in capital regardless of whether or not they contain lower cases, which is a small bug and should be fixed soon.
- Kinase with “_TYR” indicates it is a dual specificity kinase tested in PSPA tyrosine setting, which has not been included in kinase-library yet.
We can also calculate the percentile score using a referenced score sheet.
# Percentile reference sheet
= Data.get_pspa_tyr_pct()
y_pct
'AEEKEyHSEGG',**param_PSPA_y, pct_ref = y_pct) get_pct(
considering string: ['-5A', '-4E', '-3E', '-2K', '-1E', '0Y', '1H', '2S', '3E', '4G', '5G']
log2(score) | percentile | |
---|---|---|
EGFR | 3.181 | 96.787423 |
FGFR4 | 2.390 | 94.012303 |
CSK | 2.308 | 95.201640 |
ZAP70 | 2.068 | 88.380041 |
SYK | 1.998 | 85.522898 |
... | ... | ... |
EPHA1 | -3.501 | 12.139440 |
FES | -3.699 | 21.216678 |
TNK1 | -4.269 | 5.481887 |
TNK2 | -4.577 | 2.050581 |
DDR2 | -4.920 | 10.403281 |
93 rows × 2 columns
High-throughput substrate scoring on a dataframe
Load your csv
# df = pd.read_csv('your_file.csv')
Load a demo df
# Load a demo df with phosphorylation sites
= Data.get_ochoa_site().head()
df -2:] df.iloc[:,
site_seq | gene_site | |
---|---|---|
0 | VDDEKGDSNDDYDSA | A0A075B6Q4_S24 |
1 | YDSAGLLSDEDCMSV | A0A075B6Q4_S35 |
2 | IADHLFWSEETKSRF | A0A075B6Q4_S57 |
3 | KSRFTEYSMTSSVMR | A0A075B6Q4_S68 |
4 | FTEYSMTSSVMRRNE | A0A075B6Q4_S71 |
Set the column name and param to calculate
Here we choose param_CDDM_upper, as the sequences in the demo df are all in capital. You can also choose other params.
= predict_kinase_df(df,'site_seq',**param_CDDM_upper)
results results
input dataframe has a length 5
Preprocessing
Finish preprocessing
Calculating position: [-7, -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7]
100%|██████████| 289/289 [00:05<00:00, 56.64it/s]
kinase | SRC | EPHA3 | FES | NTRK3 | ALK | EPHA8 | ABL1 | FLT3 | EPHB2 | FYN | ... | MEK5 | PKN2 | MAP2K7 | MRCKB | HIPK3 | CDK8 | BUB1 | MEKK3 | MAP2K3 | GRK1 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.991760 | 1.093712 | 1.051750 | 1.067134 | 1.013682 | 1.097519 | 0.966379 | 0.982464 | 1.054986 | 1.055910 | ... | 1.314859 | 1.635470 | 1.652251 | 1.622672 | 1.362973 | 1.797155 | 1.305198 | 1.423618 | 1.504941 | 1.872020 |
1 | 0.910262 | 0.953743 | 0.942327 | 0.950601 | 0.872694 | 0.932586 | 0.846899 | 0.826662 | 0.915020 | 0.942713 | ... | 1.175454 | 1.402006 | 1.430392 | 1.215826 | 1.569373 | 1.716455 | 1.270999 | 1.195081 | 1.223082 | 1.793290 |
2 | 0.849866 | 0.899910 | 0.848895 | 0.879652 | 0.874959 | 0.899414 | 0.839200 | 0.836523 | 0.858040 | 0.867269 | ... | 1.408003 | 1.813739 | 1.454786 | 1.084522 | 1.352556 | 1.524663 | 1.377839 | 1.173830 | 1.305691 | 1.811849 |
3 | 0.803826 | 0.836527 | 0.800759 | 0.894570 | 0.839905 | 0.781001 | 0.847847 | 0.807040 | 0.805877 | 0.801402 | ... | 1.110307 | 1.703637 | 1.795092 | 1.469653 | 1.549936 | 1.491344 | 1.446922 | 1.055452 | 1.534895 | 1.741090 |
4 | 0.822793 | 0.796532 | 0.792343 | 0.839882 | 0.810122 | 0.781420 | 0.805251 | 0.795022 | 0.790380 | 0.864538 | ... | 1.062617 | 1.357689 | 1.485945 | 1.249266 | 1.456078 | 1.422782 | 1.376471 | 1.089629 | 1.121309 | 1.697524 |
5 rows × 289 columns
Phosphorylation sites
Besides calculating sequence scores, we also provides multiple datasets of phosphorylation sites.
CPTAC pan-cancer phosphoproteomics
= Data.get_cptac_ensembl_site()
df 3) df.head(
gene | site | site_seq | protein | gene_name | gene_site | protein_site | |
---|---|---|---|---|---|---|---|
0 | ENSG00000003056.8 | S267 | DDQLGEESEERDDHL | ENSP00000000412.3 | M6PR | M6PR_S267 | ENSP00000000412_S267 |
1 | ENSG00000003056.8 | S267 | DDQLGEESEERDDHL | ENSP00000440488.2 | M6PR | M6PR_S267 | ENSP00000440488_S267 |
2 | ENSG00000048028.11 | S1053 | PPTIRPNSPYDLCSR | ENSP00000003302.4 | USP28 | USP28_S1053 | ENSP00000003302_S1053 |
Ochoa et al. human phosphoproteome
= Data.get_ochoa_site()
df 3) df.head(
uniprot | position | residue | is_disopred | disopred_score | log10_hotspot_pval_min | isHotspot | uniprot_position | functional_score | current_uniprot | name | gene | Sequence | is_valid | site_seq | gene_site | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | A0A075B6Q4 | 24 | S | True | 0.91 | 6.839384 | True | A0A075B6Q4_24 | 0.149257 | A0A075B6Q4 | A0A075B6Q4_HUMAN | None | MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT... | True | VDDEKGDSNDDYDSA | A0A075B6Q4_S24 |
1 | A0A075B6Q4 | 35 | S | True | 0.87 | 9.192622 | False | A0A075B6Q4_35 | 0.136966 | A0A075B6Q4 | A0A075B6Q4_HUMAN | None | MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT... | True | YDSAGLLSDEDCMSV | A0A075B6Q4_S35 |
2 | A0A075B6Q4 | 57 | S | False | 0.28 | 0.818834 | False | A0A075B6Q4_57 | 0.125364 | A0A075B6Q4 | A0A075B6Q4_HUMAN | None | MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT... | True | IADHLFWSEETKSRF | A0A075B6Q4_S57 |
PhosphoSitePlus human phosphorylation site
= Data.get_psp_human_site()
df 3) df.head(
gene | protein | uniprot | site | gene_site | SITE_GRP_ID | species | site_seq | LT_LIT | MS_LIT | MS_CST | CST_CAT# | Ambiguous_Site | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | YWHAB | 14-3-3 beta | P31946 | T2 | YWHAB_T2 | 15718712 | human | ______MtMDksELV | NaN | 3.0 | 1.0 | None | 0 |
1 | YWHAB | 14-3-3 beta | P31946 | S6 | YWHAB_S6 | 15718709 | human | __MtMDksELVQkAk | NaN | 8.0 | NaN | None | 0 |
2 | YWHAB | 14-3-3 beta | P31946 | Y21 | YWHAB_Y21 | 3426383 | human | LAEQAERyDDMAAAM | NaN | NaN | 4.0 | None | 0 |
Unique sites of combined Ochoa & PhosphoSitePlus
= Data.get_combine_site_psp_ochoa()
df 3) df.head(
site_seq | gene_site | gene | source | num_site | acceptor | -7 | -6 | -5 | -4 | ... | -2 | -1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AAAAAAASGGAGSDN | PBX1_S136 | PBX1 | ochoa | 1 | S | A | A | A | A | ... | A | A | S | G | G | A | G | S | D | N |
1 | AAAAAAASGGGVSPD | PBX2_S146 | PBX2 | ochoa | 1 | S | A | A | A | A | ... | A | A | S | G | G | G | V | S | P | D |
2 | AAAAAAASGVTTGKP | CLASR_S349 | CLASR | ochoa | 1 | S | A | A | A | A | ... | A | A | S | G | V | T | T | G | K | P |
3 rows × 21 columns
Phosphorylation site sequence example
All capital - 15 length (-7 to +7)
- QSEEEKLSPSPTTED
- TLQHVPDYRQNVYIP
- TMGLSARyGPQFTLQ
All capital - 10 length (-5 to +4)
- SRDPHYQDPH
- LDNPDyQQDF
- AAAAAsGGAG
With lowercase - (-7 to +7)
- QsEEEKLsPsPTTED
- TLQHVPDyRQNVYIP
- TMGLsARyGPQFTLQ
With lowercase - (-5 to +4)
- sRDPHyQDPH
- LDNPDyQQDF
- AAAAAsGGAG