from katlas.core import *
KATLAS
KATLAS is a repository containing python tools to predict kinases given a substrate sequence. It also contains datasets of kinase substrate specificities and human phosphoproteomics.
References: Please cite the appropriate papers if KATLAS is helpful to your research.
KATLAS was described in the paper [Computational Decoding of Human Kinome Substrate Specificities and Functions]
The positional scanning peptide array (PSPA) data is from paper An atlas of substrate specificities for the human serine/threonine kinome and paper The intrinsic substrate specificity of the human tyrosine kinome
The kinase substrate datasets used for generating PSSMs are derived from PhosphoSitePlus and paper Large-scale Discovery of Substrates of the Human Kinome
Phosphorylation sites are acquired from PhosphoSitePlus, paper The functional landscape of the human phosphoproteome, and CPTAC / LinkedOmics
Reproduce datasets & figures
Follow the instructions in katlas_raw: https://github.com/sky1ove/katlas_raw
Need to install the package via: pip install 'python-katlas[dev]' -U
Web applications
Users can now run the analysis directly on the web without needing to code.
Check out our latest web platform: kinase-atlas.com
Tutorials on Colab
Install
pip install python-katlas -U
pip install git+https://github.com/sky1ove/katlas.git
pip install "git+https://github.com/sky1ove/katlas.git@main#egg=python-katlas[dev]"
To use other modules besides the core, do pip install 'python-katlas[dev]' -U
Import
Quick start
We provide two methods to calculate substrate sequence:
- Computational Data-Driven Method (CDDM)
- Positional Scanning Peptide Array (PSPA)
We consider the input in two formats:
- a single input string (phosphorylation site)
- a csv/dataframe that contains a column of phosphorylation sites
For input sequences, we also consider it in two conditions:
- all capital
- contains lower cases indicating phosphorylation status
Single sequence as input
CDDM, all capital
'AAAAAAASGGAGSDN',**Params("CDDM_upper")) predict_kinase(
considering string: ['-7A', '-6A', '-5A', '-4A', '-3A', '-2A', '-1A', '0S', '1G', '2G', '3A', '4G', '5S', '6D', '7N']
kinase
PAK6 2.032
ULK3 2.032
PRKX 2.012
ATR 1.991
PRKD1 1.988
...
DDR2 0.928
EPHA4 0.928
TEK 0.921
KIT 0.915
FGFR3 0.910
Length: 289, dtype: float64
CDDM, with lower case indicating phosphorylation status
'AAAAAAAsGGAGsDN',**Params("CDDM")) predict_kinase(
considering string: ['-7A', '-6A', '-5A', '-4A', '-3A', '-2A', '-1A', '0s', '1G', '2G', '3A', '4G', '5s', '6D', '7N']
kinase
ULK3 1.987
PAK6 1.981
PRKD1 1.946
PIM3 1.944
PRKX 1.939
...
EPHA4 0.905
EGFR 0.900
TEK 0.898
FGFR3 0.894
KIT 0.882
Length: 289, dtype: float64
PSPA, with lower case indicating phosphorylation status
'AEEKEyHsEGG',**Params("PSPA")).head() predict_kinase(
considering string: ['-5A', '-4E', '-3E', '-2K', '-1E', '0y', '1H', '2s', '3E', '4G', '5G']
kinase
EGFR 4.013
FGFR4 3.568
ZAP70 3.412
CSK 3.241
SYK 3.209
dtype: float64
To replicate the results from The Kinase Library (PSPA)
Check this link: The Kinase Library, and use log2(score) to rank, it shows same results with the below (with slight differences due to rounding).
'AEEKEyHSEGG',**Params("PSPA")).head(10) predict_kinase(
considering string: ['-5A', '-4E', '-3E', '-2K', '-1E', '0y', '1H', '2S', '3E', '4G', '5G']
kinase
EGFR 3.181
FGFR4 2.390
CSK 2.308
ZAP70 2.068
SYK 1.998
PDHK1_TYR 1.922
RET 1.732
MATK 1.688
FLT1 1.627
BMPR2_TYR 1.456
dtype: float64
- So far The kinase Library considers all tyr sequences in capital regardless of whether or not they contain lower cases, which is a small bug and should be fixed soon.
- Kinase with “_TYR” indicates it is a dual specificity kinase tested in PSPA tyrosine setting, which has not been included in kinase-library yet.
We can also calculate the percentile score using a referenced score sheet.
# Percentile reference sheet
= Data.get_pspa_tyr_pct() y_pct
'AEEKEyHSEGG',pct_ref = y_pct,**Params("PSPA_y")) get_pct(
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[9], line 1 ----> 1 get_pct('AEEKEyHSEGG',pct_ref = y_pct,**Params("PSPA_y")) TypeError: get_pct() got an unexpected keyword argument 'pct_ref'
High-throughput substrate scoring on a dataframe
Load your csv
# df = pd.read_csv('your_file.csv')
Load a demo df
# Load a demo df with phosphorylation sites
= Data.get_ochoa_site().head()
df -2:] df.iloc[:,
Set the column name and param to calculate
Here we choose param_CDDM_upper, as the sequences in the demo df are all in capital. You can also choose other params.
= predict_kinase_df(df,'site_seq',**Params("CDDM_upper"))
results results
Phosphorylation sites
Besides calculating sequence scores, we also provides multiple datasets of phosphorylation sites.
CPTAC pan-cancer phosphoproteomics
= Data.get_cptac_ensembl_site()
df 3) df.head(
Ochoa et al. human phosphoproteome
= Data.get_ochoa_site()
df 3) df.head(
PhosphoSitePlus human phosphorylation site
= Data.get_psp_human_site()
df 3) df.head(
Unique sites of combined Ochoa & PhosphoSitePlus
= Data.get_combine_site_psp_ochoa()
df 3) df.head(
Phosphorylation site sequence example
All capital - 15 length (-7 to +7)
- QSEEEKLSPSPTTED
- TLQHVPDYRQNVYIP
- TMGLSARyGPQFTLQ
All capital - 10 length (-5 to +4)
- SRDPHYQDPH
- LDNPDyQQDF
- AAAAAsGGAG
With lowercase - (-7 to +7)
- QsEEEKLsPsPTTED
- TLQHVPDyRQNVYIP
- TMGLsARyGPQFTLQ
With lowercase - (-5 to +4)
- sRDPHyQDPH
- LDNPDyQQDF
- AAAAAsGGAG