KATLAS

Predict kinases given a substrate sequence

KATLAS is a repository containing python tools to predict kinases given a substrate sequence. It also contains datasets of kinase substrate specificities and human phosphoproteomics.

References: Please cite the appropriate papers if KATLAS is helpful to your research.

KATLAS was described in the paper [Computational Decoding of Human Kinome Substrate Specificities and Functions]
The positional scanning peptide array (PSPA) data is from paper An atlas of substrate specificities for the human serine/threonine kinome and paper The intrinsic substrate specificity of the human tyrosine kinome
The kinase substrate datasets used for generating PSSMs are derived from PhosphoSitePlus and paper Large-scale Discovery of Substrates of the Human Kinome
Phosphorylation sites are acquired from PhosphoSitePlus, paper The functional landscape of the human phosphoproteome, and CPTAC / LinkedOmics

Reproduce datasets & figures

Follow the instructions in katlas_raw: https://github.com/sky1ove/katlas_raw

Need to install the package via: pip install 'python-katlas[dev]' -U

Web applications

Users can now run the analysis directly on the web without needing to code.

Check out our latest web platform: kinase-atlas.com

Tutorials on Colab

Install

pip install python-katlas -U
pip install git+https://github.com/sky1ove/katlas.git
pip install "git+https://github.com/sky1ove/katlas.git@main#egg=python-katlas[dev]"

To use other modules besides the core, do pip install 'python-katlas[dev]' -U

Import

from katlas.core import *

Quick start

We provide two methods to calculate substrate sequence:

Computational Data-Driven Method (CDDM)
Positional Scanning Peptide Array (PSPA)

We consider the input in two formats:

a single input string (phosphorylation site)
a csv/dataframe that contains a column of phosphorylation sites

For input sequences, we also consider it in two conditions:

all capital
contains lower cases indicating phosphorylation status

Single sequence as input

CDDM, all capital

predict_kinase('AAAAAAASGGAGSDN',**Params("CDDM_upper"))

considering string: ['-7A', '-6A', '-5A', '-4A', '-3A', '-2A', '-1A', '0S', '1G', '2G', '3A', '4G', '5S', '6D', '7N']

kinase
PAK6     2.032
ULK3     2.032
PRKX     2.012
ATR      1.991
PRKD1    1.988
         ...  
DDR2     0.928
EPHA4    0.928
TEK      0.921
KIT      0.915
FGFR3    0.910
Length: 289, dtype: float64

CDDM, with lower case indicating phosphorylation status

predict_kinase('AAAAAAAsGGAGsDN',**Params("CDDM"))

considering string: ['-7A', '-6A', '-5A', '-4A', '-3A', '-2A', '-1A', '0s', '1G', '2G', '3A', '4G', '5s', '6D', '7N']

kinase
ULK3     1.987
PAK6     1.981
PRKD1    1.946
PIM3     1.944
PRKX     1.939
         ...  
EPHA4    0.905
EGFR     0.900
TEK      0.898
FGFR3    0.894
KIT      0.882
Length: 289, dtype: float64

PSPA, with lower case indicating phosphorylation status

predict_kinase('AEEKEyHsEGG',**Params("PSPA")).head()

considering string: ['-5A', '-4E', '-3E', '-2K', '-1E', '0y', '1H', '2s', '3E', '4G', '5G']

kinase
EGFR     4.013
FGFR4    3.568
ZAP70    3.412
CSK      3.241
SYK      3.209
dtype: float64

To replicate the results from The Kinase Library (PSPA)

Check this link: The Kinase Library, and use log2(score) to rank, it shows same results with the below (with slight differences due to rounding).

predict_kinase('AEEKEyHSEGG',**Params("PSPA")).head(10)

considering string: ['-5A', '-4E', '-3E', '-2K', '-1E', '0y', '1H', '2S', '3E', '4G', '5G']

kinase
EGFR         3.181
FGFR4        2.390
CSK          2.308
ZAP70        2.068
SYK          1.998
PDHK1_TYR    1.922
RET          1.732
MATK         1.688
FLT1         1.627
BMPR2_TYR    1.456
dtype: float64

So far The kinase Library considers all tyr sequences in capital regardless of whether or not they contain lower cases, which is a small bug and should be fixed soon.
Kinase with “_TYR” indicates it is a dual specificity kinase tested in PSPA tyrosine setting, which has not been included in kinase-library yet.

We can also calculate the percentile score using a referenced score sheet.

# Percentile reference sheet
y_pct = Data.get_pspa_tyr_pct()

get_pct('AEEKEyHSEGG',pct_ref = y_pct,**Params("PSPA_y"))

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[9], line 1
----> 1 get_pct('AEEKEyHSEGG',pct_ref = y_pct,**Params("PSPA_y"))

TypeError: get_pct() got an unexpected keyword argument 'pct_ref'

High-throughput substrate scoring on a dataframe

Load your csv

# df = pd.read_csv('your_file.csv')

Load a demo df

# Load a demo df with phosphorylation sites
df = Data.get_ochoa_site().head()
df.iloc[:,-2:]

Set the column name and param to calculate

Here we choose param_CDDM_upper, as the sequences in the demo df are all in capital. You can also choose other params.

results = predict_kinase_df(df,'site_seq',**Params("CDDM_upper"))
results

Phosphorylation sites

Besides calculating sequence scores, we also provides multiple datasets of phosphorylation sites.

CPTAC pan-cancer phosphoproteomics

df = Data.get_cptac_ensembl_site()
df.head(3)

Ochoa et al. human phosphoproteome

df = Data.get_ochoa_site()
df.head(3)

PhosphoSitePlus human phosphorylation site

df = Data.get_psp_human_site()
df.head(3)

Unique sites of combined Ochoa & PhosphoSitePlus

df = Data.get_combine_site_psp_ochoa()
df.head(3)

Phosphorylation site sequence example

All capital - 15 length (-7 to +7)

QSEEEKLSPSPTTED
TLQHVPDYRQNVYIP
TMGLSARyGPQFTLQ

All capital - 10 length (-5 to +4)

SRDPHYQDPH
LDNPDyQQDF
AAAAAsGGAG

With lowercase - (-7 to +7)

QsEEEKLsPsPTTED
TLQHVPDyRQNVYIP
TMGLsARyGPQFTLQ

With lowercase - (-5 to +4)

sRDPHyQDPH
LDNPDyQQDF
AAAAAsGGAG