Core

Core functions in Katlas library

Setup

Data

We will go through how to load kinase information data and phosphorylation sites data.


source

Data

 Data ()

A class for fetching various datasets.

Datasets used in this study can be accessed through Data

Data Description
get_kinase_info 523 full kinase information
get_pspa_tyr_norm PSPA Tyr kinase normalized data; with additional 0 position information
get_pspa_st_norm PSPA ST kinase normalized data; with additional 0 position information
get_pspa_all_norm PSPA ST and Tyr kinase normalized data; with additional 0 position information
get_pspa_st_pct PSPA ST kinase scoring on ST phosphoproteome; reference for percentile calculation
get_pspa_tyr_pct PSPA Tyr kinase scoring on Tyr phosphoproteome; reference for percentile calculation
get_num_dict dictionary of number of random amino acids for PSPA
get_ks_dataset kinase-substrate dataset
get_cddm CDDM PSSM
get_cddm_upper CDDM PSSM for upper case
get_cddm_others CDDM PSSM of mutated kinase
get_cddm_others_info information of mutated kinase
get_combine combined CDDM and PSPA data
get_aa_info amino acid information
get_aa_rdkit chemical properties extracted from amino acids’ smiles through Rdkit
get_aa_morgan morgan fingerprints of amino acids’ smiles
get_cptac_ensembl_site CPTAC unique ensembl ID + site
get_cptac_unique_site CPTAC unique site sequence
get_cptac_gene_site CPTAC unique gene+site, fewer cases than ensembl ID + site
get_psp_human_site PhosphoSitePlus, human dataset, gene+site
get_ochoa_site Ochoa et al. human phosphoproteom dataset
get_combine_site_psp_ochoa combined Ochoa and PSP low throughput data

To load kinase information data:

kinase = Data.get_kinase_info()
kinase.head()
kinase ID_coral uniprot ID_HGNC group family subfamily_coral subfamily in_ST_paper in_Tyr_paper ... cytosol cytoskeleton plasma membrane mitochondrion Golgi apparatus endoplasmic reticulum vesicle centrosome aggresome main_location
0 AAK1 AAK1 Q2M2I8 AAK1 Other NAK None NAK 1 0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN None
1 ABL1 ABL1 P00519 ABL1 TK Abl None Abl 0 1 ... 6.0 NaN 4.0 NaN NaN NaN NaN NaN NaN cytosol
2 ABL2 ABL2 P42684 ABL2 TK Abl None Abl 0 1 ... 4.0 6.0 NaN NaN NaN NaN NaN NaN NaN cytoskeleton
3 TNK2 ACK Q07912 TNK2 TK Ack None Ack 0 1 ... NaN NaN NaN NaN NaN NaN 8.0 NaN 2.0 vesicle
4 ACVR2A ACTR2 P27037 ACVR2A TKL STKR STKR2 STKR2 1 0 ... 5.0 NaN NaN NaN NaN 5.0 NaN NaN NaN cytosol

5 rows × 30 columns


source

CPTAC

 CPTAC ()

A class for fetching CPTAC phosphoproteomics data.

To check available cancer types, use CPTAC.list_cancer()

CPTAC.list_cancer()
['HNSCC', 'GBM', 'COAD', 'CCRCC', 'LSCC', 'BRCA', 'UCEC', 'LUAD', 'PDAC', 'OV']

To load CPTAC phosphorylation site information, use CPTAC.get_id(). Fold change of various conditions can be acquired through LinkedOmics or LinkedOmicsKB. Use is_KB to indicate whether the phosphorylation site information is for LinkedOmics or LinkedOmicsKB.

# Example of getting phosphorylation site information
tumor = CPTAC.get_id('CCRCC',is_KB=True)
normal = CPTAC.get_id('CCRCC',is_KB=True, is_Tumor=False)
tumor.head()
the CCRCC dataset length is: 54238
after id mapping, the length is 213737
0 sites does not have a mapped gene name
after removing duplicates of protein_site, the length is 212814
the CCRCC dataset length is: 53152
after id mapping, the length is 209188
0 sites does not have a mapped gene name
after removing duplicates of protein_site, the length is 208298
gene site site_seq protein gene_name gene_site protein_site
0 ENSG00000003056.8 S267 DDQLGEESEERDDHL ENSP00000000412.3 M6PR M6PR_S267 ENSP00000000412_S267
1 ENSG00000003056.8 S267 DDQLGEESEERDDHL ENSP00000440488.2 M6PR M6PR_S267 ENSP00000440488_S267
2 ENSG00000048028.11 S1053 PPTIRPNSPYDLCSR ENSP00000003302.4 USP28 USP28_S1053 ENSP00000003302_S1053
3 ENSG00000048028.11 S1053 PPTIRPNSPYDLCSR ENSP00000445743.1 USP28 USP28_S1053 ENSP00000445743_S1053
4 ENSG00000048028.11 S1053 PPTIRPNSPYDLCSR ENSP00000442431.1 USP28 USP28_S1053 ENSP00000442431_S1053

Substrate scoring

Utils


source

convert_string

 convert_string (input_string:str)

*Convert amino acids of lower case other than s,t,y to capital; convert rare amino acids to _*

In many phosphorylation datsets, there are amino acids in the site sequence that are in lower case but does not belong to s/t/y. Also, there are uncommon amino acids such as U or O that appear in the sequence. Therefore, it is essential to convert the sequence string for kinase ranking.

# example
convert_string('AAkUuPRFstTH')
'AAK__PRFstTH'

source

STY2sty

 STY2sty (input_string:str)

Replace ‘STY’ with ‘sty’


source

checker

 checker (input_string)

Check if the input string contains non-s/t/y at the middle position

checker('AAkUuPSFstTH') # if the center amino acid does not belong to sty/STY, will raise an error

STY2sty('AAkUuPSFSTtH') # convert all capital STY to sty in a string
'AAkUuPsFsttH'

source

cut_seq

 cut_seq (input_string:str, min_position:int, max_position:int)

Extract sequence based on a range relative to its center position

Type Details
input_string str site sequence
min_position int minimum position relative to its center
max_position int maximum position relative to its center
cut_seq('AAkUuPSFSTtH',-5,4)
'AkUuPSFSTt'

source

get_dict

 get_dict (input_string:str)

Get a dictionary of input string; no need for the star in the middle; make sure it is 15 or 10 length

Type Details
input_string str phosphorylation site sequence
cols = get_dict("PSVEPPLsQETFSDL")
cols
['-7P',
 '-6S',
 '-5V',
 '-4E',
 '-3P',
 '-2P',
 '-1L',
 '0s',
 '1Q',
 '2E',
 '3T',
 '4F',
 '5S',
 '6D',
 '7L']

Function1 - multiply


source

multiply_func

 multiply_func (values, factor=17)

Multiply the possibilities of the amino acids at each position in a phosphorylation site

Type Default Details
values list of values, possibilities of amino acids at certain positions
factor int 17 scale factor

image.png

The function implement formula from Johnson et al. Nature: An atlas of substrate specificities for the human serine/threonine kinome, Supplementary Note2 (page 160)

Multiply class, consider the dynamics of scale factor


source

multiply

 multiply (values, kinase, num_dict={'SYK': 18, 'PTK2': 18, 'ZAP70': 18,
           'ERBB2': 18, 'CSK': 18, 'FGFR4': 18, 'EGFR': 18, 'ERBB4': 18,
           'EPHA8': 18, 'EPHA7': 18, 'EPHA5': 18, 'EPHA2': 18, 'EPHB2':
           18, 'EPHB1': 18, 'EPHB3': 18, 'EPHB4': 18, 'EPHA4': 18,
           'EPHA3': 18, 'EPHA6': 18, 'FRK': 18, 'EPHA1': 18, 'TEC': 18,
           'BTK': 18, 'ITK': 18, 'BMX': 18, 'TXK': 16, 'ABL2': 18, 'ABL1':
           18, 'SRMS': 18, 'PTK2B': 18, 'FER': 18, 'MERTK': 18, 'AXL': 18,
           'FES': 18, 'PTK6': 18, 'YES1': 18, 'FGR': 18, 'SRC': 18, 'FYN':
           18, 'LCK': 18, 'BLK': 18, 'LYN': 18, 'HCK': 18, 'PDGFRB': 18,
           'PDGFRA': 18, 'FLT3': 18, 'TYRO3': 18, 'ROS1': 18, 'TEK': 18,
           'LTK': 18, 'ALK': 18, 'MUSK': 18, 'KIT': 18, 'CSF1R': 18,
           'MET': 18, 'KDR': 18, 'RET': 18, 'MST1R': 16, 'JAK3': 16,
           'FLT1': 16, 'MATK': 18, 'FGFR3': 18, 'FGFR2': 18, 'FGFR1': 18,
           'FLT4': 18, 'INSR': 18, 'IGF1R': 18, 'INSRR': 16, 'NTRK3': 18,
           'NTRK1': 18, 'NTRK2': 18, 'TNK1': 18, 'TNK2': 18, 'DDR2': 18,
           'DDR1': 18, 'TYK2': 18, 'JAK2': 18, 'JAK1': 18, 'TNNI3K_TYR':
           18, 'NEK10_TYR': 16, 'PINK1_TYR': 16, 'MAP2K7_TYR': 16,
           'PKMYT1_TYR': 16, 'TESK1_TYR': 16, 'LIMK1_TYR': 16,
           'LIMK2_TYR': 16, 'WEE1_TYR': 18, 'MAP2K6_TYR': 16,
           'MAP2K4_TYR': 16, 'PDHK1_TYR': 16, 'BMPR2_TYR': 16,
           'PDHK4_TYR': 16, 'PDHK3_TYR': 16, 'AAK1': 17, 'ACVR2A': 17,
           'ACVR2B': 17, 'AKT1': 17, 'AKT2': 17, 'AKT3': 17, 'ALK2': 17,
           'ALK4': 17, 'ALPHAK3': 17, 'AMPKA1': 17, 'AMPKA2': 17,
           'ANKRD3': 17, 'ATM': 17, 'ATR': 17, 'AURA': 17, 'AURB': 17,
           'AURC': 17, 'GRK2': 17, 'GRK3': 17, 'BCKDK': 17, 'BIKE': 17,
           'BMPR1A': 17, 'BMPR1B': 17, 'BMPR2': 17, 'BRAF': 17, 'BRSK1':
           17, 'BRSK2': 17, 'BUB1': 17, 'CAMK1A': 17, 'CAMK1B': 17,
           'CAMK1D': 17, 'CAMK1G': 17, 'CAMK2A': 17, 'CAMK2B': 17,
           'CAMK2D': 17, 'CAMK2G': 17, 'CAMK4': 17, 'CAMKK1': 17,
           'CAMKK2': 17, 'CAMLCK': 17, 'CDK1': 17, 'CDC7': 17, 'CDK10':
           17, 'CDK19': 17, 'CDK2': 17, 'CDK3': 17, 'CDK4': 17, 'CDK5':
           17, 'CDK6': 17, 'CDK7': 17, 'CDK8': 17, 'CDK9': 17, 'CDKL1':
           17, 'CDKL5': 17, 'CHAK1': 17, 'CHAK2': 17, 'CDK13': 17, 'CHK1':
           17, 'CHK2': 17, 'CK1A': 17, 'CK1A2': 17, 'CK1D': 17, 'CK1E':
           17, 'CK1G1': 17, 'CK1G2': 17, 'CK1G3': 17, 'CK2A1': 17,
           'CK2A2': 17, 'CLK1': 17, 'CLK2': 17, 'CLK3': 17, 'CLK4': 17,
           'COT': 17, 'CRIK': 17, 'CDK12': 17, 'DAPK1': 17, 'DAPK2': 17,
           'DAPK3': 17, 'DCAMKL1': 17, 'DCAMKL2': 17, 'DLK': 17, 'DMPK1':
           17, 'DNAPK': 17, 'DRAK1': 17, 'DYRK1A': 17, 'DYRK1B': 17,
           'DYRK2': 17, 'DYRK3': 17, 'DYRK4': 17, 'ERK1': 17, 'ERK2': 17,
           'ERK5': 17, 'ERK7': 17, 'MTOR': 17, 'GAK': 17, 'GCK': 17,
           'GCN2': 17, 'GRK4': 17, 'GRK5': 17, 'GRK6': 17, 'GRK7': 17,
           'GSK3A': 17, 'GSK3B': 17, 'HASPIN': 17, 'HGK': 17, 'HIPK1': 17,
           'HIPK2': 17, 'HIPK3': 17, 'HIPK4': 17, 'HPK1': 17, 'HRI': 17,
           'HUNK': 17, 'ICK': 17, 'IKKA': 17, 'IKKB': 17, 'IKKE': 17,
           'IRAK1': 17, 'IRAK4': 17, 'IRE1': 17, 'IRE2': 17, 'JNK1': 17,
           'JNK2': 17, 'JNK3': 17, 'KHS1': 17, 'KHS2': 17, 'KIS': 17,
           'LATS1': 17, 'LATS2': 17, 'LKB1': 17, 'LOK': 17, 'LRRK2': 17,
           'MAK': 17, 'MEK1': 17, 'MEK2': 17, 'MEK5': 17, 'MEKK1': 17,
           'YSK4': 17, 'MEKK2': 17, 'MEKK3': 17, 'ASK1': 17, 'MEKK6': 17,
           'MAP3K15': 17, 'MAPKAPK2': 17, 'MAPKAPK3': 17, 'MAPKAPK5': 17,
           'MARK1': 17, 'MARK2': 17, 'MARK3': 17, 'MARK4': 17, 'MASTL':
           17, 'MELK': 17, 'MINK': 17, 'MLK1': 17, 'MLK2': 17, 'MLK3': 17,
           'MLK4': 17, 'MNK1': 17, 'MNK2': 17, 'MOK': 17, 'MOS': 17,
           'MPSK1': 17, 'MRCKA': 17, 'MRCKB': 17, 'MSK1': 17, 'MSK2': 17,
           'SRPK3': 17, 'MST1': 17, 'MST2': 17, 'MST3': 17, 'MST4': 17,
           'MYO3A': 17, 'MYO3B': 17, 'NDR1': 17, 'NDR2': 17, 'NEK1': 17,
           'NEK11': 17, 'NEK2': 17, 'NEK3': 17, 'NEK4': 17, 'NEK5': 17,
           'NEK6': 17, 'NEK7': 17, 'NEK8': 17, 'NEK9': 17, 'NIK': 17,
           'NIM1': 17, 'NLK': 17, 'NUAK1': 17, 'NUAK2': 17, 'OSR1': 17,
           'P38A': 17, 'P38B': 17, 'P38D': 17, 'P38G': 17, 'P70S6K': 17,
           'P70S6KB': 17, 'PAK1': 17, 'PAK2': 17, 'PAK3': 17, 'PAK4': 17,
           'PAK5': 17, 'PAK6': 17, 'PASK': 17, 'PBK': 17, 'CDK16': 17,
           'CDK17': 17, 'CDK18': 17, 'PDHK1': 16, 'PDHK4': 16, 'PDK1': 17,
           'PERK': 17, 'CDK14': 17, 'PHKG1': 17, 'PHKG2': 17, 'PIM1': 17,
           'PIM2': 17, 'PIM3': 17, 'PINK1': 17, 'PKACA': 17, 'PKACB': 17,
           'PKACG': 17, 'PKCA': 17, 'PKCB': 17, 'PKCD': 17, 'PKCE': 17,
           'PKCG': 17, 'PKCH': 17, 'PKCI': 17, 'PKCT': 17, 'PKCZ': 17,
           'PRKD1': 17, 'PRKD2': 17, 'PRKD3': 17, 'PKG1': 17, 'PKG2': 17,
           'PKN1': 17, 'PKN2': 17, 'PKN3': 17, 'PKR': 17, 'PLK1': 17,
           'PLK2': 17, 'PLK3': 17, 'PLK4': 17, 'PRKX': 17, 'PRP4': 17,
           'PRPK': 17, 'QIK': 17, 'QSK': 17, 'RAF1': 17, 'GRK1': 17,
           'RIPK1': 17, 'RIPK2': 17, 'RIPK3': 17, 'ROCK1': 17, 'ROCK2':
           17, 'P90RSK': 17, 'RSK2': 17, 'RSK3': 17, 'RSK4': 17, 'SBK':
           17, 'MYLK4': 17, 'SGK1': 17, 'SGK3': 17, 'DSTYK': 17, 'SIK':
           17, 'SKMLCK': 17, 'SLK': 17, 'SMG1': 17, 'SMMLCK': 17, 'SNRK':
           17, 'SRPK1': 17, 'SRPK2': 17, 'SSTK': 17, 'STK33': 17, 'STLK3':
           17, 'TAK1': 17, 'TAO1': 17, 'TAO2': 17, 'TAO3': 17, 'TBK1': 17,
           'TGFBR1': 17, 'TGFBR2': 17, 'TLK1': 17, 'TLK2': 17, 'TNIK': 17,
           'TSSK1': 17, 'TSSK2': 17, 'TTBK1': 17, 'TTBK2': 17, 'TTK': 17,
           'ULK1': 17, 'ULK2': 17, 'VRK1': 17, 'VRK2': 17, 'WNK1': 17,
           'WNK3': 17, 'WNK4': 17, 'YANK2': 17, 'YANK3': 17, 'YSK1': 17,
           'ZAK': 17, 'EEF2K': 17, 'FAM20C': 17})

Multiply values, consider the dynamics of scale factor, which is PSPA random aa number.

multiply(values=[1,2,3,4,5],kinase='PDHK1')
22.906890595608516

Function2 - sum up


source

sumup

 sumup (values, kinase=None)

Sum up the possibilities of the amino acids at each position in a phosphorylation site sequence

Type Default Details
values list of values, possibilities of amino acids at certain positions
kinase NoneType None

image.png

Substrate scoring for one input string


source

predict_kinase

 predict_kinase (input_string:str, ref:pandas.core.frame.DataFrame,
                 func:Callable, to_lower:bool=False, to_upper:bool=False,
                 verbose=True)

Predict kinase given a phosphorylation site sequence

Type Default Details
input_string str site sequence
ref DataFrame reference dataframe for scoring
func Callable function to calculate score
to_lower bool False convert capital STY to lower case
to_upper bool False convert all letter to uppercase
verbose bool True

Here we provide different PSSM settings from either PSPA data or kinase-substrate dataset for kinase prediction:

predict_kinase("GAEEKEyHAEGGA", **param_PSPA).dropna()
considering string: ['-5A', '-4E', '-3E', '-2K', '-1E', '0y', '1H', '2A', '3E', '4G', '5G']
kinase
EGFR     2.881
CSK      2.148
ZAP70    1.997
FGFR4    1.918
MATK     1.778
         ...  
EPHA1   -3.486
TNK1    -4.211
FES     -4.315
TNK2    -4.595
DDR2    -5.467
Length: 93, dtype: float64
for param in [param_PSPA, param_CDDM,param_CDDM_upper]:
    print(predict_kinase("PSVEPPLsQETFSDL",**param).head())
considering string: ['-5V', '-4E', '-3P', '-2P', '-1L', '0s', '1Q', '2E', '3T', '4F', '5S']
kinase
ATM       5.037
SMG1      4.385
DNAPK     3.818
ATR       3.507
FAM20C    3.170
dtype: float64
considering string: ['-7P', '-6S', '-5V', '-4E', '-3P', '-2P', '-1L', '0s', '1Q', '2E', '3T', '4F', '5S', '6D', '7L']
kinase
ATR      3.064
ATM      2.909
DNAPK    2.270
CK2A1    1.873
TSSK1    1.856
dtype: float64
considering string: ['-7P', '-6S', '-5V', '-4E', '-3P', '-2P', '-1L', '0S', '1Q', '2E', '3T', '4F', '5S', '6D', '7L']
kinase
ATR      3.229
ATM      3.038
DNAPK    2.479
CK2A1    2.006
CDK8     1.999
dtype: float64

Substrates scoring for a column of input strings


source

predict_kinase_df

 predict_kinase_df (df, seq_col, ref, func, to_lower=False,
                    to_upper=False)
df = Data.get_psp_human_site()
df_sty = df[df['site_seq'].str[7].isin(list('sty'))]
out = predict_kinase_df(df_sty.head(20_000),'site_seq', **param_PSPA)
input dataframe has a length 20000
Preprocessing
Finish preprocessing
Merging reference
Finish merging
100%|██████████| 396/396 [00:17<00:00, 23.17it/s]
CPU times: user 18.8 s, sys: 76.4 ms, total: 18.9 s
Wall time: 19.1 s
out_cddm = predict_kinase_df(df_sty.head(20_000),'site_seq', **param_CDDM)
input dataframe has a length 20000
Preprocessing
Finish preprocessing
Merging reference
Finish merging
CPU times: user 2.26 s, sys: 16 ms, total: 2.27 s
Wall time: 2.27 s

Other examples:

data = Data.get_ochoa_site().head()

for param in [param_PSPA,param_PSPA_st,param_PSPA_y, param_CDDM,param_CDDM_upper]:
    display(predict_kinase_df(data,'site_seq', **param))
input dataframe has a length 5
Preprocessing
Finish preprocessing
100%|██████████| 396/396 [00:01<00:00, 278.67it/s]
AAK1 ACVR2A ACVR2B AKT1 AKT2 AKT3 ALK2 ALK4 ALPHAK3 AMPKA1 ... NTRK3 TXK TYK2 TYRO3 FLT1 KDR FLT4 WEE1_TYR YES1 ZAP70
0 -10.959900 -0.580989 0.328868 -3.890850 -3.590503 -5.312027 0.813786 -0.559382 -0.932736 -2.607226 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 -6.787549 -0.166112 0.306884 -5.885999 -4.786083 -6.575957 1.561436 -0.865154 -3.399169 -3.260798 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 -9.030938 1.231686 1.774661 -6.164209 -5.446345 -8.329813 0.777783 -1.355417 -0.928937 -4.998190 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 -4.849113 2.271636 2.057240 -2.886034 -2.379830 -3.634907 1.547144 2.735341 -2.825794 -1.696886 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 -6.596842 -1.387696 -0.956218 -2.834231 -3.794276 -4.968521 -1.862002 -1.717226 -2.653170 -3.514512 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 396 columns

input dataframe has a length 5
Preprocessing
Finish preprocessing
100%|██████████| 303/303 [00:01<00:00, 282.00it/s]
AAK1 ACVR2A ACVR2B AKT1 AKT2 AKT3 ALK2 ALK4 ALPHAK3 AMPKA1 ... VRK1 VRK2 WNK1 WNK3 WNK4 YANK2 YANK3 YSK1 YSK4 ZAK
0 -10.959900 -0.580989 0.328868 -3.890850 -3.590503 -5.312027 0.813786 -0.559382 -0.932736 -2.607226 ... -4.682274 -2.854275 -1.668666 -1.526785 -2.964575 -2.877200 -1.792244 -6.282651 -1.715336 -3.204281
1 -6.787549 -0.166112 0.306884 -5.885999 -4.786083 -6.575957 1.561436 -0.865154 -3.399169 -3.260798 ... -5.669906 -2.817085 -4.071192 -3.393741 -5.096516 -1.874000 -1.479976 -8.708810 -3.708147 -6.092628
2 -9.030938 1.231686 1.774661 -6.164209 -5.446345 -8.329813 0.777783 -1.355417 -0.928937 -4.998190 ... -5.832088 -3.243278 -4.249323 -2.749652 -5.053019 0.581384 -0.502630 -6.448245 -1.897494 -2.846533
3 -4.849113 2.271636 2.057240 -2.886034 -2.379830 -3.634907 1.547144 2.735341 -2.825794 -1.696886 ... -2.757951 -1.699232 -1.725384 -0.091196 -0.672545 0.313278 -0.207212 -2.315848 -0.053572 -1.117657
4 -6.596842 -1.387696 -0.956218 -2.834231 -3.794276 -4.968521 -1.862002 -1.717226 -2.653170 -3.514512 ... -1.546328 -1.457323 -1.277532 0.510635 -1.045845 -0.314193 -1.023331 -2.482345 -2.227114 -1.592725

5 rows × 303 columns

input dataframe has a length 5
Preprocessing
Finish preprocessing
100%|██████████| 93/93 [00:00<00:00, 282.04it/s]
ABL1 TNK2 ALK ABL2 AXL BLK BMPR2_TYR PTK6 BTK CSF1R ... NTRK3 TXK TYK2 TYRO3 FLT1 KDR FLT4 WEE1_TYR YES1 ZAP70
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 93 columns

input dataframe has a length 5
Preprocessing
Finish preprocessing
SRC EPHA3 FES NTRK3 ALK EPHA8 ABL1 FLT3 EPHB2 FYN ... MEK5 PKN2 MAP2K7 MRCKB HIPK3 CDK8 BUB1 MEKK3 MAP2K3 GRK1
0 0.929569 1.023548 0.989176 0.993497 0.949303 1.035864 0.911981 0.931400 0.980941 0.999884 ... 1.280109 1.579914 1.585706 1.521548 1.331723 1.705593 1.281669 1.387180 1.419866 1.808546
1 0.831463 0.876808 0.861939 0.868801 0.794678 0.858953 0.772543 0.754167 0.822913 0.863020 ... 1.072389 1.323723 1.319648 1.136658 1.507195 1.556711 1.224350 1.111727 1.115820 1.707075
2 0.793592 0.837832 0.783147 0.809343 0.808309 0.834936 0.777472 0.777312 0.788872 0.816794 ... 1.234520 1.758183 1.355396 0.983398 1.290056 1.422390 1.260192 1.113435 1.221784 1.724631
3 0.622633 0.652045 0.602943 0.666497 0.638019 0.616664 0.661643 0.618955 0.607654 0.626510 ... 0.915427 1.447707 1.443802 1.233698 1.290700 1.162054 1.060876 0.852917 1.219678 1.351096
4 0.653840 0.619648 0.617121 0.645628 0.635812 0.618403 0.635859 0.625696 0.605016 0.683491 ... 0.835344 1.212995 1.333055 1.103199 1.280820 1.219769 1.190424 0.901394 0.952642 1.504753

5 rows × 289 columns

input dataframe has a length 5
Preprocessing
Finish preprocessing
SRC EPHA3 FES NTRK3 ALK EPHA8 ABL1 FLT3 EPHB2 FYN ... MEK5 PKN2 MAP2K7 MRCKB HIPK3 CDK8 BUB1 MEKK3 MAP2K3 GRK1
0 0.991760 1.093712 1.051750 1.067134 1.013682 1.097519 0.966379 0.982464 1.054986 1.055910 ... 1.314859 1.635470 1.652251 1.622672 1.362973 1.797155 1.305198 1.423618 1.504941 1.872020
1 0.910262 0.953743 0.942327 0.950601 0.872694 0.932586 0.846899 0.826662 0.915020 0.942713 ... 1.175454 1.402006 1.430392 1.215826 1.569373 1.716455 1.270999 1.195081 1.223082 1.793290
2 0.849866 0.899910 0.848895 0.879652 0.874959 0.899414 0.839200 0.836523 0.858040 0.867269 ... 1.408003 1.813739 1.454786 1.084522 1.352556 1.524663 1.377839 1.173830 1.305691 1.811849
3 0.803826 0.836527 0.800759 0.894570 0.839905 0.781001 0.847847 0.807040 0.805877 0.801402 ... 1.110307 1.703637 1.795092 1.469653 1.549936 1.491344 1.446922 1.055452 1.534895 1.741090
4 0.822793 0.796532 0.792343 0.839882 0.810122 0.781420 0.805251 0.795022 0.790380 0.864538 ... 1.062617 1.357689 1.485945 1.249266 1.456078 1.422782 1.376471 1.089629 1.121309 1.697524

5 rows × 289 columns

Percentile scoring for one input string


source

get_pct

 get_pct (site, ref, func, pct_ref)

Replicate the precentile results from The Kinase Library.

st_pct = Data.get_pspa_st_pct()
y_pct = Data.get_pspa_tyr_pct()
a = get_pct('PSVEPPLyQETFSDL',**param_PSPA_y, pct_ref=y_pct)
considering string: ['-5V', '-4E', '-3P', '-2P', '-1L', '0Y', '1Q', '2E', '3T', '4F', '5S']
a.sort_values('percentile',ascending=False)
log2(score) percentile
ABL2 3.137 96.568694
BMX 2.816 96.117567
BTK 1.956 95.693780
CSK 2.303 95.174299
MERTK 2.509 93.588517
... ... ...
FLT1 -1.919 25.358852
PINK1_TYR -1.227 21.927546
MUSK -3.031 21.298701
TNNI3K_TYR -3.549 11.004785
PKMYT1_TYR -1.739 4.798360

93 rows × 2 columns

get_pct('PSVEPPLsQETFSDL',**param_PSPA_st, pct_ref=st_pct)
considering string: ['-5V', '-4E', '-3P', '-2P', '-1L', '0S', '1Q', '2E', '3T', '4F']
log2(score) percentile
ATM 5.037 99.822351
SMG1 4.385 99.831819
DNAPK 3.818 99.205315
ATR 3.507 99.680344
FAM20C 3.170 95.370556
... ... ...
PKN1 -7.275 14.070436
P70S6K -7.295 4.089816
AKT3 -7.375 11.432995
PKCI -7.742 8.129511
NEK3 -8.254 4.637240

303 rows × 2 columns

Percentile scoring for a column of input strings


source

get_pct_df

 get_pct_df (score_df, pct_ref)

Replicate the precentile results from The Kinase Library.

Details
score_df output from predict_kinase_df
pct_ref a reference df for percentile calculation
score_df = predict_kinase_df(data,'site_seq', **param_PSPA_st)
input dataframe has a length 5
Preprocessing
Finish preprocessing
100%|██████████| 303/303 [00:01<00:00, 262.53it/s]
# substrate score first
score_df = predict_kinase_df(data,'site_seq', **param_PSPA_st)

# get percentile reference
pct_ref = Data.get_pspa_st_pct()
input dataframe has a length 5
Preprocessing
Finish preprocessing
100%|██████████| 303/303 [00:01<00:00, 269.52it/s]
pct = get_pct_df(score_df,pct_ref)
pct
100%|██████████| 303/303 [00:02<00:00, 142.57it/s]
kinase AAK1 ACVR2A ACVR2B AKT1 AKT2 AKT3 ALK2 ALK4 ALPHAK3 AMPKA1 ... VRK1 VRK2 WNK1 WNK3 WNK4 YANK2 YANK3 YSK1 YSK4 ZAK
0 0.458 74.671 84.053 40.272 33.211 33.567 88.644 72.171 87.964 39.757 ... 29.815 7.807 51.817 61.251 39.677 32.182 31.327 15.722 64.522 37.746
1 22.403 79.735 83.860 14.659 14.462 18.451 92.567 62.538 55.851 28.697 ... 12.757 8.305 15.029 25.499 11.296 53.721 39.987 2.279 14.715 0.783
2 4.115 90.708 93.359 12.239 8.160 5.969 88.406 43.246 88.003 8.176 ... 10.769 3.742 13.282 36.930 11.678 93.213 66.309 14.134 60.500 46.700
3 52.441 95.324 94.544 55.475 55.510 56.197 92.514 97.997 65.543 55.549 ... 70.543 39.667 50.790 84.612 77.572 90.827 72.860 75.441 90.201 81.427
4 24.954 60.679 69.408 56.285 29.554 38.174 54.628 28.577 68.272 24.793 ... 87.502 49.994 58.722 91.131 72.087 83.217 52.870 73.150 52.402 73.957

5 rows × 303 columns

Data collection/processing


source

get_unique_site

 get_unique_site (df:pandas.core.frame.DataFrame=None,
                  seq_col:str='site_seq', id_col:str='gene_site')

Remove duplicates among phosphorylation sites; return df with new columns of acceptor and number of duplicates

Type Default Details
df DataFrame None dataframe that contains phosphorylation sites
seq_col str site_seq column name of site sequence
id_col str gene_site column name of site id

As there are lots of duplicates of the phosphorylation site sequence in the dataset, it could be helpful to remove the duplicated sequences.

Implement get_unique_site to get unique phosphorylation sites. Need to inform columns of sequence and id.

df = Data.get_ochoa_site()
unique = get_unique_site(df,seq_col='site_seq',id_col='gene_site')
unique.sort_values('num_site',ascending=False).head()
site_seq gene_site num_site acceptor
59397 PDYRQNVYIPGSNAT PCDGC_Y896|PCDGK_Y898|PCDGG_Y887|PCDGM_Y908|PC... 21 Y
96321 TMGLSARYGPQFTLQ PCDGC_Y879|PCDGK_Y881|PCDGG_Y870|PCDGM_Y891|PC... 21 Y
96147 TLQHVPDYRQNVYIP PCDGC_Y891|PCDGK_Y893|PCDGG_Y882|PCDGM_Y903|PC... 21 Y
11223 DKFIIPGSPAIISIR PCDC1_S906|PCDA7_S880|PCDA6_S893|PCDAC_S884|PC... 14 S
18666 ELAKHAVSEGTKAVT H2B1K_S113|H2BFS_S113|H2B1D_S113|H2B1C_S113|H2... 12 S

source

extract_site_seq

 extract_site_seq (df:pandas.core.frame.DataFrame, seq_col:str,
                   position_col:str)

Extract -7 to +7 site sequence from protein sequence

Type Details
df DataFrame dataframe that contains protein sequence
seq_col str column name of protein sequence
position_col str column name of position 0

As some datasets only contains protein information and position of phosphorylation sites, but not phosphorylation site sequence, we can retreive protein sequence and use this function to get -7 to +7 phosphorylation site sequence (as numpy array).

Remember to validate the phospho-acceptor at position 0 before extract the site sequence, as there could be mismatch due to the protein sequence database updates.

df = Data.get_ochoa_site().head()
df
uniprot position residue is_disopred disopred_score log10_hotspot_pval_min isHotspot uniprot_position functional_score current_uniprot name gene Sequence is_valid site_seq gene_site
0 A0A075B6Q4 24 S True 0.91 6.839384 True A0A075B6Q4_24 0.149257 A0A075B6Q4 A0A075B6Q4_HUMAN None MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT... True VDDEKGDSNDDYDSA A0A075B6Q4_S24
1 A0A075B6Q4 35 S True 0.87 9.192622 False A0A075B6Q4_35 0.136966 A0A075B6Q4 A0A075B6Q4_HUMAN None MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT... True YDSAGLLSDEDCMSV A0A075B6Q4_S35
2 A0A075B6Q4 57 S False 0.28 0.818834 False A0A075B6Q4_57 0.125364 A0A075B6Q4 A0A075B6Q4_HUMAN None MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT... True IADHLFWSEETKSRF A0A075B6Q4_S57
3 A0A075B6Q4 68 S False 0.03 0.375986 False A0A075B6Q4_68 0.119811 A0A075B6Q4 A0A075B6Q4_HUMAN None MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT... True KSRFTEYSMTSSVMR A0A075B6Q4_S68
4 A0A075B6Q4 71 S False 0.05 0.000000 False A0A075B6Q4_71 0.095193 A0A075B6Q4 A0A075B6Q4_HUMAN None MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT... True FTEYSMTSSVMRRNE A0A075B6Q4_S71
extract_site_seq(df,seq_col='Sequence',position_col='position')
100%|██████████| 5/5 [00:00<00:00, 3412.22it/s]
array(['VDDEKGDSNDDYDSA', 'YDSAGLLSDEDCMSV', 'IADHLFWSEETKSRF',
       'KSRFTEYSMTSSVMR', 'FTEYSMTSSVMRRNE'], dtype='<U15')

Get amino acids frequency at each position


source

get_freq

 get_freq (df_k:pandas.core.frame.DataFrame, aa_order=['P', 'G', 'A', 'C',
           'S', 'T', 'V', 'I', 'L', 'M', 'F', 'Y', 'W', 'H', 'K', 'R',
           'Q', 'N', 'D', 'E', 's', 't', 'y'], aa_order_paper=['P', 'G',
           'A', 'C', 'S', 'T', 'V', 'I', 'L', 'M', 'F', 'Y', 'W', 'H',
           'K', 'R', 'Q', 'N', 'D', 'E', 's', 't', 'y'], position=[-7, -6,
           -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7],
           position_paper=[-5, -4, -3, -2, -1, 1, 2, 3, 4])

Get frequency matrix given a dataframe of phosphorylation sites for a single kinase

Type Default Details
df_k DataFrame a dataframe for a single kinase that contains phosphorylation sequence splitted by their position
aa_order list [‘P’, ‘G’, ‘A’, ‘C’, ‘S’, ‘T’, ‘V’, ‘I’, ‘L’, ‘M’, ‘F’, ‘Y’, ‘W’, ‘H’, ‘K’, ‘R’, ‘Q’, ‘N’, ‘D’, ‘E’, ‘s’, ‘t’, ‘y’] amino acid to include in the full matrix
aa_order_paper list [‘P’, ‘G’, ‘A’, ‘C’, ‘S’, ‘T’, ‘V’, ‘I’, ‘L’, ‘M’, ‘F’, ‘Y’, ‘W’, ‘H’, ‘K’, ‘R’, ‘Q’, ‘N’, ‘D’, ‘E’, ‘s’, ‘t’, ‘y’] amino acid to include in the partial matrix
position list [-7, -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7] position to include in the full matrix
position_paper list [-5, -4, -3, -2, -1, 1, 2, 3, 4] position to include in the partial matrix
# Get data of a certain kinase
df = Data.get_ks_dataset()
df_k = df.query('kinase=="DYRK2"')
df_k.head() # note that the dataframe contains columns -7 to 7
Kinase substrate kinase_uniprot kinase_paper source -7 -6 -5 -4 -3 ... 0 1 2 3 4 5 6 7 kinase on_tree
0 DYRK2 AEGLRPAsPLGLTQE Q92630 DYRK2 pplus A E G L R ... s P L G L T Q E DYRK2 1
1 DYRK2 GGGAGPVsPQHHELT Q92630 DYRK2 pplus G G G A G ... s P Q H H E L T DYRK2 1
2 DYRK2 LRGNVVPsPLPtRRt Q92630 DYRK2 pplus L R G N V ... s P L P t R R t DYRK2 1
3 DYRK2 GPMRRSKsPADSANG Q92630 DYRK2 pplus G P M R R ... s P A D S A N G DYRK2 1
4 DYRK2 PERsQEEsPPGSTKR Q92630 DYRK2 pplus P E R s Q ... s P P G S T K R DYRK2 1

5 rows × 22 columns

# get frequency matrix
paper_format, full = get_freq(df_k)
paper_format.head()
Position -5 -4 -3 -2 -1 1 2 3 4
aa
P 0.060639 0.066152 0.074972 0.110254 0.110254 0.386313 0.057459 0.135105 0.062361
G 0.076075 0.074972 0.126792 0.061742 0.087100 0.046358 0.068508 0.101883 0.067929
A 0.091510 0.083793 0.061742 0.142227 0.100331 0.089404 0.108287 0.071982 0.080178
C 0.011025 0.006615 0.011025 0.030871 0.017641 0.012141 0.023204 0.018826 0.006682
S 0.036384 0.049614 0.024256 0.036384 0.023153 0.027594 0.028729 0.035437 0.038976

Statistical analysis


source

get_pvalue

 get_pvalue (df, columns1, columns2, test_method='mann_whitney',
             FC_method='median')

Performs statistical tests and calculates difference between the median or mean of two groups of columns.

Type Default Details
df
columns1 list of column names for group1
columns2 list of column names for group2
test_method str mann_whitney ‘student_t’, ‘mann_whitney’, ‘wilcoxon’
FC_method str median or mean

source

get_metaP

 get_metaP (p_values)

Use Fisher’s method to calculate a combined p value given a list of p values; this function also allows negative p values (negative correlation)

p_values = [0.001,-0.5,0.002]

get_metaP(p_values)
0.0003626876953231754

PSPA normalization


source

raw2norm

 raw2norm (df:pandas.core.frame.DataFrame, PDHK:bool=False)

Normalize single ST kinase data

Type Default Details
df DataFrame single kinase’s df has position as index, and single amino acid as columns
PDHK bool False whether this kinase belongs to PDHK family

This function implement the normalization method from Johnson et al. Nature: An atlas of substrate specificities for the human serine/threonine kinome

Specifically, > - matrices were column-normalized at all positions by the sum of the 17 randomized amino acids (excluding serine, threonine and cysteine), to yield PSSMs. >- PDHK1 and PDHK4 were normalized to the 16 randomized amino acids (excluding serine, threonine, cysteine and additionally tyrosine) >- The cysteine row was scaled by its median to be 1/17 (1/16 for PDHK1 and PDHK4). >- The serine and threonine values in each position were set to be the median of that position. >- The S0/T0 ratio was determined by summing the values of S and T rows in the matrix (SS and ST, respectively), accounting for the different S vs. T composition of the central (1:1) and peripheral (only S or only T) positions (Sctrl and Tctrl, respectively), and then normalizing to the higher value among the two (S0 and T0, respectively, Supplementary Note 1)

This function is usually implemented with the below function, with normalize being a bool argument.


source

get_one_kinase

 get_one_kinase (df:pandas.core.frame.DataFrame, kinase:str,
                 normalize:bool=False, drop_s:bool=True)

Obtain a specific kinase data from stacked dataframe

Type Default Details
df DataFrame stacked dataframe (paper’s raw data)
kinase str a specific kinase
normalize bool False normalize according to the paper; special for PDHK1/4
drop_s bool True drop s as s is a duplicates of t in PSPA

Retreive a single kinase data from PSPA data that has an format of kinase as index and position+amino acid as column.

To load raw and normalized PSPA data:

import pandas as pd
raw = pd.read_csv('https://github.com/sky1ove/katlas_raw/raw/refs/heads/main/nbs/raw/pspa_st_raw.csv').set_index('kinase')
norm = pd.read_csv('https://github.com/sky1ove/katlas_raw/raw/refs/heads/main/nbs/raw/pspa_st_norm.csv').set_index('kinase')
scale = pd.read_csv('https://github.com/sky1ove/katlas_raw/raw/refs/heads/main/nbs/raw/pspa_st_scale.csv').set_index('kinase')
raw
-5P -5G -5A -5C -5S -5T -5V -5I -5L -5M ... 4H 4K 4R 4Q 4N 4D 4E 4s 4t 4y
kinase
AAK1 7.614134e+06 2.590563e+06 3.001315e+06 4.696631e+06 4.944312e+06 8.315838e+06 1.005654e+07 1.643306e+07 1.049974e+07 9.133578e+06 ... 6.020663e+06 8.938081e+06 9.983402e+06 6.833482e+06 6.364453e+06 4.189046e+06 4.921596e+06 2.705054e+06 2.705054e+06 2.909280e+06
ACVR2A 4.991039e+06 5.783856e+06 7.015771e+06 8.367603e+06 7.072052e+06 7.601400e+06 7.188292e+06 7.513916e+06 7.159895e+06 6.266123e+06 ... 6.039473e+06 5.556301e+06 5.178735e+06 6.490098e+06 5.862481e+06 6.742906e+06 6.750653e+06 7.414220e+06 7.414220e+06 6.209577e+06
ACVR2B 2.648033e+07 2.568969e+07 2.813730e+07 4.517591e+07 3.287672e+07 3.351696e+07 2.701119e+07 2.199626e+07 2.341299e+07 2.567058e+07 ... 2.798420e+07 2.249692e+07 2.423690e+07 2.913286e+07 2.652739e+07 3.638873e+07 3.472932e+07 3.790608e+07 3.790608e+07 3.176142e+07
AKT1 1.839951e+07 1.810468e+07 1.683184e+07 1.724774e+07 2.264728e+07 1.780129e+07 1.303757e+07 1.327190e+07 1.415649e+07 1.540976e+07 ... 2.951154e+07 5.094266e+07 4.815292e+07 3.269388e+07 2.889660e+07 1.970135e+07 1.388746e+07 1.748307e+07 1.748307e+07 1.169683e+07
AKT2 5.439238e+06 5.569477e+06 5.805463e+06 6.301076e+06 5.004932e+06 4.812023e+06 3.906822e+06 3.776845e+06 4.450345e+06 4.629320e+06 ... 6.812202e+06 1.159068e+07 9.932526e+06 6.544477e+06 6.252361e+06 3.629092e+06 3.510048e+06 5.499662e+06 5.499662e+06 4.188621e+06
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
YANK2 3.458024e+06 4.161241e+06 3.791822e+06 4.204140e+06 5.512662e+06 6.506984e+06 2.579471e+06 2.797506e+06 2.734876e+06 2.793789e+06 ... 4.210220e+06 6.679453e+06 6.769908e+06 3.128978e+06 3.000055e+06 1.953885e+06 2.785169e+06 6.746347e+06 6.746347e+06 3.883819e+07
YANK3 6.199852e+06 7.696729e+06 6.418478e+06 7.520856e+06 8.734398e+06 9.017247e+06 4.982414e+06 5.323761e+06 5.566637e+06 5.384766e+06 ... 5.905824e+06 6.872153e+06 7.202598e+06 6.212529e+06 5.734786e+06 5.006788e+06 8.467383e+06 1.183013e+07 1.183013e+07 5.676307e+07
YSK1 2.011151e+08 2.429346e+08 2.492902e+08 2.329794e+08 2.793240e+08 5.318319e+08 1.699564e+08 1.605611e+08 1.520427e+08 1.801505e+08 ... 2.603125e+08 4.353054e+08 4.744941e+08 1.853705e+08 1.990460e+08 1.114695e+08 9.829561e+07 9.425377e+07 9.425377e+07 8.056884e+07
YSK4 8.476214e+07 1.041055e+08 1.064002e+08 1.206393e+08 1.541274e+08 1.674547e+08 7.388398e+07 5.712895e+07 6.185101e+07 7.312922e+07 ... 1.133369e+08 1.403252e+08 1.065588e+08 1.276831e+08 1.069086e+08 1.009037e+08 8.336549e+07 1.092130e+08 1.092130e+08 6.710453e+07
ZAK 1.775228e+08 1.882077e+08 1.937538e+08 1.992441e+08 2.961481e+08 3.950234e+08 1.334480e+08 1.267392e+08 1.402677e+08 1.420544e+08 ... 1.855718e+08 3.330092e+08 2.793030e+08 1.693537e+08 1.535532e+08 9.440832e+07 1.021720e+08 1.076703e+08 1.076703e+08 1.127176e+08

303 rows × 207 columns

one_kinase = get_one_kinase(raw,'PDHK1')
one_kinase.head()
aa A C D E F G H I K L ... P Q R S T V W Y t y
position
-5 8742435.33 10414182.29 8663835.37 8096013.86 11402696.32 10253402.14 10105837.98 8683931.90 7578162.13 9660152.81 ... 6637930.10 6242275.07 8735083.42 17325761.72 10840094.13 8430649.60 14729350.10 11402696.32 8575155.23 9671765.02
-4 9382375.57 10685938.26 8357249.08 7761083.75 11217909.10 10855959.77 9079043.40 9130790.82 7898317.44 9322057.05 ... 9672268.74 8379245.47 9377210.68 10952415.10 9895845.34 7886254.77 13908900.76 11217909.10 8025228.06 11415154.10
-3 9566806.27 10274228.62 7860338.75 6664677.78 12646646.40 9136758.39 10619788.43 10815274.55 7575486.39 10510394.47 ... 8973502.17 8383343.00 8378836.06 15571737.29 10373422.50 9253028.96 17526458.60 12646646.40 6558017.14 8706611.00
-2 8874823.78 11219554.16 7104673.31 6607581.65 11937469.77 13445698.89 11887506.94 8049058.41 6643874.14 9617614.67 ... 7548109.53 8208440.55 9307590.91 20205849.32 13325121.79 7839573.90 16355323.34 11937469.77 4944830.56 8422409.78
-1 10110169.52 14777201.90 12784916.61 5507173.44 8406884.45 8990141.98 10109111.77 6409587.79 5295768.52 7469514.59 ... 6981606.35 6472612.56 6069925.70 19309187.20 22395646.37 6650117.78 9773567.40 8406884.45 4625731.13 5606047.19

5 rows × 22 columns

Set normalize to True can normalize the data based on previous normalization method.

one_normalized = get_one_kinase(raw,'PDHK1',normalize=True)
one_normalized.head()
aa A C D E F G H I K L ... P Q R S T V W Y t y
position
-5 0.0594 0.0625 0.0589 0.0550 0.0775 0.0697 0.0687 0.0590 0.0515 0.0657 ... 0.0451 0.0424 0.0594 0.0594 0.0594 0.0573 0.1001 0.0775 0.0583 0.0658
-4 0.0618 0.0621 0.0550 0.0511 0.0739 0.0715 0.0598 0.0601 0.0520 0.0614 ... 0.0637 0.0552 0.0617 0.0608 0.0608 0.0519 0.0916 0.0739 0.0528 0.0752
-3 0.0608 0.0576 0.0499 0.0423 0.0803 0.0580 0.0674 0.0687 0.0481 0.0667 ... 0.0570 0.0532 0.0532 0.0584 0.0584 0.0588 0.1113 0.0803 0.0416 0.0553
-2 0.0587 0.0655 0.0470 0.0437 0.0790 0.0890 0.0787 0.0533 0.0440 0.0637 ... 0.0500 0.0543 0.0616 0.0565 0.0565 0.0519 0.1082 0.0790 0.0327 0.0557
-1 0.0782 0.1009 0.0989 0.0426 0.0650 0.0695 0.0782 0.0496 0.0409 0.0578 ... 0.0540 0.0500 0.0469 0.0594 0.0594 0.0514 0.0756 0.0650 0.0358 0.0433

5 rows × 22 columns

To further scale the data based on the scaling method from Johnson et al. Nature: An atlas of substrate specificities for the human serine/threonine kinome, we can multiply all values by a certain factor (16 for most kinases, and 17 for PDHK)

All kinases are divided by 1/17 (#Random AA); PDHK1 or 4 are divided by 1/16.

num_dict = Data.get_num_dict()
# multiply all values by a scale factor (number of random amino acids)
scale2 = norm.apply(lambda r: r*num_dict.get(r.name), axis=1)

We can compare the calculated one with the original one from the paper. They are same.

scale.head(2).round(2)
-5P -5G -5A -5C -5S -5T -5V -5I -5L -5M ... 4H 4K 4R 4Q 4N 4D 4E 4s 4t 4y
kinase
AAK1 1.22 0.42 0.48 0.78 0.72 0.72 1.62 2.64 1.69 1.47 ... 0.95 1.41 1.58 1.08 1.01 0.66 0.78 0.43 0.43 0.46
ACVR2A 0.71 0.82 0.99 0.83 0.98 0.98 1.02 1.06 1.01 0.89 ... 0.97 0.90 0.84 1.05 0.95 1.09 1.09 1.20 1.20 1.00

2 rows × 207 columns

scale2.head(2).round(2)
-5P -5G -5A -5C -5S -5T -5V -5I -5L -5M ... 4H 4K 4R 4Q 4N 4D 4E 4s 4t 4y
kinase
AAK1 1.22 0.42 0.48 0.78 0.72 0.72 1.62 2.64 1.69 1.47 ... 0.95 1.41 1.58 1.08 1.01 0.66 0.78 0.43 0.43 0.46
ACVR2A 0.71 0.82 0.99 0.83 0.98 0.98 1.02 1.06 1.01 0.89 ... 0.97 0.90 0.83 1.05 0.95 1.09 1.09 1.20 1.20 1.00

2 rows × 207 columns

End