Core

Core functions in Katlas library

Setup

Data

We will go through how to load kinase information data and phosphorylation sites data.


source

Data

 Data ()

A class for fetching various datasets.

Datasets used in this study can be accessed through Data

To load kinase information data:

kinase = Data.get_kinase_info()
kinase.head()
kinase ID_coral uniprot ID_HGNC group family subfamily_coral subfamily in_ST_paper in_Tyr_paper ... cytosol cytoskeleton plasma membrane mitochondrion Golgi apparatus endoplasmic reticulum vesicle centrosome aggresome main_location
0 AAK1 AAK1 Q2M2I8 AAK1 Other NAK NaN NAK 1 0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 ABL1 ABL1 P00519 ABL1 TK Abl NaN Abl 0 1 ... 6.0 NaN 4.0 NaN NaN NaN NaN NaN NaN cytosol
2 ABL2 ABL2 P42684 ABL2 TK Abl NaN Abl 0 1 ... 4.0 6.0 NaN NaN NaN NaN NaN NaN NaN cytoskeleton
3 TNK2 ACK Q07912 TNK2 TK Ack NaN Ack 0 1 ... NaN NaN NaN NaN NaN NaN 8.0 NaN 2.0 vesicle
4 ACVR2A ACTR2 P27037 ACVR2A TKL STKR STKR2 STKR2 1 0 ... 5.0 NaN NaN NaN NaN 5.0 NaN NaN NaN cytosol

5 rows × 30 columns

CPTAC


source

CPTAC

 CPTAC ()

A class for fetching CPTAC phosphoproteomics data.

To check available cancer types, use CPTAC.list_cancer()

CPTAC.list_cancer()
['HNSCC', 'GBM', 'COAD', 'CCRCC', 'LSCC', 'BRCA', 'UCEC', 'LUAD', 'PDAC', 'OV']

To load CPTAC phosphorylation site information, use CPTAC.get_id(). Fold change of various conditions can be acquired through LinkedOmics or LinkedOmicsKB. Use is_KB to indicate whether the phosphorylation site information is for LinkedOmics or LinkedOmicsKB.

# Example of getting phosphorylation site information
normal = CPTAC.get_id('CCRCC',is_KB=True, is_Tumor=False)
normal.head()
the CCRCC dataset length is: 53152
after id mapping, the length is 209188
0 sites does not have a mapped gene name
after removing duplicates of protein_site, the length is 208298
gene site site_seq protein gene_name gene_site protein_site
0 ENSG00000003056.8 S267 DDQLGEESEERDDHL ENSP00000000412.3 M6PR M6PR_S267 ENSP00000000412_S267
1 ENSG00000003056.8 S267 DDQLGEESEERDDHL ENSP00000440488.2 M6PR M6PR_S267 ENSP00000440488_S267
2 ENSG00000048028.11 S1053 PPTIRPNSPYDLCSR ENSP00000003302.4 USP28 USP28_S1053 ENSP00000003302_S1053
3 ENSG00000048028.11 S1053 PPTIRPNSPYDLCSR ENSP00000445743.1 USP28 USP28_S1053 ENSP00000445743_S1053
4 ENSG00000048028.11 S1053 PPTIRPNSPYDLCSR ENSP00000442431.1 USP28 USP28_S1053 ENSP00000442431_S1053

Checker

In many phosphorylation datsets, there are amino acids in the site sequence that are in lower case but does not belong to s/t/y. Also, there are uncommon amino acids such as U or O that appear in the sequence. Therefore, it is essential to convert the sequence string for kinase ranking.


source

check_seq

 check_seq (seq)

Convert non-s/t/y characters to uppercase and replace disallowed characters with underscores.

try:
    check_seq('aaadaaa')
except Exception as e:
    print(e)
aaadaaa has d at position 3; need to have one of 's', 't', or 'y' in the center
check_seq('AAkUuPSFstTH') # if the center amino acid does not belong to sty/STY, will raise an error
'AAK__PSFstTH'

source

check_seq_df

 check_seq_df (df, col)

Convert non-s/t/y to upper case & replace with underscore if the character is not in the allowed set

df=Data.get_human_site()
df.head()
substrate_uniprot substrate_genes site source AM_pathogenicity substrate_sequence substrate_species sub_site substrate_phosphoseq position site_seq
0 A0A024R4G9 C19orf48 MGC13170 hCG_2008493 S20 psp NaN MTVLEAVLEIQAITGSRLLSMVPGPARPPGSCWDPTQCTRTWLLSH... Homo sapiens (Human) A0A024R4G9_S20 MTVLEAVLEIQAITGSRLLsMVPGPARPPGSCWDPTQCTRTWLLSH... 20 _MTVLEAVLEIQAITGSRLLsMVPGPARPPGSCWDPTQCTR
1 A0A075B6Q4 None S24 ochoa NaN MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT... Homo sapiens (Human) A0A075B6Q4_S24 MDIQKSENEDDSEWEDVDDEKGDsNDDYDSAGLLsDEDCMSVPGKT... 24 QKSENEDDSEWEDVDDEKGDsNDDYDSAGLLsDEDCMSVPG
2 A0A075B6Q4 None S35 ochoa NaN MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT... Homo sapiens (Human) A0A075B6Q4_S35 MDIQKSENEDDSEWEDVDDEKGDsNDDYDSAGLLsDEDCMSVPGKT... 35 EDVDDEKGDsNDDYDSAGLLsDEDCMSVPGKTHRAIADHLF
3 A0A075B6Q4 None S57 ochoa NaN MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT... Homo sapiens (Human) A0A075B6Q4_S57 MDIQKSENEDDSEWEDVDDEKGDsNDDYDSAGLLsDEDCMSVPGKT... 57 EDCMSVPGKTHRAIADHLFWsEETKSRFTEYsMTssVMRRN
4 A0A075B6Q4 None S68 ochoa NaN MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT... Homo sapiens (Human) A0A075B6Q4_S68 MDIQKSENEDDSEWEDVDDEKGDsNDDYDSAGLLsDEDCMSVPGKT... 68 RAIADHLFWsEETKSRFTEYsMTssVMRRNEQLTLHDERFE
check_seq_df(df.head(),'site_seq')
0    _MTVLEAVLEIQAITGSRLLsMVPGPARPPGSCWDPTQCTR
1    QKSENEDDSEWEDVDDEKGDsNDDYDSAGLLsDEDCMSVPG
2    EDVDDEKGDsNDDYDSAGLLsDEDCMSVPGKTHRAIADHLF
3    EDCMSVPGKTHRAIADHLFWsEETKSRFTEYsMTssVMRRN
4    RAIADHLFWsEETKSRFTEYsMTssVMRRNEQLTLHDERFE
Name: site_seq, dtype: object

source

validate_site

 validate_site (site_info, seq)

Validate site position residue match with site residue.

site='S610'
seq = 'MSVPSSLSQSAINANSHGGPALSLPLPLHAAHNQLLNAKLQATAVGPKDLRSAMGEGGGPEPGPANAKWLKEGQNQLRRAATAHRDQNRNVTLTLAEEASQEPEMAPLGPKGLIHLYSELELSAHNAANRGLRGPGLIISTQEQGPDEGEEKAAGEAEEEEEDDDDEEEEEDLSSPPGLPEPLESVEAPPRPQALTDGPREHSKSASLLFGMRNSAASDEDSSWATLSQGSPSYGSPEDTDSFWNPNAFETDSDLPAGWMRVQDTSGTYYWHIPTGTTQWEPPGRASPSQGSSPQEESQLTWTGFAHGEGFEDGEFWKDEPSDEAPMELGLKEPEEGTLTFPAQSLSPEPLPQEEEKLPPRNTNPGIKCFAVRSLGWVEMTEEELAPGRSSVAVNNCIRQLSYHKNNLHDPMSGGWGEGKDLLLQLEDETLKLVEPQSQALLHAQPIISIRVWGVGRDSGRERDFAYVARDKLTQMLKCHVFRCEAPAKNIATSLHEICSKIMAERRNARCLVNGLSLDHSKLVDVPFQVEFPAPKNELVQKFQVYYLGNVPVAKPVGVDVINGALESVLSSSSREQWTPSHVSVAPATLTILHQQTEAVLGECRVRFLSFLAVGRDVHTFAFIMAAGPASFCCHMFWCEPNAASLSEAVQAACMLRYQKCLDARSQASTSCLPAPPAESVARRVGWTVRRGVQSLWGSLKPKRLGAHTP'
validate_site(site,seq)
1

source

validate_site_df

 validate_site_df (df, site_info_col, protein_seq_col)

Validate site position residue match with site residue in a dataframe.

validate_site_df(df.head(),'site','substrate_sequence')
0    1
1    1
2    1
3    1
4    1
dtype: int64

Onehot


source

onehot_encode

 onehot_encode (sequences)
df=Data.get_combine_site_psp_ochoa()
onehot_encode(df['site_seq'].head(1000))
array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

Scoring for single sequence

Algorithms

Function1 - multiply


source

multiply_func

 multiply_func (values, factor=17)

Multiply the possibilities of the amino acids at each position in a phosphorylation site

Type Default Details
values list of values, possibilities of amino acids at certain positions
factor int 17 scale factor

image.png

The function implement formula from Johnson et al. Nature: An atlas of substrate specificities for the human serine/threonine kinome, Supplementary Note2 (page 160)

Multiply class, consider the dynamics of scale factor


source

multiply

 multiply (values, kinase, num_dict={'SYK': 18, 'PTK2': 18, 'ZAP70': 18,
           'ERBB2': 18, 'CSK': 18, 'FGFR4': 18, 'EGFR': 18, 'ERBB4': 18,
           'EPHA8': 18, 'EPHA7': 18, 'EPHA5': 18, 'EPHA2': 18, 'EPHB2':
           18, 'EPHB1': 18, 'EPHB3': 18, 'EPHB4': 18, 'EPHA4': 18,
           'EPHA3': 18, 'EPHA6': 18, 'FRK': 18, 'EPHA1': 18, 'TEC': 18,
           'BTK': 18, 'ITK': 18, 'BMX': 18, 'TXK': 16, 'ABL2': 18, 'ABL1':
           18, 'SRMS': 18, 'PTK2B': 18, 'FER': 18, 'MERTK': 18, 'AXL': 18,
           'FES': 18, 'PTK6': 18, 'YES1': 18, 'FGR': 18, 'SRC': 18, 'FYN':
           18, 'LCK': 18, 'BLK': 18, 'LYN': 18, 'HCK': 18, 'PDGFRB': 18,
           'PDGFRA': 18, 'FLT3': 18, 'TYRO3': 18, 'ROS1': 18, 'TEK': 18,
           'LTK': 18, 'ALK': 18, 'MUSK': 18, 'KIT': 18, 'CSF1R': 18,
           'MET': 18, 'KDR': 18, 'RET': 18, 'MST1R': 16, 'JAK3': 16,
           'FLT1': 16, 'MATK': 18, 'FGFR3': 18, 'FGFR2': 18, 'FGFR1': 18,
           'FLT4': 18, 'INSR': 18, 'IGF1R': 18, 'INSRR': 16, 'NTRK3': 18,
           'NTRK1': 18, 'NTRK2': 18, 'TNK1': 18, 'TNK2': 18, 'DDR2': 18,
           'DDR1': 18, 'TYK2': 18, 'JAK2': 18, 'JAK1': 18, 'TNNI3K_TYR':
           18, 'NEK10_TYR': 16, 'PINK1_TYR': 16, 'MAP2K7_TYR': 16,
           'PKMYT1_TYR': 16, 'TESK1_TYR': 16, 'LIMK1_TYR': 16,
           'LIMK2_TYR': 16, 'WEE1_TYR': 18, 'MAP2K6_TYR': 16,
           'MAP2K4_TYR': 16, 'PDHK1_TYR': 16, 'BMPR2_TYR': 16,
           'PDHK4_TYR': 16, 'PDHK3_TYR': 16, 'AAK1': 17, 'ACVR2A': 17,
           'ACVR2B': 17, 'AKT1': 17, 'AKT2': 17, 'AKT3': 17, 'ALK2': 17,
           'ALK4': 17, 'ALPHAK3': 17, 'AMPKA1': 17, 'AMPKA2': 17,
           'ANKRD3': 17, 'ATM': 17, 'ATR': 17, 'AURA': 17, 'AURB': 17,
           'AURC': 17, 'GRK2': 17, 'GRK3': 17, 'BCKDK': 17, 'BIKE': 17,
           'BMPR1A': 17, 'BMPR1B': 17, 'BMPR2': 17, 'BRAF': 17, 'BRSK1':
           17, 'BRSK2': 17, 'BUB1': 17, 'CAMK1A': 17, 'CAMK1B': 17,
           'CAMK1D': 17, 'CAMK1G': 17, 'CAMK2A': 17, 'CAMK2B': 17,
           'CAMK2D': 17, 'CAMK2G': 17, 'CAMK4': 17, 'CAMKK1': 17,
           'CAMKK2': 17, 'CAMLCK': 17, 'CDK1': 17, 'CDC7': 17, 'CDK10':
           17, 'CDK19': 17, 'CDK2': 17, 'CDK3': 17, 'CDK4': 17, 'CDK5':
           17, 'CDK6': 17, 'CDK7': 17, 'CDK8': 17, 'CDK9': 17, 'CDKL1':
           17, 'CDKL5': 17, 'CHAK1': 17, 'CHAK2': 17, 'CDK13': 17, 'CHK1':
           17, 'CHK2': 17, 'CK1A': 17, 'CK1A2': 17, 'CK1D': 17, 'CK1E':
           17, 'CK1G1': 17, 'CK1G2': 17, 'CK1G3': 17, 'CK2A1': 17,
           'CK2A2': 17, 'CLK1': 17, 'CLK2': 17, 'CLK3': 17, 'CLK4': 17,
           'COT': 17, 'CRIK': 17, 'CDK12': 17, 'DAPK1': 17, 'DAPK2': 17,
           'DAPK3': 17, 'DCAMKL1': 17, 'DCAMKL2': 17, 'DLK': 17, 'DMPK1':
           17, 'DNAPK': 17, 'DRAK1': 17, 'DYRK1A': 17, 'DYRK1B': 17,
           'DYRK2': 17, 'DYRK3': 17, 'DYRK4': 17, 'ERK1': 17, 'ERK2': 17,
           'ERK5': 17, 'ERK7': 17, 'MTOR': 17, 'GAK': 17, 'GCK': 17,
           'GCN2': 17, 'GRK4': 17, 'GRK5': 17, 'GRK6': 17, 'GRK7': 17,
           'GSK3A': 17, 'GSK3B': 17, 'HASPIN': 17, 'HGK': 17, 'HIPK1': 17,
           'HIPK2': 17, 'HIPK3': 17, 'HIPK4': 17, 'HPK1': 17, 'HRI': 17,
           'HUNK': 17, 'ICK': 17, 'IKKA': 17, 'IKKB': 17, 'IKKE': 17,
           'IRAK1': 17, 'IRAK4': 17, 'IRE1': 17, 'IRE2': 17, 'JNK1': 17,
           'JNK2': 17, 'JNK3': 17, 'KHS1': 17, 'KHS2': 17, 'KIS': 17,
           'LATS1': 17, 'LATS2': 17, 'LKB1': 17, 'LOK': 17, 'LRRK2': 17,
           'MAK': 17, 'MEK1': 17, 'MEK2': 17, 'MEK5': 17, 'MEKK1': 17,
           'YSK4': 17, 'MEKK2': 17, 'MEKK3': 17, 'ASK1': 17, 'MEKK6': 17,
           'MAP3K15': 17, 'MAPKAPK2': 17, 'MAPKAPK3': 17, 'MAPKAPK5': 17,
           'MARK1': 17, 'MARK2': 17, 'MARK3': 17, 'MARK4': 17, 'MASTL':
           17, 'MELK': 17, 'MINK': 17, 'MLK1': 17, 'MLK2': 17, 'MLK3': 17,
           'MLK4': 17, 'MNK1': 17, 'MNK2': 17, 'MOK': 17, 'MOS': 17,
           'MPSK1': 17, 'MRCKA': 17, 'MRCKB': 17, 'MSK1': 17, 'MSK2': 17,
           'SRPK3': 17, 'MST1': 17, 'MST2': 17, 'MST3': 17, 'MST4': 17,
           'MYO3A': 17, 'MYO3B': 17, 'NDR1': 17, 'NDR2': 17, 'NEK1': 17,
           'NEK11': 17, 'NEK2': 17, 'NEK3': 17, 'NEK4': 17, 'NEK5': 17,
           'NEK6': 17, 'NEK7': 17, 'NEK8': 17, 'NEK9': 17, 'NIK': 17,
           'NIM1': 17, 'NLK': 17, 'NUAK1': 17, 'NUAK2': 17, 'OSR1': 17,
           'P38A': 17, 'P38B': 17, 'P38D': 17, 'P38G': 17, 'P70S6K': 17,
           'P70S6KB': 17, 'PAK1': 17, 'PAK2': 17, 'PAK3': 17, 'PAK4': 17,
           'PAK5': 17, 'PAK6': 17, 'PASK': 17, 'PBK': 17, 'CDK16': 17,
           'CDK17': 17, 'CDK18': 17, 'PDHK1': 16, 'PDHK4': 16, 'PDK1': 17,
           'PERK': 17, 'CDK14': 17, 'PHKG1': 17, 'PHKG2': 17, 'PIM1': 17,
           'PIM2': 17, 'PIM3': 17, 'PINK1': 17, 'PKACA': 17, 'PKACB': 17,
           'PKACG': 17, 'PKCA': 17, 'PKCB': 17, 'PKCD': 17, 'PKCE': 17,
           'PKCG': 17, 'PKCH': 17, 'PKCI': 17, 'PKCT': 17, 'PKCZ': 17,
           'PRKD1': 17, 'PRKD2': 17, 'PRKD3': 17, 'PKG1': 17, 'PKG2': 17,
           'PKN1': 17, 'PKN2': 17, 'PKN3': 17, 'PKR': 17, 'PLK1': 17,
           'PLK2': 17, 'PLK3': 17, 'PLK4': 17, 'PRKX': 17, 'PRP4': 17,
           'PRPK': 17, 'QIK': 17, 'QSK': 17, 'RAF1': 17, 'GRK1': 17,
           'RIPK1': 17, 'RIPK2': 17, 'RIPK3': 17, 'ROCK1': 17, 'ROCK2':
           17, 'P90RSK': 17, 'RSK2': 17, 'RSK3': 17, 'RSK4': 17, 'SBK':
           17, 'MYLK4': 17, 'SGK1': 17, 'SGK3': 17, 'DSTYK': 17, 'SIK':
           17, 'SKMLCK': 17, 'SLK': 17, 'SMG1': 17, 'SMMLCK': 17, 'SNRK':
           17, 'SRPK1': 17, 'SRPK2': 17, 'SSTK': 17, 'STK33': 17, 'STLK3':
           17, 'TAK1': 17, 'TAO1': 17, 'TAO2': 17, 'TAO3': 17, 'TBK1': 17,
           'TGFBR1': 17, 'TGFBR2': 17, 'TLK1': 17, 'TLK2': 17, 'TNIK': 17,
           'TSSK1': 17, 'TSSK2': 17, 'TTBK1': 17, 'TTBK2': 17, 'TTK': 17,
           'ULK1': 17, 'ULK2': 17, 'VRK1': 17, 'VRK2': 17, 'WNK1': 17,
           'WNK3': 17, 'WNK4': 17, 'YANK2': 17, 'YANK3': 17, 'YSK1': 17,
           'ZAK': 17, 'EEF2K': 17, 'FAM20C': 17})

Multiply values, consider the dynamics of scale factor, which is PSPA random aa number.

multiply(values=[1,2,3,4,5],kinase='PDHK1')
22.906890595608516

Function2 - sum up


source

sumup

 sumup (values, kinase=None)

Sum up the possibilities of the amino acids at each position in a phosphorylation site sequence

Type Default Details
values list of values, possibilities of amino acids at certain positions
kinase NoneType None

image.png

Utils


source

STY2sty

 STY2sty (input_string:str)

Replace all ‘STY’ with ‘sty’ in a sequence

STY2sty('AAkUuPSFSTtH') # convert all capital STY to sty in a string
'AAkUuPsFsttH'

source

get_dict

 get_dict (input_string:str)

Get a dictionary of input string; no need for the star in the middle; make sure it is 15 or 10 length

Type Details
input_string str phosphorylation site sequence
cols = get_dict("PSVEPPLsQETFSDL")
cols
['-7P',
 '-6S',
 '-5V',
 '-4E',
 '-3P',
 '-2P',
 '-1L',
 '0s',
 '1Q',
 '2E',
 '3T',
 '4F',
 '5S',
 '6D',
 '7L']

Scoring func


source

predict_kinase

 predict_kinase (input_string:str, ref:pandas.core.frame.DataFrame,
                 func:Callable, to_lower:bool=False, to_upper:bool=False,
                 verbose=True)

Predict kinase given a phosphorylation site sequence

Type Default Details
input_string str site sequence
ref DataFrame reference dataframe for scoring
func Callable function to calculate score
to_lower bool False convert capital STY to lower case
to_upper bool False convert all letter to uppercase
verbose bool True

Params

Here we provide different PSSM settings from either PSPA data or kinase-substrate dataset for kinase prediction:


source

Params

 Params (name=None)
Params()
Available parameter sets:
['PSPA_st', 'PSPA_y', 'PSPA', 'CDDM', 'CDDM_upper']
for p in ['PSPA', 'CDDM','CDDM_upper']:
    print(predict_kinase("PSVEPPLsQETFSDL",**Params(p)).head())
considering string: ['-5V', '-4E', '-3P', '-2P', '-1L', '0s', '1Q', '2E', '3T', '4F', '5S']
kinase
ATM       5.037
SMG1      4.385
DNAPK     3.818
ATR       3.507
FAM20C    3.170
dtype: float64
considering string: ['-7P', '-6S', '-5V', '-4E', '-3P', '-2P', '-1L', '0s', '1Q', '2E', '3T', '4F', '5S', '6D', '7L']
kinase
ATR      3.064
ATM      2.909
DNAPK    2.270
CK2A1    1.873
TSSK1    1.856
dtype: float64
considering string: ['-7P', '-6S', '-5V', '-4E', '-3P', '-2P', '-1L', '0S', '1Q', '2E', '3T', '4F', '5S', '6D', '7L']
kinase
ATR      3.229
ATM      3.038
DNAPK    2.479
CK2A1    2.006
CDK8     1.999
dtype: float64

Scoring for sequences in df

Utils


source

cut_seq

 cut_seq (input_string:str, min_position:int, max_position:int)

Extract sequence based on a range relative to its center position

Type Details
input_string str site sequence
min_position int minimum position relative to its center
max_position int maximum position relative to its center
cut_seq('AAkUuPSFSTtH',-5,4)
'AkUuPSFSTt'

Scoring func


source

predict_kinase_df

 predict_kinase_df (df, seq_col, ref, func, to_lower=False,
                    to_upper=False)
df = Data.get_psp_human_site()
df_sty = df[df['site_seq'].str[7].isin(list('sty'))]
out_cddm = predict_kinase_df(df_sty.head(500),'site_seq', **Params('CDDM'))
input dataframe has a length 500
Preprocessing
Finish preprocessing
Merging reference
Finish merging
CPU times: user 24.4 ms, sys: 4.12 ms, total: 28.5 ms
Wall time: 27.9 ms

Percentile scoring

Single sequence


source

get_pct

 get_pct (site, ref, func, pct_ref)

Replicate the precentile results from The Kinase Library.

st_pct = Data.get_pspa_st_pct()
y_pct = Data.get_pspa_tyr_pct()
a = get_pct('PSVEPPLyQETFSDL',**Params('PSPA_y'), pct_ref=y_pct)
considering string: ['-5V', '-4E', '-3P', '-2P', '-1L', '0Y', '1Q', '2E', '3T', '4F', '5S']
a.sort_values('percentile',ascending=False)
log2(score) percentile
ABL2 3.137 96.568694
BMX 2.816 96.117567
BTK 1.956 95.693780
CSK 2.303 95.174299
MERTK 2.509 93.588517
... ... ...
FLT1 -1.919 25.358852
PINK1_TYR -1.227 21.927546
MUSK -3.031 21.298701
TNNI3K_TYR -3.549 11.004785
PKMYT1_TYR -1.739 4.798360

93 rows × 2 columns

get_pct('PSVEPPLsQETFSDL',**Params('PSPA_st'), pct_ref=st_pct)
considering string: ['-5V', '-4E', '-3P', '-2P', '-1L', '0S', '1Q', '2E', '3T', '4F']
log2(score) percentile
ATM 5.037 99.822351
SMG1 4.385 99.831819
DNAPK 3.818 99.205315
ATR 3.507 99.680344
FAM20C 3.170 95.370556
... ... ...
PKN1 -7.275 14.070436
P70S6K -7.295 4.089816
AKT3 -7.375 11.432995
PKCI -7.742 8.129511
NEK3 -8.254 4.637240

303 rows × 2 columns

Sequence in df


source

get_pct_df

 get_pct_df (score_df, pct_ref)

Replicate the precentile results from The Kinase Library.

Details
score_df output from predict_kinase_df
pct_ref a reference df for percentile calculation
# substrate score first
# score_df = predict_kinase_df(df_sty,'site_seq', **Params('PSPA_st'))

# get percentile reference
# pct_ref = Data.get_pspa_st_pct()
# pct = get_pct_df(score_df,pct_ref)

Phosphorylate protein seq


source

phosphorylate_seq

 phosphorylate_seq (r, site_info_col='site',
                    sub_seq_col='substrate_sequence')

Phosphorylate whole sequence based on phosphosites

ks = Data.get_ks_dataset()
site_df = ks.groupby('substrate_uniprot').agg({'site':lambda x: x.unique(),
                                  'substrate_sequence': 'first',
                                  }).reset_index()
site_df.head()
substrate_uniprot site substrate_sequence
0 A0A2R8Y4L2 [S95, S22, T25, S6, S158] MSKSESPKEPEQLRKLFIGGLSFETTDESLRSHFEQWGTLTDCVVM...
1 A0AVT1 [Y553, S697, Y1046, Y49, S1048, S732] MEGSEPVAAHQGEEASCSSWGTGSTNKNLPIMSTASVEIDDALYSR...
2 A0FGR8 [Y824, S821] MTANRDAALSSHRHPGCAQRPRTPTFASSSQRRSAFGFDDGNFPGL...
3 A0JLT2 [S239] MENFTALFGAQADPPPPPTALGFGPGKPPPPPPPPAGGGPGTAPPP...
4 A0MZ66 [S473, T455, S444, S464, S467, S492, S494, S53... MNSSDEEKQLQLITSLKEQAIGEYEDLRAENQKTKEKCDKIRQERD...
phosphorylate_seq(site_df.iloc[0],'site','substrate_sequence')
'MSKSEsPKEPEQLRKLFIGGLsFEtTDESLRSHFEQWGTLTDCVVMRDPNTKRSRGFGFVTYATVEEVDAAMNARPHKVDGRVVEPKRAVSREDsQRPDAHLTVKKIFVGGIKEDTEEHHLRDYFEQYGKIEVIEIMTDRGSGKKRGFAFVTFDDHDsVDKIVIQKYHTVNGHNCEVRKALSKQEMASASSSQRGRSGSGNFGGGRGGGFGGNDNFGRGGNFSGRGGFGGSRGGGGYGGSGDGYNGFGNDGSNFGGGGSYNDFGNYNNQSSNFGPMKGGNFEGRSSGPHGGGGQYFAKPRNQGGYGGSSSSSSYGSGRRF'

We incorporate the above groupby func to the below:


source

phosphorylate_seq_df

 phosphorylate_seq_df (df, id_col='substrate_uniprot',
                       site_info_col='site',
                       sub_seq_col='substrate_sequence')

Phosphorylate whole sequence based on phosphosites in a dataframe

phosphorylate_seq_df(ks.head(),'substrate_uniprot','site','substrate_sequence')
substrate_uniprot site substrate_sequence substrate_phosphoseq
0 A4FU28 [S140] MEEPGATPQPYLGLVLEELGRVVAALPESMRPDENPYGFPSELVVC... MEEPGATPQPYLGLVLEELGRVVAALPESMRPDENPYGFPSELVVC...
1 O00141 [S252, S255, S397, S404] MTVKTEAAKGTLTYSRMRGMVAILIAFMKQRRMGLNDFIQKIANNS... MTVKTEAAKGTLTYSRMRGMVAILIAFMKQRRMGLNDFIQKIANNS...

Extract site seq


source

extract_site_seq

 extract_site_seq (df:pandas.core.frame.DataFrame, seq_col:str,
                   site_info_col:str, n=7)

Extract -n to +n site sequence from protein sequence

Type Default Details
df DataFrame dataframe that contains protein sequence
seq_col str column name of protein sequence
site_info_col str column name of site information (e.g., S10)
n int 7 length of surrounding sequence (default -7 to +7)

As some datasets only contains protein information and position of phosphorylation sites, but not phosphorylation site sequence, we can retreive protein sequence and use this function to get -7 to +7 phosphorylation site sequence (as numpy array).

Remember to validate the phospho-acceptor at position 0 before extract the site sequence, as there could be mismatch due to the protein sequence database updates.

df = Data.get_human_site()
df.head()
substrate_uniprot substrate_genes site source AM_pathogenicity substrate_sequence substrate_species sub_site substrate_phosphoseq position site_seq
0 A0A024R4G9 C19orf48 MGC13170 hCG_2008493 S20 psp NaN MTVLEAVLEIQAITGSRLLSMVPGPARPPGSCWDPTQCTRTWLLSH... Homo sapiens (Human) A0A024R4G9_S20 MTVLEAVLEIQAITGSRLLsMVPGPARPPGSCWDPTQCTRTWLLSH... 20 _MTVLEAVLEIQAITGSRLLsMVPGPARPPGSCWDPTQCTR
1 A0A075B6Q4 None S24 ochoa NaN MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT... Homo sapiens (Human) A0A075B6Q4_S24 MDIQKSENEDDSEWEDVDDEKGDsNDDYDSAGLLsDEDCMSVPGKT... 24 QKSENEDDSEWEDVDDEKGDsNDDYDSAGLLsDEDCMSVPG
2 A0A075B6Q4 None S35 ochoa NaN MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT... Homo sapiens (Human) A0A075B6Q4_S35 MDIQKSENEDDSEWEDVDDEKGDsNDDYDSAGLLsDEDCMSVPGKT... 35 EDVDDEKGDsNDDYDSAGLLsDEDCMSVPGKTHRAIADHLF
3 A0A075B6Q4 None S57 ochoa NaN MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT... Homo sapiens (Human) A0A075B6Q4_S57 MDIQKSENEDDSEWEDVDDEKGDsNDDYDSAGLLsDEDCMSVPGKT... 57 EDCMSVPGKTHRAIADHLFWsEETKSRFTEYsMTssVMRRN
4 A0A075B6Q4 None S68 ochoa NaN MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT... Homo sapiens (Human) A0A075B6Q4_S68 MDIQKSENEDDSEWEDVDDEKGDsNDDYDSAGLLsDEDCMSVPGKT... 68 RAIADHLFWsEETKSRFTEYsMTssVMRRNEQLTLHDERFE
extract_site_seq(df.head(),
                 seq_col='substrate_sequence',
                 site_info_col='site',
                 n=30
                 )
100%|██████████| 5/5 [00:00<00:00, 7169.75it/s]
array(['___________MTVLEAVLEIQAITGSRLLSMVPGPARPPGSCWDPTQCTRTWLLSHTPRR',
       '_______MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKTHRAIADHL',
       'KSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKTHRAIADHLFWSEETKSRFT',
       'DYDSAGLLSDEDCMSVPGKTHRAIADHLFWSEETKSRFTEYSMTSSVMRRNEQLTLHDERF',
       'DCMSVPGKTHRAIADHLFWSEETKSRFTEYSMTSSVMRRNEQLTLHDERFEKFYEQYDDDE'],
      dtype='<U61')

PSSM


source

get_prob

 get_prob (df:pandas.core.frame.DataFrame, col:str, aa_order=['P', 'G',
           'A', 'C', 'S', 'T', 'V', 'I', 'L', 'M', 'F', 'Y', 'W', 'H',
           'K', 'R', 'Q', 'N', 'D', 'E', 's', 't', 'y'])

Get the probability matrix of PSSM from phosphorylation site sequences.

ks = Data.get_ks_dataset()
ks_k = ks[ks.kinase_uniprot=='P00519']
pssm_df = get_prob(ks_k,'site_seq')
pssm_df.head()
Position -20 -19 -18 -17 -16 -15 -14 -13 -12 -11 ... 11 12 13 14 15 16 17 18 19 20
aa
P 0.050061 0.048691 0.062349 0.055489 0.046988 0.054753 0.064787 0.055090 0.056683 0.048272 ... 0.052728 0.051140 0.069436 0.063164 0.057716 0.056639 0.051072 0.050697 0.052163 0.060703
G 0.080586 0.080341 0.069007 0.067551 0.082530 0.070397 0.093581 0.073054 0.077566 0.072706 ... 0.099939 0.070856 0.071916 0.075672 0.071518 0.064821 0.080076 0.088720 0.062341 0.090735
A 0.080586 0.080341 0.062954 0.054282 0.075301 0.071600 0.070186 0.070060 0.065632 0.070322 ... 0.064378 0.077634 0.069436 0.072545 0.063363 0.079924 0.088272 0.087452 0.057888 0.070927
C 0.017094 0.012781 0.013317 0.019903 0.012048 0.017449 0.007798 0.014371 0.013126 0.012515 ... 0.007357 0.017868 0.014879 0.012508 0.011920 0.018880 0.019546 0.014575 0.019084 0.014058
S 0.047619 0.035910 0.046610 0.030157 0.037349 0.042720 0.041992 0.041916 0.034010 0.039333 ... 0.024525 0.036352 0.047117 0.040025 0.042033 0.040277 0.039092 0.051965 0.041349 0.039617

5 rows × 41 columns


source

pssm_to_seq

 pssm_to_seq (pssm_df, thr=0.4, contain_sty=True)

Represent PSSM in string sequence of amino acids

Type Default Details
pssm_df
thr float 0.4 threshold of probability to show in sequence
contain_sty bool True keep only s,t,y values (last three) in center 0 position
pssm_to_seq(pssm_df,thr=0.1)
'........K.K.K..E.EEVy*[E/A].[L/P]....K..........L.'

source

recover_pssm

 recover_pssm (flat_pssm:pandas.core.series.Series)

Recover 2D pssm from flat pssm Series

pspa = Data.get_pspa_all_norm()
flat_pssm = pspa.loc['AAK1'].dropna()
recovered = recover_pssm(flat_pssm)
recovered
Position -5 -4 -3 -2 -1 0 1 2 3 4
aa
A 0.0284 0.0706 0.1119 0.0538 0.0385 0.0000 0.0312 0.0750 0.0582 0.0646
C 0.0456 0.0560 0.0655 0.0588 0.0313 0.0000 0.0223 0.0684 0.0742 0.0603
D 0.0160 0.0332 0.0272 0.0805 0.0161 0.0000 0.0103 0.0552 0.0405 0.0389
E 0.0153 0.0560 0.0370 0.0636 0.0146 0.0000 0.0098 0.0524 0.0371 0.0457
F 0.0425 0.0520 0.0487 0.0336 0.0545 0.0000 0.0129 0.0486 0.0430 0.0523
G 0.0245 0.0642 0.0512 0.0283 0.0706 0.0000 0.7216 0.0749 0.0923 0.0702
H 0.0331 0.0514 0.0397 0.0567 0.0688 0.0000 0.0160 0.0620 0.0674 0.0560
I 0.1554 0.0621 0.0599 0.0378 0.0435 0.0000 0.0102 0.0370 0.0388 0.0415
K 0.0262 0.0809 0.0517 0.0587 0.1229 0.0000 0.0194 0.0577 0.0739 0.0831
L 0.0993 0.0742 0.0527 0.0569 0.0654 0.0000 0.0131 0.0414 0.0489 0.0461
M 0.0864 0.0693 0.0782 0.1166 0.0554 0.0000 0.0124 0.0481 0.0437 0.0464
N 0.0275 0.0429 0.0582 0.0808 0.0434 0.0000 0.0170 0.0834 0.0735 0.0592
P 0.0720 0.0534 0.1084 0.0226 0.1136 0.0000 0.0463 0.0527 0.0681 0.0628
Q 0.0560 0.0627 0.0695 0.1517 0.0397 0.0000 0.0138 0.0771 0.0623 0.0635
R 0.0956 0.0715 0.0581 0.0555 0.0923 0.0000 0.0249 0.0774 0.0901 0.0928
S 0.0425 0.0619 0.0527 0.0555 0.0545 0.1013 0.0143 0.0552 0.0582 0.0560
T 0.0425 0.0619 0.0527 0.0555 0.0545 1.0000 0.0143 0.0552 0.0582 0.0560
V 0.0951 0.0619 0.0641 0.0340 0.0392 0.0000 0.0107 0.0542 0.0610 0.0422
W 0.0315 0.0403 0.0425 0.0289 0.0440 0.0000 0.0162 0.0572 0.0481 0.0826
Y 0.0952 0.0534 0.0411 0.0399 0.0777 0.0000 0.0143 0.0459 0.0533 0.0521
pS 0.0201 0.0332 0.0303 0.0209 0.0121 0.1013 0.0123 0.0409 0.0335 0.0251
pT 0.0201 0.0332 0.0303 0.0209 0.0121 1.0000 0.0123 0.0409 0.0335 0.0251
pY 0.0611 0.0339 0.0274 0.0486 0.0178 0.0000 0.0100 0.0410 0.0359 0.0270

source

process_pssm

 process_pssm (pssm_df)

Keep only s,t,y values in center 0 position; normalize per position

norm_pssm = process_pssm(recovered)
norm_pssm.head()
Position -5 -4 -3 -2 -1 0 1 2 3 4
aa
A 0.023054 0.055152 0.088880 0.042695 0.032558 0.0 0.028740 0.057613 0.044987 0.051701
C 0.037016 0.043747 0.052025 0.046663 0.026469 0.0 0.020542 0.052543 0.057355 0.048259
D 0.012988 0.025935 0.021604 0.063884 0.013615 0.0 0.009488 0.042403 0.031306 0.031132
E 0.012420 0.043747 0.029388 0.050472 0.012347 0.0 0.009027 0.040252 0.028677 0.036575
F 0.034500 0.040622 0.038681 0.026665 0.046089 0.0 0.011883 0.037333 0.033238 0.041857

source

pssm2dict

 pssm2dict (pssm_df)

Convert pssm dataframe to dict

pssm2dict(pssm_df.iloc[:1,:10])
{'-20P': 0.05006,
 '-19P': 0.04869,
 '-18P': 0.06235,
 '-17P': 0.05549,
 '-16P': 0.04699,
 '-15P': 0.05475,
 '-14P': 0.06479,
 '-13P': 0.05509,
 '-12P': 0.05668,
 '-11P': 0.04827}

JS divergence


source

js_divergence

 js_divergence (p1, p2, mean=True)

p1 and p2 are two arrays (df or np) with index as aa and column as position

Type Default Details
p1 pssm
p2 pssm
mean bool True
js_divergence(pssm_df,pssm_df)
1.0000000826903708e-10

source

js_divergence_flat

 js_divergence_flat (p1_flat, p2_flat)

p1 and p2 are two flattened pd.Series with index as aa and column as position

Details
p1_flat pd.Series of flattened pssm
p2_flat pd.Series of flattened pssm
flat_norm_pssm = pd.Series(pssm2dict(norm_pssm))
js_divergence(flat_norm_pssm,flat_norm_pssm)
1.0000050826907844e-09

Entropy & Information Content


source

entropy

 entropy (pssm_df, return_min=False, exclude_zero=False, contain_sty=True)

Calculate entropy per position (max) of a PSSM surrounding 0

Type Default Details
pssm_df a dataframe of pssm with index as aa and column as position
return_min bool False return min entropy as a single value or return all entropy as a series
exclude_zero bool False exclude the column of 0 (center position) in the entropy calculation
contain_sty bool True keep only s,t,y values (last three) in center 0 position
entropy(pssm_df)
Position
-20    4.324109
-19    4.257291
-18    4.284732
-17    4.267692
-16    4.270273
-15    4.291726
-14    4.276993
-13    4.269681
-12    4.222083
-11    4.324761
-10    4.282922
-9     4.261746
-8     4.296601
-7     4.263406
-6     4.236002
-5     4.245160
-4     4.238012
-3     4.149155
-2     4.183371
-1     4.148746
 0     0.608195
 1     4.153184
 2     4.294678
 3     4.200320
 4     4.312974
 5     4.246091
 6     4.258706
 7     4.303146
 8     4.214329
 9     4.264253
 10    4.262009
 11    4.261383
 12    4.272229
 13    4.313546
 14    4.275907
 15    4.299622
 16    4.317631
 17    4.279185
 18    4.283658
 19    4.293755
 20    4.279981
dtype: float64

source

entropy_flat

 entropy_flat (flat_pssm:pandas.core.series.Series, return_min=False,
               exclude_zero=False, contain_sty=True)

Calculate entropy per position of a flat PSSM surrounding 0

Type Default Details
flat_pssm Series
return_min bool False return min entropy as a single value or return all entropy as a series
exclude_zero bool False exclude the column of 0 (center position) in the entropy calculation
contain_sty bool True keep only s,t,y values (last three) in center 0 position

source

get_IC_standard

 get_IC_standard (pssm_df)

Calculate the standard information content (bits) from frequency matrix, using the same number of residues log2(len(pssm_df)) for all positions


source

get_IC

 get_IC (pssm_df, return_min=False, exclude_zero=False, contain_sty=True)

Calculate the information content (bits) from a frequency matrix, using log2(3) for the middle position and log2(len(pssm_df)) for others.

Type Default Details
pssm_df a dataframe of pssm with index as aa and column as position
return_min bool False return min entropy as a single value or return all entropy as a series
exclude_zero bool False exclude the column of 0 (center position) in the entropy calculation
contain_sty bool True keep only s,t,y values (last three) in center 0 position

source

get_IC_flat

 get_IC_flat (flat_pssm:pandas.core.series.Series, return_min=False,
              exclude_zero=False, contain_sty=True)

Calculate the information content (bits) from a flattened pssm pd.Series, using log2(3) for the middle position and log2(len(pssm_df)) for others.

Type Default Details
flat_pssm Series
return_min bool False return min entropy as a single value or return all entropy as a series
exclude_zero bool False exclude the column of 0 (center position) in the entropy calculation
contain_sty bool True keep only s,t,y values (last three) in center 0 position

source

get_scaled_IC

 get_scaled_IC (pssm_df)

For plotting purpose, calculate the scaled information content (bits) from a frequency matrix, using log2(3) for the middle position and log2(len(pssm_df)) for others.

P values


source

get_pvalue

 get_pvalue (df, columns1, columns2, test_method='mann_whitney',
             FC_method='median')

Performs statistical tests and calculates difference between the median or mean of two groups of columns.

Type Default Details
df
columns1 list of column names for group1
columns2 list of column names for group2
test_method str mann_whitney ‘student_t’, ‘mann_whitney’, ‘wilcoxon’
FC_method str median or mean

source

get_metaP

 get_metaP (p_values)

Use Fisher’s method to calculate a combined p value given a list of p values; this function also allows negative p values (negative correlation)

p_values = [0.001,-0.5,0.002]

get_metaP(p_values)
0.0003626876953231754

PSPA normalization


source

raw2norm

 raw2norm (df:pandas.core.frame.DataFrame, PDHK:bool=False)

Normalize single ST kinase data

Type Default Details
df DataFrame single kinase’s df has position as index, and single amino acid as columns
PDHK bool False whether this kinase belongs to PDHK family

This function implement the normalization method from Johnson et al. Nature: An atlas of substrate specificities for the human serine/threonine kinome

Specifically, > - matrices were column-normalized at all positions by the sum of the 17 randomized amino acids (excluding serine, threonine and cysteine), to yield PSSMs. >- PDHK1 and PDHK4 were normalized to the 16 randomized amino acids (excluding serine, threonine, cysteine and additionally tyrosine) >- The cysteine row was scaled by its median to be 1/17 (1/16 for PDHK1 and PDHK4). >- The serine and threonine values in each position were set to be the median of that position. >- The S0/T0 ratio was determined by summing the values of S and T rows in the matrix (SS and ST, respectively), accounting for the different S vs. T composition of the central (1:1) and peripheral (only S or only T) positions (Sctrl and Tctrl, respectively), and then normalizing to the higher value among the two (S0 and T0, respectively, Supplementary Note 1)

This function is usually implemented with the below function, with normalize being a bool argument.


source

get_one_kinase

 get_one_kinase (df:pandas.core.frame.DataFrame, kinase:str,
                 normalize:bool=False, drop_s:bool=True)

Obtain a specific kinase data from stacked dataframe

Type Default Details
df DataFrame stacked dataframe (paper’s raw data)
kinase str a specific kinase
normalize bool False normalize according to the paper; special for PDHK1/4
drop_s bool True drop s as s is a duplicates of t in PSPA

Retreive a single kinase data from PSPA data that has an format of kinase as index and position+amino acid as column.

data = Data.get_pspa_st_norm()
get_one_kinase(data,'PDHK1')
aa A C D E F G H I K L ... P Q R S T V W Y t y
position
-5 0.0594 0.0625 0.0589 0.0550 0.0775 0.0697 0.0687 0.0590 0.0515 0.0657 ... 0.0451 0.0424 0.0594 0.0594 0.0594 0.0573 0.1001 0.0775 0.0583 0.0658
-4 0.0618 0.0621 0.0550 0.0511 0.0739 0.0715 0.0598 0.0601 0.0520 0.0614 ... 0.0637 0.0552 0.0617 0.0608 0.0608 0.0519 0.0916 0.0739 0.0528 0.0752
-3 0.0608 0.0576 0.0499 0.0423 0.0803 0.0580 0.0674 0.0687 0.0481 0.0667 ... 0.0570 0.0532 0.0532 0.0584 0.0584 0.0588 0.1113 0.0803 0.0416 0.0553
-2 0.0587 0.0655 0.0470 0.0437 0.0790 0.0890 0.0787 0.0533 0.0440 0.0637 ... 0.0500 0.0543 0.0616 0.0565 0.0565 0.0519 0.1082 0.0790 0.0327 0.0557
-1 0.0782 0.1009 0.0989 0.0426 0.0650 0.0695 0.0782 0.0496 0.0409 0.0578 ... 0.0540 0.0500 0.0469 0.0594 0.0594 0.0514 0.0756 0.0650 0.0358 0.0433
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN 1.0000 0.4886 NaN NaN 0.0000 0.4886 0.0000
1 0.0400 0.0562 0.0394 0.0355 0.0735 0.0400 0.0502 0.1288 0.0390 0.1439 ... 0.0379 0.0455 0.0455 0.0455 0.0455 0.0797 0.0784 0.0735 0.0336 0.0452
2 0.0496 0.0783 0.0643 0.0555 0.0720 0.1067 0.0684 0.0480 0.0505 0.0555 ... 0.0564 0.0653 0.0695 0.0601 0.0601 0.0508 0.0672 0.0720 0.0414 0.0594
3 0.0486 0.0609 0.0938 0.0684 0.1024 0.0676 0.0544 0.0583 0.0388 0.0552 ... 0.0686 0.0502 0.0561 0.0588 0.0588 0.0593 0.0641 0.1024 0.0539 0.0431
4 0.0565 0.0749 0.0631 0.0535 0.0732 0.0655 0.0664 0.0625 0.0496 0.0552 ... 0.0677 0.0553 0.0604 0.0626 0.0626 0.0579 0.0864 0.0732 0.0548 0.0575

10 rows × 22 columns

End

# def get_freq(df_k: pd.DataFrame, # a dataframe for a single kinase that contains phosphorylation sequence splitted by their position
#              aa_order = [i for i in 'PGACSTVILMFYWHKRQNDEsty'], # amino acid to include in the full matrix 
#              aa_order_paper = [i for i in 'PGACSTVILMFYWHKRQNDEsty'], # amino acid to include in the partial matrix
#              position = [i for i in range(-7,8)], # position to include in the full matrix
#              position_paper = [-5,-4,-3,-2,-1,1,2,3,4] # position to include in the partial matrix
#              ):
    
#     "Get frequency matrix given a dataframe of phosphorylation sites for a single kinase"
    

#     #Count frequency for each amino acid at each position
#     melted_k = df_k.melt(
#                     value_vars=[i for i in range(-7, 8)],
#                     var_name='Position', 
#                     value_name='aa')
    
#     # Group by Position and Amino Acid and count occurrences
#     grouped = melted_k.groupby(['Position', 'aa']).size().reset_index(name='Count')
    

#     # Remove wired amino acid
#     aa_include = [i for i in 'PGACSTVILMFYWHKRQNDEsty']
#     grouped = grouped[grouped.aa.isin(aa_include)].reset_index(drop=True)
    
#     # get pivot table
#     pivot_k = grouped.pivot(index='aa', columns='Position', values='Count').fillna(0)
    
#     # Get frequency by dividing the sum of each column
#     freq_k = pivot_k/pivot_k.sum()

    
#     # data from the kinase-substrate dataset, and format is Lew's paper's format
#     paper = freq_k.reindex(index=aa_order_paper,columns=position_paper,fill_value=0)

#     # full pivot data from kinase-substrate dataset
#     full = freq_k.reindex(index=aa_order,columns=position, fill_value=0)

    
#     return paper,full

# # get frequency matrix
# paper_format, full = get_freq(ks_k)
# paper_format.head()
# def get_unique_site(df:pd.DataFrame = None,# dataframe that contains phosphorylation sites
#                     seq_col: str='site_seq', # column name of site sequence
#                     id_col: str='gene_site' # column name of site id
#                    ):
#     "Remove duplicates among phosphorylation sites; return df with new columns of acceptor and number of duplicates"
    
#     unique = df.groupby(seq_col).agg(
#         {id_col: lambda r: '|'.join(r.unique())} )
#     unique['num_site'] = unique[id_col].str.split('|').apply(len) 
#     unique = unique.reset_index()
#     position = len(unique[seq_col][0])//2
#     unique['acceptor'] = unique[seq_col].str[position]
    
#     return unique

As there are lots of duplicates of the phosphorylation site sequence in the dataset, it could be helpful to remove the duplicated sequences.

Implement get_unique_site to get unique phosphorylation sites. Need to inform columns of sequence and id.

# df = Data.get_ochoa_site()
# unique = get_unique_site(df,seq_col='site_seq',id_col='gene_site')
# unique.sort_values('num_site',ascending=False).head()