Core

Core functions in Katlas library

Setup

Data

We will go through how to load kinase information data and phosphorylation sites data.

Data

 Data ()

A class for fetching various datasets.

Datasets used in this study can be accessed through Data

To load kinase information data:

kinase = Data.get_kinase_info()
kinase.head()

	kinase	ID_coral	uniprot	ID_HGNC	group	family	subfamily_coral	subfamily	in_ST_paper	in_Tyr_paper	...	cytosol	cytoskeleton	plasma membrane	mitochondrion	Golgi apparatus	endoplasmic reticulum	vesicle	centrosome	aggresome	main_location
0	AAK1	AAK1	Q2M2I8	AAK1	Other	NAK	NaN	NAK	1	0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	ABL1	ABL1	P00519	ABL1	TK	Abl	NaN	Abl	0	1	...	6.0	NaN	4.0	NaN	NaN	NaN	NaN	NaN	NaN	cytosol
2	ABL2	ABL2	P42684	ABL2	TK	Abl	NaN	Abl	0	1	...	4.0	6.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	cytoskeleton
3	TNK2	ACK	Q07912	TNK2	TK	Ack	NaN	Ack	0	1	...	NaN	NaN	NaN	NaN	NaN	NaN	8.0	NaN	2.0	vesicle
4	ACVR2A	ACTR2	P27037	ACVR2A	TKL	STKR	STKR2	STKR2	1	0	...	5.0	NaN	NaN	NaN	NaN	5.0	NaN	NaN	NaN	cytosol

5 rows × 30 columns

CPTAC

source

CPTAC

 CPTAC ()

A class for fetching CPTAC phosphoproteomics data.

To check available cancer types, use CPTAC.list_cancer()

CPTAC.list_cancer()

['HNSCC', 'GBM', 'COAD', 'CCRCC', 'LSCC', 'BRCA', 'UCEC', 'LUAD', 'PDAC', 'OV']

To load CPTAC phosphorylation site information, use CPTAC.get_id(). Fold change of various conditions can be acquired through LinkedOmics or LinkedOmicsKB. Use is_KB to indicate whether the phosphorylation site information is for LinkedOmics or LinkedOmicsKB.

# Example of getting phosphorylation site information
normal = CPTAC.get_id('CCRCC',is_KB=True, is_Tumor=False)
normal.head()

the CCRCC dataset length is: 53152
after id mapping, the length is 209188
0 sites does not have a mapped gene name
after removing duplicates of protein_site, the length is 208298

	gene	site	site_seq	protein	gene_name	gene_site	protein_site
0	ENSG00000003056.8	S267	DDQLGEESEERDDHL	ENSP00000000412.3	M6PR	M6PR_S267	ENSP00000000412_S267
1	ENSG00000003056.8	S267	DDQLGEESEERDDHL	ENSP00000440488.2	M6PR	M6PR_S267	ENSP00000440488_S267
2	ENSG00000048028.11	S1053	PPTIRPNSPYDLCSR	ENSP00000003302.4	USP28	USP28_S1053	ENSP00000003302_S1053
3	ENSG00000048028.11	S1053	PPTIRPNSPYDLCSR	ENSP00000445743.1	USP28	USP28_S1053	ENSP00000445743_S1053
4	ENSG00000048028.11	S1053	PPTIRPNSPYDLCSR	ENSP00000442431.1	USP28	USP28_S1053	ENSP00000442431_S1053

Checker

In many phosphorylation datsets, there are amino acids in the site sequence that are in lower case but does not belong to s/t/y. Also, there are uncommon amino acids such as U or O that appear in the sequence. Therefore, it is essential to convert the sequence string for kinase ranking.

source

check_seq

 check_seq (seq)

Convert non-s/t/y characters to uppercase and replace disallowed characters with underscores.

try:
    check_seq('aaadaaa')
except Exception as e:
    print(e)

aaadaaa has d at position 3; need to have one of 's', 't', or 'y' in the center

check_seq('AAkUuPSFstTH') # if the center amino acid does not belong to sty/STY, will raise an error

'AAK__PSFstTH'

source

check_seq_df

 check_seq_df (df, col)

Convert non-s/t/y to upper case & replace with underscore if the character is not in the allowed set

df=Data.get_human_site()
df.head()

	substrate_uniprot	substrate_genes	site	source	AM_pathogenicity	substrate_sequence	substrate_species	sub_site	substrate_phosphoseq	position	site_seq
0	A0A024R4G9	C19orf48 MGC13170 hCG_2008493	S20	psp	NaN	MTVLEAVLEIQAITGSRLLSMVPGPARPPGSCWDPTQCTRTWLLSH...	Homo sapiens (Human)	A0A024R4G9_S20	MTVLEAVLEIQAITGSRLLsMVPGPARPPGSCWDPTQCTRTWLLSH...	20	_MTVLEAVLEIQAITGSRLLsMVPGPARPPGSCWDPTQCTR
1	A0A075B6Q4	None	S24	ochoa	NaN	MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT...	Homo sapiens (Human)	A0A075B6Q4_S24	MDIQKSENEDDSEWEDVDDEKGDsNDDYDSAGLLsDEDCMSVPGKT...	24	QKSENEDDSEWEDVDDEKGDsNDDYDSAGLLsDEDCMSVPG
2	A0A075B6Q4	None	S35	ochoa	NaN	MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT...	Homo sapiens (Human)	A0A075B6Q4_S35	MDIQKSENEDDSEWEDVDDEKGDsNDDYDSAGLLsDEDCMSVPGKT...	35	EDVDDEKGDsNDDYDSAGLLsDEDCMSVPGKTHRAIADHLF
3	A0A075B6Q4	None	S57	ochoa	NaN	MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT...	Homo sapiens (Human)	A0A075B6Q4_S57	MDIQKSENEDDSEWEDVDDEKGDsNDDYDSAGLLsDEDCMSVPGKT...	57	EDCMSVPGKTHRAIADHLFWsEETKSRFTEYsMTssVMRRN
4	A0A075B6Q4	None	S68	ochoa	NaN	MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT...	Homo sapiens (Human)	A0A075B6Q4_S68	MDIQKSENEDDSEWEDVDDEKGDsNDDYDSAGLLsDEDCMSVPGKT...	68	RAIADHLFWsEETKSRFTEYsMTssVMRRNEQLTLHDERFE

check_seq_df(df.head(),'site_seq')

0    _MTVLEAVLEIQAITGSRLLsMVPGPARPPGSCWDPTQCTR
1    QKSENEDDSEWEDVDDEKGDsNDDYDSAGLLsDEDCMSVPG
2    EDVDDEKGDsNDDYDSAGLLsDEDCMSVPGKTHRAIADHLF
3    EDCMSVPGKTHRAIADHLFWsEETKSRFTEYsMTssVMRRN
4    RAIADHLFWsEETKSRFTEYsMTssVMRRNEQLTLHDERFE
Name: site_seq, dtype: object

source

validate_site

 validate_site (site_info, seq)

Validate site position residue match with site residue.

site='S610'
seq = 'MSVPSSLSQSAINANSHGGPALSLPLPLHAAHNQLLNAKLQATAVGPKDLRSAMGEGGGPEPGPANAKWLKEGQNQLRRAATAHRDQNRNVTLTLAEEASQEPEMAPLGPKGLIHLYSELELSAHNAANRGLRGPGLIISTQEQGPDEGEEKAAGEAEEEEEDDDDEEEEEDLSSPPGLPEPLESVEAPPRPQALTDGPREHSKSASLLFGMRNSAASDEDSSWATLSQGSPSYGSPEDTDSFWNPNAFETDSDLPAGWMRVQDTSGTYYWHIPTGTTQWEPPGRASPSQGSSPQEESQLTWTGFAHGEGFEDGEFWKDEPSDEAPMELGLKEPEEGTLTFPAQSLSPEPLPQEEEKLPPRNTNPGIKCFAVRSLGWVEMTEEELAPGRSSVAVNNCIRQLSYHKNNLHDPMSGGWGEGKDLLLQLEDETLKLVEPQSQALLHAQPIISIRVWGVGRDSGRERDFAYVARDKLTQMLKCHVFRCEAPAKNIATSLHEICSKIMAERRNARCLVNGLSLDHSKLVDVPFQVEFPAPKNELVQKFQVYYLGNVPVAKPVGVDVINGALESVLSSSSREQWTPSHVSVAPATLTILHQQTEAVLGECRVRFLSFLAVGRDVHTFAFIMAAGPASFCCHMFWCEPNAASLSEAVQAACMLRYQKCLDARSQASTSCLPAPPAESVARRVGWTVRRGVQSLWGSLKPKRLGAHTP'

validate_site(site,seq)

source

validate_site_df

 validate_site_df (df, site_info_col, protein_seq_col)

Validate site position residue match with site residue in a dataframe.

validate_site_df(df.head(),'site','substrate_sequence')

0    1
1    1
2    1
3    1
4    1
dtype: int64

Onehot

source

onehot_encode

 onehot_encode (sequences)

df=Data.get_combine_site_psp_ochoa()

onehot_encode(df['site_seq'].head(1000))

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

Scoring for single sequence

Algorithms

Function1 - multiply

source

multiply_func

 multiply_func (values, factor=17)

Multiply the possibilities of the amino acids at each position in a phosphorylation site

	Type	Default	Details
values			list of values, possibilities of amino acids at certain positions
factor	int	17	scale factor

The function implement formula from Johnson et al. Nature: An atlas of substrate specificities for the human serine/threonine kinome, Supplementary Note2 (page 160)

Multiply class, consider the dynamics of scale factor

source

multiply

 multiply (values, kinase, num_dict={'SYK': 18, 'PTK2': 18, 'ZAP70': 18,
           'ERBB2': 18, 'CSK': 18, 'FGFR4': 18, 'EGFR': 18, 'ERBB4': 18,
           'EPHA8': 18, 'EPHA7': 18, 'EPHA5': 18, 'EPHA2': 18, 'EPHB2':
           18, 'EPHB1': 18, 'EPHB3': 18, 'EPHB4': 18, 'EPHA4': 18,
           'EPHA3': 18, 'EPHA6': 18, 'FRK': 18, 'EPHA1': 18, 'TEC': 18,
           'BTK': 18, 'ITK': 18, 'BMX': 18, 'TXK': 16, 'ABL2': 18, 'ABL1':
           18, 'SRMS': 18, 'PTK2B': 18, 'FER': 18, 'MERTK': 18, 'AXL': 18,
           'FES': 18, 'PTK6': 18, 'YES1': 18, 'FGR': 18, 'SRC': 18, 'FYN':
           18, 'LCK': 18, 'BLK': 18, 'LYN': 18, 'HCK': 18, 'PDGFRB': 18,
           'PDGFRA': 18, 'FLT3': 18, 'TYRO3': 18, 'ROS1': 18, 'TEK': 18,
           'LTK': 18, 'ALK': 18, 'MUSK': 18, 'KIT': 18, 'CSF1R': 18,
           'MET': 18, 'KDR': 18, 'RET': 18, 'MST1R': 16, 'JAK3': 16,
           'FLT1': 16, 'MATK': 18, 'FGFR3': 18, 'FGFR2': 18, 'FGFR1': 18,
           'FLT4': 18, 'INSR': 18, 'IGF1R': 18, 'INSRR': 16, 'NTRK3': 18,
           'NTRK1': 18, 'NTRK2': 18, 'TNK1': 18, 'TNK2': 18, 'DDR2': 18,
           'DDR1': 18, 'TYK2': 18, 'JAK2': 18, 'JAK1': 18, 'TNNI3K_TYR':
           18, 'NEK10_TYR': 16, 'PINK1_TYR': 16, 'MAP2K7_TYR': 16,
           'PKMYT1_TYR': 16, 'TESK1_TYR': 16, 'LIMK1_TYR': 16,
           'LIMK2_TYR': 16, 'WEE1_TYR': 18, 'MAP2K6_TYR': 16,
           'MAP2K4_TYR': 16, 'PDHK1_TYR': 16, 'BMPR2_TYR': 16,
           'PDHK4_TYR': 16, 'PDHK3_TYR': 16, 'AAK1': 17, 'ACVR2A': 17,
           'ACVR2B': 17, 'AKT1': 17, 'AKT2': 17, 'AKT3': 17, 'ALK2': 17,
           'ALK4': 17, 'ALPHAK3': 17, 'AMPKA1': 17, 'AMPKA2': 17,
           'ANKRD3': 17, 'ATM': 17, 'ATR': 17, 'AURA': 17, 'AURB': 17,
           'AURC': 17, 'GRK2': 17, 'GRK3': 17, 'BCKDK': 17, 'BIKE': 17,
           'BMPR1A': 17, 'BMPR1B': 17, 'BMPR2': 17, 'BRAF': 17, 'BRSK1':
           17, 'BRSK2': 17, 'BUB1': 17, 'CAMK1A': 17, 'CAMK1B': 17,
           'CAMK1D': 17, 'CAMK1G': 17, 'CAMK2A': 17, 'CAMK2B': 17,
           'CAMK2D': 17, 'CAMK2G': 17, 'CAMK4': 17, 'CAMKK1': 17,
           'CAMKK2': 17, 'CAMLCK': 17, 'CDK1': 17, 'CDC7': 17, 'CDK10':
           17, 'CDK19': 17, 'CDK2': 17, 'CDK3': 17, 'CDK4': 17, 'CDK5':
           17, 'CDK6': 17, 'CDK7': 17, 'CDK8': 17, 'CDK9': 17, 'CDKL1':
           17, 'CDKL5': 17, 'CHAK1': 17, 'CHAK2': 17, 'CDK13': 17, 'CHK1':
           17, 'CHK2': 17, 'CK1A': 17, 'CK1A2': 17, 'CK1D': 17, 'CK1E':
           17, 'CK1G1': 17, 'CK1G2': 17, 'CK1G3': 17, 'CK2A1': 17,
           'CK2A2': 17, 'CLK1': 17, 'CLK2': 17, 'CLK3': 17, 'CLK4': 17,
           'COT': 17, 'CRIK': 17, 'CDK12': 17, 'DAPK1': 17, 'DAPK2': 17,
           'DAPK3': 17, 'DCAMKL1': 17, 'DCAMKL2': 17, 'DLK': 17, 'DMPK1':
           17, 'DNAPK': 17, 'DRAK1': 17, 'DYRK1A': 17, 'DYRK1B': 17,
           'DYRK2': 17, 'DYRK3': 17, 'DYRK4': 17, 'ERK1': 17, 'ERK2': 17,
           'ERK5': 17, 'ERK7': 17, 'MTOR': 17, 'GAK': 17, 'GCK': 17,
           'GCN2': 17, 'GRK4': 17, 'GRK5': 17, 'GRK6': 17, 'GRK7': 17,
           'GSK3A': 17, 'GSK3B': 17, 'HASPIN': 17, 'HGK': 17, 'HIPK1': 17,
           'HIPK2': 17, 'HIPK3': 17, 'HIPK4': 17, 'HPK1': 17, 'HRI': 17,
           'HUNK': 17, 'ICK': 17, 'IKKA': 17, 'IKKB': 17, 'IKKE': 17,
           'IRAK1': 17, 'IRAK4': 17, 'IRE1': 17, 'IRE2': 17, 'JNK1': 17,
           'JNK2': 17, 'JNK3': 17, 'KHS1': 17, 'KHS2': 17, 'KIS': 17,
           'LATS1': 17, 'LATS2': 17, 'LKB1': 17, 'LOK': 17, 'LRRK2': 17,
           'MAK': 17, 'MEK1': 17, 'MEK2': 17, 'MEK5': 17, 'MEKK1': 17,
           'YSK4': 17, 'MEKK2': 17, 'MEKK3': 17, 'ASK1': 17, 'MEKK6': 17,
           'MAP3K15': 17, 'MAPKAPK2': 17, 'MAPKAPK3': 17, 'MAPKAPK5': 17,
           'MARK1': 17, 'MARK2': 17, 'MARK3': 17, 'MARK4': 17, 'MASTL':
           17, 'MELK': 17, 'MINK': 17, 'MLK1': 17, 'MLK2': 17, 'MLK3': 17,
           'MLK4': 17, 'MNK1': 17, 'MNK2': 17, 'MOK': 17, 'MOS': 17,
           'MPSK1': 17, 'MRCKA': 17, 'MRCKB': 17, 'MSK1': 17, 'MSK2': 17,
           'SRPK3': 17, 'MST1': 17, 'MST2': 17, 'MST3': 17, 'MST4': 17,
           'MYO3A': 17, 'MYO3B': 17, 'NDR1': 17, 'NDR2': 17, 'NEK1': 17,
           'NEK11': 17, 'NEK2': 17, 'NEK3': 17, 'NEK4': 17, 'NEK5': 17,
           'NEK6': 17, 'NEK7': 17, 'NEK8': 17, 'NEK9': 17, 'NIK': 17,
           'NIM1': 17, 'NLK': 17, 'NUAK1': 17, 'NUAK2': 17, 'OSR1': 17,
           'P38A': 17, 'P38B': 17, 'P38D': 17, 'P38G': 17, 'P70S6K': 17,
           'P70S6KB': 17, 'PAK1': 17, 'PAK2': 17, 'PAK3': 17, 'PAK4': 17,
           'PAK5': 17, 'PAK6': 17, 'PASK': 17, 'PBK': 17, 'CDK16': 17,
           'CDK17': 17, 'CDK18': 17, 'PDHK1': 16, 'PDHK4': 16, 'PDK1': 17,
           'PERK': 17, 'CDK14': 17, 'PHKG1': 17, 'PHKG2': 17, 'PIM1': 17,
           'PIM2': 17, 'PIM3': 17, 'PINK1': 17, 'PKACA': 17, 'PKACB': 17,
           'PKACG': 17, 'PKCA': 17, 'PKCB': 17, 'PKCD': 17, 'PKCE': 17,
           'PKCG': 17, 'PKCH': 17, 'PKCI': 17, 'PKCT': 17, 'PKCZ': 17,
           'PRKD1': 17, 'PRKD2': 17, 'PRKD3': 17, 'PKG1': 17, 'PKG2': 17,
           'PKN1': 17, 'PKN2': 17, 'PKN3': 17, 'PKR': 17, 'PLK1': 17,
           'PLK2': 17, 'PLK3': 17, 'PLK4': 17, 'PRKX': 17, 'PRP4': 17,
           'PRPK': 17, 'QIK': 17, 'QSK': 17, 'RAF1': 17, 'GRK1': 17,
           'RIPK1': 17, 'RIPK2': 17, 'RIPK3': 17, 'ROCK1': 17, 'ROCK2':
           17, 'P90RSK': 17, 'RSK2': 17, 'RSK3': 17, 'RSK4': 17, 'SBK':
           17, 'MYLK4': 17, 'SGK1': 17, 'SGK3': 17, 'DSTYK': 17, 'SIK':
           17, 'SKMLCK': 17, 'SLK': 17, 'SMG1': 17, 'SMMLCK': 17, 'SNRK':
           17, 'SRPK1': 17, 'SRPK2': 17, 'SSTK': 17, 'STK33': 17, 'STLK3':
           17, 'TAK1': 17, 'TAO1': 17, 'TAO2': 17, 'TAO3': 17, 'TBK1': 17,
           'TGFBR1': 17, 'TGFBR2': 17, 'TLK1': 17, 'TLK2': 17, 'TNIK': 17,
           'TSSK1': 17, 'TSSK2': 17, 'TTBK1': 17, 'TTBK2': 17, 'TTK': 17,
           'ULK1': 17, 'ULK2': 17, 'VRK1': 17, 'VRK2': 17, 'WNK1': 17,
           'WNK3': 17, 'WNK4': 17, 'YANK2': 17, 'YANK3': 17, 'YSK1': 17,
           'ZAK': 17, 'EEF2K': 17, 'FAM20C': 17})

Multiply values, consider the dynamics of scale factor, which is PSPA random aa number.

multiply(values=[1,2,3,4,5],kinase='PDHK1')

22.906890595608516

Function2 - sum up

source

sumup

 sumup (values, kinase=None)

Sum up the possibilities of the amino acids at each position in a phosphorylation site sequence

	Type	Default	Details
values			list of values, possibilities of amino acids at certain positions
kinase	NoneType	None

Utils

source

STY2sty

 STY2sty (input_string:str)

Replace all ‘STY’ with ‘sty’ in a sequence

STY2sty('AAkUuPSFSTtH') # convert all capital STY to sty in a string

'AAkUuPsFsttH'

source

get_dict

 get_dict (input_string:str)

Get a dictionary of input string; no need for the star in the middle; make sure it is 15 or 10 length

	Type	Details
input_string	str	phosphorylation site sequence

cols = get_dict("PSVEPPLsQETFSDL")
cols

['-7P',
 '-6S',
 '-5V',
 '-4E',
 '-3P',
 '-2P',
 '-1L',
 '0s',
 '1Q',
 '2E',
 '3T',
 '4F',
 '5S',
 '6D',
 '7L']

Scoring func

source

predict_kinase

 predict_kinase (input_string:str, ref:pandas.core.frame.DataFrame,
                 func:Callable, to_lower:bool=False, to_upper:bool=False,
                 verbose=True)

Predict kinase given a phosphorylation site sequence

	Type	Default	Details
input_string	str		site sequence
ref	DataFrame		reference dataframe for scoring
func	Callable		function to calculate score
to_lower	bool	False	convert capital STY to lower case
to_upper	bool	False	convert all letter to uppercase
verbose	bool	True

Params

Here we provide different PSSM settings from either PSPA data or kinase-substrate dataset for kinase prediction:

source

Params

 Params (name=None)

Params()

Available parameter sets:

['PSPA_st', 'PSPA_y', 'PSPA', 'CDDM', 'CDDM_upper']

for p in ['PSPA', 'CDDM','CDDM_upper']:
    print(predict_kinase("PSVEPPLsQETFSDL",**Params(p)).head())

considering string: ['-5V', '-4E', '-3P', '-2P', '-1L', '0s', '1Q', '2E', '3T', '4F', '5S']
kinase
ATM       5.037
SMG1      4.385
DNAPK     3.818
ATR       3.507
FAM20C    3.170
dtype: float64
considering string: ['-7P', '-6S', '-5V', '-4E', '-3P', '-2P', '-1L', '0s', '1Q', '2E', '3T', '4F', '5S', '6D', '7L']
kinase
ATR      3.064
ATM      2.909
DNAPK    2.270
CK2A1    1.873
TSSK1    1.856
dtype: float64
considering string: ['-7P', '-6S', '-5V', '-4E', '-3P', '-2P', '-1L', '0S', '1Q', '2E', '3T', '4F', '5S', '6D', '7L']
kinase
ATR      3.229
ATM      3.038
DNAPK    2.479
CK2A1    2.006
CDK8     1.999
dtype: float64

Scoring for sequences in df

Utils

source

cut_seq

 cut_seq (input_string:str, min_position:int, max_position:int)

Extract sequence based on a range relative to its center position

	Type	Details
input_string	str	site sequence
min_position	int	minimum position relative to its center
max_position	int	maximum position relative to its center

cut_seq('AAkUuPSFSTtH',-5,4)

'AkUuPSFSTt'

Scoring func

source

predict_kinase_df

 predict_kinase_df (df, seq_col, ref, func, to_lower=False,
                    to_upper=False)

df = Data.get_psp_human_site()
df_sty = df[df['site_seq'].str[7].isin(list('sty'))]

out_cddm = predict_kinase_df(df_sty.head(500),'site_seq', **Params('CDDM'))

input dataframe has a length 500
Preprocessing
Finish preprocessing
Merging reference
Finish merging
CPU times: user 24.4 ms, sys: 4.12 ms, total: 28.5 ms
Wall time: 27.9 ms

Percentile scoring

Single sequence

source

get_pct

 get_pct (site, ref, func, pct_ref)

Replicate the precentile results from The Kinase Library.

st_pct = Data.get_pspa_st_pct()
y_pct = Data.get_pspa_tyr_pct()

a = get_pct('PSVEPPLyQETFSDL',**Params('PSPA_y'), pct_ref=y_pct)

considering string: ['-5V', '-4E', '-3P', '-2P', '-1L', '0Y', '1Q', '2E', '3T', '4F', '5S']

a.sort_values('percentile',ascending=False)

	log2(score)	percentile
ABL2	3.137	96.568694
BMX	2.816	96.117567
BTK	1.956	95.693780
CSK	2.303	95.174299
MERTK	2.509	93.588517
...	...	...
FLT1	-1.919	25.358852
PINK1_TYR	-1.227	21.927546
MUSK	-3.031	21.298701
TNNI3K_TYR	-3.549	11.004785
PKMYT1_TYR	-1.739	4.798360

93 rows × 2 columns

get_pct('PSVEPPLsQETFSDL',**Params('PSPA_st'), pct_ref=st_pct)

considering string: ['-5V', '-4E', '-3P', '-2P', '-1L', '0S', '1Q', '2E', '3T', '4F']

	log2(score)	percentile
ATM	5.037	99.822351
SMG1	4.385	99.831819
DNAPK	3.818	99.205315
ATR	3.507	99.680344
FAM20C	3.170	95.370556
...	...	...
PKN1	-7.275	14.070436
P70S6K	-7.295	4.089816
AKT3	-7.375	11.432995
PKCI	-7.742	8.129511
NEK3	-8.254	4.637240

303 rows × 2 columns

Sequence in df

source

get_pct_df

 get_pct_df (score_df, pct_ref)

Replicate the precentile results from The Kinase Library.

	Details
score_df	output from predict_kinase_df
pct_ref	a reference df for percentile calculation

# substrate score first
# score_df = predict_kinase_df(df_sty,'site_seq', **Params('PSPA_st'))

# get percentile reference
# pct_ref = Data.get_pspa_st_pct()

# pct = get_pct_df(score_df,pct_ref)

Phosphorylate protein seq

source

phosphorylate_seq

 phosphorylate_seq (r, site_info_col='site',
                    sub_seq_col='substrate_sequence')

Phosphorylate whole sequence based on phosphosites

ks = Data.get_ks_dataset()

site_df = ks.groupby('substrate_uniprot').agg({'site':lambda x: x.unique(),
                                  'substrate_sequence': 'first',
                                  }).reset_index()
site_df.head()

	substrate_uniprot	site	substrate_sequence
0	A0A2R8Y4L2	[S95, S22, T25, S6, S158]	MSKSESPKEPEQLRKLFIGGLSFETTDESLRSHFEQWGTLTDCVVM...
1	A0AVT1	[Y553, S697, Y1046, Y49, S1048, S732]	MEGSEPVAAHQGEEASCSSWGTGSTNKNLPIMSTASVEIDDALYSR...
2	A0FGR8	[Y824, S821]	MTANRDAALSSHRHPGCAQRPRTPTFASSSQRRSAFGFDDGNFPGL...
3	A0JLT2	[S239]	MENFTALFGAQADPPPPPTALGFGPGKPPPPPPPPAGGGPGTAPPP...
4	A0MZ66	[S473, T455, S444, S464, S467, S492, S494, S53...	MNSSDEEKQLQLITSLKEQAIGEYEDLRAENQKTKEKCDKIRQERD...

phosphorylate_seq(site_df.iloc[0],'site','substrate_sequence')

'MSKSEsPKEPEQLRKLFIGGLsFEtTDESLRSHFEQWGTLTDCVVMRDPNTKRSRGFGFVTYATVEEVDAAMNARPHKVDGRVVEPKRAVSREDsQRPDAHLTVKKIFVGGIKEDTEEHHLRDYFEQYGKIEVIEIMTDRGSGKKRGFAFVTFDDHDsVDKIVIQKYHTVNGHNCEVRKALSKQEMASASSSQRGRSGSGNFGGGRGGGFGGNDNFGRGGNFSGRGGFGGSRGGGGYGGSGDGYNGFGNDGSNFGGGGSYNDFGNYNNQSSNFGPMKGGNFEGRSSGPHGGGGQYFAKPRNQGGYGGSSSSSSYGSGRRF'

We incorporate the above groupby func to the below:

source

phosphorylate_seq_df

 phosphorylate_seq_df (df, id_col='substrate_uniprot',
                       site_info_col='site',
                       sub_seq_col='substrate_sequence')

Phosphorylate whole sequence based on phosphosites in a dataframe

phosphorylate_seq_df(ks.head(),'substrate_uniprot','site','substrate_sequence')

	substrate_uniprot	site	substrate_sequence	substrate_phosphoseq
0	A4FU28	[S140]	MEEPGATPQPYLGLVLEELGRVVAALPESMRPDENPYGFPSELVVC...	MEEPGATPQPYLGLVLEELGRVVAALPESMRPDENPYGFPSELVVC...
1	O00141	[S252, S255, S397, S404]	MTVKTEAAKGTLTYSRMRGMVAILIAFMKQRRMGLNDFIQKIANNS...	MTVKTEAAKGTLTYSRMRGMVAILIAFMKQRRMGLNDFIQKIANNS...

Extract site seq

source

extract_site_seq

 extract_site_seq (df:pandas.core.frame.DataFrame, seq_col:str,
                   site_info_col:str, n=7)

Extract -n to +n site sequence from protein sequence

	Type	Default	Details
df	DataFrame		dataframe that contains protein sequence
seq_col	str		column name of protein sequence
site_info_col	str		column name of site information (e.g., S10)
n	int	7	length of surrounding sequence (default -7 to +7)

As some datasets only contains protein information and position of phosphorylation sites, but not phosphorylation site sequence, we can retreive protein sequence and use this function to get -7 to +7 phosphorylation site sequence (as numpy array).

Remember to validate the phospho-acceptor at position 0 before extract the site sequence, as there could be mismatch due to the protein sequence database updates.

df = Data.get_human_site()
df.head()

	substrate_uniprot	substrate_genes	site	source	AM_pathogenicity	substrate_sequence	substrate_species	sub_site	substrate_phosphoseq	position	site_seq
0	A0A024R4G9	C19orf48 MGC13170 hCG_2008493	S20	psp	NaN	MTVLEAVLEIQAITGSRLLSMVPGPARPPGSCWDPTQCTRTWLLSH...	Homo sapiens (Human)	A0A024R4G9_S20	MTVLEAVLEIQAITGSRLLsMVPGPARPPGSCWDPTQCTRTWLLSH...	20	_MTVLEAVLEIQAITGSRLLsMVPGPARPPGSCWDPTQCTR
1	A0A075B6Q4	None	S24	ochoa	NaN	MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT...	Homo sapiens (Human)	A0A075B6Q4_S24	MDIQKSENEDDSEWEDVDDEKGDsNDDYDSAGLLsDEDCMSVPGKT...	24	QKSENEDDSEWEDVDDEKGDsNDDYDSAGLLsDEDCMSVPG
2	A0A075B6Q4	None	S35	ochoa	NaN	MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT...	Homo sapiens (Human)	A0A075B6Q4_S35	MDIQKSENEDDSEWEDVDDEKGDsNDDYDSAGLLsDEDCMSVPGKT...	35	EDVDDEKGDsNDDYDSAGLLsDEDCMSVPGKTHRAIADHLF
3	A0A075B6Q4	None	S57	ochoa	NaN	MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT...	Homo sapiens (Human)	A0A075B6Q4_S57	MDIQKSENEDDSEWEDVDDEKGDsNDDYDSAGLLsDEDCMSVPGKT...	57	EDCMSVPGKTHRAIADHLFWsEETKSRFTEYsMTssVMRRN
4	A0A075B6Q4	None	S68	ochoa	NaN	MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT...	Homo sapiens (Human)	A0A075B6Q4_S68	MDIQKSENEDDSEWEDVDDEKGDsNDDYDSAGLLsDEDCMSVPGKT...	68	RAIADHLFWsEETKSRFTEYsMTssVMRRNEQLTLHDERFE

extract_site_seq(df.head(),
                 seq_col='substrate_sequence',
                 site_info_col='site',
                 n=30
                 )

100%|██████████| 5/5 [00:00<00:00, 7169.75it/s]

array(['___________MTVLEAVLEIQAITGSRLLSMVPGPARPPGSCWDPTQCTRTWLLSHTPRR',
       '_______MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKTHRAIADHL',
       'KSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKTHRAIADHLFWSEETKSRFT',
       'DYDSAGLLSDEDCMSVPGKTHRAIADHLFWSEETKSRFTEYSMTSSVMRRNEQLTLHDERF',
       'DCMSVPGKTHRAIADHLFWSEETKSRFTEYSMTSSVMRRNEQLTLHDERFEKFYEQYDDDE'],
      dtype='<U61')

PSSM

source

get_prob

 get_prob (df:pandas.core.frame.DataFrame, col:str, aa_order=['P', 'G',
           'A', 'C', 'S', 'T', 'V', 'I', 'L', 'M', 'F', 'Y', 'W', 'H',
           'K', 'R', 'Q', 'N', 'D', 'E', 's', 't', 'y'])

Get the probability matrix of PSSM from phosphorylation site sequences.

ks = Data.get_ks_dataset()

ks_k = ks[ks.kinase_uniprot=='P00519']

pssm_df = get_prob(ks_k,'site_seq')
pssm_df.head()

Position	-20	-19	-18	-17	-16	-15	-14	-13	-12	-11	...	11	12	13	14	15	16	17	18	19	20
aa
P	0.050061	0.048691	0.062349	0.055489	0.046988	0.054753	0.064787	0.055090	0.056683	0.048272	...	0.052728	0.051140	0.069436	0.063164	0.057716	0.056639	0.051072	0.050697	0.052163	0.060703
G	0.080586	0.080341	0.069007	0.067551	0.082530	0.070397	0.093581	0.073054	0.077566	0.072706	...	0.099939	0.070856	0.071916	0.075672	0.071518	0.064821	0.080076	0.088720	0.062341	0.090735
A	0.080586	0.080341	0.062954	0.054282	0.075301	0.071600	0.070186	0.070060	0.065632	0.070322	...	0.064378	0.077634	0.069436	0.072545	0.063363	0.079924	0.088272	0.087452	0.057888	0.070927
C	0.017094	0.012781	0.013317	0.019903	0.012048	0.017449	0.007798	0.014371	0.013126	0.012515	...	0.007357	0.017868	0.014879	0.012508	0.011920	0.018880	0.019546	0.014575	0.019084	0.014058
S	0.047619	0.035910	0.046610	0.030157	0.037349	0.042720	0.041992	0.041916	0.034010	0.039333	...	0.024525	0.036352	0.047117	0.040025	0.042033	0.040277	0.039092	0.051965	0.041349	0.039617

5 rows × 41 columns

source

pssm_to_seq

 pssm_to_seq (pssm_df, thr=0.4, contain_sty=True)

Represent PSSM in string sequence of amino acids

	Type	Default	Details
pssm_df
thr	float	0.4	threshold of probability to show in sequence
contain_sty	bool	True	keep only s,t,y values (last three) in center 0 position

pssm_to_seq(pssm_df,thr=0.1)

'........K.K.K..E.EEVy*[E/A].[L/P]....K..........L.'

source

recover_pssm

 recover_pssm (flat_pssm:pandas.core.series.Series)

Recover 2D pssm from flat pssm Series

pspa = Data.get_pspa_all_norm()

flat_pssm = pspa.loc['AAK1'].dropna()

recovered = recover_pssm(flat_pssm)
recovered

Position	-5	-4	-3	-2	-1	0	1	2	3	4
aa
A	0.0284	0.0706	0.1119	0.0538	0.0385	0.0000	0.0312	0.0750	0.0582	0.0646
C	0.0456	0.0560	0.0655	0.0588	0.0313	0.0000	0.0223	0.0684	0.0742	0.0603
D	0.0160	0.0332	0.0272	0.0805	0.0161	0.0000	0.0103	0.0552	0.0405	0.0389
E	0.0153	0.0560	0.0370	0.0636	0.0146	0.0000	0.0098	0.0524	0.0371	0.0457
F	0.0425	0.0520	0.0487	0.0336	0.0545	0.0000	0.0129	0.0486	0.0430	0.0523
G	0.0245	0.0642	0.0512	0.0283	0.0706	0.0000	0.7216	0.0749	0.0923	0.0702
H	0.0331	0.0514	0.0397	0.0567	0.0688	0.0000	0.0160	0.0620	0.0674	0.0560
I	0.1554	0.0621	0.0599	0.0378	0.0435	0.0000	0.0102	0.0370	0.0388	0.0415
K	0.0262	0.0809	0.0517	0.0587	0.1229	0.0000	0.0194	0.0577	0.0739	0.0831
L	0.0993	0.0742	0.0527	0.0569	0.0654	0.0000	0.0131	0.0414	0.0489	0.0461
M	0.0864	0.0693	0.0782	0.1166	0.0554	0.0000	0.0124	0.0481	0.0437	0.0464
N	0.0275	0.0429	0.0582	0.0808	0.0434	0.0000	0.0170	0.0834	0.0735	0.0592
P	0.0720	0.0534	0.1084	0.0226	0.1136	0.0000	0.0463	0.0527	0.0681	0.0628
Q	0.0560	0.0627	0.0695	0.1517	0.0397	0.0000	0.0138	0.0771	0.0623	0.0635
R	0.0956	0.0715	0.0581	0.0555	0.0923	0.0000	0.0249	0.0774	0.0901	0.0928
S	0.0425	0.0619	0.0527	0.0555	0.0545	0.1013	0.0143	0.0552	0.0582	0.0560
T	0.0425	0.0619	0.0527	0.0555	0.0545	1.0000	0.0143	0.0552	0.0582	0.0560
V	0.0951	0.0619	0.0641	0.0340	0.0392	0.0000	0.0107	0.0542	0.0610	0.0422
W	0.0315	0.0403	0.0425	0.0289	0.0440	0.0000	0.0162	0.0572	0.0481	0.0826
Y	0.0952	0.0534	0.0411	0.0399	0.0777	0.0000	0.0143	0.0459	0.0533	0.0521
pS	0.0201	0.0332	0.0303	0.0209	0.0121	0.1013	0.0123	0.0409	0.0335	0.0251
pT	0.0201	0.0332	0.0303	0.0209	0.0121	1.0000	0.0123	0.0409	0.0335	0.0251
pY	0.0611	0.0339	0.0274	0.0486	0.0178	0.0000	0.0100	0.0410	0.0359	0.0270

source

process_pssm

 process_pssm (pssm_df)

Keep only s,t,y values in center 0 position; normalize per position

norm_pssm = process_pssm(recovered)
norm_pssm.head()

Position	-5	-4	-3	-2	-1	0	1	2	3	4
aa
A	0.023054	0.055152	0.088880	0.042695	0.032558	0.0	0.028740	0.057613	0.044987	0.051701
C	0.037016	0.043747	0.052025	0.046663	0.026469	0.0	0.020542	0.052543	0.057355	0.048259
D	0.012988	0.025935	0.021604	0.063884	0.013615	0.0	0.009488	0.042403	0.031306	0.031132
E	0.012420	0.043747	0.029388	0.050472	0.012347	0.0	0.009027	0.040252	0.028677	0.036575
F	0.034500	0.040622	0.038681	0.026665	0.046089	0.0	0.011883	0.037333	0.033238	0.041857

source

pssm2dict

 pssm2dict (pssm_df)

Convert pssm dataframe to dict

pssm2dict(pssm_df.iloc[:1,:10])

{'-20P': 0.05006,
 '-19P': 0.04869,
 '-18P': 0.06235,
 '-17P': 0.05549,
 '-16P': 0.04699,
 '-15P': 0.05475,
 '-14P': 0.06479,
 '-13P': 0.05509,
 '-12P': 0.05668,
 '-11P': 0.04827}

JS divergence

source

js_divergence

 js_divergence (p1, p2, mean=True)

p1 and p2 are two arrays (df or np) with index as aa and column as position

	Type	Default	Details
p1			pssm
p2			pssm
mean	bool	True

js_divergence(pssm_df,pssm_df)

1.0000000826903708e-10

source

js_divergence_flat

 js_divergence_flat (p1_flat, p2_flat)

p1 and p2 are two flattened pd.Series with index as aa and column as position

	Details
p1_flat	pd.Series of flattened pssm
p2_flat	pd.Series of flattened pssm

flat_norm_pssm = pd.Series(pssm2dict(norm_pssm))

js_divergence(flat_norm_pssm,flat_norm_pssm)

1.0000050826907844e-09

Entropy & Information Content

source

entropy

 entropy (pssm_df, return_min=False, exclude_zero=False, contain_sty=True)

Calculate entropy per position (max) of a PSSM surrounding 0

	Type	Default	Details
pssm_df			a dataframe of pssm with index as aa and column as position
return_min	bool	False	return min entropy as a single value or return all entropy as a series
exclude_zero	bool	False	exclude the column of 0 (center position) in the entropy calculation
contain_sty	bool	True	keep only s,t,y values (last three) in center 0 position

entropy(pssm_df)

Position
-20    4.324109
-19    4.257291
-18    4.284732
-17    4.267692
-16    4.270273
-15    4.291726
-14    4.276993
-13    4.269681
-12    4.222083
-11    4.324761
-10    4.282922
-9     4.261746
-8     4.296601
-7     4.263406
-6     4.236002
-5     4.245160
-4     4.238012
-3     4.149155
-2     4.183371
-1     4.148746
 0     0.608195
 1     4.153184
 2     4.294678
 3     4.200320
 4     4.312974
 5     4.246091
 6     4.258706
 7     4.303146
 8     4.214329
 9     4.264253
 10    4.262009
 11    4.261383
 12    4.272229
 13    4.313546
 14    4.275907
 15    4.299622
 16    4.317631
 17    4.279185
 18    4.283658
 19    4.293755
 20    4.279981
dtype: float64

source

entropy_flat

 entropy_flat (flat_pssm:pandas.core.series.Series, return_min=False,
               exclude_zero=False, contain_sty=True)

Calculate entropy per position of a flat PSSM surrounding 0

	Type	Default	Details
flat_pssm	Series
return_min	bool	False	return min entropy as a single value or return all entropy as a series
exclude_zero	bool	False	exclude the column of 0 (center position) in the entropy calculation
contain_sty	bool	True	keep only s,t,y values (last three) in center 0 position

source

get_IC_standard

 get_IC_standard (pssm_df)

Calculate the standard information content (bits) from frequency matrix, using the same number of residues log2(len(pssm_df)) for all positions

source

get_IC

 get_IC (pssm_df, return_min=False, exclude_zero=False, contain_sty=True)

Calculate the information content (bits) from a frequency matrix, using log2(3) for the middle position and log2(len(pssm_df)) for others.

	Type	Default	Details
pssm_df			a dataframe of pssm with index as aa and column as position
return_min	bool	False	return min entropy as a single value or return all entropy as a series
exclude_zero	bool	False	exclude the column of 0 (center position) in the entropy calculation
contain_sty	bool	True	keep only s,t,y values (last three) in center 0 position

source

get_IC_flat

 get_IC_flat (flat_pssm:pandas.core.series.Series, return_min=False,
              exclude_zero=False, contain_sty=True)

Calculate the information content (bits) from a flattened pssm pd.Series, using log2(3) for the middle position and log2(len(pssm_df)) for others.

	Type	Default	Details
flat_pssm	Series
return_min	bool	False	return min entropy as a single value or return all entropy as a series
exclude_zero	bool	False	exclude the column of 0 (center position) in the entropy calculation
contain_sty	bool	True	keep only s,t,y values (last three) in center 0 position

source

get_scaled_IC

 get_scaled_IC (pssm_df)

For plotting purpose, calculate the scaled information content (bits) from a frequency matrix, using log2(3) for the middle position and log2(len(pssm_df)) for others.

P values

source

get_pvalue

 get_pvalue (df, columns1, columns2, test_method='mann_whitney',
             FC_method='median')

Performs statistical tests and calculates difference between the median or mean of two groups of columns.

	Type	Default	Details
df
columns1			list of column names for group1
columns2			list of column names for group2
test_method	str	mann_whitney	‘student_t’, ‘mann_whitney’, ‘wilcoxon’
FC_method	str	median	or mean

source

get_metaP

 get_metaP (p_values)

Use Fisher’s method to calculate a combined p value given a list of p values; this function also allows negative p values (negative correlation)

p_values = [0.001,-0.5,0.002]

get_metaP(p_values)

0.0003626876953231754

PSPA normalization

source

raw2norm

 raw2norm (df:pandas.core.frame.DataFrame, PDHK:bool=False)

Normalize single ST kinase data

	Type	Default	Details
df	DataFrame		single kinase’s df has position as index, and single amino acid as columns
PDHK	bool	False	whether this kinase belongs to PDHK family

This function implement the normalization method from Johnson et al. Nature: An atlas of substrate specificities for the human serine/threonine kinome

Specifically, > - matrices were column-normalized at all positions by the sum of the 17 randomized amino acids (excluding serine, threonine and cysteine), to yield PSSMs. >- PDHK1 and PDHK4 were normalized to the 16 randomized amino acids (excluding serine, threonine, cysteine and additionally tyrosine) >- The cysteine row was scaled by its median to be 1/17 (1/16 for PDHK1 and PDHK4). >- The serine and threonine values in each position were set to be the median of that position. >- The S0/T0 ratio was determined by summing the values of S and T rows in the matrix (SS and ST, respectively), accounting for the different S vs. T composition of the central (1:1) and peripheral (only S or only T) positions (Sctrl and Tctrl, respectively), and then normalizing to the higher value among the two (S0 and T0, respectively, Supplementary Note 1)

This function is usually implemented with the below function, with normalize being a bool argument.

source

get_one_kinase

 get_one_kinase (df:pandas.core.frame.DataFrame, kinase:str,
                 normalize:bool=False, drop_s:bool=True)

Obtain a specific kinase data from stacked dataframe

	Type	Default	Details
df	DataFrame		stacked dataframe (paper’s raw data)
kinase	str		a specific kinase
normalize	bool	False	normalize according to the paper; special for PDHK1/4
drop_s	bool	True	drop s as s is a duplicates of t in PSPA

Retreive a single kinase data from PSPA data that has an format of kinase as index and position+amino acid as column.

data = Data.get_pspa_st_norm()

get_one_kinase(data,'PDHK1')

aa	A	C	D	E	F	G	H	I	K	L	...	P	Q	R	S	T	V	W	Y	t	y
position
-5	0.0594	0.0625	0.0589	0.0550	0.0775	0.0697	0.0687	0.0590	0.0515	0.0657	...	0.0451	0.0424	0.0594	0.0594	0.0594	0.0573	0.1001	0.0775	0.0583	0.0658
-4	0.0618	0.0621	0.0550	0.0511	0.0739	0.0715	0.0598	0.0601	0.0520	0.0614	...	0.0637	0.0552	0.0617	0.0608	0.0608	0.0519	0.0916	0.0739	0.0528	0.0752
-3	0.0608	0.0576	0.0499	0.0423	0.0803	0.0580	0.0674	0.0687	0.0481	0.0667	...	0.0570	0.0532	0.0532	0.0584	0.0584	0.0588	0.1113	0.0803	0.0416	0.0553
-2	0.0587	0.0655	0.0470	0.0437	0.0790	0.0890	0.0787	0.0533	0.0440	0.0637	...	0.0500	0.0543	0.0616	0.0565	0.0565	0.0519	0.1082	0.0790	0.0327	0.0557
-1	0.0782	0.1009	0.0989	0.0426	0.0650	0.0695	0.0782	0.0496	0.0409	0.0578	...	0.0540	0.0500	0.0469	0.0594	0.0594	0.0514	0.0756	0.0650	0.0358	0.0433
0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	1.0000	0.4886	NaN	NaN	0.0000	0.4886	0.0000
1	0.0400	0.0562	0.0394	0.0355	0.0735	0.0400	0.0502	0.1288	0.0390	0.1439	...	0.0379	0.0455	0.0455	0.0455	0.0455	0.0797	0.0784	0.0735	0.0336	0.0452
2	0.0496	0.0783	0.0643	0.0555	0.0720	0.1067	0.0684	0.0480	0.0505	0.0555	...	0.0564	0.0653	0.0695	0.0601	0.0601	0.0508	0.0672	0.0720	0.0414	0.0594
3	0.0486	0.0609	0.0938	0.0684	0.1024	0.0676	0.0544	0.0583	0.0388	0.0552	...	0.0686	0.0502	0.0561	0.0588	0.0588	0.0593	0.0641	0.1024	0.0539	0.0431
4	0.0565	0.0749	0.0631	0.0535	0.0732	0.0655	0.0664	0.0625	0.0496	0.0552	...	0.0677	0.0553	0.0604	0.0626	0.0626	0.0579	0.0864	0.0732	0.0548	0.0575

10 rows × 22 columns

End

# def get_freq(df_k: pd.DataFrame, # a dataframe for a single kinase that contains phosphorylation sequence splitted by their position
#              aa_order = [i for i in 'PGACSTVILMFYWHKRQNDEsty'], # amino acid to include in the full matrix 
#              aa_order_paper = [i for i in 'PGACSTVILMFYWHKRQNDEsty'], # amino acid to include in the partial matrix
#              position = [i for i in range(-7,8)], # position to include in the full matrix
#              position_paper = [-5,-4,-3,-2,-1,1,2,3,4] # position to include in the partial matrix
#              ):
    
#     "Get frequency matrix given a dataframe of phosphorylation sites for a single kinase"
    

#     #Count frequency for each amino acid at each position
#     melted_k = df_k.melt(
#                     value_vars=[i for i in range(-7, 8)],
#                     var_name='Position', 
#                     value_name='aa')
    
#     # Group by Position and Amino Acid and count occurrences
#     grouped = melted_k.groupby(['Position', 'aa']).size().reset_index(name='Count')
    

#     # Remove wired amino acid
#     aa_include = [i for i in 'PGACSTVILMFYWHKRQNDEsty']
#     grouped = grouped[grouped.aa.isin(aa_include)].reset_index(drop=True)
    
#     # get pivot table
#     pivot_k = grouped.pivot(index='aa', columns='Position', values='Count').fillna(0)
    
#     # Get frequency by dividing the sum of each column
#     freq_k = pivot_k/pivot_k.sum()

    
#     # data from the kinase-substrate dataset, and format is Lew's paper's format
#     paper = freq_k.reindex(index=aa_order_paper,columns=position_paper,fill_value=0)

#     # full pivot data from kinase-substrate dataset
#     full = freq_k.reindex(index=aa_order,columns=position, fill_value=0)

    
#     return paper,full

# # get frequency matrix
# paper_format, full = get_freq(ks_k)
# paper_format.head()

# def get_unique_site(df:pd.DataFrame = None,# dataframe that contains phosphorylation sites
#                     seq_col: str='site_seq', # column name of site sequence
#                     id_col: str='gene_site' # column name of site id
#                    ):
#     "Remove duplicates among phosphorylation sites; return df with new columns of acceptor and number of duplicates"
    
#     unique = df.groupby(seq_col).agg(
#         {id_col: lambda r: '|'.join(r.unique())} )
#     unique['num_site'] = unique[id_col].str.split('|').apply(len) 
#     unique = unique.reset_index()
#     position = len(unique[seq_col][0])//2
#     unique['acceptor'] = unique[seq_col].str[position]
    
#     return unique

As there are lots of duplicates of the phosphorylation site sequence in the dataset, it could be helpful to remove the duplicated sequences.

Implement get_unique_site to get unique phosphorylation sites. Need to inform columns of sequence and id.

# df = Data.get_ochoa_site()
# unique = get_unique_site(df,seq_col='site_seq',id_col='gene_site')
# unique.sort_values('num_site',ascending=False).head()