Data processing

from katlas.core import *
import pandas as pd

Phosphorylate entire protein sequence

df_group=Data.get_ks_dataset()
df_group.head()
kin_sub_site kinase_uniprot substrate_uniprot site source substrate_genes substrate_phosphoseq position site_seq sub_site substrate_sequence
0 O00141_A4FU28_S140 O00141 A4FU28 S140 Sugiyama CTAGE9 MEEPGATPQPYLGLVLEELGRVVAALPESMRPDENPYGFPSELVVC... 140 AAAEEARSLEATCEKLSRsNsELEDEILCLEKDLKEEKSKH A4FU28_S140 MEEPGATPQPYLGLVLEELGRVVAALPESMRPDENPYGFPSELVVC...
1 O00141_O00141_S252 O00141 O00141 S252 Sugiyama SGK1 SGK MTVKTEAAKGTLTYSRMRGMVAILIAFMKQRRMGLNDFIQKIANNS... 252 SQGHIVLTDFGLCKENIEHNsTtstFCGtPEyLAPEVLHKQ O00141_S252 MTVKTEAAKGTLTYSRMRGMVAILIAFMKQRRMGLNDFIQKIANNS...
2 O00141_O00141_S255 O00141 O00141 S255 Sugiyama SGK1 SGK MTVKTEAAKGTLTYSRMRGMVAILIAFMKQRRMGLNDFIQKIANNS... 255 HIVLTDFGLCKENIEHNsTtstFCGtPEyLAPEVLHKQPYD O00141_S255 MTVKTEAAKGTLTYSRMRGMVAILIAFMKQRRMGLNDFIQKIANNS...
3 O00141_O00141_S397 O00141 O00141 S397 Sugiyama SGK1 SGK MTVKTEAAKGTLTYSRMRGMVAILIAFMKQRRMGLNDFIQKIANNS... 397 sGPNDLRHFDPEFTEEPVPNsIGKsPDsVLVTAsVKEAAEA O00141_S397 MTVKTEAAKGTLTYSRMRGMVAILIAFMKQRRMGLNDFIQKIANNS...
4 O00141_O00141_S404 O00141 O00141 S404 Sugiyama SGK1 SGK MTVKTEAAKGTLTYSRMRGMVAILIAFMKQRRMGLNDFIQKIANNS... 404 HFDPEFTEEPVPNsIGKsPDsVLVTAsVKEAAEAFLGFsYA O00141_S404 MTVKTEAAKGTLTYSRMRGMVAILIAFMKQRRMGLNDFIQKIANNS...
phosphorylate_seq_df?
Signature:
phosphorylate_seq_df(
    df,
    id_col='substrate_uniprot',
    site_info_col='site',
    sub_seq_col='substrate_sequence',
)
Docstring: Phosphorylate whole sequence based on phosphosites in a dataframe
File:      ~/katlas/katlas/core.py
Type:      function
seq = phosphorylate_seq_df(df_group)
seq.head(1)
substrate_uniprot site substrate_sequence substrate_phosphoseq
0 A0A2R8Y4L2 [S95, S22, T25, S6, S158] MSKSESPKEPEQLRKLFIGGLSFETTDESLRSHFEQWGTLTDCVVM... MSKSEsPKEPEQLRKLFIGGLsFEtTDESLRSHFEQWGTLTDCVVM...
seq_map = seq.set_index('substrate_uniprot')['substrate_phosphoseq']

df_group['substrate_phosphoseq'] = df_group.substrate_uniprot.map(seq_map)
df_group.head()
kin_sub_site kinase_uniprot substrate_uniprot site source substrate_genes substrate_phosphoseq position site_seq sub_site substrate_sequence
0 O00141_A4FU28_S140 O00141 A4FU28 S140 Sugiyama CTAGE9 MEEPGATPQPYLGLVLEELGRVVAALPESMRPDENPYGFPSELVVC... 140 AAAEEARSLEATCEKLSRsNsELEDEILCLEKDLKEEKSKH A4FU28_S140 MEEPGATPQPYLGLVLEELGRVVAALPESMRPDENPYGFPSELVVC...
1 O00141_O00141_S252 O00141 O00141 S252 Sugiyama SGK1 SGK MTVKTEAAKGTLTYSRMRGMVAILIAFMKQRRMGLNDFIQKIANNS... 252 SQGHIVLTDFGLCKENIEHNsTtstFCGtPEyLAPEVLHKQ O00141_S252 MTVKTEAAKGTLTYSRMRGMVAILIAFMKQRRMGLNDFIQKIANNS...
2 O00141_O00141_S255 O00141 O00141 S255 Sugiyama SGK1 SGK MTVKTEAAKGTLTYSRMRGMVAILIAFMKQRRMGLNDFIQKIANNS... 255 HIVLTDFGLCKENIEHNsTtstFCGtPEyLAPEVLHKQPYD O00141_S255 MTVKTEAAKGTLTYSRMRGMVAILIAFMKQRRMGLNDFIQKIANNS...
3 O00141_O00141_S397 O00141 O00141 S397 Sugiyama SGK1 SGK MTVKTEAAKGTLTYSRMRGMVAILIAFMKQRRMGLNDFIQKIANNS... 397 sGPNDLRHFDPEFTEEPVPNsIGKsPDsVLVTAsVKEAAEA O00141_S397 MTVKTEAAKGTLTYSRMRGMVAILIAFMKQRRMGLNDFIQKIANNS...
4 O00141_O00141_S404 O00141 O00141 S404 Sugiyama SGK1 SGK MTVKTEAAKGTLTYSRMRGMVAILIAFMKQRRMGLNDFIQKIANNS... 404 HFDPEFTEEPVPNsIGKsPDsVLVTAsVKEAAEAFLGFsYA O00141_S404 MTVKTEAAKGTLTYSRMRGMVAILIAFMKQRRMGLNDFIQKIANNS...

Extract site sequence

df = Data.get_human_site().head().copy()

extract_site_seq can extract -length to +length sequence from a protein sequence given a position

extract_site_seq?
Signature:
extract_site_seq(
    df: pandas.core.frame.DataFrame,
    seq_col: str,
    site_info_col: str,
    n=7,
)
Docstring: Extract -n to +n site sequence from protein sequence
File:      ~/katlas/katlas/core.py
Type:      function
N=20
df['site_seq'] = extract_site_seq(df,
                                  seq_col='substrate_sequence',
                                  site_info_col='site',
                                  n=N)
100%|██████████| 5/5 [00:00<00:00, 7741.42it/s]
df.site_seq
0    _MTVLEAVLEIQAITGSRLLSMVPGPARPPGSCWDPTQCTR
1    QKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPG
2    EDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKTHRAIADHLF
3    EDCMSVPGKTHRAIADHLFWSEETKSRFTEYSMTSSVMRRN
4    RAIADHLFWSEETKSRFTEYSMTSSVMRRNEQLTLHDERFE
Name: site_seq, dtype: object
df.site_seq.str[N].value_counts()
site_seq
S    5
Name: count, dtype: int64

Cut sequence

If you want to cut the sequence surrounding the center:

cut_seq('AAkUuPSFSTtH',-5,4)
'AkUuPSFSTt'
df.site_seq.apply(lambda x: cut_seq(x,-5,4))
0    GSRLLSMVPG
1    DEKGDSNDDY
2    SAGLLSDEDC
3    DHLFWSEETK
4    RFTEYSMTSS
Name: site_seq, dtype: object

Check site

df['site_seq'] = check_seq_df(df,'site_seq')