Protein

Setup

Uniprot sequence


source

get_uniprot_seq


def get_uniprot_seq(
    uniprot_id
):

Queries the UniProt database to retrieve the protein sequence for a given UniProt ID.

get_uniprot_seq('P04626')
'MELAALCRWGLLLALLPPGAASTQVCTGTDMKLRLPASPETHLDMLRHLYQGCQVVQGNLELTYLPTNASLSFLQDIQEVQGYVLIAHNQVRQVPLQRLRIVRGTQLFEDNYALAVLDNGDPLNNTTPVTGASPGGLRELQLRSLTEILKGGVLIQRNPQLCYQDTILWKDIFHKNNQLALTLIDTNRSRACHPCSPMCKGSRCWGESSEDCQSLTRTVCAGGCARCKGPLPTDCCHEQCAAGCTGPKHSDCLACLHFNHSGICELHCPALVTYNTDTFESMPNPEGRYTFGASCVTACPYNYLSTDVGSCTLVCPLHNQEVTAEDGTQRCEKCSKPCARVCYGLGMEHLREVRAVTSANIQEFAGCKKIFGSLAFLPESFDGDPASNTAPLQPEQLQVFETLEEITGYLYISAWPDSLPDLSVFQNLQVIRGRILHNGAYSLTLQGLGISWLGLRSLRELGSGLALIHHNTHLCFVHTVPWDQLFRNPHQALLHTANRPEDECVGEGLACHQLCARGHCWGPGPTQCVNCSQFLRGQECVEECRVLQGLPREYVNARHCLPCHPECQPQNGSVTCFGPEADQCVACAHYKDPPFCVARCPSGVKPDLSYMPIWKFPDEEGACQPCPINCTHSCVDLDDKGCPAEQRASPLTSIISAVVGILLVVVLGVVFGILIKRRQQKIRKYTMRRLLQETELVEPLTPSGAMPNQAQMRILKETELRKVKVLGSGAFGTVYKGIWIPDGENVKIPVAIKVLRENTSPKANKEILDEAYVMAGVGSPYVSRLLGICLTSTVQLVTQLMPYGCLLDHVRENRGRLGSQDLLNWCMQIAKGMSYLEDVRLVHRDLAARNVLVKSPNHVKITDFGLARLLDIDETEYHADGGKVPIKWMALESILRRRFTHQSDVWSYGVTVWELMTFGAKPYDGIPAREIPDLLEKGERLPQPPICTIDVYMIMVKCWMIDSECRPRFRELVSEFSRMARDPQRFVVIQNEDLGPASPLDSTFYRSLLEDDDMGDLVDAEEYLVPQQGFFCPDPAPGAGGMVHHRHRSSSTRSGGGDLTLGLEPSEEEAPRSPLAPSEGAGSDVFDGDLGMGAAKGLQSLPTHDPSPLQRYSEDPTVPLPSETDGYVAPLTCSPQPEYVNQPDVRPQPPSPREGPLPAARPAGATLERPKTLSPGKNGVVKDVFAFGGAVENPEYLTPQGGAAPQPHPPPAFSPAFDNLYYWDQDPPERGAPPSTFKGTPTAENPEYLGLDVPV'

source

get_uniprot_features


def get_uniprot_features(
    uniprot_id
):

Given uniprot_id, get specific region for uniprot features.

get_uniprot_features('P04626').keys()
dict_keys(['uniprot_id', 'protein_name', 'gene_name', 'features'])

source

get_uniprot_kd


def get_uniprot_kd(
    uniprot_id
):

Query ‘Domain: Protein kinase’ based on UniProt ID and get its sequence info.

get_uniprot_kd('P04626')
[{'uniprot_id': 'P04626',
  'protein_name': 'Receptor tyrosine-protein kinase erbB-2',
  'gene_name': 'ERBB2',
  'start': 720,
  'end': 987,
  'description': 'Protein kinase',
  'sequence': 'LRKVKVLGSGAFGTVYKGIWIPDGENVKIPVAIKVLRENTSPKANKEILDEAYVMAGVGSPYVSRLLGICLTSTVQLVTQLMPYGCLLDHVRENRGRLGSQDLLNWCMQIAKGMSYLEDVRLVHRDLAARNVLVKSPNHVKITDFGLARLLDIDETEYHADGGKVPIKWMALESILRRRFTHQSDVWSYGVTVWELMTFGAKPYDGIPAREIPDLLEKGERLPQPPICTIDVYMIMVKCWMIDSECRPRFRELVSEFSRMARDPQRFV'}]

source

get_uniprot_type


def get_uniprot_type(
    uniprot_id, type_:str='Signal'
):

Get region sequences based on UniProt ID features.

get_uniprot_type('P04626','S') # signal peptide
No feature of type 'S' found for P04626.
Available feature types: Active site, Alternative sequence, Beta strand, Binding site, Chain, Compositional bias, Disulfide bond, Domain, Glycosylation, Helix, Modified residue, Motif, Mutagenesis, Natural variant, Region, Signal, Topological domain, Transmembrane, Turn
['Active site',
 'Alternative sequence',
 'Beta strand',
 'Binding site',
 'Chain',
 'Compositional bias',
 'Disulfide bond',
 'Domain',
 'Glycosylation',
 'Helix',
 'Modified residue',
 'Motif',
 'Mutagenesis',
 'Natural variant',
 'Region',
 'Signal',
 'Topological domain',
 'Transmembrane',
 'Turn']
get_uniprot_type('P04626','Signal') # signal peptide
[{'uniprot_id': 'P04626',
  'type': 'Signal',
  'protein_name': 'Receptor tyrosine-protein kinase erbB-2',
  'gene_name': 'ERBB2',
  'start': 1,
  'end': 22,
  'description': '',
  'sequence': 'MELAALCRWGLLLALLPPGAAS'}]
get_uniprot_type('P04626','Transmembrane') # tm domain
[{'uniprot_id': 'P04626',
  'type': 'Transmembrane',
  'protein_name': 'Receptor tyrosine-protein kinase erbB-2',
  'gene_name': 'ERBB2',
  'start': 653,
  'end': 675,
  'description': 'Helical',
  'sequence': 'SIISAVVGILLVVVLGVVFGILI'}]

Mutate sequence


source

apply_mut_single


def apply_mut_single(
    seq, # protein sequence
    mutations:VAR_POSITIONAL, # e.g., E709A
    start_pos:int=1, # if the protein sequence does not start from index 1, indicate the start index to match the mutations
):

Apply mutations to a protein sequence.

seq = get_uniprot_seq('P04626')
mut_seq = apply_mut_single(seq,'M1A','E2S')
mut_seq
Converted: M1A
Converted: E2S
'ASLAALCRWGLLLALLPPGAASTQVCTGTDMKLRLPASPETHLDMLRHLYQGCQVVQGNLELTYLPTNASLSFLQDIQEVQGYVLIAHNQVRQVPLQRLRIVRGTQLFEDNYALAVLDNGDPLNNTTPVTGASPGGLRELQLRSLTEILKGGVLIQRNPQLCYQDTILWKDIFHKNNQLALTLIDTNRSRACHPCSPMCKGSRCWGESSEDCQSLTRTVCAGGCARCKGPLPTDCCHEQCAAGCTGPKHSDCLACLHFNHSGICELHCPALVTYNTDTFESMPNPEGRYTFGASCVTACPYNYLSTDVGSCTLVCPLHNQEVTAEDGTQRCEKCSKPCARVCYGLGMEHLREVRAVTSANIQEFAGCKKIFGSLAFLPESFDGDPASNTAPLQPEQLQVFETLEEITGYLYISAWPDSLPDLSVFQNLQVIRGRILHNGAYSLTLQGLGISWLGLRSLRELGSGLALIHHNTHLCFVHTVPWDQLFRNPHQALLHTANRPEDECVGEGLACHQLCARGHCWGPGPTQCVNCSQFLRGQECVEECRVLQGLPREYVNARHCLPCHPECQPQNGSVTCFGPEADQCVACAHYKDPPFCVARCPSGVKPDLSYMPIWKFPDEEGACQPCPINCTHSCVDLDDKGCPAEQRASPLTSIISAVVGILLVVVLGVVFGILIKRRQQKIRKYTMRRLLQETELVEPLTPSGAMPNQAQMRILKETELRKVKVLGSGAFGTVYKGIWIPDGENVKIPVAIKVLRENTSPKANKEILDEAYVMAGVGSPYVSRLLGICLTSTVQLVTQLMPYGCLLDHVRENRGRLGSQDLLNWCMQIAKGMSYLEDVRLVHRDLAARNVLVKSPNHVKITDFGLARLLDIDETEYHADGGKVPIKWMALESILRRRFTHQSDVWSYGVTVWELMTFGAKPYDGIPAREIPDLLEKGERLPQPPICTIDVYMIMVKCWMIDSECRPRFRELVSEFSRMARDPQRFVVIQNEDLGPASPLDSTFYRSLLEDDDMGDLVDAEEYLVPQQGFFCPDPAPGAGGMVHHRHRSSSTRSGGGDLTLGLEPSEEEAPRSPLAPSEGAGSDVFDGDLGMGAAKGLQSLPTHDPSPLQRYSEDPTVPLPSETDGYVAPLTCSPQPEYVNQPDVRPQPPSPREGPLPAARPAGATLERPKTLSPGKNGVVKDVFAFGGAVENPEYLTPQGGAAPQPHPPPAFSPAFDNLYYWDQDPPERGAPPSTFKGTPTAENPEYLGLDVPV'

source

apply_mut_complex


def apply_mut_complex(
    seq, # protein sequence
    mut, # mutation (e.g., G776delinsVC/S783C, G778dupGSP)
    start_pos:int=1, # if truncated protein sequence, indicate where it starts to match the position of mutation
):

Apply a composite mutation like ‘G776delinsVC/S783C’ to seq, assuming seq[0] corresponds to residue number start_pos.

  • At most one delins or dup is allowed.
  • Point substitutions are executed first; the indel/dup is done last.
her2_seq = 'LRKVKVLGSGAFGTVYKGIWIPDGENVKIPVAIKVLRENTSPKANKEILDEAYVMAGVGSPYVSRLLGICLTSTVQLVTQLMPYGCLLDHVRENRGRLGSQDLLNWCMQIAKGMSYLEDVRLVHRDLAARNVLVKSPNHVKITDFGLARLLDIDETEYHADGGKVPIKWMALESILRRRFTHQSDVWSYGVTVWELMTFGAKPYDGIPAREIPDLLEKGERLPQPPICTIDVYMIMVKCWMIDSECRPRFRELVSEFSRMARDPQRFV'
mut_seq = apply_mut_complex(her2_seq,'G776delinsVC/S783C',720)
mut_seq
'LRKVKVLGSGAFGTVYKGIWIPDGENVKIPVAIKVLRENTSPKANKEILDEAYVMAVCVGSPYVCRLLGICLTSTVQLVTQLMPYGCLLDHVRENRGRLGSQDLLNWCMQIAKGMSYLEDVRLVHRDLAARNVLVKSPNHVKITDFGLARLLDIDETEYHADGGKVPIKWMALESILRRRFTHQSDVWSYGVTVWELMTFGAKPYDGIPAREIPDLLEKGERLPQPPICTIDVYMIMVKCWMIDSECRPRFRELVSEFSRMARDPQRFV'

source

compare_seq


def compare_seq(
    seq1:str, seq2:str, start_pos:int=1, label1:str='Original', label2:str='Mutant', visualize:bool=True,
    return_text:bool=False
):

Align two protein sequences and summarise differences. Returns a formatted text block (and can optionally print it).

compare_seq(her2_seq,mut_seq)
Original       1-79   : LRKVKVLGSGAFGTVYKGIWIPDGENVKIPVAIKVLRENTSPKANKEILDEAYVMA-GVGSPYVSRLLGICLTSTVQLVT
Mutant                : LRKVKVLGSGAFGTVYKGIWIPDGENVKIPVAIKVLRENTSPKANKEILDEAYVMAVCVGSPYVCRLLGICLTSTVQLVT
                                                                                ^^      ^               

Original      80-159  : QLMPYGCLLDHVRENRGRLGSQDLLNWCMQIAKGMSYLEDVRLVHRDLAARNVLVKSPNHVKITDFGLARLLDIDETEYH
Mutant                : QLMPYGCLLDHVRENRGRLGSQDLLNWCMQIAKGMSYLEDVRLVHRDLAARNVLVKSPNHVKITDFGLARLLDIDETEYH
                                                                                                        

Original     160-239  : ADGGKVPIKWMALESILRRRFTHQSDVWSYGVTVWELMTFGAKPYDGIPAREIPDLLEKGERLPQPPICTIDVYMIMVKC
Mutant                : ADGGKVPIKWMALESILRRRFTHQSDVWSYGVTVWELMTFGAKPYDGIPAREIPDLLEKGERLPQPPICTIDVYMIMVKC
                                                                                                        

Original     240-268  : WMIDSECRPRFRELVSEFSRMARDPQRFV
Mutant                : WMIDSECRPRFRELVSEFSRMARDPQRFV
                                                     

Differences:
  insertion    at   57: - → V
  substitution at   57: G → C
  substitution at   64: S → C
print(compare_seq(her2_seq,mut_seq,return_text=True))
Original       1-79   : LRKVKVLGSGAFGTVYKGIWIPDGENVKIPVAIKVLRENTSPKANKEILDEAYVMA-GVGSPYVSRLLGICLTSTVQLVT
Mutant                : LRKVKVLGSGAFGTVYKGIWIPDGENVKIPVAIKVLRENTSPKANKEILDEAYVMAVCVGSPYVCRLLGICLTSTVQLVT
                                                                                ^^      ^               

Original      80-159  : QLMPYGCLLDHVRENRGRLGSQDLLNWCMQIAKGMSYLEDVRLVHRDLAARNVLVKSPNHVKITDFGLARLLDIDETEYH
Mutant                : QLMPYGCLLDHVRENRGRLGSQDLLNWCMQIAKGMSYLEDVRLVHRDLAARNVLVKSPNHVKITDFGLARLLDIDETEYH
                                                                                                        

Original     160-239  : ADGGKVPIKWMALESILRRRFTHQSDVWSYGVTVWELMTFGAKPYDGIPAREIPDLLEKGERLPQPPICTIDVYMIMVKC
Mutant                : ADGGKVPIKWMALESILRRRFTHQSDVWSYGVTVWELMTFGAKPYDGIPAREIPDLLEKGERLPQPPICTIDVYMIMVKC
                                                                                                        

Original     240-268  : WMIDSECRPRFRELVSEFSRMARDPQRFV
Mutant                : WMIDSECRPRFRELVSEFSRMARDPQRFV
                                                     

Differences:
  insertion    at   57: - → V
  substitution at   57: G → C
  substitution at   64: S → C

End