Protein pairs

Setup

from kdock.data.core import *
from kdock.data.protein_pairs import *
import pandas as pd
from tqdm import tqdm
pip install "colabfold[alphafold] @ git+https://github.com/sokrypton/ColabFold"

## MSA

MSA can be run in cpu only server

Prepare a csv that have first column `id` and second column `sequence` of amino acid sequence

::: {#0bea2a4a-965d-4107-ad6a-21b5c12a0ad8 .cell}
``` {.python .cell-code}
project_name='sdf'

:::

get_colabfold_cmd('a.csv',project_name)
Run below in terminal:

 colabfold_batch a.csv msa_sdf --msa-only

After finish, copy a3m files to a gpu available place

copy_a3m(a3m_dir=f'/teamspace/studios/alphfold3/msa_{project_name}',
         dest_dir=f'af_input/{project_name}/msa')

JSON file

Read the file that contained id and sequence

df = pd.read_csv('file.csv')
protein_list = df['gene_id'].tolist()
df = generate_pair_df(protein_list)
for idx, row in tqdm(df.iterrows(),total=len(df)):
    json_data = get_multi_protein_json([row['Gene1'], row['Gene2']], 
                             a3m_dir=f'af_input/{project_name}/a3m', 
                             save_folder=f'af_input/{project_name}')

This will generate a number of json files in the save_folder.

We need to distribute them to nfolders for parallel running when multiple gpus are available.

split_nfolder(f'af_input/{project_name}',n=4) # default n is 4

Docker Command

docker pull sky1ove/alphafold3
for i in range(4):
    docker_multi_full(input_dir=f"af_input/{project_name}/folder_{i}",
                       output_dir=f"af_output/{project_name}",
                       gpus=i)

Run the printed command in your terminal

Report for protein pairs

df_sum, top_genes = get_report(f"af_output/{project_name}",
                               save_dir=f'af_report/{project_name}')

df_sum.sort_values('iptm_ptm_rnk_add').head(10)

A 3d plot will be generated with x=‘iptm’,y=‘ptm’,z=‘chain_pair_pae_min_add’

Top genes are: - Smallest 30 from ‘iptm_ptm_rnk_add’, ‘chain_pair_pae_min_add’, ‘chain_pair_pae_min_0_1’, ‘chain_pair_pae_min_1_0’, ‘iptm_pae_add_rnk’ - Largest 30 from ‘ranking_score’, ‘iptm’, ‘iptm_ptm_add’

df_sum contains the score for each metric

Copy top protein structures to a local folder

from fastcore.utils import L
copy_file('proA_proB',source_dir='af_output/proA',dest_dir='af_top')

# Or 
L(top_genes).map(copy_file,source_dir='af_output/proA',dest_dir='af_top')

Embeddings

To do