from kdock.data.core import *
from kdock.data.protein_pairs import *
import pandas as pd
from tqdm import tqdm
Protein pairs
Setup
pip install "colabfold[alphafold] @ git+https://github.com/sokrypton/ColabFold"
## MSA
MSA can be run in cpu only server
Prepare a csv that have first column `id` and second column `sequence` of amino acid sequence
::: {#0bea2a4a-965d-4107-ad6a-21b5c12a0ad8 .cell}
``` {.python .cell-code}
project_name='sdf'
:::
'a.csv',project_name) get_colabfold_cmd(
Run below in terminal:
colabfold_batch a.csv msa_sdf --msa-only
After finish, copy a3m files to a gpu available place
=f'/teamspace/studios/alphfold3/msa_{project_name}',
copy_a3m(a3m_dir=f'af_input/{project_name}/msa') dest_dir
JSON file
Read the file that contained id and sequence
= pd.read_csv('file.csv') df
= df['gene_id'].tolist() protein_list
= generate_pair_df(protein_list) df
for idx, row in tqdm(df.iterrows(),total=len(df)):
= get_multi_protein_json([row['Gene1'], row['Gene2']],
json_data =f'af_input/{project_name}/a3m',
a3m_dir=f'af_input/{project_name}') save_folder
This will generate a number of json files in the save_folder.
We need to distribute them to nfolders for parallel running when multiple gpus are available.
f'af_input/{project_name}',n=4) # default n is 4 split_nfolder(
Docker Command
docker pull sky1ove/alphafold3
for i in range(4):
=f"af_input/{project_name}/folder_{i}",
docker_multi_full(input_dir=f"af_output/{project_name}",
output_dir=i) gpus
Run the printed command in your terminal
Report for protein pairs
= get_report(f"af_output/{project_name}",
df_sum, top_genes =f'af_report/{project_name}')
save_dir
'iptm_ptm_rnk_add').head(10) df_sum.sort_values(
A 3d plot will be generated with x=‘iptm’,y=‘ptm’,z=‘chain_pair_pae_min_add’
Top genes are: - Smallest 30 from ‘iptm_ptm_rnk_add’, ‘chain_pair_pae_min_add’, ‘chain_pair_pae_min_0_1’, ‘chain_pair_pae_min_1_0’, ‘iptm_pae_add_rnk’ - Largest 30 from ‘ranking_score’, ‘iptm’, ‘iptm_ptm_add’
df_sum contains the score for each metric
Copy top protein structures to a local folder
from fastcore.utils import L
'proA_proB',source_dir='af_output/proA',dest_dir='af_top')
copy_file(
# Or
map(copy_file,source_dir='af_output/proA',dest_dir='af_top') L(top_genes).
Embeddings
To do