from kdock.data.core import *
from kdock.data.protein_pairs import *
import pandas as pd
from tqdm import tqdmProtein pairs
Setup
pip install "colabfold[alphafold] @ git+https://github.com/sokrypton/ColabFold"
## MSA
MSA can be run in cpu only server
Prepare a csv that have first column `id` and second column `sequence` of amino acid sequence
::: {#0bea2a4a-965d-4107-ad6a-21b5c12a0ad8 .cell}
``` {.python .cell-code}
project_name='sdf':::
get_colabfold_cmd('a.csv',project_name)Run below in terminal:
colabfold_batch a.csv msa_sdf --msa-only
After finish, copy a3m files to a gpu available place
copy_a3m(a3m_dir=f'/teamspace/studios/alphfold3/msa_{project_name}',
dest_dir=f'af_input/{project_name}/msa')JSON file
Read the file that contained id and sequence
df = pd.read_csv('file.csv')protein_list = df['gene_id'].tolist()df = generate_pair_df(protein_list)for idx, row in tqdm(df.iterrows(),total=len(df)):
json_data = get_multi_protein_json([row['Gene1'], row['Gene2']],
a3m_dir=f'af_input/{project_name}/a3m',
save_folder=f'af_input/{project_name}')This will generate a number of json files in the save_folder.
We need to distribute them to nfolders for parallel running when multiple gpus are available.
split_nfolder(f'af_input/{project_name}',n=4) # default n is 4Docker Command
docker pull sky1ove/alphafold3for i in range(4):
docker_multi_full(input_dir=f"af_input/{project_name}/folder_{i}",
output_dir=f"af_output/{project_name}",
gpus=i)Run the printed command in your terminal
Report for protein pairs
df_sum, top_genes = get_report(f"af_output/{project_name}",
save_dir=f'af_report/{project_name}')
df_sum.sort_values('iptm_ptm_rnk_add').head(10)A 3d plot will be generated with x=‘iptm’,y=‘ptm’,z=‘chain_pair_pae_min_add’
Top genes are: - Smallest 30 from ‘iptm_ptm_rnk_add’, ‘chain_pair_pae_min_add’, ‘chain_pair_pae_min_0_1’, ‘chain_pair_pae_min_1_0’, ‘iptm_pae_add_rnk’ - Largest 30 from ‘ranking_score’, ‘iptm’, ‘iptm_ptm_add’
df_sum contains the score for each metric
Copy top protein structures to a local folder
from fastcore.utils import L
copy_file('proA_proB',source_dir='af_output/proA',dest_dir='af_top')
# Or
L(top_genes).map(copy_file,source_dir='af_output/proA',dest_dir='af_top')Embeddings
To do