='sdf' project_name
Protein pairs & ColabFold pipeline
We use ColabFold MSA for protein pairs pipeline, as it takes shorter time
Setup
Setup
docker pull sky1ove/alphafold3
Protein pairs
Since protein protein screening involves a lot of proteins, it takes a long time for AF3 default MSA pipeline, so we used colabfold MSA pipeline
get_colabfold_cmd
get_colabfold_cmd (csv_path, project_name)
'sdf.csv',project_name) get_colabfold_cmd(
Run below in terminal:
colabfold_batch sdf.csv msa_sdf --msa-only
MSA
copy_a3m
copy_a3m (a3m_dir:str, dest_dir:str)
Copies all .a3m files from the source directory to the destination directory.
Type | Details | |
---|---|---|
a3m_dir | str | Path to the source directory containing .a3m files. |
dest_dir | str | Path to the destination directory where files will be copied |
='data',dest_dir='af_input') copy_a3m(a3m_dir
Copying files: 100%|██████████| 1/1 [00:00<00:00, 637.53file/s]
Copied 1 a3m files from data to af_input
Protein-protein input
Important
Make sure a3m files are under af_input, otherwise it won’t detect the files
a3m_to_seq
a3m_to_seq (file_path:pathlib.Path)
Get protein sequence from a3m file
f'af_input/{project_name}/a3m/CD8A.a3m')) a3m_to_seq(Path(
'SQFRVSPLDRTWNLGETVELKCQVLLSNPTSGCSWLFQPRGAAASPTFLLYLSQNKPKAAEGLDTQRFSGKRLGDTFVLTLSDFRRENEGYYFCSALSNSIMYFSHFVPVFLPAKPTTTPAPRPPTPAPTIASQPLSLRPEACRPAAGGAVHTRGLDFACD'
get_protein_subjson
get_protein_subjson (gene_name, a3m_dir='.', idx='A', run_template=True)
Get subjson (protein part) with colabfold unpairedMSA .a3m path
= get_protein_subjson('CD8A',a3m_dir=f'af_input/{project_name}/a3m') sub_json
sub_json
{'id': 'A',
'sequence': 'SQFRVSPLDRTWNLGETVELKCQVLLSNPTSGCSWLFQPRGAAASPTFLLYLSQNKPKAAEGLDTQRFSGKRLGDTFVLTLSDFRRENEGYYFCSALSNSIMYFSHFVPVFLPAKPTTTPAPRPPTPAPTIASQPLSLRPEACRPAAGGAVHTRGLDFACD',
'modifications': [],
'unpairedMsaPath': '/root/af_input/sdf/a3m/CD8A.a3m',
'pairedMsa': '',
'templates': None}
dump_json_folder
dump_json_folder (json_data, folder)
Save json under a folder
get_multi_protein_json
get_multi_protein_json (gene_list, a3m_dir, run_template=True, save_folder=None)
Get json of multiple proteins, with unpaired MSA path indicated (from colabfold MSA)
= get_multi_protein_json(['CD8A','CD8A'],
AF_input =f'af_input/{project_name}/a3m',
a3m_dir=f'af_input/{project_name}') save_folder
You can generate a list of json files under a folder.
len(AF_input['sequences']) AF_input.keys(),
(dict_keys(['name', 'modelSeeds', 'sequences', 'bondedAtomPairs', 'dialect', 'version']),
2)
generate_pair_df
generate_pair_df (gene_list, self_pair=True)
Unique pair genes in a gene list
list('ABC')) generate_pair_df(
Gene1 | Gene2 | |
---|---|---|
0 | A | B |
1 | A | C |
2 | B | C |
3 | A | A |
4 | B | B |
5 | C | C |
= generate_pair_df(['CD8A'])
df df
Gene1 | Gene2 | |
---|---|---|
0 | CD8A | CD8A |
Generate json files first:
for idx, row in tqdm(df.iterrows(),total=len(df)):
= get_multi_protein_json([row['Gene1'], row['Gene2']],
json_data =f'af_input/{project_name}/a3m',
a3m_dir=f'af_input/{project_name}') save_folder
100%|██████████| 1/1 [00:00<00:00, 147.81it/s]
Split them to subfolder:
f'af_input/{project_name}') split_nfolder(
Distributed 1 files into 4 folders.
Docker
Todo: Pair proteins
for i in range(4):
=f"af_input/{project_name}/folder_{i}",
get_docker_command(input_dir=f"af_output/{project_name}",
output_dir=i) gpus
End
Utils
# #| export
# def split_files_into_subfolders(input_folder: str, nfolder: int = 4):
# "Splits `.a3m` files in a folder into subfolders (folder_0, folder_1, ..., folder_N)."
# input_path = Path(input_folder)
# if not input_path.is_dir():
# raise ValueError(f"Input folder {input_folder} does not exist or is not a directory.")
# # List all `.a3m` files
# a3m_files = sorted(input_path.glob("*.a3m"))
# if not a3m_files:
# print("No `.a3m` files found in the input folder.")
# return
# # Create the subfolders
# subfolders = [input_path / f"folder_{i}" for i in range(nfolder)]
# for folder in subfolders:
# folder.mkdir(exist_ok=True)
# # Distribute the files into the subfolders
# for idx, file in enumerate(a3m_files):
# target_folder = subfolders[idx % nfolder]
# shutil.move(str(file), target_folder / file.name)
# print(f"Distributed {len(a3m_files)} files into {nfolder} folders.")