project_name='sdf'Protein pairs & ColabFold pipeline
We use ColabFold MSA for protein pairs pipeline, as it takes shorter time
Setup
Setup
docker pull sky1ove/alphafold3Protein pairs
Since protein protein screening involves a lot of proteins, it takes a long time for AF3 default MSA pipeline, so we used colabfold MSA pipeline
get_colabfold_cmd
get_colabfold_cmd (csv_path, project_name)
get_colabfold_cmd('sdf.csv',project_name)Run below in terminal:
colabfold_batch sdf.csv msa_sdf --msa-only
MSA
copy_a3m
copy_a3m (a3m_dir:str, dest_dir:str)
Copies all .a3m files from the source directory to the destination directory.
| Type | Details | |
|---|---|---|
| a3m_dir | str | Path to the source directory containing .a3m files. |
| dest_dir | str | Path to the destination directory where files will be copied |
copy_a3m(a3m_dir='data',dest_dir='af_input')Copying files: 100%|██████████| 1/1 [00:00<00:00, 637.53file/s]
Copied 1 a3m files from data to af_input
Protein-protein input
Important
Make sure a3m files are under af_input, otherwise it won’t detect the files
a3m_to_seq
a3m_to_seq (file_path:pathlib.Path)
Get protein sequence from a3m file
a3m_to_seq(Path(f'af_input/{project_name}/a3m/CD8A.a3m'))'SQFRVSPLDRTWNLGETVELKCQVLLSNPTSGCSWLFQPRGAAASPTFLLYLSQNKPKAAEGLDTQRFSGKRLGDTFVLTLSDFRRENEGYYFCSALSNSIMYFSHFVPVFLPAKPTTTPAPRPPTPAPTIASQPLSLRPEACRPAAGGAVHTRGLDFACD'
get_protein_subjson
get_protein_subjson (gene_name, a3m_dir='.', idx='A', run_template=True)
Get subjson (protein part) with colabfold unpairedMSA .a3m path
sub_json = get_protein_subjson('CD8A',a3m_dir=f'af_input/{project_name}/a3m')sub_json{'id': 'A',
'sequence': 'SQFRVSPLDRTWNLGETVELKCQVLLSNPTSGCSWLFQPRGAAASPTFLLYLSQNKPKAAEGLDTQRFSGKRLGDTFVLTLSDFRRENEGYYFCSALSNSIMYFSHFVPVFLPAKPTTTPAPRPPTPAPTIASQPLSLRPEACRPAAGGAVHTRGLDFACD',
'modifications': [],
'unpairedMsaPath': '/root/af_input/sdf/a3m/CD8A.a3m',
'pairedMsa': '',
'templates': None}
dump_json_folder
dump_json_folder (json_data, folder)
Save json under a folder
get_multi_protein_json
get_multi_protein_json (gene_list, a3m_dir, run_template=True, save_folder=None)
Get json of multiple proteins, with unpaired MSA path indicated (from colabfold MSA)
AF_input = get_multi_protein_json(['CD8A','CD8A'],
a3m_dir=f'af_input/{project_name}/a3m',
save_folder=f'af_input/{project_name}')You can generate a list of json files under a folder.
AF_input.keys(), len(AF_input['sequences'])(dict_keys(['name', 'modelSeeds', 'sequences', 'bondedAtomPairs', 'dialect', 'version']),
2)
generate_pair_df
generate_pair_df (gene_list, self_pair=True)
Unique pair genes in a gene list
generate_pair_df(list('ABC'))| Gene1 | Gene2 | |
|---|---|---|
| 0 | A | B |
| 1 | A | C |
| 2 | B | C |
| 3 | A | A |
| 4 | B | B |
| 5 | C | C |
df = generate_pair_df(['CD8A'])
df| Gene1 | Gene2 | |
|---|---|---|
| 0 | CD8A | CD8A |
Generate json files first:
for idx, row in tqdm(df.iterrows(),total=len(df)):
json_data = get_multi_protein_json([row['Gene1'], row['Gene2']],
a3m_dir=f'af_input/{project_name}/a3m',
save_folder=f'af_input/{project_name}')100%|██████████| 1/1 [00:00<00:00, 147.81it/s]
Split them to subfolder:
split_nfolder(f'af_input/{project_name}')Distributed 1 files into 4 folders.
Docker
Todo: Pair proteins
for i in range(4):
get_docker_command(input_dir=f"af_input/{project_name}/folder_{i}",
output_dir=f"af_output/{project_name}",
gpus=i)End
Utils
# #| export
# def split_files_into_subfolders(input_folder: str, nfolder: int = 4):
# "Splits `.a3m` files in a folder into subfolders (folder_0, folder_1, ..., folder_N)."
# input_path = Path(input_folder)
# if not input_path.is_dir():
# raise ValueError(f"Input folder {input_folder} does not exist or is not a directory.")
# # List all `.a3m` files
# a3m_files = sorted(input_path.glob("*.a3m"))
# if not a3m_files:
# print("No `.a3m` files found in the input folder.")
# return
# # Create the subfolders
# subfolders = [input_path / f"folder_{i}" for i in range(nfolder)]
# for folder in subfolders:
# folder.mkdir(exist_ok=True)
# # Distribute the files into the subfolders
# for idx, file in enumerate(a3m_files):
# target_folder = subfolders[idx % nfolder]
# shutil.move(str(file), target_folder / file.name)
# print(f"Distributed {len(a3m_files)} files into {nfolder} folders.")