Protein pairs & ColabFold pipeline

We use ColabFold MSA for protein pairs pipeline, as it takes shorter time

Setup

docker pull sky1ove/alphafold3

Protein pairs

Since protein protein screening involves a lot of proteins, it takes a long time for AF3 default MSA pipeline, so we used colabfold MSA pipeline

source

get_colabfold_cmd

 get_colabfold_cmd (csv_path, project_name)

project_name='sdf'

get_colabfold_cmd('sdf.csv',project_name)

Run below in terminal:

 colabfold_batch sdf.csv msa_sdf --msa-only

MSA

source

copy_a3m

 copy_a3m (a3m_dir:str, dest_dir:str)

Copies all .a3m files from the source directory to the destination directory.

	Type	Details
a3m_dir	str	Path to the source directory containing .a3m files.
dest_dir	str	Path to the destination directory where files will be copied

copy_a3m(a3m_dir='data',dest_dir='af_input')

Copying files: 100%|██████████| 1/1 [00:00<00:00, 637.53file/s]

Copied 1 a3m files from data to af_input

Protein-protein input

Important

Make sure a3m files are under af_input, otherwise it won’t detect the files

source

a3m_to_seq

 a3m_to_seq (file_path:pathlib.Path)

Get protein sequence from a3m file

a3m_to_seq(Path(f'af_input/{project_name}/a3m/CD8A.a3m'))

'SQFRVSPLDRTWNLGETVELKCQVLLSNPTSGCSWLFQPRGAAASPTFLLYLSQNKPKAAEGLDTQRFSGKRLGDTFVLTLSDFRRENEGYYFCSALSNSIMYFSHFVPVFLPAKPTTTPAPRPPTPAPTIASQPLSLRPEACRPAAGGAVHTRGLDFACD'

source

get_protein_subjson

 get_protein_subjson (gene_name, a3m_dir='.', idx='A', run_template=True)

Get subjson (protein part) with colabfold unpairedMSA .a3m path

sub_json = get_protein_subjson('CD8A',a3m_dir=f'af_input/{project_name}/a3m')

sub_json

{'id': 'A',
 'sequence': 'SQFRVSPLDRTWNLGETVELKCQVLLSNPTSGCSWLFQPRGAAASPTFLLYLSQNKPKAAEGLDTQRFSGKRLGDTFVLTLSDFRRENEGYYFCSALSNSIMYFSHFVPVFLPAKPTTTPAPRPPTPAPTIASQPLSLRPEACRPAAGGAVHTRGLDFACD',
 'modifications': [],
 'unpairedMsaPath': '/root/af_input/sdf/a3m/CD8A.a3m',
 'pairedMsa': '',
 'templates': None}

source

dump_json_folder

 dump_json_folder (json_data, folder)

Save json under a folder

source

get_multi_protein_json

 get_multi_protein_json (gene_list, a3m_dir, run_template=True,
                         save_folder=None)

Get json of multiple proteins, with unpaired MSA path indicated (from colabfold MSA)

AF_input = get_multi_protein_json(['CD8A','CD8A'],
                        a3m_dir=f'af_input/{project_name}/a3m',
                        save_folder=f'af_input/{project_name}')

You can generate a list of json files under a folder.

AF_input.keys(), len(AF_input['sequences'])

(dict_keys(['name', 'modelSeeds', 'sequences', 'bondedAtomPairs', 'dialect', 'version']),
 2)

source

generate_pair_df

 generate_pair_df (gene_list, self_pair=True)

Unique pair genes in a gene list

generate_pair_df(list('ABC'))

	Gene1	Gene2
0	A	B
1	A	C
2	B	C
3	A	A
4	B	B
5	C	C

df = generate_pair_df(['CD8A'])
df

	Gene1	Gene2
0	CD8A	CD8A

Generate json files first:

for idx, row in tqdm(df.iterrows(),total=len(df)):
    json_data = get_multi_protein_json([row['Gene1'], row['Gene2']], 
                             a3m_dir=f'af_input/{project_name}/a3m', 
                             save_folder=f'af_input/{project_name}')

100%|██████████| 1/1 [00:00<00:00, 147.81it/s]

Split them to subfolder:

split_nfolder(f'af_input/{project_name}')

Distributed 1 files into 4 folders.

Docker

Todo: Pair proteins

for i in range(4):
    get_docker_command(input_dir=f"af_input/{project_name}/folder_{i}",
                       output_dir=f"af_output/{project_name}",
                       gpus=i)

End

Utils

# #| export
# def split_files_into_subfolders(input_folder: str, nfolder: int = 4):
    
#     "Splits `.a3m` files in a folder into subfolders (folder_0, folder_1, ..., folder_N)."
    
#     input_path = Path(input_folder)
#     if not input_path.is_dir():
#         raise ValueError(f"Input folder {input_folder} does not exist or is not a directory.")

#     # List all `.a3m` files
#     a3m_files = sorted(input_path.glob("*.a3m"))
#     if not a3m_files:
#         print("No `.a3m` files found in the input folder.")
#         return

#     # Create the subfolders
#     subfolders = [input_path / f"folder_{i}" for i in range(nfolder)]
#     for folder in subfolders:
#         folder.mkdir(exist_ok=True)

#     # Distribute the files into the subfolders
#     for idx, file in enumerate(a3m_files):
#         target_folder = subfolders[idx % nfolder]
#         shutil.move(str(file), target_folder / file.name)

#     print(f"Distributed {len(a3m_files)} files into {nfolder} folders.")