0. 前言
进化(演化)是生物种群在遗传变异、自然选择、遗传漂变等因素作用下,随时间累积而发生的适应性变化过程。进化的结果在宏观上可以体现在生物体与祖先的表型差异,微观上则是分子水平的基因频率改变。自然界中的进化有几个重要性的因素:1)变异的多样性;2)变异可遗传;3)选择压力适中。
而定向进化则是人为地模拟自然界中选择作用,可以对物种的表型进行选择,也可以对具体的某个分子(蛋白)进行选择。我们首先从原始序列(也称为野生型)人工构建大量的突变体文库,然后衡量突变体的表型(如植株的高度,果实大小、蛋白分子的稳定性|溶解度|酶活性等)。选择最优表型的突变体,作为下一次突变的起始序列,迭代这一过程。这一过程可以理解为于爬山,理想表型的突变体在山顶,每迭代一次相当于向山顶靠近一步。
20世纪中叶小麦的绿色革命可以作为一个宏观定向进化的例子,我们想获得一个抗倒伏的矮杆品种,初始情况下我们有一个性状优良的小麦群体,但是这个群体的小麦都比较高。我们人为选取该群体中较矮的植株进行诱变,对后代的植株高度进行观察,选择后代中较矮的植株再进行诱变,如此迭代几轮,直到后代整体的植株高度符合要求。分子的定向进化也类似于这个过程。
前言部分只是简单的介绍了下背景,感兴趣的同学可以去深入了解。大家可以想一下为什么酶定向进化
和肽和抗体噬菌体展示
一起获得2018年诺贝尔化学奖。
1. EvolvePro环境安装
需要预安装的一些软件,包括:
conda
CUDA驱动
Slurm
作业提交系统
## 1. EvolvePro的安装
git clone https://github.com/mat10d/EvolvePro.git
cd EvolvePro
conda env create -f environment.yml
conda activate evolvepro
## 2. 安装依赖
### 感觉不是必须
### 众所周知的原因,大陆地区这里可能会失败,多试几次就好了(雾
sh setup_plm.sh
conda activate plm
# 切换回evolvepro
conda activate evolvepro
## 3. Download Pretrained models
# wget https://dl.fbaipublicfiles.com/fair-esm/models/esm1b_t33_650M_UR50S.pt
cd data/chaofan/projects/08.Directed_evolution/00.EVOLVEpro # 工作文件夹
wget https://dl.fbaipublicfiles.com/fair-esm/models/esm1b_t33_650M_UR50S.pt
wget https://dl.fbaipublicfiles.com/fair-esm/regression/esm1b_t33_650M_UR50S-contact-regression.pt
# wget https://dl.fbaipublicfiles.com/fair-esm/models/esm2_t33_650M_UR50D.pt
# wget https://dl.fbaipublicfiles.com/fair-esm/regression/esm2_t33_650M_UR50D-contact-regression.pt
### esm2用在下游任务上会比esm1b好一点
有两条路,一条是多轮实验迭代,每轮实验加入新测量的表型值;另一条是基于深度突变扫描的数据。
2. 代码实现 (Experimental Workflow)
2.1 构建单氨基酸突变文库
from evolvepro.src.process import generate_wt, generate_single_aa_mutants
generate_wt('MAKEDNIEMQGTVLETLPNTMFRVELENGHVVTAHISGKMRKNYIRILTGDKVTVELTPYDLSKGRIVFRSR', output_file='./kelsic_WT.fasta')
generate_single_aa_mutants('./kelsic_WT.fasta', output_file='./kelsic.fasta')
Number of mutants: 1369
from evolvepro.src.process import suggest_initial_mutants
suggest_initial_mutants('./kelsic.fasta', num_mutants=12, random_seed=42)
Suggested 12 mutants for testing:
1. R23K
2. T58E
3. I36D
4. V31C
5. I7A
6. K3F
7. Q10P
8. G38E
9. E4M
10. D61W
11. E4Y
12. R23N
我们随机的从单突变体文库中挑选这12条突变序列进行模型初始的训练,这12条序列与后续的Round1.xlsx文件是对应的。
2.2 获取Protein Language model的编码矩阵
BERT
语言模型能很好的将离散的文本序列转换为数值型的特征矩阵,广泛运用于文本分类任务
!mkdir -p Single_mutation_concatenate
# 如提前下好了预训练模型,通过 .pt 结尾文件代表本地加载权重,如没有 .pt后缀,默认会重新下载权重文件。
!python /data/chaofan/software/EvolvePro/evolvepro/plm/esm/extract.py esm1b_t33_650M_UR50S.pt ./kelsic.fasta ./kelsic_esm1b_t33_650M_UR50S --toks_per_batch 512 --include mean --concatenate_dir ./Single_mutation_concatenate
Transferred model to GPU
Read kelsic.fasta with 1369 sequences
Processing 1 of 196 batches (7 sequences)
Device: cuda:0
Processing 2 of 196 batches (7 sequences)
Device: cuda:0
Processing 3 of 196 batches (7 sequences)
Device: cuda:0
Processing 4 of 196 batches (7 sequences)
Device: cuda:0
Processing 5 of 196 batches (7 sequences)
Device: cuda:0
Processing 6 of 196 batches (7 sequences)
Device: cuda:0
Processing 7 of 196 batches (7 sequences)
Device: cuda:0
Processing 8 of 196 batches (7 sequences)
Device: cuda:0
Processing 9 of 196 batches (7 sequences)
Device: cuda:0
Processing 10 of 196 batches (7 sequences)
Device: cuda:0
Processing 11 of 196 batches (7 sequences)
Device: cuda:0
Processing 12 of 196 batches (7 sequences)
Device: cuda:0
Processing 13 of 196 batches (7 sequences)
Device: cuda:0
Processing 14 of 196 batches (7 sequences)
Device: cuda:0
Processing 15 of 196 batches (7 sequences)
Device: cuda:0
Processing 16 of 196 batches (7 sequences)
Device: cuda:0
Processing 17 of 196 batches (7 sequences)
Device: cuda:0
Processing 18 of 196 batches (7 sequences)
Device: cuda:0
Processing 19 of 196 batches (7 sequences)
Device: cuda:0
Processing 20 of 196 batches (7 sequences)
Device: cuda:0
Processing 21 of 196 batches (7 sequences)
Device: cuda:0
Processing 22 of 196 batches (7 sequences)
Device: cuda:0
Processing 23 of 196 batches (7 sequences)
Device: cuda:0
Processing 24 of 196 batches (7 sequences)
Device: cuda:0
Processing 25 of 196 batches (7 sequences)
Device: cuda:0
Processing 26 of 196 batches (7 sequences)
Device: cuda:0
Processing 27 of 196 batches (7 sequences)
Device: cuda:0
Processing 28 of 196 batches (7 sequences)
Device: cuda:0
Processing 29 of 196 batches (7 sequences)
Device: cuda:0
Processing 30 of 196 batches (7 sequences)
Device: cuda:0
Processing 31 of 196 batches (7 sequences)
Device: cuda:0
Processing 32 of 196 batches (7 sequences)
Device: cuda:0
Processing 33 of 196 batches (7 sequences)
Device: cuda:0
Processing 34 of 196 batches (7 sequences)
Device: cuda:0
Processing 35 of 196 batches (7 sequences)
Device: cuda:0
Processing 36 of 196 batches (7 sequences)
Device: cuda:0
Processing 37 of 196 batches (7 sequences)
Device: cuda:0
Processing 38 of 196 batches (7 sequences)
Device: cuda:0
Processing 39 of 196 batches (7 sequences)
Device: cuda:0
Processing 40 of 196 batches (7 sequences)
Device: cuda:0
Processing 41 of 196 batches (7 sequences)
Device: cuda:0
Processing 42 of 196 batches (7 sequences)
Device: cuda:0
Processing 43 of 196 batches (7 sequences)
Device: cuda:0
Processing 44 of 196 batches (7 sequences)
Device: cuda:0
Processing 45 of 196 batches (7 sequences)
Device: cuda:0
Processing 46 of 196 batches (7 sequences)
Device: cuda:0
Processing 47 of 196 batches (7 sequences)
Device: cuda:0
Processing 48 of 196 batches (7 sequences)
Device: cuda:0
Processing 49 of 196 batches (7 sequences)
Device: cuda:0
Processing 50 of 196 batches (7 sequences)
Device: cuda:0
Processing 51 of 196 batches (7 sequences)
Device: cuda:0
Processing 52 of 196 batches (7 sequences)
Device: cuda:0
Processing 53 of 196 batches (7 sequences)
Device: cuda:0
Processing 54 of 196 batches (7 sequences)
Device: cuda:0
Processing 55 of 196 batches (7 sequences)
Device: cuda:0
Processing 56 of 196 batches (7 sequences)
Device: cuda:0
Processing 57 of 196 batches (7 sequences)
Device: cuda:0
Processing 58 of 196 batches (7 sequences)
Device: cuda:0
Processing 59 of 196 batches (7 sequences)
Device: cuda:0
Processing 60 of 196 batches (7 sequences)
Device: cuda:0
Processing 61 of 196 batches (7 sequences)
Device: cuda:0
Processing 62 of 196 batches (7 sequences)
Device: cuda:0
Processing 63 of 196 batches (7 sequences)
Device: cuda:0
Processing 64 of 196 batches (7 sequences)
Device: cuda:0
Processing 65 of 196 batches (7 sequences)
Device: cuda:0
Processing 66 of 196 batches (7 sequences)
Device: cuda:0
Processing 67 of 196 batches (7 sequences)
Device: cuda:0
Processing 68 of 196 batches (7 sequences)
Device: cuda:0
Processing 69 of 196 batches (7 sequences)
Device: cuda:0
Processing 70 of 196 batches (7 sequences)
Device: cuda:0
Processing 71 of 196 batches (7 sequences)
Device: cuda:0
Processing 72 of 196 batches (7 sequences)
Device: cuda:0
Processing 73 of 196 batches (7 sequences)
Device: cuda:0
Processing 74 of 196 batches (7 sequences)
Device: cuda:0
Processing 75 of 196 batches (7 sequences)
Device: cuda:0
Processing 76 of 196 batches (7 sequences)
Device: cuda:0
Processing 77 of 196 batches (7 sequences)
Device: cuda:0
Processing 78 of 196 batches (7 sequences)
Device: cuda:0
Processing 79 of 196 batches (7 sequences)
Device: cuda:0
Processing 80 of 196 batches (7 sequences)
Device: cuda:0
Processing 81 of 196 batches (7 sequences)
Device: cuda:0
Processing 82 of 196 batches (7 sequences)
Device: cuda:0
Processing 83 of 196 batches (7 sequences)
Device: cuda:0
Processing 84 of 196 batches (7 sequences)
Device: cuda:0
Processing 85 of 196 batches (7 sequences)
Device: cuda:0
Processing 86 of 196 batches (7 sequences)
Device: cuda:0
Processing 87 of 196 batches (7 sequences)
Device: cuda:0
Processing 88 of 196 batches (7 sequences)
Device: cuda:0
Processing 89 of 196 batches (7 sequences)
Device: cuda:0
Processing 90 of 196 batches (7 sequences)
Device: cuda:0
Processing 91 of 196 batches (7 sequences)
Device: cuda:0
Processing 92 of 196 batches (7 sequences)
Device: cuda:0
Processing 93 of 196 batches (7 sequences)
Device: cuda:0
Processing 94 of 196 batches (7 sequences)
Device: cuda:0
Processing 95 of 196 batches (7 sequences)
Device: cuda:0
Processing 96 of 196 batches (7 sequences)
Device: cuda:0
Processing 97 of 196 batches (7 sequences)
Device: cuda:0
Processing 98 of 196 batches (7 sequences)
Device: cuda:0
Processing 99 of 196 batches (7 sequences)
Device: cuda:0
Processing 100 of 196 batches (7 sequences)
Device: cuda:0
Processing 101 of 196 batches (7 sequences)
Device: cuda:0
Processing 102 of 196 batches (7 sequences)
Device: cuda:0
Processing 103 of 196 batches (7 sequences)
Device: cuda:0
Processing 104 of 196 batches (7 sequences)
Device: cuda:0
Processing 105 of 196 batches (7 sequences)
Device: cuda:0
Processing 106 of 196 batches (7 sequences)
Device: cuda:0
Processing 107 of 196 batches (7 sequences)
Device: cuda:0
Processing 108 of 196 batches (7 sequences)
Device: cuda:0
Processing 109 of 196 batches (7 sequences)
Device: cuda:0
Processing 110 of 196 batches (7 sequences)
Device: cuda:0
Processing 111 of 196 batches (7 sequences)
Device: cuda:0
Processing 112 of 196 batches (7 sequences)
Device: cuda:0
Processing 113 of 196 batches (7 sequences)
Device: cuda:0
Processing 114 of 196 batches (7 sequences)
Device: cuda:0
Processing 115 of 196 batches (7 sequences)
Device: cuda:0
Processing 116 of 196 batches (7 sequences)
Device: cuda:0
Processing 117 of 196 batches (7 sequences)
Device: cuda:0
Processing 118 of 196 batches (7 sequences)
Device: cuda:0
Processing 119 of 196 batches (7 sequences)
Device: cuda:0
Processing 120 of 196 batches (7 sequences)
Device: cuda:0
Processing 121 of 196 batches (7 sequences)
Device: cuda:0
Processing 122 of 196 batches (7 sequences)
Device: cuda:0
Processing 123 of 196 batches (7 sequences)
Device: cuda:0
Processing 124 of 196 batches (7 sequences)
Device: cuda:0
Processing 125 of 196 batches (7 sequences)
Device: cuda:0
Processing 126 of 196 batches (7 sequences)
Device: cuda:0
Processing 127 of 196 batches (7 sequences)
Device: cuda:0
Processing 128 of 196 batches (7 sequences)
Device: cuda:0
Processing 129 of 196 batches (7 sequences)
Device: cuda:0
Processing 130 of 196 batches (7 sequences)
Device: cuda:0
Processing 131 of 196 batches (7 sequences)
Device: cuda:0
Processing 132 of 196 batches (7 sequences)
Device: cuda:0
Processing 133 of 196 batches (7 sequences)
Device: cuda:0
Processing 134 of 196 batches (7 sequences)
Device: cuda:0
Processing 135 of 196 batches (7 sequences)
Device: cuda:0
Processing 136 of 196 batches (7 sequences)
Device: cuda:0
Processing 137 of 196 batches (7 sequences)
Device: cuda:0
Processing 138 of 196 batches (7 sequences)
Device: cuda:0
Processing 139 of 196 batches (7 sequences)
Device: cuda:0
Processing 140 of 196 batches (7 sequences)
Device: cuda:0
Processing 141 of 196 batches (7 sequences)
Device: cuda:0
Processing 142 of 196 batches (7 sequences)
Device: cuda:0
Processing 143 of 196 batches (7 sequences)
Device: cuda:0
Processing 144 of 196 batches (7 sequences)
Device: cuda:0
Processing 145 of 196 batches (7 sequences)
Device: cuda:0
Processing 146 of 196 batches (7 sequences)
Device: cuda:0
Processing 147 of 196 batches (7 sequences)
Device: cuda:0
Processing 148 of 196 batches (7 sequences)
Device: cuda:0
Processing 149 of 196 batches (7 sequences)
Device: cuda:0
Processing 150 of 196 batches (7 sequences)
Device: cuda:0
Processing 151 of 196 batches (7 sequences)
Device: cuda:0
Processing 152 of 196 batches (7 sequences)
Device: cuda:0
Processing 153 of 196 batches (7 sequences)
Device: cuda:0
Processing 154 of 196 batches (7 sequences)
Device: cuda:0
Processing 155 of 196 batches (7 sequences)
Device: cuda:0
Processing 156 of 196 batches (7 sequences)
Device: cuda:0
Processing 157 of 196 batches (7 sequences)
Device: cuda:0
Processing 158 of 196 batches (7 sequences)
Device: cuda:0
Processing 159 of 196 batches (7 sequences)
Device: cuda:0
Processing 160 of 196 batches (7 sequences)
Device: cuda:0
Processing 161 of 196 batches (7 sequences)
Device: cuda:0
Processing 162 of 196 batches (7 sequences)
Device: cuda:0
Processing 163 of 196 batches (7 sequences)
Device: cuda:0
Processing 164 of 196 batches (7 sequences)
Device: cuda:0
Processing 165 of 196 batches (7 sequences)
Device: cuda:0
Processing 166 of 196 batches (7 sequences)
Device: cuda:0
Processing 167 of 196 batches (7 sequences)
Device: cuda:0
Processing 168 of 196 batches (7 sequences)
Device: cuda:0
Processing 169 of 196 batches (7 sequences)
Device: cuda:0
Processing 170 of 196 batches (7 sequences)
Device: cuda:0
Processing 171 of 196 batches (7 sequences)
Device: cuda:0
Processing 172 of 196 batches (7 sequences)
Device: cuda:0
Processing 173 of 196 batches (7 sequences)
Device: cuda:0
Processing 174 of 196 batches (7 sequences)
Device: cuda:0
Processing 175 of 196 batches (7 sequences)
Device: cuda:0
Processing 176 of 196 batches (7 sequences)
Device: cuda:0
Processing 177 of 196 batches (7 sequences)
Device: cuda:0
Processing 178 of 196 batches (7 sequences)
Device: cuda:0
Processing 179 of 196 batches (7 sequences)
Device: cuda:0
Processing 180 of 196 batches (7 sequences)
Device: cuda:0
Processing 181 of 196 batches (7 sequences)
Device: cuda:0
Processing 182 of 196 batches (7 sequences)
Device: cuda:0
Processing 183 of 196 batches (7 sequences)
Device: cuda:0
Processing 184 of 196 batches (7 sequences)
Device: cuda:0
Processing 185 of 196 batches (7 sequences)
Device: cuda:0
Processing 186 of 196 batches (7 sequences)
Device: cuda:0
Processing 187 of 196 batches (7 sequences)
Device: cuda:0
Processing 188 of 196 batches (7 sequences)
Device: cuda:0
Processing 189 of 196 batches (7 sequences)
Device: cuda:0
Processing 190 of 196 batches (7 sequences)
Device: cuda:0
Processing 191 of 196 batches (7 sequences)
Device: cuda:0
Processing 192 of 196 batches (7 sequences)
Device: cuda:0
Processing 193 of 196 batches (7 sequences)
Device: cuda:0
Processing 194 of 196 batches (7 sequences)
Device: cuda:0
Processing 195 of 196 batches (7 sequences)
Device: cuda:0
Processing 196 of 196 batches (4 sequences)
Device: cuda:0
Saved representations to kelsic_esm1b_t33_650M_UR50S
Shape of concatenated DataFrame: (1369, 1280)
Saved concatenated representations to Single_mutation_concatenate/kelsic_esm1b_t33_650M_UR50S.pt.csv
因为频繁出现注入攻击,现pytorch强制要求加载权重时是否相信来源,此处需要修改/data/chaofan/miniconda3/envs/evolvepro/lib/python3.11/site-packages/esm/pretrained.py
代码, 在第70行调用函数处加上weights_only=False
。
Number of mutants: 1369
Traceback (most recent call last):
File "/data/chaofan/software/EvolvePro/evolvepro/plm/esm/extract.py", line 193, in <module>
main()
File "/data/chaofan/software/EvolvePro/evolvepro/plm/esm/extract.py", line 181, in main
run(args)
File "/data/chaofan/software/EvolvePro/evolvepro/plm/esm/extract.py", line 73, in run
model, alphabet = pretrained.load_model_and_alphabet(args.model_location)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/chaofan/miniconda3/envs/evolvepro/lib/python3.11/site-packages/esm/pretrained.py", line 26, in load_model_and_alphabet
return load_model_and_alphabet_local(model_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/chaofan/miniconda3/envs/evolvepro/lib/python3.11/site-packages/esm/pretrained.py", line 70, in load_model_and_alphabet_local
model_data = torch.load(str(model_location), map_location="cpu")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/chaofan/miniconda3/envs/evolvepro/lib/python3.11/site-packages/torch/serialization.py", line 1470, in load
raise pickle.UnpicklingError(_get_wo_message(str(e))) from None
_pickle.UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options, do those steps only if you trust the source of the checkpoint.
(1) In PyTorch 2.6, we changed the default value of the `weights_only` argument in `torch.load` from `False` to `True`. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
(2) Alternatively, to load with `weights_only=True` please check the recommended steps in the following error message.
WeightsUnpickler error: Unsupported global: GLOBAL argparse.Namespace was not an allowed global by default. Please use `torch.serialization.add_safe_globals([Namespace])` or the `torch.serialization.safe_globals([Namespace])` context manager to allowlist this global if you trust this class/function.
Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.
经过上述步骤,每条文本的氨基酸序列被一个长为1280的数值向量代替,表格中一共有1369条序列,分别是一条野生型+1368条(19*野生型序列长度)单氨基酸变异序列
2.3 对模型进行多轮few-shot训练
from evolvepro.src.evolve import evolve_experimental
protein_name = 'kelsic'
embeddings_base_path = 'Single_mutation_concatenate'
embeddings_file_name = 'kelsic_esm1b_t33_650M_UR50S.pt.csv'
round_base_path = '/data/chaofan/software/EvolvePro/colab/rounds_data'
wt_fasta_path = "./kelsic_WT.fasta"
number_of_variants = 12
output_dir = './'
rename_WT = False
2.3.1 Round1
round_name = 'Round1'
round_file_names = ['kelsic_Round1.xlsx']
this_round_variants, df_test, df_sorted_all = evolve_experimental(
protein_name,
round_name,
embeddings_base_path,
embeddings_file_name,
round_base_path,
round_file_names,
wt_fasta_path,
rename_WT,
number_of_variants,
output_dir
)
Processing kelsic - Round1
Embeddings loaded: (1369, 1280)
Loaded experimental data for kelsic_Round1.xlsx: (12, 3)
iteration shape: (12, 2)
Labels shape: (1369, 5)
Embeddings and labels are aligned
(1357,)
Tested variants in this round: 12
43 K3F
67 E4M
76 E4Y
115 I7A
184 Q10P
427 R23K
430 R23N
572 V31C
668 I36D
707 G38E
1087 T58E
1158 D61W
Name: variant, dtype: object
Top 12 variants predicted by the model:
variant y_pred y_actual y_actual_scaled y_actual_binary dist_metric \
74 E4V 0.954787 NaN NaN NaN None
225 T12S 0.952057 NaN NaN NaN None
5 M1F 0.951310 NaN NaN NaN None
63 E4H 0.950540 NaN NaN NaN None
15 M1S 0.947988 NaN NaN NaN None
19 M1Y 0.944838 NaN NaN NaN None
18 M1W 0.944590 NaN NaN NaN None
249 L14C 0.942195 NaN NaN NaN None
61 E4F 0.942045 NaN NaN NaN None
10 M1L 0.941873 NaN NaN NaN None
142 E8L 0.941533 NaN NaN NaN None
135 E8C 0.941173 NaN NaN NaN None
std_predictions
74 0.0
225 0.0
5 0.0
63 0.0
15 0.0
19 0.0
18 0.0
249 0.0
61 0.0
10 0.0
142 0.0
135 0.0
Data saved to ./kelsic/Round1
/data/chaofan/software/EvolvePro/evolvepro/src/model.py:200: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
df_all = pd.concat([df_train, df_test])
模型的设计非常简单,主要是用预训练的BERT模型去抽特征,然后用传统的机器学习方法(随机森林,XGBoost等)对特征进行回归分析。第一轮使用12
条随机数据进行训练,第二列使用12 * 2
条数据重新训练,以此类推,第N轮使用12 * N
条数据进行训练。每轮只训练顶层的传统机器学习模型。
训练后的模型先对整个突变体库扫一遍,对每个突变体进行表型值打分并排序,取预测表型值最好的12
个突变体(且没有真实表型值)进行真实表型值验证。将新测量的真实表型值加入到下一轮的模型训练中,也就是每一轮过后,训练数据集会多出12
条。按道理来说,Round2.xlsx
的数据应该是上一轮预测的Top12突变体的真实表型值,但因为不同软件版本的差异,这里展示的结果可能会与原文有一定的差异,这是正常现象。
2.3.2 Round2
round_name = 'Round2'
round_file_names = ['kelsic_Round1.xlsx', 'kelsic_Round2.xlsx']
this_round_variants, df_test, df_sorted_all = evolve_experimental(
protein_name,
round_name,
embeddings_base_path,
embeddings_file_name,
round_base_path,
round_file_names,
wt_fasta_path,
rename_WT,
number_of_variants,
output_dir
)
Processing kelsic - Round2
Embeddings loaded: (1369, 1280)
Loaded experimental data for kelsic_Round1.xlsx: (12, 3)
Loaded experimental data for kelsic_Round2.xlsx: (12, 3)
iteration shape: (24, 2)
Labels shape: (1369, 5)
Embeddings and labels are aligned
(1345,)
Tested variants in this round: 24
43 K3F
61 E4F
63 E4H
66 E4L
67 E4M
68 E4N
70 E4Q
74 E4V
76 E4Y
86 D5M
115 I7A
135 E8C
146 E8Q
184 Q10P
211 T12C
220 T12M
427 R23K
430 R23N
572 V31C
668 I36D
707 G38E
1087 T58E
1158 D61W
1332 S71C
Name: variant, dtype: object
Top 12 variants predicted by the model:
variant y_pred y_actual y_actual_scaled y_actual_binary dist_metric \
80 D5F 1.000372 NaN NaN NaN None
75 E4W 0.990040 NaN NaN NaN None
64 E4I 0.982857 NaN NaN NaN None
83 D5I 0.977915 NaN NaN NaN None
95 D5Y 0.974122 NaN NaN NaN None
85 D5L 0.971400 NaN NaN NaN None
93 D5V 0.970972 NaN NaN NaN None
62 E4G 0.970870 NaN NaN NaN None
469 E25Q 0.969735 NaN NaN NaN None
7 M1H 0.967748 NaN NaN NaN None
94 D5W 0.967325 NaN NaN NaN None
578 V31I 0.967287 NaN NaN NaN None
std_predictions
80 0.0
75 0.0
64 0.0
83 0.0
95 0.0
85 0.0
93 0.0
62 0.0
469 0.0
7 0.0
94 0.0
578 0.0
Data saved to ./kelsic/Round2
/data/chaofan/software/EvolvePro/evolvepro/src/model.py:200: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
df_all = pd.concat([df_train, df_test])
2.3.3 Round3
round_name = 'Round3'
round_file_names = ['kelsic_Round1.xlsx', 'kelsic_Round2.xlsx', 'kelsic_Round3.xlsx']
this_round_variants, df_test, df_sorted_all = evolve_experimental(
protein_name,
round_name,
embeddings_base_path,
embeddings_file_name,
round_base_path,
round_file_names,
wt_fasta_path,
rename_WT,
number_of_variants,
output_dir
)
Processing kelsic - Round3
Embeddings loaded: (1369, 1280)
Loaded experimental data for kelsic_Round1.xlsx: (12, 3)
Loaded experimental data for kelsic_Round2.xlsx: (12, 3)
Loaded experimental data for kelsic_Round3.xlsx: (12, 3)
iteration shape: (36, 2)
Labels shape: (1369, 5)
Embeddings and labels are aligned
(1333,)
Tested variants in this round: 36
18 M1W
43 K3F
59 E4C
61 E4F
63 E4H
64 E4I
66 E4L
67 E4M
68 E4N
70 E4Q
74 E4V
75 E4W
76 E4Y
78 D5C
80 D5F
83 D5I
85 D5L
86 D5M
87 D5N
93 D5V
95 D5Y
115 I7A
135 E8C
146 E8Q
184 Q10P
211 T12C
220 T12M
427 R23K
430 R23N
469 E25Q
572 V31C
668 I36D
707 G38E
1087 T58E
1158 D61W
1332 S71C
Name: variant, dtype: object
Top 12 variants predicted by the model:
variant y_pred y_actual y_actual_scaled y_actual_binary dist_metric \
82 D5H 0.985537 NaN NaN NaN None
77 D5A 0.974578 NaN NaN NaN None
91 D5S 0.973428 NaN NaN NaN None
89 D5Q 0.972064 NaN NaN NaN None
504 E27M 0.964911 NaN NaN NaN None
496 E27C 0.959024 NaN NaN NaN None
500 E27H 0.958966 NaN NaN NaN None
71 E4R 0.957292 NaN NaN NaN None
92 D5T 0.956897 NaN NaN NaN None
137 E8F 0.953024 NaN NaN NaN None
505 E27N 0.952960 NaN NaN NaN None
143 E8M 0.952932 NaN NaN NaN None
std_predictions
82 0.0
77 0.0
91 0.0
89 0.0
504 0.0
496 0.0
500 0.0
71 0.0
92 0.0
137 0.0
505 0.0
143 0.0
Data saved to ./kelsic/Round3
/data/chaofan/software/EvolvePro/evolvepro/src/model.py:200: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
df_all = pd.concat([df_train, df_test])
2.3.4 Round4
round_name = 'Round4'
round_file_names = ['kelsic_Round1.xlsx', 'kelsic_Round2.xlsx', 'kelsic_Round3.xlsx', 'kelsic_Round4.xlsx']
this_round_variants, df_test, df_sorted_all = evolve_experimental(
protein_name,
round_name,
embeddings_base_path,
embeddings_file_name,
round_base_path,
round_file_names,
wt_fasta_path,
rename_WT,
number_of_variants,
output_dir
)
Processing kelsic - Round4
Embeddings loaded: (1369, 1280)
Loaded experimental data for kelsic_Round1.xlsx: (12, 3)
Loaded experimental data for kelsic_Round2.xlsx: (12, 3)
Loaded experimental data for kelsic_Round3.xlsx: (12, 3)
Loaded experimental data for kelsic_Round4.xlsx: (12, 3)
iteration shape: (48, 2)
Labels shape: (1369, 5)
Embeddings and labels are aligned
(1321,)
Tested variants in this round: 48
18 M1W
43 K3F
59 E4C
61 E4F
63 E4H
64 E4I
66 E4L
67 E4M
68 E4N
70 E4Q
73 E4T
74 E4V
75 E4W
76 E4Y
77 D5A
78 D5C
80 D5F
82 D5H
83 D5I
85 D5L
86 D5M
87 D5N
89 D5Q
91 D5S
92 D5T
93 D5V
95 D5Y
115 I7A
135 E8C
146 E8Q
184 Q10P
211 T12C
220 T12M
390 M21L
427 R23K
430 R23N
445 V24I
469 E25Q
496 E27C
503 E27L
504 E27M
507 E27Q
572 V31C
668 I36D
707 G38E
1087 T58E
1158 D61W
1332 S71C
Name: variant, dtype: object
Top 12 variants predicted by the model:
variant y_pred y_actual y_actual_scaled y_actual_binary \
72 E4S 0.978078 NaN NaN NaN
94 D5W 0.977183 NaN NaN NaN
69 E4P 0.973912 NaN NaN NaN
1281 V68I 0.962484 NaN NaN NaN
90 D5R 0.951873 NaN NaN NaN
71 E4R 0.948736 NaN NaN NaN
88 D5P 0.947673 NaN NaN NaN
144 E8N 0.947493 NaN NaN NaN
141 E8K 0.947032 NaN NaN NaN
1034 V55I 0.945509 NaN NaN NaN
616 T33I 0.944996 NaN NaN NaN
470 E25R 0.944632 NaN NaN NaN
dist_metric std_predictions
72 None 0.0
94 None 0.0
69 None 0.0
1281 None 0.0
90 None 0.0
71 None 0.0
88 None 0.0
144 None 0.0
141 None 0.0
1034 None 0.0
616 None 0.0
470 None 0.0
Data saved to ./kelsic/Round4
/data/chaofan/software/EvolvePro/evolvepro/src/model.py:200: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
df_all = pd.concat([df_train, df_test])
2.3.5 Round5
round_name = 'Round5'
round_file_names = ['kelsic_Round1.xlsx', 'kelsic_Round2.xlsx', 'kelsic_Round3.xlsx', 'kelsic_Round4.xlsx', 'kelsic_Round5.xlsx']
this_round_variants, df_test, df_sorted_all = evolve_experimental(
protein_name,
round_name,
embeddings_base_path,
embeddings_file_name,
round_base_path,
round_file_names,
wt_fasta_path,
rename_WT,
number_of_variants,
output_dir
)
Processing kelsic - Round5
Embeddings loaded: (1369, 1280)
Loaded experimental data for kelsic_Round1.xlsx: (12, 3)
Loaded experimental data for kelsic_Round2.xlsx: (12, 3)
Loaded experimental data for kelsic_Round3.xlsx: (12, 3)
Loaded experimental data for kelsic_Round4.xlsx: (12, 3)
Loaded experimental data for kelsic_Round5.xlsx: (12, 3)
iteration shape: (60, 2)
Labels shape: (1369, 5)
Embeddings and labels are aligned
(1309,)
Tested variants in this round: 60
18 M1W
43 K3F
58 E4A
59 E4C
61 E4F
63 E4H
64 E4I
66 E4L
67 E4M
68 E4N
69 E4P
70 E4Q
72 E4S
73 E4T
74 E4V
75 E4W
76 E4Y
77 D5A
78 D5C
80 D5F
82 D5H
83 D5I
85 D5L
86 D5M
87 D5N
88 D5P
89 D5Q
91 D5S
92 D5T
93 D5V
94 D5W
95 D5Y
115 I7A
135 E8C
137 E8F
139 E8H
141 E8K
146 E8Q
178 Q10H
184 Q10P
211 T12C
220 T12M
368 T20H
390 M21L
427 R23K
430 R23N
445 V24I
469 E25Q
496 E27C
503 E27L
504 E27M
507 E27Q
572 V31C
668 I36D
707 G38E
1034 V55I
1087 T58E
1158 D61W
1332 S71C
1339 S71K
Name: variant, dtype: object
Top 12 variants predicted by the model:
variant y_pred y_actual y_actual_scaled y_actual_binary \
114 N6Y 0.954242 NaN NaN NaN
373 T20N 0.950471 NaN NaN NaN
65 E4K 0.950415 NaN NaN NaN
226 T12V 0.949447 NaN NaN NaN
1281 V68I 0.947393 NaN NaN NaN
464 E25K 0.947118 NaN NaN NaN
71 E4R 0.946143 NaN NaN NaN
144 E8N 0.940163 NaN NaN NaN
628 A34C 0.937827 NaN NaN NaN
1349 S71Y 0.936767 NaN NaN NaN
1169 L62M 0.934671 NaN NaN NaN
370 T20K 0.934475 NaN NaN NaN
dist_metric std_predictions
114 None 0.0
373 None 0.0
65 None 0.0
226 None 0.0
1281 None 0.0
464 None 0.0
71 None 0.0
144 None 0.0
628 None 0.0
1349 None 0.0
1169 None 0.0
370 None 0.0
Data saved to ./kelsic/Round5
/data/chaofan/software/EvolvePro/evolvepro/src/model.py:200: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
df_all = pd.concat([df_train, df_test])
2.3.6 plot
from evolvepro.src.plot import read_exp_data, plot_variants_by_iteration
round_base_path = '/data/chaofan/software/EvolvePro/colab/rounds_data'
round_file_names = ['kelsic_Round1.xlsx', 'kelsic_Round2.xlsx', 'kelsic_Round3.xlsx', 'kelsic_Round4.xlsx', 'kelsic_Round5.xlsx']
wt_fasta_path = "./kelsic_WT.fasta"
df = read_exp_data(round_base_path, round_file_names, wt_fasta_path)
plot_variants_by_iteration(df, activity_column='activity', output_dir=output_dir, output_file="kelsic")
3. 代码实现 (Deep Mutational Scanning Workflow)
DMS数据集包含具体的突变位点和表型数据,我们需要生成突变体文库对应的fasta和embedding特征矩阵,才能进行下游的迭代训练。
3.1 生成突变体序列和对应表型表格
!cp -r /data/chaofan/software/EvolvePro/data .
对scripts/process/dms_process.py
代码的project_root
进行修改
!python /data/chaofan/software/EvolvePro/scripts/process/dms_process.py
3.2 获取氨基酸序列的embedding特征矩阵
!sbatch esm2_650M_dms.sh
# 运行时大概占3G显存
# 这一步运行时间非常长
esm2_650M_dms.sh#!/bin/bash # Configuration values for SLURM job submission. # One leading hash ahead of the word SBATCH is not a comment, but two are. #SBATCH --job-name=esm #SBATCH --gres=gpu:1 #SBATCH --cpus-per-task=1 #SBATCH --mem=200gb #SBATCH --output out/esm-%j.out study_names=("brenan" "jones" "stiffler" "haddox" "doud" "giacomelli" "kelsic" "lee" "markin" "cas12f" "cov2_S" "zikv_E") model_names=("esm2_t33_650M_UR50D") fasta_path="output/dms/" results_path="output/plm/esm/" repr_layers=33 toks_per_batch=2000 mkdir -p ${results_path} for model_name in "${model_names[@]}"; do for study in "${study_names[@]}"; do command="python3 /data/chaofan/software/EvolvePro/evolvepro/plm/esm/extract.py ${model_name}.pt ${fasta_path}${study}.fasta ${results_path}${study}/${model_name} --toks_per_batch ${toks_per_batch} --include mean --concatenate_dir ${results_path}" echo "Running command: ${command}" eval "${command}" done done
3.3 DMS数据验证定向进化
因为我们输入的是${model_name}.pt
文件,会导致输出文件名都变为${dataset_name}_esm2_t33_650M_UR50D.pt.csv
,但是后续文件的输入又要求是${dataset_name}_esm2_t33_650M_UR50D.csv
,需要进行文件名的批量修改再进行下一步。
!ls output/plm/esm/*.csv | while read line; do re_name=`echo $line|cut -d "/" -f4|cut -d "." -f1`; mv ${line} output/plm/esm/${re_name}.csv; done
!sbatch esm2_650M.sh
esm2_650M.sh#!/bin/bash # Configuration values for SLURM job submission. # One leading hash ahead of the word SBATCH is not a comment, but two are. #SBATCH --time=12:00:00 #SBATCH --job-name=esm2_650M_optimal #SBATCH -n 12 #SBATCH -N 1 #SBATCH --cpus-per-task=5 #SBATCH --mem=20gb #SBATCH --output out/esm2_650M_optimal-%j.out source ~/.bashrc conda activate evolvepro # module load openmind8/gnu-parallel/20240222 datasets=("brenan" "stiffler" "doud" "haddox" "giacomelli" "jones" "kelsic" "lee" "markin" "zikv_E" "cas12f" "cov2_S") # Function to run dms_main for a given dataset run_dms_main() { dataset_name=$1 output_file="out/${dataset_name}-esm2_650M_optimal.out" echo "Running ${dataset_name} dataset:" > ${output_file} python3 -u /data/chaofan/software/EvolvePro/scripts/dms/dms_main.py \ --dataset_name ${dataset_name} \ --experiment_name "esm2_650M_optimal" \ --model_name "esm2_t33_650M_UR50D" \ --embeddings_path "./output/plm/esm" \ --labels_path "./output/dms" \ --num_simulations 10 \ --num_iterations 10 \ --measured_var "activity" \ --learning_strategies "topn" \ --num_mutants_per_round 16 \ --num_final_round_mutants 16 \ --first_round_strategies "random" \ --embedding_types "embeddings" \ --regression_types "randomforest" \ --embeddings_file_type "csv" \ --output_dir "./output/dms_results" \ >> ${output_file} 2>&1 echo "Done running ${dataset_name} dataset:" >> ${output_file} } # Export the function so it's available to GNU Parallel export -f run_dms_main # Use GNU Parallel to run the dms_main function in parallel for each dataset parallel -j12 run_dms_main ::: "${datasets[@]}"
这是其中一个结果表格,还是很清晰的。原文主要利用DMS数据集中大量的突变体和对应的表型数据进行方法可行性的验证,当然也可以从DMS大量的数据基础上去探索更广阔的突变空间。
好了,你已经学会了目前最热门的借助深度学习辅助定向进化进行蛋白分子的理性设计。