EvolvePro-directed evolution | zhangchaofan's blog

发布日期: 2025-04-15

更新日期: 2025-05-18

文章字数: 6.5k

阅读时长: 36 分

0. 前言

进化（演化）是生物种群在遗传变异、自然选择、遗传漂变等因素作用下，随时间累积而发生的适应性变化过程。进化的结果在宏观上可以体现在生物体与祖先的表型差异，微观上则是分子水平的基因频率改变。自然界中的进化有几个重要性的因素：1）变异的多样性；2）变异可遗传；3）选择压力适中。
而定向进化则是人为地模拟自然界中选择作用，可以对物种的表型进行选择，也可以对具体的某个分子（蛋白）进行选择。我们首先从原始序列（也称为野生型）人工构建大量的突变体文库，然后衡量突变体的表型（如植株的高度，果实大小、蛋白分子的稳定性|溶解度|酶活性等）。选择最优表型的突变体，作为下一次突变的起始序列，迭代这一过程。这一过程可以理解为于爬山，理想表型的突变体在山顶，每迭代一次相当于向山顶靠近一步。
20世纪中叶小麦的绿色革命可以作为一个宏观定向进化的例子，我们想获得一个抗倒伏的矮杆品种，初始情况下我们有一个性状优良的小麦群体，但是这个群体的小麦都比较高。我们人为选取该群体中较矮的植株进行诱变，对后代的植株高度进行观察，选择后代中较矮的植株再进行诱变，如此迭代几轮，直到后代整体的植株高度符合要求。分子的定向进化也类似于这个过程。
前言部分只是简单的介绍了下背景，感兴趣的同学可以去深入了解。大家可以想一下为什么酶定向进化和肽和抗体噬菌体展示一起获得2018年诺贝尔化学奖。

1. EvolvePro环境安装

需要预安装的一些软件，包括：

conda
CUDA驱动
Slurm作业提交系统

## 1. EvolvePro的安装
git clone https://github.com/mat10d/EvolvePro.git
cd EvolvePro
conda env create -f environment.yml
conda activate evolvepro

## 2. 安装依赖
### 感觉不是必须
### 众所周知的原因，大陆地区这里可能会失败，多试几次就好了（雾
sh setup_plm.sh
conda activate plm
# 切换回evolvepro
conda activate evolvepro

## 3. Download Pretrained models
# wget https://dl.fbaipublicfiles.com/fair-esm/models/esm1b_t33_650M_UR50S.pt
cd  data/chaofan/projects/08.Directed_evolution/00.EVOLVEpro # 工作文件夹
wget https://dl.fbaipublicfiles.com/fair-esm/models/esm1b_t33_650M_UR50S.pt
wget https://dl.fbaipublicfiles.com/fair-esm/regression/esm1b_t33_650M_UR50S-contact-regression.pt

# wget https://dl.fbaipublicfiles.com/fair-esm/models/esm2_t33_650M_UR50D.pt
# wget https://dl.fbaipublicfiles.com/fair-esm/regression/esm2_t33_650M_UR50D-contact-regression.pt
### esm2用在下游任务上会比esm1b好一点

有两条路，一条是多轮实验迭代，每轮实验加入新测量的表型值；另一条是基于深度突变扫描的数据。

2. 代码实现 (Experimental Workflow)

2.1 构建单氨基酸突变文库

from evolvepro.src.process import generate_wt, generate_single_aa_mutants
generate_wt('MAKEDNIEMQGTVLETLPNTMFRVELENGHVVTAHISGKMRKNYIRILTGDKVTVELTPYDLSKGRIVFRSR', output_file='./kelsic_WT.fasta')
generate_single_aa_mutants('./kelsic_WT.fasta', output_file='./kelsic.fasta')

Number of mutants: 1369

from evolvepro.src.process import suggest_initial_mutants
suggest_initial_mutants('./kelsic.fasta', num_mutants=12, random_seed=42)

Suggested 12 mutants for testing:
1. R23K
2. T58E
3. I36D
4. V31C
5. I7A
6. K3F
7. Q10P
8. G38E
9. E4M
10. D61W
11. E4Y
12. R23N

我们随机的从单突变体文库中挑选这12条突变序列进行模型初始的训练，这12条序列与后续的Round1.xlsx文件是对应的。

2.2 获取Protein Language model的编码矩阵

BERT语言模型能很好的将离散的文本序列转换为数值型的特征矩阵，广泛运用于文本分类任务

!mkdir -p Single_mutation_concatenate

# 如提前下好了预训练模型，通过 .pt 结尾文件代表本地加载权重，如没有 .pt后缀，默认会重新下载权重文件。
!python /data/chaofan/software/EvolvePro/evolvepro/plm/esm/extract.py esm1b_t33_650M_UR50S.pt ./kelsic.fasta ./kelsic_esm1b_t33_650M_UR50S --toks_per_batch 512 --include mean --concatenate_dir ./Single_mutation_concatenate

Transferred model to GPU
Read kelsic.fasta with 1369 sequences
Processing 1 of 196 batches (7 sequences)
Device: cuda:0
Processing 2 of 196 batches (7 sequences)
Device: cuda:0
Processing 3 of 196 batches (7 sequences)
Device: cuda:0
Processing 4 of 196 batches (7 sequences)
Device: cuda:0
Processing 5 of 196 batches (7 sequences)
Device: cuda:0
Processing 6 of 196 batches (7 sequences)
Device: cuda:0
Processing 7 of 196 batches (7 sequences)
Device: cuda:0
Processing 8 of 196 batches (7 sequences)
Device: cuda:0
Processing 9 of 196 batches (7 sequences)
Device: cuda:0
Processing 10 of 196 batches (7 sequences)
Device: cuda:0
Processing 11 of 196 batches (7 sequences)
Device: cuda:0
Processing 12 of 196 batches (7 sequences)
Device: cuda:0
Processing 13 of 196 batches (7 sequences)
Device: cuda:0
Processing 14 of 196 batches (7 sequences)
Device: cuda:0
Processing 15 of 196 batches (7 sequences)
Device: cuda:0
Processing 16 of 196 batches (7 sequences)
Device: cuda:0
Processing 17 of 196 batches (7 sequences)
Device: cuda:0
Processing 18 of 196 batches (7 sequences)
Device: cuda:0
Processing 19 of 196 batches (7 sequences)
Device: cuda:0
Processing 20 of 196 batches (7 sequences)
Device: cuda:0
Processing 21 of 196 batches (7 sequences)
Device: cuda:0
Processing 22 of 196 batches (7 sequences)
Device: cuda:0
Processing 23 of 196 batches (7 sequences)
Device: cuda:0
Processing 24 of 196 batches (7 sequences)
Device: cuda:0
Processing 25 of 196 batches (7 sequences)
Device: cuda:0
Processing 26 of 196 batches (7 sequences)
Device: cuda:0
Processing 27 of 196 batches (7 sequences)
Device: cuda:0
Processing 28 of 196 batches (7 sequences)
Device: cuda:0
Processing 29 of 196 batches (7 sequences)
Device: cuda:0
Processing 30 of 196 batches (7 sequences)
Device: cuda:0
Processing 31 of 196 batches (7 sequences)
Device: cuda:0
Processing 32 of 196 batches (7 sequences)
Device: cuda:0
Processing 33 of 196 batches (7 sequences)
Device: cuda:0
Processing 34 of 196 batches (7 sequences)
Device: cuda:0
Processing 35 of 196 batches (7 sequences)
Device: cuda:0
Processing 36 of 196 batches (7 sequences)
Device: cuda:0
Processing 37 of 196 batches (7 sequences)
Device: cuda:0
Processing 38 of 196 batches (7 sequences)
Device: cuda:0
Processing 39 of 196 batches (7 sequences)
Device: cuda:0
Processing 40 of 196 batches (7 sequences)
Device: cuda:0
Processing 41 of 196 batches (7 sequences)
Device: cuda:0
Processing 42 of 196 batches (7 sequences)
Device: cuda:0
Processing 43 of 196 batches (7 sequences)
Device: cuda:0
Processing 44 of 196 batches (7 sequences)
Device: cuda:0
Processing 45 of 196 batches (7 sequences)
Device: cuda:0
Processing 46 of 196 batches (7 sequences)
Device: cuda:0
Processing 47 of 196 batches (7 sequences)
Device: cuda:0
Processing 48 of 196 batches (7 sequences)
Device: cuda:0
Processing 49 of 196 batches (7 sequences)
Device: cuda:0
Processing 50 of 196 batches (7 sequences)
Device: cuda:0
Processing 51 of 196 batches (7 sequences)
Device: cuda:0
Processing 52 of 196 batches (7 sequences)
Device: cuda:0
Processing 53 of 196 batches (7 sequences)
Device: cuda:0
Processing 54 of 196 batches (7 sequences)
Device: cuda:0
Processing 55 of 196 batches (7 sequences)
Device: cuda:0
Processing 56 of 196 batches (7 sequences)
Device: cuda:0
Processing 57 of 196 batches (7 sequences)
Device: cuda:0
Processing 58 of 196 batches (7 sequences)
Device: cuda:0
Processing 59 of 196 batches (7 sequences)
Device: cuda:0
Processing 60 of 196 batches (7 sequences)
Device: cuda:0
Processing 61 of 196 batches (7 sequences)
Device: cuda:0
Processing 62 of 196 batches (7 sequences)
Device: cuda:0
Processing 63 of 196 batches (7 sequences)
Device: cuda:0
Processing 64 of 196 batches (7 sequences)
Device: cuda:0
Processing 65 of 196 batches (7 sequences)
Device: cuda:0
Processing 66 of 196 batches (7 sequences)
Device: cuda:0
Processing 67 of 196 batches (7 sequences)
Device: cuda:0
Processing 68 of 196 batches (7 sequences)
Device: cuda:0
Processing 69 of 196 batches (7 sequences)
Device: cuda:0
Processing 70 of 196 batches (7 sequences)
Device: cuda:0
Processing 71 of 196 batches (7 sequences)
Device: cuda:0
Processing 72 of 196 batches (7 sequences)
Device: cuda:0
Processing 73 of 196 batches (7 sequences)
Device: cuda:0
Processing 74 of 196 batches (7 sequences)
Device: cuda:0
Processing 75 of 196 batches (7 sequences)
Device: cuda:0
Processing 76 of 196 batches (7 sequences)
Device: cuda:0
Processing 77 of 196 batches (7 sequences)
Device: cuda:0
Processing 78 of 196 batches (7 sequences)
Device: cuda:0
Processing 79 of 196 batches (7 sequences)
Device: cuda:0
Processing 80 of 196 batches (7 sequences)
Device: cuda:0
Processing 81 of 196 batches (7 sequences)
Device: cuda:0
Processing 82 of 196 batches (7 sequences)
Device: cuda:0
Processing 83 of 196 batches (7 sequences)
Device: cuda:0
Processing 84 of 196 batches (7 sequences)
Device: cuda:0
Processing 85 of 196 batches (7 sequences)
Device: cuda:0
Processing 86 of 196 batches (7 sequences)
Device: cuda:0
Processing 87 of 196 batches (7 sequences)
Device: cuda:0
Processing 88 of 196 batches (7 sequences)
Device: cuda:0
Processing 89 of 196 batches (7 sequences)
Device: cuda:0
Processing 90 of 196 batches (7 sequences)
Device: cuda:0
Processing 91 of 196 batches (7 sequences)
Device: cuda:0
Processing 92 of 196 batches (7 sequences)
Device: cuda:0
Processing 93 of 196 batches (7 sequences)
Device: cuda:0
Processing 94 of 196 batches (7 sequences)
Device: cuda:0
Processing 95 of 196 batches (7 sequences)
Device: cuda:0
Processing 96 of 196 batches (7 sequences)
Device: cuda:0
Processing 97 of 196 batches (7 sequences)
Device: cuda:0
Processing 98 of 196 batches (7 sequences)
Device: cuda:0
Processing 99 of 196 batches (7 sequences)
Device: cuda:0
Processing 100 of 196 batches (7 sequences)
Device: cuda:0
Processing 101 of 196 batches (7 sequences)
Device: cuda:0
Processing 102 of 196 batches (7 sequences)
Device: cuda:0
Processing 103 of 196 batches (7 sequences)
Device: cuda:0
Processing 104 of 196 batches (7 sequences)
Device: cuda:0
Processing 105 of 196 batches (7 sequences)
Device: cuda:0
Processing 106 of 196 batches (7 sequences)
Device: cuda:0
Processing 107 of 196 batches (7 sequences)
Device: cuda:0
Processing 108 of 196 batches (7 sequences)
Device: cuda:0
Processing 109 of 196 batches (7 sequences)
Device: cuda:0
Processing 110 of 196 batches (7 sequences)
Device: cuda:0
Processing 111 of 196 batches (7 sequences)
Device: cuda:0
Processing 112 of 196 batches (7 sequences)
Device: cuda:0
Processing 113 of 196 batches (7 sequences)
Device: cuda:0
Processing 114 of 196 batches (7 sequences)
Device: cuda:0
Processing 115 of 196 batches (7 sequences)
Device: cuda:0
Processing 116 of 196 batches (7 sequences)
Device: cuda:0
Processing 117 of 196 batches (7 sequences)
Device: cuda:0
Processing 118 of 196 batches (7 sequences)
Device: cuda:0
Processing 119 of 196 batches (7 sequences)
Device: cuda:0
Processing 120 of 196 batches (7 sequences)
Device: cuda:0
Processing 121 of 196 batches (7 sequences)
Device: cuda:0
Processing 122 of 196 batches (7 sequences)
Device: cuda:0
Processing 123 of 196 batches (7 sequences)
Device: cuda:0
Processing 124 of 196 batches (7 sequences)
Device: cuda:0
Processing 125 of 196 batches (7 sequences)
Device: cuda:0
Processing 126 of 196 batches (7 sequences)
Device: cuda:0
Processing 127 of 196 batches (7 sequences)
Device: cuda:0
Processing 128 of 196 batches (7 sequences)
Device: cuda:0
Processing 129 of 196 batches (7 sequences)
Device: cuda:0
Processing 130 of 196 batches (7 sequences)
Device: cuda:0
Processing 131 of 196 batches (7 sequences)
Device: cuda:0
Processing 132 of 196 batches (7 sequences)
Device: cuda:0
Processing 133 of 196 batches (7 sequences)
Device: cuda:0
Processing 134 of 196 batches (7 sequences)
Device: cuda:0
Processing 135 of 196 batches (7 sequences)
Device: cuda:0
Processing 136 of 196 batches (7 sequences)
Device: cuda:0
Processing 137 of 196 batches (7 sequences)
Device: cuda:0
Processing 138 of 196 batches (7 sequences)
Device: cuda:0
Processing 139 of 196 batches (7 sequences)
Device: cuda:0
Processing 140 of 196 batches (7 sequences)
Device: cuda:0
Processing 141 of 196 batches (7 sequences)
Device: cuda:0
Processing 142 of 196 batches (7 sequences)
Device: cuda:0
Processing 143 of 196 batches (7 sequences)
Device: cuda:0
Processing 144 of 196 batches (7 sequences)
Device: cuda:0
Processing 145 of 196 batches (7 sequences)
Device: cuda:0
Processing 146 of 196 batches (7 sequences)
Device: cuda:0
Processing 147 of 196 batches (7 sequences)
Device: cuda:0
Processing 148 of 196 batches (7 sequences)
Device: cuda:0
Processing 149 of 196 batches (7 sequences)
Device: cuda:0
Processing 150 of 196 batches (7 sequences)
Device: cuda:0
Processing 151 of 196 batches (7 sequences)
Device: cuda:0
Processing 152 of 196 batches (7 sequences)
Device: cuda:0
Processing 153 of 196 batches (7 sequences)
Device: cuda:0
Processing 154 of 196 batches (7 sequences)
Device: cuda:0
Processing 155 of 196 batches (7 sequences)
Device: cuda:0
Processing 156 of 196 batches (7 sequences)
Device: cuda:0
Processing 157 of 196 batches (7 sequences)
Device: cuda:0
Processing 158 of 196 batches (7 sequences)
Device: cuda:0
Processing 159 of 196 batches (7 sequences)
Device: cuda:0
Processing 160 of 196 batches (7 sequences)
Device: cuda:0
Processing 161 of 196 batches (7 sequences)
Device: cuda:0
Processing 162 of 196 batches (7 sequences)
Device: cuda:0
Processing 163 of 196 batches (7 sequences)
Device: cuda:0
Processing 164 of 196 batches (7 sequences)
Device: cuda:0
Processing 165 of 196 batches (7 sequences)
Device: cuda:0
Processing 166 of 196 batches (7 sequences)
Device: cuda:0
Processing 167 of 196 batches (7 sequences)
Device: cuda:0
Processing 168 of 196 batches (7 sequences)
Device: cuda:0
Processing 169 of 196 batches (7 sequences)
Device: cuda:0
Processing 170 of 196 batches (7 sequences)
Device: cuda:0
Processing 171 of 196 batches (7 sequences)
Device: cuda:0
Processing 172 of 196 batches (7 sequences)
Device: cuda:0
Processing 173 of 196 batches (7 sequences)
Device: cuda:0
Processing 174 of 196 batches (7 sequences)
Device: cuda:0
Processing 175 of 196 batches (7 sequences)
Device: cuda:0
Processing 176 of 196 batches (7 sequences)
Device: cuda:0
Processing 177 of 196 batches (7 sequences)
Device: cuda:0
Processing 178 of 196 batches (7 sequences)
Device: cuda:0
Processing 179 of 196 batches (7 sequences)
Device: cuda:0
Processing 180 of 196 batches (7 sequences)
Device: cuda:0
Processing 181 of 196 batches (7 sequences)
Device: cuda:0
Processing 182 of 196 batches (7 sequences)
Device: cuda:0
Processing 183 of 196 batches (7 sequences)
Device: cuda:0
Processing 184 of 196 batches (7 sequences)
Device: cuda:0
Processing 185 of 196 batches (7 sequences)
Device: cuda:0
Processing 186 of 196 batches (7 sequences)
Device: cuda:0
Processing 187 of 196 batches (7 sequences)
Device: cuda:0
Processing 188 of 196 batches (7 sequences)
Device: cuda:0
Processing 189 of 196 batches (7 sequences)
Device: cuda:0
Processing 190 of 196 batches (7 sequences)
Device: cuda:0
Processing 191 of 196 batches (7 sequences)
Device: cuda:0
Processing 192 of 196 batches (7 sequences)
Device: cuda:0
Processing 193 of 196 batches (7 sequences)
Device: cuda:0
Processing 194 of 196 batches (7 sequences)
Device: cuda:0
Processing 195 of 196 batches (7 sequences)
Device: cuda:0
Processing 196 of 196 batches (4 sequences)
Device: cuda:0
Saved representations to kelsic_esm1b_t33_650M_UR50S
Shape of concatenated DataFrame: (1369, 1280)
Saved concatenated representations to Single_mutation_concatenate/kelsic_esm1b_t33_650M_UR50S.pt.csv

因为频繁出现注入攻击，现pytorch强制要求加载权重时是否相信来源，此处需要修改/data/chaofan/miniconda3/envs/evolvepro/lib/python3.11/site-packages/esm/pretrained.py代码，在第70行调用函数处加上weights_only=False。

Number of mutants: 1369

Traceback (most recent call last):
  File "/data/chaofan/software/EvolvePro/evolvepro/plm/esm/extract.py", line 193, in <module>
    main()
  File "/data/chaofan/software/EvolvePro/evolvepro/plm/esm/extract.py", line 181, in main
    run(args)
  File "/data/chaofan/software/EvolvePro/evolvepro/plm/esm/extract.py", line 73, in run
    model, alphabet = pretrained.load_model_and_alphabet(args.model_location)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/chaofan/miniconda3/envs/evolvepro/lib/python3.11/site-packages/esm/pretrained.py", line 26, in load_model_and_alphabet
    return load_model_and_alphabet_local(model_name)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/chaofan/miniconda3/envs/evolvepro/lib/python3.11/site-packages/esm/pretrained.py", line 70, in load_model_and_alphabet_local
    model_data = torch.load(str(model_location), map_location="cpu")
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/chaofan/miniconda3/envs/evolvepro/lib/python3.11/site-packages/torch/serialization.py", line 1470, in load
    raise pickle.UnpicklingError(_get_wo_message(str(e))) from None
_pickle.UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options, do those steps only if you trust the source of the checkpoint. 
	(1) In PyTorch 2.6, we changed the default value of the `weights_only` argument in `torch.load` from `False` to `True`. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
	(2) Alternatively, to load with `weights_only=True` please check the recommended steps in the following error message.
	WeightsUnpickler error: Unsupported global: GLOBAL argparse.Namespace was not an allowed global by default. Please use `torch.serialization.add_safe_globals([Namespace])` or the `torch.serialization.safe_globals([Namespace])` context manager to allowlist this global if you trust this class/function.

Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.

经过上述步骤，每条文本的氨基酸序列被一个长为1280的数值向量代替，表格中一共有1369条序列，分别是一条野生型+1368条(19*野生型序列长度)单氨基酸变异序列

2.3 对模型进行多轮few-shot训练

from evolvepro.src.evolve import evolve_experimental

protein_name = 'kelsic'
embeddings_base_path = 'Single_mutation_concatenate'
embeddings_file_name = 'kelsic_esm1b_t33_650M_UR50S.pt.csv'
round_base_path = '/data/chaofan/software/EvolvePro/colab/rounds_data'
wt_fasta_path = "./kelsic_WT.fasta"
number_of_variants = 12
output_dir = './'
rename_WT = False

2.3.1 Round1

round_name = 'Round1'
round_file_names = ['kelsic_Round1.xlsx']

this_round_variants, df_test, df_sorted_all = evolve_experimental(
    protein_name,
    round_name,
    embeddings_base_path,
    embeddings_file_name,
    round_base_path,
    round_file_names,
    wt_fasta_path,
    rename_WT,
    number_of_variants,
    output_dir
)

Processing kelsic - Round1
    Embeddings loaded: (1369, 1280)
    Loaded experimental data for kelsic_Round1.xlsx: (12, 3)
    iteration shape: (12, 2)
    Labels shape: (1369, 5)
    Embeddings and labels are aligned
    (1357,)
    
    Tested variants in this round: 12
    43       K3F
    67       E4M
    76       E4Y
    115      I7A
    184     Q10P
    427     R23K
    430     R23N
    572     V31C
    668     I36D
    707     G38E
    1087    T58E
    1158    D61W
    Name: variant, dtype: object
    
    Top 12 variants predicted by the model:
        variant    y_pred  y_actual  y_actual_scaled  y_actual_binary dist_metric  \
    74      E4V  0.954787       NaN              NaN              NaN        None   
    225    T12S  0.952057       NaN              NaN              NaN        None   
    5       M1F  0.951310       NaN              NaN              NaN        None   
    63      E4H  0.950540       NaN              NaN              NaN        None   
    15      M1S  0.947988       NaN              NaN              NaN        None   
    19      M1Y  0.944838       NaN              NaN              NaN        None   
    18      M1W  0.944590       NaN              NaN              NaN        None   
    249    L14C  0.942195       NaN              NaN              NaN        None   
    61      E4F  0.942045       NaN              NaN              NaN        None   
    10      M1L  0.941873       NaN              NaN              NaN        None   
    142     E8L  0.941533       NaN              NaN              NaN        None   
    135     E8C  0.941173       NaN              NaN              NaN        None   
    
         std_predictions  
    74               0.0  
    225              0.0  
    5                0.0  
    63               0.0  
    15               0.0  
    19               0.0  
    18               0.0  
    249              0.0  
    61               0.0  
    10               0.0  
    142              0.0  
    135              0.0  
    
    Data saved to ./kelsic/Round1


    /data/chaofan/software/EvolvePro/evolvepro/src/model.py:200: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
      df_all = pd.concat([df_train, df_test])

模型的设计非常简单，主要是用预训练的BERT模型去抽特征，然后用传统的机器学习方法(随机森林，XGBoost等)对特征进行回归分析。第一轮使用12条随机数据进行训练，第二列使用12 * 2条数据重新训练，以此类推，第N轮使用12 * N条数据进行训练。每轮只训练顶层的传统机器学习模型。
训练后的模型先对整个突变体库扫一遍，对每个突变体进行表型值打分并排序，取预测表型值最好的12个突变体（且没有真实表型值）进行真实表型值验证。将新测量的真实表型值加入到下一轮的模型训练中，也就是每一轮过后，训练数据集会多出12条。按道理来说，Round2.xlsx的数据应该是上一轮预测的Top12突变体的真实表型值，但因为不同软件版本的差异，这里展示的结果可能会与原文有一定的差异，这是正常现象。

2.3.2 Round2

round_name = 'Round2'
round_file_names = ['kelsic_Round1.xlsx', 'kelsic_Round2.xlsx']

this_round_variants, df_test, df_sorted_all = evolve_experimental(
    protein_name,
    round_name,
    embeddings_base_path,
    embeddings_file_name,
    round_base_path,
    round_file_names,
    wt_fasta_path,
    rename_WT,
    number_of_variants,
    output_dir
)

Processing kelsic - Round2
    Embeddings loaded: (1369, 1280)
    Loaded experimental data for kelsic_Round1.xlsx: (12, 3)
    Loaded experimental data for kelsic_Round2.xlsx: (12, 3)
    iteration shape: (24, 2)
    Labels shape: (1369, 5)
    Embeddings and labels are aligned
    (1345,)
    
    Tested variants in this round: 24
    43       K3F
    61       E4F
    63       E4H
    66       E4L
    67       E4M
    68       E4N
    70       E4Q
    74       E4V
    76       E4Y
    86       D5M
    115      I7A
    135      E8C
    146      E8Q
    184     Q10P
    211     T12C
    220     T12M
    427     R23K
    430     R23N
    572     V31C
    668     I36D
    707     G38E
    1087    T58E
    1158    D61W
    1332    S71C
    Name: variant, dtype: object
    
    Top 12 variants predicted by the model:
        variant    y_pred  y_actual  y_actual_scaled  y_actual_binary dist_metric  \
    80      D5F  1.000372       NaN              NaN              NaN        None   
    75      E4W  0.990040       NaN              NaN              NaN        None   
    64      E4I  0.982857       NaN              NaN              NaN        None   
    83      D5I  0.977915       NaN              NaN              NaN        None   
    95      D5Y  0.974122       NaN              NaN              NaN        None   
    85      D5L  0.971400       NaN              NaN              NaN        None   
    93      D5V  0.970972       NaN              NaN              NaN        None   
    62      E4G  0.970870       NaN              NaN              NaN        None   
    469    E25Q  0.969735       NaN              NaN              NaN        None   
    7       M1H  0.967748       NaN              NaN              NaN        None   
    94      D5W  0.967325       NaN              NaN              NaN        None   
    578    V31I  0.967287       NaN              NaN              NaN        None   
    
         std_predictions  
    80               0.0  
    75               0.0  
    64               0.0  
    83               0.0  
    95               0.0  
    85               0.0  
    93               0.0  
    62               0.0  
    469              0.0  
    7                0.0  
    94               0.0  
    578              0.0  
    
    Data saved to ./kelsic/Round2


    /data/chaofan/software/EvolvePro/evolvepro/src/model.py:200: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
      df_all = pd.concat([df_train, df_test])

2.3.3 Round3

round_name = 'Round3'
round_file_names = ['kelsic_Round1.xlsx', 'kelsic_Round2.xlsx', 'kelsic_Round3.xlsx']

this_round_variants, df_test, df_sorted_all = evolve_experimental(
    protein_name,
    round_name,
    embeddings_base_path,
    embeddings_file_name,
    round_base_path,
    round_file_names,
    wt_fasta_path,
    rename_WT,
    number_of_variants,
    output_dir
)

Processing kelsic - Round3
    Embeddings loaded: (1369, 1280)
    Loaded experimental data for kelsic_Round1.xlsx: (12, 3)
    Loaded experimental data for kelsic_Round2.xlsx: (12, 3)
    Loaded experimental data for kelsic_Round3.xlsx: (12, 3)
    iteration shape: (36, 2)
    Labels shape: (1369, 5)
    Embeddings and labels are aligned
    (1333,)
    
    Tested variants in this round: 36
    18       M1W
    43       K3F
    59       E4C
    61       E4F
    63       E4H
    64       E4I
    66       E4L
    67       E4M
    68       E4N
    70       E4Q
    74       E4V
    75       E4W
    76       E4Y
    78       D5C
    80       D5F
    83       D5I
    85       D5L
    86       D5M
    87       D5N
    93       D5V
    95       D5Y
    115      I7A
    135      E8C
    146      E8Q
    184     Q10P
    211     T12C
    220     T12M
    427     R23K
    430     R23N
    469     E25Q
    572     V31C
    668     I36D
    707     G38E
    1087    T58E
    1158    D61W
    1332    S71C
    Name: variant, dtype: object
    
    Top 12 variants predicted by the model:
        variant    y_pred  y_actual  y_actual_scaled  y_actual_binary dist_metric  \
    82      D5H  0.985537       NaN              NaN              NaN        None   
    77      D5A  0.974578       NaN              NaN              NaN        None   
    91      D5S  0.973428       NaN              NaN              NaN        None   
    89      D5Q  0.972064       NaN              NaN              NaN        None   
    504    E27M  0.964911       NaN              NaN              NaN        None   
    496    E27C  0.959024       NaN              NaN              NaN        None   
    500    E27H  0.958966       NaN              NaN              NaN        None   
    71      E4R  0.957292       NaN              NaN              NaN        None   
    92      D5T  0.956897       NaN              NaN              NaN        None   
    137     E8F  0.953024       NaN              NaN              NaN        None   
    505    E27N  0.952960       NaN              NaN              NaN        None   
    143     E8M  0.952932       NaN              NaN              NaN        None   
    
         std_predictions  
    82               0.0  
    77               0.0  
    91               0.0  
    89               0.0  
    504              0.0  
    496              0.0  
    500              0.0  
    71               0.0  
    92               0.0  
    137              0.0  
    505              0.0  
    143              0.0  
    
    Data saved to ./kelsic/Round3


    /data/chaofan/software/EvolvePro/evolvepro/src/model.py:200: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
      df_all = pd.concat([df_train, df_test])

2.3.4 Round4

round_name = 'Round4'
round_file_names = ['kelsic_Round1.xlsx', 'kelsic_Round2.xlsx', 'kelsic_Round3.xlsx', 'kelsic_Round4.xlsx']

this_round_variants, df_test, df_sorted_all = evolve_experimental(
    protein_name,
    round_name,
    embeddings_base_path,
    embeddings_file_name,
    round_base_path,
    round_file_names,
    wt_fasta_path,
    rename_WT,
    number_of_variants,
    output_dir
)

Processing kelsic - Round4
    Embeddings loaded: (1369, 1280)
    Loaded experimental data for kelsic_Round1.xlsx: (12, 3)
    Loaded experimental data for kelsic_Round2.xlsx: (12, 3)
    Loaded experimental data for kelsic_Round3.xlsx: (12, 3)
    Loaded experimental data for kelsic_Round4.xlsx: (12, 3)
    iteration shape: (48, 2)
    Labels shape: (1369, 5)
    Embeddings and labels are aligned
    (1321,)
    
    Tested variants in this round: 48
    18       M1W
    43       K3F
    59       E4C
    61       E4F
    63       E4H
    64       E4I
    66       E4L
    67       E4M
    68       E4N
    70       E4Q
    73       E4T
    74       E4V
    75       E4W
    76       E4Y
    77       D5A
    78       D5C
    80       D5F
    82       D5H
    83       D5I
    85       D5L
    86       D5M
    87       D5N
    89       D5Q
    91       D5S
    92       D5T
    93       D5V
    95       D5Y
    115      I7A
    135      E8C
    146      E8Q
    184     Q10P
    211     T12C
    220     T12M
    390     M21L
    427     R23K
    430     R23N
    445     V24I
    469     E25Q
    496     E27C
    503     E27L
    504     E27M
    507     E27Q
    572     V31C
    668     I36D
    707     G38E
    1087    T58E
    1158    D61W
    1332    S71C
    Name: variant, dtype: object
    
    Top 12 variants predicted by the model:
         variant    y_pred  y_actual  y_actual_scaled  y_actual_binary  \
    72       E4S  0.978078       NaN              NaN              NaN   
    94       D5W  0.977183       NaN              NaN              NaN   
    69       E4P  0.973912       NaN              NaN              NaN   
    1281    V68I  0.962484       NaN              NaN              NaN   
    90       D5R  0.951873       NaN              NaN              NaN   
    71       E4R  0.948736       NaN              NaN              NaN   
    88       D5P  0.947673       NaN              NaN              NaN   
    144      E8N  0.947493       NaN              NaN              NaN   
    141      E8K  0.947032       NaN              NaN              NaN   
    1034    V55I  0.945509       NaN              NaN              NaN   
    616     T33I  0.944996       NaN              NaN              NaN   
    470     E25R  0.944632       NaN              NaN              NaN   
    
         dist_metric  std_predictions  
    72          None              0.0  
    94          None              0.0  
    69          None              0.0  
    1281        None              0.0  
    90          None              0.0  
    71          None              0.0  
    88          None              0.0  
    144         None              0.0  
    141         None              0.0  
    1034        None              0.0  
    616         None              0.0  
    470         None              0.0  
    
    Data saved to ./kelsic/Round4


    /data/chaofan/software/EvolvePro/evolvepro/src/model.py:200: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
      df_all = pd.concat([df_train, df_test])

2.3.5 Round5

round_name = 'Round5'
round_file_names = ['kelsic_Round1.xlsx', 'kelsic_Round2.xlsx', 'kelsic_Round3.xlsx', 'kelsic_Round4.xlsx', 'kelsic_Round5.xlsx']

this_round_variants, df_test, df_sorted_all = evolve_experimental(
    protein_name,
    round_name,
    embeddings_base_path,
    embeddings_file_name,
    round_base_path,
    round_file_names,
    wt_fasta_path,
    rename_WT,
    number_of_variants,
    output_dir
)

Processing kelsic - Round5
   Embeddings loaded: (1369, 1280)
   Loaded experimental data for kelsic_Round1.xlsx: (12, 3)
   Loaded experimental data for kelsic_Round2.xlsx: (12, 3)
   Loaded experimental data for kelsic_Round3.xlsx: (12, 3)
   Loaded experimental data for kelsic_Round4.xlsx: (12, 3)
   Loaded experimental data for kelsic_Round5.xlsx: (12, 3)
   iteration shape: (60, 2)
   Labels shape: (1369, 5)
   Embeddings and labels are aligned
   (1309,)
   
   Tested variants in this round: 60
   18       M1W
   43       K3F
   58       E4A
   59       E4C
   61       E4F
   63       E4H
   64       E4I
   66       E4L
   67       E4M
   68       E4N
   69       E4P
   70       E4Q
   72       E4S
   73       E4T
   74       E4V
   75       E4W
   76       E4Y
   77       D5A
   78       D5C
   80       D5F
   82       D5H
   83       D5I
   85       D5L
   86       D5M
   87       D5N
   88       D5P
   89       D5Q
   91       D5S
   92       D5T
   93       D5V
   94       D5W
   95       D5Y
   115      I7A
   135      E8C
   137      E8F
   139      E8H
   141      E8K
   146      E8Q
   178     Q10H
   184     Q10P
   211     T12C
   220     T12M
   368     T20H
   390     M21L
   427     R23K
   430     R23N
   445     V24I
   469     E25Q
   496     E27C
   503     E27L
   504     E27M
   507     E27Q
   572     V31C
   668     I36D
   707     G38E
   1034    V55I
   1087    T58E
   1158    D61W
   1332    S71C
   1339    S71K
   Name: variant, dtype: object
   
   Top 12 variants predicted by the model:
        variant    y_pred  y_actual  y_actual_scaled  y_actual_binary  \
   114      N6Y  0.954242       NaN              NaN              NaN   
   373     T20N  0.950471       NaN              NaN              NaN   
   65       E4K  0.950415       NaN              NaN              NaN   
   226     T12V  0.949447       NaN              NaN              NaN   
   1281    V68I  0.947393       NaN              NaN              NaN   
   464     E25K  0.947118       NaN              NaN              NaN   
   71       E4R  0.946143       NaN              NaN              NaN   
   144      E8N  0.940163       NaN              NaN              NaN   
   628     A34C  0.937827       NaN              NaN              NaN   
   1349    S71Y  0.936767       NaN              NaN              NaN   
   1169    L62M  0.934671       NaN              NaN              NaN   
   370     T20K  0.934475       NaN              NaN              NaN   
   
        dist_metric  std_predictions  
   114         None              0.0  
   373         None              0.0  
   65          None              0.0  
   226         None              0.0  
   1281        None              0.0  
   464         None              0.0  
   71          None              0.0  
   144         None              0.0  
   628         None              0.0  
   1349        None              0.0  
   1169        None              0.0  
   370         None              0.0  
   
   Data saved to ./kelsic/Round5


   /data/chaofan/software/EvolvePro/evolvepro/src/model.py:200: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
     df_all = pd.concat([df_train, df_test])

2.3.6 plot

from evolvepro.src.plot import read_exp_data, plot_variants_by_iteration

round_base_path = '/data/chaofan/software/EvolvePro/colab/rounds_data'
round_file_names = ['kelsic_Round1.xlsx', 'kelsic_Round2.xlsx', 'kelsic_Round3.xlsx', 'kelsic_Round4.xlsx', 'kelsic_Round5.xlsx']
wt_fasta_path = "./kelsic_WT.fasta"

df = read_exp_data(round_base_path, round_file_names, wt_fasta_path)
plot_variants_by_iteration(df, activity_column='activity', output_dir=output_dir, output_file="kelsic")

3. 代码实现 (Deep Mutational Scanning Workflow)

DMS数据集包含具体的突变位点和表型数据，我们需要生成突变体文库对应的fasta和embedding特征矩阵，才能进行下游的迭代训练。

3.1 生成突变体序列和对应表型表格

!cp -r /data/chaofan/software/EvolvePro/data .

对scripts/process/dms_process.py代码的project_root进行修改

!python /data/chaofan/software/EvolvePro/scripts/process/dms_process.py

3.2 获取氨基酸序列的embedding特征矩阵

!sbatch esm2_650M_dms.sh
# 运行时大概占3G显存
# 这一步运行时间非常长

esm2_650M_dms.sh
#!/bin/bash
# Configuration values for SLURM job submission.
# One leading hash ahead of the word SBATCH is not a comment, but two are.
#SBATCH --job-name=esm
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=1   
#SBATCH --mem=200gb  
#SBATCH --output out/esm-%j.out 

study_names=("brenan" "jones" "stiffler" "haddox" "doud" "giacomelli" "kelsic" "lee" "markin" "cas12f" "cov2_S" "zikv_E")
model_names=("esm2_t33_650M_UR50D")
fasta_path="output/dms/"
results_path="output/plm/esm/"

repr_layers=33
toks_per_batch=2000

mkdir -p ${results_path}

for model_name in "${model_names[@]}"; do
  for study in "${study_names[@]}"; do
    command="python3 /data/chaofan/software/EvolvePro/evolvepro/plm/esm/extract.py ${model_name}.pt ${fasta_path}${study}.fasta ${results_path}${study}/${model_name} --toks_per_batch ${toks_per_batch} --include mean --concatenate_dir ${results_path}"
    echo "Running command: ${command}"
    eval "${command}"
  done
done

3.3 DMS数据验证定向进化

因为我们输入的是${model_name}.pt文件，会导致输出文件名都变为${dataset_name}_esm2_t33_650M_UR50D.pt.csv，但是后续文件的输入又要求是${dataset_name}_esm2_t33_650M_UR50D.csv，需要进行文件名的批量修改再进行下一步。

!ls output/plm/esm/*.csv | while read line; do re_name=`echo $line|cut -d "/" -f4|cut -d "." -f1`; mv ${line} output/plm/esm/${re_name}.csv; done

!sbatch esm2_650M.sh

esm2_650M.sh
#!/bin/bash
# Configuration values for SLURM job submission.
# One leading hash ahead of the word SBATCH is not a comment, but two are.
#SBATCH --time=12:00:00 
#SBATCH --job-name=esm2_650M_optimal
#SBATCH -n 12
#SBATCH -N 1   
#SBATCH --cpus-per-task=5  
#SBATCH --mem=20gb  
#SBATCH --output out/esm2_650M_optimal-%j.out 

source ~/.bashrc
conda activate evolvepro
# module load openmind8/gnu-parallel/20240222

datasets=("brenan" "stiffler" "doud" "haddox" "giacomelli" "jones" "kelsic" "lee" "markin" "zikv_E" "cas12f" "cov2_S")

# Function to run dms_main for a given dataset
run_dms_main() {
    dataset_name=$1
    output_file="out/${dataset_name}-esm2_650M_optimal.out"

    echo "Running ${dataset_name} dataset:" > ${output_file}
    python3 -u /data/chaofan/software/EvolvePro/scripts/dms/dms_main.py \
        --dataset_name ${dataset_name} \
        --experiment_name "esm2_650M_optimal" \
        --model_name "esm2_t33_650M_UR50D" \
        --embeddings_path "./output/plm/esm" \
        --labels_path "./output/dms" \
        --num_simulations 10 \
        --num_iterations 10 \
        --measured_var "activity" \
        --learning_strategies "topn" \
        --num_mutants_per_round 16 \
        --num_final_round_mutants 16 \
        --first_round_strategies "random" \
        --embedding_types "embeddings" \
        --regression_types "randomforest" \
        --embeddings_file_type "csv" \
        --output_dir "./output/dms_results" \
        >> ${output_file} 2>&1
    echo "Done running ${dataset_name} dataset:" >> ${output_file}
}

# Export the function so it's available to GNU Parallel
export -f run_dms_main

# Use GNU Parallel to run the dms_main function in parallel for each dataset
parallel -j12 run_dms_main ::: "${datasets[@]}"