Genomic Evolutionary Rate Profiling(GERP)

发布日期: 2023-06-11

更新日期: 2024-12-30

文章字数: 969

阅读时长: 4 分

用来简单记录下GERP的过程，方便后续再次使用。

1. 环境准备

# 下载cactus
## 一般在自己本地的soft文件夹下进行
wget https://github.com/ComparativeGenomicsToolkit/cactus/releases/download/v2.5.2/cactus-bin-v2.5.2.tar.gz

tar -xzf cactus-bin-v2.5.2.tar.gz
cd cactus-bin-v2.5.2

pip install virtualenv
virtualenv -p python3 cactus_env
echo "export PATH=$(pwd)/bin:\$PATH" >> cactus_env/bin/activate
echo "export PYTHONPATH=$(pwd)/lib:\$PYTHONPATH" >> cactus_env/bin/activate
source cactus_env/bin/activate
python3 -m pip install -U setuptools pip
python3 -m pip install -U .
python3 -m pip install -U -r ./toil-requirement.txt

cd bin && for i in wigToBigWig faToTwoBit bedToBigBed bigBedToBed axtChain pslPosTarget bedSort hgGcPercent mafToBigMaf hgLoadMafSummary; do wget -q http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/${i}; chmod +x ${i}; done


## 按照上面的流程下载 cactus一般是不太会有其他问题的

# 下载 mashtree
git clone https://github.com/lskatz/mashtree.git

# 下载UCSC工具包
## 把这个页面http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/上所有的文件下下来，后面有些要用

2. 获得物种树

mashtree --mindepth 0 --numcpus 12 *.fa > mashtree.dnd

3. 全基因组比对

这个主要使用Progressive Cactus做。

# 构建 cactus的输入文件
## species.cacuts.txt第一行则为上一步构建的mashtree。
## 后面的行分为两列，用空格隔开，第一列为基因组名称(可以是物种名或自己取的名字)
## 第二列则基因组序列地址(基因组序列必须是softmasked的) 
# 这里一定一定要注意 你的基因组文件名称必须和物种树的名称一致，比如物种树里是 Camellia_sinensis， 那么最好基因组文件名称是 Camellia_sinensis.fa
cactus ./js species.cacuts.txt species.hal

# hal2maf
## refname则是你选择的一个参考基因组，名称必须与 species.cacuts.txt 内的一致
$refname="refname"
cactus-hal2maf ./js species.hal species.maf.gz --refGenome $refname --chunkSize 1000000 --noAncestors --dupeMode single

# 可以过滤一下，我们这里是cactus的结果，先不过滤
## filter mafs so all blocks have refs and are at least 20 bp long
gzip -d species.maf.gz
mafFilter -minCol=20 -needComp="$refname" species.maf > species.filter.maf
## maf文件格式: https://www.jianshu.com/p/963e3c5dcc80

3.1 获得MSA(全基因组范围)文件

# 生成一个空的dummy.bed文件
touch dummy.bed 
# 将maf文件按染色体拆分
mafSplit -byTarget dummy.bed ./ species.maf -useFullSequenceName

python maf2fasta.mutil.py -i $each_chr_maf --order species.order -r refs.genome.fa -o $ouf_prefix -c cup_nums

3.2 提取4DTV位点

# 提取refs 4DTV位点
iTools Gfftools getCdsPep -Ref ref.genome.fa -Gff ref.gff -OutPut ouf_name -4DSite

这里我们有两种处理顺序，一种是利用上面的whole genome alignment[maf -> fasta])，再根据这个比对结果提取4DTV位点(如果MSA文件是非必须的，推荐第二种方法)。另外一种更为直接，直接从maf结果中提取4DTV位点序列。
我们直接用后面的方法。脚本写了一上午，检查了一下午，就不放出来了(也可以使用玉米NAM群体³的R脚本)。

gzip -d ouf_name.4Dsite.gz

# maf24DTV 这里生成各染色体文件
./maf24DTV.py -i species.maf --order species.order -s ouf_name.4Dsite -o ouf_prefix -r ref.genome.fa

# 将上一步提取各染色体的4DTV位点合并到一起
./fasta_merge_chr.py -i ouf_prefix* -o species.merge.4DTV.fa

# 构建中性进化树
iqtree -s species.merge.4DTV.fa -st DNA -T 60 -mem 200G  -bb 1000 --prefix species.merge.4DTV.fa

4. 计算GERP数值

GERP的范围和你进化树的枝长与比较的物种数量都是有关系的，注意。

# 这里msa的header和 ${REF}都要和系统发育树中的ID 对应，这里我分染色体跑
gerpcol -t species.merge.4DTV.fa.treefile -f ${msa_fasta} -a -e ${REF} -j -v -z -x .gerp.rates

gerpelem -f ${msa_fasta}.gerp.rates

References:

https://doi.org/10.1016/j.cell.2023.04.008

https://github.com/ComparativeGenomicsToolkit/cactus

https://github.com/HuffordLab/NAM-genomes/tree/master/gerp

zhangchaofan

https://zhangchaofan01.gitee.io/2023/06/11/genomic-evolutionary-rate-profiling-gerp/