manual_genome | zhangchaofan's blog

发布日期: 2023-08-25

更新日期: 2023-08-25

文章字数: 860

阅读时长: 4 分

1.引言

我们在基因组的组装过程中可能需要手动检查的情况，直接对初始的组装结果进行手动矫正，而这个脚本就是用来做这个的。
这些都是在原序列的基础上操作的，我检查了几个例子是没问题的，但是用的时候还是要注意隐藏的 BUG。

Usage: python removeDNA_fromBED.py [raw_contig] [dispose.bed] [disposed_contig]

2.输入输出

输入的 bed 文件根据操作的不同，有不同的列数，每个操作都是一行。
一共有 4 大类:

‘D’: Deletion 删除;
‘RC’: Reverse_Complement 反向互补;
‘I’: Insertion 插入序列，这个操作行有五列，除 le 和D和RC类似的 4 列外，
多了第五列用来存放要插入的序列，而且这一行的第二列和第三列应该是一样的。
你是插入一个点，比如在 11 位置插入，就是在前 10 个碱基后插入一段序列。
‘M’: Move 挪动序列，这个操作行有 7 列: Chr start1 end1 M start2 end2 [+|-]
你是移动，就会有两个位置嘛，如果第七列为 + 符号，代表把 start1 到 end1 的序列
挪动到 start2 和 end2 的位置，这种情况下 start2 是等于 end2 的。这个操作等价与在
start1 到 end 的一个删除和在 start2 的一个插入。如果第七列为 - 符号，则方向
相反，把 start2-end2 的序列插入到 start1(=end1)的位置。

输入 bed 文件示例：

contig1	1	10	RC
contig1	11	11	I	ZCF

输入 fasta文件：

>contig1
AAAAAAAAAATTTTTTTTTTCCCCCCCCCCGGGGGGGGGG

输出结果：

>contig1
TTTTTTTTTTZCFTTTTTTTTTTCCCCCCCCCCGGGGGGGGGG

3.脚本

manual_genome.py
from sys import argv, exit

def reverse_complement(sequence):
    complement_dict = {'A': 'T', 'T': 'A', 'C': 'G', 'G': 'C','a': 't', 't': 'a', 'c':'g', 'g':'c', 'N':'N', 'n': 'n'}
    reverse_complement_sequence = ''.join([complement_dict[base] for base in sequence[::-1]])
    return reverse_complement_sequence

def read_fasta_file(filename):
    fasta_dict = {}
    current_sequence = ""
    current_id = None

    with open(filename, 'r') as file:
        for line in file:
            line = line.strip()
            if line.startswith('>'):  # 处理标识符行
                if current_id is not None:
                    fasta_dict[current_id] = current_sequence
                current_id = line[1:]
                current_sequence = ""
            else:  # 处理序列行
                current_sequence += line
        if current_id is not None:
            fasta_dict[current_id] = current_sequence

    return fasta_dict

def read_tsv_file(filename, fa_dic):
    tsv_dict = {}

    with open(filename, 'r') as file:
        for line in file:
            line = line.strip()
            if line:  # 跳过空行
                columns = line.split('\t')
                key = columns[0]
                if key not in tsv_dict:
                    tsv_dict[key] = []

                if columns[3] != "M":
                    tsv_dict[key].append(columns[1:])
                else:
                    # move order
                    # M = insertion + deletion
                    if columns[6] == '+':
                        tsv_dict[key].append([columns[1], columns[2], 'D'])
                        tsv_dict[key].append([columns[4], columns[5], 'I', \
                            fa_dic[key][int(columns[1])-1:int(columns[2])]])
                    else:
                        tsv_dict[key].append([columns[4], columns[5], 'D'])
                        tsv_dict[key].append([columns[1], columns[2], 'I', \
                            fa_dic[key][int(columns[4])-1:int(columns[5])]])
    #
    for ctg in tsv_dict:
        tsv_dict[ctg] = sorted(tsv_dict[ctg], key=lambda x: (int(x[0]), int(x[1])))

    return tsv_dict

if __name__ == "__main__":
    if len(argv) != 4:
        exit("Usage: python manual_genome.py [raw_contig] [dispose.bed] [disposed_contig]\n\n\n" + \
            "Like BED format [1_index]: \nChr\tstart\tEND\tkeyword\nkeyword: 'D': deletion; 'RC' : reverse_complement; " + \
            "'M': move seqs from one pos to another pos(in same contig)." + "\n" + \
            "If keyword = 'I', the line with 5 columns: Chr start end I [insertion_seq].\n"+ \
            "If keyword = 'M', the line with 7 columns: Chr start1 end1 M start2 end2 [+|-].\n"+ \
            "If 7thcol in M lines = '+', means move seqs from start1 to start2, else move seqs from start2 to start1\n")


    fa_dic = read_fasta_file(argv[1])
    dispose_dic = read_tsv_file(argv[2], fa_dic)
    ouf_w = open(argv[3], "w")

    for contig in fa_dic:
        raw_Seq = fa_dic[contig]
        gap_bp = 0

        for items in dispose_dic[contig]:
            # Deletion
            if items[2] == "D":
                raw_Seq = raw_Seq[:int(items[0])-1-gap_bp] + raw_Seq[int(items[1])-gap_bp:]
                gap_bp +=  int(items[1]) - int(items[0]) + 1
                print(gap_bp)
            # reverse_complement
            if items[2] == "RC":
                raw_Seq = raw_Seq[:int(items[0])-1-gap_bp] + \
                    reverse_complement(raw_Seq[int(items[0])-1-gap_bp:int(items[1])-gap_bp]) + \
                    raw_Seq[int(items[1])-gap_bp:]

            if items[2] == "I":
                raw_Seq = raw_Seq[:int(items[0])-1-gap_bp] + \
                    items[3] + \
                    raw_Seq[int(items[1])-1-gap_bp:]

                gap_bp -= len(items[3])



        ouf_w.write(">"+contig+"\n"+raw_Seq+"\n")

    ouf_w.close()

如发现任何问题，恳请联系我纠正。

zhangchaofan

https://zhangchaofan01.gitee.io/2023/08/25/manual-genome/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 zhangchaofan !

Genome python

ncRNA-annotation

2023-10-02 zhangchaofan

Genome ncRNA

Install R-packages from github

2023-08-14 zhangchaofan