1.引言
我们在基因组的组装过程中可能需要手动检查的情况,直接对初始的组装结果进行手动矫正,而这个脚本就是用来做这个的。
这些都是在原序列的基础上操作的,我检查了几个例子是没问题的,但是用的时候还是要注意隐藏的 BUG。
Usage: python removeDNA_fromBED.py [raw_contig] [dispose.bed] [disposed_contig]
2.输入输出
输入的 bed 文件根据操作的不同,有不同的列数,每个操作都是一行。
一共有 4 大类:
- ‘D’: Deletion 删除;
- ‘RC’: Reverse_Complement 反向互补;
- ‘I’: Insertion 插入序列,这个操作行有五列,除 le 和
D
和RC
类似的 4 列外,
多了第五列用来存放要插入的序列,而且这一行的第二列和第三列应该是一样的。
你是插入一个点,比如在 11 位置插入,就是在前 10 个碱基后插入一段序列。 - ‘M’: Move 挪动序列,这个操作行有 7 列: Chr start1 end1 M start2 end2 [+|-]
你是移动,就会有两个位置嘛,如果第七列为 + 符号,代表把 start1 到 end1 的序列
挪动到 start2 和 end2 的位置,这种情况下 start2 是等于 end2 的。这个操作等价与在
start1 到 end 的一个删除和在 start2 的一个插入。 如果第七列为 - 符号,则方向
相反,把 start2-end2 的序列插入到 start1(=end1)的位置。
输入 bed
文件示例:
contig1 1 10 RC
contig1 11 11 I ZCF
输入 fasta
文件:
>contig1
AAAAAAAAAATTTTTTTTTTCCCCCCCCCCGGGGGGGGGG
输出结果:
>contig1
TTTTTTTTTTZCFTTTTTTTTTTCCCCCCCCCCGGGGGGGGGG
3.脚本
manual_genome.pyfrom sys import argv, exit def reverse_complement(sequence): complement_dict = {'A': 'T', 'T': 'A', 'C': 'G', 'G': 'C','a': 't', 't': 'a', 'c':'g', 'g':'c', 'N':'N', 'n': 'n'} reverse_complement_sequence = ''.join([complement_dict[base] for base in sequence[::-1]]) return reverse_complement_sequence def read_fasta_file(filename): fasta_dict = {} current_sequence = "" current_id = None with open(filename, 'r') as file: for line in file: line = line.strip() if line.startswith('>'): # 处理标识符行 if current_id is not None: fasta_dict[current_id] = current_sequence current_id = line[1:] current_sequence = "" else: # 处理序列行 current_sequence += line if current_id is not None: fasta_dict[current_id] = current_sequence return fasta_dict def read_tsv_file(filename, fa_dic): tsv_dict = {} with open(filename, 'r') as file: for line in file: line = line.strip() if line: # 跳过空行 columns = line.split('\t') key = columns[0] if key not in tsv_dict: tsv_dict[key] = [] if columns[3] != "M": tsv_dict[key].append(columns[1:]) else: # move order # M = insertion + deletion if columns[6] == '+': tsv_dict[key].append([columns[1], columns[2], 'D']) tsv_dict[key].append([columns[4], columns[5], 'I', \ fa_dic[key][int(columns[1])-1:int(columns[2])]]) else: tsv_dict[key].append([columns[4], columns[5], 'D']) tsv_dict[key].append([columns[1], columns[2], 'I', \ fa_dic[key][int(columns[4])-1:int(columns[5])]]) # for ctg in tsv_dict: tsv_dict[ctg] = sorted(tsv_dict[ctg], key=lambda x: (int(x[0]), int(x[1]))) return tsv_dict if __name__ == "__main__": if len(argv) != 4: exit("Usage: python manual_genome.py [raw_contig] [dispose.bed] [disposed_contig]\n\n\n" + \ "Like BED format [1_index]: \nChr\tstart\tEND\tkeyword\nkeyword: 'D': deletion; 'RC' : reverse_complement; " + \ "'M': move seqs from one pos to another pos(in same contig)." + "\n" + \ "If keyword = 'I', the line with 5 columns: Chr start end I [insertion_seq].\n"+ \ "If keyword = 'M', the line with 7 columns: Chr start1 end1 M start2 end2 [+|-].\n"+ \ "If 7thcol in M lines = '+', means move seqs from start1 to start2, else move seqs from start2 to start1\n") fa_dic = read_fasta_file(argv[1]) dispose_dic = read_tsv_file(argv[2], fa_dic) ouf_w = open(argv[3], "w") for contig in fa_dic: raw_Seq = fa_dic[contig] gap_bp = 0 for items in dispose_dic[contig]: # Deletion if items[2] == "D": raw_Seq = raw_Seq[:int(items[0])-1-gap_bp] + raw_Seq[int(items[1])-gap_bp:] gap_bp += int(items[1]) - int(items[0]) + 1 print(gap_bp) # reverse_complement if items[2] == "RC": raw_Seq = raw_Seq[:int(items[0])-1-gap_bp] + \ reverse_complement(raw_Seq[int(items[0])-1-gap_bp:int(items[1])-gap_bp]) + \ raw_Seq[int(items[1])-gap_bp:] if items[2] == "I": raw_Seq = raw_Seq[:int(items[0])-1-gap_bp] + \ items[3] + \ raw_Seq[int(items[1])-1-gap_bp:] gap_bp -= len(items[3]) ouf_w.write(">"+contig+"\n"+raw_Seq+"\n") ouf_w.close()
如发现任何问题,恳请联系我纠正。