Read Fasta File


  我们在平时的工作中经常会遇到对生物序列进行提取或修改,不管是基因组DNA序列还是编码蛋白序列,它们都是fasta格式。一般都是将fasta序列存储为dict格式再进行操作,下面就介绍下我常用或遇到的一些处理方式:

1.比较原始的逐行处理

#read a fasta file and return a dictionary, the key is entry id and the value is the sequence in upcase
def readFasta(fastaFile):
    import gzip
    f1 = gzip.open(fastaFile, 'rt') if fastaFile.endswith('gz') else open(fastaFile)
    line = f1.readline()
    sequence = ""
    fasta_dict = {}
    header = ""
    while True:
        if line:
            if line[0] == '>':
                if len(sequence) > 0:
                    fasta_dict[header] = sequence
                    sequence = ""
                header = line.strip().replace('>', '').split()[0]
            else:
                sequence += line.strip()
        else:
            break 
        line = f1.readline()
    f1.close()
    if header and sequence:
        fasta_dict[header] = sequence

    return fasta_dict

2.比较优雅的借助Biopython API

def load_fasta(path, subset_chroms=None):
    from Bio import SeqIO
    import gzip
    with gzip.open(path, "rt") if path.endswith(".gz") else open(path) as handle:
        genome = pd.Series(
            {
                rec.id: str(rec.seq)
                for rec in SeqIO.parse(handle, "fasta")
                if subset_chroms is None or rec.id in subset_chroms
            }
        )
    return genome

(未完待续)


文章作者: zhangchaofan
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 zhangchaofan !
评论
  目录