从 FASTA 文件中提取特定 ID 的序列

一、说明

在日常的工作中，我们经常会遇到：从某个物种总的 cds 序列或者 pep 序列中提取我们感兴趣特定 ID 序列这样的问题。比如我知道某个基因家族所有序列的 ID 名称，现在我想提取这些 ID 的序列，方便下一步的分析。这个时候就可以使用我今天写的东西了。

二、具体细节

下图这是大豆 pep 文件中的一部分，我们看下大概的格式就行了，具体的细节自己找个类似的文件去看下。每条序列由两部分组成，一部分是>开头的一行，包括这条序列的 ID 和相应的描述性东西，另一部分则是由对应的序列组成(DNA 序列或氨基酸序列，这个由你自己的文件的类别决定)。

大体的思路是这样的，常规的操作是先读取总的 FASTA 文件，将相应的信息提取出来，然后我们根据待提取的基因 ID 再来操作。在读取总的 FASTA 文件的时候，我们可以根据 FASTA 格式开头的>，来判断是否读取到新的序列。

脚本的输入是两个，一个是总的 FASTA 文件，一个是待提取的 ID 列表文件，输出的话就是我们想要 ID

对应的序列了。

#! Usage: python get_seq_from_lst.py [fasta_file] [seq_lst] [ouf_file]
from sys import argv

def read_fa(fa_file):
    tmp_dic = []
    for line in open(fa_file):
        if line[0] == '\n':
            continue
        if line[0] == '>':
            tmp_dic.append([line[1:].strip(), ''])
        else:
            tmp_dic[-1][1] += line.strip()
    tmp_dic = {t_lst[0]:t_lst[1] for t_lst in tmp_dic}

    return tmp_dic

def main():
    seq_ids = [line.strip() for line in open(argv[2])]
    fa_dic = read_fa(argv[1])
    ouf_w = open(argv[3], 'w')
    for seq_id in seq_ids:
        if seq_id in fa_dic:
            ouf_w.write('>'+seq_id+"\n"+fa_dic[seq_id]+"\n")
        else:
            print("%s not in %s file" % (seq_id, argv[1]))
    ouf_w.close()

main()

条条大路通罗马，为了实现某一个目的，可以有非常多的方式，特别是 python 和 perl 这些脚本语言。我们可以用不同的数据结构去实现，然而就是使用相同的数据结构，每个人的思考方式也是不同的。最好的代码就是还没写出来的代码！我们这里并没有使用字典这样非常直观的数据结构，只是用了字符串和列表的方式，下面的是使用哈希表实现的 C++代码，C++还没学多久，所以看起来不是很妙。

# include <iostream>
# include <fstream>
# include <string>
# include <cstdlib>
# include <map>

using namespace std;

int main(int argc, char * argv[]){
    if (argc != 4){
        cerr << "Usage: " << argv[0] << " fasta_file[f] gene_lst[l] ouf_[o] \n";
        exit(EXIT_FAILURE);
    }

    // get fa content
    ifstream fin;
    map<string, string> fa_dict;
    fin.open(argv[1]);
    string line;
    string header;
    string seqence;
    seqence = "";

    while (getline(fin, line)){
        if(line[0] == '>'){
            if (seqence.length() > 0){
                fa_dict[header] = seqence;
                seqence = "";
            }
            header = line;
        }else{
            seqence += line;
            seqence += '\n';
        }
    }
    // last one
    fa_dict[header] = seqence;
    fin.clear();
    fin.close();
    // ok
    // for(auto it : fa_dict){
        //     cout << it.first <<" "<< it.second <<endl;
    // }

    // get gene lst
    fin.open(argv[2]);
    ofstream ouf(argv[3]);

    while(getline(fin, line))
    {
        for(auto it : fa_dict){
            if(it == line){
                ouf << it.first << "\n" << it.second;
            }
        }
    }
    fin.clear();
    fin.close();
    ouf.clear();
    ouf.close();

    return 0;
}