最近在跑RNA-seq碰到一个自己挖的坑,samtools将sam文件转为二进制的bam文件并排序,默认是按照pos进行排序的。htseq-count默认的输入sam|bam是按照name排序的,这样就导致了一个问题,htseq-count会把不在相邻行的paried-end reads当成一条序列:
Warning: 16607642 reads with missing mate encountered.
后果是导致虚假的输出: Notice that your BAM file is probably sorted by position, but you're not using the -r pos option. That will most likely result in bogus output.
我们需要设置htseq-count
的-r pos
参数,或者在对bam文件进行排序时,保留按name
进行排序。
Usage: samtools sort [options...] [in.bam]
Options:
-l INT Set compression level, from 0 (uncompressed) to 9 (best)
-u Output uncompressed data (equivalent to -l 0)
-m INT Set maximum memory per thread; suffix K/M/G recognized [768M]
-M Use minimiser for clustering unaligned/unplaced reads
-K INT Kmer size to use for minimiser [20]
-n Sort by read name (not compatible with samtools index command)
-t TAG Sort by value of TAG. Uses position as secondary index (or read name if -n is set)
-o FILE Write final output to FILE rather than standard output
-T PREFIX Write temporary files to PREFIX.nnnn.bam
--no-PG do not add a PG line
--input-fmt-option OPT[=VAL]
Specify a single input file format option in the form
of OPTION or OPTION=VALUE
-O, --output-fmt FORMAT[,OPT[=VAL]]...
Specify output format (SAM, BAM, CRAM)
--output-fmt-option OPT[=VAL]
Specify a single output file format option in the form
of OPTION or OPTION=VALUE
--reference FILE
Reference sequence FASTA FILE [null]
-@, --threads INT
Number of additional threads to use [0]
--write-index
Automatically index the output files [off]
--verbosity INT
Set level of verbosity