Tuesday 27 November 2012

Using samtools to manipulate SAM and BAM files

I've been using samtools to manipulate bam and sam files during the last few days, and decided to write a few notes so that I won't forget. Here are the commands that I find most useful:

samtools faidx
samtools faidx my.fasta 
   make an index file for input fasta file "my.fasta", called "my.fasta.fai"

samtools index
samtools index in.bam in.bam.bai
   make an index file for input fasta file "in.bam", called "in.bam.bai"

samtools sort
samtools sort in.bam in_sorted
    sorts a bam file "in.bam", to make sorted bam file "in_sorted.bam"

samtools view
samtools view -h in.bam > out.sam
    convert a bam file "in.bam" to sam format file "out.sam"
samtools view -h in.bam scaffold1:155-200 > out.sam
    take reads mapping to positions 155-200 on scaffold1, and convert to sam format file "out.sam". To do this, you need index file "in.bam.bai" (see above)
samtools view -h -o out.sam in.bam scaffold1:155-200
    take reads mapped to 155-200 on scaffold1, and make "out.sam", including a header. To do this, you need index file "in.bam.bai" (see above)
samtools view -S -bt my.fasta.fai in.sam > out.bam
    convert a sam file "in.sam" to a bam file "out.bam". To do this, you need index file "my.fasta.fai" (see above)
samtools view -bT my.fasta in.sam > out.bam
    convert a sam file "in.sam" to a bam file "out.bam". To do this, you need fasta file "my.fasta".
samtools view -b in.bam scaffold1:155-200 > out.bam
     take reads mapping to positions 155-200 on scaffold1, and put in an output bam file "out.bam".
samtools view -b -s 0.60 input.bam > out.bam
     randomly samples 60% of reads in the input.bam, and puts them into output bam file "out.bam".
samtools view -H in.bam
    view only the header for bam file "in.bam"
samtools view -H -S in.sam
    view only the header for input sam file "in.sam"
samtools view -L my.bed in.bam
    get alignments that overlap positions in bed-format file my.bed, for bam file "in.bam"
samtools view -c in.bam
    count the number of read alignments in bam file "in.bam"
samtools view -f 2 -q10 in.bam > out.sam
    take just the read-pairs that are 'proper pairs' (-f 2: gives 'proper pairs'), with minimum mapping quality of 10
samtools view -f 0x400 in.bam
    take just the reads that are marked as duplicates in your bam
samtools view -F 0x400 in.bam
    take just the reads that are not marked as duplicates in your bam  
(samtools view -H in.bam; samtools view in.bam | grep -w NH:i:1) | samtools view -bS - > out.bam
take just the uniquely mapping reads, that map with MAPQ score of >=30 
bwa mem genome.fasta reads_fastq.gz mates_fastq.gz | samtools view -b -S -F 4 - > output.bam
   run 'bwa mem' to align reads and their mates to a genome, and pipe the output through samtools to just take the reads that align (using -F4) and discarding reads that don't align (to save disk space), and put the output in output.bam

samtools mpileup
samtools mpileup in.bam
    get mpileup output for bam file "in.bam"
samtools mpileup -l my.bed in.bam
    get mpileup output for reads that overlap positions in bed-format file my.bed, for bam file "in.bam"

samtools merge
samtools merge out.bam in1.bam in2.bam
    merge two BAM files, in1.bam and in2.bam, to create out.bam  

samtools idxstats 
samtools idxstats in.bam
    gives the length, number of mapped reads, and number of unmapped reads for each scaffold/chromosome for which there is data in in.bam, etc.

samtools flagstat
samtools flagstat in.bam
    gives statistics about your bam file, such as the total number of reads, number of mapped reads, number of properly paired reads, number of reads marked as duplicates.

Converting BAM to Fastq and vice versa
Apparently, you can use Picard's SamToFastq to convert bam/sam files to fastq format. 

No comments: