Friday 2 December 2016

Using samtools mpileup to find SNPs, and vcf-tools to filter SNPs

My colleagues have recently been talking about SNP-finding using 'samtools mpileup', and ways to filter the set of SNPs called by 'samtools mpileup' to throw away the more dodgy SNP calls.

Calling SNPs (and reference bases) using 'samtools mpileup':
You can call SNPs (and reference bases) using 'samtools mpileup' (see the samtools mpileup webpage for details) by typing:
% samtools mpileup in.bam
where 'in.bam' is your input bam file.
The output is in VCF (Variant Call Format).  

The output may look like this:
1       2131    N       2       gG      FF
1       2132    N       2       tT      FF
1       2133    N       2       aA      FF
1       2134    N       2       gG      BF
1       2135    N       2       cC      BF
1       2136    N       2       tT      BF
1       2137    N       2       cC      BF
1       2138    N       2       g$G$    BF
1       2147    N       1       ^]g     F
1       2148    N       1       g       F
The columns are: chromosome, 1-based coordinate, reference base, number of reads covering the site, read bases, and base qualities.

In the example output above we don't see the base in the reference. To see it we must type:
% samtools faidx genome.fa
% samtools mpileup -Q 15 -D -f genome.fa in.bam
where genome.fa is the assembly file.
Then we see something like this:
1       5131    C       2       .,      FE
1       5132    G       2       .,      FF
1       5133    A       2       .,      FF
1       5134    G       2       .,      FF
1       5135    G       2       .,      FF
1       5136    A       2       .,      FF
A dot '.' stands for the reference base on the forward strand, and ',' for the reverse strand.

When you are doing this, you can tell 'samtools mpileup' to only take bases with base alignment quality scores (BAQ scores: these are adjusted base quality scores, which have been reduced for base positions near indels, to help rule out false positive SNP calls due to alignment artefacts near small indels) of 15 or higher by typing:
% samtools mpileup -Q 15 in.bam
To only take cases with reads with mapping quality (MAPQ) of >=30:
% samtools mpileup -q 30 -Q 15 -D -f genome.fa in.bam

You can vary the contents of the output VCF file that you get using the following options:
-D : gives you per-sample read depth at each base position
-S : gives you a P-value for a statistical test for per-sample strand bias at each base position (strand bias occurs if a particular SNP is being called from reads that aligned to one strand but not reads that align to the complementary strand).
-u : gives you a text BCF output (a compressed VCF format)
-g : gives you binary BCF output (a compressed binary version of BCF)

To avoid excessive usage of memory (RAM), we can tell samtools to only use a depth of 1000 reads at each position, using the -d option:
% samtools mpileup -d 1000 in.bam

To combine several options in one command, you could type, for example:
% samtools mpileup -d 1000 -DSug in.bam  
or
% samtools mpileup -Q 15 -D in.bam 

Filtering SNPs using bcftools:
To filter the output of samtools mpileup to just have variant bases (not reference bases), we need to filter the output using bcftools, for example:

% samtools mpileup -u -q 30 -Q 15 -D -f genome.fa in.bam | bcftools view -bvcg - > var.raw.bcf
where the samtools mpileup -u option produces uncompressed BCF output;
bcftools view -b outputs BCf; -v output potential variant sites only; -c calls SNPs; -g calls genotypes at variant sites.
This calls SNPs and small indels.

We can then filter the output file using:
% bcftools view var.raw.bcf | vcfutils.pl varFilter -D100 > var.flt.vcf 
where -D sets the maximum read depth to call a SNP.

Filtering SNPs using vcftools-annotate:
There is also a way to filter SNPs using vcftools-annotate, for example on MinDP (read depth) and RMS MQ (mapping quality). I need to look into this.

A protocol for calling and filtering SNPs
My colleague Diogo Ribeiro previously identified some SNP sites from Illumina sequencing data for a nematode species. Here is a brief description of what he did:

- For each isolate, sequencing reads were mapped to the reference assembly using the SMALT mapper, to produce bam files.

- SNPs and small indels relative to the reference assembly were identified separately for each isolate, using samtools and vcftools, based on a maximum of 1000 reads at each base position ('samtools mpileup -d 1000' option).

- The resulting VCF files (one per isolate) were filtered to remove less reliable variant sites, by removing those that were:
(i) in repetitive regions of the reference assembly, as identified by gem-mappability;
(ii) were in gaps in scaffolds, or within 100 bp of a gap of >=8 bp;
(iii) were within 100 bp of a scaffold start or end;
(iv) had a SNP quality score of less than 50;
(v) had a mapping quality of less than 30;
(vi) had a read depth (including reads from both strands) of less than 20, or had a read depth of less than 10 from a single strand.
Some of this is possible using Diogo Ribeiro's script:
eg.
% custom_filter_vcf_files.py --DP 20 --DP_STRAND 10 --QUAL 50 --MQ 30 vcf vcf_filtered2
applies the filters (iv), (v), (vi) to VCF files in input directory 'vcf', and puts the output in an output directory 'vcf_filtered2' (must exist already).

- The variant density, for all variants and for just homozygous variants, was estimated in non-overlapping 1-kb windows of each scaffold of the reference assembly. When calculating variant density, sites with more than one alternative allele were excluded, as these are probably erroneous variant calls.

Scripts for SNPs (note to self)
My colleagues Diogo Ribeiro and Bhavana Harsha wrote some nice scripts for SNP calling and SNP filtering, see:
(i) vcf-annotate_custom_filters.pl -> can filter a VCF file based on median read depth, and other filters that the user species (uses vcftools-annotate),
(ii) median_dp_filter.pl -> can filter a VCF file based on median read depth for a chromosome,
(iii) snp_calling_improv.pl -> can filter a VCF file based on using gem-mappability to find repetitive regions of a reference assembly,
(iv) custom_filter_vcf_files.py -> can filter a VCF file based on minimum read depth, read depth per strand, SNP quality, or mapping quality,

(v) plotsAlongChr.py -> plots SNP data (from VCF files) along chromosomes,
(vi) GeneSNPAnalyses.py -> analyses distribution of SNPs per gene, and SNPs per kb of gene.

Acknowledgements
Thanks very much to my colleagues Bhavana Harsha and Diogo Ribeiro who figured out lots of these things.

No comments: