Friday 6 May 2016

Assessing read depth variation based on a bam file

I have a BAM file of sequencing reads mapped to an assembly (that was made based on those sequencing reads), and would like to know whether there is variation in read depth along the scaffolds of the assembly. How to do this?

My former colleague, the great Bernardo Foth, investigated this, and came up with some nice scripts to do this. He said that the best way to do this is to use 'bedtools genomecov'. He said that 'bedtools genomecov' clearly shows best agreement with manual counting of reads in a bam file, and gives much better agreement than 'samtools mpileup' or 'samtools' depth. He said 'samtools depth' appears to have a maximum reported depth of ~8070, and 'samtools mpileup' also underestimates maximum depth (by a factor of up to 10).

Bernardo wrote a couple of nice wrapper scripts to use 'bedtools genomecov' for this purpose:

Bernardo's first script : using 'bedtools genomecov' to calculate read depth
(Note to self: this is /nfs/users/nfs_b/bf3/bin/bedtools_genomecov.quick.forDepthOnly.sh).
Bernardo's first script is:

file=$1
bedtools genomecov -d -ibam $file -g ref.fa.fai > $file.bedtools_genomecov.forDepth.txt
bgzip $file.bedtools_genomecov.forDepth.txt
tabix -s 1 -b 2 -e 2 -f $file.bedtools_genomecov.forDepth.txt.gz



This first script takes a BAM file as its input eg. mybam.bam.
The input BAM must be sorted by position: a 'samtools sort <bam>' should suffice, eg.:
% samtools sort mybam.bam mybam_sorted.bam
This will produce a file mybam_sorted.bam.
(Note to self: the BAM files from the path-dev team's mapping pipeline are coordinate-sorted already. If a BAM file is coordinate-sorted, it should have 'SO:coordinate' in the header, which you can see using 'samtools view -H'.)

The script also expects the genome assembly file to be present and called ref.fa.fai. You can make this from your fasta file ref.fa using 'samtools faidx ref.fa'.

The 'bedtools genomecov' options used are:
-d: Report the depth at each genome position,
-ibam: the input BAM file,
-g: the genome assembly file.
The output produced from 'bedtools genomecov' will have a name ending in 'bedtools_genomecov.forDepth.txt', eg. mybam_sorted.bam.bedtools_genomecov.forDepth.txt. It looks like this (tab-delimited, with 3 columns):
scaffold scaffold_position read_depth_at_scaffold_position

We want to run tabix on this, to make a tabix index file.
To run tabix, we must first compress the data using 'bgzip', to make a file ending in '.bgz'.
The tabix options used are:
-s 1: says the sequence name is in column 1,
-b 2: says the start column is column 2,
-e 2: says the end column is column 2,
-f: forces it to overwrite the output index file (if already present). 
 Tabix will make an output index file ending in 'txt.gz.tbi', eg. mybam_sorted.bam.bedtools_genomecov.forDepth.txt.gz.tbi

Bernardo said that for a 15-Gbyte BAM file, this script takes ~30 minutes to run and requires 150 Mbyte of RAM.

I'm not sure why it took so long to run for me. I found for a 210G byte BAM file, it took 2.45 hours to run, and required about 50 Mbyte of RAM. Maybe the run time depends not just on the BAM file size, but also assembly size(?)
(Note to self: to find the size of a BAM file if you have a soft-link to it, use 'du -hD'.)

Bernardo's second script : calculating the read-depth in 10-kb bins (windows) along scaffolds
(Note to self: this is /nfs/users/nfs_b/bf3/bin/bedtools_genomecov.quickDPanalysis.perBin_02.sh).

file=$1
binsize=$2
if (( $# < 2 ))
   then
   echo
   echo "Usage:  $0  <bedtools genomecov output file>  <binsize>"
   echo
   echo "file: e.g. pre.realigned.bam.bedtools_genomecov.forDepth.txt.gz (bgzipped and tabix-indexed)"
   echo "bin size: e.g. 10000"
   echo
   echo "bsub this script with 500MB memory"
   echo
   exit
fi
if (( $binsize > 1 ))
     then echo binsize = $binsize provided, ok
     else binsize=10000
fi

IN=$file
OUT=$IN.$binsize.txt; rm -f $OUT; echo -e "#seq\tstart\tmedianDP\tmeanDP\tbinsize" > $OUT;
for sc in `tabix -l $IN`; do rm -f TEMPTEMPTEMP.*; tabix -h $IN $sc | split -l $binsize -a 5 -d - TEMPTEMPTEMP.; for chunk in TEMPTEMPTEMP.*; do len=`cat $chunk | wc -l`; start=`head -1 $chunk | cut -f 2`; depth=`cut -f 3 $chunk | perl ~bf3/bin/statsTuples_02.pl median mean | sed 's/median=//' | sed 's/mean=//' | sed 's/,/\t/'`; echo -e "$sc\t$start\t$depth\t$len" >> $OUT; done; done
rm -f TEMPTEMPTEMP.*


This second script takes a 'bedtools genomecov' file that has been bzipped and tabix-indexed as its input (called something ending in 'txt.gz.tbi', eg. mybam_sorted.bam.bedtools_genomecov.forDepth.txt.gz.tbi ). It also requires that the user specifies a binsize eg. 10,000 bp.

For a 15-Gbyte BAM file, Bernardo said that this script takes <10 minutes to run, and requires 10 Mbyte of RAM.

Note this script requires Discrete.pm:
% export PERL5LIB=$PERL5LIB:/nfs/users/nfs_a/alc/Perl/library/Discrete.pm

This makes an output file (which you could call something ending in 'txt.gz.DPperBin.txt', eg. mybam_sorted.bam.bedtools_genomecov.forDepth.txt.gz.DPperBin.txt) in which the columns are:
seq : scaffold name in your assembly
start : start of the bin region in the scaffold
medianDP : median read depth in the bin region
meanDP : mean read depth in the bin region
binsize  : binsize used

My notes on this script:
- 'tabix -l' lists the scaffold names. This script loops through the scaffolds.
- 'tabix -h' prints the header lines for a scaffold.
- The 'split -l $binsize -a 5 -d' command splits the input lines for the scaffold (one per genomic position along the scaffold) into chunks of $binsize lines each, and puts them in a temporary output file (using numeric (-d) suffix names of length 5 (-a 5)).  This effectively makes one temporary output file for each bin along a scaffold.
- For each bin along a scaffold, it gets the bin's length (using 'cat $chunk | wc -l'), the start point of the bin (using 'head -1 $chunk | cut -f 2'), the median and mean depth (using 'cut -f 3 $chunk | perl ~bf3/bin/statsTuples_02.pl median mean | sed 's/median=//' | sed 's/mean=//')
- The output information for each bin is written into the overall output file.

Bernardo's third script : calculating the mean and median read depth for each scaffold
(Note to self: /lustre/scratch108/parasites/bf3/strongyloides/boysGirls/sr.may2014/smalt.y0.75xr0.L2/bedtools_genomecov.quick.forDepthOnly.mean_median.sh)

This script is: 
zcat bam.sorted.markdup.bam.bedtools_genomecov.forDepth.txt.gz | groupBy -g 1 -c 3,3 -o mean,median > bam.sorted.markdup.bam.bedtools_genomecov.forDepth.mean_median.txt

It assumes your input file is called 'bam.sorted.markdup.bam.bedtools_genomecov.forDepth.txt.gz. It uses 'bedtools groupBy' with these options:
-g 1 : use column 1 (scaffold name) for grouping
-c 3,3 : summarise based on column 3 (median read depth in bins)
-o mean,median : write out the mean and median for each scaffold
This makes an output file 'bam.sorted.markdup.bam.bedtools_genomecov.forDepth.mean_median.txt'.

Calculating the median read depth across all scaffolds
To calculate the median read depth across all scaffolds, Bernardo did this:
% cat bam.sorted.markdup.bam.bedtools_genomecov.forDepth.txt.gz.DPperBin.txt | cut -f3 | perl  ~/bin/statsTuples_02.pl median
This takes the medians for all the bins, and calculates their median.

Acknowledgements
A big thank you to Bernardo Foth, who worked all this out and left great notes behind for us to follow!

No comments: