Friday 9 November 2012

Estimating the mean insert size in a BAM file

If you have a very large BAM file of read-pairs mapped to a genome, it can often be interesting to have an estimate of the mean insert size for the read-pairs. To do this, you can use my script find_bam_insertsize.pl which estimates the mean, median, standard deviation and interquartile range of insert sizes in a given BAM file.

To do this, the script takes the first one million read-pairs in the BAM file, and then takes the subset of those that are 'proper' read-pairs (are mapped in the correct orientation and within a defined insert size, so have flags of 99, 147, 83, or 163 in the BAM file, as described in this nice blogpost). The mean, median, standard deviation, and interquartile range of the insert sizes of these proper read-pairs is calculated.

Using Picard CollectInsertSizeMetrics.jar
Another way to do the same thing is to use Picard CollectInsertSizeMetrics.jar
[Note to self: I need to run this on pcs4, by first typing 'ssh -Y pcs4', as it seems to need to write a graphical output to the screen.]
% /software/bin/java -Xmx128M -jar /software/varinf/releases/PicardTools/picard-tools-1.77/CollectInsertSizeMetrics.jar INPUT=input.bam OUTPUT=out.metrics HISTOGRAM_FILE=out.hist
where input.bam is the input bam file, out.metrics is the output file with insert size metrics, out.hist is the output file with a histogram of insert sizes.

The output file 'out.metrics' has the insert size metrics, which we can pull out:
% cut -f1,5,6,13,22 out.metrics | more
MEDIAN_INSERT_SIZE    MEAN_INSERT_SIZE    STANDARD_DEVIATION    WIDTH_OF_50_PERCENT
399    395.536998    72.923077    79
This tells us that the median insert size is 399 bp, the mean is 395.536998, the standard deviation is 72.923077 and the IQR is 79 bp.

No comments: