Thursday 20 December 2012

Cigar alignments and insert sizes in SAM/BAM format

The alignments in a SAM/BAM file are stored in 'cigar' format, in which 'M' means a match or mismatch, 'I' means an insertion in the read relative to the reference genome, and 'D' means a deletion in the read relative to the reference genome.

For example, the alignment:
ACTGATTTGG- (genome)
|||||   ||
AATGA---AGT (read)
has cigar string 5M3D2M1I.

Bases at the start or end of a read may be 'soft-clipped' (usually because they are low quality sequence). This can be represented as 'S' in the cigar string. For example, the alignment:
ACTGATTTGG- (genome)
   ||   ||
AATGA---AGT (read)
in which the read bases have been soft-clipped, has cigar string 3S2MD2M1I.

Column 4 of a SAM/BAM file is the mapped position of a read. This contains the position of the reference genome that the first non-softclipped base of the read is mapped to (according to the SAM format specification). That is, it ignores soft-clipped bases at the start of the read. Therefore, if you soft-clip more bases at the start, you must change column 4.
       For example, the following alignment has cigar '3S2M3D2M1I' and start position 100:
ACTGATTTGG- (genome)
   ||   ||
AATGA---AGT (read)
If we soft-clip another base at the start of the alignment, we get cigar '4S1M3D2M1I' and start position 101:
ACTGATTTGG- (genome)
    |   ||
AATGA---AGT (read)

Column 8 of a SAM/BAM file is the mapped position of the mate. Therefore, if we changed the mapped position of a read x (by soft-clipping) as above, we must then change column 8 of the SAM/BAM for the mate of read x.

Column 9 of a SAM/BAM file has the inferred insert size of a read-pair. According to the SAM format specification this is defined by the first position of mapped position of the first read and the last mapped position of the second read (ignoring soft-clipped bases) of a read-pair. Therefore, if we changed the mapped position of a read x (by soft-clipping) as above, we must change column 9 of the SAM/BAM for read x and for the mate of read x.

For example, if you have:
read x: start-position: 144558, cigar: '5S95M'
mate of read x: start-position: 147480, cigar '2S98M'
The left-most mapped base of read x is base 144558 in the reference.
The right-most mapped base of the mate of read x is 147480 + (sum of M/D/N operations in the cigar for the mate of read x) - 1 = 147480 + 98 - 1 = 147577 in the reference.
Therefore, the insert size = 147577 - 144558 + 1 = 3020 bp.

Thanks to my colleague Martin Hunt for discussion. 

Extra note: here is a really useful page I found that explains the bit flags in SAM/BAM files.

No comments: