Monday, 15 April 2013

Making fast genome-scale DNA alignments using nucmer

The nucmer software can be used to make fast DNA alignments of large genome-size sequences. It is a fast alternative to BLAST for searching for matches between large fasta files, for example, for comparing two related genomes or two assemblies for the same species.

The program was described in a paper from Steven Salzberg's group by Delcher et al. (2002), published in Nucleic Acids Res. The nucmer program makes use of MUMmer, an algorithm by the same group, which finds all "maximal unique matches" (MUMs) between two genomes. That is, it finds all subsequences of at least k bp (k = 20 by default) that are identical between two genomes. Thus, a MUM can be 20 bp or longer.

The paper says that nucmer works by:
1) first running MUMmer that are identical between two genomes) to find all exact matches between two genomes.
2) then running a clustering algorithm for all the MUMs along each scaffold. MUMs are clustered together if they are separated by no more than a user-specified distance.
3) then running a modified Smith-Waterman algorithm to align the sequences between the MUMs in a cluster. The algorithm permits only limited mismatches in the gaps between MUMs. The user can specify the amount of mismatch permitted.

Running nucmer
To run nucmer you can type:
% nucmer -p <prefix> <database> <query>
where <prefix> is the prefix that you want to give to the nucmer output files (by default this is 'out'),
<database> is the fasta file for the genome assembly that you want to search against,
<query> is the fasta file for the genome assembly that you want to use as your query.
Nucmer will then produce an output file called <prefix>.delta. For example, if you used 'out' as <prefix>, then the output file will be called out.delta.

Running show-coords
Two other useful programs that come with nucmer are called delta-filter and show-coords.

You can convert a delta-format file (either from nucmer or delta-filter, see below) to a tab-delimited file of matching regions by using the show-coords program, eg.
% show-coords -dTlro out.delta.filter > out.coords
where
-d:  displays the alignment direction in an additional column called 'FRM',
-T: makes the output file tab-delimited,
-l: includes sequence length information in the output,
-o: annotates maximal alignments between two sequences, i.e. overlaps between database and query sequences (these are marked as '[CONTAINED]' in the output),
-r: sorts the output lines by database IDs and coordinates.

show-coords will give an output that looks something like this:
NUCMER
    [S1]     [E1] |     [S2]     [E2]  |  [LEN 1]  [LEN 2]  |  [% IDY]  | [TAGS]
=============================================================
       5      104  |      100        5   |      100       96   |    96.00   | seq3db    seq1query
      29     128  |        1      100   |      100      100  |   100.00  | seq3db    seq1query
       1      125  |       19      142  |      125      124  |    99.20   | seq2db    seq1query

This means that:
(i) there is a match from positions 1-100 of the query sequence 'seq1query', that matches positions 29-128 of the database sequence 'seq3db',
(ii) there is a match from positions 5-100 of the query sequence 'seq1query', where the negative strand of this region matches positions 5-104 of 'seq3db',
(iii) there is a match from positions 19-142 of 'sequ1query', to 1-125 of 'seq2db'.

If you had used the -o option, you would see:
    [S1]     [E1]  |     [S2]     [E2]  |  [LEN 1]  [LEN 2]  |  [% IDY]  | [TAGS]
=============================================================

        5      104  |      100        5   |      100       96   |    96.00   | seq3db    seq1query    [BEGIN]
      29      128  |        1      100   |      100      100  |   100.00  | seq3db    seq1query    [END]
       1      125   |       19      142  |      125      124  |    99.20   | seq2db    seq1query    [CONTAINED]

The '[CONTAINED]' tag shows that the last match, at position 19-142 in the query 'seq1query', overlaps with another match in this query. 

If you had used the -r option, you would see:
    [S1]     [E1]  |     [S2]     [E2]  |  [LEN 1]  [LEN 2]  |  [% IDY]  | [TAGS]
=============================================================

        1      125  |       19      142  |      125      124  |    99.20   | seq2db    seq1query
       5      104   |      100        5   |      100       96   |    96.00   | seq3db    seq1query
      29      128  |        1      100   |      100      100  |   100.00  | seq3db    seq1query

This has the hits sorted by database sequence (so hits to 'seq2db' first, then hits to 'seq3db'), and then with respect to coordinate within the database sequence (so a hit to 5-104 in seq3db, then a hit to 29-128 in seq3db).

Running delta-filter
After running nucmer (and before running show-coords, if you like), you can use delta-filter to filter the delta file, to exclude poorly-matching regions. You can run it by typing:
% delta-filter -i <min_aln_id> -l <min_aln_len> -q
where <min_aln_id> is the percent identity for alignments to be kept,
<min_aln_len> is the minimum length for alignments to be kept,
and the -q option means each position of each query is mapped to its best hit in the database, allowing for overlaps in the database (but discarding poorer overlapping alignments with respect to the query).

For example:
% delta-filter -i 95 -l 200 -q out.delta > out.delta.filter
This takes hits with at least 95% identity, that are at least 200 bp long, and takes the best non-overlapping hits with respect to the query. 

The out.delta.filter file is in the same format as the out.delta file, but should have less matches, as some have been filtered out.

The -q option can be used to find the best non-overlapping alignments with respect to the query. For example, say nucmer gave the following alignments (which we can view with show-coords):
    [S1]     [E1]  |     [S2]     [E2]  |  [LEN 1]  [LEN 2]  |  [% IDY]  | [TAGS]
=============================================================

        5      104  |      100        5   |      100       96   |    96.00   | seq3db    seq1query    [BEGIN]
      29      128  |        1      100   |      100      100  |   100.00  | seq3db    seq1query    [END]
       1      125   |       19      142  |      125      124  |    99.20   | seq2db    seq1query    [CONTAINED]

We could filter the out.delta file using delta-filter with the -q option, and then view the output of delta-filter using show-coords, and we would get: 
    [S1]     [E1]  |     [S2]     [E2]  |  [LEN 1]  [LEN 2]  |  [% IDY]  | [TAGS]
=============================================================

       29      128  |        1      100   |      100      100  |   100.00  | seq3db    seq1query
       1      125    |       19      142  |      125      124  |    99.20   | seq2db    seq1query

The match from 100-5 (of 96% identity) in seq1query has been discarded, as it overlaps with the hit from 1-100 (of 100% identity) in seq1query, which had higher percent identity and was longer.

Maximum input sequence length for nucmer
If you use a sequence that is too long as a nucmer input sequence, you will get an error message like this:
'ERROR: mummer and/or mgaps returned non-zero, please file a bug report'
It turns out that this happens when the input sequence is longer than 536,870,908 bp, as I found out from this mummer sourceforge page.
So it seems that nucmer can only handle input sequences of less than about 535 Mb.

Comparison to BLAST
In practice, I have heard that nucmer tends to join together nearby matching regions into one long alignment, where BLAST would report them as several separate alignments. The obvious advantage of nucmer over BLAST is that nucmer is faster. A disadvantage however is that nucmer may miss some short matches of low percent identity (since all MUMs must contain at least 20 identical bases in a row). Another disadvantage is that, unlike BLAST, nucmer does not give an e-value for a match.

I tried comparing the run-time, output file size, and RAM used for nucmer and BLAST for the same query assembly and target assembly. As an example, for a  92 Mb query assembly (89 Mbyte size), and a 87 Mb target assembly (83 Mbyte size), BLAST took 1.5 hours to run, used 150 Mbyte of RAM, and made a 538 Mbyte (m8 format) output file. In contrast, nucmer took ~10 minutes to run, used 1440 Mbyte of RAM, and made a 4.1 Mbyte output (nucmer delta) file. So nucmer took ~10% as long to run, and the output file is about 1% of the size, but it needed ~10 times more RAM to run.

Thanks to Martin Hunt for advice on nucmer.

No comments: