Monday 25 April 2022

Using mash to compare genome assemblies

I wanted to compare a set of 390 bacterial genome assemblies (for the bacterium Vibrio cholerae) to a set of 1664 genome assemblies, to see if there are any assemblies that are identical (or almost identical) between the two sets.

My colleagues in the Thomson group at Sanger mentioned the software Mash to me for this, which was published by Ondov et al 2016

Mash reduces large sequences and sequences sets (e.g. a genome assembly) to small, representative 'sketches', from which global mutation distances can be rapidly estimated. To create a sketch, each k-mer in a sequence is hashed. When 'mash sketch' is run, it automatically assesses the best k-mer size to use (see here for details). Sketches of different sizes can be compared using 'mash dist'.

There is a nice documentation for mash online.

Loading mash
To run mash, first I loaded the mash software (only necessary at Sanger):
% module avail -t | grep mash
mash/2.1.1--he518ae8_0
% module load mash/2.1.1--he518ae8_0
 
Creating sketches
Then,  I created 'sketches' for the set of 1664 genome assemblies (which all ended in *.fas), using a shell script like this:
#!/bin/sh
for i in `ls *.fas`
do
    echo "$i"
    mash sketch $i
done
This ran fine, and took about 35 minutes to run for the 1664 bacterial genome assemblies. This created a .msh (sketch) file for each of the assembly (.fas) files.
 
I then ran a similar script to create sketch files for the set of 390 genome assemblies.

Looking at the information on a sketch file
You can look at the information on a sketch file by typing something like:
% mash info THSTI_V12.contigs_spades.fasta.msh
You will see something like:
 Header:
  Hash function (seed):          MurmurHash3_x64_128 (42)
  K-mer size:                    21 (64-bit hashes)
  Alphabet:                      ACGT (canonical)
  Target min-hashes per sketch:  1000
  Sketches:                      1

Sketches:
  [Hashes]  [Length]  [ID]                            [Comment]

  1000      4042874   THSTI_V12.contigs_spades.fasta  [43 seqs] .THSTI_V12.1  [...]

Comparing genome assemblies using mash
Next, I compared pairs of genome assemblies (one from the set of 1664 assemblies versus one from the set of 390 assemblies), using mash, e.g.
% mash dist W2_T6.fasta.msh W1_T1.fasta.msh
W2_T6.fasta     W1_T1.fasta     0.000830728     0       966/1000
The results are tab delimited lists of Reference-ID, Query-ID, Mash-distance, P-value, and Matching-hashes. So, in the example above, the mash distance is 0.000830728.
 
Combining lots of sketch files using 'mash paste'
If you have a lot of sketch files, you may want to combine them using 'mash paste' into one large sketch file. You can do this as long as they have the same k-mer (you can find out their k-mer using 'mash info'; see above). 
 
For example, I had 1664 *msh files with a kmer size of 21. I first made a file with the list of all these .msh files:
% ls *msh > 1664_msh_files
Then I combined them into one large sketch file called 'combined.msh' by typing:
% mash paste combined -l 1664_msh_files
 
I wanted to compare these 1664 msh files to another set of 390 msh files. So I made a combined msh file using 'mash paste' for the 390 msh files. Then I can compare the combined msh file for the set of 1664 assemblies, to the combined msh file for the set of 390 assemblies, by running 'mash dist' on the two combined sketch files. Note that this is MUCH FASTER than using 'mash dist' to compare each of the 390 msh files for the first set of assemblies, to each of the 1664 msh files for the second set of assemblies!
 

 


No comments: