Thursday 3 April 2014

Using Reapr to assess genome assembly quality

I've been learning to use my colleague Martin Hunt's fantastic Reapr software to assess quality of genome assemblies.

Reapr is described in a recent paper by Martin.

Running Reapr

There are detailed instructions on how to run Reapr in the Reapr manual. I've written a few notes here to remind myself of some key points:

1. Reapr facheck
To check whether the scaffold/contig names in your assembly will be acceptable to Reapr, type:
% reapr facheck assembly.fa
where assembly.fa is your assembly file.

2. Mapping your large-insert reads
You will need a bam file of your mapped large-insert read-pairs. The bam file should be sorted by coordinate, indexed, and have duplicates either marked or removed [note to self: this is true for the bam files made by Daria for the 50 HG QC steps]. It's recommended to use the smalt mapper with -x -r options [note to self: this is true for the bam files made by Daria for the 50 HG QC steps].
     Reads in a pair should be pointed towards each other (should be 'innies'): if you're not sure if this is the case, you can run 'bamcheck input.bam', and this will tell you the number of 'inward oriented pairs' and 'outward oriented pairs'. You should have all or nearly all inward oriented pairs. If your reads are mainly outties (which may be the case for large-insert reads), you will need to get their reverse complement, and map them again, to make sure they are innies for Reapr (you could use Reapr to do the mapping step). 
     Reapr can map your reads for you, using smalt, if you don't have a bam file already (see the the Reapr manual). 
     Note: if you have bam files from several libraries, you can combine them into one bam, as long as the libraries have approximately the same insert size distribution. If you need to choose from multiple long-insert libraries, the Reapr FAQ says to choose the longest with enough coverage.
     Note: the Reapr website recommends to use version 0.7.0.1 of SMALT with the -f samsoft option [note to self: the bam files made by Daria for the 50 HG QC steps were made using -f samsoft, but with smalt 0.7.4].

3. Reads for calling error-free bases
Reapr can take fastq files of short-insert read-pairs, to call error-free bases in the assembly (actually usually short-insert reads are used for this, but large-insert reads could be too if you don't have any short-insert reads). Like the large-insert read-pairs (see above), these read-pairs should be 'innies'.
     You don't map these reads for you, Reapr maps them itself in the 'reapr perfectmap' step (see below).
     Note: if you have fastq files from several libraries, you can combine them (into one for forward reads, one for reverse), as long as the libraries have the same insert size distribution.

4. Calling error-free bases in the assembly
If you want to call error-free bases in the assembly, run:
% reapr perfectmap assembly.fa short_1.fq short_2.fq i_size perfect_prefix
where assembly.fa is your assembly file,
short_1.fq, short_2.fq are your fastq files for your short-insert read-pairs,
i_size is the insert-size for your short-insert read-pair library,
perfect_prefix is a prefix that is given to the output files.
The files short_1.fq and short_2.fq can be zipped files, eg. short_1.fq.gz and short_2.fq.gz.
Note: it's important that all the reads are the same length.
Note 2: for very large genomes (over a few 100 Mbase), you might want to use the 'perfectbam' step instead (see the Reapr manual for details).

5. Running the Reapr pipeline
To run the main Reapr pipeline, you type:
% reapr pipeline assembly.fa long_mapped.bam outdir perfect_prefix
where assembly.fa is your assembly file,
long_mapped.bam is your bam file of mapped long-insert read-pairs (from step 2 above),
outdir is the name that Reapr will give to the output directory,
perfect_prefix is the prefix given to the output files from step 4 (see above), if step 4 was run (this parameter is optional).
    Note that the stages of the 'reapr pipeline' command are 'preprocess', 'stats', 'score' and 'break'. These will be run with one call to 'reapr pipeline'. Alternatively,  you can run them separately (by running 'reapr preprocess', 'reapr stats', etc.).
    Note: the Reapr manual says that if you don't have any short-insert reads for running step 4, you might want to skip step 4, and so skip the 'perfect_prefix' option in 'reapr pipeline', as long-insert reads probably won't give you very accurate error-free bases in step 4.


6. Viewing the output files
The output files will be in the directory 'outdir' created by step 5 above. The most important are:
- 03.score.errors.gff : a report of the errors found,
- 04.break.broken_assembly.fa : a new version of the assembly, with scaffolds broken based on the errors found,
- 05.summary.report.txt : a summary of the errors found in the assembly, plus contiguity statistics (N50, etc.) of the original and broken assemblies.

The 4 types of errors in 05.summary.report.txt are:
1. FCD errors within a contig
2. FCD error over a gap
3. Low fragment coverage within a contig
4. Low fragment coverage over a gap

7. Sanity checks
The Reapr manual recommends several sanity checks:
- look at file 00.Sample/gc.vs.cov.lowess.pdf and check the GC/coverage bias looks ok,
- look at file 00.Sample/insert.in.pdf and check it shows the insert size distribution for your inserts,

- look at file 00.Sample/insert.stats.txt and check the numbers look ok,

Running Reapr using Martin's reapr_wrapper script (Sanger users only)

Sanger users could run Reapr using the reapr_wrapper script, following these steps:
1. Make a template config file called config_file by typing:
% ~mh12/git/python3/reapr_wrapper.py -t config_file

2. Edit the config file for your data. The main things to change are:
- the lane id. for your short-insert library, eg.
  short_insert_ID 6937_6
- the insert size for your short-insert library, eg. (for mean insert size 510)
  short_insert_isize 510 [note to self: if you already have a bam, you can get this using bamcheck]
- the lane id. for your large-insert library, eg.
  large_insert_ID 6980_8
- the insert size (-i option) to use in smalt for your large-insert library, eg. (for mean insert size 2498, with sd of 813, we might use 2498+3*813=4937=approx 5000)
  large_insert_map_options -x -r 1 -i 5000 -y 0.5
  (-i specifies the maximum insert size in smalt)
- if the read-pairs from your large-insert library are 'outties', since Reapr needs 'innies', we need to put:
   large_insert_revcomp 1
- the genome fasta file, eg.
  genome_fasta /nfs/helminths02/analysis/50HGP/Parastrongyloides.trichosuri/ASSEMBLY/PTRK.v2.QC.fa_v2
- the output directory (must not exist already):
  output_directory output_directory /lustre/scratch108/parasites/alc/StrongyloidesReapr/p_trichosuri/reapr_output/

3. Run reapr by typing:
% reapr_wrapper.py config_file