Friday 2 December 2016

HISAT2 aligner for RNAseq data

Previously I used the TopHat aligner for RNAseq data (see my TopHat blogpost) but recently it seems to have been replaced by HISAT2 from the same team. HISAT2 was published by Kim et al 2015, and there is a user manual available. Other people in my team also use STAR but say HISAT2 is slightly more user friendly, but similar in accuracy and speed.

Creating a HISAT2 index
The first step in using HISAT2 is to create index files for your genome assembly:
% hisat2-build  assembly.fa myindex
where assembly.fa is the assembly file, and myindex is the prefix you want to give to the index file names.
(note: Adam has hisat2 installed in /nfs/users/nfs_a/ar11/hisat2-2.0.0-beta/ )

Running HISAT2
You can now run HISAT2 to map your reads to your assembly. I do it using a wrapper script adapted by one from Adam Reid:
% ~alc/Documents/000_FUGI_Hymenolepis/pathfind2hisat2_alc2.pl -f lanelist_reads -i myindex -int 40000
where lanelist_reads is a file containing a list of the fastq files (full paths) for the sample (including all fastq files for all replicates for a condition),
myindex is the name of the HISAT2 index for the genome,
-int 40000 tells HISAT2 to only allow introns up to 40 kb.
This submits the jobs to the Sanger farm to run.
Note: My colleague Michaela noticed that when we ran HISAT2 output through HTSeqCount, it gave some warning messages, and these seem to be due to duplicate alignments in the HISAT2 bam file from HISAT2 2.0.0-beta version. The HISAT2 webpage says that a problem creating duplicate alignments was fixed in a recent release, so I'm going to try that new release to see if it gets rid of the error messages from HTSeqCount.

The actual HISAT2 command that this wrapper runs is:
% hisat2 --max-intronlen 40000 -p 12 myindex -1 seqs1_1.fastq -2 seqs1_2.fastq -1 seqs2_1.fastq -2 seqs2_2.fastq -S myoutput.sam
where -p 12 says to run HISAT2 with 12 threads,
seqs1_1.fastq and seqs1_2.fastq are reads and their mates from one sequencing run, and seqs2_1.fastq and seqs2_1.fastq are reads and their mates from another run (lane) for the same sample (same replicate),
-S myoutput.sam specifies the output sam file name.
Note that in this example a replicate was run on two sequencing lanes, this is sometimes done if we want to ensure a good amount of sequence data.
The pathfind2hisat2_alc2.pl script may have been given the fastq files for several lanes (in the lanelist_reads file) but it will identify which lanes belong to the same replicate, and will merge them into the final bam file produced.
So for example, if we had three larval replicates, and each were run on 2 lanes, then the two lanes for each replicate will be merged, so that we will end up with three bam files.

Notes:
- This needs to be run in the same directory that contains the HISAT2 index files.
- The data is assumed to be unstranded here.
- This uses the default -k option in HISAT2, -k 5, which means that for multiply mapping reads, HISAT2 maps them to up to 5 places, where those 5 places are not guaranteed to be their best alignments out of all possible aligned places (eg. if they could align to 10 places, we are not guaranteed to get the best 5 reported).
- Once HISAT2 has finished running, you should see bam files in your directory. They will have been sorted by their mapped positions along each chromosome/scaffold (by default HISAT2 gives SAM output and it is unsorted, but the pathfind2hisat2_alc2.pl script converts to bam, and sorts the bam).

(Note to self: only run this for one condition (ie. all replicates for that condition at once) at a time, as it needs a lot of disk space (several hundred Gbyte) for temporary files!)

HISAT2 output
As well as the bam file produced by HISAT2, it also produces mapping statistics that will look something like this:

23590680 reads; of these:
  23590680 (100.00%) were paired; of these:
    3217972 (13.64%) aligned concordantly 0 times
    11223995 (47.58%) aligned concordantly exactly 1 time
    9148713 (38.78%) aligned concordantly >1 times
    ----
    3217972 pairs aligned concordantly 0 times; of these:
      130791 (4.06%) aligned discordantly 1 time
    ----
    3087181 pairs aligned 0 times concordantly or discordantly; of these:
      6174362 mates make up the pairs; of these:
        4542683 (73.57%) aligned 0 times
        799882 (12.95%) aligned exactly 1 time
        831797 (13.47%) aligned >1 times
90.37% overall alignment rate

The first line says '23590680 reads', but in fact my input was 23,590,680 read-pairs, ie. 47,181,360 reads.
Here it says 90.37% of reads were mapped by HISAT2. This is 831797 + 799882 + 11223995*2 + 9148713*2 + 130791*2 = 42,638,677 reads.
47.58% of read-pairs were mapped uniquely and concordantly, where 'concordantly' means the reads of each pair were mapped about the expected distance apart in the right orientation relative to their mate.

What about multiply mapping reads?
The commands above used the default option for HISAT2, ie. '-k 5', which means that if a read can map equally to 20 places in the genome, HISAT2 will just report 5 of those (not necessarily the best 5 alignments out of the 20). In this case, we were intending to use HISAT2 output for differential expression analysis, so does it matter that we have multiply mapping reads?

After discussing with Adam Reid, we came to the conclusion that it doesn't matter, as these multiply mapping reads will have a very low mapping quality score (as they map to multiple places), and at a later stage in our pipeline, when we are estimating read counts per gene, we can discard these reads using the program HTSeqCount, based on their low mapping quality score (you can tell HTSeqCount to use a mapping quality threshold).

There is some variation between different people's protocols in how to treat multiply mapping reads, but it seems that people often keep just one (or a few) reads of a set of multiply mapping reads if they are mapping to the genome assembly, as we were. On the other hand, if you are mapping to a transcriptome assembly, you may want to keep multiply mapping reads. This is because you are mapping to a smaller fasta file, so it is feasible to map or multiply mapping reads (which can be slow for a whole genome). An advantage of this is that there are clever tools such as ExPress that can figure out how many of the multiply mapping reads to give to each member of a multi-gene family, based on unique regions in those genes, and this can then give you better estimates of expression level (reads or read-pairs per gene).

You can see a nice discussion of this and other issues regarding RNAseq analysis in Conesa et al 2016's "A survey of best practices for RNA-seq data analysis.

Acknowledgements
Thanks to Adam Reid and Michaela Herz for discussing HISAT2, and Adam for his wrapper script.

2 comments:

Unknown said...

Hi,

Thanks for the post,

My HISAT2 run of paired reads resulted in the following mapping quality for the reads

samtools view IS22330_WW_3_.sorted.bam | cut -f5 | sort | uniq -c
4035592 0
3469864 1
42400194 60

My alignment stats report from Hisat2 output shows:

[samopen] SAM header is present: 10 sequences.
23901435 reads; of these:
23901435 (100.00%) were paired; of these:
3178146 (13.30%) aligned concordantly 0 times
17341946 (72.56%) aligned concordantly exactly 1 time
3381343 (14.15%) aligned concordantly >1 times
----
3178146 pairs aligned concordantly 0 times; of these:
279629 (8.80%) aligned discordantly 1 time
----
2898517 pairs aligned 0 times concordantly or discordantly; of these:
5797034 mates make up the pairs; of these:
3785318 (65.30%) aligned 0 times
1543852 (26.63%) aligned exactly 1 time
467864 (8.07%) aligned >1 times
92.08% overall alignment rate
[bam_sort_core] merging from 21 files...

I want to know how many of my reads are uniquely mapped, multi-mapped and unmapped. Based on the MAPQ score above, a score of 60 would represent uniquely mapped reads? But the total mapped reads (sum of MAPQ score 60, 0 and 1 ) > total number of reads in the sample 47,802,870. Are there reads that are duplicated in the output sam file. How do I remove them and find uniquely mapped, multi-mapped and unmapped.

Thanks

Kapeel

Avril Coghlan said...

I think you should be able to find answers to this using samtools, see http://avrilomics.blogspot.com/2012/11/using-samtools-to-manipulate-sam-and.html