Creating a HISAT2 index
The first step in using HISAT2 is to create index files for your genome assembly:
% hisat2-build assembly.fa myindex
where assembly.fa is the assembly file, and myindex is the prefix you want to give to the index file names.
(note: Adam has hisat2 installed in /nfs/users/nfs_a/ar11/hisat2-2.0.0-beta/ )
Running HISAT2
You can now run HISAT2 to map your reads to your assembly. I do it using a wrapper script adapted by one from Adam Reid:
% ~alc/Documents/000_FUGI_Hymenolepis/pathfind2hisat2_alc2.pl -f lanelist_reads -i myindex -int 40000
where lanelist_reads is a file containing a list of the fastq files (full paths) for the sample (including all fastq files for all replicates for a condition),
myindex is the name of the HISAT2 index for the genome,
-int 40000 tells HISAT2 to only allow introns up to 40 kb.
This submits the jobs to the Sanger farm to run.
Note: My colleague Michaela noticed that when we ran HISAT2 output through HTSeqCount, it gave some warning messages, and these seem to be due to duplicate alignments in the HISAT2 bam file from HISAT2 2.0.0-beta version. The HISAT2 webpage says that a problem creating duplicate alignments was fixed in a recent release, so I'm going to try that new release to see if it gets rid of the error messages from HTSeqCount.
The actual HISAT2 command that this wrapper runs is:
% hisat2 --max-intronlen 40000 -p 12 myindex -1 seqs1_1.fastq -2 seqs1_2.fastq -1 seqs2_1.fastq -2 seqs2_2.fastq -S myoutput.sam
where -p 12 says to run HISAT2 with 12 threads,
seqs1_1.fastq and seqs1_2.fastq are reads and their mates from one sequencing run, and seqs2_1.fastq and seqs2_1.fastq are reads and their mates from another run (lane) for the same sample (same replicate),
-S myoutput.sam specifies the output sam file name.
Note that in this example a replicate was run on two sequencing lanes, this is sometimes done if we want to ensure a good amount of sequence data.
The pathfind2hisat2_alc2.pl script may have been given the fastq files for several lanes (in the lanelist_reads file) but it will identify which lanes belong to the same replicate, and will merge them into the final bam file produced.
So for example, if we had three larval replicates, and each were run on 2 lanes, then the two lanes for each replicate will be merged, so that we will end up with three bam files.
Notes:
- This needs to be run in the same directory that contains the HISAT2 index files.
- The data is assumed to be unstranded here.
- This uses the default -k option in HISAT2, -k 5, which means that for multiply mapping reads, HISAT2 maps them to up to 5 places, where those 5 places are not guaranteed to be their best alignments out of all possible aligned places (eg. if they could align to 10 places, we are not guaranteed to get the best 5 reported).
- Once HISAT2 has finished running, you should see bam files in your directory. They will have been sorted by their mapped positions along each chromosome/scaffold (by default HISAT2 gives SAM output and it is unsorted, but the pathfind2hisat2_alc2.pl script converts to bam, and sorts the bam).
(Note to self: only run this for one condition (ie. all replicates for that condition at once) at a time, as it needs a lot of disk space (several hundred Gbyte) for temporary files!)
HISAT2 output
As well as the bam file produced by HISAT2, it also produces mapping statistics that will look something like this:
The first line says '23590680 reads', but in fact my input was 23,590,680 read-pairs, ie. 47,181,360 reads.
Here it says 90.37% of reads were mapped by HISAT2. This is 831797 + 799882 + 11223995*2 + 9148713*2 + 130791*2 = 42,638,677 reads.
47.58% of read-pairs were mapped uniquely and concordantly, where 'concordantly' means the reads of each pair were mapped about the expected distance apart in the right orientation relative to their mate.
What about multiply mapping reads?
The commands above used the default option for HISAT2, ie. '-k 5', which means that if a read can map equally to 20 places in the genome, HISAT2 will just report 5 of those (not necessarily the best 5 alignments out of the 20). In this case, we were intending to use HISAT2 output for differential expression analysis, so does it matter that we have multiply mapping reads?
After discussing with Adam Reid, we came to the conclusion that it doesn't matter, as these multiply mapping reads will have a very low mapping quality score (as they map to multiple places), and at a later stage in our pipeline, when we are estimating read counts per gene, we can discard these reads using the program HTSeqCount, based on their low mapping quality score (you can tell HTSeqCount to use a mapping quality threshold).
There is some variation between different people's protocols in how to treat multiply mapping reads, but it seems that people often keep just one (or a few) reads of a set of multiply mapping reads if they are mapping to the genome assembly, as we were. On the other hand, if you are mapping to a transcriptome assembly, you may want to keep multiply mapping reads. This is because you are mapping to a smaller fasta file, so it is feasible to map or multiply mapping reads (which can be slow for a whole genome). An advantage of this is that there are clever tools such as ExPress that can figure out how many of the multiply mapping reads to give to each member of a multi-gene family, based on unique regions in those genes, and this can then give you better estimates of expression level (reads or read-pairs per gene).
You can see a nice discussion of this and other issues regarding RNAseq analysis in Conesa et al 2016's "A survey of best practices for RNA-seq data analysis.
Acknowledgements
Thanks to Adam Reid and Michaela Herz for discussing HISAT2, and Adam for his wrapper script.