avrilomics: Using Tophat for mapping RNA-Seq data

The TopHat software can be used to map RNA-Seq data to a genome, and tries to be splice-site aware without being told about known splice sites. Short-read alignment algorithms such as Bowtie, BWA or Maq do not allow alignments between a read and the genome to contain large gaps, and so cannot align reads that span introns. TopHat was created to address this limitation: it can align reads that span large introns. This means that in the output BAM file made by TopHat you may have read-to-genome alignments that span huge introns.

The software is available for download from the TopHat website. The latest version (as of April 2013) is TopHat 2.0.8.

How TopHat works
Based on my reading of the TopHat paper (Trapnell et al 2009, Bioinformatics), this is my understanding of how it works. Tophat first maps all reads to your genome using Bowtie. All reads that do not map to the genome are set aside as 'initially unmapped reads' (IUM reads). TopHat then assembles the mapped reads (ignoring the IUM reads for the present) using the assembly module in Maq, and extracts the resulting 'islands' of continguous sequence; these are assumed to be putative exons.

Sometimes the mRNA for a gene was sequenced at low coverage, and as a result the exons in the genes have gaps in coverage, so one exon is picked up as two nearby 'islands'. TopHat has a parameter that controls when two distinct but nearby 'islands' should be merged into a single 'island', as they probably correspond to the same exon. By default, TopHat considers to 'islands' to be close enough to merge them like this if they are 6 bp or less apart.

To map reads to splice junctions, TopHat enumerates all canonical donor and acceptor sites (ie. for GT-AG introns) within the 'islands', and infers what would be the sequence of the corresponding splice junction for a particular donor and acceptor site pair. Usually the donor and the acceptor are in different 'islands', but TopHat will consider a donor and acceptor pair in the same island if the island is very deeply sequenced, because in that case, there may be alternative splice-forms, and one splice-form may have the potential intron for that donor-acceptor pair and the other splice-form lack it. Then TopHat checks whether any of the 'initially unmapped reads' (IUM reads) matches any of those putative splice junctions. By default, TopHat considers potential introns of 70 to 20,000 bp.

A second TopHat paper (Trapnell et al 2012, Nature Protocols) describes how the latest version of TopHat breaks up reads that Bowtie cannot align (the IUM reads) into smaller pieces called 'segments'. When several 'segments' from the same read align to positions on the genome that are ~100 bp to several kb apart, TopHat assumes that the read spans a splice junction and estimates where the splice sites are. As far as I can see, the original TopHat paper (Trapnell et al 2009, Bioinformatics) did not mention the 'segments' approach, so I think this is a feature of more recent versions of TopHat.

Performance of TopHat on mammalian genomes
In the TopHat paper (Trapnell et al 2009, Bioinformatics), they mapped RNA-Seq reads from a mammalian RNA-Seq experiment and recovered >72% of known splice sites for that species. Sensitivity suffers for genes sequenced at less than 5-fold coverage. Junctions spanning very long introns or introns with non-canonical donor and acceptor sites will also be missed. Happily, they find that TopHat reports few false positives splice junctions. They say that TopHat maps reads to a mammalian genome at a rate of ~2.2M reads per CPU hour, taking about 22 hours to run and using <4 Gbyte of RAM on a single processor.

Running TopHat
Before running TopHat, you first need to make a Bowtie2 index for your genome. You can do this by typing:
% bowtie2-build SSTP.fa SSTP
where SSTP.fa is your assembly,
SSTP is an abbreviation used to refer to your species.
This makes files called .ebwt.

Here is the basic command to run TopHat:
% tophat2 path_to_index reads1.fastq, reads2.fastq
where path_to_index is the path to the directory containing the Bowtie2 index,
reads1.fastq, reads2.fastq are the fastq files of reads to be mapped (for left-end and right-end reads of read-pairs, assuming you have paired-end read-pairs).

Note that TopHat doesn't mind if the fastq files are zipped files, ie. you could have:
% tophat2 path_to_index reads1.fastq.gz, reads2.fastq.gz

Note if you like you can count how many reads are in your fastq file using:
% wc -l reads1.fastq
This gives the number of lines in the fastq file. The number of reads will be the number of lines divided by 4. There should be the same number of reads in reads1.fastq and reads2.fastq.

The TopHat manual details all the possible options of TopHat. Here are the options I am using for a nematode genome:
% tophat2 --mate-std-dev 50 -a 6 -i 10 -I 20000 --microexon-search --min-segment-intron 10 --max-segment-intron 20000 -r -30 --num-threads 8 /lustre/scratch108/parasites/alc/StrongyloidesTophat/SSTP/SSTP FL_Female_20110928_1.fastq FL_Female_20110928_2.fastq
where --mate-std-dev is the standard deviation for the inner distance between mate-pairs,
-a is 'anchor length', the minimum length that the read must cover on either side of a splice junction,
-i is the minimum intron length,
-I is the maximum intron length,
--microexon-search makes TopHat try to look for 'microexons' shorter than the read length,
--min-segment-intron is the minimum intron length allowed during 'split segment' search,
--max-segment-intron is the maximum intron length allowed during 'split segment' search,
-r is the expected mean inner distance between mate pairs. You can set this to (median insert size of your library) - (2 * read-length). It could be a negative number,
--num-threads is the number of processors to use for the TopHat job.
The last three arguments '/lustre/scratch108/parasites/alc/StrongyloidesTophat/SSTP/SSTP FL_Female_20110928_1.fastq FL_Female_20110928_2.fastq' are the path_to_index, reads1.fastq and reads2.fastq.
Note that the <path> argument should include the index of the genome fasta file, in this case <path_to_directory>/SSTP.
The output directory is called 'tophat_out' by default but can be specified using --output-dir.

I submitted the job to the Sanger farm requesting 6 Gbyte of memory (the TopHat paper said they needed 2 Gbyte of memory for a mammalian genome, but I needed 6 Gbyte, perhaps due to having more reads?). The job can be submitted in a shell script using a command like this:

% bsub -o myscript.o -e myscript.e -n 8 -R "select[mem > 6000] rusage[mem=6000] span[hosts=1]" -M6000000 myscript
where myscript is a shell script containing the tophat2 command above,
-R "select[mem > 6000] rusage[mem=6000] requests 6 Gbyte of memory,
-n 8 requests 8 processors for the job (since we used --num-threads 8 in tophat),
span[hosts=1] is required when you use the -n option.

My initial fastq files were 16 Gbyte each (32 Gbyte total), and my assembly file 41 Mbyte. I found that TopHat took 4.5 hours to do the mapping, and the output files took about 8.7 Gbyte of disk space. The main output file is called accepted_hits.bam (the mapped reads). The input files took ~32 Gbyte of disk space, so about 40 Gbyte of disk space was need to run the job.

Other TopHat options
-G genes.gtf : map to transcripts given in a GTF/GFF with positions of known transcripts,
--library-type <string> (fr-unstranded, fr-firststrand, fr-secondstrand) : if your RNA-seq data is strand-specific, specify the strand-specific protocol used to generate the reads. For example, for strand-specific RNA-seq samples from our lab, we use --library-type fr-firststrand.
--no-novel-juncs : tell TopHat to only map the reads for each sample to known transcripts, with novel splice discovery disabled (you must also use -G genes.gtf with this option),
-g 1 (or --max-multihits 1) : only map each read to one place in the genome (probably a good idea if you are going to use your RNA-seq data for differential expression analyses).

Other aligners - STAR
Another aligner being used in my team is STAR (Jha et al 2012), from Thomas Gingeras' group. The authors say 'STAR outperforms other aligners by more than a factor of 50 in mapping speed [...] while at the same time improving alignment sensitivity and precision. In addition to unbiased de novo detection of canonical junctions, STAR can discover non-canonical splices and chimeric (fusion) transcripts, and is also capable of mapping full length RNA sequences.'

Some notes on using STAR:

- There is a user manual for STAR available here: http://chagall.med.cornell.edu/RNASEQcourse/STARmanual_2.4.2a.pdf

- To use STAR, make a subdirectory for the BAM files of aligned reads that we are going to create using STAR (eg. mkdir bams), and change to that subdirectory (cd bams).

- The first step in using STAR is to use it to create index files for your genome assemblies. To do this you will need to type something like this (on the Sanger farm):

bsub.py 10 –-threads emu_star_index ~ar11/STAR-STAR_2.4.2a/bin/Linux_x86_64/STAR --runThreadN 8 --runMode genomeGenerate --genomeDir /lustre/scratch108/parasites/name/bams –genomeFastaFiles /lustre/scratch108/parasites/name/assemblies/emultilocularis.fa –sjdbGTFfile /lustre/scratch108/parasites/name/annotation/emultilocularis.gff

where

--genomeDir specifies the full path to your ‘bams’ directory, eg. /lustre/scratch108/parasites/name/bams,

/lustre/scratch108/parasites/name/assemblies/emultilocularis.fa is where your assembly file is,

/lustre/scratch108/parasites/name/annotation/emultilocularis.gff is where your genome annotation file (GFF file) is,

“bsub.py 10 –-threads” submits this job to the pcs5 compute farm, requesting 10 Gbyte of memory for the job.)

- We need to know the length of the reads for the –sjdbOverhang option.

- For GFF3 format files, you need to use –sjdbGTFtagExonParentTranscript Parent.

- To run STAR, you will then need to type:

Other aligners - Hisat2
Adam Reid in my group has been using Hisat2 and finding it useful.
(Note to self: Adam has installed it here: /nfs/users/nfs_a/ar11/hisat2-2.0.0-beta/hisat2).

Thanks
Thanks to Jason Tsai and Bernardo Foth for recommending parameters. Thanks to Eleanor Stanley for helping run the TopHat job. Thanks for Adam Reid and Alan Tracey for info on STAR.

avrilomics

Friday, 26 April 2013

Using Tophat for mapping RNA-Seq data

No comments:

About Me