Friday 2 December 2016

HISAT2 aligner for RNAseq data

Previously I used the TopHat aligner for RNAseq data (see my TopHat blogpost) but recently it seems to have been replaced by HISAT2 from the same team. HISAT2 was published by Kim et al 2015, and there is a user manual available. Other people in my team also use STAR but say HISAT2 is slightly more user friendly, but similar in accuracy and speed.

Creating a HISAT2 index
The first step in using HISAT2 is to create index files for your genome assembly:
% hisat2-build  assembly.fa myindex
where assembly.fa is the assembly file, and myindex is the prefix you want to give to the index file names.
(note: Adam has hisat2 installed in /nfs/users/nfs_a/ar11/hisat2-2.0.0-beta/ )

Running HISAT2
You can now run HISAT2 to map your reads to your assembly. I do it using a wrapper script adapted by one from Adam Reid:
% ~alc/Documents/000_FUGI_Hymenolepis/pathfind2hisat2_alc2.pl -f lanelist_reads -i myindex -int 40000
where lanelist_reads is a file containing a list of the fastq files (full paths) for the sample (including all fastq files for all replicates for a condition),
myindex is the name of the HISAT2 index for the genome,
-int 40000 tells HISAT2 to only allow introns up to 40 kb.
This submits the jobs to the Sanger farm to run.
Note: My colleague Michaela noticed that when we ran HISAT2 output through HTSeqCount, it gave some warning messages, and these seem to be due to duplicate alignments in the HISAT2 bam file from HISAT2 2.0.0-beta version. The HISAT2 webpage says that a problem creating duplicate alignments was fixed in a recent release, so I'm going to try that new release to see if it gets rid of the error messages from HTSeqCount.

The actual HISAT2 command that this wrapper runs is:
% hisat2 --max-intronlen 40000 -p 12 myindex -1 seqs1_1.fastq -2 seqs1_2.fastq -1 seqs2_1.fastq -2 seqs2_2.fastq -S myoutput.sam
where -p 12 says to run HISAT2 with 12 threads,
seqs1_1.fastq and seqs1_2.fastq are reads and their mates from one sequencing run, and seqs2_1.fastq and seqs2_1.fastq are reads and their mates from another run (lane) for the same sample (same replicate),
-S myoutput.sam specifies the output sam file name.
Note that in this example a replicate was run on two sequencing lanes, this is sometimes done if we want to ensure a good amount of sequence data.
The pathfind2hisat2_alc2.pl script may have been given the fastq files for several lanes (in the lanelist_reads file) but it will identify which lanes belong to the same replicate, and will merge them into the final bam file produced.
So for example, if we had three larval replicates, and each were run on 2 lanes, then the two lanes for each replicate will be merged, so that we will end up with three bam files.

Notes:
- This needs to be run in the same directory that contains the HISAT2 index files.
- The data is assumed to be unstranded here.
- This uses the default -k option in HISAT2, -k 5, which means that for multiply mapping reads, HISAT2 maps them to up to 5 places, where those 5 places are not guaranteed to be their best alignments out of all possible aligned places (eg. if they could align to 10 places, we are not guaranteed to get the best 5 reported).
- Once HISAT2 has finished running, you should see bam files in your directory. They will have been sorted by their mapped positions along each chromosome/scaffold (by default HISAT2 gives SAM output and it is unsorted, but the pathfind2hisat2_alc2.pl script converts to bam, and sorts the bam).

(Note to self: only run this for one condition (ie. all replicates for that condition at once) at a time, as it needs a lot of disk space (several hundred Gbyte) for temporary files!)

HISAT2 output
As well as the bam file produced by HISAT2, it also produces mapping statistics that will look something like this:

23590680 reads; of these:
  23590680 (100.00%) were paired; of these:
    3217972 (13.64%) aligned concordantly 0 times
    11223995 (47.58%) aligned concordantly exactly 1 time
    9148713 (38.78%) aligned concordantly >1 times
    ----
    3217972 pairs aligned concordantly 0 times; of these:
      130791 (4.06%) aligned discordantly 1 time
    ----
    3087181 pairs aligned 0 times concordantly or discordantly; of these:
      6174362 mates make up the pairs; of these:
        4542683 (73.57%) aligned 0 times
        799882 (12.95%) aligned exactly 1 time
        831797 (13.47%) aligned >1 times
90.37% overall alignment rate

The first line says '23590680 reads', but in fact my input was 23,590,680 read-pairs, ie. 47,181,360 reads.
Here it says 90.37% of reads were mapped by HISAT2. This is 831797 + 799882 + 11223995*2 + 9148713*2 + 130791*2 = 42,638,677 reads.
47.58% of read-pairs were mapped uniquely and concordantly, where 'concordantly' means the reads of each pair were mapped about the expected distance apart in the right orientation relative to their mate.

What about multiply mapping reads?
The commands above used the default option for HISAT2, ie. '-k 5', which means that if a read can map equally to 20 places in the genome, HISAT2 will just report 5 of those (not necessarily the best 5 alignments out of the 20). In this case, we were intending to use HISAT2 output for differential expression analysis, so does it matter that we have multiply mapping reads?

After discussing with Adam Reid, we came to the conclusion that it doesn't matter, as these multiply mapping reads will have a very low mapping quality score (as they map to multiple places), and at a later stage in our pipeline, when we are estimating read counts per gene, we can discard these reads using the program HTSeqCount, based on their low mapping quality score (you can tell HTSeqCount to use a mapping quality threshold).

There is some variation between different people's protocols in how to treat multiply mapping reads, but it seems that people often keep just one (or a few) reads of a set of multiply mapping reads if they are mapping to the genome assembly, as we were. On the other hand, if you are mapping to a transcriptome assembly, you may want to keep multiply mapping reads. This is because you are mapping to a smaller fasta file, so it is feasible to map or multiply mapping reads (which can be slow for a whole genome). An advantage of this is that there are clever tools such as ExPress that can figure out how many of the multiply mapping reads to give to each member of a multi-gene family, based on unique regions in those genes, and this can then give you better estimates of expression level (reads or read-pairs per gene).

You can see a nice discussion of this and other issues regarding RNAseq analysis in Conesa et al 2016's "A survey of best practices for RNA-seq data analysis.

Acknowledgements
Thanks to Adam Reid and Michaela Herz for discussing HISAT2, and Adam for his wrapper script.

Using samtools mpileup to find SNPs, and vcf-tools to filter SNPs

My colleagues have recently been talking about SNP-finding using 'samtools mpileup', and ways to filter the set of SNPs called by 'samtools mpileup' to throw away the more dodgy SNP calls.

Calling SNPs (and reference bases) using 'samtools mpileup':
You can call SNPs (and reference bases) using 'samtools mpileup' (see the samtools mpileup webpage for details) by typing:
% samtools mpileup in.bam
where 'in.bam' is your input bam file.
The output is in VCF (Variant Call Format).  

The output may look like this:
1       2131    N       2       gG      FF
1       2132    N       2       tT      FF
1       2133    N       2       aA      FF
1       2134    N       2       gG      BF
1       2135    N       2       cC      BF
1       2136    N       2       tT      BF
1       2137    N       2       cC      BF
1       2138    N       2       g$G$    BF
1       2147    N       1       ^]g     F
1       2148    N       1       g       F
The columns are: chromosome, 1-based coordinate, reference base, number of reads covering the site, read bases, and base qualities.

In the example output above we don't see the base in the reference. To see it we must type:
% samtools faidx genome.fa
% samtools mpileup -Q 15 -D -f genome.fa in.bam
where genome.fa is the assembly file.
Then we see something like this:
1       5131    C       2       .,      FE
1       5132    G       2       .,      FF
1       5133    A       2       .,      FF
1       5134    G       2       .,      FF
1       5135    G       2       .,      FF
1       5136    A       2       .,      FF
A dot '.' stands for the reference base on the forward strand, and ',' for the reverse strand.

When you are doing this, you can tell 'samtools mpileup' to only take bases with base alignment quality scores (BAQ scores: these are adjusted base quality scores, which have been reduced for base positions near indels, to help rule out false positive SNP calls due to alignment artefacts near small indels) of 15 or higher by typing:
% samtools mpileup -Q 15 in.bam
To only take cases with reads with mapping quality (MAPQ) of >=30:
% samtools mpileup -q 30 -Q 15 -D -f genome.fa in.bam

You can vary the contents of the output VCF file that you get using the following options:
-D : gives you per-sample read depth at each base position
-S : gives you a P-value for a statistical test for per-sample strand bias at each base position (strand bias occurs if a particular SNP is being called from reads that aligned to one strand but not reads that align to the complementary strand).
-u : gives you a text BCF output (a compressed VCF format)
-g : gives you binary BCF output (a compressed binary version of BCF)

To avoid excessive usage of memory (RAM), we can tell samtools to only use a depth of 1000 reads at each position, using the -d option:
% samtools mpileup -d 1000 in.bam

To combine several options in one command, you could type, for example:
% samtools mpileup -d 1000 -DSug in.bam  
or
% samtools mpileup -Q 15 -D in.bam 

Filtering SNPs using bcftools:
To filter the output of samtools mpileup to just have variant bases (not reference bases), we need to filter the output using bcftools, for example:

% samtools mpileup -u -q 30 -Q 15 -D -f genome.fa in.bam | bcftools view -bvcg - > var.raw.bcf
where the samtools mpileup -u option produces uncompressed BCF output;
bcftools view -b outputs BCf; -v output potential variant sites only; -c calls SNPs; -g calls genotypes at variant sites.
This calls SNPs and small indels.

We can then filter the output file using:
% bcftools view var.raw.bcf | vcfutils.pl varFilter -D100 > var.flt.vcf 
where -D sets the maximum read depth to call a SNP.

Filtering SNPs using vcftools-annotate:
There is also a way to filter SNPs using vcftools-annotate, for example on MinDP (read depth) and RMS MQ (mapping quality). I need to look into this.

A protocol for calling and filtering SNPs
My colleague Diogo Ribeiro previously identified some SNP sites from Illumina sequencing data for a nematode species. Here is a brief description of what he did:

- For each isolate, sequencing reads were mapped to the reference assembly using the SMALT mapper, to produce bam files.

- SNPs and small indels relative to the reference assembly were identified separately for each isolate, using samtools and vcftools, based on a maximum of 1000 reads at each base position ('samtools mpileup -d 1000' option).

- The resulting VCF files (one per isolate) were filtered to remove less reliable variant sites, by removing those that were:
(i) in repetitive regions of the reference assembly, as identified by gem-mappability;
(ii) were in gaps in scaffolds, or within 100 bp of a gap of >=8 bp;
(iii) were within 100 bp of a scaffold start or end;
(iv) had a SNP quality score of less than 50;
(v) had a mapping quality of less than 30;
(vi) had a read depth (including reads from both strands) of less than 20, or had a read depth of less than 10 from a single strand.
Some of this is possible using Diogo Ribeiro's script:
eg.
% custom_filter_vcf_files.py --DP 20 --DP_STRAND 10 --QUAL 50 --MQ 30 vcf vcf_filtered2
applies the filters (iv), (v), (vi) to VCF files in input directory 'vcf', and puts the output in an output directory 'vcf_filtered2' (must exist already).

- The variant density, for all variants and for just homozygous variants, was estimated in non-overlapping 1-kb windows of each scaffold of the reference assembly. When calculating variant density, sites with more than one alternative allele were excluded, as these are probably erroneous variant calls.

Scripts for SNPs (note to self)
My colleagues Diogo Ribeiro and Bhavana Harsha wrote some nice scripts for SNP calling and SNP filtering, see:
(i) vcf-annotate_custom_filters.pl -> can filter a VCF file based on median read depth, and other filters that the user species (uses vcftools-annotate),
(ii) median_dp_filter.pl -> can filter a VCF file based on median read depth for a chromosome,
(iii) snp_calling_improv.pl -> can filter a VCF file based on using gem-mappability to find repetitive regions of a reference assembly,
(iv) custom_filter_vcf_files.py -> can filter a VCF file based on minimum read depth, read depth per strand, SNP quality, or mapping quality,

(v) plotsAlongChr.py -> plots SNP data (from VCF files) along chromosomes,
(vi) GeneSNPAnalyses.py -> analyses distribution of SNPs per gene, and SNPs per kb of gene.

Acknowledgements
Thanks very much to my colleagues Bhavana Harsha and Diogo Ribeiro who figured out lots of these things.

GTF versus GFF

The GFF format is described here.
The entry for one gene is something like this:
# Gene gene:HmN_000672600
pathogens_HYM_scaffold_0001     WormBase_imported       gene    599     742     .       +       .       ID=gene:HmN_000672600;Name=HmN_000672600;biotype=protein_coding
pathogens_HYM_scaffold_0001     WormBase_imported       mRNA    599     742     .       +       .       ID=transcript:HmN_000672600.1;Parent=gene:HmN_000672600;Name=HmN_000672600.1
pathogens_HYM_scaffold_0001     WormBase_imported       exon    599     742     .       +       .       ID=exon:HmN_000672600.1.1;Parent=transcript:HmN_000672600.1
pathogens_HYM_scaffold_0001     WormBase_imported       CDS     599     742     .       +       0       ID=cds:HmN_000672600.1;Parent=transcript:HmN_000672600.1

You can see that the transcript, exon and CDS lines have a 'Parent=' tag at the end, to say which transcript/gene they belong to. Also, there are 'gene', 'mRNA', 'exon', and 'CDS' lines.

The GTF format is slightly different. An example is: (for the same gene as above)
pathogens_HYM_scaffold_0001     WormBase_imported       transcript      599     742     .       +       .       gene_id "gene:HmN_000672600"; transcript_id "transcript:HmN_000672600.1"
pathogens_HYM_scaffold_0001     WormBase_imported       exon    599     742     .       +       .       gene_id "HmN_000672600.1"; transcript_id "transcript:HmN_000672600.1"; exon_number "1"

Here we just have 'transcript' and 'exon' lines, and the tag at the end gives the gene_id and transcript_id, so there is no need for 'Parent=' tags.
Here is a description of GTF: here.

Converting GFF to GTF:
My colleague Adam Reid has written a nice Perl script for convering GFF to GTF, eg.:
% perl /nfs/users/nfs_a/ar11/scripts/gff3wormbase2gtf.pl  emultilocularis.gff3 > emultilocularis.gtf

Mapping Illumina reads using the smalt aligner

Martin Hunt has written the map_splitter.py script for aligning Illumina reads to a genome assembly.

This can use the SMALT aligner (on the Sanger compute farm) as follows:

% map_splitter.py --split 500000 -k 11 -s 2 -o " -x -r 0 -y 0.7 -i 500" smalt assembly.fa seqs_1.fastq.gz seqs_2.fastq.gz

where --split 500000 tells it to split the input genome fasta file into smaller files of 500,000 sequences each;
-k 11 is the kmer for the SMALT index;
-s 2 is the step length for the SMALT index;
-o gies the mapping options for SMALT;
-x is the SMALT option for SMALT to do an exhaustive search for alignments (at cost of speed);
-r 0 tells SMALT to randomly assign multiply mapping reads to one place;
-y 0.7 tells SMALT to only take reads with alignment identity of >=70%;
-i 500 tells SMALT to allow a maximum insert size of 500 bp;
assembly.fa is the input assembly file;
seqs_1.fastq.gz and seqs_2.fastq.gz are the fastq files of reads and their mates, respectively.

Thursday 1 December 2016

Searching the NCBI trace archive

To search the NCBI Trace Archive for the capillary read sequence data used for the Echinococcus multilocularis tapeworm genome paper (Tsai et al), you can search for:

CENTER_NAME = 'SC' and SEQ_LIB_ID = '98488' and SPECIES_CODE = 'ECHINOCOCCUS MULTILOCULARIS'