avrilomics: September 2020

Tuesday 29 September 2020

Running BLAST to align genome sequences

I'm interested in finding conserved non-coding sequences between two related species of worms.

First I took the introns, UTRs, and intergenic regions from the first species, and tried comparing them to the genome of the second species using exonerate, but that was very slow. I then tried BLAT, which was a little faster. Then I tried BLASTN, which was nice and speedy!

It's been a while since I ran BLAST on the Sanger farm so I needed to remind myself how to run it, even though I have written previous posts on that ages ago (e.g. on farm_blast and on speeding up blast jobs).

This is what I did now:

Find the BLAST module on the farm, and load it: (only applicable to Sanger users)

% module avail -t | grep -i blast
blast/2.7.1=h96bfa4b_5

% module load blast/2.7.1=h96bfa4b_5

Make a blast database:

% makeblastdb -in genome2.fa -dbtype nucl

Run blast:

% blastn -db genome2.fa -query genome1_intronsandutrsandintergenic.fa -out myoutput.blast -outfmt 6

One thing I always always forget is what are the columns in the BLAST m8 format, so I have to look at this nice webpage.

Note that by default the blastn command runs Megablast, which looks for matches of high percent identity, and is a fast algorithm. I'm interested in high percent identity matches, so I used this.

Alternatives to BLAST:

An alternative to BLAST is nucmer, part of the mummer package, which I wrote a post on ages ago (see here). Note to self: nucmer is part of the mummer module on the Sanger farm.

I asked my colleagues what they are using nowadays for whole genome alignements, and they mentioned a couple of other software:

- my colleague Eerik Aunin mentioned the software SibeliaZ, which is tailored for aligning highly similar genomes, eg. strains of the same species,

- my colleague Faye Rodgers mentioned Cactus, which can be used to make alignments of 1000s of vertebrate genomes,

- my colleague Ana Protasio mentioned Satsuma

Regarding finding conserved noncoding regions, my colleague James Cotton mentioned PhastCons.

Using powsimR and SCOPIT for power calculations for single-cell data analysis

powsimR

I'm trying to do a power calculation to ask how many biological replicates are needed to detect the majority of differentially expressed genes between two treatments, in a particular cell population (cluster) in single cell data. I found the nice tool powsimR from the group of Dr Ines Hellmann. They say in their paper that for single cell data, the biological replicate is a cell, so their software tells you how many cells you need to sequence to detect the majority of differentially expressed genes (e.g. with >2-fold differential gene expression) between two treatments (e.g. male versus female worms).

There are detailed instructions on how to install powsimR on their powsimR github page. I followed these on a Windows laptop, after updating to the latest version of R, and found that it all installed fine but I had to leave out the 'build_vignettes=TRUE' when running the final devtools::install_github("bvieth/powsimR", dependences=FALSE) command. The powsimR github page says that several users have this issue. They say that this means the vignettes are not built, but can be viewed at powsimR vignette instead.

PowsimR lets you take into account batch effects (e.g. biological replicates performed in different months), depth of sequencing based on prior data, "dropout" effets where a gene is only detected in a subset of cells, different methods for testing differential expression (e.g. MAST, scde, sDD, monocle). You can specify a particular power and false discovery rate (FDR).

PowsimR can be used to help design experiments in advance, and also can be used for posterior analysis. For example in their paper they say "For example for the Kolodziejczyk data, 384 single cells for each condition would be sufficient to detect >80% of the DE genes with a well controlled FDR of 5%. Given the lower sample sizes actually used in Kolodziejczyk et al (2015), our power analysis suggests that only 60% of all DE genes could be detected".

In practice, I found that PowsimR looks quite complex to use as there are a lot of parameters to specify, and first you need to understand them. I ran out of time to do this properly, but hopefully can return to it one day, as it looks very useful...

SCOPIT

Another nice tool is SCOPIT, which lets you ask the question: how many cells do we need to squence from an adult worm, to capture a rare cell type?

SCOPIT is very simple to use through the website above.

Acknowledgements

Thanks to my colleague Faye Rodgers for telling me about SCOPIT.