avrilomics: 2020

Thursday 12 November 2020

Fastqc for checking read quality

I wanted to check whether the bases in a read have low quality. The read looks like this in my fastq file:

@M04241:499:GW19092930:1:2112:8740:16257 1:N:0:TGACCT
CGATGTGAGATCCCTCAGACCCTTTTAGTCAGTGGTCCCTTAAGCGGAGGCCCTATAGTGAGTCGTATTACAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAATATATACAACCAAATAAAGATGAAAAAAGAAACTCAATCCAAAAGAAATTGTAGAAGCAAGCTACACAGATTGACCGTAATATTTTCAAACTCAGAGCATGAC
+
CCCCCFFFFFFFGGGGGGGGGGHHHHHHHHHHHHHHHHHHHHHHHGGGGGGGGHHHHHHHHHHHGGHHHHHHHHHHHGGGGHHHGGGGGHHHHHHGHHHHHGGGGGGGGF/GGHGGHHHHGGGHGHHHHHHHHHHHHGGGGG<;-;.000000099.:..09CF0000;;;0;;...;/;0:0;;/0;:.;;;B0009000:9FF//.0;0;/9.000;00:-9:A.0:00000009000000/90900

I put these lines in a file 'tmp.fastq'.

I ran fastqc using this command:

% module avail -t | grep -i fastqc [to find the fastqc module on the Sanger farm]

% module load fastqc/0.11.8-c2 [to load the fastqc module on the Sanger farm]

% fastqc tmp.fastq

% unzip tmp_fastqc.zip

% more tmp_fastqc/fastqc_data.txt

This shows me the average base quality drops off a lot after position 145 approximately:

#Base   Mean    Median Lower Quartile Upper Quartile 10th Percentile 90th Percentile
1       34.0    NaN     NaN     NaN     NaN     NaN
2       34.0    NaN     NaN     NaN     NaN     NaN
3       34.0    NaN     NaN     NaN     NaN     NaN
4       34.0    NaN     NaN     NaN     NaN     NaN
5       34.0    NaN     NaN     NaN     NaN     NaN
6       37.0    NaN     NaN     NaN     NaN     NaN
7       37.0    NaN     NaN     NaN     NaN     NaN
8       37.0    NaN     NaN     NaN     NaN     NaN
9       37.0    NaN     NaN     NaN     NaN     NaN
10-14   37.4    NaN     NaN     NaN     NaN     NaN
15-19   38.0    NaN     NaN     NaN     NaN     NaN
20-24   38.4    NaN     NaN     NaN     NaN     NaN
25-29   39.0    NaN     NaN     NaN     NaN     NaN
30-34   39.0    NaN     NaN     NaN     NaN     NaN
35-39   39.0    NaN     NaN     NaN     NaN     NaN
40-44   39.0    NaN     NaN     NaN     NaN     NaN
45-49   38.2    NaN     NaN     NaN     NaN     NaN
50-54   38.2    NaN     NaN     NaN     NaN     NaN
55-59   39.0    NaN     NaN     NaN     NaN     NaN
60-64   39.0    NaN     NaN     NaN     NaN     NaN
65-69   38.6    NaN     NaN     NaN     NaN     NaN
70-74   39.0    NaN     NaN     NaN     NaN     NaN
75-79   38.6    NaN     NaN     NaN     NaN     NaN
80-84   38.6    NaN     NaN     NaN     NaN     NaN
85-89   38.0    NaN     NaN     NaN     NaN     NaN
90-94   39.0    NaN     NaN     NaN     NaN     NaN
95-99   38.8    NaN     NaN     NaN     NaN     NaN
100-104 38.4    NaN     NaN     NaN     NaN     NaN
105-109 38.0    NaN     NaN     NaN     NaN     NaN
110-114 33.2    NaN     NaN     NaN     NaN     NaN

115-119 38.6    NaN     NaN     NaN     NaN     NaN
120-124 38.4    NaN     NaN     NaN     NaN     NaN
125-129 38.8    NaN     NaN     NaN     NaN     NaN
130-134 39.0    NaN     NaN     NaN     NaN     NaN
135-139 38.6    NaN     NaN     NaN     NaN     NaN
140-144 33.4    NaN     NaN     NaN     NaN     NaN
145-149 16.2    NaN     NaN     NaN     NaN     NaN
150-154 15.0    NaN     NaN     NaN     NaN     NaN
155-159 19.8    NaN     NaN     NaN     NaN     NaN
160-164 24.6    NaN     NaN     NaN     NaN     NaN
165-169 17.2    NaN     NaN     NaN     NaN     NaN
170-174 23.8    NaN     NaN     NaN     NaN     NaN
175-179 15.8    NaN     NaN     NaN     NaN     NaN
180-184 21.4    NaN     NaN     NaN     NaN     NaN
185-189 21.2    NaN     NaN     NaN     NaN     NaN
190-194 24.8    NaN     NaN     NaN     NaN     NaN
195-199 16.8    NaN     NaN     NaN     NaN     NaN
200-204 23.2    NaN     NaN     NaN     NaN     NaN
205-209 18.6    NaN     NaN     NaN     NaN     NaN
210-214 21.0    NaN     NaN     NaN     NaN     NaN
215-219 16.8    NaN     NaN     NaN     NaN     NaN
220-224 18.2    NaN     NaN     NaN     NaN     NaN
225-229 22.0    NaN     NaN     NaN     NaN     NaN
230-234 15.0    NaN     NaN     NaN     NaN     NaN
235-239 16.8    NaN     NaN     NaN     NaN     NaN
240-244 14.8    NaN     NaN     NaN     NaN     NaN
245-249 18.6    NaN     NaN     NaN     NaN     NaN

Tuesday 29 September 2020

Running BLAST to align genome sequences

I'm interested in finding conserved non-coding sequences between two related species of worms.

First I took the introns, UTRs, and intergenic regions from the first species, and tried comparing them to the genome of the second species using exonerate, but that was very slow. I then tried BLAT, which was a little faster. Then I tried BLASTN, which was nice and speedy!

It's been a while since I ran BLAST on the Sanger farm so I needed to remind myself how to run it, even though I have written previous posts on that ages ago (e.g. on farm_blast and on speeding up blast jobs).

This is what I did now:

Find the BLAST module on the farm, and load it: (only applicable to Sanger users)

% module avail -t | grep -i blast
blast/2.7.1=h96bfa4b_5

% module load blast/2.7.1=h96bfa4b_5

Make a blast database:

% makeblastdb -in genome2.fa -dbtype nucl

Run blast:

% blastn -db genome2.fa -query genome1_intronsandutrsandintergenic.fa -out myoutput.blast -outfmt 6

One thing I always always forget is what are the columns in the BLAST m8 format, so I have to look at this nice webpage.

Note that by default the blastn command runs Megablast, which looks for matches of high percent identity, and is a fast algorithm. I'm interested in high percent identity matches, so I used this.

Alternatives to BLAST:

An alternative to BLAST is nucmer, part of the mummer package, which I wrote a post on ages ago (see here). Note to self: nucmer is part of the mummer module on the Sanger farm.

I asked my colleagues what they are using nowadays for whole genome alignements, and they mentioned a couple of other software:

- my colleague Eerik Aunin mentioned the software SibeliaZ, which is tailored for aligning highly similar genomes, eg. strains of the same species,

- my colleague Faye Rodgers mentioned Cactus, which can be used to make alignments of 1000s of vertebrate genomes,

- my colleague Ana Protasio mentioned Satsuma

Regarding finding conserved noncoding regions, my colleague James Cotton mentioned PhastCons.

Using powsimR and SCOPIT for power calculations for single-cell data analysis

powsimR

I'm trying to do a power calculation to ask how many biological replicates are needed to detect the majority of differentially expressed genes between two treatments, in a particular cell population (cluster) in single cell data. I found the nice tool powsimR from the group of Dr Ines Hellmann. They say in their paper that for single cell data, the biological replicate is a cell, so their software tells you how many cells you need to sequence to detect the majority of differentially expressed genes (e.g. with >2-fold differential gene expression) between two treatments (e.g. male versus female worms).

There are detailed instructions on how to install powsimR on their powsimR github page. I followed these on a Windows laptop, after updating to the latest version of R, and found that it all installed fine but I had to leave out the 'build_vignettes=TRUE' when running the final devtools::install_github("bvieth/powsimR", dependences=FALSE) command. The powsimR github page says that several users have this issue. They say that this means the vignettes are not built, but can be viewed at powsimR vignette instead.

PowsimR lets you take into account batch effects (e.g. biological replicates performed in different months), depth of sequencing based on prior data, "dropout" effets where a gene is only detected in a subset of cells, different methods for testing differential expression (e.g. MAST, scde, sDD, monocle). You can specify a particular power and false discovery rate (FDR).

PowsimR can be used to help design experiments in advance, and also can be used for posterior analysis. For example in their paper they say "For example for the Kolodziejczyk data, 384 single cells for each condition would be sufficient to detect >80% of the DE genes with a well controlled FDR of 5%. Given the lower sample sizes actually used in Kolodziejczyk et al (2015), our power analysis suggests that only 60% of all DE genes could be detected".

In practice, I found that PowsimR looks quite complex to use as there are a lot of parameters to specify, and first you need to understand them. I ran out of time to do this properly, but hopefully can return to it one day, as it looks very useful...

SCOPIT

Another nice tool is SCOPIT, which lets you ask the question: how many cells do we need to squence from an adult worm, to capture a rare cell type?

SCOPIT is very simple to use through the website above.

Acknowledgements

Thanks to my colleague Faye Rodgers for telling me about SCOPIT.

Monday 3 August 2020

Power calculation to test a drug for anthelmintic activity in mice

I wanted to find out how to do a power calculation to estimate the number of mice needed per treatment group, for testing a drug for anthelmintic activity in mice.

Luckily, I found a nice paper by Marriott et al 2018, who estimated the number of mice or gerbils needed per treatment group in such drug tests, to achieve a drug efficacy of 70% or higher, with a statistical power of >75% and an alpha set at 0.05.

Their methods say:

'Power analysis was undertaken using sample means and standard deviations of untreated/vehicle control gerbil or SCID mouse worm burdens combined from 2-3 independent infection or implantation experiments. With the assumption of proportional variation, sample size was calculated for drug efficacy effect sizes of 70% or 90% with a statistical power (1-Beta) of >75 < 90% with alpha set at 0.05 using a two-sample T test (Russ Lenth PiFace Applet)'.

To do this, they used a great Java program called PiFace by the statistician Russ Lenth. This uses a two-sided t-test.

For example, for SCID mice, Marriott et al estimated based on some previous data that the worm burden (number of worms) per mouse is 15.3 (mean) with a standard deviation of 8.7. Achieving an efficacy of a drug of 70% would mean that the mean in the drug-treated mice would be (15.3-0.7*15.3)=4.59. That is, the difference between the mean worm burden in the control and drug-treated mice would be 15.3-4.59=10.71.

Marriott et al say in their methods that they used the 'assumption of proportional variation'. I think this means that they assumed the standard deviation in the drug-treated mice would be smaller than in the control mice, as the mean is smaller in the control mice. Under this assumption, if you get a standard deviation of 8.7 for a mean of 15.3 in the control mice, you would expect a standard deviation of (8.7/15.3)*4.59=2.61 for a mean of 4.59 in the drug-treated mice.

We can put the numbers sigma1=8.7, sigma2=2.6, 'True difference of means=10', into the PiFace Java program, and power=0.75, and choose 'two-sided test' for a two-sided t-test, and we estimate that we need a group size of 8, ie. 8 control mice and 8 drug-treated mice. Note that the picture shows the power to be 0.782, as if we put in power=0.75, it calculates the minimum sample sizes to give that power and then their corresponding power.

Note also that here we set the 'True difference of means=10', which was rounding down the difference of 10.71 which I calculated above. I find when I set this 'True difference of means' to 10, I get the value of 8 for the sample size, which is what Marriott et al report in their Table 4 of their paper. When I set the 'True difference of means' to 10.71, I get a slightly different sample size. I think they must have rounded down 10.71 to 10, to do this calculation (?).

By the way, I love the warning given by Russ Lenth in the instructions for his Java program PiFace: "Folks, just because you can plug numbers into a program doesn’t change the fact that if you don’t know what you’re doing, you’re almost guaranteed to get meaningless results – if not dangerously misleading ones. Statistics really is like rocket science; it isn’t easy, even to us who have studied it for a long time."

GPower software

My colleague Maria Duque also told me about another free software called GPower.

GPower is also very easy to use, and as well as power calculations for t-tests, it can perform power calculations for Mann-Whitney tests, chi-squared tests, ANOVA, etc.

There is a nice talk about GPower with many examples, available here.

Acknowledgements

Thank you to Russ Lenth for making his lovely Java program PiFace available. Thank you also to my colleague Maria Duque for telling me about GPower.

Monday 8 June 2020

Searching for software on the Sanger farm

This is only useful to Sanger users...

To find software on the farm, you can use:
% /software/bin/locate_software

e.g. to find the blat software:
% /software/bin/locate_software blat

Using FLASH to merge paired Illumina reads

I have a fastq file for the forward reads from an Illumina sequence library, and a fastq file for the reverse reads, and want to see if I can merge these, as I know that in many cases the insert size is small so the forward and reverse reads should overlap.

To try to merge them, I'm going to try the FLASH software, described in a paper by Magoc & Salzberg (2011).

On the command line, it seems I can run it by typing:
% /nfs/team87/farm3_lims2_vms/software/crispresso_dependencies/bin/flash
This tells me the input should be:
Usage: flash [OPTIONS] MATES_1.FASTQ MATES_2.FASTQ

Let's try it on my data: (where I have 20,593 read-pairs)
% /nfs/team87/farm3_lims2_vms/software/crispresso_dependencies/bin/flash my-R1.fastq.gz my-R2.fastq.gz

This gives me output:
[FLASH] Starting FLASH v1.2.11
[FLASH] Fast Length Adjustment of SHort reads
[FLASH]
[FLASH] Input files:
[FLASH]     my-R1.fastq.gz
[FLASH]     my-R2.fastq.gz
[FLASH]
[FLASH] Output files:
[FLASH]     ./out.extendedFrags.fastq
[FLASH]     ./out.notCombined_1.fastq
[FLASH]     ./out.notCombined_2.fastq
[FLASH]     ./out.hist
[FLASH]     ./out.histogram
[FLASH]
[FLASH] Parameters:
[FLASH]     Min overlap:           10
[FLASH]     Max overlap:           65
[FLASH]     Max mismatch density: 0.250000
[FLASH]     Allow "outie" pairs:   false
[FLASH]     Cap mismatch quals:    false
[FLASH]     Combiner threads:      16
[FLASH]     Input format:          FASTQ, phred_offset=33
[FLASH]     Output format:         FASTQ, phred_offset=33
[FLASH]
[FLASH] Starting reader and writer threads
[FLASH] Starting 16 combiner threads
[FLASH] Processed 20593 read pairs
[FLASH]
[FLASH] Read combination statistics:
[FLASH]     Total pairs:      20593
[FLASH]     Combined pairs:   137
[FLASH]     Uncombined pairs: 20456
[FLASH]     Percent combined: 0.67%
[FLASH]
[FLASH] Writing histogram files.
[FLASH] WARNING: An unexpectedly high proportion of combined pairs (41.61%)
overlapped by more than 65 bp, the --max-overlap (-M) parameter. Consider
increasing this parameter. (As-is, FLASH is penalizing overlaps longer than
65 bp when considering them for possible combining!)
[FLASH]
[FLASH] FLASH v1.2.11 complete!
[FLASH] 1.031 seconds elapsed
[FLASH] Finished with 1 warning (see above)

This gave me a warning that I should increase the max_overlap parameter, so next I tried:
% /nfs/team87/farm3_lims2_vms/software/crispresso_dependencies/bin/flash -M 200 my-R1.fastq.gz my-R2.fastq.gz
[FLASH] Starting FLASH v1.2.11
[FLASH] Fast Length Adjustment of SHort reads
[FLASH]
[FLASH] Input files:
[FLASH]     /nfs/repository/working_area/SHISTO/V7/HIV_integrations/wannaporn/new_2020_data/Sm-control_i1-R1.fastq.gz
[FLASH]     /nfs/repository/working_area/SHISTO/V7/HIV_integrations/wannaporn/new_2020_data/Sm-control_i1-R2.fastq.gz
[FLASH]
[FLASH] Output files:
[FLASH]     ./out.extendedFrags.fastq
[FLASH]     ./out.notCombined_1.fastq
[FLASH]     ./out.notCombined_2.fastq
[FLASH]     ./out.hist
[FLASH]     ./out.histogram
[FLASH]
[FLASH] Parameters:
[FLASH]     Min overlap:           10
[FLASH]     Max overlap:           200
[FLASH]     Max mismatch density: 0.250000
[FLASH]     Allow "outie" pairs:   false
[FLASH]     Cap mismatch quals:    false
[FLASH]     Combiner threads:      16
[FLASH]     Input format:          FASTQ, phred_offset=33
[FLASH]     Output format:         FASTQ, phred_offset=33
[FLASH]
[FLASH] Starting reader and writer threads
[FLASH] Starting 16 combiner threads
[FLASH] Processed 20593 read pairs
[FLASH]
[FLASH] Read combination statistics:
[FLASH]     Total pairs:      20593
[FLASH]     Combined pairs:   140
[FLASH]     Uncombined pairs: 20453
[FLASH]     Percent combined: 0.68%
[FLASH]
[FLASH] Writing histogram files.
[FLASH]
[FLASH] FLASH v1.2.11 complete!
[FLASH] 0.273 seconds elapsed

Seems to work fine. However, only a small percent of my read-pairs are combined (0.68%), oh well!

Monday 23 March 2020

Comparing SMILES strings using OpenBabel

I have three different SMILES strings that should be for the same compound, from different sources.

It's a little difficult to be sure they are definitely the same thing, because the compound has stereochemistry and it's hard to be sure all those wedge bonds are the same just by looking at them.

In this case, when I paste the SMILES into CDK Depict, they look a little bit different from each other, but I think that's just due to rotation about a single bond.
One looks like this:

and the other looks like this:

If you are beginning to squint and roll your head sideways, then perhaps we need an easier way to compare them...

So here's a way to do it using OpenBabel (which happily, I find I installed on my Mac a while ago, see my blog post on that):

- first you can write a file with the three SMILES strings, something like this: (which I called 'digitoxin.smi', as they should all be the compound digitoxin):
O1CC(=CC1=O)[C@@H]2[C@@]3([C@@]([C@H]4[C@@H]([C@@]5([C@@H](C[C@H](CC5)O[C@@H]6O[C@@H]([C@H]([C@H](C6)O)O[C@@H]7O[C@@H]([C@H]([C@H](C7)O)O[C@@H]8O[C@@H]([C@H]([C@H](C8)O)O)C)C)C)CC4)C)CC3)(CC2)O)C    myspreadsheet
C[C@H]1O[C@@H](O[C@H]2[C@@H](O)C[C@H](O[C@H]3[C@@H](O)C[C@H](O[C@H]4CC[C@]5(C)[C@H]6CC[C@]7(C)[C@@H](C8=CC(=O)OC8)CC[C@]7(O)[C@@H]6CC[C@@H]5C4)O[C@@H]3C)O[C@@H]2C)C[C@H](O)[C@@H]1O    chembl
O1CC(=CC1=O)[C@@H]2[C@@]3([C@@]([C@H]4[C@@H]([C@@]5([C@@H](C[C@H](CC5)O[C@@H]6O[C@@H]([C@H]([C@H](C6)O)O[C@@H]7O[C@@H]([C@H]([C@H](C7)O)O[C@@H]8O[C@@H]([C@H]([C@H](C8)O)O)C)C)C)CC4)C)CC3)(CC2)O)C    sigmaspreadsheet

You can see all the '@' symbols which convey the stereochemistry info. in the SMILES strings. Stereochemistry info. in SMILES strings can be also conveyed by '/' or '\' symbols too.

Then use OpenBabel to convert these to Inchi keys:
% obabel digitoxin.smi -o inchikey
This gives me:
WDJUZGPOPHTGOT-XUDUSOBPSA-N
WDJUZGPOPHTGOT-XUDUSOBPSA-N
WDJUZGPOPHTGOT-XUDUSOBPSA-N
Apparently, the first block (before the first '-') says whether the atoms are connected in the same way, and the second block (after the first '-' and before the second '-') tells about the stereochemistry. Looks like the three compounds are the same, hurray!

Thanks!
Thanks to Noel O'Blog for help.

Thursday 27 February 2020

Seurat for single cell RNAseq data analysis

I'm learning how to use the Seurat package for analysis of single cell RNAseq data.

Here's some notes on how I installed Seurat and made some plots, using someone else's RNAseq data set.

Installing Seurat (on a Windows laptop)
- I opened Rstudio.
- I checked the R version by going to Tools > Global options, which told me I had R 3.6.1 (Seurat requires R 3.4 or greater).
- I followed the Seurat installation instructions but with a small workaround suggested on github:
> if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
> BiocManager::install("multtest")
> install.packages('Seurat')
> library(Seurat)
- This installed Seurat 3.1.0.

Reading in and plotting some single cell data
I had someone else's single cell RNAseq data as an .rds file. I read it into R as a Seurat object using:
> dat <- readRDS('file.rds')
Then I made a UMAP plot of the data using:
> DimPlot(dat, label=TRUE)
Then to make a plot showing a particular gene's expression on the UMAP, e.g. gene Smp_089320 (see examples here):
> FeaturePlot(dat, features = c("Smp-089320"))

I also learnt from a a tutorial on visualisation using Seurat that you can make a violin plot for a gene using:
> VlnPlot(dat, features = c("Smp-089320"))
You should also be able to make a plot visualising gene expression in each cluster using:
> RidgePlot(dat, features = c("Smp-089320"))
but for some reason that gave me an error message.

Friday 24 January 2020

RStudio for R

My colleague Adam Reid suggested that I try RStudio for using R.

Installing and running RStudio
I installed it on my Mac laptop by going to the downloads page and downloading the latest Mac installer (dmg file). This brought up an icon for RStudio, which I double-clicked on. Then the RStudio window opened (hurray!):