This is only useful to Sanger users...
To find software on the farm, you can use:
% /software/bin/locate_software
e.g. to find the blat software:
% /software/bin/locate_software blat
Monday, 8 June 2020
Using FLASH to merge paired Illumina reads
I have a fastq file for the forward reads from an Illumina sequence library, and a fastq file for the reverse reads, and want to see if I can merge these, as I know that in many cases the insert size is small so the forward and reverse reads should overlap.
To try to merge them, I'm going to try the FLASH software, described in a paper by Magoc & Salzberg (2011).
On the command line, it seems I can run it by typing:
% /nfs/team87/farm3_lims2_vms/software/crispresso_dependencies/bin/flash
This tells me the input should be:
Usage: flash [OPTIONS] MATES_1.FASTQ MATES_2.FASTQ
Let's try it on my data: (where I have 20,593 read-pairs)
% /nfs/team87/farm3_lims2_vms/software/crispresso_dependencies/bin/flash my-R1.fastq.gz my-R2.fastq.gz
This gives me output:
[FLASH] Starting FLASH v1.2.11
[FLASH] Fast Length Adjustment of SHort reads
[FLASH]
[FLASH] Input files:
[FLASH] my-R1.fastq.gz
[FLASH] my-R2.fastq.gz
[FLASH]
[FLASH] Output files:
[FLASH] ./out.extendedFrags.fastq
[FLASH] ./out.notCombined_1.fastq
[FLASH] ./out.notCombined_2.fastq
[FLASH] ./out.hist
[FLASH] ./out.histogram
[FLASH]
[FLASH] Parameters:
[FLASH] Min overlap: 10
[FLASH] Max overlap: 65
[FLASH] Max mismatch density: 0.250000
[FLASH] Allow "outie" pairs: false
[FLASH] Cap mismatch quals: false
[FLASH] Combiner threads: 16
[FLASH] Input format: FASTQ, phred_offset=33
[FLASH] Output format: FASTQ, phred_offset=33
[FLASH]
[FLASH] Starting reader and writer threads
[FLASH] Starting 16 combiner threads
[FLASH] Processed 20593 read pairs
[FLASH]
[FLASH] Read combination statistics:
[FLASH] Total pairs: 20593
[FLASH] Combined pairs: 137
[FLASH] Uncombined pairs: 20456
[FLASH] Percent combined: 0.67%
[FLASH]
[FLASH] Writing histogram files.
[FLASH] WARNING: An unexpectedly high proportion of combined pairs (41.61%)
overlapped by more than 65 bp, the --max-overlap (-M) parameter. Consider
increasing this parameter. (As-is, FLASH is penalizing overlaps longer than
65 bp when considering them for possible combining!)
[FLASH]
[FLASH] FLASH v1.2.11 complete!
[FLASH] 1.031 seconds elapsed
[FLASH] Finished with 1 warning (see above)
This gave me a warning that I should increase the max_overlap parameter, so next I tried:
% /nfs/team87/farm3_lims2_vms/software/crispresso_dependencies/bin/flash -M 200 my-R1.fastq.gz my-R2.fastq.gz
[FLASH] Starting FLASH v1.2.11
[FLASH] Fast Length Adjustment of SHort reads
[FLASH]
[FLASH] Input files:
[FLASH] /nfs/repository/working_area/SHISTO/V7/HIV_integrations/wannaporn/new_2020_data/Sm-control_i1-R1.fastq.gz
[FLASH] /nfs/repository/working_area/SHISTO/V7/HIV_integrations/wannaporn/new_2020_data/Sm-control_i1-R2.fastq.gz
[FLASH]
[FLASH] Output files:
[FLASH] ./out.extendedFrags.fastq
[FLASH] ./out.notCombined_1.fastq
[FLASH] ./out.notCombined_2.fastq
[FLASH] ./out.hist
[FLASH] ./out.histogram
[FLASH]
[FLASH] Parameters:
[FLASH] Min overlap: 10
[FLASH] Max overlap: 200
[FLASH] Max mismatch density: 0.250000
[FLASH] Allow "outie" pairs: false
[FLASH] Cap mismatch quals: false
[FLASH] Combiner threads: 16
[FLASH] Input format: FASTQ, phred_offset=33
[FLASH] Output format: FASTQ, phred_offset=33
[FLASH]
[FLASH] Starting reader and writer threads
[FLASH] Starting 16 combiner threads
[FLASH] Processed 20593 read pairs
[FLASH]
[FLASH] Read combination statistics:
[FLASH] Total pairs: 20593
[FLASH] Combined pairs: 140
[FLASH] Uncombined pairs: 20453
[FLASH] Percent combined: 0.68%
[FLASH]
[FLASH] Writing histogram files.
[FLASH]
[FLASH] FLASH v1.2.11 complete!
[FLASH] 0.273 seconds elapsed
Seems to work fine. However, only a small percent of my read-pairs are combined (0.68%), oh well!
To try to merge them, I'm going to try the FLASH software, described in a paper by Magoc & Salzberg (2011).
On the command line, it seems I can run it by typing:
% /nfs/team87/farm3_lims2_vms/software/crispresso_dependencies/bin/flash
This tells me the input should be:
Usage: flash [OPTIONS] MATES_1.FASTQ MATES_2.FASTQ
Let's try it on my data: (where I have 20,593 read-pairs)
% /nfs/team87/farm3_lims2_vms/software/crispresso_dependencies/bin/flash my-R1.fastq.gz my-R2.fastq.gz
This gives me output:
[FLASH] Starting FLASH v1.2.11
[FLASH] Fast Length Adjustment of SHort reads
[FLASH]
[FLASH] Input files:
[FLASH] my-R1.fastq.gz
[FLASH] my-R2.fastq.gz
[FLASH]
[FLASH] Output files:
[FLASH] ./out.extendedFrags.fastq
[FLASH] ./out.notCombined_1.fastq
[FLASH] ./out.notCombined_2.fastq
[FLASH] ./out.hist
[FLASH] ./out.histogram
[FLASH]
[FLASH] Parameters:
[FLASH] Min overlap: 10
[FLASH] Max overlap: 65
[FLASH] Max mismatch density: 0.250000
[FLASH] Allow "outie" pairs: false
[FLASH] Cap mismatch quals: false
[FLASH] Combiner threads: 16
[FLASH] Input format: FASTQ, phred_offset=33
[FLASH] Output format: FASTQ, phred_offset=33
[FLASH]
[FLASH] Starting reader and writer threads
[FLASH] Starting 16 combiner threads
[FLASH] Processed 20593 read pairs
[FLASH]
[FLASH] Read combination statistics:
[FLASH] Total pairs: 20593
[FLASH] Combined pairs: 137
[FLASH] Uncombined pairs: 20456
[FLASH] Percent combined: 0.67%
[FLASH]
[FLASH] Writing histogram files.
[FLASH] WARNING: An unexpectedly high proportion of combined pairs (41.61%)
overlapped by more than 65 bp, the --max-overlap (-M) parameter. Consider
increasing this parameter. (As-is, FLASH is penalizing overlaps longer than
65 bp when considering them for possible combining!)
[FLASH]
[FLASH] FLASH v1.2.11 complete!
[FLASH] 1.031 seconds elapsed
[FLASH] Finished with 1 warning (see above)
This gave me a warning that I should increase the max_overlap parameter, so next I tried:
% /nfs/team87/farm3_lims2_vms/software/crispresso_dependencies/bin/flash -M 200 my-R1.fastq.gz my-R2.fastq.gz
[FLASH] Starting FLASH v1.2.11
[FLASH] Fast Length Adjustment of SHort reads
[FLASH]
[FLASH] Input files:
[FLASH] /nfs/repository/working_area/SHISTO/V7/HIV_integrations/wannaporn/new_2020_data/Sm-control_i1-R1.fastq.gz
[FLASH] /nfs/repository/working_area/SHISTO/V7/HIV_integrations/wannaporn/new_2020_data/Sm-control_i1-R2.fastq.gz
[FLASH]
[FLASH] Output files:
[FLASH] ./out.extendedFrags.fastq
[FLASH] ./out.notCombined_1.fastq
[FLASH] ./out.notCombined_2.fastq
[FLASH] ./out.hist
[FLASH] ./out.histogram
[FLASH]
[FLASH] Parameters:
[FLASH] Min overlap: 10
[FLASH] Max overlap: 200
[FLASH] Max mismatch density: 0.250000
[FLASH] Allow "outie" pairs: false
[FLASH] Cap mismatch quals: false
[FLASH] Combiner threads: 16
[FLASH] Input format: FASTQ, phred_offset=33
[FLASH] Output format: FASTQ, phred_offset=33
[FLASH]
[FLASH] Starting reader and writer threads
[FLASH] Starting 16 combiner threads
[FLASH] Processed 20593 read pairs
[FLASH]
[FLASH] Read combination statistics:
[FLASH] Total pairs: 20593
[FLASH] Combined pairs: 140
[FLASH] Uncombined pairs: 20453
[FLASH] Percent combined: 0.68%
[FLASH]
[FLASH] Writing histogram files.
[FLASH]
[FLASH] FLASH v1.2.11 complete!
[FLASH] 0.273 seconds elapsed
Seems to work fine. However, only a small percent of my read-pairs are combined (0.68%), oh well!
Monday, 23 March 2020
Comparing SMILES strings using OpenBabel
I have three different SMILES strings that should be for the same compound, from different sources.
It's a little difficult to be sure they are definitely the same thing, because the compound has stereochemistry and it's hard to be sure all those wedge bonds are the same just by looking at them.
In this case, when I paste the SMILES into CDK Depict, they look a little bit different from each other, but I think that's just due to rotation about a single bond.
One looks like this:
C%5BC%40H%5D(O%5BC%40H%5D3%5BC%40%40H%5D(O)C%5BC%40H%5D(O%5BC%40H%5D4CC%5BC%40%5D5(C)%5BC%40H%5D6CC%5BC%40%5D7(C)%5BC%40%40H%5D(C8%3DCC(%3DO)OC8)CC%5BC%40%5D7(O)%5BC%40%40H%5D6CC%5BC%40%40H%5D5C4)O%5BC%40%40H%5D3C)O%5BC%40%40H%5D2C)C%5BC%40H%5D(O)%5BC%40%40H%5D1O&w=80&h=50&abbr=off&hdisp=bridgehead&showtitle=false&zoom=1.6&annotate=none)
and the other looks like this:
%5BC%40%40H%5D2%5BC%40%40%5D3(%5BC%40%40%5D(%5BC%40H%5D4%5BC%40%40H%5D(%5BC%40%40%5D5(%5BC%40%40H%5D(C%5BC%40H%5D(CC5)O%5BC%40%40H%5D6O%5BC%40%40H%5D(%5BC%40H%5D(%5BC%40H%5D(C6)O)O%5BC%40%40H%5D7O%5BC%40%40H%5D(%5BC%40H%5D(%5BC%40H%5D(C7)O)O%5BC%40%40H%5D8O%5BC%40%40H%5D(%5BC%40H%5D(%5BC%40H%5D(C8)O)O)C)C)C)CC4)C)CC3)(CC2)O)C&w=80&h=50&abbr=off&hdisp=bridgehead&showtitle=false&zoom=1.6&annotate=none)
If you are beginning to squint and roll your head sideways, then perhaps we need an easier way to compare them...
So here's a way to do it using OpenBabel (which happily, I find I installed on my Mac a while ago, see my blog post on that):
- first you can write a file with the three SMILES strings, something like this: (which I called 'digitoxin.smi', as they should all be the compound digitoxin):
O1CC(=CC1=O)[C@@H]2[C@@]3([C@@]([C@H]4[C@@H]([C@@]5([C@@H](C[C@H](CC5)O[C@@H]6O[C@@H]([C@H]([C@H](C6)O)O[C@@H]7O[C@@H]([C@H]([C@H](C7)O)O[C@@H]8O[C@@H]([C@H]([C@H](C8)O)O)C)C)C)CC4)C)CC3)(CC2)O)C myspreadsheet
C[C@H]1O[C@@H](O[C@H]2[C@@H](O)C[C@H](O[C@H]3[C@@H](O)C[C@H](O[C@H]4CC[C@]5(C)[C@H]6CC[C@]7(C)[C@@H](C8=CC(=O)OC8)CC[C@]7(O)[C@@H]6CC[C@@H]5C4)O[C@@H]3C)O[C@@H]2C)C[C@H](O)[C@@H]1O chembl
O1CC(=CC1=O)[C@@H]2[C@@]3([C@@]([C@H]4[C@@H]([C@@]5([C@@H](C[C@H](CC5)O[C@@H]6O[C@@H]([C@H]([C@H](C6)O)O[C@@H]7O[C@@H]([C@H]([C@H](C7)O)O[C@@H]8O[C@@H]([C@H]([C@H](C8)O)O)C)C)C)CC4)C)CC3)(CC2)O)C sigmaspreadsheet
You can see all the '@' symbols which convey the stereochemistry info. in the SMILES strings. Stereochemistry info. in SMILES strings can be also conveyed by '/' or '\' symbols too.
Then use OpenBabel to convert these to Inchi keys:
% obabel digitoxin.smi -o inchikey
This gives me:
WDJUZGPOPHTGOT-XUDUSOBPSA-N
WDJUZGPOPHTGOT-XUDUSOBPSA-N
WDJUZGPOPHTGOT-XUDUSOBPSA-N
Apparently, the first block (before the first '-') says whether the atoms are connected in the same way, and the second block (after the first '-' and before the second '-') tells about the stereochemistry. Looks like the three compounds are the same, hurray!
Thanks!
Thanks to Noel O'Blog for help.
It's a little difficult to be sure they are definitely the same thing, because the compound has stereochemistry and it's hard to be sure all those wedge bonds are the same just by looking at them.
In this case, when I paste the SMILES into CDK Depict, they look a little bit different from each other, but I think that's just due to rotation about a single bond.
One looks like this:
and the other looks like this:
If you are beginning to squint and roll your head sideways, then perhaps we need an easier way to compare them...
So here's a way to do it using OpenBabel (which happily, I find I installed on my Mac a while ago, see my blog post on that):
- first you can write a file with the three SMILES strings, something like this: (which I called 'digitoxin.smi', as they should all be the compound digitoxin):
O1CC(=CC1=O)[C@@H]2[C@@]3([C@@]([C@H]4[C@@H]([C@@]5([C@@H](C[C@H](CC5)O[C@@H]6O[C@@H]([C@H]([C@H](C6)O)O[C@@H]7O[C@@H]([C@H]([C@H](C7)O)O[C@@H]8O[C@@H]([C@H]([C@H](C8)O)O)C)C)C)CC4)C)CC3)(CC2)O)C myspreadsheet
C[C@H]1O[C@@H](O[C@H]2[C@@H](O)C[C@H](O[C@H]3[C@@H](O)C[C@H](O[C@H]4CC[C@]5(C)[C@H]6CC[C@]7(C)[C@@H](C8=CC(=O)OC8)CC[C@]7(O)[C@@H]6CC[C@@H]5C4)O[C@@H]3C)O[C@@H]2C)C[C@H](O)[C@@H]1O chembl
O1CC(=CC1=O)[C@@H]2[C@@]3([C@@]([C@H]4[C@@H]([C@@]5([C@@H](C[C@H](CC5)O[C@@H]6O[C@@H]([C@H]([C@H](C6)O)O[C@@H]7O[C@@H]([C@H]([C@H](C7)O)O[C@@H]8O[C@@H]([C@H]([C@H](C8)O)O)C)C)C)CC4)C)CC3)(CC2)O)C sigmaspreadsheet
You can see all the '@' symbols which convey the stereochemistry info. in the SMILES strings. Stereochemistry info. in SMILES strings can be also conveyed by '/' or '\' symbols too.
Then use OpenBabel to convert these to Inchi keys:
% obabel digitoxin.smi -o inchikey
This gives me:
WDJUZGPOPHTGOT-XUDUSOBPSA-N
WDJUZGPOPHTGOT-XUDUSOBPSA-N
WDJUZGPOPHTGOT-XUDUSOBPSA-N
Apparently, the first block (before the first '-') says whether the atoms are connected in the same way, and the second block (after the first '-' and before the second '-') tells about the stereochemistry. Looks like the three compounds are the same, hurray!
Thanks!
Thanks to Noel O'Blog for help.
Thursday, 27 February 2020
Seurat for single cell RNAseq data analysis
I'm learning how to use the Seurat package for analysis of single cell RNAseq data.
Here's some notes on how I installed Seurat and made some plots, using someone else's RNAseq data set.
Installing Seurat (on a Windows laptop)
- I opened Rstudio.
- I checked the R version by going to Tools > Global options, which told me I had R 3.6.1 (Seurat requires R 3.4 or greater).
- I followed the Seurat installation instructions but with a small workaround suggested on github:
> if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
> BiocManager::install("multtest")
> install.packages('Seurat')
> library(Seurat)
- This installed Seurat 3.1.0.
Reading in and plotting some single cell data
I had someone else's single cell RNAseq data as an .rds file. I read it into R as a Seurat object using:
> dat <- readRDS('file.rds')
Then I made a UMAP plot of the data using:
> DimPlot(dat, label=TRUE)
Then to make a plot showing a particular gene's expression on the UMAP, e.g. gene Smp_089320 (see examples here):
> FeaturePlot(dat, features = c("Smp-089320"))
I also learnt from a a tutorial on visualisation using Seurat that you can make a violin plot for a gene using:
> VlnPlot(dat, features = c("Smp-089320"))
You should also be able to make a plot visualising gene expression in each cluster using:
> RidgePlot(dat, features = c("Smp-089320"))
but for some reason that gave me an error message.
Here's some notes on how I installed Seurat and made some plots, using someone else's RNAseq data set.
Installing Seurat (on a Windows laptop)
- I opened Rstudio.
- I checked the R version by going to Tools > Global options, which told me I had R 3.6.1 (Seurat requires R 3.4 or greater).
- I followed the Seurat installation instructions but with a small workaround suggested on github:
> if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
> BiocManager::install("multtest")
> install.packages('Seurat')
> library(Seurat)
- This installed Seurat 3.1.0.
Reading in and plotting some single cell data
I had someone else's single cell RNAseq data as an .rds file. I read it into R as a Seurat object using:
> dat <- readRDS('file.rds')
Then I made a UMAP plot of the data using:
> DimPlot(dat, label=TRUE)
Then to make a plot showing a particular gene's expression on the UMAP, e.g. gene Smp_089320 (see examples here):
> FeaturePlot(dat, features = c("Smp-089320"))
I also learnt from a a tutorial on visualisation using Seurat that you can make a violin plot for a gene using:
> VlnPlot(dat, features = c("Smp-089320"))
You should also be able to make a plot visualising gene expression in each cluster using:
> RidgePlot(dat, features = c("Smp-089320"))
but for some reason that gave me an error message.
Friday, 24 January 2020
RStudio for R
My colleague Adam Reid suggested that I try RStudio for using R.
Installing and running RStudio
I installed it on my Mac laptop by going to the downloads page and downloading the latest Mac installer (dmg file). This brought up an icon for RStudio, which I double-clicked on. Then the RStudio window opened (hurray!):
Installing and running RStudio
I installed it on my Mac laptop by going to the downloads page and downloading the latest Mac installer (dmg file). This brought up an icon for RStudio, which I double-clicked on. Then the RStudio window opened (hurray!):
Wednesday, 11 December 2019
Statistical rethinking: grid approximation for estimating a posterior
My colleague James Cotton recommended that a good way to expand my stats knowledge is to read the book Statistcal Rethinking by Richard McElreath, which so far I'm enjoying a lot!
A nice aspect of the book is that he uses many examples in R.
Grid approximation for estimating a posterior distribution
In chapter 2, he shows how to estimate a posterior distribution using a 'grid approximation' approach. I've wrapped up his lines of code into a little function that lets you try out different number of grid points.
In this case, we are interested in calculating the posterior distribution for p, the probability of success for a binomial variable.
We can specifiy different prior distributions for p.
Let's assume that we have carried out 9 Bernoulli trials and observed 6 successes so far.
Let's try a uniform prior:
plot_posterior_using_uniform_prior <- function(num_points_in_grid)
{
{
p_grid <- seq(from = 0, to=1, length.out=num_points_in_grid)
prior <- ifelse ( p_grid < 0.5, 0, 1)
Chapter 2 of Statistical Rethinking also explains that the grid approximation approach doesn't scale well and is slow, and that a faster, more scaleable approach is quadratic approximation. McElreath shows how it can be used to estimate the posterior distribution for the same parameter p, as above: (see his book for how to install the 'rethinking' R package):
A nice aspect of the book is that he uses many examples in R.
Grid approximation for estimating a posterior distribution
In chapter 2, he shows how to estimate a posterior distribution using a 'grid approximation' approach. I've wrapped up his lines of code into a little function that lets you try out different number of grid points.
In this case, we are interested in calculating the posterior distribution for p, the probability of success for a binomial variable.
We can specifiy different prior distributions for p.
Let's assume that we have carried out 9 Bernoulli trials and observed 6 successes so far.
Let's try a uniform prior:
plot_posterior_using_uniform_prior <- function(num_points_in_grid)
{
p_grid <- seq(from = 0, to=1, length.out=num_points_in_grid)
prior <- rep(1,num_points_in_grid)
likelihood <- dbinom(6, size=9, prob=p_grid)
ustd.posterior <- likelihood * prior
posterior <- ustd.posterior/sum(ustd.posterior)
plot(p_grid, posterior, type="b", xlab="probability of success", ylab="posterior probability for p")
}
Do a plot with a grid with 30 values for p:
plot_posterior_using_uniform_prior(num_points_in_grid=30)
Let's try 10 values for p: plot_posterior_using_uniform_prior(num_points_in_grid=10)
This doesn't give us such a good estimate of the posterior distribution for p:
What if we had a different prior, that was a step function, assigning 0 probability to all values of p less than 0.5:
plot_posterior_using_step_prior <- function(num_points_in_grid){
p_grid <- seq(from = 0, to=1, length.out=num_points_in_grid)
prior <- ifelse ( p_grid < 0.5, 0, 1)
likelihood <- dbinom(6, size=9, prob=p_grid)
ustd.posterior <- likelihood * prior
posterior <- ustd.posterior/sum(ustd.posterior)
plot(p_grid, posterior, type="b", xlab="probability of success", ylab="posterior probability for p")
}plot_posterior_using_step_prior(num_points_in_grid=30)
Quadratic approximation for estimating a posterior distribution Chapter 2 of Statistical Rethinking also explains that the grid approximation approach doesn't scale well and is slow, and that a faster, more scaleable approach is quadratic approximation. McElreath shows how it can be used to estimate the posterior distribution for the same parameter p, as above: (see his book for how to install the 'rethinking' R package):
library(rethinking)
globe.qa <- quap(alist(W ~ dbinom(W+L,p), p ~ dunif(0,1)), data=list(W=6,L=3) + )
precis(globe.qa)
mean sd 5.5% 94.5%
p 0.67 0.16 0.42 0.92
This tells us that, using a uniform prior and assuming the posterior for p is Gaussian (the key assumption of the quadratic approximation approach), its maximum (peak) is at 0.67 and its standard deviation is 0.16.
This is similar to the peak of the posterior distribution for p that we estimated using the grid approximation approach above (when using a large number of grid points such as 30).
Friday, 15 November 2019
Drawing a molecule using CDK Depict
Before I wrote a blogpost on how to draw a molecule using OpenBabel.
Another lovely way to draw a molecule is using CDK Depict.
I can paste in a SMILES string, eg. for the drug tegaserod 'CCCCCNC(=N)N\N=C\c1c[nH]c2ccc(OC)cc12', and get a picture:
I prefer to have the picture in colour on a white background, and no abbreviations:
N%5CN%3DC%5Cc1c%5BnH%5Dc2ccc(OC)cc12&w=80&h=50&abbr=off&hdisp=bridgehead&showtitle=false&zoom=1.6&annotate=none)
Cool!
Another lovely way to draw a molecule is using CDK Depict.
I can paste in a SMILES string, eg. for the drug tegaserod 'CCCCCNC(=N)N\N=C\c1c[nH]c2ccc(OC)cc12', and get a picture:
I prefer to have the picture in colour on a white background, and no abbreviations:
Cool!
Subscribe to:
Comments (Atom)



