Thursday 20 December 2018
CRAM to BAM conversion
To convert a CRAM file to BAM, you can use the SCRAMBLE conversion tool.
For example:
% /software/sciops/pkg/staden_io_bin/1.14.9/bin/scramble -I cram -O bam 27541_1#9.cram > 27541_1#9.bam
where /software/sciops/pkg/staden_io_bin/1.14.9/bin/scramble is where SCRAMBLE is installed,
27541_1#9.cram is the input CRAM file,
27541_1#9.bam is the output BAM file.
Monday 26 November 2018
Dashed line in pysvg
Now I wanted to figure out how to make a dashed line using pysvg.
The library seems to be missing a dedicated function for this, but I was able to do it with:
# define the dashed line style
linestyle_dict = {'stroke-width':2, 'stroke':'black','stroke-dasharray': "5, 5"}
linestyle = pysvg.builders.StyleBuilder(linestyle_dict)
myline = shape_builder.createLine(0,0,300,300)
myline.set_style(linestyle.getStyle())
svg_document.addElement(myline)
If you want to change the lengths of the dashes you can change for example 'stroke-dasharray': "5, 5" to 'stroke-dasharray': "10, 10".
BWA aligner
BWA mem
On the BWA website, it explains the the 'bwa mem' algorithm is the best one for '70bp or longer Illumina, 454, Ion Torrent and Sanger reads, assembly contigs and BAC sequences.'
I've previously used it something like this:
bwa mem genome.fasta reads_fastq.gz mates_fastq.gz | samtools view -b -S -F 4 - > output.bam
run 'bwa mem' to align reads and their mates to a genome, and pipe the output through samtools to just take the reads that align (using -F4) and discarding reads that don't align (to save disk space), and put the output in output.bam
Other options
Other options available in BWA include:
bwa sampe: generated paired ended alignments (This takes an option -a <max_insert_size>)
Friday 23 November 2018
Adjusting text style in pysvg
Now I want to do something a bit more tricky, to adjust the font styles.
I found a tutorial online, but the commands in it don't seem to work with the current pysvg that's installed on the Sanger farm.
However, I found another file on github that gave me some hints...
Adjusting font style and alignment with pysvg
You need something like this in the Python:
# define the text style
style = pysvg.builders.StyleBuilder()
style.setFontFamily(fontfamily="Arial")
style.setFontSize('1em')
style.setFilling("blue")
style.setTextAnchor("start")
style.style_dict["alignment-baseline"] = "middle" # there doesn't seem to be a setAlignmentBaseline function
Then you can add text using something like this:
# add some text to our svg picture for this chromosome:
t=pysvg.text.Text(mychr, x = mychrlen*xscale + 30, y = chrcnt*yscale + 30)
t.set_style(style.getStyle())
svg_document.addElement(t)
Wednesday 21 November 2018
Jupyter notebooks using Python
The first step is to install Jupyter.
(Note to self: I found it is installed on the pcs5 at Sanger, so I can start it up by 'ssh -Y pcs5', and then..)
The second step is to start up a Jupyter notebook session:
(Note to self: I did this in /nfs/users/nfs_a/alc/Documents/git/Python)
% jupyter notebook
(Note to self: jupyter is installed with python2 software on the Sanger farm, so seems to assume Python2.)
This opens up a web browser showing the files in the directory where I started jupyter.
You can then start a new notebook by going to the 'New' menu on the right, and choosing 'python2'.
You can give your notebook a title by typing in the 'Untitled' box at the top, for example, you could rename it 'plotting_chrs', and then it would be saved as file plotting_chrs.ipynb
Then you can start typing in the prompt line.
To get a line to be executed, you can press Shift+Return.
Here I made a jupiter notebook to make a SVG plot of chromosomes (see my previous blogpost), and display the SVG inline:
(Note to self: the whole of the svg image from pysvg doesn't seem to be displayed inline, I'm not sure why this is - I tried another svg and it worked fine.)
To shut down the notebook, you can click on 'Close and Halt' in the File menu. To start again you can type
% jupyter notebook plotting_chrs.ipynb
to open this notebook again.
Later note (May 2019):
- I wanted to install jupyter notebook on a laptop running Windows. To do this, I first installed the latest Python (v 3.7.3) using a Window installer for Python from the Python website. Then I used 'cmd' in Windows to open a command prompt and within the command prompt I typed:
> py -m pip install --upgrade pip
> py -m pip install jupyter
This installed Jupyter notebook [Note: to install another module, such as 'pandas', you can type 'py -m pip install pandas].
To then run a certain Jupyter notebook such as my.notebook, you can type:
> jupyter notebook C:\\mypath\Docs\my.notebook
This should open the notebook my.notebook in a web browser.
You can get a certain command to run in the Jupyter notebook by pressing shift+return.
- I wanted to install jupyter notebook on a Mac laptop. I already had Anaconda installed, so typed:
Plotting chromosomes using Python pysvg
I found some nice examples for using pysvg here.
Here is another old example (seems to use an old version of pysvg, that doesn't work now) by Noel O'Blog.
It can be run using python2 (not python3):
% python /nfs/users/nfs_a/alc/Documents/git/Python/plot_emu_chrs.py emu_chrs.svg
% convert emu_chrs.svg emu_chrs.png
The output looks like (as png):
You can edit it manually in the SVG editor Inkscape (Note to self: do 'ssh -Y pcs5' and then 'inkscape emu_chrs.svg').
The code is:
import sys
import os
import pysvg.structure
import pysvg.builders
import pysvg.text
# Note: this is to make a svg plot with the E. multilocularis chromosomes,
# plotted one under the other, scaled according to their lengths, and
# with a different colour for each chromosome, and a label for each
# chromosome.
# Note 2: this script must be run with python2 (not python3).
#====================================================================#
def define_input_data():
# define the input data, as a list of E. multilocularis chromosomes, and
# a dictionary with lengths of E. multilocularis chromosomes, and a dictionary
# with the colours to plot E. multilocularis chromosomes:
chrlist = ["chr1", "chr2", "chr3"]
chrlengths = { "chr1": 2000000, "chr2": 5000000, "chr3": 4500000}
chrcols = { "chr1": "rgb(255,0,153)", "chr2": "blue", "chr3": "purple"}
# Note: there is a list of RGB values for colours here: http://access-excel.tips/vba-excel-rgb-property-and-get-rgb-color/
return(chrlist, chrlengths, chrcols)
#====================================================================#
# make the plot of the input data:
def plot_input_data(chrlist, chrlengths, chrcols, output_svg_file):
# initialise our svg picture
svg_document = pysvg.structure.Svg()
shape_builder = pysvg.builders.ShapeBuilder()
# plot each chromosome, one under the other:
chrcnt = 0
yscale = 100
xscale = 0.00004
for mychr in chrlist:
# find its length:
assert(mychr in chrlist)
mychrlen = chrlengths[mychr]
# find the colour to plot it:
assert(mychr in chrcols)
mychrcol = chrcols[mychr]
# add a rectangle to our svg picture for this chromosome:
# createRect takes x, y, width, height
myrectwidth = "%dpx" % (mychrlen*xscale + 20)
svg_document.addElement(shape_builder.createRect(0, chrcnt*yscale, myrectwidth, "50px", strokewidth = 1, stroke = "black", fill = mychrcol))
# add some text to our svg picture for this chromosome:
svg_document.addElement(pysvg.text.Text(mychr, x = mychrlen*xscale + 30, y = chrcnt*yscale + 30))
# add one to the chromosome count:
chrcnt += 1
# save the svg picture to an output file
svg_document.save(output_svg_file)
return()
#====================================================================#
def main():
# check the command-line arguments:
if len(sys.argv) != 2:
print("Usage: %s output_svg_file" % sys.argv[0])
sys.exit(1)
output_svg_file = sys.argv[1]
# define the input data: (note: you could read this in from an input file)
(chrlist, chrlengths, chrcols) = define_input_data()
# make the plot of the input data:
plot_input_data(chrlist, chrlengths, chrcols, output_svg_file)
#====================================================================#
if __name__=="__main__":
main()
#====================================================================#
Wednesday 14 November 2018
I-TASSER for protein structure and function prediction
Notes to self:
- there is a script runI-TASSER.pl on the Sanger farm for running it
- I-TASSER and its scripts currently live here:
/software/pathogen/external/apps/usr/local/itasser-5.0
- Users need the library location:
/lustre/scratch118/infgen/pathogen/pathpipe/itasser
- Sample command:
runI-TASSER.pl -pkgdir /software/pathogen/external/apps/usr/local/itasser-5.0 -libdir /lustre/scratch118/infgen/pathogen/pathpipe/itasser -java_home /usr -seqname example -datadir example -outdir itasser_out
- If users run it with the -light option that will limit the simulation time to 5-6 hours which means they can get a smallish protein (<100aa) through in approx 8 hours.
Thanks to
Victoria Offord
Arporn Wangwiwatsin
Tuesday 13 November 2018
HHpred for finding remote homologues
Monday 29 October 2018
Phred quality scores
I always forget how they are calculated.
From wikipedia, the Phred base quality score for a particular base is:
Q = -10 log_10 (P)
where P = the probability that a particular base was called incorrectly.
Working out the probability a particular base was called incorrectly from the Phred score?
If you have Q, you can work out P as:
P = 10^(-Q /10)
So a base quality score of 20 means P = 10^(-2) = 0.010.
A base quality score of 25 means P = 0.003 approx.
A base quality score of 30 means P = 0.001.
Working out the Phred quality score threshold needed to achieve a particular threshold of P
If we want to use a threshold of P = 0.05, then we would need to use a threshold of Q of:
Q = -10 log_10(0.05) = 13.0103.
Working back to get P gives:
P = 10^(-13.0103/10) = 0.05.
Likewise, for a threshold of P = 0.005:
Q = -10 log_10(0.005) = 23.0103.
Working back to get P gives:
P = 10^(-23.0103/10) = 0.005.
Friday 26 October 2018
Running CRISPResso on a ton of sequencing data
Here's I'll talk about how to run it on loads of data! In this case I had ~20 million read-pairs for a single sample, while previously I had only run CRISPResso on 1-2 million read-pairs (even that took a couple of days!)
Here is a little plot I made of CRISPResso run-time in hours versus number of input read-pairs:
You can see that as the number of read-pairs gets above about 1 million, the run-time is 0.5-1 day roughly. I guess the runtime may also depend on things like the base quality of the reads (if they are low quality, I'm guessing CRISPResso will have to make lots of alignments with indels/substitutions, and may infer lots of indel/substitution mutations, which will take time).
To run CRISPResso on ~20 million read-pairs from a single sample, here's what I did:
1. Split up the input fastq files into smaller files of 1 million reads each
I have paired end read-pairs, so the input data is in a file called sample1_1.fastq.gz (forward reads) and sample1_2.fastq.gz (reverse reads). To split this up, I ran:
% python3 ~alc/Documents/git/Python/split_up_fastq.py sample1_1.fastq.gz 1000000 sample1_1_sub
% python3 ~alc/Documents/git/Python/split_up_fastq.py sample1_1.fastq.gz 1000000 sample1_1_sub
You can see my Python script on gist here.
Note to self: I submitted farm jobs to do this, as it's slow.
2. Use Trimmomatic to discard low quality read-pairs.
I used the Trimmomatic software to discard read-pairs with low quality bases, only taking read-pairs where both reads of a pair had average base quality of >=20. To do this, I ran:
% python3 ~alc/Documents/git/Python/filter_fastq_files_using_trimmomatic.py sample1 17
where here sample1 is the name of the sample in step 1 above, and 17 is the number of pairs of smaller fastq files that step 1 above made.
This script submits Trimmomatic jobs to the Sanger compute farm.
You can see my Python script on gist here.
3. Run CRISPResso on each smaller subset of read-pairs.
I then ran CRISPResso on each smaller subset of read-pairs:
% python3 ~alc/Documents/git/Python/submit_crispresso_jobs_for_subsetsoffastq.py sample1 17
where here sample1 is the name of the sample in step 1 above, and 17 is the number of pairs of smaller fastq files that step 1 above made.
This script submits CRISPResso jobs to the Sanger compute farm.
You can see my Python script on gist here, you would need to edit it at the 'xxx' positions, to put in your amplicon sequence, gRNA sequence and coding sequence.
Spiking sequencing libraries with PhiX
"introduction of sequencing diversity in low-complexity libraries (diversity is needed to discriminate clusters and create signal thresholds for base-calling). As the software has improved, the recommended amount of phiX spike-in has decreased."
Another comment says:
"Older basecalling software on the MiSeq used to require a 50% phiX spike for low-diversity (amplicon samples). More recently, the software has been updated and only requires a 5-10% spike."
Reading and writing gzipped files in Python
The same module can also write out zipped (.gz) files.
For example, here is a little script that reads in a zipped (.gz) input fastq file of sequence reads, and splits it into smaller files that have seqs_per_output_file sequences each. The output files are written out as zipped files (.gz).
I've highlighted the lines of the file that use the gzip module, and read and write from the input and output files.
Very handy!
import sys
import os
import gzip
from collections import defaultdict
#====================================================================#
# now read in the input fastq and split it up:
def read_fastq_file_and_split(input_fastq_file, seqs_per_output_file, output_file_prefix):
# open an output file:
output_file_cnt = 1
output_file = "%s_%d.fastq.gz" % (output_file_prefix, output_file_cnt)
outputfileObj = gzip.open(output_file, "wb") # write out the output file in gzipped format
print("Opening",output_file,"...")
# read in the input file:
fileObj = gzip.open(input_fastq_file, "rt") # this opens a gzipped file in text mode
seqcnt = 0
linecnt = 0
for line in fileObj:
line = line.rstrip()
if line.startswith('@') and (linecnt == 0 or linecnt == 4):
linecnt = 0
seqcnt += 1
if seqcnt == (seqs_per_output_file + 1):
outputfileObj.close()
output_file_cnt += 1
output_file = "%s_%d.fastq" % (output_file_prefix, output_file_cnt)
outputfileObj = gzip.open(output_file, "wb") # write out the output file in gzipped format
print("Opening",output_file,"...")
seqcnt = 1
outputline = "%s\n" % line
outputfileObj.write(outputline.encode()) # need to write bytes (not a string) to the output file, see https://stackoverflow.com/questions/49286135/write-strings-to-gzip-file
linecnt += 1
fileObj.close()
outputfileObj.close()
# the fastq file looks like this:
# @M03558:259:000000000-BH588:1:1101:15455:1333 2:N:0:NTTGTA
# NCAAGCATCTCATTTTGTGCATATACCTGGTCTTTCGTCTTCTGGCGTGAAGTCGCCGACTGAATGCCAGCAATCTCTTTTTGAGTCTCATTTTGCATCTCGGCAATCTCTTTCTGATTGTCCAGTTGCATTTTAGTAAGCTCTTTTTGATTCTCAAATCCGGCGTCAACCATACCAGCAGAGGAAGCATCAGCACCAGCACGCTCCCAAGCATTAAGCTCAGGAAATGCAGCAGCAAGATAATCACGAGT
# +
# #>>ABBFFFFFFGGGGGGGGGGHHHHHHHHHHHGHHGH2FFHHHHHGGGGGHHHGGGGGGGGHHHGHHHHGHHHHHHHHHHHGGGHHHHHHHHHHHHHHHHGGGGGHGFHHHHHHHHHHHHHGHHHHGHFHHHHHEFFFHGFFHHHHHGGHHHHHFGHHHHGGGCGGGFHHGGHGHHHHHHHHAGFFHHHHHGGGAEFHHHFDGDDGGGGEEBFGGGGGFGGGGEFGFFFGGGGGGFFFFFFBBFFFDFAA
return
#====================================================================#
def main():
# check the command-line arguments:
if len(sys.argv) != 4 or os.path.exists(sys.argv[1]) == False:
print("Usage: %s input_fastq_file seqs_per_output_file output_file_prefix" % sys.argv[0])
sys.exit(1)
input_fastq_file = sys.argv[1] # input fastq file
seqs_per_output_file = int(sys.argv[2]) # number of sequences to put into each output file
output_file_prefix = sys.argv[3] # prefix to use for the output file names
# now read in the input fastq and split it up:
read_fastq_file_and_split(input_fastq_file, seqs_per_output_file, output_file_prefix)
#====================================================================#
if __name__=="__main__":
main()
#====================================================================#
Thursday 25 October 2018
Randomly sample from a fastq file
Filter read-pairs by average base quality using Trimmomatic
From a discussion on seqanswers about filtering, I found that people suggested Trimmomatic.
Running Trimmomatic
I tried running Trimmomatic on some smaller files of 250,000 (0.25 million) read-pairs, using this command-line: (note to self: on the Sanger farm)
% java -Xmx128M -jar /software/pathogen/external/apps/usr/local/Trimmomatic-0.33/trimmomatic-0.33.jar PE -threads 1 -phred33 sample1topsmall_1.fastq.gz sample1topsmall_2.fastq.gz sample1topsmallout_1_paired.fastq.gz sample1topsmallout_1_unpaired.fastq.gz sample1topsmallout_2_paired.fastq.gz sample1topsmallout_2_unpaired.fastq.gz SLIDINGWINDOW:150:20
where:
PE means my reads are paired ends,
-threads 1 tells Trimmomatic to just use one thread (one CPU),
-phred33 tells Trimmomatic that the phred quality scores in my file use ASCII_BASE=33 (see here for an explanation), (note to self: I knew this from CRISPResso output)
sample1topsmall_1.fastq.gz sample1topsmall_2.fastq.gz are my input fastq files,
sample1topsmallout_1_paired.fastq.gz, sample1topsmallout_1_unpaired.fastq.gz, sample1topsmallout_2_paired.fastq.gz, sample1topsmallout_2_unpaired.fastq.gz are output files produced by Trimmomatic (two with reads where both reads of a pair pass the filter, two files with reads where just one read of a pair passes),
SLIDINGWINDOW:150:20 tells Trimmomatic to use a sliding window of 150 bp and take reads that have average base quality of >=20 in this window (note to self: my reads are 150 bp long).
Output from Trimmomatic
Trimmomatic ran really quickly! It gave some output like this:
Input Read Pairs: 250000 Both Surviving: 1619 (0.65%) Forward Only Surviving: 228797 (91.52%) Reverse Only Surviving: 604 (0.24%) Dropped: 18980 (7.59%)
TrimmomaticPE: Completed successfully
You can see here that my forward reads were much higher quality than my reverse reads, so few read-pairs survived where both reads of the pair passed the quality filter. The output files sample1topsmallout_1_paired.fastq.gz and sample1topsmallout_2_paired.fastq.gz have 1619 reads each as 1619 read-pairs passed.
Run-time and memory (RAM) required by Trimmomatic
I found Trimmomatic very fast. For 2.5 million read-pairs, it took about 2 minutes to run on the Sanger compute farm (using one CPU), and needed just ~80 Mbyte of memory (RAM).
Thursday 18 October 2018
Submitting an assembly plus annotations to the ENA
To do this, I followed these steps:
Step 1: Upload my EMBL file (with genome assembly and annotation) to the ENA submission website
Note (later): I think that actually Step 1 is not necessary for submitting your EMBL file to the ENA, so you could skip directly to Step 2 here (if you have your EMBL file already; if you need to make your EMBL file see here, and if you need to register your sample and study with the ENA see here).
Once I had created an EMBL file with my genome assembly and annotation (see here), and had checked my study and sample had already been registered with the ENA (see here), I followed these steps:
1) I went to the ENA submission website, and signed in using the login and password given to me by my team (Note to self: asked Pathogen Informatics team for this).
2) I clicked on the 'New Submissions' tab at the top.
3) Then I selected 'Submit genome assemblies'.
4) At the bottom of the screen, it says the first step is to upload the data files into the "Webin upload area". I clicked on the 'upload data files' link.
5) This brought me to a page that says to look at http://ena-docs.readthedocs.io/en/latest/upload_01.html
6) I created a md5sum for my file by typing:
% md5sum my.embl.gz > my.embl.gz.md5
7) Following the instructions on the http://ena-docs.readthedocs.io/en/latest/upload_01.html page, I did the following to transfer my files using ftp, to the ENA's "Webin upload area":
% ftp webin.ebi.ac.uk [enter username and password]
> bin
> put my.embl
> put my.embl.md5
> bye
The files that I transferred were zipped embl files (e.g. enterobius_vermicularis_new2.embl.gz and enterobius_vermicularis_new2.embl.md5.gz for a species Enterobius vermicularis).
Step 2: Make a "manifest" file for the EMBL file
To submit an EMBL file for example for a species Enterobius vermicularis, you need to make a 'manifest file' that contains the information on the assembly, e.g. enterobius_vermicularis.manifest. The manifest files are described here. For example, for my species Enterobius vermicularis, the manifest file looked like this: (tab separated)
STUDY PRJEB503
SAMPLE ERS076738
ASSEMBLYNAME E_vermicularis_Canary_Islands_0011_upd
COVERAGE 53
PROGRAM SGA, Velvet, SSpace, IMAGE, GapFiller, REAPR
PLATFORM Illumina
MINGAPLENGTH 1
MOLECULETYPE genomic DNA
FLATFILE enterobius_vermicularis_new2.embl.gz
Note: here the assembly name has had '_upd' appended to it, as it is an update of a previous one that was in the ENA.
Note: If your assembly is in chromosomes you also need a chromosome list file (see here) as well as a manifest file; this wasn't the case for me as I just had scaffolds in my assembly.
Note: to see the manifest file information for an assembly that has already been submitted to the ENA, you can log into the ENA submission website (see Step 1 above), and then type the analysis accession number (e.g. ERZ021218) in the search box 'Accession/Unique name' at the top right, and you will see the analysis pop up, then click 'Edit' on the right' and you will see a page with 'Edit details' pop up, and then click on 'Edit XML' on the top right and this should bring you to a page with the details that were in the manifest, which will look something like this (I've highlighted in red the info that was used in the manifest file above):
<?xml version="1.0" encoding="UTF-8"?><ANALYSIS_SET>
<ANALYSIS accession="ERZ021254" alias="E_vermicularis_Canary_Islands_0011" center_name="WTSI">
<IDENTIFIERS>
<PRIMARY_ID>ERZ021254</PRIMARY_ID>
<SUBMITTER_ID namespace="WTSI">E_vermicularis_Canary_Islands_0011</SUBMITTER_ID>
</IDENTIFIERS>
<TITLE>Genome sequence of Enterobius vermicularis</TITLE>
<DESCRIPTION>We have sequenced the genome of Enterobius vermicularis</DESCRIPTION>
<STUDY_REF accession="ERP006298">
<IDENTIFIERS>
<PRIMARY_ID>ERP006298</PRIMARY_ID>
<SECONDARY_ID>PRJEB503</SECONDARY_ID>
</IDENTIFIERS>
</STUDY_REF>
<SAMPLE_REF accession="ERS076738">
<IDENTIFIERS>
<PRIMARY_ID>ERS076738</PRIMARY_ID>
</IDENTIFIERS>
</SAMPLE_REF>
<ANALYSIS_TYPE>
<SEQUENCE_ASSEMBLY>
<NAME>E_vermicularis_Canary_Islands</NAME>
<PARTIAL>false</PARTIAL>
<COVERAGE>53</COVERAGE>
<PROGRAM>SGA, Velvet, SSpace, IMAGE, GapFiller, REAPR</PROGRAM>
<PLATFORM>Illumina</PLATFORM>
<MIN_GAP_LENGTH>1</MIN_GAP_LENGTH>
</SEQUENCE_ASSEMBLY>
</ANALYSIS_TYPE>
<FILES>
<FILE checksum="4bb219b0585c07e91ac57332288236b0" checksum_method="MD5" filename="ERZ021/ERZ021254/EVEC.v1.QC.fa_v4_submit.gz" filetype="fasta"/>
</FILES>
</ANALYSIS>
</ANALYSIS_SET>
Step 3: Use the webin java client to submit the EMBL file to the ENA
1) Download the latest version of the webin command line tool, as described here. This is called something like webin-cli-root-1.4.2.jar.
2) If Java is not installed on your laptop, you need to install it (see here ).
3) You should now run the webin client in the directory that has your manifest file and chromosome list file (if you have a chromosome list file; see Step 2 above), using the -validate and -test options. You need to run the webin client by typing something like this:
% java -jar webin-cli-root-1.4.2.jar
Here are the options you need: (something like this: you will need to change it for your directories, password, username, etc.)
% /software/bin/java -jar webin-cli-root-1.4.2.jar -context genome -username xxx -password xxx -manifest enterobius_vermicularis.manifest -inputdir . -outputdir ./evermicularis -validate -test
where -username and -password give your username and password for the ENA (used in Step 1),
-manifest gives the name of your manifest file (e.g. enterobius_vermicularis.manifest),
-inputdir gives the name of your input directory,
-outputdir gives the name of the directory where you want output to be put (Note: you seem to need this option for the webin client to run),
-validate runs the EMBL file validator,
-test means that this is just a test run that runs the validator and the files are not actually submitted.
It's a good idea to do this before you submit.
Note: you need to have the EMBL file mentioned in your manifest file in the input directory (inputdir), e.g. file enterobius_vermicularis_new2.embl.gz. Webin does not seem to look for it in the "Webin upload area" (see Step 1) - this is what I had initially assumed it does, but it seems that's not the case.
Note: On the Sanger farm head node, this java program needs more memory (RAM) than the default required so I needed to reserve some RAM by typing:
% /software/bin/java -Xmx128M -jar webin-cli-root-1.4.2.jar
If your EMBL file is very large, the Java program may need more memory (RAM) than is allowed on the Sanger farm login node, so you will need to submit this as a farm job.
I wasn't able to get webin running on the Sanger farm (either farm head node or other nodes), as I kept getting an error :
'ERROR: A server error occurred when checking application version. Connect to wwwdev.ebi.ac.uk:443 [wwwdev.ebi.ac.uk/193.62.197.11] failed: Connection timed out'
I tried a few different things: (i) using -Dhttps.proxyHost and -Dhttps.proxyPort options in webin (see here) when running webin, and (ii) setting the environmental variable http_proxy on the farm. However, this still didn't run for me. However, I found it ran on my Sanger Mac laptop, so that was fine.
4) Once the webin command in (3) above has run, it produces an output report file in a subdirectory 'validate', e.g. mcorti/genome/M_corti_Specht_Voge_0011_upd/validate/mesocestoides_corti_new2.embl.gz.report
If the 'validate' output report file is empty, then it has run fine with no errors.
There is also an output 'submit' directory, e.g. mcorti/genome/M_corti_Specht_Voge_0011_upd/submit that contains a file analysis.XML. That is, the 'submit' directory should have your analysis XML file, which is an XML version of your manifest (see Step 2 above).
If it doesn't pass the 'validate' step, then look in the 'genome' directory to see what error messages you got.
5) If (4) was all ok, now run the webin client with the -submit option instead of -validate, and without -test:
% /software/bin/java -jar webin-cli-root-1.4.2.jar -context genome -username xxx -password xxx -manifest enterobius_vermicularis.manifest -inputdir . -outputdir ./evermicularis -submit
6) Once your assembly has been submitted, it will go into a queue in the ENA to be imported into their database. Once this has been done (may take a little while if they have a lot of sequence data in the queue already), you will be sent the new ENA accession number for it.
Checking what happened to your assembly
If you don't hear back from the ENA about your assembly, you can log into https://www.ebi.ac.uk/ena/submit/webin/ with your webin account. Then, in the "Analysis process" tab, you can search for the accessions to see if they have failed. Usually if there is a failure, the error messages will be emailed to the email address associated with your webin account (Note to self: this is the pathogen informatics team for the webin account I used.)
Acknowledgements
A big thank you to
Dr Ana M. Cerdeño-Tárraga at the ENA for her help, and also to the Pathogen Informatics team (Dr Jacqui Keane, Sara Sjunnebo) at Sanger for some advice. Installing Java on my laptop
On my mac laptop, to install Java, I had to:
1) Go to the 'Applications' folder, and click on 'Managed Software Center'
2) Select 'Oracle Java JDK 8', and click install.
3) Click on 'Install requested' at top right. Then a 'checking for updates' screen appears. This seems to install Java.
4) In a xterm window on the Mac, I can then type a java command, e.g.: (to run a Java program called webin-cli-root-1.4.2.jar)
% java -jar webin-cli-root-1.4.2.jar
Runs fine!
Monday 15 October 2018
Getting raw sequence data at Sanger
1) Request an irods account from the service desk, as explained on the irods wiki page.
2) Make sure your ..softwarerc has irods in, so you can use irods commands.
3) (Steve told me): first move to a directory where you want to transfer the cram files, and then run the following:
(e.g. for run 27104, lane 8)
kinit
< input password > # you need to input your irods password here
icd /seq/27104
ils | grep "27104_8.*cram"$ | grep -v "phix" | while read -r list; do iget /seq/27104/$list . ; done &
# once you cram files have downloaded, convert crams to fastq
bsub.py 1 cram2fq ~sd21/bash_scripts/run_cram2fastq
Note the script ~sd21/bash_scripts/run_cram2fastq says:
for i in *cram; do samtools view -ub --threads 4 ${i} | samtools sort -n - | samtools fastq -1 ${i%.cram}_1.fastq.gz -2 ${i%.cram}_2.fastq.gz - ; done
Note: to get this script to run for me, I had to make a copy of it and change the path for samtools to be /software/pathogen/external/apps/usr/local/samtools-1.6/samtools, which was the one in Steve's .cshrc
More info about irods
A lot of initial questions are covered in the FAQs located at
https://gitlab.internal.sanger.ac.uk/kdj/npg_doc/blob/master/irods_in_10_minutes.adoc
http://mediawiki.internal.sanger.ac.uk/index.php/IRODS_for_Sequencing_Users
Acknowledgements
Thanks very much to Steve Doyle for help with this.
Thursday 11 October 2018
Analysing a S. mansoni gene list
1. STITCH (protein-protein interactions) http://stitch.embl.de/
If there is no S. mansoni in the STITCH database, one could try C. elegans orthologs, and/or human orthologs (see 2 below). You could try with ortholog_one2one orthologs, and separately with ortholog_one2many orthologs; one-to-one orthologs might be a cleaner set, but adding one-to-many orthologs may give you a much larger sample, so allow you to pick up more.
- go to wormbase parasite https://parasite.wormbase.org/index.html
- click on biomart at top
- click species, tick genome, choose S. mansoni
- click on 'Gene', then in the 'id list' box, paste in your S. mansoni Smp id list
- This gives it your list of S. mansoni genes
- To tell it you want eg. C. elegans orthologs, click on 'output attributes' on the left (blue link)
- Click on 'orthologs', then under C. elegans orthologs choose for example gene id, gene name, protein stable id, % id, homology type
Note: STITCH might want gene id (WBGene) or the protein id (e.g. AC5.3)
- click on 'Count' at top of page to get id of number of results, then 'Results' at top of page to get Results. Save as tab separated file
3. Another thing to try is tissue/phenotype/GO enrichment using the C. elegans wormbase page, and your C. elegans orthologs: https://wormbase.org/tools/enrichment/tea/tea.cgi
Could try with ortholog_one2one orthologs, and separately with ortholog_one2many.
4. Can try GO enrichment for S. mansoni (or your C.elegans/human orthologs) using TopGO (an R package).
First you need your S. mansoni GO terms for S. mansoni genes from WormBase parasite Biomart :
- go to wormbase parasite https://parasite.wormbase.org/index.html
- click on biomart at top
- click species, tick genome, choose S. mansoni
- To tell it you want GO terms, click on 'output attributes' on the left (blue link)
- Select 'Gene ontology' and tick all the boxes beside it
- Click on "results' at top of the page
- Then run TopGO. See my notes here on TopGO: http://avrilomics.blogspot.com/2015/07/using-topgo-to-test-for-go-term.html
Note: 4-Feb-2019:
Another thing to try could be https://biit.cs.ut.ee/gprofiler/gost for GO enrichment, can take WormBase ParaSite gene lists.
Note: 3-July-2019:
Other things to try:
- find Drosophila melanogaster orthologs, paste into the FlyMine website
- find Drosophila melanogaster orthologs, and test for enrichment of phenotypes downloaded from FlyBase
- try the GSEA Gene Set Enrichment software
Tuesday 25 September 2018
How to do a substructure search of ChEMBL (2)
I've been doing some more and found some funny/interesting things.
1) Strange things with aromatic rings!
I wanted to do a search that would hit both menadione (a napthaquinone, image from wikipedia):
and also hit anthraquinones like diacerein (image from wikipedia):
I first tried this with this substructure but it only hit menadione and not the anthraquinones:
Then after discussing with Noel (thanks Noel!) I found out that I need to specify S/A (single/aromatic) and D/A (double/aromatic) bonds, like this:
This found both the anthraquinones and naphthaquinones, hurray!
Note: I found that the ability to specify S/A and D/A bonds is only available in the current ChEMBL website, not the new ChEMBL beta website, which seems to use a different sketcher from ChemAxon.
Note 2 (1-Oct-2018): I heard back from the ChEMBL website that you can specify S/A bond on the ChEMBL beta website by selecting the bond and pressing 14, and D/A by pressing 24 (and 'any' bond by pressing 0). Thanks to David Mendez from ChEMBL!
2) Specifying any atom with 'A'
I wanted to do a search which would pick up both
and
I found I could do this with the 'A' (any atom), like this:
Note: on the ChEMBL beta website, I see '*', but if I select this and press 'A' it turns to A.
3) Wierd things I can't explain
I found that when I do this attached substructure search
it gets a hit to dipananone https://www.ebi.ac.uk/
4) Starting from a particular structure
If you have a problem drawing in your structure, you can save a related compound as its mol file from the ChEMBL website, and then click on the 'open file' button within the sketcher on the ChEMBL homepage, and open that mol file to be your starting point for editing.
Acknowledgements
Thanks to Noel O'Blog and John Mayfield for help!
And also a big thanks to David Mendez from ChEMBL!