Thursday, 20 December 2018

CRAM to BAM conversion

CRAM is a compressed version of the BAM/SAM formats.
To convert a CRAM file to BAM, you can use the SCRAMBLE conversion tool.
For example:
% /software/sciops/pkg/staden_io_bin/1.14.9/bin/scramble -I cram -O bam 27541_1#9.cram > 27541_1#9.bam
where /software/sciops/pkg/staden_io_bin/1.14.9/bin/scramble is where SCRAMBLE is installed,
27541_1#9.cram is the input CRAM file,
27541_1#9.bam is the output BAM file.

Monday, 26 November 2018

Dashed line in pysvg

I recently wrote a blog post on using pysvg to make a picture of chromsomes.

Now I wanted to figure out how to make a dashed line using pysvg.

The library seems to be missing a dedicated function for this, but I was able to do it with:

# define the dashed line style
    linestyle_dict = {'stroke-width':2, 'stroke':'black','stroke-dasharray': "5, 5"}
    linestyle = pysvg.builders.StyleBuilder(linestyle_dict)
    myline = shape_builder.createLine(0,0,300,300)
    myline.set_style(linestyle.getStyle())
    svg_document.addElement(myline)


If you want to change the lengths of the dashes you can change for example 'stroke-dasharray': "5, 5" to 'stroke-dasharray': "10, 10".



BWA aligner

The BWA aligner for short read alignment is by Heng Li and Richard Durbin.

BWA mem
On the BWA website, it explains the the 'bwa mem' algorithm is the best one for '70bp or longer Illumina, 454, Ion Torrent and Sanger reads, assembly contigs and BAC sequences.'

I've previously used it something like this:
bwa mem genome.fasta reads_fastq.gz mates_fastq.gz | samtools view -b -S -F 4 - > output.bam
   run 'bwa mem' to align reads and their mates to a genome, and pipe the output through samtools to just take the reads that align (using -F4) and discarding reads that don't align (to save disk space), and put the output in output.bam

Other options
Other options available in BWA include:
bwa sampe: generated paired ended alignments (This takes an option -a <max_insert_size>)

Friday, 23 November 2018

Adjusting text style in pysvg

I wrote a previous blogpost on making a plot using pysvg.

Now I want to do something a bit more tricky, to adjust the font styles.
I found a tutorial online, but the commands in it don't seem to work with the current pysvg that's installed on the Sanger farm.
However, I found another file on github that gave me some hints...

Adjusting font style and alignment with pysvg
You need something like this in the Python:

# define the text style
    style = pysvg.builders.StyleBuilder()
    style.setFontFamily(fontfamily="Arial")
    style.setFontSize('1em')
    style.setFilling("blue")
    style.setTextAnchor("start")
    style.style_dict["alignment-baseline"] = "middle" # there doesn't seem to be a setAlignmentBaseline function


Then you can add text using something like this:
# add some text to our svg picture for this chromosome:
        t=pysvg.text.Text(mychr, x = mychrlen*xscale + 30, y = chrcnt*yscale + 30)
        t.set_style(style.getStyle())
        svg_document.addElement(t)

Wednesday, 21 November 2018

Jupyter notebooks using Python

I wanted to figure out how to run Python in a Jupyter notebook, which is like the newer version of iPython notebooks. I found quite a nice tutorial on getting started on Python Jupyter notebooks here.

The first step is to install Jupyter.
(Note to self: I found it is installed on the pcs5 at Sanger, so I can start it up by 'ssh -Y pcs5', and then..)

The second step is to start up a Jupyter notebook session:
(Note to self: I did this in /nfs/users/nfs_a/alc/Documents/git/Python)
% jupyter notebook
(Note to self: jupyter is installed with python2 software on the Sanger farm, so seems to assume Python2.)
This opens up a web browser showing the files in the directory where I started jupyter.
You can then start a new notebook by going to the 'New' menu on the right, and choosing 'python2'.

You can give your notebook a title by typing in the 'Untitled' box at the top, for example, you could rename it 'plotting_chrs', and then it would be saved as file plotting_chrs.ipynb
Then you can start typing in the prompt line.
To get a line to be executed, you can press Shift+Return.

Here I made a jupiter notebook to make a SVG plot of chromosomes (see my previous blogpost), and display the SVG inline:



(Note to self: the whole of the svg image from pysvg doesn't seem to be displayed inline, I'm not sure why this is - I tried another svg and it worked fine.)

To shut down the notebook, you can click on 'Close and Halt' in the File menu. To start again you can type
% jupyter notebook plotting_chrs.ipynb 
to open this notebook again.

Later note (May 2019):
- I wanted to install jupyter notebook on a laptop running Windows. To do this, I first installed the latest Python (v 3.7.3) using a Window installer for Python from the Python website. Then I used 'cmd' in Windows to open a command prompt and within the command prompt I typed:
> py -m pip install --upgrade pip
> py -m pip install jupyter
This installed Jupyter notebook [Note: to install another module, such as 'pandas', you can type 'py -m pip install pandas].
To then run a certain Jupyter notebook such as my.notebook, you can type:
> jupyter notebook C:\\mypath\Docs\my.notebook
This should open the notebook my.notebook in a web browser.
You can get a certain command to run in the Jupyter notebook by pressing shift+return.

Later note (Oct 2021):
- I wanted to install jupyter notebook on a Mac laptop. I already had Anaconda installed, so typed:
% conda install -c conda-forge jupyterlab
To run Jupyter notebook, I then typed:
% jupyter notebook
This opened Jupyter notebook in Firefox.
Brilliant!


Plotting chromosomes using Python pysvg

To draw an svg image of chromosomes, here is a little example I made using pysvg (below).

I found some nice examples for using pysvg here.
Here is another old example (seems to use an old version of pysvg, that doesn't work now) by Noel O'Blog.

It can be run using python2 (not python3):
% python /nfs/users/nfs_a/alc/Documents/git/Python/plot_emu_chrs.py emu_chrs.svg
% convert emu_chrs.svg emu_chrs.png

The output looks like (as png):


You can edit it manually in the SVG editor Inkscape (Note to self: do 'ssh -Y pcs5' and then 'inkscape emu_chrs.svg').

The code is:

import sys
import os
import pysvg.structure
import pysvg.builders
import pysvg.text


# Note: this is to make a svg plot with the E. multilocularis chromosomes,
# plotted one under the other, scaled according to their lengths, and
# with a different colour for each chromosome, and a label for each
# chromosome.
# Note 2: this script must be run with python2 (not python3).

#====================================================================#

def define_input_data():
    # define the input data, as a list of E. multilocularis chromosomes, and
    # a dictionary with lengths of E. multilocularis chromosomes, and a dictionary
    # with the colours to plot E. multilocularis chromosomes:

    chrlist = ["chr1", "chr2", "chr3"]
    chrlengths = { "chr1": 2000000, "chr2": 5000000, "chr3": 4500000}
    chrcols = { "chr1": "rgb(255,0,153)", "chr2": "blue", "chr3": "purple"}
    # Note: there is a list of RGB values for colours here: http://access-excel.tips/vba-excel-rgb-property-and-get-rgb-color/

    return(chrlist, chrlengths, chrcols)

#====================================================================#

# make the plot of the input data:

def plot_input_data(chrlist, chrlengths, chrcols, output_svg_file):

    # initialise our svg picture  
    svg_document = pysvg.structure.Svg()
    shape_builder = pysvg.builders.ShapeBuilder()

    # plot each chromosome, one under the other:
    chrcnt = 0
    yscale = 100
    xscale = 0.00004
    for mychr in chrlist:
        # find its length:

        assert(mychr in chrlist)
        mychrlen = chrlengths[mychr]
        # find the colour to plot it:
        assert(mychr in chrcols)
        mychrcol = chrcols[mychr]
        # add a rectangle to our svg picture for this chromosome:
        # createRect takes x, y, width, height
        myrectwidth = "%dpx" % (mychrlen*xscale + 20)
        svg_document.addElement(shape_builder.createRect(0, chrcnt*yscale, myrectwidth, "50px", strokewidth = 1, stroke = "black", fill = mychrcol))
        # add some text to our svg picture for this chromosome:
        svg_document.addElement(pysvg.text.Text(mychr, x = mychrlen*xscale + 30, y = chrcnt*yscale + 30))
        # add one to the chromosome count:
        chrcnt += 1

    # save the svg picture to an output file
    svg_document.save(output_svg_file)

    return()

#====================================================================#

def main():
       
    # check the command-line arguments:
    if len(sys.argv) != 2:
        print("Usage: %s output_svg_file" % sys.argv[0])
        sys.exit(1)
    output_svg_file = sys.argv[1]

    # define the input data: (note: you could read this in from an input file)
    (chrlist, chrlengths, chrcols) = define_input_data()

    # make the plot of the input data:
    plot_input_data(chrlist, chrlengths, chrcols, output_svg_file)

#====================================================================#

if __name__=="__main__":
    main()

#====================================================================#
 



Wednesday, 14 November 2018

I-TASSER for protein structure and function prediction

I've recently heard about the I-TASSER program for protein structure and function prediction.

Notes to self:
- there is a script runI-TASSER.pl on the Sanger farm for running it
-  I-TASSER and its scripts currently live here:
/software/pathogen/external/apps/usr/local/itasser-5.0
- Users need the library location:
/lustre/scratch118/infgen/pathogen/pathpipe/itasser
- Sample command:
runI-TASSER.pl -pkgdir /software/pathogen/external/apps/usr/local/itasser-5.0 -libdir /lustre/scratch118/infgen/pathogen/pathpipe/itasser -java_home /usr -seqname example -datadir example  -outdir itasser_out
- If users run it with the -light option that will limit the simulation time to 5-6 hours which means they can get a smallish protein (<100aa) through in approx 8 hours.
 

Thanks to 
Victoria Offord
Arporn Wangwiwatsin
 

Tuesday, 13 November 2018

HHpred for finding remote homologues

I recently came across HHpred (described in Soding et al 2005), a program for looking for remote homologues by using HMMs. A nice use case of this is in the paper by Valentim et al 2013, where they used HHpred to predict that the Schistosoma mansoni protein that is involved in response to the drug oxamniquine has structural similarity to sulfotransferases.

Monday, 29 October 2018

Phred quality scores

Phred quality scores are used as a measure of quality for DNA sequencing base quality.
I always forget how they are calculated.

From wikipedia, the Phred base quality score for a particular base is:
Q = -10 log_10 (P)
where P = the probability that a particular base was called incorrectly.

Working out the probability a particular base was called incorrectly from the Phred score?
If you have Q, you can work out P as:
P = 10^(-Q /10)
So a base quality score of 20 means P = 10^(-2) = 0.010.
A base quality score of 25 means P = 0.003 approx.
A base quality score of 30 means P = 0.001.

Working out the Phred quality score threshold needed to achieve a particular threshold of P
If we want to use a threshold of P = 0.05, then we would need to use a threshold of Q of:
Q = -10 log_10(0.05) = 13.0103.
Working back to get P gives:
P = 10^(-13.0103/10) = 0.05.

Likewise, for a threshold of P = 0.005:
Q = -10 log_10(0.005) = 23.0103.
Working back to get P gives:
P = 10^(-23.0103/10) =  0.005.



Friday, 26 October 2018

Running CRISPResso on a ton of sequencing data

I have a ton of sequencing data from an Illumina HiSeq lane from a PCR-amplicon sequence, and I'd like to run CRISPResso on it. In a previous blogpost, I talked about how to run CRISPResso.

Here's I'll talk about how to run it on loads of data! In this case I had ~20 million read-pairs for a single sample, while previously I had only run CRISPResso on 1-2 million read-pairs (even that took a couple of days!)

Here is a little plot I made of CRISPResso run-time in hours versus number of input read-pairs:

















You can see that as the number of read-pairs gets above about 1 million, the run-time is 0.5-1 day roughly. I guess the runtime may also depend on things like the base quality of the reads (if they are low quality, I'm guessing CRISPResso will have to make lots of alignments with indels/substitutions, and may infer lots of indel/substitution mutations, which will take time).

To run CRISPResso on ~20 million read-pairs from a single sample, here's what I did:

1. Split up the input fastq files into smaller files of 1 million reads each
I have paired end read-pairs, so the input data is in a file called sample1_1.fastq.gz (forward reads) and sample1_2.fastq.gz (reverse reads). To split this up, I ran:
% python3 ~alc/Documents/git/Python/split_up_fastq.py sample1_1.fastq.gz 1000000 sample1_1_sub
% python3 ~alc/Documents/git/Python/split_up_fastq.py sample1_1.fastq.gz 1000000 sample1_1_sub
You can see my Python script on gist here.
Note to self: I submitted farm jobs to do this, as it's slow.

2. Use Trimmomatic to discard low quality read-pairs.
I used the Trimmomatic software to discard read-pairs with low quality bases, only taking read-pairs where both reads of a pair had average base quality of >=20. To do this, I ran:
python3 ~alc/Documents/git/Python/filter_fastq_files_using_trimmomatic.py sample1 17
where here sample1 is the name of the sample in step 1 above, and 17 is the number of pairs of smaller fastq files that step 1 above made. 
This script submits Trimmomatic jobs to the Sanger compute farm.
You can see my Python script on gist here.

3.  Run CRISPResso on each smaller subset of read-pairs.
I then ran CRISPResso on each smaller subset of read-pairs:
% python3 ~alc/Documents/git/Python/submit_crispresso_jobs_for_subsetsoffastq.py sample1 17
where here sample1 is the name of the sample in step 1 above, and 17 is the number of pairs of smaller fastq files that step 1 above made. 
This script submits CRISPResso jobs to the Sanger compute farm.
You can see my Python script on gist here, you would need to edit it at the 'xxx' positions, to put in your amplicon sequence, gRNA sequence and coding sequence. 






Spiking sequencing libraries with PhiX

If you are sequencing a low complexity library (e.g. a PCR-amplicon library) using Illumina sequencers, it's a good idea to spike in some PhiX to improve base quality, as explained here and here on the Illumina support pages. On the seqanswers forum they say that PhiX is used for:
"introduction of sequencing diversity in low-complexity libraries (diversity is needed to discriminate clusters and create signal thresholds for base-calling). As the software has improved, the recommended amount of phiX spike-in has decreased." 
Another comment says:
"Older basecalling software on the MiSeq used to require a 50% phiX spike for low-diversity (amplicon samples). More recently, the software has been updated and only requires a 5-10% spike."

Reading and writing gzipped files in Python

I've been reading in zipped (.gz) files (in my case, zipped fastq files of sequence data) in Python using a lovely Python module called gzip.py.
The same module can also write out zipped (.gz) files.

For example, here is a little script that reads in a zipped (.gz) input fastq file of sequence reads, and splits it into smaller files that have seqs_per_output_file sequences each. The output files are written out as zipped files (.gz).

I've highlighted the lines of the file that use the gzip module, and read and write from the input and output files.

Very handy!

import sys
import os
import gzip
from collections import defaultdict

#====================================================================#

# now read in the input fastq and split it up:    

def read_fastq_file_and_split(input_fastq_file, seqs_per_output_file, output_file_prefix):

    # open an output file:
    output_file_cnt = 1
    output_file = "%s_%d.fastq.gz" % (output_file_prefix, output_file_cnt)
    outputfileObj = gzip.open(output_file, "wb") # write out the output file in gzipped format
    print("Opening",output_file,"...")

    # read in the input file:
    fileObj = gzip.open(input_fastq_file, "rt") # this opens a gzipped file in text mode
    seqcnt = 0
    linecnt = 0
    for line in fileObj:
        line = line.rstrip()
        if line.startswith('@') and (linecnt == 0 or linecnt == 4):
            linecnt = 0
            seqcnt += 1
            if seqcnt == (seqs_per_output_file + 1):
                outputfileObj.close()
                output_file_cnt += 1
                output_file = "%s_%d.fastq" % (output_file_prefix, output_file_cnt)
                outputfileObj = gzip.open(output_file, "wb") # write out the output file in gzipped format
                print("Opening",output_file,"...")
                seqcnt = 1
        outputline = "%s\n" % line
        outputfileObj.write(outputline.encode()) # need to write bytes (not a string) to the output file, see https://stackoverflow.com/questions/49286135/write-strings-to-gzip-file
        linecnt += 1
    fileObj.close()
    outputfileObj.close()

    # the fastq file looks like this:
    # @M03558:259:000000000-BH588:1:1101:15455:1333 2:N:0:NTTGTA
    # NCAAGCATCTCATTTTGTGCATATACCTGGTCTTTCGTCTTCTGGCGTGAAGTCGCCGACTGAATGCCAGCAATCTCTTTTTGAGTCTCATTTTGCATCTCGGCAATCTCTTTCTGATTGTCCAGTTGCATTTTAGTAAGCTCTTTTTGATTCTCAAATCCGGCGTCAACCATACCAGCAGAGGAAGCATCAGCACCAGCACGCTCCCAAGCATTAAGCTCAGGAAATGCAGCAGCAAGATAATCACGAGT
    # +
    # #>>ABBFFFFFFGGGGGGGGGGHHHHHHHHHHHGHHGH2FFHHHHHGGGGGHHHGGGGGGGGHHHGHHHHGHHHHHHHHHHHGGGHHHHHHHHHHHHHHHHGGGGGHGFHHHHHHHHHHHHHGHHHHGHFHHHHHEFFFHGFFHHHHHGGHHHHHFGHHHHGGGCGGGFHHGGHGHHHHHHHHAGFFHHHHHGGGAEFHHHFDGDDGGGGEEBFGGGGGFGGGGEFGFFFGGGGGGFFFFFFBBFFFDFAA

    return

#====================================================================#


def main():
   
    # check the command-line arguments:
    if len(sys.argv) != 4 or os.path.exists(sys.argv[1]) == False:
        print("Usage: %s input_fastq_file seqs_per_output_file output_file_prefix" % sys.argv[0])
        sys.exit(1)
    input_fastq_file = sys.argv[1] # input fastq file                 
    seqs_per_output_file = int(sys.argv[2]) # number of sequences to put into each output file
    output_file_prefix = sys.argv[3] # prefix to use for the output file names

    # now read in the input fastq and split it up:    
    read_fastq_file_and_split(input_fastq_file, seqs_per_output_file, output_file_prefix)
   
#====================================================================#

if __name__=="__main__":
    main()

#====================================================================#


Thursday, 25 October 2018

Randomly sample from a fastq file

Sometimes I want to take a sneak peak at some sequencing data, which is given to me hot from the sequencing machines in the form of a pair of fastq files (paired end read-pairs: one fastq file for the forward reads and one for the reverse reads). I used to just take the first n (e.g. 250,000) read-pairs from the top of the fastq file, to have a look. However, as discussed on seqanswers here it's often the case that:
SOLiD or Illumina output files have a pile of rubbish at the start of the file from bad sequencing reads (the edges of chips / cells seem to be more prone to error), which heavily influences the results gleaned from the first few reads.”
To get around this, they suggest you subsample randomly using HTSeq, and also point to a discussion here.
 

Filter read-pairs by average base quality using Trimmomatic

I have a large amount of Illumina sequencing data (~17 million read-pairs) and, before running further analysis, want to filter the data by the average base quality, for example by discarding all read pairs that have average base quality of <=20. The format of my data is two fastq files for my paired Illumina read-pairs, one for forward reads and one for reverse reads.

From a discussion on seqanswers about filtering, I found that people suggested Trimmomatic.

Running Trimmomatic
I tried running Trimmomatic on some smaller files of 250,000 (0.25 million) read-pairs, using this command-line: (note to self: on the Sanger farm)
java -Xmx128M  -jar /software/pathogen/external/apps/usr/local/Trimmomatic-0.33/trimmomatic-0.33.jar PE -threads 1 -phred33 sample1topsmall_1.fastq.gz sample1topsmall_2.fastq.gz sample1topsmallout_1_paired.fastq.gz sample1topsmallout_1_unpaired.fastq.gz sample1topsmallout_2_paired.fastq.gz sample1topsmallout_2_unpaired.fastq.gz SLIDINGWINDOW:150:20 
where:
PE means my reads are paired ends,
-threads 1 tells Trimmomatic to just use one thread (one CPU),
-phred33 tells Trimmomatic that the phred quality scores in my file use ASCII_BASE=33 (see here for an explanation), (note to self: I knew this from CRISPResso output)
sample1topsmall_1.fastq.gz sample1topsmall_2.fastq.gz are my input fastq files,
sample1topsmallout_1_paired.fastq.gz, sample1topsmallout_1_unpaired.fastq.gz, sample1topsmallout_2_paired.fastq.gz, sample1topsmallout_2_unpaired.fastq.gz are output files produced by Trimmomatic (two with reads where both reads of a pair pass the filter, two files with reads where just one read of a pair passes),
SLIDINGWINDOW:150:20 tells Trimmomatic to use a sliding window of 150 bp and take reads that have average base quality of >=20 in this window (note to self: my reads are 150 bp long).

Output from Trimmomatic
Trimmomatic ran really quickly! It gave some output like this:
Input Read Pairs: 250000 Both Surviving: 1619 (0.65%) Forward Only Surviving: 228797 (91.52%) Reverse Only Surviving: 604 (0.24%) Dropped: 18980 (7.59%)
TrimmomaticPE: Completed successfully

You can see here that my forward reads were much higher quality than my reverse reads, so few read-pairs survived where both reads of the pair passed the quality filter. The output files sample1topsmallout_1_paired.fastq.gz and sample1topsmallout_2_paired.fastq.gz have 1619 reads each as 1619 read-pairs passed.

Run-time and memory (RAM) required by Trimmomatic
I found Trimmomatic very fast. For 2.5 million read-pairs, it took about 2 minutes to run on the Sanger compute farm (using one CPU), and needed just ~80 Mbyte of memory (RAM).


Thursday, 18 October 2018

Submitting an assembly plus annotations to the ENA

I need to submit an assembly plus annotations for a parasitic worm (a multicellular eukaryote) to the ENA. Note: see my previous blogpost on preparing an EMBL file for the ENA.

To do this, I followed these steps:

Step 1: Upload my EMBL file (with genome assembly and annotation) to the ENA submission website
Note (later): I think that actually Step 1 is not necessary for submitting your EMBL file to the ENA, so you could skip directly to Step 2 here (if you have your EMBL file already; if you need to make your EMBL file see here, and if you need to register your sample and study with the ENA see here).

Once I had created an EMBL file with my genome assembly and annotation (see here), and had checked my study and sample had already been registered with the ENA (see here), I followed these steps:

1) I went to the ENA submission website, and signed in using the login and password given to me by my team (Note to self: asked Pathogen Informatics team for this).

2) I clicked on the 'New Submissions' tab at the top.

3) Then I selected 'Submit genome assemblies'.

4) At the bottom of the screen, it says the first step is to upload the data files into the "Webin upload area". I clicked on the 'upload data files' link.

5) This brought me to a page that says to look at http://ena-docs.readthedocs.io/en/latest/upload_01.html

6) I created a md5sum for my file by typing:
% md5sum my.embl.gz > my.embl.gz.md5

7) Following the instructions on the http://ena-docs.readthedocs.io/en/latest/upload_01.html page, I did the following to transfer my files using ftp, to the ENA's "Webin upload area":
% ftp webin.ebi.ac.uk [enter username and password]
> bin
> put my.embl
> put my.embl.md5
> bye

The files that I transferred were zipped embl files (e.g. enterobius_vermicularis_new2.embl.gz and  enterobius_vermicularis_new2.embl.md5.gz for a species Enterobius vermicularis).

Step 2: Make a "manifest" file for the EMBL file
To submit an EMBL file for example for a species Enterobius vermicularis, you need to make a 'manifest file' that contains the information on the assembly, e.g. enterobius_vermicularis.manifest. The manifest files are described here. For example, for my species Enterobius vermicularis, the manifest file looked like this: (tab separated)
STUDY   PRJEB503
SAMPLE    ERS076738
ASSEMBLYNAME   E_vermicularis_Canary_Islands_0011_upd
COVERAGE   53
PROGRAM   SGA, Velvet, SSpace, IMAGE, GapFiller, REAPR
PLATFORM   Illumina
MINGAPLENGTH   1
MOLECULETYPE   genomic DNA
FLATFILE   enterobius_vermicularis_new2.embl.gz

Note: here the assembly name has had '_upd' appended to it, as it is an update of a previous one that was in the ENA.

Note: If your assembly is in chromosomes you also need a chromosome list file (see here) as well as a manifest file; this wasn't the case for me as I just had scaffolds in my assembly.

Note: to see the manifest file information for an assembly that has already been submitted to the ENA, you can log into the ENA submission website (see Step 1 above), and then type the analysis accession number (e.g. ERZ021218) in the search box 'Accession/Unique name' at the top right, and you will see the analysis pop up, then click 'Edit' on the right' and you will see a page with 'Edit details' pop up, and then click on 'Edit XML' on the top right and this should bring you to a page with the details that were in the manifest, which will look something like this (I've highlighted in red the info that was used in the manifest file above):
<?xml version="1.0" encoding="UTF-8"?><ANALYSIS_SET>
   <ANALYSIS accession="ERZ021254" alias="E_vermicularis_Canary_Islands_0011" center_name="WTSI">
      <IDENTIFIERS>
         <PRIMARY_ID>ERZ021254</PRIMARY_ID>
         <SUBMITTER_ID namespace="WTSI">E_vermicularis_Canary_Islands_0011</SUBMITTER_ID>
      </IDENTIFIERS>
      <TITLE>Genome sequence of Enterobius vermicularis</TITLE>
      <DESCRIPTION>We have sequenced the genome of Enterobius vermicularis</DESCRIPTION>
      <STUDY_REF accession="ERP006298">
         <IDENTIFIERS>
            <PRIMARY_ID>ERP006298</PRIMARY_ID>
            <SECONDARY_ID>PRJEB503</SECONDARY_ID>
         </IDENTIFIERS>
      </STUDY_REF>
      <SAMPLE_REF accession="ERS076738">
         <IDENTIFIERS>
            <PRIMARY_ID>ERS076738</PRIMARY_ID>
         </IDENTIFIERS>
      </SAMPLE_REF>
      <ANALYSIS_TYPE>
         <SEQUENCE_ASSEMBLY>
            <NAME>E_vermicularis_Canary_Islands</NAME>
            <PARTIAL>false</PARTIAL>
            <COVERAGE>53</COVERAGE>
            <PROGRAM>SGA, Velvet, SSpace, IMAGE, GapFiller, REAPR</PROGRAM>
            <PLATFORM>Illumina</PLATFORM>
            <MIN_GAP_LENGTH>1</MIN_GAP_LENGTH>
         </SEQUENCE_ASSEMBLY>
      </ANALYSIS_TYPE>
      <FILES>
         <FILE checksum="4bb219b0585c07e91ac57332288236b0" checksum_method="MD5" filename="ERZ021/ERZ021254/EVEC.v1.QC.fa_v4_submit.gz" filetype="fasta"/>
      </FILES>
   </ANALYSIS>
</ANALYSIS_SET>


Step 3: Use the webin java client to submit the EMBL file to the ENA

1) Download the latest version of the webin command line tool, as described here. This is called something like webin-cli-root-1.4.2.jar.

2) If Java is not installed on your laptop, you need to install it (see here ).

3) You should now run the webin client in the directory that has your manifest file and chromosome list file (if you have a chromosome list file; see Step 2 above), using the -validate and -test options. You need to run the webin client by typing something like this:
% java -jar webin-cli-root-1.4.2.jar 

Here are the options you need: (something like this: you will need to change it for your directories, password, username, etc.)
/software/bin/java -jar webin-cli-root-1.4.2.jar -context genome -username xxx -password xxx -manifest enterobius_vermicularis.manifest -inputdir . -outputdir ./evermicularis -validate -test
where -username and -password give your username and password for the ENA (used in Step 1),
-manifest gives the name of your manifest file (e.g. enterobius_vermicularis.manifest),
-inputdir gives the name of your input directory,
-outputdir gives the name of the directory where you want output to be put (Note: you seem to need this option for the webin client to run),
-validate runs the EMBL file validator,
-test means that this is just a test run that runs the validator and the files are not actually submitted.
It's a good idea to do this before you submit.
Note: you need to have the EMBL file mentioned in your manifest file in the input directory (inputdir), e.g. file  enterobius_vermicularis_new2.embl.gz. Webin does not seem to look for it in the  "Webin upload area" (see Step 1) - this is what I had initially assumed it does, but it seems that's not the case.

Note: On the Sanger farm head node, this java program needs more memory (RAM) than the default required so I needed to reserve some RAM by typing:
% /software/bin/java -Xmx128M -jar webin-cli-root-1.4.2.jar
If your EMBL file is very large, the Java program may need more memory (RAM) than is allowed on the Sanger farm login node, so you will need to submit this as a farm job.
I wasn't able to get webin running on the Sanger farm (either farm head node or other nodes), as I kept getting an error :
'ERROR: A server error occurred when checking application version. Connect to wwwdev.ebi.ac.uk:443 [wwwdev.ebi.ac.uk/193.62.197.11] failed: Connection timed out'
I tried a few different things: (i) using -Dhttps.proxyHost and -Dhttps.proxyPort options in webin (see here) when running webin, and (ii) setting the environmental variable http_proxy on the farm. However, this still didn't run for me. However, I found it ran on my Sanger Mac laptop, so that was fine.

4) Once the webin command in (3) above has run, it produces an output report file in a subdirectory 'validate', e.g. mcorti/genome/M_corti_Specht_Voge_0011_upd/validate/mesocestoides_corti_new2.embl.gz.report
If the 'validate' output report file is empty, then it has run fine with no errors.

There is also an output 'submit' directory, e.g. mcorti/genome/M_corti_Specht_Voge_0011_upd/submit that contains a file analysis.XML.  That is, the 'submit' directory should have your analysis XML file, which is an XML version of your manifest (see Step 2 above).

If it doesn't pass the 'validate' step, then look in the 'genome' directory to see what error messages you got.

5) If (4) was all ok, now run the webin client with the -submit option instead of -validate, and without -test:
% /software/bin/java -jar webin-cli-root-1.4.2.jar -context genome -username xxx -password xxx -manifest enterobius_vermicularis.manifest -inputdir . -outputdir ./evermicularis -submit

6) Once your assembly has been submitted, it will go into a queue in the ENA to be imported into their database. Once this has been done (may take a little while if they have a lot of sequence data in the queue already), you will be sent the new ENA accession number for it.

Checking what happened to your assembly
If you don't hear back from the ENA about your assembly, you can log into https://www.ebi.ac.uk/ena/submit/webin/ with your webin account. Then, in the "Analysis process" tab, you can search for the accessions to see if they have failed. Usually if there is a failure, the error messages will be emailed to the email address associated with your webin account (Note to self: this is the pathogen informatics team for the webin account I used.)

Acknowledgements
A big thank you to Dr Ana M. Cerdeño-Tárraga at the ENA for her help, and also to the Pathogen Informatics team (Dr Jacqui Keane, Sara Sjunnebo) at Sanger for some advice.

Installing Java on my laptop

This is only really useful to Sanger users...

On my mac laptop, to install Java, I had to:
1) Go to the 'Applications' folder, and click on 'Managed Software Center'
2) Select 'Oracle Java JDK 8', and click install.
3) Click on 'Install requested' at top right. Then a 'checking for updates' screen appears. This seems to install Java.
4) In a xterm window on the Mac, I can then type a java command, e.g.: (to run a Java program called webin-cli-root-1.4.2.jar)
% java -jar webin-cli-root-1.4.2.jar
Runs fine!

Monday, 15 October 2018

Getting raw sequence data at Sanger

This is useful for Sanger people only: how to get some data off our Sanger irods system.

1) Request an irods account from the service desk, as explained on the irods wiki page.

2) Make sure your ..softwarerc has irods in, so you can use irods commands. 

3) (Steve told me): first move to a directory where you want to transfer the cram files, and then run the following:
(e.g. for run 27104, lane 8)

kinit
< input password >   # you need to input your irods password here
icd /seq/27104
ils | grep "27104_8.*cram"$ | grep -v "phix" | while read -r list; do iget /seq/27104/$list . ;  done &

# once you cram files have downloaded, convert crams to fastq
bsub.py 1 cram2fq ~sd21/bash_scripts/run_cram2fastq


Note the script ~sd21/bash_scripts/run_cram2fastq says:
for i in *cram; do samtools view -ub --threads 4 ${i} | samtools sort -n - | samtools fastq -1 ${i%.cram}_1.fastq.gz -2 ${i%.cram}_2.fastq.gz - ; done

Note: to get this script to run for me, I had to make a copy of it and change the path for samtools to be /software/pathogen/external/apps/usr/local/samtools-1.6/samtools, which was the one in Steve's .cshrc

More info about irods
A lot of initial questions are covered in the FAQs located at
https://gitlab.internal.sanger.ac.uk/kdj/npg_doc/blob/master/irods_in_10_minutes.adoc
http://mediawiki.internal.sanger.ac.uk/index.php/IRODS_for_Sequencing_Users

Acknowledgements
Thanks very much to Steve Doyle for help with this.

Thursday, 11 October 2018

Analysing a S. mansoni gene list

To analyse a list of Schistosoma mansoni genes with a particular expression pattern, to see if they have some unusual features, here are some things you can try:

1. STITCH (protein-protein interactions)  http://stitch.embl.de/
If there is no S. mansoni in the STITCH database, one could try C. elegans orthologs, and/or human orthologs (see 2 below). You could try with ortholog_one2one orthologs, and separately with ortholog_one2many orthologs; one-to-one orthologs might be a cleaner set, but adding one-to-many orthologs may give you a much larger sample, so allow you to pick up more.
2. To get C. elegans (or human) orthologs
  - go to wormbase parasite https://parasite.wormbase.org/index.html
  - click on biomart at top
  - click species, tick genome, choose S. mansoni
  - click on 'Gene', then in the 'id list' box, paste in your S. mansoni Smp id list
  - This gives it your list of S. mansoni genes
  - To tell it you want eg. C. elegans orthologs, click on 'output attributes' on the left (blue link)
  - Click on 'orthologs', then under C. elegans orthologs choose for example gene id, gene name, protein stable id, % id, homology type
   Note: STITCH might want gene id (WBGene) or the protein id (e.g. AC5.3)
  - click on 'Count' at top of page to get id of number of results, then 'Results' at top of page to get Results. Save as tab separated file

3. Another thing to try is tissue/phenotype/GO enrichment using the C. elegans wormbase page, and your C. elegans orthologs: https://wormbase.org/tools/enrichment/tea/tea.cgi
  Could try with ortholog_one2one orthologs, and separately with ortholog_one2many.

4. Can try GO enrichment for S. mansoni (or your C.elegans/human orthologs) using TopGO (an R package).
  First you need your S. mansoni GO terms for S. mansoni genes from WormBase parasite Biomart :
  - go to wormbase parasite https://parasite.wormbase.org/index.html
  - click on biomart at top
  - click species, tick genome, choose S. mansoni
  - To tell it you want GO terms, click on 'output attributes' on the left (blue link)
  - Select 'Gene ontology' and tick all the boxes beside it
  - Click on "results' at top of the page
  - Then run TopGO. See  my notes here on TopGO: http://avrilomics.blogspot.com/2015/07/using-topgo-to-test-for-go-term.html

Note: 4-Feb-2019:
Another thing to try could be https://biit.cs.ut.ee/gprofiler/gost for GO enrichment, can take WormBase ParaSite gene lists.

Note: 3-July-2019:
Other things to try:
- find Drosophila melanogaster orthologs, paste into the FlyMine website
- find Drosophila melanogaster orthologs, and test for enrichment of phenotypes downloaded from FlyBase
- try the GSEA Gene Set Enrichment software

Tuesday, 25 September 2018

How to do a substructure search of ChEMBL (2)

I previously wrote some notes on how to do a substructure search of ChEMBL.

I've been doing some more and found some funny/interesting things.

1) Strange things with aromatic rings!
I wanted to do a search that would hit both menadione (a napthaquinone, image from wikipedia):

Skeltal formula
and also hit anthraquinones like diacerein (image from wikipedia):
Structural formula of diacerein
I first tried this with this substructure but it only hit menadione and not the anthraquinones:










Then after discussing with Noel (thanks Noel!) I found out that I need to specify S/A (single/aromatic) and D/A (double/aromatic) bonds, like this:






This found both the anthraquinones and naphthaquinones, hurray!

Note: I found that the ability to specify S/A and D/A bonds is only available in the current ChEMBL website, not the new ChEMBL beta website, which seems to use a different sketcher from ChemAxon.
Note 2 (1-Oct-2018): I heard back from the ChEMBL website that you can specify S/A bond on the ChEMBL beta website by selecting the bond and pressing 14, and D/A by pressing 24 (and 'any' bond by pressing 0). Thanks to David Mendez from ChEMBL! 

2) Specifying any atom with 'A'
I wanted to do a search which would pick up both

and




I found I could do this with the 'A' (any atom), like this:




Note: on the ChEMBL beta website, I see '*', but if I select this and press 'A'  it turns to A.

3) Wierd things I can't explain

I found that when I do this attached substructure search

 it gets a hit to dipananone https://www.ebi.ac.uk/chembldb/index.php/compound/inspect/CHEMBL2111157 when I do the search using the ChEBML beta interface, but not when I do the exact same substructure search using the normal CHEMBL interface. I'm not sure why this is? I sent an email to the ChEMBL helpdesk, waiting for a reply...

4) Starting from a particular structure
If you have a problem drawing in your structure, you can save a related compound as its mol file from the ChEMBL website, and then click on the 'open file' button within the sketcher on the ChEMBL homepage, and open that mol file to be your starting point for editing.

Acknowledgements
Thanks to Noel O'Blog and John Mayfield for help!
And also a big thanks to David Mendez from ChEMBL!