Monday 29 October 2018

Phred quality scores

Phred quality scores are used as a measure of quality for DNA sequencing base quality.
I always forget how they are calculated.

From wikipedia, the Phred base quality score for a particular base is:
Q = -10 log_10 (P)
where P = the probability that a particular base was called incorrectly.

Working out the probability a particular base was called incorrectly from the Phred score?
If you have Q, you can work out P as:
P = 10^(-Q /10)
So a base quality score of 20 means P = 10^(-2) = 0.010.
A base quality score of 25 means P = 0.003 approx.
A base quality score of 30 means P = 0.001.

Working out the Phred quality score threshold needed to achieve a particular threshold of P
If we want to use a threshold of P = 0.05, then we would need to use a threshold of Q of:
Q = -10 log_10(0.05) = 13.0103.
Working back to get P gives:
P = 10^(-13.0103/10) = 0.05.

Likewise, for a threshold of P = 0.005:
Q = -10 log_10(0.005) = 23.0103.
Working back to get P gives:
P = 10^(-23.0103/10) =  0.005.

Friday 26 October 2018

Running CRISPResso on a ton of sequencing data

I have a ton of sequencing data from an Illumina HiSeq lane from a PCR-amplicon sequence, and I'd like to run CRISPResso on it. In a previous blogpost, I talked about how to run CRISPResso.

Here's I'll talk about how to run it on loads of data! In this case I had ~20 million read-pairs for a single sample, while previously I had only run CRISPResso on 1-2 million read-pairs (even that took a couple of days!)

Here is a little plot I made of CRISPResso run-time in hours versus number of input read-pairs:

You can see that as the number of read-pairs gets above about 1 million, the run-time is 0.5-1 day roughly. I guess the runtime may also depend on things like the base quality of the reads (if they are low quality, I'm guessing CRISPResso will have to make lots of alignments with indels/substitutions, and may infer lots of indel/substitution mutations, which will take time).

To run CRISPResso on ~20 million read-pairs from a single sample, here's what I did:

1. Split up the input fastq files into smaller files of 1 million reads each
I have paired end read-pairs, so the input data is in a file called sample1_1.fastq.gz (forward reads) and sample1_2.fastq.gz (reverse reads). To split this up, I ran:
% python3 ~alc/Documents/git/Python/ sample1_1.fastq.gz 1000000 sample1_1_sub
% python3 ~alc/Documents/git/Python/ sample1_1.fastq.gz 1000000 sample1_1_sub
You can see my Python script on gist here.
Note to self: I submitted farm jobs to do this, as it's slow.

2. Use Trimmomatic to discard low quality read-pairs.
I used the Trimmomatic software to discard read-pairs with low quality bases, only taking read-pairs where both reads of a pair had average base quality of >=20. To do this, I ran:
python3 ~alc/Documents/git/Python/ sample1 17
where here sample1 is the name of the sample in step 1 above, and 17 is the number of pairs of smaller fastq files that step 1 above made. 
This script submits Trimmomatic jobs to the Sanger compute farm.
You can see my Python script on gist here.

3.  Run CRISPResso on each smaller subset of read-pairs.
I then ran CRISPResso on each smaller subset of read-pairs:
% python3 ~alc/Documents/git/Python/ sample1 17
where here sample1 is the name of the sample in step 1 above, and 17 is the number of pairs of smaller fastq files that step 1 above made. 
This script submits CRISPResso jobs to the Sanger compute farm.
You can see my Python script on gist here, you would need to edit it at the 'xxx' positions, to put in your amplicon sequence, gRNA sequence and coding sequence. 

Spiking sequencing libraries with PhiX

If you are sequencing a low complexity library (e.g. a PCR-amplicon library) using Illumina sequencers, it's a good idea to spike in some PhiX to improve base quality, as explained here and here on the Illumina support pages. On the seqanswers forum they say that PhiX is used for:
"introduction of sequencing diversity in low-complexity libraries (diversity is needed to discriminate clusters and create signal thresholds for base-calling). As the software has improved, the recommended amount of phiX spike-in has decreased." 
Another comment says:
"Older basecalling software on the MiSeq used to require a 50% phiX spike for low-diversity (amplicon samples). More recently, the software has been updated and only requires a 5-10% spike."

Reading and writing gzipped files in Python

I've been reading in zipped (.gz) files (in my case, zipped fastq files of sequence data) in Python using a lovely Python module called
The same module can also write out zipped (.gz) files.

For example, here is a little script that reads in a zipped (.gz) input fastq file of sequence reads, and splits it into smaller files that have seqs_per_output_file sequences each. The output files are written out as zipped files (.gz).

I've highlighted the lines of the file that use the gzip module, and read and write from the input and output files.

Very handy!

import sys
import os
import gzip
from collections import defaultdict


# now read in the input fastq and split it up:    

def read_fastq_file_and_split(input_fastq_file, seqs_per_output_file, output_file_prefix):

    # open an output file:
    output_file_cnt = 1
    output_file = "%s_%d.fastq.gz" % (output_file_prefix, output_file_cnt)
    outputfileObj =, "wb") # write out the output file in gzipped format

    # read in the input file:
    fileObj =, "rt") # this opens a gzipped file in text mode
    seqcnt = 0
    linecnt = 0
    for line in fileObj:
        line = line.rstrip()
        if line.startswith('@') and (linecnt == 0 or linecnt == 4):
            linecnt = 0
            seqcnt += 1
            if seqcnt == (seqs_per_output_file + 1):
                output_file_cnt += 1
                output_file = "%s_%d.fastq" % (output_file_prefix, output_file_cnt)
                outputfileObj =, "wb") # write out the output file in gzipped format
                seqcnt = 1
        outputline = "%s\n" % line
        outputfileObj.write(outputline.encode()) # need to write bytes (not a string) to the output file, see
        linecnt += 1

    # the fastq file looks like this:
    # @M03558:259:000000000-BH588:1:1101:15455:1333 2:N:0:NTTGTA
    # +



def main():
    # check the command-line arguments:
    if len(sys.argv) != 4 or os.path.exists(sys.argv[1]) == False:
        print("Usage: %s input_fastq_file seqs_per_output_file output_file_prefix" % sys.argv[0])
    input_fastq_file = sys.argv[1] # input fastq file                 
    seqs_per_output_file = int(sys.argv[2]) # number of sequences to put into each output file
    output_file_prefix = sys.argv[3] # prefix to use for the output file names

    # now read in the input fastq and split it up:    
    read_fastq_file_and_split(input_fastq_file, seqs_per_output_file, output_file_prefix)

if __name__=="__main__":


Thursday 25 October 2018

Randomly sample from a fastq file

Sometimes I want to take a sneak peak at some sequencing data, which is given to me hot from the sequencing machines in the form of a pair of fastq files (paired end read-pairs: one fastq file for the forward reads and one for the reverse reads). I used to just take the first n (e.g. 250,000) read-pairs from the top of the fastq file, to have a look. However, as discussed on seqanswers here it's often the case that:
SOLiD or Illumina output files have a pile of rubbish at the start of the file from bad sequencing reads (the edges of chips / cells seem to be more prone to error), which heavily influences the results gleaned from the first few reads.”
To get around this, they suggest you subsample randomly using HTSeq, and also point to a discussion here.

Filter read-pairs by average base quality using Trimmomatic

I have a large amount of Illumina sequencing data (~17 million read-pairs) and, before running further analysis, want to filter the data by the average base quality, for example by discarding all read pairs that have average base quality of <=20. The format of my data is two fastq files for my paired Illumina read-pairs, one for forward reads and one for reverse reads.

From a discussion on seqanswers about filtering, I found that people suggested Trimmomatic.

Running Trimmomatic
I tried running Trimmomatic on some smaller files of 250,000 (0.25 million) read-pairs, using this command-line: (note to self: on the Sanger farm)
java -Xmx128M  -jar /software/pathogen/external/apps/usr/local/Trimmomatic-0.33/trimmomatic-0.33.jar PE -threads 1 -phred33 sample1topsmall_1.fastq.gz sample1topsmall_2.fastq.gz sample1topsmallout_1_paired.fastq.gz sample1topsmallout_1_unpaired.fastq.gz sample1topsmallout_2_paired.fastq.gz sample1topsmallout_2_unpaired.fastq.gz SLIDINGWINDOW:150:20 
PE means my reads are paired ends,
-threads 1 tells Trimmomatic to just use one thread (one CPU),
-phred33 tells Trimmomatic that the phred quality scores in my file use ASCII_BASE=33 (see here for an explanation), (note to self: I knew this from CRISPResso output)
sample1topsmall_1.fastq.gz sample1topsmall_2.fastq.gz are my input fastq files,
sample1topsmallout_1_paired.fastq.gz, sample1topsmallout_1_unpaired.fastq.gz, sample1topsmallout_2_paired.fastq.gz, sample1topsmallout_2_unpaired.fastq.gz are output files produced by Trimmomatic (two with reads where both reads of a pair pass the filter, two files with reads where just one read of a pair passes),
SLIDINGWINDOW:150:20 tells Trimmomatic to use a sliding window of 150 bp and take reads that have average base quality of >=20 in this window (note to self: my reads are 150 bp long).

Output from Trimmomatic
Trimmomatic ran really quickly! It gave some output like this:
Input Read Pairs: 250000 Both Surviving: 1619 (0.65%) Forward Only Surviving: 228797 (91.52%) Reverse Only Surviving: 604 (0.24%) Dropped: 18980 (7.59%)
TrimmomaticPE: Completed successfully

You can see here that my forward reads were much higher quality than my reverse reads, so few read-pairs survived where both reads of the pair passed the quality filter. The output files sample1topsmallout_1_paired.fastq.gz and sample1topsmallout_2_paired.fastq.gz have 1619 reads each as 1619 read-pairs passed.

Run-time and memory (RAM) required by Trimmomatic
I found Trimmomatic very fast. For 2.5 million read-pairs, it took about 2 minutes to run on the Sanger compute farm (using one CPU), and needed just ~80 Mbyte of memory (RAM).

Thursday 18 October 2018

Submitting an assembly plus annotations to the ENA

I need to submit an assembly plus annotations for a parasitic worm (a multicellular eukaryote) to the ENA. Note: see my previous blogpost on preparing an EMBL file for the ENA.

To do this, I followed these steps:

Step 1: Upload my EMBL file (with genome assembly and annotation) to the ENA submission website
Note (later): I think that actually Step 1 is not necessary for submitting your EMBL file to the ENA, so you could skip directly to Step 2 here (if you have your EMBL file already; if you need to make your EMBL file see here, and if you need to register your sample and study with the ENA see here).

Once I had created an EMBL file with my genome assembly and annotation (see here), and had checked my study and sample had already been registered with the ENA (see here), I followed these steps:

1) I went to the ENA submission website, and signed in using the login and password given to me by my team (Note to self: asked Pathogen Informatics team for this).

2) I clicked on the 'New Submissions' tab at the top.

3) Then I selected 'Submit genome assemblies'.

4) At the bottom of the screen, it says the first step is to upload the data files into the "Webin upload area". I clicked on the 'upload data files' link.

5) This brought me to a page that says to look at

6) I created a md5sum for my file by typing:
% md5sum my.embl.gz > my.embl.gz.md5

7) Following the instructions on the page, I did the following to transfer my files using ftp, to the ENA's "Webin upload area":
% ftp [enter username and password]
> bin
> put my.embl
> put my.embl.md5
> bye

The files that I transferred were zipped embl files (e.g. enterobius_vermicularis_new2.embl.gz and  enterobius_vermicularis_new2.embl.md5.gz for a species Enterobius vermicularis).

Step 2: Make a "manifest" file for the EMBL file
To submit an EMBL file for example for a species Enterobius vermicularis, you need to make a 'manifest file' that contains the information on the assembly, e.g. enterobius_vermicularis.manifest. The manifest files are described here. For example, for my species Enterobius vermicularis, the manifest file looked like this: (tab separated)
SAMPLE    ERS076738
ASSEMBLYNAME   E_vermicularis_Canary_Islands_0011_upd
PROGRAM   SGA, Velvet, SSpace, IMAGE, GapFiller, REAPR
PLATFORM   Illumina
FLATFILE   enterobius_vermicularis_new2.embl.gz

Note: here the assembly name has had '_upd' appended to it, as it is an update of a previous one that was in the ENA.

Note: If your assembly is in chromosomes you also need a chromosome list file (see here) as well as a manifest file; this wasn't the case for me as I just had scaffolds in my assembly.

Note: to see the manifest file information for an assembly that has already been submitted to the ENA, you can log into the ENA submission website (see Step 1 above), and then type the analysis accession number (e.g. ERZ021218) in the search box 'Accession/Unique name' at the top right, and you will see the analysis pop up, then click 'Edit' on the right' and you will see a page with 'Edit details' pop up, and then click on 'Edit XML' on the top right and this should bring you to a page with the details that were in the manifest, which will look something like this (I've highlighted in red the info that was used in the manifest file above):
<?xml version="1.0" encoding="UTF-8"?><ANALYSIS_SET>
   <ANALYSIS accession="ERZ021254" alias="E_vermicularis_Canary_Islands_0011" center_name="WTSI">
         <SUBMITTER_ID namespace="WTSI">E_vermicularis_Canary_Islands_0011</SUBMITTER_ID>
      <TITLE>Genome sequence of Enterobius vermicularis</TITLE>
      <DESCRIPTION>We have sequenced the genome of Enterobius vermicularis</DESCRIPTION>
      <STUDY_REF accession="ERP006298">
      <SAMPLE_REF accession="ERS076738">
            <PROGRAM>SGA, Velvet, SSpace, IMAGE, GapFiller, REAPR</PROGRAM>
         <FILE checksum="4bb219b0585c07e91ac57332288236b0" checksum_method="MD5" filename="ERZ021/ERZ021254/EVEC.v1.QC.fa_v4_submit.gz" filetype="fasta"/>

Step 3: Use the webin java client to submit the EMBL file to the ENA

1) Download the latest version of the webin command line tool, as described here. This is called something like webin-cli-root-1.4.2.jar.

2) If Java is not installed on your laptop, you need to install it (see here ).

3) You should now run the webin client in the directory that has your manifest file and chromosome list file (if you have a chromosome list file; see Step 2 above), using the -validate and -test options. You need to run the webin client by typing something like this:
% java -jar webin-cli-root-1.4.2.jar 

Here are the options you need: (something like this: you will need to change it for your directories, password, username, etc.)
/software/bin/java -jar webin-cli-root-1.4.2.jar -context genome -username xxx -password xxx -manifest enterobius_vermicularis.manifest -inputdir . -outputdir ./evermicularis -validate -test
where -username and -password give your username and password for the ENA (used in Step 1),
-manifest gives the name of your manifest file (e.g. enterobius_vermicularis.manifest),
-inputdir gives the name of your input directory,
-outputdir gives the name of the directory where you want output to be put (Note: you seem to need this option for the webin client to run),
-validate runs the EMBL file validator,
-test means that this is just a test run that runs the validator and the files are not actually submitted.
It's a good idea to do this before you submit.
Note: you need to have the EMBL file mentioned in your manifest file in the input directory (inputdir), e.g. file  enterobius_vermicularis_new2.embl.gz. Webin does not seem to look for it in the  "Webin upload area" (see Step 1) - this is what I had initially assumed it does, but it seems that's not the case.

Note: On the Sanger farm head node, this java program needs more memory (RAM) than the default required so I needed to reserve some RAM by typing:
% /software/bin/java -Xmx128M -jar webin-cli-root-1.4.2.jar
If your EMBL file is very large, the Java program may need more memory (RAM) than is allowed on the Sanger farm login node, so you will need to submit this as a farm job.
I wasn't able to get webin running on the Sanger farm (either farm head node or other nodes), as I kept getting an error :
'ERROR: A server error occurred when checking application version. Connect to [] failed: Connection timed out'
I tried a few different things: (i) using -Dhttps.proxyHost and -Dhttps.proxyPort options in webin (see here) when running webin, and (ii) setting the environmental variable http_proxy on the farm. However, this still didn't run for me. However, I found it ran on my Sanger Mac laptop, so that was fine.

4) Once the webin command in (3) above has run, it produces an output report file in a subdirectory 'validate', e.g. mcorti/genome/M_corti_Specht_Voge_0011_upd/validate/
If the 'validate' output report file is empty, then it has run fine with no errors.

There is also an output 'submit' directory, e.g. mcorti/genome/M_corti_Specht_Voge_0011_upd/submit that contains a file analysis.XML.  That is, the 'submit' directory should have your analysis XML file, which is an XML version of your manifest (see Step 2 above).

If it doesn't pass the 'validate' step, then look in the 'genome' directory to see what error messages you got.

5) If (4) was all ok, now run the webin client with the -submit option instead of -validate, and without -test:
% /software/bin/java -jar webin-cli-root-1.4.2.jar -context genome -username xxx -password xxx -manifest enterobius_vermicularis.manifest -inputdir . -outputdir ./evermicularis -submit

6) Once your assembly has been submitted, it will go into a queue in the ENA to be imported into their database. Once this has been done (may take a little while if they have a lot of sequence data in the queue already), you will be sent the new ENA accession number for it.

Checking what happened to your assembly
If you don't hear back from the ENA about your assembly, you can log into with your webin account. Then, in the "Analysis process" tab, you can search for the accessions to see if they have failed. Usually if there is a failure, the error messages will be emailed to the email address associated with your webin account (Note to self: this is the pathogen informatics team for the webin account I used.)

A big thank you to Dr Ana M. Cerdeño-Tárraga at the ENA for her help, and also to the Pathogen Informatics team (Dr Jacqui Keane, Sara Sjunnebo) at Sanger for some advice.

Installing Java on my laptop

This is only really useful to Sanger users...

On my mac laptop, to install Java, I had to:
1) Go to the 'Applications' folder, and click on 'Managed Software Center'
2) Select 'Oracle Java JDK 8', and click install.
3) Click on 'Install requested' at top right. Then a 'checking for updates' screen appears. This seems to install Java.
4) In a xterm window on the Mac, I can then type a java command, e.g.: (to run a Java program called webin-cli-root-1.4.2.jar)
% java -jar webin-cli-root-1.4.2.jar
Runs fine!

Monday 15 October 2018

Getting raw sequence data at Sanger

This is useful for Sanger people only: how to get some data off our Sanger irods system.

1) Request an irods account from the service desk, as explained on the irods wiki page.

2) Make sure your ..softwarerc has irods in, so you can use irods commands. 

3) (Steve told me): first move to a directory where you want to transfer the cram files, and then run the following:
(e.g. for run 27104, lane 8)

< input password >   # you need to input your irods password here
icd /seq/27104
ils | grep "27104_8.*cram"$ | grep -v "phix" | while read -r list; do iget /seq/27104/$list . ;  done &

# once you cram files have downloaded, convert crams to fastq 1 cram2fq ~sd21/bash_scripts/run_cram2fastq

Note the script ~sd21/bash_scripts/run_cram2fastq says:
for i in *cram; do samtools view -ub --threads 4 ${i} | samtools sort -n - | samtools fastq -1 ${i%.cram}_1.fastq.gz -2 ${i%.cram}_2.fastq.gz - ; done

Note: to get this script to run for me, I had to make a copy of it and change the path for samtools to be /software/pathogen/external/apps/usr/local/samtools-1.6/samtools, which was the one in Steve's .cshrc

More info about irods
A lot of initial questions are covered in the FAQs located at

Thanks very much to Steve Doyle for help with this.

Thursday 11 October 2018

Analysing a S. mansoni gene list

To analyse a list of Schistosoma mansoni genes with a particular expression pattern, to see if they have some unusual features, here are some things you can try:

1. STITCH (protein-protein interactions)
If there is no S. mansoni in the STITCH database, one could try C. elegans orthologs, and/or human orthologs (see 2 below). You could try with ortholog_one2one orthologs, and separately with ortholog_one2many orthologs; one-to-one orthologs might be a cleaner set, but adding one-to-many orthologs may give you a much larger sample, so allow you to pick up more.
2. To get C. elegans (or human) orthologs
  - go to wormbase parasite
  - click on biomart at top
  - click species, tick genome, choose S. mansoni
  - click on 'Gene', then in the 'id list' box, paste in your S. mansoni Smp id list
  - This gives it your list of S. mansoni genes
  - To tell it you want eg. C. elegans orthologs, click on 'output attributes' on the left (blue link)
  - Click on 'orthologs', then under C. elegans orthologs choose for example gene id, gene name, protein stable id, % id, homology type
   Note: STITCH might want gene id (WBGene) or the protein id (e.g. AC5.3)
  - click on 'Count' at top of page to get id of number of results, then 'Results' at top of page to get Results. Save as tab separated file

3. Another thing to try is tissue/phenotype/GO enrichment using the C. elegans wormbase page, and your C. elegans orthologs:
  Could try with ortholog_one2one orthologs, and separately with ortholog_one2many.

4. Can try GO enrichment for S. mansoni (or your C.elegans/human orthologs) using TopGO (an R package).
  First you need your S. mansoni GO terms for S. mansoni genes from WormBase parasite Biomart :
  - go to wormbase parasite
  - click on biomart at top
  - click species, tick genome, choose S. mansoni
  - To tell it you want GO terms, click on 'output attributes' on the left (blue link)
  - Select 'Gene ontology' and tick all the boxes beside it
  - Click on "results' at top of the page
  - Then run TopGO. See  my notes here on TopGO:

Note: 4-Feb-2019:
Another thing to try could be for GO enrichment, can take WormBase ParaSite gene lists.

Note: 3-July-2019:
Other things to try:
- find Drosophila melanogaster orthologs, paste into the FlyMine website
- find Drosophila melanogaster orthologs, and test for enrichment of phenotypes downloaded from FlyBase
- try the GSEA Gene Set Enrichment software