Thursday, 5 February 2026

Using enadownloader to download fastqs from the ENA

I wanted to download fastq files for a long list of SRR accessions from the ENA today.

I realised I could use the enadownloader tool that I previously wrote a blogpost about a while ago.

Here's how I used enadownloader to download the fastq files, on the Sanger compute farm:

First I checked which is the latest version of the enadownloader tool on the farm:

% module avail -t | grep -i ena 

Then I loaded the module:

% module load enadownloader/v2.3.5-4ac05c8f

Then I made a file of all the SRR accessions, called 'srr_accessions' like this:

SRR31024208
SRR31024304
SRR31024307

...

Then I made an output directory 'srr_accessions_fastqs' to put the fastqs in:

mkdir srr_accessions_fastqs 

Then I used enadownloader to download the fastqs for all these accessions:

% enadownloader -t run -i srr_accessions -d -o srr_accessions_fastqs

where -t run means the type of data is sequence runs, -i srr_accessions means the input file is srr_accesions, -d means that I want to download data, -o srr_accessions_fastqs means the output directory is srr_accessions_fastqs.

Nice and easy!

 

 

Making assemblies for Oxford Nanopore sequence data using Dragonflye

I've been making genome assemblies for some Oxford Nanopore Technology (ONT) sequencing data using the Dragonflye package by Robert A. Petit III. 

It was super easy to run!

Here's how I ran it on the Sanger compute farm:

First I found the version of Dragonflye on the farm:

% module avail -t | grep -i dragon

Then I loaded it:

% module load dragonflye/1.2.1

Then I assembled sequence reads for a Vibrio cholerae isolate into an assembly using Dragonflye: 

dragonflye --reads SRR31024125_1.fastq.gz --outdir SRR31024125_1.fastq_dragonflye --gsize 4000000

where  SRR31024125_1.fastq.gz was my input fastq file of ONT reads,

SRR31024125_1.fastq_dragonflye was the name that I wanted to give to the output directory,

 --gsize 4000000 specifies that the Vibrio cholerae genome is about 4.0 Mbase.

The output file was called SRR31024125_1.fastq_dragonflye/contigs.fa.  

It took about 20 minutes to make the assembly. The input file of ONT reads was about 93 Megabytes (SRR31024125_1.fastq.gz).

 

 

  

Monday, 17 November 2025

Making a map of the locations where bacterial isolates were collected

 I wanted to make a map of the locations in the world were some bacterial isolates were collected. I found a nice website that gave me some useful location on plotting maps in R.

Here's what I found worked for me.

First I got a map of the world: 

> library("ggplot2")
> theme_set(theme_bw())
> library("sf")
> library("rnaturalearth")
> library("rnaturalearthdata")
> world <- ne_countries(scale = "medium", returnclass = "sf")

Then I made a text file in which I had the number of bacterial isolates collected in each country, and the abbreviation for countries using the ISO_A3 three-letter codes:
Isolates    Country    "Country Code"
1    Bahrain    BHR
1    Ecuador    ECU
1    El Salvador    SLV
1    Gabon    GAB
1    Guatemala    GTM
...
I read in this file:
MyCountryCountData <- read.table("countrycount_data.txt",header=TRUE, sep="\t", stringsAsFactors=FALSE)
Then I stored the data on the number of counts from each country, from this file:
numcountries <- length(world$iso_a3)
isolatecounts <- numeric()
for (i in 1:numcountries)
{
    mycountry <- world$iso_a3[i]
    print(paste("Calculating for country=",mycountry," row i=",i, "mylength=",length(isolatecounts)))
    myindex <- which(mycountry == MyCountryCountData$Country.Code)
    if (length(myindex) > 0)
    {
       myisolates <- MyCountryCountData$Isolates[myindex]
       isolatecounts <- append(isolatecounts, myisolates, after=length(isolatecounts))
    }
    else
    {
       isolatecounts <- append(isolatecounts, 0, after=length(isolatecounts))
    }
}
 
Then I made a plot showing a map of the world, with the countries coloured in according to their number of isolates:
 
Make a 300 dpi tiff file:
> tiff("countries_histogram.tiff", units="in", width=5, height=5, res=300)
> ggplot(data = world) + geom_sf(aes(fill = isolatecounts[1:242])) + scale_fill_viridis_c(option = "plasma", trans="sqrt")
> dev.off()
 
 The plot looks something like this:

 
 
 
 
 
 
 
 
 
 

Friday, 14 November 2025

Submitting V. cholerae and MLST sequence types to PubMLST

 I've been submitting novel Vibrio cholerae MLST alleles and sequence types to the PubMLST database.

This is the website for submitting novel Vibrio cholerae alleles, genomes or MLST sequence types to PubMLST. 

 Some little things I found out along the way:

- to submit a novel MLST sequence type based on a genome sequence, several things are necessary:

(i) first submit the novel allele(s) to PubMLST. This requires that you know some things, such as gene for the allele (e.g. pntA), sequencing technology used (e.g. Illumina), assembly type (e.g. de novo), assembly software (e.g. CLC genomics v. 7), sequencing coverage (e.g. >100x), read length (e.g. 200-299 bp). I usually just submit one allele at a time for a genome.

 Note that if a gene is missing from a particular genome, that is not counted as a novel allele or MLST sequence type. 

 PubMLST will then email you back with the new identifier for this allele, e.g.  allele 87 for pntA (a made-up example).

(ii) then you need to submit the genomes that have the novel allele/MLST to PubMLST. This requires you submit some information about the genome, in this format (below). You can submit one or more genomes at a time, e.g.

isolate    references    assembly_filename    sequence_method    country    year    species    serogroup    biosample_accession    run_accession    NCBI_assembly_accession
1223-93    35930328    1223-93.fasta.gz    Illumina    Indonesia    1993    Vibrio cholerae    O180    SAMD00180560    DRR213438    GCA_023164185.1
1003-93    35930328    1003-93.fasta.gz    Illumina    Indonesia    1993    Vibrio cholerae    O161    SAMD00180540    DRR213418    GCA_023163825.1

Note that if you don't have the NCBI accession, you can just leave that column empty. If you don't know the biosample accession or run accession you can put 'null' in those columns.

PubMLST will then email you back with the new PubMLST identifiers for these genomes, e.g. identifier 4188 and 4189 (a made-up example).

(iii)  then you can submit the novel MLST sequence type, giving an example of a genome in PubMLST in which the novel MLST sequence type is found. You need to submit some information in this format, e.g. for a genome with PubMLST identifier 4184:

id    adk    gyrB    mdh    metE    pntA    purM    pyrC
4184    13    40    14    46    3    9    39

Acknowledgements

Thank you to Sophie Octavia for helpful advice.

 

 

 

 

 

 

 

 

Friday, 10 January 2025

Using snippy to find SNPs in bacterial genomes

I have been learning how to use snippy by Torsten Seemann to identify SNPs in bacterial genomes.

Running snippy

To run snippy on the Sanger computer farm, I first had to type:

% module load snippy/4.6.0

Then I wanted to run snippy for an assembly "14.fasta", by comparing it to a reference genome "ref.fasta". I told snippy to infer SNPs by simulating fake 250-bp reads from the assembly "14.fasta", and comparing those to the reference genome:

% snippy --cpus 16 --outdir mysnps_test --ref ref.fa --ctgs 14.fasta

where the output files were put into directory mysnps_test, and the --cpus 16 means that 16 CPUs are used.

It took 8 minutes to run on that assembly.

 Output files from snippy

 The main output file from snippy is called snps.tab and looks something like this:

% head -10 mysnps_test/snps.tab
CHROM POS TYPE REF ALT EVIDENCE FTYPE STRAND NT_POS AA_POS EFFECT LOCUS_TAG GENE PRODUCT
AE003852 5414 snp G A A:20 G:0
AE003852 42082 snp A C C:20 A:0
AE003852 137105 del TAACAGAAACAGA T T:14 TAACAGAAACAGA:0
AE003852 144569 snp G A A:20 G:0
AE003852 167663 snp T C C:14 T:0
AE003852 167678 snp G A A:14 G:0
AE003852 167684 snp C T T:14 C:0
AE003852 167697 snp A G G:14 A:0
AE003852 182735 snp C T T:20 C:0
  

Acknowledgements

Thanks to my colleagues Lia Bote and Vignesh Shetty for help running snippy and understanding it.

 

 


 

 



Thursday, 22 August 2024

Using an image that has a Creative Commons license

 I'm writing some online training material, and want to include some images that have Creative Commons licenses, so need to figure out how to correctly cite the license information.

 A nice article by Creative Commons expert Cory Doctorow points out that there is a very particular way that you need to cite the license information correctly.

As Cory Doctorow says, all CC licenses (save for the Public Domain dedication) require that users:

  • Name the creator (either as identified on the work, or as noted in instructions to downstream users)
  • Provide a URL for the work (either as identified on the work, or as noted in instructions to downstream users)
  • Name the license
  • Provide a URL for the license
  • Note whether the work has been modified

The Creative Commons Wiki page gives recommended practices for attribution, including many examples.

Monday, 17 June 2024

Making documentation on readthedocs

I've found the Readthedocs website really useful for making documentation, for example, for the Little Books of R and for the Vibriowatch Tutorial.

 

One very nice thing is that you can keep the versions of your underlying text in github, and when you update the text/figures in github, it triggers readthedocs to update the formatted version on the readthedocs website.

There's a nice tutorial on how you can do this: here.

 

The steps are (as described in the tutorial here):

1. Preparing your repository on github

2. Create an account in readthedocs you don't have one already.

3. Import the new project into readthedocs

4. Checking the first build

5. Configuring the project

6. Triggering builds from pull requests

7. Adding a configuration file

8. Versioning documentation

9. Getting project insights