avrilomics: 2026

Tuesday, 17 March 2026

Making a bubble plot to show frequencies

I've been using ggplot2 in R to make a bubbleplot, to show frequencies. This is an alternative to a histogram. Here's a little example:

I have a file year_data_file_example.txt with columns YEAR, LINEAGE, NUMBER, with the number of bacterial isolates of each particular lineage:

YEAR LINEAGE NUMBER

1990 lineage1 30

1990 lineage2 25

1990 lineage3 5

1991 lineage1 25

1991 lineage2 27

1991 lineage3 8

1992 lineage1 20

1992 lineage2 28

1992 lineage3 9

I made a bubbleplot using R by typing:

> library("ggplot2")

> MyData <- read.table("year_data_file_example.txt",header=TRUE)
> ggplot(MyData, aes(x=MyData$YEAR, y=MyData$LINEAGE, size=MyData$NUMBER)) + geom_point(color = 'blue')

Acknowledgements

Thanks to my colleague Amber Barton for advice.

Friday, 20 February 2026

Using MOB-suite to predict plasmids in bacterial genome assemblies

Today I wanted to predict plasmids in a bacterial genome assembly, and used the MOB-suite tool.

Here's how I ran it on the Sanger compute farm:

% mob_recon --infile genome.fasta --outdir genome_plasmid

where genome.fasta is the fasta file name for my genome, and genome_plasmid is the name I wanted to give to the output directory. I needed to request 1000 Mbyte of RAM to run this on a 4.5 Mbyte bacterial genome.

The output file will be genome_plasmid//mobtyper_results.txt.

Some useful columns in the output file are:

column 15: mash_nearest_neighbor
column 16: mash_neighbour_distance

column 17: mash_neighbour_identification

The output 'mobtyper_results.txt' file looks something like this:

sample_id num_contigs size gc md5 rep_type(s) rep_type_accession(s) relaxase_type(s) relaxase_type_accession(s) mpf_type mpf_type_accession(s) orit_type(s) or
it_accession(s) predicted_mobility mash_nearest_neighbor mash_neighbor_distance mash_neighbor_identification primary_cluster_id secondary_cluster_id predicted_host_range_overall_rank pr
edicted_host_range_overall_name observed_host_range_ncbi_rank observed_host_range_ncbi_name reported_host_range_lit_rank reported_host_range_lit_name associated_pmid(s)
CCBT0329:AA860 1 153481 0.5174686451848661 8c072d1914bfa50eb379d2673416d2b0 IncC 000092__CP025470 MOBH,MOBH NC_012690_00071,NC_012885_00072 MPF_F NC_023291_00077,NC_012885_
00091,NC_016974_00085,NC_012885_00083,NC_014170_00023,NC_009140_00071,NC_012885_00167,NC_012885_00088 MOBH JQ319772 conjugative CP015394 0.000143503 Klebsiella pneumoniae AA860 AJ
278 phylum Pseudomonadota class Gammaproteobacteria phylum Pseudomonadota 23800906; 20138094; 19482926; 24567731; 28842132; 20851899; 22290972; 19949054
CCBT0329:AC804 1 3981 0.46897764380808843 cab608a1a227ef9028aa1b8d80e819b9 rep_cluster_159 000964__AF052650 - - - - - - non-mobilizable AF052650 0.00759618 Vibrio cholerae AC804 AM145 genus Vibrio genus Vibrio - - -

In this example, two plasmids are predicted in the genome. The first one is an IncC plasmid of size 153 kb, and has its closest sequence match to NCBI accession CP105394, which is a Klebsiella pneumoniae plasmid. The second one is a small plasmid of about 4 kb, which has its closest sequence match to NCBI accession AF052650, which is a Vibrio cholerae plasmid. If you look up AF052650 on the NCBI website, you'll find it is V. cholerae plasmid pTLC.

Thursday, 5 February 2026

Using enadownloader to download fastqs from the ENA

I wanted to download fastq files for a long list of SRR accessions from the ENA today.

I realised I could use the enadownloader tool that I previously wrote a blogpost about a while ago.

Here's how I used enadownloader to download the fastq files, on the Sanger compute farm:

First I checked which is the latest version of the enadownloader tool on the farm:

% module avail -t | grep -i ena

Then I loaded the module:

% module load enadownloader/v2.3.5-4ac05c8f

Then I made a file of all the SRR accessions, called 'srr_accessions' like this:

SRR31024208
SRR31024304
SRR31024307

...

Then I made an output directory 'srr_accessions_fastqs' to put the fastqs in:

% mkdir srr_accessions_fastqs

Then I used enadownloader to download the fastqs for all these accessions:

% enadownloader -t run -i srr_accessions -d -o srr_accessions_fastqs

where -t run means the type of data is sequence runs, -i srr_accessions means the input file is srr_accesions, -d means that I want to download data, -o srr_accessions_fastqs means the output directory is srr_accessions_fastqs.

Nice and easy!

Making assemblies for Oxford Nanopore sequence data using Dragonflye

I've been making genome assemblies for some Oxford Nanopore Technology (ONT) sequencing data using the Dragonflye package by Robert A. Petit III.

It was super easy to run!

Here's how I ran it on the Sanger compute farm:

First I found the version of Dragonflye on the farm:

% module avail -t | grep -i dragon

Then I loaded it:

% module load dragonflye/1.2.1

Then I assembled sequence reads for a Vibrio cholerae isolate into an assembly using Dragonflye:

% dragonflye --reads SRR31024125_1.fastq.gz --outdir SRR31024125_1.fastq_dragonflye --gsize 4000000

where SRR31024125_1.fastq.gz was my input fastq file of ONT reads,

SRR31024125_1.fastq_dragonflye was the name that I wanted to give to the output directory,

--gsize 4000000 specifies that the Vibrio cholerae genome is about 4.0 Mbase.

The output file was called SRR31024125_1.fastq_dragonflye/contigs.fa.

It took about 20 minutes to make the assembly. The input file of ONT reads was about 93 Megabytes (SRR31024125_1.fastq.gz).