Wednesday, 6 May 2026

The Perceptron - classifying items into two classes based on numerical input variables

I've been reading Sebastian Raschka's excellent book Machine Learning with PyTorch and Scikit-Learn about machine learning/AI using Python.

The Perceptron algorithm
In the book Raschka talks about the Perceptron, a historically important AI algorithm for classifying individuals into two classes based on an arbitrary number of numerical input variables. The Perceptron is essentially an optimisation algorithm that iteratively adjusts weights in order to minimise classification errors for the individuals.  

The Perceptron is a simple neural network, a single-layer neural network. It can also be thought of as a linear classifier, as it calculates a linear combination of input features and weights to make predictions of one of two classes, that is, for binary classification.

Raschka taught me how to use the Perceptron to classify individual Iris flowers into two species, Iris setosa and Iris versicolor, based on their sepal length (in cm) and petal length (in cm):

1. First, I installed the necessary Python packages (numpy, scipy,  scikit-learn, matplotlib, and pandas): see https://avrilomics.blogspot.com/2026/04/installing-python-libraries-within.html

2. Then I downloaded Rachka's Python scripts https://github.com/rasbt/machine-learning-book/blob/main/python_environment_check.py and https://github.com/rasbt/machine-learning-book/blob/main/ch02/ch02.py

2. Then I started up python3 and loaded in Raschka's Python scripts, and some other necessary Python modules:
% python3
>>> import python_environment_check
>>> import chap02
>>> import os
>>> import pandas as pd 
>>> import matplotlib.pyplot as plt
>>> import numpy as np

3. Then I loaded in the Iris data set into a data frame 'df', and take a subset of the Iris data set corresponding to the 100 individual flowers belonging to either the Iris setosa and Iris versicolor species, putting the species names for the individuals into variable y (0=setosa, 1=versicolor) and their sepal lengths and petal lengths into variable X

>>> s = 'https://archive.ics.uci.edu/ml/'\
... 'machine-learning-databases/iris/iris.data'
>>> df = pd.read_csv(s,header=None,encoding='utf-8')
>>> y = df.iloc[0:100,4].values # These are the species names
>>> y = np.where(y == 'Iris-setosa', 0, 1)
>>> X = df.iloc[0:100, [0,2]].values

4. Then I trained the Perceptron algorithm on this subset of the Iris data set, using Raschka's code for the Perceptron algorithm (in Raschka's ch02.py Python script, downloaded above):
>>> ppn = chap02.Perceptron(eta=0.1, n_iter=10)
>>> ppn.fit(X, y)

5. Now I can use the Perceptron algorithm to classify a totally new flower just collected from the wild (which we assume is either Iris versicolor or Iris setosa) as either Iris versicolor/Iris setosa based on its sepal length and petal length. Say for example we have a flower with sepal length of 5.0 cm and petal length of 1.0 cm, let's classify it using the trained Perceptron algorithm:
>>> ppn.predict((5.0,1.0))
array(0)
The Perceptron has classified it as class 0, which means it has classified it as Iris setosa.
Say we have a second flower with a sepal length of 5.0 cm and a petal length of 3.0 cm, let's classify it too:
>>> ppn.predict((5.0,3.0))
array(1)
The Perceptron has classified it as class 1, which means it has classified it as Iris versicolor.

Yay! Thank you Dr Raschka.

Some caveats to know about the Perceptron algorithm are:
- it just works for two classes, not multiple classes, so for example can't predict a third species of Iris.
- it can be trained (and so the training algorithm will converge) if the two classes (Iris versicolor or Iris setosa here) can be separated by a linear hyperplane. 

The Adaline (ADAptive LInear NEuron) Algorithm

In the book Raschka also talks about a second type of single-layer neural network called the Adaline algorithm. Like the Perceptron, it is a linear classifier, as it calculates a linear combination of input features and weights to make predictions of one of two classes, that is, for binary classification

The key difference between the Perceptron and Adaline algorithms are in how they iteratively adjusts weights in order to minimise classification errors for the individuals. 

The Perceptron algorithm iterates over each individual in the training data set one-by-one, and if in a particular iteration the prediction for that individual in the training set is incorrect, the weights for all the input variables (x) are updated. That is, the weight w_j for the jth input variable x_j is adjusted by multiplying the prediction error for that individual, by the input variable's (x_j) value and the learning rate (eta). 

The Adaline algorithm also iterates over each individual in the training set one-by-one. At each iteration the weights for all the input variables are updated, regardless of whether the prediction for that individual is correct or not. The weight w_j  for the jth input variable x_j is adjusted by multiplying the gradient of the total prediction error over all individuals (i.e. partial derivative of the total prediction error over all individuals, with respect to the weight w_j  for the jth input variable x_j), by the learning rate (eta). 
  



Tuesday, 28 April 2026

Installing Python libraries within virtualenv

 Today I learnt how to install Python libraries in a virtual environment (in my case, on the Sanger compute farm), using virtualenv.

 Here is how I created a virtual environment called 'raschka' and installed the numpy, scipy, scikit-learn, matplotlib and pandas libraries (specifying particular versions of these) inside a virtual environment. First I created a virtual environment called 'raschka':

% virtualenv -p /usr/bin/python3 raschka 

Then I activated it:

% source raschka/bin/activate 

Then I installed Python libraries within it:

% pip install numpy==1.21.2

% pip install scipy==1.7.2

% pip install scikit-learn==1.1

% pip install matplotlib==3.4.3

% pip install pandas==1.3.2

To start up Python and load in these libraries I can type:

% python3

>>> import numpy

>>> import scipy

>>> import sklearn

>>> import matplotlib

>>> import pandas 

Note that you I can exit the virtual environment by typing:
% deactivate

and then later I can enter it again by typing:

% source raschka/bin/activate 

Cool!

 

Tuesday, 17 March 2026

Making a bubble plot to show frequencies

I've been using ggplot2 in R to make a bubbleplot, to show frequencies. This is an alternative to a histogram. Here's a little example:

 I have a file year_data_file_example.txt with columns YEAR, LINEAGE, NUMBER, with the number of bacterial isolates of each particular lineage:

YEAR LINEAGE NUMBER
1990 lineage1 30
1990 lineage2 25
1990 lineage3 5
1991 lineage1 25
1991 lineage2 27
1991 lineage3 8
1992 lineage1 20
1992 lineage2 28
1992 lineage3 9 
 
I made a bubbleplot using R by typing: 
> library("ggplot2")
> MyData <- read.table("year_data_file_example.txt",header=TRUE)
> ggplot(MyData, aes(x=MyData$YEAR, y=MyData$LINEAGE, size=MyData$NUMBER)) + geom_point(color = 'blue')
 
Acknowledgements
Thanks to my colleague Amber Barton for advice.
 
 



 

Friday, 20 February 2026

Using MOB-suite to predict plasmids in bacterial genome assemblies

Today I wanted to predict plasmids in a bacterial genome assembly, and used the MOB-suite tool.

Here's how I ran it on the Sanger compute farm:

% mob_recon --infile genome.fasta --outdir genome_plasmid 

where genome.fasta is the fasta file name for my genome, and genome_plasmid is the name I wanted to give to the output directory. I needed to request 1000 Mbyte of RAM to run this on a 4.5 Mbyte bacterial genome.

The output file will be genome_plasmid//mobtyper_results.txt.

Some useful columns in the output file are:

column 15: mash_nearest_neighbor
column 16: mash_neighbour_distance
column 17: mash_neighbour_identification

The output 'mobtyper_results.txt' file looks something like this: 

sample_id       num_contigs     size    gc      md5     rep_type(s)     rep_type_accession(s)   relaxase_type(s)        relaxase_type_accession(s)      mpf_type        mpf_type_accession(s)   orit_type(s)    or
it_accession(s) predicted_mobility      mash_nearest_neighbor   mash_neighbor_distance  mash_neighbor_identification    primary_cluster_id      secondary_cluster_id    predicted_host_range_overall_rank       pr
edicted_host_range_overall_name observed_host_range_ncbi_rank   observed_host_range_ncbi_name   reported_host_range_lit_rank    reported_host_range_lit_name    associated_pmid(s)
CCBT0329:AA860  1       153481  0.5174686451848661      8c072d1914bfa50eb379d2673416d2b0        IncC    000092__CP025470        MOBH,MOBH       NC_012690_00071,NC_012885_00072 MPF_F   NC_023291_00077,NC_012885_
00091,NC_016974_00085,NC_012885_00083,NC_014170_00023,NC_009140_00071,NC_012885_00167,NC_012885_00088   MOBH    JQ319772        conjugative     CP015394        0.000143503     Klebsiella pneumoniae   AA860   AJ
278     phylum  Pseudomonadota  class   Gammaproteobacteria     phylum  Pseudomonadota  23800906; 20138094; 19482926; 24567731; 28842132; 20851899; 22290972; 19949054
CCBT0329:AC804  1       3981    0.46897764380808843     cab608a1a227ef9028aa1b8d80e819b9        rep_cluster_159 000964__AF052650        -       -       -       -       -       -       non-mobilizable AF052650 0.00759618       Vibrio cholerae AC804   AM145   genus   Vibrio  genus   Vibrio  -       -       -

In this example, two plasmids are predicted in the genome. The first one is an IncC plasmid of size 153 kb, and has its closest sequence match to NCBI accession CP105394, which is a Klebsiella pneumoniae plasmid. The second one is a small plasmid of about 4 kb, which has its closest sequence match to NCBI accession AF052650, which is a Vibrio cholerae plasmid. If you look up AF052650 on the NCBI website, you'll find it is V. cholerae plasmid pTLC.


 

Thursday, 5 February 2026

Using enadownloader to download fastqs from the ENA

I wanted to download fastq files for a long list of SRR accessions from the ENA today.

I realised I could use the enadownloader tool that I previously wrote a blogpost about a while ago.

Here's how I used enadownloader to download the fastq files, on the Sanger compute farm:

First I checked which is the latest version of the enadownloader tool on the farm:

% module avail -t | grep -i ena 

Then I loaded the module:

% module load enadownloader/v2.3.5-4ac05c8f

Then I made a file of all the SRR accessions, called 'srr_accessions' like this:

SRR31024208
SRR31024304
SRR31024307

...

Then I made an output directory 'srr_accessions_fastqs' to put the fastqs in:

mkdir srr_accessions_fastqs 

Then I used enadownloader to download the fastqs for all these accessions:

% enadownloader -t run -i srr_accessions -d -o srr_accessions_fastqs

where -t run means the type of data is sequence runs, -i srr_accessions means the input file is srr_accesions, -d means that I want to download data, -o srr_accessions_fastqs means the output directory is srr_accessions_fastqs.

Nice and easy!

 

 

Making assemblies for Oxford Nanopore sequence data using Dragonflye

I've been making genome assemblies for some Oxford Nanopore Technology (ONT) sequencing data using the Dragonflye package by Robert A. Petit III. 

It was super easy to run!

Here's how I ran it on the Sanger compute farm:

First I found the version of Dragonflye on the farm:

% module avail -t | grep -i dragon

Then I loaded it:

% module load dragonflye/1.2.1

Then I assembled sequence reads for a Vibrio cholerae isolate into an assembly using Dragonflye: 

dragonflye --reads SRR31024125_1.fastq.gz --outdir SRR31024125_1.fastq_dragonflye --gsize 4000000

where  SRR31024125_1.fastq.gz was my input fastq file of ONT reads,

SRR31024125_1.fastq_dragonflye was the name that I wanted to give to the output directory,

 --gsize 4000000 specifies that the Vibrio cholerae genome is about 4.0 Mbase.

The output file was called SRR31024125_1.fastq_dragonflye/contigs.fa.  

It took about 20 minutes to make the assembly. The input file of ONT reads was about 93 Megabytes (SRR31024125_1.fastq.gz).

 

 

  

Monday, 17 November 2025

Making a map of the locations where bacterial isolates were collected

 I wanted to make a map of the locations in the world were some bacterial isolates were collected. I found a nice website that gave me some useful location on plotting maps in R.

Here's what I found worked for me.

First I got a map of the world: 

> library("ggplot2")
> theme_set(theme_bw())
> library("sf")
> library("rnaturalearth")
> library("rnaturalearthdata")
> world <- ne_countries(scale = "medium", returnclass = "sf")

Then I made a text file in which I had the number of bacterial isolates collected in each country, and the abbreviation for countries using the ISO_A3 three-letter codes:
Isolates    Country    "Country Code"
1    Bahrain    BHR
1    Ecuador    ECU
1    El Salvador    SLV
1    Gabon    GAB
1    Guatemala    GTM
...
I read in this file:
MyCountryCountData <- read.table("countrycount_data.txt",header=TRUE, sep="\t", stringsAsFactors=FALSE)
Then I stored the data on the number of counts from each country, from this file:
numcountries <- length(world$iso_a3)
isolatecounts <- numeric()
for (i in 1:numcountries)
{
    mycountry <- world$iso_a3[i]
    print(paste("Calculating for country=",mycountry," row i=",i, "mylength=",length(isolatecounts)))
    myindex <- which(mycountry == MyCountryCountData$Country.Code)
    if (length(myindex) > 0)
    {
       myisolates <- MyCountryCountData$Isolates[myindex]
       isolatecounts <- append(isolatecounts, myisolates, after=length(isolatecounts))
    }
    else
    {
       isolatecounts <- append(isolatecounts, 0, after=length(isolatecounts))
    }
}
 
Then I made a plot showing a map of the world, with the countries coloured in according to their number of isolates:
 
Make a 300 dpi tiff file:
> tiff("countries_histogram.tiff", units="in", width=5, height=5, res=300)
> ggplot(data = world) + geom_sf(aes(fill = isolatecounts[1:242])) + scale_fill_viridis_c(option = "plasma", trans="sqrt")
> dev.off()
 
 The plot looks something like this: