avrilomics: 2021

Thursday, 28 October 2021

Using Weblogo for sequence logos

A very nice tool for creating sequence logos is the Weblogo website.
You can paste in a multiple alignment like this:

AATGGAAGTGGAAAATCTGTTAGCA
TTATATTAGGAAAATCGTTATAGCA
ATTATGAGTGGAAAATCATGTAGCA
GAAATCAATTGATAGAATATGAGCA

and get back a sequence logo, lovely!

Monday, 25 October 2021

Using Ontologizer for GO enrichment analysis

I previously used the Ontologizer tool for GO enrichment analysis, but that was a few years ago (see Woods et al PMID:23675306). I remember it having a nice algorithm that takes the relationship between parent and child GO terms into account. I decided to use it again, this time to test for GO enrichment among a list of Schistosoma mansoni genes.

Installing Ontologizer

I downloaded Ontologizer onto the Sanger compute farm using:

% wget http://ontologizer.de/cmdline/Ontologizer.jar

This is the command-line version of Ontologizer.

You can see instructions for installing it and running it here: here.

Note I remember there was also a beautiful GUI for Ontologizer. There are instructions for installing it here. However, unfortunately I wasn't able to figure out how to install it this time (I tried to install on my Mac laptop, and found it needs a file swt.jar, which I wasn't able to find on the web, the links to that seem to be broken now.) Oh well!

Preparing the inputs for Ontologizer

First I downloaded the Gene ontology hierarchy file using:

% wget http://purl.obolibrary.org/obo/go/go-basic.obo

Next I wanted to get the GO annotations for Schistosoma mansoni. There are not curated annotations on the GO downloads page, so I used BioMart in WormBase ParaSite to get a set of predicted GO annotations:

- I went to BioMart in WormBase ParaSite

- For the Query, I chose the genome to be Schistosoma mansoni

- I selected the Output attributes to be GO: GO term accession, GO term name, GO term definition, GO term evidence code, GO domain

- I clicked on 'Results' at the top of the webpage

- I clicked on 'Export as CSV' and got the file 'mart_export.txt', which I renamed as 'smansoni_biomart.txt'

For Ontologizer, the annotations need to be in a format called GAF format. I wrote a little perl script make_gaf.pl to convert the file 'smansoni_biomart.txt' to GAF format:

% perl -w make_gaf.pl smansoni_biomart.txt > smansoni.gaf

Here is my perl script make_gaf.pl:

#!/usr/local/bin/perl

# read in the GO data for S. mansoni from WormBase ParaSite Biomart:
$go_data = $ARGV[0]; # smansoni_biomart.txt

# Genome project,Gene stable ID,GO term accession,GO term name,GO term definition,GO term evidence code,GO domain
# schistosoma_mansoni_prjea36577,Smp_000020,,,,,
# schistosoma_mansoni_prjea36577,Smp_000030,GO:0000502,proteasome complex,"A large multisubunit complex which catalyzes protein degradation, found in eukaryotes, archaea and some bacteria. In eukaryotes, this complex consists of the
barrel shaped proteasome core complex and one or two associated proteins or complexes that act in regulating entry into or exit from the core.",IEA,cellular_component
# schistosoma_mansoni_prjea36577,Smp_000030,GO:0042176,regulation of protein catabolic process,"Any process that modulates the frequency, rate or extent of the chemical reactions and pathways resulting in the breakdown of a protein b
y the destruction of the native, active configuration, with or without the hydrolysis of peptide bonds.",IEA,biological_process
# schistosoma_mansoni_prjea36577,Smp_000030,GO:0030234,enzyme regulator activity,Binds to and modulates the activity of an enzyme.,IEA,molecular_function
# schistosoma_mansoni_prjea36577,Smp_000030,GO:0050790,regulation of catalytic activity,Any process that modulates the activity of an enzyme.,IEA,biological_process

print "!gaf-version: 2.1\n";
open(GO,"$go_data") || die "ERROR: cannot open $go_data\n";
while(<GO>)
{
    $line = $_;
    chomp $line;
    if (substr($line,0,3) eq 'sch' && $line =~ /GO:/)
    {
        @temp = split(/\,/,$line);
        $gene = $temp[1]; # e.g. Smp_000030
        $term = $temp[2]; # e.g. GO:0000502
        $evid = $temp[$#temp-1]; # e.g. IEA
        $type = $temp[$#temp]; # e.g. biological_process
        if    ($type eq 'biological_process') { $type = 'P';}
        elsif ($type eq 'molecular_function') { $type = 'F';}
        elsif ($type eq 'cellular_component') { $type = 'C';}
        # Note: S. mansoni has taxon_id 6183 in the NCBI Taxonomy: https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=6183
        # Column 1: DB                 : required=1   e.g. UniProtKB or WormBase ParaSite
        # Column 2: DB Object ID       : required=1   e.g. P12345
        # Column 3: DB Object Symbol   : required=1   e.g. PHO3
        # Column 4: DB Object Qualifier: optional     e.g. NOT
        # Column 5: GO ID              : required=1   e.g. GO:0003993
        # Column 6: DB:Reference       : required=1   e.g. PMID:2676709
        # Column 7: Evidence code      : required=1   e.g. IMP
        # Column 8: With/From          : optional     e.g. GO:0000346
        # Column 9: Aspect             : required=1   e.g. F
        # Column10: DB Object name     : optional     e.g. Toll-like receptor 4
        # Column11: DB Object synonym : optional     e.g. hToll
        # Column12: DB Object type     : required=1   e.g. protein
        # Column13: Taxon              : required=1   e.g. taxon:9606
        # Column14: Date               : required=1   e.g. 20090118
        # Column15: Assigned By        : required=1   e.g. SGD
        # Column16: Annotation extension: optional     e.g. part_of(CL:0000576)
        # Column17: Gene Product Form ID: optional     e.g. UniProtKB:P12345-2
        $gene_symbol = $gene."_symbol";
        $gene_alias = $gene."_alias";
        print "WB\t$gene\t$gene_symbol\tinvolved_in\t$term\tpubmed\t$evid\t\t$type\t\t$gene_alias\tgene\ttaxon:6183\t20211022\tWB\t\t\n";
    }
}
close(GO);

print STDERR "FINISHED\n";

The other things you need for Ontologizer are a list of your genes of interest, and a list of all genes (a background set). I had a list of all genes in the S. mansoni gene set as the background set (called 'all_schisto_v7'). The file with my list of genes of interest was called cluster0_genesB.

Running Ontologizer

I found that I was not able to run Ontologizer on the head node of the Sanger farm, so submitted it as a farm job. Here is the command I used:

% bsub -o cluster0_genes.o -e cluster0_genes.e -R "select[mem>1000] rusage[mem=1000]" -M1000 "/usr/bin/java -Xmx1G -jar Ontologizer.jar -g go-basic.obo -a smansoni.gaf -p all_schisto_v7 -s cluster0_genesB"

That is, the actual Ontologizer command was:

% java -jar Ontologizer.jar -g go-basic.obo -a smansoni.gaf -p all_schisto_v7 -s cluster0_genesB

This made an output file: table-cluster0_genesB-Parent-Child-Union-None.txt

The output file looks a bit like this:

ID      Pop.total       Pop.term        Study.total     Study.term      Pop.family      Study.family    nparents        is.trivial      p       p.adjusted      p.min   name
GO:0031224      10129   2072    1464    228     4152    290     2       false   1.797278481560433E-25   1.797278481560433E-25   0.0     "intrinsic component of membrane"
GO:0016020      10129   2345    1464    234     4152    290     1       false   2.0730394290778032E-19 2.0730394290778032E-19 0.0     "membrane"
GO:0017171      10129   60      1464    25      886     75      1       false   1.5850273023050352E-13 1.5850273023050352E-13 9.159281107768625E-95   "serine hydrolase activity"
GO:0006508      10129   346     1464    51      1247    82      1       false   1.471129482398609E-11   1.471129482398609E-11   6.0266E-319     "proteolysis"
GO:0008233      10129   250     1464    42      1321    89      2       false   2.535618270865147E-10   2.535618270865147E-10   1.6965153175154988E-277 "peptidase activity"
GO:0008289      10129   72      1464    16      3678    154     1       false   2.2267642693677877E-8   2.2267642693677877E-8   2.3266792066569023E-153 "lipid binding"
GO:0008236      10129   60      1464    25      250     42      2       false   4.6040476340954584E-8   4.6040476340954584E-8   2.491599802996095E-59   "serine-type peptidase activity"
GO:0004252      10129   52      1464    25      162     37      2       false   4.0367746142102156E-7   4.0367746142102156E-7   1.041762945213498E-43   "serine-type endopeptidase activity"
GO:1901564      10129   1438    1464    93      2479    118     2       false   9.21145381068937E-7     9.21145381068937E-7     0.0     "organonitrogen compound metabolic process"

...

By default, Ontologizer does not use any correction for multiple testing, so I wanted to correct for the fact that I was testing so many GO terms at once. I decided to use the Bonferroni correction:

% head table-cluster0_genesB-Parent-Child-Union-Bonferroni.txt
ID      Pop.total       Pop.term        Study.total     Study.term      Pop.family      Study.family    nparents        is.trivial      p       p.adjusted      p.min   name
GO:0031224      10129   2072    1464    228     4152    290     2       false   1.797278481560433E-25   1.7595356334476639E-22 0.0     "intrinsic component of membrane"
GO:0016020      10129   2345    1464    234     4152    290     1       false   2.0730394290778032E-19 2.0295056010671693E-16 0.0     "membrane"
GO:0017171      10129   60      1464    25      886     75      1       false   1.5850273023050352E-13 1.5517417289566294E-10 9.159281107768625E-95   "serine hydrolase activity"
GO:0006508      10129   346     1464    51      1247    82      1       false   1.471129482398609E-11   1.4402357632682383E-8   6.0266E-319     "proteolysis"
GO:0008233      10129   250     1464    42      1321    89      2       false   2.535618270865147E-10   2.482370287176979E-7    1.6965153175154988E-277 "peptidase activity"
GO:0008289      10129   72      1464    16      3678    154     1       false   2.2267642693677877E-8   2.180002219711064E-5    2.3266792066569023E-153 "lipid binding"
GO:0008236      10129   60      1464    25      250     42      2       false   4.6040476340954584E-8   4.5073626337794536E-5   2.491599802996095E-59   "serine-type peptidase activity"
GO:0004252      10129   52      1464    25      162     37      2       false   4.0367746142102156E-7   3.952002347311801E-4    1.041762945213498E-43   "serine-type endopeptidase activity"
GO:1901564      10129   1438    1464    93      2479    118     2       false   9.21145381068937E-7     9.018013280664893E-4    0.0     "organonitrogen compound metabolic process"

More information on Ontologizer

It is also possible to produce some graphical output from the command-line version of Ontologizer, you can see details of that here.

Tuesday, 19 October 2021

Read data from an Excel file into Python using pandas

The Python pandas package can be used to read data from an Excel file into Python.

For example, I had an Excel file SimpleData.xlsx with three columns (showing the first few rows below):

To read it into Python using pandas, I first installed Pandas using Anaconda (which I had already installed on my computer, a Mac laptop):

% conda install -c conda-forge pandas

I also found that I needed a package called openpyxl to be able to read Excel using pandas:

% conda install -c conda-forge openpyxl

Then I opened Python using:

% python3

and within the Python prompt typed:

>>> import pandas as pd
Now make a dataframe in pandas:

>>> mydata = pd.read_excel("SimpleData.xlsx")
Now print out the dataframe 'mydata':

>>> mydata
   Cmpd       MW LogP
0    C1 277.330 3.29
1    C2 374.521 3.60
2    C3 357.360 3.56
3    C4 509.040 5.48
4    C5 424.480 3.03
.. ...      ...   ...
76 C77 954.660 0.00
77 C78 348.358 2.08
78 C79 501.070 3.65
79 C80 470.461 3.63
80 C81 302.780 4.91

[81 rows x 3 columns]

Hurray!

Thursday, 16 September 2021

A Python script to submit lots of jobs to the farm

I often need to split up an input file because it's huge, and submit lots of jobs to the Sanger compute farm on all the little chunks.

For example, I had a file of BLAT results, and wanted to run a script on these results, but the file was too big.

Anyway, the BLAT file was enormous, so I split it up into smaller files of 10,000 lines each, using:

% split -l 10000 enormous_blat.txt fblat

This made files fblataa, fblatab... (47 files)

On each of these files I wanted to run my script (which is called 'strip_off_adaptors.py') on each of these small chunks: ie.

% python3 strip_off_adaptors.py fblataa

etc.

But that was going to take me ages to submit 47 jobs to the farm, typing all those 'bsub' commands. Well at least 10 minutes!

So I decided to write a Python script to submit the jobs (see my script below).

It takes a file with a list of the fblat* files as its input.

Then it makes a subdirectory for each of each fblat* file (e.g. fblataa), e.g. fblataadir.

Then it submits the job for fblataa in the directory fblataadir. And so on, for fblatab, fblatac, etc.

It can be run using:

% python3 submit_water_jobs.py fblat_file_list lib_all_R1_001.fa linker.fa

(where lib_all_R1_001.fa and linker.fa are just some other input files required by my script 'strip_off_adaptors.py'.)

Easy-peasy!

Here's my script submit_water_jobs.py, you can alter it to submit jobs for lots of chunks of any type of file to a compute farm using bsub:

import os
import sys
from collections import defaultdict

#====================================================================#

def read_input_file_list(input_file):
    """read in the input file with the list of input BLAT files"""

    # define a list to contain the names of the input BLAT files:
    input_file_list = list()

    # read in the input file:
    fileObj = open(input_file, "r")
    for line in fileObj:
        line = line.rstrip()
        temp = line.split()
        input_file_name = temp[0]
        input_file_list.append(input_file_name)
    fileObj.close()

    return input_file_list

#====================================================================#

def main():

    # check the command-line arguments:
    if len(sys.argv) != 4 or os.path.exists(sys.argv[1]) == False or os.path.exists(sys.argv[2]) == False or os.path.exists(sys.argv[3]) == False:
        print("Usage: %s input_list_file input_reads_fasta input_linker_fasta" % sys.argv[0])
        sys.exit(1)
    input_file = sys.argv[1] # input file with list of input BLAT files
    input_reads_fasta = sys.argv[2] # input fasta file of reads
    input_linker_fasta = sys.argv[3] # input fasta file with the linker sequence

    # read in the input file with list of input BLAT files
    input_file_list = read_input_file_list(input_file)

    # get the current directory:
    current_dir = os.getcwd()

    # for each input BLAT file, submit the 'water' job:

    for blat_file in input_file_list:
        # make a directory for running this job
        newdir = '%sdir' % blat_file # e.g. fblataadir
        newdir2 = os.path.join(current_dir,newdir)
        os.mkdir(newdir2)
        os.chdir(newdir2)
        # make a soft-link to the input BLAT file:
        blat_file2 = os.path.join(current_dir,blat_file)
        blat_file3 = os.path.join(newdir2,blat_file)
        command0 = "ln -s %s %s" % (blat_file2, blat_file3) # blat_file3 is in the new directory
        os.system(command0)

        # make a soft-link to the input fasta file of reads:
        input_reads_fasta2 = os.path.join(current_dir,input_reads_fasta)
        input_reads_fasta3 = os.path.join(newdir2, input_reads_fasta)
        command1 = "ln -s %s %s" % (input_reads_fasta2, input_reads_fasta3) # input_reads_fasta3 is in the new directory
        os.system(command1)
        # make a soft-link to the input file with the linker sequence:
        input_linker_fasta2 = os.path.join(current_dir, input_linker_fasta)
        input_linker_fasta3 = os.path.join(newdir2, input_linker_fasta)
        command2 = "ln -s %s %s" % (input_linker_fasta2, input_linker_fasta3) # input_linker_fasta3 is in the new directory
        os.system(command2)
        # define the name of the output file:
        output_file = "%s2" % blat_file3 # output_file is in the new directory
        # submit the job to run 'water' between the reads and the linker:
        command3 = "python3 ~alc/Documents/git/Python/strip_off_adaptors.py %s %s %s %s 0.5" % (blat_file3, input_reads_fasta3, input_linker_fasta3, output_file)
        # specify the bsub output and error file names:
        bsub_out = "myscript.o"
        bsub_err = "myscript.e"
        bsub_out2 = os.path.join(newdir2,bsub_out) # bsub_out2 is in the new directory
        bsub_err2 = os.path.join(newdir2,bsub_err) # bsub_err2 is in the new directory
        # submit farm job:
        jobname = "%s" % blat_file
        # request 5000 Mbyte of RAM for the job:
        command4 = 'bsub -o %s -e %s -R "select[mem>5000] rusage[mem=5000]" -M5000 -J%s "%s"' % (bsub_out2, bsub_err2, jobname, command3)
        print(command4)
        os.system(command4)
        os.chdir(current_dir)

#====================================================================#

if __name__=="__main__":
    main()

#====================================================================#

Tuesday, 22 June 2021

Retrieving the SMILES for a list of ChEMBL identifiers

Today I needed to retrieve the SMILES for a list of ChEMBL identifiers.

I had to refresh my memory on how to retrieve data from ChEMBL using their web interface.

I wrote a little Python script (see below) that takes a file with a list of ChEMBL ids as input, e.g. :

CHEMBL608855
CHEMBL609156
CHEMBL592105
CHEMBL592123
CHEMBL592125
CHEMBL592332
CHEMBL592344
CHEMBL1197993
CHEMBL596643
CHEMBL596852

To run it you can type e.g.:

% python3 retrieve_smiles_from_chembl_for_compoundlist.py input_list output_file

It then makes an output file with the SMILES for those ids (see below), e.g.

molecule_chembl_id      canonical_smiles
CHEMBL596643    O=c1nc(C=Cc2ccc(Cl)cc2)oc2ccccc12
CHEMBL596852    COc1ccc(-c2nc3cc(Cc4ccc5[nH]c(-c6ccc(OC)cc6)nc5c4)ccc3[nH]2)cc1
CHEMBL608855    CC(C)(C)c1ccc(C2CC3=Nc4ccccc4N(C(=O)c4ccccc4Cl)C(c4ccc(F)cc4)C3=C(O)C2)cc1
CHEMBL609156    CCOC(=O)c1c[nH]c2c(CC)cccc2c1=O
CHEMBL592105    CN(C)c1ccc(C(O)(c2ccc(N(C)C)cc2)c2ccc(N(C)C)cc2)cc1
CHEMBL592344    CCOc1ccccc1CNC(=O)C1c2ccccc2C(=O)N(CC(C)C)C1c1cccs1
CHEMBL592332    CCOc1ccc2c(c1)CN(Cc1ccc(Cl)cc1)CO2
CHEMBL592123    CCCOCc1cc(CN2CCN(c3cccc(Cl)c3)CC2)c(O)c2ncccc12
CHEMBL592125    O=C(Cc1ccccc1)NC(c1ccc(Cl)cc1)c1c(O)ccc2ccccc12

My Python script

import os
import sys
import pandas as pd # uses pandas python module to view and analyse data
import requests # this is used to access json files

#====================================================================#

# call the 'molecule' API to find the molecular properties of our list of compounds:

def find_properties_of_compounds(cmpd_chembl_ids):

    #For the identified compounds, extract their molecular properties and other information from the 'molecule' ChEMBL API
    #Specify the input parameters:
    cmpd_chembl_ids = ",".join(cmpd_chembl_ids[0:]) #Amend the format of the text string of compounds so that it is suitable for the API call
    limit = 100 #Limit the number of records pulled back for each url call

    # Set up the call to the ChEMBL 'molecule' API
    # Remember that there is a limit to the number of records returned in any one API call (default is 20 records, maximum is 1000 records)
    # So need to iterate over several pages of records to gather all relevant information together!
    url_stem = "https://www.ebi.ac.uk" #This is the stem of the url
    url_full_string = url_stem + "/chembl/api/data/molecule.json?molecule_chembl_id__in={}&limit={}".format(cmpd_chembl_ids, limit) #This is the full url with the specified input parameters
    url_full = requests.get( url_full_string ).json() #This calls the information back from the API using the 'requests' module, and converts it to json format
    url_molecules = url_full['molecules'] #This is a list of the results for activities

    # This 'while' loop iterates over several pages of records (if required), and collates the list of results
    while url_full['page_meta']['next']:
        url_full = requests.get(url_stem + url_full['page_meta']['next']).json()
        url_molecules = url_molecules + url_full['molecules'] #Add result (as a list) to previous list of results

    #Convert the list of results into a Pandas dataframe:
    mol_df = pd.DataFrame(url_molecules)

    #Print out some useful information:
    #print("This is the url string that calls the 'Molecule' API with the specified query\n{}".format(url_full_string) )
    #Print("\nThese are the available columns for the Molecule API:\n{}".format(mol_df.columns))

    # Select only relevant columns:
    mol_df = mol_df[[ 'molecule_chembl_id','molecule_structures']]

    # And convert cells containing a dictionary to individual columns in the dataframe so that is it easier to filter!
    # Molecule hierarchy:
    # mol_df['parent_chembl_id'] = mol_df['molecule_hierarchy'].apply(lambda x: x['parent_chembl_id'])
    # Note that the above line gives an error message for some compounds e.g. CHEMBL1088885 that do not seem to have parent stored. However it should get printed anyway with molecule_hierarchy.

    #Physicochemical properties (only report if cells are not null)
    mol_df['canonical_smiles'] = mol_df.loc[ mol_df['molecule_structures'].notnull(), 'molecule_structures'].apply(lambda x: x['canonical_smiles'])
    mol_df = mol_df[[ 'molecule_chembl_id', 'canonical_smiles']]

    return mol_df

#====================================================================#

def read_input_list_of_compounds(input_compoundlist_file, output_file):

    cnt = 0
    # open the output file:
    with open(output_file, 'w') as f:

        # read in the list of oompounds:
        compounds = list() # create an empty list to store the compounds in
        inputfileObj = open(input_compoundlist_file, "r")
        compound_set_count = 0 # we will retrieve data for 10 compounds at a time
        for line in inputfileObj:
            line = line.rstrip()
            temp = line.split()
            # CHEMBL10
            compound = temp[0] # e.g. CHEMBL10
            cnt += 1
            compounds.append(compound)
            # if the list of compounds has 10 compounds, find the compound info. for these compounds:
            if len(compounds) == 10:
                compound_set_count += 1
                # using a list of known compounds, find compound info. for those compounds:
                print(cnt,"Finding compound info. for compounds",compounds)
                mol_df = find_properties_of_compounds(compounds)

                #Export the data frame to a csv file:
                #Followed expamples from https://stackoverflow.com/questions/37357727/pandas-write-tab-separated-dataframe-with-literal-tabs-with-no-quotes
                # and https://datatofish.com/export-dataframe-to-csv and https://stackoverflow.com/questions/17530542/how-to-add-pandas-data-to-an-existing-csv-file
                if compound_set_count == 1:
                    mol_df.to_csv(f, sep="\t", index=None, header=True) # only write a header for the first set of 10 targets
                else:
                    mol_df.to_csv(f, sep="\t", index=None, header=False)
                # empty the list of compounds:
                compounds.clear() # from https://www.geeksforgeeks.org/different-ways-to-clear-a-list-in-python/
        inputfileObj.close()
        # if there are some compounds left in the compound list, find their properties:
        if len(compounds) > 0:
            # find the compound info for these targets:
            print(cnt,"Finding compound info. for compounds",compounds)
            mol_df = find_properties_of_compounds(compounds)
            mol_df.to_csv(f, sep="\t", index=None, header=False)

#====================================================================#

def main():

    # check the command-line arguments:
    if len(sys.argv) != 3 or os.path.exists(sys.argv[1]) == False:
        print("Usage: %s input_compoundlist_file output_file" % sys.argv[0])
        sys.exit(1)
    input_compoundlist_file = sys.argv[1] # input file with a list of ChEMBL compounds of interest
    output_file = sys.argv[2]

    # read in the input list of compounds of interest:
    print("Reading in compound list...")
    read_input_list_of_compounds(input_compoundlist_file, output_file)

    print("FINISHED\n")

#====================================================================#

if __name__=="__main__":
    main()

#====================================================================#

Friday, 28 May 2021

Choosing distinct random colours using R

I wanted to choose 110 distinct random colours for a plot in R. I found I could do this using the randomcoloR package:

> install.packages("randomcoloR")

> library(randomcoloR)

> distinctColorPalette(k=110)

[1] "#C0AAEE" "#D14FA7" "#5E4B73" "#46A0C6" "#BCEF7D" "#E1A6ED" "#B4F5A2" "#C8EABE" "#D492EE" "#4560D7" "#F0F0E3" "#457172" "#6D9BCA" "#C46AA6" "#ECF030" "#E2EFD0" "#EB2F85" "#8FF0EA" "#83C7F1" "#B3A4A9" "#D86C40"
[22] "#45ECEB" "#9BAF69" "#9B7EEB" "#93EBB6" "#E99F79" "#BC24EA" "#BBA7C1" "#C6CB95" "#F33BE7" "#6B25AA" "#F5A2E1" "#A7C6A1" "#DBA497" "#BAEDF2" "#7B5FF2" "#6C3283" "#A8A3CF" "#BA465C" "#BAF43C" "#D1E9E9" "#77CEC3"
[43] "#70769C" "#939DEA" "#E2B8E2" "#91EF77" "#D14DD9" "#9FAD9E" "#B68851" "#E236B8" "#8BD33B" "#78D7ED" "#F5B1D3" "#F1F0B6" "#50ED85" "#2C4B24" "#5BA8F1" "#65F239" "#ED3E52" "#52E059" "#EE6F96" "#62EED2" "#CAAEA1"
[64] "#EFC5BE" "#D6EFA0" "#E27666" "#E785AF" "#A57EC4" "#966C5B" "#CBCDB3" "#B781AD" "#F0C068" "#F09935" "#B5CDE9" "#D4C874" "#91496E" "#EA79EF" "#7BA32F" "#869175" "#EEC896" "#BB67D5" "#B9EADA" "#C9C6C7" "#B78490"
[85] "#C9D87A" "#91B5BB" "#F0C843" "#DEDCF1" "#55EDB4" "#5580D7" "#EFA3AB" "#4FB0B9" "#ADB9F0" "#E2EC5C" "#B09836" "#5631E9" "#EA7FCF" "#96CE8F" "#6CC161" "#D8CAF5" "#4BA784" "#50C284" "#EDE2E3" "#F0EC80" "#E6878A"
[106] "#B49D78" "#A5F1D1" "#A44FEF" "#C2C52C" "#F1CDE0"

Now make a plot with these colours in it:

> barplot(1:110, col=colours)

Thursday, 27 May 2021

Making a riverplot to show overlaps between two clusterings

I had created two different clusterings, and my colleague Adam Reid suggested that I create a 'riverplot' (see here for a nice description) to show the overlaps between the clusters in clustering RUN1, and clustering RUN2 (made from two different runs of the same clustering software, with slightly different inputs).

To do this, I used the riverplot R package.

For my clusterings RUN1 and RUN2, I had found overlaps between the clusters in set RUN1, and the clusters in set RUN2, as follows, where (x, y) gives the number of overlaps between a cluster x in set RUN1 and a cluster y in set RUN2:

- pair (5, 4) : 15005
- pair (6, 5) : 5923
- pair (4, 4) : 4118
- pair (0, 3) : 9591
- pair (4, 5) : 3290
- pair (5, 5) : 17
- pair (1, 0) : 13890
- pair (3, 2) : 4131
- pair (2, 3) : 504
- pair (2, 1) : 16480
- pair (0, 0) : 1
- pair (0, 1) : 4
- pair (1, 2) : 62
- pair (4, 0) : 6
- pair (3, 3) : 135
- pair (2, 4) : 113
- pair (3, 1) : 43
- pair (1, 1) : 17
- pair (3, 4) : 64
- pair (1, 4) : 6
- pair (4, 3) : 148
- pair (0, 5) : 38
- pair (1, 3) : 16
- pair (2, 2) : 12
- pair (0, 2) : 2
- pair (5, 3) : 15
- pair (6, 4) : 40
- pair (0, 4) : 14
- pair (6, 3) : 3
- pair (1, 5) : 2
- pair (4, 1) : 5
- pair (4, 2) : 1
- pair (2, 0) : 2
- pair (5, 0) : 2

You could I guess show this as a weighted graph, ie. with nodes in RUN1 on the left and nodes in RUN2 on the right, and edges between them, with the weight for each edge written on it.

Another nice way is a riverplot. I made this using the riverplot package as follows:

> install.packages("riverplot")
> library("riverplot")
> nodes <- c("RUN1_0", "RUN1_1", "RUN1_2", "RUN1_3", "RUN1_4", "RUN1_5", "RUN1_6", "RUN2_0", "RUN2_1", "RUN2_2", "RUN2_3", "RUN2_4", "RUN2_5")
> edges <- list( RUN1_0 = list( RUN2_0=1, RUN2_1=4, RUN2_2=2, RUN2_3=9591, RUN2_4=14, RUN2_5=38),
+ RUN1_1 = list( RUN2_0=13890, RUN2_1=17, RUN2_2=62, RUN2_3=16, RUN2_4=6, RUN2_5=2),
+ RUN1_2 = list( RUN2_0=2, RUN2_1=16480, RUN2_2=12, RUN2_3=504, RUN2_4=113, RUN2_5=0),
+ RUN1_3 = list( RUN2_0=0, RUN2_1=43, RUN2_2=4131, RUN2_3=135, RUN2_4=64, RUN2_5=0),
+ RUN1_4 = list( RUN2_0=6, RUN2_1=5, RUN2_2=1, RUN2_3=148, RUN2_4=4118, RUN2_5=3290),
+ RUN1_5 = list( RUN2_0=2, RUN2_1=0, RUN2_2=0, RUN2_3=15, RUN2_4=15005, RUN2_5=17),
+ RUN1_6 = list( RUN2_0=0, RUN2_1=0, RUN2_2=0, RUN2_3=3, RUN2_4=40, RUN2_5=5923))
> r <- makeRiver( nodes, edges, node_xpos = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2), node_labels=c(RUN1_0 = "0", RUN1_1 = "1", RUN1_2 = "2", RUN1_3 = "3", RUN1_4 = "4", RUN1_5 = "5", RUN1_6 = "6", RUN2_0 = "0", RUN2_1 = "1", RUN2_2 = "2", RUN2_3 = "3", RUN2_4 = "4", RUN2_5= "5"), node_styles= list(RUN1_0 = list(col="yellow"), RUN1_1 = list(col="orange"), RUN1_2=list(col="red"), RUN1_3=list(col="green"), RUN1_4=list(col="blue"), RUN1_5=list(col="pink"), RUN1_6=list(col="purple")))
> plot(r)

Here we see cluster 0 in RUN1 mostly corresponds to cluster 3 in RUN2.

Cluster 1 in RUN1 mostly corresponds to cluster 0 in RUN2.

Cluster 2 in RUN1 mostly corresponds to cluster 1 in RUN2.

Cluster 3 in RUN1 mostly corresponds to cluster 2 in RUN2.

Clusters 4,5,6 in RUN1 correspond to clusters 4 and 5 in RUN2: cluster 4 in RUN2 maps to clusters 5 and 4 in RUN1, and cluster 5 in RUN2 maps to clusters 6 and 4 in RUN1.

Note in some cases you might have a lot of clusters, then you can reduce down the label size for the clusters as described here ie.:

> custom.style <- riverplot::default.style()
> custom.style$textcex <- 0.1
> plot(r, default_style=custom.style)

Acknowledgements

Thanks to Adam Reid for introducing me to riverplots!