Thursday, 16 September 2021

A Python script to submit lots of jobs to the farm

I often need to split up an input file because it's huge, and submit lots of jobs to the Sanger compute farm on all the little chunks. 

 For example, I had a file of BLAT results, and wanted to run a script on these results, but the file was too big. 

Anyway, the BLAT file was enormous, so I split it up into smaller files of 10,000 lines each, using:

% split -l 10000 enormous_blat.txt fblat 

This made files fblataa, fblatab... (47 files) 

On each of these files I wanted to run my script (which is called '') on each of these small chunks: ie. 

% python3 fblataa 

% python3 fblataa 


But that was going to take me ages to submit 47 jobs to the farm, typing all those 'bsub' commands. Well at least 10 minutes! 

 So I decided to write a Python script to submit the jobs (see my script below). 

It takes a file with a list of the fblat* files as its input. 

Then it makes a subdirectory for each of each fblat* file (e.g. fblataa), e.g. fblataadir. 

Then it submits the job for fblataa in the directory fblataadir. And so on, for fblatab, fblatac, etc. 

 It can be run using: 

% python3 fblat_file_list lib_all_R1_001.fa linker.fa 

(where lib_all_R1_001.fa and linker.fa are just some other input files required by my script ''.) 



Here's my script, you can alter it to submit jobs for lots of chunks of any type of file to a compute farm using bsub: 


import os
import sys
from collections import defaultdict


def read_input_file_list(input_file):
    """read in the input file with the list of input BLAT files"""
    # define a list to contain the names of the input BLAT files:
    input_file_list = list()
    # read in the input file:
    fileObj = open(input_file, "r")
    for line in fileObj:
        line = line.rstrip()
        temp = line.split()
        input_file_name = temp[0]

    return input_file_list     


def main():

    # check the command-line arguments:         
    if len(sys.argv) != 4 or os.path.exists(sys.argv[1]) == False or os.path.exists(sys.argv[2]) == False or os.path.exists(sys.argv[3]) == False:
        print("Usage: %s input_list_file input_reads_fasta input_linker_fasta" % sys.argv[0])
    input_file = sys.argv[1] # input file with list of input BLAT files  
    input_reads_fasta = sys.argv[2] # input fasta file of reads
    input_linker_fasta = sys.argv[3] # input fasta file with the linker sequence

    # read in the input file with list of input BLAT files
    input_file_list = read_input_file_list(input_file)

    # get the current directory:
    current_dir = os.getcwd()

    # for each input BLAT file, submit the 'water' job:
    for blat_file in input_file_list:
        # make a directory for running this job
        newdir = '%sdir' % blat_file # e.g. fblataadir
        newdir2 = os.path.join(current_dir,newdir)
        # make a soft-link to the input BLAT file:
        blat_file2 = os.path.join(current_dir,blat_file)
        blat_file3 = os.path.join(newdir2,blat_file)
        command0 = "ln -s %s %s" % (blat_file2, blat_file3) # blat_file3 is in the new directory

        # make a soft-link to the input fasta file of reads:
        input_reads_fasta2 = os.path.join(current_dir,input_reads_fasta)
        input_reads_fasta3 = os.path.join(newdir2, input_reads_fasta)
        command1 = "ln -s %s %s" % (input_reads_fasta2, input_reads_fasta3) # input_reads_fasta3 is in the new directory
        # make a soft-link to the input file with the linker sequence:
        input_linker_fasta2 = os.path.join(current_dir, input_linker_fasta)
        input_linker_fasta3 = os.path.join(newdir2, input_linker_fasta)
        command2 = "ln -s %s %s" % (input_linker_fasta2, input_linker_fasta3) # input_linker_fasta3 is in the new directory
        # define the name of the output file:
        output_file = "%s2" % blat_file3 # output_file is in the new directory
        # submit the job to run 'water' between the reads and the linker:
        command3 = "python3 ~alc/Documents/git/Python/ %s %s %s %s 0.5" % (blat_file3, input_reads_fasta3, input_linker_fasta3, output_file)
        # specify the bsub output and error file names:
        bsub_out = "myscript.o"
        bsub_err = "myscript.e"
        bsub_out2 = os.path.join(newdir2,bsub_out) # bsub_out2 is in the new directory
        bsub_err2 = os.path.join(newdir2,bsub_err) # bsub_err2 is in the new directory
        # submit farm job:
        jobname = "%s" % blat_file
        # request 5000 Mbyte of RAM for the job:
        command4 = 'bsub -o %s -e %s -R "select[mem>5000] rusage[mem=5000]" -M5000 -J%s "%s"' % (bsub_out2, bsub_err2, jobname, command3)


if __name__=="__main__":



Tuesday, 22 June 2021

Retrieving the SMILES for a list of ChEMBL identifiers

Today I needed to retrieve the SMILES for a list of ChEMBL identifiers.

I had to refresh my memory on how to retrieve data from ChEMBL using their web interface.

I wrote a little Python script (see below) that takes a file with a list of ChEMBL ids as input, e.g. :


To run it you can type e.g.:

% python3 input_list output_file

It then makes an output file with the SMILES for those ids (see below), e.g.

molecule_chembl_id      canonical_smiles
CHEMBL596643    O=c1nc(C=Cc2ccc(Cl)cc2)oc2ccccc12
CHEMBL596852    COc1ccc(-c2nc3cc(Cc4ccc5[nH]c(-c6ccc(OC)cc6)nc5c4)ccc3[nH]2)cc1
CHEMBL608855    CC(C)(C)c1ccc(C2CC3=Nc4ccccc4N(C(=O)c4ccccc4Cl)C(c4ccc(F)cc4)C3=C(O)C2)cc1
CHEMBL609156    CCOC(=O)c1c[nH]c2c(CC)cccc2c1=O
CHEMBL592105    CN(C)c1ccc(C(O)(c2ccc(N(C)C)cc2)c2ccc(N(C)C)cc2)cc1
CHEMBL592344    CCOc1ccccc1CNC(=O)C1c2ccccc2C(=O)N(CC(C)C)C1c1cccs1
CHEMBL592332    CCOc1ccc2c(c1)CN(Cc1ccc(Cl)cc1)CO2
CHEMBL592123    CCCOCc1cc(CN2CCN(c3cccc(Cl)c3)CC2)c(O)c2ncccc12
CHEMBL592125    O=C(Cc1ccccc1)NC(c1ccc(Cl)cc1)c1c(O)ccc2ccccc12

My Python script

import os
import sys
import pandas as pd # uses pandas python module to view and analyse data
import requests # this is used to access json files


# call the 'molecule' API to find the molecular properties of our list of compounds:

def find_properties_of_compounds(cmpd_chembl_ids):

    #For the identified compounds, extract their molecular properties and other information from the 'molecule' ChEMBL API
    #Specify the input parameters:
    cmpd_chembl_ids = ",".join(cmpd_chembl_ids[0:]) #Amend the format of the text string of compounds so that it is suitable for the API call
    limit = 100 #Limit the number of records pulled back for each url call

    # Set up the call to the ChEMBL 'molecule' API
    # Remember that there is a limit to the number of records returned in any one API call (default is 20 records, maximum is 1000 records)
    # So need to iterate over several pages of records to gather all relevant information together!
    url_stem = "" #This is the stem of the url
    url_full_string = url_stem + "/chembl/api/data/molecule.json?molecule_chembl_id__in={}&limit={}".format(cmpd_chembl_ids, limit) #This is the full url with the specified input parameters
    url_full = requests.get( url_full_string ).json() #This calls the information back from the API using the 'requests' module, and converts it to json format
    url_molecules = url_full['molecules'] #This is a list of the results for activities

    # This 'while' loop iterates over several pages of records (if required), and collates the list of results
    while url_full['page_meta']['next']:
        url_full = requests.get(url_stem + url_full['page_meta']['next']).json()
        url_molecules = url_molecules + url_full['molecules'] #Add result (as a list) to previous list of results

    #Convert the list of results into a Pandas dataframe:
    mol_df = pd.DataFrame(url_molecules)

    #Print out some useful information:
    #print("This is the url string that calls the 'Molecule' API with the specified query\n{}".format(url_full_string) )
    #Print("\nThese are the available columns for the Molecule API:\n{}".format(mol_df.columns))

    # Select only relevant columns:
    mol_df = mol_df[[ 'molecule_chembl_id','molecule_structures']]

    # And convert cells containing a dictionary to individual columns in the dataframe so that is it easier to filter!
    # Molecule hierarchy:
    # mol_df['parent_chembl_id'] = mol_df['molecule_hierarchy'].apply(lambda x: x['parent_chembl_id'])
    # Note that the above line gives an error message for some compounds e.g. CHEMBL1088885 that do not seem to have parent stored. However it should get printed anyway with molecule_hierarchy.

    #Physicochemical properties (only report if cells are not null)
    mol_df['canonical_smiles'] = mol_df.loc[ mol_df['molecule_structures'].notnull(), 'molecule_structures'].apply(lambda x: x['canonical_smiles'])
    mol_df = mol_df[[ 'molecule_chembl_id', 'canonical_smiles']]

    return mol_df


def read_input_list_of_compounds(input_compoundlist_file, output_file):

    cnt = 0
    # open the output file:
    with open(output_file, 'w') as f:

        # read in the list of oompounds:
        compounds = list() # create an empty list to store the compounds in
        inputfileObj = open(input_compoundlist_file, "r")
        compound_set_count = 0 # we will retrieve data for 10 compounds at a time
        for line in inputfileObj:
            line = line.rstrip()
            temp = line.split()
            # CHEMBL10
            compound = temp[0] # e.g. CHEMBL10  
            cnt += 1
            # if the list of compounds has 10 compounds, find the compound info. for these compounds:       
            if len(compounds) == 10:
                compound_set_count += 1
                # using a list of known compounds, find compound info. for those compounds:       
                print(cnt,"Finding compound info. for compounds",compounds)
                mol_df = find_properties_of_compounds(compounds)

                #Export the data frame to a csv file:
                #Followed expamples from
                # and and
                if compound_set_count == 1:
                    mol_df.to_csv(f, sep="\t", index=None, header=True) # only write a header for the first set of 10 targets
                    mol_df.to_csv(f, sep="\t", index=None, header=False)
                # empty the list of compounds:
                compounds.clear() # from
        # if there are some compounds left in the compound list, find their properties:
        if len(compounds) > 0:
            # find the compound info for these targets:
            print(cnt,"Finding compound info. for compounds",compounds)
            mol_df = find_properties_of_compounds(compounds)
            mol_df.to_csv(f, sep="\t", index=None, header=False)


def main():

    # check the command-line arguments:
    if len(sys.argv) != 3 or os.path.exists(sys.argv[1]) == False:
        print("Usage: %s input_compoundlist_file output_file" % sys.argv[0])
    input_compoundlist_file = sys.argv[1] # input file with a list of ChEMBL compounds of interest
    output_file = sys.argv[2]

    # read in the input list of compounds of interest:
    print("Reading in compound list...")
    read_input_list_of_compounds(input_compoundlist_file, output_file)



if __name__=="__main__":


Friday, 28 May 2021

Choosing distinct random colours using R

 I wanted to choose 110 distinct random colours for a plot in R. I found I could do this using the randomcoloR package:

> install.packages("randomcoloR")

> library(randomcoloR)

> distinctColorPalette(k=110)

  [1] "#C0AAEE" "#D14FA7" "#5E4B73" "#46A0C6" "#BCEF7D" "#E1A6ED" "#B4F5A2" "#C8EABE" "#D492EE" "#4560D7" "#F0F0E3" "#457172" "#6D9BCA" "#C46AA6" "#ECF030" "#E2EFD0" "#EB2F85" "#8FF0EA" "#83C7F1" "#B3A4A9" "#D86C40"
 [22] "#45ECEB" "#9BAF69" "#9B7EEB" "#93EBB6" "#E99F79" "#BC24EA" "#BBA7C1" "#C6CB95" "#F33BE7" "#6B25AA" "#F5A2E1" "#A7C6A1" "#DBA497" "#BAEDF2" "#7B5FF2" "#6C3283" "#A8A3CF" "#BA465C" "#BAF43C" "#D1E9E9" "#77CEC3"
 [43] "#70769C" "#939DEA" "#E2B8E2" "#91EF77" "#D14DD9" "#9FAD9E" "#B68851" "#E236B8" "#8BD33B" "#78D7ED" "#F5B1D3" "#F1F0B6" "#50ED85" "#2C4B24" "#5BA8F1" "#65F239" "#ED3E52" "#52E059" "#EE6F96" "#62EED2" "#CAAEA1"
 [64] "#EFC5BE" "#D6EFA0" "#E27666" "#E785AF" "#A57EC4" "#966C5B" "#CBCDB3" "#B781AD" "#F0C068" "#F09935" "#B5CDE9" "#D4C874" "#91496E" "#EA79EF" "#7BA32F" "#869175" "#EEC896" "#BB67D5" "#B9EADA" "#C9C6C7" "#B78490"
 [85] "#C9D87A" "#91B5BB" "#F0C843" "#DEDCF1" "#55EDB4" "#5580D7" "#EFA3AB" "#4FB0B9" "#ADB9F0" "#E2EC5C" "#B09836" "#5631E9" "#EA7FCF" "#96CE8F" "#6CC161" "#D8CAF5" "#4BA784" "#50C284" "#EDE2E3" "#F0EC80" "#E6878A"
[106] "#B49D78" "#A5F1D1" "#A44FEF" "#C2C52C" "#F1CDE0"

Now make a plot with these colours in it:

> barplot(1:110, col=colours)

Thursday, 27 May 2021

Making a riverplot to show overlaps between two clusterings

 I had created two different clusterings, and my colleague Adam Reid suggested that I create a 'riverplot' (see here for a nice description) to show the overlaps between the clusters in clustering RUN1, and clustering RUN2 (made from two different runs of the same clustering software, with slightly different inputs).

To do this, I used the riverplot R package.  

For my clusterings RUN1 and RUN2, I had found overlaps between the clusters in set RUN1, and the clusters in set RUN2, as follows, where (x, y) gives the number of overlaps between a cluster x in set RUN1 and a cluster y in set RUN2:

- pair (5, 4) :  15005
- pair (6, 5) :  5923
- pair (4, 4) :  4118
- pair (0, 3) :  9591
- pair (4, 5) :  3290
- pair (5, 5) :  17
- pair (1, 0) :  13890
- pair (3, 2) :  4131
- pair (2, 3) :  504
- pair (2, 1) :  16480
- pair (0, 0) :  1
- pair (0, 1) :  4
- pair (1, 2) :  62
- pair (4, 0) :  6
- pair (3, 3) :  135
- pair (2, 4) :  113
- pair (3, 1) :  43
- pair (1, 1) :  17
- pair (3, 4) :  64
- pair (1, 4) :  6
- pair (4, 3) :  148
- pair (0, 5) :  38
- pair (1, 3) :  16
- pair (2, 2) :  12
- pair (0, 2) :  2
- pair (5, 3) :  15
- pair (6, 4) :  40
- pair (0, 4) :  14
- pair (6, 3) :  3
- pair (1, 5) :  2
- pair (4, 1) :  5
- pair (4, 2) :  1
- pair (2, 0) :  2
- pair (5, 0) :  2

You could I guess show this as a weighted graph, ie. with nodes in RUN1 on the left and nodes in RUN2 on the right, and edges between them, with the weight for each edge written on it. 

 Another nice way is a riverplot. I made this using the riverplot package as follows:

> install.packages("riverplot")
> library("riverplot")
> nodes <- c("RUN1_0", "RUN1_1", "RUN1_2", "RUN1_3", "RUN1_4", "RUN1_5", "RUN1_6", "RUN2_0", "RUN2_1", "RUN2_2", "RUN2_3", "RUN2_4", "RUN2_5")
> edges <- list( RUN1_0 = list( RUN2_0=1, RUN2_1=4, RUN2_2=2, RUN2_3=9591, RUN2_4=14, RUN2_5=38),
+ RUN1_1 = list( RUN2_0=13890, RUN2_1=17, RUN2_2=62, RUN2_3=16, RUN2_4=6, RUN2_5=2),
+ RUN1_2 = list( RUN2_0=2, RUN2_1=16480, RUN2_2=12, RUN2_3=504, RUN2_4=113, RUN2_5=0),
+ RUN1_3 = list( RUN2_0=0, RUN2_1=43, RUN2_2=4131, RUN2_3=135, RUN2_4=64, RUN2_5=0),
+ RUN1_4 = list( RUN2_0=6, RUN2_1=5, RUN2_2=1, RUN2_3=148, RUN2_4=4118, RUN2_5=3290),
+ RUN1_5 = list( RUN2_0=2, RUN2_1=0, RUN2_2=0, RUN2_3=15, RUN2_4=15005, RUN2_5=17),
+ RUN1_6 = list( RUN2_0=0, RUN2_1=0, RUN2_2=0, RUN2_3=3, RUN2_4=40, RUN2_5=5923))
> r <- makeRiver( nodes, edges, node_xpos = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2), node_labels=c(RUN1_0 = "0", RUN1_1 = "1", RUN1_2 = "2", RUN1_3 = "3", RUN1_4 = "4", RUN1_5 = "5", RUN1_6 = "6", RUN2_0 = "0", RUN2_1 = "1", RUN2_2 = "2", RUN2_3 = "3", RUN2_4 = "4", RUN2_5= "5"), node_styles= list(RUN1_0 = list(col="yellow"), RUN1_1 = list(col="orange"), RUN1_2=list(col="red"), RUN1_3=list(col="green"), RUN1_4=list(col="blue"), RUN1_5=list(col="pink"), RUN1_6=list(col="purple")))
> plot(r)

Here we see cluster 0 in RUN1 mostly corresponds to cluster 3 in RUN2.

Cluster 1 in RUN1 mostly corresponds to cluster 0 in RUN2.

Cluster 2 in RUN1 mostly corresponds to cluster 1 in RUN2.

Cluster 3 in RUN1 mostly corresponds to cluster 2 in RUN2.

Clusters 4,5,6 in RUN1 correspond to clusters 4 and 5 in RUN2: cluster 4 in RUN2 maps to clusters 5 and 4 in RUN1, and cluster 5 in RUN2 maps to clusters 6 and 4 in RUN1.

 Note in some cases you might have a lot of clusters, then you can  reduce down the label size for the clusters as described here ie.:

> <-
>$textcex <- 0.1
> plot(r,


Thanks to Adam Reid for introducing me to riverplots!