Thursday, 16 September 2021

A Python script to submit lots of jobs to the farm

I often need to split up an input file because it's huge, and submit lots of jobs to the Sanger compute farm on all the little chunks. 

 For example, I had a file of BLAT results, and wanted to run a script on these results, but the file was too big. 

Anyway, the BLAT file was enormous, so I split it up into smaller files of 10,000 lines each, using:

% split -l 10000 enormous_blat.txt fblat 

This made files fblataa, fblatab... (47 files) 

On each of these files I wanted to run my script (which is called 'strip_off_adaptors.py') on each of these small chunks: ie. 

% python3 strip_off_adaptors.py fblataa 

% python3 strip_off_adaptors.py fblataa 

etc. 

But that was going to take me ages to submit 47 jobs to the farm, typing all those 'bsub' commands. Well at least 10 minutes! 

 So I decided to write a Python script to submit the jobs (see my script below). 

It takes a file with a list of the fblat* files as its input. 

Then it makes a subdirectory for each of each fblat* file (e.g. fblataa), e.g. fblataadir. 

Then it submits the job for fblataa in the directory fblataadir. And so on, for fblatab, fblatac, etc. 

 It can be run using: 

% python3 submit_water_jobs.py fblat_file_list lib_all_R1_001.fa linker.fa 

(where lib_all_R1_001.fa and linker.fa are just some other input files required by my script 'strip_off_adaptors.py'.) 

Easy-peasy! 

 

Here's my script submit_water_jobs.py, you can alter it to submit jobs for lots of chunks of any type of file to a compute farm using bsub: 

 

import os
import sys
from collections import defaultdict

#====================================================================#

def read_input_file_list(input_file):
    """read in the input file with the list of input BLAT files"""
   
    # define a list to contain the names of the input BLAT files:
    input_file_list = list()
  
    # read in the input file:
    fileObj = open(input_file, "r")
    for line in fileObj:
        line = line.rstrip()
        temp = line.split()
        input_file_name = temp[0]
        input_file_list.append(input_file_name)
    fileObj.close()

    return input_file_list     

#====================================================================#

def main():

    # check the command-line arguments:         
    if len(sys.argv) != 4 or os.path.exists(sys.argv[1]) == False or os.path.exists(sys.argv[2]) == False or os.path.exists(sys.argv[3]) == False:
        print("Usage: %s input_list_file input_reads_fasta input_linker_fasta" % sys.argv[0])
        sys.exit(1)
    input_file = sys.argv[1] # input file with list of input BLAT files  
    input_reads_fasta = sys.argv[2] # input fasta file of reads
    input_linker_fasta = sys.argv[3] # input fasta file with the linker sequence

    # read in the input file with list of input BLAT files
    input_file_list = read_input_file_list(input_file)

    # get the current directory:
    current_dir = os.getcwd()

    # for each input BLAT file, submit the 'water' job:
 
    for blat_file in input_file_list:
        # make a directory for running this job
        newdir = '%sdir' % blat_file # e.g. fblataadir
        newdir2 = os.path.join(current_dir,newdir)
        os.mkdir(newdir2)
        os.chdir(newdir2)
        # make a soft-link to the input BLAT file:
        blat_file2 = os.path.join(current_dir,blat_file)
        blat_file3 = os.path.join(newdir2,blat_file)
        command0 = "ln -s %s %s" % (blat_file2, blat_file3) # blat_file3 is in the new directory
        os.system(command0) 

        # make a soft-link to the input fasta file of reads:
        input_reads_fasta2 = os.path.join(current_dir,input_reads_fasta)
        input_reads_fasta3 = os.path.join(newdir2, input_reads_fasta)
        command1 = "ln -s %s %s" % (input_reads_fasta2, input_reads_fasta3) # input_reads_fasta3 is in the new directory
        os.system(command1)
        # make a soft-link to the input file with the linker sequence:
        input_linker_fasta2 = os.path.join(current_dir, input_linker_fasta)
        input_linker_fasta3 = os.path.join(newdir2, input_linker_fasta)
        command2 = "ln -s %s %s" % (input_linker_fasta2, input_linker_fasta3) # input_linker_fasta3 is in the new directory
        os.system(command2)
        # define the name of the output file:
        output_file = "%s2" % blat_file3 # output_file is in the new directory
        # submit the job to run 'water' between the reads and the linker:
        command3 = "python3 ~alc/Documents/git/Python/strip_off_adaptors.py %s %s %s %s 0.5" % (blat_file3, input_reads_fasta3, input_linker_fasta3, output_file)
        # specify the bsub output and error file names:
        bsub_out = "myscript.o"
        bsub_err = "myscript.e"
        bsub_out2 = os.path.join(newdir2,bsub_out) # bsub_out2 is in the new directory
        bsub_err2 = os.path.join(newdir2,bsub_err) # bsub_err2 is in the new directory
        # submit farm job:
        jobname = "%s" % blat_file
        # request 5000 Mbyte of RAM for the job:
        command4 = 'bsub -o %s -e %s -R "select[mem>5000] rusage[mem=5000]" -M5000 -J%s "%s"' % (bsub_out2, bsub_err2, jobname, command3)
        print(command4)
        os.system(command4)
        os.chdir(current_dir)

#====================================================================#

if __name__=="__main__":
    main()

#====================================================================#
 

 

Tuesday, 22 June 2021

Retrieving the SMILES for a list of ChEMBL identifiers

Today I needed to retrieve the SMILES for a list of ChEMBL identifiers.

I had to refresh my memory on how to retrieve data from ChEMBL using their web interface.

I wrote a little Python script (see below) that takes a file with a list of ChEMBL ids as input, e.g. :

CHEMBL608855
CHEMBL609156
CHEMBL592105
CHEMBL592123
CHEMBL592125
CHEMBL592332
CHEMBL592344
CHEMBL1197993
CHEMBL596643
CHEMBL596852

To run it you can type e.g.:

% python3 retrieve_smiles_from_chembl_for_compoundlist.py input_list output_file

It then makes an output file with the SMILES for those ids (see below), e.g.

molecule_chembl_id      canonical_smiles
CHEMBL596643    O=c1nc(C=Cc2ccc(Cl)cc2)oc2ccccc12
CHEMBL596852    COc1ccc(-c2nc3cc(Cc4ccc5[nH]c(-c6ccc(OC)cc6)nc5c4)ccc3[nH]2)cc1
CHEMBL608855    CC(C)(C)c1ccc(C2CC3=Nc4ccccc4N(C(=O)c4ccccc4Cl)C(c4ccc(F)cc4)C3=C(O)C2)cc1
CHEMBL609156    CCOC(=O)c1c[nH]c2c(CC)cccc2c1=O
CHEMBL592105    CN(C)c1ccc(C(O)(c2ccc(N(C)C)cc2)c2ccc(N(C)C)cc2)cc1
CHEMBL592344    CCOc1ccccc1CNC(=O)C1c2ccccc2C(=O)N(CC(C)C)C1c1cccs1
CHEMBL592332    CCOc1ccc2c(c1)CN(Cc1ccc(Cl)cc1)CO2
CHEMBL592123    CCCOCc1cc(CN2CCN(c3cccc(Cl)c3)CC2)c(O)c2ncccc12
CHEMBL592125    O=C(Cc1ccccc1)NC(c1ccc(Cl)cc1)c1c(O)ccc2ccccc12

My Python script

import os
import sys
import pandas as pd # uses pandas python module to view and analyse data
import requests # this is used to access json files

#====================================================================#

# call the 'molecule' API to find the molecular properties of our list of compounds:

def find_properties_of_compounds(cmpd_chembl_ids):

    #For the identified compounds, extract their molecular properties and other information from the 'molecule' ChEMBL API
    #Specify the input parameters:
    cmpd_chembl_ids = ",".join(cmpd_chembl_ids[0:]) #Amend the format of the text string of compounds so that it is suitable for the API call
    limit = 100 #Limit the number of records pulled back for each url call

    # Set up the call to the ChEMBL 'molecule' API
    # Remember that there is a limit to the number of records returned in any one API call (default is 20 records, maximum is 1000 records)
    # So need to iterate over several pages of records to gather all relevant information together!
    url_stem = "https://www.ebi.ac.uk" #This is the stem of the url
    url_full_string = url_stem + "/chembl/api/data/molecule.json?molecule_chembl_id__in={}&limit={}".format(cmpd_chembl_ids, limit) #This is the full url with the specified input parameters
    url_full = requests.get( url_full_string ).json() #This calls the information back from the API using the 'requests' module, and converts it to json format
    url_molecules = url_full['molecules'] #This is a list of the results for activities

    # This 'while' loop iterates over several pages of records (if required), and collates the list of results
    while url_full['page_meta']['next']:
        url_full = requests.get(url_stem + url_full['page_meta']['next']).json()
        url_molecules = url_molecules + url_full['molecules'] #Add result (as a list) to previous list of results

    #Convert the list of results into a Pandas dataframe:
    mol_df = pd.DataFrame(url_molecules)

    #Print out some useful information:
    #print("This is the url string that calls the 'Molecule' API with the specified query\n{}".format(url_full_string) )
    #Print("\nThese are the available columns for the Molecule API:\n{}".format(mol_df.columns))

    # Select only relevant columns:
    mol_df = mol_df[[ 'molecule_chembl_id','molecule_structures']]

    # And convert cells containing a dictionary to individual columns in the dataframe so that is it easier to filter!
    # Molecule hierarchy:
    # mol_df['parent_chembl_id'] = mol_df['molecule_hierarchy'].apply(lambda x: x['parent_chembl_id'])
    # Note that the above line gives an error message for some compounds e.g. CHEMBL1088885 that do not seem to have parent stored. However it should get printed anyway with molecule_hierarchy.

    #Physicochemical properties (only report if cells are not null)
    mol_df['canonical_smiles'] = mol_df.loc[ mol_df['molecule_structures'].notnull(), 'molecule_structures'].apply(lambda x: x['canonical_smiles'])
    mol_df = mol_df[[ 'molecule_chembl_id', 'canonical_smiles']]

    return mol_df

#====================================================================#

def read_input_list_of_compounds(input_compoundlist_file, output_file):

    cnt = 0
    # open the output file:
    with open(output_file, 'w') as f:

        # read in the list of oompounds:
        compounds = list() # create an empty list to store the compounds in
        inputfileObj = open(input_compoundlist_file, "r")
        compound_set_count = 0 # we will retrieve data for 10 compounds at a time
        for line in inputfileObj:
            line = line.rstrip()
            temp = line.split()
            # CHEMBL10
            compound = temp[0] # e.g. CHEMBL10  
            cnt += 1
            compounds.append(compound)  
            # if the list of compounds has 10 compounds, find the compound info. for these compounds:       
            if len(compounds) == 10:
                compound_set_count += 1
                # using a list of known compounds, find compound info. for those compounds:       
                print(cnt,"Finding compound info. for compounds",compounds)
                mol_df = find_properties_of_compounds(compounds)

                #Export the data frame to a csv file:
                #Followed expamples from https://stackoverflow.com/questions/37357727/pandas-write-tab-separated-dataframe-with-literal-tabs-with-no-quotes
                # and https://datatofish.com/export-dataframe-to-csv and https://stackoverflow.com/questions/17530542/how-to-add-pandas-data-to-an-existing-csv-file
                if compound_set_count == 1:
                    mol_df.to_csv(f, sep="\t", index=None, header=True) # only write a header for the first set of 10 targets
                else:
                    mol_df.to_csv(f, sep="\t", index=None, header=False)
                # empty the list of compounds:
                compounds.clear() # from https://www.geeksforgeeks.org/different-ways-to-clear-a-list-in-python/
        inputfileObj.close()
        # if there are some compounds left in the compound list, find their properties:
        if len(compounds) > 0:
            # find the compound info for these targets:
            print(cnt,"Finding compound info. for compounds",compounds)
            mol_df = find_properties_of_compounds(compounds)
            mol_df.to_csv(f, sep="\t", index=None, header=False)

#====================================================================#

def main():

    # check the command-line arguments:
    if len(sys.argv) != 3 or os.path.exists(sys.argv[1]) == False:
        print("Usage: %s input_compoundlist_file output_file" % sys.argv[0])
        sys.exit(1)
    input_compoundlist_file = sys.argv[1] # input file with a list of ChEMBL compounds of interest
    output_file = sys.argv[2]

    # read in the input list of compounds of interest:
    print("Reading in compound list...")
    read_input_list_of_compounds(input_compoundlist_file, output_file)

    print("FINISHED\n")

#====================================================================#

if __name__=="__main__":
    main()

#====================================================================#


Friday, 28 May 2021

Choosing distinct random colours using R

 I wanted to choose 110 distinct random colours for a plot in R. I found I could do this using the randomcoloR package:

> install.packages("randomcoloR")

> library(randomcoloR)

> distinctColorPalette(k=110)

  [1] "#C0AAEE" "#D14FA7" "#5E4B73" "#46A0C6" "#BCEF7D" "#E1A6ED" "#B4F5A2" "#C8EABE" "#D492EE" "#4560D7" "#F0F0E3" "#457172" "#6D9BCA" "#C46AA6" "#ECF030" "#E2EFD0" "#EB2F85" "#8FF0EA" "#83C7F1" "#B3A4A9" "#D86C40"
 [22] "#45ECEB" "#9BAF69" "#9B7EEB" "#93EBB6" "#E99F79" "#BC24EA" "#BBA7C1" "#C6CB95" "#F33BE7" "#6B25AA" "#F5A2E1" "#A7C6A1" "#DBA497" "#BAEDF2" "#7B5FF2" "#6C3283" "#A8A3CF" "#BA465C" "#BAF43C" "#D1E9E9" "#77CEC3"
 [43] "#70769C" "#939DEA" "#E2B8E2" "#91EF77" "#D14DD9" "#9FAD9E" "#B68851" "#E236B8" "#8BD33B" "#78D7ED" "#F5B1D3" "#F1F0B6" "#50ED85" "#2C4B24" "#5BA8F1" "#65F239" "#ED3E52" "#52E059" "#EE6F96" "#62EED2" "#CAAEA1"
 [64] "#EFC5BE" "#D6EFA0" "#E27666" "#E785AF" "#A57EC4" "#966C5B" "#CBCDB3" "#B781AD" "#F0C068" "#F09935" "#B5CDE9" "#D4C874" "#91496E" "#EA79EF" "#7BA32F" "#869175" "#EEC896" "#BB67D5" "#B9EADA" "#C9C6C7" "#B78490"
 [85] "#C9D87A" "#91B5BB" "#F0C843" "#DEDCF1" "#55EDB4" "#5580D7" "#EFA3AB" "#4FB0B9" "#ADB9F0" "#E2EC5C" "#B09836" "#5631E9" "#EA7FCF" "#96CE8F" "#6CC161" "#D8CAF5" "#4BA784" "#50C284" "#EDE2E3" "#F0EC80" "#E6878A"
[106] "#B49D78" "#A5F1D1" "#A44FEF" "#C2C52C" "#F1CDE0"


Now make a plot with these colours in it:

> barplot(1:110, col=colours)





Thursday, 27 May 2021

Making a riverplot to show overlaps between two clusterings

 I had created two different clusterings, and my colleague Adam Reid suggested that I create a 'riverplot' (see here for a nice description) to show the overlaps between the clusters in clustering RUN1, and clustering RUN2 (made from two different runs of the same clustering software, with slightly different inputs).

To do this, I used the riverplot R package.  

For my clusterings RUN1 and RUN2, I had found overlaps between the clusters in set RUN1, and the clusters in set RUN2, as follows, where (x, y) gives the number of overlaps between a cluster x in set RUN1 and a cluster y in set RUN2:

- pair (5, 4) :  15005
- pair (6, 5) :  5923
- pair (4, 4) :  4118
- pair (0, 3) :  9591
- pair (4, 5) :  3290
- pair (5, 5) :  17
- pair (1, 0) :  13890
- pair (3, 2) :  4131
- pair (2, 3) :  504
- pair (2, 1) :  16480
- pair (0, 0) :  1
- pair (0, 1) :  4
- pair (1, 2) :  62
- pair (4, 0) :  6
- pair (3, 3) :  135
- pair (2, 4) :  113
- pair (3, 1) :  43
- pair (1, 1) :  17
- pair (3, 4) :  64
- pair (1, 4) :  6
- pair (4, 3) :  148
- pair (0, 5) :  38
- pair (1, 3) :  16
- pair (2, 2) :  12
- pair (0, 2) :  2
- pair (5, 3) :  15
- pair (6, 4) :  40
- pair (0, 4) :  14
- pair (6, 3) :  3
- pair (1, 5) :  2
- pair (4, 1) :  5
- pair (4, 2) :  1
- pair (2, 0) :  2
- pair (5, 0) :  2

You could I guess show this as a weighted graph, ie. with nodes in RUN1 on the left and nodes in RUN2 on the right, and edges between them, with the weight for each edge written on it. 

 Another nice way is a riverplot. I made this using the riverplot package as follows:

> install.packages("riverplot")
> library("riverplot")
> nodes <- c("RUN1_0", "RUN1_1", "RUN1_2", "RUN1_3", "RUN1_4", "RUN1_5", "RUN1_6", "RUN2_0", "RUN2_1", "RUN2_2", "RUN2_3", "RUN2_4", "RUN2_5")
> edges <- list( RUN1_0 = list( RUN2_0=1, RUN2_1=4, RUN2_2=2, RUN2_3=9591, RUN2_4=14, RUN2_5=38),
+ RUN1_1 = list( RUN2_0=13890, RUN2_1=17, RUN2_2=62, RUN2_3=16, RUN2_4=6, RUN2_5=2),
+ RUN1_2 = list( RUN2_0=2, RUN2_1=16480, RUN2_2=12, RUN2_3=504, RUN2_4=113, RUN2_5=0),
+ RUN1_3 = list( RUN2_0=0, RUN2_1=43, RUN2_2=4131, RUN2_3=135, RUN2_4=64, RUN2_5=0),
+ RUN1_4 = list( RUN2_0=6, RUN2_1=5, RUN2_2=1, RUN2_3=148, RUN2_4=4118, RUN2_5=3290),
+ RUN1_5 = list( RUN2_0=2, RUN2_1=0, RUN2_2=0, RUN2_3=15, RUN2_4=15005, RUN2_5=17),
+ RUN1_6 = list( RUN2_0=0, RUN2_1=0, RUN2_2=0, RUN2_3=3, RUN2_4=40, RUN2_5=5923))
> r <- makeRiver( nodes, edges, node_xpos = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2), node_labels=c(RUN1_0 = "0", RUN1_1 = "1", RUN1_2 = "2", RUN1_3 = "3", RUN1_4 = "4", RUN1_5 = "5", RUN1_6 = "6", RUN2_0 = "0", RUN2_1 = "1", RUN2_2 = "2", RUN2_3 = "3", RUN2_4 = "4", RUN2_5= "5"), node_styles= list(RUN1_0 = list(col="yellow"), RUN1_1 = list(col="orange"), RUN1_2=list(col="red"), RUN1_3=list(col="green"), RUN1_4=list(col="blue"), RUN1_5=list(col="pink"), RUN1_6=list(col="purple")))
> plot(r)






Here we see cluster 0 in RUN1 mostly corresponds to cluster 3 in RUN2.

Cluster 1 in RUN1 mostly corresponds to cluster 0 in RUN2.

Cluster 2 in RUN1 mostly corresponds to cluster 1 in RUN2.

Cluster 3 in RUN1 mostly corresponds to cluster 2 in RUN2.

Clusters 4,5,6 in RUN1 correspond to clusters 4 and 5 in RUN2: cluster 4 in RUN2 maps to clusters 5 and 4 in RUN1, and cluster 5 in RUN2 maps to clusters 6 and 4 in RUN1.

 Note in some cases you might have a lot of clusters, then you can  reduce down the label size for the clusters as described here ie.:

> custom.style <- riverplot::default.style()
> custom.style$textcex <- 0.1
> plot(r, default_style=custom.style)


Acknowledgements

Thanks to Adam Reid for introducing me to riverplots!



Thursday, 12 November 2020

Fastqc for checking read quality

 I wanted to check whether the bases in a read have low quality. The read looks like this in my fastq file:

@M04241:499:GW19092930:1:2112:8740:16257 1:N:0:TGACCT
CGATGTGAGATCCCTCAGACCCTTTTAGTCAGTGGTCCCTTAAGCGGAGGCCCTATAGTGAGTCGTATTACAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAATATATACAACCAAATAAAGATGAAAAAAGAAACTCAATCCAAAAGAAATTGTAGAAGCAAGCTACACAGATTGACCGTAATATTTTCAAACTCAGAGCATGAC
+
CCCCCFFFFFFFGGGGGGGGGGHHHHHHHHHHHHHHHHHHHHHHHGGGGGGGGHHHHHHHHHHHGGHHHHHHHHHHHGGGGHHHGGGGGHHHHHHGHHHHHGGGGGGGGF/GGHGGHHHHGGGHGHHHHHHHHHHHHGGGGG<;-;.000000099.:..09CF0000;;;0;;...;/;0:0;;/0;:.;;;B0009000:9FF//.0;0;/9.000;00:-9:A.0:00000009000000/90900

 I put these lines in a file 'tmp.fastq'.

I ran fastqc using this command:

% module avail -t | grep -i fastqc  [to find the fastqc module on the Sanger farm]

% module load fastqc/0.11.8-c2 [to load the fastqc module on the Sanger farm]

% fastqc tmp.fastq

% unzip tmp_fastqc.zip 

% more tmp_fastqc/fastqc_data.txt 

This shows me the average base quality drops off a lot after position 145 approximately:

#Base   Mean    Median  Lower Quartile  Upper Quartile  10th Percentile 90th Percentile
1       34.0    NaN     NaN     NaN     NaN     NaN
2       34.0    NaN     NaN     NaN     NaN     NaN
3       34.0    NaN     NaN     NaN     NaN     NaN
4       34.0    NaN     NaN     NaN     NaN     NaN
5       34.0    NaN     NaN     NaN     NaN     NaN
6       37.0    NaN     NaN     NaN     NaN     NaN
7       37.0    NaN     NaN     NaN     NaN     NaN
8       37.0    NaN     NaN     NaN     NaN     NaN
9       37.0    NaN     NaN     NaN     NaN     NaN
10-14   37.4    NaN     NaN     NaN     NaN     NaN
15-19   38.0    NaN     NaN     NaN     NaN     NaN
20-24   38.4    NaN     NaN     NaN     NaN     NaN
25-29   39.0    NaN     NaN     NaN     NaN     NaN
30-34   39.0    NaN     NaN     NaN     NaN     NaN
35-39   39.0    NaN     NaN     NaN     NaN     NaN
40-44   39.0    NaN     NaN     NaN     NaN     NaN
45-49   38.2    NaN     NaN     NaN     NaN     NaN
50-54   38.2    NaN     NaN     NaN     NaN     NaN
55-59   39.0    NaN     NaN     NaN     NaN     NaN
60-64   39.0    NaN     NaN     NaN     NaN     NaN
65-69   38.6    NaN     NaN     NaN     NaN     NaN
70-74   39.0    NaN     NaN     NaN     NaN     NaN
75-79   38.6    NaN     NaN     NaN     NaN     NaN
80-84   38.6    NaN     NaN     NaN     NaN     NaN
85-89   38.0    NaN     NaN     NaN     NaN     NaN
90-94   39.0    NaN     NaN     NaN     NaN     NaN
95-99   38.8    NaN     NaN     NaN     NaN     NaN
100-104 38.4    NaN     NaN     NaN     NaN     NaN
105-109 38.0    NaN     NaN     NaN     NaN     NaN
110-114 33.2    NaN     NaN     NaN     NaN     NaN

115-119 38.6    NaN     NaN     NaN     NaN     NaN
120-124 38.4    NaN     NaN     NaN     NaN     NaN
125-129 38.8    NaN     NaN     NaN     NaN     NaN
130-134 39.0    NaN     NaN     NaN     NaN     NaN
135-139 38.6    NaN     NaN     NaN     NaN     NaN
140-144 33.4    NaN     NaN     NaN     NaN     NaN
145-149 16.2    NaN     NaN     NaN     NaN     NaN
150-154 15.0    NaN     NaN     NaN     NaN     NaN
155-159 19.8    NaN     NaN     NaN     NaN     NaN
160-164 24.6    NaN     NaN     NaN     NaN     NaN
165-169 17.2    NaN     NaN     NaN     NaN     NaN
170-174 23.8    NaN     NaN     NaN     NaN     NaN
175-179 15.8    NaN     NaN     NaN     NaN     NaN
180-184 21.4    NaN     NaN     NaN     NaN     NaN
185-189 21.2    NaN     NaN     NaN     NaN     NaN
190-194 24.8    NaN     NaN     NaN     NaN     NaN
195-199 16.8    NaN     NaN     NaN     NaN     NaN
200-204 23.2    NaN     NaN     NaN     NaN     NaN
205-209 18.6    NaN     NaN     NaN     NaN     NaN
210-214 21.0    NaN     NaN     NaN     NaN     NaN
215-219 16.8    NaN     NaN     NaN     NaN     NaN
220-224 18.2    NaN     NaN     NaN     NaN     NaN
225-229 22.0    NaN     NaN     NaN     NaN     NaN
230-234 15.0    NaN     NaN     NaN     NaN     NaN
235-239 16.8    NaN     NaN     NaN     NaN     NaN
240-244 14.8    NaN     NaN     NaN     NaN     NaN
245-249 18.6    NaN     NaN     NaN     NaN     NaN



Tuesday, 29 September 2020

Running BLAST to align genome sequences

I'm interested in finding conserved non-coding sequences between two related species of worms. 

First I took the introns, UTRs, and intergenic regions from the first species, and tried comparing them to the genome of the second species using exonerate, but that was very slow. I then tried BLAT, which was a little faster. Then I tried BLASTN, which was nice and speedy!

It's been a while since I ran BLAST on the Sanger farm so I needed to remind myself how to run it, even though I have written previous posts on that ages ago (e.g. on farm_blast and on speeding up blast jobs).

This is what I did now:

Find the BLAST module on the farm, and load it: (only applicable to Sanger users)

% module avail -t | grep -i blast
blast/2.7.1=h96bfa4b_5  

% module load blast/2.7.1=h96bfa4b_5

Make a blast database:

% makeblastdb -in genome2.fa -dbtype nucl

Run blast:

% blastn -db genome2.fa -query genome1_intronsandutrsandintergenic.fa -out myoutput.blast -outfmt 6  

One thing I always always forget is what are the columns in the BLAST m8 format, so I have to look at this nice webpage.

Note that by default the blastn command runs Megablast, which looks for matches of high percent identity, and is a fast algorithm. I'm interested in high percent identity matches, so I used this.

Alternatives to BLAST:

An alternative to BLAST is nucmer, part of the mummer package, which I wrote a post on ages ago (see here). Note to self: nucmer is part of the mummer module on the Sanger farm.

I asked my colleagues what they are using nowadays for whole genome alignements, and they mentioned a couple of other software:

- my colleague Eerik Aunin mentioned the software SibeliaZ, which is tailored for aligning highly similar genomes, eg. strains of the same species,

- my colleague Faye Rodgers mentioned Cactus, which can be used to make alignments of 1000s of vertebrate genomes,

 - my colleague Ana Protasio mentioned Satsuma

Regarding finding conserved noncoding regions, my colleague James Cotton mentioned PhastCons.


 


Using powsimR and SCOPIT for power calculations for single-cell data analysis

powsimR

I'm trying to do a power calculation to ask how many biological replicates are needed to detect the majority of differentially expressed genes between two treatments, in a particular cell population (cluster) in single cell data. I found the nice tool powsimR from the group of Dr Ines Hellmann. They say in their paper that for single cell data, the biological replicate is a cell, so their software tells you how many cells you need to sequence to detect the majority of differentially expressed genes (e.g. with >2-fold differential gene expression) between two treatments (e.g. male versus female worms).

There are detailed instructions on how to install powsimR on their powsimR github page. I followed these on a Windows laptop, after updating to the latest version of R, and found that it all installed fine but I had to leave out the 'build_vignettes=TRUE' when running the final devtools::install_github("bvieth/powsimR", dependences=FALSE) command. The powsimR github page says that several users have this issue. They say that this means the vignettes are not built, but can be viewed at powsimR vignette instead. 

PowsimR lets you take into account batch effects (e.g. biological replicates performed in different months), depth of sequencing based on prior data, "dropout" effets where a gene is only detected in a subset of cells, different methods for testing differential expression (e.g. MAST, scde, sDD, monocle). You can specify a particular power and false discovery rate (FDR).

PowsimR can be used to help design experiments in advance, and also can be used for posterior analysis. For example in their paper they say "For example for the Kolodziejczyk data, 384 single cells for each condition would be sufficient to detect >80% of the DE genes with a well controlled FDR of 5%. Given the lower sample sizes actually used in Kolodziejczyk et al (2015), our power analysis suggests that only 60% of all DE genes could be detected". 

In practice, I found that PowsimR looks quite complex to use as there are a lot of parameters to specify, and first you need to understand them. I ran out of time to do this properly, but hopefully can return to it one day, as it looks very useful...

SCOPIT

Another nice tool is SCOPIT, which lets you ask the question: how many cells do we need to squence from an adult worm, to capture a rare cell type?  

SCOPIT is very simple to use through the website above.

Acknowledgements

Thanks to my colleague Faye Rodgers for telling me about SCOPIT.