Thursday, 17 December 2015

Orthofinder for orthologs

Today I heard of a new software for finding orthologs and paralogs, called Orthofinder (published by Emms and Kelly 2015), which my former colleague Jason Tsai has recommended. Sounds nice..

Thursday, 3 December 2015

Selecting distinct random colours in Python

I wanted to make a plot of Pfam domains in proteins, and colour them using different colours. The problem is that I needed the domains to have distinct colours, but didn't know in advance how many domains a particular gene family would have. I found a nice Python script by adewes on github for this (thank you!)
[Note: there's a small typo in this script, 'colors.append(generate_new_color(colors),pastel_factor = 0.9)' should be 'colors.append(generate_new_color(colors,pastel_factor = 0.9))']

An example script to choose 20 distinct random colours and plot them
I made a small example script that chooses 20 distinct random colours, and then makes a plot showing the colours and their hex values. The script on github (mentioned above) gives RGB codes for the colours, and I convert these to hex values for my script. (The script also worked fine when I just plotted the RGB values directly, but I wanted to use the hex values as labels for the colours in the picture produced.)

"""
Visualization of random colors.

Simple plot example with the 20 random colors and its visual representation.
"""
# Based on http://matplotlib.org/examples/color/named_colors.html and https://gist.github.com/adewes/5884820

from __future__ import (absolute_import, division, print_function,
                        unicode_literals)

import numpy as np
import matplotlib.pyplot as plt
import random
import sys
from matplotlib import colors

#====================================================================#

# function to create a random colour, copied from https://gist.github.com/adewes/5884820

def get_random_color(pastel_factor = 0.5):
    return [(x+pastel_factor)/(1.0+pastel_factor) for x in [random.uniform(0,1.0) for i in [1,2,3]]]

#====================================================================#

# function to create a random colour, copied from https://gist.github.com/adewes/5884820

def color_distance(c1,c2):
    return sum([abs(x[0]-x[1]) for x in zip(c1,c2)])

#====================================================================#

# function to create a random colour, copied from https://gist.github.com/adewes/5884820

def generate_new_color(existing_colors,pastel_factor = 0.5):
    max_distance = None
    best_color = None
    for i in range(0,100):
        color = get_random_color(pastel_factor = pastel_factor)
        if not existing_colors:
            return color
        best_distance = min([color_distance(color,c) for c in existing_colors])
        if not max_distance or best_distance > max_distance:
            max_distance = best_distance
            best_color = color
    return best_color

#====================================================================#

# choose 20 random colours:
hex_ = []
mycolours = []
names = []
for i in range(1,21):
    mycolour = generate_new_color(mycolours,pastel_factor = 0.9)
    mycolours.append(mycolour)
    myhex = colors.rgb2hex(mycolour)
    hex_.append(myhex)
    names.append(myhex)

n = len(hex_)
ncols = 4
nrows = int(np.ceil(1. * n / ncols))

fig, ax = plt.subplots()

X, Y = fig.get_dpi() * fig.get_size_inches()

# row height
h = Y / (nrows + 1)
# col width
w = X / ncols

for (i, color) in enumerate(hex_):
    name = names[i]
    col = i % ncols
    row = int(i / ncols)
    y = Y - (row * h) - h

    xi_line = w * (col + 0.05)
    xf_line = w * (col + 0.25)
    xi_text = w * (col + 0.3)

    ax.text(xi_text, y, name, fontsize=(h * 0.075),
            horizontalalignment='left',
            verticalalignment='center')

    # Add extra black line a little bit thicker to make
    # clear colors more visible.
    ax.hlines(y, xi_line, xf_line, color='black', linewidth=(h * 0.7))
    ax.hlines(y + h * 0.1, xi_line, xf_line, color=color, linewidth=(h * 0.6))

ax.set_xlim(0, X)
ax.set_ylim(0, Y)
ax.set_axis_off()

fig.subplots_adjust(left=0, right=1,
                    top=1, bottom=0,
                    hspace=0, wspace=0)
plt.savefig("random_colour_key.png")


Here is the picture it makes, called 'random_colour_key.png':

Friday, 27 November 2015

Plotting a phylogenetic tree with the alignment, using ete2 / ete3

I've been learning to use the ETE2 python library, to plot a phylogenetic tree, showing the alignment to the right of the tree, with the Pfam domains and intron positions plotted on it.

I noticed that TreeFam shows a picture like this for a family, eg. for BRCA2 family, but the gap positions are not taken into account when plotting the domain positions:

Also, the positions of introns are not given on the TreeFam picture. Gap positions are not shown on this picture either.

On the other hand, http://www.ensembl.org/gives a picture of a family alignment that shows the gap positions (as pale grey regions) and intron positions (as vertical black lines), but doesn't show Pfam domains, eg. for BRCA2 family:

So I want to make a picture that shows (i) the tree, (ii) the alignment, with gaps, (iii) Pfam domains on the alignment, taking alignment gap positions into account, (iv) intron positions on the alignment, taking gap positions into account. My boss also asked me to colour the branches according to which species group they belong to (where we had previously decided on species groups). A big to do list! And that is when ete2 came to my rescue...

Input phylogenetic tree and alignment
Here's an example phylogenetic tree and alignment that I used:

My tree, in Newick format, file 1113736.nw:
((((((RSKR_0000859440.1:0.01358,RSKR_0000844950.1:0.017319):0.093293,RSKR_0000469800.1:0.092903):0,RSKR_0000043800.1:0.303926):0.519822,RSKR_0000470050.1:1.97151):0.063083,RSKR_0000070650.1:1.08953):0,(((SSTP_0000209200.1:0.082369,SRAE_2000414400.1:0.108955):0.013672,SVE_2022200.1:0.39581):0.336291,PTRK_0000902350.1:0.568681):0.177623):0;

My alignment, in fasta format, file 1113736.aln2:
>RSKR_0000070650.1
------------------------------------------------------------
------------------------------------------------------------
-----------------------MMSLVLEKSFTCCLTFGVIFAFIPGSYHAVSPAGFVL
TNNEYTNYKSIATFLDKPIVDVGSLSIWTLVSINGATSILTLMVNVLTL-IYMKTTKKYN
YSKYDTRLAMYSLSVFISQQTYNF-LFYLAVFG--PTEDKSIFYLTKKATTNDFGVICAL
AKVNVKLF-----NLINNSKRRIISE---------NDKFA--PNMGYLSVNLNKDEVGFI
KFNDYRHEVSTIEVFGDFRNEKVKDWLDEMVALTLHLHLDIDP
>RSKR_0000043800.1
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
--------MPICTFTNVLPKGWEQVPTRNSVLLLACLSGATLIMNLMTFAAIFYRIRGHH
VKKTEIKMAVYSIFLFLSQQSYTC-VFYTGYFGG-SLPDLSINYFAKVIKPWAYVIFCLT
PAVSLICLSKNVRSIV--------REAIRGQMIGSQNKVG-STMIRTVKE----------
--------------TVLPVPKKLNRV-----------------
>RSKR_0000844950.1
-----MNETEEILILRIQKFFPFVTAYLTIASLVYHFLIAIIMCPLYAVIIIGIFKRRHT
SEFNSVFFKTAIISGVIDLLFFLDHFITGTCALYPPLASLIAPTTPTIWLGTVFFFGYWC
AYTLDLLALYISVNRFFVILSPIRGNIIFDRTFYYVMVFLACLAAAPNWFHIVSPAFFTI
QCNNYTNNIPIATFTNIPIEGYEKTATNNSVMIFVTLSSTTLVMNLMTLGAIIYRIRGQH
IIKTEIKLAIYSIFLFVSQQSYTC-LFYTGYFGG-NLPDMSISYFAKVIKPWAYVILCLT
PAISLMCLSNNVRSIV--------IEFVKEGKIESINKIGTTTQIKTIKQ----------
--------------IATASPRRLNVV-----------------
>SVE_2022200.1
MNDPATINTDDILLYKLQKYLPWFYVSIDTVSLIIHSIIAIFSYIFYGSIIVLLIKNRKN
KLFGTFFYYCVILNGFVNIFFHLDNFFTSVLLSYEPFLEKVAPTEPNYWYAPLYYIGFFF
THFLYYNIFIISVNRMVMLTWPLKVNHIFENKKKLFTIFGILIAGIPHLYLLVTPAYFSI
SSSIYTNNKNVGKFINVMWNGVEELPLKITAINCGIFTIATFLVNLFNIYLLMKKGHKK-
-----IKVVVVVIVHYVNKNKQPQRTVHSSAVGESSLP----------------------
------------------------------------------------------------
-------------------------------------------
>RSKR_0000469800.1
------------------------------------------------------------
-----------MISGVIDLIFFFDHFVTGTCALYSPYARMIAPTTPTLWLGGVYFFGFWC
AYSLDLLALYISINRFFMILFPIKAKLIFEQTFYYTMAFMACLAAAPAWFHIISLAYFTL
ISNEYTNNIPIAVFTNVPIKGYEKTATNNSVIILVVVSSATFAMNLMTFGGIMYKIRGHT
IKKTEIKMAIYSVFLFLSQQSYTC-LFYTGYFGG-SFPDLSISYFAKVIKPWAYVIFCLT
PAVSLMCLSKNVRSIV--------LEFVKGRKIGS-SQMG-TTQIKTIKE----------
--------------TIIAPSRRLNQV-----------------
>PTRK_0000902350.1
------------------------------------------------------------
------------------------------------------------------------
--------------------------------------------------MLGTPAYFKG
SKNEYTNFNSIATFSNVPFTGWKTVGYTTLAIYCGIISISTLLINCLIIYILISHKKGNL
-KSTDRKLAMYSILLFGSQLIYIV-MFYFDLFCDRNANDYSLLYFSAVIRGWNFNIFCLF
APYSLLIF--------NRDIRSLVLN------------------------LLFNTPV---
-----DVFVPSTKSNIQQNKKIIIKS-----------------
>SSTP_0000209200.1
------------------------------------------------------------
--------------------------------------MHVAPTQPSYWYSFLYYAGFFF
AQFLYYNTFIISINRVTILLSPLKATNIFEKNIKLLTTLGILISGIPHIYLLFTPAYFSV
TSSIYTNNKKIGTFANVMWKGYETMPLNIIAIYCGFFTIATFLVNLFIIYLLVKRSNKKN
-INSDTKMTIYGILLFISQMTYTS-MLYVGYFTNMNLEDKTIIYIFKIMRPWCFNILCLF
APYSLFVFNTNIRNMI--------FKKIH------------SFSIKKN------------
-------------------------------------------
>SRAE_2000414400.1
--MNLSSNTNDILLFKIQEYFPSFTVSLSLTSLLIHTIIAIISYTLYIFIIAFLIKNRKN
SIFNSFFYYSIILNGFCDISFHIDNFFTSNLLLYEPYLNKFAPTEPSFWFSPLFYIGYFF
VDFLFYNSFIISINRVVVIISPIKAKNIFETNIKLLTAFGLLLSGLPHLYLLTRPAYFSV
ESSKYTNDKKIGKFIILMWKGHETMPINIVAIYCGIFTVATFLVNLFIIYLLIKKRNKKN
-KNYDTKMTIYVILHFISQITYIS-ILYTGYFVDMNLEDQTIIYILKIVRPWCFNILCLF
APYSLIAFNTNIRKMI--------YEKIT------------FLFIQRLVT----------
--------------IFKNLHSCINIF-----------------
>RSKR_0000470050.1
----------------------MKWKYLFTASVIIHIVILIVSYTLYVIILYVLIKARKV
KNFDTCFYKSVIVLGIVDIFFHFDNLCTGLLPLWPPFAKAFFPTEPNYVLGVIYAIGFFL
SYFKFAICFYISLNR-----------------------FGILML----------------
------------------NEKYKIVNFKISQLFLPRFLADGKVSQIFDAFLTISLNKNKR
VI-----------------TKFNC-----------NRHEHIYHFLAT-------------
--------FDNVYSLI----------DIN-------------------------------
-------------------------------------------
>RSKR_0000859440.1
-----MNETEEILILRIQKFFPFVTAYLTIASLVYHFLIAIIMCPLYAVIIIGIFKRRHT
SEFNSVFFKTAIISGVIDLLFFLDHFITGTCALYPPLASLIAPTTPTIWLGTVFFFGYWC
AYTLDLLALYISVNRFFVILSPIRGNIIFDRTFYYVMVFLACLAAAPNWFHIVSPAFFTI
QCNNYTNNIPIATFTNIPIEGYEKTATNNSVMILVALSSTTLVMNLMTFGAIIYRIRGQH
IIKMEIKLAIYSIFLFVSQQSYTC-LFYTGYFGG-NLPDMSISYFAKVIKPWAYVIFCLT
PAISLMCLSKNVRSIV--------IEFVKEGKIESINKIGTTTQIKTIKQ----------
--------------IATASPRRLNVV-----------------


[Note to self: to get this type:
% get_tree_for_50HG_family_v75.pl 1113736 es9_compara_homology_final2_75 newick > 1113736.nw
% perl /nfs/users/nfs_a/alc/Documents/git/helminth_scripts/scripts/compara_50HG/get_aln_for_50HG_family_v75_fasta.pl 1113736 es9_compara_homology_final2_75 > 1113736.aln
% perl -w rename_seqs_in_50HG_aln.pl 1113736.aln > 1113736.aln2 ]

Simple example using ete2
An  example comes with ETE2 for plotting a tree with the multiple alignment, and showing domains on the alignment. I adjusted the example to give the following code. This works with ete2 version 2.2.1072, but not the latest version (2.3.10) for some reason. It generates a random tree with 10 leaves, and then randomly selects three motifs (domains) to plot on the alignment for each leaf. The region between motifs is in 'compactseq' format, meaning that the sequence positions are given colours by amino acid, and gaps shown as a black line. The plot it makes is shown below.

import sys
sys.path = ['/nfs/users/nfs_a/alc/Documents/git/Python/ete_egg/ete2-2.2.1072-py2.7.egg'] + sys.path # prepend to pythonpath, version 2.2.1072
from ete2 import Tree, SeqMotifFace, TreeStyle, add_face_to_node
import random

seq = "-----------------------------------------------AQAKIEKIKGSKKAIKVFSSAKYPAPERLQEYGSIFTDAQDPGLQRRPRHRIQSKQRPLDERALQEKLKDFPVCVSTKPEPEDDAEEGLGGLPSNISSVSSLLLFNTTENLYKKYVFLDPLAGAVTKTHVMLGAETEEKLFDAPLSISKREQLEQQVPENYFYVPDLGQVPEIDV
PSYLPDLPGIANDLMYIADLGPGIAPSAPGTIPELPTFHTEVAEPLKVGELGSGMGAGPGTPAHTPSSLDTPHFVFQTYKMGAPPLPPSTAAPVGQGARQDDSSSSASPSVQGAPREVVDPSGGWATLLESIRQAGGIGKAKLRSMKERKLEKQQQKEQEQVRATSQGGHLMSDLFNKLVMRRKGISGKGPGAGDGPGGAFARVSDSIPPLPPPQQPQAEDEDDWES"
motifs = [
    # seq.start, seq.end, shape, width, height, fgcolor, bgcolor
    [10, 100, "[]", None, 10, "black", "rgradient:blue", "arial|4|white|My domain"],
    [101, 150, "o", None, 10, "blue", "pink", None],
    [155, 180, "()", None, 10, "blue", "rgradient:purple", None],
    [160, 190, "^", None, 14, "black", "yellow", None],
    [172, 180, "v", None, 12, "black", "rgradient:orange", None],
    [185, 190, "o", None, 12, "black", "brown", None],
    [198, 200, "<>", None, 15, "black", "rgradient:gold", None],
    [310, 420, "<>", None, 10, "black", "rgradient:black", None],
    ]

def get_example_tree():
    # Create a random tree and add to each leaf a random set of motifs
    # from the original set
    t = Tree()
    t.populate(10)
    for l in t.iter_leaves():
         seq_motifs = random.sample(motifs,3)
         seqFace = SeqMotifFace(seq, seq_motifs, intermotif_format="compactseq", seqtail_format="compactseq", scale_factor=1)
         seqFace.margin_bottom = 4
         f = l.add_face(seqFace, 0, "aligned")

    return t

if __name__ == '__main__':
    t = get_example_tree()
    t.render("motifs.png", w=1200, dpi=300)

 









Plotting my own tree and alignment in ete2
I read the instructions on how to plot the tree and alignment in the ETE manual, and also on the ETE website. It took me a little fiddling, but I found I can do it using something like these commands in ete2 (where I use my own code to find the positions of Pfam domains and introns with respect to the alignment, taking gaps into account):

# Create my own layout function.
def layout(node):
    node_leaf_names = node.get_leaf_names()
    node_species_names = [species[x] for x in node_leaf_names]
    node_species_colours = [leafcolour[x] for x in node_species_names]
    node_species_colours_set = set(node_species_colours)
    if len(node_species_colours_set) == 1:
        colour = list(node_species_colours_set)[0]
        style2 = NodeStyle()
        style2["fgcolor"] = "black"
        style2["size"] = 3
        style2["vt_line_color"] = colour
        style2["hz_line_color"] = colour
        style2["vt_line_width"] = 2
        style2["hz_line_width"] = 2
        style2["vt_line_type"] = 0 # 0 solid, 1 dashed, 2 dotted
        style2["hz_line_type"] = 0
        node.img_style = style2


# load a tree from a file. The root node is put into 't'
t = Tree(newick_file, format=1) # format=1 means internal node names are loaded
 

# read in the alignment file: 
aln_seq = AvrilFastaUtils.read_fastas([fasta_alignment_file])
 

# read in the positions of Pfam domains in the genes in the tree:
gene_list = t.get_leaf_names()
 

# read in the colours to give to leaves:
global leafcolour # global so the layout function can see it
leafcolour = defaultdict()
fileObj = open(species_colours_file, "r")
for line in fileObj:
    line = line.rstrip()
    temp = line.split("\t")
    assert(len(temp)==4)
    species_name = temp[0]
    colour = temp[3]
    assert(species_name not in leafcolour)
    leafcolour[species_name] = colour
 

# read in the id. files to find the mapping between transcript ids. and the ids. used in interproscan output: 
global species # global so the layout function can see it
sys.stderr.write('Reading in the id. mapping files...\n')
(iprid, transcriptid, species) = FiftyHG_Pfam.read_id_files_for_all_species(id_dir) 
# the 'species' hash has the species for a transcript id. or ipr. id.
 

# get the ids. used in the interproscan output:
for gene in gene_list:
   if gene not in iprid:
        print("ERROR: gene is not in iprid:",gene)
        sys.exit(1)
    assert(gene in iprid)
    assert(gene in species)

gene_list2 = [iprid[gene] for gene in gene_list]

# get the ids. used in the gff files:
for gene in gene_list:
    if gene not in transcriptid:
        print("ERROR: gene is not in transcriptid:",gene)
        sys.exit(1)
    assert(gene in transcriptid)
    assert(gene in species)

 gene_list3 = [transcriptid[gene] for gene in gene_list]

# read in the positions of Pfam domains in the proteins, from the Interproscan output: 
species_set = set([species[gene] for gene in gene_list])
if pfam_or_introns == "pfam":
    sys.stderr.write('Reading in the Pfam domain positions...\n')
    (pfam, specieslist, pfampos) = FiftyHG_Pfam.read_pfam_files_for_all_species(pfam_dir, 'interproscan', gene_list2, species_set) # we want Pfam results from the interproscan files
 

# read in the positions of introns in genes in the tree:
if pfam_or_introns == "introns":
    sys.stderr.write('Reading in the intron positions...\n')
    intronpos = FiftyHG_Pfam.read_intron_positions_for_all_species(gff_dir, gene_list3, species_set) # don't use gene_list2, as different ids. are used in the interproscan results (gene_list2) and gff files
 

# define a list of colours for plotting domains:
mycolours = ["black","brown","pink","orange","yellow","magenta","cyan","red","green","blue"]
 

# now show the domains beside the tree. The code below is based on examples 
# on http://etetoolkit.org/docs/2.3/tutorial/tutorial_drawing.html#phylogenetic-trees-and-sequence-domains
domaincolour = defaultdict()
for l in t.iter_leaves():
    # get the alignment for this leaf:
    if l.name not in aln_seq:
        print("ERROR: gene is not in aln_seq:",l.name)
        sys.exit(1)
    assert(l.name in aln_seq)
    seq = aln_seq[l.name]
    aln_len = len(seq)

    # get the positions of gap regions for this leaf:
    # Note that before I tried to show gaps as yellow regions using 'seq_motifs = FiftyHG_Pfam.define_motifs_for_gaps(gaps)', but this isn't
    # necessary, as by default gaps are shown as a black line.
    gaps = FiftyHG_Pfam.find_gap_regions_for_gene(seq)


    # get the positions of Pfam domains for this leaf:
    seq_motifs = []
    assert(l.name in iprid)
    assert(l.name in species)
    id_name = iprid[l.name] # name used for this transcript in the interproscan output
    species_name = species[l.name]
    if pfam_or_introns == "pfam":
        if id_name in pfam[species_name]: # id_name is 'NA' if there are no Pfam domains
            domains = pfam[species_name][id_name]
            for domain in domains:
                domainname = "%s=%s=%s" % (species_name,id_name,domain)
                assert(domainname in pfampos)
                domainpos = pfampos[domainname]
                for pos in domainpos: # this Pfam domain may occur more than once in the sequence, so get all its positions
                    (start, end) = pos # this is the position of the Pfam domain with respect to the sequence
                    # get the position of the Pfam domain with respect to the alignment (ie. taking gaps into account):
                    (start2, end2) = FiftyHG_Pfam.get_domain_pos_with_respect_to_aln(start, end, gaps, aln_len)
                    if domain in domaincolour:
                        mycolour = domaincolour[domain]
                    else:
                        assert(len(mycolours)>1)
                        mycolour = mycolours.pop()
                        domaincolour[domain] =  mycolour
                    motifname = "arial|2|black|%s" % domain
                    motif = [start2, end2, "()", None, 20, mycolour, mycolour, motifname]
                    # Note: I tried to put the motifname in as '"arial|8|white|PF000124" as an extra argument, but this seems to fail
                    seq_motifs.append(motif)


    # now show the introns on the alignment:
    assert(l.name in transcriptid)
    id_name = '%s=%s' % (species_name, transcriptid[l.name])
    if pfam_or_introns == "introns":
        if id_name in intronpos: # not all genes have introns
            introns = intronpos[id_name]
            for intronstart in introns:
               # get the position of the intron with respect to the alignment (ie. taking gaps into account):
               (intronstart2, intronend2) = FiftyHG_Pfam.get_domain_pos_with_respect_to_aln(intronstart, intronstart, gaps, aln_len)
                intronend2 = intronstart2 + 3 # ete2 seems to require that the motif is at least 2 letters long to plot it (otherwise I get an error). I find it's hard to see unless I make it a few letters long.
                motif = [intronstart2, intronend2, "()", None, 30, "black", "yellow", None]
                seq_motifs.append(motif)

   # add all the domains for this leaf to the plot:
    seqFace = SeqMotifFace(seq, seq_motifs, intermotif_format="compactseq", seqtail_format="compactseq", scale_factor=1.0)    

    # see SeqMotifFace in https://pythonhosted.org/ete2/_downloads/ETE.pdf
    # 'seq' is the aa sequence, seq_motifs are motifs in the original sequence, intermotif_format says how spaces
    # between motifs should be filled.
    # Note that the ete manual says there should be a gapcolor parameter, but it doesn't seem to work for me. The same for height, fgcolor, bgcolor.
    # They seem to be set to default values of height=10, fgcolor=slategrey, bgcolor=slategrey, gapcolor=black, and also default value of
    # seq_format=blockseq. This means that gaps are coloured in black by default, and the rest of the alignment is coloured as 'blockseq' by
    # default, which I think is the same as compactseq, ie. shows a colour for each alignment position.
        seqFace.margin_bottom = 4
    f = l.add_face(seqFace, 0, position="aligned")
 

# Set the right margin of the tree. This is done in Class TreeStyle, attribute margin_right (pixels), default is 0. 
# See section 2.2.4 ("Tree Style") of the ETE manual for more detail.
ts = TreeStyle()
ts.layout_fn = layout
ts.margin_right = 30
ts.margin_left = 30
 

 write the picture to a file:
 t.render(output_file, w=2000, dpi=500, tree_style=ts)



I found that when ete2 plots the Pfam domains, the introns that lie within Pfam domains aren't plotted then, as it doesn't seem to be able to plot overlapping domains. So I put an option in my script to plot either the intron positions, or the Pfam domains, giving these two plots:

Introns:











Pfam domains:








 


Plotting my own tree and alignment in ete3
I then found out that the latest version of ete2, called ete3 (which works with recent Python2 versions and with Python3) can plot overlapping domains, so can be used to plot the intron positions (which I plot as narrow yellow domains) and Pfam domains. There is a nice tutorial for ete3.

I had to change this line in my script for it to work, and also told it to plot gap regions as blanks (a new feature of ete3) and the remaining sequence as a black line (using the 'compactseq' display looks too busy if you have both introns and Pfam domains plotted):
seqFace = SeqMotifFace(seq, motifs=seq_motifs, seq_format="line", gap_format="blank")













This shows the tree and alignment, with gaps; showing Pfam domains and introns on the alignment, taking gaps into account; and colours the branches by species group (here there is just one species group, so the branches are all the same colour).

Hurray! Thanks ete3!

Acknowledgements
Thanks to Jaime Huerta-Cepas on the ete mailing list for answering my questions.

How to run my final version of my script (notes for my forgetful brain)
This is for a gene family from our 50HG compara database:
1. Log into pcs5:
% ssh -Y pcs5
% cd /nfs/users/nfs_a/alc/Documents/git/Python

2.  Get the tree for a family from our mysql database of compara families:
eg. for family 622941:
% get_tree_for_50HG_family_v75.pl 622941 es9_compara_homology_final2_75 newick > 622941.nw

3. Get the alignment for a family in fasta format:
eg. for family 622941:
% perl /nfs/users/nfs_a/alc/Documents/git/helminth_scripts/scripts/compara_50HG/get_aln_for_50HG_family_v75_fasta.pl 622941 es9_compara_homology_final2_75 > 622941.aln   
  
 4. Rename the sequences in the alignment file to match those in the tree:
eg. for family 622941:
% perl -w /nfs/users/nfs_a/alc/Documents/000_50HG_InterestingFamilies/rename_seqs_in_50HG_aln.pl 622941.aln > 622941.aln2
  
5. Run the script to make the plot of the tree with domains and introns:
eg. for family 622941:
% python2 plot_tree_with_domains_with_ETE_ete3.py 622941.nw 622941.aln2  /nfs/helminths02/analysis/50HGP/00ANALYSES/final_interpro_domains_NEW /nfs/helminths02/analysis/50HGP/00ANALYSES/final_id_mapping /lustre/scratch108/parasites/bh4/Compara/50HG_Compara_75_Final/Ensembl_dumps/ /nfs/helminths02/analysis/50HGP/00ANALYSES/final_species_list/species_clades_08Oct.txt_colours  622941.svg

This makes file 622941.svg with the picture.

Making a key for colours in Python using matplotlib

I have different colours assigned to categories in a big plot, but don't have room in that plot to show a key for the colours. So I decided to make a separate key for the colours using matplotlib in Python.

I found a nice example of plotting colours in matplotlib called named_colors.py, and used that as my starting point.

My input: a list of colours, and labels for them
Then I edited the code a bit. I had a list of colours I'm interested in, and a list of labels for them:
hex_ =  [ u'#FFFF00', u'#006400',      u'#00ff7f',    u'#7cfc00',       u'#00F5FF', u'#0000ff',  u'#FFA500',     u'#8B8682',      u'#A78D84',           u'#A52A2A',         u'#ff1493',                     u'#ff69b4',         u'#FF0000' , u'#660000', u'#000000' ]
names = [ 'I' ,     'III-Ascaridida','III-Oxyurida','III-Spirurida',      'IVa',    'IVb',      'V-AS',      'V-Free-Living', 'V-Strongylid-Lungworm', 'V-Strongylid-Other', 'Trematodes-Schistosomatids', 'Trematodes-Other', 'Cestodes', 'Flatworms-Other', 'Outgroup']


My code
Here's the code I found to work:
"""
Visualization of named colors.

Simple plot example with the named colors and its visual representation.
"""
# Based on http://matplotlib.org/examples/color/named_colors.html

from __future__ import (absolute_import, division, print_function,
                        unicode_literals)

import numpy as np
import matplotlib.pyplot as plt


# Take my colors of interest:
hex_ =  [ u'#FFFF00', u'#006400',      u'#00ff7f',    u'#7cfc00',       u'#00F5FF', u'#0000ff',  u'#FFA500',     u'#8B8682',      u'#A78D84',           u'#A52A2A',         u'#ff1493',                     u'#ff69b4',         u'#FF0000' , u'#660000', u'#000000' ]
names = [ 'I' ,     'III-Ascaridida','III-Oxyurida','III-Spirurida',      'IVa',    'IVb',      'V-AS',      'V-Free-Living', 'V-Strongylid-Lungworm', 'V-Strongylid-Other', 'Trematodes-Schistosomatids', 'Trematodes-Other', 'Cestodes', 'Flatworms-Other', 'Outgroup']

n = len(hex_)
ncols = 4
nrows = int(np.ceil(1. * n / ncols))

fig, ax = plt.subplots()

X, Y = fig.get_dpi() * fig.get_size_inches()

# row height
h = Y / (nrows + 1)
# col width
w = X / ncols

for (i, color) in enumerate(hex_):
    name = names[i]
    col = i % ncols
    row = int(i / ncols)
    y = Y - (row * h) - h

    xi_line = w * (col + 0.05)
    xf_line = w * (col + 0.25)
    xi_text = w * (col + 0.3)

    ax.text(xi_text, y, name, fontsize=(h * 0.075),
            horizontalalignment='left',
            verticalalignment='center')

    # Add extra black line a little bit thicker to make
    # clear colors more visible.
    ax.hlines(y, xi_line, xf_line, color='black', linewidth=(h * 0.7))
    ax.hlines(y + h * 0.1, xi_line, xf_line, color=color, linewidth=(h * 0.6))

ax.set_xlim(0, X)
ax.set_ylim(0, Y)
ax.set_axis_off()

fig.subplots_adjust(left=0, right=1,
                    top=1, bottom=0,
                    hspace=0, wspace=0)
plt.savefig("colour_key.png")


Output of my code:
This makes this key for my colours:

















Other Notes:
- Something useful I found is a list of all the named colours in matplotlib, and their hex values: here
- Our collaborator Bruce Rosa pointed out the nice website http://paletton.com for choosing colour schemes for figures. Another nice one is colorbrewer

Monday, 23 November 2015

Import a Python module from an egg file

To import a Python module directly from an egg file (eg. to over-ride a globally installed version), you can do this: (eg. for the ete2 module)

sys.path = ['/nfs/users/nfs_a/alc/Documents/git/Python/ete_egg/ete2-2.2.1072-py2.7.egg'] + sys.path # prepend to pythonpath, version 2.2.1072
 

from ete2 import Tree, SeqMotifFace, TreeStyle, add_face_to_node


Using Pickle in Python

If you want to run a Python script many times, and each time it is going to read in the same values into a dictionary (hash table), and it takes a long time to create the dictionary (eg. because it's reading in a lot of data from files), then you can use a Python Pickle!

Here's a quick example:

if os.path.exists("myid.p"):
        myid = pickle.load(open("myid.p", "rb"))
    else:
        myid = read_id_files(id_dir)
        pickle.dump(myid, open("myid.p", "wb"))

Friday, 20 November 2015

Finding the version of a Python module

To find the version of a Python module, eg. the ete2 module, that is installed:
% python2
>>> import pkg_resources
>>> pkg_resources.get_distribution("ete2").version
'2.2.1072'
This means we have ete2 version 2.2.1072.

Find out where the Python module is loaded from:
>>> import ete2
>>> print ete2.__file__
/software/pathogen/external/lib/python2.7/site-packages/ete2-2.3.10-py2.7.egg/ete2/__init__.pyc
>>> import ete3
>>> print ete3.__file__
/software/pathogen/external/lib/python2.7/site-packages/ete3-3.0.0b7-py2.7.egg/ete3/__init__.pyc

Friday, 30 October 2015

Excel and Google spreadsheet tips

I intend to make some notes on Excel here, as I'm not very good at it so far!

Summarise a column with a categorical variable
If you have a column with names (eg. 'new', 'old', 'ancient') in Excel (eg. column A1:A20) and then want to summarise how many you have of each value, you can get the number of 'new' values by typing in another cell: =SUMIF(A1:A20, 'new')

Count number of cells in a column that contain certain text
eg. count cells in a column R1:R37017 that contain the text 'GO::0006508':
=COUNTIF(R1:R37017,"*GO:0006508*")

Google spreadsheet tips

Fill down
Select the original cell and cells below it, then press CTRL+d.

If statement
This has format: IF(test, value to take if test is true, value to take if test is false)
For example:
=IF(D240>F240,C240,E240)

Colour cells in a column
You can use conditional formatting rules. Select the cells you want to format, then go to Format menu->Conditional formatting, and type in the rule. For example, I wanted to colour the cells in a column (values in H3:H255) with a green background if they were equal to the values in cells D3:D255. I used the formula '=D3:D55' (with 'is equal to') for this.

Adding a new column with ranks of another column
To add a column with ranks of values in another column, in the 2nd column put for example in cell B3: =RANK(A3,$A$3:$A$55) to give the rank of cell A3 in column A3:A55.

Sorting columns by values in a particular column
Select the columns that you want to sort, including headers. Then go to Data->Sort range, and click the box to say your columns have headers, and choose the column that you want to sort by. [I did a little check, and hidden columns are sorted too.]

Friday, 23 October 2015

Filtering a repeat library for matches to proteins/RNAs

My colleague Eleanor Stanley came up with a nice protocol for filtering a repeat library for matches to proteins/RNAs (thanks Eleanor!).

Get known protein-coding regions and RNAs:
The first step is to download C. elegans protein-coding regions and RNAs from WormBase, and flatworm proteins from GeneDB:

% ln -s ~wormpub/BUILD/WORMPEP/wormpep$WB/wormpep.dna$WB
% ln -s ~wormpub/BUILD/WORMRNA/wormrna$WB/wormrna$WB.rna
% chado_dump_transcripts_coding -o Smansoni > Sma.fa

% chado_dump_transcripts_coding -o Emultilocularis > Emu.fa
% cat Sma.fa Emu.fa > flatworm_transcripts.fa

Make blast databases:
% formatdb -p F -i flatworm_transcripts.fa
% formatdb -p F -i wormpep.dna235
% formatdb -p F -i wormrna235.rna

Run blast against the known protein-coding regions and RNAs
Next discard 'unknown' repeats (ie. classified by RepeatClassifier as 'unknown') from your library that have blast matches of e-value <= 0.001 to the known protein-coding regions and RNAs.
This can be done using a shell script ~alc/Documents/000_50HG_Repeats/filter_lib.sh below, by running:
% ~alc/Documents/000_50HG_Repeats/filter_lib.sh my_repeat.lib
where my_repeat.lib is the repeat library.
This makes a file my_repeat.lib.filtered with the repeats left after this filtering step, and a file my_repeat.lib.notwanted of the repeats discarded by this step.

#!/bin/sh

# Filters consensi.fa against roundworm and flatworm ncRNAs and ORFs to remove
# repeats that hit genes.
# taking lines from ~es9/pipeline/repmod_filter.sh:

# $1 is merged library
export flatworm_REP=/lustre/scratch108/parasites/alc/000_50HG_Repeats/merged_libraries_NEW/flatworm_transcripts.fa
export roundworm_dna_REP=/lustre/scratch108/parasites/alc/000_50HG_Repeats/merged_libraries_NEW/wormpep.dna235
export roundworm_rna_REP=//lustre/scratch108/parasites/alc/000_50HG_Repeats/merged_libraries_NEW/wormrna235.rna
blastall -a 1 -b 10 -v 10 -p blastn -e 0.001 -m 8 -d $flatworm_REP -i $1 -o flatworm.blast
blastall -a 1 -b 10 -v 10 -p blastn -e 0.001 -m 8 -d $roundworm_dna_REP -i $1 -o roundworm_dna.blast
blastall -a 1 -b 10 -v 10 -p blastn -e 0.001 -m 8 -d $roundworm_rna_REP -i $1 -o roundworm_rna.blast
awk '{print $1}' flatworm.blast | sort -u > hit_all
awk '{print $1}' roundworm_dna.blast | sort -u >> hit_all
awk '{print $1}' roundworm_rna.blast | sort -u >> hit_all
sort -u hit_all > hits
# just take hits to unknown repeats:
cat hits | grep -i unknown > hits2
rm -f hit_all

# Remove all hits from $1

/nfs/users/nfs_j/jit/repository/pathogen/user/jit/converter//fasta2single.pl $1 > $1.v2
perl ~es9/pipeline/fasta_retrieve_subsets.pl $1.v2 hits2
echo ""
echo "Filtered species specific repeats are now available:"
echo "Number of repeats in $1.v2.filtered: "
grep -c '^>' $1.v2.filtered
echo "Number of repeats in $1.v2.notwanted: "
grep -c '^>' $1.v2.notwanted
echo "This matches number of repeats in the hits file: "
wc -l hits2


Thursday, 22 October 2015

Getting a p-value for a T test in R (cumulative distribution functions)

Here's how to get a P-value for a t-test in R (I always forget how to do this!) This uses the cumulative distribution function.

Cumulative distribution function for a t-distributed variable
If your t-value is 0.1558572 and degrees of freedom is 22:
> 2*pt(-abs(t),df=n-1)
[1] 0.8776341
It doesn't matter if your t-value is negative.

Cumultative distribution function for a Normally distributed variable
We can also calculate it for a Normally distributed variable, eg. P(Z <= 0.9), where  Z is a standard Normal variable is:
> pnorm(0.9)
[1] 0.8159399

Cumulative distribution function for a Poisson distributed variable
eg. P(X > 11), where X is a Poisson(8) variable:
> ppois(11, lower.tail=FALSE)
[1] 0.111924

Filtering multiple alignments

Li Heng's TreeBeST software can be used to filter multiple alignments to get just the conserved columns.  This is described in the TreeFam paper:
'The alignment is then filtered to retain only conserved regions, by using CLUSTALX (25) with the BLOSUM62 scoring matrix (26) to calculate a score for each alignment column. The scores are scaled to be in the range 0 to 100, and columns having scores of <15 are removed.'

Input alignment in clustal format
For example, you might start with an alignment in CLUSTAL format.
(Note to self: get 50HG alignment for family 1272479, in clustal format:
% get_aln_for_50HG_family_v75.pl 1272479 es9_compara_homology_final2_75 > family1272479.aln

Convert input alignment to fasta format
Then convert clustal format alignment to fasta format:
% seqret -sequence family1272479.aln -outseq family1272479.aln.fasta

Filter alignment using treebest
% /nfs/users/nfs_a/alc/Documents/bin/treebest/treebest filter family1272479.aln.fasta > family1272479.aln.filt
Note: my colleague Diogo Ribeiro noticed that if a sequence has "*", TreeBest simply ignores those sequences and reports: "[ma_add_to_ma] variable length in multialignment! skip!"

Find percent of alignment filtered
Get length of original alignment:
% get_sequence_lengths.pl family1272479.aln.fa . | cut -d" " -f5 | sort | uniq
eg. 1057 in this case
Get length of filtered alignment:
% get_sequence_lengths.pl family1272479.aln.filt . | cut -d" " -f5 | sort | uniq
eg. 275 in this case

View the filtered alignment
You can view the filtered alignment if you like using Mview Mview (at EBI):
You can see here that the filtered alignment doesn't look very conserved, but seems to have some conservation. 275/1057 = 26% of columns in the original alignment were kept by treebest, which isn't many.
Check if the remaining filtered alignment is mostly low complexity sequence
In another example, there seems to be mostly low complexity regions:
 If we use SEG to find regions of low complexity, it turns out that most of the columns (left after treebest filtering) are low complexity :
% segmasker -in family1264954.aln.filt -out family1264954.aln.filt.seg gives us output file family1264954.aln.filt.seg :
>Bm14703_6279/1-203
71 - 80
141 - 153
>BPAG_0000876701-mRNA-1_6280/1-367
65 - 80
123 - 138
141 - 153
>BTMF_0001217401-mRNA-1_42155/1-166
71 - 80
>TELCIR_15305_45464/1-161
59 - 83
76 - 102
113 - 168
>nRc.2.0.1.t36918-RA_13658/1-578
69 - 80
126 - 160

This tells us for example that, of the 169 alignment columns, for TELCIR_15305_45464 columns 59-83,76-102,113-168 are low complexity (some overlap between regions).

Monday, 5 October 2015

online Latex equation editor

For making a nice picture of an equation (eg. to put into Powerpoint), I like to use Roger's online equation editor. You can type in the Latex for your equation, and it gives you a lovely picture. For example, I typed in:

C_{v} = \frac{s}{\bar{x}}

and got:

Monday, 14 September 2015

LTRharvest and LTRdigest

The LTRharvest software can be used to find LTR retrotransposons in a genome assembly. One of the authors is my colleague Sascha Steinbiss. The other authors are David Ellinghaus, Stefan Kurtz, and Ute Willhoeft. The software is described in a paper by Ellinghaus et al (2008).

There is a nice manual for the software.

To run LTRharvest
LTRharvest runs on a suffix array index. First create a suffix array index for your genome assembly, by typing:
% gt suffixerator -db genome.fa -indexname genome.fa -tis -suf -lcp -des -ssp -sds -dna
where genome.fa is your assembly fasta file.
Note that this sometimes needs a lot of memory (RAM), eg. needed 10Gbyte for a particular tricky assembly (1259 Mbase in 482,000 pieces!)

Then you can run LTRharvest:
% gt ltrharvest -index genome.fa 
This is very fast to run, only seconds! 

A useful option is the -gff3 option, which makes an output file in gff3 format:
% gt ltrharvest -index genome.fa -gff3 genome.fa.gff

The -out option makes an output fasta file of the predicted LTR retrotransposon sequences:
% gt ltrharvest -index genome.fa -gff3 genome.fa.gff -out genome.fa.ltr.fa

Note: Sascha suggested an even better way to run LTRharvest is to run:
% gt ltrharvest -index genome.fa -seqids yes -tabout no > ltrharvest.out
This makes an output file with the regular sequence identifiers in the genome.fa input fasta file for the assembly.

The LTRharvest program can work on gzipped fasta files.

For detailed help:
% gt ltrharvest -help

Output from LTRharvest
The output file looks like this:
# args=-index strongyloides_ratti.fa -gff3 strongyloides_ratti.ltr.gff
# predictions are reported in the following way
# s(ret) e(ret) l(ret) s(lLTR) e(lLTR) l(lLTR) s(rLTR) e(rLTR) l(rLTR) sim(LTRs) seq-nr
# where:
# s = starting position
# e = ending position
# l = length
# ret = LTR-retrotransposon
# lLTR = left LTR
# rLTR = right LTR
# sim = similarity
# seq-nr = sequence number
14138  15733  1596  14138  14422  285  15452  15733  282  86.32  0
81529  83510  1982  81529  81874  346  83161  83510  350  94.29  0
206851  211458  4608  206851  207044  194  211263  211458  196  94.90  0
246562  248980  2419  246562  247311  750  248221  248980  760  87.11  0
 

...
Each non-comment line reports a LTR retrotransposon prediction, with the start and end positions of the whole LTR retrotransposon, the start and end positions of the left LTR instance, and the start and end of the right LTR instance. For each of these elements, the corresponding element length is reported, as well as a %similarity of the two LTRs. The last integer of the line denotes the number of the input sequence, where the input sequence numbers are counted from 0.
    That is, the columns are:
Retrotransposon_start Retrotransposon_end Retrotransposon_length Left_LTR_start Left_LTR_end Left_LTR_length Right_LTR_start Right_LTR_end Right_LTR_length Percent_identity Sequence_number

The gff output file will look like this, with the position of the whole LTR retrotransposon plus target site duplications ('repeat_region'), of the LTR retrotransposon itself ('LTR_retrotransposon'), of target site duplications ('target_site_duplication'), and of the LTRs ('long_terminal_repeat'):
seq0    LTRharvest      repeat_region   14134   15737   .       ?       .       ID=repeat_region1
seq0    LTRharvest      target_site_duplication 14134   14137   .       ?       .       Parent=repeat_region1
seq0    LTRharvest      LTR_retrotransposon     14138   15733   .       ?       .       ID=LTR_retrotransposon1;Parent=repeat_region1;ltr_similarity=86.32;seq_number=0
seq0    LTRharvest      long_terminal_repeat    14138   14422   .       ?       .       Parent=LTR_retrotransposon1
seq0    LTRharvest      long_terminal_repeat    15452   15733   .       ?       .       Parent=LTR_retrotransposon1
seq0    LTRharvest      target_site_duplication 15734   15737   .       ?       .       Parent=repeat_region1


LTRdigest
The LTRdigest software can be used to post-process LTRharvest, to weed out potential false positives. LTRharvest just gives you the positions of LTR retrotransposons, but LTRdigest annotates internal features of LTR retrotransposons. The software is described in a paper by Steinbiss et al (2009). There is a user manual.

For help:
% gt ltrdigest -help

The basic command line is:
% gt ltrdigest input_gff  indexname > gff_result_file
where input_gff is an input gff file of LTR retrotransposon regions (eg. the ones found by LTRharvest),
and indexname is the name of the suffix array index for the assembly (eg. 'genome.fa'),
gff_result_file is a gff file of results.
The input gff file must be sorted by position, which can be done using:
% gt gff3 -sort unsorted_input_gff > sorted_input_gff


Other useful options are:
-hmms : specify a list of pHMM files in HMMER2 format for domain search, eg. pHMMs from Pfam (eg. HMM1.hmm, HMM2.hmm, HMM3.hmm). Shell globbing can be used here to specify a large number of files, eg. by using wildcards, eg. '-hmms HMM*.hmm'. For example, the LTRdigest paper by Steinbiss et al (2009) gives a list of LTR retrotransposon Pfam domains in tables B1 and B2. For example, PF03732 = Retrotrans_gag; we can download the HMM from Pfam using:
% wget http://pfam.xfam.org/family/PF03732/hmm
and convert it to HMMER2 format (which ltrdigest needs) using:
% hmmconvert -2 hmm > newhmm
% mv newhmm PF03732.hmm
I also downloaded all the HMMs from GyDB.

-outfileprefix mygenome-ltrs : write all output, such as sequences, to files beginning with "mygenome-ltrs"

A little LTRdigest protocol
Here are some steps that Sascha kindly suggested to me, for filtering the LTRharvest results using LTRdigest, in order to get a library (fasta file) of full-length (or mostly full-length) LTR retrotransposon sequences for my genome assembly:

1. Run LTRharvest (to get output files unsorted_input_gff and genome.fa.ltr.fa). [or gff file ltrharvest.out]

2. Run LTRdigest on the LTRharvest results, with selected protein HMMs from Pfam and GyDB. The command would be something like:
% gt gff3 -sort unsorted_input_gff > sorted_input_gff
% gt ltrdigest -hmms hmmdir/*hmm -outfileprefix myspecies_ltrdigest sorted_input_gff indexname > myspecies_ltrdigest_output_gff
[Or if you got file ltrharvest.out from step 1, then:
% gt gff3 -sort ltrharvest.out > sorted_input_gff
% gt ltrdigest -hmms hmmdir/*hmm -outfileprefix myspecies_ltrdigest -encseq indexname -matchdescstart < sorted_input_gff > myspecies_ltrdigest_output_gff  ]

3. To remove all LTR retrotransposon candidates that don't have any domain hit at all (to help get rid of things that might not be LTR retrotransposon insertions):
% gt select -rule_files ~ss34/filters/filter_protein_match.lua -- < myspecies_ltrdigest_output_gff > m myspecies_ltrdigest_output_gff2
[Note: to filter all candidates that don't have a *full* set of domain hits, use ~ss34/filters/filter_full_domain_set.lua]
Note: 26-Oct-2018: the script filter_protein_match.lua is: (thanks to Steve Doyle for finding Sascha's script)
name        = "Protein Domain Filter"
author      = "Sascha Kastens"
version     = "1.0"
email       = "mail@skastens.de"
short_descr = "Filters out candidates without protein domains"
description = "Filters out a candidate if it does not contain at " ..
              "least one node of type 'protein_match'."

function filter(gn)
  gfi = gt.feature_node_iterator_new(gn)
  node = gfi:next()
  while not (node == nil) do
    if (node:get_type() == "protein_match") then
      return false
    end
    node = gfi:next()
  end
  return true
end


4. Extra the DNA sequences for the remaining full-length elements:
% perl -w ~alc/Documents/000_50HG_Repeats/get_ltr_retrotransposon_seqs.pl genome.fa.ltr.fa myspecies_ltrdigest_output_gff2 > myspecies_ltrdigest_output_gff2.fa
Note I've put the get_ltr_retrotransposon_seqs.pl on gist following a request.
[Or if you got file ltrharvest.out from step 1, then:
% gt extractfeat -type LTR_retrotransposon -encseq genome.fa myspecies_ltrdigest_output_gff2 > myspecies_ltridgest_output_gff2.fa ]

[Note: I tried to run these steps in parallel for many species on the Sanger compute farm, but had some issues with the HMMER runs, which I've reported to Sascha, so ended up running them one-species-at-a-time.]

Acknowledgements
Thanks Sascha for help and advice!

Friday, 11 September 2015

Installing a Python module locally

I decided to try to install a Python module in my home directory. I decided to start with the pygraphviz module.

Attempt 1: using 'pip install'
To install a Python module in my home directory using 'pip install', I think I should be able to type for example, on the Sanger farm:
(in /nfs/users/nfs_a/alc/Documents/PythonModules)
% pip install --install-option="--prefix=/nfs/users/nfs_a/alc/Documents/PythonModules" pygraphviz
This made a subdirectory /nfs/users/nfs_a/alc/Documents/PythonModules/share/doc/pygraphviz-1.3.1
However, this doesn't seem to have installed the module properly, I'm not sure why..

What did Beckett say? 'No matter. Try again. Fail again. Fail better'.

Attempt 2: using 'setup.py'
As a second attempt, I went to the pygraphfiz download page, and downloaded the source for the module:
% wget https://pypi.python.org/packages/source/p/pygraphviz/pygraphviz-1.3.1.tar.gz#md5=7f690295dfe77edaa9e552d09d98d279
Then:
% tar -xvf pygraphviz-1.3.1.tar.gz
Put the installed version of the module in ~alc/Documents/PythonModulesInstall, and tell the installer where to find graphviz (this is an extra requirement for pygraphviz, it didn't work when I let it try to find graphviz by itself):
% python3 setup.py install --home=/nfs/users/nfs_a/alc/Documents/PythonModulesInstall --include-path=/usr/include/graphviz/ --library-path=/usr/lib/graphviz
This seemed to run ok. It made a directory ~alc/Documents/PythonModulesInstall/lib/python/pygraphviz.

Now try running the tests:
% python3 setup_egg.py nosetests
This didn't work for me, as there didn't seem to be any setup_egg.py file. I must be missing something obvious. Oh well!

Now try importing the module:
% cd ~alc/Documents/PythonModulesInstall/lib/python/
% python3
>> import pygraphviz
>> dir(pygraphviz)
['AGraph', 'Attribute', 'DotError', 'Edge', 'ItemAttribute', 'Node', '__all__', '__author__', '__builtins__', '__cached__', '__date__', '__doc__', '__file__', '__initializing__', '__license__', '__loader__', '__name__', '__package__', '__path__', '__revision__', '__version__', 'absolute_import', 'agraph', 'division', 'graphviz', 'print_function', 'release', 'test', 'tests', 'version']
>> print(pygraphviz.__version__)
1.3.1

It's working, yay!

21-Nov-2018: Another example: installing pysvg:
This time in my Python directory (/nfs/users/nfs_a/alc/Documents/git/Python) I tried:
% pip install --install-option="--prefix=/nfs/users/nfs_a/alc/Documents/PythonModules" pysvg

Then I can see a directory pysvg in /nfs/users/nfs_a/alc/Documents/PythonModules/lib/python3.6/site-packages/

9-Mar-2022: using 'conda install'
Today I wanted to install scipy on the Sanger farm. In /nfs/users/nfs_a/alc/Documents/git/Python I typed:
% conda install scipy 
and it told me:
environment location: /lustre/scratch118/infgen/team133/alc/000_FUGI_CGrunau_Chromatin/RunningGDA/miniconda
The following packages will be downloaded: 
scipy-1.7.3
I can now see a directory: /lustre/scratch118/infgen/team133/alc/000_FUGI_CGrunau_Chromatin/RunningGDA/miniconda/pkgs/scipy-1.7.3-py39hc147768_0
When I typed:
% python3
and then 
>> import scipy
it seemed to work fine!
 
6-Dec-2022: using a virtual environment
I found a nice website at Sanger on the Sanger internal wiki about using a virtual environment to install Python modules. I was able to use this to install the PyPDF2 module:
(in my directory /nfs/users/nfs_a/alc/Documents/git/Python):
First I set up a virtual environment called 'vibriowatch_pypdf':
% virtualenv -p /usr/bin/python3 vibriowatch_pypdf
Then I activated it:
% source vibriowatch_pypdf/bin/activate
Then I installed PyPDF2 in that virtual environment:
% pip install PyPDF2
Then I can use the PyPDF2 module in a script that imports that Python module, e.g.
% python3 extract_data_from_pdf_file.py Monir2022_SuppTable1.pdf
If I need to run PyPDF2 from another directory on the Sanger farm, I seem to have to activate the 'vibriowatch_pypdf' environment first:
% source /nfs/users/nfs_a/alc/Documents/git/Python/vibriowatch_pypdf/bin/activate
When I want to finish using the virtual environment, I need to type:
% deactivate
That was easy, hurray!
 
Another example (30-Mar-2023): 
I wanted to run the VISTA tool developed by Corin Yeats at Pathogenwatch. First I downloaded it:
% git clone https://github.com/pathogenwatch-oss/vista/
It was put in a local directory VISTA.
I also downloaded the data files references.fasta and metadata.toml from https://github.com/pathogenwatch-oss/vista/tree/main/data
Format the database for VISTA:
% cp vista/data/references.fasta vista/data/references
% makeblastdb -in vista/data/references -input_type fasta -dbtype nucl
Then I tried to run it:
% python3 vista/vista.py 2010EL_1786.fasta vista/data > result.json
where 2010EL_1786.fasta is an example input fasta file for a genome.
It told me that it needed a Python module 'toml' and 'Bio':
To install this, in my directory /nfs/users/nfs_a/alc/Documents/git/Python, I set up a virtual environment called 'vibriowatch_vista':
% virtualenv -p /usr/bin/python3 vibriowatch_vista
Then I activated it:
% source vibriowatch_vista/bin/activate     
Then I installed toml:
% pip install toml         
Then I installed Bio:
% pip install Bio
Then back in the directory where I had downloaded VISTA from git, I activated the virtual environment:
% source /nfs/users/nfs_a/alc/Documents/git/Python/vibriowatch_vista/bin/activate
Then I was able to run VISTA ok:
% python3 vista/vista.py 2010EL_1786.fasta vista/data > result.json
This gave a result.json file with the output. Hurray! I have installed the Python libraries toml and Bio and run VISTA!

Finding the maximum flow in a network using Python

To find a maximum flow in a network, we can use a maximum flow algorithm, using the Python networkx library.

Applications of the maximum flow algorithm
Example applications of this are:
(i) finding how much gas can flow through a given pipeline network (eg. in a day),
(ii) finding how many vehicles can pass through a town (in a given period of time).

Bioinformatics examples include:
(i) identifying putative disease genes, by finding the maximum 'information' flow in a network (Chen et al 2011) that takes into account phenotype similarities and protein-protein interactions. The vertices in the network are diseases and genes; and the edges are phenotype similarities between pairs of diseases, and protein-protein interactions between pairs of genes, and all known associations between diseases and genes. For each disease node, it finds the gene node that has the maximum 'information' flow to that disease node.
(ii) partitioning a protein sequence into putative protein domains (Yu et al (2000). The vertices in the network are residues of a protein, and the edges are residue-residue contacts. The protein is split into two domains by finding the 'maximum flow' along edges, where the flow capacities of edges are set in a way that means that residues (nodes) in the same domain have greater flow, based on training data (as far as I understand). To split the protein into multiple domains, the algorithm can be run again and again.

The maximum flow algorithm in Python
At the moment, networkx finds the maximum flow in a network using the highest-label variant of the preflow-push algorithm. As input, we need to know the network structure (nodes and directed edges) and also know flow capacities (maximum flow values) for each edge. We also need to specify a source (start) node for the flow, and a sink (end) node for the flow.

The max-flow min-cut theorem says that the maximum flow must be equal to the value of a minimum cut, ie. to the minimum possible value for a set of a cut, that is, for a set of edges that, when removed from the network, causes the situation that no flow can pass from the source node to the sink node.

Download and install the latest version of the library from the download page (v1.10 as I speak).  (Thanks Martin Aslett for this!) (Note to self: you can install Python packages using 'pip install' or setup.py).

Enter the network data
First we need to enter our network data:
import networkx 
print(networkx.__version__) # check the versionnn
    1.10 
DG=networkx.DiGraph() # make a directed graph (digraph)
DG.add_nodes_from(["S","A","B","C","D","E","T"]) # add nodes
DG.add_edges_from([("S","A"),("S","B"),("S","C"),("A","B"),("A","D"),("B","D"),("B","C"),
    ("B","E"),("C","E"),("D","T"),("E","T")])  
# specify the capacity values for the edges:
networkx.set_edge_attributes(DG, 'capacity', {('S','A'): 5.0, ('S','B'): 6.0, ('S','C'): 8.0, 
    ('A','B'): 4.0, ('A','D'): 10.0, ('B','D'): 3.0, ('B','C'): 2.0, ('B','E'): 11.0, 
    ('C','E'): 6.0, ('D','T'): 9.0, ('E','T'): 4.0})

Plot our  network (just for fun!)
For this I've used the pygraphviz module, as I found it makes nicer pictures of digraphs than networkx:
import pygraphviz
G=pygraphviz.AGraph(strict=False,directed=True)
nodelist = ['A', 'B', 'C', 'D', 'E', 'S', 'T']
G.add_nodes_from(nodelist)
G.add_edge('S','A',label='5')
G.add_edge('S','B',label='6')
G.add_edge('S','C',label='8')
G.add_edge('A','B',label='4')
G.add_edge('A','D',label='10')
G.add_edge('B','D',label='3')
G.add_edge('B','E',label='11')
G.add_edge('B','C',label='2')
G.add_edge('C','E',label='6')
G.add_edge('D','T',label='9')
G.add_edge('E','T',label='4')
G.layout()
G.draw('file.png')














Notes:
- sometimes G.layout(prog='dot') gives a nicer picture.
- to make coloured nodes using pygraphviz:
  G.node_attr['style'] = 'filled'
  n = G.get_node('T')
  n.attr['fillcolor'] = 'red'
  n = G.get_node('D')
  n.attr['fillcolor'] = 'red'

Find the maximum flow
Now find the maximum flow, specifying node "S" as the source node for the flow, node "T" as the sink node for the flow,
networkx.maximum_flow_value(DG, s="S", t="T")
 12.0
networkx.maximum_flow(DG, s="S", t="T")
(12.0, {'S': {'C': 4.0, 'B': 3.0, 'A': 5.0}, 'C': {'E': 4.0}, 'B': {'C': 0, 'E': 0, 'D': 3.0}, 
    'A': {'B': 0, 'D': 5.0}, 'D': {'T': 8.0}, 'E': {'T': 4.0}, 'T': {}})
networkx.minimum_cut(DG, s="S", t="T")
(12.0, ({'C', 'B', 'S', 'E'}, {'A', 'T', 'D'}))
networkx.minimum_cut_value(DG, s="S", t="T")
12.0


This tells us that the value of a minimum cut (and therefore of the maximum flow) is 12.0. The minimum cut given separates nodes A, T, and D from the other nodes, so includes edges SA, AB, BD, and ET. The capacity of this cut is 5 (from SA) + 3 (from BD) + 4 (from ET) = 12.0. So the maximum flow is 12.0!

(Note that the arc AB is not included in this calculation since it is directed from a vertex in the set {A, D, T} to a vertex in the set {S, B, C, E}.)

Example 2 of finding the maximum flow
import networkx
DG=networkx.DiGraph() # make a directed graph (digraph)
DG.add_nodes_from(["S","A","B","C","D","T"]) # add nodes
DG.add_edges_from([("S","A"),("S","B"),("S","C"),("A","B"),("A","C"),("B","D"),("C","D"),("C","T"),("D","T")])
# specify the capacity values for the edges:
networkx.set_edge_attributes(DG, 'capacity', {('S','A'): 2.0, ('S','B'): 4.0, ('S', 'C'): 5.0, ('A', 'B'): 3.0, ('A', 'C'): 3.0, ('B', 'D'): 3.0, ('C', 'D'): 2.0, ('C', 'T'): 4.0, ('D', 'T'): 6.0})
networkx.maximum_flow_value(DG, s="S", t="T")
9.0
networkx.maximum_flow(DG, s="S", t="T")
(9.0, {'S': {'C': 4.0, 'B': 3.0, 'A': 2.0}, 'C': {'D': 2.0, 'T': 4.0}, 'B': {'D': 3.0}, 'A': {'C': 2.0, 'B': 0}, 'D': {'T': 5.0}, 'T': {}})
networkx.minimum_cut(DG, s="S", t="T")
(9.0, ({'C', 'B', 'A', 'S'}, {'T', 'D'}))
networkx.minimum_cut_value(DG, s="S", t="T")
9.0