Friday 27 November 2015

Plotting a phylogenetic tree with the alignment, using ete2 / ete3

I've been learning to use the ETE2 python library, to plot a phylogenetic tree, showing the alignment to the right of the tree, with the Pfam domains and intron positions plotted on it.

I noticed that TreeFam shows a picture like this for a family, eg. for BRCA2 family, but the gap positions are not taken into account when plotting the domain positions:

Also, the positions of introns are not given on the TreeFam picture. Gap positions are not shown on this picture either.

On the other hand, http://www.ensembl.org/gives a picture of a family alignment that shows the gap positions (as pale grey regions) and intron positions (as vertical black lines), but doesn't show Pfam domains, eg. for BRCA2 family:

So I want to make a picture that shows (i) the tree, (ii) the alignment, with gaps, (iii) Pfam domains on the alignment, taking alignment gap positions into account, (iv) intron positions on the alignment, taking gap positions into account. My boss also asked me to colour the branches according to which species group they belong to (where we had previously decided on species groups). A big to do list! And that is when ete2 came to my rescue...

Input phylogenetic tree and alignment
Here's an example phylogenetic tree and alignment that I used:

My tree, in Newick format, file 1113736.nw:
((((((RSKR_0000859440.1:0.01358,RSKR_0000844950.1:0.017319):0.093293,RSKR_0000469800.1:0.092903):0,RSKR_0000043800.1:0.303926):0.519822,RSKR_0000470050.1:1.97151):0.063083,RSKR_0000070650.1:1.08953):0,(((SSTP_0000209200.1:0.082369,SRAE_2000414400.1:0.108955):0.013672,SVE_2022200.1:0.39581):0.336291,PTRK_0000902350.1:0.568681):0.177623):0;

My alignment, in fasta format, file 1113736.aln2:
>RSKR_0000070650.1
------------------------------------------------------------
------------------------------------------------------------
-----------------------MMSLVLEKSFTCCLTFGVIFAFIPGSYHAVSPAGFVL
TNNEYTNYKSIATFLDKPIVDVGSLSIWTLVSINGATSILTLMVNVLTL-IYMKTTKKYN
YSKYDTRLAMYSLSVFISQQTYNF-LFYLAVFG--PTEDKSIFYLTKKATTNDFGVICAL
AKVNVKLF-----NLINNSKRRIISE---------NDKFA--PNMGYLSVNLNKDEVGFI
KFNDYRHEVSTIEVFGDFRNEKVKDWLDEMVALTLHLHLDIDP
>RSKR_0000043800.1
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
--------MPICTFTNVLPKGWEQVPTRNSVLLLACLSGATLIMNLMTFAAIFYRIRGHH
VKKTEIKMAVYSIFLFLSQQSYTC-VFYTGYFGG-SLPDLSINYFAKVIKPWAYVIFCLT
PAVSLICLSKNVRSIV--------REAIRGQMIGSQNKVG-STMIRTVKE----------
--------------TVLPVPKKLNRV-----------------
>RSKR_0000844950.1
-----MNETEEILILRIQKFFPFVTAYLTIASLVYHFLIAIIMCPLYAVIIIGIFKRRHT
SEFNSVFFKTAIISGVIDLLFFLDHFITGTCALYPPLASLIAPTTPTIWLGTVFFFGYWC
AYTLDLLALYISVNRFFVILSPIRGNIIFDRTFYYVMVFLACLAAAPNWFHIVSPAFFTI
QCNNYTNNIPIATFTNIPIEGYEKTATNNSVMIFVTLSSTTLVMNLMTLGAIIYRIRGQH
IIKTEIKLAIYSIFLFVSQQSYTC-LFYTGYFGG-NLPDMSISYFAKVIKPWAYVILCLT
PAISLMCLSNNVRSIV--------IEFVKEGKIESINKIGTTTQIKTIKQ----------
--------------IATASPRRLNVV-----------------
>SVE_2022200.1
MNDPATINTDDILLYKLQKYLPWFYVSIDTVSLIIHSIIAIFSYIFYGSIIVLLIKNRKN
KLFGTFFYYCVILNGFVNIFFHLDNFFTSVLLSYEPFLEKVAPTEPNYWYAPLYYIGFFF
THFLYYNIFIISVNRMVMLTWPLKVNHIFENKKKLFTIFGILIAGIPHLYLLVTPAYFSI
SSSIYTNNKNVGKFINVMWNGVEELPLKITAINCGIFTIATFLVNLFNIYLLMKKGHKK-
-----IKVVVVVIVHYVNKNKQPQRTVHSSAVGESSLP----------------------
------------------------------------------------------------
-------------------------------------------
>RSKR_0000469800.1
------------------------------------------------------------
-----------MISGVIDLIFFFDHFVTGTCALYSPYARMIAPTTPTLWLGGVYFFGFWC
AYSLDLLALYISINRFFMILFPIKAKLIFEQTFYYTMAFMACLAAAPAWFHIISLAYFTL
ISNEYTNNIPIAVFTNVPIKGYEKTATNNSVIILVVVSSATFAMNLMTFGGIMYKIRGHT
IKKTEIKMAIYSVFLFLSQQSYTC-LFYTGYFGG-SFPDLSISYFAKVIKPWAYVIFCLT
PAVSLMCLSKNVRSIV--------LEFVKGRKIGS-SQMG-TTQIKTIKE----------
--------------TIIAPSRRLNQV-----------------
>PTRK_0000902350.1
------------------------------------------------------------
------------------------------------------------------------
--------------------------------------------------MLGTPAYFKG
SKNEYTNFNSIATFSNVPFTGWKTVGYTTLAIYCGIISISTLLINCLIIYILISHKKGNL
-KSTDRKLAMYSILLFGSQLIYIV-MFYFDLFCDRNANDYSLLYFSAVIRGWNFNIFCLF
APYSLLIF--------NRDIRSLVLN------------------------LLFNTPV---
-----DVFVPSTKSNIQQNKKIIIKS-----------------
>SSTP_0000209200.1
------------------------------------------------------------
--------------------------------------MHVAPTQPSYWYSFLYYAGFFF
AQFLYYNTFIISINRVTILLSPLKATNIFEKNIKLLTTLGILISGIPHIYLLFTPAYFSV
TSSIYTNNKKIGTFANVMWKGYETMPLNIIAIYCGFFTIATFLVNLFIIYLLVKRSNKKN
-INSDTKMTIYGILLFISQMTYTS-MLYVGYFTNMNLEDKTIIYIFKIMRPWCFNILCLF
APYSLFVFNTNIRNMI--------FKKIH------------SFSIKKN------------
-------------------------------------------
>SRAE_2000414400.1
--MNLSSNTNDILLFKIQEYFPSFTVSLSLTSLLIHTIIAIISYTLYIFIIAFLIKNRKN
SIFNSFFYYSIILNGFCDISFHIDNFFTSNLLLYEPYLNKFAPTEPSFWFSPLFYIGYFF
VDFLFYNSFIISINRVVVIISPIKAKNIFETNIKLLTAFGLLLSGLPHLYLLTRPAYFSV
ESSKYTNDKKIGKFIILMWKGHETMPINIVAIYCGIFTVATFLVNLFIIYLLIKKRNKKN
-KNYDTKMTIYVILHFISQITYIS-ILYTGYFVDMNLEDQTIIYILKIVRPWCFNILCLF
APYSLIAFNTNIRKMI--------YEKIT------------FLFIQRLVT----------
--------------IFKNLHSCINIF-----------------
>RSKR_0000470050.1
----------------------MKWKYLFTASVIIHIVILIVSYTLYVIILYVLIKARKV
KNFDTCFYKSVIVLGIVDIFFHFDNLCTGLLPLWPPFAKAFFPTEPNYVLGVIYAIGFFL
SYFKFAICFYISLNR-----------------------FGILML----------------
------------------NEKYKIVNFKISQLFLPRFLADGKVSQIFDAFLTISLNKNKR
VI-----------------TKFNC-----------NRHEHIYHFLAT-------------
--------FDNVYSLI----------DIN-------------------------------
-------------------------------------------
>RSKR_0000859440.1
-----MNETEEILILRIQKFFPFVTAYLTIASLVYHFLIAIIMCPLYAVIIIGIFKRRHT
SEFNSVFFKTAIISGVIDLLFFLDHFITGTCALYPPLASLIAPTTPTIWLGTVFFFGYWC
AYTLDLLALYISVNRFFVILSPIRGNIIFDRTFYYVMVFLACLAAAPNWFHIVSPAFFTI
QCNNYTNNIPIATFTNIPIEGYEKTATNNSVMILVALSSTTLVMNLMTFGAIIYRIRGQH
IIKMEIKLAIYSIFLFVSQQSYTC-LFYTGYFGG-NLPDMSISYFAKVIKPWAYVIFCLT
PAISLMCLSKNVRSIV--------IEFVKEGKIESINKIGTTTQIKTIKQ----------
--------------IATASPRRLNVV-----------------


[Note to self: to get this type:
% get_tree_for_50HG_family_v75.pl 1113736 es9_compara_homology_final2_75 newick > 1113736.nw
% perl /nfs/users/nfs_a/alc/Documents/git/helminth_scripts/scripts/compara_50HG/get_aln_for_50HG_family_v75_fasta.pl 1113736 es9_compara_homology_final2_75 > 1113736.aln
% perl -w rename_seqs_in_50HG_aln.pl 1113736.aln > 1113736.aln2 ]

Simple example using ete2
An  example comes with ETE2 for plotting a tree with the multiple alignment, and showing domains on the alignment. I adjusted the example to give the following code. This works with ete2 version 2.2.1072, but not the latest version (2.3.10) for some reason. It generates a random tree with 10 leaves, and then randomly selects three motifs (domains) to plot on the alignment for each leaf. The region between motifs is in 'compactseq' format, meaning that the sequence positions are given colours by amino acid, and gaps shown as a black line. The plot it makes is shown below.

import sys
sys.path = ['/nfs/users/nfs_a/alc/Documents/git/Python/ete_egg/ete2-2.2.1072-py2.7.egg'] + sys.path # prepend to pythonpath, version 2.2.1072
from ete2 import Tree, SeqMotifFace, TreeStyle, add_face_to_node
import random

seq = "-----------------------------------------------AQAKIEKIKGSKKAIKVFSSAKYPAPERLQEYGSIFTDAQDPGLQRRPRHRIQSKQRPLDERALQEKLKDFPVCVSTKPEPEDDAEEGLGGLPSNISSVSSLLLFNTTENLYKKYVFLDPLAGAVTKTHVMLGAETEEKLFDAPLSISKREQLEQQVPENYFYVPDLGQVPEIDV
PSYLPDLPGIANDLMYIADLGPGIAPSAPGTIPELPTFHTEVAEPLKVGELGSGMGAGPGTPAHTPSSLDTPHFVFQTYKMGAPPLPPSTAAPVGQGARQDDSSSSASPSVQGAPREVVDPSGGWATLLESIRQAGGIGKAKLRSMKERKLEKQQQKEQEQVRATSQGGHLMSDLFNKLVMRRKGISGKGPGAGDGPGGAFARVSDSIPPLPPPQQPQAEDEDDWES"
motifs = [
    # seq.start, seq.end, shape, width, height, fgcolor, bgcolor
    [10, 100, "[]", None, 10, "black", "rgradient:blue", "arial|4|white|My domain"],
    [101, 150, "o", None, 10, "blue", "pink", None],
    [155, 180, "()", None, 10, "blue", "rgradient:purple", None],
    [160, 190, "^", None, 14, "black", "yellow", None],
    [172, 180, "v", None, 12, "black", "rgradient:orange", None],
    [185, 190, "o", None, 12, "black", "brown", None],
    [198, 200, "<>", None, 15, "black", "rgradient:gold", None],
    [310, 420, "<>", None, 10, "black", "rgradient:black", None],
    ]

def get_example_tree():
    # Create a random tree and add to each leaf a random set of motifs
    # from the original set
    t = Tree()
    t.populate(10)
    for l in t.iter_leaves():
         seq_motifs = random.sample(motifs,3)
         seqFace = SeqMotifFace(seq, seq_motifs, intermotif_format="compactseq", seqtail_format="compactseq", scale_factor=1)
         seqFace.margin_bottom = 4
         f = l.add_face(seqFace, 0, "aligned")

    return t

if __name__ == '__main__':
    t = get_example_tree()
    t.render("motifs.png", w=1200, dpi=300)

 









Plotting my own tree and alignment in ete2
I read the instructions on how to plot the tree and alignment in the ETE manual, and also on the ETE website. It took me a little fiddling, but I found I can do it using something like these commands in ete2 (where I use my own code to find the positions of Pfam domains and introns with respect to the alignment, taking gaps into account):

# Create my own layout function.
def layout(node):
    node_leaf_names = node.get_leaf_names()
    node_species_names = [species[x] for x in node_leaf_names]
    node_species_colours = [leafcolour[x] for x in node_species_names]
    node_species_colours_set = set(node_species_colours)
    if len(node_species_colours_set) == 1:
        colour = list(node_species_colours_set)[0]
        style2 = NodeStyle()
        style2["fgcolor"] = "black"
        style2["size"] = 3
        style2["vt_line_color"] = colour
        style2["hz_line_color"] = colour
        style2["vt_line_width"] = 2
        style2["hz_line_width"] = 2
        style2["vt_line_type"] = 0 # 0 solid, 1 dashed, 2 dotted
        style2["hz_line_type"] = 0
        node.img_style = style2


# load a tree from a file. The root node is put into 't'
t = Tree(newick_file, format=1) # format=1 means internal node names are loaded
 

# read in the alignment file: 
aln_seq = AvrilFastaUtils.read_fastas([fasta_alignment_file])
 

# read in the positions of Pfam domains in the genes in the tree:
gene_list = t.get_leaf_names()
 

# read in the colours to give to leaves:
global leafcolour # global so the layout function can see it
leafcolour = defaultdict()
fileObj = open(species_colours_file, "r")
for line in fileObj:
    line = line.rstrip()
    temp = line.split("\t")
    assert(len(temp)==4)
    species_name = temp[0]
    colour = temp[3]
    assert(species_name not in leafcolour)
    leafcolour[species_name] = colour
 

# read in the id. files to find the mapping between transcript ids. and the ids. used in interproscan output: 
global species # global so the layout function can see it
sys.stderr.write('Reading in the id. mapping files...\n')
(iprid, transcriptid, species) = FiftyHG_Pfam.read_id_files_for_all_species(id_dir) 
# the 'species' hash has the species for a transcript id. or ipr. id.
 

# get the ids. used in the interproscan output:
for gene in gene_list:
   if gene not in iprid:
        print("ERROR: gene is not in iprid:",gene)
        sys.exit(1)
    assert(gene in iprid)
    assert(gene in species)

gene_list2 = [iprid[gene] for gene in gene_list]

# get the ids. used in the gff files:
for gene in gene_list:
    if gene not in transcriptid:
        print("ERROR: gene is not in transcriptid:",gene)
        sys.exit(1)
    assert(gene in transcriptid)
    assert(gene in species)

 gene_list3 = [transcriptid[gene] for gene in gene_list]

# read in the positions of Pfam domains in the proteins, from the Interproscan output: 
species_set = set([species[gene] for gene in gene_list])
if pfam_or_introns == "pfam":
    sys.stderr.write('Reading in the Pfam domain positions...\n')
    (pfam, specieslist, pfampos) = FiftyHG_Pfam.read_pfam_files_for_all_species(pfam_dir, 'interproscan', gene_list2, species_set) # we want Pfam results from the interproscan files
 

# read in the positions of introns in genes in the tree:
if pfam_or_introns == "introns":
    sys.stderr.write('Reading in the intron positions...\n')
    intronpos = FiftyHG_Pfam.read_intron_positions_for_all_species(gff_dir, gene_list3, species_set) # don't use gene_list2, as different ids. are used in the interproscan results (gene_list2) and gff files
 

# define a list of colours for plotting domains:
mycolours = ["black","brown","pink","orange","yellow","magenta","cyan","red","green","blue"]
 

# now show the domains beside the tree. The code below is based on examples 
# on http://etetoolkit.org/docs/2.3/tutorial/tutorial_drawing.html#phylogenetic-trees-and-sequence-domains
domaincolour = defaultdict()
for l in t.iter_leaves():
    # get the alignment for this leaf:
    if l.name not in aln_seq:
        print("ERROR: gene is not in aln_seq:",l.name)
        sys.exit(1)
    assert(l.name in aln_seq)
    seq = aln_seq[l.name]
    aln_len = len(seq)

    # get the positions of gap regions for this leaf:
    # Note that before I tried to show gaps as yellow regions using 'seq_motifs = FiftyHG_Pfam.define_motifs_for_gaps(gaps)', but this isn't
    # necessary, as by default gaps are shown as a black line.
    gaps = FiftyHG_Pfam.find_gap_regions_for_gene(seq)


    # get the positions of Pfam domains for this leaf:
    seq_motifs = []
    assert(l.name in iprid)
    assert(l.name in species)
    id_name = iprid[l.name] # name used for this transcript in the interproscan output
    species_name = species[l.name]
    if pfam_or_introns == "pfam":
        if id_name in pfam[species_name]: # id_name is 'NA' if there are no Pfam domains
            domains = pfam[species_name][id_name]
            for domain in domains:
                domainname = "%s=%s=%s" % (species_name,id_name,domain)
                assert(domainname in pfampos)
                domainpos = pfampos[domainname]
                for pos in domainpos: # this Pfam domain may occur more than once in the sequence, so get all its positions
                    (start, end) = pos # this is the position of the Pfam domain with respect to the sequence
                    # get the position of the Pfam domain with respect to the alignment (ie. taking gaps into account):
                    (start2, end2) = FiftyHG_Pfam.get_domain_pos_with_respect_to_aln(start, end, gaps, aln_len)
                    if domain in domaincolour:
                        mycolour = domaincolour[domain]
                    else:
                        assert(len(mycolours)>1)
                        mycolour = mycolours.pop()
                        domaincolour[domain] =  mycolour
                    motifname = "arial|2|black|%s" % domain
                    motif = [start2, end2, "()", None, 20, mycolour, mycolour, motifname]
                    # Note: I tried to put the motifname in as '"arial|8|white|PF000124" as an extra argument, but this seems to fail
                    seq_motifs.append(motif)


    # now show the introns on the alignment:
    assert(l.name in transcriptid)
    id_name = '%s=%s' % (species_name, transcriptid[l.name])
    if pfam_or_introns == "introns":
        if id_name in intronpos: # not all genes have introns
            introns = intronpos[id_name]
            for intronstart in introns:
               # get the position of the intron with respect to the alignment (ie. taking gaps into account):
               (intronstart2, intronend2) = FiftyHG_Pfam.get_domain_pos_with_respect_to_aln(intronstart, intronstart, gaps, aln_len)
                intronend2 = intronstart2 + 3 # ete2 seems to require that the motif is at least 2 letters long to plot it (otherwise I get an error). I find it's hard to see unless I make it a few letters long.
                motif = [intronstart2, intronend2, "()", None, 30, "black", "yellow", None]
                seq_motifs.append(motif)

   # add all the domains for this leaf to the plot:
    seqFace = SeqMotifFace(seq, seq_motifs, intermotif_format="compactseq", seqtail_format="compactseq", scale_factor=1.0)    

    # see SeqMotifFace in https://pythonhosted.org/ete2/_downloads/ETE.pdf
    # 'seq' is the aa sequence, seq_motifs are motifs in the original sequence, intermotif_format says how spaces
    # between motifs should be filled.
    # Note that the ete manual says there should be a gapcolor parameter, but it doesn't seem to work for me. The same for height, fgcolor, bgcolor.
    # They seem to be set to default values of height=10, fgcolor=slategrey, bgcolor=slategrey, gapcolor=black, and also default value of
    # seq_format=blockseq. This means that gaps are coloured in black by default, and the rest of the alignment is coloured as 'blockseq' by
    # default, which I think is the same as compactseq, ie. shows a colour for each alignment position.
        seqFace.margin_bottom = 4
    f = l.add_face(seqFace, 0, position="aligned")
 

# Set the right margin of the tree. This is done in Class TreeStyle, attribute margin_right (pixels), default is 0. 
# See section 2.2.4 ("Tree Style") of the ETE manual for more detail.
ts = TreeStyle()
ts.layout_fn = layout
ts.margin_right = 30
ts.margin_left = 30
 

 write the picture to a file:
 t.render(output_file, w=2000, dpi=500, tree_style=ts)



I found that when ete2 plots the Pfam domains, the introns that lie within Pfam domains aren't plotted then, as it doesn't seem to be able to plot overlapping domains. So I put an option in my script to plot either the intron positions, or the Pfam domains, giving these two plots:

Introns:











Pfam domains:








 


Plotting my own tree and alignment in ete3
I then found out that the latest version of ete2, called ete3 (which works with recent Python2 versions and with Python3) can plot overlapping domains, so can be used to plot the intron positions (which I plot as narrow yellow domains) and Pfam domains. There is a nice tutorial for ete3.

I had to change this line in my script for it to work, and also told it to plot gap regions as blanks (a new feature of ete3) and the remaining sequence as a black line (using the 'compactseq' display looks too busy if you have both introns and Pfam domains plotted):
seqFace = SeqMotifFace(seq, motifs=seq_motifs, seq_format="line", gap_format="blank")













This shows the tree and alignment, with gaps; showing Pfam domains and introns on the alignment, taking gaps into account; and colours the branches by species group (here there is just one species group, so the branches are all the same colour).

Hurray! Thanks ete3!

Acknowledgements
Thanks to Jaime Huerta-Cepas on the ete mailing list for answering my questions.

How to run my final version of my script (notes for my forgetful brain)
This is for a gene family from our 50HG compara database:
1. Log into pcs5:
% ssh -Y pcs5
% cd /nfs/users/nfs_a/alc/Documents/git/Python

2.  Get the tree for a family from our mysql database of compara families:
eg. for family 622941:
% get_tree_for_50HG_family_v75.pl 622941 es9_compara_homology_final2_75 newick > 622941.nw

3. Get the alignment for a family in fasta format:
eg. for family 622941:
% perl /nfs/users/nfs_a/alc/Documents/git/helminth_scripts/scripts/compara_50HG/get_aln_for_50HG_family_v75_fasta.pl 622941 es9_compara_homology_final2_75 > 622941.aln   
  
 4. Rename the sequences in the alignment file to match those in the tree:
eg. for family 622941:
% perl -w /nfs/users/nfs_a/alc/Documents/000_50HG_InterestingFamilies/rename_seqs_in_50HG_aln.pl 622941.aln > 622941.aln2
  
5. Run the script to make the plot of the tree with domains and introns:
eg. for family 622941:
% python2 plot_tree_with_domains_with_ETE_ete3.py 622941.nw 622941.aln2  /nfs/helminths02/analysis/50HGP/00ANALYSES/final_interpro_domains_NEW /nfs/helminths02/analysis/50HGP/00ANALYSES/final_id_mapping /lustre/scratch108/parasites/bh4/Compara/50HG_Compara_75_Final/Ensembl_dumps/ /nfs/helminths02/analysis/50HGP/00ANALYSES/final_species_list/species_clades_08Oct.txt_colours  622941.svg

This makes file 622941.svg with the picture.

No comments: