avrilomics: 2014

Tuesday 3 June 2014

Using FigTree for plotting phylogenetic trees

The FigTree software is used a lot in my group for plotting phylogenetic trees, and makes lovely pictures.

Running FigTree on the Sanger farm: [of interest to Sanger users only]
% ssh -Y farm3-login
% bsub -o o -e e -R "select[mem>1000] rusage[mem=1000]" -M1000 "/software/bin/java -Xmx1G -jar ~bh4/apps/FigTree_v1.4.2/lib/figtree.jar"
(This has to be bsubbed, for some reason it won't run on the head node)

Once FigTree has opened, load the tree using File->Open.

Thanks to my colleague Bhavana Harsha for help with this.

Thursday 3 April 2014

Using Reapr to assess genome assembly quality

I've been learning to use my colleague Martin Hunt's fantastic Reapr software to assess quality of genome assemblies.

Reapr is described in a recent paper by Martin.

Running Reapr

There are detailed instructions on how to run Reapr in the Reapr manual. I've written a few notes here to remind myself of some key points:

1. Reapr facheck
To check whether the scaffold/contig names in your assembly will be acceptable to Reapr, type:
% reapr facheck assembly.fa
where assembly.fa is your assembly file.

2. Mapping your large-insert reads
You will need a bam file of your mapped large-insert read-pairs. The bam file should be sorted by coordinate, indexed, and have duplicates either marked or removed [note to self: this is true for the bam files made by Daria for the 50 HG QC steps]. It's recommended to use the smalt mapper with -x -r options [note to self: this is true for the bam files made by Daria for the 50 HG QC steps].
     Reads in a pair should be pointed towards each other (should be 'innies'): if you're not sure if this is the case, you can run 'bamcheck input.bam', and this will tell you the number of 'inward oriented pairs' and 'outward oriented pairs'. You should have all or nearly all inward oriented pairs. If your reads are mainly outties (which may be the case for large-insert reads), you will need to get their reverse complement, and map them again, to make sure they are innies for Reapr (you could use Reapr to do the mapping step).
     Reapr can map your reads for you, using smalt, if you don't have a bam file already (see the the Reapr manual ).
   Note: if you have bam files from several libraries, you can combine them into one bam, as long as the libraries have approximately the same insert size distribution. If you need to choose from multiple long-insert libraries, the Reapr FAQ says to choose the longest with enough coverage.
     Note: the Reapr website recommends to use version 0.7.0.1 of SMALT with the -f samsoft option [note to self: the bam files made by Daria for the 50 HG QC steps were made using -f samsoft, but with smalt 0.7.4].

3. Reads for calling error-free bases
Reapr can take fastq files of short-insert read-pairs, to call error-free bases in the assembly (actually usually short-insert reads are used for this, but large-insert reads could be too if you don't have any short-insert reads). Like the large-insert read-pairs (see above), these read-pairs should be 'innies'.
   You don't map these reads for you, Reapr maps them itself in the 'reapr perfectmap' step (see below).
   Note: if you have fastq files from several libraries, you can combine them (into one for forward reads, one for reverse), as long as the libraries have the same insert size distribution.

4. Calling error-free bases in the assembly
If you want to call error-free bases in the assembly, run:
% reapr perfectmap assembly.fa short_1.fq short_2.fq i_size perfect_prefix
where assembly.fa is your assembly file,
short_1.fq, short_2.fq are your fastq files for your short-insert read-pairs,
i_size is the insert-size for your short-insert read-pair library,
perfect_prefix is a prefix that is given to the output files.
The files short_1.fq and short_2.fq can be zipped files, eg. short_1.fq.gz and short_2.fq.gz.
Note: it's important that all the reads are the same length.
Note 2: for very large genomes (over a few 100 Mbase), you might want to use the 'perfectbam' step instead (see the Reapr manual for details).

5. Running the Reapr pipeline
To run the main Reapr pipeline, you type:
% reapr pipeline assembly.fa long_mapped.bam outdir perfect_prefix
where assembly.fa is your assembly file,
long_mapped.bam is your bam file of mapped long-insert read-pairs (from step 2 above),
outdir is the name that Reapr will give to the output directory,
perfect_prefix is the prefix given to the output files from step 4 (see above), if step 4 was run (this parameter is optional).
    Note that the stages of the 'reapr pipeline' command are 'preprocess', 'stats', 'score' and 'break'. These will be run with one call to 'reapr pipeline'. Alternatively, you can run them separately (by running 'reapr preprocess', 'reapr stats', etc.).
    Note: the Reapr manual says that if you don't have any short-insert reads for running step 4, you might want to skip step 4, and so skip the 'perfect_prefix' option in 'reapr pipeline', as long-insert reads probably won't give you very accurate error-free bases in step 4.

6. Viewing the output files
The output files will be in the directory 'outdir' created by step 5 above. The most important are:
- 03.score.errors.gff : a report of the errors found,
- 04.break.broken_assembly.fa : a new version of the assembly, with scaffolds broken based on the errors found,
- 05.summary.report.txt : a summary of the errors found in the assembly, plus contiguity statistics (N50, etc.) of the original and broken assemblies.

The 4 types of errors in 05.summary.report.txt are:
1. FCD errors within a contig
2. FCD error over a gap
3. Low fragment coverage within a contig
4. Low fragment coverage over a gap

7. Sanity checks
The Reapr manual recommends several sanity checks:
- look at file 00.Sample/gc.vs.cov.lowess.pdf and check the GC/coverage bias looks ok,
- look at file 00.Sample/insert.in.pdf and check it shows the insert size distribution for your inserts,

- look at file 00.Sample/insert.stats.txt and check the numbers look ok,

Running Reapr using Martin's reapr_wrapper script (Sanger users only)

Sanger users could run Reapr using the reapr_wrapper script, following these steps:
1. Make a template config file called config_file by typing:
% ~mh12/git/python3/reapr_wrapper.py -t config_file

2. Edit the config file for your data. The main things to change are:
- the lane id. for your short-insert library, eg.
short_insert_ID 6937_6
- the insert size for your short-insert library, eg. (for mean insert size 510)
short_insert_isize 510 [note to self: if you already have a bam, you can get this using bamcheck]
- the lane id. for your large-insert library, eg.
large_insert_ID 6980_8
- the insert size (-i option) to use in smalt for your large-insert library, eg. (for mean insert size 2498, with sd of 813, we might use 2498+3*813=4937=approx 5000)
large_insert_map_options -x -r 1 -i 5000 -y 0.5
(-i specifies the maximum insert size in smalt)
- if the read-pairs from your large-insert library are 'outties', since Reapr needs 'innies', we need to put:
   large_insert_revcomp 1
- the genome fasta file, eg.
genome_fasta /nfs/helminths02/analysis/50HGP/Parastrongyloides.trichosuri/ASSEMBLY/PTRK.v2.QC.fa_v2
- the output directory (must not exist already):
output_directory output_directory /lustre/scratch108/parasites/alc/StrongyloidesReapr/p_trichosuri/reapr_output/

3. Run reapr by typing:
% reapr_wrapper.py config_file

Tuesday 18 March 2014

Adding filters in Apple Mail

I use the Apple 'Mail' program to read my email, and a lot of emails that aren't (from mailing lists, etc.) To set up filters in 'Mail', you can do the following:
1. Go to the 'Mail' menu -> choose 'Preferences', click on the 'Rules' tab.
2. Set up a new rule for filtering messages.

Tuesday 4 March 2014

Querying the chado database

The chado database lies behind Genedb. To carry out queries, you can log into chado directly by typing (from within Sanger):
> ssh pcs5
> chado [then type your chado password]
Then within chado, you can type queries, and put the output in a file.
For example, to get a list of all the Schistosoma mansoni genes that have a note containing the word 'manual' (to find all manually curated genes), and save them in a file 'smansoni_curated', we can type:

\o smansoni_curated

select gene.uniquename as gene
, prop.value as note
from feature gene
join featureprop prop on gene.feature_id = prop.feature_id
join cvterm prop_type on prop.type_id = prop_type.cvterm_id
join cv prop_type_cv on prop_type.cv_id = prop_type_cv.cv_id
join organism on gene.organism_id = organism.organism_id
where prop_type_cv.name = 'feature_property' and prop_type.name = 'comment'
and organism.genus = 'Schistosoma' and organism.species = 'mansoni'
and prop.value like '%manual%'
;

I got this example from the Sample_Chado_queries website.
There are also more sample chado queries on the Useful_chado_queries website.
A third useful webpage is the Extracting_data_from_a_Chado_database website.

Thanks to my colleagues Magdalena, Matt and Anna for help.

Notes:
- to exit chado, I seem to have to type CTRL+D

Monday 3 March 2014

Retrieving annotations from chado (genedb)

The database behind Genedb is called Chado.

Getting a gff file for Schistosoma mansoni from chado
To extract annotations for a species (eg. Schistosoma mansoni) from Chado, you can use a shell script like this on the Sanger farm (for Sanger users only):

#!/bin/bash

export WRITEDB_ENTRIES_PASSWORD=xxx
export output="/lustre/scratch108/parasites/alc/50HGI_FuncAnnotn/Smansoni_chado_dump"
rm -rf $output;
mkdir -p $output;
bsub -o $output/bsub.o -e $output/bsub.e -q long \
-M2500 -R "select[mem>2500] rusage[mem=2500]" \
writedb_entries.py -t -o Smansoni -i -d pgsrv1:5432/pathogens?genedb -x $output

I've replaced the password information with 'xxx', to keep the password secret!

Getting all flatworm transcripts from chado
This is from my colleague Eleanor Stanley (thanks Eleanor).
% chado_dump_transcripts_coding -o Smansoni > Sma.fa
% chado_dump_transcripts_coding -o Emultilocularis > Emu.fa
% cat Sma.fa Emu.fa > flatworm_transcripts.fa

Thursday 20 February 2014

Retrieving the protein sequences for a species from an Ensembl Compara database

Today I wanted to retrieve the protein sequences for Schistosoma mansoni from an Ensembl Compara database that my colleagues Alessandra Traini and Eleanor Stanley built for a project in our group.

I knew that the S. mansoni genome had been loaded into the database as an assembly ('schistosoma_mansoni') and a gff file of gene predictions.

It took me a while to figure out how to retrieve all the sequences of S. mansoni proteins from the database, ie. the S. mansoni proteome. Here goes (using the Ensembl 74 Compara Perl API):
(I have replaced the details of our local database with 'xxx's, to keep it private)

BEGIN {
        unshift (@INC, qw(/nfs/users/nfs_b/bh4/apps/ensembl-74/src/ensembl/modules /nfs/users/nfs_b/bh4/apps/ensembl-74/src/ensembl-compara/modules));
}

use strict;
use warnings;
use Bio::EnsEMBL::Compara::DBSQL::DBAdaptor;

my $num_args               = $#ARGV + 1;
if ($num_args != 1)
{
    print "Usage of get_proteome_for_cestode_species.pl\n\n";
    print "perl get_proteome_for_cestode_species.pl <species>\n";
    print "where <species> is the species name (eg. schistosoma_mansoni)\n";
    print "For example:\n";
    print "perl -w get_proteome_for_cestode_species.pl schistosoma_mansoni\n";
    exit;
}

# FIND THE NAME OF THE SPECIES OF INTEREST:

my $species                = $ARGV[0];

#------------------------------------------------------------------#

my $comparadb = new Bio::EnsEMBL::Compara::DBSQL::DBAdaptor (-host => 'xxx', -port => 3378, -user => 'xxx', -pass => 'xxx', -dbname => 'xxx');

# first get the taxon_id for species $species (eg. schistosoma_mansoni)
my $species_taxon_id = 'none';
my $gda = $comparadb->get_adaptor("GenomeDB");
my @genomes = @{$gda->fetch_all()};
foreach my $genome (@genomes){
   my $genome_name = $genome->name(); # eg. homo_sapiens
   my $taxon = $genome->taxon()->name(); # eg. Homo sapiens
   my $taxon_id = $genome->taxon_id();
   if ($genome_name eq $species)
   {
      if ($species_taxon_id ne 'none') { print STDERR "ERROR: already have species_taxon_id $species_taxon_id\n"; exit;}
      $species_taxon_id = $taxon_id;
   }
}
if ($species_taxon_id eq 'none') { print STDERR "ERROR: did not find species_taxon_id for $species\n"; exit;}

# get the Bio::EnsEMBL::Compara::SeqMember objects for the species $species
my $pma = $comparadb->get_adaptor("SeqMember");
my @proteins = @{$pma->fetch_all_by_source_taxon("Ensemblpep", $species_taxon_id)};

foreach my $protein (@proteins) {
    my $stable_id = $protein->stable_id();
    if (defined($protein->sequence()))
    {
       my $sequence = $protein->sequence();
       print ">$stable_id\n";
       print "$sequence\n";
    }
    else { print STDERR "ERROR: do not know sequence for $stable_id\n"; exit;}
}

Getting all protein sequences, versus the 'canonical' sequences
In Compara, if there are multiple splice-forms for a gene, one is chosen randomly as the 'canonical' protein to represent that gene. To get all the protein sequences for all splice-forms for all genes, you need to use (see above):
my $pma = $comparadb->get_adaptor("SeqMember");
my @proteins = @{$pma->fetch_all_by_source_taxon("Ensemblpep", $species_taxon_id)};

foreach my $protein (@proteins) {
    my $stable_id = $protein->stable_id();
    if (defined($protein->sequence()))
    {
       my $sequence = $protein->sequence();
       print ">$stable_id\n";
       print "$sequence\n";
    }
    else { print STDERR "ERROR: do not know sequence for $stable_id\n"; exit;}
}
To just get the protein sequences for the canonical splice-forms for all genes (one per gene), you need to use:
my $gma = $comparadb->get_adaptor("GeneMember");
my @genes = @ { $gma->fetch_all_by_source_taxon("ENSEMBLGENE", $species_taxon_id) } ;

foreach my $gene (@genes) {
   my $stable_id = $gene->stable_id();
   my $canonical = $gene->get_canonical_SeqMember();
   if (defined($canonical->sequence()))
   {
      my $sequence = $canonical->sequence();
      print ">$stable_id\n";
      print "$sequence\n";
   }
   else { print "ERROR: no sequence for gene $stable_id\n"; exit;}
}
Note that the second version gets a 'GeneMember' object rather than a 'SeqMember' object (this is because a GeneMember object has a 'canonical Seqmember' attached to it, but a SeqMember object doesn't.

Friday 31 January 2014

Using Python to find common ancestors (parents) of GO terms

I need a script to find the last common ancestor(s) of two Gene Ontology (GO) terms. Two GO terms may have more than one last common ancestor in the GO hierarchy, via different routes. For example, if the nodes N1...N15 in the diagram below represent GO terms in the GO hierarchy:

the last common ancestors of N1 and N2 are the set {N7, N5}, as they are the last common ancestors via different routes from N1 and N2. Likewise, the last common ancestor of N10 and N11 is N13.
The function find_lcas_of_pair of_GO_terms in my Python script find_lca_of_go_terms.py finds the last common ancestor(s) of two GO terms in this way. (Thanks to my husband Noel for help with this!) For example, it finds the last common ancestor of the terms GO:0001578 (microtubule bundle formation) and GO:0030036 (actin cytoskeleton organization) to be GO:0007010 (cytoskeleton organization). The same example was given in a biostars discussion, which gave a solution to the problem in Java.

If we want to find the last common ancestors (LCAs) of the GO terms of two genes, where gene 1 has GO terms N1, N10 and gene 2 has GO terms N2, N11, then we can do the following:
(i) find the LCAs for N1 and N2
(ii) find the LCAs for N1 and N11
(iii) find the LCAs for N10 and N2
(iv) find the LCAs for N10 and N11
(v) find the union of the LCAs found in (i), (ii), (iii) and (iv)
(vi) remove any GO term from the set of LCA in step (v), if it is an ancestor (in the GO hierarchy) of another GO term in that set.
This function find_lcas_of_GO_terms_for_two_genes in my Python script does this. For this example, it finds the LCAs of the GO terms to be the set {N5, N7, 13}. Taking another example, if your first gene has GO terms GO:0001578, GO:0004104, and GO:0004835 and your second gene has GO terms GO:0030036, GO:0003990, it finds the LCAs of the GO terms for the two genes to be the set {GO:0004104, GO:0007010}.

What about if you have multiple genes? The function find_lcas_of_GO_terms_for_many_genes deals with this case. It finds the set of LCAs for the first pair of genes (giving set 1). Then it finds the set of LCAs between the GO terms in set 1 and the GO terms for the second gene (giving set 2). Then it finds the set of LCAs for the GO terms in set 2 and the GO terms for the third gene (giving set 3), and so on, until you have gone through all the genes.

Wednesday 29 January 2014

Merging optical map XML files using Python

My colleagues asked me to merge 341 optical map XML files (for 341 scaffolds) into one large XML file. I was able to write a Python script merge_optical_map_xml_files.py by using Python's nice ElementTree library for parsing XML. I thought it would be a tricky job but with Python it took me only about half an hour. Yay!

Postscript:
My colleagues Alan Tracey and Karen Brooks found that the MapSolver software cannot read in the merged XML file produced by my script. We figured out that the problem is with the header. The header produced by my script is:
<?xml version='1.0' encoding='iso-8859-1'?>
<aligned_maps_document><experiment>
...

In order to get MapSolver to read in the XML file, I had to manually edit the header of the XML file produced by my script so that it is:
<?xml version="1.0" encoding="iso-8859-1"?>



<aligned_maps_document version="0.5">

<experiment>
...
I've now updated my Python script to make it produce this type of header in the output XML file.

Bin assembly pipeline

[Of interest to Sanger users only]

My colleague Daria Gordon wrote a pipeline to create a 'bin assembly' out of the reads that didn't make it into the main genome assembly for a species.

Here are the steps (to be run in a 'screen' session):

1) Create a directory for your work in your /lustre directory, eg.:
% mkdir /lustre/scratch108/parasites/alc/bin_assembly_test
% cd /lustre/scratch108/parasites/alc/bin_assembly_test

2) Create soft-links to your assembly fasta file, zipped fastq files (find out where they are using 'pathfind'), and bam files (if any).

3) Set PERL5LIB to find the necessary perl modules:
% export PERL5LIB=$PERL5LIB:/software/pathogen/internal/pathdev/vr-codebase/modules
[Note: 17-Dec-2015: the pipeline was moved to:
/software/parasites/projects/helminth_scripts/modules/HelminthGenomeAnalysis/, so need to type:
% export PERL5LIB=$PERL5LIB:/software/parasites/projects/helminth_scripts/modules]
Note, probably you will also need:
% export PERL5LIB=$PERL5LIB:/software/parasites/internal/prod/lib

4) Make a tab-delimited file with information on the lanes (eg. 'lanes.txt'), with columns: lane_id, fastq_file1, fastq_file2, insert size. Your fastq files may be g-zipped:

6956_8  6956_8_1.fastq.gz  6956_8_2.fastq.gz  3000
7623_6  7623_6_1.fastq.gz  7623_6_2.fastq.gz  450

Note: it's important that this file doesn't have any extra blank lines.
Also, if the file names contain '#' characters, you might need to put the file names in quotes (or rename them to avoid these characters), to avoid problems.

5) Start the pipeline:
% init_pipeline.pl HelminthGenomeAnalysis::PipeConfig::BinAssembly_conf -species_name <species> -assembly_file <assembly> -input_dir <input_dir> -pass 50hgi --lanes_file <lanes_file>
where <species> is your species name, eg. haemonchus,
<assembly> is your assembly file, eg. haemonchus.fasta,
<input_dir> is the directory with your input files, eg. /lustre/scratch108/parasites/alc/bin_assembly_test,
<lanes_file> is your lanes file, eg. lanes.txt

6) Paste in the beekeeper commands that you are told (as the output from step 5), eg.:
% beekeeper.pl -url mysql://wormpipe_admin:50hgi@mcs10:3388/wormpipe_alc_bin_assembly_haemonchus -sync
% beekeeper.pl -url mysql://wormpipe_admin:50hgi@mcs10:3388/wormpipe_alc_bin_assembly_haemonchus -loop

Tuesday 28 January 2014

Find the number of steps from a GO term to the top of the hierarchy

I've been analysing the Gene Ontology (GO) hierarchy, and wanted to find a way to calculate the number of steps from a particular GO term to the top of the hierarchy.

I first came up with the idea to use Dijkstra's algorithm to do this, and wrote a Python script calc_dists_to_top_of_GO.py for this. This worked ok, but was quite slow (took about 15 minutes to run).

My husband Noel came up with the idea of using a breadth-first search for the same purpose. Here's a Python script that he helped me to write for this (thanks Noel!): calc_dists_to_top_of_GO_using_bfs.py.
This was much faster, and only took seconds to run!

Both scripts use the GO ontology file (which contains the GO hierarchy), in this case 'gene_ontology.WS238.obo' (which I downloaded from WormBase) as input.

Happily the two scripts give the same result : )

For example, for the GO term 'single strand break repair' (GO:0000012), they both find that it is 6 steps from a term at the top of the hierarchy (in this case, from 'biological process', GO:0008150):

Monday 27 January 2014

Depth-first search and breadth-first search in Python

I recently had a tree structure like this (actually from the Gene Ontology hierarchy), and wanted to traverse the tree:

If you start at node N1, there are two ways to traverse the tree:

(i) a depth-first search: this uses a 'stack', and focuses on the last item that you put onto the stack. Starting at node N1, we first put N1 on the stack. The stack is the list [N1] now.
Then we take the last item off the stack: N1 in this case. We take the parent nodes of this item (N2, N3, N4) and add them to the stack. The stack contains list [N2, N3, N4] now.
Then we take the last item off the stack: N4. We take the parent nodes of this item (N3). It is already in the stack. The stack contains list [N2, N3] now.
We take the last item off the stack: N3. We take the parent nodes of this item (N6, N7), and add them to the stack. The stack contains list [N2, N6, N7] now.
We take the last item off the stack: N7. This item has no parents. The stack is [N2, N6] now.
We take the last item off the stack: N6. We add its parents (N13) to the stack, so the stack is [N2, N13].
We take the last item off the stack: N13. It has no parents. The stack is [N2] now.
We take the last item off the stack: N2. It has no parents. The stack is empty now, so we've finished.
That is, a depth-first traversal starting at node N1 takes this route:

(ii) a breadth-first search: this uses a 'queue', and focuses on the first item that you put on the queue.
Starting at node N1, we first put N1 on the queue. The queue is list [N1] now.
Then we take the first item off the queue: N1 in this case. We take the parent nodes of this item (N2, N3, N4) and add them to the queue. The queue is list [N2, N3, N4] now.
We then take the first item off the queue: N2 in this case. N2 has no parents. The queue is [N3, N4] now.
We take the first item off the queue: N3 in this case. Its parents (N6, N7) are added to the queue, so the queue is [N4, N6, N7] now.
We take the first item off the queue: N4. We have already seen this parent (it was in the queue previously), so we ignore it. The queue is [N6, N7] now.
We take the first item off the queue: N6. We add its parents (N13) to the queue, so the queue is [N7, N13] now.
We take the first item off the queue: N7. It has no parents. The queue is [N13] now.
We take the first item off the queue: N13. It has no parents. The queue is empty now.

That is, a breadth-first search starting at node N1 takes this route:

Python code for depth-first and breadth-first searches:
My python script tree_traversal.py has code for depth-first and breadth-first traversal of this tree (thanks to my husband Noel for help with this!) If you run it, you should get output showing the stack at each point in the depth-first search, and the queue at each point in the breadth-first search:

% python3 tree_traversal.py
Depth-first search:
('stack=', [('N1', 0)])
('stack=', [('N2', 1), ('N3', 1), ('N4', 1)])
('stack=', [('N2', 1), ('N3', 1)])
('stack=', [('N2', 1), ('N6', 2), ('N7', 2)])
('stack=', [('N2', 1), ('N6', 2)])
('stack=', [('N2', 1), ('N13', 3)])
('stack=', [('N2', 1)])
{'N13': 3, 'N1': 0, 'N2': 1, 'N3': 1, 'N4': 1, 'N6': 2, 'N7': 2}

Breadth-first search:
('queue=', [('N1', 0)])
('queue=', [('N2', 1), ('N3', 1), ('N4', 1)])
('queue=', [('N3', 1), ('N4', 1)])
('queue=', [('N4', 1), ('N6', 2), ('N7', 2)])
('queue=', [('N6', 2), ('N7', 2)])
('queue=', [('N7', 2), ('N13', 3)])
('queue=', [('N13', 3)])
{'N13': 3, 'N1': 0, 'N2': 1, 'N3': 1, 'N4': 1, 'N6': 2, 'N7': 2}

Notes on the script: pop() takes the item from the end of a list, removes it and gives it to you.

Friday 24 January 2014

Useful vi / vim short-cuts

Some useful vi / vim short-cuts that I'm always forgetting:

Copy and paste a block of text
5Y             Copy five lines of text
P            Paste the copied text
See also here for how to copy and paste a block of text.

Show hidden characters in a file
:set list       Show hidden characters
:set unlist   Hide the hidden characters

Moving around in the file
0                 Go to the start of a line
$                 Go to the end of a line
1G              Go to the start of the file
G                Go to the end of the file

Insert 50 '-'s in a row
'.' repeats the last text-changing command. You can type a number before a command in vi for it to be done a certain number of times. For example, if you want to make a line of 50 '-'s in a row, you can type '-', then press ESC, and type '50.' to get 50 '-'s.

Toggle auto-indenting for paste

If you want to paste in code from a python script, but don't want the intenting to be messed up, you can type:

:set paste

You can then turn this off later using:

:set nopaste

There is a nice VI cheat sheet here.

Wednesday 22 January 2014

The Python 'zip' function

The Python 'zip' function returns a list of tuples, where the i-th tuple contains the i-th element from the each of the argument sequences. I learnt about 'zip' in the Python course in Cambridge, and my colleague Bhavana Harsha recently reminded me about it.

Here are some examples of using 'zip':

letters = ['A', 'B', 'C', 'D', 'E']
numbers = [1, 2, 3, 4, 5]
for x, y in zip(letters, numbers):
print(x, y)
A 1
B 2
C 3
D 4
E 5

mystring = 'gene1 species1 gene2 species2 gene3 species3 gene4 species4'
temp = mystring.split()
pairs = zip(temp[0::2], temp[1::2])
for x, y in pairs:
print(x, y)
gene1 species1
gene2 species2
gene3 species3
gene4 species4
Note: here temp[0::2] gives the 1st, 3rd, 5th... element in the list 'temp', and temp[1::2] gives the 2nd, 4th, 6th... element in the list 'temp'.

seq1 = 'ACGTAT'
seq2 = 'ACTTTT'
bases = zip(seq1,seq2)
for x, y in bases:
print(x, y)
A A
C C
G T
T T
A T
T T

Note that once you have expanded a 'zip' once (eg. printing it out), then it no longer holds the data. For example, if you try and print out the zip 'bases' in the last example a second time, then you won't get any result.

Thanks to Bhavana Harsha for the last example.

Calculating a similarity measure for two text strings in Python

Calculating a similarity measure for two text strings

I recently wanted to calculate a similarity measure for two text strings (actually two different short functional descriptions for the same C. elegans gene). That is, a simple measure of textual (rather than semantic) similarity.

I found a mention on stackoverflow of the 'difflib' module in the Python standard library, and found that I could do it using difflib:

import difflib

fn1 = 'protein tyrosine phosphatase activity'
fn2 = 'protein-tyrosine phosphatase'
score = difflib.SequenceMatcher(None,fn1.lower(),fn2.lower()).ratio()
print(score)

This gives a score of 0.8307692307692308 in this case.

Nice!

Finding the longest matching substring of two strings
Another thing that I wanted to do was to find the longest matching substring of two strings. Again, we can do this using difflib:

import difflib

fn1 = 'protein tyrosine phosphatase activity'
fn2 = 'protein-tyrosine phosphatase'
s = difflib.SequenceMatcher(None,fn1.lower(),fn2.lower())
s.find_longest_match(0,len(fn1),0,len(fn2))

This gives output:

Match(a=8, b=8, size=20)

This tells us that the longest match starts at position 8 in fn1 (ie. at the 't' of 'tyrosine') and at position 8 in fn2 (the 't' of 'tyrosine') and continues for 20 letters (until the end of 'phosphatase').

Thursday 16 January 2014

Using CEGMA to assess gene sets

The CEGMA software can be used to assess assemblies. It can also be used to assess the gene set that you have made for an assembly. What you can do is to take your set of predicted proteins (ie. a protein fasta file), and run hmmer (hmmpfam from HMMER2) against the CEGMA file of HMMs for KOGs:
eg.
% bsub -q yesterday -J CEG.TMUE1 -G team133 -R "select[mem > 500] rusage[mem=500]" -M500 -o TMUE1.cegma.output -e TMUE1.cegma.err hmmpfam2 KOG.hmm TMUE1.pep
where KOG.hmm is the file of HMMs for KOGs, and TMUE1.pep is your fasta file of proteins, TMUE1.cegma.output will be the hmmpfam output file.

Then you can parse the hmmpfam output file to see what percent of the CEGMA HMMs had significant hits:
% parse_hmmpfam_output.pl TMUE1.cegma.output TMUE1.pep 1e-05 458
where 1e-05 is the e-value cutoff to use for the hmmpfam hits, 458 is the number of KOGs in the KOG.hmm file, and the script parse_hmmpfam_output.pl can be found here.

This gives you an idea of how complete your gene set is. If your gene set is really complete, you'd expect that there will be hits to nearly 100% of the CEGMA HMMs.

Comment 28-April-2016: If CEGMA was used as input in your gene-finding pipeline (eg. to train/guide your gene-finder), it's probably not fair to also use it to assess the completeness of your gene set, as it probably did well at finding the CEGMA genes.