Friday 28 June 2013

Scanning an assembly for contaminant scaffolds

My colleague Daria Gordon has written a pipeline for scanning an assembly for contaminant scaffolds. At present it scans an invertebrate assembly for contaminant scaffolds from other taxonomic divisions, such as bacteria, vertebrates, plants, fungi, viruses, etc.

Briefly, the pipeline works by taking each scaffold (or 50-kb chunks of longer scaffolds), and running blast against the nr protein database. If a scaffold has stronger blast hits to non-invertebrate proteins (vertebrates, bacteria, viruses, plants, fungi) than to invertebrate proteins, then it is considered to be a 'contaminant' scaffold.

How to run Daria's contamination scanning pipeline:
This is only available to Sanger users. Daria gave me these instructions for running her pipeline:

1) You need to be in a screen session, so when the terminal window is closed (either you go home or by mistake) the pipeline will keep running.  Make sure you choose one specific farm2-login node and stick with it. This way you can always reconnect to your running session. For example:
% ssh farm2-head3
% screen -RD
 [Note: you need to have export LSB_DEFAULTGROUP=team133 in your .bashrc to run the pipeline]

2) Create a directory for the assemblies on /lustre/scratch108
% mkdir  /lustre/scratch108/parasites/alc/50HGI_blast
% cd  /lustre/scratch108/parasites/alc/50HGI_blast

3) Create a working directory (a separate one for each assembly) in it.
% mkdir  Schistosoma.curassoni
% cd  Schistosoma.curassoni

4) Copy your assembly:
% cp /nfs/helminths/analysis/50HGP/Schistosoma.curassoni/ASSEMBLY/v1.0.pipeline/SCUD.v1.fa .
Note the pipeline expects that the assembly fasta file will have a version name in it eg. SCUD.v1.fa, not just SCUD.fa.

Daria suggested that if the assembly is very big (eg. >500 Mb), it could be a good idea to split it into smaller files of ~500 Mb each, and run the pipeline separartely on each smaller file. You can split an assembly file of 300,000 sequences into 3 equal size files using my perl script split_up_fasta.pl
% split_up_fasta.pl assembly_file 100000 smaller_file .
This will make three smaller files called smaller_file1, smaller_file2, smaller_file3.

5) Create the pipeline database:
% init_pipeline.pl WormAssemblyQC::PipeConfig::ContaminationScreening_conf -assembly_prefix SCUD -pipeline_name contam_screen_SCUD_nr  -pipeline_dir /lustre/scratch108/parasites/alc/50HGI_blast/Schistosoma.curassoni/ -pass 50hgi

-farm_memory_suffix 000 is necessary on farm2, not on farm3

Note that -assembly_prefix must be the start of the name of the assembly, eg. if the assembly is called SCUD.v1.fa then use '-assembly_prefix SCUD'. The assembly must have a number! (eg. SCUD.v1.fa rather than just SCUD.fa)

If you have a different version than v1, use the -assembly_version option, eg. if your file is SCUD.v2.fa then use '-assembly_prefix SCUD -assembly_version v2'.

Note: at the moment the pipeline uses these files by default:
(i) non-invertebrate proteins database: /lustre/scratch108/parasites/alc/ContaminationPipeline/new_files/smallnr.fa
(ii) invertebrate proteins database: /lustre/scratch108/parasites/alc/ContaminationPipeline/new_files/seq_update.fasta
You can however specify an invertebrates protein database on the command-line using the -inv_db <string> option (the /lustre/scratch108/parasites/alc/ContaminationPipeline/new_files/seq_update.fasta file is a copy of the same file, so could be used for example).

6) In the end of the output of this command you will find a mysql connection line specifically to the newly created pipeline database.
% mysql --host=mcs10 --port=3388 --user="wormpipe_admin" --pass="50hgi" wormpipe_alc_contam_screen_SCUD_nr
It is convenient to have a mysql session open(in another window) to monitor the progress of the pipeline.
For example:
> select * from progress;

7) To actually run the pipeline you will need to copy and paste the beekeeper.pl commands to your screen-protected window. First with -sync (takes seconds) and then with -loop (runs until the pipeline either completes or fails):
% beekeeper.pl -url mysql://wormpipe_admin:50hgi@mcs10:3388/wormpipe_alc_contam_screen_SCUD_nr -sync                 
% beekeeper.pl -url mysql://wormpipe_admin:50hgi@mcs10:3388/wormpipe_alc_contam_screen_SCUD_nr  -loop

Troubleshooting:
Sometimes the pipeline fails for some reason. Here are some things to do if this happens:

Finding failed jobs:
Find the job id. for failed jobs in the mysql database:
> select job_id from job where status='FAILED';
See the message for a failed job:
> select * from msg where job_id = 7;

Restarting failed jobs:
You can restart this particular job:
% beekeeper.pl -url <your_database> -reset_job_id 138
Then as usual
% beekeeper.pl -url <your_database> -sync
% beekeeper.pl -url <your_database> -loop

Note that occassionally a blast job can fail because the query has some sequence that causes blast to crash, for example, the contig below causes blast to crash:
>scaffold1325.size71668:offset::0:
ACATACATACATACATACATACATACATACATACATACATACATACATACATACATACAT
GCATACATACATACATACATACATACATACATACATACATACATACATAC
The solution that Daria suggested here was to delete this contig (from the chunk file for the blast job), and then do 'reset_job_id' as above and restart the pipeline.

Note that if you need to delete a whole chunk file (eg. because it's all Ns), then you need to create an empty file with that name (eg. using 'touch'), or else the pipeline will fail.

Restarting a stalled pipeline:
Sometimes the whole beekeeper will die with some error message about running out of memory or something. To restart it, try:
% beekeeper.pl -url <your_database> -sync
% beekeeper.pl -url <your_database> -loop

Deleting the whole mysql database:
Delete the whole database for a species from the mysql table:
> drop database <name_of_database>;
eg.
> drop database wormpipe_alc_contam_screen_SCUD_nr;

Restarting a failed analysis:
eg. if the phylum_finder analysis failed:
> beekeeper.pl -url <name_of_database> -reset_failed_jobs_for_analysis phylum_finder
Clean up dead jobs for resubmission:
> beekeeper.pl -url <name_of_database> -dead
Then try '-sync' and '-loop' as above.


Documentation:
Again, for Sanger users, you can see Daria's documentation by typing:
% perldoc WormAssemblyQC::PipeConfig::ContaminationScreening_conf

Other Notes:
Sometimes you can get the message:
Can't read from '/ensembl-hive/hive_config.json' at /software/pathogen/external/apps/usr/local/ensembl-hive/modules/Bio/EnsEMBL/Hive/Utils/Config.pm line 51.
However, this doesn't matter, just ignore this.

Other tools for scanning assemblies for contamination:
DecconSeq: described in a paper by Schmeider and Edwards (2011), available on the DeconSeq website. I haven't tried this, but it sounds interesting.

My proteome-based contamination scanning pipeline
If you have a gene set for a species, subsequently to running Daria's contamination scan pipeline, you can run my proteome-based contamination scanning pipeline, which tends to pick up some extra contaminant scaffolds missed by Daria's pipeline (and, very rarely, re-classifies a scaffold that Daria's pipeline classified as 'contaminant' as 'non-contaminant'). There are two steps:
(i) Run blastp:
% python3 ~alc/Documents/git/Python/run_blastp_for_contamination_scans.py >& running_blastp
(ii) Classify scaffolds as contaminant/non-contaminant:
% python3 ~alc/Documents/git/Python/run_contamination_scans_using_blastp.py

I then have a further step that can detect even more contaminant scaffolds (eg. by detecting flatworm contamination in a nematode assembly):
(i) Run blastp: [Note to self: make sure the protein fasta being used in this step is the right one and hasn't had contaminant scaffolds already removed, if you're re-running it]

% python3 /nfs/users/nfs_a/alc/Documents/git/Python/run_blastp_for_contamination_scans_round2.py
(ii) Classify scaffolds as contaminant/non-contaminant:  [Note to self: Make sure that the script will be able to read the blast outputs from the first step (i) above, eg. that the protein names in the blast results matches what is in the gff]

% python3 ~alc/Documents/git/Python/run_contamination_scans_using_blastp_round2.py

Notes to self:
Diogo has put the Taxonomy files for Daria's pipeline here:
/lustre/scratch108/parasites/dr7/DATA/taxonomy/ncbi_dumps
These were downloaded from the NCBI ftp site:  ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/

Difficult cases for my contamination scan: 
Daria and my contamination scan pipeline (including the blastp steps) has problems with certain very difficult cases:
- Because it is based on finding hits to protein coding regions (via blastx or blastp), it doesn't detect contaminant scaffolds that lack any genes.
- The blastp steps use a database of non-invertebrate proteins that contain representative proteomes eg. E. coli K12, a B. subtilis strain, etc. This does not have every possible E. coli strain, so it's possible to have contamination with an E. coli strain that has many proteins missing from E. coli K12, and if a scaffold has 1 or 2 of such proteins, they may not be detected as contamination. 
- The second blastp step (aimed at finding flatworm contamination in a nematode assembly) can classify some scaffolds as contaminant if you are looking at a very diverged nematode in clade I that has genes that are closer to a flatworm's.
- The database of invertebrate proteins that I used was downloaded a while ago, so in some cases misses recent proteomes that are closely related to a particular query (eg. nematode) species.
Ouch!

Wednesday 19 June 2013

Testing bioinformatics perl scripts: calculating GC content

A perl module that calculates GC content of a DNA sequence, using Carp::Assert and throws_ok():
Let's write a perl module GC.pm that contains a subroutine to calculate the GC content of a DNA sequence, and test it.

package GC;

use strict;
use warnings;

use Math::Round; # has the nearest() function
use Carp::Assert; # has the assert() function
use Scalar::Util qw(looks_like_number);

use base 'Exporter';
our @EXPORT_OK = qw( gc_content_one_seq );

sub gc_content_one_seq
{

   my $seq = $_[0];
   my $g_or_c = 0;
   my $gc;

   # throw an exception if the sequence is uninitialised (undefined), or an empty string:
   throw Error::Simple("sequence not defined") if (!(defined($seq)));
   throw Error::Simple("sequence does not exist") if ($seq eq '');

   # calculate the GC content:

   $seq =~ tr/[a-z]/[A-Z]/;  # convert the sequence to uppercase 
   $g_or_c = ($seq =~ tr/G|C//);  # counts number of Gs or Cs in the sequence
   $gc = $g_or_c*100/length($seq);
   $gc = nearest(0.01, $gc); # round to 0.01 precision

   # die if the GC content is not between 0 and 100:
   assert($gc >= 0.00 && $gc <= 100.00); # this should never happen, program will die if it does

   # die if the GC content is not numeric:
   assert(looks_like_number($gc)); # this should never happen, program will die if it does

   return $gc;
}

1;


Testing the perl module using ok(), use_ok(), can_ok(), and throws_ok():
Then you can use the testing script GC.t to test the subroutines in module GC.pm:

#!perl

use strict;
use warnings;


use Test::More tests => 7;
use Error; # has Error::Simple
use Test::Exception; # has throws_ok()

# Specify the subroutines to import:
my @subs = qw ( gc_content_one_seq );

# Check we can import the subroutines:
use_ok( 'GC', @subs);
can_ok( __PACKAGE__, 'gc_content_one_seq');

# Test the gc_content_one_seq() subroutine:

my $seq = "AAAAAAAAAAGGGGGGGGGGAAAAAAAAAAAAAAAAAAAAGGGGGGGGGGAAAAAAAAAA";
my $gc_seq = GC::gc_content_one_seq($seq);
ok ($gc_seq == 33.33, "Test 3: check gc_content_one_seq() correctly gives GC=33.33");


$seq = "AAAAAAAAAAGGGGGGGGGGAAAAAAAAAAAA";
$gc_seq = GC::gc_content_one_seq($seq);
ok ($gc_seq == 31.25, "Test 4: check gc_content_one_seq() correctly gives GC=31.25'"); 


$seq = "AAAAAAAAA";
$gc_seq = GC::gc_content_one_seq($seq);
ok ($gc_seq == 0.00, "Test 5: check gc_content_one_seq() correctly gives GC=0.00");
 

$seq = ""; # Check an error is thrown if the sequence is an empty string:
throws_ok { $gc_seq = GC::gc_content_one_seq($seq) } 'Error::Simple','Test 6: sequence not defined'; 

my $seq2; # Check an error is thrown if the sequence is undefined:
throws_ok { $gc_seq = GC::gc_content_one_seq($seq2) } 'Error::Simple','Test 7: sequence does not exist';
 
When you run the tests you see:
% prove GC.t
GC.t .. ok  
All tests successful.
Files=1, Tests=7,  0 wallclock secs ( 0.04 usr  0.01 sys +  0.04 cusr  0.02 csys =  0.11 CPU)
Result: PASS

% prove -v GC.t
GC.t ..
1..7
ok 1 - use GC;
ok 2 - main->can('gc_content_one_seq')
ok 3 - Test 3: check gc_content_one_seq() correctly gives GC=33.33
ok 4 - Test 4: check gc_content_one_seq() correctly gives GC=31.25'
ok 5 - Test 5: check gc_content_one_seq() correctly gives GC=0.00
ok 6 - Test 6: sequence not defined
ok 7 - Test 7: sequence does not exist
ok
All tests successful.
Files=1, Tests=7,  0 wallclock secs ( 0.04 usr  0.01 sys +  0.02 cusr  0.02 csys =  0.09 CPU)
Result: PASS


Using the perl module in a perl script:

We can use the subroutine gc_content_one_seq() in a perl script gc.pl like this:

#!/usr/bin/perl

use strict;
use warnings;
use GC;   

my $gc = GC::gc_content_one_seq("ACGT");

print "GC = $gc\n";


% perl -w gc.pl
GC = 50 

Thanks to my colleagues Daria Gordon and Bhavana Harsha for lots of helpful discussion about this.

Testing perl scripts

With the help of my colleagues Daria Gordon and Bhavana Harsha and the book 'Perl Testing: A Developer's Notebook', I've been learning how to test perl scripts.

Testing a simple 'Hello world!' perl subroutine using ok() from Test::Simple
Here is a very simple testing script hello.t with subroutines to print out 'Hello world!' and 'Hello universe!', and a test of each subroutine:

#!perl

use strict;
use warnings;


use Test::Simple tests => 2;

sub hello_world
{
   return "Hello world!";
}


sub hello_universe
{
   return "Hello universe!";


ok( hello_world() eq "Hello world!", 'Test 1: check hello_world() output is ok');

ok( hello_universe() eq "Hello universe!", 'Test 2: check hello_universe() output is ok');

Here the subroutine and the test are all in one file.
The test can be run by typing:
% prove hello.t
hello.t .. ok 
All tests successful.
Files=1, Tests=2,  0 wallclock secs ( 0.02 usr  0.00 sys +  0.01 cusr  0.00 csys =  0.03 CPU)
Result: PASS

You can get more information on each of the tests run by typing:
% prove -v hello.t
hello.t ..
1..2
ok 1 - Test 1: check hello_world() output is ok
ok 2 - Test 2: check hello_universe() output is ok
ok
All tests successful.
Files=1, Tests=2,  0 wallclock secs ( 0.02 usr  0.01 sys +  0.01 cusr  0.00 csys =  0.04 CPU)
Result: PASS


A perl module for the 'Hello world!' subroutine
A more normal way to do things is to put the subroutines in a separate module file, eg. Hello.pm:

package Hello;

use strict;
use warnings;

use base 'Exporter';
our @EXPORT_OK = qw( hello_world hello_universe );

sub hello_world
{
   return "Hello world!";
}

sub hello_universe
{
   return "Hello universe!";
}

1;


Testing the 'Hello world!' perl module using ok(), use_ok() and can_ok()
Then you can use this testing script hello2.t to test the subroutines in the module:

#!perl

use strict;
use warnings;

use Test::More tests => 5;

# Specify the subroutines to import:
my @subs = qw (hello_world hello_universe);

# Check we can import the subroutines:
use_ok( 'Hello', @subs);
can_ok( __PACKAGE__, 'hello_world');
can_ok( __PACKAGE__, 'hello_universe');

# Test the subroutines:
ok( hello_world() eq "Hello world!", 'Test 4: check hello_world() output is ok');
ok( hello_universe() eq "Hello universe!", 'Test 5: check hello_universe() output is ok');


We run the testing code by typing:
% prove hello2.t
hello2.t .. ok  
All tests successful.
Files=1, Tests=5,  0 wallclock secs ( 0.02 usr  0.01 sys +  0.02 cusr  0.00 csys =  0.05 CPU)
Result: PASS

For more details, we type:
% prove -v hello2.t
hello2.t ..
1..5
ok 1 - use Hello;
ok 2 - main->can('hello_world')
ok 3 - main->can('hello_universe')
ok 4 - Test 4: check hello_world() output is ok
ok 5 - Test 5: check hello_universe() output is ok
ok
All tests successful.
Files=1, Tests=5,  0 wallclock secs ( 0.02 usr  0.00 sys +  0.02 cusr  0.00 csys =  0.04 CPU)
Result: PASS


Using the 'Hello world!' perl module in a perl script We can use the perl module in a perl script, eg. hello.pl:

#!/usr/bin/perl

use strict;
use warnings;
use Hello;

my $hello1 = Hello::hello_world();
print "$hello1\n";

my $hello2 = Hello::hello_universe();
print "$hello2\n";


When we run the perl script, we see:
% perl -w hello.pl
Hello world!
Hello universe!


Thanks to my colleagues Daria Gordon and Bhavana Harsha for lots of helpful discussion about this.

Tuesday 18 June 2013

Checking if a perl module is installed, and installing a perl module locally

Checking a perl module is installed
To check if a perl module is installed, for example, the Test::Simple module, you can type:
% perl -MTest::Simple -e 1
If there is no output, then it means the perl module is installed.

Finding out where a perl module is installed
To find out where a perl module, eg. Test::Simple, is installed, you can type:
% perldoc -l Test::Simple

To see the raw source of an installed perl module
To see the raw source of an installed perl module, eg. Test::Simple, type:
% perldoc Test::Simple

Installing a perl module locally
To install a perl module locally, eg. Carp::Assert, you can type:
% wget search.cpan.org/CPAN/authors/id/M/MS/MSCHWERN/Carp-Assert-0.20.tar.gz
% gunzip Carp-Assert-0.20.tar.gz
% tar -xvf Carp-Assert-0.20.tar
% cd Carp-Assert-0.20
% perl Makefile.PL PREFIX=~alc/Documents/PerlModules/
% make test
% make install