Friday, 28 June 2013

Scanning an assembly for contaminant scaffolds

My colleague Daria Gordon has written a pipeline for scanning an assembly for contaminant scaffolds. At present it scans an invertebrate assembly for contaminant scaffolds from other taxonomic divisions, such as bacteria, vertebrates, plants, fungi, viruses, etc.

Briefly, the pipeline works by taking each scaffold (or 50-kb chunks of longer scaffolds), and running blast against the nr protein database. If a scaffold has stronger blast hits to non-invertebrate proteins (vertebrates, bacteria, viruses, plants, fungi) than to invertebrate proteins, then it is considered to be a 'contaminant' scaffold.

How to run Daria's contamination scanning pipeline:
This is only available to Sanger users. Daria gave me these instructions for running her pipeline:

1) You need to be in a screen session, so when the terminal window is closed (either you go home or by mistake) the pipeline will keep running.  Make sure you choose one specific farm2-login node and stick with it. This way you can always reconnect to your running session. For example:
% ssh farm2-head3
% screen -RD
 [Note: you need to have export LSB_DEFAULTGROUP=team133 in your .bashrc to run the pipeline]

2) Create a directory for the assemblies on /lustre/scratch108
% mkdir  /lustre/scratch108/parasites/alc/50HGI_blast
% cd  /lustre/scratch108/parasites/alc/50HGI_blast

3) Create a working directory (a separate one for each assembly) in it.
% mkdir  Schistosoma.curassoni
% cd  Schistosoma.curassoni

4) Copy your assembly:
% cp /nfs/helminths/analysis/50HGP/Schistosoma.curassoni/ASSEMBLY/v1.0.pipeline/SCUD.v1.fa .
Note the pipeline expects that the assembly fasta file will have a version name in it eg. SCUD.v1.fa, not just SCUD.fa.

Daria suggested that if the assembly is very big (eg. >500 Mb), it could be a good idea to split it into smaller files of ~500 Mb each, and run the pipeline separartely on each smaller file. You can split an assembly file of 300,000 sequences into 3 equal size files using my perl script
% assembly_file 100000 smaller_file .
This will make three smaller files called smaller_file1, smaller_file2, smaller_file3.

5) Create the pipeline database:
% WormAssemblyQC::PipeConfig::ContaminationScreening_conf -assembly_prefix SCUD -pipeline_name contam_screen_SCUD_nr  -pipeline_dir /lustre/scratch108/parasites/alc/50HGI_blast/Schistosoma.curassoni/ -pass 50hgi

-farm_memory_suffix 000 is necessary on farm2, not on farm3

Note that -assembly_prefix must be the start of the name of the assembly, eg. if the assembly is called SCUD.v1.fa then use '-assembly_prefix SCUD'. The assembly must have a number! (eg. SCUD.v1.fa rather than just SCUD.fa)

If you have a different version than v1, use the -assembly_version option, eg. if your file is SCUD.v2.fa then use '-assembly_prefix SCUD -assembly_version v2'.

Note: at the moment the pipeline uses these files by default:
(i) non-invertebrate proteins database: /lustre/scratch108/parasites/alc/ContaminationPipeline/new_files/smallnr.fa
(ii) invertebrate proteins database: /lustre/scratch108/parasites/alc/ContaminationPipeline/new_files/seq_update.fasta
You can however specify an invertebrates protein database on the command-line using the -inv_db <string> option (the /lustre/scratch108/parasites/alc/ContaminationPipeline/new_files/seq_update.fasta file is a copy of the same file, so could be used for example).

6) In the end of the output of this command you will find a mysql connection line specifically to the newly created pipeline database.
% mysql --host=mcs10 --port=3388 --user="wormpipe_admin" --pass="50hgi" wormpipe_alc_contam_screen_SCUD_nr
It is convenient to have a mysql session open(in another window) to monitor the progress of the pipeline.
For example:
> select * from progress;

7) To actually run the pipeline you will need to copy and paste the commands to your screen-protected window. First with -sync (takes seconds) and then with -loop (runs until the pipeline either completes or fails):
% -url mysql://wormpipe_admin:50hgi@mcs10:3388/wormpipe_alc_contam_screen_SCUD_nr -sync                 
% -url mysql://wormpipe_admin:50hgi@mcs10:3388/wormpipe_alc_contam_screen_SCUD_nr  -loop

Sometimes the pipeline fails for some reason. Here are some things to do if this happens:

Finding failed jobs:
Find the job id. for failed jobs in the mysql database:
> select job_id from job where status='FAILED';
See the message for a failed job:
> select * from msg where job_id = 7;

Restarting failed jobs:
You can restart this particular job:
% -url <your_database> -reset_job_id 138
Then as usual
% -url <your_database> -sync
% -url <your_database> -loop

Note that occassionally a blast job can fail because the query has some sequence that causes blast to crash, for example, the contig below causes blast to crash:
The solution that Daria suggested here was to delete this contig (from the chunk file for the blast job), and then do 'reset_job_id' as above and restart the pipeline.

Note that if you need to delete a whole chunk file (eg. because it's all Ns), then you need to create an empty file with that name (eg. using 'touch'), or else the pipeline will fail.

Restarting a stalled pipeline:
Sometimes the whole beekeeper will die with some error message about running out of memory or something. To restart it, try:
% -url <your_database> -sync
% -url <your_database> -loop

Deleting the whole mysql database:
Delete the whole database for a species from the mysql table:
> drop database <name_of_database>;
> drop database wormpipe_alc_contam_screen_SCUD_nr;

Restarting a failed analysis:
eg. if the phylum_finder analysis failed:
> -url <name_of_database> -reset_failed_jobs_for_analysis phylum_finder
Clean up dead jobs for resubmission:
> -url <name_of_database> -dead
Then try '-sync' and '-loop' as above.

Again, for Sanger users, you can see Daria's documentation by typing:
% perldoc WormAssemblyQC::PipeConfig::ContaminationScreening_conf

Other Notes:
Sometimes you can get the message:
Can't read from '/ensembl-hive/hive_config.json' at /software/pathogen/external/apps/usr/local/ensembl-hive/modules/Bio/EnsEMBL/Hive/Utils/ line 51.
However, this doesn't matter, just ignore this.

Other tools for scanning assemblies for contamination:
DecconSeq: described in a paper by Schmeider and Edwards (2011), available on the DeconSeq website. I haven't tried this, but it sounds interesting.

My proteome-based contamination scanning pipeline
If you have a gene set for a species, subsequently to running Daria's contamination scan pipeline, you can run my proteome-based contamination scanning pipeline, which tends to pick up some extra contaminant scaffolds missed by Daria's pipeline (and, very rarely, re-classifies a scaffold that Daria's pipeline classified as 'contaminant' as 'non-contaminant'). There are two steps:
(i) Run blastp:
% python3 ~alc/Documents/git/Python/ >& running_blastp
(ii) Classify scaffolds as contaminant/non-contaminant:
% python3 ~alc/Documents/git/Python/

I then have a further step that can detect even more contaminant scaffolds (eg. by detecting flatworm contamination in a nematode assembly):
(i) Run blastp: [Note to self: make sure the protein fasta being used in this step is the right one and hasn't had contaminant scaffolds already removed, if you're re-running it]

% python3 /nfs/users/nfs_a/alc/Documents/git/Python/
(ii) Classify scaffolds as contaminant/non-contaminant:  [Note to self: Make sure that the script will be able to read the blast outputs from the first step (i) above, eg. that the protein names in the blast results matches what is in the gff]

% python3 ~alc/Documents/git/Python/

Notes to self:
Diogo has put the Taxonomy files for Daria's pipeline here:
These were downloaded from the NCBI ftp site:

Difficult cases for my contamination scan: 
Daria and my contamination scan pipeline (including the blastp steps) has problems with certain very difficult cases:
- Because it is based on finding hits to protein coding regions (via blastx or blastp), it doesn't detect contaminant scaffolds that lack any genes.
- The blastp steps use a database of non-invertebrate proteins that contain representative proteomes eg. E. coli K12, a B. subtilis strain, etc. This does not have every possible E. coli strain, so it's possible to have contamination with an E. coli strain that has many proteins missing from E. coli K12, and if a scaffold has 1 or 2 of such proteins, they may not be detected as contamination. 
- The second blastp step (aimed at finding flatworm contamination in a nematode assembly) can classify some scaffolds as contaminant if you are looking at a very diverged nematode in clade I that has genes that are closer to a flatworm's.
- The database of invertebrate proteins that I used was downloaded a while ago, so in some cases misses recent proteomes that are closely related to a particular query (eg. nematode) species.

No comments: