Wednesday, 29 January 2014

Bin assembly pipeline

[Of interest to Sanger users only]

My colleague Daria Gordon wrote a pipeline to create a 'bin assembly' out of the reads that didn't make it into the main genome assembly for a species.

Here are the steps (to be run in a 'screen' session):

1) Create a directory for your work in your /lustre directory, eg.:
% mkdir /lustre/scratch108/parasites/alc/bin_assembly_test
% cd /lustre/scratch108/parasites/alc/bin_assembly_test  

2) Create soft-links to your assembly fasta file, zipped fastq files (find out where they are using 'pathfind'), and bam files (if any).

3) Set PERL5LIB to find the necessary perl modules:
% export PERL5LIB=$PERL5LIB:/software/pathogen/internal/pathdev/vr-codebase/modules
[Note: 17-Dec-2015: the pipeline was moved to:
/software/parasites/projects/helminth_scripts/modules/HelminthGenomeAnalysis/, so need to type:
% export PERL5LIB=$PERL5LIB:/software/parasites/projects/helminth_scripts/modules]
Note, probably you will also need:
export PERL5LIB=$PERL5LIB:/software/parasites/internal/prod/lib

4) Make a tab-delimited file with information on the lanes (eg. 'lanes.txt'), with columns: lane_id, fastq_file1, fastq_file2, insert size. Your fastq files may be g-zipped:

6956_8  6956_8_1.fastq.gz  6956_8_2.fastq.gz  3000
7623_6  7623_6_1.fastq.gz  7623_6_2.fastq.gz  450

Note: it's important that this file doesn't have any extra blank lines.
Also, if the file names contain '#' characters, you  might need to put the file names in quotes (or rename them to avoid these characters), to avoid problems. 

5) Start the pipeline:
% HelminthGenomeAnalysis::PipeConfig::BinAssembly_conf -species_name <species> -assembly_file <assembly> -input_dir <input_dir> -pass 50hgi --lanes_file <lanes_file>
where <species> is your species name, eg. haemonchus,
<assembly> is your assembly file, eg. haemonchus.fasta,
<input_dir> is the directory with your input files, eg. /lustre/scratch108/parasites/alc/bin_assembly_test,
<lanes_file> is your lanes file, eg. lanes.txt

6) Paste in the beekeeper commands that you are told (as the output from step 5), eg.:
% -url mysql://wormpipe_admin:50hgi@mcs10:3388/wormpipe_alc_bin_assembly_haemonchus -sync
% -url mysql://wormpipe_admin:50hgi@mcs10:3388/wormpipe_alc_bin_assembly_haemonchus -loop

No comments: