Monday 8 June 2020

Searching for software on the Sanger farm

This is only useful to Sanger users...

To find software on the farm, you can use:
% /software/bin/locate_software

e.g. to find the blat software:
% /software/bin/locate_software blat 


Using FLASH to merge paired Illumina reads

I have a fastq file for the forward reads from an Illumina sequence library, and a fastq file for the reverse reads, and want to see if I can merge these, as I know that in many cases the insert size is small so the forward and reverse reads should overlap.

To try to merge them, I'm going to try the FLASH software, described in a paper by Magoc & Salzberg (2011).

On the command line, it seems I can run it by typing:
% /nfs/team87/farm3_lims2_vms/software/crispresso_dependencies/bin/flash 
This tells me the input should be:
Usage: flash [OPTIONS] MATES_1.FASTQ MATES_2.FASTQ

Let's try it on my data: (where I have 20,593 read-pairs)
% /nfs/team87/farm3_lims2_vms/software/crispresso_dependencies/bin/flash my-R1.fastq.gz my-R2.fastq.gz

This gives me output:
[FLASH] Starting FLASH v1.2.11
[FLASH] Fast Length Adjustment of SHort reads
[FLASH]
[FLASH] Input files:
[FLASH]     my-R1.fastq.gz
[FLASH]     my-R2.fastq.gz
[FLASH]
[FLASH] Output files:
[FLASH]     ./out.extendedFrags.fastq
[FLASH]     ./out.notCombined_1.fastq
[FLASH]     ./out.notCombined_2.fastq
[FLASH]     ./out.hist
[FLASH]     ./out.histogram
[FLASH]
[FLASH] Parameters:
[FLASH]     Min overlap:           10
[FLASH]     Max overlap:           65
[FLASH]     Max mismatch density:  0.250000
[FLASH]     Allow "outie" pairs:   false
[FLASH]     Cap mismatch quals:    false
[FLASH]     Combiner threads:      16
[FLASH]     Input format:          FASTQ, phred_offset=33
[FLASH]     Output format:         FASTQ, phred_offset=33
[FLASH]
[FLASH] Starting reader and writer threads
[FLASH] Starting 16 combiner threads
[FLASH] Processed 20593 read pairs
[FLASH]
[FLASH] Read combination statistics:
[FLASH]     Total pairs:      20593
[FLASH]     Combined pairs:   137
[FLASH]     Uncombined pairs: 20456
[FLASH]     Percent combined: 0.67%
[FLASH]
[FLASH] Writing histogram files.
[FLASH] WARNING: An unexpectedly high proportion of combined pairs (41.61%)
overlapped by more than 65 bp, the --max-overlap (-M) parameter.  Consider
increasing this parameter.  (As-is, FLASH is penalizing overlaps longer than
65 bp when considering them for possible combining!)
[FLASH]
[FLASH] FLASH v1.2.11 complete!
[FLASH] 1.031 seconds elapsed
[FLASH] Finished with 1 warning (see above)


This gave me a warning that I should increase the max_overlap parameter, so next I tried:
% /nfs/team87/farm3_lims2_vms/software/crispresso_dependencies/bin/flash -M 200 my-R1.fastq.gz my-R2.fastq.gz 
[FLASH] Starting FLASH v1.2.11
[FLASH] Fast Length Adjustment of SHort reads
[FLASH] 
[FLASH] Input files:
[FLASH]     /nfs/repository/working_area/SHISTO/V7/HIV_integrations/wannaporn/new_2020_data/Sm-control_i1-R1.fastq.gz
[FLASH]     /nfs/repository/working_area/SHISTO/V7/HIV_integrations/wannaporn/new_2020_data/Sm-control_i1-R2.fastq.gz
[FLASH] 
[FLASH] Output files:
[FLASH]     ./out.extendedFrags.fastq
[FLASH]     ./out.notCombined_1.fastq
[FLASH]     ./out.notCombined_2.fastq
[FLASH]     ./out.hist
[FLASH]     ./out.histogram
[FLASH] 
[FLASH] Parameters:
[FLASH]     Min overlap:           10
[FLASH]     Max overlap:           200
[FLASH]     Max mismatch density:  0.250000
[FLASH]     Allow "outie" pairs:   false
[FLASH]     Cap mismatch quals:    false
[FLASH]     Combiner threads:      16
[FLASH]     Input format:          FASTQ, phred_offset=33
[FLASH]     Output format:         FASTQ, phred_offset=33
[FLASH] 
[FLASH] Starting reader and writer threads
[FLASH] Starting 16 combiner threads
[FLASH] Processed 20593 read pairs
[FLASH] 
[FLASH] Read combination statistics:
[FLASH]     Total pairs:      20593
[FLASH]     Combined pairs:   140
[FLASH]     Uncombined pairs: 20453
[FLASH]     Percent combined: 0.68%
[FLASH] 
[FLASH] Writing histogram files.
[FLASH] 
[FLASH] FLASH v1.2.11 complete!
[FLASH] 0.273 seconds elapsed


Seems to work fine. However, only a small percent of my read-pairs are combined (0.68%), oh well!