I have a large amount of Illumina sequencing data (~17 million read-pairs) and, before running further analysis, want to filter the data by the average base quality, for example by discarding all read pairs that have average base quality of <=20. The format of my data is two fastq files for my paired Illumina read-pairs, one for forward reads and one for reverse reads.
From a discussion on seqanswers about filtering, I found that people suggested Trimmomatic.
I tried running Trimmomatic on some smaller files of 250,000 (0.25 million) read-pairs, using this command-line: (note to self: on the Sanger farm)
% java -Xmx128M -jar /software/pathogen/external/apps/usr/local/Trimmomatic-0.33/trimmomatic-0.33.jar PE -threads 1 -phred33 sample1topsmall_1.fastq.gz sample1topsmall_2.fastq.gz sample1topsmallout_1_paired.fastq.gz sample1topsmallout_1_unpaired.fastq.gz sample1topsmallout_2_paired.fastq.gz sample1topsmallout_2_unpaired.fastq.gz SLIDINGWINDOW:150:20
PE means my reads are paired ends,
-threads 1 tells Trimmomatic to just use one thread (one CPU),
-phred33 tells Trimmomatic that the phred quality scores in my file use ASCII_BASE=33 (see here for an explanation), (note to self: I knew this from CRISPResso output)
sample1topsmall_1.fastq.gz sample1topsmall_2.fastq.gz are my input fastq files,
sample1topsmallout_1_unpaired.fastq.gz, sample1topsmallout_2_paired.fastq.gz, sample1topsmallout_2_unpaired.fastq.gz are output files produced by Trimmomatic (two with reads where both reads of a pair pass the filter, two files with reads where just one read of a pair passes),
SLIDINGWINDOW:150:20 tells Trimmomatic to use a sliding window of 150 bp and take reads that have average base quality of >=20 in this window (note to self: my reads are 150 bp long).
Output from Trimmomatic
Trimmomatic ran really quickly! It gave some output like this:
Input Read Pairs: 250000 Both Surviving: 1619 (0.65%) Forward Only Surviving: 228797 (91.52%) Reverse Only Surviving: 604 (0.24%) Dropped: 18980 (7.59%)
TrimmomaticPE: Completed successfully
You can see here that my forward reads were much higher quality than my reverse reads, so few read-pairs survived where both reads of the pair passed the quality filter. The output files sample1topsmallout_1_paired.fastq.gz and sample1topsmallout_2_paired.fastq.gz have 1619 reads each as 1619 read-pairs passed.
Run-time and memory (RAM) required by Trimmomatic
I found Trimmomatic very fast. For 2.5 million read-pairs, it took about 2 minutes to run on the Sanger compute farm (using one CPU), and needed just ~80 Mbyte of memory (RAM).