Friday 26 October 2018

Running CRISPResso on a ton of sequencing data

I have a ton of sequencing data from an Illumina HiSeq lane from a PCR-amplicon sequence, and I'd like to run CRISPResso on it. In a previous blogpost, I talked about how to run CRISPResso.

Here's I'll talk about how to run it on loads of data! In this case I had ~20 million read-pairs for a single sample, while previously I had only run CRISPResso on 1-2 million read-pairs (even that took a couple of days!)

Here is a little plot I made of CRISPResso run-time in hours versus number of input read-pairs:

















You can see that as the number of read-pairs gets above about 1 million, the run-time is 0.5-1 day roughly. I guess the runtime may also depend on things like the base quality of the reads (if they are low quality, I'm guessing CRISPResso will have to make lots of alignments with indels/substitutions, and may infer lots of indel/substitution mutations, which will take time).

To run CRISPResso on ~20 million read-pairs from a single sample, here's what I did:

1. Split up the input fastq files into smaller files of 1 million reads each
I have paired end read-pairs, so the input data is in a file called sample1_1.fastq.gz (forward reads) and sample1_2.fastq.gz (reverse reads). To split this up, I ran:
% python3 ~alc/Documents/git/Python/split_up_fastq.py sample1_1.fastq.gz 1000000 sample1_1_sub
% python3 ~alc/Documents/git/Python/split_up_fastq.py sample1_1.fastq.gz 1000000 sample1_1_sub
You can see my Python script on gist here.
Note to self: I submitted farm jobs to do this, as it's slow.

2. Use Trimmomatic to discard low quality read-pairs.
I used the Trimmomatic software to discard read-pairs with low quality bases, only taking read-pairs where both reads of a pair had average base quality of >=20. To do this, I ran:
python3 ~alc/Documents/git/Python/filter_fastq_files_using_trimmomatic.py sample1 17
where here sample1 is the name of the sample in step 1 above, and 17 is the number of pairs of smaller fastq files that step 1 above made. 
This script submits Trimmomatic jobs to the Sanger compute farm.
You can see my Python script on gist here.

3.  Run CRISPResso on each smaller subset of read-pairs.
I then ran CRISPResso on each smaller subset of read-pairs:
% python3 ~alc/Documents/git/Python/submit_crispresso_jobs_for_subsetsoffastq.py sample1 17
where here sample1 is the name of the sample in step 1 above, and 17 is the number of pairs of smaller fastq files that step 1 above made. 
This script submits CRISPResso jobs to the Sanger compute farm.
You can see my Python script on gist here, you would need to edit it at the 'xxx' positions, to put in your amplicon sequence, gRNA sequence and coding sequence. 






No comments: