Thursday 25 October 2018

Randomly sample from a fastq file

Sometimes I want to take a sneak peak at some sequencing data, which is given to me hot from the sequencing machines in the form of a pair of fastq files (paired end read-pairs: one fastq file for the forward reads and one for the reverse reads). I used to just take the first n (e.g. 250,000) read-pairs from the top of the fastq file, to have a look. However, as discussed on seqanswers here it's often the case that:
SOLiD or Illumina output files have a pile of rubbish at the start of the file from bad sequencing reads (the edges of chips / cells seem to be more prone to error), which heavily influences the results gleaned from the first few reads.”
To get around this, they suggest you subsample randomly using HTSeq, and also point to a discussion here.
 

No comments: