Sometimes I want to take a sneak peak at some sequencing data, which is given to me hot from the sequencing machines in the form of a pair of fastq files (paired end read-pairs: one fastq file for the forward reads and one for the reverse reads). I used to just take the first
n (e.g. 250,000) read-pairs from the top of the fastq file, to have a look. However, as discussed on
seqanswers here it's often the case that:
“SOLiD or Illumina output files have a pile of
rubbish at the start of the file from bad sequencing reads (the edges of chips
/ cells seem to be more prone to error), which heavily influences the results
gleaned from the first few reads.”
To get around this, they suggest you subsample randomly using
HTSeq, and also point to a discussion
here.
No comments:
Post a Comment