Monday, 27 November 2017

Fastq format

I'm looking at a fastq file that has header lines like this and want to figure out what they mean:

@M03558:259:000000000-BH588:1:1101:16110:1341 2:N:0:NTTGTA
@M03558:259:000000000-BH588:1:1101:16089:1342 2:N:0:NTTGTA

@M03558:259:000000000-BH588:1:1101:15471:1344 2:N:0:NTTGTA
@M03558:259:000000000-BH588:1:1101:15455:1333 2:N:0:NTTGTA

@M03558:259:000000000-BH588:1:1101:14580:1411 2:N:0:CTTGTA
...


Luckily I found a nice wikipedia description of FASTQ, which tells me that probably:
M03558 = the unique instrument name
259 = the run id
000000000-BH588 = the flow cell id.
1 = the flow cell lane
1101 = the tile number within the flow cell lane (I get 28 different values for this)
16110 = x-coordinate of the cluster within the tile
1341 = y-coordinate of the cluster within the tile
2 = member of a pair (paired end reads only). (I only see '2' here so I think the reads in my fastq are single-end reads)
N = means the read is not filtered (I only see 'N' here)
0 = this will be '0' when none of the control bits are on (I only see '0' here)
NTTGTA = this is the index sequence. I see several different index sequences, with different frequencies.
 
Thanks wikipedia!

 

No comments: