I'm trying to get my hands on some RNAseq gene expression data, and have to remind myself what do RPKM and FPKM mean again?
RPKM: Reads per kilobase of transcript, per million mapped reads. This is a measure of a transcript's expression level, normalised for the length of the transcript, and for the total amount of reads in the data set.
FPKM: Fragments per kilobase of transcript, per million
mapped reads. In RNAseq, the relative expression of a transcript is
proportional to the number of cDNA fragments that originate from it. That is, if your data is paired-end RNAseq data, and you see two reads from the same read-pair, then these are only counted as one fragment when calculating the FPKM (but counted as two reads when calculating RPKM).
How do you get RPKM or FPKM values?
1. First you need to map your raw reads to the genome assembly eg. using TopHat. If you already have a gff/gtf file of known genes, you can use the -G option to map to the known genes.
2. You can then calculate the RPKM or FPKM using Cufflinks.
RNAseq data files
Just to remind myself, there are also different types of RNAseq data files:
bam files: have the reads mapped to an assembly
BigWig files: useful for displaying RNAseq data, as they are in an indexed binary format.
There is some discussion on the FPKM values calculated by Cufflinks here. Apparently TopHat used to calculate RPKM, but it's now deprecated and recommended to use Cufflinks (see here).