Tuesday 8 January 2013

Using RNA-SeqQC to assess RNA-seq data quality

The RNA-SeQC program from the Broad Institute is a java program that helps you to assess the quality of an RNA-seq library by calculating lots of useful metrics.

I've been running RNA-SeqQC on some RNA-seq data from Plasmodium knowlesi, and founded that I needed to carry out various steps to get the program to run:

1) Preparing the fasta file of scaffolds:
(i) fasta file of scaffolds: RNA-SeQC requires a fasta file of scaffolds as input. I downloaded 15 embl files for P. knowlesi from the Sanger ftp site and then used my Perl script embl_to_fasta.pl to convert the 15 embl files to 15 fasta files. I then concatenated the 15 fasta files for the different scaffolds into one fasta file with all the scaffolds' sequences 'Pk_strainH_scaffolds.fa'.
(ii) samtools index (fai) file: 
RNA-SeqQC requires that you make a samtools index file for the fasta file of scaffolds, which I did by typing:
% samtools faidx Pk_strainH_scaffolds.fa
This made file 'Pk_strainH_scaffolds.fa.fai'.
(iii) dict file:
 RNA-SeqQC also requires that you make a dict (dictionary) file for the fasta file of scaffolds. I did this using Picard 'CreateSequenceDictionary.jar' by typing:
% /software/bin/java -Xmx128M -jar /software/varinf/releases/PicardTools/picard-tools-1.77/CreateSequenceDictionary.jar R=Pk_strainH_scaffolds.fa O=Pk_strainH_scaffolds.dict
This made a dictionary file called Pk_strainH_scaffolds.dict.
[Note that '-Xmx128M' is required so that java will not run out of memory on the farm head node.]

2) Preparing the gtf file of annotations: RNA-SeQC requires a gtf format file of annotations for protein-coding and rRNA genes as input. I downloaded 15 gff files for P. knowlesi scaffolds from the Sanger ftp site and then used my Perl script convert_chado_gff_to_gtf.pl to convert the 15 gff files to 15 gtf files. I then concatenated the 15 gtf files for the different scaffolds into one gtf file with the annotation for all scaffolds, called 'Pk_strainH_scaffolds.gtf'.
     Note that this takes rRNA and protein-coding genes from the gff file and puts them into the gtf file. It's necessary to include the rRNA genes, as RNA-SeqQC uses these. 

3) Preparing the bam file:  
(i) adding read-group information to the bam: If there is not already read-group information in the bam (ie. if the bam header does not a line starting with '@RG'), then you will need to add read-group information to the bam file. You can do this using the AddOrReplaceReadGroups.jar program from Picard, by typing, for example:
% /software/farm/java/bin/java -Xmx128M -jar /software/varinf/releases/PicardTools/picard-tools-1.77/AddOrReplaceReadGroups.jar INPUT=/lustre/scratch108/parasites/alc/RNA-SeQC/Pknowlesi/tp1.bam OUTPUT=/lustre/scratch108/parasites/alc/RNA-SeQC/Pknowlesi/tp1_new2.bam RGID="8824_1#1" RGLB=Pk_TP1_1_dUTP_RNA RGPL=ILLUMINA RGPU=6081750 RGSM=Pk_TP1_1_dUTP_RNA RGCN=SC RGDS="Plasmodium knowlesi transcriptome sequencing (lc5)"
      In this example, I gave 'tp1.bam' as the input bam, and got 'tp1_new2.bam' as the output bam. I set the following read-group information:
RGID="8824_1#1" ---> Read Group ID
RGLB=Pk_TP1_1_dUTP_RNA  ---> Read Group library
RGPL=ILLUMINA ---> Read Group platform
RGPU=6081750 ---> Read Group platform unit
RGSM=Pk_TP1_1_dUTP_RNA ---> Read Group sample name
RGCN=SC ---> Read Group sequencing centre name
RGDS="Plasmodium knowlesi transcriptome sequencing (lc5)" ---> read group description
(ii) changing the header of the bam file: RNA-SeqQC requires that the order of the scaffolds in the bam file's header is the same as the order of the scaffolds in the fasta file of scaffolds. If this is not the case, you will have to change the order of the scaffolds in the header of the bam file.
      To change the header of the bam file, I first extracted the header of my bam file in sam format:
% samtools view -H tp1_new2.bam > header
I then edited the file 'header' to change the order of the scaffolds so that they are the same as in the fasta file of scaffolds. I then converted my bam file to a sam file:
% samtools view tp1_new2.bam > tp1_new2.sam
I then put the new header and the sam file into one long sam file:
% cat header tp1_new2.sam > tp1_new4.sam
I then changed the sam file back into a bam file:
% samtools view -S -bt Pk_strainH_scaffolds.fa.fai tp1_new4.sam > tp1_new4.bam
I then sorted the bam file (this is required in order to index the bam file, which is the next step below):
% samtools sort tp1_new4.bam tp1_new_sorted4
This made a bam file 'tp1_new_sorted4.bam'. 
      Note that this is rather a long way of changing the header in a bam file. However, I found that when I tried to use 'samtools reheader' to change the header (using 'samtools reheader header tp1_new2.bam > tp1_new3.bam'), this seemed to mess up the output bam file.
(iii) samtools index (bai file):
RNA-SeqQC needs you to make a samtools index file for the bam file, which you can do, for example for bam 'tp1.bam', by typing:
% samtools index tp1_new_sorted4.bam tp1_new_sorted4.bam.bai

4) Running RNA-SeqQC:
To run RNA-SeqQC, I typed:
% /software/bin/java -Xmx2028M -jar /nfs/users/nfs_a/alc/Documents/RNASeqQC/RNA-SeQC/RNA-SeQC_v1.1.7.jar -n 1000 -s "TestId|tp1_new_sorted4.bam|TestDesc" -t Pk_strainH_scaffolds.gtf -r Pk_strainH_scaffolds.fa -o test1
This tells RNA-SeqQC that the input bam file is 'tp1_new_sorted4.bam', that the input gtf file is Pk_strainH_scaffolds.gtf, that the input fasta file of scaffolds is Pk_strainH_scaffolds.fa and that directory for the output files is 'test1'.
      Note that -Xmx5140M reserves 5 Gbyte of RAM for the job (for a ~3 Gbyte bam file). I submitted the job to a compute farm, also requesting 5 Gbyte of RAM (via the LSF job submission system).
      The -n 1000 option means that 1000 transcripts are used to calculate the RNA-seq quality information.
      This took about half an hour to run for my ~3 Gbyte bam file.

No comments: