Thursday 31 January 2013

Using CNVnator to find copy number variation

I've been exploring using the CNVnator program from Mark Gerstein's lab. to identify copy number variations (CNVs) in a genome. 

This program analyses the read depth in a bam file of short read sequence data mapped to a genome assembly, and tries to identify regions that have unusual read depth and so might correspond to copy number variations.

To use CNVnator, you must specify a 'bin size', or window size. CNVnator then divides the genome assembly into nonoverlapping bins (windows) of this size, and uses the count of mapped reads in each window as the read depth for that window. In the CNVnator paper, they use 100-bp bins (windows) to analyse data from a trio from the 1000 Genomes Project.

In the case of the 1000 Genomes Project, the reads are from an individual sequenced as part of the 1000 Genomes Project, and are mapped to the human reference genome. In my case, I am using CNVnator to look at reads from a particular species, mapped to the assembly for that species, where we suspect that there are uncollapsed haplotypes in the genome assembly, which should be detected by CNVnator as putative 'deletions' (regions of read depth < 1). It can also find potential 'duplications' (regions of read depth > 1), which could be due to collapsed repeats. The CNVnator paper says that it is harder to find potential duplications than potential deletions, and I find that in practice it tends to identify fewer potential duplications than deletions.

In the CNVnator paper, they say that CNVnator can identify CNVs from a few 100 bases to megabases in length. Furthermore, the precision is good: 200 bp for 90% of the breakpoints in a test case studied in the CNVnator paper (using a bin size of 100 bp). The higher the coverage you have, the smaller the bin size you can use, which will give you greater precision. They recommend to use ~100-bp bins for 20-30x coverage, ~500-bp bins for 4-6x coverage, and ~30-bp bins for 100x coverage. However, they say that the bin size used shouldn't be shorter than the read length in your data.

Running CNVnator

1) Download the program 
You can download CNVnator from the CNVnator download page.

2) Set environmental variables
Before running CNVnator you need to set the ROOTSYS variable to the directory where you unpacked the root distribution, eg.:
> setenv ROOTSYS "/nfs/users/nfs_b/bf3/bin/root/"
You then need to set the LD_LIBRARY_PATH variable:

3) Preparing input data & preprocessing the data
The input files required for CNVnator are:
(i) a fasta file of scaffolds/chromosomes,
(ii) a bam file,
(iii) individual fasta files for each scaffold/chromosome. 
I find that CNVnator needs you to have individual fasta files for each scaffold/chromosome in the current directory. As well as this, CNVnator seems to stall if the fasta files have a lot of characters per line of sequence (eg. 1000s of characters), but this was solved by reformatting my fasta files so that they had just 60 characters per line of sequence (using my script

To run CNVnator, you must run a pre-processing step to pull the reads out of the bam file:
> /nfs/users/nfs_b/bf3/bin/CNVnator_v0.2.7/src/cnvnator -root out.root -genome ref.fa -chrom SVE.scaffold.00029.709875_77520_542822 -tree out.sorted.markdup.bam
where /nfs/users/nfs_b/bf3/bin/CNVnator_v0.2.7/src/ is the path to where you have installed CNVnator. 
The other arguments are:
-root: the name of a file that CNVnator will make to store a pre-processed form of your data,
-genome: your genome assembly file, 
-chrom: the chromosome/scaffold that you are interested in,
-tree: your bam file. 
This step seemed to require quite a lot of memory (RAM).  For a 419 Mbyte RAM file (6,138,830 100-bp reads), this needed 1.5 Gbyte of RAM. It took about 25 minutes to run.

4) Making a histogram 
The next step is to make a histogram (presumably a histogram of the read depth, for all the windows in your genome assembly). This is not memory consuming. To run this you need to have fasta files in your current directory with the sequences of all your scaffolds/chromosomes. For example, if you have scaffolds scaffold1, scaffold2, etc., you will need files scaffold1.fa, scaffold2.fa, etc. in your current directory. You can then run:
>  /nfs/users/nfs_b/bf3/bin/CNVnator_v0.2.7/src/cnvnator -root out.root -genome ref.fa -chrom SVE.scaffold.00029.709875_77520_542822 -his 100
The '100' at the end is the bin size (window size) that you want to use.
This step takes less than 1 minute to run.

5) Calculating statistics
I think this step is to calculate statistical significances (p-values) for the windows that have unusual read depth.
% /nfs/users/nfs_b/bf3/bin/CNVnator_v0.2.7/src/cnvnator -root out.root -genome ref.fa -chrom SVE.scaffold.00029.709875_77520_542822 -stat 100
Here the '100' at the end is the bin size (window size).
This step only takes a few seconds.

6) Partitioning
This step partitions the chromosomes/scaffolds into long regions (each one of which could be longer than the window size) that have similar read depth, and so presumably similar copy number.
> /nfs/users/nfs_b/bf3/bin/CNVnator_v0.2.7/src/cnvnator -root out.root -genome ref.fa -chrom SVE.scaffold.00029.709875_77520_542822 -partition 100
Again, the '100' at the end is the bin size (window size).
The README file for the CNVnator program says that this is the most time-consuming step.
 It took just a few seconds to run in my case (on one scaffold).

7) Identifying CNVs
This step finds potential CNVs.
> /nfs/users/nfs_b/bf3/bin/CNVnator_v0.2.7/src/cnvnator -root out.root -genome ref.fa -chrom SVE.scaffold.00029.709875_77520_542822 -call 100 > out.100.cnvnator
Here the '100' is the window size. I have redirected the output to the file 'out.100.cnvnator'.
Again, this just took a few seconds to run.

According to the CNVnator README file, the columns of the output file are:
CNV_type coordinates CNV_size normalized_RD p-val1 p-val2 p-val3 p-val4 q0
normalized_RD: normalized to 1.
p-val1: is calculated using t-test statistics.
p-val2: is from probability of RD values within the region to be in the tails of gaussian distribution describing frequencies of RD values in bins.
p-val3: same as p-val1 but for the middle of CNV
p-val4: same as p-val2 but for the middle of CNV
q0: fraction of reads mapped with q0 quality

Strangely, I find that some of the p-values in the output file are greater than 1, which they obviously shouldn't be if they are really p-values. I'm not sure why this is..

8) Visualising a CNV region
You can visualise one of the CNV regions by typing (you will need to have logged in using 'ssh -X' if you are doing this on the Sanger farm):
[Note to self: I found this worked for me on pcs4, but not on the farm head nodes]
> /nfs/users/nfs_b/bf3/bin/CNVnator_v0.2.7/src/cnvnator -root out.root -genome ref.fa -chrom SVE.scaffold.00029.709875_77520_542822 -view 100
This will bring up a prompt '>'. Type the region you are interested in on the prompt, eg.:
> SVE.scaffold.00029.709875_77520_542822:1-10000
This will bring up a lovely picture of your region:

Here the green line goes up suddenly at position ~4700, suggesting that the read depth (and so copy number) increases quite a lot at that point in the scaffold.

Thanks to Bernardo Foth for a lot of help and instructions for running CNVnator (most of the above was worked out by Bernardo), and to Alan Tracey for example data.

A wrapper script for running CNVnator 
It takes a bit of time to paste in all the above commands, so I've written a simple wrapper script to run CNVnator on an assembly,

Running CNVnator on the Sanger compute farm
This will be useful to Sanger people only: if you want to run CNVnator on an assembly relatively quickly, it is useful to divide up your assembly into smaller subsets, and run each subset on a separate farm node. I have a script to do this, It submits the jobs as a LSF 'job group', so that only 100 jobs run at any one time. For each job, it requests 3.5 Gbyte of RAM.
[Note to self: Latest cnvnator : /nfs/users/nfs_b/bf3/bin/CNVnator_v0.3/src/ ]  

Tuesday 29 January 2013

Solving equations using SAGE

It is very easy to solve equations using the open source SAGE maths software. For example, to solve the equation:
we can type in SAGE:
> solve(3*x^2 + 16*x - 28 == 0, x)
[x == -2/3*sqrt(37) - 8/3, x == 2/3*sqrt(37) - 8/3]
We can round the answer to three decimal places by typing:
> round(-2/3*sqrt(37) - 8/3, 3)
> round(2/3*sqrt(37) - 8/3, 3)    
This tells us that the two solutions are 1.389 and -6.722, rounded to three decimal places.
Easy peasy!

Another example
In the Little Book of R for Biomedical Statistics I have an equation:
n = numerator/denominator
numerator =  2 * ((qalpha + qgamma)^2) * pi0 * (1 - p0)
denominator = (piT - piC)^2
pi0 = (piT + piC)/2

Let's use SAGE to rearrange the equation in terms of piT: (Here I type 'a' for qalpha, 'g' for qgamma, 'T' for piT and 'C' for piC):
> a, g, T, C, n = var('a g T C n')
> numerator = 2 * ((a+g)^2) * ((T+C)/2) * (1 - ((T+C)/2))
> denominator = (T - C)^2
> solve(n == numerator/denominator, T)
[T == -((a^2 + 2*a*g + g^2 - 2*n)*C - a^2 - 2*a*g - g^2 + sqrt(-8*C^2*n + a^2 + 
2*a*g + g^2 + 8*C*n)*(a + g))/(a^2 + 2*a*g + g^2 + 2*n), T == -((a^2 + 2*a*g + 
g^2 - 2*n)*C - a^2 - 2*a*g - g^2 - sqrt(-8*C^2*n + a^2 + 2*a*g + g^2 + 
8*C*n)*(a + g))/(a^2 + 2*a*g + g^2 + 2*n)]

Wednesday 23 January 2013

Inferring transcript sequences from a gff file

Have you ever been given a gff file of gene predictions, but wanted to infer the transcript sequences (ie. the DNA sequences that would be translated into protein sequences) for the genes? I've written a perl script that does this, based on an input gff file of gene predictions, and an input fasta file of chromosome/scaffold sequences.

Monday 21 January 2013

Finding the highest-scoring non-overlapping gene predictions

If you have used GeneWise to make gene predictions for a species using different HMMs as input (for example, by using my perl script, GeneWise may make several overlapping gene predictions in each region of a chromosome/scaffold.

In this case, we are probably just interested in the most convincing gene predictions, ie. those that have been assigned the highest scores by GeneWise.

I've just written a script that will take a gff file of overlapping gene predictions (all with scores from the same source, eg. from GeneWise), and will make you an output gff file with just the set of highest-scoring non-overlapping gene predictions.

To do this, my script uses a greedy algorithm:
(i) it sorts the genes along each scaffold in order of decreasing score
(ii) it takes the highest-scoring gene for the scaffold, and removes all genes that overlap that gene
(iii) it takes the next highest-scoring gene in the scaffold, and removes all genes that overlap that gene
and so on, until all the genes in the scaffold have been considered.

This script could be used to filter out many gene predictions from a gff file, and could be used for a gff file of gene predictions from any source (not just GeneWise), as long as the gene scores in the file are all on a comparable scale.

Friday 18 January 2013

Using git with github

I have several git repositories at on the github website, including the source of my webpage on using R for time series analysis, "A Little Book of R for Time Series".

Whenever I need to update the webpage, I need to remember how to update it using git, I need to use the following commands (which I find quite hard to remember!):

% cd LittleBookofRTimeSeries    
[go into the directory containing my repository, on my laptop]

% git add src/timeseries.rst           
[tell git that I have edited timeseries.rst and want to update it] 

% git commit -m 'updated plotForecastErrors' 
[give git a message describing the update I made]

% git push origin master             
[send the change to github]

Additional changes can be made by just using 'git add', 'git commit', and 'git push origin master'.
Not so complicated after all!

Wednesday 9 January 2013

Using BEDTools to analyse gff and bed files

I've just found out about the BEDTools software, which is a nice software, for manipulating bed files.

Using BEDTools to find overlaps between gff file features
BEDTools can find overlaps between features in a gff format file, which is very useful. To do this, you just type:
% /nfs/users/nfs_a/alc/Documents/bin/bedtools-2.17.0/bin/intersectBed -loj -a gff1 -b gff2
where /nfs/users/nfs_a/alc/Documents/bin/bedtools-2.17.0/ is the directory where you installed BEDTools, 'gff1' is your first gff file, and 'gff2' is your second gff file. The -loj option means that for each feature in gff1, BEDTools will report each overlapping feature in gff2.
Very handy!

Using BEDTools to find non-overlapping features between gff files
If you want to find features in a first gff file that do not overlap any feature in the second gff file, you can type:
%  /nfs/users/nfs_a/alc/Documents/bin/bedtools-2.17.0/bin/intersectBed -v -a gff1 -b gff2
where /nfs/users/nfs_a/alc/Documents/bin/bedtools-2.17.0/ is the directory where you installed BEDTools, 'gff1' is your first gff file, and 'gff2' is your second gff file. This will report features in gff1 that have no overlap with features in gff2.

Using BEDTools to get a fasta file based on coordinates in a gff file
If you have a gff file of the coordinates of some features that you are interested in (eg. introns), and a fasta file of your genome, you can use BEDTools to make a fasta file of those features (eg. intron sequences):
% /nfs/users/nfs_a/alc/Documents/bin/bedtools-2.17.0/bin/bedtools getfasta -fi genome.fa -bed introns.gff -fo introns.fa -s
where /nfs/users/nfs_a/alc/Documents/bin/bedtools-2.17.0/ is the directory where you installed BEDTools, 'genome.fa' is the fasta file of your genome, 'introns.gff' is the input gff file, 'introns.fa' is the output fasta file of intron sequences, and the '-s' option means that introns on the negative strand will be reverse-complemented.

A note about bed format
An important thing to remember about bed format is that the start coordinate of a region is given as start-1, rather than start; that is, bed is a 'zero-based, half open format'.

Tuesday 8 January 2013

Using RNA-SeqQC to assess RNA-seq data quality

The RNA-SeQC program from the Broad Institute is a java program that helps you to assess the quality of an RNA-seq library by calculating lots of useful metrics.

I've been running RNA-SeqQC on some RNA-seq data from Plasmodium knowlesi, and founded that I needed to carry out various steps to get the program to run:

1) Preparing the fasta file of scaffolds:
(i) fasta file of scaffolds: RNA-SeQC requires a fasta file of scaffolds as input. I downloaded 15 embl files for P. knowlesi from the Sanger ftp site and then used my Perl script to convert the 15 embl files to 15 fasta files. I then concatenated the 15 fasta files for the different scaffolds into one fasta file with all the scaffolds' sequences 'Pk_strainH_scaffolds.fa'.
(ii) samtools index (fai) file: 
RNA-SeqQC requires that you make a samtools index file for the fasta file of scaffolds, which I did by typing:
% samtools faidx Pk_strainH_scaffolds.fa
This made file 'Pk_strainH_scaffolds.fa.fai'.
(iii) dict file:
 RNA-SeqQC also requires that you make a dict (dictionary) file for the fasta file of scaffolds. I did this using Picard 'CreateSequenceDictionary.jar' by typing:
% /software/bin/java -Xmx128M -jar /software/varinf/releases/PicardTools/picard-tools-1.77/CreateSequenceDictionary.jar R=Pk_strainH_scaffolds.fa O=Pk_strainH_scaffolds.dict
This made a dictionary file called Pk_strainH_scaffolds.dict.
[Note that '-Xmx128M' is required so that java will not run out of memory on the farm head node.]

2) Preparing the gtf file of annotations: RNA-SeQC requires a gtf format file of annotations for protein-coding and rRNA genes as input. I downloaded 15 gff files for P. knowlesi scaffolds from the Sanger ftp site and then used my Perl script to convert the 15 gff files to 15 gtf files. I then concatenated the 15 gtf files for the different scaffolds into one gtf file with the annotation for all scaffolds, called 'Pk_strainH_scaffolds.gtf'.
     Note that this takes rRNA and protein-coding genes from the gff file and puts them into the gtf file. It's necessary to include the rRNA genes, as RNA-SeqQC uses these. 

3) Preparing the bam file:  
(i) adding read-group information to the bam: If there is not already read-group information in the bam (ie. if the bam header does not a line starting with '@RG'), then you will need to add read-group information to the bam file. You can do this using the AddOrReplaceReadGroups.jar program from Picard, by typing, for example:
% /software/farm/java/bin/java -Xmx128M -jar /software/varinf/releases/PicardTools/picard-tools-1.77/AddOrReplaceReadGroups.jar INPUT=/lustre/scratch108/parasites/alc/RNA-SeQC/Pknowlesi/tp1.bam OUTPUT=/lustre/scratch108/parasites/alc/RNA-SeQC/Pknowlesi/tp1_new2.bam RGID="8824_1#1" RGLB=Pk_TP1_1_dUTP_RNA RGPL=ILLUMINA RGPU=6081750 RGSM=Pk_TP1_1_dUTP_RNA RGCN=SC RGDS="Plasmodium knowlesi transcriptome sequencing (lc5)"
      In this example, I gave 'tp1.bam' as the input bam, and got 'tp1_new2.bam' as the output bam. I set the following read-group information:
RGID="8824_1#1" ---> Read Group ID
RGLB=Pk_TP1_1_dUTP_RNA  ---> Read Group library
RGPL=ILLUMINA ---> Read Group platform
RGPU=6081750 ---> Read Group platform unit
RGSM=Pk_TP1_1_dUTP_RNA ---> Read Group sample name
RGCN=SC ---> Read Group sequencing centre name
RGDS="Plasmodium knowlesi transcriptome sequencing (lc5)" ---> read group description
(ii) changing the header of the bam file: RNA-SeqQC requires that the order of the scaffolds in the bam file's header is the same as the order of the scaffolds in the fasta file of scaffolds. If this is not the case, you will have to change the order of the scaffolds in the header of the bam file.
      To change the header of the bam file, I first extracted the header of my bam file in sam format:
% samtools view -H tp1_new2.bam > header
I then edited the file 'header' to change the order of the scaffolds so that they are the same as in the fasta file of scaffolds. I then converted my bam file to a sam file:
% samtools view tp1_new2.bam > tp1_new2.sam
I then put the new header and the sam file into one long sam file:
% cat header tp1_new2.sam > tp1_new4.sam
I then changed the sam file back into a bam file:
% samtools view -S -bt Pk_strainH_scaffolds.fa.fai tp1_new4.sam > tp1_new4.bam
I then sorted the bam file (this is required in order to index the bam file, which is the next step below):
% samtools sort tp1_new4.bam tp1_new_sorted4
This made a bam file 'tp1_new_sorted4.bam'. 
      Note that this is rather a long way of changing the header in a bam file. However, I found that when I tried to use 'samtools reheader' to change the header (using 'samtools reheader header tp1_new2.bam > tp1_new3.bam'), this seemed to mess up the output bam file.
(iii) samtools index (bai file):
RNA-SeqQC needs you to make a samtools index file for the bam file, which you can do, for example for bam 'tp1.bam', by typing:
% samtools index tp1_new_sorted4.bam tp1_new_sorted4.bam.bai

4) Running RNA-SeqQC:
To run RNA-SeqQC, I typed:
% /software/bin/java -Xmx2028M -jar /nfs/users/nfs_a/alc/Documents/RNASeqQC/RNA-SeQC/RNA-SeQC_v1.1.7.jar -n 1000 -s "TestId|tp1_new_sorted4.bam|TestDesc" -t Pk_strainH_scaffolds.gtf -r Pk_strainH_scaffolds.fa -o test1
This tells RNA-SeqQC that the input bam file is 'tp1_new_sorted4.bam', that the input gtf file is Pk_strainH_scaffolds.gtf, that the input fasta file of scaffolds is Pk_strainH_scaffolds.fa and that directory for the output files is 'test1'.
      Note that -Xmx5140M reserves 5 Gbyte of RAM for the job (for a ~3 Gbyte bam file). I submitted the job to a compute farm, also requesting 5 Gbyte of RAM (via the LSF job submission system).
      The -n 1000 option means that 1000 transcripts are used to calculate the RNA-seq quality information.
      This took about half an hour to run for my ~3 Gbyte bam file.

Monday 7 January 2013

Converting embl to fasta format

I recently wanted to download the genome sequence for Plasmodium knowlesi from the Sanger ftp site and found that the sequence data is available in embl format. This meant that I needed to write a script to convert the embl format files to fasta format files. My perl script takes an embl format file as input, and makes a fasta format file as output.