Wednesday, 9 January 2013

Using BEDTools to analyse gff and bed files

I've just found out about the BEDTools software, which is a nice software, for manipulating bed files.

Using BEDTools to find overlaps between gff file features
BEDTools can find overlaps between features in a gff format file, which is very useful. To do this, you just type:
% /nfs/users/nfs_a/alc/Documents/bin/bedtools-2.17.0/bin/intersectBed -loj -a gff1 -b gff2
where /nfs/users/nfs_a/alc/Documents/bin/bedtools-2.17.0/ is the directory where you installed BEDTools, 'gff1' is your first gff file, and 'gff2' is your second gff file. The -loj option means that for each feature in gff1, BEDTools will report each overlapping feature in gff2.
Very handy!

Using BEDTools to find non-overlapping features between gff files
If you want to find features in a first gff file that do not overlap any feature in the second gff file, you can type:
%  /nfs/users/nfs_a/alc/Documents/bin/bedtools-2.17.0/bin/intersectBed -v -a gff1 -b gff2
where /nfs/users/nfs_a/alc/Documents/bin/bedtools-2.17.0/ is the directory where you installed BEDTools, 'gff1' is your first gff file, and 'gff2' is your second gff file. This will report features in gff1 that have no overlap with features in gff2.

Using BEDTools to get a fasta file based on coordinates in a gff file
If you have a gff file of the coordinates of some features that you are interested in (eg. introns), and a fasta file of your genome, you can use BEDTools to make a fasta file of those features (eg. intron sequences):
% /nfs/users/nfs_a/alc/Documents/bin/bedtools-2.17.0/bin/bedtools getfasta -fi genome.fa -bed introns.gff -fo introns.fa -s
where /nfs/users/nfs_a/alc/Documents/bin/bedtools-2.17.0/ is the directory where you installed BEDTools, 'genome.fa' is the fasta file of your genome, 'introns.gff' is the input gff file, 'introns.fa' is the output fasta file of intron sequences, and the '-s' option means that introns on the negative strand will be reverse-complemented.

A note about bed format
An important thing to remember about bed format is that the start coordinate of a region is given as start-1, rather than start; that is, bed is a 'zero-based, half open format'.

No comments: