avrilomics: Filtering a repeat library for matches to proteins/RNAs

Friday, 23 October 2015

Filtering a repeat library for matches to proteins/RNAs

My colleague Eleanor Stanley came up with a nice protocol for filtering a repeat library for matches to proteins/RNAs (thanks Eleanor!).

Get known protein-coding regions and RNAs:
The first step is to download C. elegans protein-coding regions and RNAs from WormBase, and flatworm proteins from GeneDB:

% ln -s ~wormpub/BUILD/WORMPEP/wormpep$WB/wormpep.dna$WB
% ln -s ~wormpub/BUILD/WORMRNA/wormrna$WB/wormrna$WB.rna
% chado_dump_transcripts_coding -o Smansoni > Sma.fa
% chado_dump_transcripts_coding -o Emultilocularis > Emu.fa
% cat Sma.fa Emu.fa > flatworm_transcripts.fa

Make blast databases:
% formatdb -p F -i flatworm_transcripts.fa
% formatdb -p F -i wormpep.dna235
% formatdb -p F -i wormrna235.rna

Run blast against the known protein-coding regions and RNAs
Next discard 'unknown' repeats (ie. classified by RepeatClassifier as 'unknown') from your library that have blast matches of e-value <= 0.001 to the known protein-coding regions and RNAs.
This can be done using a shell script ~alc/Documents/000_50HG_Repeats/filter_lib.sh below, by running:
% ~alc/Documents/000_50HG_Repeats/filter_lib.sh my_repeat.lib
where my_repeat.lib is the repeat library.
This makes a file my_repeat.lib.filtered with the repeats left after this filtering step, and a file my_repeat.lib.notwanted of the repeats discarded by this step.

#!/bin/sh

# Filters consensi.fa against roundworm and flatworm ncRNAs and ORFs to remove
# repeats that hit genes.
# taking lines from ~es9/pipeline/repmod_filter.sh:

# $1 is merged library
export flatworm_REP=/lustre/scratch108/parasites/alc/000_50HG_Repeats/merged_libraries_NEW/flatworm_transcripts.fa
export roundworm_dna_REP=/lustre/scratch108/parasites/alc/000_50HG_Repeats/merged_libraries_NEW/wormpep.dna235
export roundworm_rna_REP=//lustre/scratch108/parasites/alc/000_50HG_Repeats/merged_libraries_NEW/wormrna235.rna
blastall -a 1 -b 10 -v 10 -p blastn -e 0.001 -m 8 -d $flatworm_REP -i $1 -o flatworm.blast
blastall -a 1 -b 10 -v 10 -p blastn -e 0.001 -m 8 -d $roundworm_dna_REP -i $1 -o roundworm_dna.blast
blastall -a 1 -b 10 -v 10 -p blastn -e 0.001 -m 8 -d $roundworm_rna_REP -i $1 -o roundworm_rna.blast
awk '{print $1}' flatworm.blast | sort -u > hit_all
awk '{print $1}' roundworm_dna.blast | sort -u >> hit_all
awk '{print $1}' roundworm_rna.blast | sort -u >> hit_all
sort -u hit_all > hits
# just take hits to unknown repeats:
cat hits | grep -i unknown > hits2
rm -f hit_all

# Remove all hits from $1
/nfs/users/nfs_j/jit/repository/pathogen/user/jit/converter//fasta2single.pl $1 > $1.v2
perl ~es9/pipeline/fasta_retrieve_subsets.pl $1.v2 hits2
echo ""
echo "Filtered species specific repeats are now available:"
echo "Number of repeats in $1.v2.filtered: "
grep -c '^>' $1.v2.filtered
echo "Number of repeats in $1.v2.notwanted: "
grep -c '^>' $1.v2.notwanted
echo "This matches number of repeats in the hits file: "
wc -l hits2

4 comments:

Unknown said...: possible to share 'fasta_retrieve_subsets.pl' script?; 26 October 2015 at 07:17
Avril Coghlan said...: The script is by my colleague Eleanor Stanley. It does something like this:

my $fasta = shift @ARGV;
my $bad = shift @ARGV;
my %reads = () ;

open (IN, "$bad") or die "I can't open $bad\n";
open (IN2, "$fasta") or die "I can't open $fasta\n";
open (OUT, ">$fasta.notwanted") or die "I can't open $fasta.notwanted\n";
open (OUT2, ">$fasta.filtered") or die "I can't open $fasta.filtered\n";

while () {
chomp;
my @line = split /\s+/ , $_;
$reads{$line[0]}++;
#print "Line:$line[0]:\n";
}
close (IN);

while () {
chomp;
if (/^>(\S+)/) {
my $seq_name = $1;
$seq_name=~s/>//;
my $seq = ;
chomp($seq);
#print "SEQname:$seq_name:\n";
if ($reads{$seq_name} ) {
print OUT ">$seq_name\n";
print OUT "$seq\n";
}
else {
print OUT2 ">$seq_name\n";
print OUT2 "$seq\n";
}
}
}

close (IN2);
close (OUT);
close (OUT2);; 2 November 2015 at 06:26
Avril Coghlan said...: Note: the script above assumes that there is one line per sequence in the fasta file (ie. the sequence is not spread over several lines in the fasta file, it is all on one line).; 2 November 2015 at 06:38
Unknown said...: Thanks for sharing these notes: they're very helpful!; 30 November 2015 at 08:08