Wednesday, 8 May 2013

Sanger scripts for finding reference genomes, gff files, bam files, fastq files, etc.

This is only useful to Sanger people, but extremely useful!

Finding assemblies and gff files
Find genome assemblies for Schistosoma mansoni:
% reffind -s 'Schistosoma mansoni' -f fa
% reffind -sp mansoni -f fa
Find gff files for Schistosoma mansoni:
% reffind -s 'Schistosoma mansoni' -f gff
Find embl files for Schistosoma mansoni:
% reffind -s 'Schistosoma mansoni' -f embl

Finding fastq and bam files
Find bam files and fastq files for lane 9342_8#6:
% pathfind --t lane --id 9342_8#6
Make a file of all the fastq files for all the lanes listed in an input file "lanelist":
% pathfind --t file --id lanelist --f fastq > lanelist_reads
This makes a file lanelist_reads with the locations of all the fastq files.
Note: in '9342_8#6' the run is 9342, the lane is 8, and 6 is the tag (several samples were multiplexed together in one lane, and given different tags).
Note: 26-Apr-2018: use 'pf' instead of 'pathfind' as 'pf' will always be the latest version of the script , e.g.
% pf data --id 27541_1 --type lane
(thanks to Victoria Offord for this)

Getting a UniProt entry
Get the UniProt entry H2L008.1:
% mfetch H2L008.1

Getting statistics about a sequencing lane (read length, etc.)
% pathfind -t lane -id  9342_8#6 -stats
This makes a file 9342_8#6.pathfind_stats.csv in your directory, with statistics eg. read length, number of bases sequenced, mean insert size, depth of coverage, etc.

Thanks to Victoria Offord for advice on pathfind.

No comments: