Wednesday 8 May 2013

Sanger scripts for finding reference genomes, gff files, bam files, fastq files, etc.

Pathfind:

This is only useful to Sanger people, but extremely useful!

Update note Sept 2020: To use pathfind, you now first need to load the pathfind module. You can search for it using:

% module avail -t | grep -i pf

For example, this may tell you something like:

pf/1.0.0

You can then load the pathfind module using:

% module load pf/1.0.0

You can then use the module by typing:

% pf ... (as below)


Finding assemblies and gff files
Find genome assemblies for Schistosoma mansoni:
% reffind -s 'Schistosoma mansoni' -f fa
or
% reffind -sp mansoni -f fa
Find gff files for Schistosoma mansoni:
% reffind -s 'Schistosoma mansoni' -f gff
Find embl files for Schistosoma mansoni:
% reffind -s 'Schistosoma mansoni' -f embl

Note: 27-Apr-2022: in the latest version, you should instead use 'pf' instead of 'pathfind'. 
For example, to find the assembly for lane 5174_7#5:
% pf assembly -t lane --id 5174_7#5

To find the reference genome 'Vibrio_cholerae_O1_biovar_eltor_str_N16961_v2':
% pf ref --id 'Vibrio_cholerae_O1_biovar_eltor_str_N16961_v2'

Finding fastq and bam files
Find bam files and fastq files for lane 9342_8#6:
% pathfind --t lane --id 9342_8#6
Make a file of all the fastq files for all the lanes listed in an input file "lanelist":
% pathfind --t file --id lanelist --f fastq > lanelist_reads
This makes a file lanelist_reads with the locations of all the fastq files.
Note: in '9342_8#6' the run is 9342, the lane is 8, and 6 is the tag (several samples were multiplexed together in one lane, and given different tags).
Note: 26-Apr-2018: use 'pf' instead of 'pathfind' as 'pf' will always be the latest version of the script , e.g.
% pf data --id 27541_1 --type lane
(thanks to Victoria Offord for this)

Getting a UniProt entry
Get the UniProt entry H2L008.1:
% mfetch H2L008.1

Getting statistics about a sequencing lane (read length, etc.)
% pathfind -t lane -id  9342_8#6 -stats
This makes a file 9342_8#6.pathfind_stats.csv in your directory, with statistics eg. read length, number of bases sequenced, mean insert size, depth of coverage, etc.

Finding information about a sequencing lane (e.g. strain names, ENA accessions, etc.):

% pf supplementary -t lane --id '24880_5#192'

Finding QC information about a sequencing lane (e.g. Kraken report):

% pf qc -t lane --id '24880_5#188'

 Acknowledgements
Thanks to Victoria Offord for advice on pathfind.

No comments: