Thursday 11 August 2016

Setting up vim for Python

I wanted vim to automatically indent by 4 spaces when I'm editing a Python file. Following the instructions on wiki.python.org, I put this in my ~/.vimrc file:
syntax on
filetype indent plugin on
set modeline
set tabstop=8
set expandtab
set shiftwidth=4
set softtabstop=4


Works perfectly!

Monday 8 August 2016

Classifying my BLAST hits by taxonomic group

I'm running blast against the NCBI nr database (note to self: this is in /data/blastdb/Supported/NR/nr) and want to classify my BLAST hits by taxonomic group.

I found a long discussion about this on biostars and it seems you can do it in various ways.

I tried out a couple of ways:

Option 1 for getting taxids (a bit slow but probably good if you have LOTS of sequence ids)
NCBI provides files converting accessions into taxonomy ids, and I downloaded a file 'prot.accession2taxid.gz' with the conversion between protein accessions and taxonomy ids from ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/
This file has format: accession<TAB>accession.version<TAB>taxid<TAB>gi
Note that NCBI is phasing out GI numbers and recommend to use accession.version instead so I decided to do that.
I need to write a script to read in my BLAST output and add a column with the taxonomy id. for each accession, but that should be relatively easy.

Voila! 

Option 2 for getting taxids (much faster but probably not so good if you have lots of sequence ids)
On biostars Pierre Lindenbaum tells about how to use Entrez's E-Utilities for this. This seems to be pretty fast. For example if you want to get the taxid for the protein with accession KIH48240.1 you can type in your web browser:
eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=KIH48240.1&rettype=fasta&retmode=xml
This gives you back an XML file that has the taxid as 51022:
<?xml version="1.0"?>
 <!DOCTYPE TSeqSet PUBLIC "-//NCBI//NCBI TSeq/EN" "http://www.ncbi.nlm.nih.gov/dtd/NCBI_TSeq.dtd">
 <TSeqSet>
<TSeq>
  <TSeq_seqtype value="protein"/>
  <TSeq_gi>748345068</TSeq_gi>
  <TSeq_accver>KIH48240.1</TSeq_accver>
  <TSeq_sid>gnl|WGS:JHON|ANCDUO_21692</TSeq_sid>
  <TSeq_taxid>51022</TSeq_taxid>
  <TSeq_orgname>Ancylostoma duodenale</TSeq_orgname>
  <TSeq_defline>hypothetical protein ANCDUO_21692, partial [Ancylostoma duodenale]</TSeq_defline>
  <TSeq_length>381</TSeq_length>
  <TSeq_sequence>KCISISSKTLSELGLHGPSRTGAELVQSEVLRERNYNTEELERFVQGNEPLLVPDQRLAYEAIMDMIRNGNGGLFFLDAPGGTGKTFLINLLLAEIRRQNEIAVAVASSGIAATLLDGGRTAHSALKLPLDLSRSENPVCNISKGSGKAQVLKMCKVIVWDECTMAHKRALEALDRTLQDIRENNRLMGGAVLVLAGDFRQTLPVIPRATPADELNACLKASYLWRHVRKMTLTTNMRVHLQGDLSAQSFAQQLLRLGDGDFPVDADTDLISFPSDFCNLTESPEELIVKVFPNITNNFRNHQWLCDRAILAPMNDGVNKINTEIQNQLPGPAATYESIDTVVDREQATGIYGISMMESISSSRMLPESLPIHYYPKWCSQ</TSeq_sequence>
</TSeq>

</TSeqSet>


Or you can type on a linux machine:
wget 'eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=KIH48240.1&rettype=fasta&retmode=xml' -O my.xml
This saves the xml output (above) into a local file 'my.xml'.

Or even better, just to get the taxid:
curl -s 'eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=KIH48240.1&rettype=fasta&retmode=xml' | grep taxid | cut -d">" -f2 | cut -d"<" -f1 
This gives you: (note: 'curl -s' means silent mode, so it curl doesn't give you progress info)
51022

Thanks Pierre!

Option 1 for getting taxonomy info for taxids: use Perl:
It turns out you can use Jason Stajich's Taxonomy.pm in Perl to take a taxonomy id. for a sequence, and convert this to the taxonomic group (eg. species, genus, order, etc.) There is a nice example of doing this on www.biostars.org (search for biostars62911.pl in this page). 


Option 2 for getting taxonomy info for taxids: use Python:
If you use Biopython's Entrez module you can also take the taxonomy id. and get info. such as species name, NCBI division for the sequence (eg. "Invertebrates") and the phylum of the species. This must be in Python2, since Biopython is for Python2, not Python3, at present.

Here's an example bit of code:

from Bio import Entrez


#====================================================================#

# get the species, phylum etc. for a taxid:

def get_taxonomy_data(taxid):
    """get the species, phylum etc. for a taxid
    >>> get_taxonomy_data('51022')
    ('Ancylostoma duodenale', 'Invertebrates', 'Nematoda')
    """

    # Copied examples from http://biopython.org/DIST/docs/api/Bio.Entrez-module.html
    Entrez.email = "alc@sanger.ac.uk"
    handle = Entrez.efetch(db="taxonomy", id=taxid, mode="text", rettype="xml")
    records = Entrez.read(handle)
    handle.close()
    # record is now a Python list
    # print(records[0].keys()) gives:
    # [u'Lineage', u'Division', u'ParentTaxId', u'PubDate', u'LineageEx', u'CreateDate', u'TaxId', u'Rank', u'GeneticCode', u'ScientificName', u'MitoGeneticCode', u'UpdateDate']

    # get the species name:
    scientific_name = records[0]["ScientificName"] # eg. Ancylostoma duodenale

    # get the division:
    division = records[0]["Division"] # eg. Invertebrates

    # get the phylum:
    lineage = records[0]["LineageEx"]
    # eg. [{u'ScientificName': 'cellular organisms', u'TaxId': '131567', u'Rank': 'no rank'}, {u'ScientificName': 'Eukaryota', u'TaxId': '2759', u'Rank': 'superkingdom'}, {u'ScientificName': 'Opisthokonta', u'TaxId': '33154', u'Rank': 'no rank'}, {u'ScientificName': 'Metazoa', u'TaxId': '33208', u'Rank': 'kingdom'}, {u'ScientificName': 'Eumetazoa', u'TaxId': '6072', u'Rank': 'no rank'}, {u'ScientificName': 'Bilateria', u'TaxId': '33213', u'Rank': 'no rank'}, {u'ScientificName': 'Protostomia', u'TaxId': '33317', u'Rank': 'no rank'}, {u'ScientificName': 'Ecdysozoa', u'TaxId': '1206794', u'Rank': 'no rank'}, {u'ScientificName': 'Nematoda', u'TaxId': '6231', u'Rank': 'phylum'}, {u'ScientificName': 'Chromadorea', u'TaxId': '119089', u'Rank': 'class'}, {u'ScientificName': 'Rhabditida', u'TaxId': '6236', u'Rank': 'order'}, {u'ScientificName': 'Strongylida', u'TaxId': '6308', u'Rank': 'suborder'}, {u'ScientificName': 'Ancylostomatoidea', u'TaxId': '33277', u'Rank': 'superfamily'}, {u'ScientificName': 'Ancylostomatidae', u'TaxId': '33278', u'Rank': 'family'}, {u'ScientificName': 'Ancylostomatinae', u'TaxId': '53469', u'Rank': 'subfamily'}, {u'ScientificName': 'Ancylostoma', u'TaxId': '29169', u'Rank': 'genus'}]
    found_phylum = False
    phylum = None
    for taxon in lineage:
        name = taxon["ScientificName"]
        rank = taxon["Rank"]
        if rank == "phylum":
            found_phylum = True
            phylum = name
    assert(found_phylum == True and phylum != None)

    return(scientific_name, division, phylum)

#====================================================================#



# get a taxid from somewhere (see options for getting the taxid for a protein sequence above)

# get the species, phylum etc. for this taxid:         
(scientific_name, division, phylum) = get_taxonomy_data(taxid)

Other ways of doing the same thing:
There are some discussions of different ways to do this on biostars.org, and also biostars (another post).

Using md5sum

Often I see a md5sum file when I download a large file. How can I check if I've downloaded the large file correctly?
(i) first, paste the md5sum information into a file called 'MD5SUM':
eg.
a477b9935efe786a16e7ccb13527b68d  prot.accession2taxid.gz
(ii) then, run:
% md5sum -c MD5SUMS
Hopefully, this will say all is fine:
prot.accession2taxid.gz: OK