Friday 2 December 2016

HISAT2 aligner for RNAseq data

Previously I used the TopHat aligner for RNAseq data (see my TopHat blogpost) but recently it seems to have been replaced by HISAT2 from the same team. HISAT2 was published by Kim et al 2015, and there is a user manual available. Other people in my team also use STAR but say HISAT2 is slightly more user friendly, but similar in accuracy and speed.

Creating a HISAT2 index
The first step in using HISAT2 is to create index files for your genome assembly:
% hisat2-build  assembly.fa myindex
where assembly.fa is the assembly file, and myindex is the prefix you want to give to the index file names.
(note: Adam has hisat2 installed in /nfs/users/nfs_a/ar11/hisat2-2.0.0-beta/ )

Running HISAT2
You can now run HISAT2 to map your reads to your assembly. I do it using a wrapper script adapted by one from Adam Reid:
% ~alc/Documents/000_FUGI_Hymenolepis/ -f lanelist_reads -i myindex -int 40000
where lanelist_reads is a file containing a list of the fastq files (full paths) for the sample (including all fastq files for all replicates for a condition),
myindex is the name of the HISAT2 index for the genome,
-int 40000 tells HISAT2 to only allow introns up to 40 kb.
This submits the jobs to the Sanger farm to run.
Note: My colleague Michaela noticed that when we ran HISAT2 output through HTSeqCount, it gave some warning messages, and these seem to be due to duplicate alignments in the HISAT2 bam file from HISAT2 2.0.0-beta version. The HISAT2 webpage says that a problem creating duplicate alignments was fixed in a recent release, so I'm going to try that new release to see if it gets rid of the error messages from HTSeqCount.

The actual HISAT2 command that this wrapper runs is:
% hisat2 --max-intronlen 40000 -p 12 myindex -1 seqs1_1.fastq -2 seqs1_2.fastq -1 seqs2_1.fastq -2 seqs2_2.fastq -S myoutput.sam
where -p 12 says to run HISAT2 with 12 threads,
seqs1_1.fastq and seqs1_2.fastq are reads and their mates from one sequencing run, and seqs2_1.fastq and seqs2_1.fastq are reads and their mates from another run (lane) for the same sample (same replicate),
-S myoutput.sam specifies the output sam file name.
Note that in this example a replicate was run on two sequencing lanes, this is sometimes done if we want to ensure a good amount of sequence data.
The script may have been given the fastq files for several lanes (in the lanelist_reads file) but it will identify which lanes belong to the same replicate, and will merge them into the final bam file produced.
So for example, if we had three larval replicates, and each were run on 2 lanes, then the two lanes for each replicate will be merged, so that we will end up with three bam files.

- This needs to be run in the same directory that contains the HISAT2 index files.
- The data is assumed to be unstranded here.
- This uses the default -k option in HISAT2, -k 5, which means that for multiply mapping reads, HISAT2 maps them to up to 5 places, where those 5 places are not guaranteed to be their best alignments out of all possible aligned places (eg. if they could align to 10 places, we are not guaranteed to get the best 5 reported).
- Once HISAT2 has finished running, you should see bam files in your directory. They will have been sorted by their mapped positions along each chromosome/scaffold (by default HISAT2 gives SAM output and it is unsorted, but the script converts to bam, and sorts the bam).

(Note to self: only run this for one condition (ie. all replicates for that condition at once) at a time, as it needs a lot of disk space (several hundred Gbyte) for temporary files!)

HISAT2 output
As well as the bam file produced by HISAT2, it also produces mapping statistics that will look something like this:

23590680 reads; of these:
  23590680 (100.00%) were paired; of these:
    3217972 (13.64%) aligned concordantly 0 times
    11223995 (47.58%) aligned concordantly exactly 1 time
    9148713 (38.78%) aligned concordantly >1 times
    3217972 pairs aligned concordantly 0 times; of these:
      130791 (4.06%) aligned discordantly 1 time
    3087181 pairs aligned 0 times concordantly or discordantly; of these:
      6174362 mates make up the pairs; of these:
        4542683 (73.57%) aligned 0 times
        799882 (12.95%) aligned exactly 1 time
        831797 (13.47%) aligned >1 times
90.37% overall alignment rate

The first line says '23590680 reads', but in fact my input was 23,590,680 read-pairs, ie. 47,181,360 reads.
Here it says 90.37% of reads were mapped by HISAT2. This is 831797 + 799882 + 11223995*2 + 9148713*2 + 130791*2 = 42,638,677 reads.
47.58% of read-pairs were mapped uniquely and concordantly, where 'concordantly' means the reads of each pair were mapped about the expected distance apart in the right orientation relative to their mate.

What about multiply mapping reads?
The commands above used the default option for HISAT2, ie. '-k 5', which means that if a read can map equally to 20 places in the genome, HISAT2 will just report 5 of those (not necessarily the best 5 alignments out of the 20). In this case, we were intending to use HISAT2 output for differential expression analysis, so does it matter that we have multiply mapping reads?

After discussing with Adam Reid, we came to the conclusion that it doesn't matter, as these multiply mapping reads will have a very low mapping quality score (as they map to multiple places), and at a later stage in our pipeline, when we are estimating read counts per gene, we can discard these reads using the program HTSeqCount, based on their low mapping quality score (you can tell HTSeqCount to use a mapping quality threshold).

There is some variation between different people's protocols in how to treat multiply mapping reads, but it seems that people often keep just one (or a few) reads of a set of multiply mapping reads if they are mapping to the genome assembly, as we were. On the other hand, if you are mapping to a transcriptome assembly, you may want to keep multiply mapping reads. This is because you are mapping to a smaller fasta file, so it is feasible to map or multiply mapping reads (which can be slow for a whole genome). An advantage of this is that there are clever tools such as ExPress that can figure out how many of the multiply mapping reads to give to each member of a multi-gene family, based on unique regions in those genes, and this can then give you better estimates of expression level (reads or read-pairs per gene).

You can see a nice discussion of this and other issues regarding RNAseq analysis in Conesa et al 2016's "A survey of best practices for RNA-seq data analysis.

Thanks to Adam Reid and Michaela Herz for discussing HISAT2, and Adam for his wrapper script.

Using samtools mpileup to find SNPs, and vcf-tools to filter SNPs

My colleagues have recently been talking about SNP-finding using 'samtools mpileup', and ways to filter the set of SNPs called by 'samtools mpileup' to throw away the more dodgy SNP calls.

Calling SNPs (and reference bases) using 'samtools mpileup':
You can call SNPs (and reference bases) using 'samtools mpileup' (see the samtools mpileup webpage for details) by typing:
% samtools mpileup in.bam
where 'in.bam' is your input bam file.
The output is in VCF (Variant Call Format).  

The output may look like this:
1       2131    N       2       gG      FF
1       2132    N       2       tT      FF
1       2133    N       2       aA      FF
1       2134    N       2       gG      BF
1       2135    N       2       cC      BF
1       2136    N       2       tT      BF
1       2137    N       2       cC      BF
1       2138    N       2       g$G$    BF
1       2147    N       1       ^]g     F
1       2148    N       1       g       F
The columns are: chromosome, 1-based coordinate, reference base, number of reads covering the site, read bases, and base qualities.

In the example output above we don't see the base in the reference. To see it we must type:
% samtools faidx genome.fa
% samtools mpileup -Q 15 -D -f genome.fa in.bam
where genome.fa is the assembly file.
Then we see something like this:
1       5131    C       2       .,      FE
1       5132    G       2       .,      FF
1       5133    A       2       .,      FF
1       5134    G       2       .,      FF
1       5135    G       2       .,      FF
1       5136    A       2       .,      FF
A dot '.' stands for the reference base on the forward strand, and ',' for the reverse strand.

When you are doing this, you can tell 'samtools mpileup' to only take bases with base alignment quality scores (BAQ scores: these are adjusted base quality scores, which have been reduced for base positions near indels, to help rule out false positive SNP calls due to alignment artefacts near small indels) of 15 or higher by typing:
% samtools mpileup -Q 15 in.bam
To only take cases with reads with mapping quality (MAPQ) of >=30:
% samtools mpileup -q 30 -Q 15 -D -f genome.fa in.bam

You can vary the contents of the output VCF file that you get using the following options:
-D : gives you per-sample read depth at each base position
-S : gives you a P-value for a statistical test for per-sample strand bias at each base position (strand bias occurs if a particular SNP is being called from reads that aligned to one strand but not reads that align to the complementary strand).
-u : gives you a text BCF output (a compressed VCF format)
-g : gives you binary BCF output (a compressed binary version of BCF)

To avoid excessive usage of memory (RAM), we can tell samtools to only use a depth of 1000 reads at each position, using the -d option:
% samtools mpileup -d 1000 in.bam

To combine several options in one command, you could type, for example:
% samtools mpileup -d 1000 -DSug in.bam  
% samtools mpileup -Q 15 -D in.bam 

Filtering SNPs using bcftools:
To filter the output of samtools mpileup to just have variant bases (not reference bases), we need to filter the output using bcftools, for example:

% samtools mpileup -u -q 30 -Q 15 -D -f genome.fa in.bam | bcftools view -bvcg - > var.raw.bcf
where the samtools mpileup -u option produces uncompressed BCF output;
bcftools view -b outputs BCf; -v output potential variant sites only; -c calls SNPs; -g calls genotypes at variant sites.
This calls SNPs and small indels.

We can then filter the output file using:
% bcftools view var.raw.bcf | varFilter -D100 > var.flt.vcf 
where -D sets the maximum read depth to call a SNP.

Filtering SNPs using vcftools-annotate:
There is also a way to filter SNPs using vcftools-annotate, for example on MinDP (read depth) and RMS MQ (mapping quality). I need to look into this.

A protocol for calling and filtering SNPs
My colleague Diogo Ribeiro previously identified some SNP sites from Illumina sequencing data for a nematode species. Here is a brief description of what he did:

- For each isolate, sequencing reads were mapped to the reference assembly using the SMALT mapper, to produce bam files.

- SNPs and small indels relative to the reference assembly were identified separately for each isolate, using samtools and vcftools, based on a maximum of 1000 reads at each base position ('samtools mpileup -d 1000' option).

- The resulting VCF files (one per isolate) were filtered to remove less reliable variant sites, by removing those that were:
(i) in repetitive regions of the reference assembly, as identified by gem-mappability;
(ii) were in gaps in scaffolds, or within 100 bp of a gap of >=8 bp;
(iii) were within 100 bp of a scaffold start or end;
(iv) had a SNP quality score of less than 50;
(v) had a mapping quality of less than 30;
(vi) had a read depth (including reads from both strands) of less than 20, or had a read depth of less than 10 from a single strand.
Some of this is possible using Diogo Ribeiro's script:
% --DP 20 --DP_STRAND 10 --QUAL 50 --MQ 30 vcf vcf_filtered2
applies the filters (iv), (v), (vi) to VCF files in input directory 'vcf', and puts the output in an output directory 'vcf_filtered2' (must exist already).

- The variant density, for all variants and for just homozygous variants, was estimated in non-overlapping 1-kb windows of each scaffold of the reference assembly. When calculating variant density, sites with more than one alternative allele were excluded, as these are probably erroneous variant calls.

Scripts for SNPs (note to self)
My colleagues Diogo Ribeiro and Bhavana Harsha wrote some nice scripts for SNP calling and SNP filtering, see:
(i) -> can filter a VCF file based on median read depth, and other filters that the user species (uses vcftools-annotate),
(ii) -> can filter a VCF file based on median read depth for a chromosome,
(iii) -> can filter a VCF file based on using gem-mappability to find repetitive regions of a reference assembly,
(iv) -> can filter a VCF file based on minimum read depth, read depth per strand, SNP quality, or mapping quality,

(v) -> plots SNP data (from VCF files) along chromosomes,
(vi) -> analyses distribution of SNPs per gene, and SNPs per kb of gene.

Thanks very much to my colleagues Bhavana Harsha and Diogo Ribeiro who figured out lots of these things.

GTF versus GFF

The GFF format is described here.
The entry for one gene is something like this:
# Gene gene:HmN_000672600
pathogens_HYM_scaffold_0001     WormBase_imported       gene    599     742     .       +       .       ID=gene:HmN_000672600;Name=HmN_000672600;biotype=protein_coding
pathogens_HYM_scaffold_0001     WormBase_imported       mRNA    599     742     .       +       .       ID=transcript:HmN_000672600.1;Parent=gene:HmN_000672600;Name=HmN_000672600.1
pathogens_HYM_scaffold_0001     WormBase_imported       exon    599     742     .       +       .       ID=exon:HmN_000672600.1.1;Parent=transcript:HmN_000672600.1
pathogens_HYM_scaffold_0001     WormBase_imported       CDS     599     742     .       +       0       ID=cds:HmN_000672600.1;Parent=transcript:HmN_000672600.1

You can see that the transcript, exon and CDS lines have a 'Parent=' tag at the end, to say which transcript/gene they belong to. Also, there are 'gene', 'mRNA', 'exon', and 'CDS' lines.

The GTF format is slightly different. An example is: (for the same gene as above)
pathogens_HYM_scaffold_0001     WormBase_imported       transcript      599     742     .       +       .       gene_id "gene:HmN_000672600"; transcript_id "transcript:HmN_000672600.1"
pathogens_HYM_scaffold_0001     WormBase_imported       exon    599     742     .       +       .       gene_id "HmN_000672600.1"; transcript_id "transcript:HmN_000672600.1"; exon_number "1"

Here we just have 'transcript' and 'exon' lines, and the tag at the end gives the gene_id and transcript_id, so there is no need for 'Parent=' tags.
Here is a description of GTF: here.

Converting GFF to GTF:
My colleague Adam Reid has written a nice Perl script for convering GFF to GTF, eg.:
% perl /nfs/users/nfs_a/ar11/scripts/  emultilocularis.gff3 > emultilocularis.gtf

Mapping Illumina reads using the smalt aligner

Martin Hunt has written the script for aligning Illumina reads to a genome assembly.

This can use the SMALT aligner (on the Sanger compute farm) as follows:

% --split 500000 -k 11 -s 2 -o " -x -r 0 -y 0.7 -i 500" smalt assembly.fa seqs_1.fastq.gz seqs_2.fastq.gz

where --split 500000 tells it to split the input genome fasta file into smaller files of 500,000 sequences each;
-k 11 is the kmer for the SMALT index;
-s 2 is the step length for the SMALT index;
-o gies the mapping options for SMALT;
-x is the SMALT option for SMALT to do an exhaustive search for alignments (at cost of speed);
-r 0 tells SMALT to randomly assign multiply mapping reads to one place;
-y 0.7 tells SMALT to only take reads with alignment identity of >=70%;
-i 500 tells SMALT to allow a maximum insert size of 500 bp;
assembly.fa is the input assembly file;
seqs_1.fastq.gz and seqs_2.fastq.gz are the fastq files of reads and their mates, respectively.

Thursday 1 December 2016

Searching the NCBI trace archive

To search the NCBI Trace Archive for the capillary read sequence data used for the Echinococcus multilocularis tapeworm genome paper (Tsai et al), you can search for:


Thursday 10 November 2016

A plot of gene counts across a species tree

My colleague James Cotton has written a nice Python script to plot gene counts in a gene family, across the species tree. Here's notes on how to run it (as otherwise I will forget..):

1. Get the gene counts in the family:
% grep 'family 132785 :' complete_families.txt | tr ')' '\n' | cut -d" " -f3 | sort | uniq -c

2. Format in the form:

3. Run the script
% ssh -Y pcs5
% cd /lustre/scratch108/parasites/alc/000_50HG_InterestingFamilies/final_list_files/genecountplots
% python 'ancylostoma_caninum(23),ancylostoma_ceylanicum(22),ancylostoma_duodenale(22),angiostrongylus_cantonensis(2),angiostrongylus_costaricensis(3),caenorhabditis_elegans(1),cylicostephanus_goldi(10),haemonchus_contortus(4),haemonchus_placei(10),heligmosomoides_bakeri(46),necator_americanus(10),nippostrongylus_brasiliensis(33),oesophagostomum_dentatum(7),panagrellus_redivivus(3),pristionchus_pacificus(1),strongylus_vulgaris(1),teladorsagia_circumcincta(10)' family132785.png

This makes a nice plot like this:

Update: Mar 2017:
James wrote:
I've updated by family count data around a tree -
python ~/bin/ 'ixodes_scapularis(1),taenia_asiatica(3),echinococcus_multilocularis(4),angiostrongylus_cantonensis(1),hymenolepis_nana(1),protopolystoma_xenopodis(4),schistocephalus_solidus(6),oesophagostomum_dentatum(1),echinococcus_granulosus(4),nippostrongylus_brasiliensis(1),clonorchis_sinensis(2),haemonchus_placei(1),schistosoma_curassoni(1),echinostoma_caproni(4),taenia_solium(4),schistosoma_margrebowiei(1),mesocestoides_corti(4),hydatigera_taeniaeformis(3),schistosoma_mansoni(1),schistosoma_mattheei(1),hymenolepis_diminuta(1),trichoplax_adhaerens(1),diphyllobothrium_latum(4),crassostrea_gigas(1),drosophila_melanogaster(2),schistosoma_haematobium(1),teladorsagia_circumcincta(1),hymenolepis_microstoma(1),strongylus_vulgaris(1),pristionchus_pacificus(1),meloidogyne_hapla(1),schistosoma_rodhaini(1),fasciola_hepatica(2),spirometra_erinaceieuropaei(9),schmidtea_mediterranea(4),schistosoma_japonicum(1),trichobilharzia_regenti(3),angiostrongylus_costaricensis(1)' blah.png

Thursday 8 September 2016

ChEMBL activity data

I'm interested in ChEMBL ligands for the target P0A6C1 (ChEMBL target CHEMBL1293285). I looked on the ChEMBL target page for this protein: Then I clicked on the piechart under 'Target Associated Bioactivities', and see a list of activities for ligands, including CHEMBL407 (flumazenil), CHEMBL16 (phenytoin), CHEMBL22 (trimethoprim), ...

Thursday 11 August 2016

Setting up vim for Python

I wanted vim to automatically indent by 4 spaces when I'm editing a Python file. Following the instructions on, I put this in my ~/.vimrc file:
syntax on
filetype indent plugin on
set modeline
set tabstop=8
set expandtab
set shiftwidth=4
set softtabstop=4

Works perfectly!

Monday 8 August 2016

Classifying my BLAST hits by taxonomic group

I'm running blast against the NCBI nr database (note to self: this is in /data/blastdb/Supported/NR/nr) and want to classify my BLAST hits by taxonomic group.

I found a long discussion about this on biostars and it seems you can do it in various ways.

I tried out a couple of ways:

Option 1 for getting taxids (a bit slow but probably good if you have LOTS of sequence ids)
NCBI provides files converting accessions into taxonomy ids, and I downloaded a file 'prot.accession2taxid.gz' with the conversion between protein accessions and taxonomy ids from
This file has format: accession<TAB>accession.version<TAB>taxid<TAB>gi
Note that NCBI is phasing out GI numbers and recommend to use accession.version instead so I decided to do that.
I need to write a script to read in my BLAST output and add a column with the taxonomy id. for each accession, but that should be relatively easy.


Option 2 for getting taxids (much faster but probably not so good if you have lots of sequence ids)
On biostars Pierre Lindenbaum tells about how to use Entrez's E-Utilities for this. This seems to be pretty fast. For example if you want to get the taxid for the protein with accession KIH48240.1 you can type in your web browser:
This gives you back an XML file that has the taxid as 51022:
<?xml version="1.0"?>
  <TSeq_seqtype value="protein"/>
  <TSeq_orgname>Ancylostoma duodenale</TSeq_orgname>
  <TSeq_defline>hypothetical protein ANCDUO_21692, partial [Ancylostoma duodenale]</TSeq_defline>


Or you can type on a linux machine:
wget '' -O my.xml
This saves the xml output (above) into a local file 'my.xml'.

Or even better, just to get the taxid:
curl -s '' | grep taxid | cut -d">" -f2 | cut -d"<" -f1 
This gives you: (note: 'curl -s' means silent mode, so it curl doesn't give you progress info)

Thanks Pierre!

Option 1 for getting taxonomy info for taxids: use Perl:
It turns out you can use Jason Stajich's in Perl to take a taxonomy id. for a sequence, and convert this to the taxonomic group (eg. species, genus, order, etc.) There is a nice example of doing this on (search for in this page). 

Option 2 for getting taxonomy info for taxids: use Python:
If you use Biopython's Entrez module you can also take the taxonomy id. and get info. such as species name, NCBI division for the sequence (eg. "Invertebrates") and the phylum of the species. This must be in Python2, since Biopython is for Python2, not Python3, at present.

Here's an example bit of code:

from Bio import Entrez


# get the species, phylum etc. for a taxid:

def get_taxonomy_data(taxid):
    """get the species, phylum etc. for a taxid
    >>> get_taxonomy_data('51022')
    ('Ancylostoma duodenale', 'Invertebrates', 'Nematoda')

    # Copied examples from = ""
    handle = Entrez.efetch(db="taxonomy", id=taxid, mode="text", rettype="xml")
    records =
    # record is now a Python list
    # print(records[0].keys()) gives:
    # [u'Lineage', u'Division', u'ParentTaxId', u'PubDate', u'LineageEx', u'CreateDate', u'TaxId', u'Rank', u'GeneticCode', u'ScientificName', u'MitoGeneticCode', u'UpdateDate']

    # get the species name:
    scientific_name = records[0]["ScientificName"] # eg. Ancylostoma duodenale

    # get the division:
    division = records[0]["Division"] # eg. Invertebrates

    # get the phylum:
    lineage = records[0]["LineageEx"]
    # eg. [{u'ScientificName': 'cellular organisms', u'TaxId': '131567', u'Rank': 'no rank'}, {u'ScientificName': 'Eukaryota', u'TaxId': '2759', u'Rank': 'superkingdom'}, {u'ScientificName': 'Opisthokonta', u'TaxId': '33154', u'Rank': 'no rank'}, {u'ScientificName': 'Metazoa', u'TaxId': '33208', u'Rank': 'kingdom'}, {u'ScientificName': 'Eumetazoa', u'TaxId': '6072', u'Rank': 'no rank'}, {u'ScientificName': 'Bilateria', u'TaxId': '33213', u'Rank': 'no rank'}, {u'ScientificName': 'Protostomia', u'TaxId': '33317', u'Rank': 'no rank'}, {u'ScientificName': 'Ecdysozoa', u'TaxId': '1206794', u'Rank': 'no rank'}, {u'ScientificName': 'Nematoda', u'TaxId': '6231', u'Rank': 'phylum'}, {u'ScientificName': 'Chromadorea', u'TaxId': '119089', u'Rank': 'class'}, {u'ScientificName': 'Rhabditida', u'TaxId': '6236', u'Rank': 'order'}, {u'ScientificName': 'Strongylida', u'TaxId': '6308', u'Rank': 'suborder'}, {u'ScientificName': 'Ancylostomatoidea', u'TaxId': '33277', u'Rank': 'superfamily'}, {u'ScientificName': 'Ancylostomatidae', u'TaxId': '33278', u'Rank': 'family'}, {u'ScientificName': 'Ancylostomatinae', u'TaxId': '53469', u'Rank': 'subfamily'}, {u'ScientificName': 'Ancylostoma', u'TaxId': '29169', u'Rank': 'genus'}]
    found_phylum = False
    phylum = None
    for taxon in lineage:
        name = taxon["ScientificName"]
        rank = taxon["Rank"]
        if rank == "phylum":
            found_phylum = True
            phylum = name
    assert(found_phylum == True and phylum != None)

    return(scientific_name, division, phylum)


# get a taxid from somewhere (see options for getting the taxid for a protein sequence above)

# get the species, phylum etc. for this taxid:         
(scientific_name, division, phylum) = get_taxonomy_data(taxid)

Other ways of doing the same thing:
There are some discussions of different ways to do this on, and also biostars (another post).

Using md5sum

Often I see a md5sum file when I download a large file. How can I check if I've downloaded the large file correctly?
(i) first, paste the md5sum information into a file called 'MD5SUM':
a477b9935efe786a16e7ccb13527b68d  prot.accession2taxid.gz
(ii) then, run:
% md5sum -c MD5SUMS
Hopefully, this will say all is fine:
prot.accession2taxid.gz: OK

Monday 18 July 2016

Transferring GO terms across species

I've written a pipeline to transfer GO terms to their orthologs in other species. Here are some notes to remind myself how to use it:

To run it for a new species (eg. T. regenti):
- Make a directory eg. /lustre/scratch108/parasites/alc/000_50HG_GO/tregenti

- Make a link to the file with the orthologs in the reference species:
% ln -s /lustre/scratch108/parasites/bh4/50HGI_FuncAnnotation_GO_terms/trichobilharzia_regenti_orths.txt

- Make a link to the file with a protein list for this species:
% ln -s /lustre/scratch108/parasites/bh4/50HGI_FuncAnnotation_GO_terms/trichobilharzia_regenti_protein_list.txt

- Make a file species_list.txt with just the name of the species of interest, eg. trichobilharzia_regenti

- Make a link to the file with locus tag prefixes:
% ln -s /lustre/scratch108/parasites/bh4/50HGI_FuncAnnotation_GO_terms/species_locustagprefix

 - Make links to the GO annotation and ontology files:
% ln -s /lustre/scratch108/parasites/bh4/50HGI_FuncAnnotation_GO_terms/gene_association.fb
% ln -s /lustre/scratch108/parasites/bh4/50HGI_FuncAnnotation_GO_terms/gene_association.goa_human
% ln -s /lustre/scratch108/parasites/bh4/50HGI_FuncAnnotation_GO_terms/gene_association.WS243.wb.c_elegans
% ln -s /lustre/scratch108/parasites/bh4/50HGI_FuncAnnotation_GO_terms/gene_association.zfin
% ln -s /lustre/scratch108/parasites/bh4/50HGI_FuncAnnotation_GO_terms/go-basic.obo
% ln -s /lustre/scratch108/parasites/bh4/50HGI_FuncAnnotation_GO_terms/go_terms_files

- Make links to files with mappings between identifiers:
% ln -s  /lustre/scratch108/parasites/bh4/50HGI_FuncAnnotation_GO_terms/ensembl_zfish_ids
% ln -s  /lustre/scratch108/parasites/bh4/50HGI_FuncAnnotation_GO_terms/uniprot_human_ids2
% ln -s /lustre/scratch108/parasites/bh4/Compara/50HG_Compara_75_Final/final_id_mapping/trichobilharzia_regenti_id_mapping.txt 
% ln -s /lustre/scratch108/parasites/bh4/50HGI_FuncAnnotation_GO_terms/other_id_files

- Now run the script to assign GO terms to genes:
% /lustre/scratch108/parasites/bh4/50HGI_FuncAnnotation_GO_terms/ species_list.txt > bsub_go_terms_jobs   
This does:
% bsub -q small -E 'test -e /nfs/users/nfs_a/alc' -R "select[mem>200] rusage[mem=200]" -M200 -o go_terms.o -e go_terms.e -J GO_terms_trichobilharzia_regenti trichobilharzia_regenti trichobilharzia_regenti_protein_list.txt trichobilharzia_regenti_orths.txt go_terms_files other_id_files go-basic.obo /nfs/helminths02/analysis/50HGP/01INTERPRO/TRE.protein.fa.gz.fas.ipr.gz trichobilharzia_regenti_GO_terms.txt trichobilharzia_regenti_id_mapping.txt
This makes an output file called trichobilharzia_regenti_GO_terms.txt. 

Summarising the results
Count the number of genes with GO terms:
% grep TRE trichobilharzia_regenti_GO_terms.txt | grep 'GO:' | cut -f1 | sort | uniq | wc
8719 [compared to 6215 for S. mansoni]
Count the number of genes with GO terms that came from curations in other species:
% grep TRE trichobilharzia_regenti_GO_terms.txt | grep 'GO:' | grep -v InterPro | cut -f1 | sort | uniq | wc

Count the number of genes with GO terms that came from InterPro domains:
% grep TRE trichobilharzia_regenti_GO_terms.txt | grep 'GO:' | grep InterPro | cut -f1 | sort | uniq | wc

Count the number of distinct GO terms in the file:
%  grep TRE trichobilharzia_regenti_GO_terms.txt | grep 'GO:' | cut -f4 | tr ',' '\n' | sort | uniq | wc 
2510 [compared to 2673 for S. mansoni] 
Count the number of transfers from each reference species:
% grep TRE trichobilharzia_regenti_GO_terms.txt | grep 'GO:' | cut -f2 | grep -v InterPro | tr ',' '\n' | cut -d"(" -f2 | sort  | uniq -c
   2171 caenorhabditis_elegans)
   1979 danio_rerio)
   2735 drosophila_melanogaster)
   4560 homo_sapiens)

- Bhavana's files are all in /lustre/scratch108/parasites/bh4/50HGI_FuncAnnotation_GO_terms/
- Bhavana's notes on how to run the pipeline are in /nfs/helminths02/analysis/50HGP/00ANALYSES/final_GO_terms/00README

Thank you:
Thank you Bhavana Harsha for making nice notes on how to run the pipeline.

Friday 15 July 2016

Using the linux grep, sort, and awk commands

The linux grep, sort and awk commands are really useful, here are some of the nice features: (I haven't added much here yet, but plan to add as I go along..)

grep examples
% grep -f  file1 file2
Finds patterns specified in file1, in file2, eg. file1 could have a list of words, and this would search for them in file2.

Find lines of a file that have some alphabetical character:
% grep -E '[A-Za-z]' file1

Get the last 5 characters of every line of a file:
% grep -o '.....$' file1

sort examples
Sort using a tab as delimiter:
% sort -k15,15nr -t$'\t' file1

awk examples
Filter a tab-deliminted file to just take lines in which column 17 has value >=50:
% awk -F"\t" '($17 >= 50)'  file1

Filter a tab-delminited file to just take lines in which column 2 has value '727':
% awk -F"\t" '$2 == 727' file1

Filter a tab-deliminted file to just take lines in which column 2 has string 'snow':
% awk -F"\t" '($2=="snow") {print}' file1

tr examples
Get rid of the character " from all lines:
% cat file1 | tr -d '"'

find examples
Find files that have pysvg* in their name:
% find . -name 'pysvg*'
% find /nfs/users/nfs_a/alc/Documents/PythonModules/ -name 'pysvg*'
Find files with pysvg in their name:
% find . -name 'pysvg'

Finding all descendants of a GO term

I wanted to find all descendants of a GO term in the GO hierarchy. First I downloaded the hierarchy from the GO website:
% wget

Then I wrote a little Python script to find all the descendants of a GO term:

from collections import defaultdict
import sys
import os


# define a function to record the children of each GO term in the GO hierarchy:

def read_go_children(input_go_obo_file):
    """record the children of each GO term in the GO hierarchy"""

    # first read in the input GO file, and make a list of all the GO terms, and the
    # terms below them in the GO hierarchy:
    # eg.
    # [Term]
    # id: GO:0004835
    children = defaultdict(list) # children of a particular GO term, in the hierarchy
    take = 0

    fileObj = open(input_go_obo_file, "r")
    for line in fileObj:
        line = line.rstrip()
        temp = line.split()
        if len(temp) == 1:
           if temp[0] == '[Term]':
               take = 1
        elif len(temp) >= 2 and take == 1:
            if temp[0] == 'id:':
                go = temp[1]
            elif temp[0] == 'is_a:': # eg. is_a: GO:0048308 ! organelle inheritance
                parent = temp[1]
                children[parent].append(go) # record that a child of 'parent' is term 'go'
        elif len(temp) == 0:
            take = 0

    return children


# define a function to find all descendants of a GO term in the GO hierarchy:

def find_all_descendants(input_go_term, children):

    """find all the descendants of a GO term in the GO hierarchy
    >> find_all_descendants('GO1', {'GO1': ['GO2', 'GO3'], 'GO2': ['GO4']})
    ['GO1', 'GO2', 'GO3', 'GO4']

    descendants = set()
    queue = []
    while queue:
        node = queue.pop(0)
        if node in children and node not in descendants: # don't visit a second time
            node_children = children[node]

    return descendants 


def main():
    # check the command-line arguments:
    if len(sys.argv) != 3 or os.path.exists(sys.argv[1]) == False:
        print("Usage: %s input_go_obo_file input_go_term" % sys.argv[0])
    input_go_obo_file = sys.argv[1] # the input gene ontology file eg. gene_ontology.WS238.obo
    input_go_term = sys.argv[2] # the input GO term of interest

    # read in the children of each GO term in the GO hierachy:
    children = read_go_children(input_go_obo_file)

    # find all the descendants of the term 'input_go_term':
    descendants = find_all_descendants(input_go_term, children)
    descendantlist = ",".join(descendants)


if __name__=="__main__":


To run it I type, for example, to find descendants of the term GO:0008291 (acetylcholine metabolic process):
% python3 go-basic.obo 'GO:0008291'
This gives me:
DESCENDANTS: GO:0001507,GO:0006581,GO:0008291,GO:0008292

The descendants are:
GO:0006581 acetylcholine catabolic process 
GO:0001507 acetylcholine catabolic process in synaptic cleft (descendant of acetylcholine catabolic process)
GO:0008292 acetylcholine biosynthetic process

Hurray for Python!

- I see a discussion on biostars about doing something similar in R. I didn't check if that gives the same result.
- It turns out that you can also do this using WormMine (I don't know if this is for just C. elegans-related GO terms though)

Thursday 14 July 2016

Viewing a multiple alignment

I was looking for a quick way to view a multiple alignment, and found Mview at the EBI. Nice and easy!

Friday 8 July 2016

Calculating a conservation score for a multiple alignment

I wanted to calculate a conservation score for a multiple alignment in clustalw format, so investigated how to do this.

I found a nice biostars discussion on different ways to do this, which mentioned a program by Capra & Singh (Bioinformatics 2007). It can calculate a score using 'Jensen-Shannon divergence'. This is the one I used in the end (see below).

I also heard from Daniel Caffrey about the Pfaat software for viewing and editing multiple alignments, which calculates conservation scores based on Von Neumann entropy, Shannon entropy etc.

Calculating a conservation score using Capra & Singh's program

1. First I downloaded their program from their website. It's a Python (Python2) script, so easy to install!
(note to self: I put it in /nfs/users/nfs_a/alc/Documents/bin/conservationscore/conservation_code/

2. I had an input clustalw file called
(note to self: to get an alignment for a family from our 50HG database:
% perl -w 191613 es9_compara_homology_final2_75 >
Note: I found that the script by Capra & Singh can't handle '*' characters in the protein alignment, so I changed these to 'X' characters first (using 'tr '\*' 'X'). It also can't handle ':' characters in the sequence names. 

3. To calculate a score using their program:
% python
It seems to use Python2, also it needs the input alignment file to be in the same directory as the script is installed in. By default, this scores an alignment using Jensen-Shannon divergence, and a window size of 3 (on either side of the residue), and the blosum62 background distribution.

4. The output file looks like this:
# /nfs/users/nfs_a/alc/Documents/git/helminth_scripts/scripts/compara_50HG/ -- js_divergence - window_size: 3 - background: blosum62 - seq. weighting: True - gap penalty: 1 - normalized: False
# align_column_number   score   column

0       -1000.00000     --------------------------------------------------------------------------------------------------------------------M----------------------------------------
1       -1000.00000     --------------------------------------------------------------------------------------------------------------------S----------------------------------------


Each column is a sequence, and each row is an alignment position. It seems some alignment positions have score of -1000. Probably the alignment positions with score of -1000 should be ignored. To get an overall alignment score you could get an alignment score that was the median of the other alignment positions' conservation scores.


Thursday 23 June 2016

QED score for compounds

The QED score (Quantitative estimate of druglikeness) score for compounds was published by Bickerton et al 2012. To calculate this, a series of desirability functions (d) are derived, each corresponding to a different molecular descriptor. These are combined to give the QED by taking the geometric mean of the individual functions. The individual functions used are 8 widely-used molecular properties that are thought to be important for determining druglikeness:
1) molecular weight (MW)
2) octanol-water partition coefficient (ALOGP)
3) number of hydrogen bond donors (HBD)
4) number of hydrogen bond acceptors (HBA)
5) molecular polar surface area (PSA)
6) number of rotatable bonds (ROTB)
7) the number of aromatic rings (AROM)
8) number of structural alerts (ALERTS).

RNASeq gene expression

I'm trying to get my hands on some RNAseq gene expression data, and have to remind myself what do RPKM and FPKM mean again?

RPKM: Reads per kilobase of transcript, per million mapped reads. This is a measure of a transcript's expression level, normalised for the length of the transcript, and for the total amount of reads in the data set.

FPKM: Fragments per kilobase of transcript, per million mapped reads. In RNAseq, the relative expression of a transcript is proportional to the number of cDNA fragments that originate from it. That is, if your data is paired-end RNAseq data, and you see two reads from the same read-pair, then these are only counted as one fragment when calculating the FPKM (but counted as two reads when calculating RPKM).

How do you get RPKM or FPKM values?
1. First you need to map your raw reads to the genome assembly eg. using TopHat. If you already have a gff/gtf file of known genes, you can use the -G option to map to the known genes.
2. You can then calculate the RPKM or FPKM using Cufflinks.

RNAseq data files
Just to remind myself, there are also different types of RNAseq data files:
bam files: have the reads mapped to an assembly
BigWig files: useful for displaying RNAseq data, as they are in an indexed binary format.

There is some discussion on the FPKM values calculated by Cufflinks here. Apparently TopHat used to calculate RPKM, but it's now deprecated and recommended to use Cufflinks (see here).

Classes of anthelmintic drugs

The MeSH browser can be used to study the MeSH (medical subject headings) classification system. Drugs and compounds in the MeSH pharmalogical classification can be browsed here. I'm interested in anthelmintic drugs. For these the MeSH classes are:

   - Antinematodal Agents
       - Filaricides
   - Antiplatyhelmintic Agents  
       - Anticestodal Agents
       - Schistosomicides

But anthelmintic drugs also go by other names..
Other names to look for are: anti-helminthic, antischistosomal, antifilarial, antitrematodal, filaricidal, fasciolicidal, fasciolicide, microfilaricidal, nematicidal, nematocidal, taeniacide.


Friday 17 June 2016

Scripts for gene statistics based on a gff

I've written some scripts to calculate gene statistics based on a gff file, here's how they work (to remind myself)..

(1) Given a protein fasta file, find the mean and median protein length:
First you get the longest transcript for each gene:

This needs a list of the transcript names:
% grep mRNA augustus3.c.gff3 | grep 'Parent=g' | tr ':' ' ' | tr ';' ' ' | cut -f9 | tr "=" ' ' | awk '{ print $2"\t"$4 }' > augustus3.transcripts 
16,767 transcripts

[script uses python2 because of biopython version]
% python2 /nfs/users/nfs_a/alc/Documents/git/Python/ aug.c3.aa augustus3.transcripts augustus3.longest.pep
12,405 transcripts

[script uses python2 because of biopython version]
% python2 ~alc/Documents/git/Python/ augustus3.longest.pep
Mean length=510.306812, median length=336.000000

(2) Given the gff file, calculate the intron statistics:
First make a gff file with the longest transcript:
% python2 /nfs/users/nfs_a/alc/Documents/git/Python/ augustus3.longest.pep augustus3.c.gff3 augustus3.c.longest.gff3

Then calculate the total length of intron DNA, number of introns, mean and median intron length: % python3 ~alc/Documents/git/Python/ augustus3.c.longest.gff3
Num_cds_regions (coding exons)=74902
Total intron bases=156226973
Number of introns=62497
Median length of intron regions=1610.000000 bp
Mean length of intron regions=2501.492744 bp

(3) Given the gff file, calculate the coding exon statistics (total amount of bases in coding exons, number of coding exons, median and mean length of coding exons, mean and median coding exons per gene):

% python3 ~alc/Documents/git/Python/ augustus3.c.longest.gff3
Num_cds_regions (coding exons)=74902
Total coding exon (CDS) bases=19024352
Number of CDS regions=74902
Median length of CDS regions=165.000000 bp
Mean length of CDS regions=254.041868 bp
Mean number of CDS per gene=6.038049
Median number of CDS per gene=4.000000

Monday 6 June 2016

A list of known anthelmintic drugs

I wanted to make a list of known anthelmintic (anti-helminth) drugs. It sounded so simple, but took a while. In the end I compiled info from various sources:

WHO ATC codes
- The WHO list of anthelmintic drugs for veterinary use (and here too), and list WHO list of anthelmintic drugs for human use
- the WHO also has some nice documents describing nomenclature for drugs here, where they list drugs in different classes, including anthelmintics

- Anthelminitic drugs listed in wikipedia

KEGG Drugs
- KEGG drugs (although I don't have a subscription, so just could search for one name at a time, to confirm those I'd found)

- a ChEMBL keyword search for 'anthelmintic' (by our collaborator Prudence Mutowo at ChEMBL - thanks Prudence)

Merck Manuals
- the Merck Veterinary Manual (online) proved to be a treasure trove
- the Merck Medical Manual (online)

- all the anthelmintics in DrugBank : these can be found by searching for those with MeSH term 'anthelmintic' : here (see DrugBank category browser)

- 215 PubChem terms manually annotated with MeSH term 'antelmintics' here
(how I got that link: looked up one anthelmintic drug eg. carbendazole, then under Pharmalogical Properties, there are the MeSH terms such as 'anthelminitics', and a link to 'all PubChem compounds annotated with this term', so click on that ===> (later) actually I asked PubChem and they said I can just search for "anthelmintics"[PharmAction] AND "chembl"[SourceName] to get all PubChem anthelmintics that have a ChEMBL id.)
- You can find a list of antifilarial compounds in PubChem using this link
- In the end I found I got the most compounds in PubChem by searching for "anthelmintics OR anthelmintics OR nematocide OR nematicide".

- a wormbook chapter on 'Anthelmintic drugs'

- a ChEBI search for 'anthelmintic'

- The book 'Approaches to design and synthesis of antiparasitic drugs' edited by Anand & Sharma, which has a table of anthelmintic drugs.
Gave me a list of about 160 anthelmintic compounds, many no longer in use but useful to have such a list..

Some things I learnt along the way:

Searching ChEMBL
- Sometimes a compound is in ChEMBL but has not been named. If you can get the SMILES string from another source (eg. from wikipedia or PubChem), you can search for it in ChEMBL using the SMILES string. To do this, go to the ChEMBL homepage, and paste the SMILES into the 'List Search' box on the right hand side after clicking on the SMILES Search radio button.
- When I search ChEMBL using the SMILES, sometimes it has matches that have the same structure if stereochemistry (around chiral centres) is ignored, but a different structure when stereochemistry is considered! I find the PubChem search with SMILES (PubChem structure search here) does respect stereochemistry, so sometimes it's best to search PubChem with the SMILES, and see if the PubChem match has a ChEMBL link in its entry (rather than searching ChEMBL directly).

Searching PubChem
- You can search PubChem using a SMILES here.
- Often there are several PubChem entries for a drug but they have different numbers of HCl or water molecules in them, for example for bisbenzimide.
- To search PubChem for a long IUPAC name such as "N-(4-bromophenyl)-2,6-dihydroxybenzamide", you need to include quotes on either end, or the search doesn't work.

Searching ChEBI
- You can do a search for the term 'anthelmintic' on the ChEBI homepage, but that only finds about 10 records.
- A far better way is to go to the ChEBI homepage, click on 'Advanced Search', scroll down the bottom and in 'Ontology term' select 'has role' and type 'anthelmintic'. This picks up compounds annotated as having a role 'antiplatyhelmintic', 'antinematodal' or 'schistosomicide' too.

Convering a picture of a molecule to a SMILES
- In some cases I couldn't find the molecule in PubChem or ChEMBL, but could see a structure in a paper. In this case I had to draw in the molecule using the molecule drawing tool at PubChem, and then this gave me the SMILES, which I could use to search ChEMBL.

Converting IUPAC names to SMILES
- In some cases, I found a list of IUPAC names in a textbook, and wanted to know if they are in PubChem. However, I've heard that the IUPAC names in PubChem are not always correct. I then heard I can convert IUPAC names to SMILES using the OPSIN software by Daniel Lowe and colleagues. I can then search PubChem using the SMILES on their structure search page. Great!

SMILES in PubChem
- PubChem gives the 'isomeric SMILES' and the 'canonical SMILES' for a molecule, but the one to use is the 'isomeric SMILES' as this gives the stereochemistry (unless there is no 'isomeric SMILES' given, in which case use the 'canonical SMILES').

Metal-containing compounds
- ChEMBL does not give SMILES for metal-containing compounds because SMILES does not have a good way to represent the coordinate bonds nor a recognised way to
describe stereo around a metal centre. This can make it awkward to find the structure though, eg.
for 'gallium nitrate', I found it in ChEMBL by searching for the name, but here is no PubChem link in ChEMBL and the ChEMBL entry has no SMILES but has has formula GaN3O9 and synonym ganite. I then searched for 'ganite' in PubChem and found 3 entries, only one is GaN3O, and this gives a SMILES.
- Another interesting example is motexafin gadolinium: this has two PubChem entries with the same SMILES:
158385: CCC1=C(C2=CC3=NC(=CN=C4C=C(C(=
However, these are different molecules, if you look at the pictures in PubChem you'll see that the two structures in PubChem have different coordinate bonds.
Noel O'Blog tells me that SMILES for metal-containing compounds don't really work properly and so this is why we get these problems.

Chemical classes of compounds
- To find out the chemical class of a compound, it's useful to use ChEBI, which has classified compounds using a chemical ontology. If your compound isn't in ChEBI, you can use the SMILES to search for similar compounds: go to 'Advanced Search', then click on the 'File Open' symbol in the toolbar around the square at the left, then paste in your SMILES, and click OK, then click 'Find compounds which resemble this structure' on the right, and click 'Go'.

Thanks to Noel O'Blog for helping me when I got stuck with the cheminformatics.  Thanks also to the ChEMBL helpdesk for answering my questions.

Friday 3 June 2016

Matrix multiplication in R

It's quite easy to multiply matrices in R. For example if you have a row matrix:
> a <- matrix(c(0,0,0.5,0.5),1,4)
> a
     [,1] [,2] [,3] [,4]
[1,]    0    0  0.5  0.5

and a 4x4 square matrix:
> P <- matrix(c(0,1/3,1/3,0,1/3,0,0,1,1/3,0,0,0,1/3,2/3,2/3,0),4,4)
> P
          [,1]      [,2]      [,3]      [,4]
[1,] 0.0000000 0.3333333 0.3333333 0.3333333
[2,] 0.3333333 0.0000000 0.0000000 0.6666667
[3,] 0.3333333 0.0000000 0.0000000 0.6666667
[4,] 0.0000000 1.0000000 0.0000000 0.0000000

Now multiply them:
> a %*% P
          [,1] [,2] [,3]      [,4]
[1,] 0.1666667  0.5    0 0.3333333

Handy for Markov chains!

Friday 13 May 2016

Plotting in R - some ggplot2 tricks

I love to use ggplot2 for plotting in R, but am a beginner, and trying to master it. Here's some things I've learnt recently.

How to plot multiple figures on one plot
I got some hints from stackoverflow. Here's a rough example of a multi-plot with 6 plots, in 2 columns and 3 rows:
# make 6 plots and save as myplot_1, ... myplot_6
# make final plot:

How to change the angle of x-labels for the boxes in a boxplot
Something like this: (angle changes the angle, size changes the size of the text)
myplot <- myplot + geom_boxplot(outlier.size = 2, fill="red") + ylab("% Repeat") + xlab("Clade") + theme(axis.text.x=element_text(angle=-25,size=7))

Add a title to the plot:
Something like this: (use 'ggtitle')
myplot <- myplot + geom_boxplot(outlier.size = 2, fill="red") + ylab("% Repeat") + xlab("Clade") + theme(axis.text.x=element_text(angle=-25,size=7)) + ggtitle("Unknown repeats")

Change colour of boxes in boxplot:
Something like this: (note: put aes(fill=myxorder) in the ggplot command rather than the geom_boxplot command)
myplot <- ggplot(data = mydata_b, aes(myxorder, myvalues_b, fill=myxorder))
myplot <- myplot + geom_boxplot(outlier.size = 2) + ylab("% Repeat") + xlab("Clade") + scale_fill_manual(values=c("red","pink","green","blue","yellow","orange","purple"))
You can remove the legend made using:
..."orange","purple")) + guides(fill=FALSE)

Put the x-axis title closer to the plot:
Something like this: (use theme=axis.title.x)
myplot <- ggplot(data = mydata_b, aes(myxorder, myvalues_b))
myplot <- myplot + geom_boxplot(outlier.size = 2, fill="red") + ylab("% Repeat") + xlab("Clade") + theme(axis.title.x=element_text(vjust=2.0))