Thursday 28 February 2013

Using artemis to view annotations

Starting Artemis
I am using Artemis on a server called pcs4 at sanger. To run Artemis,  I need to log into pcs4:
% ssh -Y pcs4

 Then start Artemis:
% art
This brings up the Artemis menu:












Opening a fasta file in Artemis
Then to open a file in Artemis, you can go to File menu, and choose 'Open', and choose the name of fasta file or embl file for your sequence. This could be a fasta or embl file for an assembly that you want to look at.

   Alternatively, you can start Artemis by telling it the name of your fasta or embl file on the command-line, eg.
% art PTRK.v1.fa
where 'PTRK.v1.fa' is a fasta file for the assembly that I want to look at.

   This brings up a window with one scaffold from my assembly displayed, and a list of all the other scaffolds below:




















Opening an embl file in Artemis
If you have gene predictions stored in an embl file, you can bring it up by typing:
% art PTRK.contig.00392.62747.embl
where PTRK.contig.00392.62747.embl is the name of the embl file.

This brings up a nice artemis window:
% art PTRK.contig.00392.62747.embl


















Loading annotations from a gff file in Artemis
You can also load up an embl file and gff files at once eg.
% art PTRK.contig.00392.62747.embl + PTRK.contig.00392.62747.features.gff
where  PTRK.contig.00392.62747.features.gff has some additional features that you want to display.
Note that the gff file has to just be for the same scaffold/chromosome as the embl file (it can't contain multiple scaffolds).
     If you load a gff file into Artemis, it might not show the intron between the separate exons in a gene, so it can be hard to see which exons belong to which gene. As a result, it's a better idea to convert  your gff files to embl files (eg. using my script gff_to_embl.pl), and load them into Artemis.

Viewing mapped RNA-seq reads in Artemis
I have a bam file of mapped RNA-seq reads, mapped to my genome using TopHat. I can load the bam file into Artemis by going to the Artemis 'File' menu and selecting 'Read BAM/VCF' and selecting the bam file. Note that the bam file needs to be sorted and indexed using Samtools (ie. you need to have created a .bai file for the bam file, as described here). The bam file can contain more scaffolds than are present in the embl file that you opened (ie. contain the scaffold in the embl file, plus additional scaffolds).

    The default view of the mapped reads is 'stack' view:


 













Note that reads shown in blue are unique reads, whereas reads in green are 'duplicated' reads that have been mapped to exactly the same position on the reference sequence. To save space, if there are duplicated reads, only one is shown by Artemis.

If you want to filter the reads that are shown from the bam file, you can right-click, and choose 'Filter reads', and you can choose to only take reads with above a certain mapping quality.

You can change this to a sliding-window plot of coverage, by right-clicking on the display of the mapped reads, and selecting 'Views' -> 'Coverage':
















Monday 25 February 2013

Using RATT to transfer gene predictions between species

The RATT software by Thomas D. Otto (described in a paper by Otto et al. 2010) can be used to transfer gene predictions (or other annotations) from one species to another (or from one version of an assembly for a species, to a different assembly for the same species, or from one strain of a species to a different strain of that species). It is easy to run.

How does RATT work
RATT transfers annotations based upon conserved synteny between assemblies (of one strain), strains or species. First the two sequences (reference assembly with annotations, and the new assembly that you want to transfer annotations to) are compared using 'nucmer' from the MUMmer package, to identify syntenic regions. To be included, synteny blocks must have at least 40% sequence identity at the nucleotide level. Based on the nucmer alignments, RATT identifies a syntenic region in the new assembly for each region in the reference assembly (the one with annotations). It then recalibrates the alignment coordinates within syntenic regions by using SNPs (identified using show-snp from the MUMmer package) as unambiguous anchor points within synteny blocks.

RATT next proceeds to the annotation-mapping step, where it tries to transfer each gene prediction (or other annotation) from the reference assembly to the new assembly. A feature is not mapped if it spans a synteny break, if its coordinates match different chromosomes or different strands in the new assembly, or if the new mapped distance of its coordinates has increased by more than 20 kb. After transferring the annotations to the new assembly/genome, RATT refines the transferred gene predictions to make sure that the predicted start and stop codons look alright, to avoid internal stop codons, and to fix incorrect splice sites.

The RATT paper reported that RATT has good accuracy (shown to be better than the Glimmer gene prediction software), but occasionally makes errors in determining borders of features, as well as in transferring small features, gene families, genes in regions of limited synteny, or very divergent genes.

Running RATT
Say you want to transfer gene predictions from one species (species 1) to another (species 2). 
First, you need embl files of the original gene predictions that you want to transfer, with coordinates with respect to the genome assembly for species 1. If your original gene predictions are in a gff file, you can convert them to embl files (one per chromosome/scaffold) using my perl script gff_to_embl.pl. Note that this script expects that (i) for each gene, the features in the input gff file for that gene are in the order gene line, mRNA line, CDS line(s); (ii) the scaffold/chromosome names used are the same between the input gff file and input genome fasta file; (iii) the columns in the gff file are separated by tabs and not spaces.

Secondly, you need an assembly fasta file for the genome of the species that you want to transfer gene predictions to (species 2).

The command to use RATT is:
% setenv RATT_HOME /nfs/users/nfs_t/tdo/Bin/ratt/
% $RATT_HOME/start.ratt.sh ratt_input_embl species2.fa Transfer1 Species
where '/nfs/users/nfs_t/tdo/Bin/ratt/' is the directory where you have installed RATT,
'ratt_input_embl' is the directory with the embl files for species 1,
'species2.fa' is the fasta file for the genome assembly for species 2,
'Transfer1' is the prefix that will be given to the results file,
'Species' tells RATT that you are transferring annotations between species (if you are transferring between different assemblies for the same species, you can put 'Assembly' here).

I find that RATT requires quite a lot of memory (RAM). For example, to transfer gene predictions to a 135 Mbyte genome, I needed to request 5 Gbyte of memory when I was running it on the Sanger compute farm.

Interpreting RATT's output
In the RATT standard output (goes to the .o file if you are running RATT on the Sanger farm), you get a summary of what RATT has done, looking something like this:
Overview of transfer of annotation elements:
385     elements found.
385     Elements were transfered.
0       Elements could be transfered partially.

2       Elements split.
0       Parts of elements (i.e.exons tRNA) not transferred.
0       Elements couldn't be transferred.

CDS:
385     Gene models to transfer.
385     Gene models transferred correctly.
0       Gene models partially transferred.
0       Exons not transferred from partial CDS matches.
0       Gene models not transferred.

803 Sequences where generated


For each chromosome/scaffold, RATT makes an output file in embl format called [...]final.embl (eg. Transfer1.chr1.final.embl), with the gene predictions in embl format in your target species (species 2). It is possible to convert these to gff format, using my Perl script embl_to_gff.pl. You can do this using a shell script like this:
#!/usr/bin/env tcsh
foreach file (*final.embl)
      echo "$file"
      perl -w embl_to_gff.pl $file $file.gff . yes
end


For each chromosome/scaffold, RATT makes an output file called [...]report.gff (eg. Transfer1.chr1.report.gff), which records the differences between the reference annotation and the transferred annotation, including any changes made by RATT in the 'refinement' step, and any problems that RATT thinks are remaining in the final transferred gene predictions (eg. if a gene prediction is missing a start codon).

RATT will make [...]Report.txt files that tell you whether gene predictions were transferred without any potential errors in the transferred gene predictions (eg. the start codon or stop codon of the transferred gene prediction looks wrong). I found that some errors were reported for my transferred gene predictions in these report files. Thomas Otto suggested that at least 80% of the CDS should be transferred for a transferred gene prediction to be used with any confidence.

Problems transferring genes using RATTThere may be a few genes that RATT said it could not transfer. Sometimes I have found that it is possible to transfer these manually using exonerate:
% exonerate -t assembly.fa -q protein.fa --model protein2genome --bestn 1 --showtargetgff > output.exonerate
where assembly.fa is your new assembly, protein.fa is the translation of the gene you want to transfer, output.exonerate is the exonerate output file.

Another thing that I find is that in a few cases RATT transfers (or partially transfer) the gene, but the transferred gene model is not at all similar to the original one. It is possible to weed out cases like this by aligning the translation of the original gene model to the RATT-transferred one, using ggsearch (for global alignment) or ssearch (for local alignment) in the FASTA package.

Viewing RATT's output in artemis/act
You can view transferred annotations, as well as the information in the report.gff file for a chromosome/scaffold, in Artemis, by typing:
[Note to self: I first need to type: 'ssh -Y pcs4']
% art chr1.fa + Transfer1.chr1.final.embl + Transfer1.chr1.report.gff

You will see something like this - this example shows a gff tag where the report.gff has highlighted the fact that a gene prediction is missing a start codon (the translation starts with 'I'):




Monday 18 February 2013

Using exonerate to align ESTs/cDNAs/proteins to a genome

Aligning ESTs/cDNAs to a genome using exonerate
To align ESTs (or EST clusters) or cDNAs to a genome sequence, the program exonerate can be used. This can be run using the command:
% exonerate --model est2genome --bestn 10 est.fa genome.fa
where 'est.fa' is your fasta file of ESTs, and 'genome.fa' is your fasta file of scaffolds/chromosomes for your genome. The '--bestn 10' option finds the top 10 matches of each EST in the genome.

Other useful options are:
-t genome.fa : the fasta file 'genome.fa' of scaffolds/chromosomes for your genome,
-q est.fa : the fasta file 'est.fa' of ESTs,
--showtargetgff : give the output in gff format,
--showalignment N : do not show the alignment between the EST and genome in the output,
--refine full : this performs a more accurate alignment between the EST and genome, but will be slower (I found this very slow),
--model coding2genome : the exonerate man page says that this is similar to '--model est2genome', but that the query sequence is translated during comparison, allowing a more sensitive comparison.

In practice, I found that it is fastest to run exonerate by putting each EST cluster into a separate fasta file, and then running exonerate for each EST cluster against the genome assembly. For example, to align Parastrongyloides trichosuri EST clusters to the genome assembly, I put the EST sequences in files seq1, seq2, seq3, etc. and ran the following shell script:

#!/usr/bin/env tcsh
foreach file (seq*)
      echo "$file"
      exonerate -t genome.fa -q $file --model coding2genome --bestn 1 --showtargetgff > $file.out
end

I found that this required quite a lot of memory (RAM), so I requested 500 Mbyte of RAM when I submitted it to the Sanger compute farm.
This produced files seq1.out, seq2.out, seq3.out, etc. with the exonerate results for EST clusters seq1, seq2, seq3, etc.

Each exonerate gene prediction is given a score. I find that gene predictions with scores of at least 1000 look pretty good.

Getting the predicted proteins in ESTs from exonerate
If you use the --model coding2genome option, exonerate will give you a protein alignment for the predicted gene and the EST that you align it to. If you wanted to get the predicted protein from the EST that is in that alignment, you could use this extra option:
exonerate --ryo ">%qi (%qab - %qae)\n%qcs\n"
(see here and here).

Aligning proteins to a genome using exonerate
To align proteins to a genome, you need to use '--model protein2genome' instead of '--model coding2genome', eg.:
% exonerate -t genome.fa -q protein.fa --model protein2genome --bestn 1 --showtargetgff

Running BLAST first, and then running exonerate for the genome regions with BLAST hits:
Exonerate can be a bit slow to run, so if you want to speed things up (with some loss of sensitivity), you can run BLAST first to find the regions of a genome assembly with hits to your query EST or protein, and then run exonerate to align the query EST/protein to the region of the BLAST hit.

You can use my script run_exonerate_after_blast.pl for this, eg.
% run_exonerate_after_blast.pl assembly.fa proteins.fa exonerate_output . 0.05 25000 /software/pubseq/bin/ncbi_blast+/
where assembly.fa is your input assembly file,
proteins.fa is the file of proteins you want to align using exonerate,
exonerate_output is the exonerate output file, processed to just have some of the GFF lines,
0.05 is the BLAST e-value cutoff to use, 
25000 means that we run exonerate on the BLAST hit region plus 25,000 bp on either side,
/software/pubseq/bin/ncbi_blast+/ is the path to the directory with BLAST executables.
[Note to self: you will probably need to submit this to the 'long' queue (48 hour time-limit) on the Sanger farm, with 1000 Mbyte of RAM.]

If the job runs out of run-time on the Sanger farm before finishing, the output so farm will be in a large file called tmp0.xxx... where xxx... are digits. You can convert this to the format the the output should be in by using my script convert_exonerate_gff_to_std_gff.pl.

Aligning intron regions of one genome to another genome

In this example, I had a file of intron sequences from one species, and wanted them to align them to the full genome of a second species. I decided to use the 'ryo' (roll-your-own) output format, which allows you to specify your own output format:

% exonerate --model affine:local --revcomp --ryo "%qi %qab %qae %ti %tab %tae %s %pi" --showalignment no --bestn 1 introns.fa genome2.fa

where -model affine:local  will give local alignments,

--revcomp looks on both strands for matches,

--ryo "%qi %qab %qae %ti %tab %tae %s %pi" prints out the query id., start and end of the alignment in the query, target id., start and end of the alignment in the target, alignment score, and percent identity of the alignment.

--showalignment no: means the alignment isn't shown

--bestn 1 : just give the best hit for each query.

Other useful exonerate options 
--model protein2genome:bestfit : this includes the entire protein in the alignment.
--exhaustive yes :  makes more accurate alignments (but is slower).

Wednesday 13 February 2013

Bioinformatics lecture notes and practicals

I've decided to put the Powerpoint slides for the bioinformatics lectures that I used to give up on the web, for others to use. Here are those I have uploaded so far:

Genomes, DNA and sequencing
Introduction to genomes (lecture): slideshare

Sequence comparison and alignment
Homologs, orthologs and paralogs (lecture): slideshare
Dotplots for bioinformatics (lecture): slideshare
Pairwise sequence alignment (lecture): slideshare
The Needleman-Wunsch algorithm (lecture): slideshare
Alignment scoring functions (lecture): slideshare
The Smith-Waterman algorithm (lecture): slideshare
Statistical significance of alignments (lecture): slideshare
Multiple alignment (lecture): slideshare
BLAST (lecture): slideshare
Dotplots & pairwise sequence alignment (R practical): Little Book of R for Bioinformatics (chap. 4)
Multiple alignment (R practical): Little Book of R for Bioinformatics (chap. 5)

Hidden Harkov Models (HMMs) 
HMMs for bioinformatics (lecture): slideshare
HMMs for bioinformatics (R practical): Little Book of R for Bioinformatics (chap. 10)

Tuesday 12 February 2013

Introduction to HMMs

I recently was talking to a colleague about HMMs (Hidden Markov models), which got me thinking about their beauty!

When I got home, I decided to put the 'Introduction to HMMs' lecture that I used to give to bioinformatics students on Slideshare. You can also see an R tutorial on HMMs that I wrote.

Calculating confidence intervals using R

Let's have a blitz on calculating confidence intervals using R!

An approximate large-sample confidence interval for the population mean
If we have a large random sample of size n from a population, we can calculate an approximate confidence interval for the population mean using the following R function:
> largeSampleConfidenceInterval <- function(sampleMean, sampleSd, n, confidenceLevel)
   {
         alpha <- 1 - (confidenceLevel/100)
         error <- qnorm(1-(alpha/2))*(sampleSd)/sqrt(n) 
         upperLimit <- sampleMean + error
         lowerLimit <- sampleMean - error
         print(paste(confidenceLevel,"% confidence interval: (",lowerLimit,",",upperLimit,")"))
   }
For example, say we want a 95% confidence interval for the mean, when our sample mean is 2.45, our sample standard deviation is 2.03, and our sample size is 621:
> largeSampleConfidenceInterval(2.45, 2.03, 621, 95)
[1] "95 % confidence interval: ( 2.29033918977288 , 2.60966081022712 )"
This tells that an approximate 95% confidence interval for the mean is (2.29, 2.61).
Note that this approach of calculating an approximate confidence intervals is only valid when the sample size is large.

An approximate large-sample confidence interval for a proportion
If we have observed x successes in n independent data points (Bernoulli trials), for example, if we observe 20% of 110 schoolchildren to be red-haired, we can calculate an approximate confidence interval for the true proportion of successes using this R function:
> largeSampleConfidenceIntervalForProportion <- function(observedP, trials, confidenceLevel)
   {
         alpha <- 1 - (confidenceLevel/100) 
         error <- qnorm(1-(alpha/2))*sqrt(observedP*(1-observedP)/trials)    
         upperLimit <- observedP + error
         lowerLimit <- observedP - error
         print(paste(confidenceLevel,"% confidence interval: (",lowerLimit,",",upperLimit,")"))
   }
For example, if we observe 20% of 110 schoolchildren to be red-haired children, we can calculate a 90% confidence interval for the true proportion of children who are red-haired by typing:
> largeSampleConfidenceIntervalForProportion(0.20, 110, 90)
[1] "90 % confidence interval: ( 0.137267744076674 , 0.262732255923326 )"
This gives an approximate 90% confidence interval of (0.14, 0.26).
Again, this approach is only valid for large samples.

Note that you can also use the R prop.test() function to calculate a confidence interval for a proportion, but it will give a slightly different confidence interval, as it uses a slightly different method:
> prop.test(x=0.20*110,n=110,conf.level=0.90)
90 percent confidence interval:
 0.1408916 0.2745358
In fact, there are many ways to calculate confidence intervals for proportions, some of these are implemented in the R package PropCIs.

An approximate large-sample confidence interval for the difference between two proportions
If we observed a proportion p1 in a sample of size n1, and a proportion p2 in a sample of size n2, we can calculate an approximate confidence interval for the difference between the two proportions using this R function:
> largeSampleConfidenceIntervalForDifferenceInProportions <- function(p1, n1, p2, n2, confidenceLevel)
   {
         alpha <- 1 - (confidenceLevel/100) 
         error <- qnorm(1-(alpha/2))*sqrt((p1*(1-p1)/n1) + (p2*(1-p2)/n2))    
         upperLimit <- (p1 - p2) + error
         lowerLimit <- (p1 - p2) - error
         print(paste(confidenceLevel,"% confidence interval: (",lowerLimit,",",upperLimit,")"))
   }
For example, if we observe that 20% of a sample of 110 schoolchildren in Dublin are red-haired, and that 25% of a sample of 104 schoolchildren in Cork are red-haired, we can calculate a 95% confidence interval for the difference in proportions using:
> largeSampleConfidenceIntervalForDifferenceInProportions(0.20, 110, 0.25, 104, 95)
[1] "95 % confidence interval: ( -0.161862788606968 , 0.0618627886069677 )"
This gives us an approximate 95% confidence interval of (-0.16, 0.06).
Again, this approach is only valid for large samples.

We can actually do the same thing in R using prop.test():
> prop.test(x=c(0.20*110,0.25*104),n=c(110,104))
95 percent confidence interval:
 -0.17121594  0.07121594
It doesn't give the same answer as our largeSampleConfidenceIntervalForDifferenceInProportions() function, because it is applying Yates' continuity correction. It will give the same answer if we turn off Yates' continuity correction (it is usually advisable to use Yates' correction, but just to show you):
> prop.test(x=c(0.20*110,0.25*104),n=c(110,104),correct=FALSE)
95 percent confidence interval:
 -0.16186279  0.06186279 

An exact confidence interval for a Normal mean
If we know that our data comes from a Normally distributed variable, we can calculate an exact confidence interval for the population mean using this R function:
> exactConfidenceIntervalForNormalMean <- function(sampleMean, sampleSd, n, confidenceLevel)
   {
         alpha <- 1 - (confidenceLevel/100)
         error <- qt(1-(alpha/2),df=n-1)*(sampleSd)/sqrt(n) 
         upperLimit <- sampleMean + error
         lowerLimit <- sampleMean - error
         print(paste(confidenceLevel,"% confidence interval: (",lowerLimit,",",upperLimit,")"))
   }
For example, if we observed a mean of 294.81 and a standard deviation of 1.77 based on a sample of 10 data points, we can calculate an exact 90% confidence interval for the population mean by typing:
>  exactConfidenceIntervalForNormalMean(294.81, 1.77, 10, 90)
[1] "90 % confidence interval: ( 293.783964262636 , 295.836035737364 )"
This gives us an exact 90% confidence interval of (293.78, 295.84).

An exact confidence interval for the difference between two Normal means
If we have collected two samples for a variable that we know to be Normally distributed, and we know that the samples have the same population variance, then we can calculate an exact confidence interval for the difference between the two population means by using this R function:
> exactConfidenceIntervalForDifferenceInNormalMeans <- function(sampleMean1, sampleSd1, n1, sampleMean2, sampleSd2, n2, confidenceLevel)
   {
         alpha <- 1 - (confidenceLevel/100)
         pooledVariance <- ((n1-1)*(sampleSd1^2) + (n2-1)*(sampleSd2^2))/(n1 + n2 - 2)
         pooledSd <- sqrt(pooledVariance)
         qt <- qt(1-(alpha/2),df=n1+n2-2)
         error <- qt(1-(alpha/2),df=n1+n2-2)*pooledSd*sqrt((1/n1)+(1/n2))
         upperLimit <- (sampleMean1 - sampleMean2) + error
         lowerLimit <- (sampleMean1 - sampleMean2) - error
         print(paste(confidenceLevel,"% confidence interval:(",lowerLimit,",",upperLimit,")"))
   }
For example, say our first sample had a sample mean of 1.692 and sample standard deviation of 0.518 and sample size of 27. And say our second sample had a sample mean of 2.307 and standard deviation of 0.665 and sample size of 23. Then we can calculate an exact 95% confidence interval for the difference between the means by typing:
> exactConfidenceIntervalForDifferenceInNormalMeans(1.692, 0.518, 27, 2.307, 0.665, 23, 95)
[1] "95 % confidence interval: ( -0.951573462687863 , -0.278426537312137 )"
That is, we get an exact 95% confidence interval of (-0.952, -0.278).

An exact confidence interval for a proportion
If the number of successes or failures is small, then we need to use exact methods to calculate a confidence interval for a proportion. An exact confidence interval for a proportion can be calculated based on the binomial distribution, using the exactci() function in the PropCIs R package. For example, if the observed proportion is 0/1000, we can calculate an exact 95% confidence interval using: 
> library("PropCIs")
> exactci(0, 1000, 0.95)
95 percent confidence interval:
 0.000000000 0.003682084
This tells us that a 95% confidence interval for the proportion is (0.0000, 0.0037), rounded to four decimal places.

A confidence interval for the variance of a Normal distribution
See my blog post on this topic.

An exact confidence interval for a Poisson mean (lambda parameter)
For example, if you observed 187 events over 365 days, the observed mean is 187/365 = approx. 0.512. You can calculate an exact 90% confidence interval for this Poisson mean by typing:
> library("exactci")
> poisson.exact(187, 365, conf.level=0.9)
    Exact two-sided Poisson test (central method)
90 percent confidence interval:
 0.4523004 0.5783757
That is, our confidence interval is (0.452, 0.578).

An approximate large sample confidence interval for a Poisson mean (lambda parameter)
> largeSampleConfidenceIntervalForPoissonMean <- function(observedMean, sampleSize, confidenceLevel)
   {
         alpha <- 1 - (confidenceLevel/100)
         error <- qnorm(1-(alpha/2))*sqrt(observedMean/sampleSize)
         upperLimit <- observedMean + error
         lowerLimit <- observedMean - error
         print(paste(confidenceLevel,"% confidence interval: (",lowerLimit,",",upperLimit,")"))
   }
For example, if you observed 187 events over 365 days, the observed mean is 187/365 = approx. 0.512. You can calculaet an approximate 90% confidence interval by pasting the function above into R and typing:
> largeSampleConfidenceIntervalForPoissonMean(187/365, 365, 90)
[1] "90 % confidence interval: ( 0.450704013552185 , 0.57395352069439 )"
That is, our approximate 90% confidence interval is (0.451, 0.574).

Friday 8 February 2013

Using git for personal use

Using git with just a local repository
My colleague Martin Hunt gave a nice presentation yesterday on using git to keep track of different versions of your scripts. I've decided to use it straight away, so in my scripts directory I typed:
% git init
This made a subdirectory .git.
You never have to typte 'git init' again from now on. 

If I then type:
% git status
it tells me that all the files currently in my scripts directory are untracked files.

I can add the latest perl script that I wrote by typing:
% git add my.pl
If I type 'git status' again, it tells me that there has been one file added ('my.pl').

I can commit the file by typing:
% git commit -m "Nice perl script to do some things"
If I type 'git status' again, then 'my.pl' is no longer listed as a file that has just been added.

If I edit my.perl, I can commit the changed version by typing:
% git add my.pl
% git commit -m "Added nice subroutine"
This will commit the changed version of my.pl.

Using git with a local and a remote repository
My colleagues and I share a remote repository at Sanger for a project that we work on together. In this case, I have a local repository in my directory for that project (note to self: it is /nfs/users/nfs_a/alc/Documents/git/helminth_scripts/), and I then keep copies of them in the shared remote repository too.

Anyone at Sanger can make a copy of the remote repository by 'cloning it':
% mkdir git
% cd git
% git clone git+ssh://git.internal.sanger.ac.uk/repos/git/pathogens/helminth_scripts
This will create a 'helminth_scripts' folder, with all the subdirectories.
From then on, you don't need to type 'git clone' anymore; the local repository has been created.

I can add my latest perl script to the local repository by typing:
% git add my.pl
% git commit -m "Nice perl script to do some things"

When I think the script is working and ready for others to use (this refers to Sanger users only), I can then copy it to the remote repository. Before I copy it to the remote repository, I need to first pull down all new versions of scripts from the remote repository, by typing:
% git pull origin master
I can then copy my script my.pl to the remote repository:
% git push origin master
To make it possible for others at Sanger to use, I need to follow the steps in 'Deploying changes at Sanger' below.

Deploying changes at Sanger
This is mostly a 'note to self' to remind me how to do this. 
When I am happy that a Perl script my.pl that I have written is ready for other users to use, I can 'deploy' it at Sanger, which means that it is put in a special remote git repository for 'stable releases' of scripts, and then this latest stable version is put in the path of other Sanger users. 

To do this:
% ssh lenny-dev64
% newgrp pathdev
% cd /software/pathogen/projects/helminth_scripts
% git pull origin master # pulls the latest changes from git.internal.sanger.ac.uk
% chmod -R g+w .         # some files won't be changed as they belong to other users
% ln -s /software/pathogen/projects/helminth_scripts/scripts/my.pl /software/pathogen/internal/prod/bin/my.pl               # make /software/pathogen/internal/prod/bin/my.pl point to /software/pathogen/projects/helminth_scripts/scripts/my.pl
% ln -s  /software/pathogen/projects/helminth_scripts/modules/HelminthGenomeAnalysis/Mymodule.pm /software/pathogen/internal/prod/lib/Mymodule.pm
% exit
% exit
Note: I need to type 'exit' twice to exit lenny-dev64, because 'newgrp' opens a subshell.

For Farm3, things have changed a little:
% ssh farm3-login
% cd /software/pathogen/projects/helminth_scripts
% git pull origin master # pulls the latest changes from git.internal.sanger.ac.uk
% chmod -R g+w .         # some files won't be changed as they belong to other users
% ln -s /software/pathogen/projects/helminth_scripts/scripts/my.pl /software/pathogen/internal/prod/bin/my.pl               # make /software/pathogen/internal/prod/bin/my.pl point to /software/pathogen/projects/helminth_scripts/scripts/my.pl
% ln -s  /software/pathogen/projects/helminth_scripts/modules/HelminthGenomeAnalysis/Mymodule.pm /software/pathogen/internal/prod/lib/Mymodule.pm
% exit
% exit

Other useful git commands

To stop git from tracking a script, without deleting it from the directory:
% git rm --cached script.pl

To make git exclude some files when you are running 'git status' (eg. .swp files), you can add them to .git/info/exclude (in your main git directory). See this stackoverflow question

See the last version of a file that you committed to git, eg. for file
plot_tree_with_domains_with_ETE.py:
% git show HEAD~1:plot_tree_with_domains_with_ETE.py


Thanks to Martin Hunt & Daria Gordon for explaining git.