Thursday 26 October 2017

CRISPresso software for analysing CRISPR-Cas9 data

I am learning to use the CRISPResso software for analysing CRISPR-Cas9 data, which was published by Pinello et al (2016). The supporting information for that paper contains a lovely user guide for the software.

I am new to CRISPR-Cas9, so here is a summary to remind me!

The two key parts of this system are:
- the Cas9 enzyme, which cuts double-stranded DNA at a specific location.
- a guide RNA (gRNA), which is about 20 bp, and is located within a longer RNA scaffold. The scaffold binds to DNA and pre-designed gRNA guides Cas9 (Cas9 binds to the gRNA) to the right part of the genome. The gRNA has bases complementary to the target DNA sequence in the genome.

The PAM (protospacer adjacent motif) is a 2-6 bp DNA sequence immediately following the DNA sequence where the Cas9 cuts the DNA (ie. immediately following the gRNA sequence). Cas9 will not cleave the DNA sequence unless it is followed by the PAM sequence. The canonical PAM is 5'-NGG-3'.

Details of the CRISPresso pipeline
The CRISPresso pipeline is described in the supporting information for the paper as including these steps:

1. Quality filtering
This allows the user to try to avoid false positives due to sequencing errors. CRISPResso allows reads to be filtered based (i) on a read's 'average quality score', defined as the average of the read's single-base quality scores, (ii) on the lowest base quality for all bases in a read. By default, the reads are not filtered.

2. Trimming
If the user provides the adaptor sequences used, this will trim them using Trimmomatic. By default, the reads are not trimmed.

3. Merging paired end reads
When paired end reads are used, it is possible to merge the reads since the amplicon sequence is usually shorter than twice the read length. This step mean that the sequence for the overlapping region is obtained with greater confidence. This step is performed using FLASH.

4. Alignment
To align the filtered reads to the reference amplicon, CRISPResso uses needle from the EMBOSS package, which performs Needleman-Wunsch alignment.

5. Quantification of mutation
HDR events: The sequence identity scores for alignment to the reference amplicon and expected HDR amplicon (if any) are considered. If the read aligns better to the expected HDR amplicon, the read is classified as HDR or mixed HDR-NHEJ. If the sequence identity is above 98%, the read is classified as HDR, otherwise as mixed HDR-NHEJ.
Unmodified versus NHEJ: Reads that align better to the reference amplicon than expected HDR amplicon are considered unmodified if they have 100% sequence identity with it. Otherwise, they are classified as NHEJ if they contain mutations in a window of w bp (w/2 bp on each side) about the gRNA's cleavage site.
Output plots: For all plots, all mutations on reads with a given classification are shown by default, even the ones outside the w bp window in the case of NHEJ that do not contribute to the classification.

Note that if there are deletion and substitutions in a read that are at the end of the reference amplicon,  CRISPResso may ignore these (for quantification and in the Alleles_frequency_table.txt output file). This is because by default CRISPResso excludes 15bp from each side of the reads for quantification of indels, and only considers an indel if it overlaps the window around the cleavage site (ie. the window of size w). The value of 15 bp can be changed by the -exclude_bp_from_left and --exclude_bp_from_right options.

Input files
CRISPResso requires as input:
1. Paired-end reads in .fastq (or fastq.gz) format
2. A reference amplicon

Other optional inputs:
1. One or more gRNA sequences can be provided to compare the predicted cleavage position(s) to the position of the observed mutations.
2. For HDR quantification, the expected amplicon sequence after HDR must also be provided.

First on the Sanger farm, I need to set some environmental variables to run CRISPresso:
% export PYTHONPATH=/nfs/team87/farm3_lims2_vms/software/python_local/lib/python/
% export PATH=/nfs/team87/farm3_lims2_vms/software/crispresso_dependencies/bin:$PATH

A simple job to run CRISPresso:
% /nfs/team87/farm3_lims2_vms/software/python_local/bin/CRISPResso -w 50 --hide_mutations_outside_window_NHEJ -o S1_expA -r1 Homo-sapiens_S1_L001_R1_001.fastq -a GAAAGTCCGGAAAGACAAGGAAGgaacacctccacttacaaaagaagataagacagttgtcagacaaagccctcgaaggattaagccagttaggattattccttcttcaaaaaggacagatgcaaccattgctaagcaactcttacagag -g CTCGAAGGATTAAGCCAGTT
/nfs/team87/farm3_lims2_vms/software/python_local/bin/CRISPResso is where CRISPResso is installed,
-w specifies the window(s) in bp around each sgRNA to quantify the indels. The window is centred on the predicted cleavage site specified by each sgRNA. By default this is 1 bp on the CRISPresso website,
--hide_mutations_outside_window_NHEJ allows the user to visualise only the mutations overlapping the window around the cleavage site and used to classify a read as NHEJ. By default this is set to False, and all mutations are visualised including those that do not overlap the window, even though they are not used to classify a read as NHEJ.
-o S1_expA specifies the output folder to use for the analysis (by default this is the current folder),
-r1 Homo-sapiens_S1_L001_R1_001.fastq specifies the first fastq file,
-a GAAAGTCCGGAAAGACAAGGAAGgaacacctccacttacaaaagaagataagacagttgtcagacaaagccctcgaaggattaagccagttaggattattccttcttcaaaaaggacagatgcaaccattgctaagcaactcttacagag specifies the amplicon sequence,
-g  CTCGAAGGATTAAGCCAGTT specifies the sgRNA sequence. If more than one sequence is entered, they can be separated by commas.

Memory (RAM) and run-time requirements
On the Sanger compute farm,  a colleague recommended to use 2000 Mbyte of RAM (in bsub: "-M2000 -R"select[mem>2000] rusage[mem=2000]")
I found that this ran fine for me, it actually used a maximum memory of ~300 Mbyte in my case, where I had about 550,000 forward reads and 550,000 reverse reads.  It took 8 minutes to run.

Script to submit CRISPResso jobs to the Sanger farm
I also wrote a Python script to submit CRISPresso jobs to the Sanger farm, for lots of jobs. To run it I typed:
% export PYTHONPATH=/nfs/team87/farm3_lims2_vms/software/python_local/lib/python/% export PATH=/nfs/team87/farm3_lims2_vms/software/crispresso_dependencies/bin:$PATH
% python3 amplicon_files guide_rna_files fastq_files/ 23528_1

Output files:
This gives the information on the CRISPresso run.

This looks like this:
Aligned_Sequence        NHEJ    UNMODIFIED      HDR     n_deleted       n_inserted      n_mutated       #Reads  %Reads

This is my interpretation: 
- the first column 'Aligned_sequence' gives an aligned sequence present in some reads, 
- the second column 'NHEJ' says whether reads with that sequence were classified as NHEJ reads, 
- the third column 'UNMODIFIED' says whether those reads had any change (substitution or indel), 
- the fourth column 'HDR' says whether they were classified as HDR reads, 
- the fifth column 'n_deleted' - I think this is the number of deleted bases in reads with this sequence (???), 
- the sixth column 'n_inserted' - I think this is the number of inserted bases in reads with this sequence (???), 
- the seventh column 'n_mutated' - I think this is the number of mutated (substituted) bases in reads with this sequence (???), 
- the eighth column '#Reads' says how many reads had that sequence, 
- and the ninth column '%Reads' says what % of reads had that sequence.

Note: by default, this file only includes substitutions within the window of size w, and deletions that overlap that window of size w and indels that are at least 15 bp from the end of the reference amplicon.
I think this is just a temporary file produced by the program (???).

This seems to give the data for a histogram of deletion sizes, I'm guessing the first column is the deletion size in bp, and the second size is the %reads with a deletion of that size (???):
del_size        fq
0       53175
-1      4
-2      3
-3      0
-4      0
-5      0
-6      0
-7      0
-8      0
-9      0
-10     0
-11     0
-12     0
-13     0

The supporting information for the paper says it is the data used to generate the deletion histogram in figure 3 in the output report.

This looks like this:
# amplicon position     effect
1       4.964085592869768027e-01
2       2.519649505471776019e-01
3       2.450077093753525670e+00
4       1.147004625625211577e-01
5       2.425632732879545728e-01

According to the supporting information for the paper, this gives the location of mutations (deletions, insertions, substitutions) with respect to the reference amplicon. I'm guessing that the first column is the position in the reference amplicon, and the second column is the frequency of mutations at that position.

In an example I looked at, column 1 goes from 1...172, which I guess is the size of the amplicon (PCR product).

According to the supporting information for the paper, this gives the location of deletions.

According to the supporting information for the paper, this gives the location of insertions.

According to the supporting information for the paper, this gives the location of substitutions.

The supporting information for the paper says it is the data used to generate figure 1 in the output report.

The supporting information for the paper says it is the data used to generate the deletion histogram in figure 3 in the output report.

This seems to have the mapping information for a particular sample, eg.

Note that in our case we had several different experiments (gRNA and amplicon combinations) sequenced in the same sequencing sample, so only some of the reads aligned (here 53182) to one of the amplicons.
The CRISPResso github page says the 'READS AFTER PREPROCESSING' is the number of reads left after merging reads (from paired-end read-pairs) and/or filtering based on base quality.

This looks like this:
# amplicon position     effect
1       0.000000000000000000e+00
2       0.000000000000000000e+00
3       0.000000000000000000e+00
4       0.000000000000000000e+00

The supporting information for the paper says it has the average length of the deletions for each position.

The supporting information for the paper says it has the average length of the insertions for each position.

This looks like this:
Quantification of editing frequency:
        - Unmodified:1641 reads
        - NHEJ:51541 reads (4 reads with insertions, 7 reads with deletions, 51540 reads with substitutions)
        - HDR:0 reads (0 reads with insertions, 0 reads with deletions, 0 reads with substitutions)
        - Mixed HDR-NHEJ:0 reads (0 reads with insertions, 0 reads with deletions, 0 reads with substitutions)

Total Aligned:53182 reads

I think this is just a temporary file produced by the program (???).

The supporting information for the paper says it has the information used to make figure 3 in the output report. It looks like this:
sub_size        fq
0       1642
1       47881
2       3408
3       189
4       43
5       10
6       5
7       2
8       0
9       0
10      1
11      1
12      0
13      0

This seems to have the substitution size (column 1) and frequency (column 2).
In Figure 3, it shows the x-axis of the right hand graph to be 'number of positions substituted', so I think this must be what is called the 'substitution size' (sub_size) in this file (???).

This is a histogram, which looks something like this:

This is another histogram, which shows the y-axis as the % of sequences with a certain indel size, rather than the number of sequences (as in 1a above). It looks like this:

This is a pie chart, which shows the %reads with a modification (insertion, or deletion, or substitution):

This looks like this:

This looks like this:





Note: this picture does not seem to just include indels that overlap the window of size w around the Cas9 cut site. However, I think that this plot is excluding indels that occur in the last 15 bp of the reference amplicon.

Thank you for help
Patrick Driguez
Peter Keen

Viewing pdfs on linux

I wanted to find a pdf viewer that I can use on the Sanger compute farm, and found a nice discussion about pdf viewers for linux. In the end the one that worked for me was evince:
% evince file.pdf

Friday 6 October 2017

OpenBabel for beginners

The software Open Babel lets you convert between lots of chemistry file formats, and do lots of other cool things like drawing pictures of compounds.

Here are some simple things I've learnt:

Drawing a picture of a molecule:
% ~jc17/software/openbabel/bin/obabel -isdf Structure2D_CID_9812710.sdf -xu -xd -xC -m -O ivermectin.svg
where Open Babel is installed in ~jc17/software/openbabel/bin/obabel
For some reason this makes you an output file called ivermectin1.svg


Converting an SDF format file to SMILES format:
% ~jc17/software/openbabel/bin/obabel tmp.sdf -O tmp.smi
To get the ids in the output file:
% obabel NNN.sdf -O NNN.smi --append IDNUMBER

To kekulize a SMILES string:
Run Open Babel version 3.0 with "-osmi -xk".

Thanks to:
Noel O'Boyle
James Cotton

Note afterwards: Noel said normally molecules are drawn without showing all the hydrogens.

Thursday 5 October 2017

Some basic immunology

I'm trying to learn some basic immunology, so here are some notes to help me remember... (Thanks to wikipedia for helping me understand the things.)

Basophils: a type of granulocyte (a type of white blood cell). They are responsible for inflammatory reactions during immune response.

Classical complement pathway: one of three complement pathways that can activate the complement system, which clears damaged cells and microbes from an organism by promoting inflammation.

Cytokines: substances such as IFN_gamma (interferon gamma), TNF, IL-4, IL-10 that are secreted by certain cells of the immune system and have an effect on other cells (eg. on macrophages).

Danger-associated molecular pattern (DAMP): host molecules that can initiate and perpetuate a noninfectious inflammatory response. Examples are purine metabolites eg. ATP, adenosine. ATP and adenosine are released in high concentration after severe disruption of the cell.

Dendritic cells: antigen-presenting cells of the mammalian immune system. They process antigen material and present it on their cell surface to the T cells of the immune system. Dendritic cells can produce cytokines that send T cells towards a Th1 or Th2 response. Note that monocytes can differentiate into dendritic cells.

Histamine: an organic nitrogenous compound involved in local immune response; as part of an immune response to foreign pathogens, histamine is produced by basophils and by mast cells found in nearby connective tissues. Histamine increases permeabililty of capillaries to white blood cells.

Interferon gamma: a soluble cytokine critical for innate and adaptive immunity against viral, some bacterial and protozoal infections. An important activator of macrophages.

Lectins (also known as lectin receptors): carbohydrate-binding proteins. Within the innate immune system, lectins such as MBL (mannose binding lectin) help mediate defense against invading microbes. Other lectins likely are involved in modulating inflammatory processes. Here is a review I came across on the role of a type of lectins called galectins in the animal immune system.

Lectin pathway: a type of cascade reaction in the complement system, that after activation, produces activated complement proteins further down the cascade. In contrast to the classical complement pathway, the lectin pathway does not recognise an antibody bound to its target, but rather starts with MBL (mannose binding protein) or ficulin bound to certain sugars.

Lymphocytes: a subtype of white blood cell. There are many types including CD4+ T cells, CD8+ T cells, NK cells, NKT cells, B cells, etc.

Macrophages: a type of white blood cell that engulfs and digests foreign substances, cancer cells, microbes, etc. Macrophages break down these substances and present the smaller proteins to the T lymphocytes.

Mast cells: a type of granulocyte (a type of white blood cell). Contains many granules rich in histamine and heparin. Involved in wound healing, response to pathogens, etc.

Monoclonal antibodies: antibodies made by identical immune cells that are all clones of the same parent cell, and that all bind to the same epitope of the antigen.

Mononuclear phagocyte system (MGS) (also known as macrophage system): part of the immune system that consists of phagocytic cells located in reticular connective tissue. The cells are mostly macrophages and monocytes and they accumulate in the lymph nodes and in the spleen.

Spleen: Removes old red blood cells and holds a reserve of blood. Synthesises antibodies in its white pulp. Also has role in cell-mediated immunity.

T cell (T lymphocyte): a type of lymphocyte that plays a key role in cell-mediated immunity. They hunt down and destroy cells that are infected with microbes or that have become cancerous.

Th1 response: Th1 T cells tend to generate immune responses against intracellular parasites such as bacteria and viruses.

Th2 response: Th2 T cells generate immune responses against helminths and other extracellular parasites. Dendritic cells (which sense helminth antigens) are thought to play a dominant role in initiation of Th2 response. To do this they are thought to somehow cause granulocytes (like basophils, eosinophils, mast cells) to produce Th2-associated cytokines like IL4. The interleukin 4 (IL4) is a cytokine that induces differentiation of naive helper T cells (Th0 cells) to Th2 cells.

Introduction to HMMs in Bioinformatics

I was looking recently for my slides on Introduction to HMMs in Bioinformatics, and after some searching, found I'd put them on slideshare. Isn't the web great for woolly memory.