avrilomics: 2017

Friday 22 December 2017

Finding all occurrences of a subsequence in a sequence

I wanted to find all the occurrences of a subsequence in a genome assembly. To do this, I first tried using BLAT but it didn't find them for me (not sure why).

So I instead wrote a little Python function to print out all the positions of a subsequence in a sequence:

#====================================================================#

# find the positions of a subsequence in a sequence:

def find_positions_of_subsequence(seq, subsequence, seqname):

    still_searching = True
    start = 0
    end = len(seq) - 1
    while (still_searching == True):
        position = seq.find(subsequence, start, end)
        if position == -1:
            still_searching = False
        else:
            actual_position = position + 1
            format_string = "Found at %d in %s" % (actual_position, seqname)
            print(format_string)
            start = position + 1

    return

#====================================================================#

Python saves the day!

Monday 4 December 2017

Trimming adapter sequences

I wanted to trim adapter sequences, specifically the sequence 'GTTTTAGGTC'

Attempt 1: Trimmomatric
I first tried Trimmomatic, with a fasta file with the forward and reverse complement of my adapter sequences, and the 'ILLUMINACLIP:/lustre/scratch118/infgen/team133/alc/000_FUGI_PatrickCRISPR/OmegaGenewhiz/adaptor.fa:2:40:0' option. However, I found that this just removes the sequences with adapter, it doesn't trim them. This is also described here.

Attempt 2: Cutadapt
I then tried cutadapt. With the following fasta file:
>seq1
GTTTTAGGTCGTTATCGTGTA
>seq2
TACACGATAACGACCTAAAAC

I was able to trim adapters using:
% cutadapt -g GTTTTAGGTC my.fasta -a GACCTAAAAC
where -g cuts sequences off the 5' end, and -a off the 3' end.
It cuts off adapter with at most 10% errors in single-end read mode.

This gave me:
>seq1
GTTATCGTGTA
>seq2
TACACGATAAC
Hurray!

Something strange I noticed about cutadapt:
I have been using cutadapt to trim some adaptors, and noticed that it totally removed some of my sequences.

For example, with this fastq input: (in tmp.fastq)
@M03558:259:000000000-BH588:1:1101:4913:6565 2:N:0:TTTGTA
ACTGACCCTCAGCAATCTTAAACTTCTTAGACGAATCACCAGAACGGAAAACATCCTTCATAGAAATTTCACGCGGCGGCAAGTTGCCATACAAAACAGGGTCGCCAGCAATATCGGTATAAGTCAAAGCACCTTTAGCGTTAAGGTACTGAATCTCTTTAGTCGCAGTAGGCGGAAAACGAACAAGCGCAAGAGTAAACATAGTGCCATGCTCAGGAACAAAGAAACGCGGCACAGAATGTTTATAGGTC
+
AAAAAFFFFCFCGGGGGGGGGGHHHHHHGFHHGEEGEHHFHHFFHGEGFGHFHHHHHHHHHHGHFHHHHHHHEHGCFGGGGGGHHHHHHHHHHHHHGHGHGHHGGGGGGHHGFHEGGCEFFHHGGGHHHEFGHEHHHGGGAE?EHHGDHHFGHHHHHEHHHH:C.A@DCGHBGA-?BGGGCGEGCFFFFFFFFEF;FFFEF/B/:BFEF;/:;BB;BFFFFFF/BFFFAAFA-;-B.FFBF;/BFBF/9/:

I then ran cutadapt like this:
% cutadapt -g GTTTTAGGTC -a GACCTAAAAC tmp.fastq
and it removed the whole input sequence.
I can't see any matches to the adaptors in that input though. When I ran blastn between that read and the NCBI database, the sequence was found to be PhiX174. Is that why cutadapt removed it, or was there some other reason?

Update: 7-Dec-2017: I sent an email about this to the cutadapt developer, Marcel Martin, and received a very helpful reply:

In short: The 5' adapter is found at the very end (3') of the read with one

error.

From the statistics report that cutadapt prints, you can see that it finds

the 5' adapter with one error. And if you run the above command without

specifying the 3' adapter with -a GACCTAAAAC, you get the same result, so

it must be the 5' adapter. When cutadapt finds a 5' adapter, it removes

the adapter sequnce itself and everything preceding it. And in your case,

the read ends with 'GTTTATAGGTC', which is the 5' adapter sequence with an

extra 'A' inserted between one of the 'T' nucleotides.
Although cutadapt doesn’t print it, the alignment likely looks like this:

...GTTTATAGGTC (read end)
   GTTT-TAGGTC (adapter)

This is possibly not the result you were expecting, but cutadapt is doing

what it is supposed to.

You have a few options:
- Accept this result
- If you think allowing insertions is the problem, use --no-indels to
disallow insertions and deletions.
- If you want to prevent the 5' adapter from occurring within the read at

all, you can specify the adapter as -g XXGTTTTAGGTC. The 'X' is always

counted as a mismatch, so if you have more X than allowed mismatches,

the adapter is forced to occur at the 5' end (overlapping it).

Monday 27 November 2017

Fastq format

I'm looking at a fastq file that has header lines like this and want to figure out what they mean:

@M03558:259:000000000-BH588:1:1101:16110:1341 2:N:0:NTTGTA
@M03558:259:000000000-BH588:1:1101:16089:1342 2:N:0:NTTGTA
@M03558:259:000000000-BH588:1:1101:15471:1344 2:N:0:NTTGTA
@M03558:259:000000000-BH588:1:1101:15455:1333 2:N:0:NTTGTA
@M03558:259:000000000-BH588:1:1101:14580:1411 2:N:0:CTTGTA
...

Luckily I found a nice wikipedia description of FASTQ, which tells me that probably:
M03558 = the unique instrument name
259 = the run id
000000000-BH588 = the flow cell id.
1 = the flow cell lane
1101 = the tile number within the flow cell lane (I get 28 different values for this)
16110 = x-coordinate of the cluster within the tile
1341 = y-coordinate of the cluster within the tile
2 = member of a pair (paired end reads only). (I only see '2' here so I think the reads in my fastq are single-end reads)
N = means the read is not filtered (I only see 'N' here)
0 = this will be '0' when none of the control bits are on (I only see '0' here)
NTTGTA = this is the index sequence. I see several different index sequences, with different frequencies.

Thanks wikipedia!

Friday 3 November 2017

A stacked barplot in R

I wanted to make a stacked barplot in R. My input data looked like this:

plate barcode reads
1 1 3232
1 2 32232
1 4 23232
2 1 23322
2 2 2323
2 3 4343
2 4 23432

I wanted to make a barplot showing the 96 barcodes adjacent to each other, and for each barcode, a stack showing the number of reads for plate 1, 2, 3.

Getting the data into R (painful!)
The problem was getting my data into R. The input data did not have values for every plate-barcode combination, but I wanted to assume a value of 0 for combinations that were not in the input file. In the end I had to write some code to squeeze the data into R:

# input data has columns plate, barcode, number of input reads
MyData <- read.table("reads_in_inputs",header=TRUE)

plate <- MyData$plate
barcode <- MyData$barcode
reads <- MyData$reads

# put the input data in a matrix, for use in the barplot() command.
# The matrix will have three rows (plate 1,2,3) and 96 columns (barcode 1..96):
mymatrix <- matrix(, nrow=3, ncol=96)

for (platenum in 1:3)
{
   for (barcodenum in 1:96)
   {
      # find the index (if any) in vector 'reads' for plate 'platenum' and barcode 'barcodenum'.
      value <- intersect(which(plate==platenum),which(barcode==barcodenum))
      if (length(value > 0))
      {
         # get the number of reads for plate 'platenum' and barcode 'barcodenum' from vector reads:
         mymatrix[platenum,barcodenum] <- reads[value] / 1e+3 # in thousands of reads
      }
      else
      {
         mymatrix[platenum,barcodenum] <- 0
      }
   }
}

Plotting the data in R (ok!)
Plotting the data was not so hard. I used the example from http://www.r-graph-gallery.com/ to make a stacked barplot:
colnames <- seq(1,96)
rownames <- seq(1,3)
colnames(mymatrix) <- colnames
rownames(mymatrix) <- rownames
barplot(mymatrix, col=colors()[c(23,89,12)], border="white", space=0.04, font.axis=2, xlab="barcode", ylab="thousands of input reads", legend=rownames(mymatrix))

A little bit of the plot:

Some other little tricks I learnt:
To put some space around the plot I can type before the 'barplot' command:
par( mar=c(8, 4.7, 2.3, 0)) # last value is space on RHS, second last value is space at top, 2nd value is space on LHS, 1st value is space below

In the barplot command itself:
border="white": use white for the border of the bars
space=0.04 : leaves space before each bar. cex.names=0.5
makes the x-axis labels smallerlas=3 makes the labels perpendicular to the axis

Multivariate hypergeometric distribution in R

A hypergeometric distribution can be used where you are sampling coloured balls from an urn without replacement.

A univariate hypergeometric distribution can be used when there are two colours of balls in the urn, and a multivariate hypergeometric distribution can be used when there are more than two colours of balls.

The multivariate hypergeometric distribution can be used to ask questions such as: what is the probability that, if I have 80 distinct colours of balls in the urn, and sample 100 balls from the urn with replacement, that I will have at least one ball of each colour?

I recently have a similar problem in bioinformatics: if there are 6329 distinct copies of a gene in our eppendorph, and (after some PCRs) we sequence 10,000 distinct reads, what is the probability that we sequenced at least one read from each of the 6329 distinct copies of the gene? In other words, if there are 6329 distinct colours of balls in an urn, and we pick out 10,000 balls (without replacement), what is the probability that we chose at least one ball of each colour?

In R the dhyper function could be used in the case where there are two colours of ball in an urn, that is a univariate hypergeometric distribution. But what about where there are 6329 colours of ball, that is, a multivariate hypergeometric distribution? Then I came across the extraDistr R package, which has a multivariate hypergeometric distribution. The function is called "MultiHypergeometric".
I installed and load this library in R using:
> install.packages("extraDistr")
> library("extraDistr")

If two cards are drawn from a deck of 52 cards, what is the probability that one is a spade and one is a heart?
I found a nice example of using the MultiHypergeometric distribution here by Jonathan Fivelsdal, thank you! He asked the question: if two cards are drawn from a deck of 52 cards, what is the probability that one is a spade and one is a heart? In R:
> dmvhyper(x = c('heart' = 1, 'spade' = 1, 'other' = 0), n = c('heart' = 13, 'spade' = 13, 'other' = 26), k=2)
[1] 0.127451
Here x is a vector that has the frequency of hearts and spades and other cards in our sample of interest,
n is a vector that has the frequency of hearts and spades and other cards in the deck of cards,
k is the number of cards in our sample of interest.
[Note: the manual for this package says k is 'the number of balls drawn from the urn'.]

Given a huge bag containing scrabble letters, if know the number of each letter tile in the bag, what is the probability of drawing a particular word of length N if we select out N letter tiles without replacement?

This is another nice example that I found here by Herb Susmann.
The answer is that we can do something like this:
> dmvhyper(x = letterfreqs, n = freqs, k = sum(letterfreqs))
where x = letterfreqs is a vector of length 26, with the frequency of each letter in the word of interest,
n is a vector of length 26 with the frequency of each letter tile in the bag,
k is the number of letters in the word of interest (of length N), ie. it is N.

If I have 6329*2=12,658 balls of 6329 distinct colours in an urn (two of each colour), and pick out 10,000 balls (without replacement), what is the probability that I chose at least one ball of each colour?
> dmvhyper(x = rep(1, 6329), n = rep(2, 6329), k = 10000, log = FALSE)
[1] 0
x is an m-column matrix of quantiles. From the example above by Jonathan Fivelsdal using a deck of cards, I think that this gives the counts of each of the colours of ball in the sample that you're asking about. In our case we are asking about having at least one ball of each colour.
The vector n is an m-length vector and has the number of balls in each of the m different colours. In our case m is 6329, so it is a 6329-length vector. Here we just assume we have 2 balls of each colour in the urn.
In this case k is the number of balls drawn from the urn, which is 10,000 in our case.

Notes to myself:
- in the case of our sequencing problem, we need to take into account the number of sequences in the urn, ie. the total number of sequences after the PCR.
- is sequencing molecules a sampling without replacement problem? Need to check...

Thursday 2 November 2017

'Learn Python Visually' book

I am doing some Python study, and found the book Learn Python Visually by Ivelin Demirov in my work library.

It has the subtitle 'An Accelerated Learning Method Which Uses Science and Creativity to Teach Right-Brained Non-Coders', which is intriguing. Also, the development of the book was funded by Kickstarter, which is cool to hear!

I'm thinking to make here some notes on the new things that I learnt from this book:

p. 14. Numbers are immutable. This page explains about numbers being immutable in Python. I always find this concept really confusing. The book gave me a little more information on it, but then I also found a nice article on it on the web.
   There it is explained that when an object is initiated, it is assigned a unique object id. The state of the object can be changed later if the object is a mutable-type object. That is, the state of a mutable object can be changed after it is created, but the state of an immutable object can't be.
   The book also tells about the id() function that tells you about the identity of an object as an integer, and the is() function that compares the identity of two objects.
   Example:
>>> b = 11
>>> a = b
>>> c = 12
>>> d = 11
>>> id(a)
9084448
>>> id(b)
9084448
>>> id(c)
9084480
>>> id(d)
9084448
>>> a is b
True
>>> a is c
False
>>> a is d
True

p. 15 Assigning the same value to multiple variables at once. I didn't know you can assign a value to multiple variables at once:
>>> a = b = c = d = "Hello"

p. 15 Special words that can't be used as variable names. The book gives a list of special words that can't be used as variable names, I also found them on the web here.

p. 19 Using multi-line strings to store large text blocks, by enclosing them with triple quotes:
An example is:
>>> huidobro = """Al horitaña de la montazonte
La violondrina y el goloncelo
Descolgada esta mañana de la lunala
Se acerca a todo galope
"""

p. 21 Short forms of arithmetic operators.
I knew that it was possible to do:
>>> a += 1
but it turns out that you can also do:
>>> a -= 1
>>> a *= 1
>>> a /= 2
etc.

p. 28 Precedence of arithmetic operators.
Something I didn't know was that exponents have precedence before subtraction, which means that we get different answers for:
>>> abs(-2 ** 2 - 20)
24
which first finds 2**2 = 4, and then finds abs(-4 - 20), and
>>> abs((-2) ** 2 - 20)
16
which first finds (-2)**2 = 4, and then finds abs(4 - 20).

Thursday 26 October 2017

CRISPresso software for analysing CRISPR-Cas9 data

I am learning to use the CRISPResso software for analysing CRISPR-Cas9 data, which was published by Pinello et al (2016). The supporting information for that paper contains a lovely user guide for the software.

CRISPR-Cas9
I am new to CRISPR-Cas9, so here is a summary to remind me!

The two key parts of this system are:
- the Cas9 enzyme, which cuts double-stranded DNA at a specific location.
- a guide RNA (gRNA), which is about 20 bp, and is located within a longer RNA scaffold. The scaffold binds to DNA and pre-designed gRNA guides Cas9 (Cas9 binds to the gRNA) to the right part of the genome. The gRNA has bases complementary to the target DNA sequence in the genome.

The PAM (protospacer adjacent motif) is a 2-6 bp DNA sequence immediately following the DNA sequence where the Cas9 cuts the DNA (ie. immediately following the gRNA sequence). Cas9 will not cleave the DNA sequence unless it is followed by the PAM sequence. The canonical PAM is 5'-NGG-3'.

Details of the CRISPresso pipeline
The CRISPresso pipeline is described in the supporting information for the paper as including these steps:

1. Quality filtering
This allows the user to try to avoid false positives due to sequencing errors. CRISPResso allows reads to be filtered based (i) on a read's 'average quality score', defined as the average of the read's single-base quality scores, (ii) on the lowest base quality for all bases in a read. By default, the reads are not filtered.

2. Trimming
If the user provides the adaptor sequences used, this will trim them using Trimmomatic. By default, the reads are not trimmed.

3. Merging paired end reads
When paired end reads are used, it is possible to merge the reads since the amplicon sequence is usually shorter than twice the read length. This step mean that the sequence for the overlapping region is obtained with greater confidence. This step is performed using FLASH.

4. Alignment
To align the filtered reads to the reference amplicon, CRISPResso uses needle from the EMBOSS package, which performs Needleman-Wunsch alignment.

5. Quantification of mutation
HDR events: The sequence identity scores for alignment to the reference amplicon and expected HDR amplicon (if any) are considered. If the read aligns better to the expected HDR amplicon, the read is classified as HDR or mixed HDR-NHEJ. If the sequence identity is above 98%, the read is classified as HDR, otherwise as mixed HDR-NHEJ.
Unmodified versus NHEJ: Reads that align better to the reference amplicon than expected HDR amplicon are considered unmodified if they have 100% sequence identity with it. Otherwise, they are classified as NHEJ if they contain mutations in a window of w bp (w/2 bp on each side) about the gRNA's cleavage site.
Output plots: For all plots, all mutations on reads with a given classification are shown by default, even the ones outside the w bp window in the case of NHEJ that do not contribute to the classification.

Note that if there are deletion and substitutions in a read that are at the end of the reference amplicon, CRISPResso may ignore these (for quantification and in the Alleles_frequency_table.txt output file). This is because by default CRISPResso excludes 15bp from each side of the reads for quantification of indels, and only considers an indel if it overlaps the window around the cleavage site (ie. the window of size w). The value of 15 bp can be changed by the -exclude_bp_from_left and --exclude_bp_from_right options.

Input files
CRISPResso requires as input:
1. Paired-end reads in .fastq (or fastq.gz) format
2. A reference amplicon

Other optional inputs:
1. One or more gRNA sequences can be provided to compare the predicted cleavage position(s) to the position of the observed mutations.
2. For HDR quantification, the expected amplicon sequence after HDR must also be provided.

Command-line
First on the Sanger farm, I need to set some environmental variables to run CRISPresso:
% export PYTHONPATH=/nfs/team87/farm3_lims2_vms/software/python_local/lib/python/
% export PATH=/nfs/team87/farm3_lims2_vms/software/crispresso_dependencies/bin:$PATH

A simple job to run CRISPresso:
% /nfs/team87/farm3_lims2_vms/software/python_local/bin/CRISPResso -w 50 --hide_mutations_outside_window_NHEJ -o S1_expA -r1 Homo-sapiens_S1_L001_R1_001.fastq -a GAAAGTCCGGAAAGACAAGGAAGgaacacctccacttacaaaagaagataagacagttgtcagacaaagccctcgaaggattaagccagttaggattattccttcttcaaaaaggacagatgcaaccattgctaagcaactcttacagag -g CTCGAAGGATTAAGCCAGTT
where
/nfs/team87/farm3_lims2_vms/software/python_local/bin/CRISPResso is where CRISPResso is installed,
-w specifies the window(s) in bp around each sgRNA to quantify the indels. The window is centred on the predicted cleavage site specified by each sgRNA. By default this is 1 bp on the CRISPresso website,
--hide_mutations_outside_window_NHEJ allows the user to visualise only the mutations overlapping the window around the cleavage site and used to classify a read as NHEJ. By default this is set to False, and all mutations are visualised including those that do not overlap the window, even though they are not used to classify a read as NHEJ.
-o S1_expA specifies the output folder to use for the analysis (by default this is the current folder),
-r1 Homo-sapiens_S1_L001_R1_001.fastq specifies the first fastq file,
-a GAAAGTCCGGAAAGACAAGGAAGgaacacctccacttacaaaagaagataagacagttgtcagacaaagccctcgaaggattaagccagttaggattattccttcttcaaaaaggacagatgcaaccattgctaagcaactcttacagag specifies the amplicon sequence,
-g CTCGAAGGATTAAGCCAGTT specifies the sgRNA sequence. If more than one sequence is entered, they can be separated by commas.

Memory (RAM) and run-time requirements
On the Sanger compute farm, a colleague recommended to use 2000 Mbyte of RAM (in bsub: "-M2000 -R"select[mem>2000] rusage[mem=2000]")
I found that this ran fine for me, it actually used a maximum memory of ~300 Mbyte in my case, where I had about 550,000 forward reads and 550,000 reverse reads. It took 8 minutes to run.

Script to submit CRISPResso jobs to the Sanger farm
I also wrote a Python script to submit CRISPresso jobs to the Sanger farm, for lots of jobs. To run it I typed:
% export PYTHONPATH=/nfs/team87/farm3_lims2_vms/software/python_local/lib/python/% export PATH=/nfs/team87/farm3_lims2_vms/software/crispresso_dependencies/bin:$PATH
% python3 submit_crispresso_jobs.py amplicon_files guide_rna_files fastq_files/ 23528_1

Output files:
CRISPResso_RUNNING_LOG.txt
This gives the information on the CRISPresso run.

Alleles_frequency_table.txt
This looks like this:
Aligned_Sequence        NHEJ    UNMODIFIED      HDR     n_deleted       n_inserted      n_mutated       #Reads %Reads
TCGTCAATCGACTAATTCATCTCAACAACAAACGGAAAAGGAAGCAATGTCAGCTAATTCGATGTTTCTTATTGCCGTATTGTCATACACATTGATAAGTCAATTGGGGATAACTACATCGGATTCATGCAAATATTGTCTACAATTGTACGATGAAACGTATGAGAGGGGT    True    False   False   0.0     0.0     1       29716      55.876048287
TCGTCAATCGACTAATTCATCTCAACAACAAACGGAAAAAGAAGGAATGTCAGCTAATTCGATGTTTATTATTGCCGTATTGTCATACACATTGATAAGTCAATTGGGGATAACTACATCGGATTCATGCAAATATTGTCTACAATTGTACGATGAAACGTATGAGAGGGGT    True    False   False   0.0     0.0     1       2110       3.96750780339
...
This is my interpretation:
- the first column 'Aligned_sequence' gives an aligned sequence present in some reads,
- the second column 'NHEJ' says whether reads with that sequence were classified as NHEJ reads,
- the third column 'UNMODIFIED' says whether those reads had any change (substitution or indel),
- the fourth column 'HDR' says whether they were classified as HDR reads,
- the fifth column 'n_deleted' - I think this is the number of deleted bases in reads with this sequence (???),
- the sixth column 'n_inserted' - I think this is the number of inserted bases in reads with this sequence (???),
- the seventh column 'n_mutated' - I think this is the number of mutated (substituted) bases in reads with this sequence (???),
- the eighth column '#Reads' says how many reads had that sequence,
- and the ninth column '%Reads' says what % of reads had that sequence.

Note: by default, this file only includes substitutions within the window of size w, and deletions that overlap that window of size w and indels that are at least 15 bp from the end of the reference amplicon.

cut_points.pickle
I think this is just a temporary file produced by the program (???).

deletion_histogram.txt
This seems to give the data for a histogram of deletion sizes, I'm guessing the first column is the deletion size in bp, and the second size is the %reads with a deletion of that size (???):
del_size        fq
0       53175
-1      4
-2      3
-3      0
-4      0
-5      0
-6      0
-7      0
-8      0
-9      0
-10     0
-11     0
-12     0
-13     0
The supporting information for the paper says it is the data used to generate the deletion histogram in figure 3 in the output report.

effect_vector_combined.txt
This looks like this:
# amplicon position     effect
1       4.964085592869768027e-01
2       2.519649505471776019e-01
3       2.450077093753525670e+00
4       1.147004625625211577e-01
5       2.425632732879545728e-01
...
According to the supporting information for the paper, this gives the location of mutations (deletions, insertions, substitutions) with respect to the reference amplicon. I'm guessing that the first column is the position in the reference amplicon, and the second column is the frequency of mutations at that position.

In an example I looked at, column 1 goes from 1...172, which I guess is the size of the amplicon (PCR product).

effect_vector_deletion_NHEJ.txt
According to the supporting information for the paper, this gives the location of deletions.

effect_vector_insertion_NHEJ.txt
According to the supporting information for the paper, this gives the location of insertions.

effect_vector_substitution_NHEJ.txt
According to the supporting information for the paper, this gives the location of substitutions.

indel_histogram.txt
The supporting information for the paper says it is the data used to generate figure 1 in the output report.

insertion_histogram.txt
The supporting information for the paper says it is the data used to generate the deletion histogram in figure 3 in the output report.

Mapping_statistics.txt
This seems to have the mapping information for a particular sample, eg.
   READS IN INPUTS:135418
   READS AFTER PREPROCESSING:117968
   READS ALIGNED:53182
Note that in our case we had several different experiments (gRNA and amplicon combinations) sequenced in the same sequencing sample, so only some of the reads aligned (here 53182) to one of the amplicons.
The CRISPResso github page says the 'READS AFTER PREPROCESSING' is the number of reads left after merging reads (from paired-end read-pairs) and/or filtering based on base quality.

position_dependent_vector_avg_deletion_size.txt
This looks like this:
# amplicon position     effect
1       0.000000000000000000e+00
2       0.000000000000000000e+00
3       0.000000000000000000e+00
4       0.000000000000000000e+00
...
The supporting information for the paper says it has the average length of the deletions for each position.

position_dependent_vector_avg_insertion_size.txt
The supporting information for the paper says it has the average length of the insertions for each position.

Quantification_of_editing_frequency.txt
This looks like this:
Quantification of editing frequency:
        - Unmodified:1641 reads
        - NHEJ:51541 reads (4 reads with insertions, 7 reads with deletions, 51540 reads with substitutions)
        - HDR:0 reads (0 reads with insertions, 0 reads with deletions, 0 reads with substitutions)
        - Mixed HDR-NHEJ:0 reads (0 reads with insertions, 0 reads with deletions, 0 reads with substitutions)
Total Aligned:53182 reads

sgRNA_intervals.pickle
I think this is just a temporary file produced by the program (???).

substitution_histogram.txt
The supporting information for the paper says it has the information used to make figure 3 in the output report. It looks like this:
sub_size        fq
0       1642
1       47881
2       3408
3       189
4       43
5       10
6       5
7       2
8       0
9       0
10      1
11      1
12      0
13      0
This seems to have the substitution size (column 1) and frequency (column 2).
In Figure 3, it shows the x-axis of the right hand graph to be 'number of positions substituted', so I think this must be what is called the 'substitution size' (sub_size) in this file (???).

Figures:
1a.Indel_size_distribution_n_sequences.pdf
This is a histogram, which looks something like this:

1b.Indel_size_distribution_percentage.pdf
This is another histogram, which shows the y-axis as the % of sequences with a certain indel size, rather than the number of sequences (as in 1a above). It looks like this:

2.Unmodified_NHEJ_pie_chart.pdf
This is a pie chart, which shows the %reads with a modification (insertion, or deletion, or substitution):

3.Insertion_Deletion_Substitutions_size_hist.pdf
This looks like this:

4a.Combined_Insertion_Deletion_Substitution_Locations.pdf
This looks like this:

4b.Insertion_Deletion_Substitution_Locations_NHEJ.pdf

4e.Position_dependent_average_indel_size.pdf

Note: this picture does not seem to just include indels that overlap the window of size w around the Cas9 cut site. However, I think that this plot is excluding indels that occur in the last 15 bp of the reference amplicon.

Thank you for help
Patrick Driguez
Peter Keen

Viewing pdfs on linux

I wanted to find a pdf viewer that I can use on the Sanger compute farm, and found a nice discussion about pdf viewers for linux. In the end the one that worked for me was evince:
% evince file.pdf
Great!

Friday 6 October 2017

OpenBabel for beginners

The software Open Babel lets you convert between lots of chemistry file formats, and do lots of other cool things like drawing pictures of compounds.

Here are some simple things I've learnt:

Drawing a picture of a molecule:
% ~jc17/software/openbabel/bin/obabel -isdf Structure2D_CID_9812710.sdf -xu -xd -xC -m -O ivermectin.svg
where Open Babel is installed in ~jc17/software/openbabel/bin/obabel
For some reason this makes you an output file called ivermectin1.svg

Yippee!

Converting an SDF format file to SMILES format:
% ~jc17/software/openbabel/bin/obabel tmp.sdf -O tmp.smi
To get the ids in the output file:
% obabel NNN.sdf -O NNN.smi --append IDNUMBER

To kekulize a SMILES string:
Run Open Babel version 3.0 with "-osmi -xk".

Thanks to:
Noel O'Boyle
James Cotton

Note afterwards: Noel said normally molecules are drawn without showing all the hydrogens.

Thursday 5 October 2017

Some basic immunology

I'm trying to learn some basic immunology, so here are some notes to help me remember... (Thanks to wikipedia for helping me understand the things.)

Basophils: a type of granulocyte (a type of white blood cell). They are responsible for inflammatory reactions during immune response.

Classical complement pathway: one of three complement pathways that can activate the complement system, which clears damaged cells and microbes from an organism by promoting inflammation.

Cytokines: substances such as IFN_gamma (interferon gamma), TNF, IL-4, IL-10 that are secreted by certain cells of the immune system and have an effect on other cells (eg. on macrophages).

Danger-associated molecular pattern (DAMP): host molecules that can initiate and perpetuate a noninfectious inflammatory response. Examples are purine metabolites eg. ATP, adenosine. ATP and adenosine are released in high concentration after severe disruption of the cell.

Dendritic cells: antigen-presenting cells of the mammalian immune system. They process antigen material and present it on their cell surface to the T cells of the immune system. Dendritic cells can produce cytokines that send T cells towards a Th1 or Th2 response. Note that monocytes can differentiate into dendritic cells.

Histamine: an organic nitrogenous compound involved in local immune response; as part of an immune response to foreign pathogens, histamine is produced by basophils and by mast cells found in nearby connective tissues. Histamine increases permeabililty of capillaries to white blood cells.

Interferon gamma: a soluble cytokine critical for innate and adaptive immunity against viral, some bacterial and protozoal infections. An important activator of macrophages.

Lectins (also known as lectin receptors): carbohydrate-binding proteins. Within the innate immune system, lectins such as MBL (mannose binding lectin) help mediate defense against invading microbes. Other lectins likely are involved in modulating inflammatory processes. Here is a review I came across on the role of a type of lectins called galectins in the animal immune system.

Lectin pathway: a type of cascade reaction in the complement system, that after activation, produces activated complement proteins further down the cascade. In contrast to the classical complement pathway, the lectin pathway does not recognise an antibody bound to its target, but rather starts with MBL (mannose binding protein) or ficulin bound to certain sugars.

Lymphocytes: a subtype of white blood cell. There are many types including CD4+ T cells, CD8+ T cells, NK cells, NKT cells, B cells, etc.

Macrophages: a type of white blood cell that engulfs and digests foreign substances, cancer cells, microbes, etc. Macrophages break down these substances and present the smaller proteins to the T lymphocytes.

Mast cells: a type of granulocyte (a type of white blood cell). Contains many granules rich in histamine and heparin. Involved in wound healing, response to pathogens, etc.

Monoclonal antibodies: antibodies made by identical immune cells that are all clones of the same parent cell, and that all bind to the same epitope of the antigen.

Mononuclear phagocyte system (MGS) (also known as macrophage system): part of the immune system that consists of phagocytic cells located in reticular connective tissue. The cells are mostly macrophages and monocytes and they accumulate in the lymph nodes and in the spleen.

Spleen: Removes old red blood cells and holds a reserve of blood. Synthesises antibodies in its white pulp. Also has role in cell-mediated immunity.

T cell (T lymphocyte): a type of lymphocyte that plays a key role in cell-mediated immunity. They hunt down and destroy cells that are infected with microbes or that have become cancerous.

Th1 response: Th1 T cells tend to generate immune responses against intracellular parasites such as bacteria and viruses.

Th2 response: Th2 T cells generate immune responses against helminths and other extracellular parasites. Dendritic cells (which sense helminth antigens) are thought to play a dominant role in initiation of Th2 response. To do this they are thought to somehow cause granulocytes (like basophils, eosinophils, mast cells) to produce Th2-associated cytokines like IL4. The interleukin 4 (IL4) is a cytokine that induces differentiation of naive helper T cells (Th0 cells) to Th2 cells.

Introduction to HMMs in Bioinformatics

I was looking recently for my slides on Introduction to HMMs in Bioinformatics, and after some searching, found I'd put them on slideshare. Isn't the web great for woolly memory.

Monday 25 September 2017

How to do a substructure search of ChEMBL

I wanted to do a search of ChEMBL to see if it contains any phase III or phase IV approved drugs that are carbamate benzamidazoles (like albendazole).

So I went to the ChEMBL homepage where there is a 'Marvin JS' chemical drawing tool by the company ChemAxon. Here was my picture:

The 'A' here means any atom. This 'A' doesn't seem to include H though, so I had to do another search using '-OH' instead of the '-OA'.
[Later note: actually it's not necessary to do -OA here, you can just do '-O' and the software shows a -OH but this substructure search will hit things with -O-A unless you specifically draw in a -O-H (ie. H instead of A)].

Then I clicked on the 'Substructure search - Fetch compounds' button under the 'Marvin JS' chemical drawing tool, on the ChEMBL webpage. This brought back 248 hits.

Then I clicked on the arrow at the top of the 'Max phase' column to sort by phase. This found a few old friends at the top, albendazole and mebendazole, plus a couple of others:

The fourth was a molecule I hadn't seen before:

Some things I noticed:

- For some reason the 'atom toolbar' seemed to disappear when I was using Safari. I tried using Firefox instead and it was there again. Not sure why this is..

- When I searched ChEMBL for compounds containing a -SO3 substructure, it got hits that on first glance didn't seemed to contain this substructure. However, when I looked at the ChEMBL pages for those hits, they said 'Alternative forms of this compound in ChEMBL', and those alternative forms did have -SO3 substructures. So I guess it was searching the 'Alternative forms' as well.

- The 'A' atom seems to mean any atom except H.

- I found that when I search for a benzene ring with -OH attached, it also finds hits that have benzene attached to -O-(something else) so that it doesn't seem to enforce that it's -OH.

- Something is wrong with the ChEMBL results page. When I searched for a benzene ring with two -OH groups attached, the default display showed 25 hits per page, and when I clicked through the pages, I saw some compounds repeated on different hit pages, but didn't find hexylresorcinol, which should be a hit. However, when I asked for 100 hits per page, it showed hexylresorcinol! I emailed the ChEMBL helpdesk about this but no reply yet...

Friday 2 June 2017

Retrieve E.C. numbers for Drosophila melanogaster proteins from UniProt

I wanted to do a query of UniProt, to retrieve all the E.C. numbers that it has for D. melanogaster proteins. I found that you can use a query language called 'sparql' for this http://sparql.uniprot.org/. I found that this was the sparql query to get all the E.C. numbers for D. melanogaster proteins:

PREFIX up:<http://purl.uniprot.org/core/>
PREFIX taxon:<http://purl.uniprot.org/taxonomy/>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
SELECT ?protein ?enzyme
WHERE
{
        ?protein a up:Protein .
        ?protein up:organism taxon:7227 .
        ?protein up:enzyme ?enzyme .
}

Friday 24 March 2017

Finding the list of bioactivities for a ChEMBL target

Here's how to find the list of bioactivities for a ChEMBL target:

From the ChEMBL webpage:
-a) paste the target accession (eg. 'O60760') in the search bar and click on targets
-b) on the page that you land on - click on the right hand side where it says 'Please select"
c) Download the bioactivities and you can search for the ChEMBL_Id of the molecule you are interested in, and this will give you the links to the publication where the bioactivities were reported

Thanks to Prudence from ChEMBL for this!

Hierarchical clustering a big data set in R

A while ago I wrote some notes on hierarchical clustering using hclust in R. Now I needed to do hierarchical clustering on a big distance matrix, 26104 x 26104 in size. Will it run? It turns out that hierarchical clustering is O(n^3) and so is slow to run and needs lots of memory for big data sets. I found some discussions about this on the web at http://stackoverflow.com/. However, I did get it to run, by requesting 25000 Mbyte of RAM! Hurray! And it only took 15 mins to run!
Note however I started with an even bigger data set and found it too big to run hierarchical clustering on, but managed to break down my clusters into a smaller size first by finding clusters in Python using a community detection algorithm (see finding-communities-in-graph-using.html). Then I was able to run the hierarchical clustering on each of the clusters found by Python (the biggest of which was the 26104-element case above).

Monday 20 March 2017

Finding communities in a graph using Python

Before, I wrote some notes on finding communities (clusters) in a graph using R. I've now been finding out how to do this in Python.

Here's some options I explored

- I found that the networkx Python module for graphs has a function to find k-clique communities in a graph. However, these communities seem to be very small ones, it split up my network too much!

- I found a nice discussion of community detection in Python on stackoverflow.

- I read that another option would be igraph but it turns out that installing it isn't very easy, so I decided to try something else first..

- I then found that there is a Python module called community that contains some nice functions for finding communities. It can find the best partition of a network into communities, here is my code for this (below). By the way, I found that the 'community' module seems to expect that the input to its 'best_partition' function will be a connected component (ie. each node can be reached from every other node), so I needed to first split my original graph into connected components, and then find communities in each connected component:

import networkx as nx # following example at https://networkx.github.io/examples.html
sys.path.append("/nfs/users/nfs_a/alc/Documents/git/Python/taynaud-python-louvain-6a3696fdce57/") # this is where I downloaded the 'community' module
import community

# first read the graph into networkx, it is in a file with pairs of chemical compounds that I want to be the nodes, and I want an edge between each pair:

G = nx.Graph() # create a graph in networkxnodeset = set() # set of nodes in our graph already
# read in the input file:
fileObj = open(input_pairs_file, "r")
for line in fileObj:
        line = line.rstrip()
        temp = line.split("\t")
        compound1 = int(temp[0])
        compound2 = int(temp[1])
        similarity = float(temp[2]))
        if compound1 not in nodeset:
            G.add_node(compound1)
        if compound2 not in nodeset:
            G.add_node(compound2)
        G.add_edge(compound1,compound2, weight=similarity)
fileObj.close()

# first split the graph into connected components:
subgraphs = list(nx.connected_component_subgraphs(G))
for subgraph in subgraphs:
        # find the best partition of the data:
        partition = community.best_partition(subgraph) # we can now print out this partition if we want

Clustering and heatmaps in Python

Not long ago, I wrote a blog post about clustering and heatmaps in R. I'd like now to learn how to do this in Python. This is just a note to myself, that I came across a nice blog post on clustering and dendrograms in Python.

Saturday 11 March 2017

Fast way to find connected components

I have a big graph, where the nodes are chemicals, and there is an edge between any two chemicals. My question is: can I find connected components?

My data is stored in a file called "my_input" like this, with distances between chemicals in the 3rd column:
chemical1 chemical2 0.32
chemical1 chemical3 0.55
etc.
I have 116,506,602 pairs of compounds so it's a lot!

Attempt 1: using R to find connected components (slow!)
This turned out to be very slow. I tried using an R script below:

# Function to make a graph
makegraph <- function(myfile)
{
   library("graph")
   mytable <- read.table(file(myfile),header=FALSE,sep="\t") # Store the data in a data frame
   proteins1 <- mytable$V1
   proteins2 <- mytable$V2
   protnames <- c(levels(proteins1),levels(proteins2))
   # Find out how many pairs of proteins there are
   numpairs <- length(proteins1)
   # Find the unique protein names:
   uniquenames <- unique(protnames)
   # Make a graph for these proteins with no edges:
   mygraph <- new("graphNEL", nodes = uniquenames)
   # Add edges to the graph:
   # See http://rss.acs.unt.edu/Rdoc/library/graph/doc/graph.pdf for more examples
   weights <- rep(1,numpairs)
   mygraph2 <- addEdge(as.vector(proteins1),as.vector(proteins2),mygraph,weights)

   return(mygraph2)
}

library("RBGL")
mygraph <- makegraph("my_input") # takes in my input file
# First find the connected components in the graph:
myconnectedcomponents <- connectedComp(mygraph)
# Find number of connected components:
numcomponents <- length(myconnectedcomponents)
# write the members to an output file:
sink('connected_components.out')
for (i in 1:numcomponents) {
   component <- myconnectedcomponents[[i]] # Store the connected component in a vector "component"
   print(paste("Component",i,":",component))
}
sink()
print(paste("FINISHED"))

Attempt 2: using Python Networkx to find connected components
In the end I wrote my own Python script to find connected components (see Attempt 3 below), but it was pointed out to me that I should have investigated the Python NetworkX library.

I later did this, and found it's pretty easy and fast using NetworkX to find connected components.
This will let you read in your graph as 'G', then find connected components, and return them as subgraphs:

import networkx as nx # following example at https://networkx.github.io/examples.html
G = nx.Graph() # create a graph in networkx
nodeset = set() # set of nodes in our graph already

# read in the input file with the pairs of nodes:
fileObj = open(output_pairs_file, "r")
for line in fileObj:
        line = line.rstrip()
        temp = line.split("\t")
        compound1 = int(temp[0])
        compound2 = int(temp[1])
        if compound1 not in nodeset:
            G.add_node(compound1)
        if compound2 not in nodeset:
            G.add_node(compound2)
        G.add_edge(compound1,compound2)
fileObj.close()
subgraphs = list(nx.connected_component_subgraphs(G))

Attempt 3: using Python to find connected components (fast!)
This attempt was much faster than the first one. The key was to store as much as possible in arrays and temporary files rather than hashes. It took 28 seconds to run on my big data set! : )
I guess the reason is that R is trying to create an object for each node, and probably takes loads of memory to do this. I try to store all the info in small arrays, using the node number as the index, so that's faster..

import sys
import os
from collections import defaultdict

#====================================================================#

# read in the original compound name for each compound:

def read_original_compound_names(output_file_node_labels):

    # define a list for storing the original compound names:
    original_compound_name = list()
    fileObj = open(output_file_node_labels,"r")
    for line in fileObj:
        line = line.rstrip()
        temp = line.split("\t")
        original_name = temp[0]
        original_compound_name.append(original_name)
    fileObj.close()

    return original_compound_name

#====================================================================#

# The input file has the connected nodes, in format node1\tnode2\tdist
# We want to relabel the nodes as 1,2,3...

def relabel_nodes(input_pairs_file, output_pairs_file, output_file_node_labels):

    # define a dictionary to store the node number for an original compound name:
    node_number = defaultdict()

    # open an output file for putting the nodes relabelled as 1,2,3...
    outputfileObj = open(output_pairs_file, "w")

    # open an output file for putting the orignal node names corresponding to the node labels 1,2,3...
    outputfileObj2 = open(output_file_node_labels, "w")

    # read in the input file:
    num_compounds = 0
    fileObj = open(input_pairs_file, "r")
    for line in fileObj:
        line = line.rstrip()
        temp = line.split("\t")
        compound1 = temp[0]
        compound2 = temp[1]
        for compound in [compound1, compound2]:
            if compound not in node_number:
                num_compounds += 1
                node_number[compound] = num_compounds
                line2 = "%s\t%s\n" % (compound, num_compounds)
                outputfileObj2.write(line2)
        line = "%s\t%s\n" % (node_number[compound1], node_number[compound2])
        outputfileObj.write(line)
    fileObj.close()
    outputfileObj.close()
    outputfileObj2.close()

    return num_compounds

#====================================================================#

# read in the new version of the input file (output_pairs_file), and store the cluster number
# for each compound in a list 'compound_cluster_number':

def store_compounds_cluster_number(output_pairs_file, num_compounds):

    # define a list for storing the cluster number of compounds: (a list rather than a dictionary to store space)
    # we can initialise the list to contain zeroes
    compound_cluster_number = [0] * num_compounds

    # define a list of sets, for storing the clusters that a particular cluster should be merged to:
    clusters_to_merge_with = list()

    # define a list of sets, for storing the members of each cluster:
    members_of_cluster = list()

    # read in the input file:
    num_clusters = 0
    fileObj = open(output_pairs_file, "r")
    for line in fileObj:
        line = line.rstrip()
        temp = line.split("\t")
        compound1 = int(temp[0])
        compound2 = int(temp[1])
        (compound_cluster_number,clusters_to_merge_with,members_of_cluster,num_clusters) = store_cluster_numbers_for_two_compounds(compound1,compound2,compound_cluster_number,clusters_to_merge_with,members_of_cluster,num_clusters)
    fileObj.close()
    return(compound_cluster_number, members_of_cluster, clusters_to_merge_with, num_clusters)

#====================================================================#

# take a pair of compounds, and store the cluster numbers for them:

def store_cluster_numbers_for_two_compounds(compound1,compound2,compound_cluster_number,clusters_to_merge_with,members_of_cluster,num_clusters):
    """store cluster numbers for two compounds
    >>> store_cluster_numbers_for_two_compounds(3,4,[1,1,0,0],[set()],[{1,2}], 1) # example where compound1 and compound2 (nodes 3,4) are not in any cluster yet, and we have one cluster so far with nodes 1,2
    ([1, 1, 2, 2], [set(), set()], [{1, 2}, {3, 4}], 2)
    >>> store_cluster_numbers_for_two_compounds(2,3,[1,1,0],[set()],[{1,2}], 1) # example where compound1 (node 2) is in a cluster already, but compound2 (node 3) is not, and we have one cluster so far with nodes 1,2
    ([1, 1, 1], [set()], [{1, 2, 3}], 1)
    >>> store_cluster_numbers_for_two_compounds(2,3,[1,0,1],[set()],[{1,3}], 1) # example where compound2 (node 3) is in a cluster already, but compound1 (node 2) is not, and we have one cluster so far with nodes 1,3
    ([1, 1, 1], [set()], [{1, 2, 3}], 1)
    >>> store_cluster_numbers_for_two_compounds(1,3,[1,1,2,2],[set(),set()],[{1,2},{3,4}], 2) # example where we have two clusters already, with nodes 1,2 and 3,4, and compound1 (node 1) and compound2 (node 3) belong to different ones
    ([1, 1, 2, 2], [{2}, {1}], [{1, 2}, {3, 4}], 2)
    """
    # tested by typing: python3 -m doctest find_connected_components.py

    if compound_cluster_number[compound1-1] != 0 and compound_cluster_number[compound2-1] == 0: # compound1 is in a cluster but compound2 is not
        cluster_number = compound_cluster_number[compound1-1]
        compound_cluster_number[compound2-1] = cluster_number
        members_of_cluster[cluster_number-1].add(compound2)
    elif compound_cluster_number[compound1-1] == 0 and compound_cluster_number[compound2-1] != 0: # compound1 is not in a cluster, but compound2 is
        cluster_number = compound_cluster_number[compound2-1]
        compound_cluster_number[compound1-1] = cluster_number
        members_of_cluster[cluster_number-1].add(compound1)
    elif compound_cluster_number[compound1-1] == 0 and compound_cluster_number[compound2-1] == 0: # compound1 and compound2 are not in any cluster yet
        num_clusters += 1
        clusters_to_merge_with.append(set())
        members_of_cluster.append(set([compound1,compound2]))
        compound_cluster_number[compound1-1] = num_clusters
        compound_cluster_number[compound2-1] = num_clusters
    elif compound_cluster_number[compound1-1] != 0 and compound_cluster_number[compound2-1] != 0: # compound1 and compound2 are in clusters already
        if compound_cluster_number[compound1-1] != compound_cluster_number[compound2-1]: # they are not in the same cluster
            # we need to merge these clusters:
            cluster_number1 = compound_cluster_number[compound1-1]
            cluster_number2 = compound_cluster_number[compound2-1]
            clusters_to_merge_with[cluster_number1-1].add(cluster_number2)
            clusters_to_merge_with[cluster_number2-1].add(cluster_number1)
    else:
        print("ERROR: should not arrive here")
        sys.exit(1)
    return (compound_cluster_number,clusters_to_merge_with,members_of_cluster, num_clusters)

#====================================================================#

# merge all clusters that need merging:

def merge_clusters(compound_cluster_number, members_of_cluster, clusters_to_merge_with):
    """merge all clusters that need merging
    >>> merge_clusters([1, 1, 2, 2], [{1, 2}, {3, 4}], [{2}, {1}]) # there are two clusters, which should be merged into one
    [{1, 2, 3, 4}, set()]
    >>> merge_clusters([1, 1, 2, 2], [{1, 2}, {3, 4}], [set(), set()]) # there are two clusters, which should not be merged
    [{1, 2}, {3, 4}]
    >>> merge_clusters([1, 1, 2, 2, 3], [{1, 2}, {3, 4}, {5}], [{2}, {1, 3}, {2}]) # three are three clusters, which should be merged
    [{1, 2, 3, 4, 5}, set(), set()]
    >>> merge_clusters([1, 1, 2, 2, 3], [{1, 2}, {3, 4}, {5}], [set(), {3}, {2}]) # there are three clusters, the last 2 should be merged
    [{1, 2}, {3, 4, 5}, set()]
    >>> merge_clusters([1, 1, 2, 2, 3], [{1, 2}, {3, 4}, {5}], [{3}, set(), {1}]) # there are three clusters, the first and third should be merged
    [{1, 2, 5}, {3, 4}, set()]
    >>> merge_clusters([1, 1, 2, 2, 3, 4], [{1, 2}, {3, 4}, {5}, {6}], [{3}, set(), {1,4}, {3}]) # there are 4 clusters, the 1st, 3rd and 4th should be merged
    [{1, 2, 5, 6}, {3, 4}, set(), set()]
    """
    # tested using: python3 -m doctest find_connected_components.py

    for cluster_index, this_cluster_set_to_merge_with in enumerate(clusters_to_merge_with):
        if len(this_cluster_set_to_merge_with) > 0: # if this cluster should be merged with one or more other clusters
            # find the full set of clusters that this cluster should be merged with:
            full_set_to_merge_with = find_full_set_to_merge_with(this_cluster_set_to_merge_with, clusters_to_merge_with, cluster_index + 1)
            for cluster_number_to_merge_with in full_set_to_merge_with:
                assert(len(members_of_cluster[cluster_number_to_merge_with-1]) > 0)
                for compound in members_of_cluster[cluster_number_to_merge_with-1]:
                    # don't need this: compound_cluster_number[compound-1] = cluster_index
                    members_of_cluster[cluster_index].add(compound)
                clusters_to_merge_with[cluster_number_to_merge_with-1] = set()
                members_of_cluster[cluster_number_to_merge_with-1] = set()

    return members_of_cluster

#====================================================================#

# find the full set of clusters that this cluster should be merged with:

def find_full_set_to_merge_with(this_cluster_set_to_merge_with, clusters_to_merge_with, this_cluster_number):
    """find the full set of clusters that this cluster should be merged with
    >>> find_full_set_to_merge_with({2}, [{2}, {1}], 1) # this cluster is cluster 1, it needs to be merged with cluster 2
    {2}
    >>> find_full_set_to_merge_with({1}, [{2}, {1}], 2) # this cluster is cluster 1, it needs to be merged with cluster 1
    {1}
    >>> find_full_set_to_merge_with({2}, [{2}, {1, 3}, {2}], 1) # this cluster is cluster 1, it needs to be merged with clusters 2,3
    {2, 3}
    >>> find_full_set_to_merge_with({2}, [{2}, {1, 3}, {2, 4}, {3}], 1) # this cluster is cluster 1, it needs to be merged with clusters 2,3,4
    {2, 3, 4}
    >>> find_full_set_to_merge_with({2}, [{2}, {1, 3}, {2}, set()], 1) # this cluster is cluster 1, it needs to be merged with clusters 2,3
    {2, 3}
    """

    full_set_to_merge_with = set()
    found_some_new = True

    while found_some_new == True:
        new_to_merge_with = set()
        for cluster_number_to_merge_with in this_cluster_set_to_merge_with:
            if cluster_number_to_merge_with != this_cluster_number and cluster_number_to_merge_with not in full_set_to_merge_with:
                full_set_to_merge_with.add(cluster_number_to_merge_with)
                if len(clusters_to_merge_with[cluster_number_to_merge_with-1]) > 0:
                    new_to_merge_with.update(clusters_to_merge_with[cluster_number_to_merge_with-1])
        if len(new_to_merge_with) == 0:
            found_some_new = False
        else:
            this_cluster_set_to_merge_with.update(new_to_merge_with)

    return full_set_to_merge_with

#====================================================================#

# print out the compounds in each cluster:

def print_out_clusters(members_of_cluster, original_compound_name, output_file):

    outputfileObj = open(output_file, "w")

    cluster_num = 0
    for cluster_index, this_cluster_members in enumerate(members_of_cluster):
        if len(this_cluster_members) > 0:
            cluster_num += 1
            for compound in this_cluster_members:
                compound_name = original_compound_name[compound-1]
                line = "%s\t%s\n" % (cluster_num, compound_name)
                outputfileObj.write(line)
                compound_name = original_compound_name[compound-1]
                line = "%s\t%s\n" % (cluster_num, compound_name)
                outputfileObj.write(line)
    outputfileObj.close()

    return

#====================================================================#

def main():

    # check the command-line arguments:
    if len(sys.argv) != 5 or os.path.exists(sys.argv[1]) == False:
        print("Usage: %s input_pairs_file output_pairs_file output_file_node_labels, output_file" % sys.argv[0])
        sys.exit(1)
    input_pairs_file = sys.argv[1] # input file of connected nodes. This has format node1\tnode2\tdist. eg. reformatted_data_anthelminticdists.filt2b_cutoff0.65
    output_pairs_file = sys.argv[2] # output file of connected nodes, with the nodes labelled as 1,2,3...
    output_file_node_labels = sys.argv[3] # output file with the original node names corresponding to node labels 1,2,3...
    output_file = sys.argv[4] # output file for writing out the members of each connected component

    # relabel the nodes as 1,2,3...
    num_compounds = relabel_nodes(input_pairs_file, output_pairs_file, output_file_node_labels)

    # read in the new version of the input file (output_pairs_file), and store the cluster number
    # for each compound in a list 'compound_cluster_number':
    (compound_cluster_number, members_of_cluster, clusters_to_merge_with, num_clusters) = store_compounds_cluster_number(output_pairs_file, num_compounds)

    # now merge all clusters that need merging:
    members_of_cluster = merge_clusters(compound_cluster_number, members_of_cluster, clusters_to_merge_with)

    # now read in the original compound name for each compound:
    original_compound_name = read_original_compound_names(output_file_node_labels)

    # now print out the compounds in each cluster:
    print_out_clusters(members_of_cluster, original_compound_name, output_file)

#====================================================================#

if __name__=="__main__":
    main()

#====================================================================#