## Monday, 27 November 2017

### Fastq format

I'm looking at a fastq file that has header lines like this and want to figure out what they mean:

@M03558:259:000000000-BH588:1:1101:16110:1341 2:N:0:NTTGTA
@M03558:259:000000000-BH588:1:1101:16089:1342 2:N:0:NTTGTA

@M03558:259:000000000-BH588:1:1101:15471:1344 2:N:0:NTTGTA
@M03558:259:000000000-BH588:1:1101:15455:1333 2:N:0:NTTGTA

@M03558:259:000000000-BH588:1:1101:14580:1411 2:N:0:CTTGTA
...

Luckily I found a nice wikipedia description of FASTQ, which tells me that probably:
M03558 = the unique instrument name
259 = the run id
000000000-BH588 = the flow cell id.
1 = the flow cell lane
1101 = the tile number within the flow cell lane (I get 28 different values for this)
16110 = x-coordinate of the cluster within the tile
1341 = y-coordinate of the cluster within the tile
2 = member of a pair (paired end reads only). (I only see '2' here so I think the reads in my fastq are single-end reads)
N = means the read is not filtered (I only see 'N' here)
0 = this will be '0' when none of the control bits are on (I only see '0' here)
NTTGTA = this is the index sequence. I see several different index sequences, with different frequencies.

Thanks wikipedia!

## Friday, 3 November 2017

### A stacked barplot in R

I wanted to make a stacked barplot in R. My input data looked like this:

1 1 3232
1 2 32232
1 4 23232
2 1 23322
2 2 2323
2 3 4343
2 4 23432

I wanted to make a barplot showing the 96 barcodes adjacent to each other, and for each barcode, a stack showing the number of reads for plate 1, 2, 3.

Getting the data into R (painful!)
The problem was getting my data into R. The input data did not have values for every plate-barcode combination, but I wanted to assume a value of 0 for combinations that were not in the input file. In the end I had to write some code to squeeze the data into R:

# input data has columns plate, barcode, number of input reads

plate <- MyData\$plate
barcode <- MyData\$barcode

# put the input data in a matrix, for use in the barplot() command.
# The matrix will have three rows (plate 1,2,3) and 96 columns (barcode 1..96):
mymatrix <- matrix(, nrow=3, ncol=96)

for (platenum in 1:3)

for (barcodenum in 1:96)
{
# find the index (if any) in vector 'reads' for plate 'platenum' and barcode 'barcodenum'.
value <- intersect(which(plate==platenum),which(barcode==barcodenum))
if (length(value > 0))
{
# get the number of reads for plate 'platenum' and barcode 'barcodenum' from vector reads:
}
else
{
mymatrix[platenum,barcodenum] <- 0
}
}
}

Plotting the data in R (ok!)
Plotting the data was not so hard. I used the example from http://www.r-graph-gallery.com/ to make a stacked barplot:
colnames <- seq(1,96)
rownames <- seq(1,3)
colnames(mymatrix) <- colnames
rownames(mymatrix) <- rownames

barplot(mymatrix, col=colors()[c(23,89,12)], border="white", space=0.04, font.axis=2, xlab="barcode", ylab="thousands of input reads", legend=rownames(mymatrix))

A little bit of the plot:

Some other little tricks I learnt:
To put some space around the plot I can type before the 'barplot' command:
par( mar=c(8, 4.7, 2.3, 0)) # last value is space on RHS, second last value is space at top, 2nd value is space on LHS, 1st value is space below

In the barplot command itself:
border="white": use white for the border of the bars
space=0.04 : leaves space before each bar. cex.names=0.5
makes the x-axis labels smallerlas=3 makes the labels perpendicular to the axis

### Multivariate hypergeometric distribution in R

A hypergeometric distribution can be used where you are sampling coloured balls from an urn without replacement.

A univariate hypergeometric distribution can be used when there are two colours of balls in the urn, and a multivariate hypergeometric distribution can be used when there are more than two colours of balls.

The multivariate hypergeometric distribution can be used to ask questions such as: what is the probability that, if I have 80 distinct colours of balls in the urn, and sample 100 balls from the urn with replacement, that I will have at least one ball of each colour?

I recently have a similar problem in bioinformatics: if there are 6329 distinct copies of a gene in our eppendorph, and (after some PCRs) we sequence 10,000 distinct reads, what is the probability that we sequenced at least one read from each of the 6329 distinct copies of the gene? In other words, if there are 6329 distinct colours of balls in an urn, and we pick out 10,000 balls (without replacement), what is the probability that we chose at least one ball of each colour?

In R the dhyper function could be used in the case where there are two colours of ball in an urn, that is a univariate hypergeometric distribution. But what about where there are 6329 colours of ball, that is, a multivariate hypergeometric distribution? Then I came across the extraDistr R package, which has a multivariate hypergeometric distribution. The function is called "MultiHypergeometric".
I installed and load this library in R using:

If two cards are drawn from a deck of 52 cards, what is the probability that one is a spade and one is a heart?
I found a nice example of using the MultiHypergeometric distribution here by Jonathan Fivelsdal, thank you! He asked the question: if two cards are drawn from a deck of 52 cards, what is the probability that one is a spade and one is a heart? In R:
> dmvhyper(x = c('heart' = 1, 'spade' = 1, 'other' = 0), n = c('heart' = 13, 'spade' = 13, 'other' = 26), k=2)
 0.127451
Here x is a vector that has the frequency of hearts and spades and other cards in our sample of interest,
n is a vector that has the frequency of hearts and spades and other cards in the deck of cards,
k is the number of cards in our sample of interest.
[Note: the manual for this package says k is 'the number of balls drawn from the urn'.]

Given a huge bag containing scrabble letters, if know the number of each letter tile in the bag, what is the probability of drawing a particular word of length N if we select out N letter tiles without replacement?

This is another nice example that I found here by Herb Susmann.
The answer is that we can do something like this:
> dmvhyper(x = letterfreqs, n = freqs, k = sum(letterfreqs))
where x = letterfreqs is a vector of length 26, with the frequency of each letter in the word of interest,
n is a vector of length 26 with the frequency of each letter tile in the bag,
k is the number of letters in the word of interest (of length N), ie. it is N.

If I have 6329*2=12,658 balls of 6329 distinct colours in an urn (two of each colour), and pick out 10,000 balls (without replacement), what is the probability that I chose at least one ball of each colour?
> dmvhyper(x = rep(1, 6329), n = rep(2, 6329), k = 10000, log = FALSE)
 0
x is an m-column matrix of quantiles. From the example above by Jonathan Fivelsdal using a deck of cards, I think that this gives the counts of each of the colours of ball in the sample that you're asking about. In our case we are asking about having at least one ball of each colour.
The vector n is an m-length vector and has the number of balls in each of the m different colours. In our case m is 6329, so it is a 6329-length vector. Here we just assume we have 2 balls of each colour in the urn.
In this case k is the number of balls drawn from the urn, which is 10,000 in our case.

Notes to myself:
- in the case of our sequencing problem, we need to take into account the number of sequences in the urn, ie. the total number of sequences after the PCR.
- is sequencing molecules a sampling without replacement problem? Need to check...

## Thursday, 2 November 2017

### 'Learn Python Visually' book

I am doing some Python study, and found the book Learn Python Visually by Ivelin Demirov in my work library.

It has the subtitle 'An Accelerated Learning Method Which Uses Science and Creativity to Teach Right-Brained Non-Coders', which is intriguing. Also, the development of the book was funded by Kickstarter, which is cool to hear!

I'm thinking to make here some notes on the new things that I learnt from this book:

p. 14. Numbers are immutable. This page explains about numbers being immutable in Python. I always find this concept really confusing. The book gave me a little more information on it, but then I also found a nice article on it on the web.
There it is explained that when an object is initiated, it is assigned a unique object id. The state of the object can be changed later if the object is a mutable-type object. That is, the state of a mutable object can be changed after it is created, but the state of an immutable object can't be.
The book also tells about the id() function that tells you about the identity of an object as an integer, and the is() function that compares the identity of two objects.
Example:
>>> b = 11
>>> a = b
>>> c = 12
>>> d = 11
>>> id(a)
9084448
>>> id(b)
9084448
>>> id(c)
9084480
>>> id(d)
9084448
>>> a is b
True
>>> a is c
False
>>> a is d
True

p. 15 Assigning the same value to multiple variables at once. I didn't know you can assign a value to multiple variables at once:
>>> a = b = c = d = "Hello"

p. 15 Special words that can't be used as variable names. The book gives a list of special words that can't be used as variable names, I also found them on the web here.

p. 19 Using multi-line strings to store large text blocks, by enclosing them with triple quotes:
An example is:
>>> huidobro = """Al horitaña de la montazonte
La violondrina y el goloncelo
Descolgada esta mañana de la lunala
Se acerca a todo galope

"""

p. 21 Short forms of arithmetic operators.
I knew that it was possible to do:
>>> a += 1
but it turns out that you can also do:
>>> a -= 1
>>> a *= 1
>>> a /= 2
etc.

p. 28 Precedence of arithmetic operators.
Something I didn't know was that exponents have precedence before subtraction, which means that we get different answers for:
>>> abs(-2 ** 2 - 20)
24
which first finds 2**2 = 4, and then finds abs(-4 - 20), and
>>> abs((-2) ** 2 - 20)
16
which first finds (-2)**2 = 4, and then finds abs(4 - 20).