Friday 22 December 2017

Finding all occurrences of a subsequence in a sequence

I wanted to find all the occurrences of a subsequence in a genome assembly. To do this, I first tried using BLAT but it didn't find them for me (not sure why).

So I instead wrote a little Python function to print out all the positions of a subsequence in a sequence:

#====================================================================#

# find the positions of a subsequence in a sequence:

def find_positions_of_subsequence(seq, subsequence, seqname):

    still_searching = True
    start = 0
    end = len(seq) - 1
    while (still_searching == True):
        position = seq.find(subsequence, start, end)
        if position == -1:
            still_searching = False
        else:
            actual_position = position + 1
            format_string = "Found at %d in %s" % (actual_position, seqname)
            print(format_string)
            start = position + 1

    return

#====================================================================#


Python saves the day!

Monday 4 December 2017

Trimming adapter sequences

I wanted to trim adapter sequences, specifically the sequence 'GTTTTAGGTC'

Attempt 1: Trimmomatric
I first tried Trimmomatic, with a fasta file with the forward and reverse complement of my adapter sequences, and the 'ILLUMINACLIP:/lustre/scratch118/infgen/team133/alc/000_FUGI_PatrickCRISPR/OmegaGenewhiz/adaptor.fa:2:40:0' option. However, I found that this just removes the sequences with adapter, it doesn't trim them. This is also described here.

Attempt 2: Cutadapt
I then tried cutadapt. With the following fasta file:
>seq1
GTTTTAGGTCGTTATCGTGTA
>seq2
TACACGATAACGACCTAAAAC 


I was able to trim adapters using:
% cutadapt -g GTTTTAGGTC my.fasta -a GACCTAAAAC
where -g cuts sequences off the 5' end, and -a off the 3' end. 
It cuts off adapter with at most 10% errors in single-end read mode.

This gave me:
>seq1
GTTATCGTGTA
>seq2

TACACGATAAC
Hurray!

Something strange I noticed about cutadapt:
I have been using cutadapt to trim some adaptors, and noticed that it totally removed some of my sequences.

For example, with this fastq input: (in tmp.fastq)
@M03558:259:000000000-BH588:1:1101:4913:6565 2:N:0:TTTGTA
ACTGACCCTCAGCAATCTTAAACTTCTTAGACGAATCACCAGAACGGAAAACATCCTTCATAGAAATTTCACGCGGCGGCAAGTTGCCATACAAAACAGGGTCGCCAGCAATATCGGTATAAGTCAAAGCACCTTTAGCGTTAAGGTACTGAATCTCTTTAGTCGCAGTAGGCGGAAAACGAACAAGCGCAAGAGTAAACATAGTGCCATGCTCAGGAACAAAGAAACGCGGCACAGAATGTTTATAGGTC
+
AAAAAFFFFCFCGGGGGGGGGGHHHHHHGFHHGEEGEHHFHHFFHGEGFGHFHHHHHHHHHHGHFHHHHHHHEHGCFGGGGGGHHHHHHHHHHHHHGHGHGHHGGGGGGHHGFHEGGCEFFHHGGGHHHEFGHEHHHGGGAE?EHHGDHHFGHHHHHEHHHH:C.A@DCGHBGA-?BGGGCGEGCFFFFFFFFEF;FFFEF/B/:BFEF;/:;BB;BFFFFFF/BFFFAAFA-;-B.FFBF;/BFBF/9/:


I then ran cutadapt like this:
% cutadapt -g GTTTTAGGTC -a GACCTAAAAC tmp.fastq
and it removed the whole input sequence.
I can't see any matches to the adaptors in that input though. When I ran blastn between that read and the NCBI database, the sequence was found to be PhiX174. Is that why cutadapt removed it, or was there some other reason?

Update: 7-Dec-2017: I sent an email about this to the cutadapt developer, Marcel Martin, and received a very helpful reply:


In short: The 5' adapter is found at the very end (3') of the read with one 
error.

From the statistics report that cutadapt prints, you can see that it finds 
the 5' adapter with one error. And if you run the above command without 
specifying the 3' adapter with -a GACCTAAAAC, you get the same result, so 
it must be the 5' adapter. When cutadapt finds a 5' adapter, it removes 
the adapter sequnce itself and everything preceding it. And in your case, 
the read ends with 'GTTTATAGGTC', which is the 5' adapter sequence with an 
extra 'A' inserted between one of the 'T' nucleotides.
Although cutadapt doesn’t print it, the alignment likely looks like this:

...GTTTATAGGTC (read end)
   GTTT-TAGGTC (adapter)

This is possibly not the result you were expecting, but cutadapt is doing 
what it is supposed to.

You have a few options:
- Accept this result
- If you think allowing insertions is the problem, use --no-indels to
disallow insertions and deletions.
- If you want to prevent the 5' adapter from occurring within the read at 
all, you can specify the adapter as -g XXGTTTTAGGTC. The 'X' is always 
counted as a mismatch, so if you have more X than allowed mismatches, 
the adapter is forced to occur at the 5' end (overlapping it).