Monday 4 December 2017

Trimming adapter sequences

I wanted to trim adapter sequences, specifically the sequence 'GTTTTAGGTC'

Attempt 1: Trimmomatric
I first tried Trimmomatic, with a fasta file with the forward and reverse complement of my adapter sequences, and the 'ILLUMINACLIP:/lustre/scratch118/infgen/team133/alc/000_FUGI_PatrickCRISPR/OmegaGenewhiz/adaptor.fa:2:40:0' option. However, I found that this just removes the sequences with adapter, it doesn't trim them. This is also described here.

Attempt 2: Cutadapt
I then tried cutadapt. With the following fasta file:
>seq1
GTTTTAGGTCGTTATCGTGTA
>seq2
TACACGATAACGACCTAAAAC 


I was able to trim adapters using:
% cutadapt -g GTTTTAGGTC my.fasta -a GACCTAAAAC
where -g cuts sequences off the 5' end, and -a off the 3' end. 
It cuts off adapter with at most 10% errors in single-end read mode.

This gave me:
>seq1
GTTATCGTGTA
>seq2

TACACGATAAC
Hurray!

Something strange I noticed about cutadapt:
I have been using cutadapt to trim some adaptors, and noticed that it totally removed some of my sequences.

For example, with this fastq input: (in tmp.fastq)
@M03558:259:000000000-BH588:1:1101:4913:6565 2:N:0:TTTGTA
ACTGACCCTCAGCAATCTTAAACTTCTTAGACGAATCACCAGAACGGAAAACATCCTTCATAGAAATTTCACGCGGCGGCAAGTTGCCATACAAAACAGGGTCGCCAGCAATATCGGTATAAGTCAAAGCACCTTTAGCGTTAAGGTACTGAATCTCTTTAGTCGCAGTAGGCGGAAAACGAACAAGCGCAAGAGTAAACATAGTGCCATGCTCAGGAACAAAGAAACGCGGCACAGAATGTTTATAGGTC
+
AAAAAFFFFCFCGGGGGGGGGGHHHHHHGFHHGEEGEHHFHHFFHGEGFGHFHHHHHHHHHHGHFHHHHHHHEHGCFGGGGGGHHHHHHHHHHHHHGHGHGHHGGGGGGHHGFHEGGCEFFHHGGGHHHEFGHEHHHGGGAE?EHHGDHHFGHHHHHEHHHH:C.A@DCGHBGA-?BGGGCGEGCFFFFFFFFEF;FFFEF/B/:BFEF;/:;BB;BFFFFFF/BFFFAAFA-;-B.FFBF;/BFBF/9/:


I then ran cutadapt like this:
% cutadapt -g GTTTTAGGTC -a GACCTAAAAC tmp.fastq
and it removed the whole input sequence.
I can't see any matches to the adaptors in that input though. When I ran blastn between that read and the NCBI database, the sequence was found to be PhiX174. Is that why cutadapt removed it, or was there some other reason?

Update: 7-Dec-2017: I sent an email about this to the cutadapt developer, Marcel Martin, and received a very helpful reply:


In short: The 5' adapter is found at the very end (3') of the read with one 
error.

From the statistics report that cutadapt prints, you can see that it finds 
the 5' adapter with one error. And if you run the above command without 
specifying the 3' adapter with -a GACCTAAAAC, you get the same result, so 
it must be the 5' adapter. When cutadapt finds a 5' adapter, it removes 
the adapter sequnce itself and everything preceding it. And in your case, 
the read ends with 'GTTTATAGGTC', which is the 5' adapter sequence with an 
extra 'A' inserted between one of the 'T' nucleotides.
Although cutadapt doesn’t print it, the alignment likely looks like this:

...GTTTATAGGTC (read end)
   GTTT-TAGGTC (adapter)

This is possibly not the result you were expecting, but cutadapt is doing 
what it is supposed to.

You have a few options:
- Accept this result
- If you think allowing insertions is the problem, use --no-indels to
disallow insertions and deletions.
- If you want to prevent the 5' adapter from occurring within the read at 
all, you can specify the adapter as -g XXGTTTTAGGTC. The 'X' is always 
counted as a mismatch, so if you have more X than allowed mismatches, 
the adapter is forced to occur at the 5' end (overlapping it).
 

2 comments:

Unknown said...

Have you tried AdapterRemoval? I'm using it a lot recently because it provides several tools to process raw reads files. It combines several tools into a single efficient software.

https://github.com/MikkelSchubert/adapterremoval

Avril Coghlan said...

No, I didn't know about that, thanks!