I'm learning how to use the ARIBA software to search for genes and variants in a genome for which I have Illumina read-pair data as fastq files.
Given fastq files for a genome that you have sequenced, ARIBA tries to makes a local assembly for the gene that you are interested in.
(Note that if you had a genome assembly rather than just fastq files for your genome, you could search for that gene using BLAST.)
I'm interested in looking for variants (alleles) of a gene called ctxB in Vibrio cholerae.
Start ARIBA
First I need to load ARIBA on the Sanger farm (for Sanger users only):
% module avail -t | grep -i ariba
ariba/release-v2.14.6
% module load ariba/release-v2.14.6
Input files
The input files that I have are a fasta file 'ctxB_sequences_rev.fa.txt' of the sequences for ctxB variants:
e.g.
>ctxB1
ATGATTAAATTAAAATTTGGTGTTTTTTTTACAGTTTTACTATCTTCAGCATATGCACATGGAACACCTCAAAATATTACTGATTTGTGTGCAGAATACCACAACACACAAATACATACGCTAAATGATAAGATATTTTCGTATACAGAATCTCTAGCTGGAAAAAGAGAGATGGCTATCATTACTTTTAAGAATGGTGCAACTTTTCAAGTAGAAGTAC
CAGGTAGTCAACATATAGATTCACAAAAAAAAGCGATTGAAAGGATGAAGGATACCCTGAGGATTGCATATCTTACTGAAGCTAAAGTCGAAAAGTTATGTGTATGGAATAATAAAACGCCTCATGCGATTGCCGCAATTAGTATGGCAAATTAA
>ctxB3/B3b
ATGATTAAATTAAAATTTGGTGTTTTTTTTACAGTTTTACTATCTTCAGCATATGCACATGGAACACCTCAAAATATTACTGATTTGTGTGCAGAATACCACAACACACAAATATATACGCTAAATGATAAGATATTTTCGTATACAGAATCTCTAGCTGGAAAAAGAGAGATGGCTATCATTACTTTTAAGAATGGTGCAATTTTTCAAGTAGAAGTAC
CAGGTAGTCAACATATAGATTCACAAAAAAAAGCGATTGAAAGGATGAAGGATACCCTGAGGATTGCATATCTTACTGAAGCTAAAGTCGAAAAGTTATGTGTATGGAATAATAAAACGCCTCATGCGATTGCCGCAATTAGTATGGCAAATTAA
Note that ARIBA doesn't like spaces or new line characters, so the sequence should all be on one line with no spaces. Also, these should be DNA sequences, not protein sequences.
A second input file 'ctxB_desc.tsv' looks like this, tab-separated, with one line per variant:
ctxB1 1 0 . . ctxB1
ctxB3/B3b 1 0 . . ctxB3/B3b
Note that this needs to be tab-separated. To insert tabs when you're using 'vi', press CTRL-V then tab.
Run ARIBA
I used these commands to run ARIBA:
% ariba prepareref -f ctxB_sequences_rev.fa.txt -m ctxB_desc.tsv out.ctxB.prepareref
where ctxB_sequences_rev.fa.txt and ctxB_desc.tsv are my input files (see above) and out.ctxB.prepareref is the name that I want to give to and output directory.
This is preparing to run the ARIBA pipeline.
% ariba run out.ctxB.prepareref 1.fastq.gz 2.fastq.gz out.ctxB.mygenome
where 1.fastq.gz and 2.fastq.gz are the fastq files for my genome of interest, and out.ctxB.mygenome is the name I want to give to the output directory.
This is running the ARIBA local assembly pipeline.
% ariba summary --preset all_no_filter out.summary_ctxB out.ctxB.*/report.tsv
where out.summary_ctxB is the name I want to give the output file.
This summarises multiple reports made by 'ariba run' above. In my case I actually made only one report for ctxB.
Output file
The output file out.summary_ctxB.csv looks like this:
name,cluster.assembled,cluster.match,cluster.ref_seq,cluster.pct_id,cluster.ctg_cov,cluster.known_var,cluster.novel_var
out.ctxB.VC006AtopC/report.tsv,yes,yes,ctxB7,100.0,56.4,no,no
The description of the columns is here.
That is, the report tells me that it did find a match to the ctxB7 gene, with 100% identical, with mean read depth 56.4 across the contig with the match.
Acknowledgements
Thanks to my colleague Matt Dorman for help.
No comments:
Post a Comment