avrilomics: October 2022

Tuesday 25 October 2022

Using abricate to search for antimicrobial resistance genes

I'm learning to use the abricate software for identifying antimicrobial resistance (AMR) genes in bacterial genomes.

Start abricate

First I need to load abricate on the Sanger compute farm (for Sanger users only):

% module avail -t | grep -i abricate

abricate/1.0.1

% module load abricate/1.0.1

Run abricate

I have a bacterial genome in a file mygenome.fa and want to search for AMR genes in it, using abricate and the NCBI AMR database. Luckily, the NCBI AMR database is on the Sanger farm in the directory /lustre/scratch118/infgen/pathogen/pathpipe/abricate/db, so I can type:

% abricate --datadir /lustre/scratch118/infgen/pathogen/pathpipe/abricate/db --db ncbi mygenome.fa

Say you have lots of files that you want to run abricate on. If you make a file fofn.txt with a single column with a list of the fasta files that you want to run abricate on, you can run abricate on multiple files:

% abricate --datadir /lustre/scratch118/infgen/pathogen/pathpipe/abricate/db --db ncbi --fofn fofn.txt

Note that abricate can also be used to find virulence genes, e.g. using the vfdb (virulence factor database):

% abricate --datadir /data/pam/software/abricate/db --db vfdb mygenome.fa

Output for abricate

The output looks like this:

#FILE SEQUENCE START END STRAND GENE COVERAGE COVERAGE_MAP GAPS %COVERAGE %IDENTITY DATABASE ACCESSION PRODUCT RESISTANCE
mygenome.fasta NODE_11_length_118020_cov_55.534451 70862 71335 + dfrA1 1-474/474 =============== 0/0 100.00 100.00 ncbi A7J11_00830 trimethoprim-resistant dihydrofolate reductase DfrA1

The columns in the output file are described here.

Note that abricate finds genes that cause antimicrobial resistance, but does not find SNPs that find antimicrobial resistance.

Acknowledgements

Thanks to my colleague Sam Dougan for advice about abricate.

Using ARIBA to search for genes and alleles

I'm learning how to use the ARIBA software to search for genes and variants in a genome for which I have Illumina read-pair data as fastq files.

Given fastq files for a genome that you have sequenced, ARIBA tries to makes a local assembly for the gene that you are interested in.

(Note that if you had a genome assembly rather than just fastq files for your genome, you could search for that gene using BLAST.)

I'm interested in looking for variants (alleles) of a gene called ctxB in Vibrio cholerae.

Start ARIBA

First I need to load ARIBA on the Sanger farm (for Sanger users only):

% module avail -t | grep -i ariba
ariba/release-v2.14.6

% module load ariba/release-v2.14.6

Input files

The input files that I have are a fasta file 'ctxB_sequences_rev.fa.txt' of the sequences for ctxB variants:

e.g.

>ctxB1
ATGATTAAATTAAAATTTGGTGTTTTTTTTACAGTTTTACTATCTTCAGCATATGCACATGGAACACCTCAAAATATTACTGATTTGTGTGCAGAATACCACAACACACAAATACATACGCTAAATGATAAGATATTTTCGTATACAGAATCTCTAGCTGGAAAAAGAGAGATGGCTATCATTACTTTTAAGAATGGTGCAACTTTTCAAGTAGAAGTAC
CAGGTAGTCAACATATAGATTCACAAAAAAAAGCGATTGAAAGGATGAAGGATACCCTGAGGATTGCATATCTTACTGAAGCTAAAGTCGAAAAGTTATGTGTATGGAATAATAAAACGCCTCATGCGATTGCCGCAATTAGTATGGCAAATTAA
>ctxB3/B3b
ATGATTAAATTAAAATTTGGTGTTTTTTTTACAGTTTTACTATCTTCAGCATATGCACATGGAACACCTCAAAATATTACTGATTTGTGTGCAGAATACCACAACACACAAATATATACGCTAAATGATAAGATATTTTCGTATACAGAATCTCTAGCTGGAAAAAGAGAGATGGCTATCATTACTTTTAAGAATGGTGCAATTTTTCAAGTAGAAGTAC
CAGGTAGTCAACATATAGATTCACAAAAAAAAGCGATTGAAAGGATGAAGGATACCCTGAGGATTGCATATCTTACTGAAGCTAAAGTCGAAAAGTTATGTGTATGGAATAATAAAACGCCTCATGCGATTGCCGCAATTAGTATGGCAAATTAA

Note that ARIBA doesn't like spaces or new line characters, so the sequence should all be on one line with no spaces. Also, these should be DNA sequences, not protein sequences.

A second input file 'ctxB_desc.tsv' looks like this, tab-separated, with one line per variant:

ctxB1 1 0 . . ctxB1
ctxB3/B3b 1 0 . . ctxB3/B3b

Note that this needs to be tab-separated. To insert tabs when you're using 'vi', press CTRL-V then tab.

Run ARIBA

I used these commands to run ARIBA:

% ariba prepareref -f ctxB_sequences_rev.fa.txt -m ctxB_desc.tsv out.ctxB.prepareref

where ctxB_sequences_rev.fa.txt and ctxB_desc.tsv are my input files (see above) and out.ctxB.prepareref is the name that I want to give to and output directory.

This is preparing to run the ARIBA pipeline.

% ariba run out.ctxB.prepareref 1.fastq.gz 2.fastq.gz out.ctxB.mygenome

where 1.fastq.gz and 2.fastq.gz are the fastq files for my genome of interest, and out.ctxB.mygenome is the name I want to give to the output directory.

This is running the ARIBA local assembly pipeline.

% ariba summary --preset all_no_filter out.summary_ctxB out.ctxB.*/report.tsv

where out.summary_ctxB is the name I want to give the output file.

This summarises multiple reports made by 'ariba run' above. In my case I actually made only one report for ctxB.

Output file

The output file out.summary_ctxB.csv looks like this:

name,cluster.assembled,cluster.match,cluster.ref_seq,cluster.pct_id,cluster.ctg_cov,cluster.known_var,cluster.novel_var
out.ctxB.VC006AtopC/report.tsv,yes,yes,ctxB7,100.0,56.4,no,no

The description of the columns is here.

That is, the report tells me that it did find a match to the ctxB7 gene, with 100% identical, with mean read depth 56.4 across the contig with the match.

Acknowledgements

Thanks to my colleague Matt Dorman for help.