Friday, 4 September 2015

usearch for sequence clustering

I've been looking at Robert Edgar's USEARCH software for fast sequence similarity search and clustering algorithms. Here are some uses: (note: I've used version 5 here as it is installed on our computer system at work, but there is a more recent version available, version 7. The user guide for v5 is here and the user guide for v7 is here.)

Sorting sequences by length:
% /nfs/users/nfs_j/jit/bin//usearch_v5.0.151 --sort input.fa --output sorted_input.fa --log usearch.log
where /nfs/users/nfs_j/jit/bin//usearch_v5.0.151 is an example path to a USEARCH installation,
input.fa is your fasta file of input sequences,
sorted_input.fa is the output file of sorted sequences,
usearch.log is a log file produced.

% usearch -sortbylength input.fa --output sorted_input.fa

Clustering sequences by similarity (using global alignment):
% /nfs/users/nfs_j/jit/bin//usearch_v5.0.151 --cluster sorted_input.fa --id 0.7 --seedsout my_seeds.fa --uc result.uc --log search.log -consout my_consensus.fa
where /nfs/users/nfs_j/jit/bin//usearch_v5.0.151 is an example path to a USEARCH installation,
sorted_input.fa is your fasta file of input sequences,
--id 0.7 says to cluster sequences if they are >=70% identical,
--seedsout my_seeds.fa prints out the 'seed' sequences for clusters to an output fasta file my_seeds.fa; there is one seed sequence from the input fasta file sorted_input.fa per cluster,
--uc result.uc prints out the clusters found (in UCLUST format) to an output file result.uc,
--log search.log creates a log file search.log,
-consout my_consensus.fa creates an output file with the consensus sequence for each cluster.
Note that the input sequences in sorted_input.fa should be sorted by length (see Sorting sequences by length above).
Note also that my_consensus.fa is a consensus based on aligning the sequences in each cluster. However, as far as I understand, you can create a more accurate consensus by first creating a more accurate alignment for each cluster, using staralign (see Create an alignment below).

% usearch -cluster_fast sorted_input.fa --id 0.7 --centroids my_centroids.fa --uc result.uc -consout my_consensus.fa -msaout aligned.fasta
where the extra -msaout option produces a file aligned.fasta that contains an alignment file for each cluster. The --centroids my_centroids.fa option produces a file with the centroids found in the clustering step.

Create an alignment for the sequences in each cluster, and create a consensus sequence for the sequences in each cluster based on the alignment:
In USEARCH v5, the clustering step above creates a consensus sequence file using the -consout option, but you can create a more accurate consensus sequence by re-aligning the sequences in clusters.
First get a list of the sequences in clusters (in UCLUST format), and put them in a fasta file: 
% grep "^[ SH]" result.uc > sh.uc
% /nfs/users/nfs_j/jit/bin//usearch_v5.0.151  --uc2fastax sh.uc --input sorted_input.fa --output sh.fasta
The file sh.fasta has all the sequences in each cluster.

Now create a multiple alignment of the sequences in each cluster, inserting additional gaps in the fasta file sh.fasta: 
% /nfs/users/nfs_j/jit/bin//usearch_v5.0.151 --staralign sh.fasta --output aligned.fasta
aligned.fasta has all the sequences in each cluster, aligned to each other.
Lastly, create a consensus sequence for each cluster, based on the alignment file:
% /nfs/users/nfs_j/jit/repository/pathogen/user/jit/repeat_scripts/ my_seeds.fa aligned.fasta

The my_seeds.fa contains the unaligned 'seed' sequences used for each cluster (one input sequence per cluster), and aligned.fasta contains an alignment of all the sequences in each cluster. This step makes file aligned.fasta.consensus.fa, which has one consensus sequence per cluster. 
[This is using a script by my colleague Jason Tsai, which tries to find the most common base at each position in the alignment of a cluster. Steps in this perl script:
(i) read in the sequences in my_seeds.fa, and store them in hash %original
(ii) read in the file aligned.fasta, and count the number of sequences in each cluster, and store in hash %cluster_size
(iii) if there is just one sequence in a cluster, the script prints out the sequence in %original for that cluster,
(iv) if there are multiple sequences in a cluster, the script reads in the bases found in all the sequences in the cluster, at each position of the alignment, and stores this in hash %consensus (with index $cluster and $i, where $cluster is cluster number and $i is alignment position). Then for each alignment position, it ignores '-', 'N' or '.' letters in the alignment column, and takes the consensus base to be the most common base found at that alignment position.]

 In USEARCH v7, you don't need the -uc2fastax and -staralign steps, as the -cluster command (see Clustering sequences above) has a -msaout option to make an alignment for each cluster, and the -consout option gives a consensus sequence based on that multiple alignment (see the USEARCH manual). Therefore, the -consout consensus sequence file produced by USEARCH v7 can be used. 

I've learnt about USEARCH by studying notes and scripts made by Jason Tsai - thanks Jason!

No comments: