Thursday, 24 February 2022

Calculating assembly statistics

 [Note: this is useful to Sanger users only.]

There is a nice program called 'assembly-stats' for calculating assembly statistics on the Sanger farm. 

Find the latest version of it:

% module avail -t | grep -i stats
assembly-stats/1.0.1

Load the module:
% module load assembly-stats/1.0.1

Now run it on an assembly file '2038_EDC_717.fas':

% assembly-stats -t 2038_EDC_717.fas
filename        total_length    number  mean_length     longest shortest        N_count Gaps    N50     N50n    N70     N70n    N90     N90n
2038_EDC_717.fas        3816803 82      46546.38        362749  1020    2       1       163759  8       89245   15      24898   30

If you have a whole directory of assembly files all ending in '.fas', you can make a bourne-shell script to run assembly_stats on them, with a for loop:

#!/bin/sh

# see https://alvinalexander.com/blog/post/linux-unix/bourne-shell-script-for-loop-edit-files/
for i in `ls *.fas`
do
    echo "$i"
    assembly-stats -t $i > $i.stats
done

This makes a file .stats for each assembly file (e.g. 2038_EDC_717.fas.stats for assembly file 2038_EDC_717.fas).



No comments: