avrilomics: February 2022

[Note: this is useful to Sanger users only.]

There is a nice program called 'assembly-stats' for calculating assembly statistics on the Sanger farm.

Find the latest version of it:

% module avail -t | grep -i stats
assembly-stats/1.0.1

Load the module:
% module load assembly-stats/1.0.1

Now run it on an assembly file '2038_EDC_717.fas':

% assembly-stats -t 2038_EDC_717.fas
filename total_length number mean_length longest shortest N_count Gaps N50 N50n N70 N70n N90 N90n
2038_EDC_717.fas 3816803 82 46546.38 362749 1020 2 1 163759 8 89245 15 24898 30

If you have a whole directory of assembly files all ending in '.fas', you can make a bourne-shell script to run assembly_stats on them, with a for loop:

#!/bin/sh

# see https://alvinalexander.com/blog/post/linux-unix/bourne-shell-script-for-loop-edit-files/
for i in `ls *.fas`
do
echo "$i"
assembly-stats -t $i > $i.stats
done

This makes a file .stats for each assembly file (e.g. 2038_EDC_717.fas.stats for assembly file 2038_EDC_717.fas).

I have been using the Genome Decomposition Analysis (GDA) software by Eerik Aunin and Adam Reid to analyse the genome of the flatworm Schistosoma mansoni.

GDA is a new tool that is described in a paper by Aunin, Berriman and Reid (see here).

GDA extracts genomic features (e.g. gene density, repeat density, histone modification peaks, etc.) from sliding windows across chromosomes, and then clusters the genomic windows by similarity using HDBSCAN within GDA.

It is very useful for exploring trends across a genome.

I've included some instructions here on how to install and run GDA. However, the latest instructions and many more details can be obtained from the github page for GDA by Eerik Aunin and Adam Reid at Sanger: see https://github.com/eeaunin/gda.

Installing GDA

[Note to self: I did this on the Sanger farm.]

I installed GDA using the following steps:

First I cloned the GDA git repository: [Note that I used the Sanger git repository; you probably need to use the git repository https://github.com/eeaunin/gda.]

% git clone https://gitlab.internal.sanger.ac.uk/ar11/gda.git

Then I ran the conda installation script:

% python gda/create_gda_conda_env.py gda_env gda_downloads gda

Then I activated the conda environment:

% conda activate gda_env

Running GDA

Here is how I ran GDA for the test data set which comes with it, which is for Plasmodium falciparum:

First I ran the feature extraction pipeline:

% bsub -n12 -R"span[hosts=1]" -M10000 -R 'select[mem>10000] rusage[mem=10000]' -o gda_test.o -e gda_test.e "gda extract_genomic_features --threads 12 --pipeline_run_folder gda_pipeline_run gda/test_data/PlasmoDB-49_Pfalciparum3D7_Genome.fasta"

The output results were in the folder gda_pipeline_run.

Next I clustered the genome windows and analysed clusters:

% bsub -n1 -R"span[hosts=1]" -M10000 -R 'select[mem>10000] rusage[mem=10000]' -o gda_clustering_test.o -e gda_clustering_test.e "gda clustering -c 100 -n 5 gda_pipeline_run/merged_bedgraph_table/PlasmoDB-49_Pfalciparum3D7_Genome_merged_bedgraph.tsv"

The clustering output is in the folder gda_out. This is the output file that I can then use as input into the GDA Shiny app or IGV (see below).

Using the GDA Shiny App

[Note to self: I did this on my Mac laptop rather than on the Sanger farm.]

There is a lovely Shiny App for viewing the GDA results.

To install the Shiny App, I first downloaded the GDA code using:

% git clone https://gitlab.internal.sanger.ac.uk/ar11/gda.git

To install the Shiny App in R, I typed (in R):

> source("gda/gda_shiny/install_gda_shiny_dependencies_without_conda.R")

Then I can start the Shiny App using:

% python3 gda/gda_shiny/gda_shiny.py gda_out_mydata_1kb

where gda_out_mydata_1kb is my output directory from running GDA.

This starts the Shiny App in my browser and I get lovely pictures like this UMAP plot showing the GDA clusters:

The Shiny App also gives many other nice outputs, for example a heatmap showing input variables for the GDA clusters; a plot showing distribution of GDA clusters across the chromosomes; and a table showing the variables that are significantly different for each particular GDA cluster compared to the other clusters.

Viewing GDA results in the IGV genome browser:

[Note to self: I did this on my Mac laptop rather than on the Sanger farm.]

To view the results from GDA in the IGV genome browser, you first need to install the IGV software by following the instructions on the IGV website here.

To load the GDA results into IGV, as well as the bedgraph files of features that GDA used as input, you need to run something like this:

% gda/gda_make_igv_session_file.py -g schistosoma_mansoni.PRJEA36577.WBPS14.annotations.gff3 gda_out_mydata_1kb/cluster_heatmap.csv gda_out_mydata_1kb/schisto_v7/clusters.bed schistosoma_mansoni.PRJEA36577.WBPS14.genomic.fa bedgraph_output_mydata

where schistosoma_mansoni.PRJEA36577.WBPS14.annotations.gff3 is the file with the annotations of genes, mRNAs, etc. for your genome;

gda_out_mydata_1kb is the folder containing the output from your GDA run;

bedgraph_output_mydata is the folder with input bedgraph files used as input for GDA.

This will make a file igv_session_gda.xml.

Then start up IGV [Note to self: I have the IGV icon on my desktop on my laptop.]

Then if you start up IGV you can go load this file into IGV by going to File->Open session, and then choose 'igv_session_gda.xml' as the session file.

It may be a little slow to load all the data into IGV, but you can look at the bottom right of the IGV screen to see it is loading data (it will say things like '1317M of 2359M', etc.).

Once it has loaded, you can view the GDA clusters along the bottom of the screen, as well as all the inputs that were used for GDA above that (e.g. GC content, genes, UTRs, etc.):

Acknowledgements

A big thank you to Eerik Aunin and Adam Reid for helping me with running GDA.

avrilomics

Thursday, 24 February 2022

Calculating assembly statistics

Friday, 18 February 2022

Genome Decomposition Analysis (GDA)

Tuesday, 1 February 2022

Exporting protein sequences from Artemis

About Me