Friday, 18 February 2022

Genome Decomposition Analysis (GDA)

 I have been using the Genome Decomposition Analysis (GDA) software by Eerik Aunin and Adam Reid to analyse the genome of the flatworm Schistosoma mansoni. 

 GDA is a new tool that is described in a paper by Aunin, Berriman and Reid (see here).

GDA extracts genomic features (e.g. gene density, repeat density, histone modification peaks, etc.) from sliding windows across chromosomes, and then clusters the genomic windows by similarity using HDBSCAN within GDA.

It is very useful for exploring trends across a genome.

I've included some instructions here on how to install and run GDA. However, the latest instructions and many more details can be obtained from the github page for GDA by Eerik Aunin and Adam Reid at Sanger: see https://github.com/eeaunin/gda.

Installing GDA

[Note to self: I did this on the Sanger farm.] 

I installed GDA using the following steps:

First I cloned the GDA git repository: [Note that I used the Sanger git repository; you probably need to use the git repository https://github.com/eeaunin/gda.]

% git clone https://gitlab.internal.sanger.ac.uk/ar11/gda.git

Then I ran the conda installation script:

% python gda/create_gda_conda_env.py gda_env gda_downloads gda

Then I activated the conda environment:

 % conda activate gda_env

Running GDA

Here is how I ran GDA for the test data set which comes with it, which is for Plasmodium falciparum:

First I ran the feature extraction pipeline:

% bsub -n12 -R"span[hosts=1]" -M10000 -R 'select[mem>10000] rusage[mem=10000]' -o gda_test.o -e gda_test.e "gda extract_genomic_features --threads 12 --pipeline_run_folder gda_pipeline_run gda/test_data/PlasmoDB-49_Pfalciparum3D7_Genome.fasta"

The output results were in the folder gda_pipeline_run.

Next I clustered the genome windows and analysed clusters:

% bsub -n1 -R"span[hosts=1]" -M10000 -R 'select[mem>10000] rusage[mem=10000]' -o gda_clustering_test.o -e gda_clustering_test.e "gda clustering -c 100 -n 5 gda_pipeline_run/merged_bedgraph_table/PlasmoDB-49_Pfalciparum3D7_Genome_merged_bedgraph.tsv"

The clustering output is in the folder gda_out. This is the output file that I can then use as input into the GDA Shiny app or IGV (see below). 

Using the GDA Shiny App

[Note to self: I did this on my Mac laptop rather than on the Sanger farm.]

There is a lovely Shiny App for viewing the GDA results.

To install the Shiny App, I first downloaded the GDA code using:

% git clone https://gitlab.internal.sanger.ac.uk/ar11/gda.git

To install the Shiny App in R, I typed (in R):

> source("gda/gda_shiny/install_gda_shiny_dependencies_without_conda.R")

Then I can start the Shiny App using:

% python3 gda/gda_shiny/gda_shiny.py gda_out_mydata_1kb

where gda_out_mydata_1kb is my output directory from running GDA.

This starts the Shiny App in my browser and I get lovely pictures like this UMAP plot showing the GDA clusters:



 










 

The Shiny App also gives many other nice outputs, for example a heatmap showing input variables for the GDA clusters; a plot showing distribution of GDA clusters across the chromosomes; and a table showing the variables that are significantly different for each particular GDA cluster compared to the other clusters.

Viewing GDA results in the IGV genome browser:

[Note to self: I did this on my Mac laptop rather than on the Sanger farm.]

To view the results from GDA in the IGV genome browser, you first need to install the IGV software by following the instructions on the IGV website here.

To load the GDA results into IGV,  as well as the bedgraph files of features that GDA used as input, you need to run something like this:

% gda/gda_make_igv_session_file.py -g schistosoma_mansoni.PRJEA36577.WBPS14.annotations.gff3 gda_out_mydata_1kb/cluster_heatmap.csv gda_out_mydata_1kb/schisto_v7/clusters.bed schistosoma_mansoni.PRJEA36577.WBPS14.genomic.fa bedgraph_output_mydata

where schistosoma_mansoni.PRJEA36577.WBPS14.annotations.gff3 is the file with the annotations of genes, mRNAs, etc. for your genome;

gda_out_mydata_1kb is the folder containing the output from your GDA run;

bedgraph_output_mydata is the folder with input bedgraph files used as input for GDA.

This will make a file igv_session_gda.xml.  

Then start up IGV [Note to self: I have the IGV icon on my desktop on my laptop.]

Then if you start up IGV you can go load this file into IGV by going to File->Open session, and then choose 'igv_session_gda.xml' as the session file. 

It may be a little slow to load all the data into IGV, but you can look at the bottom right of the IGV screen to see it is loading data (it will say things like '1317M of 2359M', etc.).

Once it has loaded, you can view the GDA clusters along the bottom of the screen, as well as all the inputs that were used for GDA above that (e.g. GC content, genes, UTRs, etc.):



 


 

 




Acknowledgements

A big thank you to Eerik Aunin and Adam Reid for helping me with running GDA.




 

 

No comments: