Thursday, 24 February 2022

Calculating assembly statistics

 [Note: this is useful to Sanger users only.]

There is a nice program called 'assembly-stats' for calculating assembly statistics on the Sanger farm. 

Find the latest version of it:

% module avail -t | grep -i stats
assembly-stats/1.0.1

Load the module:
% module load assembly-stats/1.0.1

Now run it on an assembly file '2038_EDC_717.fas':

% assembly-stats -t 2038_EDC_717.fas
filename        total_length    number  mean_length     longest shortest        N_count Gaps    N50     N50n    N70     N70n    N90     N90n
2038_EDC_717.fas        3816803 82      46546.38        362749  1020    2       1       163759  8       89245   15      24898   30

If you have a whole directory of assembly files all ending in '.fas', you can make a bourne-shell script to run assembly_stats on them, with a for loop:

#!/bin/sh

# see https://alvinalexander.com/blog/post/linux-unix/bourne-shell-script-for-loop-edit-files/
for i in `ls *.fas`
do
    echo "$i"
    assembly-stats -t $i > $i.stats
done

This makes a file .stats for each assembly file (e.g. 2038_EDC_717.fas.stats for assembly file 2038_EDC_717.fas).



Friday, 18 February 2022

Genome Decomposition Analysis (GDA)

 I have been using the Genome Decomposition Analysis (GDA) software by Eerik Aunin and Adam Reid to analyse the genome of the flatworm Schistosoma mansoni. 

 GDA is a new tool that is described in a paper by Aunin, Berriman and Reid (see here).

GDA extracts genomic features (e.g. gene density, repeat density, histone modification peaks, etc.) from sliding windows across chromosomes, and then clusters the genomic windows by similarity using HDBSCAN within GDA.

It is very useful for exploring trends across a genome.

I've included some instructions here on how to install and run GDA. However, the latest instructions and many more details can be obtained from the github page for GDA by Eerik Aunin and Adam Reid at Sanger: see https://github.com/eeaunin/gda.

Installing GDA

[Note to self: I did this on the Sanger farm.] 

I installed GDA using the following steps:

First I cloned the GDA git repository: [Note that I used the Sanger git repository; you probably need to use the git repository https://github.com/eeaunin/gda.]

% git clone https://gitlab.internal.sanger.ac.uk/ar11/gda.git

Then I ran the conda installation script:

% python gda/create_gda_conda_env.py gda_env gda_downloads gda

Then I activated the conda environment:

 % conda activate gda_env

Running GDA

Here is how I ran GDA for the test data set which comes with it, which is for Plasmodium falciparum:

First I ran the feature extraction pipeline:

% bsub -n12 -R"span[hosts=1]" -M10000 -R 'select[mem>10000] rusage[mem=10000]' -o gda_test.o -e gda_test.e "gda extract_genomic_features --threads 12 --pipeline_run_folder gda_pipeline_run gda/test_data/PlasmoDB-49_Pfalciparum3D7_Genome.fasta"

The output results were in the folder gda_pipeline_run.

Next I clustered the genome windows and analysed clusters:

% bsub -n1 -R"span[hosts=1]" -M10000 -R 'select[mem>10000] rusage[mem=10000]' -o gda_clustering_test.o -e gda_clustering_test.e "gda clustering -c 100 -n 5 gda_pipeline_run/merged_bedgraph_table/PlasmoDB-49_Pfalciparum3D7_Genome_merged_bedgraph.tsv"

The clustering output is in the folder gda_out. This is the output file that I can then use as input into the GDA Shiny app or IGV (see below). 

Using the GDA Shiny App

[Note to self: I did this on my Mac laptop rather than on the Sanger farm.]

There is a lovely Shiny App for viewing the GDA results.

To install the Shiny App, I first downloaded the GDA code using:

% git clone https://gitlab.internal.sanger.ac.uk/ar11/gda.git

To install the Shiny App in R, I typed (in R):

> source("gda/gda_shiny/install_gda_shiny_dependencies_without_conda.R")

Then I can start the Shiny App using:

% python3 gda/gda_shiny/gda_shiny.py gda_out_mydata_1kb

where gda_out_mydata_1kb is my output directory from running GDA.

This starts the Shiny App in my browser and I get lovely pictures like this UMAP plot showing the GDA clusters:



 










 

The Shiny App also gives many other nice outputs, for example a heatmap showing input variables for the GDA clusters; a plot showing distribution of GDA clusters across the chromosomes; and a table showing the variables that are significantly different for each particular GDA cluster compared to the other clusters.

Viewing GDA results in the IGV genome browser:

[Note to self: I did this on my Mac laptop rather than on the Sanger farm.]

To view the results from GDA in the IGV genome browser, you first need to install the IGV software by following the instructions on the IGV website here.

To load the GDA results into IGV,  as well as the bedgraph files of features that GDA used as input, you need to run something like this:

% gda/gda_make_igv_session_file.py -g schistosoma_mansoni.PRJEA36577.WBPS14.annotations.gff3 gda_out_mydata_1kb/cluster_heatmap.csv gda_out_mydata_1kb/schisto_v7/clusters.bed schistosoma_mansoni.PRJEA36577.WBPS14.genomic.fa bedgraph_output_mydata

where schistosoma_mansoni.PRJEA36577.WBPS14.annotations.gff3 is the file with the annotations of genes, mRNAs, etc. for your genome;

gda_out_mydata_1kb is the folder containing the output from your GDA run;

bedgraph_output_mydata is the folder with input bedgraph files used as input for GDA.

This will make a file igv_session_gda.xml.  

Then start up IGV [Note to self: I have the IGV icon on my desktop on my laptop.]

Then if you start up IGV you can go load this file into IGV by going to File->Open session, and then choose 'igv_session_gda.xml' as the session file. 

It may be a little slow to load all the data into IGV, but you can look at the bottom right of the IGV screen to see it is loading data (it will say things like '1317M of 2359M', etc.).

Once it has loaded, you can view the GDA clusters along the bottom of the screen, as well as all the inputs that were used for GDA above that (e.g. GC content, genes, UTRs, etc.):



 


 

 




Acknowledgements

A big thank you to Eerik Aunin and Adam Reid for helping me with running GDA.




 

 

Tuesday, 1 February 2022

Exporting protein sequences from Artemis

 I had a gff with the annotation and genome sequences for some contigs, and wanted to export the protein sequences from Artemis. I wrote previously on how to run Artemis here but that was ages ago, so had to remind myself!

To log into the Sanger compute cluster and run Artemis:

% ssh -Y pcs6

% module avail -t | grep -i art

% module load artemis/18.1.0

% art

This then opened Artemis, and to load my gff file, I used the 'File' menu. 

Then to select my genes of interest, I went to the 'View' menu, and choose 'CDS genes and products', and then went to the 'Select' menu and chose 'All CDS features', and chose my genes of interest from the list. Then I went to the 'Write' menu and chose 'Amino acids of selected features to file', and this wrote a file with the protein sequences for my genes of interest. Great!