Atavide: atavistic metagenome analysis with a vision

Atavide

What is it?

Atavide is a pipeline that integrates a lot of different tools to annotate and analyze your (random) metagenomics data. The idea is to quickly and easily get you to the hard part: thinking about what your data really means. We have abstracted out all the intermediate analyses so you don’t have to do them.

Who did it?

Atavide is written by Rob and Mike with some helpful input from other friends.

How do I cite it?

Please see the citation file for the current citation.

Using atavide

This guide is currently written for using atavide on the Flinders University HPC, deepthought. You will probably need to modify a few things to make it work on another HPC, and we’re working on streamlining some of these points and making the installation and running easier.

Prerequisites

For the Flinders HPC, you will need a few things set up. If you have some of these, feel free to skip that step.

Get atavide

Currently, you will need to get a copy of atavide using git:

# start off in your home directory
cd
# get a copy of atavide
git clone https://github.com/linsalrob/atavide.git

This will download all the files that you need to run atavide. Everything else will be downloaded when you start processing your data!

Your sequence data

You will need some sequence data. Atavide works with compressed or uncompressed fastq files. Put them in a directory called fastq in a subdirectory where you will work.

e.g. this set of commands will make the directories for you

mkdir --parents sequence_data/fastq
cd sequence_data/fastq
## copy your sequences here
cd ..

and now you end up in the directory called sequence_data that has one directory called fastq and that has your sequence data inside it.

Make a copy of the settings file

It is good practice to make a copy of the configuration/settings file to your working location. Then, the next time you come to work on this project, you know what was done.

Assuming you are in the directory with the fastq directory inside it, e.g. you are in sequence_data in the above example, this code will copy the configuration file for you.

cp $HOME/atavide/config/atavide.yaml .

Please note: You do not need to change any of the basic configuration in this file, and hopefully our defaults make sense.

Cleaning out host sequences

You may not need to do this step! (In which case, skip to the next section). You only need this step if you want to remove potential host contamination (e.g. human genomes) from your data.

Atavide includes the option to remove host genome sequences, by adding two additional directives to the atavide.yaml file. If you are interested in this you will need a host genome file. Add the following two directives to atavide.yaml:

To the directories data set, add a new option called host_dbpath that contains the path to the host database.

Add a new options option and add the name of the actual host database you are going to use.

For example, a modified atavide.yaml file might look like:

# These are the directories where we read the data and store the output
# feel free to change these names
directories:
        Reads: "fastq"
        round1_assembly_output: "assembly.1"
        round1_contig_read_mapping: "reads.contigs.1"
        round2_unassembled_reads: "unassembled_reads"
        round2_assembly_output: "reassembled_reads"
        reads_vs_final_assemblies: "reads_vs_final_assemblies"
        prinseq: "QC"
        statistics: "statistics"
	combined_contig_merging: "final.combined_contigs"
	read_based_annotations: "ReadAnnotations"
        host_dbpath: '/home/edwa0468/hecatomb/databases/human_masked'
options:
        host_dbname: 'human_virus_masked'

Running atavide

To run the pipeline, use the command:

snakemake -s $HOME/atavide/workflow/atavide.snakefile --configfile atavide.yaml --profile slurm

What does it do?

Atavide runs the following pieces of code for you!

  1. Clean the data with prinseq++ 1a. (Optional) Remove host contamination with bowtie2

To bin the metagenomes into mags we:

  1. Assemble the reads from each sample separately with megahit
  2. Merge those contigs, and map all the reads from each sample back to the contigs
  3. Find reads that have not been mapped
  4. Assemble all of the unmapped reads together in one big pile with megahit
  5. Merge those contigs with the original contigs using flye
  6. Map all the reads back to the final contigs and generate a table of read coverages

To calculate read based statistics we:

  1. Run focus
  2. Run super-focus
  3. Generate a super-focus based taxonomy, provided you add an SQL taxonomy file
  4. Run kraken2
  5. Run singlem