Atavide: atavistic metagenome analysis with a vision
Atavide
What is it?
Atavide is a pipeline that integrates a lot of different tools to annotate and analyze your (random) metagenomics data. The idea is to quickly and easily get you to the hard part: thinking about what your data really means. We have abstracted out all the intermediate analyses so you don’t have to do them.
Who did it?
Atavide is written by Rob and Mike with some helpful input from other friends.
How do I cite it?
Please see the citation file for the current citation.
Using atavide
This guide is currently written for using atavide on the Flinders University HPC, deepthought. You will probably need to modify a few things to make it work on another HPC, and we’re working on streamlining some of these points and making the installation and running easier.
Prerequisites
For the Flinders HPC, you will need a few things set up. If you have some of these, feel free to skip that step.
- Access to the HPC
- A VPN connection — but you only need this if you are not at Flinders
- conda and snakemake installed in your account
- A snakemake profile — just copy Mike’s — it’s what we all do.
- If you need pointers, here are some helpful Linux commands
Get atavide
Currently, you will need to get a copy of atavide using git:
# start off in your home directory
cd
# get a copy of atavide
git clone https://github.com/linsalrob/atavide.git
This will download all the files that you need to run atavide. Everything else will be downloaded when you start processing your data!
Your sequence data
You will need some sequence data. Atavide works with compressed or uncompressed fastq files. Put them in a directory called fastq
in a subdirectory where you will work.
e.g. this set of commands will make the directories for you
mkdir --parents sequence_data/fastq
cd sequence_data/fastq
## copy your sequences here
cd ..
and now you end up in the directory called sequence_data
that has one directory called fastq
and that has your sequence data inside it.
Make a copy of the settings file
It is good practice to make a copy of the configuration/settings file to your working location. Then, the next time you come to work on this project, you know what was done.
Assuming you are in the directory with the fastq
directory inside it, e.g. you are in sequence_data
in the above example, this code will copy the configuration file for you.
cp $HOME/atavide/config/atavide.yaml .
Please note: You do not need to change any of the basic configuration in this file, and hopefully our defaults make sense.
Cleaning out host sequences
You may not need to do this step! (In which case, skip to the next section). You only need this step if you want to remove potential host contamination (e.g. human genomes) from your data.
Atavide includes the option to remove host genome sequences, by adding two additional directives to the atavide.yaml
file. If you are interested in this you will
need a host genome file. Add the following two directives to atavide.yaml
:
To the directories data set, add a new option called host_dbpath
that contains the path to the host database.
Add a new options
option and add the name of the actual host database you are going to use.
For example, a modified atavide.yaml
file might look like:
# These are the directories where we read the data and store the output
# feel free to change these names
directories:
Reads: "fastq"
round1_assembly_output: "assembly.1"
round1_contig_read_mapping: "reads.contigs.1"
round2_unassembled_reads: "unassembled_reads"
round2_assembly_output: "reassembled_reads"
reads_vs_final_assemblies: "reads_vs_final_assemblies"
prinseq: "QC"
statistics: "statistics"
combined_contig_merging: "final.combined_contigs"
read_based_annotations: "ReadAnnotations"
host_dbpath: '/home/edwa0468/hecatomb/databases/human_masked'
options:
host_dbname: 'human_virus_masked'
Running atavide
To run the pipeline, use the command:
snakemake -s $HOME/atavide/workflow/atavide.snakefile --configfile atavide.yaml --profile slurm
What does it do?
Atavide
runs the following pieces of code for you!
To bin the metagenomes into mags we:
- Assemble the reads from each sample separately with megahit
- Merge those contigs, and map all the reads from each sample back to the contigs
- Find reads that have not been mapped
- Assemble all of the unmapped reads together in one big pile with megahit
- Merge those contigs with the original contigs using flye
- Map all the reads back to the final contigs and generate a table of read coverages
To calculate read based statistics we:
- Run focus
- Run super-focus
- Generate a super-focus based taxonomy, provided you add an SQL taxonomy file
- Run kraken2
- Run singlem