What is it?
Atavide is a pipeline that integrates a lot of different tools to annotate and analyze your (random) metagenomics data. The idea is to quickly and easily get you to the hard part: thinking about what your data really means. We have abstracted out all the intermediate analyses so you don’t have to do them.
Who did it?
Atavide is written by Rob and Mike with some helpful input from other friends.
How do I cite it?
Please see the citation file for the current citation.
This guide is currently written for using atavide on the Flinders University HPC, deepthought. You will probably need to modify a few things to make it work on another HPC, and we’re working on streamlining some of these points and making the installation and running easier.
For the Flinders HPC, you will need a few things set up. If you have some of these, feel free to skip that step.
- Access to the HPC
- A VPN connection — but you only need this if you are not at Flinders
- conda and snakemake installed in your account
- A snakemake profile — just copy Mike’s — it’s what we all do.
- If you need pointers, here are some helpful Linux commands
Currently, you will need to get a copy of atavide using git:
# start off in your home directory cd # get a copy of atavide git clone https://github.com/linsalrob/atavide.git
This will download all the files that you need to run atavide. Everything else will be downloaded when you start processing your data!
Your sequence data
You will need some sequence data. Atavide works with compressed or uncompressed fastq files. Put them in a directory called
fastq in a subdirectory where you will work.
e.g. this set of commands will make the directories for you
mkdir --parents sequence_data/fastq cd sequence_data/fastq ## copy your sequences here cd ..
and now you end up in the directory called
sequence_data that has one directory called
fastq and that has your sequence data inside it.
Make a copy of the settings file
It is good practice to make a copy of the configuration/settings file to your working location. Then, the next time you come to work on this project, you know what was done.
Assuming you are in the directory with the
fastq directory inside it, e.g. you are in
sequence_data in the above example, this code will copy the configuration file for you.
cp $HOME/atavide/config/atavide.yaml .
Please note: You do not need to change any of the basic configuration in this file, and hopefully our defaults make sense.
Cleaning out host sequences
You may not need to do this step! (In which case, skip to the next section). You only need this step if you want to remove potential host contamination (e.g. human genomes) from your data.
Atavide includes the option to remove host genome sequences, by adding two additional directives to the
atavide.yaml file. If you are interested in this you will
need a host genome file. Add the following two directives to
To the directories data set, add a new option called
host_dbpath that contains the path to the host database.
Add a new
options option and add the name of the actual host database you are going to use.
For example, a modified
atavide.yaml file might look like:
# These are the directories where we read the data and store the output # feel free to change these names directories: Reads: "fastq" round1_assembly_output: "assembly.1" round1_contig_read_mapping: "reads.contigs.1" round2_unassembled_reads: "unassembled_reads" round2_assembly_output: "reassembled_reads" reads_vs_final_assemblies: "reads_vs_final_assemblies" prinseq: "QC" statistics: "statistics" combined_contig_merging: "final.combined_contigs" read_based_annotations: "ReadAnnotations" host_dbpath: '/home/edwa0468/hecatomb/databases/human_masked' options: host_dbname: 'human_virus_masked'
To run the pipeline, use the command:
snakemake -s $HOME/atavide/workflow/atavide.snakefile --configfile atavide.yaml --profile slurm
What does it do?
Atavide runs the following pieces of code for you!
To bin the metagenomes into mags we:
- Assemble the reads from each sample separately with megahit
- Merge those contigs, and map all the reads from each sample back to the contigs
- Find reads that have not been mapped
- Assemble all of the unmapped reads together in one big pile with megahit
- Merge those contigs with the original contigs using flye
- Map all the reads back to the final contigs and generate a table of read coverages
To calculate read based statistics we: