How to install and use Kraken2 on deepthought

Kraken2

Kraken2 is a tool to identify the taxonomy of the things in your environmental sample. It does not identify what those things are doing, but what is there.

Kraken2 uses k-mers to identify the taxonomy of the microbes in your sample. In essence, they have taken all complete genomes, and then identified all k-mers that are unique to each taxonomic level. Through some nifty computing, and special data structures, they have figured out how to search this very efficiently.

To run Kraken2, you need two things:

  1. Your data, probably in fastq or fasta format
  2. A database of things we know about. Note that this is a databse in Kraken2 format, not in another format (not even Kraken1)!

Luckily for you, there are a wide range of pre-built kraken databases that you can download, so you do not need to go to the effort of building them yourself. I would very strongly recommend that you use one of the prebuilt databases unless you know what you are doing! They will make your bioinformatics easier, and when you come to write up the paper you do not need to worry about explaining what you’ve done: you can say taht you used the prebuilt databases!

Installing Kraken2 on deepthought

To install Kraken2 on deepthought, we are going to use conda. Once you have conda installed, [here are more install instructions] you can just type:

conda create -y -n kraken2 -c bioconda kraken2

This will figure out all the things that need to be installed, and then install them for you. It should not take too long for the installer to complete.

Once it has completed, you will need to activate the conda environment:

conda activate kraken2

Next, we need to get the databases. You can either download them from the link above so that you have your own copy, or you can make a link to Rob’s copy and use that. If you are in a class, just link to Rob’s copy using this code:

export KRAKEN2_DEFAULT_DB=~edwa0468/kraken2/latest

We are going to make a variable name with your file:

Important 1: CHANGE barcode01.fastq to your filename!

export FQFILE=barcode01.fastq

Important 2: CHECK can you access your file. Try this command

ls $FQFILE
~~~

If it says `File Not Found` you do not have `FQFILE` set correctly! If your fastq file is in the `fastq` directory, try this: `export FQFILE=fastq/barcode01.fastq` but change the file name as appropriate!



and now we'll run that on the cluster again!

```bash
cp -f ~edwa0468/kraken.slurm ~/
sbatch ~/kraken.slurm

You should now have the results!

Running directly

kraken2 --threads 8 --quick --output kraken_output --report kraken_report barcode01.fastq

Or we can run on the cluster

Finally, we are going to set the default location of the path.

You only need to do this one, and it will set up the setting permanently, but be careful because you could break things!

Use nano to edit the file called ~/.bashrc

nano ~/.bashrc

In the first line, enter:

export KRAKEN2_DEFAULT_DB=$HOME/kraken2

Once you have copied that (note there are no spaces around the =, there is a dollar sign before HOME, and HOME is capitals while kraken2 is not) press the Ctrl key and x to exit nano.

It will ask you if you want to save the changes:

Save modified buffer (ANSWERING "No" WILL DESTROY CHANGES) ?                                                                                                                                                                                  Y Yes
 N No           ^C Cancel

Press y to save the changes and it will exit the program.

As a note, nano is a simple text editor you can use to look at files like sequence files, etc.

The installation is complete and now you can use it to explore your metagenomes.

Using kraken2 on deepthought

A single fastq file

If you have one fastq file (e.g. from a Oxford Nanopore MinION run) you can just use a simple Kraken command

kraken2 --threads 8 --quick --output kraken_output --report kraken_report barcode01.fastq

But remember! this is a cluster, so we really want to use slurm to submit our jobs

We will use nano again to make a script we can run on the cluster:

nano kraken.slurm

and copy these 5 lines into that file:

#!/bin/bash

#SBATCH --ntasks=8

kraken2 --threads 8 --quick --output kraken_out.txt --report kraken_report.txt barcode01.fastq

Again, press Ctrl-x to exit nano, and y to save your file.

There are two important things here:

  1. No blank line at the start of the file
  2. No spaces at the beginnings of the lines

Now you can run that on the cluster:

sbatch kraken.slurm

and you can monitor the progress with

squeue

or

squeue -u <FAN>

where <FAN> is your FAN!

Notice that Kraken2 will also output a lot of information in a file that will be called something like slurm-1709843.out (but the number will be totally different). That tells you whether the command has worked or if there was some kind of error.

Paired end reads

If you have paired end reads (e.g. from Illumina data) you can modify that kraken command by (1) adding the flag --paired so that kraken2 knows the sequences are paired and (2) providing two fastq files.

Your command will look something like:

kraken2 --paired --threads 8 --report kraken_taxonomy.txt --output kraken_output.txt fastq/reads_1.fastq fastq/reads_2.fastq

If you want to run it on the cluster, you can copy that line into kraken.slurm instead of the last line, so that it looks like

#!/bin/bash

#SBATCH --ntasks=8

kraken2 --paired --threads 8 --report kraken_taxonomy.txt --output kraken_output.txt fastq/reads_1.fastq fastq/reads_2.fastq

Kraken2 outputs

The commands that we have run ask for both the --output and --report outputs from kraken2:

This will output two files:

  • kraken_output.txt contains the standard kraken output:
    • A code (C or U) indicating whether the read was classified or not
    • The read ID from the fastq file
    • The taxonomy ID assigned to the read if it is classified, or 0 if it is not classified
    • The length of the sequence in base pairs. Because we are using paired end reads, there are two lengths (R1|R2)
    • A space-separated list of the lowest common ancestor for each sequence that indicates how many kmers map to which taxonomic IDs. Because we have paired end information, there is a |:| separator between the R1 and R2 information
  • kraken_taxonomy.txt contains the standard kraken report:
    • Percent of fragments at that taxonomic level
    • Number of fragments at that taxonomic level (the sum of fragments at this level and all those below this level)
    • Number of fragments exactly at that taxonomic level
    • A taxonomic level code: Unclassified, Root, Domain, Kingdom, Phylum, Class, Order, Family, Genus, or Species. If the taxonomy is not one of these the number indicates the levels between this node and the appropriate node. See the docs for more information.
    • NCBI Taxonomic name
    • Scientific name

For more information about Kraken2, see the wiki page