How to install and use Kraken2 on deepthought
Kraken2
Kraken2 is a tool to identify the taxonomy of the things in your environmental sample. It does not identify what those things are doing, but what is there.
Kraken2 uses k-mers to identify the taxonomy of the microbes in your sample. In essence, they have taken all complete genomes, and then identified all k-mers that are unique to each taxonomic level. Through some nifty computing, and special data structures, they have figured out how to search this very efficiently.
To run Kraken2, you need two things:
- Your data, probably in fastq or fasta format
- A database of things we know about. Note that this is a databse in Kraken2 format, not in another format (not even Kraken1)!
Luckily for you, there are a wide range of pre-built kraken databases that you can download, so you do not need to go to the effort of building them yourself. I would very strongly recommend that you use one of the prebuilt databases unless you know what you are doing! They will make your bioinformatics easier, and when you come to write up the paper you do not need to worry about explaining what you’ve done: you can say taht you used the prebuilt databases!
Installing Kraken2 on deepthought
You only need to do this step once! When you come back to deepthought you can do conda activate bioinformatics
and you have kraken2 ready to go!
To install Kraken2 on deepthought, we are going to use conda. Once you have conda installed, [here are more install instructions].
We start by activating our conda environment:
conda activate bioinformatics
(Note, if you great the warning: Could not find conda environment: bioinformatics
, then you need to use conda create -n bioinformatics
first to create the environment.)
and now we can install new software:
conda install -y -c bioconda kraken2
This will figure out all the things that need to be installed, and then install them for you. It should not take too long for the installer to complete.
Next, we need to get the databases. You can either download them from the link above so that you have your own copy, or you can make a link to Rob’s copy and use that. If you are in a class, just link to Rob’s copy using this code:
conda env config vars set KRAKEN2_DEFAULT_DB=~edwa0468/kraken2/latest
This is the end of the installation part, next time you can skip to the next section!
Using kraken2
We are going to make a variable name with your file:
Important 1: CHANGE barcode01.fastq to your filename!
export FQFILE=barcode01.fastq
Important 2: CHECK can you access your file. Try this command
ls $FQFILE
If it says File Not Found
you do not have FQFILE
set correctly! If your fastq file is in the fastq
directory, try this: export FQFILE=fastq/barcode01.fastq
but change the file name as appropriate!
Running directly
kraken2 --threads 8 --quick --output kraken_output --report kraken_report barcode01.fastq
Or we can run on the cluster
Finally, we are going to set the default location of the path.
You only need to do this one, and it will set up the setting permanently, but be careful because you could break things!
Use nano
to edit the file called ~/.bashrc
nano ~/.bashrc
In the first line, enter:
export KRAKEN2_DEFAULT_DB=$HOME/kraken2
Once you have copied that (note there are no spaces around the =
, there is a dollar sign before HOME, and HOME is capitals while kraken2 is not) press the Ctrl
key and x
to ex
it nano.
It will ask you if you want to save the changes:
Save modified buffer (ANSWERING "No" WILL DESTROY CHANGES) ? Y Yes
N No ^C Cancel
Press y
to save the changes and it will exit the program.
As a note, nano
is a simple text editor you can use to look at files like sequence files, etc.
The installation is complete and now you can use it to explore your metagenomes.
Using kraken2 on deepthought
A single fastq file
If you have one fastq file (e.g. from a Oxford Nanopore MinION run) you can just use a simple Kraken command
kraken2 --threads 8 --quick --output kraken_output --report kraken_report barcode01.fastq
But remember! this is a cluster, so we really want to use slurm to submit our jobs
We have created a file for you:
cp ~edwa0468/kraken.slurm .
Or, you can create your own file: We will use nano
again to make a script we can run on the cluster:
nano kraken.slurm
and copy these 5 lines into that file:
#!/bin/bash
#SBATCH --ntasks=8
kraken2 --threads 8 --quick --output kraken_out.txt --report kraken_report.txt barcode01.fastq
Again, press Ctrl-x
to exit nano, and y
to save your file.
There are two important things here:
- No blank line at the start of the file
- No spaces at the beginnings of the lines
Now you can run that on the cluster:
sbatch kraken.slurm
and you can monitor the progress with
squeue
or
squeue -u <FAN>
where <FAN>
is your FAN!
Notice that Kraken2
will also output a lot of information in a file that will be called something like slurm-1709843.out
(but the number will be totally different). That tells you whether the command has worked or if there was some kind of error.
Paired end reads
If you have paired end reads (e.g. from Illumina data) you can modify that kraken command by (1) adding the flag --paired
so that kraken2 knows the sequences are paired and (2) providing two fastq files.
Your command will look something like:
kraken2 --paired --threads 8 --report kraken_taxonomy.txt --output kraken_output.txt fastq/reads_1.fastq fastq/reads_2.fastq
If you want to run it on the cluster, you can copy that line into kraken.slurm
instead of the last line, so that it looks like
#!/bin/bash
#SBATCH --ntasks=8
kraken2 --paired --threads 8 --report kraken_taxonomy.txt --output kraken_output.txt fastq/reads_1.fastq fastq/reads_2.fastq
Kraken2 outputs
The commands that we have run ask for both the --output
and --report
outputs from kraken2:
This will output two files:
kraken_output.txt
contains the standard kraken output:- A code (C or U) indicating whether the read was classified or not
- The read ID from the fastq file
- The taxonomy ID assigned to the read if it is classified, or 0 if it is not classified
- The length of the sequence in base pairs. Because we are using paired end reads, there are two lengths (R1|R2)
- A space-separated list of the lowest common ancestor for each sequence that indicates how many kmers map to which taxonomic IDs. Because we have paired end information, there is a
|:|
separator between the R1 and R2 information
kraken_taxonomy.txt
contains the standard kraken report:- Percent of fragments at that taxonomic level
- Number of fragments at that taxonomic level (the sum of fragments at this level and all those below this level)
- Number of fragments exactly at that taxonomic level
- A taxonomic level code:
U
nclassified,R
oot,D
omain,K
ingdom,P
hylum,C
lass,O
rder,F
amily,G
enus, orS
pecies. If the taxonomy is not one of these the number indicates the levels between this node and the appropriate node. See the docs for more information. - NCBI Taxonomic name
- Scientific name
For more information about Kraken2, see the wiki page