Part 3: Using snakemake on deepthought

snakemake is a pipelining tool that allows you to recreate reusable pipelines to analyse data, and helps you to run things on the cluster.

In the previous section we installed conda/mamba, prinseq++, and snakemake. In this tutorial, we are going to write a small script to run prinseq++ on a data set.

What is prinseq++

When you sequence something you get the raw DNA sequence and it has not been cleaned or trimmed. prinseq++ is a fast tool for taking fastq files (e.g. from BGI, Illumina, or Nanopore) and quality trimming them. You can read more at the prinseq++ website

The command that we are going to run is

        prinseq++ -min_len 60 -min_qual_mean 25 -ns_max_n 1 -derep 1 \
            -out_format 0 -trim_tail_left 5 -trim_tail_right 5 \
            -trim_qual_type min -trim_qual_left 30 \
            -trim_qual_right 30 -trim_qual_window 10 \
            -out_good r1.fastq -out_single r1.single -out_bad r1.bad \
            -out_good2 r2.fastq -out_single2 r2.single -out_bad2 r2.bad \
            -fastq r1.fastq \
            -fastq2 r2.fastq;

Lets break these parameters down and take a look.

Filter on the length and overall quality

-min_len 60 – the minimum length of sequences to include. Assuming Illumina sequencing, you probably have read lengths between 150 and 300bp. Anything shorter will be a bad read. -min_qual_mean 25 – minimum quality score for the mean of the read should be >25. (1 in 10^2.5 error ~ 1:300) which is about 1 bp error per read for 300 bp reads

Filter N’s

-ns_max_n 1 Filter sequences with more than a single N in the sequence. The more N’s we have the less sure we are of the read.

Remove exact duplicate sequences

For (non-16S samples) we would expect exactly duplicated sequences to be diminishingly rare, and they are most likely an artefact of the sequencing reaction. Some sequencing technologies (those that involve amplification) are much more prone to this than others.

-derep 1

Poly A/T Trimming

You can remove poly-A and poly-T sequences at the end of the read with at least 5 A’s. These can be an artefact of mRNA sequencing.

-trim_tail_left 5
-trim_tail_right 5 

Quality based trimming of the reads

You can trim based on min, mean, max, or sum quality scores. In this case, trim using the min score. We take a window of size trim_qual_window bp (so in this case 10 bp) and remove it if the score is less than stated. So in this case, we trim 10 bp off either end if the score is less than 30. We repeat that until we have a 10 bp window whose min score >= 30.

-trim_qual_type min
-trim_qual_left 30 
-trim_qual_right 30 
-trim_qual_window 10

Output options

-out_format 0 - ouput either fastq (0) or fasta (1) format sequences.

The file names of reads 1 (R1) and reads 2 (R2) for paired (R1/2), single (unpaired reads, e.g. if the other read fails quality) (S1/2), or bad (b1/2) reads

-out_good {output.r1} -out_single {output.s1} -out_bad {output.b1}
-out_good2 {output.r2} -out_single2 {output.s2} -out_bad2 {output.b2}

Input reads

Assuming left (R1) and right (R2) reads.

-fastq {input.r1} -fastq2 {input.r2}

But we are going to use some snakemake magic to make the input/output file names automatically.

What is snakemake?

snakemake is a way to write reproducible code. There are lots of tutorials about snakemake and I recommend that you take a look at them. Here we are going to do something very simple: use snakemake to run prinseq++ on a data set.

The dataset

You can use any dataset you like for this step. There are some test data in /home/edwa0468/Tutorials/01snakemake/fastq that you can copy like so:


mkdir example_snakemake
cd example_snakemake
cp -r /home/edwa0468/Tutorials/01snakemake/fastq .
ls fastq/

This will make a directory called fastq with two files, 16714_19_ar.qc_R1.fastq.gz and 16714_19_ar.qc_R2.fastq.gz.

Writing the snakefile

This is coming. For the moment, use the one I wrote for you!

Running snakemake

For now, the example snakefile is in /home/edwa0468/Tutorials/01snakemake/prinseq.snakefile, and you can copy that and run with this command:

snakemake -s prinseq.snakefile --cores 1

This will run on the head node and create a directory called prinseq with all the output files.

$ ls -l prinseq/
total 401536
-rw-r--r-- 1 edwa0468 staff  93655245 Sep 18 15:18 16714_19_ar.qc_R1.bad.fastq
-rw-r--r-- 1 edwa0468 staff 104756189 Sep 18 15:18 16714_19_ar.qc_R1.good.fastq
-rw-r--r-- 1 edwa0468 staff   7983090 Sep 18 15:18 16714_19_ar.qc_R1.single.fastq
-rw-r--r-- 1 edwa0468 staff  64568138 Sep 18 15:18 16714_19_ar.qc_R2.bad.fastq
-rw-r--r-- 1 edwa0468 staff 103598005 Sep 18 15:18 16714_19_ar.qc_R2.good.fastq
-rw-r--r-- 1 edwa0468 staff  36600191 Sep 18 15:18 16714_19_ar.qc_R2.single.fastq

If you try and run that snakefile again, snakemake will say that there is nothing to do.

Now lets run it on the cluster:

snakemake -s prinseq.snakefile --cluster 'sbatch -t 60 --mem=2g -c 1' -j 10 --latency-wait 60

Let’s break this command down:

  • The first part snakemake -s prinseq.snakefile means run the snakefile we wrote above.
  • --cluster 'sbatch -t 60 --mem=2g -c 1' means run this on the cluster using:
    • -t 60 means a amximum time of 60 minutes. In fact, the process finishes much faster.
    • --mem=2g means request 2GB of RAM
    • -c 1 means run on a single CPU.
  • -j 10 means at most 10 jobs. Actually, at the moment we only have a single job!
  • --latency-wait 60 means to wait for the output files for 60 seconds. It takes a while to transfer the files around the system, and we need to ensure that we wait for all the output files before we declare success … or failure!

Up next

Making a snakemake profile