DADA2 pipeline for processing raw 16S sequencing data

Overview 📖

Here we will run through the process for generating a sequencing variants table from raw FASTQ files generated by paired-end Illumina sequencing. For reproducibility, we also provide an Rmarkdown file that covers all the steps to produce the outputs required for assembling a phyloseq container object for your microbiome data.

Output	Description
`seqtab_nochim.rds`	The clean counts table matrix
`taxonomy_species.rds`	The SILVA-assigned taxonomy table
`tree.rds`	The de novo phylogenetic tree

We have a Nextflow pipeline for that!

If you are familiar with the bash command line interface, you may also want to try the Nextflow pipeline we have built which will automate the processes contained in this guide.

Preparation 🏗️

Conda environment 🐍

Since we require the use of illumina-utils for demultiplexing our raw reads, it is recommended to create a conda environment to keep package versions separate from the global base environment. For faster environment creation, we will use the Miniforge3 package manager.

How can I install Miniforge3?

Linux and MacOS 🐧🍏MacOS (Homebrew) 🍺Windows 🪟

Open a new terminal window, and download the appropriate installer for your computer's architecture using the wget command.

wget "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"

Run the installation script:

bash Miniforge3-$(uname)-$(uname -m).sh

The interactive installation will prompt you to initialize conda with your shell.

Ensure you have the Homebrew package manager installed.

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Install miniforge:

brew install --cask miniforge

Download and run the Windows installer.
Follow the prompts, taking note to:
- Create start menu shortcuts
- Add Miniforge3 to my PATH environment variable (unselected by default)

Info

The latter option above is not selected by default due to potential conflicts with other software. Without Miniforge3 on the path, the most convenient way to use the installed software (such as commands conda and mamba) will be via the "Miniforge Prompt" installed to the start menu.

Warning

There are known issues with the usage of special characters and spaces in the installation location, see for example #484 on the Miniforge3 GitHub. We recommend users install in a directory without any such characters in the name.

Let's create a new environment that houses the tools we need to run the pipeline.

Create a new environment for the DADA2 pipeline

# Create a new conda environment named dada2
mamba create -n dada2 python=3.8 parallel git -y

# Activate the environment
mamba activate dada2

# Install illumina-utils
pip install illumina-utils

Sequencing files 🧬🗒️

Before starting, make sure you have the four files listed below for each run you want to process:

File	Description
`R1.fastq.gz`	FASTQ file for the forward read
`R2.fastq.gz`	FASTQ file for the reverse read
`Index.fastq.gz`	FASTQ file for the index read
`barcode_to_sample_[runNN].txt`	A text file with no column names that maps index barcodes to samples

Things to pay attention to:

FASTQ files must use the Phred+33 quality score format and sequences headers (@ lines) must fit the standard format of the CASAVA 1.8 output:

@EAS139:136:FC706VJ:2:2104:15343:197393 1:N:0:0

The barcode_to_sample_[runNN].txt file must contain two tab-delimited columns: the first for samples names and the second for samples barcodes. Avoid special characters, and ensure you do not include column names.

An example directory layout is shown below, and is the format that the Rmarkdown file will be expecting. This layout future-proofs the directory by allowing other inputs, such as metadata or extra omics, to be stored in the same input folder without everything becoming too cluttered.

ProjectDirectory/
├── DADA2_pipeline.Rmd
└── input/
    └── 01_dada2/
        └── run_data/
            ├── run01/
            │   ├── R1.fastq.gz
            │   ├── R2.fastq.gz
            │   ├── Index.fastq.gz
            │   └── barcode_to_sample_run01.txt
            ├── run02/ #(if present)
            │   ├── R1.fastq.gz
            │   ├── R2.fastq.gz
            │   ├── Index.fastq.gz
            │   └── barcode_to_sample_run02.txt
            └── ...

SILVA database 🦠🗃️

You will also require a DADA2-formatted reference database for assigning taxonomy and species-level annotation. The different options are available here.

Make a note of the paths to each of these files, as you will need to enter these in the Rmarkdown file. We advise storing them in a dedicated databases folder rather than in your individual project directory, but that's up to you!

Download the DADA2-formatted SILVA database

We recommend the SILVA database: DADA2-formatted SILVA database version 138.2.

You will require the following database file, which can be downloaded directly by clicking the link.
- silva_nr99_v138.2_toSpecies_trainset.fa.gz

Install RAxML 🌳

RAxML (Randomised Axelerated Maximum Likelihood) is a widely used software for phylogenetic tree inference, and can be easily installed, assuming you have root user access.

How do I install RAxML?

Linux 🐧MacOS (Homebrew) 🍺

Install multi-threaded RAxML

# Install a C compiler and make if they are not already available
sudo apt update
sudo apt install build-essential

# Download RAxML
cd ~/Downloads
git clone https://github.com/stamatak/standard-RAxML.git
cd standard-RAxML

# Compile the multi-threaded version
make -f Makefile.PTHREADS.gcc

# Move the executable to a system directory
sudo mv raxmlHPC-PTHREADS /usr/bin/
sudo chmod +x /usr/bin/raxmlHPC-PTHREADS

# Check that the executable is correctly installed and accessible
raxmlHPC-PTHREADS -v # should show the version

Ensure you have the Homebrew package manager installed.

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Then you can install RAxML as follows:

Install multi-threaded RAxML

# Install a C compiler and make if they are not already available
brew install gcc make

# Download RAxML
cd ~/Downloads
git clone https://github.com/stamatak/standard-RAxML.git
cd standard-RAxML

# Compile the multi-threaded version
make -f Makefile.PTHREADS.mac_gcc

# Move the executable to a system directory
sudo mv raxmlHPC-PTHREADS /usr/local/bin
sudo chmod +x /usr/local/bin/raxmlHPC-PTHREADS

# If /usr/local/bin is not in your PATH variable, add it now
echo 'export PATH="$PATH:/usr/local/bin"' >> ~/.zshrc
source ~/.zshrc

# Verify installation
raxmlHPC-PTHREADS -v # should show the version

Usage 🏃

Open the Rmarkdown file in RStudio and follow the instructions in the text and comments. Hopefully this will all be quite self-explanatory.

At the end of the pipeline, your outputs will be in a handy location for further downstream processing.

Output	Description
`input/01_dada2/seqtab_nochim.rds`	The clean counts table matrix
`input/01_dada2/taxonomy_species.rds`	The SILVA-assigned taxonomy table
`input/01_dada2/tree.rds`	The de novo phylogenetic tree

What if my 16S data is already demultiplexed?

This is no problem! Organise everything according to the directory layout above. Then, inside each run, create a folder called demultiplexed and store your files there. Everything else should work as planned, and you can skip the demultiplexing steps at the beginning of the pipeline.

Workflow overview 📖

In this section, we will walk through the steps that are performed throughout the pipeline for reference. Most of this information is present in the Rmarkdown file as well.

Demultiplexing 🗒️🪓

Demultiplexing is performed using the iu-demultiplex command from the illumina-utils FASTQ files processing toolbox. If multiple runs have to be processed, this will be done in parallel.

The -j argument in the parallel command specifies the number of computing cores to use. You may edit it to your need (considering both available CPUs and memory).

File naming

Please ensure that the files names are consistent in each run. The 4 mandatory files are R1.fastq.gz, R2.fastq.gz, Index.fastq.gz, and barcode_to_sample_[runNN].txt

Demultiplexing code

Firstly we want to check that we don't have any duplicated sample names in our barcode_to_sample_[runNN].txt mapping files.

Check repeating samples

# Retrieve the barcode to sample mapping file paths
run_data_fp <- here::here('input', '01_dada2', 'run_data')
barcode_files <- here::here(run_data_fp,
                            list.files(run_data_fp, pattern = 'barcode', recursive = TRUE))

# Read in each of the mapping files to a data.frame
sample_names_dir <- ldply(barcode_files, function(f) {
    dat <- read_tsv(f, col_names = c('sample_id', 'barcode'), col_types = 'cc')
    return(dat)
})

# Check for duplicated sample names
duplicate_sample_names_dir <- sample_names_dir[duplicated(sample_names_dir[,1]) | duplicated(sample_names_dir[,1], fromLast = TRUE),]
duplicate_sample_names_dir

If there are any names returned above, change them in your mapping file, and re-run the code above.
Then we can move on to the demultiplexing step itself, which is handled using bash, which assumes you are using a Unix command line interface.

Demultiplex the runs

# Activate conda environment if available
if command -v conda >/dev/null 2>&1; then
conda activate dada2
fi

ls -d input/01_dada2/run_data/* | parallel -j -2 '
run_dir="{}"
outputdir="demultiplexed"

# Create the output directory if it does not exist
if [[ ! -d "${run_dir}/${outputdir}" ]]; then
    mkdir "${run_dir}/${outputdir}"
fi

# Enable nullglob so that the for loop does not run if no .fastq.gz files are found
shopt -s nullglob
for f in "${run_dir}"/*.fastq.gz; do
    # Only gunzip if the uncompressed file does not exist
    if [ ! -f "${f%.gz}" ]; then
    gunzip "$f"
    fi
done

# Run demultiplexing; note that the glob is unquoted so it can expand properly
iu-demultiplex -s ${run_dir}/barcode_to_sample* \
                --r1 "${run_dir}/R1.fastq" \
                --r2 "${run_dir}/R2.fastq" \
                -i "${run_dir}/Index.fastq" \
                -x \
                -o "${run_dir}/${outputdir}"
'

Quality check 📉🔍

The DADA2 plotQualityProfile function plots a visual summary of the distribution of quality scores as a function of sequence position for the input fastq file.

This can take minutes to hours depending on the number of samples and the resources available.

Quality check code

Run `plotQualityProfile` for each run

# Identify the run directories
runs.dirs <- list.dirs(here::here('input', '01_dada2', 'run_data'), recursive = FALSE)
runs <- basename(runs.dirs)

# Generate the quality profile plots
plots <- foreach(i = 1:length(runs), .packages = c('dada2', 'ggplot2')) %dopar% {
p <- list()
p[[1]] <- plotQualityProfile(paste(paste0(runs.dirs[i], '/demultiplexed'), 
                                    list.files(paste0(runs.dirs[i], '/demultiplexed/'), pattern = 'R1.fastq'), 
                                    sep = '/'), n = 1e+06, aggregate = TRUE) +
    ggtitle(paste('Forward reads |', runs[i]))
p[[2]] <- plotQualityProfile(paste(paste0(runs.dirs[i], '/demultiplexed'), 
                                    list.files(paste0(runs.dirs[i], '/demultiplexed/'), pattern = 'R2.fastq'), 
                                    sep = '/'), n = 1e+06, aggregate = TRUE) +
    ggtitle(paste('Reverse reads |', runs[i]))
p
}

plots

# Store the quality profile in the run directory
for (i in 1:length(runs)) {
saveRDS(plots[[i]], file.path(runs.dirs[i], 'quality_score.pdf.rds'))
pdf(file.path(runs.dirs[i], 'quality_score.pdf'))
invisible(lapply(plots[[i]], print))
invisible(dev.off())
}

Combine plots for all runs

# Recover variables for next chunk
runs.dirs <- list.dirs(here::here('input', '01_dada2', 'run_data'), recursive = FALSE)
runs <- basename(runs.dirs)
plots <- foreach(i=1:length(runs)) %dopar% {
readRDS(file.path(runs.dirs[i], 'quality_score.pdf.rds'))
}

nplot.pp <- 4  # number of plots per page
ncol.pp <- 2  # number of columns in a page
fig <- foreach(i=seq(1, length(unlist(plots, recursive = F)), by=nplot.pp), .packages = c('ggpubr')) %dopar% {
ggarrange(plotlist=unlist(plots, recursive = F)[i:(i+nplot.pp-1)], ncol=ncol.pp, nrow=nplot.pp/ncol.pp)
}
invisible(lapply(fig, print))

In gray-scale is a heat map of the frequency of each quality score at each base position. The median quality score at each position is shown by the green line, and the quartiles of the quality score distribution by the orange lines. The reverse reads are usually of worse quality, especially at the end, which is common in Illumina sequencing.

Save outputs to PDF

# Store the quality profile summary
if (!dir.exists(here('input', '01_dada2', 'figures'))) {
dir.create(here('input', '01_dada2', 'figures'))
}
pdf(here::here('input', '01_dada2', 'figures', 'quality_score.pdf'), paper='a4')
invisible(lapply(fig, print))
invisible(dev.off())

Quality filtering and trimming 🧬✂️

The DADA2 filterAndTrim function trims sequences to a specified length, removes sequences shorter than that length, and filters based on the number of ambiguous bases, a minimum quality score, and the expected errors in a read. Based on the quality profiles, adjust the trimming (for each run). Your reads must still overlap after truncation in order to merge them later (the basic rule is truncLen must be large enough to maintain 20 + biological.length.variation nucleotides of overlap between them).

Quality filtering and trimming code

Set up the required parameters

# Recover variables for next chunk
runs.dirs <- list.dirs(here::here('input', '01_dada2', 'run_data'), recursive = F)
runs <- basename(runs.dirs)

# Set up parameters for filtering and trimming (first parameter stands for R1, second for R2)
truncLen <- c(240, 240)
maxEE <- c(4,4)
trimLeft <- c(54, 54)
truncQ <- c(2,2)
maxN <- c(0,0)
rm.phix <- TRUE

Tunable parameters

Parameter	Description
`truncLen`	Truncates reads to the specified length (R1, R2). Ensures uniform length but may discard short reads.
`maxEE`	Filters out reads with more than the specified number of expected errors. Lower values improve accuracy but reduce read count.
`trimLeft`	Removes the specified number of bases from the start of each read. Useful for removing primers and non-informative regions.
`truncQ`	Truncates reads at the first base with a quality score ≤ `truncQ`. Helps remove poor-quality tails.
`maxN`	Discards reads containing more than the specified number of ambiguous bases (`N`). Ensures high-quality sequences.
`rm.phix`	Removes reads matching the PhiX genome, commonly used as a sequencing control, to reduce contamination.

Perform filtering and trimming

# For each run, store the filtered sequences in a new directory named 'filtered'
filterAndTrim.out <- vector('list', length(runs))
for(i in 1:length(runs)) {
fwd.fn <- sort(list.files(file.path(runs.dirs[i], 'demultiplexed'), pattern = 'R1.fastq'))
rev.fn <- sort(list.files(file.path(runs.dirs[i], 'demultiplexed'), pattern = 'R2.fastq'))
filterAndTrim.out[[i]] <- filterAndTrim(
                fwd=file.path(runs.dirs[i], 'demultiplexed', fwd.fn),
                filt=file.path(runs.dirs[i], 'filtered', fwd.fn),
                rev=file.path(runs.dirs[i], 'demultiplexed', rev.fn),
                filt.rev=file.path(runs.dirs[i], 'filtered', rev.fn),
                truncLen=truncLen,
                trimLeft = trimLeft,
                maxEE=maxEE,
                truncQ=truncQ,
                maxN=maxN,
                rm.phix=rm.phix,
                compress=TRUE,
                verbose=TRUE,
                multithread=nc)
}

# Store the filtering report in the run directory
filt.plots <- foreach(i=1:length(runs), .packages = c('ggplot2', 'reshape2')) %do% {
saveRDS(filterAndTrim.out[[i]], file.path(runs.dirs[i], 'filtering_report.rds'))
data <- as.data.frame(filterAndTrim.out[[i]])
row.names(data) <- gsub('_R1_001.fastq|-R1.fastq', '', row.names(data))
data$reads.in <- data$reads.in - data$reads.out
p <- ggplot(melt(as.matrix(data)), aes(x=Var1, y=value, fill=Var2)) +
    geom_bar(stat='identity') +
    theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
    labs(title = runs[i], x = 'Samples', y = 'Reads', fill = NULL) +
    theme(axis.title.x = element_blank())
saveRDS(p, file.path(runs.dirs[i], 'filtering_report.pdf.rds'))
pdf(file.path(runs.dirs[i], 'filtering_report.pdf'))
print(p)
invisible(dev.off())
p
}
pdf(here::here('input', '01_dada2', 'figures', 'filtering_report.pdf'), width = 12, height = 8)
invisible(lapply(filt.plots, print))
invisible(dev.off())

Save the filtering report to PDF

# Recover variables for next chunck
runs.dirs <- list.dirs(here::here('input', '01_dada2', 'run_data'), recursive = F)
runs <- basename(runs.dirs)

filt.plots <- foreach(i=1:length(runs.dirs)) %dopar% {
readRDS(file.path(runs.dirs[i], 'filtering_report.pdf.rds'))
}

invisible(lapply(filt.plots, print))

If too few reads are passing the filter, consider whether you can relax the maxEE parameter, or perhaps reduce the truncLen to remove the lower quality tail ends of reads.

Sequencing error model generation 🔬📉

The DADA2 algorithm makes use of a parametric error model, and every amplicon dataset has a different set of error rates. The learnErrors method learns this error model from the data, by alternating estimation of the error rates and inference of sample composition until they converge on a jointly consistent solution.

Sequencing error model code

Generate the sequencing error models

# Recover variables
runs.dirs <- list.dirs(here::here('input', '01_dada2', 'run_data'), recursive = F)
runs <- basename(runs.dirs)

# Identify all of the forward and reverse sequencing files
err.model <- foreach(i = 1:length(runs), .packages = c('dada2', 'ggplot2')) %dopar% {
fwd.fn <- sort(list.files(file.path(runs.dirs[i], 'filtered'), pattern = 'R1.fastq'))
rev.fn <- sort(list.files(file.path(runs.dirs[i], 'filtered'), pattern = 'R2.fastq'))
err <- list()
err[[1]] <- learnErrors(file.path(runs.dirs[i], 'filtered', fwd.fn), nbases=1e8, multithread=nc)
err[[2]] <- learnErrors(file.path(runs.dirs[i], 'filtered', rev.fn), nbases=1e8, multithread=nc)
err
}
# Plot the error model
err.plots <- foreach(i = 1:length(runs), .packages = c('dada2', 'ggplot2')) %do% {
p <- list()
p[[1]] <- plotErrors(err.model[[i]][[1]], nominalQ=TRUE) +
                ggtitle(paste(runs[i], '| forward reads'))
p[[2]] <- plotErrors(err.model[[i]][[2]], nominalQ=TRUE) +
                ggtitle(paste(runs[i], '| reverse reads'))
p
}

# Store the error model in the run directory
for (i in 1:length(runs)) {
saveRDS(err.model[[i]], file.path(runs.dirs[i], 'error_model.rds'))
saveRDS(err.plots[[i]], file.path(runs.dirs[i], 'error_model.pdf.rds'))
pdf(file.path(runs.dirs[i], 'error_model.pdf'))
invisible(lapply(err.plots[[i]], print))
invisible(dev.off())
}

Combine the error models into a condensed summary

# Recover variables
runs.dirs <- list.dirs(here::here('input', '01_dada2', 'run_data'), recursive = F)
runs <- basename(runs.dirs)
err.plots <- foreach(i=1:length(runs)) %dopar% {
readRDS(file.path(runs.dirs[i], 'error_model.pdf.rds'))
}

nplot.pp <- 2  # number of plots per page
ncol.pp <- 2  # number of columns in a page
fig <- foreach(i=seq(1, length(unlist(err.plots, recursive = F)), by=nplot.pp), .packages = c('ggpubr')) %dopar% {
ggarrange(plotlist=unlist(err.plots, recursive = F)[i:(i+nplot.pp-1)], ncol=ncol.pp, nrow=nplot.pp/ncol.pp)
}
invisible(lapply(fig, print))

Transitions (A→C, A→G, …) are shown. Points are the observed error rates for each consensus quality score. The black line shows the estimated error rates after convergence of the machine-learning algorithm. The red line shows the error rates expected under the nominal definition of the Q-score. Here the estimated error rates (black line) are a good fit to the observed rates (points), and the error rates drop with increased quality as expected.

Save the error model summary to PDF

# Store the error model summary
pdf(here::here('input', '01_dada2', 'figures', 'error_model.pdf'), width = 12, height = 6)
invisible(lapply(fig, print))
invisible(dev.off())

Count table generation 🧮

A table with amplicon sequence variants is constructed. To avoid overloading memory, runs and samples are processed sequentialy. The process starts with sequences dereplication, then it goes through Amplicon Sequence Variants (ASVs) inference and ends with Paired-Ends (PE) merging.

Dereplication combines all identical sequencing reads into into “unique sequences” with a corresponding “abundance” equal to the number of reads with that unique sequence. Dereplication in the DADA2 pipeline has one crucial addition from other pipelines: DADA2 retains a summary of the quality information associated with each unique sequence. The consensus quality profile of a unique sequence is the average of the positional qualities from the dereplicated reads. The consensus scores are then used by the error model of the dada function.
Amplicon sequence variants (ASV) inference is the core stage of the DADA2 package (the dada function). It will assign all reads to an error-corrected sequence using the models of the error rates of the previous step.
Paired-Ends (PE) merging performs a global ends-free alignment between paired forward and reverse reads and merges them together if they exactly overlap. It requires that the input forward and reverse reads are in the same order. Note that merging in the DADA2 pipeline happens after denoising, hence the strict requirement of exact overlap since it is expected that nearly all substitution errors have already been removed.

Count table generation code

Generate the counts table

# Recover variables
runs.dirs <- list.dirs(here::here('input', '01_dada2', 'run_data'), recursive = F)
runs <- basename(runs.dirs)
err.model <- foreach(i=1:length(runs)) %dopar% {
readRDS(file.path(runs.dirs[i], 'error_model.rds'))
}

for(i in 1:length(runs)) {

fwd.fn <- sort(list.files(file.path(runs.dirs[i], 'filtered'), pattern = 'R1.fastq'))
rev.fn <- sort(list.files(file.path(runs.dirs[i], 'filtered'), pattern = 'R2.fastq'))
sample.names <- sapply(strsplit(basename(fwd.fn), 'R1.fastq'), `[`, 1)
sample.names.rev <- sapply(strsplit(basename(rev.fn), 'R2.fastq'), `[`, 1)
if (!identical(sample.names, sample.names.rev)) stop('Forward and reverse files do not match.')
names(fwd.fn) <- sample.names
names(rev.fn) <- sample.names

merged <- vector('list', length(sample.names))
names(merged) <- sample.names
for(j in 1:length(sample.names)) {
    derep <- vector('list', 2)
    derep[[1]] <- derepFastq(file.path(runs.dirs[i], 'filtered', fwd.fn[j]))
    derep[[2]] <- derepFastq(file.path(runs.dirs[i], 'filtered', rev.fn[j]))
    asv <- vector('list', 2)
    asv[[1]] <- dada(derep[[1]], err=err.model[[i]][[1]], pool = TRUE, multithread=nc)
    asv[[2]] <- dada(derep[[2]], err=err.model[[i]][[2]], pool = TRUE, multithread=nc)
    merged[[sample.names[j]]] <- mergePairs(asv[[1]], derep[[1]], asv[[2]], derep[[2]])
}

st <- makeSequenceTable(merged)
saveRDS(st, file.path(runs.dirs[i], 'seqtab.rds'))
}

Info

Most of your reads should successfully merge. If that is not the case, upstream parameters may need to be revisited.

Merge runs 🤝🧪

At this point, all of the individual per-run seqtab tables are merged into a single global seqtab.

Merge runs code

Merge runs

# Recover variables
runs.dirs <- list.dirs(here::here('input', '01_dada2', 'run_data'), recursive = F)

seqtab.fps <- file.path(runs.dirs, 'seqtab.rds')
if (length(seqtab.fps) == 1) {
seqtab <- readRDS(seqtab.fps[[1]])
} else {
seqtab <- mergeSequenceTables(tables = seqtab.fps)
}

# Save data into a new directory named 'data'
saveRDS(seqtab, here::here('input', '01_dada2', 'seqtab.rds'))

Chimera screening 🦄

The dada algorithm models and removes substitution errors, but chimeras are another importance source of spurious sequences in amplicon sequencing.

What are chimeras?

Chimeras are artifacts generated during PCR amplification when an incomplete amplicon acts as a primer for a subsequent extension step. This occurs when a partially synthesized DNA fragment anneals to a different template in later cycles, leading to the formation of a hybrid sequence. As a result, the final amplicon consists of segments from two distinct parental sequences, producing a misleading chimeric read that does not represent a true biological variant.

Chimera screening code

Screen and remove chimeric reads

# Recover variables
seqtab <- readRDS(here::here('input', '01_dada2', 'seqtab.rds'))

# Remove chimeric reads
seqtab_nochim <- removeBimeraDenovo(seqtab, method='per-sample', multithread=nc, verbose = TRUE)
saveRDS(seqtab_nochim, here::here('input', '01_dada2', 'seqtab_nochim.rds'))
fwrite(as.data.frame(seqtab_nochim), here::here('input', '01_dada2', 'seqtab_nochim.txt'), quote = F, sep = '\t')

# Inspect distribution of sequence lengths after chimera removal
distrib <- table(nchar(getSequences(seqtab_nochim)))
distrib_plot <- function(){
plot(distrib, xlab = 'Read length', ylab = 'Number of ASVs')
}
saveRDS(distrib, here::here('input', '01_dada2', 'length_distribution.rds'))
pdf(here::here('input', '01_dada2', 'figures', 'length_distribution.pdf'))
distrib_plot()
invisible(dev.off())

Check the post-chimera removal dimensions and inspect length distributions

# Recover variables
seqtab <- readRDS(here::here('input', '01_dada2', 'seqtab.rds'))
seqtab_nochim <- readRDS(here::here('input', '01_dada2', 'seqtab_nochim.rds'))
distrib <- readRDS(here::here('input', '01_dada2', 'length_distribution.rds'))
distrib.plot <- function(){
plot(distrib, xlab = 'Read length', ylab = 'Number of ASVs')
}

# Check the dimensions of the table before chimera removal
dim(seqtab)

# Check the dimensions of the table after chimera removal
dim(seqtab_nochim)

# Check the distribution of sequence lengths
distrib.plot()

Reads tracking 📊🛤️

As a final check of our progress, we look at the number of reads that made it through each step in the pipeline.

Double check the retention of reads

Outside of filtering (first step) there should be no step in which a majority of reads are lost. If a majority of reads failed to merge, you may need to revisit the truncLen parameter used in the filtering step and make sure that the truncated reads span your amplicon. If a majority of reads were removed as chimeric, you may need to revisit the removal of primers, as the ambiguous nucleotides in unremoved primers interfere with chimera identification.

Reads tracking code

Generate reads tracking plot and save to PDF

# Recover variables
seqtab <- readRDS(here::here('input', '01_dada2', 'seqtab.rds'))
seqtab_nochim <- readRDS(here::here('input', '01_dada2', 'seqtab_nochim.rds'))

track.plots <- foreach(i=1:length(runs), .packages = c('ggplot2', 'reshape2')) %do% {
filtering <- readRDS(file.path(runs.dirs[i], 'filtering_report.rds'))
row.names(filtering) <- gsub('_R1_001.fastq|-R1.fastq', '', row.names(filtering))
track <- cbind(filtering[row.names(filtering) %in% row.names(seqtab),],
                rowSums(seqtab[row.names(seqtab) %in% row.names(filtering), ]),
                rowSums(seqtab_nochim[row.names(seqtab_nochim) %in% row.names(filtering), ]))
colnames(track) <- c('Input', 'Filtered', 'Merged', 'Non chimeric')
for (j in (ncol(track)-1):1) {
    for (k in (j+1):ncol(track)) {
    track[, j] <- track[, j] - track[, k]
    }
}
p <- ggplot(melt(as.matrix(track)), aes(x=Var1, y=value, fill=Var2)) +
    geom_bar(stat='identity') +
    theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5, size = 4)) +
    labs(title = runs[i], x = 'Samples', y = 'Reads', fill = NULL)
saveRDS(p, file.path(runs.dirs[i], 'read_tracking_report.pdf.rds'))
pdf(file.path(runs.dirs[i], 'read_tracking_report.pdf'))
print(p)
invisible(dev.off())
p
}
pdf(here::here('input', '01_dada2', 'figures', 'read_tracking_report.pdf'), width = 8, height = 6)
invisible(lapply(track.plots, print))
invisible(dev.off())

View the reads tracking plot

# Recover variables
runs.dirs <- list.dirs(here::here('input', '01_dada2', 'run_data'), recursive = F)
track.plots <- foreach(i=1:length(runs.dirs)) %dopar% {
readRDS(file.path(runs.dirs[i], 'read_tracking_report.pdf.rds'))
}

invisible(lapply(track.plots, print))

Taxonomy Assignment 🧬🏷️

The DADA2 package provides a native implementation of the naive Bayesian classifier method for this purpose. The assignTaxonomy function takes as input a set of sequences to be classified and a training set of reference sequences with known taxonomy, and outputs taxonomic assignments with at least minBoot bootstrap confidence.

Update your SILVA database version

As of SILVA version 138.2, the DADA2-formatted version can assign to the species level in a single step. Please update your database file if you are using an older version.

Taxonomy assignment code

Assign taxonomy to the species level

# Recover variables
seqtab_nochim <- readRDS(here::here('input', '01_dada2', 'seqtab_nochim.rds'))

# Path to the DADA2-formatted reference database
# Replace the existing path with the path to your SILVA training set
db_fp <- here::here('..', '..', '20_Databases', 'Silva', 'silva_nr99_v138.2_toSpecies_trainset.fa.gz')

taxonomy <- assignTaxonomy(seqtab_nochim, db_fp, minBoot = 50, tryRC = TRUE, multithread=nc)
saveRDS(taxonomy, here::here('input', '01_dada2', 'taxonomy_species.rds'))
fwrite(as.data.frame(taxonomy), here::here('input', '01_dada2', 'taxonomy_species.txt'), quote = F, sep = '\t')

Trying the reverse complement

The tryRC = TRUE option ensures that both the original sequences and their reverse complements are considered when matching against the reference database. This is useful when the sequencing orientation is unknown or inconsistent, helping to improve taxonomic assignment accuracy by allowing for potential reverse-complement matches.

Export in Qiime classic OTU table-like format 💾🔠

The DADA2 pipeline provides results as a count table of ASVs per samples and a taxonomic classification of each ASV in two separate files. As detailed in the DADA2 tutorial, these two objects can easily be used with the phyloseq R package for subsequent data analysis.

For compatibility with other data analysis tools, a count table in a tab-delimited text format matching the Qiime classic OTU table format is also created.
The table contains samples in columns and ASVs in rows. The taxonomy at the species level is added as an extra 'taxonomy' column as well as a 'sequence' column. The first columns contains mock OTU IDs.

Qiime-style OTU table export code

Export OTU table in classic Qiime format

# Recover variables
seqtab_nochim <- readRDS(here::here('input', '01_dada2', 'seqtab_nochim.rds'))
taxonomy_species <- readRDS(here::here('input', '01_dada2', 'taxonomy_species.rds'))

# Define a function to create the classic OTU table
dada2otu <- function(seqtab=NULL, taxonomy=NULL) {
out <- as.data.frame(cbind(c(1:nrow(taxonomy)), t(as.data.frame(seqtab)), 
                            apply(as.data.frame(taxonomy), 1, paste, collapse = '; '), colnames(seqtab)))
row.names(out) <- c(1:nrow(out))
names(out) <- c('#OTU ID', row.names(as.data.frame(seqtab)), 'taxonomy', 'sequence')
return(out)
}

# Export OTU table
fwrite(dada2otu(seqtab_nochim, taxonomy.species), here::here('input', '01_dada2', 'otu_table.txt'), quote = F, sep = '\t', buffMB = 100)

# Export count table
total_counts <- as.data.frame(cbind(row.names(as.data.frame(seqtab_nochim)), rowSums(seqtab_nochim)))
names(total_counts) <- c('SampleID', 'Total count')
fwrite(total_counts, here::here('input', '01_dada2', 'total_counts.txt'), quote = F, sep = '\t')

Create a phylogenetic tree 🌳🧬

The DADA2 sequence inference method is reference-free, so we must construct the phylogenetic tree relating the inferred sequence variants de novo. We begin by performing a multiple-alignment using the DECIPHER R package. The phangorn R package is then used to construct a phylogenetic tree. Here we first construct a neighbor-joining tree, and then fit a GTR+G+I (Generalized time-reversible with Gamma rate variation) maximum likelihood tree using the neighbor-joining tree as a starting point. Adapted from Callahan et al. pipeline.

De novo phylogenetic tree code

Create phylogenetic tree using RAxML

# Recover variables for next chunk
seqtab_nochim <- readRDS(here::here('input', '01_dada2', 'seqtab_nochim.rds'))
taxonomy_species <- readRDS(here::here('input', '01_dada2', 'taxonomy_species.rds'))

# Extract ASV sequences
seqs <- colnames(seqtab_nochim)
names(seqs) <- seqs

# Align sequences
alignment <- AlignSeqs(DNAStringSet(seqs), anchor = NA)
phang_align <- phyDat(as(alignment, 'matrix'), type ='DNA')

# Convert phyDat class to DNAbin format
alignment_dnabin <- as.DNAbin(phang_align)

# Prepare suitable names for RAxML
seq_names_tidy <- data.frame(original = rownames(alignment_dnabin)) %>%
mutate(raxml_names = 'seq') %>%
mutate(raxml_names = gsub('\\.', '_', make.unique(raxml_names)))
rownames(alignment_dnabin) <- seq_names_tidy$raxml_names

# Run RAxML using the raxml function
# Set threads to the number of **physical** cores you have
fitGTR <- raxml(DNAbin = alignment_dnabin,
                m = 'GTRGAMMAI',
                f = 'a',
                N = 100,
                p = 12345,
                x = 12345,
                threads = 24, 
                exec = '/usr/bin/raxmlHPC-PTHREADS',
                file = 'dada2')

# Move the best tree to your dada2 input folder from the project directory
# Remove the other unnecessary files
# Read in the resulting tree from the RAxML output file
file.copy(from = here('RAxML_bestTree.dada2'),
        to = here('input', '01_dada2', 'RAxML_bestTree.dada2'))
file.remove(list.files(pattern = 'RAxML|dada2.phy'))
tree <- read.tree(here('input', '01_dada2', 'RAxML_bestTree.dada2'))

# Correct the tip labels
tip_labels <- data.frame(raxml_names = tree$tip.label) %>%
left_join(seq_names_tidy, by = 'raxml_names')
tree$tip.label <- tip_labels$original

# Force dichotomy (resolve any multifurcations)
tree_dich <- ape::multi2di(tree)

# Re-root the tree explicitly using an outgroup (i.e. the first tip)
tree_rooted <- root(tree_dich,
                    outgroup = tree_dich$tip.label[1],
                    resolve.root = TRUE)

# Save to disk
saveRDS(tree_rooted, here('input', '01_dada2', 'tree.rds'))

Rights

Licence: This pipeline is provided under the MIT license.
Pipeline authors: A. Rapin, C. Pattaroni, A. Butler, B.J. Marsland, M. Macowan
This document: M. Macowan