Analysing 16S data: Part 2
Session Overview
During this session we will cover the fundamentals of amplicon-based microbiome analysis.
Details of the individual session components are included below:
1. Using RStudio
2. Loading microbiome data into R
3. Normalizing count data
4. Loading data into phyloseq
5. Plotting figures
6. Analysis of alpha diversity
7. Analysis of beta diversity
Session One: Part 2
1. Using RStudio
In the first part of this session we used the Linux command line to process our 16S data and we used the text editor Nano to document each step in this process. Having generated an OTU count matrix and an associated taxonomy, we will now use the statistical programming language R to further analyse these data.
R can also be run from the command line (try typing ‘R’ on the Linux command line). However, for this session we will use RStudio, which is a freely available Integrated Development Environment (IDE) for R.
First, from the Linux command line create a new working directory called Session1.2, then link the output from the first part of Session 1 to this directory.
Once you have set up your working directory, return to the homepage for these practical sessions
and click on the link to open RStudio.
This will open a Graphical User Interface (GUI) in your browser, which will look
something like the image below. If you do not see the panel on the top left, on the control bar
select “File” > “New File” > “RScript”. Brielfy, there should now be four open tabs: the top left is
an R script, which is analagous to the Nano script we created during the first part of this session;
the bottom left is the R console, which contains a command prompt similar to the one on the Linux
command line; the tab on the top right provide information on your current R session; and the tab on
the bottom right provides additional useful information, such as help pages, or a window for viewing
graphical plots.
We will be writing most of our code in the script window (top left), and running it in the R console.
Start by typing an R command in the script window, then pressing Ctrl + Return
to execute it in
the console.
In the first half of this session, we were running tools (e.g. USEARCH) from within the working
directory: ~/MCA/16s/Session1
. We have just created a new working directory for the second half
of the session: ~/MCA/16s/Session1.2
. We now need to make sure that RStudio is running within
our current working directory.
Please ask if you have any further questions about working in RStudio.
2. Loading microbiome data into R
The first step is to load data into the R session. In addition to the OTU count matrix and taxonomy table we will also load an example file containing metadata for each sample. These tables will appear as data frames.
3. Normalizing count data
There are many reasons why sampling of 16S rDNA may not be consistent across samples. For example, the amount of material used for genomic DNA extraction may be variable. The exact number of reads generated by sequencing platforms is also variable between runs. Furthermore, when samples are multiplexed, technical error during pooling may mean that different volumes of each sample are loaded onto the flow cell.
Try plotting a histogram of the number of sequences per OTU.
Together, these factors mean that i) the number of sequence reads generated per sample is likely to be highly variable, and ii) that estimating the exact number of 16S gene sequences in the original sample is generally not possible. Estimates of the abundance of different taxa within a sample are therefore relative, and it is necessary to normalize counts before comparing a single taxon between samples.
Data normalization (adjusting for differences in sequencing depth) can be performed on the count
matrix once it has been loaded into R. There are several different normalization methods one of
which is to rarefy (i.e. randomly sample without replacement) data so that an equal number of
sequence counts are present for each sample. Rarefying requires loading an R package called
metseqR
.
For discussion of the issues surrounding rarefying data (including disambiguation of the terms rarefying and rarefaction) see this PLoS article.
4. Loading data into phyloseq
The package metaseqR
contains functions specifically developed for the analysis of sequence data.
A similar package is phyloseq
, which has been developed exclusively for 16S analysis. It provides
many convenient functions for data visualization and exploration.
To use phyloseq
it’s necessary to combine our count, taxonomy, and metadata data frames into a
single phyloseq
object.
For further information on using phyloseq
have a look at the
online documentation.
3.b. Normalizing data within phyloseq
In addition to storing data, phyloseq provides convenient functions that allow you to manipulate in a flexible manner. For example, it is possible to normalize data. We will normalize the count data so that the columns for each sample sum the median number of counts in the un-normalized count matrix.
Have a look at other data normalization methods available within the
phyloseq
package. For example see the functionnormalize.deseq()
5. Plotting figures
Phyloseq also provides convenient functions for generating summary plot of your data.
Once your data are contained within a phyloseq
object, it is easy to genreate sophisticated plots
with relatively little effort.
For further information on plotting with phyloseq try looking at the help documentation for each plot function. Additionally, have a look at some of the online phyloseq tutorials
Try plotting a stacked bar chart in which counts are coloured by genus.
What data transformations are implicitly made when plotting a heatmap?
6. Analysis of alpha diversity
While individual taxa are of interest, many biologically relevant changes in the microbiome (for example dysbiosis) are reported at the community level. Statistics that summarize changes in microbiome community composition are therefore of interest.
Alpha diversity metrics represent measurements of microbiome diversity within an individual. For comparative purposes, it is possible to compare the alpha diversity of individual A with that of individual B.
There are multiple different statistical methods for measuring alpha diversity (although, see here for disambiguation of the terms ‘richness’ and ‘diversity’, both of which are considered measurements of ‘alpha diversity’). Phyloseq provides convenient access to many of these methods.
Which alpha diversity measures alter between the normalized and un-normalized data sets?
Which data set returns the correct alpha diversity metrics?
7. Analysis of beta diversity
In contrast to alpha diversity, beta diversity metrics represent measurments of the extent to which the microbiome changes across individuals. For a discussion of the differences between alpha and beta diversity see Jost 2007.
To plot beta diversity, it is first necessary to generate a metric that summarizes how much the microbiome varies between every pair of individuals in a study. There are many such measures of similarity/dissimilarity. (For a more-or-less comprehensive summary see Chapter 7 of Legendre & Legendre 2012).
Pairwise comparison of similarity/dissimilarity between all sampels in a study can be used to generaet resemblance matrices, which can in turn, be used to plot visual summaries of beta diversity.
What does each cell in the distance matrix
dist_norm
represent?
Try colouring your beta diversity plot by one of the other variables in the medata.
For a summary of all the dissimilarity measures offered by phyloseq
, try typing
?distanceMethodList
into your R console.
Conclusions
Reading data into R provides access to a wide variety of R libraries that are written specifically for the analysis of 16S sequence data. Using packages provides high-level access to the power of R as a statistical programming language, thereby dramatically increasing the ease of analysis. However, this comes at an increased risk of making invalid statistical assumptions about your data.
Fortunately, a central tenet of the R language is that it should be robust as well as convenient.
The BioConductor initiative extends these principles to genomic analysis;
it contains several packages that are useful for microbiome analysis. This session has introduced
phyloseq
and metaseqR
,
but see also metagenomeSeq
and Metagenome
.