We start again with the dataset from Cho et al. Nature. 2012;488(7413):621-6..

load("STAT2.RData")
library(phyloseq)

1 Preparing the data

The microbiomes are assessed for fecal and cecal contents. The Location variable identifies, where the specimen comes from. subset_samples(...) can be used to filter the data to only those observations that satisfy the requirements.

fecal.phy = subset_samples(phy2, Location=='fecal')

head(sample_data(fecal.phy))
##           X.SampleID BarcodeSequence LinkerPrimerSequence Treatment
## fecal_C1    fecal_C1              NA                   NA         C
## fecal_C10  fecal_C10              NA                   NA         C
## fecal_C2    fecal_C2              NA                   NA         C
## fecal_C3    fecal_C3              NA                   NA         C
## fecal_C4    fecal_C4              NA                   NA         C
## fecal_C5    fecal_C5              NA                   NA         C
##           Location         Description   GIP Insulin Leptin IGF1    BMD
## fecal_C1     fecal missing_description  8.61       1   2320 1250 0.0492
## fecal_C10    fecal missing_description 27.70     516   2680 3720 0.0479
## fecal_C2     fecal missing_description 28.60     137   1040 2180 0.0487
## fecal_C3     fecal missing_description 19.20     543   1670 2160 0.0510
## fecal_C4     fecal missing_description 38.30     606    972 2070 0.0465
## fecal_C5     fecal missing_description 13.20       1   2210 1240 0.0547
##           mFat pFat
## fecal_C1   4.0 19.7
## fecal_C10  4.0 19.8
## fecal_C2   3.7 18.6
## fecal_C3   3.6 19.5
## fecal_C4   4.0 21.3
## fecal_C5   4.1 21.6

2 Computing distances

Phyloseq provides are wrapper function distance() to compute a range of different distances for microbiome data. We will use Bray-Curtis distance here.

bray.fecal = phyloseq::distance(fecal.phy, method="bray")
class(bray.fecal)
## [1] "dist"

Examine the distance matrix jsd.fecal. What R class is it of? To convert to matrix use as.matrix function. This way you can examine the values of the distance matrix more easily, e.g.

as.matrix(bray.fecal)[1:4, 1:4]
##            fecal_C1 fecal_C10  fecal_C2  fecal_C3
## fecal_C1  0.0000000 0.3571727 0.1497128 0.2015329
## fecal_C10 0.3571727 0.0000000 0.3309047 0.2893686
## fecal_C2  0.1497128 0.3309047 0.0000000 0.1851358
## fecal_C3  0.2015329 0.2893686 0.1851358 0.0000000

Why are the diagonal values 0?

3 vegan::meandist

Sometimes it is meaningful to describe the variability within a set of related samples (intra-group variability) and how samples of different identity relate to each other (inter-group variability). These can be computed using meandist() function from the vegan package. For example if we wanted to see an average distance within the treatment groups (Treatment variable), we can run:

library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.5-2
md = meandist(bray.fecal, sample_data(fecal.phy)$Treatment)
md
##            C         P         T         V        VP
## C  0.2817158 0.3245986 0.4622712 0.3138259 0.3764969
## P  0.3245986 0.3129034 0.4574114 0.3172899 0.3946473
## T  0.4622712 0.4574114 0.4383582 0.4649090 0.4638802
## V  0.3138259 0.3172899 0.4649090 0.2919245 0.3968708
## VP 0.3764969 0.3946473 0.4638802 0.3968708 0.3864409
## attr(,"class")
## [1] "meandist" "matrix"  
## attr(,"n")
## grouping
##  C  P  T  V VP 
## 10  9  9 10  8

The diagonal entries give the intra-group distances.

barplot(diag(md))

We can see that the variability within some treatment levels is higher.

4 Principal Coordinates Analysis

Phyloseq provides a method ordinate() for convenient computation of multivariate projections. For the sake of learning we will use one of the underlying functions from ade4 package to compute PCoA. We always start by examining the screeplot for the projection. This is the plot of variability along each of the axes.

library(ade4)
fecal.pco = dudi.pco(bray.fecal, 
                     scannf=F, nf=2, full=T)
## Warning in dudi.pco(bray.fecal, scannf = F, nf = 2, full = T): Non
## euclidean distance
screeplot(fecal.pco)

Notice the warning message that says that the distances are not Euclidean. This happens when dissimilarities are used, and thus the triangle inequality is violated. This can be corrected for by adding a constant to all non-zero distances; the caillez() method computes the minimum constant.

fecal.pco = dudi.pco(cailliez(bray.fecal), 
                     scannf=F, nf=2)

Calculate % inertia explained

fecal.pco$eig[1:2]/sum(fecal.pco$eig)
## [1] 0.2382008 0.1033638

These numbers indicate the fraction of the entire variability that is explained by each axis.

s.class() function gives the ability to group point by a factor variable.

s.class(fecal.pco$li, sample_data(fecal.phy)$Treatment)

s.value() plots represents a numeric variable.

s.value(fecal.pco$li, sample_data(fecal.phy)$Insulin, sub="Insulin")

We can see how well the phenotypic variables correlate with the first two major axis of variability.

cor.mat = with(sample_data(fecal.phy),
               cor(cbind(PCo=fecal.pco$li,
                         GIP, Insulin, Leptin,
                         IGF1, BMD, mFat, pFat)))

round(cor.mat[,1:2], digits=2)
##         PCo.A1 PCo.A2
## PCo.A1    1.00   0.00
## PCo.A2    0.00   1.00
## GIP      -0.05   0.20
## Insulin   0.08   0.23
## Leptin    0.20   0.20
## IGF1      0.10   0.08
## BMD      -0.26  -0.04
## mFat     -0.08  -0.08
## pFat     -0.11  -0.07

This chunk produces a heatmap plot of these correlations.

library(ggplot2)
library(reshape)
mcor = melt(as.matrix(cor.mat[,1:2]))
mcor$X1 = factor(mcor$X1, levels=colnames(cor.mat))
mcor$X2 = factor(mcor$X2, levels=colnames(cor.mat))
qplot(x=X1, y=X2, 
      data=mcor, 
      fill=value, geom="tile")+
  scale_fill_gradientn(limits = c(-1,1), 
                       colours=c("red", "white", "green")) + 
  geom_text(aes(label=round(value, digits=2)))

5 Removing the effect of a variable in a multivariate analysis.

We first compute the PCoA for all of the data.

bray.all = phyloseq::distance(phy2, method="bray")
bray.pco = dudi.pco(cailliez(bray.all), scannf=F, nf=2)

Let’s examine how the Treatment and Location variables are related to the major axes of variability.

s.class(bray.pco$li, 
        sample_data(phy2)$Location)

s.class(bray.pco$li, 
        sample_data(phy2)$Treatment)

s.class(bray.pco$li, 
        with(sample_data(phy2), 
             interaction(Treatment, Location)))

Note that the Location difference is so large that it essentially obscures the effect of the Treatment variable. We can perform a within class analysis (wca) by centering the data within the factor levels of the Location variable. For each location we essential take the center of mass and move it to the origin of the axes.

bray.wca = wca(bray.pco, 
              sample_data(phy2)$Location, scannf=F, nf=2)

s.class(bray.wca$li, sample_data(phy2)$Location)

s.class(bray.wca$li, sample_data(phy2)$Treatment)

Now the effect of the Treatment variable is more apparent and data no longer vary according to the Location variable.

6 Ordinations with phyloseq

Examine the following for some ideas about how to produce and present PCoA plots with phyloseq.

library(ggplot2)
phy.ord = ordinate(phy2, method="MDS", distance = "bray")

plot_ordination(phy2, phy.ord, color="Location", shape="Treatment")

plot_ordination(phy2, phy.ord, color="Location", shape="Treatment") + facet_wrap(facets = "Treatment")

plot_ordination(phy2, phy.ord, color="Treatment") + facet_wrap(facets = "Location")

plot_ordination(phy2, phy.ord, color="Treatment") + facet_wrap(facets = "Location", scales = "free")

Phyloseq ordinate method does not implement WCA. If you wanted to plot the results of WCA, add the WCA axes to the sample data and do a manual plot. Here’s an example.

sample_data(phy2)$WCA1 = bray.wca$li[,1]
sample_data(phy2)$WCA2 = bray.wca$li[,2]
ggplot(sample_data(phy2), aes(x=WCA1, y = WCA2)) + geom_point(aes(color=Treatment)) + facet_wrap("Location")

7 Phyloseq distance tutorial

Follow Distance Tutorial