We start again with the dataset from Cho et al. Nature. 2012;488(7413):621-6..
load("STAT2.RData")
library(phyloseq)
The microbiomes are assessed for fecal and cecal contents. The Location variable identifies, where the specimen comes from. subset_samples(...)
can be used to filter the data to only those observations that satisfy the requirements.
fecal.phy = subset_samples(phy2, Location=='fecal')
head(sample_data(fecal.phy))
## X.SampleID BarcodeSequence LinkerPrimerSequence Treatment
## fecal_C1 fecal_C1 NA NA C
## fecal_C10 fecal_C10 NA NA C
## fecal_C2 fecal_C2 NA NA C
## fecal_C3 fecal_C3 NA NA C
## fecal_C4 fecal_C4 NA NA C
## fecal_C5 fecal_C5 NA NA C
## Location Description GIP Insulin Leptin IGF1 BMD
## fecal_C1 fecal missing_description 8.61 1 2320 1250 0.0492
## fecal_C10 fecal missing_description 27.70 516 2680 3720 0.0479
## fecal_C2 fecal missing_description 28.60 137 1040 2180 0.0487
## fecal_C3 fecal missing_description 19.20 543 1670 2160 0.0510
## fecal_C4 fecal missing_description 38.30 606 972 2070 0.0465
## fecal_C5 fecal missing_description 13.20 1 2210 1240 0.0547
## mFat pFat
## fecal_C1 4.0 19.7
## fecal_C10 4.0 19.8
## fecal_C2 3.7 18.6
## fecal_C3 3.6 19.5
## fecal_C4 4.0 21.3
## fecal_C5 4.1 21.6
Phyloseq provides are wrapper function distance() to compute a range of different distances for microbiome data. We will use Bray-Curtis distance here.
bray.fecal = phyloseq::distance(fecal.phy, method="bray")
class(bray.fecal)
## [1] "dist"
Examine the distance matrix jsd.fecal
. What R class is it of? To convert to matrix use as.matrix
function. This way you can examine the values of the distance matrix more easily, e.g.
as.matrix(bray.fecal)[1:4, 1:4]
## fecal_C1 fecal_C10 fecal_C2 fecal_C3
## fecal_C1 0.0000000 0.3571727 0.1497128 0.2015329
## fecal_C10 0.3571727 0.0000000 0.3309047 0.2893686
## fecal_C2 0.1497128 0.3309047 0.0000000 0.1851358
## fecal_C3 0.2015329 0.2893686 0.1851358 0.0000000
Why are the diagonal values 0?
vegan::meandist
Sometimes it is meaningful to describe the variability within a set of related samples (intra-group variability) and how samples of different identity relate to each other (inter-group variability). These can be computed using meandist() function from the vegan package. For example if we wanted to see an average distance within the treatment groups (Treatment variable), we can run:
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.5-2
md = meandist(bray.fecal, sample_data(fecal.phy)$Treatment)
md
## C P T V VP
## C 0.2817158 0.3245986 0.4622712 0.3138259 0.3764969
## P 0.3245986 0.3129034 0.4574114 0.3172899 0.3946473
## T 0.4622712 0.4574114 0.4383582 0.4649090 0.4638802
## V 0.3138259 0.3172899 0.4649090 0.2919245 0.3968708
## VP 0.3764969 0.3946473 0.4638802 0.3968708 0.3864409
## attr(,"class")
## [1] "meandist" "matrix"
## attr(,"n")
## grouping
## C P T V VP
## 10 9 9 10 8
The diagonal entries give the intra-group distances.
barplot(diag(md))
We can see that the variability within some treatment levels is higher.
Phyloseq provides a method ordinate() for convenient computation of multivariate projections. For the sake of learning we will use one of the underlying functions from ade4 package to compute PCoA. We always start by examining the screeplot for the projection. This is the plot of variability along each of the axes.
library(ade4)
fecal.pco = dudi.pco(bray.fecal,
scannf=F, nf=2, full=T)
## Warning in dudi.pco(bray.fecal, scannf = F, nf = 2, full = T): Non
## euclidean distance
screeplot(fecal.pco)
Notice the warning message that says that the distances are not Euclidean. This happens when dissimilarities are used, and thus the triangle inequality is violated. This can be corrected for by adding a constant to all non-zero distances; the caillez() method computes the minimum constant.
fecal.pco = dudi.pco(cailliez(bray.fecal),
scannf=F, nf=2)
Calculate % inertia explained
fecal.pco$eig[1:2]/sum(fecal.pco$eig)
## [1] 0.2382008 0.1033638
These numbers indicate the fraction of the entire variability that is explained by each axis.
s.class() function gives the ability to group point by a factor variable.
s.class(fecal.pco$li, sample_data(fecal.phy)$Treatment)
s.value() plots represents a numeric variable.
s.value(fecal.pco$li, sample_data(fecal.phy)$Insulin, sub="Insulin")
We can see how well the phenotypic variables correlate with the first two major axis of variability.
cor.mat = with(sample_data(fecal.phy),
cor(cbind(PCo=fecal.pco$li,
GIP, Insulin, Leptin,
IGF1, BMD, mFat, pFat)))
round(cor.mat[,1:2], digits=2)
## PCo.A1 PCo.A2
## PCo.A1 1.00 0.00
## PCo.A2 0.00 1.00
## GIP -0.05 0.20
## Insulin 0.08 0.23
## Leptin 0.20 0.20
## IGF1 0.10 0.08
## BMD -0.26 -0.04
## mFat -0.08 -0.08
## pFat -0.11 -0.07
This chunk produces a heatmap plot of these correlations.
library(ggplot2)
library(reshape)
mcor = melt(as.matrix(cor.mat[,1:2]))
mcor$X1 = factor(mcor$X1, levels=colnames(cor.mat))
mcor$X2 = factor(mcor$X2, levels=colnames(cor.mat))
qplot(x=X1, y=X2,
data=mcor,
fill=value, geom="tile")+
scale_fill_gradientn(limits = c(-1,1),
colours=c("red", "white", "green")) +
geom_text(aes(label=round(value, digits=2)))
We first compute the PCoA for all of the data.
bray.all = phyloseq::distance(phy2, method="bray")
bray.pco = dudi.pco(cailliez(bray.all), scannf=F, nf=2)
Let’s examine how the Treatment and Location variables are related to the major axes of variability.
s.class(bray.pco$li,
sample_data(phy2)$Location)
s.class(bray.pco$li,
sample_data(phy2)$Treatment)
s.class(bray.pco$li,
with(sample_data(phy2),
interaction(Treatment, Location)))
Note that the Location difference is so large that it essentially obscures the effect of the Treatment variable. We can perform a within class analysis (wca) by centering the data within the factor levels of the Location variable. For each location we essential take the center of mass and move it to the origin of the axes.
bray.wca = wca(bray.pco,
sample_data(phy2)$Location, scannf=F, nf=2)
s.class(bray.wca$li, sample_data(phy2)$Location)
s.class(bray.wca$li, sample_data(phy2)$Treatment)
Now the effect of the Treatment variable is more apparent and data no longer vary according to the Location variable.
Examine the following for some ideas about how to produce and present PCoA plots with phyloseq.
library(ggplot2)
phy.ord = ordinate(phy2, method="MDS", distance = "bray")
plot_ordination(phy2, phy.ord, color="Location", shape="Treatment")
plot_ordination(phy2, phy.ord, color="Location", shape="Treatment") + facet_wrap(facets = "Treatment")
plot_ordination(phy2, phy.ord, color="Treatment") + facet_wrap(facets = "Location")
plot_ordination(phy2, phy.ord, color="Treatment") + facet_wrap(facets = "Location", scales = "free")
Phyloseq ordinate
method does not implement WCA. If you wanted to plot the results of WCA, add the WCA axes to the sample data and do a manual plot. Here’s an example.
sample_data(phy2)$WCA1 = bray.wca$li[,1]
sample_data(phy2)$WCA2 = bray.wca$li[,2]
ggplot(sample_data(phy2), aes(x=WCA1, y = WCA2)) + geom_point(aes(color=Treatment)) + facet_wrap("Location")
Follow Distance Tutorial