Expected Time to Completion: 45 minutes
The data here is real, but based on OTU-clustering and old sequencing technology There is not a known truth for any of the examples, and as you’ll see, different methods give different answers even for the rank of most significant results.
In the following examples we will use publicly available data from a study on colorectal cancer (Kostic et al (2012) Genomic analysis identifies association of Fusobacterium with colorectal carcinoma. Genome research 22:292–298). This data consists of 454 FLX Titanium sequencing counts of OTUs from PCR amplicons of the V3-V5 variable region of the 16S rRNA gene. In the original study 190 separate samples were obtained in the form of samples from 95 patient biopsy pairs from either tumor or non-tumor tissue. For simplicity of illustration, this data has been truncated to include only samples from patients with a Stage-II tumor.
packageVersion("phyloseq")
## [1] '1.25.2'
packageVersion("edgeR")
## [1] '3.22.3'
packageVersion("DESeq2")
## [1] '1.20.0'
packageVersion("metagenomeSeq")
## [1] '1.22.0'
If something is missing, use the Bioconductor installer:
## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite("edgeR")
Time-permitting, I recommend that you review the included shot noise simulation tutorial demonstrating the effect of sample size on species proportion estimates in the simple Poisson case (no overdispersion).
This should be a quick read, and a reminder why we need to include the Library Size(s) in microbiome analyses, as the proportion value itself omits / masks the implicit uncertainty of the sampling process.
The remainder of this lab is concerned with the more complicated case in which we use a mixture model in order to account for additional sources of uncertainty.
library("phyloseq"); packageVersion("phyloseq")
## [1] '1.25.2'
library("ggplot2"); packageVersion("ggplot2")
## [1] '3.0.0'
theme_set(theme_bw())
library("magrittr"); packageVersion("magrittr")
## [1] '1.5'
closedps <- readRDS("Kostic2012StageII.RDS")
closedps %>% nsamples
## [1] 26
closedps %>% ntaxa
## [1] 634
closedps %>% sample_variables()
## [1] "X.SampleID" "BarcodeSequence"
## [3] "LinkerPrimerSequence" "NECROSIS_PERCENT"
## [5] "TARGET_SUBFRAGMENT" "ASSIGNED_FROM_GEO"
## [7] "EXPERIMENT_CENTER" "TITLE"
## [9] "RUN_PREFIX" "AGE"
## [11] "NORMAL_EQUIVALENT_PERCENT" "FIBROBLAST_AND_VESSEL_PERCENT"
## [13] "DEPTH" "TREATMENT"
## [15] "AGE_AT_DIAGNOSIS" "COMMON_NAME"
## [17] "HOST_COMMON_NAME" "BODY_SITE"
## [19] "ELEVATION" "REPORTS_RECEIVED"
## [21] "CEA" "PCR_PRIMERS"
## [23] "COLLECTION_DATE" "ALTITUDE"
## [25] "ENV_BIOME" "SEX"
## [27] "PLATFORM" "RACE"
## [29] "BSP_DIAGNOSIS" "STUDY_CENTER"
## [31] "COUNTRY" "CHEMOTHERAPY"
## [33] "YEAR_OF_DEATH" "ETHNICITY"
## [35] "ANONYMIZED_NAME" "TAXON_ID"
## [37] "SAMPLE_CENTER" "SAMP_SIZE"
## [39] "YEAR_OF_BIRTH" "ORIGINAL_DIAGNOSIS"
## [41] "AGE_UNIT" "STUDY_ID"
## [43] "EXPERIMENT_DESIGN_DESCRIPTION" "Description_duplicate"
## [45] "DIAGNOSIS" "BODY_HABITAT"
## [47] "SEQUENCING_METH" "RUN_DATE"
## [49] "HISTOLOGIC_GRADE" "LONGITUDE"
## [51] "ENV_MATTER" "TARGET_GENE"
## [53] "ENV_FEATURE" "KEY_SEQ"
## [55] "BODY_PRODUCT" "TUMOR_PERCENT"
## [57] "LIBRARY_CONSTRUCTION_PROTOCOL" "REGION"
## [59] "RUN_CENTER" "TUMOR_TYPE"
## [61] "BSP_NOTES" "RADIATION_THERAPY"
## [63] "INFLAMMATION_PERCENT" "HOST_SUBJECT_ID"
## [65] "PC3" "LATITUDE"
## [67] "OSH_DIAGNOSIS" "STAGE"
## [69] "PRIMARY_DISEASE" "HOST_TAXID"
## [71] "Description"
DESeq2 has an official extension within the phyloseq package and a vignette within phyloseq.
Anders S, Huber W (2010) Differential expression analysis for sequence count data. Genome Biology 11: R106
The function phyloseq_to_deseq2
converts your phyloseq-format microbiome data into a DESeqDataSet
with dispersions estimated, using the experimental design formula, also shown (the ~DIAGNOSIS
term).
dds = phyloseq_to_deseq2(closedps, ~DIAGNOSIS)
## Loading required namespace: DESeq2
## converting counts to integer mode
The DESeq
function does the rest of the modelling/testing. In this case we’re using its default testing framework, but you can use alternatives via glm
interface. The default multiple-inference correction is FDR, and occurs within the DESeq
function.
library("DESeq2"); packageVersion("DESeq2")
## [1] '1.20.0'
# A custom geometric mean function, with zero/NA tolerance.
gm_mean = function(x, na.rm=TRUE){
exp(sum(log(x[x > 0]), na.rm=na.rm) / length(x))
}
geoMeans = apply(counts(dds), 1, gm_mean)
dds = estimateSizeFactors(dds, geoMeans = geoMeans)
dds = DESeq(dds, test="Wald", fitType = "parametric")
## using pre-existing size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
## -- replacing outliers and refitting for 137 genes
## -- DESeq argument 'minReplicatesForReplace' = 7
## -- original counts are preserved in counts(dds)
## estimating dispersions
## fitting model and testing
results
extracts a table of the test results from dds
. Very fast. The following will order by FDR, and remove NA
values.
fdrAlpha = 0.05
resultsDESeq2 = results(dds)
# Add taxonomy to the results table
resultsDESeq2 <-
cbind(
resultsDESeq2 %>% as("data.frame"),
tax_table(closedps)[rownames(resultsDESeq2), ] %>% as("matrix")) %>%
set_rownames(rownames(resultsDESeq2))
resultsDESeq2$OTU <- rownames(resultsDESeq2)
# Set default/no-information to failed tests
# e.g. NA `padj` values to max, 1.0
resultsDESeq2[is.na(resultsDESeq2$log2FoldChange), "log2FoldChange"] <- 0.0
resultsDESeq2[is.na(resultsDESeq2$padj), "padj"] <- 1.0
resultsDESeq2[is.na(resultsDESeq2$pvalue), "pvalue"] <- 1.0
resultsDESeq2 %>%
dplyr::filter(padj < fdrAlpha) %>%
knitr::kable()
baseMean | log2FoldChange | lfcSE | stat | pvalue | padj | Kingdom | Phylum | Class | Order | Family | Genus | Species | OTU |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6.66580 | 21.73314 | 2.942975 | 7.384753 | 0.00e+00 | 0.0000000 | Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Porphyromonadaceae | Porphyromonas | endodontalis | 4423790 |
19.99757 | -5.78066 | 1.384756 | -4.174497 | 2.99e-05 | 0.0063462 | Bacteria | Fusobacteria | Fusobacteriia | Fusobacteriales | Fusobacteriaceae | Fusobacterium | NA | 4472949 |
Explore OTUs that were significantly different between the two classes. The following makes a nice ggplot2 summary of the results.
To make the plot look nice, we need to order things for ggplot2, and it typically does this by the order of elements in the factor. This looks a bit complicated, but don’t worry to much about it. We are not changing the data at all here, just the order of unique categories in each factor.
library("ggplot2")
theme_set(theme_bw())
pdsd <-
resultsDESeq2 %>%
ggplot(
mapping = aes(
x = log2FoldChange,
y = -log10(pvalue),
color = Phylum,
shape = padj < fdrAlpha)
) +
geom_vline(xintercept = 0.0, linetype = 2) +
geom_point() +
# Highlight the rejected hypotheses at FDR < alpha
geom_point(
size = 6,
alpha = 0.5,
data = resultsDESeq2 %>% dplyr::filter(padj < fdrAlpha)) +
geom_text(
size = 2,
color = "black",
mapping = aes(label = OTU),
data = resultsDESeq2 %>% dplyr::filter(padj < fdrAlpha)) +
scale_size_continuous(range = c(3, 7)) +
theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust=0.5)) +
ggtitle("Volcano Plot", "DESeq2 Result")
print(pdsd)
DESeq2 provides a variance stabilizing transformation (VST) from the fitted dispersion-mean relation(s). The count data is transformed and normalized by division by the library size factor, yielding a matrix of values which are now approximately homoskedastic (having constant variance along the range of mean values). The rlogTransformation
is less sensitive to size factors, which can be an issue when size factors vary widely. This transformation is useful when checking for outliers or as input for machine learning techniques such as clustering or linear discriminant analysis.
Relevant functions
varianceStabilizingTransformation
getVarianceStabilizedData
rlogTransformation
/ rlog
Note that neither rlog transformation nor the VST are used by the differential expression estimation in DESeq, which always occurs on the raw count data, through generalized linear modeling which incorporates knowledge of the variance-mean dependence. The rlog transformation and VST are offered as separate functionality which can be used for visualization, clustering or other machine learning tasks. See the transformation section of the vignette for more details.
Perform VST, store the transformed values in a new table.
vsd = getVarianceStabilizedData(dds)
dim(vsd)
## [1] 634 26
cpsvsd = closedps
otu_table(cpsvsd) <- otu_table(vsd, taxa_are_rows = TRUE)
Compute and display ordination
pca = ordinate(cpsvsd, "RDA")
plot_ordination(cpsvsd, pca, type = "split", color="Phylum", shape="DIAGNOSIS")
We can do the same thing with regularized log.
The transformed values, rlog(K)
, are equal to rlog(K_ij) = log2(q_ij) = x_j. * beta_i
, with formula terms defined in DESeq
function documentation.
rld = rlog(dds, blind = FALSE)
cpsrld = closedps
otu_table(cpsrld) <- otu_table(assay(rld), taxa_are_rows = TRUE)
pca = ordinate(cpsrld, "RDA")
plot_ordination(cpsrld, pca, type = "split",
color="Phylum", shape="DIAGNOSIS")
Robinson MD, McCarthy DJ, Smyth GK (2009) edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics (Oxford, England) 26: 139–140
Another popular negative binomial method.
source("http://bioconductor.org/biocLite.R")
biocLite("edgeR")
library("edgeR"); packageVersion("edgeR")
## [1] '3.22.3'
Define a function for converting phyloseq data into edgeR DGE format.
################################################################################
#' Convert phyloseq OTU count data into DGEList for edgeR package
#'
#' Further details.
#'
#' @param physeq (Required). A \code{\link{phyloseq-class}} or
#' an \code{\link{otu_table-class}} object.
#' The latter is only appropriate if \code{group} argument is also a
#' vector or factor with length equal to \code{nsamples(physeq)}.
#'
#' @param group (Required). A character vector or factor giving the experimental
#' group/condition for each sample/library. Alternatively, you may provide
#' the name of a sample variable. This name should be among the output of
#' \code{sample_variables(physeq)}, in which case
#' \code{get_variable(physeq, group)} would return either a character vector or factor.
#' This is passed on to \code{\link[edgeR]{DGEList}},
#' and you may find further details or examples in its documentation.
#'
#' @param method (Optional). The label of the edgeR-implemented normalization to use.
#' See \code{\link[edgeR]{calcNormFactors}} for supported options and details.
#' The default option is \code{"RLE"}, which is a scaling factor method
#' proposed by Anders and Huber (2010).
#' At time of writing, the \link[edgeR]{edgeR} package supported
#' the following options to the \code{method} argument:
#'
#' \code{c("TMM", "RLE", "upperquartile", "none")}.
#'
#' @param ... Additional arguments passed on to \code{\link[edgeR]{DGEList}}
#'
#' @examples
#'
phyloseq_to_edgeR = function(physeq, group, method="RLE", ...){
require("edgeR")
require("phyloseq")
# Enforce orientation.
if( !taxa_are_rows(physeq) ){ physeq <- t(physeq) }
x = as(otu_table(physeq), "matrix")
# Add one to protect against overflow, log(0) issues.
x = x + 1
# Check `group` argument
if( identical(all.equal(length(group), 1), TRUE) & nsamples(physeq) > 1 ){
# Assume that group was a sample variable name (must be categorical)
group = get_variable(physeq, group)
}
# Define gene annotations (`genes`) as tax_table
taxonomy = tax_table(physeq, errorIfNULL=FALSE)
if( !is.null(taxonomy) ){
taxonomy = data.frame(as(taxonomy, "matrix"))
}
# Now turn into a DGEList
y = DGEList(counts=x, group=group, genes=taxonomy, remove.zeros = TRUE, ...)
# Calculate the normalization factors
z = edgeR::calcNormFactors(y, method=method)
# Check for division by zero inside `calcNormFactors`
if( !all(is.finite(z$samples$norm.factors)) ){
stop("Something wrong with edgeR::calcNormFactors on this data,
non-finite $norm.factors, consider changing `method` argument")
}
# Estimate dispersions
return(edgeR::estimateTagwiseDisp(edgeR::estimateCommonDisp(z)))
}
################################################################################
What kind of filter is this?
cpsf = filter_taxa(closedps,
function(x){sum(x>=1) > (length(x)/4)},
prune = TRUE)
Now let’s use our newly-defined function to convert the phyloseq data object into an edgeR “DGE” data object, called dge
.
dge = phyloseq_to_edgeR(cpsf, group="DIAGNOSIS")
Perform two-class test
et = exactTest(dge)
Extract values and plot
tt <-
topTags(
object = et,
n = nrow(dge$table),
adjust.method="BH",
sort.by="PValue")
resER <- tt@.Data[[1]] %>% as("data.frame")
resER %>%
dplyr::filter(FDR < 0.05) %>%
knitr::kable()
Kingdom | Phylum | Class | Order | Family | Genus | Species | logFC | logCPM | PValue | FDR |
---|---|---|---|---|---|---|---|---|---|---|
Bacteria | Firmicutes | Clostridia | Clostridiales | Veillonellaceae | Succiniclasticum | NA | -3.147567 | 11.40684 | 0.0002191 | 0.0289242 |
Differential Abundance
resER$OTU <- rownames(resER)
pedgeR <-
ggplot(
data = resER,
mapping = aes(
x = logFC,
y = -log10(PValue),
color = Phylum,
shape = FDR < fdrAlpha)) +
geom_vline(xintercept = 0) +
geom_point() +
scale_size_continuous(range = c(3, 7)) +
theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust = 0.5)) +
ggtitle("edgeR - Volcano Plot")
pedgeR +
geom_text(
size = 2,
mapping = aes(label = OTU),
color = "black",
data = resER %>% dplyr::filter(FDR < fdrAlpha))
Zhou, X., Lindsay, H., & Robinson, M. D. (2014). Robustly detecting differential expression in RNA sequencing data using observation weights. Nucleic Acids Research.
estimateGLMRobustDisp
- Empirical Robust Bayes Tagwise Dispersions for Negative Binomial GLMs using Observation Weights.
“At times, because of the moderation of dispersion estimates towards a trended values, features (typically, genes) can be sensitive to outliers and causing false positives. That is, since the dispersion estimates are moderated downwards toward the trend and because the regression parameter estimates may be affected by the outliers, genes are deemed significantly differential expressed”
# Define the design matrix for the full model
design <-
model.matrix(
object = ~DIAGNOSIS,
data = cpsf %>% sample_data %>% data.frame)
design
## (Intercept) DIAGNOSISTumor
## 32I9UAPQ.518112 1 1
## 32I9UNA9.518098 1 0
## 41E1KAMA.517991 1 1
## 41E1KNBP.517985 1 0
## 59S9WAIH.518013 1 1
## 59S9WNC6.518055 1 0
## 82S3MAZ4.518145 1 1
## 82S3MNBY.518050 1 0
## BFJMKAKB.518159 1 1
## BFJMKNMP.518102 1 0
## G3UBQAJU.518060 1 1
## G3UBQNDP.518121 1 0
## KIXFRARY.518160 1 1
## KIXFRNL2.518129 1 0
## LRQ9WAHA.518036 1 1
## LRQ9WN6C.518048 1 0
## MQETMAZC.518171 1 1
## MQETMNL4.518037 1 0
## QFHRSAG2.518005 1 1
## QFHRSNIH.518116 1 0
## TV28IANZ.518070 1 1
## TV28INUO.518004 1 0
## UZ65XAS5.518119 1 1
## UZ65XN27.518028 1 0
## XBS5MA8H.517990 1 1
## XBS5MNEH.518074 1 0
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$DIAGNOSIS
## [1] "contr.treatment"
rer = estimateGLMRobustDisp(dge, design)
rertab <-
topTags(
object = exactTest(rer),
n = nrow(dge$table),
adjust.method = "BH",
sort.by = "PValue")@.Data[[1]] %>%
as("data.frame")
# Show the results with FDR < 0.05
rertab %>%
dplyr::filter(FDR < 0.05) %>%
knitr::kable()
Kingdom | Phylum | Class | Order | Family | Genus | Species | logFC | logCPM | PValue | FDR |
---|---|---|---|---|---|---|---|---|---|---|
Bacteria | Firmicutes | Clostridia | Clostridiales | Veillonellaceae | Succiniclasticum | NA | -3.314996 | 10.09857 | 0.0000000 | 0.0000001 |
Bacteria | Proteobacteria | Gammaproteobacteria | Aeromonadales | Aeromonadaceae | NA | NA | -2.727805 | 10.20933 | 0.0000006 | 0.0000367 |
Bacteria | Firmicutes | Clostridia | Clostridiales | Lachnospiraceae | NA | NA | -2.528247 | 10.37310 | 0.0000008 | 0.0000367 |
Bacteria | Actinobacteria | Actinobacteria | Actinomycetales | Actinomycetaceae | Actinomyces | NA | 1.563126 | 10.07439 | 0.0000297 | 0.0009789 |
Bacteria | Firmicutes | Erysipelotrichi | Erysipelotrichales | Erysipelotrichaceae | [Eubacterium] | dolichum | -1.850542 | 10.43365 | 0.0000663 | 0.0017511 |
Bacteria | Proteobacteria | Betaproteobacteria | Burkholderiales | Alcaligenaceae | Sutterella | NA | -1.938259 | 10.95418 | 0.0008726 | 0.0191981 |
Bacteria | Firmicutes | Erysipelotrichi | Erysipelotrichales | Erysipelotrichaceae | NA | NA | -1.716481 | 10.67694 | 0.0010679 | 0.0201376 |
Bacteria | Firmicutes | Clostridia | Clostridiales | Lachnospiraceae | [Ruminococcus] | gnavus | -2.189227 | 11.28974 | 0.0019370 | 0.0316542 |
Bacteria | Firmicutes | Clostridia | Clostridiales | Lachnospiraceae | [Ruminococcus] | NA | -1.901929 | 11.46815 | 0.0021582 | 0.0316542 |
Bacteria | Firmicutes | Clostridia | Clostridiales | Lachnospiraceae | NA | NA | 1.457871 | 10.22377 | 0.0037352 | 0.0482130 |
Bacteria | Firmicutes | Bacilli | Lactobacillales | Streptococcaceae | Streptococcus | NA | -1.392119 | 10.95531 | 0.0040727 | 0.0482130 |
Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Bacteroidaceae | Bacteroides | NA | -1.482239 | 10.24958 | 0.0043830 | 0.0482130 |
We already defined the plot for edgeR earlier. Let’s copy that plot, and then replace the data. Magic!
prer = pedgeR
rertab$OTU <- rownames(rertab)
prer$data <- rertab
prer <-
prer +
geom_text(
size = 2,
mapping = aes(label = OTU),
color = "black",
data = rertab %>% dplyr::filter(FDR < fdrAlpha)) +
ggtitle("edgeR-robust - Volcano Plot")
print(prer)
We’re going to move on to other packages. Let’s “unload” edgeR to avoid conflicts, namespace issues.
unloadNamespace("edgeR")
source("http://bioconductor.org/biocLite.R")
biocLite("metagenomeSeq")
metagenomeSeq is an official Bioconductor package, with the explicit goal of detecting differential abundance in microbiome experiments with an explicit design.
First, we need functions to
MGS
objectDefine a second function to based on our simple two-class experimental design.
Function to perform relevant test (fitZig
) - MGS
- A metagenomeSeq object, most-likely produced using the conversion tool make_metagenomeSeq() - variable
- the variable among the sample_data (aka “phenoData” in metagenomeSeq) that you want to test - physeq
- optional phyloseq data object that has the relevant tax_table
. phyloseq
or taxonomyTable
is fine.
library("metagenomeSeq"); packageVersion("metagenomeSeq")
## Loading required package: glmnet
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following object is masked from 'package:S4Vectors':
##
## expand
## Loading required package: foreach
## Loaded glmnet 2.0-16
## Loading required package: RColorBrewer
## [1] '1.22.0'
# Function to convert form phyloseq object to metagenomeSeq object
phyloseq_to_metagenomeSeq = function(physeq){
require("metagenomeSeq")
require("phyloseq")
# Enforce orientation
if(!taxa_are_rows(physeq)){physeq <- t(physeq)}
OTU = as(otu_table(physeq), "matrix")
# Convert sample_data to AnnotatedDataFrame
ADF = AnnotatedDataFrame(data.frame(sample_data(physeq)))
# define dummy "feature" data for OTUs, using their name
# Helps with extraction and relating to taxonomy later on.
TDF = AnnotatedDataFrame(
data = data.frame(
OTUname = taxa_names(physeq),
row.names=taxa_names(physeq)))
# Create the metagenomeSeq object
MGS = newMRexperiment(counts=OTU, phenoData=ADF, featureData=TDF)
# Trigger metagenomeSeq to calculate its Cumulative Sum scaling factor.
MGS = cumNorm(MGS)
return(MGS)
}
# Function to perform relevant test (`fitZig`)
# - `MGS` - A metagenomeSeq object, most-likely produced using the conversion tool make_metagenomeSeq()
# - `variable` - the variable among the sample_data (aka "phenoData" in metagenomeSeq) that you want to test
# - `physeq` - optional phyloseq data object that has the relevant `tax_table`.
# `phyloseq` or `taxonomyTable` is fine.
test_metagenomeSeq = function(MGS, variable, physeq=NULL){
require("metagenomeSeq")
require("phyloseq")
# Create the `mod` variable used in the fitZig test.
if( inherits(variable, "factor") ){
# If variable is already a factor, use directly in model.matrix
mod = model.matrix( ~variable )
} else if( inherits(variable, "matrix") ){
# If it is a matrix, assume that model.matrix() has been used already
} else if( inherits(variable, "character") ){
# If it is a character that specifies a variable in phenoData,
# use the corresponding variable from MGS
if( variable %in% colnames(phenoData(MGS)@data) ){
mod = model.matrix(~phenoData(MGS)@data[, variable])
} else {
stop("The phenoData variable name you specified is not present in `phenoData(MGS)`")
}
} else {
stop("Improper specification of the experimental design variable for testing. See `variable` argument")
}
# Wrapper to run the Expectation-maximization algorithm
# and estimate $f_count$ fits with the zero-inflated Guassian (z.i.g.)
fit = fitZig(MGS, mod)
# You need to specify all OTUs to get the full table from MRfulltable.
x = MRfulltable(fit, number=nrow(assayData(MGS)$counts))
# if any OTUs left out, rm those from x. Detected by NA rownames.
x = x[!is.na(rownames(x)), ]
# Modify this data.frame by adding the OTUnames. Clip the ":1" added to the OTU names
rownames(x) <- gsub(":1", "", x=rownames(x), fixed=TRUE)
x$OTUnames <- as.character(rownames(x))
if( !is.null(tax_table(physeq, errorIfNULL=FALSE)) ){
# Attach the bacterial taxonomy to the table, if available
TAX = data.frame(tax_table(physeq))
TAX$OTUnames <- as.character(rownames(TAX))
y = merge(x, TAX, by="OTUnames")
} else {
y = x
}
# Sort and return
y = y[order(y$adjPvalue), ]
return(y)
}
Perform metagenomeSeq differential abundance detection. The default method for multiple-inference adjustment in metagenomeSeq::MRfulltable
is Benjamini-Hochberg (FDR).
mgs = phyloseq_to_metagenomeSeq(cpsf)
## Default value being used.
mgsres = test_metagenomeSeq(mgs, "DIAGNOSIS", cpsf)
## it= 0, nll=41.04, log10(eps+1)=Inf, stillActive=132
## it= 1, nll=44.23, log10(eps+1)=0.04, stillActive=14
## it= 2, nll=44.24, log10(eps+1)=0.03, stillActive=4
## it= 3, nll=44.31, log10(eps+1)=0.01, stillActive=1
## it= 4, nll=44.33, log10(eps+1)=0.03, stillActive=1
## it= 5, nll=44.38, log10(eps+1)=0.00, stillActive=0
mgsres$OTU <- mgsres$OTUnames
head(mgsres)
## OTUnames +samples in group 0 +samples in group 1 counts in group 0
## 128 806497 5 3 221
## 37 4382457 7 6 117
## 73 4453773 3 4 37
## 38 4383924 6 3 131
## 68 4447950 5 2 111
## 122 4483337 8 7 267
## counts in group 1 oddsRatio lower upper fisherP fisherAdjP
## 128 4 2.0247229 0.28658818 17.249447 0.6727689 1
## 37 24 1.3449818 0.22723452 8.260005 1.0000000 1
## 73 9 0.6853112 0.07781577 5.338345 1.0000000 1
## 38 17 2.7406820 0.40859215 23.066726 0.4109840 1
## 68 17 3.2757687 0.40402819 42.843261 0.3782609 1
## 122 23 1.3547860 0.22311713 8.588112 1.0000000 1
## (Intercept) phenoData(MGS)@data[, variable]Tumor scalingFactor
## 128 4.845874 -3.931532 -2.7012098
## 37 1.828084 -1.663561 7.3726635
## 73 1.617782 -1.912030 6.5184995
## 38 3.840960 -2.156651 -1.6724050
## 68 1.479270 -2.006814 8.4916702
## 122 3.206714 -1.923928 0.5316862
## pvalues adjPvalues Kingdom Phylum Class
## 128 2.733165e-05 0.003607777 Bacteria Firmicutes Clostridia
## 37 3.526532e-03 0.155167415 Bacteria Firmicutes Erysipelotrichi
## 73 3.130797e-03 0.155167415 Bacteria Proteobacteria Deltaproteobacteria
## 38 9.034026e-03 0.238498275 Bacteria Firmicutes Clostridia
## 68 8.181027e-03 0.238498275 Bacteria Bacteroidetes Bacteroidia
## 122 1.226664e-02 0.269865991 Bacteria Firmicutes Clostridia
## Order Family Genus Species
## 128 Clostridiales Veillonellaceae Succiniclasticum <NA>
## 37 Erysipelotrichales Erysipelotrichaceae [Eubacterium] dolichum
## 73 Desulfovibrionales Desulfovibrionaceae Desulfovibrio <NA>
## 38 Clostridiales <NA> <NA> <NA>
## 68 Bacteroidales Bacteroidaceae Bacteroides <NA>
## 122 Clostridiales Lachnospiraceae <NA> <NA>
## OTU
## 128 806497
## 37 4382457
## 73 4453773
## 38 4383924
## 68 4447950
## 122 4483337
pmgs <-
ggplot(
data = mgsres,
mapping = aes(`(Intercept)`, -log10(pvalues), color = Phylum)) +
geom_vline(xintercept = 0) +
geom_point() +
theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust = 0.5)) +
ggtitle("metagenomeSeq - Volcano Plot")
pmgs <-
pmgs +
geom_text(
size = 2,
mapping = aes(label = OTU),
color = "black",
data = mgsres %>% dplyr::filter(adjPvalues < fdrAlpha))
print(pmgs)
We’re going to move on to other packages. Let’s “unload” metagenomeSeq to avoid conflicts, namespace issues.
unloadNamespace("metagenomeSeq")
The gridExtra package adds some helpful utilities for dealing with graphic objects created using the low-level R graphics package called grid, which is included as part of the base R distribution. See gridBase for details about combining elements of grid graphics with base R graphics.
We will use the grid.arrange
function from gridExtra
library("gridExtra"); packageVersion("gridExtra")
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:Biobase':
##
## combine
## The following object is masked from 'package:BiocGenerics':
##
## combine
## [1] '2.3'
grid.arrange(nrow=2, pdsd, pedgeR, prer, pmgs)
Generate a “tidy” table of the abundances, by “melting” the phyloseq object with psmelt()
function.
Count the number of hits that overlap between methods
if(!require("UpSetR")){install.packages("UpSetR")}
## Loading required package: UpSetR
hitList <-
list(
edgeR = resER %>%
dplyr::filter(FDR < fdrAlpha) %>%
extract2("OTU") %>%
unique,
robustER = rertab %>%
dplyr::filter(FDR < fdrAlpha) %>%
extract2("OTU") %>%
unique,
DESeq2 = resultsDESeq2 %>%
dplyr::filter(padj < fdrAlpha) %>%
extract2("OTU") %>%
unique,
metagenomeSeq = mgsres %>%
dplyr::filter(adjPvalues < fdrAlpha) %>%
extract2("OTU") %>%
unique
)
hitList %>%
fromList() %>%
upset(order.by = "freq")
Which OTU overlaps the most methods?
hitAgrees <-
intersect(
hitList$edgeR,
hitList$metagenomeSeq) %>%
intersect(hitList$robustER)
library("data.table")
##
## Attaching package: 'data.table'
## The following object is masked from 'package:SummarizedExperiment':
##
## shift
## The following object is masked from 'package:GenomicRanges':
##
## shift
## The following object is masked from 'package:IRanges':
##
## shift
## The following objects are masked from 'package:S4Vectors':
##
## first, second
library("magrittr")
longtab = psmelt(closedps) %>% data.table
longtab[, RelativeAbundance := Abundance/sum(Abundance), by = Sample]
I like violin plots …
# Subset to an example "hit(s)" that agree
justExampleHits = longtab[(OTU %in% hitAgrees)]
ggplot(
data = justExampleHits,
mapping = aes(
x = DIAGNOSIS,
y = RelativeAbundance,
color = Genus,
shape = Phylum)) +
geom_violin(color = "gray") +
geom_jitter(size = 3, width = 0.1, alpha = 0.5) +
scale_y_sqrt() +
facet_wrap(~OTU) +
ggtitle("'Groundtruth' plot for OTU hit(s)",
"Each point is an OTU in a sample. Is this convincing? Which direction is the result?")