Expected Time to Completion: 45 minutes

 


1 The data

The data here is real, but based on OTU-clustering and old sequencing technology There is not a known truth for any of the examples, and as you’ll see, different methods give different answers even for the rank of most significant results.

In the following examples we will use publicly available data from a study on colorectal cancer (Kostic et al (2012) Genomic analysis identifies association of Fusobacterium with colorectal carcinoma. Genome research 22:292–298). This data consists of 454 FLX Titanium sequencing counts of OTUs from PCR amplicons of the V3-V5 variable region of the 16S rRNA gene. In the original study 190 separate samples were obtained in the form of samples from 95 patient biopsy pairs from either tumor or non-tumor tissue. For simplicity of illustration, this data has been truncated to include only samples from patients with a Stage-II tumor.

2 Check packages are installed

packageVersion("phyloseq")
## [1] '1.25.2'
packageVersion("edgeR")
## [1] '3.22.3'
packageVersion("DESeq2")
## [1] '1.20.0'
packageVersion("metagenomeSeq")
## [1] '1.22.0'

If something is missing, use the Bioconductor installer:

## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite("edgeR")

 

3 The Poisson case

3.1 “Shot noise” simulation

Time-permitting, I recommend that you review the included shot noise simulation tutorial demonstrating the effect of sample size on species proportion estimates in the simple Poisson case (no overdispersion).

This should be a quick read, and a reminder why we need to include the Library Size(s) in microbiome analyses, as the proportion value itself omits / masks the implicit uncertainty of the sampling process.

The remainder of this lab is concerned with the more complicated case in which we use a mixture model in order to account for additional sources of uncertainty.


 

4 Startup R session

4.1 Load phyloseq, ggplot2

library("phyloseq"); packageVersion("phyloseq")
## [1] '1.25.2'
library("ggplot2"); packageVersion("ggplot2")
## [1] '3.0.0'
theme_set(theme_bw())
library("magrittr"); packageVersion("magrittr")
## [1] '1.5'

4.2 Load provided example data

closedps <- readRDS("Kostic2012StageII.RDS")
closedps %>% nsamples
## [1] 26
closedps %>% ntaxa
## [1] 634
closedps %>% sample_variables()
##  [1] "X.SampleID"                    "BarcodeSequence"              
##  [3] "LinkerPrimerSequence"          "NECROSIS_PERCENT"             
##  [5] "TARGET_SUBFRAGMENT"            "ASSIGNED_FROM_GEO"            
##  [7] "EXPERIMENT_CENTER"             "TITLE"                        
##  [9] "RUN_PREFIX"                    "AGE"                          
## [11] "NORMAL_EQUIVALENT_PERCENT"     "FIBROBLAST_AND_VESSEL_PERCENT"
## [13] "DEPTH"                         "TREATMENT"                    
## [15] "AGE_AT_DIAGNOSIS"              "COMMON_NAME"                  
## [17] "HOST_COMMON_NAME"              "BODY_SITE"                    
## [19] "ELEVATION"                     "REPORTS_RECEIVED"             
## [21] "CEA"                           "PCR_PRIMERS"                  
## [23] "COLLECTION_DATE"               "ALTITUDE"                     
## [25] "ENV_BIOME"                     "SEX"                          
## [27] "PLATFORM"                      "RACE"                         
## [29] "BSP_DIAGNOSIS"                 "STUDY_CENTER"                 
## [31] "COUNTRY"                       "CHEMOTHERAPY"                 
## [33] "YEAR_OF_DEATH"                 "ETHNICITY"                    
## [35] "ANONYMIZED_NAME"               "TAXON_ID"                     
## [37] "SAMPLE_CENTER"                 "SAMP_SIZE"                    
## [39] "YEAR_OF_BIRTH"                 "ORIGINAL_DIAGNOSIS"           
## [41] "AGE_UNIT"                      "STUDY_ID"                     
## [43] "EXPERIMENT_DESIGN_DESCRIPTION" "Description_duplicate"        
## [45] "DIAGNOSIS"                     "BODY_HABITAT"                 
## [47] "SEQUENCING_METH"               "RUN_DATE"                     
## [49] "HISTOLOGIC_GRADE"              "LONGITUDE"                    
## [51] "ENV_MATTER"                    "TARGET_GENE"                  
## [53] "ENV_FEATURE"                   "KEY_SEQ"                      
## [55] "BODY_PRODUCT"                  "TUMOR_PERCENT"                
## [57] "LIBRARY_CONSTRUCTION_PROTOCOL" "REGION"                       
## [59] "RUN_CENTER"                    "TUMOR_TYPE"                   
## [61] "BSP_NOTES"                     "RADIATION_THERAPY"            
## [63] "INFLAMMATION_PERCENT"          "HOST_SUBJECT_ID"              
## [65] "PC3"                           "LATITUDE"                     
## [67] "OSH_DIAGNOSIS"                 "STAGE"                        
## [69] "PRIMARY_DISEASE"               "HOST_TAXID"                   
## [71] "Description"

 


5 DESeq2

5.1 DESeq2 with phyloseq

DESeq2 has an official extension within the phyloseq package and a vignette within phyloseq.

Anders S, Huber W (2010) Differential expression analysis for sequence count data. Genome Biology 11: R106

5.2 DESeq2 conversion

The function phyloseq_to_deseq2 converts your phyloseq-format microbiome data into a DESeqDataSet with dispersions estimated, using the experimental design formula, also shown (the ~DIAGNOSIS term).

dds = phyloseq_to_deseq2(closedps, ~DIAGNOSIS)
## Loading required namespace: DESeq2
## converting counts to integer mode

5.3 DESeq2’s DESeq function

The DESeq function does the rest of the modelling/testing. In this case we’re using its default testing framework, but you can use alternatives via glm interface. The default multiple-inference correction is FDR, and occurs within the DESeq function.

library("DESeq2"); packageVersion("DESeq2")
## [1] '1.20.0'
# A custom geometric mean function, with zero/NA tolerance.
gm_mean = function(x, na.rm=TRUE){
  exp(sum(log(x[x > 0]), na.rm=na.rm) / length(x))
}
geoMeans = apply(counts(dds), 1, gm_mean)
dds = estimateSizeFactors(dds, geoMeans = geoMeans)
dds = DESeq(dds, test="Wald", fitType = "parametric")
## using pre-existing size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
## -- replacing outliers and refitting for 137 genes
## -- DESeq argument 'minReplicatesForReplace' = 7 
## -- original counts are preserved in counts(dds)
## estimating dispersions
## fitting model and testing

5.4 DESeq2 - explore results

results extracts a table of the test results from dds. Very fast. The following will order by FDR, and remove NA values.

fdrAlpha = 0.05
resultsDESeq2 = results(dds)
# Add taxonomy to the results table
resultsDESeq2 <-
  cbind(
    resultsDESeq2 %>% as("data.frame"),
    tax_table(closedps)[rownames(resultsDESeq2), ] %>% as("matrix")) %>% 
  set_rownames(rownames(resultsDESeq2))
resultsDESeq2$OTU <- rownames(resultsDESeq2)
# Set default/no-information to failed tests
# e.g. NA `padj` values to max, 1.0
resultsDESeq2[is.na(resultsDESeq2$log2FoldChange), "log2FoldChange"] <- 0.0
resultsDESeq2[is.na(resultsDESeq2$padj), "padj"] <- 1.0
resultsDESeq2[is.na(resultsDESeq2$pvalue), "pvalue"] <- 1.0
resultsDESeq2 %>% 
  dplyr::filter(padj < fdrAlpha) %>% 
  knitr::kable()
baseMean log2FoldChange lfcSE stat pvalue padj Kingdom Phylum Class Order Family Genus Species OTU
6.66580 21.73314 2.942975 7.384753 0.00e+00 0.0000000 Bacteria Bacteroidetes Bacteroidia Bacteroidales Porphyromonadaceae Porphyromonas endodontalis 4423790
19.99757 -5.78066 1.384756 -4.174497 2.99e-05 0.0063462 Bacteria Fusobacteria Fusobacteriia Fusobacteriales Fusobacteriaceae Fusobacterium NA 4472949

Explore OTUs that were significantly different between the two classes. The following makes a nice ggplot2 summary of the results.

To make the plot look nice, we need to order things for ggplot2, and it typically does this by the order of elements in the factor. This looks a bit complicated, but don’t worry to much about it. We are not changing the data at all here, just the order of unique categories in each factor.

library("ggplot2")
theme_set(theme_bw())
pdsd <-
  resultsDESeq2 %>% 
  ggplot(
    mapping = aes(
      x = log2FoldChange,
      y = -log10(pvalue),
      color = Phylum,
      shape = padj < fdrAlpha)
  ) + 
  geom_vline(xintercept = 0.0, linetype = 2) +
  geom_point() +
  # Highlight the rejected hypotheses at FDR < alpha
  geom_point(
    size = 6, 
    alpha = 0.5,
    data = resultsDESeq2 %>% dplyr::filter(padj < fdrAlpha)) +
  geom_text(
    size = 2, 
    color = "black",
    mapping = aes(label = OTU),
    data = resultsDESeq2 %>% dplyr::filter(padj < fdrAlpha)) +
  scale_size_continuous(range = c(3, 7)) +
  theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust=0.5)) +
  ggtitle("Volcano Plot", "DESeq2 Result")
print(pdsd)

5.5 DESeq2 - Variance Stabilization

DESeq2 provides a variance stabilizing transformation (VST) from the fitted dispersion-mean relation(s). The count data is transformed and normalized by division by the library size factor, yielding a matrix of values which are now approximately homoskedastic (having constant variance along the range of mean values). The rlogTransformation is less sensitive to size factors, which can be an issue when size factors vary widely. This transformation is useful when checking for outliers or as input for machine learning techniques such as clustering or linear discriminant analysis.

Relevant functions

  • varianceStabilizingTransformation
  • getVarianceStabilizedData
  • rlogTransformation / rlog

Note that neither rlog transformation nor the VST are used by the differential expression estimation in DESeq, which always occurs on the raw count data, through generalized linear modeling which incorporates knowledge of the variance-mean dependence. The rlog transformation and VST are offered as separate functionality which can be used for visualization, clustering or other machine learning tasks. See the transformation section of the vignette for more details.

5.6 DESeq2 - VST

Perform VST, store the transformed values in a new table.

vsd = getVarianceStabilizedData(dds)
dim(vsd)
## [1] 634  26
cpsvsd = closedps
otu_table(cpsvsd) <- otu_table(vsd, taxa_are_rows = TRUE)

Compute and display ordination

pca = ordinate(cpsvsd, "RDA")
plot_ordination(cpsvsd, pca, type = "split", color="Phylum", shape="DIAGNOSIS")

5.7 DESeq2 - Regularized Log

We can do the same thing with regularized log.

The transformed values, rlog(K), are equal to rlog(K_ij) = log2(q_ij) = x_j. * beta_i, with formula terms defined in DESeq function documentation.

rld = rlog(dds, blind = FALSE)
cpsrld = closedps
otu_table(cpsrld) <- otu_table(assay(rld), taxa_are_rows = TRUE)
pca = ordinate(cpsrld, "RDA")
plot_ordination(cpsrld, pca, type = "split",
                color="Phylum", shape="DIAGNOSIS")

 


6 edgeR

6.1 edgeR Background

Robinson MD, McCarthy DJ, Smyth GK (2009) edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics (Oxford, England) 26: 139–140

Another popular negative binomial method.

6.2 Install edgeR

source("http://bioconductor.org/biocLite.R")
biocLite("edgeR")

6.3 Load package and function

library("edgeR"); packageVersion("edgeR")
## [1] '3.22.3'

Define a function for converting phyloseq data into edgeR DGE format.

################################################################################
#' Convert phyloseq OTU count data into DGEList for edgeR package
#' 
#' Further details.
#' 
#' @param physeq (Required).  A \code{\link{phyloseq-class}} or
#'  an \code{\link{otu_table-class}} object. 
#'  The latter is only appropriate if \code{group} argument is also a 
#'  vector or factor with length equal to \code{nsamples(physeq)}.
#'  
#' @param group (Required). A character vector or factor giving the experimental
#'  group/condition for each sample/library. Alternatively, you may provide
#'  the name of a sample variable. This name should be among the output of
#'  \code{sample_variables(physeq)}, in which case
#'  \code{get_variable(physeq, group)} would return either a character vector or factor.
#'  This is passed on to \code{\link[edgeR]{DGEList}},
#'  and you may find further details or examples in its documentation.
#'  
#' @param method (Optional). The label of the edgeR-implemented normalization to use.
#'  See \code{\link[edgeR]{calcNormFactors}} for supported options and details. 
#'  The default option is \code{"RLE"}, which is a scaling factor method 
#'  proposed by Anders and Huber (2010).
#'  At time of writing, the \link[edgeR]{edgeR} package supported 
#'  the following options to the \code{method} argument:
#'  
#'  \code{c("TMM", "RLE", "upperquartile", "none")}.
#'
#' @param ... Additional arguments passed on to \code{\link[edgeR]{DGEList}}
#' 
#' @examples
#' 
phyloseq_to_edgeR = function(physeq, group, method="RLE", ...){
  require("edgeR")
  require("phyloseq")
  # Enforce orientation.
  if( !taxa_are_rows(physeq) ){ physeq <- t(physeq) }
  x = as(otu_table(physeq), "matrix")
  # Add one to protect against overflow, log(0) issues.
  x = x + 1
  # Check `group` argument
  if( identical(all.equal(length(group), 1), TRUE) & nsamples(physeq) > 1 ){
    # Assume that group was a sample variable name (must be categorical)
    group = get_variable(physeq, group)
  }
  # Define gene annotations (`genes`) as tax_table
  taxonomy = tax_table(physeq, errorIfNULL=FALSE)
  if( !is.null(taxonomy) ){
    taxonomy = data.frame(as(taxonomy, "matrix"))
  } 
  # Now turn into a DGEList
  y = DGEList(counts=x, group=group, genes=taxonomy, remove.zeros = TRUE, ...)
  # Calculate the normalization factors
  z = edgeR::calcNormFactors(y, method=method)
  # Check for division by zero inside `calcNormFactors`
  if( !all(is.finite(z$samples$norm.factors)) ){
    stop("Something wrong with edgeR::calcNormFactors on this data,
         non-finite $norm.factors, consider changing `method` argument")
  }
  # Estimate dispersions
  return(edgeR::estimateTagwiseDisp(edgeR::estimateCommonDisp(z)))
}
################################################################################

6.4 edgeR - Quick Filtering

What kind of filter is this?

cpsf = filter_taxa(closedps, 
                   function(x){sum(x>=1) > (length(x)/4)},
                   prune = TRUE)

6.5 edgeR - Convert for edgeR

Now let’s use our newly-defined function to convert the phyloseq data object into an edgeR “DGE” data object, called dge.

dge = phyloseq_to_edgeR(cpsf, group="DIAGNOSIS")

6.6 edgeR - Differential Abundance

Perform two-class test

et = exactTest(dge)

6.7 edgeR - Compile Results

Extract values and plot

tt <-
  topTags(
    object = et, 
    n = nrow(dge$table), 
    adjust.method="BH", 
    sort.by="PValue")
resER <- tt@.Data[[1]] %>% as("data.frame")
resER %>% 
  dplyr::filter(FDR < 0.05) %>% 
  knitr::kable()
Kingdom Phylum Class Order Family Genus Species logFC logCPM PValue FDR
Bacteria Firmicutes Clostridia Clostridiales Veillonellaceae Succiniclasticum NA -3.147567 11.40684 0.0002191 0.0289242

6.8 edgeR - Plot

Differential Abundance

resER$OTU <- rownames(resER)
pedgeR <-
  ggplot(
    data = resER, 
    mapping = aes(
      x = logFC, 
      y = -log10(PValue),
      color = Phylum, 
      shape = FDR < fdrAlpha)) + 
  geom_vline(xintercept = 0) +
  geom_point() + 
  scale_size_continuous(range = c(3, 7)) +
  theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust = 0.5)) +
  ggtitle("edgeR - Volcano Plot")
pedgeR +
  geom_text(
    size = 2,
    mapping = aes(label = OTU),
    color = "black",
    data = resER %>% dplyr::filter(FDR < fdrAlpha))

6.9 edgeR-robust

Zhou, X., Lindsay, H., & Robinson, M. D. (2014). Robustly detecting differential expression in RNA sequencing data using observation weights. Nucleic Acids Research.

estimateGLMRobustDisp - Empirical Robust Bayes Tagwise Dispersions for Negative Binomial GLMs using Observation Weights.

“At times, because of the moderation of dispersion estimates towards a trended values, features (typically, genes) can be sensitive to outliers and causing false positives. That is, since the dispersion estimates are moderated downwards toward the trend and because the regression parameter estimates may be affected by the outliers, genes are deemed significantly differential expressed”

# Define the design matrix for the full model
design <- 
  model.matrix(
    object = ~DIAGNOSIS, 
    data = cpsf %>% sample_data %>% data.frame) 
design
##                 (Intercept) DIAGNOSISTumor
## 32I9UAPQ.518112           1              1
## 32I9UNA9.518098           1              0
## 41E1KAMA.517991           1              1
## 41E1KNBP.517985           1              0
## 59S9WAIH.518013           1              1
## 59S9WNC6.518055           1              0
## 82S3MAZ4.518145           1              1
## 82S3MNBY.518050           1              0
## BFJMKAKB.518159           1              1
## BFJMKNMP.518102           1              0
## G3UBQAJU.518060           1              1
## G3UBQNDP.518121           1              0
## KIXFRARY.518160           1              1
## KIXFRNL2.518129           1              0
## LRQ9WAHA.518036           1              1
## LRQ9WN6C.518048           1              0
## MQETMAZC.518171           1              1
## MQETMNL4.518037           1              0
## QFHRSAG2.518005           1              1
## QFHRSNIH.518116           1              0
## TV28IANZ.518070           1              1
## TV28INUO.518004           1              0
## UZ65XAS5.518119           1              1
## UZ65XN27.518028           1              0
## XBS5MA8H.517990           1              1
## XBS5MNEH.518074           1              0
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$DIAGNOSIS
## [1] "contr.treatment"
rer = estimateGLMRobustDisp(dge, design)
rertab <- 
  topTags(
    object = exactTest(rer),
    n = nrow(dge$table), 
    adjust.method = "BH",
    sort.by = "PValue")@.Data[[1]] %>% 
  as("data.frame")
# Show the results with FDR < 0.05
rertab %>% 
  dplyr::filter(FDR < 0.05) %>% 
  knitr::kable()
Kingdom Phylum Class Order Family Genus Species logFC logCPM PValue FDR
Bacteria Firmicutes Clostridia Clostridiales Veillonellaceae Succiniclasticum NA -3.314996 10.09857 0.0000000 0.0000001
Bacteria Proteobacteria Gammaproteobacteria Aeromonadales Aeromonadaceae NA NA -2.727805 10.20933 0.0000006 0.0000367
Bacteria Firmicutes Clostridia Clostridiales Lachnospiraceae NA NA -2.528247 10.37310 0.0000008 0.0000367
Bacteria Actinobacteria Actinobacteria Actinomycetales Actinomycetaceae Actinomyces NA 1.563126 10.07439 0.0000297 0.0009789
Bacteria Firmicutes Erysipelotrichi Erysipelotrichales Erysipelotrichaceae [Eubacterium] dolichum -1.850542 10.43365 0.0000663 0.0017511
Bacteria Proteobacteria Betaproteobacteria Burkholderiales Alcaligenaceae Sutterella NA -1.938259 10.95418 0.0008726 0.0191981
Bacteria Firmicutes Erysipelotrichi Erysipelotrichales Erysipelotrichaceae NA NA -1.716481 10.67694 0.0010679 0.0201376
Bacteria Firmicutes Clostridia Clostridiales Lachnospiraceae [Ruminococcus] gnavus -2.189227 11.28974 0.0019370 0.0316542
Bacteria Firmicutes Clostridia Clostridiales Lachnospiraceae [Ruminococcus] NA -1.901929 11.46815 0.0021582 0.0316542
Bacteria Firmicutes Clostridia Clostridiales Lachnospiraceae NA NA 1.457871 10.22377 0.0037352 0.0482130
Bacteria Firmicutes Bacilli Lactobacillales Streptococcaceae Streptococcus NA -1.392119 10.95531 0.0040727 0.0482130
Bacteria Bacteroidetes Bacteroidia Bacteroidales Bacteroidaceae Bacteroides NA -1.482239 10.24958 0.0043830 0.0482130

6.10 Plot (Robust)

We already defined the plot for edgeR earlier. Let’s copy that plot, and then replace the data. Magic!

prer = pedgeR
rertab$OTU <- rownames(rertab)
prer$data <- rertab
prer <-
  prer +
  geom_text(
    size = 2,
    mapping = aes(label = OTU),
    color = "black",
    data = rertab %>% dplyr::filter(FDR < fdrAlpha)) +
  ggtitle("edgeR-robust - Volcano Plot")
print(prer)

6.11 Detach edgeR package

We’re going to move on to other packages. Let’s “unload” edgeR to avoid conflicts, namespace issues.

unloadNamespace("edgeR")

 


7 metagenomeSeq

7.1 metagenomeSeq - install

source("http://bioconductor.org/biocLite.R")
biocLite("metagenomeSeq")

7.3 metagenomeSeq functions

Define a second function to based on our simple two-class experimental design.

Function to perform relevant test (fitZig) - MGS - A metagenomeSeq object, most-likely produced using the conversion tool make_metagenomeSeq() - variable - the variable among the sample_data (aka “phenoData” in metagenomeSeq) that you want to test - physeq - optional phyloseq data object that has the relevant tax_table. phyloseq or taxonomyTable is fine.

library("metagenomeSeq"); packageVersion("metagenomeSeq")
## Loading required package: glmnet
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following object is masked from 'package:S4Vectors':
## 
##     expand
## Loading required package: foreach
## Loaded glmnet 2.0-16
## Loading required package: RColorBrewer
## [1] '1.22.0'
# Function to convert form phyloseq object to metagenomeSeq object
phyloseq_to_metagenomeSeq = function(physeq){
  require("metagenomeSeq")
  require("phyloseq")
  # Enforce orientation
  if(!taxa_are_rows(physeq)){physeq <- t(physeq)}
  OTU = as(otu_table(physeq), "matrix")
  # Convert sample_data to AnnotatedDataFrame
  ADF = AnnotatedDataFrame(data.frame(sample_data(physeq)))
  # define dummy "feature" data for OTUs, using their name
  # Helps with extraction and relating to taxonomy later on.
  TDF = AnnotatedDataFrame(
    data = data.frame(
      OTUname = taxa_names(physeq), 
      row.names=taxa_names(physeq)))
  # Create the metagenomeSeq object
  MGS = newMRexperiment(counts=OTU, phenoData=ADF, featureData=TDF)
  # Trigger metagenomeSeq to calculate its Cumulative Sum scaling factor.
  MGS = cumNorm(MGS)
  return(MGS)
}
# Function to perform relevant test (`fitZig`)
# - `MGS` - A metagenomeSeq object, most-likely produced using the conversion tool make_metagenomeSeq()
# - `variable` - the variable among the sample_data (aka "phenoData" in metagenomeSeq) that you want to test
# - `physeq` - optional phyloseq data object that has the relevant `tax_table`.
# `phyloseq` or `taxonomyTable` is fine.
test_metagenomeSeq = function(MGS, variable, physeq=NULL){
  require("metagenomeSeq")
  require("phyloseq")
  # Create the `mod` variable used in the fitZig test.
  if( inherits(variable, "factor") ){
    # If variable is already a factor, use directly in model.matrix
    mod = model.matrix( ~variable )
  } else if( inherits(variable, "matrix") ){
    # If it is a matrix, assume that model.matrix() has been used already
  } else if( inherits(variable, "character") ){ 
    # If it is a character that specifies a variable in phenoData, 
    # use the corresponding variable from MGS
    if( variable %in% colnames(phenoData(MGS)@data) ){
      mod = model.matrix(~phenoData(MGS)@data[, variable])
    } else {
      stop("The phenoData variable name you specified is not present in `phenoData(MGS)`")
    }
  } else {
    stop("Improper specification of the experimental design variable for testing. See `variable` argument")
  }
  # Wrapper to run the Expectation-maximization algorithm 
  # and estimate $f_count$ fits with the zero-inflated Guassian (z.i.g.)
  fit = fitZig(MGS, mod)
  # You need to specify all OTUs to get the full table from MRfulltable. 
  x = MRfulltable(fit, number=nrow(assayData(MGS)$counts))
  # if any OTUs left out, rm those from x. Detected by NA rownames.
  x = x[!is.na(rownames(x)), ]
  # Modify this data.frame by adding the OTUnames. Clip the ":1" added to the OTU names
  rownames(x) <- gsub(":1", "", x=rownames(x), fixed=TRUE)
  x$OTUnames <- as.character(rownames(x))
  if( !is.null(tax_table(physeq, errorIfNULL=FALSE)) ){
    # Attach the bacterial taxonomy to the table, if available
    TAX = data.frame(tax_table(physeq))
    TAX$OTUnames <- as.character(rownames(TAX))    
    y = merge(x, TAX, by="OTUnames") 
  } else {
    y = x
  }
  # Sort and return
  y = y[order(y$adjPvalue), ]
  return(y)
}

7.4 metagenomeSeq - test

Perform metagenomeSeq differential abundance detection. The default method for multiple-inference adjustment in metagenomeSeq::MRfulltable is Benjamini-Hochberg (FDR).

mgs = phyloseq_to_metagenomeSeq(cpsf)
## Default value being used.
mgsres = test_metagenomeSeq(mgs, "DIAGNOSIS", cpsf)
## it= 0, nll=41.04, log10(eps+1)=Inf, stillActive=132
## it= 1, nll=44.23, log10(eps+1)=0.04, stillActive=14
## it= 2, nll=44.24, log10(eps+1)=0.03, stillActive=4
## it= 3, nll=44.31, log10(eps+1)=0.01, stillActive=1
## it= 4, nll=44.33, log10(eps+1)=0.03, stillActive=1
## it= 5, nll=44.38, log10(eps+1)=0.00, stillActive=0
mgsres$OTU <- mgsres$OTUnames

7.5 metagenomeSeq - plot

head(mgsres)
##     OTUnames +samples in group 0 +samples in group 1 counts in group 0
## 128   806497                   5                   3               221
## 37   4382457                   7                   6               117
## 73   4453773                   3                   4                37
## 38   4383924                   6                   3               131
## 68   4447950                   5                   2               111
## 122  4483337                   8                   7               267
##     counts in group 1 oddsRatio      lower     upper   fisherP fisherAdjP
## 128                 4 2.0247229 0.28658818 17.249447 0.6727689          1
## 37                 24 1.3449818 0.22723452  8.260005 1.0000000          1
## 73                  9 0.6853112 0.07781577  5.338345 1.0000000          1
## 38                 17 2.7406820 0.40859215 23.066726 0.4109840          1
## 68                 17 3.2757687 0.40402819 42.843261 0.3782609          1
## 122                23 1.3547860 0.22311713  8.588112 1.0000000          1
##     (Intercept) phenoData(MGS)@data[, variable]Tumor scalingFactor
## 128    4.845874                            -3.931532    -2.7012098
## 37     1.828084                            -1.663561     7.3726635
## 73     1.617782                            -1.912030     6.5184995
## 38     3.840960                            -2.156651    -1.6724050
## 68     1.479270                            -2.006814     8.4916702
## 122    3.206714                            -1.923928     0.5316862
##          pvalues  adjPvalues  Kingdom         Phylum               Class
## 128 2.733165e-05 0.003607777 Bacteria     Firmicutes          Clostridia
## 37  3.526532e-03 0.155167415 Bacteria     Firmicutes     Erysipelotrichi
## 73  3.130797e-03 0.155167415 Bacteria Proteobacteria Deltaproteobacteria
## 38  9.034026e-03 0.238498275 Bacteria     Firmicutes          Clostridia
## 68  8.181027e-03 0.238498275 Bacteria  Bacteroidetes         Bacteroidia
## 122 1.226664e-02 0.269865991 Bacteria     Firmicutes          Clostridia
##                  Order              Family            Genus  Species
## 128      Clostridiales     Veillonellaceae Succiniclasticum     <NA>
## 37  Erysipelotrichales Erysipelotrichaceae    [Eubacterium] dolichum
## 73  Desulfovibrionales Desulfovibrionaceae    Desulfovibrio     <NA>
## 38       Clostridiales                <NA>             <NA>     <NA>
## 68       Bacteroidales      Bacteroidaceae      Bacteroides     <NA>
## 122      Clostridiales     Lachnospiraceae             <NA>     <NA>
##         OTU
## 128  806497
## 37  4382457
## 73  4453773
## 38  4383924
## 68  4447950
## 122 4483337
pmgs <- 
  ggplot(
    data = mgsres,
    mapping = aes(`(Intercept)`, -log10(pvalues), color = Phylum)) + 
  geom_vline(xintercept = 0) +
  geom_point() + 
  theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust = 0.5)) + 
  ggtitle("metagenomeSeq - Volcano Plot")
pmgs <-
  pmgs +
  geom_text(
    size = 2,
    mapping = aes(label = OTU),
    color = "black",
    data = mgsres %>% dplyr::filter(adjPvalues < fdrAlpha))
print(pmgs)

7.6 metagenomeSeq - Detach package

We’re going to move on to other packages. Let’s “unload” metagenomeSeq to avoid conflicts, namespace issues.

unloadNamespace("metagenomeSeq")

 

8 Compare Results

8.1 Combining ggplot2 figures

The gridExtra package adds some helpful utilities for dealing with graphic objects created using the low-level R graphics package called grid, which is included as part of the base R distribution. See gridBase for details about combining elements of grid graphics with base R graphics.

8.2 Combining ggplot2 figures

We will use the grid.arrange function from gridExtra

library("gridExtra"); packageVersion("gridExtra")
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:Biobase':
## 
##     combine
## The following object is masked from 'package:BiocGenerics':
## 
##     combine
## [1] '2.3'

8.3 Combining ggplot2 figures BONUS!

grid.arrange(nrow=2, pdsd, pedgeR, prer, pmgs)

9 Diligence

Generate a “tidy” table of the abundances, by “melting” the phyloseq object with psmelt() function.

9.1 Consistent hits between methods?

Count the number of hits that overlap between methods

if(!require("UpSetR")){install.packages("UpSetR")}
## Loading required package: UpSetR
hitList <-
  list(
    edgeR = resER %>% 
      dplyr::filter(FDR < fdrAlpha) %>% 
      extract2("OTU") %>% 
      unique,
    robustER = rertab %>% 
      dplyr::filter(FDR < fdrAlpha) %>% 
      extract2("OTU") %>% 
      unique,
    DESeq2 = resultsDESeq2 %>% 
      dplyr::filter(padj < fdrAlpha) %>% 
      extract2("OTU") %>% 
      unique,
    metagenomeSeq = mgsres %>% 
      dplyr::filter(adjPvalues < fdrAlpha) %>% 
      extract2("OTU") %>% 
      unique
  ) 
hitList %>% 
  fromList() %>% 
  upset(order.by = "freq")

Which OTU overlaps the most methods?

hitAgrees <-
  intersect(
    hitList$edgeR,
    hitList$metagenomeSeq) %>% 
  intersect(hitList$robustER)

9.2 Example “groundtruth” plot

library("data.table")
## 
## Attaching package: 'data.table'
## The following object is masked from 'package:SummarizedExperiment':
## 
##     shift
## The following object is masked from 'package:GenomicRanges':
## 
##     shift
## The following object is masked from 'package:IRanges':
## 
##     shift
## The following objects are masked from 'package:S4Vectors':
## 
##     first, second
library("magrittr")
longtab = psmelt(closedps) %>% data.table
longtab[, RelativeAbundance := Abundance/sum(Abundance), by = Sample]

I like violin plots …

# Subset to an example "hit(s)" that agree
justExampleHits = longtab[(OTU %in% hitAgrees)]
ggplot(
  data = justExampleHits, 
  mapping = aes(
    x = DIAGNOSIS, 
    y = RelativeAbundance, 
    color = Genus, 
    shape = Phylum)) + 
  geom_violin(color = "gray") +
  geom_jitter(size = 3, width = 0.1, alpha = 0.5) +
  scale_y_sqrt() +
  facet_wrap(~OTU) +
  ggtitle("'Groundtruth' plot for OTU hit(s)", 
          "Each point is an OTU in a sample. Is this convincing? Which direction is the result?")