Expected Time to Completion: 45 minutes
The data here is real, but based on OTU-clustering and old sequencing technology. There is not a known truth for any of the examples, and as you’ll see, different methods give different answers even for the rank of most significant results.
In the following examples we will use publicly available data from a study on colorectal cancer (Kostic et al (2012) Genomic analysis identifies association of Fusobacterium with colorectal carcinoma. Genome research 22:292–298). This data consists of 454 FLX Titanium sequencing counts of OTUs from PCR amplicons of the V3-V5 variable region of the 16S rRNA gene. In the original study 190 separate samples were obtained in the form of samples from 95 patient biopsy pairs from either tumor or non-tumor tissue. For simplicity of illustration, this data has been truncated to include only samples from patients with a Stage-II tumor.
packageVersion("phyloseq")
## [1] '1.28.0'
packageVersion("edgeR")
## [1] '3.26.5'
packageVersion("DESeq2")
## [1] '1.24.0'
packageVersion("metagenomeSeq")
## [1] '1.26.2'
If something is missing, use the Bioconductor installer:
## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite("edgeR")
Time-permitting, I recommend that you review the included shot noise simulation tutorial demonstrating the effect of sample size on species proportion estimates in the simple Poisson case (no overdispersion).
This should be a quick read, and a reminder why we need to include the Library Size(s) in microbiome analyses, as the proportion value itself omits / masks the implicit uncertainty of the sampling process.
The remainder of this lab is concerned with the more complicated case in which we use a mixture model in order to account for additional sources of uncertainty.
library("phyloseq"); packageVersion("phyloseq")
## [1] '1.28.0'
library("ggplot2"); packageVersion("ggplot2")
## [1] '3.2.0'
theme_set(theme_bw())
library("magrittr"); packageVersion("magrittr")
## [1] '1.5'
closedps <- readRDS("Kostic2012StageII.RDS")
closedps %>% nsamples
## [1] 26
closedps %>% ntaxa
## [1] 634
closedps %>% sample_variables()
## [1] "X.SampleID" "BarcodeSequence"
## [3] "LinkerPrimerSequence" "NECROSIS_PERCENT"
## [5] "TARGET_SUBFRAGMENT" "ASSIGNED_FROM_GEO"
## [7] "EXPERIMENT_CENTER" "TITLE"
## [9] "RUN_PREFIX" "AGE"
## [11] "NORMAL_EQUIVALENT_PERCENT" "FIBROBLAST_AND_VESSEL_PERCENT"
## [13] "DEPTH" "TREATMENT"
## [15] "AGE_AT_DIAGNOSIS" "COMMON_NAME"
## [17] "HOST_COMMON_NAME" "BODY_SITE"
## [19] "ELEVATION" "REPORTS_RECEIVED"
## [21] "CEA" "PCR_PRIMERS"
## [23] "COLLECTION_DATE" "ALTITUDE"
## [25] "ENV_BIOME" "SEX"
## [27] "PLATFORM" "RACE"
## [29] "BSP_DIAGNOSIS" "STUDY_CENTER"
## [31] "COUNTRY" "CHEMOTHERAPY"
## [33] "YEAR_OF_DEATH" "ETHNICITY"
## [35] "ANONYMIZED_NAME" "TAXON_ID"
## [37] "SAMPLE_CENTER" "SAMP_SIZE"
## [39] "YEAR_OF_BIRTH" "ORIGINAL_DIAGNOSIS"
## [41] "AGE_UNIT" "STUDY_ID"
## [43] "EXPERIMENT_DESIGN_DESCRIPTION" "Description_duplicate"
## [45] "DIAGNOSIS" "BODY_HABITAT"
## [47] "SEQUENCING_METH" "RUN_DATE"
## [49] "HISTOLOGIC_GRADE" "LONGITUDE"
## [51] "ENV_MATTER" "TARGET_GENE"
## [53] "ENV_FEATURE" "KEY_SEQ"
## [55] "BODY_PRODUCT" "TUMOR_PERCENT"
## [57] "LIBRARY_CONSTRUCTION_PROTOCOL" "REGION"
## [59] "RUN_CENTER" "TUMOR_TYPE"
## [61] "BSP_NOTES" "RADIATION_THERAPY"
## [63] "INFLAMMATION_PERCENT" "HOST_SUBJECT_ID"
## [65] "PC3" "LATITUDE"
## [67] "OSH_DIAGNOSIS" "STAGE"
## [69] "PRIMARY_DISEASE" "HOST_TAXID"
## [71] "Description"
closedps %>% get_variable("ANONYMIZED_NAME") %>% table()
## .
## 32I9U 41E1K 59S9W 82S3M BFJMK G3UBQ KIXFR LRQ9W MQETM QFHRS TV28I UZ65X
## 2 2 2 2 2 2 2 2 2 2 2 2
## XBS5M
## 2
closedps %>% get_variable("DIAGNOSIS") %>% table()
## .
## Healthy Tumor
## 13 13
DESeq2 has an official extension within the phyloseq package and a vignette within phyloseq.
Anders S, Huber W (2010) Differential expression analysis for sequence count data. Genome Biology 11: R106
The function phyloseq_to_deseq2
converts your phyloseq-format microbiome data into a DESeqDataSet
with dispersions estimated, using the experimental design formula, also shown (the ~DIAGNOSIS
term).
dds = phyloseq_to_deseq2(closedps, ~DIAGNOSIS)
## converting counts to integer mode
The DESeq
function does the rest of the modelling/testing. In this case we’re using its default testing framework, but you can use alternatives via glm
interface. The default multiple-inference correction is FDR, and occurs within the DESeq
function.
library("DESeq2"); packageVersion("DESeq2")
## [1] '1.24.0'
# A custom geometric mean function, with zero/NA tolerance.
gm_mean = function(x, na.rm=TRUE){
exp(sum(log(x[x > 0]), na.rm=na.rm) / length(x))
}
geoMeans = apply(counts(dds), 1, gm_mean)
dds = estimateSizeFactors(dds, geoMeans = geoMeans)
dds = DESeq(dds, test="Wald", fitType = "parametric")
## using pre-existing size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
## -- replacing outliers and refitting for 137 genes
## -- DESeq argument 'minReplicatesForReplace' = 7
## -- original counts are preserved in counts(dds)
## estimating dispersions
## fitting model and testing
results
extracts a table of the test results from dds
. Very fast. The following will order by FDR, and remove NA
values.
fdrAlpha = 0.05
resultsDESeq2 = results(dds)
# Add taxonomy to the results table
resultsDESeq2 <-
cbind(
resultsDESeq2 %>% as("data.frame"),
tax_table(closedps)[rownames(resultsDESeq2), ] %>% as("matrix")) %>%
set_rownames(rownames(resultsDESeq2))
resultsDESeq2$OTU <- rownames(resultsDESeq2)
# Set default/no-information to failed tests
# e.g. NA `padj` values to max, 1.0
resultsDESeq2[is.na(resultsDESeq2$log2FoldChange), "log2FoldChange"] <- 0.0
resultsDESeq2[is.na(resultsDESeq2$padj), "padj"] <- 1.0
resultsDESeq2[is.na(resultsDESeq2$pvalue), "pvalue"] <- 1.0
resultsDESeq2 %>%
dplyr::filter(padj < fdrAlpha) %>%
knitr::kable()
baseMean | log2FoldChange | lfcSE | stat | pvalue | padj | Kingdom | Phylum | Class | Order | Family | Genus | Species | OTU |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6.66580 | 21.73314 | 2.942975 | 7.384753 | 0.00e+00 | 0.0000000 | Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Porphyromonadaceae | Porphyromonas | endodontalis | 4423790 |
19.99757 | -5.78066 | 1.384756 | -4.174497 | 2.99e-05 | 0.0063462 | Bacteria | Fusobacteria | Fusobacteriia | Fusobacteriales | Fusobacteriaceae | Fusobacterium | NA | 4472949 |
Explore OTUs that were significantly different between the two classes. The following makes a nice ggplot2 summary of the results.
To make the plot look nice, we need to order things for ggplot2, and it typically does this by the order of elements in the factor. This looks a bit complicated, but don’t worry to much about it. We are not changing the data at all here, just the order of unique categories in each factor.
library("ggplot2")
theme_set(theme_bw())
pdsd <-
resultsDESeq2 %>%
ggplot(
mapping = aes(
x = log2FoldChange,
y = -log10(pvalue),
color = Phylum,
shape = padj < fdrAlpha)
) +
geom_vline(xintercept = 0.0, linetype = 2) +
geom_point() +
# Highlight the rejected hypotheses at FDR < alpha
geom_point(
size = 6,
alpha = 0.5,
data = resultsDESeq2 %>% dplyr::filter(padj < fdrAlpha)) +
geom_text(
size = 2,
color = "black",
mapping = aes(label = OTU),
data = resultsDESeq2 %>% dplyr::filter(padj < fdrAlpha)) +
scale_size_continuous(range = c(3, 7)) +
theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust=0.5)) +
ggtitle("Volcano Plot", "DESeq2 Result")
print(pdsd)
DESeq2 provides a variance stabilizing transformation (VST) from the fitted dispersion-mean relation(s). The count data is transformed and normalized by division by the library size factor, yielding a matrix of values which are now approximately homoskedastic (having constant variance along the range of mean values). The rlogTransformation
is less sensitive to size factors, which can be an issue when size factors vary widely. This transformation is useful when checking for outliers or as input for machine learning techniques such as clustering or linear discriminant analysis.
Relevant functions
varianceStabilizingTransformation
getVarianceStabilizedData
rlogTransformation
/ rlog
Note that neither rlog transformation nor the VST are used by the differential expression estimation in DESeq, which always occurs on the raw count data, through generalized linear modeling which incorporates knowledge of the variance-mean dependence. The rlog transformation and VST are offered as separate functionality which can be used for visualization, clustering or other machine learning tasks. See the transformation section of the vignette for more details.
Perform VST, store the transformed values in a new table.
vsd = getVarianceStabilizedData(dds)
dim(vsd)
## [1] 634 26
cpsvsd = closedps
otu_table(cpsvsd) <- otu_table(vsd, taxa_are_rows = TRUE)
Compute and display ordination
pca = ordinate(cpsvsd, "RDA")
plot_ordination(cpsvsd, pca, type = "split", color="Phylum", shape="DIAGNOSIS")
We can do the same thing with regularized log.
The transformed values, rlog(K)
, are equal to rlog(K_ij) = log2(q_ij) = x_j. * beta_i
, with formula terms defined in DESeq
function documentation.
rld = rlog(dds, blind = FALSE)
cpsrld = closedps
otu_table(cpsrld) <- otu_table(assay(rld), taxa_are_rows = TRUE)
pca = ordinate(cpsrld, "RDA")
plot_ordination(cpsrld, pca, type = "split",
color="Phylum", shape="DIAGNOSIS")
Robinson MD, McCarthy DJ, Smyth GK (2009) edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics (Oxford, England) 26: 139–140
Another popular negative binomial method.
source("http://bioconductor.org/biocLite.R")
biocLite("edgeR")
library("edgeR"); packageVersion("edgeR")
## [1] '3.26.5'
Define a function for converting phyloseq data into edgeR DGE format.
################################################################################
#' Convert phyloseq OTU count data into DGEList for edgeR package
#'
#' Further details.
#'
#' @param physeq (Required). A \code{\link{phyloseq-class}} or
#' an \code{\link{otu_table-class}} object.
#' The latter is only appropriate if \code{group} argument is also a
#' vector or factor with length equal to \code{nsamples(physeq)}.
#'
#' @param group (Required). A character vector or factor giving the experimental
#' group/condition for each sample/library. Alternatively, you may provide
#' the name of a sample variable. This name should be among the output of
#' \code{sample_variables(physeq)}, in which case
#' \code{get_variable(physeq, group)} would return either a character vector or factor.
#' This is passed on to \code{\link[edgeR]{DGEList}},
#' and you may find further details or examples in its documentation.
#'
#' @param method (Optional). The label of the edgeR-implemented normalization to use.
#' See \code{\link[edgeR]{calcNormFactors}} for supported options and details.
#' The default option is \code{"RLE"}, which is a scaling factor method
#' proposed by Anders and Huber (2010).
#' At time of writing, the \link[edgeR]{edgeR} package supported
#' the following options to the \code{method} argument:
#'
#' \code{c("TMM", "RLE", "upperquartile", "none")}.
#'
#' @param ... Additional arguments passed on to \code{\link[edgeR]{DGEList}}
#'
#' @examples
#'
phyloseq_to_edgeR = function(physeq, group, method="RLE", ...){
require("edgeR")
require("phyloseq")
# Enforce orientation.
if( !taxa_are_rows(physeq) ){ physeq <- t(physeq) }
x = as(otu_table(physeq), "matrix")
# Add one to protect against overflow, log(0) issues.
x = x + 1
# Check `group` argument
if( identical(all.equal(length(group), 1), TRUE) & nsamples(physeq) > 1 ){
# Assume that group was a sample variable name (must be categorical)
group = get_variable(physeq, group)
}
# Define gene annotations (`genes`) as tax_table
taxonomy = tax_table(physeq, errorIfNULL=FALSE)
if( !is.null(taxonomy) ){
taxonomy = data.frame(as(taxonomy, "matrix"))
}
# Now turn into a DGEList
y = DGEList(counts=x, group=group, genes=taxonomy, remove.zeros = TRUE, ...)
# Calculate the normalization factors
z = edgeR::calcNormFactors(y, method=method)
# Check for division by zero inside `calcNormFactors`
if( !all(is.finite(z$samples$norm.factors)) ){
stop("Something wrong with edgeR::calcNormFactors on this data,
non-finite $norm.factors, consider changing `method` argument")
}
# Estimate dispersions
return(edgeR::estimateTagwiseDisp(edgeR::estimateCommonDisp(z)))
}
################################################################################
What kind of filter is this?
cpsf <-
filter_taxa(
physeq = closedps,
flist = function(x){sum(x>=1) > (length(x)/4)},
prune = TRUE)
Now let’s use our newly-defined function to convert the phyloseq data object into an edgeR “DGE” data object, called dge
.
dge = phyloseq_to_edgeR(cpsf, group="DIAGNOSIS")
Perform two-class test
et = exactTest(dge)
Extract values and plot
tt <-
topTags(
object = et,
n = nrow(dge$table),
adjust.method="BH",
sort.by="PValue")
resER <- tt@.Data[[1]] %>% as("data.frame")
resER %>%
dplyr::filter(FDR < fdrAlpha) %>%
knitr::kable()
Kingdom | Phylum | Class | Order | Family | Genus | Species | logFC | logCPM | PValue | FDR |
---|---|---|---|---|---|---|---|---|---|---|
Bacteria | Firmicutes | Clostridia | Clostridiales | Veillonellaceae | Succiniclasticum | NA | -3.147567 | 11.40684 | 0.0002191 | 0.0289242 |
Differential Abundance
resER$OTU <- rownames(resER)
pedgeR <-
ggplot(
data = resER,
mapping = aes(
x = logFC,
y = -log10(PValue),
color = Phylum,
shape = FDR < fdrAlpha)) +
geom_vline(xintercept = 0) +
geom_point() +
scale_size_continuous(range = c(3, 7)) +
theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust = 0.5)) +
ggtitle("edgeR - Volcano Plot")
pedgeR +
geom_text(
size = 2,
mapping = aes(label = OTU),
color = "black",
data = resER %>% dplyr::filter(FDR < fdrAlpha))
Zhou, X., Lindsay, H., & Robinson, M. D. (2014). Robustly detecting differential expression in RNA sequencing data using observation weights. Nucleic Acids Research.
estimateGLMRobustDisp
- Empirical Robust Bayes Tagwise Dispersions for Negative Binomial GLMs using Observation Weights.
“At times, because of the moderation of dispersion estimates towards a trended values, features (typically, genes) can be sensitive to outliers and causing false positives. That is, since the dispersion estimates are moderated downwards toward the trend and because the regression parameter estimates may be affected by the outliers, genes are deemed significantly differential expressed”
# Define the design matrix for the full model
design <-
model.matrix(
object = ~DIAGNOSIS,
data = cpsf %>% sample_data %>% data.frame)
design
## (Intercept) DIAGNOSISTumor
## 32I9UAPQ.518112 1 1
## 32I9UNA9.518098 1 0
## 41E1KAMA.517991 1 1
## 41E1KNBP.517985 1 0
## 59S9WAIH.518013 1 1
## 59S9WNC6.518055 1 0
## 82S3MAZ4.518145 1 1
## 82S3MNBY.518050 1 0
## BFJMKAKB.518159 1 1
## BFJMKNMP.518102 1 0
## G3UBQAJU.518060 1 1
## G3UBQNDP.518121 1 0
## KIXFRARY.518160 1 1
## KIXFRNL2.518129 1 0
## LRQ9WAHA.518036 1 1
## LRQ9WN6C.518048 1 0
## MQETMAZC.518171 1 1
## MQETMNL4.518037 1 0
## QFHRSAG2.518005 1 1
## QFHRSNIH.518116 1 0
## TV28IANZ.518070 1 1
## TV28INUO.518004 1 0
## UZ65XAS5.518119 1 1
## UZ65XN27.518028 1 0
## XBS5MA8H.517990 1 1
## XBS5MNEH.518074 1 0
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$DIAGNOSIS
## [1] "contr.treatment"
rer = estimateGLMRobustDisp(y = dge, design = design)
rertab <-
topTags(
object = exactTest(rer),
n = nrow(dge$table),
adjust.method = "BH",
sort.by = "PValue")@.Data[[1]] %>%
as("data.frame")
# Show the results with FDR < 0.05
rertab %>%
dplyr::filter(FDR < fdrAlpha) %>%
knitr::kable()
Kingdom | Phylum | Class | Order | Family | Genus | Species | logFC | logCPM | PValue | FDR |
---|---|---|---|---|---|---|---|---|---|---|
Bacteria | Firmicutes | Clostridia | Clostridiales | Veillonellaceae | Succiniclasticum | NA | -3.314996 | 10.09857 | 0.0000000 | 0.0000001 |
Bacteria | Proteobacteria | Gammaproteobacteria | Aeromonadales | Aeromonadaceae | NA | NA | -2.727805 | 10.20933 | 0.0000006 | 0.0000367 |
Bacteria | Firmicutes | Clostridia | Clostridiales | Lachnospiraceae | NA | NA | -2.528247 | 10.37310 | 0.0000008 | 0.0000367 |
Bacteria | Actinobacteria | Actinobacteria | Actinomycetales | Actinomycetaceae | Actinomyces | NA | 1.563126 | 10.07439 | 0.0000297 | 0.0009789 |
Bacteria | Firmicutes | Erysipelotrichi | Erysipelotrichales | Erysipelotrichaceae | [Eubacterium] | dolichum | -1.850542 | 10.43365 | 0.0000663 | 0.0017511 |
Bacteria | Proteobacteria | Betaproteobacteria | Burkholderiales | Alcaligenaceae | Sutterella | NA | -1.938259 | 10.95418 | 0.0008726 | 0.0191981 |
Bacteria | Firmicutes | Erysipelotrichi | Erysipelotrichales | Erysipelotrichaceae | NA | NA | -1.716481 | 10.67694 | 0.0010679 | 0.0201376 |
Bacteria | Firmicutes | Clostridia | Clostridiales | Lachnospiraceae | [Ruminococcus] | gnavus | -2.189227 | 11.28974 | 0.0019370 | 0.0316542 |
Bacteria | Firmicutes | Clostridia | Clostridiales | Lachnospiraceae | [Ruminococcus] | NA | -1.901929 | 11.46815 | 0.0021582 | 0.0316542 |
Bacteria | Firmicutes | Clostridia | Clostridiales | Lachnospiraceae | NA | NA | 1.457871 | 10.22377 | 0.0037352 | 0.0482130 |
Bacteria | Firmicutes | Bacilli | Lactobacillales | Streptococcaceae | Streptococcus | NA | -1.392119 | 10.95531 | 0.0040727 | 0.0482130 |
Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Bacteroidaceae | Bacteroides | NA | -1.482239 | 10.24958 | 0.0043830 | 0.0482130 |
We already defined the plot for edgeR earlier. Let’s copy that plot, and then replace the data. Magic!
prer = pedgeR
rertab$OTU <- rownames(rertab)
prer$data <- rertab
prer <-
prer +
geom_text(
size = 2,
mapping = aes(label = OTU),
color = "black",
data = rertab %>% dplyr::filter(FDR < fdrAlpha)) +
ggtitle("edgeR-robust - Volcano Plot")
print(prer)
We’re going to move on to other packages. Let’s “unload” edgeR to avoid conflicts, namespace issues.
unloadNamespace("edgeR")
source("http://bioconductor.org/biocLite.R")
biocLite("metagenomeSeq")
metagenomeSeq is an official Bioconductor package, with the explicit goal of detecting differential abundance in microbiome experiments with an explicit design.
First, we need functions to
MGS
objectDefine a second function to based on our simple two-class experimental design.
Function to perform relevant test (fitZig
) - MGS
- A metagenomeSeq object, most-likely produced using the conversion tool make_metagenomeSeq() - variable
- the variable among the sample_data (aka “phenoData” in metagenomeSeq) that you want to test - physeq
- optional phyloseq data object that has the relevant tax_table
. phyloseq
or taxonomyTable
is fine.
library("metagenomeSeq"); packageVersion("metagenomeSeq")
## [1] '1.26.2'
# Function to convert form phyloseq object to metagenomeSeq object
phyloseq_to_metagenomeSeq = function(physeq){
require("metagenomeSeq")
require("phyloseq")
# Enforce orientation
if(!taxa_are_rows(physeq)){physeq <- t(physeq)}
OTU = as(otu_table(physeq), "matrix")
# Convert sample_data to AnnotatedDataFrame
ADF = AnnotatedDataFrame(data.frame(sample_data(physeq)))
# define dummy "feature" data for OTUs, using their name
# Helps with extraction and relating to taxonomy later on.
TDF = AnnotatedDataFrame(
data = data.frame(
OTUname = taxa_names(physeq),
row.names=taxa_names(physeq)))
# Create the metagenomeSeq object
MGS = newMRexperiment(counts=OTU, phenoData=ADF, featureData=TDF)
# Trigger metagenomeSeq to calculate its Cumulative Sum scaling factor.
MGS = cumNorm(MGS)
return(MGS)
}
Perform metagenomeSeq differential abundance detection. The default method for multiple-inference adjustment in metagenomeSeq::MRfulltable
is Benjamini-Hochberg (FDR).
source("mgs-helper-funs.R")
mgs = phyloseq_to_metagenomeSeq(cpsf)
## Default value being used.
mgsres = test_metagenomeSeq(mgs, variable = "DIAGNOSIS", physeq = cpsf)
mgsres$OTU <- mgsres$OTUnames
library("data.table")
mgsres <- data.table(mgsres)
setorder(mgsres, adjPvalues, pvalues)
pmgs <-
ggplot(
data = mgsres,
mapping = aes(logFC, -log10(pvalues), color = Phylum)) +
geom_vline(xintercept = 0) +
geom_point() +
theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust = 0.5)) +
ggtitle("metagenomeSeq - Volcano Plot")
pmgs <-
pmgs +
geom_text(
size = 2,
mapping = aes(label = OTU),
color = "black",
data = mgsres %>% dplyr::filter(adjPvalues < fdrAlpha))
print(pmgs)
We’re going to move on to other packages. Let’s “unload” metagenomeSeq to avoid conflicts, namespace issues.
unloadNamespace("metagenomeSeq")
The gridExtra package adds some helpful utilities for dealing with graphic objects created using the low-level R graphics package called grid, which is included as part of the base R distribution. See gridBase for details about combining elements of grid graphics with base R graphics.
We will use the grid.arrange
function from gridExtra
library("gridExtra"); packageVersion("gridExtra")
## [1] '2.3'
grid.arrange(nrow=2, pdsd, pedgeR, prer, pmgs)
Generate a “tidy” table of the abundances, by “melting” the phyloseq object with psmelt()
function.
Count the number of hits that overlap between methods
if(!require("UpSetR")){install.packages("UpSetR")}
hitList <-
list(
edgeR = resER %>%
dplyr::filter(FDR < fdrAlpha) %>%
extract2("OTU") %>%
unique,
robustER = rertab %>%
dplyr::filter(FDR < fdrAlpha) %>%
extract2("OTU") %>%
unique,
DESeq2 = resultsDESeq2 %>%
dplyr::filter(padj < fdrAlpha) %>%
extract2("OTU") %>%
unique,
metagenomeSeq = mgsres %>%
dplyr::filter(adjPvalues < fdrAlpha) %>%
extract2("OTU") %>%
unique
)
hitList %>%
fromList() %>%
upset(order.by = "freq")
Example counting repeat hits…
library("data.table")
data.table(method = names(hitList)) %>%
.[, data.table(OTU = hitList[[method]]), by = method] %>%
.[, .N, by = OTU]
## OTU N
## 1: 806497 2
## 2: 4483895 1
## 3: 4483337 1
## 4: 955102 1
## 5: 4382457 1
## 6: 4391009 1
## 7: 4465473 1
## 8: 4476604 1
## 9: 4472174 1
## 10: 4447605 1
## 11: 4479373 1
## 12: 4447950 1
## 13: 4423790 1
## 14: 4472949 1
Most-significant OTU from each method.
# Organize, and standardize p-value column name
bestHits <-
# Organize, and standardize p-value column name
list(
edgeR = resER %>%
data.table(keep.rownames = TRUE) %>%
setnames("PValue", "p.value") %>%
.[, .(rn, p.value)],
robustER = rertab %>%
data.table(keep.rownames = TRUE) %>%
setnames("PValue", "p.value") %>%
.[, .(rn, p.value)],
DESeq2 = resultsDESeq2 %>%
data.table(keep.rownames = TRUE) %>%
setnames("pvalue", "p.value") %>%
.[, .(rn, p.value)],
metagenomeSeq = mgsres %>%
data.table(keep.rownames = TRUE) %>%
setnames("OTUnames", "rn") %>%
setnames("pvalues", "p.value") %>%
.[, .(rn, p.value)]
) %>%
rbindlist(use.names = TRUE, idcol = "Method") %>%
setnames("rn", "OTU") %>%
# Sort by method, p-value, take the most-significant from each method
setorder(Method, p.value) %>%
.[, .SD[1], by = Method]
bestHits %>% knitr::kable(digits = 12)
Method | OTU | p.value |
---|---|---|
DESeq2 | 4423790 | 0.000000e+00 |
edgeR | 806497 | 2.191227e-04 |
metagenomeSeq | 4382457 | 6.903036e-02 |
robustER | 806497 | 9.990000e-10 |
library("data.table")
library("magrittr")
longtab = psmelt(closedps) %>% data.table
longtab[, RelativeAbundance := Abundance/sum(Abundance), by = Sample]
I like violin plots …
# Subset to an example "hit(s)" that agree
justExampleHits = longtab[(OTU %in% bestHits$OTU)]
ggplot(
data = justExampleHits,
mapping = aes(
x = DIAGNOSIS,
y = RelativeAbundance,
color = Genus,
shape = Phylum)) +
geom_violin(color = "gray") +
geom_jitter(size = 3, width = 0.1, alpha = 0.5) +
scale_y_sqrt() +
facet_wrap(~OTU) +
ggtitle("'Groundtruth' plot for OTU hit(s)",
"Each point is an OTU in a sample. Is this convincing? Which direction is the result?")