0.1 Outline

Expected Time to Completion: 45 minutes

  • Trees (in R)
  • Tree-based distances
  • Network Manipulation
  • Network Graphics

 


1 Startup R session

1.1 Load phyloseq

The ape package provides lots of useful, efficient tools for dealing with trees in R. We load it right away.

library("phyloseq"); packageVersion("phyloseq")
## [1] '1.24.0'
library("ape"); packageVersion("ape")
## [1] '5.1'
library("ggplot2"); packageVersion("ggplot2")
## [1] '3.0.0'
theme_set(theme_bw())

1.2 Load favorite example dataset :)

load("example-data.RData")
ls()
##  [1] "alpha"                     "bigdt"                    
##  [3] "bigdt2"                    "closedps"                 
##  [5] "cpsf"                      "cpsp"                     
##  [7] "cpsrld"                    "cpsvsd"                   
##  [9] "dadaFs"                    "dadaRs"                   
## [11] "day"                       "dds"                      
## [13] "derepFs"                   "derepRs"                  
## [15] "deseq2Hits"                "design"                   
## [17] "dge"                       "dirFiltSeqs"              
## [19] "dpcoa"                     "duf"                      
## [21] "dufPCoA"                   "enterotype"               
## [23] "et"                        "fastqs"                   
## [25] "filtFs"                    "filtRs"                   
## [27] "fnFs"                      "fnRs"                     
## [29] "fns"                       "g"                        
## [31] "gender"                    "i"                        
## [33] "JustSigsDeseq"             "LibrarySizes"             
## [35] "longtab"                   "match.ref"                
## [37] "mergers"                   "mgs"                      
## [39] "mgsres"                    "mockRef"                  
## [41] "N"                         "newEnt"                   
## [43] "newRoot"                   "ord.nmds.bray"            
## [45] "OTUproportions"            "p"                        
## [47] "path"                      "pca"                      
## [49] "pdsd"                      "pedgeR"                   
## [51] "phyloseq_to_edgeR"         "phyloseq_to_metagenomeSeq"
## [53] "plot_sigtab"               "pmgs"                     
## [55] "prer"                      "ps"                       
## [57] "ps.top20"                  "ps0"                      
## [59] "qiimedata"                 "rer"                      
## [61] "rertab"                    "resER"                    
## [63] "resultsDESeq2"             "rld"                      
## [65] "samdf"                     "sample.names"             
## [67] "samples.out"               "seqtab"                   
## [69] "seqtab.nochim"             "sim1"                     
## [71] "sim1df"                    "simulate_community"       
## [73] "spdf"                      "subject"                  
## [75] "taxMat"                    "test_metagenomeSeq"       
## [77] "title"                     "top20"                    
## [79] "tree"                      "tt"                       
## [81] "unqs.mock"                 "vsd"

 


2 Trees (in R)

2.1 Trees

  • Components of “phylo” object
  • Rooting
  • Descendants
  • Pruning
  • Random Tree
  • Plotting

Parts of “phylo” object. It is actually a list

tree = phy_tree(closedps)
class(tree)
## [1] "phylo"
names(tree)
## [1] "edge"        "Nnode"       "tip.label"   "edge.length" "node.label"

The edge matrix

head(tree$edge)
##      [,1] [,2]
## [1,]   74    1
## [2,]   74   75
## [3,]   75    2
## [4,]   75   76
## [5,]   76   77
## [6,]   77   78

The edge matrix

tree$Nnode
## [1] 72
tree$tip.label[1:5]
## [1] "169901"  "673925"  "4470518" "1107945" "4346374"
taxa_names(tree)[1:5]
## [1] "169901"  "673925"  "4470518" "1107945" "4346374"
tree$edge.length[1:5]
## [1] 0.39755 0.08322 0.38007 0.02085 0.39352
tree$node.label[1:5]
## [1] "" "" "" "" ""

2.2 Trees - assignment

The node labels were empty. Let’s assign simulated bootstrap values to each node label using standard R list assignment semantics.

set.seed(711)
tree$node.label <- round(runif(tree$Nnode, 0 , 1), digits = 3)

All other aspects of a “phylo” object are just as easy to modify. This is both convenient and dangerous. Why?

What about assigning to a tree in a phyloseq object?

Easy! Just use the same list assignment semantics.

phy_tree(closedps)$node.label[1:5]
## [1] "" "" "" "" ""
phy_tree(closedps)$node.label <- tree$node.label
phy_tree(closedps)$node.label[1:5]
## [1] 0.290 0.426 0.325 0.630 0.445

You can even replace the entire tree.

Caution: this may side-step the automatic index checks, like taxa names, that are built-in to the phyloseq function.

phy_tree(closedps) <- tree
phy_tree(closedps)
## 
## Phylogenetic tree with 73 tips and 72 internal nodes.
## 
## Tip labels:
##  169901, 673925, 4470518, 1107945, 4346374, 4414420, ...
## Node labels:
##  0.29, 0.426, 0.325, 0.63, 0.445, 0.123, ...
## 
## Rooted; includes branch lengths.

2.3 Trees - The root

Let’s check which taxa is considered root in this tree.

is.rooted(tree)
## [1] TRUE
# random root
newRoot = taxa_names(tree)[sample(ntaxa(tree), 1)]
tree <- unroot(tree)
tree <- root(tree, newRoot, resolve.root = TRUE)

2.4 Trees - Ape Plots

Add the node IDs on a “native R” ape tree plot

plot(tree); nodelabels()

Plot the pretend bootstrap values instead of the node ID.

plot(tree); nodelabels(tree$node.label)

2.5 Trees - plot_tree

phyloseq defaults are a bit more legible.

plot_tree(tree, ladderize = "left", label.tips = "OTU")

Can add symbols to simplify bootstrap values

plot_tree(tree,
          nodelabf = nodeplotboot(80),
          ladderize = "left",
          label.tips = "OTU")
## Warning in asMethod(object): NAs introduced by coercion

Can add symbols representing samples

plot_tree(closedps, nodelabf = nodeplotboot(80), 
          ladderize = "left", label.tips = "OTU",
          color = "Treatment", justify = "left")

 


3 Tree-based distances

  • UniFrac
  • DPCoA

Convert to relative abundance, then calculate distance.

cpsp = transform_sample_counts(closedps, function(x){x/sum(x)})
duf = phyloseq::distance(cpsp, "wunifrac")

3.1 UniFrac

dufPCoA = ordinate(physeq = closedps, 
                   method = "PCoA",
                   distance = duf)
plot_scree(dufPCoA)

Make the w-UniFrac PCoA graphic

plot_ordination(cpsp, dufPCoA,
                color = "Treatment", 
                title="w-UniFrac PCoA") + 
  geom_point(size=7)

plot_ordination(cpsp, dufPCoA, 
                type = "split", 
                shape="Treatment", 
                color="Phylum", 
                title="wUF-MDS Split Plot")

What about alternative ordination method, like NMDS?

How would you encode that?

3.2 DPCoA

dpcoa = ordinate(cpsp, "DPCoA")
plot_scree(dpcoa)

plot_ordination(cpsp, dpcoa,
                color = "Treatment", title="DPCoA") + geom_point(size=7)

plot_ordination(cpsp, dpcoa, type = "split", 
                shape="Treatment", color="Phylum", title="DPCoA Split Plot")

 


4 Networks (in R)

4.1 Networks and Graphical Models in R

Lots on this subject. See CRAN graphical models task view.

phyloseq uses the igraph package for its internal network methods.

See plot_network tutorial.

4.2 Distance Threshold Network

There are lots of types of networks.

Here we are discussing a very simple network that represents (dis)similarity values. Its main purpose in our case is for data exploration.

Some small modification to the sample data… (this will save us the hassle of some pesky warning messages, but everything still works; the offending samples are anyway omitted).

data("enterotype")
newEnt = as.character(sample_data(enterotype)$Enterotype)
newEnt[is.na(newEnt)] <- "Unassigned"
sample_data(enterotype)$Enterotype <- factor(newEnt)

4.3 Generate a Distance Threshold Network in phyloseq

This will create an igraph object the main class for representing a graph object in igraph.

g = make_network(enterotype, distance = "bray", max.dist = 0.4)

This returned an igraph object, but to use igraph tools on it we actually need to load igraph. Lots of stuff one can do with igraph. The following is an example of extracting information about the first 10 vertices.

library("igraph")
# cliques(g)
V(g)[1:10]
## + 10/276 vertices, named, from 3197368:
##  [1] AM.AD.1   AM.AD.2   AM.F10.T1 AM.F10.T2 DA.AD.1   DA.AD.1T  DA.AD.2  
##  [8] DA.AD.3   DA.AD.3T  DA.AD.4

We can also plot this object in base R using igraph’s extension to plot.

plot(g, vertex.label=NA)

But easier to use ggplot2-based graphics provided in phyloseq (next)

4.4 Network Graphics in phyloseq

There are two network plot functions in phyloseq:

  • plot_net
  • plot_network (legacy, igraph manipulation)

plot_net

  • faster and easier to use than the original, plot_network
  • Better defaults, maps distance values to line width

4.5 Network Graphics - example

There is a random aspect to some of the network layout methods. For complete reproducibility of the images produced later in this tutorial, it is possible to set the random number generator seed explicitly:

set.seed(711L)

The newer plot_net function does not require a separate make_network function call, or a separate igraph object. For examples running the older plot_network function, which may provide some added flexibility with igraph objects, see the plot_network section later.

Try plot_net with the default settings.

# library("phyloseq")
# data(enterotype)
plot_net(enterotype, 
         maxdist = 0.35, 
         point_label = "Sample_ID")

The previous graphic displayed some interesting structure, with one or two major subgraphs comprising a majority of samples. Furthermore, there seemed to be a correlation in the sample naming scheme and position within the network.

Instead of trying to read all of the sample names to understand the pattern, let’s map some of the sample variables onto this graphic as color and shape:

plot_net(enterotype, 
         maxdist = 0.3, 
         color = "SeqTech",
         shape="Enterotype")

4.6 Network Graphics - Bonus Questions

In the previous examples, the choice of maximum-distance and distance method were informed, but arbitrary.

  • What happens when maxdist value is decreased?? (hint: this will usually decrease the number of edges in the network).
  • What about other distances?
  • What can you learn from an ordination method instead?
  • What’s different about these two forms of exploratory graphics?