25th Summer Institute in Statistical Genetics (SISG)

Module 15: MCMC for Genetics

Mon, July 27 to Wed, July 29

Module dates/times: Monday, July 27; Tuesday, July 28, and Wednesday, July 29. Live sessions will start no earlier than 8 a.m. Pacific and end no later than 2:30 p.m. Pacific, except for Wednesdays. For modules that end on Wednesday, live sessions will end by 11 a.m. Pacific. For modules that start on Wednesday, live sessions will begin no earlier than 11:30 a.m.

This module introduces the use of Markov Chain Monte Carlo (MCMC) methods, using genetic examples -- in particular, the problem of estimating population structure from genotype data -- to motivate the material.  It assumes a solid foundation in basic statistics and the concept of likelihood, and a basic familiarity with the R statistical package. 

The course will provide an introduction to likelihood, Bayesian statistics, Monte Carlo, Markov Chains, mixture models and MCMC methods, including both Metropolis-Hasting and Gibbs sampling. Some mathematical detail is given; however, the emphasis is on concepts and practical issues arising in applications. Mathematical ideas are illustrated with simple examples and reinforced with computer practicals using the R statistical language.  Software used: R.

Suggested pairing: Modules 7 and 12.

Access 2019 course materials.

Learning Objectives: After attending this module, participants will be able to:

  1. Derive the (analytic) posterior distribution for a Binomial proportion given a conjugate (Beta) prior.
  2. Implement a Metropolis-Hastings algorithm to sample from this posterior distribution and check that it matches the analytic form.
  3. Derive the posterior distribution for cluster memberships given a prior on clusters and a likelihood for each cluster.
  4. Implement a Gibbs sampler to sample from cluster memberships given data from a mixture of product-Bernoulli distributions.
  5. Apply the structure software to a real dataset and interpret the output.
  6. Apply the PHASE software to a real dataset an interpret the output.