25th Summer Institute in Statistical Genetics (SISG)


Module 17: Computational Pipeline for WGS Data

Wed, July 29 to Fri, July 31

Module dates/times: Wednesday, July 29; Thursday, July 30, and Friday, July 31. Live sessions will start no earlier than 8 a.m. Pacific and end no later than 2:30 p.m. Pacific, except for Wednesdays. For modules that end on Wednesday, live sessions will end by 11 a.m. Pacific. For modules that start on Wednesday, live sessions will begin no earlier than 11:30 a.m.

This module is designed to follow on from Module 14. It will be a hands-on introduction to whole genome sequence analysis pipelines, informed by the instructors' experience with the TOPMed project (www.nhlbiwgs.org), and in particular its focus on pooled-data analysis used to study the role of rare variants on disease outcomes.

It will begin with an overview of data formats (BAM, VCF, GDS), and then cover population structure and relatedness effects on association mapping, phenotype harmonization, association testing (single-variant, burden and SKAT), variant annotation, WGS variant analysis pipelines focusing on tools used in the TOPMed Analysis pipeline and the role of cloud computing.

Suggested pairing: Modules 12 and 13.

Access 2019 course materials.

Learning Objectives: After attending this module, participants will be able to:

  1. Understand the structure of data generated from whole genome sequence variant calling.
  2. Define a set of data transformations needed to harmonize phenotype data from different sources.
  3. Implement a mixed-model analysis in R to investigate the relationship between a trait of interest and genotypes in a population with both recent relatedness and diverse ancestry, using both single-variant and region-based tests.
  4. Use variant annotation to boost power in association tests that involve aggregating rare variants.
  5. Explain the pros and cons of using cloud computing platforms for computationally intensive analyses.
  6. Understand and recognize when standard methods may in practice give miss-calibrated inference (e.g. cryptic relatedness, heteroscedasticity, inappropriate use of asymptotic approximations) and suggest some solutions to these problems.