26th Summer Institute in Statistical Genetics (SISG)


This module is currently full. Registrations are closed at this time.

Module 16: Computational Pipeline for WGS Data

Wed, July 21 to Fri, July 23
Registration for this module closes July 14. 

 

Live session timeframe (exact schedule with live sessions will be posted by module instructors prior to the start of the module): Wednesday: 11:30 a.m.-2:30 p.m. Pacific (2:30-5:30 p.m. Eastern); Thursday, 8 a.m. – 2:30 p.m. Pacific  (11 a.m. – 5:30 p.m. Eastern); Friday, 8 a.m. – 2:30 p.m. Pacific (11 a.m. – 5:30 p.m. Eastern).

This module is designed to follow on from Module 14. It will be a hands-on introduction to whole genome sequence analysis pipelines, informed by the instructors' experience with the TOPMed project (www.nhlbiwgs.org), and in particular its focus on pooled-data analysis used to study the role of rare variants on disease outcomes. 

It will begin with an overview of data formats (BAM, VCF, GDS), and then cover population structure and relatedness effects on association mapping, phenotype harmonization, association testing (single-variant, burden and SKAT), variant annotation, WGS variant analysis pipelines focusing on tools used in the TOPMed Analysis pipeline and the role of cloud computing.

Suggested pairing: Modules 9 and 14.

Learning Objectives: After attending this module, participants will be able to:

  1. Understand the structure of data generated from whole genome sequence variant calling.
  2. Define a set of data transformations needed to harmonize phenotype data from different sources.
  3. Implement a mixed-model analysis in R to investigate the relationship between a trait of interest and genotypes in a population with both recent relatedness and diverse ancestry, using both single-variant and region-based tests.
  4. Use variant annotation to boost power in association tests that involve aggregating rare variants.
  5. Explain the pros and cons of using cloud computing platforms for computationally intensive analyses.
  6. Understand and recognize when standard methods may in practice give miss-calibrated inference (e.g. cryptic relatedness, heteroscedasticity, inappropriate use of asymptotic approximations) and suggest some solutions to these problems.