6th Summer Institute in Statistics for Big Data (SISBID)


This module is currently full. Registrations are closed at this time.

Module 1: Data Wrangling with R

Mon, July 13 to Wed, July 15
Instructor(s):

Module dates/times: Monday, July 13; Tuesday, July 14, and Wednesday, July 15. Live sessions will start no earlier than 8 a.m. Pacific and end no later than 2:30 p.m. Pacific, except for Wednesdays. For modules that end on Wednesday, live sessions will end by 11 a.m. Pacific. For modules that start on Wednesday, live sessions will begin no earlier than 11:30 a.m.

Participants will learn how to get data and process it for visualization and statistical analysis. Our approach focuses on the concept of creating “tidy data”, e.g. data that is organized into readable and distributable files. In this module, we will:

  • Use hands-on examples from published studies and cover concepts on data retrieval, manipulation, and formatting.
  • Touch on reproducible research using R Markdown and collaborative code sharing using GitHub.
  • Briefly introduce some of the most popular public data repositories in genomics (e.g. GEO, SRA), and demonstrate how to access these repositories using tools in R and Bioconductor (e.g. using the recount and GEOquery packages).

Principles will be illustrated using data from microarray and next generation sequencing technologies.

Module assumes some familiarity with R (see previous year’s course materials for reference).

Recommended Reading: Cookbook for R, by Winston Chang, available at www.cookbook-r.com.