4th Annual Summer Institute in Statistics for Big Data (SISBID)

Module 1: Data Wrangling with R

Wed, July 11 to Fri, July 13
Instructor(s):

Module dates/times: Wednesday, July 11, 1:30-5 p.m.; Thursday, July 12, 8:30 a.m.-5 p.m., and Friday, July 13, 8:30 a.m.-5 p.m.

Participants will learn how to get data and process it for visualization and statistical analysis. Our approach focuses on the concept of creating “tidy data”, e.g. data that is organized into readable and distributable files. In this module, we will:

  • Use hands-on examples from published studies and cover concepts on data retrieval, manipulation, and formatting.
  • Touch on reproducible research using R Markdown and collaborative code sharing using GitHub.
  • Briefly introduce some of the most popular public data repositories in genomics (e.g. GEO, SRA), and demonstrate how to access these repositories using tools in R and Bioconductor (e.g. using the recount and GEOquery packages).

Principles will be illustrated using data from microarray and next generation sequencing technologies.

Module assumes some familiarity with R.

Recommended Reading: Cookbook for R, by Winston Chang, available at www.cookbook-r.com(link is external).

Course materials from previous yearshttps://github.com/SISBID