--- title: "Exercises" author: "Thomas Lumley" date: "24/07/2019" output: html_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` ## Session 1 **(A)** The PISA educational survey is stratified by country and then by various factors within country. A sample of schools is taken in each stratum, and students are sampled from each school. `nzmaths.csv` is the New Zealand subset of some variables related to mathematics performance. `nzmaths.pdf` is documentation. Declare a survey design object for these data. Use `svytotal` to estimate the total number of male and female school students at the survey age in New Zealand. **(B)** The data sets `esophlong.csv` is the famous oesophageal cancer case-control study of Tuyns and co-workers, used by Breslow and Day. A case-control study is a stratified sample, stratified on case status. Based on the population size for the region being studied, approximately 1/440 controls were sampled. Declare a survey design for these data **(C)** The file `pfoa-nhanes.rda` contains a set of data frames (also in the files `pfoa00.dta` to `pfoa10.dta`). Read in `pfoa00.dta`. Declare a survey design using the strata (`SDMVSTRA`), the sampling units (`SDMVPSU`) and the weights for the medical and clinical examination (`WTMEC2YR`). Estimate the population total number of men and women (`RIAGENDR`) and each race/ethnicity group (`RIDRETH1`) Now repeat this using the replicate weights `WTMREP01` to `WTMREP52` instead of strata and cluster information. ## Session 2 With the `pfoa00` data set, estimate the mean age, and the mean and quartiles of `SPFOA` (perfluorooctanoic acid in blood, ng/ml) With the PISA maths data, draw a histogram of `PCGIRLS`, the proportion of girls at the school. Draw a scatterplot against `PV1MATH` (maths score) and against `MATHEFF` (maths self-efficacy score) Fit an unweighted logistic regression model to the `esophlong` data set with linear effects of alcohol and tobacco and age. Compare to a model treating it as a survey sample, fitted using `svyglm` or `svy: logistic` ## Session 3 We are interested in whether exposure to perfluorooctanoic acid (PFOA) is associated with cardiovascular events (CVD) or with peripheral vascular disease (PAD) In R, the lines of code ``` nhanes00<-svydesign(id=~SDMVPSU,strata=~SDMVSTRA,weights=~WTMEC2YR,data=alldata00,nest=TRUE) nhanes00<-update(nhanes00, hadcvd=(MCQ160C==1) | (MCQ160E==1) | (MCQ160F==1), abi=(LEXRABPI+LEXLABPI)/2) nhanes00<-update(nhanes00, haspad=ifelse(abi<0.9,1,ifelse(abi>1.5,NA,0))) svyquantile(~SPFOA,nhanes00,quantiles=c(0.25,0.5,0.75)) nhanes00<-update(nhanes00, pfoa4=cut(SPFOA,c(0,3.7,5,6.8,Inf))) nhanes00<-update(nhanes00, smoking=ifelse(SMQ020==2,0,ifelse(SMQ040 %in% c(1,2),1,2))) ``` set up the survey design. In Stata, the variable declarations can just be done as usual with `gen` or `replace` after `svyset`. Try logistic models for `hadcvd` or `haspad`, with `pfoa4` as a predictor. In addition to gender and race/ethnicity (as in session 1) consider adjustment variables: | name | variable | | ------- |----- | | BPXSAR | systolic blood pressure | | BPXDAR | diastolic blood pressure | | BMXBMI | BMI | | LBXTC | total cholesterol| | LBXGH | % glycosylated hemoglobin | | smoking | smoking | | DMDEDUC | education | | RIDAGEYR | age | If you have time, try `pfoa4` ``` nhanes04<-svydesign(id=~SDMVPSU,strata=~SDMVSTRA,weights=~WTSA2YR,data=alldata04,nest=TRUE) nhanes04<-update(nhanes04, hadcvd=(MCQ160C==1) | (MCQ160E==1) | (MCQ160F==1), abi=(LEXRABPI+LEXLABPI)/2) nhanes04<-update(nhanes04, haspad=ifelse(abi<0.85,1,ifelse(abi>1.5,NA,0))) svyquantile(~LBXPFOA,nhanes04,quantiles=c(0.25,0.5,0.75),na.rm=TRUE) nhanes04<-update(nhanes04, pfoa4=cut(LBXPFOA,c(0,2.8,4.2,6,Inf))) nhanes04<-update(nhanes04, smoking=ifelse(SMQ020==2,0,ifelse(SMQ040 %in% c(1,2),1,2))) ``` or `pfoa10` ``` nhanes10<-svydesign(id=~SDMVPSU,strata=~SDMVSTRA,weights=~WTSC2YR,data=alldata10,nest=TRUE) nhanes10<-update(nhanes10, hadcvd=(MCQ160C==1) | (MCQ160E==1) | (MCQ160F==1)) svyquantile(~LBXPFOA,nhanes10,quantiles=c(0.25,0.5,0.75),na.rm=TRUE) nhanes10<-update(nhanes10, pfoa4=cut(LBXPFOA,c(0,2.6,4.2,6.2,Inf))) nhanes10<-update(nhanes10, smoking=ifelse(SMQ020==2,0,ifelse(SMQ040 %in% c(1,2),1,2))) ``` with adjustment variables | name | variable | | ------- |----- | | BPXSY1 | systolic blood pressure | | BPXDI1 | diastolic blood pressure | | BMXBMI | BMI | | LBXTC | total cholesterol| | LBXGH | % glycosylated hemoglobin | | smoking | smoking | | DMDEDUC2 | education | | RIDAGEEX | age |