--- title: 'Getting Data and Linear Models' author: "James M. Flegal" output: ioslides_presentation: smaller: yes --- ## Agenda - Getting data into and out of R - Using data frames for statistical purposes - Introduction to linear models ## Reading Data from R - You can load and save R objects + R has its own format for this, which is shared across operating systems + It's an open, documented format if you really want to pry into it - `save(thing, file="name")` saves `thing` in a file called `name` (conventional extension: `rda` or `Rda`) - `load("name")` loads the object or objects stored in the file called `name`, _with their old names_ ## ```{r} gmp <- read.table("http://faculty.ucr.edu/~jflegal/206/gmp.dat") gmp$pop <- round(gmp$gmp/gmp$pcgmp) save(gmp,file="gmp.Rda") rm(gmp) exists("gmp") not_gmp <- load(file="gmp.Rda") colnames(gmp) not_gmp ``` ## - We can load or save more than one object at once; this is how RStudio will load your whole workspace when you're starting, and offer to save it when you're done - Many packages come with saved data objects; there's the convenience function `data()` to load them ```{r} data(cats,package="MASS") summary(cats) ``` ## Non-R Data Tables - Tables full of data, just not in the R file format - Main function: `read.table()` + Presumes space-separated fields, one line per row + Main argument is the file name or URL + Returns a dataframe + Lots of options for things like field separator, column names, forcing or guessing column types, skipping lines at the start of the file... - `read.csv()` is a short-cut to set the options for reading comma-separated value (CSV) files + Spreadsheets will usually read and write CSV ## Writing Dataframes - Counterpart functions `write.table()`, `write.csv()` write a dataframe into a file - Drawback: takes a lot more disk space than what you get from `load` or `save` - Advantage: can communicate with other programs, or even edit manually ## Less Friendly Data Formats - The `foreign` package on CRAN has tools for reading data files from lots of non-R statistical software - Spreadsheets are special + Full of ugly irregularities + Values or formulas? + Headers, footers, side-comments, notes + Columns change meaning half-way down ## Spreadsheets, If You Have To - Save the spreadsheet as a CSV; `read.csv()` - Save the spreadsheet as a CSV; edit in a text editor; `read.csv()` - Use `read.xls()` from the `gdata` package + Tries very hard to work like `read.csv()`, can take a URL or filename + Can skip down to the first line that matches some pattern, select different sheets, etc. + You may still need to do a lot of tidying up after ## So You've Got A Data Frame What can we do with it? - Plot it: examine multiple variables and distributions - Test it: compare groups of individuals to each other - Check it: does it conform to what we'd like for our needs ## Test Case: Birth weight data ```{r} library(MASS) data(birthwt) summary(birthwt) ``` ## From R help Go to R help for more info, because someone documented this data ``` help(birthwt) ``` ## Make it Readable ```{r} colnames(birthwt) colnames(birthwt) <- c("birthwt.below.2500", "mother.age", "mother.weight", "race", "mother.smokes", "previous.prem.labor", "hypertension", "uterine.irr", "physician.visits", "birthwt.grams") ``` ## Make it Readable Can make all the factors more descriptive. ```{r} birthwt$race <- factor(c("white", "black", "other")[birthwt$race]) birthwt$mother.smokes <- factor(c("No", "Yes")[birthwt$mother.smokes + 1]) birthwt$uterine.irr <- factor(c("No", "Yes")[birthwt$uterine.irr + 1]) birthwt$hypertension <- factor(c("No", "Yes")[birthwt$hypertension + 1]) ``` ## Make it Readable ```{r} summary(birthwt) ``` ## Explore It ```{r} plot (birthwt$race) title (main = "Count of Mother's Race in Springfield MA, 1986") ``` ## Explore It ```{r} plot (birthwt$mother.age) title (main = "Mother's Ages in Springfield MA, 1986", ylab="Mother's Age") ``` ## Explore It ```{r} plot (sort(birthwt$mother.age)) title (main = "(Sorted) Mother's Ages in Springfield MA, 1986", ylab="Mother's Age") ``` ## Explore It ```{r} plot (birthwt$mother.age, birthwt$birthwt.grams) title (main = "Birth Weight by Mother's Age in Springfield MA, 1986", xlab="Mother's Age", ylab="Birth Weight (g)") ``` ## Basic statistical testing Let's fit some models to the data pertaining to our outcome(s) of interest. ```{r} plot (birthwt$mother.smokes, birthwt$birthwt.grams, main="Birth Weight by Mother's Smoking Habit", ylab = "Birth Weight (g)", xlab="Mother Smokes") ``` ## Basic statistical testing Tough to tell! Simple two-sample t-test: ```{r} t.test (birthwt$birthwt.grams[birthwt$mother.smokes == "Yes"], birthwt$birthwt.grams[birthwt$mother.smokes == "No"]) ``` ## Basic statistical testing Does this difference match the linear model? ```{r} linear.model.1 <- lm (birthwt.grams ~ mother.smokes, data=birthwt) linear.model.1 ``` ## Basic statistical testing Does this difference match the linear model? ```{r} summary(linear.model.1) ``` ## Basic statistical testing Does this difference match the linear model? ```{r} linear.model.2 <- lm (birthwt.grams ~ mother.age, data=birthwt) linear.model.2 ``` ## Basic statistical testing ```{r} summary(linear.model.2) ``` ## Basic statistical testing R tries to make diagnostics easy as possible. Try in R console. ```{r} plot(linear.model.2) ``` ## Detecting Outliers Note the oldest mother and her heaviest child are greatly skewing this analysis. ```{r} birthwt.noout <- birthwt[birthwt$mother.age <= 40,] linear.model.3 <- lm (birthwt.grams ~ mother.age, data=birthwt.noout) linear.model.3 ``` ## Detecting Outliers ```{r} summary(linear.model.3) ``` ## More complex models Add in smoking behavior: ```{r} linear.model.3a <- lm (birthwt.grams ~ + mother.smokes + mother.age, data=birthwt.noout) summary(linear.model.3a) ``` ## More complex models ```{r} plot(linear.model.3a) ``` ## More complex models Add in race: ```{r} linear.model.3b <- lm (birthwt.grams ~ mother.age + mother.smokes*race, data=birthwt.noout) summary(linear.model.3b) ``` ## More complex models ```{r} plot(linear.model.3b) ``` ## Including everything Let's include everything on this new data set: ```{r} linear.model.4 <- lm (birthwt.grams ~ ., data=birthwt.noout) linear.model.4 ``` ## Including everything Be careful! One of those variables `birthwt.below.2500` is a function of the outcome. ```{r} linear.model.4a <- lm (birthwt.grams ~ . - birthwt.below.2500, data=birthwt.noout) summary(linear.model.4a) ``` ## Including everything ```{r} plot(linear.model.4a) ``` ## Generalized Linear Models Maybe a linear increase in birth weight is less important than if it's below a threshold like 2500 grams (5.5 pounds). Let's fit a generalized linear model instead: ```{r} glm.0 <- glm (birthwt.below.2500 ~ . - birthwt.grams, data=birthwt.noout) plot(glm.0) ``` ## Generalized Linear Models The default value is a Gaussian model (a standard linear model). Change this: ```{r} glm.1 <- glm (birthwt.below.2500 ~ . - birthwt.grams, data=birthwt.noout, family=binomial(link=logit)) ``` ## Generalized Linear Models ```{r} summary(glm.1) ``` ## Generalized Linear Models ```{r} plot(glm.1) ``` ## Why? Let's take a subset of this data to do predictions. ```{r} odds <- seq(1, nrow(birthwt.noout), by=2) birthwt.in <- birthwt.noout[odds,] birthwt.out <- birthwt.noout[-odds,] linear.model.half <- lm (birthwt.grams ~ . - birthwt.below.2500, data=birthwt.in) ``` ## Why? ```{r} summary (linear.model.half) ``` ## Prediction of Training Data ```{r} birthwt.predict <- predict (linear.model.half) cor (birthwt.in$birthwt.grams, birthwt.predict) ``` ## Prediction of Training Data ```{r} plot (birthwt.in$birthwt.grams, birthwt.predict) ``` ## Prediction of Test Data ```{r} birthwt.predict.out <- predict (linear.model.half, birthwt.out) cor (birthwt.out$birthwt.grams, birthwt.predict.out) ``` ## Prediction of Test Data ```{r} plot (birthwt.out$birthwt.grams, birthwt.predict.out) ``` ## Summary - Loading and saving R objects is very easy - Reading and writing dataframes is pretty easy - Linear models are very easy via `lm()` - Generalized linear models are pretty easy via `glm()` - Generalized linear mixed models via `lme4()` and `glmm()`