---
title: 'Getting Data and Linear Models'
author: "James M. Flegal"
output:
ioslides_presentation:
smaller: yes
---
## Agenda
- Getting data into and out of R
- Using data frames for statistical purposes
- Introduction to linear models
## Reading Data from R
- You can load and save R objects
+ R has its own format for this, which is shared across operating systems
+ It's an open, documented format if you really want to pry into it
- `save(thing, file="name")` saves `thing` in a file called `name` (conventional extension: `rda` or `Rda`)
- `load("name")` loads the object or objects stored in the file called `name`, _with their old names_
##
```{r}
gmp <- read.table("http://faculty.ucr.edu/~jflegal/206/gmp.dat")
gmp$pop <- round(gmp$gmp/gmp$pcgmp)
save(gmp,file="gmp.Rda")
rm(gmp)
exists("gmp")
not_gmp <- load(file="gmp.Rda")
colnames(gmp)
not_gmp
```
##
- We can load or save more than one object at once; this is how RStudio will load your whole workspace when you're starting, and offer to save it when you're done
- Many packages come with saved data objects; there's the convenience function `data()` to load them
```{r}
data(cats,package="MASS")
summary(cats)
```
## Non-R Data Tables
- Tables full of data, just not in the R file format
- Main function: `read.table()`
+ Presumes space-separated fields, one line per row
+ Main argument is the file name or URL
+ Returns a dataframe
+ Lots of options for things like field separator, column names, forcing or guessing column types, skipping lines at the start of the file...
- `read.csv()` is a short-cut to set the options for reading comma-separated value (CSV) files
+ Spreadsheets will usually read and write CSV
## Writing Dataframes
- Counterpart functions `write.table()`, `write.csv()` write a dataframe into a file
- Drawback: takes a lot more disk space than what you get from `load` or `save`
- Advantage: can communicate with other programs, or even edit manually
## Less Friendly Data Formats
- The `foreign` package on CRAN has tools for reading data files from lots of non-R statistical software
- Spreadsheets are special
+ Full of ugly irregularities
+ Values or formulas?
+ Headers, footers, side-comments, notes
+ Columns change meaning half-way down
## Spreadsheets, If You Have To
- Save the spreadsheet as a CSV; `read.csv()`
- Save the spreadsheet as a CSV; edit in a text editor; `read.csv()`
- Use `read.xls()` from the `gdata` package
+ Tries very hard to work like `read.csv()`, can take a URL or filename
+ Can skip down to the first line that matches some pattern, select different sheets, etc.
+ You may still need to do a lot of tidying up after
## So You've Got A Data Frame
What can we do with it?
- Plot it: examine multiple variables and distributions
- Test it: compare groups of individuals to each other
- Check it: does it conform to what we'd like for our needs
## Test Case: Birth weight data
```{r}
library(MASS)
data(birthwt)
summary(birthwt)
```
## From R help
Go to R help for more info, because someone documented this data
```
help(birthwt)
```
## Make it Readable
```{r}
colnames(birthwt)
colnames(birthwt) <- c("birthwt.below.2500", "mother.age",
"mother.weight", "race",
"mother.smokes", "previous.prem.labor",
"hypertension", "uterine.irr",
"physician.visits", "birthwt.grams")
```
## Make it Readable
Can make all the factors more descriptive.
```{r}
birthwt$race <- factor(c("white", "black", "other")[birthwt$race])
birthwt$mother.smokes <- factor(c("No", "Yes")[birthwt$mother.smokes + 1])
birthwt$uterine.irr <- factor(c("No", "Yes")[birthwt$uterine.irr + 1])
birthwt$hypertension <- factor(c("No", "Yes")[birthwt$hypertension + 1])
```
## Make it Readable
```{r}
summary(birthwt)
```
## Explore It
```{r}
plot (birthwt$race)
title (main = "Count of Mother's Race in
Springfield MA, 1986")
```
## Explore It
```{r}
plot (birthwt$mother.age)
title (main = "Mother's Ages in Springfield MA, 1986", ylab="Mother's Age")
```
## Explore It
```{r}
plot (sort(birthwt$mother.age))
title (main = "(Sorted) Mother's Ages in Springfield MA, 1986", ylab="Mother's Age")
```
## Explore It
```{r}
plot (birthwt$mother.age, birthwt$birthwt.grams)
title (main = "Birth Weight by Mother's Age in Springfield MA, 1986",
xlab="Mother's Age", ylab="Birth Weight (g)")
```
## Basic statistical testing
Let's fit some models to the data pertaining to our outcome(s) of interest.
```{r}
plot (birthwt$mother.smokes, birthwt$birthwt.grams, main="Birth Weight
by Mother's Smoking Habit", ylab = "Birth Weight (g)", xlab="Mother Smokes")
```
## Basic statistical testing
Tough to tell! Simple two-sample t-test:
```{r}
t.test (birthwt$birthwt.grams[birthwt$mother.smokes == "Yes"],
birthwt$birthwt.grams[birthwt$mother.smokes == "No"])
```
## Basic statistical testing
Does this difference match the linear model?
```{r}
linear.model.1 <- lm (birthwt.grams ~ mother.smokes, data=birthwt)
linear.model.1
```
## Basic statistical testing
Does this difference match the linear model?
```{r}
summary(linear.model.1)
```
## Basic statistical testing
Does this difference match the linear model?
```{r}
linear.model.2 <- lm (birthwt.grams ~ mother.age, data=birthwt)
linear.model.2
```
## Basic statistical testing
```{r}
summary(linear.model.2)
```
## Basic statistical testing
R tries to make diagnostics easy as possible. Try in R console.
```{r}
plot(linear.model.2)
```
## Detecting Outliers
Note the oldest mother and her heaviest child are greatly skewing this analysis.
```{r}
birthwt.noout <- birthwt[birthwt$mother.age <= 40,]
linear.model.3 <- lm (birthwt.grams ~ mother.age, data=birthwt.noout)
linear.model.3
```
## Detecting Outliers
```{r}
summary(linear.model.3)
```
## More complex models
Add in smoking behavior:
```{r}
linear.model.3a <- lm (birthwt.grams ~ + mother.smokes + mother.age, data=birthwt.noout)
summary(linear.model.3a)
```
## More complex models
```{r}
plot(linear.model.3a)
```
## More complex models
Add in race:
```{r}
linear.model.3b <- lm (birthwt.grams ~ mother.age + mother.smokes*race, data=birthwt.noout)
summary(linear.model.3b)
```
## More complex models
```{r}
plot(linear.model.3b)
```
## Including everything
Let's include everything on this new data set:
```{r}
linear.model.4 <- lm (birthwt.grams ~ ., data=birthwt.noout)
linear.model.4
```
## Including everything
Be careful! One of those variables `birthwt.below.2500` is a function of the outcome.
```{r}
linear.model.4a <- lm (birthwt.grams ~ . - birthwt.below.2500, data=birthwt.noout)
summary(linear.model.4a)
```
## Including everything
```{r}
plot(linear.model.4a)
```
## Generalized Linear Models
Maybe a linear increase in birth weight is less important than if it's below a threshold like 2500 grams (5.5 pounds). Let's fit a generalized linear model instead:
```{r}
glm.0 <- glm (birthwt.below.2500 ~ . - birthwt.grams, data=birthwt.noout)
plot(glm.0)
```
## Generalized Linear Models
The default value is a Gaussian model (a standard linear model). Change this:
```{r}
glm.1 <- glm (birthwt.below.2500 ~ . - birthwt.grams, data=birthwt.noout, family=binomial(link=logit))
```
## Generalized Linear Models
```{r}
summary(glm.1)
```
## Generalized Linear Models
```{r}
plot(glm.1)
```
## Why?
Let's take a subset of this data to do predictions.
```{r}
odds <- seq(1, nrow(birthwt.noout), by=2)
birthwt.in <- birthwt.noout[odds,]
birthwt.out <- birthwt.noout[-odds,]
linear.model.half <-
lm (birthwt.grams ~
. - birthwt.below.2500, data=birthwt.in)
```
## Why?
```{r}
summary (linear.model.half)
```
## Prediction of Training Data
```{r}
birthwt.predict <- predict (linear.model.half)
cor (birthwt.in$birthwt.grams, birthwt.predict)
```
## Prediction of Training Data
```{r}
plot (birthwt.in$birthwt.grams, birthwt.predict)
```
## Prediction of Test Data
```{r}
birthwt.predict.out <- predict (linear.model.half, birthwt.out)
cor (birthwt.out$birthwt.grams, birthwt.predict.out)
```
## Prediction of Test Data
```{r}
plot (birthwt.out$birthwt.grams, birthwt.predict.out)
```
## Summary
- Loading and saving R objects is very easy
- Reading and writing dataframes is pretty easy
- Linear models are very easy via `lm()`
- Generalized linear models are pretty easy via `glm()`
- Generalized linear mixed models via `lme4()` and `glmm()`