---
title: 'Getting Data and Linear Models'
author: "James M. Flegal"
output:
  ioslides_presentation:
    smaller: yes
---

## Agenda

- Getting data into and out of R
- Using data frames for statistical purposes
- Introduction to linear models

## Reading Data from R

- You can load and save R objects
    + R has its own format for this, which is shared across operating systems
    + It's an open, documented format if you really want to pry into it
- `save(thing, file="name")` saves `thing` in a file called `name` (conventional extension: `rda` or `Rda`)
- `load("name")` loads the object or objects stored in the file called `name`, _with their old names_

##
```{r}
gmp <- read.table("http://faculty.ucr.edu/~jflegal/206/gmp.dat")
gmp$pop <- round(gmp$gmp/gmp$pcgmp)
save(gmp,file="gmp.Rda")
rm(gmp)
exists("gmp")
not_gmp <- load(file="gmp.Rda")
colnames(gmp)
not_gmp
```

##

- We can load or save more than one object at once; this is how RStudio will load your whole workspace when you're starting, and offer to save it when you're done
- Many packages come with saved data objects; there's the convenience function `data()` to load them
```{r}
 data(cats,package="MASS")
summary(cats)
```

## Non-R Data Tables

- Tables full of data, just not in the R file format
- Main function: `read.table()`
    + Presumes space-separated fields, one line per row
    + Main argument is the file name or URL
    + Returns a dataframe
    + Lots of options for things like field separator, column names, forcing or guessing column types, skipping lines at the start of the file...
- `read.csv()` is a short-cut to set the options for reading comma-separated value (CSV) files
    + Spreadsheets will usually read and write CSV

## Writing Dataframes

- Counterpart functions `write.table()`, `write.csv()` write a dataframe into a file
- Drawback: takes a lot more disk space than what you get from `load` or `save`
- Advantage: can communicate with other programs, or even  edit manually

## Less Friendly Data Formats

- The `foreign` package on CRAN has tools for reading data files from lots of non-R statistical software
- Spreadsheets are special
  + Full of ugly irregularities
  + Values or formulas?
  + Headers, footers, side-comments, notes
  + Columns change meaning half-way down

## Spreadsheets, If You Have To

- Save the spreadsheet as a CSV; `read.csv()`
- Save the spreadsheet as a CSV; edit in a text editor; `read.csv()`
- Use `read.xls()` from the `gdata` package
  + Tries very hard to work like `read.csv()`, can take a URL or filename
  + Can skip down to the first line that matches some pattern, select different sheets, etc.
  + You may still need to do a lot of tidying up after

## So You've Got A Data Frame

What can we do with it?

- Plot it: examine multiple variables and distributions
- Test it: compare groups of individuals to each other
- Check it: does it conform to what we'd like for our needs

## Test Case: Birth weight data

```{r}
library(MASS)
data(birthwt)
summary(birthwt)
```

## From R help

Go to R help for more info, because someone documented this data

```
help(birthwt)
```

## Make it Readable

```{r}
colnames(birthwt)
colnames(birthwt) <- c("birthwt.below.2500", "mother.age", 
                       "mother.weight", "race",
                       "mother.smokes", "previous.prem.labor", 
                       "hypertension", "uterine.irr",
                       "physician.visits", "birthwt.grams")
```

## Make it Readable

Can make all the factors more descriptive.

```{r}
birthwt$race <- factor(c("white", "black", "other")[birthwt$race])
birthwt$mother.smokes <- factor(c("No", "Yes")[birthwt$mother.smokes + 1])
birthwt$uterine.irr <- factor(c("No", "Yes")[birthwt$uterine.irr + 1])
birthwt$hypertension <- factor(c("No", "Yes")[birthwt$hypertension + 1])
```

## Make it Readable

```{r}
summary(birthwt)
```

## Explore It

```{r}
plot (birthwt$race)
title (main = "Count of Mother's Race in 
       Springfield MA, 1986")
```

## Explore It

```{r}
plot (birthwt$mother.age)
title (main = "Mother's Ages in Springfield MA, 1986", ylab="Mother's Age")
```

## Explore It

```{r}
plot (sort(birthwt$mother.age))
title (main = "(Sorted) Mother's Ages in Springfield MA, 1986", ylab="Mother's Age")
```


## Explore It

```{r}
plot (birthwt$mother.age, birthwt$birthwt.grams)
title (main = "Birth Weight by Mother's Age in Springfield MA, 1986",
       xlab="Mother's Age", ylab="Birth Weight (g)")
```


## Basic statistical testing

Let's fit some models to the data pertaining to our outcome(s) of interest. 

```{r}
plot (birthwt$mother.smokes, birthwt$birthwt.grams, main="Birth Weight 
      by Mother's Smoking Habit", ylab = "Birth Weight (g)", xlab="Mother Smokes")
```

## Basic statistical testing

Tough to tell! Simple two-sample t-test:

```{r}
t.test (birthwt$birthwt.grams[birthwt$mother.smokes == "Yes"], 
        birthwt$birthwt.grams[birthwt$mother.smokes == "No"])
```

## Basic statistical testing

Does this difference match the linear model?

```{r}
linear.model.1 <- lm (birthwt.grams ~ mother.smokes, data=birthwt)
linear.model.1
```

## Basic statistical testing

Does this difference match the linear model?

```{r}
summary(linear.model.1)
```

## Basic statistical testing

Does this difference match the linear model?

```{r}
linear.model.2 <- lm (birthwt.grams ~ mother.age, data=birthwt)
linear.model.2
```

## Basic statistical testing

```{r}
summary(linear.model.2)
```

## Basic statistical testing

R tries to make diagnostics easy as possible. Try in R console.

```{r}
plot(linear.model.2)
```

## Detecting Outliers

Note the oldest mother and her heaviest child are greatly skewing this analysis. 

```{r}
birthwt.noout <- birthwt[birthwt$mother.age <= 40,]
linear.model.3 <- lm (birthwt.grams ~ mother.age, data=birthwt.noout)
linear.model.3
```

## Detecting Outliers

```{r}
summary(linear.model.3)
```

## More complex models

Add in smoking behavior:

```{r}
linear.model.3a <- lm (birthwt.grams ~ + mother.smokes + mother.age, data=birthwt.noout)
summary(linear.model.3a)
```

## More complex models

```{r}
plot(linear.model.3a)
```

## More complex models

Add in race:

```{r}
linear.model.3b <- lm (birthwt.grams ~ mother.age + mother.smokes*race, data=birthwt.noout)
summary(linear.model.3b)
```

## More complex models

```{r}
plot(linear.model.3b)
```

## Including everything

Let's include everything on this new data set:

```{r}
linear.model.4 <- lm (birthwt.grams ~ ., data=birthwt.noout)
linear.model.4
```


## Including everything

Be careful! One of those variables `birthwt.below.2500` is a function of the outcome.

```{r}
linear.model.4a <- lm (birthwt.grams ~ . - birthwt.below.2500, data=birthwt.noout)
summary(linear.model.4a)
```

## Including everything

```{r}
plot(linear.model.4a)
```


## Generalized Linear Models

Maybe a linear increase in birth weight is less important than if it's below a threshold like 2500 grams (5.5 pounds). Let's fit a generalized linear model instead:

```{r}
glm.0 <- glm (birthwt.below.2500 ~ . - birthwt.grams, data=birthwt.noout)
plot(glm.0)
```

## Generalized Linear Models

The default value is a Gaussian model (a standard linear model). Change this:
 
```{r}
glm.1 <- glm (birthwt.below.2500 ~ . - birthwt.grams, data=birthwt.noout, family=binomial(link=logit))
```

## Generalized Linear Models

```{r}
summary(glm.1)
```


## Generalized Linear Models

```{r}
plot(glm.1)
```


## Why?

Let's take a subset of this data to do predictions.

```{r}
odds <- seq(1, nrow(birthwt.noout), by=2)
birthwt.in <- birthwt.noout[odds,]
birthwt.out <- birthwt.noout[-odds,]
linear.model.half <- 
  lm (birthwt.grams ~ 
      . - birthwt.below.2500, data=birthwt.in)
```

## Why?

```{r}
summary (linear.model.half)
```

## Prediction of Training Data

```{r}
birthwt.predict <- predict (linear.model.half)
cor (birthwt.in$birthwt.grams, birthwt.predict)
```

## Prediction of Training Data

```{r}
plot (birthwt.in$birthwt.grams, birthwt.predict)
```

## Prediction of Test Data

```{r}
birthwt.predict.out <- predict (linear.model.half, birthwt.out)
cor (birthwt.out$birthwt.grams, birthwt.predict.out)
```

## Prediction of Test Data

```{r}
plot (birthwt.out$birthwt.grams, birthwt.predict.out)
```

## Summary

- Loading and saving R objects is very easy
- Reading and writing dataframes is pretty easy
- Linear models are very easy via `lm()`
- Generalized linear models are pretty easy via `glm()`
- Generalized linear mixed models via `lme4()` and `glmm()`