Getting Data and Linear Models

Agenda

Getting data into and out of R
Using data frames for statistical purposes
Introduction to linear models

Reading Data from R

You can load and save R objects
- R has its own format for this, which is shared across operating systems
- It's an open, documented format if you really want to pry into it
save(thing, file="name") saves thing in a file called name (conventional extension: rda or Rda)
load("name") loads the object or objects stored in the file called name, with their old names

gmp <- read.table("http://faculty.ucr.edu/~jflegal/206/gmp.dat")
gmp$pop <- round(gmp$gmp/gmp$pcgmp)
save(gmp,file="gmp.Rda")
rm(gmp)
exists("gmp")

## [1] FALSE

not_gmp <- load(file="gmp.Rda")
colnames(gmp)

## [1] "MSA"   "gmp"   "pcgmp" "pop"

not_gmp

## [1] "gmp"

We can load or save more than one object at once; this is how RStudio will load your whole workspace when you're starting, and offer to save it when you're done
Many packages come with saved data objects; there's the convenience function data() to load them

 data(cats,package="MASS")
summary(cats)

##  Sex         Bwt             Hwt       
##  F:47   Min.   :2.000   Min.   : 6.30  
##  M:97   1st Qu.:2.300   1st Qu.: 8.95  
##         Median :2.700   Median :10.10  
##         Mean   :2.724   Mean   :10.63  
##         3rd Qu.:3.025   3rd Qu.:12.12  
##         Max.   :3.900   Max.   :20.50

Non-R Data Tables

Tables full of data, just not in the R file format
Main function: read.table()
- Presumes space-separated fields, one line per row
- Main argument is the file name or URL
- Returns a dataframe
- Lots of options for things like field separator, column names, forcing or guessing column types, skipping lines at the start of the file…
read.csv() is a short-cut to set the options for reading comma-separated value (CSV) files
- Spreadsheets will usually read and write CSV

Writing Dataframes

Counterpart functions write.table(), write.csv() write a dataframe into a file
Drawback: takes a lot more disk space than what you get from load or save
Advantage: can communicate with other programs, or even edit manually

Less Friendly Data Formats

The foreign package on CRAN has tools for reading data files from lots of non-R statistical software
Spreadsheets are special
Full of ugly irregularities
Values or formulas?
Headers, footers, side-comments, notes
Columns change meaning half-way down

Spreadsheets, If You Have To

Save the spreadsheet as a CSV; read.csv()
Save the spreadsheet as a CSV; edit in a text editor; read.csv()
Use read.xls() from the gdata package
Tries very hard to work like read.csv(), can take a URL or filename
Can skip down to the first line that matches some pattern, select different sheets, etc.
You may still need to do a lot of tidying up after

So You've Got A Data Frame

What can we do with it?

Plot it: examine multiple variables and distributions
Test it: compare groups of individuals to each other
Check it: does it conform to what we'd like for our needs

Test Case: Birth weight data

library(MASS)
data(birthwt)
summary(birthwt)

##       low              age             lwt             race      
##  Min.   :0.0000   Min.   :14.00   Min.   : 80.0   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:19.00   1st Qu.:110.0   1st Qu.:1.000  
##  Median :0.0000   Median :23.00   Median :121.0   Median :1.000  
##  Mean   :0.3122   Mean   :23.24   Mean   :129.8   Mean   :1.847  
##  3rd Qu.:1.0000   3rd Qu.:26.00   3rd Qu.:140.0   3rd Qu.:3.000  
##  Max.   :1.0000   Max.   :45.00   Max.   :250.0   Max.   :3.000  
##      smoke             ptl               ht                ui        
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median :0.00000   Median :0.0000  
##  Mean   :0.3915   Mean   :0.1958   Mean   :0.06349   Mean   :0.1481  
##  3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :3.0000   Max.   :1.00000   Max.   :1.0000  
##       ftv              bwt      
##  Min.   :0.0000   Min.   : 709  
##  1st Qu.:0.0000   1st Qu.:2414  
##  Median :0.0000   Median :2977  
##  Mean   :0.7937   Mean   :2945  
##  3rd Qu.:1.0000   3rd Qu.:3487  
##  Max.   :6.0000   Max.   :4990

From R help

Go to R help for more info, because someone documented this data

help(birthwt)

Make it Readable

colnames(birthwt)

##  [1] "low"   "age"   "lwt"   "race"  "smoke" "ptl"   "ht"    "ui"   
##  [9] "ftv"   "bwt"

colnames(birthwt) <- c("birthwt.below.2500", "mother.age", 
                       "mother.weight", "race",
                       "mother.smokes", "previous.prem.labor", 
                       "hypertension", "uterine.irr",
                       "physician.visits", "birthwt.grams")

Make it Readable

Can make all the factors more descriptive.

birthwt$race <- factor(c("white", "black", "other")[birthwt$race])
birthwt$mother.smokes <- factor(c("No", "Yes")[birthwt$mother.smokes + 1])
birthwt$uterine.irr <- factor(c("No", "Yes")[birthwt$uterine.irr + 1])
birthwt$hypertension <- factor(c("No", "Yes")[birthwt$hypertension + 1])

Make it Readable

summary(birthwt)

##  birthwt.below.2500   mother.age    mother.weight      race   
##  Min.   :0.0000     Min.   :14.00   Min.   : 80.0   black:26  
##  1st Qu.:0.0000     1st Qu.:19.00   1st Qu.:110.0   other:67  
##  Median :0.0000     Median :23.00   Median :121.0   white:96  
##  Mean   :0.3122     Mean   :23.24   Mean   :129.8             
##  3rd Qu.:1.0000     3rd Qu.:26.00   3rd Qu.:140.0             
##  Max.   :1.0000     Max.   :45.00   Max.   :250.0             
##  mother.smokes previous.prem.labor hypertension uterine.irr
##  No :115       Min.   :0.0000      No :177      No :161    
##  Yes: 74       1st Qu.:0.0000      Yes: 12      Yes: 28    
##                Median :0.0000                              
##                Mean   :0.1958                              
##                3rd Qu.:0.0000                              
##                Max.   :3.0000                              
##  physician.visits birthwt.grams 
##  Min.   :0.0000   Min.   : 709  
##  1st Qu.:0.0000   1st Qu.:2414  
##  Median :0.0000   Median :2977  
##  Mean   :0.7937   Mean   :2945  
##  3rd Qu.:1.0000   3rd Qu.:3487  
##  Max.   :6.0000   Max.   :4990

Explore It

plot (birthwt$race)
title (main = "Count of Mother's Race in 
       Springfield MA, 1986")

Explore It

plot (birthwt$mother.age)
title (main = "Mother's Ages in Springfield MA, 1986", ylab="Mother's Age")

Explore It

plot (sort(birthwt$mother.age))
title (main = "(Sorted) Mother's Ages in Springfield MA, 1986", ylab="Mother's Age")

Explore It

plot (birthwt$mother.age, birthwt$birthwt.grams)
title (main = "Birth Weight by Mother's Age in Springfield MA, 1986",
       xlab="Mother's Age", ylab="Birth Weight (g)")

Basic statistical testing

Let's fit some models to the data pertaining to our outcome(s) of interest.

plot (birthwt$mother.smokes, birthwt$birthwt.grams, main="Birth Weight 
      by Mother's Smoking Habit", ylab = "Birth Weight (g)", xlab="Mother Smokes")

Basic statistical testing

Tough to tell! Simple two-sample t-test:

t.test (birthwt$birthwt.grams[birthwt$mother.smokes == "Yes"], 
        birthwt$birthwt.grams[birthwt$mother.smokes == "No"])

## 
##  Welch Two Sample t-test
## 
## data:  birthwt$birthwt.grams[birthwt$mother.smokes == "Yes"] and birthwt$birthwt.grams[birthwt$mother.smokes == "No"]
## t = -2.7299, df = 170.1, p-value = 0.007003
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -488.97860  -78.57486
## sample estimates:
## mean of x mean of y 
##  2771.919  3055.696

Basic statistical testing

Does this difference match the linear model?

linear.model.1 <- lm (birthwt.grams ~ mother.smokes, data=birthwt)
linear.model.1

## 
## Call:
## lm(formula = birthwt.grams ~ mother.smokes, data = birthwt)
## 
## Coefficients:
##      (Intercept)  mother.smokesYes  
##           3055.7            -283.8

Basic statistical testing

Does this difference match the linear model?

summary(linear.model.1)

## 
## Call:
## lm(formula = birthwt.grams ~ mother.smokes, data = birthwt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2062.9  -475.9    34.3   545.1  1934.3 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3055.70      66.93  45.653  < 2e-16 ***
## mother.smokesYes  -283.78     106.97  -2.653  0.00867 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 717.8 on 187 degrees of freedom
## Multiple R-squared:  0.03627,    Adjusted R-squared:  0.03112 
## F-statistic: 7.038 on 1 and 187 DF,  p-value: 0.008667

Basic statistical testing

Does this difference match the linear model?

linear.model.2 <- lm (birthwt.grams ~ mother.age, data=birthwt)
linear.model.2

## 
## Call:
## lm(formula = birthwt.grams ~ mother.age, data = birthwt)
## 
## Coefficients:
## (Intercept)   mother.age  
##     2655.74        12.43

Basic statistical testing

summary(linear.model.2)

## 
## Call:
## lm(formula = birthwt.grams ~ mother.age, data = birthwt)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2294.78  -517.63    10.51   530.80  1774.92 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2655.74     238.86   11.12   <2e-16 ***
## mother.age     12.43      10.02    1.24    0.216    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 728.2 on 187 degrees of freedom
## Multiple R-squared:  0.008157,   Adjusted R-squared:  0.002853 
## F-statistic: 1.538 on 1 and 187 DF,  p-value: 0.2165

Basic statistical testing

R tries to make diagnostics easy as possible. Try in R console.

plot(linear.model.2)

Detecting Outliers

Note the oldest mother and her heaviest child are greatly skewing this analysis.

birthwt.noout <- birthwt[birthwt$mother.age <= 40,]
linear.model.3 <- lm (birthwt.grams ~ mother.age, data=birthwt.noout)
linear.model.3

## 
## Call:
## lm(formula = birthwt.grams ~ mother.age, data = birthwt.noout)
## 
## Coefficients:
## (Intercept)   mother.age  
##    2833.273        4.344

Detecting Outliers

summary(linear.model.3)

## 
## Call:
## lm(formula = birthwt.grams ~ mother.age, data = birthwt.noout)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2245.89  -511.24    26.45   540.09  1655.48 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2833.273    244.954   11.57   <2e-16 ***
## mother.age     4.344     10.349    0.42    0.675    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 717.2 on 186 degrees of freedom
## Multiple R-squared:  0.0009461,  Adjusted R-squared:  -0.004425 
## F-statistic: 0.1761 on 1 and 186 DF,  p-value: 0.6752

More complex models

Add in smoking behavior:

linear.model.3a <- lm (birthwt.grams ~ + mother.smokes + mother.age, data=birthwt.noout)
summary(linear.model.3a)

## 
## Call:
## lm(formula = birthwt.grams ~ +mother.smokes + mother.age, data = birthwt.noout)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2081.22  -459.82    43.56   548.22  1551.51 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2954.582    246.280  11.997   <2e-16 ***
## mother.smokesYes -265.756    105.605  -2.517   0.0127 *  
## mother.age          3.621     10.208   0.355   0.7232    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 707.1 on 185 degrees of freedom
## Multiple R-squared:  0.03401,    Adjusted R-squared:  0.02357 
## F-statistic: 3.257 on 2 and 185 DF,  p-value: 0.04072

More complex models

plot(linear.model.3a)

More complex models

Add in race:

linear.model.3b <- lm (birthwt.grams ~ mother.age + mother.smokes*race, data=birthwt.noout)
summary(linear.model.3b)

## 
## Call:
## lm(formula = birthwt.grams ~ mother.age + mother.smokes * race, 
##     data = birthwt.noout)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2343.52  -413.66    39.91   480.36  1379.90 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                3017.352    265.606  11.360  < 2e-16 ***
## mother.age                   -8.168     10.276  -0.795  0.42772    
## mother.smokesYes           -316.500    275.896  -1.147  0.25282    
## raceother                   -18.901    193.665  -0.098  0.92236    
## racewhite                   584.042    206.320   2.831  0.00517 ** 
## mother.smokesYes:raceother  258.999    349.871   0.740  0.46010    
## mother.smokesYes:racewhite -271.594    314.268  -0.864  0.38862    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 676.1 on 181 degrees of freedom
## Multiple R-squared:  0.1359, Adjusted R-squared:  0.1073 
## F-statistic: 4.746 on 6 and 181 DF,  p-value: 0.0001625

More complex models

plot(linear.model.3b)

Including everything

Let's include everything on this new data set:

linear.model.4 <- lm (birthwt.grams ~ ., data=birthwt.noout)
linear.model.4

## 
## Call:
## lm(formula = birthwt.grams ~ ., data = birthwt.noout)
## 
## Coefficients:
##         (Intercept)   birthwt.below.2500           mother.age  
##           3360.5163           -1116.3933             -16.0321  
##       mother.weight            raceother            racewhite  
##              1.9317              68.8145             247.0241  
##    mother.smokesYes  previous.prem.labor      hypertensionYes  
##           -157.7041              95.9825            -185.2778  
##      uterine.irrYes     physician.visits  
##           -340.0918              -0.3519

Including everything

Be careful! One of those variables birthwt.below.2500 is a function of the outcome.

linear.model.4a <- lm (birthwt.grams ~ . - birthwt.below.2500, data=birthwt.noout)
summary(linear.model.4a)

## 
## Call:
## lm(formula = birthwt.grams ~ . - birthwt.below.2500, data = birthwt.noout)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1761.10  -454.81    46.43   459.78  1394.13 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         2545.584    323.204   7.876 3.21e-13 ***
## mother.age           -12.111      9.909  -1.222 0.223243    
## mother.weight          4.789      1.710   2.801 0.005656 ** 
## raceother            155.605    156.564   0.994 0.321634    
## racewhite            494.545    147.153   3.361 0.000951 ***
## mother.smokesYes    -335.793    104.613  -3.210 0.001576 ** 
## previous.prem.labor  -32.922    100.185  -0.329 0.742838    
## hypertensionYes     -594.324    198.480  -2.994 0.003142 ** 
## uterine.irrYes      -514.842    136.249  -3.779 0.000215 ***
## physician.visits      -7.247     45.649  -0.159 0.874036    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 638 on 178 degrees of freedom
## Multiple R-squared:  0.2435, Adjusted R-squared:  0.2052 
## F-statistic: 6.365 on 9 and 178 DF,  p-value: 8.255e-08

Including everything

plot(linear.model.4a)

Generalized Linear Models

Maybe a linear increase in birth weight is less important than if it's below a threshold like 2500 grams (5.5 pounds). Let's fit a generalized linear model instead:

glm.0 <- glm (birthwt.below.2500 ~ . - birthwt.grams, data=birthwt.noout)
plot(glm.0)

Generalized Linear Models

The default value is a Gaussian model (a standard linear model). Change this:

glm.1 <- glm (birthwt.below.2500 ~ . - birthwt.grams, data=birthwt.noout, family=binomial(link=logit))

Generalized Linear Models

summary(glm.1)

## 
## Call:
## glm(formula = birthwt.below.2500 ~ . - birthwt.grams, family = binomial(link = logit), 
##     data = birthwt.noout)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8938  -0.8222  -0.5363   0.9848   2.2069  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)   
## (Intercept)          1.721830   1.258897   1.368  0.17140   
## mother.age          -0.027537   0.037718  -0.730  0.46534   
## mother.weight       -0.015474   0.006919  -2.237  0.02532 * 
## raceother           -0.395505   0.537685  -0.736  0.46199   
## racewhite           -1.269006   0.527180  -2.407  0.01608 * 
## mother.smokesYes     0.931733   0.402359   2.316  0.02058 * 
## previous.prem.labor  0.539549   0.345413   1.562  0.11828   
## hypertensionYes      1.860521   0.697502   2.667  0.00764 **
## uterine.irrYes       0.766517   0.458951   1.670  0.09489 . 
## physician.visits     0.063402   0.172431   0.368  0.71310   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 233.92  on 187  degrees of freedom
## Residual deviance: 201.15  on 178  degrees of freedom
## AIC: 221.15
## 
## Number of Fisher Scoring iterations: 4

Generalized Linear Models

plot(glm.1)

Why?

Let's take a subset of this data to do predictions.

odds <- seq(1, nrow(birthwt.noout), by=2)
birthwt.in <- birthwt.noout[odds,]
birthwt.out <- birthwt.noout[-odds,]
linear.model.half <- 
  lm (birthwt.grams ~ 
      . - birthwt.below.2500, data=birthwt.in)

Why?

summary (linear.model.half)

## 
## Call:
## lm(formula = birthwt.grams ~ . - birthwt.below.2500, data = birthwt.in)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1705.17  -303.11    26.48   427.18  1261.57 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         2514.891    450.245   5.586 2.81e-07 ***
## mother.age             7.052     14.935   0.472  0.63801    
## mother.weight          2.683      2.885   0.930  0.35501    
## raceother            113.948    224.519   0.508  0.61312    
## racewhite            466.219    204.967   2.275  0.02548 *  
## mother.smokesYes    -217.218    154.521  -1.406  0.16349    
## previous.prem.labor -206.093    143.726  -1.434  0.15530    
## hypertensionYes     -653.594    281.795  -2.319  0.02280 *  
## uterine.irrYes      -547.884    193.386  -2.833  0.00577 ** 
## physician.visits    -130.202     81.400  -1.600  0.11346    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 643.7 on 84 degrees of freedom
## Multiple R-squared:  0.2585, Adjusted R-squared:  0.1791 
## F-statistic: 3.254 on 9 and 84 DF,  p-value: 0.001942

Prediction of Training Data

birthwt.predict <- predict (linear.model.half)
cor (birthwt.in$birthwt.grams, birthwt.predict)

## [1] 0.508442

Prediction of Training Data

plot (birthwt.in$birthwt.grams, birthwt.predict)

Prediction of Test Data

birthwt.predict.out <- predict (linear.model.half, birthwt.out)
cor (birthwt.out$birthwt.grams, birthwt.predict.out)

## [1] 0.3749431

Prediction of Test Data

plot (birthwt.out$birthwt.grams, birthwt.predict.out)

Summary

Loading and saving R objects is very easy
Reading and writing dataframes is pretty easy
Linear models are very easy via lm()
Generalized linear models are pretty easy via glm()
Generalized linear mixed models via lme4() and glmm()