Agenda

  • Getting data into and out of R
  • Using data frames for statistical purposes
  • Introduction to linear models

Reading Data from R

  • You can load and save R objects
    • R has its own format for this, which is shared across operating systems
    • It's an open, documented format if you really want to pry into it
  • save(thing, file="name") saves thing in a file called name (conventional extension: rda or Rda)
  • load("name") loads the object or objects stored in the file called name, with their old names

gmp <- read.table("http://faculty.ucr.edu/~jflegal/206/gmp.dat")
gmp$pop <- round(gmp$gmp/gmp$pcgmp)
save(gmp,file="gmp.Rda")
rm(gmp)
exists("gmp")
## [1] FALSE
not_gmp <- load(file="gmp.Rda")
colnames(gmp)
## [1] "MSA"   "gmp"   "pcgmp" "pop"
not_gmp
## [1] "gmp"

  • We can load or save more than one object at once; this is how RStudio will load your whole workspace when you're starting, and offer to save it when you're done
  • Many packages come with saved data objects; there's the convenience function data() to load them
 data(cats,package="MASS")
summary(cats)
##  Sex         Bwt             Hwt       
##  F:47   Min.   :2.000   Min.   : 6.30  
##  M:97   1st Qu.:2.300   1st Qu.: 8.95  
##         Median :2.700   Median :10.10  
##         Mean   :2.724   Mean   :10.63  
##         3rd Qu.:3.025   3rd Qu.:12.12  
##         Max.   :3.900   Max.   :20.50

Non-R Data Tables

  • Tables full of data, just not in the R file format
  • Main function: read.table()
    • Presumes space-separated fields, one line per row
    • Main argument is the file name or URL
    • Returns a dataframe
    • Lots of options for things like field separator, column names, forcing or guessing column types, skipping lines at the start of the file…
  • read.csv() is a short-cut to set the options for reading comma-separated value (CSV) files
    • Spreadsheets will usually read and write CSV

Writing Dataframes

  • Counterpart functions write.table(), write.csv() write a dataframe into a file
  • Drawback: takes a lot more disk space than what you get from load or save
  • Advantage: can communicate with other programs, or even edit manually

Less Friendly Data Formats

  • The foreign package on CRAN has tools for reading data files from lots of non-R statistical software
  • Spreadsheets are special
  • Full of ugly irregularities
  • Values or formulas?
  • Headers, footers, side-comments, notes
  • Columns change meaning half-way down

Spreadsheets, If You Have To

  • Save the spreadsheet as a CSV; read.csv()
  • Save the spreadsheet as a CSV; edit in a text editor; read.csv()
  • Use read.xls() from the gdata package
  • Tries very hard to work like read.csv(), can take a URL or filename
  • Can skip down to the first line that matches some pattern, select different sheets, etc.
  • You may still need to do a lot of tidying up after

So You've Got A Data Frame

What can we do with it?

  • Plot it: examine multiple variables and distributions
  • Test it: compare groups of individuals to each other
  • Check it: does it conform to what we'd like for our needs

Test Case: Birth weight data

library(MASS)
data(birthwt)
summary(birthwt)
##       low              age             lwt             race      
##  Min.   :0.0000   Min.   :14.00   Min.   : 80.0   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:19.00   1st Qu.:110.0   1st Qu.:1.000  
##  Median :0.0000   Median :23.00   Median :121.0   Median :1.000  
##  Mean   :0.3122   Mean   :23.24   Mean   :129.8   Mean   :1.847  
##  3rd Qu.:1.0000   3rd Qu.:26.00   3rd Qu.:140.0   3rd Qu.:3.000  
##  Max.   :1.0000   Max.   :45.00   Max.   :250.0   Max.   :3.000  
##      smoke             ptl               ht                ui        
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median :0.00000   Median :0.0000  
##  Mean   :0.3915   Mean   :0.1958   Mean   :0.06349   Mean   :0.1481  
##  3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :3.0000   Max.   :1.00000   Max.   :1.0000  
##       ftv              bwt      
##  Min.   :0.0000   Min.   : 709  
##  1st Qu.:0.0000   1st Qu.:2414  
##  Median :0.0000   Median :2977  
##  Mean   :0.7937   Mean   :2945  
##  3rd Qu.:1.0000   3rd Qu.:3487  
##  Max.   :6.0000   Max.   :4990

From R help

Go to R help for more info, because someone documented this data

help(birthwt)

Make it Readable

colnames(birthwt)
##  [1] "low"   "age"   "lwt"   "race"  "smoke" "ptl"   "ht"    "ui"   
##  [9] "ftv"   "bwt"
colnames(birthwt) <- c("birthwt.below.2500", "mother.age", 
                       "mother.weight", "race",
                       "mother.smokes", "previous.prem.labor", 
                       "hypertension", "uterine.irr",
                       "physician.visits", "birthwt.grams")

Make it Readable

Can make all the factors more descriptive.

birthwt$race <- factor(c("white", "black", "other")[birthwt$race])
birthwt$mother.smokes <- factor(c("No", "Yes")[birthwt$mother.smokes + 1])
birthwt$uterine.irr <- factor(c("No", "Yes")[birthwt$uterine.irr + 1])
birthwt$hypertension <- factor(c("No", "Yes")[birthwt$hypertension + 1])

Make it Readable

summary(birthwt)
##  birthwt.below.2500   mother.age    mother.weight      race   
##  Min.   :0.0000     Min.   :14.00   Min.   : 80.0   black:26  
##  1st Qu.:0.0000     1st Qu.:19.00   1st Qu.:110.0   other:67  
##  Median :0.0000     Median :23.00   Median :121.0   white:96  
##  Mean   :0.3122     Mean   :23.24   Mean   :129.8             
##  3rd Qu.:1.0000     3rd Qu.:26.00   3rd Qu.:140.0             
##  Max.   :1.0000     Max.   :45.00   Max.   :250.0             
##  mother.smokes previous.prem.labor hypertension uterine.irr
##  No :115       Min.   :0.0000      No :177      No :161    
##  Yes: 74       1st Qu.:0.0000      Yes: 12      Yes: 28    
##                Median :0.0000                              
##                Mean   :0.1958                              
##                3rd Qu.:0.0000                              
##                Max.   :3.0000                              
##  physician.visits birthwt.grams 
##  Min.   :0.0000   Min.   : 709  
##  1st Qu.:0.0000   1st Qu.:2414  
##  Median :0.0000   Median :2977  
##  Mean   :0.7937   Mean   :2945  
##  3rd Qu.:1.0000   3rd Qu.:3487  
##  Max.   :6.0000   Max.   :4990

Explore It

plot (birthwt$race)
title (main = "Count of Mother's Race in 
       Springfield MA, 1986")

Explore It

plot (birthwt$mother.age)
title (main = "Mother's Ages in Springfield MA, 1986", ylab="Mother's Age")

Explore It

plot (sort(birthwt$mother.age))
title (main = "(Sorted) Mother's Ages in Springfield MA, 1986", ylab="Mother's Age")

Explore It

plot (birthwt$mother.age, birthwt$birthwt.grams)
title (main = "Birth Weight by Mother's Age in Springfield MA, 1986",
       xlab="Mother's Age", ylab="Birth Weight (g)")

Basic statistical testing

Let's fit some models to the data pertaining to our outcome(s) of interest.

plot (birthwt$mother.smokes, birthwt$birthwt.grams, main="Birth Weight 
      by Mother's Smoking Habit", ylab = "Birth Weight (g)", xlab="Mother Smokes")

Basic statistical testing

Tough to tell! Simple two-sample t-test:

t.test (birthwt$birthwt.grams[birthwt$mother.smokes == "Yes"], 
        birthwt$birthwt.grams[birthwt$mother.smokes == "No"])
## 
##  Welch Two Sample t-test
## 
## data:  birthwt$birthwt.grams[birthwt$mother.smokes == "Yes"] and birthwt$birthwt.grams[birthwt$mother.smokes == "No"]
## t = -2.7299, df = 170.1, p-value = 0.007003
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -488.97860  -78.57486
## sample estimates:
## mean of x mean of y 
##  2771.919  3055.696

Basic statistical testing

Does this difference match the linear model?

linear.model.1 <- lm (birthwt.grams ~ mother.smokes, data=birthwt)
linear.model.1
## 
## Call:
## lm(formula = birthwt.grams ~ mother.smokes, data = birthwt)
## 
## Coefficients:
##      (Intercept)  mother.smokesYes  
##           3055.7            -283.8

Basic statistical testing

Does this difference match the linear model?

summary(linear.model.1)
## 
## Call:
## lm(formula = birthwt.grams ~ mother.smokes, data = birthwt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2062.9  -475.9    34.3   545.1  1934.3 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3055.70      66.93  45.653  < 2e-16 ***
## mother.smokesYes  -283.78     106.97  -2.653  0.00867 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 717.8 on 187 degrees of freedom
## Multiple R-squared:  0.03627,    Adjusted R-squared:  0.03112 
## F-statistic: 7.038 on 1 and 187 DF,  p-value: 0.008667

Basic statistical testing

Does this difference match the linear model?

linear.model.2 <- lm (birthwt.grams ~ mother.age, data=birthwt)
linear.model.2
## 
## Call:
## lm(formula = birthwt.grams ~ mother.age, data = birthwt)
## 
## Coefficients:
## (Intercept)   mother.age  
##     2655.74        12.43

Basic statistical testing

summary(linear.model.2)
## 
## Call:
## lm(formula = birthwt.grams ~ mother.age, data = birthwt)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2294.78  -517.63    10.51   530.80  1774.92 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2655.74     238.86   11.12   <2e-16 ***
## mother.age     12.43      10.02    1.24    0.216    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 728.2 on 187 degrees of freedom
## Multiple R-squared:  0.008157,   Adjusted R-squared:  0.002853 
## F-statistic: 1.538 on 1 and 187 DF,  p-value: 0.2165

Basic statistical testing

R tries to make diagnostics easy as possible. Try in R console.

plot(linear.model.2)