- Getting data into and out of R
- Using data frames for statistical purposes
- Introduction to linear models

- Getting data into and out of R
- Using data frames for statistical purposes
- Introduction to linear models

- You can load and save R objects
- R has its own format for this, which is shared across operating systems
- It's an open, documented format if you really want to pry into it

`save(thing, file="name")`

saves`thing`

in a file called`name`

(conventional extension:`rda`

or`Rda`

)`load("name")`

loads the object or objects stored in the file called`name`

,*with their old names*

gmp <- read.table("http://faculty.ucr.edu/~jflegal/206/gmp.dat") gmp$pop <- round(gmp$gmp/gmp$pcgmp) save(gmp,file="gmp.Rda") rm(gmp) exists("gmp")

## [1] FALSE

not_gmp <- load(file="gmp.Rda") colnames(gmp)

## [1] "MSA" "gmp" "pcgmp" "pop"

not_gmp

## [1] "gmp"

- We can load or save more than one object at once; this is how RStudio will load your whole workspace when you're starting, and offer to save it when you're done
- Many packages come with saved data objects; there's the convenience function
`data()`

to load them

data(cats,package="MASS") summary(cats)

## Sex Bwt Hwt ## F:47 Min. :2.000 Min. : 6.30 ## M:97 1st Qu.:2.300 1st Qu.: 8.95 ## Median :2.700 Median :10.10 ## Mean :2.724 Mean :10.63 ## 3rd Qu.:3.025 3rd Qu.:12.12 ## Max. :3.900 Max. :20.50

- Tables full of data, just not in the R file format
- Main function:
`read.table()`

- Presumes space-separated fields, one line per row
- Main argument is the file name or URL
- Returns a dataframe
- Lots of options for things like field separator, column names, forcing or guessing column types, skipping lines at the start of the file…

`read.csv()`

is a short-cut to set the options for reading comma-separated value (CSV) files- Spreadsheets will usually read and write CSV

- Counterpart functions
`write.table()`

,`write.csv()`

write a dataframe into a file - Drawback: takes a lot more disk space than what you get from
`load`

or`save`

- Advantage: can communicate with other programs, or even edit manually

- The
`foreign`

package on CRAN has tools for reading data files from lots of non-R statistical software - Spreadsheets are special
- Full of ugly irregularities
- Values or formulas?
- Headers, footers, side-comments, notes
- Columns change meaning half-way down

- Save the spreadsheet as a CSV;
`read.csv()`

- Save the spreadsheet as a CSV; edit in a text editor;
`read.csv()`

- Use
`read.xls()`

from the`gdata`

package - Tries very hard to work like
`read.csv()`

, can take a URL or filename - Can skip down to the first line that matches some pattern, select different sheets, etc.
- You may still need to do a lot of tidying up after

What can we do with it?

- Plot it: examine multiple variables and distributions
- Test it: compare groups of individuals to each other
- Check it: does it conform to what we'd like for our needs

library(MASS) data(birthwt) summary(birthwt)

## low age lwt race ## Min. :0.0000 Min. :14.00 Min. : 80.0 Min. :1.000 ## 1st Qu.:0.0000 1st Qu.:19.00 1st Qu.:110.0 1st Qu.:1.000 ## Median :0.0000 Median :23.00 Median :121.0 Median :1.000 ## Mean :0.3122 Mean :23.24 Mean :129.8 Mean :1.847 ## 3rd Qu.:1.0000 3rd Qu.:26.00 3rd Qu.:140.0 3rd Qu.:3.000 ## Max. :1.0000 Max. :45.00 Max. :250.0 Max. :3.000 ## smoke ptl ht ui ## Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000 ## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000 ## Median :0.0000 Median :0.0000 Median :0.00000 Median :0.0000 ## Mean :0.3915 Mean :0.1958 Mean :0.06349 Mean :0.1481 ## 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.0000 ## Max. :1.0000 Max. :3.0000 Max. :1.00000 Max. :1.0000 ## ftv bwt ## Min. :0.0000 Min. : 709 ## 1st Qu.:0.0000 1st Qu.:2414 ## Median :0.0000 Median :2977 ## Mean :0.7937 Mean :2945 ## 3rd Qu.:1.0000 3rd Qu.:3487 ## Max. :6.0000 Max. :4990

Go to R help for more info, because someone documented this data

help(birthwt)

colnames(birthwt)

## [1] "low" "age" "lwt" "race" "smoke" "ptl" "ht" "ui" ## [9] "ftv" "bwt"

colnames(birthwt) <- c("birthwt.below.2500", "mother.age", "mother.weight", "race", "mother.smokes", "previous.prem.labor", "hypertension", "uterine.irr", "physician.visits", "birthwt.grams")

Can make all the factors more descriptive.

birthwt$race <- factor(c("white", "black", "other")[birthwt$race]) birthwt$mother.smokes <- factor(c("No", "Yes")[birthwt$mother.smokes + 1]) birthwt$uterine.irr <- factor(c("No", "Yes")[birthwt$uterine.irr + 1]) birthwt$hypertension <- factor(c("No", "Yes")[birthwt$hypertension + 1])

summary(birthwt)

## birthwt.below.2500 mother.age mother.weight race ## Min. :0.0000 Min. :14.00 Min. : 80.0 black:26 ## 1st Qu.:0.0000 1st Qu.:19.00 1st Qu.:110.0 other:67 ## Median :0.0000 Median :23.00 Median :121.0 white:96 ## Mean :0.3122 Mean :23.24 Mean :129.8 ## 3rd Qu.:1.0000 3rd Qu.:26.00 3rd Qu.:140.0 ## Max. :1.0000 Max. :45.00 Max. :250.0 ## mother.smokes previous.prem.labor hypertension uterine.irr ## No :115 Min. :0.0000 No :177 No :161 ## Yes: 74 1st Qu.:0.0000 Yes: 12 Yes: 28 ## Median :0.0000 ## Mean :0.1958 ## 3rd Qu.:0.0000 ## Max. :3.0000 ## physician.visits birthwt.grams ## Min. :0.0000 Min. : 709 ## 1st Qu.:0.0000 1st Qu.:2414 ## Median :0.0000 Median :2977 ## Mean :0.7937 Mean :2945 ## 3rd Qu.:1.0000 3rd Qu.:3487 ## Max. :6.0000 Max. :4990

plot (birthwt$race) title (main = "Count of Mother's Race in Springfield MA, 1986")

plot (birthwt$mother.age) title (main = "Mother's Ages in Springfield MA, 1986", ylab="Mother's Age")

plot (sort(birthwt$mother.age)) title (main = "(Sorted) Mother's Ages in Springfield MA, 1986", ylab="Mother's Age")

plot (birthwt$mother.age, birthwt$birthwt.grams) title (main = "Birth Weight by Mother's Age in Springfield MA, 1986", xlab="Mother's Age", ylab="Birth Weight (g)")

Let's fit some models to the data pertaining to our outcome(s) of interest.

plot (birthwt$mother.smokes, birthwt$birthwt.grams, main="Birth Weight by Mother's Smoking Habit", ylab = "Birth Weight (g)", xlab="Mother Smokes")

Tough to tell! Simple two-sample t-test:

t.test (birthwt$birthwt.grams[birthwt$mother.smokes == "Yes"], birthwt$birthwt.grams[birthwt$mother.smokes == "No"])

## ## Welch Two Sample t-test ## ## data: birthwt$birthwt.grams[birthwt$mother.smokes == "Yes"] and birthwt$birthwt.grams[birthwt$mother.smokes == "No"] ## t = -2.7299, df = 170.1, p-value = 0.007003 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -488.97860 -78.57486 ## sample estimates: ## mean of x mean of y ## 2771.919 3055.696

Does this difference match the linear model?

linear.model.1 <- lm (birthwt.grams ~ mother.smokes, data=birthwt) linear.model.1

## ## Call: ## lm(formula = birthwt.grams ~ mother.smokes, data = birthwt) ## ## Coefficients: ## (Intercept) mother.smokesYes ## 3055.7 -283.8

Does this difference match the linear model?

summary(linear.model.1)

## ## Call: ## lm(formula = birthwt.grams ~ mother.smokes, data = birthwt) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2062.9 -475.9 34.3 545.1 1934.3 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3055.70 66.93 45.653 < 2e-16 *** ## mother.smokesYes -283.78 106.97 -2.653 0.00867 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 717.8 on 187 degrees of freedom ## Multiple R-squared: 0.03627, Adjusted R-squared: 0.03112 ## F-statistic: 7.038 on 1 and 187 DF, p-value: 0.008667

Does this difference match the linear model?

linear.model.2 <- lm (birthwt.grams ~ mother.age, data=birthwt) linear.model.2

## ## Call: ## lm(formula = birthwt.grams ~ mother.age, data = birthwt) ## ## Coefficients: ## (Intercept) mother.age ## 2655.74 12.43

summary(linear.model.2)

## ## Call: ## lm(formula = birthwt.grams ~ mother.age, data = birthwt) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2294.78 -517.63 10.51 530.80 1774.92 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 2655.74 238.86 11.12 <2e-16 *** ## mother.age 12.43 10.02 1.24 0.216 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 728.2 on 187 degrees of freedom ## Multiple R-squared: 0.008157, Adjusted R-squared: 0.002853 ## F-statistic: 1.538 on 1 and 187 DF, p-value: 0.2165

R tries to make diagnostics easy as possible. Try in R console.

plot(linear.model.2)