Graphics

Agenda

High-level graphics
Custom graphics
Layered graphics in ggplot2

Functions for graphics

The functions hist(), boxplot(), plot(), points(), lines(), text(), mtext(), axis(), etc. form a suite that plot graphs and add features to the graph
Each of these functions have various options, to learn more about them, use the help
par() can be used to set or query graphical parameters

Univariate data: Histogram

x = state.x77[ , 2]                   # 50 average state incomes in 1977
hist(x)

Univariate data: Histogram

hist(x, breaks = 8, xlab="Income", main="Histogram of State Income in 1977")

Univariate data: Histogram

y = quakes$depth                      # 1000 earthquake depths
hist(y, seq(0, 700, by = 70), xlab="Earthquake Depth", main="Histogram of Earthquake Depths")

Empirical CDF

Function ecdf() provides data for empirical cdf

plot.ecdf(x)

Empirical CDF

Can add vertical lines and remove dots

plot.ecdf(x, verticals = T, pch = "", xlab="Income", main="ECDF of State Income in 1977")

Empirical CDF

plot.ecdf(y, verticals = T, pch = "", xlab="Earthquake Depth", 
          main="ECDF of Earthquake Depths")

`qqnorm()` and `qqplot()`

`qqnorm() plots the quantiles of a data set against the quantiles of a Normal distribution
`qqplot() plots the quantiles of a first data set against the quantiles of a second data set

`qqnorm()` and `qqplot()`

qqnorm(x)                             # qq plot for the earthquake depths
qqline(x, col = "red")                # red reference line

`qqnorm()` and `qqplot()`

qqnorm(y)                             # qq plot for the earthquake depths
qqline(y, col = "red")                # red reference line

Box plots

boxplot(count ~ spray, data = InsectSprays)

Scatterplots: `plot(x, y)`

plot(quakes$long, quakes$lat, xlab="Latitude", ylab="Longitude", 
     main="Location of Earthquake Epicenters")

Scatterplots: `plot(x, y)`

symbols(quakes$long, quakes$lat, circles = 10 ^ quakes$mag, xlab="Latitude",
        ylab="Longitude", main="Location of Earthquake Epicenters")

Three-dimensional data: `pairs(x)`

pairs(trees)

Three dimensional plots

contour(crimtab, main="Contour Plot of Criminal Data")

Three dimensional plots

image(crimtab, main="Image Plot of Criminal Data")

Three dimensional plots

persp(crimtab, theta=30, main="Perspective Plot of Criminal Data")

Categorical data: Pie charts

pie.sales = c(0.12, 0.30, 0.26, 0.16, 0.04, 0.12)
names(pie.sales) = c("Blueberry", "Cherry", "Apple", "Boston Creme",
                        "Other", "Vanilla Creme")
pie(pie.sales, col = c("blue", "red", "green", "wheat", "orange", "white"))

Categorical data: Pie charts

dotchart() and barplot() also available

barplot(VADeaths, beside = T, legend = T,
        main = "Virginia Death Rates per 1000 in 1940")

Time series plots

ts.plot(AirPassengers, xlab="Date", ylab="Passengers (in thousands)"
        , main="International Airline Passengers")

Time series plots

ts.plot(presidents, xlab="Date", ylab="Approval Rating"
        , main="Presidential Approval Ratings")

Custom graphics

par() can be used to set or query graphical parameters
We've already used some of these
adj: text justification
bg: background color
col, col.axis, col.lab, …: color specification
lty: line type, e.g. dashed, dotted, solid (default), longdash, …
lwd: line width (helpful to increase for presentation plots)
mfcol and mfrow: subsequent figures will be drawn in an nr-by-nc array on the device
pch: point types
xlog: plots to log scale if TRUE
…

Binomial distribution

Plot of binomial distribution with \(n=5\) and \(p=.4\)

x = 0:5                     
y = dbinom(x, 5, 2 / 5)    
plot(x, y, type = "h", main="Binomial Distribution", xlab="Value", ylab="Probability")

Normal distribution

Probability density function for the standard Normal distribution from -3 to 3

x = seq(-3, 3, by = 0.01)   
y = dnorm(x)                
plot(x, y, type = "l", main="Normal Distribution", ylab="f(x)")

Two empirical cdfs: `Puromycin` dataset

x = Puromycin$rate[Puromycin$state == "treated"]
y = Puromycin$rate[Puromycin$state == "untreated"]

Two empirical cdfs: `Puromycin` dataset

plot.ecdf(x, verticals = TRUE, pch = "", xlim = c(60, 200), main="Treated versus Untreated")
lines(ecdf(y), verticals = TRUE, pch = "", xlim = c(60, 200), col="blue")
legend("bottomright", c("Treated", "Untreated"), pch = "", col=c("black", "blue"), lwd = 1)

Saving a plot to a file

Begin with functions postscript(), pdf(), tiff(), jpeg(), …
… put all your plotting commands here …
Finish with dev.off()

pdf("2cdfs.pdf", width=6, height=4)
plot.ecdf(x, verticals = TRUE, pch = "", xlim = c(60, 200), main="Treated versus Untreated")
lines(ecdf(y), verticals = TRUE, pch = "", xlim = c(60, 200), col="blue")
legend("bottomright", c("Treated", "Untreated"), pch = "", col=c("black", "blue"), lwd = 1)
dev.off()

## quartz_off_screen 
##                 2

Multiple plots on one set of axes

x = seq(0, 2 * pi, length = 100)
sine = sin(x)
cosine = cos(x)
matplot(x, cbind(sine, cosine), col = c(1, 1), type = "l")

Multiple frame plots

par(mfrow = c(2, 2))
boxplot(precip)
hist(precip)
plot.ecdf(precip)
qqnorm(precip)

par(mfrow = c(1, 1))

Plot using statistical model

plot(rate ~ conc, data = Puromycin, pch = 15 * (state == "treated") + 1) 
legend("bottomright", legend = c("Untreated", "Treated"), pch = c(1, 16))

Plot using `persp()` for wire mesh

x = seq(-8, 8, length = 100)
y = x
f = function(x, y) sin(sqrt(x ^ 2 + y ^ 2)) / (sqrt (x ^ 2 + y ^ 2)) 
z = outer(x, y, f)
persp(x, y, z, xlab = "", ylab = "", zlab = "", axes = F, box = F)

Custom plot: Leemis Chapter 21

`ggplot2`

Everything so far has been part of base R
The ggplot2 package is a popular by Hadley Wickham
Based on the grammar of graphics, which tries to take the good parts of base and lattice graphics and none of the bad parts
http://ggplot2.org

Forced Expiratory Volume (FEV) Data

Explore a data on the relationship between smoking and pulmonary function from Rosner (1999) using layered graphics created with ggplot2. The data consists of a sample of 654 youths, aged 3 to 19, in the area of East Boston during middle to late 1970’s. Our main interest is in the relationship between smoking and FEV. Use the following commands to load the data and ggplot2.

load(url("http://www.faculty.ucr.edu/~jflegal/fev.RData"))
library(ggplot2)
str(fevdata)

## 'data.frame':    654 obs. of  5 variables:
##  $ age   : int  9 8 7 9 9 8 6 6 8 9 ...
##  $ fev   : num  1.71 1.72 1.72 1.56 1.9 ...
##  $ height: num  57 67.5 54.5 53 57 61 58 56 58.5 60 ...
##  $ sex   : Factor w/ 2 levels "female","male": 1 1 1 2 2 1 1 1 1 1 ...
##  $ smoke : Factor w/ 2 levels "nonsmoker","smoker": 1 1 1 1 1 1 1 1 1 1 ...

Layered graphics in `ggplot2`

ggplot2 allows you to construct multi-layered graphics. A plot in ggplot2 consists of several components:

Defaults
Layers
Scales
Coordinate system

Layers consist of:

Data
Mapping
Geom
Stat
Position

ggplot2 uses the + operator to build up a plot from these components. The basic plot definition looks like this:

ggplot(data, mapping) + 
layer(
  stat = "",
  geom = "",
  position = "", 
  geom_parms = list(), 
  stat_params = list(),
)

We usually won’t write out the full specification of layer, but use shortcuts like:

geom_point()
stat_summary()

Every geom has a default stat and every stat has a default geom.

Usually, data and mappings are the same for all layers and so they can be stored as defaults:

ggplot(data, mapping = aes(x = x, y = y))

All layers use the default values of data and mapping unless explicitly you override them explicitly. The aes() function describes the mapping that will be used for each layer. You must specify a default, but you can also specify per layer mappings and data:

ggplot(data, mapping(aes(x = x, y = y))) 
  + geom_point(aes(color = z)) 
  + geom_line(data = another_data)

Scatterplot of age versus fev:

s <- ggplot(fevdata, aes(x = age, y = fev)) 
s + geom_point()

What do you see?
Try coloring the points according to sex or smoker at home. Are there any apparent differences?

Layering

You can add additional layers to a plot with the + operator. Let’s try adding a line that shows the average value of fev for each age. One way to do this is to construct an additional data frame with columns corresponding to age and average value of fev and then add a layer with this data. We will do this with the dplyr package. Don’t worry about how it works for now. We will learn more about it later. First, you will need to install dplyr if it is not already installed:

install.packages("dplyr")

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

fev_mean <- summarize(group_by(fevdata, age), fev = mean(fev))
s + geom_point() + geom_line(data = fev_mean)

Smoothing

Similarly, we can add a smoother to the scatterplot by first computing the smooth and storing it in data frame. Then add a layer with that data. Since smoothers are so useful, this operation is available in ggplot2 as a stat.
stat_smooth() provides a smoothing transformation. It creates a new data frame with the values of the smooth and by default uses geom="ribbon" so that both the smooth curve and error bands are shown.

s + geom_point() + stat_smooth()

The default smoother is loess. This is the name given to the locally weighted quadratic regression smoother with tricubic weight function. Its bandwidth can be specified indirectly with the span parameter.

s + geom_point() + stat_smooth(span = 1)

s + geom_point() + stat_smooth(span = 1/2)

We could also use linear regression by specifying lm:

s + geom_point() + stat_smooth(method = 'lm')

Grouping

Clearly, we can see age and fev are highly correlated. What else is correlated with age and fev?
How can we compare the relationship between age and fev among smokers and non-smokers? One way is to use two separates smoothers: one for smokers and one for non-smokers.
Can do this using the group aesthetic. It specifies that we wish to group the data according by some variable before layering.

p <- ggplot(fevdata, aes(x = age, y = fev, group = smoke))
p + geom_point() + stat_smooth()

A problem with this plot is that we can’t tell which group each curve corresponds to.

p + geom_point() + stat_smooth(aes(color = smoke))

Or for points and smooths

p <- ggplot(fevdata, aes(x = age, y = fev, group = smoke, color = smoke))
p + geom_point() + stat_smooth()

What do these plots suggest?
What can’t be seen in these plots?
Could there be lurking confounders?

How is the following plot different from the others in terms of the conclusions you might draw about the relation between smoke and fev?

p + geom_point() + stat_smooth(method = 'lm')

Faceting

Faceting is a technique for constructing multiple subplots that show different subsets of data. ggplot2 has two ways to facet: facet_wrap and facet_grid. Grid faceting allows you use specify up to two factor variables: one for rows and one for columns of the grid. Wrap faceting allows you to specify one factor variable. We can use faceting to examine the relationship between fev and height for different combinations of age and sex. This will allow us to view four variables simultaneously.

p <- ggplot(fevdata, aes(x = height, y = fev)) + facet_grid(sex ~ age)
p + geom_point()

One problem with the previous plot is that there are too many ages and relatively few observations at each age. We can instead try dividing age into a smaller number of groups using the cut() function. cut() creates a new factor variable by cutting its input. Here we cut age into 5 intervals of equal length:

fevdata <- transform(fevdata, age_group = cut(age, breaks = 5))

Then make new plots:

p <- ggplot(fevdata, aes(x = height, y = fev)) + facet_grid(sex ~ age_group)
p + geom_point(aes(color = smoke))

Summary

R has strong graphic capabilities
Graphing is an iterative process; Don’t rely on the default options
Avoid gimmicks, use the minimum amount of ink to get your point across
A small table can be better than a large graph
Carefully consider the size and shape of your graph, bigger is not always better

Agenda

Functions for graphics

Univariate data: Histogram

Univariate data: Histogram

Univariate data: Histogram

Empirical CDF

Empirical CDF

Empirical CDF

qqnorm() and qqplot()

qqnorm() and qqplot()

qqnorm() and qqplot()

Box plots

Scatterplots: plot(x, y)

Scatterplots: plot(x, y)

Three-dimensional data: pairs(x)

Three dimensional plots

Three dimensional plots

Three dimensional plots

Categorical data: Pie charts

Categorical data: Pie charts

Time series plots

Time series plots

Custom graphics

Binomial distribution

Normal distribution

Two empirical cdfs: Puromycin dataset

Two empirical cdfs: Puromycin dataset

Saving a plot to a file

Multiple plots on one set of axes

Multiple frame plots

Plot using statistical model

Plot using persp() for wire mesh

Custom plot: Leemis Chapter 21

ggplot2

Forced Expiratory Volume (FEV) Data

Layered graphics in ggplot2

Layering

Smoothing

Grouping

Faceting

Summary

`qqnorm()` and `qqplot()`

`qqnorm()` and `qqplot()`

`qqnorm()` and `qqplot()`

Scatterplots: `plot(x, y)`

Scatterplots: `plot(x, y)`

Three-dimensional data: `pairs(x)`

Two empirical cdfs: `Puromycin` dataset

Two empirical cdfs: `Puromycin` dataset

Plot using `persp()` for wire mesh

`ggplot2`

Layered graphics in `ggplot2`