--- title: 'More Data Structures' author: "James M. Flegal" output: ioslides_presentation: smaller: yes beamer_presentation: default --- ## Agenda - Arrays - Matrices - Lists - Dataframes - Structures of structures ## Arrays Many data structures in R are made by adding bells and whistles to vectors, so "vector structures" Most useful: **arrays** {r} x <- c(7, 8, 10, 45) x.arr <- array(x,dim=c(2,2)) x.arr  dim says how many rows and columns; filled by columns Can have $3, 4, \ldots n$ dimensional arrays; dim is a length-$n$ vector ## Some properties of the array: {r} dim(x.arr) is.vector(x.arr) is.array(x.arr)  ## {r} typeof(x.arr) str(x.arr) attributes(x.arr)  typeof() returns the type of the _elements_ str() gives the **structure**: here, a numeric array, with two dimensions, both indexed 1--2, and then the actual numbers Exercise: try all these with x ## Accessing and operating on arrays Can access a 2-D array either by pairs of indices or by the underlying vector: {r} x.arr[1,2] x.arr  ## Omitting an index means "all of it": {r} x.arr[c(1:2),2] x.arr[,2]  ## Functions on arrays Using a vector-style function on a vector structure will go down to the underlying vector, _unless_ the function is set up to handle arrays specially: {r} which(x.arr > 9)  ## Many functions _do_ preserve array structure: {r} y <- -x y.arr <- array(y,dim=c(2,2)) y.arr + x.arr  Others specifically act on each row or column of the array separately: {r} rowSums(x.arr)  We will see a lot more of this idea ## Example: Price of houses in PA Census data for California and Pennsylvania on housing prices, by Census "tract" {r} calif_penn <- read.csv("http://www.stat.cmu.edu/~cshalizi/uADA/13/hw/01/calif_penn_2011.csv") penn <- calif_penn[calif_penn[,"STATEFP"]==42,] coefficients(lm(Median_house_value ~ Median_household_income, data=penn))  Fit a simple linear model, predicting median house price from median household income ## Census tracts 24--425 are Allegheny county Tract 24 has a median income of \$14,719; actual median house value is \$34,100 --- is that above or below what's? {r} 34100 < -26206.564 + 3.651*14719  Tract 25 has income \$48,102 and house price \$155,900 {r} 155900 < -26206.564 + 3.651*48102  What about tract 26? ## We _could_ just keep plugging in numbers like this, but that's - boring and repetitive - error-prone (what if I forget to change the median income, or drop a minus sign from the intercept?) - obscure if we come back to our work later (what _are_ these numbers?) ## Use variables and names {r} penn.coefs <- coefficients(lm(Median_house_value ~ Median_household_income, data=penn)) penn.coefs  {r} allegheny.rows <- 24:425 allegheny.medinc <- penn[allegheny.rows,"Median_household_income"] allegheny.values <- penn[allegheny.rows,"Median_house_value"] allegheny.fitted <- penn.coefs["(Intercept)"]+penn.coefs["Median_household_income"]*allegheny.medinc  ## {r} plot(x=allegheny.fitted, y=allegheny.values, xlab="Model-predicted median house values", ylab="Actual median house values", xlim=c(0,5e5),ylim=c(0,5e5)) abline(a=0,b=1,col="grey")  ## Running example: resource allocation ("mathematical programming") Factory makes cars and trucks, using labor and steel - a car takes 40 hours of labor and 1 ton of steel - a truck takes 60 hours and 3 tons of steel - resources: 1600 hours of labor and 70 tons of steel each week ## Matrices In R, a matrix is a specialization of a 2D array {r} factory <- matrix(c(40,1,60,3),nrow=2) is.array(factory) is.matrix(factory)  could also specify ncol, and/or byrow=TRUE to fill by rows. Element-wise operations with the usual arithmetic and comparison operators (e.g., factory/3) Compare whole matrices with identical() or all.equal() ## Matrix multiplication Gets a special operator {r} six.sevens <- matrix(rep(7,6),ncol=3) six.sevens factory %*% six.sevens # [2x2] * [2x3]  What happens if you try six.sevens %*% factory? ## Multiplying matrices and vectors Numeric vectors can act like proper vectors: {r} output <- c(10,20) factory %*% output output %*% factory  R silently casts the vector as either a row or a column matrix ## Matrix operators Transpose: {r} t(factory)  Determinant: {r} det(factory)  ## The diagonal The diag() function can extract the diagonal entries of a matrix: {r} diag(factory)  It can also _change_ the diagonal: {r} diag(factory) <- c(35,4) factory  Re-set it for later: {r} diag(factory) <- c(40,3)  ## Creating a diagonal or identity matrix {r} diag(c(3,4)) diag(2)  ## Inverting a matrix {r} solve(factory) factory %*% solve(factory)  ## Why's it called "solve"" anyway? Solving the linear system $\mathbf{A}\vec{x} = \vec{b}$ for $\vec{x}$: {r} available <- c(1600,70) solve(factory,available) factory %*% solve(factory,available)  ## Names in matrices We can name either rows or columns or both, with rownames() and colnames() These are just character vectors, and we use the same function to get and to set their values Names help us understand what we're working with Names can be used to coordinate different objects ## {r} rownames(factory) <- c("labor","steel") colnames(factory) <- c("cars","trucks") factory available <- c(1600,70) names(available) <- c("labor","steel")  ## {r} output <- c(20,10) names(output) <- c("trucks","cars") factory %*% output # But we've got cars and trucks mixed up! factory %*% output[colnames(factory)] all(factory %*% output[colnames(factory)] <= available[rownames(factory)])  Notice: Last lines don't have to change if we add motorcycles as output or rubber and glass as inputs (abstraction again) ## Doing the same thing to each row or column Take the mean: rowMeans(), colMeans(): input is matrix, output is vector. Also rowSums(), etc. summary(): vector-style summary of column {r} colMeans(factory) summary(factory)  ## apply(), takes 3 arguments: the array or matrix, then 1 for rows and 2 for columns, then name of the function to apply to each {r} rowMeans(factory) apply(factory,1,mean)  What would apply(factory,1,sd) do? ## Lists Sequence of values, _not_ necessarily all of the same type {r} my.distribution <- list("exponential",7,FALSE) my.distribution  Most of what you can do with vectors you can also do with lists ## Accessing pieces of lists Can use [ ] as with vectors or use [[ ]], but only with a single index [[ ]] drops names and structures, [ ] does not {r} is.character(my.distribution) is.character(my.distribution[]) my.distribution[]^2  What happens if you try my.distribution^2? What happens if you try [[ ]] on a vector? ## Expanding and contracting lists Add to lists with c() (also works with vectors): {r} my.distribution <- c(my.distribution,7) my.distribution  ## Chop off the end of a list by setting the length to something smaller (also works with vectors): {r} length(my.distribution) length(my.distribution) <- 3 my.distribution  ## Naming list elements We can name some or all of the elements of a list {r} names(my.distribution) <- c("family","mean","is.symmetric") my.distribution my.distribution[["family"]] my.distribution["family"]  ## Lists have a special short-cut way of using names, $ (which removes names and structures): {r} my.distribution[["family"]] my.distribution$family  ## Names in lists Creating a list with names: {r} another.distribution <- list(family="gaussian",mean=7,sd=1,is.symmetric=TRUE)  Adding named elements: {r} my.distribution$was.estimated <- FALSE my.distribution[["last.updated"]] <- "2011-08-30"  Removing a named list element, by assigning it the value NULL: {r} my.distribution$was.estimated <- NULL  ## Key-Value pairs Lists give us a way to store and look up data by _name_, rather than by _position_ A really useful programming concept with many names: **key-value pairs**, **dictionaries**, **associative arrays**, **hashes** If all our distributions have components named family, we can look that up by name, without caring where it is in the list ## Dataframes Dataframe = the classic data table, $n$ rows for cases, $p$ columns for variables Lots of the really-statistical parts of R presume data frames penn from last time was really a dataframe Not just a matrix because *columns can have different types* Many matrix functions also work for dataframes (rowSums(), summary(), apply()) but no matrix multiplying dataframes, even if all columns are numeric ## {r} a.matrix <- matrix(c(35,8,10,4),nrow=2) colnames(a.matrix) <- c("v1","v2") a.matrix a.matrix[,"v1"] # Try a.matrix$v1 and see what happens  ## {r} a.data.frame <- data.frame(a.matrix,logicals=c(TRUE,FALSE)) a.data.frame a.data.frame$v1 a.data.frame[,"v1"] a.data.frame[1,] colMeans(a.data.frame)  ## Adding rows and columns We can add rows or columns to an array or data-frame with rbind() and cbind(), but be careful about forced type conversions {r} rbind(a.data.frame,list(v1=-3,v2=-5,logicals=TRUE)) rbind(a.data.frame,c(3,4,6))  ## Structures of Structures So far, every list element has been a single data value List elements can be other data structures, e.g., vectors and matrices: {r} plan <- list(factory=factory, available=available, output=output) plan$output  Internally, a dataframe is basically a list of vectors ## Structures of Structures List elements can even be other lists which may contain other data structures including other lists which may contain other data structures... This **recursion** lets us build arbitrarily complicated data structures from the basic ones Most complicated objects are (usually) lists of data structures ## Example: Eigenstuff eigen() finds eigenvalues and eigenvectors of a matrix Returns a list of a vector (the eigenvalues) and a matrix (the eigenvectors) {r} eigen(factory) class(eigen(factory))  ## With complicated objects, you can access parts of parts (of parts...) {r} factory %*% eigen(factory)$vectors[,2] eigen(factory)$values * eigen(factory)$vectors[,2] eigen(factory)$values eigen(factory)[][] # NOT [[1,2]]  ## Creating an example dataframe {r} library(datasets) states <- data.frame(state.x77, abb=state.abb, region=state.region, division=state.division)  data.frame() is combining here a pre-existing matrix (state.x77), a vector of characters (state.abb), and two vectors of qualitative categorical variables (**factors**; state.region, state.division) Column names are preserved or guessed if not explicitly set ## {r} colnames(states) states[1,]  ## Dataframe access - By row and column index {r} states[49,3]  - By row and column names {r} states["Wisconsin","Illiteracy"]  ## Dataframe access - All of a row: {r} states["Wisconsin",]  Exercise: what class is states["Wisconsin",]? ## Dataframe access - All of a column: {r} head(states[,3]) head(states[,"Illiteracy"]) head(states$Illiteracy)  ## Dataframe access - Rows matching a condition: {r} states[states$division=="New England", "Illiteracy"] states[states$region=="South", "Illiteracy"]  ## Dataframe access Parts or all of the dataframe can be assigned to: {r} summary(states$HS.Grad) states$HS.Grad <- states$HS.Grad/100 summary(states$HS.Grad) states$HS.Grad <- 100*states$HS.Grad  ## with() What percentage of literate adults graduated HS? {r} head(100*(states$HS.Grad/(100-states$Illiteracy)))  with() takes a data frame and evaluates an expression "inside" it: {r} with(states, head(100*(HS.Grad/(100-Illiteracy))))  ## Data arguments Lots of functions take data arguments, and look variables up in that data frame: {r} plot(Illiteracy~Frost, data=states)  $R^2 =0.45$, $p \approx {10}^{-7}$ ## Summary - Arrays add multi-dimensional structure to vectors - Matrices act like you'd hope they would - Lists let us combine different types of data - Dataframes are hybrids of matrices and lists, for classic tabular data - Recursion lets us build complicated data structures out of the simpler ones