R & BioConductor Manual
Index
- R Basics
- Introduction
- Finding Help
- Basics on Functions and Packages
- System commands under Linux
- Reading and Writing External Data
- R Objects
- Some Great R Functions
- Graphical Procedures
- Missing Values
- Writing Your Own Functions
- R Web Applications
- R Exercises
- Programming in R
- BioConductor
- Introduction
- Finding Help
- Affy Packages
- Analysis of Differentially Expressed Genes
- Dual Color Array Packages
- Chromosome maps
- Gene Ontologies
- KEGG Pathway Analysis
- Motif Identification in Promoter Regions
- Phylogenetic Analysis
- Mining Drug-Like Compounds
- Protein Structure Analysis
- BioC Exercises
- Clustering and Data Mining in R
- Introduction
- Data Preprocessing
- Hierarchical Clustering (HC)
- Bootstrap Analysis in Hierarchical Clustering
- QT Clustering
- K-Means & PAM
- Self-Organizing Map (SOM)
- Principal Component Analysis (PCA)
- Multi-Dimensional Scaling (MDS)
- Support Vector Machines (SVM)
- Clustering Exercises
- Administration
- R Basics
- Introduction
- R (http://cran.at.r-project.org) is a complete statistical environment and programming language for professional data analysis and graphical display. The fully integrated BioConductor project provides many additional R packages for statistical data analysis in different life science areas, such as tools for microarray, sequence and genome analysis.
This manual provides a condensed introduction into the basic R and BioConductor functions. A not always very easy to read, but practical copy&paste format has been chosen throughout this manual. In this format all commands are represented in bold font followed by a short description of their usage in non-bold font. Often several commands are concatenated on one line and separated with a semicolon ';'. All explanations start with the standard comment sign '#' to prevent them from being interpreted by R as commands. This way several commands can be pasted with their comment text into the R console to demo the different functions and analysis steps. Commands starting with a '$' sign need to be executed from a Unix or Linux shell. Windows users can simply ignore them. Information in red color is considered essential knowledge and commands in green color are important for someone interested in a quick start with R and BioConductor. The remaining commands in black color provide often more in-depth information about a certain topic.
- Installation of the R Software and R Packages
- Basic R Usage
$ R # Starts the R console under Unix/Linux. The R GUI versions under Windows and Mac OS X can be started by double-clicking their icons.
object <- function(arguments) or object = function(arguments) # This general R command syntax uses the assignment operator '<-' (or '=') to assign data generated by command to its right to object on its left. A more recently introduced assignment operator is '='. Both of them work the same way and in both directions. For consistency reasons one should use only one of them.
assign("x", function(arguments)) # Has the same effect as above, but uses the assignment function instead of the assignment operator.
source("my_script") # Command to execute an R script, here 'my_script'. For example, generate a text file 'my_script' with the command 'print(1:100)', then execute it with the source function.
x <- edit(data.frame()) # Starts empty GUI spreadsheet editor for manual data entry.
x <- edit(x) # Opens existing data frame (table) 'x' in GUI spreadsheet editor.
x <- scan(w="c") # Lets you enter values from the keyboard or by copy&paste and assigns them to vector 'x'. For pure numeric values one can use 'x <- scan()'.
read.delim("clipboard") # Command to paste tables from Excel or other programs into R.
read.delim(pipe("pbpaste")) # Command to paste tables on Mac OS X systems into R.
x <- vi(x) # Allows editing of data in vi, emacs and other text editors. To save changes, object needs to be reassigned.
q() # Quits R console.
- R Startup Behavior
The R environment is controlled by hidden files in the startup directory: .RData, .Rhistory and .Rprofile (optional)
- Finding Help
- Various online manuals are available on the R project site. Very useful manuals for beginners are: An Introduction to R, the manual simpleR - Using R for Introductory Statistics, R for Beginners and Peter Dalgaard's book Introductory Statistics with R. To find basic functions and syntax structures quickly, one can consult The R Reference Card. Paul Murrell's book is a complete reference on R graphics. More manuals and documentation can be found on this R search site. References on R programming are listed in the 'Programming in R' chapter of this manual. Documentation within R can be found with the following commands.
-
?function # or 'help(function)' opens documentation on a function
apropos(function) # Finds all functions containing a given term.
example(heatmap) # executes examples for function 'heatmap'
help.search("topic") # searches help system for documentation
library() # shows available libraries on a system.
library(help=mypackage) or help(package=mypackage) # lists all functions/objects of a library.
help.start() # Starts local HTML interface. The link 'Packages' provides a list of all installed packages. After initiating 'start.help()' in a session the '?function' commands will open as HTML pages!
sessionInfo() # Prints version information about R and all loaded packages. The generated output should be provided when sending questions or bug reports to the R and BioC mailing lists.
$ R -h # or 'R --help'; provides help on R environment, more detailed information on page 90 of 'An Introduction to R'
- Basics on Functions and Packages
- R contains most arithmetic functions like mean, median, sum, prod, sqrt, length, log, etc. An extensive list of R functions can be found on the function and variable index page. Many R functions and datasets are stored in separate packages, which are only available after loading them into an R session. Information about installing new packages can be found in the administrative section of this manual.
-
library(my_package) # Loads a particular package.
library(help=mypackage) # Lists all functions/objects of a library.
search() # Lists which packages are currently loaded.
- Information and management of objects
-
ls() or objects() # Lists R objects created during session, they are stored in file '.RData' when exiting R and the workspace is saved.
rm(my_object1, my_object2, ...) # Removes objects.
rm(list = ls()) # Removes all objects without warning!
str(object) # Displays object types and structure of an R object.
ls.str(pattern="") # Lists object type info on all objects in a session.
print(ls.str(), max.level=0) # If a session contains very long list objects then one can simplify the output with this command.
lsf.str(pattern="") # Lists object type info on all functions in a session.
class(object) # Prints the object type.
mode(object) # Prints the storage mode of an object.
summary(object) # Generic summary info for all kinds of objects.
attributes(object) # Returns an object's attribute list.
gc() # Causes the garbage collection to take place. This is sometimes useful to clean up memory usage after deleting large objects.
length(object) # Provides length of object.
.Last.value # Prints thevalue of the last evaluated expression.
- Reading and changing directories
-
dir() # Reads content of current working directory.
getwd() # Returns current working directory.
setwd("/home/user") # Changes current working directory to tht specified directory.
- System commands under Linux
- R IN/OUTPUT & BATCH Mode
-
- One can redirect R input and output with '|', '>' and '<' from the Shell command line.
$ R --slave < my_infile > my_outfile # Argument '--slave' makes R run as 'quietly' as possible. This option is intended to support programs which use R to compute results for them. For example, if my_infile contains 'x <- c(1:100); x;' result for this R expression will be printed to 'my_outfile' (or STDOUT).
$ R CMD BATCH [options] my_script.R [outfile] # Sytax for running R programs in BATCH mode from the command-line. The output file lists the commands from the script file and their outputs. If no outfile is specified, the name used is that of 'infile' and '.Rout' is appended to outfile. To stop all the usual R command line information from being written to the outfile, add this as first line to my_script.R file: 'options(echo=FALSE)'. If the command is run like this 'R CMD BATCH --no-save my_script.R', then nothing will be saved in the .Rdata file which can get often very large. More on this can be found on the help pages: '$ R CMD BATCH --help' or '> ?BATCH'.
source("my_script") # Standard command to execute an R script on all OSs.
$ echo 'sqrt(3)' | R --slave # calculates sqrt of 3 in R and prints it to STDOUT.
- Executing Shell & Perl commands from R with 'system("")' function.
-
- Remember, single escapes (e.g. '\n') need to be double escaped in R (e.g. '\\n')
system("ls -al") # Prints content of current working directory.
system("perl -ne 'print if (/my_pattern1/ ? ($c=1) : (--$c > 0)); print if (/my_pattern2/ ? ($d = 1) : (--$d > 0));' my_infile.txt > my_outfile.txt") # Runs Perl one-liner that extracts two patterns from external text file and writes result into new file.
- Reading and Writing Data from/to Files
- Import
-
read.delim("clipboard", header=T) # Command to copy&paste tables from Excel or other programs into R. If the 'header' argument is set to FALSE, then the first line of the data set will not be used as column titles.
read.delim(pipe("pbpaste")) # Command to copy&paste on Mac OS X systems.
file.show("my_file") # prints content of file to screen, allows scrolling.
scan("my_file") # reads vector/array into vector from file or keyboard.
my_frame <- read.table(file="my_table") # reads in table and assigns it to data frame.
my_frame <- read.table(file="my_table", header=TRUE, sep="\t") # Same as above, but with info on column headers and field separators. If you want to import the data in character mode, then include this argument: colClasses = "character".
my_frame <- read.delim("ftp://ftp.arabidopsis.org/home/tair/Microarrays/Affymetrix/affy_ATH1_array_elements-2008-5-29.txt", na.strings = "", fill=TRUE, header=T, sep="\t") # The function read.delim() is often more flexible for importing tables with empty fields and long character strings (e.g. gene descriptions).
cat(month.name, file="zzz.txt", sep="\n"); x <- readLines("zzz.txt"); x <- x[c(grep("^J", as.character(x), perl = TRUE))]; t(as.data.frame(strsplit(x,"u"))) # A somewhat more advanced example for retrieving specific lines from an external file with a regular expression. In this example an external file is created with the 'cat' function, all lines of this file are imported into a vector with 'readLines', the specific elements (lines) are then retieved with the 'grep' function, and the resulting lines are split into sub-fields with 'strsplit'.
- Export
-
write.table(iris, "clipboard", sep="\t", col.names=NA, quote=F) # Command to copy&paste from R into Excel or other programs. It writes the data of an R data frame object into the clipbroard from where it can be pasted into other applications.
zz <- pipe('pbcopy', 'w'); write.table(iris, zz, sep="\t", col.names=NA, quote=F); close(zz) # Command to copy&paste from R into Excel or other programs on Mac OS X systems.
write.table(my_frame, file="my_file", sep="\t", col.names = NA) # Writes data frame to a tab-delimited text file. The argument 'col.names = NA' makes sure that the titles align with columns when row/index names are exported (default).
save(x, file="my_file.txt"); load(file="file.txt") # Commands to save R object to an external file and to read it in again from this file.
files <- list.files(pattern=".txt$"); for(i in files) { x <- read.table(i, header=TRUE, row.names=1, comment.char = "A", sep="\t"); assign(print(i, quote=FALSE), x); write.table(x, paste(i, c(".out"), sep=""), quote=FALSE, sep="\t", col.names = NA) } # Batch import and export of many files. First, the *.txt file names in the current directory are assigned to list ($ sign is used to anchor string '*.txt' to end of names). Second, the files are imported one-by-one using a for loop where the original names are assigned to the generated data frames with the 'assign' function. Read ?read.table to understand arguments 'row.names=1' and 'comment.char = "A"'. Third, the data frames are exported using their names for file naming and appending '*.out'.
HTML(my_frame, file = "my_table.html") # Writes data frame to HTML table. Subsequent exports to the same file will arrange several tables in one HTML document. In order to access this function one needs to load the library 'R2HTML' first. This library is usually not installed by default.
write(x, file="my_file") # Writes matrix data to a file.
sink("My_R_Output") # redirects all subsequent R output to a file 'My_R_Output' without showing it in the R console anymore.
sink() # restores normal R output behavior.
- R Objects
- Data and Object Types
- Data Types
- Numeric data: 1, 2, 3
x <- c(1, 2, 3); x; is.numeric(x); as.character(x) # Creates a numeric vector, checks for the data type and converts it into a character vector.
- Character data: "a", "b" , "c"
x <- c("1", "2", "3"); x; is.character(x); as.numeric(x) # Creates a character vector, checks for the data type and converts it into a numeric vector.
- Complex data: 1, b, 3
- Logical data: TRUE, FALSE, TRUE
1:10 < 5 # Returns TRUE where x is < 5.
- Object Types in R
- vectors: ordered collection of numeric, character, complex and logical values.
- factors: special type vectors with grouping information of its components
- data frames: two dimensional structures with different data types
- matrices: two dimensional structures with data of same type
- arrays: multidimensional arrays of vectors
- lists: general form of vectors with different types of elements
- functions: piece of code
- Naming Rules
- Object, row and column names should not start with a number.
- Avoid spaces in object, row and column names.
- Avoid special characters like '#'.
- General Subsetting Rules
- Subsetting syntax:
my_object[row] # Subsetting of one dimensional objects, like vectors and factors.
my_object[row, col] # Subsetting of two dimensional objects, like matrices and data frames.
my_object[row, col, dim] # Subsetting of three dimensional objects, like arrays.
- There are three possibilities to subset data objects.
- Subsetting by index numbers
my_object <- 1:26; names(my_object) <- LETTERS # Creates a vector sample with named elements.
my_object[1:4] # Returns the elements 1-4.
- Subsetting by same length logical vectors
my_logical <- my_object > 10 # Generates a logical vector as example.
my_object[my_logical] # Returns the elements where my_logical contains TRUE values.
- Subsetting by field names
my_object[c("B", "K", "M")] # Returns the elements with element titles: B, K and M.
- Calling a single column or list component by its name with the '$' sign.
iris$Species # Returns the 'Species' column in the sample data frame 'iris'.
iris[,c("Species")] # Has the same effect as the previous step.
- Basic Operators and Calculations
- Comparison operators
- equal: ==
- not equal: !=
- greater/less than: > <
- greater/less than or equal: >= <=
Example:
- Logical operators
- AND: &
x <- 1:10; y <- 10:1 # Creates the sample vectors 'x' and 'y'.
x > y & x > 5 # Returns TRUE where both comparisons return TRUE.
- OR: |
x == y | x != y # Returns TRUE where at least one comparison returns TRUE.
- NOT: !
!x > y # The '!' sign returns the negation (opposite) of a logical vector.
- Calculations
- Four basic arithmetic functions: addition, subtraction, multiplication and division
1 + 1; 1 - 1; 1 * 1; 1 / 1 # Returns the results of these calculations.
- Calculations on vectors
x <- 1:10; sum(x); mean(x), sd(x); sqrt(x) # Calculates for the vector x its sum, mean, standard deviation and square root. A list of the basic R functions can be found on the function and variable index page.
x <- 1:10; y <- 1:10; x + y # Calculates the sum for each element in the vectors x and y.
- Iterative calculations
apply(iris[,1:3], 1, mean) # Calculates the mean values for the columns 1-3 in the sample data frame 'iris'. With the argument setting '1', row-wise iterations are performed and with '2' column-wise iterations.
tapply(iris[,4], iris$Species, mean) # Calculates the mean values for the 4th column based on the grouping information in the 'Species' column in the 'iris' data frame.
sapply(x, sqrt) # Calculates the square root for each element in the vector x. Generates the same result as 'sqrt(x)'.
- Vectors
- General information
- Vectors are ordered collection of 'atomic' (same data type) components or modes of the following four types: numeric, character, complex and logical. Missing values are indicated by 'NA'. R inserts them automatically in blank fields.
-
x <- c(2.3, 3.3, 4.4) # Example for creating a numeric vector with ordered collection of numbers using 'c' (combine values in vector) function.
z <- scan(file="my_file") # Reads data from file and assigns them to the vector 'z'.
vector[rows] # Syntax to access vector sections.
z <- 1:10; z; as.character(z) # The function 'as.character' changes the data mode from numeric to character.
as.numeric(character) # The function 'as.numeric' changes the data mode from character to numeric.
d <- as.integer(x); d # Transforms numeric data into integers.
x <- 1:100; sample(x, 5) # Selects a random sample of size of 5 from a vector.
x <- as.integer(runif(100, min=1, max=5)); sort(x); rev(sort(x)); order(x); x[order(x)] # Generates random integers from 1 to 4. The sort() function sorts the items by size. The rev() function reverses the order. The order() function returns the sorted indices. The latter function is most flexible for sorting vectors, data frames, lists and other objects.
x <- rep(1:10, times=2); x; unique(x) # The unique() function removes the duplicated entries in a vector.
x <- as.integer(runif(20, min=1, max=5)); x[x==1] <- "A"; x[x==2] <- "T"; x[x==3] <- "G"; x[x==4] <- "C"; paste(x, sep = "", collapse ="") # Generates random DNA sequence of specified length (here 20 b). Alternatively, one can achieve a similar result with the above 'sample' function.
fx <- function(test){x <- as.integer(runif(20, min=1, max=5)); x[x==1] <- "A"; x[x==2] <- "T"; x[x==3] <- "G"; x[x==4] <- "C"; paste(x, sep = "", collapse ="")} ; z1 <- c(); for(i in 1:50) z1 <- c(fx(i), z1); z1 # Possibility to create as many random sequences as needed by using the above command as a function in a for loop.
- Sequences
- R has several facilities to create sequences of numbers:
-
1:30 # Generates a sequence of integers from 1 to 30.
letters[1:24]; LETTERS[1:24]; month.name[1:12]; month.abb[1:12] # Generates lower case letters, capital letters, month names and abbreviated month names, respectively.
2*1:10 # Creates a sequence of even numbers.
seq(1, 30, by=0.5) # Same as before, but with 0.5 increments.
seq(length=100, from=20, by=0.5) # Creates number sequence with specified start and length.
rep(LETTERS[1:8], times=5) # Replicates given sequence or vector x times.
- Character Vectors
paste(LETTERS[1:8], 1:12, sep="") # The command 'paste' merges vectors after converting to characters.
x <- paste(rep("A", times=12), 1:12, sep=""); y <- paste(rep("B", times=12), 1:12, sep=""); append(x,y) # possibility to build plate location vector in R (better example under 'arrays').
- Subsetting Vectors
x <- 1:100; x[2:23] # Values in square brackets select vector range.
x <- 1:100; x[-(2:23)] # Prints everything except the values in square brackets.
x[5] <- 99 # Replaces value at position 5 with '99'.
x <- 1:10; y <- c(x[1:5],99,x[6:10]); y # Inserts new value at defined position of vector.
letters=="c" # Returns logical vector of "FALSE" and "TRUE" strings.
which(rep(letters,2)=="c") # Returns index numbers where "c" occurs in the 'letters' vector. For retrieving indices of several strings provided by query vector, use the following 'match' function.
match(c("c","g"), letters) # Returns index numbers for "c" and "g" in the 'letters' vector. If the query vector (here 'c("c","g")') contains entries that are duplicated in the target vector, then this syntax returns only the first occurence(s) for each duplicate. To retrieve the indices for all duplicates, use the following '%in%' function.
x <- rep(1:10, 2); y <- c(2,4,6); x %in% y # The function '%in%' returns a logical vector. This syntax allows the subsetting of vectors and data frames with a query vector ('y') containing entries that are duplicated in the target vector ('x'). The resulting logical vector can be used for the actual subsetting step of vectors and data frames.
- Finding Identical and Non-Identical Entries between Vectors
intersect(month.name[1:4],month.name[3:7]) # Returns identical entries between vectors.
month.name[month.name %in% month.name[3:7]] # Returns identical entries between vectors.
setdiff(month.name[1:4],month.name[3:7]) # Returns non-identical entries between vectors.
union(month.name[1:4],month.name[3:7]) # Joins two vectors without duplicating identical entries.
x <- c(month.name[1:4], month.name[3:7]); x[duplicated(x)] # Prints duplicated values or entries.
- Other Vector Types
- see page 10 of R Introduction
- Factors
- Factors are vector objects that contain grouping (clasification) information of its components.
animalf <- factor(animal <- c("dog", "cat", "mouse", "dog", "dog", "cat")) # Creates factor 'animalf' from vector 'animal'.
animalf # Prints out factor 'animalf', this lists first all components and then the different levels (unique entries); alternatively one can print only levels with 'levels(animalf)'.
animalfr <- table(animalf); animalfr # Creates frequency table for levels.
- Function 'tapply' applies calculation on all members (replicates) of a level.
-
weight <- c(102, 50, 5, 101, 103, 52) # Creates new vector with weight values for 'animalf' (both need to have same length).
mean <- tapply(weight, animalf, mean) # Applies function (length, mean, median, sum, sterr, etc) to all level values; 'length' provides the number of entries (replicates) in each level.
- Function 'cut' divides a numeric vector into size intervals.
-
y <- 1:200; interval <- cut(y, right=F, breaks=c(1, 2, 6, 11, 21, 51, 101, length(y)+1), labels=c("1","2-5","6-10", "11-20", "21-50", "51-100", ">=101")); table(interval) # Prints the counts for the specified size intervals (beaks) in the numeric vector: 1:200.
plot(interval, ylim=c(0,110), xlab="Intervals", ylab="Count", col="green"); text(labels=as.character(table(interval)), x=seq(0.7, 8, by=1.2), y=as.vector(table(interval))+2) # Plots the size interval counts as bar diagram.
- Matrices and Arrays
- Arrays and matrices are subscripted and mutidimensional collections of data entries of the same type. Matrices have two dimensions and arrays can have one, two or more dimensions.
x <- matrix(1:30, 3, 10, byrow = T) # Lays out vector (1:30) in 3 by 10 matrix. The argument 'byrow' defines whether the matrix is filled by row or columns.
dim(x) <- c(3,5,2) # transforms above matrix into multidimensional array.
x <- array(1:25, dim=c(5,5)) # creates 5 by 5 array ('x') and fills it with values 1-25.
y <- c(x) # writes array into vector.
x[c(1:5),3] # writes values from 3rd column into vector structure.
mean(x[c(1:5),3]) # calculates mean of 3rd column.
- Subsetting matrices and arrays
-
array[rows,columns] # Syntax to access columns and rows of matrices and two-dimensional arrays.
as.matrix(iris) # Many functions in R require matrices as input. If something doesn't work then try to convert the object into a matrix with the as.matrix() function.
i <- array(c(1:5,5:1),dim=c(3,2)) # Creates 5 by 2 index array ('i') and fills it with the values 1-5, 5-1.
x[i] # Extracts the corresponding elements from 'x' that are indexed by 'i'.
x[i] <- 0 # replaces those elements by zeros.
array1 <- array(scan(file="my_array_file", sep="\t"), c(4,3)) # Reads data from 'my_array_file' and writes it into 4 by 3 array.
- Subsetting arrays with more than two dimensions
-
array[rows,columns,dimensions] # Syntax to access columns, rows and dimensions in arrays with more than two dimensions.
x <- array(1:250, dim=c(10,5,5)); x[2:5,3,] # Example to generate 10x5x5 array and how to retrieve slices from all sub-arrays..
- Merging arrays: example for building plate location table:
-
Z <- array(1:12, dim=c(12,8)); X <- array(c("A","B","C","D","E","F","G","H"), dim=c(8,12)); Y <- paste(t(X), Z, sep=""); Y # Creates array with 12 Ax-Hx columns.
M <- array(Y, c(96,1)) # Writes 12 Ax-Hx columns into one.
- Script for mapping 24/48/96 to 384 well plates
-
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/384_96_48_24conversion.txt") # Prints plate maps.
- Appending arrays
-
cbind(array1,array2) # Appends columns of arrays with same number of rows.
rbind(array1,array2) # Appends rows of arrays with same number of columns.
c(array1, array2) # Clears dimension attributes from arrays and concatenates them in vector format. The function 'as.vector(my_array1)' achieves similar result.
- Calculations between arrays
-
Z <- array(1:12, dim=c(12,8)); X <- array(12:1, dim=c(12,8)) # Creates arrays Z and X with same dimension.
calarray <- Z/X # Divides Z/X elements and assigns result to array 'calarray'.
t(my_array) # Transposes 'my_array'; a more flexible transpose function is 'aperm(my_array, perm)'.
- Information about arrays
-
nrow(X); ncol(X) # returns number of rows and columns of array 'X'.
- Lists
- Lists are ordered collections of objects that can be of differnt modes (e.g. numeric vector, array, etc.). A name can be assigned to each list component. These names do not need to be unique.
-
my_list <- list(name="Fred", wife="Mary", no.children=3, child.ages=c(4,7,9)) # Example of how to create a list.
attributes(my_list); names(my_list) # Prints attributes and names of all list components.
my_list[2]; my_list[[2]] # Prints component 2 of list; [[..]] is operator to call single element, while '[..]' is general subscripting operator.
my_list$wife # Named components in lists can also be called with their name. Command returns same component as 'my_list[[2]]'.
my_list[[4]][2] # Returns from the fourth list component the second entry.
length(my_list[[4]]) # Prints the length of the fourth list component.
my_list$wife <- 1:12 # Replaces the content of the existing component with new content.
my_list$wife <- NULL # Deletes list component 'wife'.
my_list <- c(my_list, list(my_title2=month.name[1:12])) # Appends new list component to existing list.
my_list <- c(my_name1=my_list1, my_name2=my_list2, my_name3=my_list3) # Concatenates lists.
my_list <- c(my_title1=my_list[[1]], list(my_title2=month.name[1:12])) # Concatenates existing and new components into a new list.
data.frame(unlist(my_list)); matrix(unlist(my_list)); data.frame(my_list) # Possible commands to convert lists into data frames or matrices.
my_frame <- data.frame(y1=rnorm(12),y2=rnorm(12),y3=rnorm(12),y4=rnorm(12)); my_list <- apply(my_frame, 1, list); my_list <- lapply(my_list, unlist); my_list # Converts data frame rows into vector list components.
z <- list(n1=letters[1:10], n2=letters[10:1], n3=letters[15:6]); y <- list(n1=1:10, n2=10:1, n3=12:3); result <- lapply(names(z), function(x) replace(z[[x]], which(y[[x]]<5), "N")); names(result) <- names(z); result # Operations on several list objects can be accomplished by looping over the component names of several lists that have identical structures and component names. In the given example, the vectors in list 'z' are modified by vectors of the same structure contained in list 'y'. In molecular biology, this can be used to flag/replace residues in sequences that have too low Phred scores.
mylist <- list(a=letters[1:10], b=letters[10:1], c=letters[1:3]); lapply(names(mylist), function(x) c(x, mylist[[x]])) # Syntax to access list component names with lapply. In this example the list component names are prepended to the corresponding vectors.
- Data Frames
- Data frames can be regarded as matrices with titled columns possibly of differnt modes. Their structure is most similar to tables in biosciences.
-
- Constructing data frames
my_frame <- data.frame(y1=rnorm(12),y2=rnorm(12),y3=rnorm(12),y4=rnorm(12)) # Creates data frame with vectors 1-12 and 12-1.
rownames(my_frame) <- month.name[1:12] # Assigns row (index) names. These indices need to be unique.
names(my_frame) <- c("y4", "y3", "y2", "y1") # Assigns new column titles.
names(my_frame)[c(1,2)] <- c("y3", "y4") # Changes titles of specific columns.
my_frame <- data.frame(IND=row.names(my_frame), my_frame) # Generates new column with title "IND" containing the row names.
my_frame[,2:5]; my_frame[,-1] # Different possibilities to remove column(s) from a data frame.
- Accessing and slicing data frame sections
my_frame[rows, columns] # Generic syntax to access columns and rows in data frames.
dim(my_frame) # Gives dimensions of data frame.
length(my_frame); length(my_frame$y1) # Provides number of columns or rows of data frame, respectively.
colnames(my_frame); rownames(my_frame) # Gives column and row names of data frame.
row.names(my_frame) # Prints row names or indexing column of data frame.
my_frame[order(my_frame$y2, decreasing=TRUE)), ] # Sorts the rows of a data frame by the specified columns, here 'y2'; for increasing order use 'decreasing=FALSE'. In addition to the 'order()' function, there are: 'sort(x)' for vectors, 'rev(x)' for vector in decreasing order and 'sort.list(x)' for vector sequences.
my_frame[order(my_frame[,4], -my_frame[,3]),] # Subsequent sub-sorts can be performed by changing sign of the argument.
my_frame$y1 # Notation to print entire column of a data frame as vector or factor.
my_frame$y1[2:4] # Notation to access column element(s) of a data frame.
v <-my_frame[,4]; v[3] # Notation for returning the value of an individual cell. In this example the corresponding column is first assigned to a vector and then the desired field is accessed by its index number.
my_frame[1:5,] # Notation to view only the first five rows of all columns.
my_frame[,1:2] # Notation to view all rows of the first two columns.
my_frame[,c(1,3)] # Notation to view all rows of the specified columns.
my_frame[1:5,1:2] # Notation to view only the first five rows of the columns 1-2.
my_frame["August",] # Notation to retrieve row values by their index name (here "August").
x <- data.frame(row.names=LETTERS[1:10], letter=letters[1:10],Month=month.name[1:10]); x; match(c("c","g"), x[,1]) # Returns matching index numbers of data frame or vector using 'match' function. This syntax returns for duplicates only the index of their first occurence. To return all, use the following syntax.
x[x[,2] %in% month.name[3:7],] # Subsets a data frame with a query vector using the '%in%' function. This returns all occurences of duplicates.
as.vector(as.matrix(my_frame[1,])) # Prints the row entry of a data frame in vector format.
as.data.frame(my_list) # Prints a list object as data frame.
- Calculations on data frames
summary(my_frame) # Prints a handy summary of the data.
mean(my_frame) # Syntax to calculate the mean for all columns.
mean(my_frame$y1) # Syntax to calculate the mean of one column.
data.frame(my_frame, mean=apply(my_frame[,2:5], 1, mean), ratio=(my_frame[,2]/my_frame[,3])) # Syntax to calculate the mean and ratios on each row and append the result to the same data frame. The arguments in the 'apply' function allow to restrict a calculation to specific rows or columns.
cor(my_frame[,2:4]); cor(t(my_frame[,2:4])) # Syntax to calculate a correlation matrix based on all-against-all rows and all-against-all columns.
x <- data.frame(y1=rnorm(10),y2=rnorm(10),y3=rnorm(10),y4=rnorm(10)); rownames(x) <- month.name[1:10]; print("Initial table x"); x; y <- data.frame(apply(as.matrix(x), 1, cor, as.vector(as.matrix(x["August",])))); print("My correlation coefficient column"); y; names(y)[1] <- c("correl"); print("My correlation result"); y <- cbind(x,y); y[order(-y[,5]), ] # Commands to perform a correlation search and ranking across all rows in a data frame. First, an example data frame 'x' is created. Second, the correlation of the values in the first row (vector format required!) are calculated across all table rows (matrix format required!). Finally, the resulting vector with the correlation values is merged with the original data frame 'x' and sorted decreasingly by the correlation column.
my_frame2 <- data.frame(my_frame$y2, my_frame$y3*10) # This syntax allows one to join columns in any order and to perform operations on them in same command.
merge(frame1, frame2, by.x = "frame1col_name", by.y = "frame2col_name", all = TRUE) # Merges two data frames (tables) by common field entries. To obtain only the common rows, change 'all = TRUE' to 'all = FALSE'. To merge on row names (indices), refer to it with "row.names" or '0', e.g.: 'by.x = "row.names", by.y = "row.names"'.
my_frame1 <- data.frame(title1=month.name[1:8], title2=1:8); my_frame2 <- data.frame(title1=month.name[4:12], title2=4:12); merge(my_frame1, my_frame2, by.x = "title1", by.y = "title1", all = TRUE) # Example of merging two frames by common field.
- Regular expressions
-
sub('At(\\d)g\\d\\d\\d\\d\\d', '\\1', as.character(my_frame[,1]), perl = TRUE) # Example of how to use Perl regular expressions to substitute a pattern by another one using back references. Remember: single escapes '\' need to be double escaped '\\' in R.
- Conditional selections
-
my_frame[!duplicated(my_frame[,2]),] # Removes rows with duplicated values in selected column.
my_frame[my_frame$y2 > my_frame$y3,] # Prints all rows of data frame where values of col1 > col2. Comparison operators are: == (equal), != (not equal), >= (greater than or equal), etc. Logical operators are & (and), | (or) and ! (not).
x <- 0.5:10; x[x<1.0] <- -1/x[x<1.0] # Replaces all values in vector or data frame that are below 1 with their reciprocal value.
x <-data.frame(month=month.abb[1:12], AB=LETTERS[1:2], no1=1:48, no2=1:24); x[x$month == "Apr" & (x$no1 == x$no2 | x$no1 > x$no2),] # Prints all records of frame 'x' that contain 'Apr' AND have equal values in columns 'no1' and 'no2' OR have greater values in column 'no1'.
x[x[,1] %in% c("Jun", "Aug"),] # Retrieves rows with column matches specified in a query vector. The '%in%' is short form of match function.
x[c(grep("\\d{2}", as.character(x$no1), perl = TRUE)),] # Possibility to print out all rows of a data frame where a regular expression matches (here all double digit values in col 'no1').
x[c(grep("\\d{2}", as.character(for(i in 1:4){x[,i]}), perl = TRUE)),] # Same as above, but searches all columns (1-4) using a for loop (see below).
z <- data.frame(chip1=letters[1:25], chip2=letters[25:1], chip3=letters[1:25]); z; y <- apply(z, 1, FUN <- function(x) sum(x == "m") > 2); z[y,] # Identifies in a data frame ('z') all those rows that contain a certain number of identical fields (here 'm' > 2).
z <- data.frame(chip1=1:25, chip2=25:1, chip3=1:25); c <- data.frame(z, count=apply(z[,1:3], 1, FUN <- function(x) sum(x >= 5))); c # Counts in each row of a data frame the number of fields that are above or below a specified value and appends this information to the data frame. By default rows with "NA" values will be ignored. To work around this limitation, one can replace the NA fields with a value that doesn't affect the result, e.g.: x[is.na(x)] <- 1.
x <- data.frame(matrix(rep(c("P","A","M"),20),10,5)); x; index <- x == "P"; cbind(x, Pcount=rowSums(index)); x[rowSums(index)>=2,] # Example of how one can count occurances of strings across rows. In this example the occurances of "P" in a data frame of "PMA" values are counted by converting to a data frame of logical values and then counting the 'TRUE' occurences with the 'rowSums' function, which does in this example the same as: 'cbind(x, count=apply(x=="P",1,sum))'.
- Simple loops to apply function to defined data sections
-
sapply(my_frame, mean, na.rm=T) # Retrieves mean of each numeric vector (column of data frame). Function 'lapply' returns same result in list structure. Argument 'na.rm' removes missing 'NA' values.
x <- data.frame(col1=1:12, col2=1:6, col3=1:12, col4=1:6); apply(x, 1, min) # 'apply' allows to apply a function to all fields of rows (1) or columns (2) of data.frame or matrix.
x <- data.frame(col1=month.abb[1:4], col2=1:8); tapply(x$col2, x$col1, mean) # allows to apply calculation (here mean) on 'col2' based on subgrouping/levels/categories/replicates in 'col1'
for(i in 1:2){print(mean(my_frame[,i]))} # Calculate mean for columns 1-2.
if (1==0) { print(1); } else { print(2); } # Syntax of if-else statement.
x <- 1:10; for(i in x) if (i < 5) print("<5") else print(">5") # If-else applied to vector.
l <- c(); for(i in x) if (i< 5) l <- c(l,i) else l <- c(l,"NA"); as.numeric(l) # Same as above but writes values to vector.
files <- list.files(pattern=".txt$"); for(i in files) { x <- read.table(i, header=TRUE, row.names=1, comment.char = "A", sep="\t"); assign(print(i, quote=FALSE), x); write.table(x, paste(i, c(".out"), sep=""), quote=FALSE, sep="\t", col.names = NA) } # Batch import and export of many files. First, the *.txt file names in the current directory are assigned to list ($ sign is used to anchor string '*.txt' to end of names). Second, the files are imported one-by-one using a for loop where the original names are assigned to the generated data frames with the 'assign' function. Read ?read.table to understand arguments 'row.names=1' and 'comment.char = "A"'. Third, the data frames are exported using their names for file naming and appending '*.out'.
files <- list.files(pattern=".txt$"); apply(as.matrix(files),1, function(i) { x <- read.table(i, header=TRUE, row.names=1, comment.char = "A", sep="\t"); assign(print(i, quote=FALSE), x); write.table(x, paste(i, c(".out"), sep=""), quote=FALSE, sep="\t", col.names = NA) }) # Same as above, but with apply loop structure.
x <- data.frame(v1=seq(1,2.8, by=0.2),v2=1:10/10,v3=-1:8,v4=1:10/10); l <- c(); for(i in as.vector(as.matrix(x))) { if (i<1.5) l <- c(l,i/2) else l <- c(l,i*2) }; l <- data.frame(array(l, dim=c(length(x[,1]), length(x)))); names(l) <- names(x); l # Possibility to apply if else statement on all fields of data frame. In this example the data frame is temporarily transformed into a vector where the if else operation is applyed in a for loop. Afterwards the vector is transformed back into a data frame.
- Some Great R Functions
- This sections contains a small collection of extremely useful R functions. These functions alone are reason enough to learn R.
- The unique() function makes vector entries unique:
unique(iris$Sepal.Length); length(unique(iris$Sepal.Length)) # The first step returns the unique entries for the Sepal.Length column in the iris data set and the second step counts the number of its unique entries.
- The table() function counts the occurrence of entries in a vector. It is the most basic "cluster function":
my_counts <- table(iris$Sepal.Length, exclude=NULL)[iris$Sepal.Length]; cbind(iris, CLSZ=my_counts)[1:4,] # Counts identical entries in the Sepal.Length column of the iris data set and aligns the counts with the original data frame. Note: to count NAs, set the 'exclude' argument to NULL.
myvec <- c("a", "a", "b", "c", NA, NA); table(factor(myvec, levels=c(unique(myvec), "z"), exclude=NULL)) # Additional count levels can be specified by turning the test vector into a factor and specifying them with the 'levels' argument.
- The %in% function returns the intersect between two vectors. In a subsetting context with '[ ]', it can be used to intersect matrices, data frames and lists:
iris[iris$Sepal.Length %in% iris$Sepal.Width, ] # Returns the row in the iris data frame where Sepal.Length and Sepal.Width contain identical entries.
- The merge() function joins data frames based on a common key column:
frame1 <- iris[sample(1:length(iris[,1]), 30), ]; my_result <- merge(frame1, iris, by.x = 0, by.y = 0, all = TRUE); dim(my_result) # Merges two data frames (tables) by common field entries, here row names (by.x=0). To obtain only the common rows, change 'all = TRUE' to 'all = FALSE'. To merge on specific columns, refer to them by their position numbers or their column names (e.g. "Species").
- Graphical Procedures
- R provides comprehensive graphics utilities for visualizing scientific data. The following section provides a short outline of the basic graphics functions in R. To use these functions efficiently, it is extremely important to read the help documentation on the basic plotting functions: '?plot' and '?par'. For more advanced graphics procedures there are additional libraries available (e.g.: grid and lattice). A very useful overview of graphics utilities in R is available on the Graphics Task Page from CRAN. The R Graph Gallery and the R Graphical Manual contain many illustrated R graphics code examples that can be downloaded and modified. Another useful reference for graphics procedures is Paul Murrell's book R Graphics. Interactive graphics in can be generated with rggobi (GGobi) and iplots.
- Overview of Plotting Types
-
- The following list provides an overview of some very useful plotting functions in R. To get familiar with them, one should demo their interactice sample code with 'example(myfct)' and read their documentation by typing '?myfct'.
-
- plot: generic x-y plotting
- barplot: bar plots
- boxplot: box-and-whisker plot
- hist: histograms
- pie: pie charts
- dotchart: cleveland dot plots
- image, heatmap, contour, persp: functions to generate image-like plots
- qqnorm, qqline, qqplot: distribution comparison plots
- pairs, coplot: display of multivariant data
- Scatter Plots
-
y <- as.data.frame(matrix(runif(30), ncol=3, dimnames=list(letters[1:10], LETTERS[1:3]))) # Generates a sample data set.
plot(y[,1], y[,2]) # Plots two vectors (columns) in form of a scatter plot against each other.
plot(y[,1], y[,2], type="n", main="Plot of Labels"); text(y[,1], y[,2], rownames(y)) # Prints instead of symbols the row names.
plot(y[,1], y[,2], pch=20, col="red", main="Plot of Symbols and Labels"); text(y[,1]+0.03, y[,2], rownames(y)) # Plots both, symbols plus their labels.
op <- par(mar=c(8,8,8,8), bg="lightblue"); plot(y[,1], y[,2], type="p", col="red", cex.lab=1.2, cex.axis=1.2, cex.main=1.2, cex.sub=1, lwd=4, pch=20, xlab="x label", ylab="y label", main="My Main", sub="My Sub"); par(op) # This command demonstrates the usage of the most important plotting parameters. The 'mar' argument specifies the margin sizes around the plotting area in this order: c(bottom, left, top, right). The color of the plotted symbols can be controled with the 'col' argument. The plotting symbols can be selected with the 'pch' argument, while their size is controlled by the 'lwd' argument. A selection palette for 'pch' plotting symbols can be opened with the command 'example(points)'. As alternative, one can plot any character string by passing it on to 'pch', e.g.: pch=".". The font sizes of the different text components can be changed with the 'cex.*' arguments. Please consult the '?par' documentation for more details on fine tuning R's plotting behavior.
plot(y[,1], y[,2]); myline <- lm(y[,2]~y[,1], data=y[,1:2]); abline(myline, lwd=2) # Adds a regression line to the plot.
summary(myline) # Prints summary about regression line.
plot(y[,1], y[,2], log="xy") # Generates the same plot as above, but on a log scale.
plot(y[,1], y[,2]); text(y[1,1], y[1,2], expression(sum(frac(1,sqrt(x^2*pi)))), cex=1.3) # Adds a mathematical formula to the plot.
plot(y) # Produces all possible scatter plots for all-against-all columns in a matrix or a data frame. The column headers of the matrix or data frame are used as axis titles.
pairs(y) # Alternative way to produce all possible scatter plots for all-against-all columns in a matrix or a data frame.
library(scatterplot3d); scatterplot3d(y[,1:3], pch=20, color="red") # Plots a 3D scatter plot for first three columns in y.
library(geneplotter); smoothScatter(y[,1], y[,2]) # Same as above, but generates a smooth scatter plot that shows the density of the data points.
- Line Plots
-
y <- as.data.frame(matrix(runif(30), ncol=3, dimnames=list(letters[1:10], LETTERS[1:3]))) # Generates a sample data set.
plot(y[,1], type="l", lwd=2, col="blue") # Plots a line graph instead.
split.screen(c(1,1)); plot(y[,1], ylim=c(0,1), xlab="Measurement", ylab="Intensity", type="l", lwd=2, col=1); for(i in 2:length(y[1,])) { screen(1, new=FALSE); plot(y[,i], ylim=c(0,1), type="l", lwd=2, col=i, xaxt="n", yaxt="n", ylab="", xlab="", main="", bty="n") }; close.screen(all=TRUE) # Plots a line graph for all columns in data frame 'y'. The 'split.screen()' function is used in this example in a for loop to overlay several line graphs in the same plot. More details on this topic are provided in the 'Arranging Plots' section.
A very nice line plot function for time series data is available in the Mfuzz library.
- Bar Plots
-
y <- as.data.frame(matrix(runif(30), ncol=3, dimnames=list(letters[1:10], LETTERS[1:3]))) # Generates a sample data set.
barplot(as.matrix(y[1:4,]), ylim=c(0,max(y[1:4,])+0.1), beside=T) # Generates a bar diagram for the first four rows in y. The barplot() function expects as input format a matrix, and it uses by default the column titles for labeling the bars.
text(labels=round(as.vector(as.matrix(y[1:4,])),2), x=seq(1.5, 13, by=1)+sort(rep(c(0,1,2), 4)), y=as.vector(as.matrix(y[1:4,]))+0.02) # Adds corresponding values on top of each bar.
ysub <- as.matrix(y[1:4,]); myN <- length(ysub[,1]); mycol1 <- gray(1:(myN+1)/(myN+1))[-(myN+1)]; mycol2 <- sample(colors(),myN); barplot(ysub, beside=T, ylim=c(0,max(ysub)*1.2), col=mycol2, main="Bar Plot", sub="data: ysub"); legend("topright", legend=row.names(ysub), cex=1.3, bty="n", pch=15, pt.cex=1.8, col=mycol2, ncol=myN) # Generates a bar plot with a legend for the first four rows of the input matrix 'y'. To plot the diagram in black and white, set the 'col' argument to 'col=col1". The argument 'ncol' controls the number of columns that are used for printing the legend.
par(mar=c(10.1, 4.1, 4.1, 2.1)); par(xpd=TRUE); barplot(ysub, beside=T, ylim=c(0,max(ysub)*1.2), col=mycol2, main="Bar Plot"); legend(x=4.5, y=-0.3, legend=row.names(ysub), cex=1.3, bty="n", pch=15, pt.cex=1.8, col=mycol2, ncol=myN) # Same as above, but places the legend below the bar plot. The arguments 'x' and 'y' control the placement of the legend and the 'mar' argument specifies the margin sizes around the plotting area in this order: c(bottom, left, top, right).
bar <- barplot(x <- abs(rnorm(10,2,1)), names.arg = letters[1:10], col="red", ylim=c(0,5)); stdev <- x/5; arrows(bar, x, bar, x + stdev, length=0.15, angle = 90); arrows(bar, x, bar, x + -(stdev), length=0.15, angle = 90) # Creates bar plot with standard deviation bars. The example in the 'barplot' documentation provides a very useful outline of this function.
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/mortgage.R") # Imports a function that plots a loan amortization table as bar plot.
- Pie Charts
-
y <- table(rep(c("cat", "mouse", "dog", "bird", "fly"), c(1,3,3,4,2))) # Creates a sample data set.
pie(y, col=rainbow(length(y), start=0.1, end=0.8), main="Pie Chart", clockwise=T) # Plots a simple pie chart.
pie(y, col=rainbow(length(y), start=0.1, end=0.8), labels=NA, main="Pie Chart", clockwise=T); legend("topright", legend=row.names(y), cex=1.3, bty="n", pch=15, pt.cex=1.8, col=rainbow(length(y), start=0.1, end=0.8), ncol=1) # Same pie chart as above, but with legend.
- Venn Diagrams: 2-way, 3-way or pseudo 4-way (online alternatives: Chris Seidel or Venny)
-
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/vennDia.R") # Imports the required venn diagram function vennDia.R that generates 2-, 3- and pseudo 4-sample venn diagrams. It uses R's symbols function for drawing the circles. Geometrically correct 4-way venn diagrams can only be generated with ellipses or rectangles. A nice example for this is available on Matt Settles' site (pdf, code).
x <- sample(letters, 18); y <- sample(letters, 16); z <- sample(letters, 20); w <- sample(letters, 22); x11(height=9, width=9); par(mfrow=c(2,2)) # Creates a sample data set and defines the plotting layout. To copy & paste columns from Excel into R, one can use 'read.delim("clipboard")' or 'scan(w="c")'.
qlist <- venndiagram(x=x, y=y, unique=T, title="2-Way Venn Diagram", labels=c("Sample1", "Sample2"), plot=T, lines=c(2,3), lcol=c(2,3), tcol=c(2,1,1), lwd=3, cex=1.3, type="2") # Plots a non-proportional 2-way venn diagram for the provided vectors x and y. In additon, the function returns the result keys for each individual overlap query in form of a list object, which is named here 'qlist'. With the setting 'plot=F' one can turn off the plotting behavior to obtain only this overlap list object. This is often useful for large-scale analyses. If the argument 'type' is set to "2map", then the corresponding mapping diagram will be printed instead. When the 'unique argument is set to TRUE, duplicated entries and empty fields (NAs) will be removed within each provided vector.
qlist <- venndiagram(x=x, y=y, z=z, unique=T, title="3-Way Venn Diagram", labels=c("Sample1", "Sample2", "Sample3"), plot=T, lines=c(2,3,4), lcol=c(2,3,4), tcol=c(2,1,1,1,1,1,1), lwd=3, cex=1.3, type="3") # Plots a non-proportional 3-way venn diagram for the provided vectors x, y and z. To map the different query types to the diagram, set the argument 'type' to "3map".
qlist <- venndiagram(x=x, y=y, z=z, w=w, unique=T, title="4-Way Venn Diagram", labels=c("Sample1", "Sample2", "Sample3", "Sample4"), plot=T, lines=c(2,3,4,6), lcol=c(2,3,4,6), tcol=1, lwd=3, cex=1, type="4") # Plots a pseudo 4-way venn diagram for the provided vectors x, y, z and w. Note that the diagonal overlap counts, "only in Sample1 & Sample4" and "only in Sample2 & Sample3", are provided below the plot. In this representation, the total number of entries in a given sample is equal to the sum of the values in a circle plus its corresponding diagonal count below the plot.
venndiagram(x=x, y=y, z=z, w=w, unique=T, labels=c("Sample1", "Sample2", "Sample3", "Sample4"), lines=c(2,3,4,6), lcol=c(2,3,4,6), tcol=1, lwd=3, cex=1, type="4map") # Plots the 4-way mapping venn diagram that allows mapping of the different query types to the previous diagram.
olReport(qlist=qlist, missing=".", type=1); count <- olReport(qlist=qlist, type=2); count # The function 'olReport()' returns the overlap results (here stored in qlist) in different formats: 'type=1' returns a data frame containing the overlap keys and 'type=2' the overlap counts.
x11(); barplot(as.matrix(t(count)), beside=T, col=rainbow(length(count[,1]))) # Generates a bar plot for the overlap counts.
- Histograms
-
x <- rnorm(100); hist(x, freq=F); curve(dnorm(x), add=T) # Plots a random distribution as histogram and overlays curve ('freq=F': ensures density histogram; 'add=T': enables 'overplotting').
plot(x<-1:50, dbinom(x,size=50,prob=.33), type="h") # Creates pin diagram for binomial distribution with n=50 and p=30.
- Density Plots
-
plot(density(rnorm(10)), xlim=c(-2,2), ylim=c(0,1), col="red") # Plots a random distribution in form of a density plot.
split.screen(c(1,1)); plot(density(rnorm(10)), xlim=c(-2,2), ylim=c(0,1), col="red"); screen(1, new=FALSE); plot(density(rnorm(10)), xlim=c(-2,2), ylim=c(0,1), col="green", xaxt="n", yaxt="n", ylab="", xlab="", main="",bty="n"); close.screen(all = TRUE) # Several density plots can be overlayed with the 'split.screen' function. In ths example two density plots are plotted at the same scale ('xlim/ylim') and overlayed on the same graphics device using 'screen(1,new=FALSE))'.
- Box Plots
-
y <- as.data.frame(matrix(runif(300), ncol=10, dimnames=list(1:30, LETTERS[1:10]))) # Generates a sample data set.
boxplot(y, col=1:length(y)) # Creates a box plot. A box plot (also known as a box-and-whisker diagram) is a graphical representation of a five-number summary, which consists of the smallest observation, lower quartile, median, upper quartile and largest observation.
- Feature Maps for DNA/Protein Sequences
-
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/featureMap.txt") # The featureMap.R script plots feature maps of biological sequences based on provided mapping coordinates.
plot(x <- rnorm(40,2e+07,sd=1e+07), y <- rep(1,times=40), type="h", col="blue", xaxt="n", yaxt="n", bty="n"); abline(h=0.78, col="green", lwd=12); lines(a <- rnorm(5,2e+07,sd=1e+07), b <- rep(1,times=5), type="h", col="red", lwd=2) # Alternative approach for plotting simple chromosome maps.
- Miscellaeneous Plotting Utilities
-
plot(1:10); lines(1:10, lty=1) # Superimposes lines onto existing plots. The usage of plotted values will connect the data points. Different line types can be specified with argument "lty = 2". More on this can be found in the documentation for 'par'.
plot(x <- 1:10, y <- 1:10); abline(-1,1, col="green"); abline(1,1, col="red"); abline(v=5, col="blue"); abline(h=5, col="brown") # Function 'abline' adds lines in different colors to x-y-plot.
- Arranging Plots
-
x11(height=6, width=12, pointsize=12) # Defines the size of the plotting device.
par(mfrow=c(2,3)); for(i in 1:6) { plot(1:10) } # With the argument 'par(mfrow=c(nrow,ncol))' one can define how several plots are arranged next to each other on the same plotting device.
nf <- layout(matrix(c(1,2,3,3), 2, 2, byrow=TRUE), c(3,7), c(5,5), respect=TRUE); layout.show(nf); for(i in 1:3) { barplot(1:10) } # The 'layout()' function allows to devide the plotting device into variable numbers of rows and columns with the column-widths and the row-heights specified in the respective arguments.
split.screen(c(1,1)); plot(density(rnorm(10)), xlim=c(-2,2), ylim=c(0,1), col="red"); screen(1, new=FALSE); plot(density(rnorm(10)), xlim=c(-2,2), ylim=c(0,1), col="green", xaxt="n", yaxt="n", ylab="", xlab="", main="",bty="n"); close.screen(all = TRUE) # Possibility to overlay independent plots with 'split.screen' function. In ths example two density plots are plotted at the same scale ('xlim/ylim') and overlayed on the same graphics device using 'screen(1,new=FALSE))'. Split.screen is also very useful for generating multiple plots next to each other on a single device by setting the layout setting to several screens, e.g.: split.screen(c(2,2))
- Customize X-Y Axes
-
op <- par(mar=c(6,6,6,8)); barplot(1:10, names.arg=letters[1:10], cex.names=1.5, col=3, yaxt = "n") # Creates bar plot without y-axis and wider margins around plot.
axis(2, las = 1, at=seq(0,10,0.5), lwd=2, cex.axis=1.5); axis(3, las = 1, lwd=2, cex.axis=1.5); axis(4, las = 1, at=seq(0,10,2), labels=month.name[1:6], lwd=2, cex.axis=1.5); par(op) # Adds three custom axes.
curve(sin(x), -pi, pi, axes=F, ylab="", xlab=""); axis(1, pos=0); axis(2, pos=0) # Plots a function in Cartesian-style coordinate system.
- Color Selection Utilities
-
y <- as.data.frame(matrix(runif(30), ncol=3, dimnames=list(letters[1:10], LETTERS[1:3]))) # Generates a sample data set.
cat("palette():\n"); palette(); palette(rainbow(20, start=0.1, end=0.2)); cat("palette(rainbow(20, start=0.1, end=0.2)):\n"); palette(); palette("default") # Prints and modifies the color palette which is used when the argument 'col=' has a numeric index. The last step sets the palette back to its default setting.
par(mfrow=c(1,2)); barplot(as.matrix(y[1:8,]), beside=T, col=1:8); palette(rainbow(8, start=0.1, end=0.8)); barplot(as.matrix(y[1:8,]), beside=T, col=1:8); palette("default") # Example to call the default color palette with a numeric vector in the 'col=' argument'. In the second plot a modified palette is called the same way.
example(rainbow) # The function rainbow() allows to select various predefined color schemes.
barplot(as.matrix(y[1:10,]), beside=T, col=rainbow(10, start=0.1, end=0.2)) # Example to select 10 colors with the rainbow() function. The start and end values need to be between 0 and 1. The wider their distance the more diverse are the resulting colors.
barplot(as.matrix(y[1:10,]), beside=T, col=gray(c(1:10)/10)) # The gray() function allows to select any type of gray shades by providing values from 0 to 1.
colors()[1:10]; col2rgb(colors()[1:10]) # The colors() function allows to select colors by their name. The col2rgb() can translates them into the RGB color code.
library(RColorBrewer); example(brewer.pal) # Shows how to select color schemes with the RColorBrewer library.
- Adding Text
-
y <- as.data.frame(matrix(runif(30), ncol=3, dimnames=list(letters[1:10], LETTERS[1:3]))) # Generates a sample data set.
plot(y[,1], y[,2]); text(0.2,0.2,"My Text") # Adds text at defined coordinates of plot.
plot(y[,1], y[,2]); mtext("My Text", 4) # Adds text to one of the four margin areas (side: 1-4) of the plot.
plot(y[,1], y[,2], pch=20, col="red", main="Plot of Symbols and Labels"); text(y[,1]+0.03, y[,2], rownames(y)) # Plots a scatter plot with the corresponding labels (here row names) next to the data points.
- Interactive Functions
-
locator() # Allows user to click on existing graph with left mouse button. System returns the corresponding x-y-coordinates after clicking on right mouse button.
text(locator(1), "My_Outlier", adj=0) # Adds text "My_Outlier" to graph at position where left mouse button is clicked.
- Saving Graphics to Files
-
jpeg("test.jpeg"); plot(1:10, 1:10); dev.off() # After the 'jpeg("test.jpeg")' command all graphs are redirected to the file "test.jpeg" in JPEG format. To export images with the highest quality, the default setting "quality = 75" needs to be changed to 100%. The actual image data are not written to the file until the 'dev.off()' command is executed!
pdf("test.pdf"); plot(1:10, 1:10); dev.off() # Same as above, but for pdf format. The pdf and svg formats provide often the best image quality, since they scale to any size without pixelation.
png("test.png"); plot(1:10, 1:10); dev.off() # Same as above, but for png format.
postscript("test.ps"); plot(1:10, 1:10); dev.off() # Same as above, but for PostScript format.
library("RSvgDevice"); devSVG("test.svg"); plot(1:10, 1:10); dev.off() # Generates Scalable Vector Graphics (SVG) files that can be edited in vector graphics programs, such as InkScape.
- Importing Graphics into R
-
library(pixmap); x <- read.pnm(file="my_image.pnm"); plot(x); print(x) # Images need to be in pnm format. This format can be obtained in ImageMagick like this: 'convert my_image.jpg my_image.pnm'.
- Missing Values
- Missing values are represented in R data objects by the missing value place holder 'NA'. The default behavior for many R functions on data objects with missing values is 'na.fail' which returns the value 'NA'. Several 'na.action' options are available to change this behavior.
-
x <- 1:10; x <- x[1:12]; z <- data.frame(x,y=12:1) # creates sample data.frame with 'NA' values
is.na(x) # Finds out which elements of x contain "NA" fields.
na.omit(z) # Removes rows with 'NA' fields in numeric data objects.
x <- letters[1:10]; print(x); x <- x[1:12]; print(x); x[!is.na(x)] # A similar result can be achieved on character data objects with the function 'is.na()'.
apply(z, 1, mean, na.rm=T) # uses available values to perform calculation while ignoring the 'NA' values; the default 'na.rm=F' returns 'NA' for rows with missing values
z[is.na(z),1] <- 0 # replaces 'NA' values with zeros. To replace the special <NA> sign in character columns of data frames, convert the column first to a character vector with 'as.character(my_column)', then perform the substitution on this vector and add it afterwards back to the data frame.
- Writing Your Own Functions
- The R language allows users to create their own function objects. The basic syntax for defining new functions is:
-
name <- function(arg_1, arg_2, ...) expression # Basic syntax for functions.
- A much more detailed introduction into writing functions in R is available in the Programming in R section of this manual.
- R Web Applications
- Various web-based R services and R web development kits are available:
-
- R Web Services
- R Web Development Kits
- R Exercises
- Slide Show
- The following exercises demonstrate several very useful R utilities that may convince beginners to perform their data analyses in R.
- Import from Excel
- Download the following molecular weight and subcelluar targeting tables from the TAIR site, import the files into Excel and save them as tab delimited text files.
- Import the tables into R
my_mw <- read.delim(file="MolecularWeight_tair7.xls", header=T, sep="\t") # Imports molecular weight table.
my_target <- read.delim(file="TargetP_analysis_tair7.xls", header=T, sep="\t") # Imports subcelluar targeting table.
- Online import
my_mw <- read.delim(file="http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/Samples/MolecularWeight_tair7.xls", header=T, sep="\t") # Online import of molecular weight table from TAIR.
my_target <- read.delim(file="http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/Samples/TargetP_analysis_tair7.xls", header=T, sep="\t") # Online import of subcelluar targeting table from TAIR.
- Check tables and rename gene ID columns
my_mw[1:4,]; my_target[1:4,]; colnames(my_target)[1] <- "ID"; colnames(my_mw)[1] <- "ID" # Prints out the first four rows of the two data frames and renames their gene ID columns.
- Merge the two tables based on common ID field
my_mw_target <- merge(my_mw, my_target, by.x="ID", by.y="ID", all.x=T)
- Shorten one table before the merge and then remove the non-matching rows (NAs) in the merged file
my_mw_target2 <- merge(my_mw, my_target[1:40,], by.x="ID", by.y="ID", all.x=T) # To remove non-matching rows, use the argument setting 'all=F'.
na.omit(my_mw_target2) # Removes rows containing "NAs" (non-matching rows).
- Apply several combinatorial queries (filters) on the data set. For example:
my_mw_target[my_mw_target[,2]>100000 & my_mw_target[,4]=="C",] # Retrieves all records with MW greater than 100000 and targeting to chloroplasts.
dim(my_mw_target[my_mw_target[,2]>100000 & my_mw_target[,4]=="C", ]) # Determines the number of matches for the above query
- Use a regular expression in a substitute function to generate a separate ID column that lacks gene model extensions
my_mw_target3 <- data.frame(loci=gsub("\\..*", "", as.character(my_mw_target[,1]), perl = TRUE), my_mw_target)
my_mw_target3 <- data.frame(loci=gsub("(^.*)\\.\\d{1,}", "\\1", as.character(my_mw_target[,1]), perl = TRUE), my_mw_target) # Same as above but with backreference
- Calculate the number of gene models per loci with the table() function
my_mw_target4 <- cbind(my_mw_target3, Freq=table(my_mw_target3[,1])[my_mw_target3[,1]]) # The table() function counts the number of duplicates for each loci and cbind() appends the information to the data frame.
my_mw_target4[order(-my_mw_target4$Freq),][1:30,] # Sorts by count information.
length(unique(my_mw_target4[,1])) # Removes duplicates and prints the number of unique loci.
sum(!duplicated(my_mw_target4[,1])) # Gives the same result as previous command; remember: the function 'duplicated' returns a logical vector.
length(my_mw_target4[,2])-length(unique(my_mw_target4[,1])) # Calculates the number of duplicated entries.
sum(duplicated(my_mw_target4[,1])) # Generates the same result as the previous command.
- Perform a variety calculations across columns
data.frame(my_mw_target4, avg_AA_WT=(my_mw_target4[,3]/my_mw_target4[,4]))[1:40,] # Calculates the average AA weight per protein.
data.frame(my_mw_target4, mean=apply(my_mw_target4[,6:9], 1, mean))[1:40,] # Calculates the mean across several fields of each row.
data.frame(my_mw_target4, mean=apply(my_mw_target4[,6:9], 1, mean), stdev=apply(my_mw_target4[,6:9], 1, sd, na.rm=T))[1:40,] # Calculates the mean and standard deviation across several fields of each row.
mean(my_mw_target4[,-c(1,2,5)], na.rm=T) # Calculates the mean for all numeric columns.
data.frame(my_mw_target4[1:40,], ttest=apply(my_mw_target4[1:40,6:9], 1, function(x) {t.test(x[1:2], x[3:4])$p.value} )) # Calculates a two-sample t-test for two slices in each row of the data frame and records the corresponding p-values. To calculate a paired t-test, add the argument 'paired=T'.
- Export data frame to Excel
write.table(my_mw_target4, file="my_file.xls", quote=F, sep="\t", col.names = NA) # Writes the data frame 'my_mw_target4' into a tab-delimited text file that can be opened with Excel.
- Plotting samples
boxplot(my_mw_target4[,3], my_mw_target4[,4], names=c("MW","CC"), col=c("blue","red"), main="Box Plot") # Generates box plots for MW and AA counts.
plot(density(my_mw_target4[,3]), col="blue", main="Density Plot") # Plots density distribution plots for MW.
barplot(my_mw_target4[1:5,3], beside = F, col = c("red", "green", "blue", "magenta", "black"), names.arg=as.vector(my_mw_target4[1:5,1]), main="MW Bar Plot", ylim=c(0,230000)) # Bar plot of MW for first five loci.
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/vennDia.R") # Imports the venn diagram function.
x <- sample(my_mw_target4[1:1000,1], 200); y <- sample(my_mw_target4[1:1000,1], 200); z <- sample(my_mw_target4[1:1000,1], 200) # Generates 3 sets of 200 random IDs.
qlist <- venndiagram(x=x, y=y, z=z, unique=T, title="3-Way Venn Diagram", labels=c("Sample1", "Sample2", "Sample3"), plot=T, lines=c(2,3,4), lcol=c(2,3,4), tcol=c(2,1,1,1,1,1,1), lwd=3, cex=1.3, type="3") # Plots a 3-way venn diagram for random ID sets.
olReport(qlist=qlist, missing=".", type=1) # Creates for the 3-way venn diagram the corresponding overlap data frame.
pdf("my_plot.pdf"); venndiagram(x=x, y=y, z=z); dev.off() # Saves the plot image to a PDF file.
- Write all commands into an R script (exerciseRbasics.R) and execute it with the source() function
source("exerciseRbasics.R") # Executes all of the above commands and generates the corresponding output files in the current working directory.
- Programming in R
- This section of the manual is available on the Programming in R site.
- BioConductor
- Introduction
- BioConductor is an open source and open development software project to provide tools for the analysis of SNP and transcriptional profiling data (SAGE, microarrays or Affymetrix chips) and the integration of genomic meta data.
- Finding Help
- The documentation on BioConductor packages can be found on the Vignette Pages. A brief listing of their content is available through the package table. Annotation libraries can be found on the Meta Data page. If you can't find the answer to a quesion, try the BioC Mail GMANE Archive. A BioConductor Book is now also available. The basic R help functions provide additional information:
-
library(affy) # Loads a particular package (here affy package).
library(help=affy) # Lists all functions/objects of a package (here affy package).
library() # Lists all libraries/packages that are available on a system.
openVignette() # Provides documentation on packages.
sessionInfo() # Prints version information about R and all loaded packages. The generated output should be provided when sending questions or bug reports to the R and BioC mailing lists.
?my_topic # Opens documentation on a function.
help.search("topic") # Searches help system for documentation.
search() # Shows loaded packages and attached data frame in current search path.
system.file(package="affy") # Shows location of an installed package.
library("tools"); vigSrc <- list.files(pattern="Rnw$", system.file("doc",package="GOstats"), full.names=TRUE); vigSrc; for (v in vigSrc) Stangle(v) # Extracts R code from a vignette and saves it to file(s) in your current working directory.
- Affy Packages
- BioConductor provides extensive resources for the analysis of Affymetrix data. The most important are introduced here.
- Affy
- The Affy package provides basic methods for analyzing affymetrix oligonucleotide arrays. The following steps outline the usage of the Affy package and associated packages.
- Obtaining scaled expression values with 5 different methods (MAS5, RMA, GCRMA, Plier & dChip). A comparison of three of these methods including references is available in this slide show.
Move your CEL files into your working directory and make sure that the corresponding CDF library for your chips is available on your system.
library(affy) # Loads the affy package.
mydata <- ReadAffy() # Reads all *.CEL (*.cel) files in your current working directory and stores them into the AffyBatch object 'mydata'.
mydata <- ReadAffy(widget=TRUE) # Opens file browser to select specific CEL files.
eset <- rma(mydata) # Creates normalized and background corrected expression values using the RMA method. The generated data are stored as ExpressionSet class in the 'eset' object. For large data sets use the more memory efficient justRMA() function.
eset <- mas5(mydata) # Uses expresso (MAS 5.0 method) module instead of RMA, which is much slower than RMA!
eset_gcrma <- gcrma(mydata) # Use this command to aquire gcrma data. The 'library(gcrma)' needs to be loaded first.
eset_plier <- justPlier(mydata) # Use this command to aquire plier data. The 'library(plier)' needs to be loaded first.
eset_PMA <- mas5calls(mydata) # Generates MAS 5.0 P/M/A calls. The command creates ExpressionSet with P/M/A calls in the 'exprs' slot and the wilcoxon p-values in the 'se.exprs' slot. To access them see below.
eset <- expresso(mydata, normalize.method="invariantset", bg.correct=FALSE, pmcorrect.method="pmonly", summary.method="liwong") # Generates expression calls similar to dChip (MBEI) method from Li and Wong.
library(affycoretools); affystart(plot=T, express="mas5") # Handy function to normalize all cel files in current working directory, perform qc plots and export normalized data to file. Works for mas5, rma and gcrma.
- Exporting data from 'ExpressionSet' objects
write.exprs(eset, file="mydata.txt") # Writes expression values to text file in working directory.
exprs2excel(eset, file="mydata.csv") # Writes expression values to csv excel file in working directory.
x <- data.frame(exprs(eset), exprs(eset_PMA), assayDataElement(eset_PMA, "se.exprs")); x <- x[,sort(names(x))]; write.table(x, file="mydata_PMA.xls", quote=F, col.names = NA, sep="\t") # Writes expression values, PMA values and wilcoxon p-values to text file in working directory. Remember: the old command for accessing the wilcoxon p-values in BioC versions <2.0 was 'se.exprs(eset_PMA))'.
- Obtaining single probe-level data from 'affybatch' objects (see also documentation '?ReadAffy')
mypm <- pm(mydata) # retrieves PM intensity values for single probes
mymm <- mm(mydata) # retrieves MM intensity values for single probes
myaffyids <- probeNames(mydata) # retrieves Affy IDs
result <- data.frame(myaffyids, mypm, mymm) # combines the above information in data frame
- Working with 'ExpressionSet' objects (see Biobase Manual)
eset; pData(eset) # Provides summary information of ExpressionSet object 'eset' and lists the analyzed file names.
exprs(eset)[1:2,1:4]; exprs(eset)[c("244901_at","244902_at"),1:4] # Retrieves specific rows and fields of ExpressionSet object. To learn more about this format class, consult ExpressionSet manual with command '?ExpressionSet'.
test <- as.data.frame(exprs(eset)); eset2 <-new("ExpressionSet", exprs = as.matrix(test), annotation="ath1121501"); eset2 # Example for creating an ExpressionSet object from a data frame. To create the object from an external file, use the read.delim() function first and then convert it accordingly.
data.frame(eset) # Prints content of 'eset' as data frame to STDOUT.
exprs(eset_PMA)[1:2,1:2]; assayDataElement(eset_PMA, "se.exprs")[1:2,1:2] # Prints from ExpressionSet of 'mas5calls(mydata)' the PMA values from its 'exprs' slot and the p-values from its 'se.exprs' slot.
- Retrieving annotation data for Affy IDs (see Annotation Package Manual)
library(ath1121501) # Opens library with annotation data.
library(help=ath1121501) # Shows availability and syntax for annotation data.
ath1121501() # Provides a summary about the available annotation data sets of an anntation library.
library(ath1121501cdf); ls(ath1121501cdf) # Retrieves all Affy IDs for a chip in vector format.
x <- c("245265_at", "260744_at", "259561_at", "254759_at", "267181_at") # Generates sample data set of Affy ID numbers.
mget(x, ath1121501ACCNUM, ifnotfound=NA) # Retrieves locus ID numbers for Affy IDs.
mget(x, ath1121501CHR); mget(x, ath1121501CHRLOC) # Retrieves chromosome numbers and locations of Affy IDs.
data.frame(AffyID=x, AGI=as.vector(unlist(mget(x, ath1121501ACCNUM, ifnotfound=NA))), Desc=as.vector(unlist(lapply(mget(x, ath1121501GENENAME, ifnotfound=NA), function(x) paste(x, collapse=", ")))), row.names=NULL) # Possibility to get it all into a data frame structure.
mget(x, ath1121501GO) # Retrieves GO information for Affy IDs.
- Accessing probe sequence data (see Matchprobes Manual)
library(ath1121501probe) # Opens library with probe sequence data.
print.data.frame(ath1121501probe[1:22,]) # Prints probe sequences and their positions for first two Affy IDs.
pm <- ath1121501probe$sequence # Assigns sequence component of list object 'ath1121501probe' to vector 'pm'.
mm <- complementSeq(ath1121501probe$sequence, start = 13, stop = 13) # The mismatch sequences are not stored in the probe packages, but can be created with the complementSeq function by flipping the middle base at position 13.
cat(pm[1], mm[1], sep = "\n") # The generic 'cat' function produces output in user-defined format. Here: pm aligned above mm.
reverseSeq(complementSeq(pm[1])) # Command to generate the reverse and complement of a sequence.
- Visualization and quality controls
- Additional information on this topic can be found in the affyPLM QC Manual, the arrayQualityMetrics package and on the WEHI Affy Lab site.
library(affyQCReport); QCReport(mydata, file="ExampleQC.pdf") # Generates a comprehensive QC report for the AffyBatch object 'mydata' in PDF format. See affyQCReport for details.
deg <- AffyRNAdeg(mydata); summaryAffyRNAdeg(deg); plotAffyRNAdeg(deg) # Performs RNA degradation analysis. It averages on each chip the probes relative to the 5'/3' position on the target genes. A summary list and a plot are returned.
image(mydata[ ,1]) # Reconstructs image with log intensities of first chip.
hist(mydata[ ,1:2]) # Plots histogram of PM intensities for 1st and 2nd array.
hist(log2(pm(mydata[,1])), breaks=100, col="blue") # Plos bar histogram of the PM ('pm') or MM ('mm') log intensities of 1st array.
boxplot(mydata,col="red") # Generates a box plot of un-normalized log intensity values.
boxplot(data.frame(exprs(eset)),col="blue", main="Normalized Data") # Generates a box plot of normalized log intensity values.
mva.pairs(pm(mydata)[,c(1,4)]) # Creates MA-plot for un-normalized data. A MA-plot is a plot of log-intensity ratios (M-values) versus log-intensity averages (A-values) between selected chips (here '[1,4]').
mva.pairs(exprs(eset)[,c(1,4)]) # Creates MA-plot for normalized data.
- Simpleaffy
- The simpleaffy package automates many repetitive high-level analysis steps.
- Creating expression values from CEL files
Move CEL files into working directory.
Create white-space delimited table file "covdesc.txt": first column no title then CEL file names; second column title "treatment", then treatment names.
library(simpleaffy) # Loads "affy" and "simpleaffy" packages.
raw.data <- read.affy("covdesc.txt") # Reads data in working directory that are specified in "covdesc.txt" file; for help on this function type "?read.affy".
x.rma <- call.exprs(raw.data, "rma") # Creates expression values using RMA method. The function 'call.exprs' returns always log2 expression values; for help on this function type "?call.exprs". One can also use here the "gcrma" method after loading it with the command 'library(gcrma)'.
x.mas <- call.exprs(raw.data, "mas5") # Computes expression values using MAS 5.0 method.
write.exprs(x.rma, file="mydata.txt") # Writes expression values to file "mydata.txt".
- Quality control steps
The "qc" function provides several qc stats for chips. To use it, one follows the following steps:
x <- read.affy("covdesc.txt") # Reads data in working directory.
x.mas5 <- call.exprs(x,"mas5") # Calculates expression values with MAS 5.0 method which is required for the next step!
qc <- qc(x,x.mas5) # Calls "qc" function which generates object containing scale factors, GAPDH-Actin_3'/5' ratios, target intensity, % present, average, min, max, mean background int and more; for more details type "?qc".
x.mas5@description@preprocessing # Creates QC output like this.
getGapdh3(cleancdfname(cdfName(x))) # To find out which GAPDH/Actin probe sets are used on a given chip; for others use getGapdhM, getGapdh5, getBioB, getBioC...
plot(qc) # Creates summary plot of QC data. See description on page 5 of vignette "simpleaffy".
- Filtering by expression values
get.array.subset(x.rma, "treatment",c("CMP1","Ctrl")) # When R loaded *.CEL files it also reads experiment layout from covdesc.txt file (Ctrl and CMP1). The "get.array.subset" function makes it easy to select subsets of arrays.
results <- pairwise.comparison(x.rma, "treatment", c("1", "2"), raw.data) # computes mean intensities of replicates from two treatments (means), calculates log2-fold changes between means (fc), performs t-test (tt) and writes out PMA calls (calls). Type "?pairwise.comparison" for more help on this function.
write.table(data.frame(means(results), fc(results), tt(results), calls(results)), file="my_comp.txt", sep="\t") # Command to export all 'pairwise.comparison' data into one table.
sort(abs(fc(results)),decreasing=TRUE)[1:100] # Prints out 100 highest fold-changes (fc), for means of intensities (means), for ttest (tt), for PMA (calls).
significant <- pairwise.filter(results, min.exp=log2(10), min.exp.no=2, fc=log2(8), min.present.no=4, tt= 0.001, present.by.group=FALSE) # Function 'pairwise.filter' takes the output from 'pairwise.comparison' and filters for significant changes. Filter Arguments:
- min.exp: minimum expression cut off
- min.exp.no: occurence of 'min.exp' in at least this number of chips
- min.present.no: present calls on at least this number of chips
- present.by.group: If true, then present count is restricted to replicate groups and not all chips of an experiment!
- fc: A gene must show a log2 fold change greater than this to be called significant. A 2.5-fold change can be specified like this: 'fc=log2(2.5)'
- tt: A gene must be changing with a p-score less than this to be called significant
Type "?pairwise.filter" for more info on the different arguments of this function.
write.table(data.frame(means(significant), fc(significant), tt(significant), calls(significant)), file="my_comp.txt", col.names = NA, sep="\t") # Exports all 'pairwise.filter' data into one table.
- Plotting results
plot(significant, type="scatter") # Plots significant changes from above as scatter plot. Alternative plotting types: type="ma" or type="volcano".
plot(results,significant, type="scatter") # Plots means of the two replicate groups as scatter plot and high-lights all genes that meet criteria of 'significant' filter. Meaning of colors: red - all present, orange - all present in one group or the other, yellow - all that remain.
plot(results,type="scatter") # Plots means of the two replicate groups as scatter plot.
png(file="figure1.png"); plot(significant,type="scatter"); dev.off() # Writes scatter plot to image file.
- Exercise simpleaffy (Command Summary)
- Analysis of Differentially Expressed Genes
- Various packages are available for analyzing pre-processed data from dual-color and Affymetrix arrays.
- LIMMA, limmaGUI & affylmGUI
- Limma is a software package for the analysis of gene expression microarray data, especially the use of linear models for analysing designed experiments and the assessment of differential expression. The package includes pre-processing capabilities for two-color spotted arrays. The differential expression methods apply to all array platforms and treat Affymetrix, single channel and two channel experiments in a unified way. The methods are described in Smyth 2004 and in the limma manual. An illustrated introduction for the GUI packages can be found at WEHI.
- Tcl/Tk Requirements
- Basic usage
library(limma) # Loads command-line limma package.
limmaUsersGuide() # Opens pdf manual for limma.
The cDNA and affy sample data sets of the manual can be downloaded from the limmaGUI and affylmGUI pages.
library(affylmGUI); affylmGUI() # Opens affylmGUI for affy data. For a quick start, follow the instructions for the Estrogen data set.
library(limmaGUI); limmaGUI() # Opens limmaGUI for cDNA array data. For a quick start, follow the instructions for the Swirl Zebrafish data set.
- Data objects in limma
There are four main data objects created and used by limma:
- RGList: for cDNA data created by function 'read.maimages()'
- MAList: for cDNA data created by functions MA.RG() or 'normalizeWithinArrays()'
- MArrayLM: created by function 'lmFit()'
- TestResults: created by function 'decideTests()'
More details on this can be found in paragraph 13 of the limma PDF manual ('limmaUsersGuide()').
- Preprocessing of cDNA array data
- Reading cDNA array data
To make the following commands work, save and extra