Programming in R


Workshops ]   R_&_BioC ]   BioC-Seq ]   R_Programming ]   EMBOSS ]   Linux ]   Cluster ]

Author: Thomas Girke, UC Riverside

Index

  1. Introduction
  2. R_Basics
  3. Finding Help
  4. Control Structures
  5. Functions
  6. Useful Utilities
  7. Running R Programs
  8. Object-Oriented Programming
  9. Exercises

  1. Introduction

  2. One of the main attractions of using the R environment is the ease with which users can write their own programs and functions. The R programming syntax is extremely easy to learn, even for users with no previous programming experience. Once the basic R programming control structures are understood, users can use the R language as a powerful environment to perform complex custom analyses of almost any type of data.

  3. R Basics

  4. The R & BioConductor manual provides an introduction on the usage of the R environment and its basic command syntax.

  5. Finding Help

  6. Reference list on R programming (selection)

  7. Control Structures
    1. Conditional Executions

    2. Comparison operators: == (equal), != (not equal), >= (greater than or equal), etc. Logical operators: & (and), | (or) and ! (not).

      1. If Statement operates on length-one logical vectors

      2. Syntax
                if (cond1=true) { cmd1 } else { cmd2 }
        
        Example
                if (1==0) { print(1) } else { print(2) }
        
        Avoid using newlines between '} else'. To apply if statement on many values of vector or data frame, use 'for' or 'apply' loops (see below).

      3. Ifelse Statement: operates on vectors

      4. Syntax
                ifelse(test, true_value, false_value)
        
        Example
                x <- 1:10 
                ifelse(x<5 | x>8, x, 0)
        
    3. Loops
    4. The most commonly used loop structures in R are 'for', 'while' and 'apply' loops. Less common are 'repeat' loops. The 'break' function is used to break out of loops, and 'next' halts the processing of the current iteration and advances the looping index.

      1. For Loop: flexible, but slow when looping over large number of fields (e.g. thousands of rows or columns)

      2. Syntax
                for(variable in sequence) { 
                        statements 
                }
        
        Example: mean
                mydf <- iris
                myve <- NULL
                for(i in seq(along=mydf[,1])) {
                        myve <- c(myve, mean(as.numeric(mydf[i,1:3])))
                }
        
        Example: condition*
                x <- 1:10; z <- NULL
                for(i in seq(along=x)) { 
                        if (x[i]<5) { z <- c(z,x[i]-1)  } else { z <- c(z,x[i]/x[i])  } 
                }
        
        Example: stop on condition and print error message
                x <- 1:10; z <- NULL
                for (i in seq(along=x)) { 
        		if (x[i]<5) { z <- c(z,x[i]-1)  } else { stop("values need to be <5") }
                }
        
      3. While Loop: similar to for loop, but the iterations are controlled by a conditional statement.

      4. Syntax
                while(condition) statements
        
        Loop continues until condition returns FALSE.

        Example
                z <- 0
                while(z<5) { 
                        z <- z+2
                        print(z)  
                }
        
      5. The 'apply' Function Family

      6. Syntax
                apply(X, MARGIN, FUN, ARGs)
        
        X: array, matrix or data.frame; MARGIN: 1 for rows, 2 for columns, c(1,2) for both; FUN: one or more functions; ARGs: possible arguments for function
        Example
                apply(my_matrix, 1, function(i) { my_commands } )
        
        Example for single operation
                apply(mydf[,1:3], 1, mean)
        
        Two-step approach: 1st define function, 2nd use function in apply loop (does the same as above 'for loop'*)
                x <- 1:10; z <- NULL
                test <- function(x) { if (x<5) { x-1 } else { x/x } }
                apply(as.matrix(x), 1, test)
        
        One-step approach: does the same as above, but function defined in apply loop
                apply(as.matrix(x), 1, function(x) { if (x<5) { x-1 } else { x/x } })
        
      7. Other 'apply' Functions: tapply, sapply and lapply

        • tapply

        • Applies a function to array categories of variable lengths (ragged array). Grouping is defined by vector.
          Syntax
                  tapply(vector, factor, FUN)
          
          Example
                  for(i in 1:4) {
                          print(tapply(as.vector(iris[,i]), factor(iris[,5]), mean)) 
                  }
          
          Note: the aggregate() function provides related utilities:
                 aggregate(iris[,1:4], list(iris$Species), mean) 
          
        • lapply and sapply

        • Both apply a function on vector or list objects. The function lapply returns a list, while sapply returns a more readable vector or matrix structure.
          Example for vector objects
                  z <- seq(1,10, by=2); my_matrix <- matrix(runif(100), ncol=10)
          	lapply(z, function(x) mean(my_matrix[x,]))
                  sapply(z, function(x) mean(my_matrix[x,]))
          

          Example for list objects
                  x <- list(a = 1:10, beta = exp(-3:3), logic = c(TRUE,FALSE,FALSE,TRUE))
                  lapply(x, mean)
                  sapply(x, mean)
          
      8. Other Loops

        • Repeat Loop

        • Syntax
                  repeat statement
          
          Loop is repeated until a break is specified. This means there needs to be a second statement to test whether or not to break from the loop.

          Example
                  repeat { z <- 2; z <- z+2; print(z); z <- z+6; print(z); break }
          
      9. Improving Speed Performance by Avoiding Loops

        Looping over very large data sets can become slow in R. However, this limitation can be overcome by avoiding loops over the data intensive dimension in an object altogether. This can be achieved by performing mainly vector-to-vecor or matrix-to-matrix computations which run often over 100 times faster than the corresponding for() or apply() loops in R. For this purpose, one can make use of the existing speed-optimized R functions (e.g.: rowSums, rowMeans, table, tabulate) or one can design custom functions that avoid expensive R loops by using vector- or matrix-based approaches. Alternatively, one can write programs that will perform all time consuming computations on the C-level.

        • Speed comparison of apply loop versus rowMeans for computing the mean for each row in a large matrix:
        •         myMA <- matrix(rnorm(1000000), 100000, 10, dimnames=list(1:100000, paste("C", 1:10, sep="")))
          	system.time(myMAmean <- apply(myMA, 1, mean)); myMAmean[1:4]
                  system.time(myMAmean <- rowMeans(myMA)); myMAmean[1:4]
          
          The rowMeans approach is over 200 times faster than the apply loop.

        • Speed comparison of apply loop versus vector-based approach for computing the standard deviation of each row:
        •         system.time(myMAsd <- apply(myMA, 1, sd)); myMAsd[1:4]
                  system.time(myMAsd <- sd(t(myMA))); myMAsd[1:4]
                  system.time(myMAsd <- sqrt((rowSums((myMA-rowMeans(myMA))^2)) / (length(myMA[1,])-1))); myMAsd[1:4]
          
          The vector-based approach in the last step is over 200 times faster than the apply loop or the sd(t(myMA)) command.

        • Example for computing the mean for any custom selection of columns without making compromises on the speed performance:
        • 	myList <- tapply(colnames(myMA), c(1,1,1,2,2,2,3,3,4,4), list)
          	myMAmean <- sapply(myList, function(x) rowMeans(myMA[,x]))
          	colnames(myMAmean) <- sapply(myList, paste, collapse="_"); myMAmean[1:4,]
          
          The colums are named according to the selection stored in myList.

        • Alternative to achieve the same result with similar performance, but in a much less elegant way:
        • 	myselect <- c(1,1,1,2,2,2,3,3,4,4)
          	myList <- tapply(seq(along=myMA[1,]), myselect, function(x) paste("myMA[ ,", x, "]", sep="")) 
          	myList <- sapply(myList, function(x) paste("(", paste(x, collapse=" + "),")/", length(x)))
          	myMAmean <- sapply(myList, function(x) eval(parse(text=x)))
          	colnames(myMAmean) <- tapply(colnames(myMA), myselect, paste, collapse="_"); myMAmean[1:4,]
          
          The colums are named according to the selection stored in myselect.

  8. Functions

  9. A very useful feature of the R environment is the possibility to expand existing functions and to easily write custom functions. In fact, most of the R software can be viewed as a series of R functions.
  10. Useful Utilities

    1. Debugging utilities

    2. Several debugging utilities are available for R. The most important utilities are: traceback(), browser(), options(error=recover), options(error=NULL) and debug(). The Debugging in R page provides an overview of the available resources.

    3. How to interpret a character string as expression or command?

    4. Example
              mylist <- ls() # generates vector of object names in session.
              mylist[1] # prints name of 1st entry in vector but does not execute it as expression that returns values of 10th object
              get(mylist[1]) # uses 1st entry name in vector and executes it as expression. 
              eval(parse(text=mylist[1])) # alternative approach to obtain the same result. 
      
    5. Replacement, split and paste functions for strings

    6. Example
              x <- gsub("(a)","\\1_", month.name[1], perl=T) # performs substitution with back reference which inserts in this example a '_' character
              strsplit(x,"_") # splits string on inserted character from above
              paste(rev(unlist(strsplit(x,NULL))), collapse="") # reverses a character string by splitting first all characters into vector fields 
      
    7. Time and date

    8. Example
              system.time(my_expr) # returns CPU (and other) times that 'my_expr' used
              date() # returns the current system date and time
      
    9. Pause a process for a defined time period

    10. Example
              Sys.sleep(1) # Pause execution of R expressions for a given number of seconds (e.g. in loop)
      
    11. Import of specific file lines with regular expression

    12. The following example demonstrates the retrieval of specific lines from an external file with a regular expression. First, an external file is created with the 'cat' function, all lines of this file are imported into a vector with 'readLines', the specific elements (lines) are then retieved with the 'grep' function, and the resulting lines are split into vector fields with 'strsplit'.
              cat(month.name, file="zzz.txt", sep="\n")
              x <- readLines("zzz.txt") 
              x <- x[c(grep("^J", as.character(x), perl = TRUE))]
              t(as.data.frame(strsplit(x,"u")))
      
    13. Batch import and export of many files

    14. Example
      In the following example the *.txt file names in the current directory are first assigned to a list (the '$' sign is used to anchor the match to the end of a string). Second, the files are imported one-by-one using a for loop where the original names are assigned to the generated data frames with the 'assign' function. Read ?read.table to understand arguments 'row.names=1' and 'comment.char = "A"'. Third, the data frames are exported using their names for file naming and appending '*.out'.
              files <- list.files(pattern=".txt$") 
              for(i in files) { 
                      x <- read.table(i, header=TRUE, comment.char = "A", sep="\t")
                      assign(i, x)
                      print(i)
                      write.table(x, paste(i, c(".out"), sep=""), quote=FALSE, sep="\t", col.names = NA) 
              }
      
    15. System commands

    16. Windows
              x <- shell("dir", intern=T) # reads current working directory and assigns to file 
              shell.exec("C:/Documents and Settings/Administrator/Desktop/my_file.txt") # opens file with associated program
      
      Unix
              system("...")
      
    17. Running Web Applications (basics on designing web client/crawling/scraping scripts in R)

    18. Example for obtaining MW values for peptide sequences from the EXPASY's pI/MW Tool web page.
              myentries <- c("MKWVTFISLLFLFSSAYS", "MWVTFISLL", "MFISLLFLFSSAYS")                                                                                                                                         
              myresult <- NULL                                                                                                                                                                                               
              for(i in myentries) {                                                                                                                                                                                          
                      myurl <- paste("http://ca.expasy.org/cgi-bin/pi_tool?protein=", i, "&resolution=monoisotopic", sep="")                                                                                                                                                 
                      x <- url(myurl)                                                                                                                                                                                        
                      res <- readLines(x)                                                                                                                                                                                    
                      close(x)                                                                                                                                                                                               
                      mylines <- res[grep("Theoretical pI/Mw:", res)]                                                                                                                                                         
                      myresult <- c(myresult, as.numeric(gsub(".*/ ", "", mylines)))                                                                                                                                          
                      print(myresult)                                                                                                                                                                                        
                      Sys.sleep(1) # halts process for one sec to give web service a break                                                                                                                                      
              }                                                                                                                                                                                                              
              final <- data.frame(Pep=myentries, MW=myresult)                                                                                                                                                                
              cat("\n The MW values for my peptides are:\n")                                                                                                                                                                 
              print(final)   
      
  11. Running R Programs

    1. Executing an R script from the R console
    2.         source("my_script.R")
      

    3. Syntax for running R programs in BATCH mode from the command-line. All commands starting with a '$' sign need to be executed from a Unix or Linux shell.
    4.         $ R CMD BATCH [options] my_script.R [outfile]
      

      The output file lists the commands from the script file and their outputs. If no outfile is specified, the name used is that of 'infile' and '.Rout' is appended to outfile. To stop all the usual R command line information from being written to the outfile, add this as first line to my_script.R file: 'options(echo=FALSE)'. If the command is run like this 'R CMD BATCH --no-save my_script.R', then nothing will be saved in the .Rdata file which can get often very large. More on this can be found on the help pages: '$ R CMD BATCH --help' or '> ?BATCH'.

    5. Syntax for running R programs as silently as possible
    6.         $ R --slave < my_infile > my_outfile
      

      Argument '--slave' makes R run as 'quietly' as possible. This option is intended to support programs which use R to compute results for them.

    7. Submitting R script to a Linux cluster via Torque

    8. Create the following shell script 'my_script.sh'
              ##################################
              #!/bin/sh
              cd $PBS_O_WORKDIR
              R CMD BATCH --no-save my_script.R
              ##################################
      
      This script doesn't need to have executable permissons. Use the following 'qsub' command to send this shell script to the Linux cluster from the directory where the R script 'my_script.R' is located. To utilize several CPUs on the Linux cluster, one can divide the input data into several smaller subsets and execute for each subset a separate process from a dedicated directory.
              $ qsub my_script.sh
      
      Here is a short R script that generates the required files and directories automatically and submits the jobs to the nodes: submit2cluster.R.

    9. See also this 'Tutorial on Parallel Programming in R' by Hanna Sevcikova.

  12. Object-Oriented Programming

  13. Excercises

    1. R Programming Exercises

    2. Slide Show

      Download the R Programming Exercises, then start editing this document with a programming text editor, such as Vim, Emacs or one of the R GUI text editors.

    3. Operations on many files: import, export, calculations, renaming, etc.

    4. (1) Start R from an empty test directory
      (2) Create some files as sample data
              for(i in month.name) {
                      mydf <- data.frame(Month=month.name, Rain=runif(12, min=10, max=100), Evap=runif(12, min=1000, max=2000))
                      write.table(mydf, file=paste(i , ".infile", sep=""), quote=F, row.names=F, sep="\t")
              }   
      
      (3) Import created files, perform calculations and export to renamed files
              files <- list.files(pattern=".infile$")
      	for(i in seq(along=files)) { # start for loop with numeric or character vector; numeric vector is often more flexible
                      x <- read.table(files[i], header=TRUE, row.names=1, comment.char = "A", sep="\t")
      		x <- data.frame(x, sum=apply(x, 1, sum), mean=apply(x, 1, mean)) # calculates sum and mean for each data frame
      		assign(files[i], x) # generates data frame object and names it after content in variable 'i'
      		print(files[i], quote=F) # prints loop iteration to screen to check its status
                      write.table(x, paste(files[i], c(".out"), sep=""), quote=FALSE, sep="\t", col.names = NA) 
              }
      
      (4) Same as above, but file naming by index data frame. This way one can organize file names by external table.
              name_df <- data.frame(Old_name=sort(files), New_name=sort(month.abb))
              for(i in seq(along=name_df[,1])) { 
                      x <- read.table(as.vector(name_df[i,1]), header=TRUE, row.names=1, comment.char = "A", sep="\t")
                      x <- data.frame(x, sum=apply(x, 1, sum), mean=apply(x, 1, mean))
                      assign(as.vector(name_df[i,2]), x) # generates data frame object and names it after 'i' entry in column 2
                      print(as.vector(name_df[i,1]), quote=F)
                      write.table(x, paste(as.vector(name_df[i,2]), c(".out"), sep=""), quote=FALSE, sep="\t", col.names = NA)
              }
      
      (5) Append content of all input files to one file.
              files <- list.files(pattern=".infile$")
              all_files <- data.frame(files=NULL, Month=NULL, Gain=NULL , Loss=NULL, sum=NULL, mean=NULL) # creates empty data frame container
              for(i in seq(along=files)) {
                      x <- read.table(files[i], header=TRUE, row.names=1, comment.char = "A", sep="\t")
                      x <- data.frame(x, sum=apply(x, 1, sum), mean=apply(x, 1, mean)) 
      		x <- data.frame(file=rep(files[i], length(x[,1])), x) # adds file tracking column to x
      		all_files <- rbind(all_files, x) # appends data from all files to data frame 'all_files'
                      write.table(all_files, file="all_files.xls", quote=FALSE, sep="\t", col.names = NA) 
              }
      
      (6) Write the above code into a text file and execute it with the commands 'source' and 'BATCH'.
              source("my_script.R") # execute from R console				
              $ R CMD BATCH my_script.R # execute from shell			
      
    5. Large-scale Array Analysis

    6. Sample script to perform large-scale expression array analysis with complex queries: lsArray.R.
      To demo what the script does, run it like this:
      	source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/lsArray.R")
      
    7. Graphical Procedures: Feature Map Example

    8. Script to plot feature maps of genes or chromosomes: featureMap.R.
      To demo what the script does, run it like this:
      	source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/featureMap.txt")
      
    9. Sequence Analysis Utilities: Sequence Batch Import, Sub-setting, Pattern Matching, AA Composition, NEEDLE, PHYLIP, etc.

    10. The script 'sequenceAnalysis.R' demonstrates how R can be used as a powerful tool for managing and analyzing large sets of biological sequences. This example also shows how easy it is to integrate R with the EMBOSS project or other external programs.
      The script provides the following functionality:


      To demonstrate the utilities of the script, users can simply execute it from R with the following source command:
              source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/sequenceAnalysis.txt")
      


    11. Pattern Matching and Positional Parsing of Sequences

    12. Functions for importing sequences into R, retrieving reverse and complement of nucleotide sequences, pattern searching, positional parsing and exporting search results in HTML format: patternSearch.R.
      To demo what the script does, run it like this:
      	source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/patternSearch.R")
      
    13. Identify Over-Represented Strings in Sequence Sets

    14. Functions for finding over-represented words in sets of DNA, RNA or protein sequences: wordFinder.R.
      To demo what the script does, run it like this:
      	source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/wordFinder.R")
      
    15. Translate DNA into Protein

    16. Script 'translateDNA.R' for translating NT sequences into AA sequences (required codon table).
      To demo what the script does, run it like this:
      	source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/translateDNA.R")
      
    17. Subsetting of Structure Definition Files (SDF)

    18. Script for importing and subsetting SDF files: sdfSubset.R:
      To demo what the script does, run it like this:
      	source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/sdfSubset.R")
      
    19. Managing Latex BibTeX Databases

    20. Script for importing BibTeX databases into R, retrieving the individual references with a full-text search function and viewing the results in R or in HubMed: BibTex.R.
      To demo what the script does, run it like this:
      	source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/BibTex.R")
      
    21. Loan Payments and Amortization Tables

    22. This script calculates monthly and annual mortgage or loan payments, generates amortization tables and plots the results: mortgage.R.
      To demo what the script does, run it like this:
      	source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/mortgage.R")
      
    23. Course Assignment: GC Content, Reverse & Complement

    24. Apply the above information to write a function that calculates for a set of DNA sequences their GC content and generates their reverse and complement. Here are some useful commands that can be incorporated in this function:
This site was accessed times (detailed access stats).