Programming in R


Home ]   R_&_BioConductor ]   R_Programming ]   EMBOSS ]   Linux ]   Cluster ]   More_Manuals ]

Author: Thomas Girke, UC Riverside

Index

  1. Introduction
  2. R_Basics
  3. Finding Help
  4. Control Structures
  5. Functions
  6. Useful Utilities
  7. Running R Programs
  8. Object-Oriented Programming
  9. Exercises

  1. Introduction

  2. One of the main attractions of using the R environment is the ease with which users can write their own programs and functions. The R programming syntax is extremely easy to learn, even for users with no previous programming experience. Once the basic R programming control structures are understood, users can use the language as a powerful calculation machine to perform complex analyses of almost any type of quantitative data.

  3. R Basics

  4. The R & BioConductor manual provides an introduction on the usage of the R environment and its basic command syntax.

  5. Finding Help

  6. Reference list on R programming (selection)

  7. Control Structures
    1. Conditional Executions

    2. Comparison operators: == (equal), != (not equal), >= (greater than or equal), etc. Logical operators: & (and), | (or) and ! (not).

      1. If Statement operates on length-one logical vectors

      2. Syntax
                if (cond1=true) { cmd1 } else { cmd2 }
        
        Example
                if (1==0) { print(1) } else { print(2) }
        
        Avoid using newlines between '} else'. To apply if statement on many values of vector or data frame, use 'for' or 'apply' loops (see below).

      3. Ifelse Statement: operates on vectors

      4. Syntax
                ifelse(test, true_value, false_value)
        
        Example
                x <- 1:10 
                ifelse(x<5 | x>8, x, 0)
        
    3. Loops
    4. The most commonly used loop structures in R are 'for', 'while' and 'apply' loops. Less common are 'repeat' loops. The 'break' function is used to break out of loops, and 'next' halts the processing of the current iteration and advances the looping index.

      1. For Loop: flexible, but slow when looping over large number of fields (e.g. thousands of rows or columns)

      2. Syntax
                for(variable in sequence) { 
                        statements 
                }
        
        Example: mean
                mydf <- iris
                myve <- NULL
                for(i in 1:length(mydf[,1])) {
                        myve <- c(myve, mean(as.vector(as.matrix(mydf[i,1:3]))))
                }
        
        Example: condition*
                x <- 1:10; z <- NULL
                for (i in 1:length(x)) { 
                        if (x[i]<5) { z <- c(z,x[i]-1)  } else { z <- c(z,x[i]/x[i])  } 
                }
        
        Example: stop on condition and print error message
                x <- 1:10; z <- NULL
                for (i in 1:length(x)) { 
        		if (x[i]<5) { z <- c(z,x[i]-1)  } else { stop("values need to be <5") }
                }
        
      3. While Loop: similar to for loop, but the iterations are controlled by a conditional statement.

      4. Syntax
                while(condition) statements
        
        Loop continues until condition returns FALSE.

        Example
                z <- 0
                while(z<5) { 
                        z <- z+2
                        print(z)  
                }
        
      5. The 'apply' Function Family

      6. Syntax
                apply(X, MARGIN, FUN, ARGs)
        
        X: array, matrix or data.frame; MARGIN: 1 for rows, 2 for columns, c(1,2) for both; FUN: one or more functions; ARGs: possible arguments for function
        Example
                apply(my_matrix, 1, function(i) { my_commands } )
        
        Example for single operation
                apply(mydf[,1:3], 1, mean)
        
        Two-step approach: 1st define function, 2nd use function in apply loop (does the same as above 'for loop'*)
                x <- 1:10; z <- NULL
                test <- function(x) { if (x<5) { x-1 } else { x/x } }
                apply(as.matrix(x), 1, test)
        
        One-step approach: does the same as above, but function defined in apply loop
                apply(as.matrix(x), 1, function(x) { if (x<5) { x-1 } else { x/x } })
        
      7. Other 'apply' Functions: tapply, sapply and lapply

        • tapply

        • Applies a function to array categories of variable lengths (ragged array). Grouping is defined by vector.
          Syntax
                  tapply(vector, factor, FUN)
          
          Example
                  for(i in 1:4) {
                          print(tapply(as.vector(iris[,i]), factor(iris[,5]), mean)) 
                  }
          
        • lapply and sapply

        • Both apply a function on vector or list objects. The function lapply returns a list, while sapply returns a more readable vector or matrix structure.
          Example for vector objects
                  z <- seq(1,10, by=2); my_matrix <- matrix(runif(100), ncol=10)
          	lapply(z, function(x) mean(my_matrix[x,]))
                  sapply(z, function(x) mean(my_matrix[x,]))
          

          Example for list objects
                  x <- list(a = 1:10, beta = exp(-3:3), logic = c(TRUE,FALSE,FALSE,TRUE))
                  lapply(x, mean)
                  sapply(x, mean)
          
      8. Other Loops

        • Repeat Loop

        • Syntax
                  repeat statement
          
          Loop is repeated until a break is specified. This means there needs to be a second statement to test whether or not to break from the loop.

          Example
                  repeat { z <- 2; z <- z+2; print(z); z <- z+6; print(z); break }
          
  8. Functions

  9. A very useful feature of the R environment is the possibility to expand existing functions and to easily write custom functions. In fact, most of the R software can be viewed as a series of R functions.
  10. Useful Utilities

    1. How to interpret a character string as expression or command

    2. Example
              mylist <- ls() # generates vector of object names in session.
              mylist[1] # prints name of 1st entry in vector but does not execute it as expression that returns values of 10th object
              get(mylist[1]) # uses 1st entry name in vector and executes it as expression. 
              eval(parse(text=mylist[1])) # alternative approach to obtain the same result. 
      
    3. Replacement, split and paste functions for strings

    4. Example
              x <- gsub("(a)","\\1_", month.name[1], perl=T) # performs substitution with back reference which inserts in this example a '_' character
              strsplit(x,"_") # splits string on inserted character from above
              paste(rev(unlist(strsplit(x,NULL))), collapse="") # reverses a character string by splitting first all characters into vector fields 
      
    5. Time and date

    6. Example
              system.time(my_expr) # returns CPU (and other) times that 'my_expr' used
              date() # returns the current system date and time
      
    7. Pause a process for a defined time period

    8. Example
              Sys.sleep(1) # Pause execution of R expressions for a given number of seconds (e.g. in loop)
      
    9. Import of specific file lines with regular expression

    10. The following example demonstrates the retrieval of specific lines from an external file with a regular expression. First, an external file is created with the 'cat' function, all lines of this file are imported into a vector with 'readLines', the specific elements (lines) are then retieved with the 'grep' function, and the resulting lines are split into vector fields with 'strsplit'.
              cat(month.name, file="zzz.txt", sep="\n")
              x <- readLines("zzz.txt") 
              x <- x[c(grep("^J", as.character(x), perl = TRUE))]
              t(as.data.frame(strsplit(x,"u")))
      
    11. Batch import and export of many files

    12. Example
      In the following example the *.txt file names in the current directory are first assigned to a list (the '$' sign is used to anchor the match to the end of a string). Second, the files are imported one-by-one using a for loop where the original names are assigned to the generated data frames with the 'assign' function. Read ?read.table to understand arguments 'row.names=1' and 'comment.char = "A"'. Third, the data frames are exported using their names for file naming and appending '*.out'.
              files <- list.files(pattern=".txt$") 
              for(i in files) { 
                      x <- read.table(i, header=TRUE, comment.char = "A", sep="\t")
                      assign(i, x)
                      print(i)
                      write.table(x, paste(i, c(".out"), sep=""), quote=FALSE, sep="\t", col.names = NA) 
              }
      
    13. System commands

    14. Windows
              x <- shell("dir", intern=T) # reads current working directory and assigns to file 
              shell.exec("C:/Documents and Settings/Administrator/Desktop/my_file.txt") # opens file with associated program
      
      Unix
              system("...")
      
    15. Running Web Applications (basics on designing web client/crawling/scraping scripts in R)

    16. Example for obtaining MW values for peptide sequences from the EXPASY's pI/MW Tool web page.
              myentries <- c("MKWVTFISLLFLFSSAYS", "MWVTFISLL", "MFISLLFLFSSAYS")                                                                                                                                         
              myresult <- NULL                                                                                                                                                                                               
              for(i in myentries) {                                                                                                                                                                                          
                      myurl <- paste("http://ca.expasy.org/cgi-bin/pi_tool?protein=", i, "&resolution=monoisotopic", sep="")                                                                                                                                                 
                      x <- url(myurl)                                                                                                                                                                                        
                      res <- readLines(x)                                                                                                                                                                                    
                      close(x)                                                                                                                                                                                               
                      mylines <- res[grep("Theoretical pI/Mw:", res)]                                                                                                                                                         
                      myresult <- c(myresult, as.numeric(gsub(".*/ ", "", mylines)))                                                                                                                                          
                      print(myresult)                                                                                                                                                                                        
                      Sys.sleep(1) # halts process for one sec to give web service a break                                                                                                                                      
              }                                                                                                                                                                                                              
              final <- data.frame(Pep=myentries, MW=myresult)                                                                                                                                                                
              cat("\n The MW values for my peptides are:\n")                                                                                                                                                                 
              print(final)   
      
  11. Running R Programs

    1. Syntax for running R programs in BATCH mode from the command-line
    2.         $ R CMD BATCH [options] my_script.R [outfile]
      

      The output file lists the commands from the script file and their outputs. If no outfile is specified, the name used is that of 'infile' and '.Rout' is appended to outfile. To stop all the usual R command line information from being written to the outfile, add this as first line to my_script.R file: 'options(echo=FALSE)'. If the command is run like this 'R CMD BATCH --no-save my_script.R', then nothing will be saved in the .Rdata file which can get often very large. More on this can be found on the help pages: '$ R CMD BATCH --help' or '> ?BATCH'.

    3. Syntax for running R programs as silently as possible
    4.         $ R --slave < my_infile > my_outfile
      

      Argument '--slave' makes R run as 'quietly' as possible. This option is intended to support programs which use R to compute results for them.

    5. Submitting R script to a Linux cluster via Torque

    6. Create the following shell script 'my_script.sh'
              ##################################
              #!/bin/sh
              cd $PBS_O_WORKDIR
              R CMD BATCH --no-save my_script.R
              ##################################
      
      This script doesn't need to have executable permissons. Use the following 'qsub' command to send this shell script to the Linux cluster from the directory where the R script 'my_script.R' is located. To utilize several CPUs on the Linux cluster, one can divide the input data into several smaller subsets and execute for each subset a separate process from a dedicated directory.
              $ qsub my_script.sh
      
      Here is a short R script that generates the required files and directories automatically and submits the jobs to the nodes: submit2cluster.R.

    7. See also this 'Tutorial on Parallel Programming in R' by Hanna Sevcikova.

  12. Object-Oriented Programming

  13. Excercises

    1. Operations on many files: import, export, calculations, renaming, etc.

    2. (1) Start R from an empty test directory
      (2) Create some files as sample data
              for(i in month.name) {
                      mydf <- data.frame(Month=month.name, Rain=runif(12, min=10, max=100), Evap=runif(12, min=1000, max=2000))
                      write.table(mydf, file=paste(i , ".infile", sep=""), quote=F, row.names=F, sep="\t")
              }   
      
      (3) Import created files, perform calculations and export to renamed files
              files <- list.files(pattern=".infile$")
      	for(i in 1:length(files)) { # start for loop with numeric or character vector; numeric vector is often more flexible
                      x <- read.table(files[i], header=TRUE, row.names=1, comment.char = "A", sep="\t")
      		x <- data.frame(x, sum=apply(x, 1, sum), mean=apply(x, 1, mean)) # calculates sum and mean for each data frame
      		assign(files[i], x) # generates data frame object and names it after content in variable 'i'
      		print(files[i], quote=F) # prints loop iteration to screen to check its status
                      write.table(x, paste(files[i], c(".out"), sep=""), quote=FALSE, sep="\t", col.names = NA) 
              }
      
      (4) Same as above, but file naming by index data frame. This way one can organize file names by external table.
              name_df <- data.frame(Old_name=sort(files), New_name=sort(month.abb))
              for(i in 1:length(name_df[,1])) { 
                      x <- read.table(as.vector(name_df[i,1]), header=TRUE, row.names=1, comment.char = "A", sep="\t")
                      x <- data.frame(x, sum=apply(x, 1, sum), mean=apply(x, 1, mean))
                      assign(as.vector(name_df[i,2]), x) # generates data frame object and names it after 'i' entry in column 2
                      print(as.vector(name_df[i,1]), quote=F)
                      write.table(x, paste(as.vector(name_df[i,2]), c(".out"), sep=""), quote=FALSE, sep="\t", col.names = NA)
              }
      
      (5) Append content of all input files to one file.
              files <- list.files(pattern=".infile$")
              all_files <- data.frame(files=NULL, Month=NULL, Gain=NULL , Loss=NULL, sum=NULL, mean=NULL) # creates empty data frame container
              for(i in 1:length(files)) {
                      x <- read.table(files[i], header=TRUE, row.names=1, comment.char = "A", sep="\t")
                      x <- data.frame(x, sum=apply(x, 1, sum), mean=apply(x, 1, mean)) 
      		x <- data.frame(file=rep(files[i], length(x[,1])), x) # adds file tracking column to x
      		all_files <- rbind(all_files, x) # appends data from all files to data frame 'all_files'
                      write.table(all_files, file="all_files.xls", quote=FALSE, sep="\t", col.names = NA) 
              }
      
      (6) Write the above code into a text file and execute it with the commands 'source' and 'BATCH'.
              source("my_script.R") # execute from R console				
              $ R CMD BATCH my_script.R # execute from shell			
      
    3. Large-scale Array Analysis

    4. Sample script to perform large-scale expression array analysis with complex queries: lsArray.R.
      To demo what the script does, run it like this:
      	source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/lsArray.R")
      
    5. Graphical Procedures: Feature Map Example

    6. Script to plot feature maps of genes or chromosomes: featureMap.R.
      To demo what the script does, run it like this:
      	source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/featureMap.txt")
      
    7. Sequence Analysis Utilities: Sequence Batch Import, Sub-setting, Pattern Matching, AA Composition, NEEDLE, PHYLIP, etc.

    8. The script 'sequenceAnalysis.R' demonstrates how R can be used as a powerful tool for managing and analyzing large sets of biological sequences. This example also shows how easy it is to integrate R with the EMBOSS project or other external programs.
      The script provides the following functionality:


      To demonstrate the utilities of the script, users can simply execute it from R with the following source command:
              source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/sequenceAnalysis.txt")
      


    9. Pattern Matching and Positional Parsing of Sequences

    10. Functions for importing sequences into R, retrieving reverse and complement of nucleotide sequences, pattern searching, positional parsing and exporting search results in HTML format: patternSearch.R.
      To demo what the script does, run it like this:
      	source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/patternSearch.R")
      
    11. Identify Over-Represented Strings in Sequence Sets

    12. Functions for finding over-represented words in sets of DNA, RNA or protein sequences: wordFinder.R.
      To demo what the script does, run it like this:
      	source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/wordFinder.R")
      
    13. Translate DNA into Protein

    14. Script 'translateDNA.R' for translating NT sequences into AA sequences (required codon table).
      To demo what the script does, run it like this:
      	source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/translateDNA.R")
      
    15. Vector and Phred Score Trimming of Sequences

    16. Sample script for vector and Phred score trimming of sequences: VectorPhredTrim.R.

    17. Gene Ontology Analysis

    18. Write a script to perform hypergeometric distribution tests against all GO nodes including goSlim analysis.
      To demo what the script does, run it like this:
              source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/GOHyperGAll.txt")
      
    19. KEGG Pathway Analysis

    20. Write a script that performs KEGG pathway analysis. This is a sample script that maps Arabidopsis locus IDs to KEGG pathways: KEGG Pathway Script. It can be easily adjusted to any organism.
      To demo what the script does, run it like this:
      	source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/KEGG_ATH.txt")
      
    21. Subsetting of Structure Definition Files (SDF)

    22. Script for importing and subsetting SDF files: sdfSubset.R:
      To demo what the script does, run it like this:
      	source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/sdfSubset.R")
      
    23. Managing Latex BibTeX Databases

    24. Script for importing BibTeX databases into R, retrieving the individual references with a full-text search function and viewing the results in R or in HubMed: BibTex.R.
      To demo what the script does, run it like this:
      	source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/BibTex.R")
      
    25. Loan Payments and Amortization Tables

    26. This script calculates monthly and annual mortgage or loan payments, generates amortization tables and plots the results: mortgage.R.
      To demo what the script does, run it like this:
      	source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/mortgage.R")
      
    27. Course Assignment: GC Content, Reverse & Complement

    28. Apply the above information to write a function that calculates for a set of DNA sequences their GC content and generates their reverse and complement. Here are some useful commands that can be incorporated in this function: