- A very useful feature of the R environment is the possibility to expand existing functions and to easily write custom functions. In fact, most of the R software can be viewed as a series of R functions.
Syntax to define functions
myfct <- function(arg1, arg2, ...) { function_body }
The value returned by a function is the value of the function body, which is usually an unassigned final expression.
Syntax to call functions
myfct(arg1=..., arg2=...)
- Syntax rules
- General
Functions are defined by (1) assignment with the keyword 'function', (2) the declaration of arguments/variables (arg1, arg2, ...) and (3) the definition of operations (function_body) that perform computations on the provided arguments. A function name ('myfct') needs to be assigned to call the function (see below).
- Naming
Function names can be almost anything. However, the usage of names of existing functions should be avoided.
- Arguments
It is often useful to provide default values for arguments (e.g.: 'arg1=1:10'). This way they don't need to be provided in a function call. The argument list can also be left empty ('myfct <- function() { fct_body }') when a function is expected to return always the same value(s). The argument '...' can be used to allow one function to pass on argument settings to another.
- Function body
The actual expressions (commands/operations) are defined in the function body which should be enclosed by braces. The individual commands are separated by semicolons or new lines (preferred).
- Calling functionsFunctions are called by their name followed by parentheses containing possible argument names. Empty parenthesis after the function name will result in an error message when a function requires certain arguments to be provided by the user. The function name alone will print the definition of a function.
- Scope
Variables created inside a function exist only for the life time of a function. Thus, they are not accessible outside of the function. To force variables in functions to exist globally, one can use this special assignment operator: '<<-'. If a global variable is used in a function, then the global variable will be masked only within the function.
Example: Function basics
myfct <- function(x1, x2=5) {
z1 <- x1/x1; z2 <- x2*x2
cat("my function returns:", "\n")
print(z1); print(z2)
}
myfct # prints definition of function
myfct(x1=2, x2=5) # applies function to values 2 and 5
myfct(2, 5) # the argument names are not necessary, but then the order of the specified values becomes important
myfct(x1=2) # does the same as before, but the default value '5' is used in this case
Example: Function with optional arguments
myfct2 <- function(x1=5, opt_arg) {
if(missing(opt_arg)) { # 'missing()' is used to test whether a value was specified as an argument
z1 <- 1:10
} else {
z1 <- opt_arg }
cat("my function returns:", "\n")
print(z1/x1)
}
myfct2(x1=5) # performs calculation on default vector (z1) that is defined in the function body
myfct2(x1=5, opt_arg=30:20) # a custom vector is used instead when the optional argument (opt_arg) is specified
Control utilities for functions: return, warning and stop
- Return
The evaluation flow of a function may be terminated at any stage with the 'return()' function. This is often used in combination with conditional evaluations.
- Stop
To stop the action of a function and print an error message, one can use the stop() function.
- Warning
To print a warning message in unexpected situations without aborting the evaluation flow of a function, one can use the function 'warning("...")'.
myfct <- function(x1) {
if (x1>=0) print(x1) else stop("This function did not finish, because x1 < 0")
warning("Value needs to be > 0")
}
myfct(x1=2); myfct(x1=-2)
- How to interpret a character string as expression or command
Example
mylist <- ls() # generates vector of object names in session.
mylist[1] # prints name of 1st entry in vector but does not execute it as expression that returns values of 10th object
get(mylist[1]) # uses 1st entry name in vector and executes it as expression.
eval(parse(text=mylist[1])) # alternative approach to obtain the same result.
- Replacement, split and paste functions for strings
Example
x <- gsub("(a)","\\1_", month.name[1], perl=T) # performs substitution with back reference which inserts in this example a '_' character
strsplit(x,"_") # splits string on inserted character from above
paste(rev(unlist(strsplit(x,NULL))), collapse="") # reverses a character string by splitting first all characters into vector fields
- Time and date
Example
system.time(my_expr) # returns CPU (and other) times that 'my_expr' used
date() # returns the current system date and time
- Pause a process for a defined time period
Example
Sys.sleep(1) # Pause execution of R expressions for a given number of seconds (e.g. in loop)
- Import of specific file lines with regular expression
The following example demonstrates the retrieval of specific lines from an external file with a regular expression. First, an external file is created with the 'cat' function, all lines of this file are imported into a vector with 'readLines', the specific elements (lines) are then retieved with the 'grep' function, and the resulting lines are split into vector fields with 'strsplit'.
cat(month.name, file="zzz.txt", sep="\n")
x <- readLines("zzz.txt")
x <- x[c(grep("^J", as.character(x), perl = TRUE))]
t(as.data.frame(strsplit(x,"u")))
- Batch import and export of many files
Example
In the following example the *.txt file names in the current directory are first assigned to a list (the '$' sign is used to anchor the match to the end of a string). Second, the files are imported one-by-one using a for loop where the original names are assigned to the generated data frames with the 'assign' function. Read ?read.table to understand arguments 'row.names=1' and 'comment.char = "A"'. Third, the data frames are exported using their names for file naming and appending '*.out'.
files <- list.files(pattern=".txt$")
for(i in files) {
x <- read.table(i, header=TRUE, comment.char = "A", sep="\t")
assign(i, x)
print(i)
write.table(x, paste(i, c(".out"), sep=""), quote=FALSE, sep="\t", col.names = NA)
}
- System commands
Windows
x <- shell("dir", intern=T) # reads current working directory and assigns to file
shell.exec("C:/Documents and Settings/Administrator/Desktop/my_file.txt") # opens file with associated program
Unix
system("...")
- Running Web Applications (basics on designing web client/crawling/scraping scripts in R)
Example for obtaining MW values for peptide sequences from the EXPASY's pI/MW Tool web page.
myentries <- c("MKWVTFISLLFLFSSAYS", "MWVTFISLL", "MFISLLFLFSSAYS")
myresult <- NULL
for(i in myentries) {
myurl <- paste("http://ca.expasy.org/cgi-bin/pi_tool?protein=", i, "&resolution=monoisotopic", sep="")
x <- url(myurl)
res <- readLines(x)
close(x)
mylines <- res[grep("Theoretical pI/Mw:", res)]
myresult <- c(myresult, as.numeric(gsub(".*/ ", "", mylines)))
print(myresult)
Sys.sleep(1) # halts process for one sec to give web service a break
}
final <- data.frame(Pep=myentries, MW=myresult)
cat("\n The MW values for my peptides are:\n")
print(final)
- Operations on many files: import, export, calculations, renaming, etc.
(1) Start R from an empty test directory
(2) Create some files as sample data
for(i in month.name) {
mydf <- data.frame(Month=month.name, Rain=runif(12, min=10, max=100), Evap=runif(12, min=1000, max=2000))
write.table(mydf, file=paste(i , ".infile", sep=""), quote=F, row.names=F, sep="\t")
}
(3) Import created files, perform calculations and export to renamed files
files <- list.files(pattern=".infile$")
for(i in 1:length(files)) { # start for loop with numeric or character vector; numeric vector is often more flexible
x <- read.table(files[i], header=TRUE, row.names=1, comment.char = "A", sep="\t")
x <- data.frame(x, sum=apply(x, 1, sum), mean=apply(x, 1, mean)) # calculates sum and mean for each data frame
assign(files[i], x) # generates data frame object and names it after content in variable 'i'
print(files[i], quote=F) # prints loop iteration to screen to check its status
write.table(x, paste(files[i], c(".out"), sep=""), quote=FALSE, sep="\t", col.names = NA)
}
(4) Same as above, but file naming by index data frame. This way one can organize file names by external table.
name_df <- data.frame(Old_name=sort(files), New_name=sort(month.abb))
for(i in 1:length(name_df[,1])) {
x <- read.table(as.vector(name_df[i,1]), header=TRUE, row.names=1, comment.char = "A", sep="\t")
x <- data.frame(x, sum=apply(x, 1, sum), mean=apply(x, 1, mean))
assign(as.vector(name_df[i,2]), x) # generates data frame object and names it after 'i' entry in column 2
print(as.vector(name_df[i,1]), quote=F)
write.table(x, paste(as.vector(name_df[i,2]), c(".out"), sep=""), quote=FALSE, sep="\t", col.names = NA)
}
(5) Append content of all input files to one file.
files <- list.files(pattern=".infile$")
all_files <- data.frame(files=NULL, Month=NULL, Gain=NULL , Loss=NULL, sum=NULL, mean=NULL) # creates empty data frame container
for(i in 1:length(files)) {
x <- read.table(files[i], header=TRUE, row.names=1, comment.char = "A", sep="\t")
x <- data.frame(x, sum=apply(x, 1, sum), mean=apply(x, 1, mean))
x <- data.frame(file=rep(files[i], length(x[,1])), x) # adds file tracking column to x
all_files <- rbind(all_files, x) # appends data from all files to data frame 'all_files'
write.table(all_files, file="all_files.xls", quote=FALSE, sep="\t", col.names = NA)
}
(6) Write the above code into a text file and execute it with the commands 'source' and 'BATCH'.
source("my_script.R") # execute from R console
$ R CMD BATCH my_script.R # execute from shell
- Large-scale Array Analysis
Sample script to perform large-scale expression array analysis with complex queries: lsArray.R.
To demo what the script does, run it like this:
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/lsArray.R")
- Graphical Procedures: Feature Map Example
Script to plot feature maps of genes or chromosomes: featureMap.R.
To demo what the script does, run it like this:
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/featureMap.txt")
- Sequence Analysis Utilities: Sequence Batch Import, Sub-setting, Pattern Matching, AA Composition, NEEDLE, PHYLIP, etc.
The script 'sequenceAnalysis.R' demonstrates how R can be used as a powerful tool for managing and analyzing large sets of biological sequences. This example also shows how easy it is to integrate R with the EMBOSS project or other external programs.
The script provides the following functionality:
- Batch sequence import into R data frame
- Motif searching with hit statistics
- Analysis of sequence composition
- All-against-all sequence comparisons
- Generation of phylogenetic trees
To demonstrate the utilities of the script, users can simply execute it from R with the following source command:
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/sequenceAnalysis.txt")
- Pattern Matching and Positional Parsing of Sequences
Functions for importing sequences into R, retrieving reverse and complement of nucleotide sequences, pattern searching, positional parsing and exporting search results in HTML format: patternSearch.R.
To demo what the script does, run it like this:
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/patternSearch.R")
- Identify Over-Represented Strings in Sequence Sets
Functions for finding over-represented words in sets of DNA, RNA or protein sequences: wordFinder.R.
To demo what the script does, run it like this:
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/wordFinder.R")
- Translate DNA into Protein
Script 'translateDNA.R' for translating NT sequences into AA sequences (required codon table).
To demo what the script does, run it like this:
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/translateDNA.R")
- Vector and Phred Score Trimming of Sequences
Sample script for vector and Phred score trimming of sequences: VectorPhredTrim.R.
- Gene Ontology Analysis
Write a script to perform hypergeometric distribution tests against all GO nodes including goSlim analysis.
To demo what the script does, run it like this:
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/GOHyperGAll.txt")
- KEGG Pathway Analysis
Write a script that performs KEGG pathway analysis. This is a sample script that maps Arabidopsis locus IDs to KEGG pathways: KEGG Pathway Script. It can be easily adjusted to any organism.
To demo what the script does, run it like this:
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/KEGG_ATH.txt")
- Subsetting of Structure Definition Files (SDF)
Script for importing and subsetting SDF files: sdfSubset.R:
To demo what the script does, run it like this:
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/sdfSubset.R")
- Managing Latex BibTeX Databases
Script for importing BibTeX databases into R, retrieving the individual references with a full-text search function and viewing the results in R or in HubMed: BibTex.R.
To demo what the script does, run it like this:
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/BibTex.R")
- Loan Payments and Amortization Tables
This script calculates monthly and annual mortgage or loan payments, generates amortization tables and plots the results: mortgage.R.
To demo what the script does, run it like this:
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/mortgage.R")
- Course Assignment: GC Content, Reverse & Complement
Apply the above information to write a function that calculates for a set of DNA sequences their GC content and generates their reverse and complement. Here are some useful commands that can be incorporated in this function:
- Generate an example data frame with ID numbers and DNA sequences
fx <- function(test) {
x <- as.integer(runif(20, min=1, max=5))
x[x==1] <- "A"; x[x==2] <- "T"; x[x==3] <- "G"; x[x==4] <- "C"
paste(x, sep = "", collapse ="")
}
z1 <- c()
for(i in 1:50) {
z1 <- c(fx(i), z1)
}
z1 <- data.frame(ID=1:length(z1), Seq=z1)
print(z1)
- Write each character of sequence into separate vector field and reverse its order
my_split <- strsplit(as.character(z1[1,2]),"")
my_rev <- rev(my_split[[1]])
paste(my_rev, collapse="")
- Generate the sequence complement by replacing G|C|A|T by C|G|T|A
- Use 'apply' or 'for loop' to apply the above operations to all sequences in sample data frame 'z1'
- Calculate in the same loop the GC content for each sequence using the following command
table(my_split[[1]])/length(my_split[[1]])