The Data Science Lab

Program-Defined Functions in R

The three most common open source technologies for writing data science programs are Python, SciLab, and R. Here's how to write program-defined functions in R.

There's no clear definition of the term data science. I think of data science as the process of programmatically analyzing data using classical statistics techniques or making predictions using machine learning techniques. Among my developer colleagues, the three most common ways to perform data science tasks with open source tools are using the R language, using the Python language, and using the SciLab (or roughly equivalent Octave) integrated system. In this article I present a short tutorial on writing program-defined functions in the R language.

Whenever I'm learning about program-defined functions in a new language, I want to know seven things: What's the basic syntax and return mechanism? Are parameters passed by value, by reference or both? Does the language support default parameter values? Does the language support function overloading? Does the language support variable number of arguments? Does the language support recursion? Does the language support nested definitions?

The demo program shown running in Figure 1 illustrates each of these seven topics and gives you an idea of where this article is headed. As you'll see:

  1. Basic R function syntax resembles C# and uses the "function" and "return" keywords.
  2. R function parameters are passed by value, not by reference.
  3. R supports default parameter values using the "=" assignment operator.
  4. R does not support C# style function name overloading.
  5. R supports variable number of function arguments using the "..." token.
  6. R supports recursive function definitions.
  7. R supports nested function definitions with the "<<-" assignment operator.
[Click on image for larger view.] Figure 1. Program-Defined Functions in R Demo

Installing R
If you're new to R and want to try out the language, the good news is that installing (and uninstalling) R is simple. You have several alternatives, including the recently released Microsoft R Server, but for simplicity I recommend using the base R system. Search the Web for "install R" and you'll find a link to https://cran.r-project.org/bin/windows/base/. Navigate to that page and click on the Download link at the top of the page to launch a self-extracting executable installer (see Figure 2).

[Click on image for larger view.] Figure 2. Installing R

You can accept all the installation defaults. After installation finishes, go to C:\Program Files\R-3.x.x\bin\x64 and then double-click on the Rgui.exe file to launch an R Console shell like the one shown on the left side of Figure 1.

On the top menu bar, click File | New Script. That action will launch an R Editor window like the one on the right side of Figure 1. You write your R program (technically a script because R is interpreted) in the Editor window. You call the program by issuing a "source" command in the Console window. Program output is displayed in the Console window.

The Demo Program
The entire R demo program is presented in Listing 1. To run the program, copy and paste the code into the Editor window. With focus set to the Editor window, on the Console window menu, click File | Save As, then navigate to any convenient directory (I used the rather wordy C:\ProgramDefinedFunctionsWithR) and save the script there as functions.R.

Listing 1: Program-Defined Functions in R Demo Code
# functions.R
# program-defined functions examples

# basic syntax
my.sum = function(x, y) {
  result <- x + y
  return(result)
}

# return value is last expression
my.sumterse = function(x, y) {
  x + y
}

# 'void' function
my.printvec = function(vec, dec) {
  cat("[ ")
  for (i in 1:length(vec)) {
    x <- formatC(vec[i], digits = dec, format="f")
    cat(x, " ")
  }
  cat("]\n")
}
  
# C#-style overloading not allowed
# my.sum = function(x, y, z) { . . }
# error if a my.sum() function exists

# arguments are val not ref --
# my.inc = function(arr) {
#   for (i in 1:length(arr)) {
#     arr[i] <- arr[i] + 1
#   }
# }
# does not work

# default parameter value
my.prod = function(x, y, z=10) {
  result <- x * y * z
  return(result)
}

# missing parameter value
my.prod2 = function(x, y, z) {
  if (missing(z))
    return(x * y * 10)
  else
    return(x * y * z)
}

# return two values as an array
my.sumdiff = function(x, y) {
  res1 <- x + y
  res2 <- x - y
  # result = c(res1, res2) # vector
  result <- array(0.0, 2)
  result[1] <- res1; result[2] <- res2
  return(result)
}

# return two values as a list
my.divide = function(x, y) {
  if (y == 0) {
    res = list("result" = NULL, "msg" = "error")
  }
  else {
    res = list("result" = x/y, "msg" = "success")
  }
  return(res)
}

# variable number parameters
my.multiprod = function(...) {
  vals <- list(...)
  result <- 1
  for (key in names(vals)) {
    result <- result * vals[[key]]
  }
  return(result)
}

# recursion
my.qsort = function(arr) {
  n <- length(arr)
  if (n > 1) {
    pv <- arr[n %/% 2]
    left <- my.qsort(arr[arr < pv])
    mid <- arr[arr == pv]
    right <-  my.qsort(arr[arr > pv])
    return(c(left, mid, right))
  }
  else return(arr)
}

# nested definition
my.bsort = function(arr) {
  # -----
  my.swap = function(ii, jj) {
    tmp <<- arr[ii]
    arr[ii] <<- arr[jj]
    arr[jj] <<- tmp
  }
  # -----
  n <- length(arr)
  repeat {
    swapped <- FALSE
    for (i in 1:(n-1)) {
      if (arr[i] > arr[i+1]) {
        my.swap(i, i+1)
        swapped <- TRUE
      }
    }
    if (swapped == FALSE) break
  }
  return(arr)
}

# ========

cat("\nBegin program-defined functions demo \n\n")

x <- 5.1
y <- 3
z <- 2.0

cat("x, y, z = ", x, ",", y, ",", z, "\n\n")

sum <- my.sum(x, y)
cat("Result of my.sum(x,y) = ", sum, "\n\n")

vec <- c(3.14, 2/3, 1.2345)
cat("Vector vec = ", vec, "\n")
cat("Result of my.printvec(vec, 3) : ", "\n")
my.printvec(vec, 3)
# my.printvec(vec, dec=3)
cat("\n")

prod <- my.prod(x, y) # missing z
cat("Result of my.prod(x,y) = ", prod, "\n\n")

sumdiff <- my.sumdiff(x, y)
cat("Result of my.sumdiff(x,y)= ",
  sumdiff, "\n\n")

myd <- my.divide(x, y)
cat("Result of my.divide(x,y) = ", myd[[1]],
  myd[[2]], "\n\n")
# cat("Result of my.divide(x,y) = ", myd$result,
#    myd$msg, "\n\n")
myd <- my.divide(x, 0)
cat("Result of my.divide(x,0) = ", myd[[1]],
  myd[[2]], "\n\n")

mymp <- my.multiprod(a=3, b=5, c=7)
cat("Result of my.multiprod(a=3, b=5, c=7) = ",
  mymp, "\n\n")
 
vec <- c(4.4, 9.9, 2.2, 3.3, 0.0, 5.5, 8.8,
  1.1, 7.7, 6.6)
cat("Vector vec = \n")
cat(vec, "\n")
svec <- my.qsort(vec)
cat("Result of my.qsort(vec) : \n")
cat(svec, "\n\n")

vec <- c(4.4, 9.9, 2.2, 3.3, 0.0, 5.5, 8.8,
  1.1, 7.7, 6.6)
cat("Vector vec = \n")
cat(vec, "\n")
svec <- my.bsort(vec)
cat("Result of my.bsort(vec) : \n")
cat(svec, "\n")

cat("\nEnd R functions demo \n\n") 

After saving the demo script, give focus to the Console window. Enter the command setwd("C:\\ProgramDefinedFunctionsWithR") to point the working directory to the location of your script. Then enter the command source("functions.R") to execute the program.

Basic Function Syntax
The demo program begins by defining a simple R function that returns the sum of two numeric values in order to illustrate basic syntax:

my.sum = function(x, y) {
  result <- x + y
  return(result)
}

I named the function my.sum rather than just sum because R allows you overwrite built-in functions. In other words, if I had named my function sum, it would've killed the built-in sum function that adds up the values in an array or vector. Because R has many hundreds of built-in functions, you should try to make your program-defined function names different from built-in function names. Prepending program-defined function names with "my." is my personal preference, but is not a standard convention.

In R it's common to use the dot character in function and variable names to make them more readable (most languages, including C#, use the underscore character for better readability). In R the "=" and "<-" assignment operators are usually interchangeable; I prefer to use "=" in the function signature and "<-" in the function body.

In this example I use the "return" keyword. Interestingly, by default R functions return the last expression in a function. Therefore, the function could've been written as:

my.sumterse = function(x, y) {
  x + y
}

If I'm defining a function interactively on-the-fly I'll sometimes omit the return keyword, but I think code is more readable and less error-prone with the return keyword. The code that calls the function is:

x <- 5.1
y <- 3
sum <- my.sum(x, y)
cat("Result of my.sum(x,y) = ", sum, "\n\n")

The R calling mechanism is straightforward and closely resembles that of other C-family languages.

It's possible to write R functions that don't return a value. The demo program has a "void" function to print the values in a vector using a specified number of decimals:

my.printvec = function(vec, dec) {
  cat("[ ")
  for (i in 1:length(vec)) {
    x <- formatC(vec[i], digits = dec, format="f")
    cat(x, " ")
  }
  cat("]\n")
}

The demo code that calls function my.printvec is:

vec <- c(3.14, 2/3, 1.2345)
cat("Vector vec = ", vec, "\n")
cat("Result of my.printvec(vec, 3) : ", "\n")
my.printarr(vec, 3) 

R supports named parameter calls, so the function could have been called as:

my.printvec(vec, dec=3)

Although using a named parameter call in this example doesn't improve readability much, many built-in R functions have a large number of parameters and using named parameters can greatly improve code readability.

Function Overloading and Argument Pass by Value
R doesn't support C#-style function name overloading. For example, because the demo program defines a function my.sum(x, y), an attempt to define a function my.sum(x, y, z) will generate a runtime error. Although R doesn't support explicit function overloading, you can get similar behavior by using the default parameters and the variable number of parameters mechanisms described in this article.

In R, parameters are passed by value, not by reference. For example, consider this (incorrect) function definition that attempts to add 1.0 to each value of an R array:

my.inc = function(arr) {
  for (i in 1:length(arr)) {
    arr[i] <- arr[i] + 1
  }
}

Then a call like this:

a <- array(0.0, 3)
a[1] <- 1.1; a[2] <- 5.5; a[3] <- 7.7)
my.inc(a)

would leave array a unchanged. One way to simulate the desired behavior is to make the function return a value like so:

my.inc = function(arr) {
  for (i in 1:length(arr)) {
    arr[i] <- arr[i] + 1
  }
  return(arr)
} 

And then assign the return value by calling the function like so:

a <- my.inc(a)

Another consequence of pass by value is that R does not have C#-style out or ref parameters. You can simulate out and ref parameters by returning multiple values in an array, vector or list, as I'll demonstrate shortly.

Default Parameter Values and Missing Parameters
The demo program illustrates R default function parameter values by defining this function:

my.prod = function(x, y, z=10) {
  result <- x * y * z
  return(result)
}

Function my.prod returns the product of three numeric values. If only the first two arguments are passed to the function, the function will automatically generate a third parameter with a value of 10, for example:

val <- my.prod(3.0, 4.0, 2.0)

returns 3.0 * 4.0 * 2.0 = 24.0. But the call:
val <- my.prod(3.0, 4.0)

returns 3.0 * 4.0 * 10 = 120.0

Many of the built-in R functions have a large number of parameters, and the parameters often have default values. This design approach allows you to make simplified calls to the functions. For example, the built-in formatC function has 14 parameters. Only the first parameter, the value to format, is required, and the remaining 13 parameters have default values. This allows you to write code like:

x <- format(3.14, width=6)
cat("x = ", x, "\n")

As a very general rule of thumb, if you're writing an R function for one-time use in a program, there's little advantage to generalizing the heck out of the function by adding lots of unnecessary parameters with default values. Default parameter values are most useful when you're writing library functions and you're not sure how the functions might be called.

The R language allows you to deal with missing parameter values. For example, the demo program defines a function my.prod2 like so:

my.prod2 = function(x, y, z) {
  if (missing(z))
    return(x * y * 10)
  else
    return(x * y * z)
}

Here, the built-in missing function returns TRUE if there's no argument corresponding to parameter z, FALSE otherwise. Using the built-in missing function gives you more flexibility than using a default parameter value, at the expense of a slight increase in complexity. In the programming scenarios I work with, I don't use the R missing parameter mechanism very often.

Returning Multiple Values
Because R parameters are passed by value, in situations where you want a function to return multiple values, you can't write an R function that has C# style out-parameters. When you want to return multiple values, you can return the values in an array, a vector, or a list. For example, the demo program defines a function my.sumdiff as:


comments powered by Disqus

Featured

Subscribe on YouTube