R CHAPTER 5 2018-08-20T17:45:55+00:00

R PROGRAMMING CHAPTER 5

GRAPHICS

R has a very rich set of graphics facilities. The R home page (http://www.r-project.org/ ) has a few colorful examples, but to really appreciate R’s graphical power, browse through the R Graph Gallery at http://addictedtor.free .fr/graphiquesIn this chapter, we cover the basics of using R’s base, or traditional, graphics package. This will give you enough foundation to start working with graphics in R. 

Creating Graphs

To begin, we’ll look at the foundational function for creating graphs: plot(). Then we’ll explore how to build a graph, from adding lines and points to attaching a legend.

The Workhorse of R Base Graphics: The plot() Function

The plot() function forms the foundation for much of R’s base graphing operations, serving as the vehicle for producing many different kinds of graphs. As mentioned BEFORE, plot() is a generic function, or a placeholder for a family of functions. The function that is actually called depends on the class of the object on which it is called.Let’s see what happens when we call plot() with an X vector and a Y vector, which are interpreted as a set of pairs in the (x,y) plane.

> plot(c(1,2,3), c(1,2,4))

This will cause a window to pop up, plotting the points (1,1), (2,2), and (3,4), as shown in Figure 5-1. As you can see, this is a very plain-Jane graph. We’ll discuss adding some of the fancy bells and whistles later in the chapter.

Figure 5-1: Simple point plot

NOTE The points in the graph in Figure 5-1 are denoted by empty circles. If you want to use a different character type, specify a value for the named argument pch (for point character).

The plot() function works in stages, which means you can build up a graph in stages by issuing a series of commands. For example, as a base, we might first draw an empty graph, with only axes, like this:

> plot(c(-3,3), c(-1,5), type = “n”, xlab=”x”, ylab=”y”)

This draws axes labeled x and y. The horizontal (x) axis ranges from 3 to 3. The vertical (y) axis ranges from 1 to 5. The argument type=”n” means that there is nothing in the graph itself.

Adding Lines: The abline() Function

We now have an empty graph, ready for the next stage, which is adding a line:

> x <- c(1,2,3)

> y <- c(1,3,8)

> plot(x,y)

> lmout <- lm(y ~ x)

> abline(lmout)

After the call to plot(), the graph will simply show the three points, along with the x– and y– axes with hash marks. The call to abline() then adds a line to the current graph. Now, which line is this? The result of the call to the linear-regression function lm() is a class instance containing the slope and intercept of the fit-ted line, as well as various other quantities that don’t concern us here. We’ve assigned that class instance to lmout. The slope and intercept will now be in lmout$coefficientsSo, what happens when we call abline()? This function simply draws a straight line, with the function’s arguments treated as the intercept and slope of the line. For instance, the call abline(c(2,1)) draws this line on what-ever graph you’ve built up so far:

y = 2 + 1 · x

But abline() is written to take special action if it is called on a regression object (though, surprisingly, it is not a generic function). Thus, it will pick up the slope and intercept it needs from lmout$coefficients and plot that line. It superimposes this line onto the current graph, the one that graphs the three points. In other words, the new graph will show both the points and the line, as in Figure 5-2.

Figure 5-2: Using abline()

You can add more lines using the lines() function. Though there are many options, the two basic arguments to lines() are a vector of x-values and a vector of y-values. These are interpreted as (x,y) pairs representing points to be added to the current graph, with lines connecting the points. For instance, if X and Y are the vectors (1.5,2.5) and (3,3), you could use this call to add a line from (1.5,3) to (2.5,3) to the present graph:

> lines(c(1.5,2.5),c(3,3))

If you want the lines to “connect the dots,” but don’t want the dots them-selves, include type=”l” in your call to lines() or to plot(), as follows:

> plot(x,y,type=”l”)

You can use the lty parameter in plot() to specify the type of line, such as solid or dashed. To see the types available and their codes, enter this command:

> help(par)

Starting a New Graph While Keeping the Old Ones

Each time you call plot(), directly or indirectly, the current graph window will be replaced by the new one. If you don’t want that to happen, use the command for your operating system:

  • On Linux systems, call X11().
  • On a Mac, call macintosh().
  • On Windows, call windows().

For instance, suppose you wish to plot two histograms of vectors X and Y and view them side by side. On a Linux system, you would type the following:

> hist(x)

> x11()

> hist(y)

 Extended Example: Two Density Estimates on the Same Graph

Let’s plot nonparametric density estimates (these are basically smoothed histograms) for two sets of examination scores in the same graph. We use the function density() to generate the estimates. Here are the commands we issue:

> d1 = density(testscores$Exam1,from=0,to=100)

> d2 = density(testscores$Exam2,from=0,to=100)

> plot(d1,main=””,xlab=””)

> lines(d2)

First, we compute nonparametric density estimates from the two variables, saving them in objects d1 and d2 for later use. We then call plot() to draw the curve for exam 1, at which point the plot looks like Figure 5-3. We then call lines() to add exam 2’s curve to the graph, producing Figure 5-4.

Figure 5-3: Plot of first density

Figure 5-4: Addition of second density

   

Note that we asked R to use blank labels for the figure as a whole and for the x-axis. Otherwise, R would have gotten such labels from d1, which would have been specific to exam 1. Also note that we needed to plot exam 1 first. The scores there were less diverse, so the density estimate was narrower and taller. Had we plotted exam 2, with its shorter curve, first, exam 1’s curve would have been too tall for the plot window. Here, we first ran the two plots separately to see which was taller, but let’s consider a more general situation.

Say we wish to write a broadly usable function that will plot several density estimates on the same graph. For this, we would need to automate the process of determining which density estimate is tallest. To do so, we would use the fact that the estimated density values are contained in the y component of the return value from the call to density(). We would then call max() on each density estimate and use which.max() to determine which density estimate is the tallest. The call to plot() both initiates the plot and draws the first curve. (With-out specifying type=”l”, only the points would have been plotted.) The call to lines() then adds the second curve.

Extended Example: More on the Polynomial Regression Example

We defined a class “polyreg” that facilitates fitting polynomial regression models. Our code there included an implementation of the generic print() function. Let’s now add one for the generic plot() function:

  1. # polyfit(x,maxdeg) fits all polynomials up to degree maxdeg; y is
  2. # vector for response variable, x for predictor; creates an object of
  3.  # class “polyreg”, consisting of outputs from the various regression
  4.  # models, plus the original data
  5. polyfit <- function(y,x,maxdeg) {
  6. pwrs <- powers(x,maxdeg)  # form powers of predictor variable
  7. lmout <- list()  # start to build class
  8. class(lmout) <- “polyreg”  # create a new class
  9. for (i in 1:maxdeg) {
  10. lmo <- lm(y ~ pwrs[,1:i])
  11. # extend the lm class here, with the cross-validated predictions
  12. lmo$fitted.xvvalues <- lvoneout(y,pwrs[,1:i,drop=F])
  13. lmout[[i]] <- lmo
  14. }
  15. lmout$x <- x
  16. lmout$y <- y
  17. return(lmout)
  18. }
  19. # generic print() for an object fits of class “polyreg”:  print
  20. # cross-validated mean-squared prediction errors
  21. print.polyreg <- function(fits) {
  22. maxdeg <- length(fits) – 2  # count lm() outputs only, not $x and $y
  23. n <- length(fits$y)
  24. tbl <- matrix(nrow=maxdeg,ncol=1)
  25. cat(“mean squared prediction errors, by degree\n”)
  26. colnames(tbl) <- “MSPE”
  27. for (i in 1:maxdeg) {
  28. fi <- fits[[i]]
  29. errs <- fits$y – fi$fitted.xvvalues
  30. spe <- sum(errs^2)
  31. tbl[i,1] <- spe/n
  32. }
  33. print(tbl)
  34. }
  35. # generic plot(); plots fits against raw data
  36. plot.polyreg <- function(fits) {
  37. plot(fits$x,fits$y,xlab=”X”,ylab=”Y”)  # plot data points as background
  38. maxdg <- length(fits) – 2
  39. cols <- c(“red”,”green”,”blue”)
  40. dg <- curvecount <- 1
  41. while (dg < maxdg) {
  42. prompt <- paste(“RETURN for XV fit for degree”,dg,”or type degree”,
  43. “or q for quit “)
  44. rl <- readline(prompt)
  45. dg <- if (rl == “”) dg else if (rl != “q”) as.integer(rl) else break
  46. lines(fits$x,fits[[dg]]$fitted.values,col=cols[curvecount%%3 + 1])
  47. dg <- dg + 1
  48. curvecount <- curvecount + 1
  49. }
  50. }
  51. # forms matrix of powers of the vector x, through degree dg
  52. powers <- function(x,dg) {
  53. pw <- matrix(x,nrow=length(x))
  54. prod <- x
  55. for (i in 2:dg) {
  56. prod <- prod * x
  57. pw <- cbind(pw,prod)
  58. }
  59. return(pw)
  60. }
  61. # finds cross-validated predicted values; could be made much faster via
  62. # matrix-update methods
  63. lvoneout <- function(y,xmat) {
  64. n <- length(y)
  65. predy <- vector(length=n)
  66. for (i in 1:n) {
  67. # regress, leaving out ith observation
  68. lmo <- lm(y[-i] ~ xmat[-i,])
  69. betahat <- as.vector(lmo$coef)
  70. # the 1 accommodates the constant term
  71. predy[i] <- betahat %*% c(1,xmat[i,])
  72. }
  73. return(predy)
  74. }
  75. # polynomial function of x, coefficients cfs
  76. poly <- function(x,cfs) {
  77. val <- cfs[1]
  78. prod <- 1
  79. dg <- length(cfs) – 1
  80. for (i in 1:dg) {
  81. prod <- prod * x
  82. val <- val + cfs[i+1] * prod
  83. }
  84. }

As noted, the only new code is plot.polyreg(). For convenience, the code is reproduced here:

    # generic plot(); plots fits against raw data

    plot.polyreg <- function(fits) {

       plot(fits$x,fits$y,xlab=”X”,ylab=”Y”)  # plot data points as background

      maxdg <- length(fits) – 2

      cols <- c(“red”,”green”,”blue”)

      dg <- curvecount <- 1

  while (dg < maxdg) {

          prompt <- paste(“RETURN for XV fit for degree”,dg,”or type degree”,

             “or q for quit “)

          rl <- readline(prompt)

          dg <- if (rl == “”) dg else if (rl != “q”) as.integer(rl) else break

          lines(fits$x,fits[[dg]]$fitted.values,col=cols[curvecount%%3 + 1])

          dg <- dg + 1

          curvecount <- curvecount + 1

}

}

DEBUGGING

Programmers often find that they spend more time debugging a program than actually writing it. Good debugging skills are invaluable. In this chapter, we’ll discuss debugging in R.

Fundamental Principles of Debugging

Though debugging is an art rather than a science, it involves some fun-damental principles. Here, we’ll look at some debugging best practices.

The Essence of Debugging: The Principle of Confirmation

The principle of confirmation is the essence of debugging. Fixing a buggy program is a process of confirming, one by one, that the many things you believe to be true about the code actually are true. When you find that one of your assumptions is not true, you have found a clue to the location (if not the exact nature) of a bug. Another way of saying this is, “Surprises are good!” For example, say you have the following code:

x <- y^2 + 3*g(z,2)

w <- 28

if (w+q > 0) u <- 1 else v <- 10

Do you think the value of your variable x should be 3 after x is assigned? Confirm it! Do you think the else will be executed, not the if on that third line? Confirm it! Eventually, one of these assertions that you are so sure of will turn out to not confirm. Then you will have pinpointed the likely location of the error, thus enabling you to focus on the nature of the error.

Start Small

At least at the beginning of the debugging process, stick to small, simple test cases. Working with large data objects may make it harder to think about the problem. Of course, you should eventually test your code on large, complicated cases, but start small.

Debug in a Modular, Top-Down Manner

Most good software developers agree that code should be written in a modular manner. Your first-level code should not be longer than, say, a dozen lines, with much of it consisting of function calls. And those functions should not be too lengthy and should call other functions if necessary. This makes the code easier to organize during the writing stage and easier for others to understand when it comes time for the code to be extended. You should debug in a top-down manner, too. Suppose that you have set the debug status of your function f() (that is, you have called debug(f), to be explained shortly) and f() contains this line:

y <- g(x,8)

You should take an “innocent until proven guilty” approach to g(). Do not call debug(g) yet. Execute that line and see if g() returns the value you expect. If it does, then you’ve just avoided the time-consuming process of single-stepping through g(). If g() returns the wrong value, then now is the time to call debug(g).

Antibugging

You may adopt some “antibugging” strategies as well. Suppose you have a section of code in which a variable x should be positive. You could insert this line:

stopifnot(x > 0)

If there is a bug earlier in the code that renders x equal to, say, 12, the call to stopifnot() will bring things to a halt right there, with an error mes-sage like this:

Error: x > 0 is not TRUE

(C programmers may notice the similarity to C’s assert statement.)After fixing a bug and testing the new code, you might want to keep that code handy so you can check later that the bug did not somehow reappear.

Why Use a Debugging Tool?

In the old days, programmers would perform the debugging confirmation process by temporarily inserting print statements into their code and rerunning the program to see what printed out. For example, to confirm that x= 3 in our previous code, we would insert into our code a statement that printed the value of x and do something similar for the if-else, like this:

x <- y^2 + 3*g(z,2)

cat(“x =”,x,”\n”)

w <- 28

if (w+q > 0) {

u <- 1

print(“the ‘if’ was done”)

} else { v <- 10

print(“the ‘else’ was done”)

}

We would rerun the program and inspect the feedback printed out. We would then remove the print statements and put in new ones to track down the next bug. This manual process is fine for one or two cycles, but it gets really tedious during a long debugging session. And worse, all that editing work dis-tracts your attention, making it harder to concentrate on finding the bug.

So, debugging by inserting print statements into your code is slow, cum-bersome, and distracting. If you are serious about programming in any par-ticular language, you should seek a good debugging tool for that language. Using a debugging tool will make it much easier to query the values of vari-ables, check whether the if or the else gets executed, and so on. Moreover, if your bug causes an execution error, debugging tools can analyze it for you, possibly providing major clues as to the source of the error. All of this will increase your productivity substantially.

Using R Debugging Facilities

The R base package includes a number of debugging facilities, and more functional debugging packages are also available. We’ll discuss both the base facilities and other packages, and our extended example will present a fully detailed debugging session.

Single-Stepping with the debug() and browser() Functions

The core of R’s debugging facility consists of the browser. It allows you to single-step through your code, line by line, taking a look around as you go. You can invoke the browser through a call to either the debug() or browser() function. R’s debugging facility is specific to individual functions. If you believe there is a bug in your function f(), you can make the call debug(f) to set the debug status for the function f(). This means that from that point onward, each time you call the function, you will automatically enter the browser at the beginning of the function. Calling undebug(f) will unset the debug status of the function so that entry to the function will no longer invoke the browser.

On the other hand, if you place a call to browser() at some line within f(), the browser will be invoked only when execution reaches that line. You then can single-step through your code until you exit the function. If you believe the bug’s location is not near the beginning of the function, you probably don’t want to be single-stepping from the beginning, so this approach is more direct. Readers who have used C debuggers such as GDB (the GNU debugger) will find similarity here, but some aspects will come as a surprise. As noted, for instance, debug() is called on the function level, not on the overall pro-gram level. If you believe you have bugs in several of your functions, you’ll need to call debug() on each one.

It can become tedious to call debug(f) and then undebug(f) when you just want to go through one debugging session for f(). Starting with R 2.10, one can now call debugonce() instead; calling debugonce(f) puts f() into debugging status the first time you execute it, but that status is reversed immediately upon exit from the function.

Using Browser Commands

While you are in the browser, the prompt changes from > to Browse[d]>. (Here, d is the depth of the call chain.) You may submit any of the following commands at that prompt:

  • n (for next): Tells R to execute the next line and then pause again. Hit-ting ENTER causes this action, too.
  • c (for continue): This is like n, except that several lines of code may be executed before the next pause. If you are currently in a loop, this com-mand will result in the remainder of the loop being executed and then pausing upon exit from the loop. If you are in a function but not in a loop, the remainder of the function will be executed before the next pause.
  • Any R command: While in the browser, you are still in R’s interactive mode and thus can query the value of, say, x by simply typing x. Of course, if you have a variable with the same name as a browser com-mand, you must explicitly call something like print(), as in print(n).
  • where: This prints a stack trace. It displays what sequence of function calls led execution to the current location.
  • Q: This quits the browser, bringing you back to R’s main interactive mode.

Setting Breakpoints

Calling debug(f) places a call to browser() at the beginning of f(). However, this may be too coarse a tool in some cases. If you suspect that the bug is in the middle of the function, it’s wasteful to trudge through all the intervening code. The solution is to set breakpoints at certain key locations of your code— places where you want execution to be paused. How can this be done in R? You can call browser directly or use the setBreakpoint() function (with R ver-sion 2.10 and later).

Calling browser() Directly

You can set a breakpoint by simply inserting calls to browser() at the places of interest in your code. This has the effect, essentially, of setting breakpoints there. You can make invoking the browser conditional so that it is entered only in specified situations. Use the expr argument to define those situations. For instance, suppose you suspect that your bug arises only when a certain vari-able s is larger than 1. You could use this code:

browser(s > 1)

The browser will be invoked only if s is larger than 1. The following would have the same effect:

if (s > 1) browser()

Calling the browser directly, rather than entering the debugger via debug() is very useful in situations in which you have a loop with many iter-ations and the bug surfaces only after, say, the 50th iteration. If the loop index is i, then you could write this:

if (i > 49) browser()

That way, you would avoid the tedium of stepping through the first 49 iterations!

Using the setBreakpoint() Function

Starting with R 2.10, you can use setBreakpoint() in the format

setBreakpoint(filename,linenumber)

This will result in browser() being called at line linenumber of our source file  filenameThis is especially useful when you are in the midst of using the debugger, single-stepping through code. Say you are currently at line 12 of your source file x.R and want to have a breakpoint at line 28. Instead of exiting the debugger, adding a call to browser() at line 28, and then re-entering the function, you could simply type this:

> setBreakpoint(“x.R”,28)

You could then resume execution within the debugger, say by issuing the c command. The setBreakpoint() function works by calling the trace() function, dis-cussed in the next section. Thus, to cancel the breakpoint, you cancel the trace. For instance, if we had called setBreakpoint() at a line in the function g(), we would cancel the breakpoint by typing the following:

> untrace(g)

You can call setBreakpoint() whether or not you are currently in the debugger. If you are not currently running the debugger and you execute the affected function and hit the breakpoint during that execution, you will be put into the browser automatically. This is similar to the case of browser(), but using this approach, you save yourself the trouble of changing your code via your text editor.

Tracking with the trace() Function

The trace() function is flexible and powerful, though it takes some initial effort to learn. We will discuss some of the simpler usage forms here, begin-ning with the following:

> trace(f,t)

This call instructs R to call the function t() every time we enter the func-tion f(). For instance, say we wish to set a breakpoint at the beginning of the function gy(). We could use this command:

> trace(gy,browser)

This has the same effect as placing the command browser() in our source code for gy(), but it’s quicker and more convenient than inserting such a line, saving the file, and rerunning source() to load in the new version of the file. Calling trace() does not change your source file, though it does change a temporary version of your file maintained by R. It would also be quicker and more convenient to undo, by simply running untrace:

> untrace(gy)

You can turn tracing on or off globally by calling tracingState(), using the argument TRUE to turn it on or FALSE to turn it off.

Performing Checks After a Crash with the traceback() and debugger() Functions

Say your R code crashes when you are not running the debugger. There is still a debugging tool available to you after the fact. You can do a “post-mortem” by simply calling traceback(). It will tell you in which function the problem occurred and the call chain that led to that function. You can get a lot more information if you set up R to dump frames in the event of a crash:

> options(error=dump.frames)

If you’ve done this, then after a crash, run this command:

> debugger()

You will then be presented with a choice of levels of function calls to view. For each one that you choose, you can take a look at the values of the variables there. After browsing through one level, you can return to the debugger() main menu by hitting N. You can arrange to automatically enter the debugger by writing this code:

> options(error=recover)

Note, though, that if you do choose this automatic route, it will whisk you into the debugger, even if you simply have a syntax error (not a useful time to enter the debugger).To turn off any of this behavior, type the following:

> options(error=NULL)

You’ll see a demonstration of this approach in the next section.

Extended Example: Two Full Debugging Sessions

Now that we’ve looked at R’s debugging tools, let’s try using them to find and fix code problems. We’ll begin with a simple example and then move on to a more complicated one.

Debugging Finding Runs of Ones

First recall our extended example of finding runs of 1s. Here is a buggy version of the code:

findruns <- function(x,k) {

n <- length(x)

runs <- NULL

for (i in 1:(n-k)) {

if (all(x[i:i+k-1]==1)) runs <- c(runs,i)

}

return(runs)

}

Let’s try it on a small test case:

> source(“findruns.R”)

> findruns(c(1,0,0,1,1,0,1,1,1),2)

[1] 3 4 6 7

The function was supposed to report runs at indices 4, 7, and 8, but it found some indices that it shouldn’t have and missed some as well. Some-thing is wrong. Let’s enter the debugger and take a look around.

> debug(findruns)

> findruns(c(1,0,0,1,1,0,1,1,1),2)

debugging in: findruns(c(1, 0, 0, 1, 1, 0, 1, 1, 1), 2)

debug at findruns.R#1: {

n <- length(x)

runs <- NULL

for (i in 1:(n – k)) {

if (all(x[i:i + k – 1] == 1))

runs <- c(runs, i)

}

return(runs)

}

attr(,”srcfile”)

findruns.R

So, according to the principle of confirmation, let’s first make sure our test vector was received correctly:

Browse[2]> x

[1] 1 0 0 1 1 0 1 1 1

So far, so good. Let’s step through the code a bit. We hit n a couple of times to single-step through the code.

Browse[2]> n

debug at findruns.R#2: n <- length(x)

Browse[2]> n

debug at findruns.R#3: runs <- NULL

Browse[2]> print(n)

[1] 9

Note that after each single step, R tells us which statement would be the next one to execute. In other words, at the time we executed print(n), we had not yet executed the assignment of NULL to runsNote, too, that although normally you can print out the value of a variable by simply typing its name, we could not do so here for our variable n, because n is also the abbreviation for the debugger’s next command. Thus, we needed print()At any rate, we found that the length of our test vector was 9, confirming what we knew. Now, let’s single-step some more, getting into the loop.

Browse[2]> n

debug at findruns.R#4: for (i in 1:(n – k + 1)) {

if (all(x[i:i + k – 1] == 1))

runs <- c(runs, i)

}

Browse[2]> n

debug at findruns.R#4: i

Browse[2]> n

debug at findruns.R#5: if (all(x[i:i + k – 1] == 1)) runs <- c(runs, i)

Since k is 2—that is, we are checking for runs of length 2— the if() statement should be checking the first two elements of x, which are (1,0). Let’s confirm:

Browse[2]> x[i:i + k – 1]

[1] 0

So, it did not confirm. Let’s check that we have the correct subscript range, which should be 1:2. Is it?

Browse[2]> i:i + k – 1

[1] 2

Also wrong. Well, how about i and k? They should be 1 and 2, respec-tively. Are they?

Browse[2]> i

[1] 1

Browse[2]> k

[1] 2

Well, those do confirm. Thus, our problem must be with the expression i:i + k – 1. After some thought, we realize there is an operator precedence problem there, and we correct it to i:(i + k – 1)Is it okay now?

> source(“findruns.R”)

> findruns(c(1,0,0,1,1,0,1,1,1),2)

[1] 4 7

No, as mentioned, it should be (4,7,8). Let’s set a breakpoint inside the loop and take a closer look.

> setBreakpoint(“findruns.R”,5)

/home/nm/findruns.R#5:

findruns step 4,4,2 in  < environment: R_GlobalEnv >

> findruns(c(1,0,0,1,1,0,1,1,1),2)

findruns.R#5

Called from: eval(expr, envir, enclos)

Browse[1]> x[i:(i+k-1)]

[1] 1 0

Good, we’re dealing with the first two elements of the vector, so our bug fix is working so far. Let’s look at the second iteration of the loop.

Browse[1]> c

findruns.R#5

Called from: eval(expr, envir, enclos)

Browse[1]> i

[1] 2

Browse[1]> x[i:(i+k-1)]

[1] 0 0

That’s right, too. We could go another iteration, but instead, let’s look at the last iteration, a place where bugs frequently arise in loops. So, let’s add a conditional breakpoint, as follows:

findruns <- function(x,k) {

n <- length(x)

runs <- NULL

for (i in 1:(n-k)) {

if (all(x[i:(i+k-1)]==1)) runs <- c(runs,i)

if (i == n-k) browser() # break in last iteration of loop

}

return(runs)

}

And now run it again.

> source(“findruns.R”)

> findruns(c(1,0,0,1,1,0,1,1,1),2)

Called from: findruns(c(1, 0, 0, 1, 1, 0, 1, 1, 1), 2)

Browse[1]> i

[1] 7

This shows the last iteration was for i = 7. But the vector is nine ele-ments long, and k = 2, so our last iteration should be i = 8. Some thought then reveals that the range in the loop should have been written as follows:

for (i in 1:(n-k+1)) {

By the way, note that the breakpoint that we set using setBreakpoint() is no longer valid, now that we’ve replaced the old version of the object findrunsSubsequent testing (not shown here) indicates the code now works. Let’s move on to a more complex example.

Debugging Finding City Pairs

Recall our code, which found the pair of cities with the closest distance between them. Here is a buggy version of that code:

returns the minimum value of d[i,j], i != j, and the row/col attaining that minimum, for square symmetric matrix d; no special policy on ties;

motivated by distance matrices

mind <- function(d) {

   n <- nrow(d)

    add a column to identify row number for apply()

   dd <- cbind(d,1:n)

   wmins <- apply(dd[-n,],1,imin)

    wmins will be 2xn, 1st row being indices and 2nd being values

   i <- which.min(wmins[1,])

   j <- wmins[2,i]

   return(c(d[i,j],i,j))

}

finds the location, value of the minimum in a row x

imin <- function(x) {

   n <- length(x)

   i <- x[n]

   j <- which.min(x[(i+1):(n-1)])

   return(c(j,x[j]))

}

Let’s use R’s debugging tools to find and fix the problems. We’ll run it first on a small test case:

> source(“cities.R”)
> m <- rbind(c(0,12,5),c(12,0,8),c(5,8,0)) >m

     [,1] [,2] [,3]

[1,]    0   12    5

[2,]   12    0    8

[3,]    5    8    0

> mind(m)

Error in mind(m) : subscript out of bounds

Not an auspicious start! Unfortunately, the error message doesn’t tell us where the code blew up. But the debugger will give us that information:

> options(error=recover)

> mind(m)

Error in mind(m) : subscript out of bounds

Enter a frame number, or 0 to exit

1: mind(m)

Selection: 1

Called from: eval(expr, envir, enclos)

Browse[1]> where

where 1: eval(expr, envir, enclos)

where 2: eval(quote(browser()), envir = sys.frame(which))

where 3 at cities.R#13: function ()

{

if (.isMethodsDispatchOn()) {

tState <- tracingState(FALSE)

Okay, so the problem occurred in mind() rather than imin() and in par-ticular at line 13. It still could be the fault of imin(), but for now, let’s deal with the former.

Since the error occurred with d[i,j], let’s look at those variables:

      Browse[1]> d

               [,1] [,2] [,3]

          [1,]    0   12    5

          [2,]   12    0    8

          [3,]    5    8    0

          Browse[1]> i

          [1] 2

          Browse[1]> j

          [1] 12

This is indeed a problem—d only has three columns, yet j, a column subscript, is 12. Let’s look at the variable from which we gleaned j, wmins:

Browse[1]> wmins

      [,1] [,2]

[1,]  2     1

[2,] 12   12

If you recall how the code was designed, column k of wmins is supposed to contain information about the minimum value in row k of d. So here wmins is saying that in the first row (k = 1) of d,(0,12,5), the minimum value is 12, occurring at index 2. But it should be 5 at index 3. So, something went wrong with this line:

wmins <- apply(dd[-n, ], 1, imin)

There are several possibilities here. But since ultimately imin() is called, we can check them all from within that function. So, let’s set the debug sta-tus of imin(), quit the debugger, and rerun the code.

Browse[1]> Q

> debug(imin)

> mind(m)

debugging in: FUN(newX[, i], …)

debug at cities.R#17: {

n <- length(x)

i <- x[n]

j <- which.min(x[(i + 1):(n – 1)])

return(c(j, x[j]))

}

So, we’re in imin(). Let’s see if it properly received the first row of dd, which should be (0,12,5,1).

Browse[4]> x

[1]  0  12 5  1

It’s confirmed. This seems to indicate that the first two arguments to apply() were correct and that the problem is instead within imin(), though that remains to be seen. Let’s single-step through, occasionally typing confirmational queries:

Browse[2]> n

debug at cities.r#17: n <- length(x)

Browse[2]> n

debug at cities.r#18: i <- x[n]

Browse[2]> n

debug at cities.r#19: j <- which.min(x[(i + 1):(n – 1)])

Browse[2]> n

debug at cities.r#20: return(c(j, x[j]))

Browse[2]> print(n)

Browse[2]> n

debug at cities.r#17: n <- length(x)

Browse[2]> n

debug at cities.r#18: i <- x[n]

Browse[2]> n

debug at cities.r#19: j <- which.min(x[(i + 1):(n – 1)])

Browse[2]> n

debug at cities.r#20: return(c(j, x[j]))

Browse[2]> print(n)

[1] 4

Browse[2]> i

[1] 1

Browse[2]> j

[1] 2

Recall that we designed our call which.min(x[(i + 1):(n – 1)] to look only at the above-diagonal portion of this row. This is because the matrix is symmetric and because we don’t want to consider the distance between a city and itself. But the value j = 2 does not confirm. The minimum value in (0,12,5) is 5, which occurs at index 3 of that vector, not index 2. Thus, the problem is in this line:

j <- which.min(x[(i + 1):(n – 1)])

What could be wrong? After taking a break, we realize that although the minimum value of (0,12,5) occurs at index 3 of that vector, that is not what we asked which.min() to find for us. Instead, that i + 1 term means we asked for the index of the minimum in (12,5), which is 2. We did ask which.min() for the correct information, but we failed to use it correctly, because we do want the index of the minimum in (0,12,5). We need to adjust the output of which.min() accordingly, as follows:

j <- which.min(x[(i+1):(n-1)])

k <- i + j

return(c(k,x[k]))

We make the fix and try again.

> mind(m)

Error in mind(m) : subscript out of bounds

Enter a frame number, or 0 to exit

1: mind(m)

Selection:

Oh no, another bounds error! To see where the blowup occurred this time, we issue the where command as before, and we find it was at line 13 again. What about i and j now?

Browse[1]> i

[1] 1

Browse[1]> j

[1] 5

The value of j is still wrong; it cannot be larger than 3, as we have only three columns in this matrix. On the other hand, i is correct. The overall minimum value in dd is 5, occurring in row 1, column 3. So, let’s check the source of j again, the matrix wmins:

Browse[1]> wmins

      [,1] [,2]

[1,]  3     3

[2,]  5    8

Well, there are the 3 and 5 in column 1, just as should be the case. Remember, column 1 here contains the information for row 1 in d, so wmins is saying that the minimum value in row 1 is 5, occurring at index 3 of that row, which is correct. After taking another break, though, we realize that while wmins is correct, our use of it isn’t. We have the rows and columns of that matrix mixed up. This code:

i <- which.min(wmins[1,])

j <- wmins[2,i]

should be like this:

i <- which.min(wmins[2,])

j <- wmins[1,i]

After making that change and resourcing our file, we try it out.

> mind(m)

 [1] 5 1 3

This is correct, and subsequent tests with larger matrices worked, too.

Moving Up in the World: More Convenient Debugging Tools

As just seen, R’s debugging tools are effective. However, they’re not very convenient. Fortunately, there are various tools that make the process eas-ier. In approximate chronological order of development, they are as follows:

  • The debug package by Mark Bravington
  • My edtdbg package, which works with the Vim and Emacs text editors
  • Vitalie Spinu’s ess-tracebug, which runs under Emacs (with the same goals as edtdbg but with more Emacs-specific features)
  • REvolution Analytics’ Integrated Development Environment (IDE)

All of these tools are cross-platform, working on Linux, Windows, and Mac systems, with the exception of the REvolution Analytics product. That IDE is available only on Windows systems with Microsoft Visual Studio. All of the tools are open source or free, again with the exception of the REvolution Analytics product. So, what do these packages have to offer? To me, one of the biggest problems with R’s built-in debugging tools is the lack of a window that shows the big picture—a window displaying your R code with a cursor that moves through the code as you single-step through it. For example, consider this excerpt from our previous browser output:

Browse[2]> n

debug at cities.r#17: n <- length(x)

Browse[2]> n

debug at cities.r#18: i <- x[n]

This is nice, but where are these lines in our code? Most GUI debuggers for other languages feature a window showing the user’s source code, with a symbol indicating the next line to be executed. All of the R tools listed at the start of this section remedy this lack in core R. The Bravington debug package creates a separate window for exactly this purpose. The other tools have your text editor double as that window, thus saving space on your screen com-pared to using debug.

In addition, these tools allow you to set breakpoints and handle other debugging operations without moving your screen’s cursor from the editor window to your R execution window. This is convenient and saves typing as well, greatly enhancing your ability to concentrate on the real task at hand: finding your bugs. Let’s consider the cities example again. I opened the GVim text editor on my source file in conjunction with edtdbg, did some startup for edtdbg, and then hit the [ (left bracket) key twice to single-step twice through the code. The resulting GVim window is shown in Figure 5-1.

Figure 5-1: Source window in edtdbg

First, note that the editor’s cursor is now on this line:

wmins <- apply(dd[-n, ], 1, imin)

This shows the line to be executed next.

Wheneverwe want to single-step a line, we simply hit the [ key while we are in the editor window. The editor then tells the browser to execute its n command, without my needing to move the mouse to the R execution window, and then the editor moves its cursor to the next line. We can also hit ] for the browser’s c command. Each time we execute a line or lines in this manner, the editor’s cursor moves right along. Whenever we make a change to my source code, typing ,src (the comma is part of the command) into my GVim window will tell R to call source() on it. Each time we want to rerun my code, we hit ,dt. we rarely, if ever, need to move my mouse away from the editor window to the R window and back. In essence, the editor has become my debugger in addition to providing its editing operations.

Ensuring Consistency in Debugging Simulation Code

If you’re doing anything with random numbers, you’ll need to be able to reproduce the same stream of numbers each time you run your program during the debugging session. Without this, your bugs may not be reproducible, making them even harder to fix. The set.seed() function controls this by reinitializing the random number sequence to a given value. Consider this example:

[1] 0.8811480 0.2853269 0.5864738

> runif(3)

[1] 0.5775979 0.4588383 0.8354707

> runif(3)

[1] 0.4155105 0.4852900 0.6591892

> runif(3)

> set.seed(8888)

> runif(3)

[1] 0.5775979 0.4588383 0.8354707

> set.seed(8888)

> runif(3)

[1] 0.5775979 0.4588383 0.8354707

The call runif(3) generates three random numbers from the uniform distribution on the interval (0,1). Each time we make that call, we will get a different set of three numbers. But with set.seed(), we can start over and get the same sequence of numbers.

Syntax and Runtime Errors

The most common syntax errors will be lack of matching parentheses, brackets, braces, or quotation marks. When you encounter a syntax error, this is the first thing you should check and double-check. I highly recommend that you use a text editor that does parentheses matching and syntax coloring for R, such as Vim or Emacs. Be aware that often when you get a message saying there is a syntax error on a certain line, the error may actually be in a much earlier line. This can occur with any language, but R seems especially prone to it.

If it just isn’t obvious to you where your syntax error is, I recommend selectively commenting out some of your code, better enabling you to pin-point the location of the syntax problem. Generally, it helps to follow a binary search approach: Comment out half of your code (being careful to maintain syntax integrity) and see if the same error arises. If it does, it’s in the remaining half; otherwise, it’s in the half you deleted. Then cut that half in half, and so on. You may sometimes get messages like the following:

There were 50 or more warnings (use warnings() to see the first 50)

These should be heeded—run warnings() as suggested. The problem could range from nonconvergence of an algorithm to misspecification of a matrix argument to a function. In many cases, the program output may be invalid, though it may well be fine, too, say with this message:

Fitted probabilities numerically 0 or 1 occurred in: glm…

In some cases, you may find it useful to issue this command:

> options(warn=2)

This instructs R to turn warnings into actual errors and makes the loca-tions of the warnings easier to find.

Running GDB on R Itself

This section may be of interest to you even if you are not trying to fix a bug in R. For example, you may have written some C code to interface to R and found it to be buggy. In order to run GDB on that C function, you must first run R itself through GDB. Or, you may be interested in the internals of R, say to determine how you can write efficient R code, and wish to explore the internals by stepping through the R source code with a debugging tool such as GDB.

Although you can invoke R through GDB from a shell command line, for our purposes here, we suggest using separate windows for R and GDB. Here’s the procedure:

1.Start R in one window, as usual.

2. In another window, determine the ID number of your R process. In UNIX family systems, for instance, this is obtained by something like ps -a.

3. In that second window, submit GDB’s attach command with the R pro-cess number.

4. Submit the continue command to GDB.

You can set breakpoints in the R source code either before continuing or by interrupting GDB later with CTRL-C. If, on the other hand, you wish to use GDB to explore the R source code, note the following. The R source code is dominated by S expression pointers (SEXPs), which are pointers to C structs that contain an R variable’s value, type, and so on. You can use the R internal function Rf_PrintValue(s) to inspect SEXP values. For example, if the SEXP is named s, then in GDB, type this:

call Rf_PrintValue(s)

This prints the value.

PERFORMANCE ENHANCEMENT: SPEED AND MEMORY

In computer science curricula, a common theme is the trade-off between time and space. In order to have a fast-running program, you may need to use more memory space. On the other hand, in order to conserve memory space, you might need to settle for slower code. In the R language, this trade-off is of particular interest for the following reasons:

  • R is an interpreted language. Many of the commands are written in C and thus do run in fast machine code. But other commands, and your own R code, are pure R and thus interpreted. So, there is a risk that your R application may run more slowly than you would like.
  • All objects in an R session are stored in memory. More precisely, all objects are stored in R’s memory address space. R places a limit of

231 1 bytes on the size of any object, even on 64-bit machines and even if you have a lot of RAM. Yet some applications do encounter larger objects.

This chapter will suggest ways that you can enhance the performance of your R code, taking into account the time/space trade-off.

Writing Fast R Code

What can be done to make R code faster? Here are the main tools available to you:

  • Optimize your R code through vectorization, use of byte-code compila-tion, and other approaches.
  • Write the key, CPU-intensive parts of your code in a compiled language such as C/C++.
  • Write your code in some form of parallel R.

To optimize your R code, you need to understand R’s functional programming nature and the way R uses memory.

The Dreaded for Loop

The r-help discussion listserv for R often has questions about how to accomplish various tasks without for loops. There seems to be a feeling that programmers should avoid these loops at all costs.1 Those who pose the queries usually have the goal of speeding up their code. It’s important to understand that simply rewriting code to avoid loops will not necessarily make the code faster. However, in some cases, dramatic speedup may be attained, usually through vectorization.

Vectorization for Speedup

Sometimes, you can use vectorization instead of looping. For example, if x and y are vectors of equal lengths, you can write this:

z <- x + y

This is not only more compact, but even more important, it is faster than using this loop:

for (i in 1:length(x)) z[i] <- x[i] + y[i]

Let’s do a quick timing comparison:

> x <- runif(1000000)

> y <- runif(1000000)

> z <- vector(length=1000000)

> system.time(z <- x + y)

   user  system elapsed

  0.052   0.016   0.068

> system.time(for (i in 1:length(x)) z[i] <- x[i] + y[i])

   user  system elapsed

  8.088   0.044   8.175

What a difference! The version without a loop was more than 120 times faster in elapsed time. While timings may vary from one run to another (a second run of the loop version had elapsed time of 22.958), in some cases, “delooping” R code can really pay off.

It’s worth discussing some of the sources of slowdown in the loop ver-sion. What may not be obvious to programmers coming to R from other lan-guages is that numerous function calls are involved in the loop version of the previous code:

  • Though syntactically the loop looks innocuous, for() is, in fact, a function.
  • The colon : looks even more innocuous, but it’s a function too. For

instance, 1:10 is actually the : function called on the arguments 1 and 10:

> “:”(1,10)

[1] 1 2 3 4 5 6 7 8 9 10

  • Each vector subscript operation represents a function call, with calls to [ for the two reads and to [<- in the case of the write.

Function calls can be time-consuming, as they involve setting up stack frames and the like. Suffering that time penalty at every iteration of the loop adds up to a big slowdown.By contrast, if we were to write this in C, there would be no function calls. Indeed, that is essentially what happens in our first code snippet. There are function calls there as well, namely a call to + and one to >, but each is called only once, not 1,000,000 times, as in the loop version. Thus, the first version of the code is much faster.One type of vectorization is vector filtering For instance, let’s rewrite our function oddcount() :

oddcount <- function(x) return(sum(x%%2==1))

There is no explicit loop here, and even though R will internally loop through the array, this will be done in native machine code. Again, the anticipated speedup does occur.

> x <- sample(1:1000000,100000,replace=T)

> system.time(oddcount(x))

user system elapsed

0.012 0.000 0.015

> system.time(

+ {

+   c <- 0

+    for (i in 1:length(x))

+              if (x[i] %% 2 == 1) c <- c+1

+    return(c)

+   }

+ )

user system elapsed

0.308 0.000 0.310

You might wonder whether it matters in this case, since even the loop version of the code took less than a second to run. But if this code had been part of an enclosing loop, with many iterations, the difference could be important indeed. Examples of other vectorized functions that may speed up your code are ifelse(), which(), where(), any(), all(), cumsum(), and cumprod(). In the matrix case, you can use rowSums(), colSums(), and so on. In “all possible combinations” types of settings, combin(), outer(), lower.tri(), upper.tri(), or expand.grid() may be just what you need. Though apply() eliminates an explicit loop, it is actually implemented in R rather than C and thus will usually not speed up your code. However, the other apply functions, such as lapply(), can be very helpful in speeding up your code.

Extended Example: Achieving Better Speed in a Monte Carlo Simulation

In some applications, simulation code can run for hours, days, or even months, so speedup methods are of high interest. Here, we’ll look at two simulation examples. To begin, let’s consider the following code :

sum <- 0

nreps <- 100000

for (i in 1:nreps) {

xy <- rnorm(2) # generate 2 N(0,1)s

sum <- sum + max(xy)

}

print(sum/nreps)

Here’s a revision (hopefully faster):

nreps <- 100000

xymat <- matrix(rnorm(2*nreps),ncol=2)

maxs <- pmax(xymat[,1],xymat[,2])

print(mean(maxs))

In this code, we generate all the random variates at once, storing them in a matrix xymat, with one (X,Y) pair per row:

xymat <- matrix(rnorm(2*nreps),ncol=2)

Next, we find all the max(X,Y) values, storing those values in maxs, and then simply call mean()It’s easier to program, and we believe it will be faster. Let’s check that. We had the original code in the file MaxNorm.R and the improved version in MaxNorm2.R.

> system.time(source(“MaxNorm.R”))

[1] 0.5667599

   user  system elapsed

  1.700   0.004   1.722

> system.time(source(“MaxNorm2.R”))

[1] 0.5649281

   user  system elapsed

  0.132   0.008   0.143

The speedup is dramatic, once again.mWe attained an excellent speedup in this example, but it was misleadingly easy. Let’s look at a slightly more complicated example. Our next example is a classic exercise from elementary probability courses. Urn 1 contains ten blue marbles and eight yellow ones. In urn 2, the mixture is six blue and six yellow. We draw a marble at random from urn 1, transfer it to urn 2, and then draw a marble at random from urn 2. What is the probability that that second marble is blue? This is easy to find analytically, but we’ll use simulation. Here is the straightforward way:

# perform nreps repetitions of the marble experiment, to estimate

# P(pick blue from Urn 2)

sim1 <- function(nreps)  {

nb1<-10 #10bluemarblesinUrn1
n1 <- 18 # number of marbles in Urn 1 at 1st pick
n2 <- 13 # number of marbles in Urn 2 at 2nd pick
count <- 0 # number of repetitions in which get blue from Urn 2

for (i in 1:nreps) {

nb2<-6 #6bluemarblesorig.inUrn2

 # pick from Urn 1 and put in Urn 2; is it blue?

if (runif(1) < nb1/n1) nb2 <- nb2 + 1

# pick from Urn 2; is it blue?

if (runif(1) < nb2/n2) count <- count + 1

}

 return(count/nreps)  # est. P(pick blue from Urn 2)

}

Here is how we can do it without loops, using apply():

sim2 <- function(nreps)  {

nb1 <- 10

nb2 <- 6n1 <- 18

n2 <- 13

# pre-generate all our random numbers, one row per repetition

u <- matrix(c(runif(2*nreps)),nrow=nreps,ncol=2)

# define simfun for use in apply(); simulates one repetition

simfun <- function(rw) {

# rw (“row”) is a pair of random numbers

# choose from Urn 1

if (rw[1] < nb1/n1) nb2 <- nb2 + 1

# choose from Urn 2, and return boolean on choosing blue

return (rw[2] < nb2/n2)

}

z <- apply(u,1,simfun)

# z is a vector of booleans but they can be treated as 1s, 0s

return(mean(z))

}

Here, we set up a matrix u with two columns of U(0,1) random variates. The first column is used for our simulation of drawing from urn 1, and the second for drawing from urn 2. This way, we generate all our random numbers at once, which might save a bit of time, but the main point is to set up for using apply(). Toward that goal, our function simfun() works on one repetition of the experiment—that is, one row of u. We set up the call to apply() to go through all of the nreps repetitions.

Note that since the function simfun() is declared within sim2(), the locals of sim2()n1, n2, nb1, and nb2—are available as globals of simfun(). Also, since a Boolean vector will automatically be changed by R to 1s and 0s, we can find the fraction of TRUE values in the vector by simply calling mean()Now, let’s compare performance.

> system.time(print(sim1(100000)))

[1] 0.5086

   user  system elapsed

  2.465   0.028   2.586

[1] 0.5031

user system elapsed

2.936 0.004 3.027

In spite of the many benefits of functional programming, this approach using apply() didn’t help. Instead, things got worse. Since this could be simply due to random sampling variation, we ran the code several times again, with similar results. So, let’s look at vectorizing this simulation.

sim3 <- function(nreps) {

nb1 <- 10

nb2 <- 6

n1 <- 18

n2 <- 13

u <- matrix(c(runif(2*nreps)),nrow=nreps,ncol=2)

# set up the condition vector

cndtn <- u[,1] <= nb1/n1 & u[,2] <= (nb2+1)/n2 |

u[,1] > nb1/n1 & u[,2] <= nb2/n2

return(mean(cndtn))

}

The main work is done in this statement:

cndtn <- u[,1] <= nb1/n1 & u[,2] <= (nb2+1)/n2 |

u[,1] > nb1/n1 & u[,2] <= nb2/n2

To get that, we reasoned out which conditions would lead to choosing a blue marble on the second pick, coded them, and then assigned them to cndtnRemember that <= and & are functions; in fact, they are vector functions, so they should be fast. Sure enough, this brings quite an improvement:

    > system.time(print(sim3(10000)))

    [1] 0.4987

       user  system elapsed

      0.060   0.016   0.076

In principle, the approach we took to speed up the code here could be applied to many other Monte Carlo simulations. However, it’s clear that the analog of the statement that computes cndtn would quickly become quite complex, even for seemingly simple applications. Moreover, the approach would not work in “infinite-stage” situations, meaning an unlimited number of time steps. Here, we are considering the marble example as being two-stage, with two columns to the matrix u.

Extended Example: Generating a Powers Matrix

We needed to generate a matrix of powers of our predictor variable. We used the following code:

# forms matrix of powers of the vector x, through degree dg

powers1 <- function(x,dg) {

pw <- matrix(x,nrow=length(x))

prod <- x  # current product

for (i in 2:dg) {

prod <- prod * x

pw <- cbind(pw,prod)

}

return(pw)

}

One glaring problem is that cbind() is used to build up the output ma-trix, column by column. This is very costly in terms of memory-allocation time. It’s much better to allocate the full matrix at the beginning, even though it will be empty, as this will mean incurring the cost of only one memory-allocation operation.

# forms matrix of powers of the vector x, through degree dg

powers2 <- function(x,dg) {

pw <- matrix(nrow=length(x),ncol=dg)

prod <- x  # current product

pw[,1] <- prod

for (i in 2:dg) {

prod <- prod * x

pw[,i] <- prod

}

return(pw)

}

And indeed, powers2() is a lot faster.

> x <- runif(1000000)

> system.time(powers1(x,8))

   user  system elapsed

  0.776   0.356   1.334

> system.time(powers2(x,8))

   user  system elapsed

  0.388   0.204   0.593

And yet, powers2() still contains a loop. Can we do better? It would seem that this setting is perfect for outer(), whose call form is

outer(X,Y,FUN)

This call applies the function FUN() to all possible pairs of elements of X and elements of Y. The default value of FUN is multiplication. Here, we can write the following:

powers3 <- function(x,dg) return(outer(x,1:dg,”^”))

For each combination of element of x and element of 1:dg (resulting in length(x) × dg combinations in all), outer() calls the exponentiation function ^ on that combination, placing the results in a length(x) × dg matrix. This is exactly what we need, and as a bonus, the code is quite compact. But is the code faster?

> system.time(powers3(x,8))

user system elapsed

1.336 0.204 1.747

What a disappointment! Here, we’re using a fancy R function, with very compact code, but getting the worst performance of the three functions. And it gets even worse. Here’s what happens if we try to make use of cumprod():

> powers4

function(x,dg) {

   repx <- matrix(rep(x,dg),nrow=length(x))

   return(t(apply(repx,1,cumprod)))

}

> system.time(powers4(x,8))

   user  system elapsed

28.106   1.120  83.255

In this example, we made multiple copies of x, since the powers of a number n are simply cumprod(c(1,n,n,n…)). But in spite of dutifully using two C-level R functions, the performance was disastrous. The moral of the story is that performance issues can be unpredictable. All you can do is be armed with an understanding of the basic issues, vectorization, and the memory aspects explained next and then try various approaches.

Functional Programming and Memory Issues

Most R objects are immutable, or unchangeable. Thus, R operations are implemented as functions that reassign to the given object, a trait that can have performance implications.

Vector Assignment Issues

As an example of some of the issues that can arise, consider this simple-looking statement:

z[3] <- 8

This assignment is more complex than it seems. It is actually implemented via the replacement function “[<-“ through this call and assignment:

z <- “[<-“(z,3,value=8)

An internal copy of z is made, element 3 of the copy is changed to 8, and then the resulting vector is reassigned to z. And recall that the latter simply means that z is pointed to the copy. In other words, even though we are ostensibly changing just one element of the vector, the semantics say that the entire vector is recomputed. For a long vector, this would slow down the program considerably. The same would be true for a shorter vector if it were assigned from within a loop of our code. In some situations, R does take some measures to mitigate this impact, but it is a key point to consider when aiming for fast code. You should be mindful of this when working with vectors (including arrays). If your code seems to be running unexpectedly slowly, assignment of vectors should be a prime area of suspicion.

Copy-on-Change Issues

A related issue is that R (usually) follows a copy-on-change policy. For instance, if we execute the following in the previous setting:

> y <- z

then initially y shares the same memory area with z. But if either of them changes, then a copy is made in a different area of memory, and the changed variable will occupy the new area of memory. However, only the first change is affected, as the relocating of the moved variable means there are no longer any sharing issues. The function tracemem() will report such memory relocations. Though R usually adheres to copy-on-change semantics, there are exceptions. For example, R doesn’t exhibit the location-change behavior in the following setting:

> z <- runif(10)

> tracemem(z)

[1] “<0x88c3258>”

> z[3] <- 8

> tracemem(z)

[1] “<0x88c3258>”

The location of z didn’t change; it was at memory address 0x88c3258 both before and after the assignment to z[3] was executed. Thus, although you should be vigilant about location change, you also can’t assume it. Let’s look at the times involved.

> z <- 1:10000000

> system.time(z[3] <- 8)

   user  system elapsed

  0.180   0.084   0.265

> system.time(z[33] <- 88)

user system elapsed

    0         0         0

In any event, if copying is done, the vehicle is R’s internal function duplicate(). (The function is called duplicate1() in recent versions of R.) If you’re familiar with the GDB debugging tool and your R build includes debugging information, you can explore the circumstances under which a copy is performed. Start up R with GDB, step through R through GDB, and place a breakpoint at duplicate1(). Each time you break at that function, submit this GDB command:

call Rf_PrintValue(s)

This will print the value of s (or whatever variable is of interest).

Extended Example: Avoiding Memory Copy

This example, though artificial, will demonstrate the memory-copy issues discussed in the previous section. Suppose we have a large number of unrelated vectors and, among other things, we wish to set the third element of each to 8. We could store the vectors in a matrix, one vector per row. But since they are unrelated, maybe even of different lengths, we may consider storing them in a list. But things can get very subtle when it comes to R performance issues, so let’s try it out.

> m  <- 5000

> n <- 1000

> z <- list()

> for (i in 1:m) z[[i]] <- sample(1:10,n,replace=T)

> system.time(for (i in 1:m) z[[i]][3] <- 8)

user    system  elapsed

0.288  0.024   0.321

> z <- matrix(sample(1:10,m*n,replace=T),nrow=m)

> system.time(z[,3] <- 8)

user   system   elapsed

0.008   0.044   0.052

Except for system time (again), the matrix formulation did better. One of the reasons is that in the list version, we encounter the memory-copy problem in each iteration of the loop. But in the matrix version, we encounter it only once. And, of course, the matrix version is vectorized. But what about using lapply() on the list version?

>

> set3 <- function(lv) {

+  lv[3] <- 8

+    return(lv)

+ }

> z <- list()

> for (i in 1:m) z[[i]] <- sample(1:10,n,replace=T)

> system.time(lapply(z,set3))

user  system  elapsed

0.100  0.012  0.112

It’s hard to beat vectorized code.

Using Rprof() to Find Slow Spots in Your Code

If you think your R code is running unnecessarily slowly, a handy tool for finding the culprit is Rprof(), which gives you a report of (approximately) how much time your code is spending in each of the functions it calls. This is important, as it may not be wise to optimize every section of your program. Optimization may come at the expense of coding time and code clarity, so it’s of value to know where optimization would really help.

Monitoring with Rprof()

Let’s demonstrate using Rprof() with our three versions of code to find a powers matrix from the previous extended example. We’ll call Rprof() to get the monitor started, run our code, and then call Rprof() with a NULL argument to stop the monitoring. Finally, we’ll call summaryRprof() to see the results.

> x <- runif(1000000)

> Rprof()

> invisible(powers1(x,8))

> Rprof(NULL)

> summaryRprof()

$by.self

             self.time self.pct total.time total.pct

“cbind”       0.74      86.0         0.74       86.0

*               0.10      11.6           0.10       11.6

“matrix”     0.02       2.3          0.02         2.3

“powers1”  0.00       0.0           0.86      100.0

$by.total

               total.time  total.pct  self.time  self.pct

“powers1”      0.86       100.0        0.00        0.0

“cbind”          0.74          86.0        0.74       86.0

*                 0.10          11.6          0.10        11.6

“matrix”        0.02           2.3          0.02        2.3

$sampling.time

[1] 0.86

We see immediately that the runtime of our code is dominated by calls to cbind(), which as we noted in the extended example is indeed slowing things down. By the way, the call to invisible() in this example is used to suppress out-put. We certainly don’t want to see the 1,000,000-row matrix returned by powers1() here! Profiling powers2() does not show any obvious bottlenecks.

> Rprof()

> invisible(powers2(x,8))

> Rprof(NULL)

> summaryRprof()

$by.self

                    self.time self.pct total.time total.pct

“powers2”             0.38     67.9       0.56     100.0

“matrix”                0.14     25.0       0.14      25.0

  *                        0.04       7.1        0.04       7.1

$by.total

                total.time total.pct self.time self.pct

“powers2″      0.56         100.0      0.38     67.9

“matrix”         0.14            25.0       0.14      25.0

*                 0.04               7.1       0.04       7.1

$sampling.time

[1] 0.56

What about powers3(), the promising approach that didn’t pan out?

> Rprof()

> invisible(powers3(x,8))

> Rprof(NULL)

> summaryRprof()

$by.self

                        self.time self.pct total.time total.pct

“FUN”                   0.94     56.6       0.94           56.6

“outer”                  0.72      43.4       1.66         100.0

“powers3”             0.00      0.0        1.66.        100.0

$by.total

                total.time total.pct self.time self.pct

“outer”           1.66          100.0      0.72     43.4

“powers3”     1.66          100.0      0.00      0.0

“FUN”           0.94           56.6      0.94     56.6

$sampling.time

[1] 1.66

The function logging the largest amount of time was FUN(), which as noted in our extended example is simply multiplication. For each pair of elements of x here, one of the elements is multiplied by the other; that is, a product of two scalars is found. In other words, no vectorization! No wonder it was slow.

How Rprof() Works

Let’s explore in a bit more detail what Rprof() does. Every 0.02 seconds (the default value), R inspects the call stack to determine which function calls are in effect at that time. It writes the result of each inspection to a file, by default Rprof.out. Here is an excerpt of that file from our run of powers3():

“outer” “powers3”

“outer” “powers3”

“outer” “powers3”

“FUN” “outer” “powers3”

“FUN” “outer” “powers3”

“FUN” “outer” “powers3”

“FUN” “outer” “powers3”

So, Rprof() often found that at inspection time, powers3() had called outer(), which in turn had called FUN(), the latter being the currently executing function. The function summaryRprof() conveniently summarizes all those lines in the file, but you may find that looking at the file itself reveals more insights in some cases. Note, too, that Rprof() is no panacea. If the code you’re profiling pro-duces many function calls (including indirect calls, triggered when your code calls some function that then calls another within R), the profiling out-put may be hard to decipher. This is arguably the case for the output from powers4():

$by.self

                     self.time self.pct total.time total.pct

“apply”          19.46      67.5          27.56      95.6

“lapply”          4.02      13.9          5.68       19.7

“FUN”              2.56       8.9           2.56        8.9

“as.vector”    0.82       2.8           0.82       2.8

“t.default”      0.54       1.9           0.54         1.9

“unlist”           0.40       1.4           6.08        21.1

“!”                     0.34        1.2           0.34         1.2

“is.null”          0.32        1.1           0.32         1.1

“aperm”         0.22        0.8           0.22       0.8

“matrix”         0.14        0.5           0.74        2.6

“!=”                  0.02        0.1           0.02        0.1

“powers4”     0.00        0.0         28.84     100.0

“t”                      0.00       0.0         28.10       97.4

“array”            0.00       0.0           0.22       0.8

$by.total

                    total.time total.pct self.time self.pct

“powers4”      28.84           100.0      0.00      0.0

“t”                      28.10             97.4      0.00      0.0

“apply”            27.56             95.6     19.46     67.5

“unlist”             6.08            21.1       0.40        1.4

“lapply”           5.68            19.7       4.02     13.9

“FUN”              2.56              8.9        2.56      8.9

“as.vector”    0.82             2.8        0.82      2.8

“matrix”        0.74             2.6        0.14      0.5

“t.default”    0.54             1.9        0.54      1.9

“!”                     0.34             1.2        0.34        1.2

“is.null”          0.32             1.1        0.32       1.1

“aperm”          0.22             0.8        0.22     0.8   

“array”          0.22             0.8        0.00      0.0

“!=”                 0.02              0.1        0.02      0.1

$sampling.time

[1] 28.84

Byte Code Compilation

Starting with version 2.13, R has included a byte code compiler, which you can use to try to speed up your code. Consider our example  As a trivial example, we showed that

z <- x + y

was much faster than

for (i in 1:length(x)) z[i] <- x[i] + y[i]

Again, that was obvious, but just to get an idea of how byte code compilation works, let’s give it a try:

> library(compiler)

> f <- function() for (i in 1:length(x)) z[i] <<- x[i] + y[i]

> cf <- cmpfun(f)

> system.time(cf())

 user    system   elapsed

0.845   0.003      0.848

We created a new function, cf(), from the original f(). The new code’s run time was 0.848 seconds, much faster than the 8.175 seconds the non-compiled version took. Granted, it still wasn’t as fast as the straightforward vectorized code, but it is clear that byte code compilation has potential. You should try it whenever you need faster code.

Oh No, the Data Doesn’t Fit into Memory!

As mentioned earlier, all objects in an R session are stored in memory. R places a limit of 231 1 bytes on the size of any object, regardless of word size (32-bit versus 64-bit) and the amount of RAM in your machine. How-ever, you really should not consider this an obstacle. With a little extra care, applications that have large memory requirements can indeed be handled well in R. Some common approaches are chunking and using R packages for memory management.

Chunking

One option involving no extra R packages at all is to read in your data from a disk file one chunk at a time. For example, suppose that our goal is to find means or proportions of some variables. We can use the skip argument in read.table().Say our data set has 1,000,000 records and we divide them into 10 chunks (or more—whatever is needed to cut the data down to a size so it fits in memory). Then we set skip = 0 on our first read, set skip = 100000 the second time, and so on. Each time we read in a chunk, we calculate the counts or totals for that chunk and record them. After reading all the chunks, we add up all the counts or totals in order to calculate our grand means or proportions. As another example, suppose we are performing a statistical operation, say calculating principle components, in which we have a huge number of rows—that is, a huge number of observations—but the number of variables is manageable. Again, chunking could be the solution. We apply the statistical operation to each chunk and then average the results over all the chunks. My mathematical research shows that the resulting estimators are statistically efficient in a wide class of statistical methods.

Using R Packages for Memory Management

Again looking at a bit more sophistication, there are alternatives for accommodating large memory requirements in the form of some specialized R packages. One such package is RMySQL, an R interface to SQL databases. Using it requires some database expertise, but this package provides a much more efficient and convenient way to handle large data sets. The idea is to have SQL do its variable/case selection operations for you back at the database end and then read the resulting selected data as it is produced by SQL. Since the latter will typically be much smaller than the overall data set, you will likely be able to circumvent R’s memory restriction.

Another useful package is biglm, which does regression and generalized linear-model analysis on very large data sets. It also uses chunking but in a different manner: Each chunk is used to update the running totals of sums needed for the regression analysis and then discarded. Finally, some packages do their own storage management independently of R and thus can deal with very large data sets. The two most commonly used today are ff and bigmemory. The former sidesteps memory constraints by storing data on disk instead of memory, essentially transparently to the programmer. The highly versatile bigmemory package does the same, but it can store data not only on disk but also in the machine’s main memory, which is ideal for multicore machines.

INTERFACING R TO OTHER LANGUAGES

R is a great language, but it can’t do every-thing well. Thus, it is sometimes desirable to call code written in other languages from R. Conversely, when working in other great languages, you may encounter tasks that could be better done in R. R interfaces have been developed for a number of other languages, from ubiquitous languages like C to esoteric ones like the Yacas computer algebra system. This chapter will cover two interfaces: one for calling C/C++ from R and the other for calling R from Python.

Writing C/C++ Functions to Be Called from R

You may wish to write your own C/C++ functions to be called from R. Typically, the goal is performance enhancement, since C/C++ code may run much faster than R, even if you use vectorization and other R optimization techniques to speed things up. Another possible goal in dropping down to the C/C++ level is specialized I/O. For example, R uses the TCP protocol in layer 3 of the standard Internet communication system, but UDP can be faster in some settings. To work in UDP, you need C/C++, which requires an interface to R for those languages. R actually offers two C/C++ interfaces via the functions .C() and .Call(). The latter is more versatile but requires some knowledge of R’s internal structures, so we’ll stick with .C() here.

Some R-to-C/C++ Preliminaries

In C, two-dimensional arrays are stored in row-major order, in contrast to R’s column-major order. For instance, if you have a 3-by-4 array, the element in the second row and second column is element number 5 of the array when viewed linearly, since there are three elements in the first column and this is the second element in the second column. Also keep in mind that C sub-scripts begin at 0, rather than at 1, as with R. All the arguments passed from R to C are received by C as pointers. Note that the C function itself must return void. Values that you would ordinarily return must be communicated through the function’s arguments, such as result in the following example.

Example: Extracting Subdiagonals from a Square Matrix

Here, we will write C code to extract subdiagonals from a square matrix. (Thanks to my former graduate assistant, Min-Yu Huang, who wrote an ear-lier version of this function.) Here’s the code for the file sd.c:

#include // required

// arguments:

// m:  a square matrix

// n:  number of rows/columns of m

//  k:  the subdiagonal index–0 for main diagonal, 1 for first

//              subdiagonal, 2 for the second, etc.

//    result:  space for the requested subdiagonal, returned here

void subdiag(double *m, int *n, int *k, double *result)

{

int nval = *n, kval = *k;

int stride = nval + 1;

for (int i = 0, j = kval; i < nval-kval; ++i, j+= stride)

result[i] = m[j];

}

The variable stride alludes to a concept from the parallel-processing community. Say we have a matrix in 1,000 columns and our C code is loop-ing through all the elements in a given column, from top to bottom. Again, since C uses row-major order, consecutive elements in the column are 1,000 elements apart from each other if the matrix is viewed as one long vector. Here, we would say that we are traversing that long vector with a stride of 1,000—that is, accessing every thousandth element.

Compiling and Running Code

You compile your code using R. For example, in a Linux terminal window, we could compile our file like this:

% R CMD SHLIB sd.c

          gcc -std=gnu99 -I/usr/share/R/include      -fpic  -g -O2 -c sd.c -o sd.o

          gcc -std=gnu99 -shared  -o sd.so sd.o   -L/usr/lib/R/lib -lR

This would produce the dynamic shared library file sd.so.

Note that R has reported how it invoked GCC in the output of the example. You can also run these commands by hand if you have special requirements, such as special libraries to be linked in. Also note that the locations of the include and lib directories may be system-dependent. We can then load our library into R and call our C function like this:

> dyn.load(“sd.so”)

> m <- rbind(1:5, 6:10, 11:15, 16:20, 21:25)

> k <- 2

> C(“subdiag”, as.double(m), as.integer(dim(m)[1]), as.integer(k), result=double(dim(m)[1]-k))

[[1]]

           [1]  1  6 11 16 21  2  7 12 17 22  3  8 13 18 23  4  9 14 19 24  5 10 15 20 25

[[2]]

[1]  5

[[3]]

[1] 2

$result

[1]  11  17  23

For convenience here, we’ve given the name result to both the formal argument (in the C code) and the actual argument (in the R code). Note that we needed to allocate space for result in our R code. As you can see from the example, the return value takes on the form of a list consisting of the arguments in the R call. In this case, the call had four arguments (in addition to the function name), so the returned list has four components. Typically, some of the arguments will be changed during execution of the C code, as was the case here with result.

Debugging R/C Code

We discussed a number of tools and methods for debugging R code. However, the R/C interface presents an extra challenge. The problem in using a debugging tool such as GDB here is that you must first apply it to R itself. The following is a walk-through of the R/C debugging steps using GDB on our previous sd.c code as the example.

$ R -d gdb

GNU gdb 6.8-debian

(gdb) run

Starting program: /usr/lib/R/bin/exec/R

>  dyn.load(“sd.so”)

>   # hit ctrl-c here

Program received signal SIGINT, Interrupt.

0xb7ffa430 in __kernel_vsyscall ()

(gdb) b subdiag

Breakpoint 1 at 0xb77683f3: file sd.c, line 3.

(gdb) continue

Continuing.

Breakpoint 1, subdiag (m=0x92b9480, n=0x9482328, k=0x9482348, result=0x9817148)

         at sd.c:3

3          int nval = *n, kval = *k;

(gdb)

So, what happened in this debugging session?

1. We launched the debugger, GDB, with R loaded into it, from a command line in a terminal window:

R -d gdb

2.We told GDB to run R:

(gdb) run

3.We loaded our compiled C code into R as usual:

dyn.load(“sd.so”)

4.We hit the CTRL-C interrupt key pair to pause R and put us back at the GDB prompt.

5.We set a breakpoint at the entry to subdiag():

(gdb) b subdiag

6.We told GDB to resume executing R (we needed to hit the ENTER key a second time in order to get the R prompt):

(gdb) continue

We then executed our C code:

> m <- rbind(1:5, 6:10, 11:15, 16:20, 21:25)

> k <- 2

> .C(“subdiag”, as.double(m), as.integer(dim(m)[1]), as.integer(k),

 + result=double(dim(m)[1]-k))

Breakpoint 1, subdiag (m=0x942f270, n=0x96c3328, k=0x96c3348, result=0x9a58148)

at subdiag.c:46

46 if (*n < 1) error(“n < 1\n”);

At this point, we can use GDB to debug as usual. If you’re not familiar with GDB, you may want to try one of the many quick tutorials on the Web. Table 15-1 lists some of the most useful commands.

Table 15-1: Common GDB Commands

Command   Description

l                         List code lines

b                        Set breakpoint

r                        Run/rerun

n                       Step to next statement

s                        Step into function call

p                        Print variable or expression

c                         Continue

h                        Help

q                       qQuit

Extended Example: Prediction of Discrete-Valued Time Series

Recall our example  where we observed 0- and 1-valued data, one per time period, and attempted to predict the value in any period from the previous k values, using majority rule. We developed two competing functions for the job, preda() and predb(), as follows:

# prediction in discrete time series; 0s and 1s; use k consecutive

# observations to predict the next, using majority rule; calculate the

# error rate

preda <- function(x,k) {

n <- length(x)

k2 <- k/2

# the vector pred will contain our predicted values

pred <- vector(length=n-k)

for (i in 1:(n-k)) {

if (sum(x[i:(i+(k-1))]) >= k2) pred[i] <- 1 else pred[i] <- 0

}

return(mean(abs(pred-x[(k+1):n])))

}

predb <- function(x,k) {

n <- length(x)

k2 <- k/2

pred <- vector(length=n-k)

sm <- sum(x[1:k])

if (sm >= k2) pred[1] <- 1 else pred[1] <- 0

if (n-k >= 2) {

for (i in 2:(n-k)) {

sm <- sm + x[i+k-1] – x[i-1]

if (sm >= k2) pred[i] <- 1 else pred[i] <- 0

}

}

return(mean(abs(pred-x[(k+1):n])))

}

Since the latter avoids duplicate computation, we speculated it would be faster. Now is the time to check that.

> y <- sample(0:1,100000,replace=T)

> system.time(preda(y,1000))

   user  system elapsed

  3.816   0.016   3.873

> system.time(predb(y,1000))

   user  system elapsed

  1.392   0.008   1.427

Hey, not bad! That’s quite an improvement. However, you should always ask whether R already has a fine-tuned function that will suit your needs. Since we’re basically computing a moving aver-age, we might try the filter() function, with a constant coefficient vector, as follows:

predc <- function(x,k) {

n <- length(x)

f <- filter(x,rep(1,k),sides=1)[k:(n-1)]

k2 <- k/2

pred <- as.integer(f >= k2)

return(mean(abs(pred-x[(k+1):n])))

}

That’s even more compact than our first version. But it’s a lot harder to read, and for reasons we will explore soon, it may not be so fast. Let’s check.

> system.time(predc(y,1000))

   user  system elapsed

  3.872   0.016   3.945

Well, our second version remains the champion so far. This actually should be expected, as a look at the source code shows. Typing the follow-ing shows the source for that function:

> filter

This reveals (not shown here) that filter1() is called. The latter is writ-ten in C, which should give us some speedup, but it still suffers from the duplicate computation problem—hence the slowness. So, let’s write our own C code.

#include   < R.h >

void predd(int *x, int *n, int *k, double *errrate)

{

int nval = *n, kval = *k, nk = nval – kval, i;

int sm = 0; // moving sum

int errs = 0; // error count

int pred; // predicted value

double k2 = kval/2.0;

// initialize by computing the initial window

for (i = 0; i < kval; i++) sm += x[i];

if (sm >= k2) pred = 1; else pred = 0;

errs = abs(pred-x[kval]);

for (i = 1; i < nk; i++) {

sm = sm + x[i+kval-1] – x[i-1];

if (sm >= k2) pred = 1; else pred = 0;

errs += abs(pred-x[i+kval]);

}

*errrate = (double) errs / nk;

}

This is basically predb() from before, “hand translated” into C. Let’s see if it will outdo predb().

> system.time(.C(“predd”,as.integer(y),as.integer(length(y)),as.integer(1000),

+   errrate=double(1)))

user  system elapsed

0.004   0.000   0.003

The speedup is breathtaking. You can see that writing certain functions in C can be worth the effort. This is especially true for functions that involve iteration, as R’s own iteration constructs, such as for(), are slow.

Using R from Python

Python is an elegant and powerful language, but it lacks built-in facilities for statistical and data manipulation, two areas in which R excels. This section demonstrates how to call R from Python, using RPy, one of the most popular interfaces between the two languages.

Installing RPy

RPy is a Python module that allows access to R from Python. For extra efficiency, it can be used in conjunction with NumPy. You can build the module from the source, available from http://rpy .sourceforge.net, or download a prebuilt version. If you are running Ubuntu, simply type this:

sudo apt-get install python-rpy

To load RPy from Python (whether in Python interactive mode or from code), execute the following:

from rpy import *

This will load a variable r, which is a Python class instance.

RPy Syntax

Running R from Python is in principle quite simple. Here is an example of a command you might run from the >>> Python prompt:

>>> r.hist(r.rnorm(100))

This will call the R function rnorm() to produce 100 standard normal variates and then input those values into R’s histogram function, hist()As you can see, R names are prefixed by r., reflecting the fact that Python wrappers for R functions are members of the class instance rThe preceding code will, if not refined, produce ugly output, with your (possibly voluminous!) data appearing as the graph title and the x-axis label. You can avoid this by supplying a title and label, as in this example:

>>> r.hist(r.rnorm(100),main=”,xlab=”)

RPy syntax is sometimes less simple than these examples would lead you to believe. The problem is that R and Python syntax may clash. For instance,

consider a call to the R linear model function lm(). In our example, we will predict b from a.

>>>  a = [5,12,13]

>>>  b = [10,28,30]

>>> lmout = r.lm(‘v2 ~ v1’,data=r.data_frame(v1=a,v2=b))

This is somewhat more complex than it would have been if done directly in R. What are the issues here? First, since Python syntax does not include the tilde character, we needed to specify the model formula via a string. Since this is done in R anyway, this is not a major departure.

Second, we needed a data frame to contain our data. We created one using R’s data.frame() function. In order to form a period in an R function name, we need to use an underscore on the Python end. Thus we called r.data_frame(). Note that in this call, we named the columns of our data frame v1 and v2 and then used these in our model formula. The output object is a Python dictionary (analog of R’s list type), as you can see here (in part):

>>> lmout

{‘qr’: {‘pivot’: [1, 2], ‘qr’: array([[ -1.73205081, -17.32050808],

[ 0.57735027, -6.164414 ],

[ 0.57735027, 0.78355007]]), ‘qraux’:

You should recognize the various attributes of lm() objects here. For example, the coefficients of the fitted regression line, which would be con-tained in lmout$coefficients if this were done in R, are here in Python as lmout[‘coefficients’]. So, you can access those coefficients accordingly, for example like this:

>>> lmout[‘coefficients’]

{‘v1’: 2.5263157894736841, ‘(Intercept)’: -2.5964912280701729}

>>> lmout[‘coefficients’][‘v1’]

2.5263157894736841

You can also submit R commands to work on variables in R’s namespace, using the function r(). This is convenient if there are many syntax clashes. Here is how we could run the wireframe() example in  RPy:

>>>   r.library(‘lattice’)

>>>   r.assign(‘a’,a)

>>>   r.assign(‘b’,b)

>>>  r(‘g <- expand.grid(a,b)’)

>>>  r(‘g$Var3 <- g$Var1^2 + g$Var1 * g$Var2′)

>>>  r(‘wireframe(Var3 ~ Var1+Var2,g)’)

>>>  r(‘plot(wireframe(Var3 ~ Var1+Var2,g))’)

First, we used r.assign() to copy a variable from Python’s namespace to R’s. We then ran expand.grid() (with a period in the name instead of an underscore, since we are running in R’s namespace), assigning the result to g. Again, the latter is in R’s namespace. Note that the call to wireframe() did not automatically display the plot, so we needed to call plot()The official documentation for RPy is at http://rpy.sourceforge.net/rpy/doc/ rpy.pdf. Also, you can find a useful presentation, “RPy—R from Python,” at http://www.daimi.au.dk/~besen/TBiB2007/lecture-notes/rpy.html.

PARALLEL R

Since many R users have very large computational needs, various tools for some kind of parallel operation of R have been devised. This chapter is devoted to parallel R. Many a novice in parallel processing has, with great anticipation, written parallel code for some application only to find that the parallel version actually ran more slowly than the serial one. For reasons to be discussed in this chapter, this problem is especially acute with R.Accordingly, understanding the nature of parallel-processing hardware and software is crucial to success in the parallel world. These issues will be discussed here in the context of common platforms for parallel R. We’ll start with a few code examples and then move to general performance issues.

The Mutual Outlinks Problem

Consider a network graph of some kind, such as web links or links in a social network. Let A be the adjacency matrix of the graph, meaning that, say, A[3,8] is 1 or 0, depending on whether there is a link from node 3 to node 8. For any two vertices, say any two websites, we might be interested in mutual outlinks—that is, outbound links that are common to two sites. Sup-pose that we want to find the mean number of mutual outlinks, averaged

over all pairs of websites in our data set. This mean can be found using the following outline, for an n-by-n matrix:

sum = 0

for i = 0…n-1

    for j = i+1…n-1

               for k = 0…n-1 sum = sum + a[i][k]*a[j][k]

mean = sum / (n*(n-1)/2)

Given that our graph could contain thousands—even millions—of web-sites, our task could entail quite large amounts of computation. A common approach to dealing with this problem is to divide the computation into smaller chunks and then process each of the chunks simultaneously, say on separate computers. Let’s say that we have two computers at our disposal. We might have one computer handle all the odd values of i in the for i loop in line 2 and have the second computer handle the even values. Or, since dual-core computers are fairly standard these days, we could take this same approach on a single computer. This may sound simple, but a number of major issues can arise, as you’ll learn in this chapter.

Introducing the snow Package

Luke Tierney’s snow (Simple Network of Workstations) package, available from the CRAN R code repository, is arguably the simplest, easiest-to-use form of parallel R and one of the most popular. To see how snow works, here’s code for the mutual outlinks problem described in the previous section:

# snow version of mutual links problem

mtl <- function(ichunk,m) {

n <- ncol(m)

matches <- 0

for (i in ichunk) {

if (i < n) {

rowi <- m[i,]

matches <- matches +

 sum(m[(i+1):n,] %*% rowi)

      }

}

matches

}

mutlinks <- function(cls,m) {

n <- nrow(m)

nc <- length(cls)

# determine which worker gets which chunk of i

options(warn=-1)

ichunks <- split(1:n,1:nc)

options(warn=0)

counts <- clusterApply(cls,ichunks,mtl,m)

do.call(sum,counts) / (n*(n-1)/2)

}

Suppose we have this code in the file SnowMutLinks.R. Let’s first discuss how to run it.

Running snow Code

Running the above snow code involves the following steps:

1.Load the code.

2.Load the snow library.

3.Form a snow cluster.

4.Set up the adjacency matrix of interest.

5.Run your code on that matrix on the cluster you formed.

Assuming we are running on a dual-core machine, we issue the follow-ing commands to R:

> source(“SnowMutLinks.R”)

> library(snow)

> cl <- makeCluster(type=”SOCK”,c(“localhost”,”localhost”))

> testm <- matrix(sample(0:1,16,replace=T),nrow=4)

> mutlinks(cl,testm)

[1] 0.6666667

Here, we are instructing snow to start two new R processes on our machine (localhost is a standard network name for the local machine), which I will refer to here as workers. I’ll refer to the original R process—the one in which we type the preceding commands—as the manager. So, at this point, three instances of R will be running on the machine (visible by running the ps command if you are in a Linux environment, for example). The workers form a cluster in snow parlance, which we have named cl. The snow package uses what is known in the parallel-processing world as a scatter/gather paradigm, which works as follows:

1. The manager partitions the data into chunks and parcels them out to the workers (scatter phase).

2. The workers process their chunks.

3. The manager collects the results from the workers (gather phase) and combines them as appropriate to the application.

We have specified that communication between the manager and workers will be via network sockets . Here’s a test matrix to check the code:

> testm

     [,1] [,2] [,3] [,4]

[1,]    1    0    0    1

[2,]    0    0    0    0

[3,]    1    0    1    1

[4,]    0    1    0    1

Row 1 has zero outlinks in common with row 2, two in common with row 3, and one in common with row 4. Row 2 has zero outlinks in common with the rest, but row 3 has one in common with row 4. That is a total of four mutual outlinks out of 4 × 3/2 = 6 pairs—hence, the mean value of 4/6 = 0.6666667, as you saw earlier. You can make clusters of any size, as long as you have the machines. In my department, for instance, I have machines whose network names are pc28, pc29, and pc30. Each machine is dual core, so I could create a six-worker cluster as follows:

> cl6 <-makeCluster(type=”SOCK”,c(“pc28″,”pc28″,”pc29″,”pc29″,”pc30″,”pc30”))

Analyzing the snow Code

Now let’s see how the mutlinks() function works. First, we sense how many rows the matrix m has, in line 17, and the number of workers in our cluster, in line 18. Next, we need to determine which worker will handle which values of i in the for i loop in our outline code. R’s split() function is well suited for this. For instance, in the case of a 4-row matrix and a 2-worker cluster, that call produces the following:

> split(1:4,1:2)

$`1`

[1]  1  3

$`2`

[1]  2  4

An R list is returned whose first element is the vector (1,3) and the second is (2,4). This will set up having one R process work on the odd values of i and the other work on the even values, as we discussed earlier. We ward off the warnings that split() would give us (“data length is not a multiple of split variable”) by calling options()The real work is done in line 23, where we call the snow function clusterApply(). This function initiates a call to the same specified func-tion (mtl() here), with some arguments specific to each worker and some optional arguments common to all. So, here’s what the call in line 23 does:

1. Worker 1 will be directed to call the function mtl() with the arguments ichunks[[1]] and m.

2. Worker 2 will call mtl() with the arguments ichunks[[2]] and m, and so on for all workers.

3.Each worker will perform its assigned task and then return the result to the manager.

4. The manager will collect all such results into an R list, which we have assigned here to counts.

At this point, we merely need to sum all the elements of counts. Well, I shouldn’t say “merely,” because there is a little wrinkle to iron out in line 24. R’s sum() function is capable of acting on several vector arguments, like this:

> sum(1:2,c(4,10))

[1] 17

But here, counts is an R list, not a (numeric) vector. So we rely on do.call() to extract the vectors from counts, and then we call sum() on them. Note lines 9 and 10. As you know, in R, we try to vectorize our computation wherever possible for better performance. By casting things in matrix-times-vector terms, we replace the for j and for k loops in the outline by a single vector-based expression.

How Much Speedup Can Be Attained?

I tried this code on a 1000-by-1000 matrix m1000. I first ran it on a 4-worker cluster and then on a 12-worker cluster. In principle, I should have had speedups of 4 and 12, respectively. But the actual elapsed times were 6.2 sec-onds and 5.0 seconds. Compare these figures to the 16.9 seconds runtime in nonparallel form. (The latter consisted of the call mtl(1:1000,m1000).) So, I attained a speedup of about 2.7 instead of a theoretical 4.0 for a 4-worker cluster and 3.4 rather than 12.0 on the 12-node system. (Note that some tim-ing variation occurs from run to run.) What went wrong?

In almost any parallel-processing application, you encounter overhead, or “wasted” time spent on noncomputational activity. In our example, there is overhead in the form of the time needed to send our matrix from the man-ager to the workers. We also encountered a bit of overhead in sending the function mtl() itself to the workers. And when the workers finish their tasks, returning their results to the manager causes some overhead, too. We’ll discuss this in detail when we talk about general performance considerations in.

Extended Example: K-Means Clustering

To learn more about the capabilities of snow, we’ll look at another example, this one involving k-means clustering (KMC).  KMC is a technique for exporatory data analysis. In looking at scatter plots of your data, you may have the perception that the observations tend to cluster into groups, and KMC is a method for finding such groups. The output consists of the centroids of the groups.  The following is an outline of the algorithm:

for iter = 1,2,…,niters

   set vector and count totals to 0

   for i = 1,…,nrow(m)

      set j = index of the closest group center to m[i,]

      add m[i,] to the vector total for group j, v[j]

      add 1 to the count total for group j, c[j]

   for j = 1,…,ngrps

      set new center of group j = v[j] / c[j]

Here, we specify niters iterations, with initcenters as our initial guesses for the centers of the groups. Our data is in the matrix m, and there are ngrps groups.  The following is the snow code to compute KMC in parallel:

  1. # snow version of k-means clustering problem
  2. library(snow)
  3. # returns distances from x to each vector in y;
  4. #here xisasinglevectorandyisabunch of them;
  5. # define distance between 2 points to be the sum of the absolute values
  6. # of their componentwise differences; e.g., distance between (5,4.2) and
  7. #(3,5.6)is2+1.4=3.4
  8. dst <- function(x,y) {
  9.    tmpmat <- matrix(abs(x-y),byrow=T,ncol=length(x)) # note recycling
  10.    rowSums(tmpmat)
  11. }
  12.    
  13. # will check this worker’s mchunk matrix against currctrs, the current
  14. # centers of the groups, returning a matrix; row j of the matrix will
  15. # consist of the vector sum of the points in mchunk closest to jth
  16. # current center, and the count of such points
  17. findnewgrps <- function(currctrs) {
  18. ngrps <- nrow(currctrs)
  19. spacedim <- ncol(currctrs) # what dimension space are we in?
  20. # set up the return matrix
  21. sumcounts <- matrix(rep(0,ngrps*(spacedim+1)),nrow=ngrps)
  22. for (i in 1:nrow(mchunk)) {
  23. dsts <- dst(mchunk[i,],t(currctrs))
  24. j <- which.min(dsts)
  25. sumcounts[j,] <- sumcounts[j,] + c(mchunk[i,],1)
  26. }
  27. sumcounts
  28. }
  29. parkm <- function(cls,m,niters,initcenters) {
  30. n <- nrow(m)
  31. spacedim <- ncol(m) # what dimension space are we in?
  32. # determine which worker gets which chunk of rows of m
  33. options(warn=-1)
  34. ichunks <- split(1:n,1:length(cls))
  35. options(warn=0)
  36. # form row chunks
  37. mchunks <- lapply(ichunks,function(ichunk) m[ichunk,])
  38. mcf <- function(mchunk) mchunk <<- mchunk
  39. # send row chunks to workers; each chunk will be a global variable at
  40. # the worker, named mchunk
  41. invisible(clusterApply(cls,mchunks,mcf))
  42. # send dst() to workers
  43. clusterExport(cls,”dst”)
  44. # start iterations
  45. centers <- initcenters
  46. for (i in 1:niters) {
  47. sumcounts <- clusterCall(cls,findnewgrps,centers)
  48. tmp <- Reduce(“+”,sumcounts)
  49. centers <- tmp[,1:spacedim] / tmp[,spacedim+1]
  50. # if a group is empty, let’s set its center to 0s
  51. centers[is.nan(centers)] <- 0
  52. }
  53. centers
  54. }

The code here is largely similar to our earlier mutual outlinks example. However, there are a couple of new snow calls and a different kind of usage of an old call. Let’s start with lines 39 through 44. Since our matrix m does not change from one iteration to the next, we definitely do not want to resend it to the workers repeatedly, exacerbating the overhead problem. Thus, first we need to send each worker its assigned chunk of m, just once. This is done in line 44 via snow’s clusterApply() function, which we used earlier but need to get creative with here. In line 41, we define the function mcf(), which will, running on a worker, accept the worker’s chunk from the manager and then keep it as a global variable mchunk on the worker.

Line 46 makes use of a new snow function, clusterExport(), whose job it is to make copies of the manager’s global variables at the workers. The variable in question here is actually a function, dst(). Here is why we need to send it separately: The call in line 50 will send the function findnewgrps() to the workers, but although that function calls dst(), snow will not know to send the latter as well. Therefore we send it ourselves. Line 50 itself uses another new snow call, clusterCall(). This instructs each worker to call findnewgrps(), with centers as argument.

Recall that each worker has a different matrix chunk, so this call will work on different data for each worker. This once again brings up the controversy regarding the use of global variables . Some software developers may be troubled by the use of a hidden argument in findnewgrps(). On the other hand, as mentioned earlier, using mchunk as an argument would mean sending it to the workers repeatedly, compromising performance.

Finally, take a look at line 51. The snow function clusterApply() always returns an R list. In this case, the return value is in sumcounts, each element of which is a matrix. We need to sum the matrices, producing a totals matrix. Using R’s sum() function wouldn’t work, as it would total all the elements of the matrices into a single number. Matrix addition is what we need. Calling R’s Reduce() function will do the matrix addition. Recall that any arithmetic operation in R is implemented as a function; in this case, it is implemented as the function “+”. The recall to Reduce() then successively applies “+” to the elements of the list sumcounts. Of course, we could just write a loop to do this, but using Reduce() may give us a small performance boost.

Resorting to C

As you’ve seen, using parallel R may greatly speed up your R code. This allows you to retain the convenience and expressive power of R, while still ameliorating large runtimes in big applications. If the parallelized R gives you sufficiently good performance, then all is well. Nevertheless, parallel R is still R and thus still subject to the performance issues. Recall that one solution offered in that chapter was to write a performance-critical portion of your code in C and then call that code from your main R program. (The references to C here mean C or C++.) We will explore this from a parallel-processing viewpoint. Here, instead of writing parallel R, we write ordinary R code that calls parallel C. 

Using Multicore Machines

The C code covered here runs only on multicore systems, so we must discuss the nature of such systems.You are probably familiar with dual-core machines. Any computer includes a CPU, which is the part that actually runs your program. In essence, a dual-core machine has two CPUs, a quad-core system has four, and so on. With multiple cores, you can do parallel computation!

This parallel computation is done with threads, which are analogous to snow’s workers. In computationally intensive applications, you generally set up as many threads as there are cores, for example two threads in a dual-core machine. Ideally, these threads run simultaneously, though over-head issues do arise, as will be explained when we look at general performance issues. If your machine has multiple cores, it is structured as a shared-memory system. All cores access the same RAM. The shared nature of the memory makes communication between the cores easy to program. If a thread writes to a memory location, the change is visible to the other threads, without the programmer needing to insert code to make that happen.

Extended Example: Mutual Outlinks Problem in OpenMP

OpenMP is a very popular package for programming on multicore machines. To see how it works, here is the mutual outlinks example again, this time in R-callable OpenMP code:

  1. #include < omp.h >
  2. #include < R.h >
  3. int tot;  // grand total of matches, over all threads
  4. // processes row pairs (i,i+1), (i,i+2), …
  5. int procpairs(int i, int *m, int n)
  6. {  int j,k,sum=0;
  7.    for (j = i+1; j < n; j++) {
  8.       for (k = 0; k < n; k++)
  9.          // find m[i][k]*m[j][k] but remember R uses col-major order
  10.          sum += m[n*k+i] * m[n*k+j];
  11.    }
  12. return sum;
  13. }
  14. void mutlinks(int *m, int *n, double *mlmean)
  15. {intnval=*n;
  16.    tot = 0;
  17.    #pragma omp parallel
  18.    {  int i,mysum=0,
  19.           me = omp_get_thread_num(),
  20.           nth = omp_get_num_threads();
  21.       // in checking all (i,j) pairs, partition the work according to i;
  22.       // this thread me will handle all i that equal me mod nth
  23.       for (i = me; i < nval; i += nth) {
  24.       mysum += procpairs(i,m,nval);
  25. }
  26. #pragma omp atomic
  27.          tot += mysum;
  28. }
  29.          int divisor = nval * (nval-1) / 2;
  30.            *mlmean = ((float) tot)/divisor;
  31. }

Running the OpenMP Code

We do need to link in the OpenMP library, though, by using the -fopenmp and -lgomp options. Suppose our source file is romp.c. Then we use the following commands to run the code:

gcc -std=gnu99 -fopenmp -I/usr/share/R/include -fpic -g -O2 -c romp.c -o romp.o gcc -std=gnu99 -shared -o romp.so romp.o -L/usr/lib/R/lib -lR -lgomp

Here’s an R test:

> dyn.load(“romp.so”)

> Sys.setenv(OMP_NUM_THREADS=4)

> n <- 1000

> m <- matrix(sample(0:1,n^2,replace=T),nrow=n)

> system.time(z <- .C(“mutlinks”,as.integer(m),as.integer(n),result=double(1)))

 user    system   elapsed

0.830  0.000   0.218

> z$result

[1] 249.9471

The typical way to specify the number of threads in OpenMP is through an operating system environment variable, OMP_NUM_THREADS. R is capable of setting operating system environment variables with the Sys.setenv() function. Here, we set the number of threads to 4, because we were running on a quad-core machine. Note the runtime—only 0.2 seconds! This compares to the 5.0-second time we saw earlier for a 12-node snow system. This might be surprising to some readers, as our code in the snow version was vectorized to a fair degree, as mentioned earlier. Vectorizing is good, but again, R has many hidden sources of overhead, so C might do even better. Thus, if you are willing to write part of your code in parallel C, dramatic speedups may be possible.

OpenMP Code Analysis

OpenMP code is C, with the addition of pragmas that instruct the compiler to insert some library code to perform OpenMP operations. Look at line 20, for instance. When execution reaches this point, the threads will be activated. Each thread then executes the block that follows—lines 21 through 31—in parallel. A key point is variable scope. All the variables within the block starting on line 21 are local to their specific threads. For example, we’ve named the total variable in line 21 mysum because each thread will maintain its own sum. By contrast, the global variable tot on line 4 is held in common by all the threads. Each thread makes its contribution to that grand total on line 30.

But even the variable nval on line 18 is held in common with all the threads (during the execution of mutlinks()), as it is declared outside the block beginning on line 21. So, even though it is a local variable in terms of C scope, it is global to all the threads. Indeed, we could have declared tot on that line, too. It needs to be shared by all the threads, but since it’s not used outside mutlinks(), it could have been declared on line 18. Line 29 contains another pragma, atomic. This one applies only to the single line following it—line 30, in this case—rather than to a whole block. The purpose of the atomic pragma is to avoid what is called a race condition in parallel-processing circles. This term describes a situation in which two threads are updating a variable at the same time, which may produce incor-rect results. The atomic pragma ensures that line 30 will be executed by only one thread at a time. Note that this implies that in this section of the code, our parallel program becomes temporarily serial, which is a potential source of slowdown.

Where is the manager’s role in all of this? Actually, the manager is the original thread, and it executes lines 18 and 19, as well as .C(), the R function that makes the call to mutlinks(). When the worker threads are activated in line 21, the manager goes dormant. The worker threads become dormant once they finish line 31. At that point, the manager resumes execution. Due to the dormancy of the manager while the workers are executing, we do want to have as many workers as our machine has cores. The function procpairs() is straightforward, but note the manner in which the matrix m is being accessed. Recall from the discussion on interfacing R to C that the two languages store matrices differently: column by column in R and row-wise in C. We need to be aware of that difference here. In addition, we have treated the matrix m as a one-dimensional array, as is common in parallel C code. In other words, if n is, say, 4, then we treat m as a vector of 16 elements. Due to the column-major nature of R matrix storage, the vector will consist first of the four elements of column 1, then the four of column 2, and so on. To further complicate matters, we must keep in mind that array indices in C start at 0, instead of starting at 1 as in R.

Putting all of this together yields the multiplication in line 12. The factors here are the (k,i) and (k,j) elements of the version of m in the C code, which are the (i+1,k+1) and (j+1,k+1) elements back in the R code.

Other OpenMP Pragmas

OpenMP includes a wide variety of possible operations—far too many to list here. This section provides an overview of some OpenMP pragmas that we consider especially useful.

The omp barrier Pragma

The parallel-processing term barrier refers to a line of code at which the threads rendezvous. The syntax for the omp barrier pragma is simple:

#pragma omp barrier

When a thread reaches a barrier, its execution is suspended until all other threads have reached that line. This is very useful for iterative algorithms; threads wait at a barrier at the end of every iteration. Note that in addition to this explicit barrier invocation, some other pragmas place an implicit barrier following their blocks. These include single and parallel. There is an implied barrier immediately following line 31 in the previous listing, for example, which is why the manager stays dormant until all worker threads finish.

The omp critical Pragma

The block that follows this pragma is a critical section, meaning one in which only one thread is allowed to execute at a time. The omp critical pragma essentially serves the same purpose as the atomic pragma discussed earlier, except that the latter is limited to a single statement. Here is the omp critical syntax:

#pragma omp critical

{

// place one or more statements here

}

The omp single Pragma

The block that follows this pragma is to be executed by only one of the threads. Here is the syntax for the omp single pragma:

#pragma omp single

{

// place one or more statements here

}

This is useful for initializing sum variables that are shared by the threads, for instance. As noted earlier, an automatic barrier is placed after the block. This should make sense to you. If one thread is initializing a sum, you wouldn’t want other threads that make use of this variable to continue execution until the sum has been properly set.

GPU Programming

Another type of shared-memory parallel hardware consists of graphics processing units (GPUs). If you have a sophisticated graphics card in your machine, say for playing games, you may not realize that it is also a very pow-erful computational device—so powerful that the slogan “A supercomputer on your desk!” is often used to refer to PCs equipped with high-end GPUs.

As with OpenMP, the idea here is that instead of writing parallel R, you write R code interfaced to parallel C. (Similar to the OpenMP case, C here means a slightly augmented version of the C language.) The technical details become rather complex, so I won’t show any code examples, but an overview of the platform is worthwhile. As mentioned, GPUs do follow the shared-memory/threads model, but on a much larger scale. They have dozens, or even hundreds, of cores (depending on how you define core). One major difference is that several threads can be run together in a block, which can produce certain efficiencies.

Programs that access GPUs begin their run on your machine’s CPU, referred to as the host. They then start code running on the GPU, or device. This means that your data must be transferred from the host to the device, and after the device finishes its computation, the results must be transferred back to the host. As of this writing, GPU has not yet become common among R users. The most common usage is probably through the CRAN package gputools, which consists of some matrix algebra and statistical routines callable from R. For instance, consider matrix inversion. R provides the function solve() for this, but a parallel alternative is available in gputools with the name gpuSolve().

General Performance Considerations

This section discusses some issues that you may find generally useful in parallelizing R applications.We’ll present some material on the main sources of overhead and then discuss a couple of algorithmic issues.

Sources of Overhead

Having at least a rough idea of the physical causes of overhead is essential to successful parallel programming. Let’s take a look at these in the contexts of the two main platforms, shared-memory and networked computers.

Shared-Memory Machines

As noted earlier, the memory sharing in multicore machines makes for easier programming. However, the sharing also produces overhead, since the two cores will bump into each other if they both try to access memory at the same time. This means that one of them will need to wait, causing over-head. That overhead is typically in the range of hundreds of nanoseconds (billionths of seconds). This sounds really small, but keep in mind that the CPU is working at a subnanosecond speed, so memory access often becomes a bottleneck.

Each core may also have a cache, in which it keeps a local copy of some of the shared memory. It’s intended to reduce contention for memory among the cores, but it produces its own overhead, involving time spent in keeping the caches consistent with each other. Recall that GPUs are special types of multicore machines. As such, they suffer from the problems I’ve described, and more. First, the latency, which is the time delay before the first bit arrives at the GPU from its memory after a memory read request, is quite long in GPUs.

There is also the overhead incurred in transferring data between the host and the device. The latency here is on the order of microseconds (millionths of seconds), an eternity compared to the nanosecond scale of the CPU and GPU. GPUs have great performance potential for certain classes of applications, but overhead can be a major issue. The authors of gputools note that their matrix operations start achieving a speedup only at matrix sizes of 1000 by 1000. We wrote a GPU version of our mutual outlinks application, which turned out to have a runtime of 3.0 seconds—about half of the snow version but still far slower than the OpenMP implementation.Again, there are ways of ameliorating these problems, but they require very careful, creative programming and a sophisticated knowledge of the physical GPU structure.

Networked Systems of Computers

As you saw earlier, another way to achieve parallel computation is through networked systems of computers. You still have multiple CPUs, but in this case, they are in entirely separate computers, each with its own memory. As pointed out earlier, network data transfer causes overhead. Its latency is again on the order of microseconds. Thus, even accessing a small amount of data across the network incurs a major delay.

Also note that snow has additional overhead, as it changes numeric objects such as vectors and matrices to character form before sending them, say from the manager to the workers. Not only does this entail time for the conversion (both in changing from numeric to character form and in charging back to numeric at the receiver), but the character form tends to make for much longer messages, thus longer network transfer time. Shared-memory systems can be networked together, which, in fact, we did in the previous example. We had a hybrid situation in which we formed snow clusters from several networked dual-core computers.

Embarrassingly Parallel Applications and Those That Aren’t

It’s no shame to be poor, but it’s no great honor either.
—Tevye, Fiddler on the Roof

Man is the only animal that blushes, or needs to.

—Mark Twain

The term embarrassingly parallel is heard often in talk about parallel R (and in the parallel processing field in general). The word embarrassing alludes to the fact that the problems are so easy to parallelize that there is no intellectual challenge involved; they are embarrassingly easy. Both of the example applications we’ve looked at here would be considered embarrassingly parallel. Parallelizing the for i loop for the mutual outlinks problem  was pretty obvious. Partitioning the work in the KMC example was also natural and easy.

By contrast, most parallel sorting algorithms require a great deal of interaction. For instance, consider merge sort, a common method of sort-ing numbers. It breaks the vector to be sorted into two (or more) indepen-dent parts, say the left half and right half, which are then sorted in paral-lel by two processes. So far, this is embarrassingly parallel, at least after the vector is divided in half. But then the two sorted halves must be merged to produce the sorted version of the original vector, and that process is not embarrassingly parallel. It can be parallelized but in a more complex manner.

Of course, to paraphrase Tevye, it’s no shame to have an embarrassingly parallel problem! It may not exactly be an honor, but it is a cause for celebration, as it is easy to program. More important, embarrassingly parallel problems tend to have low communication overhead, which is crucial to performance, as discussed earlier. In fact, when most people refer to embarrassingly parallel applications, they have this low overhead in mind.But what about nonembarrassingly parallel applications? Unfortunately, parallel R code is simply not suitable for many of them for a very basic reason: the functional programming nature of R. As discussed , a statement like this:

x[3] <- 8

is deceptively simple, because it can cause the entire vector x to be rewritten. This really compounds communication traffic problems. Accordingly, if your application is not embarrassingly parallel, your best strategy is prob-ably to write the computationally intensive parts of the code in C, say using OpenMP or GPU programming. Also, note carefully that even being embarrassingly parallel does not make an algorithm efficient. Some such algorithms can still have significant communication traffic, thus compromising performance.

Consider the KMC problem, run under snow. Suppose we were to set up a large enough number of workers so that each worker had relatively little work to do. In that case, the communication with the manager after each iteration would become a signficant portion of run time. In this situa-tion, we would say that the granularity is too fine, and then probably switch to using fewer workers. We would then have larger tasks for each worker, thus a coarser granularity.

Static Versus Dynamic Task Assignment

Look again at the loop beginning on line 26 of our OpenMP example, reproduced here for convenience:

for (i = me; i < nval; i += nth) {

mysum += procpairs(i,m,nval);

}

The variable me here was the thread number, so the effect of this code was that the various threads would work on nonoverlapping sets of values of i. We do want the values to be nonoverlapping, to avoid duplicate work and an incorrect count of total number of links, so the code was fine. But the point now is that we were, in effect, preassigning the tasks that each thread would handle. This is called static assignment. An alternative approach is to revise the for loop to look something like this:

int nexti = 0; // global variable

for ( ; myi < n; ) { // revised “for” loop

#pragma omp critical

{

nexti += 1;

myi = nexti;

}

if (myi < n) {

mysum += procpairs(myi,m,nval);

}

}

This is dynamic task assignment, in which it is not determined ahead of time which threads handle which values of i. Task assignment is done during execution. At first glance, dynamic assignment seems to have the potential for better performance. Suppose, for instance, that in a static assignment setting, one thread finishes its last value of i early, while another thread still has two values of i left to do. This would mean our program would finish somewhat later than it could. In parallel-processing parlance, we would have a load balance problem. With dynamic assignment, the thread that finished when there were two values of i left to handle could have taken up one of those values itself. We would have better balance and theoretically less over-all runtime.

But don’t jump to conclusions. As always, we have the overhead issue to reckon with. Recall that a critical pragma, used in the dynamic version of the code above, has the effect of temporarily rendering the program serial rather than parallel, thus causing a slowdown. In addition, for reasons too technical to discuss here, these pragmas may cause considerable cache activ-ity overhead. So in the end, the dynamic code could actually be substantially slower than the static version.

Various solutions to this problem have been developed, such as an OpenMP construct named guided. But rather than present these, the point I wish to make is that they are unnecessary. In most situations, static assignment is just fine. Why is this the case? You may recall that the standard deviation of the sum of independent, identically distributed random variables, divided by the mean of that sum, goes to zero as the number of terms goes to infinity. In other words, sums are approximately constant. This has a direct implication for our load-balancing concerns: Since the total work time for a thread in static assignment is the sum of its individual task times, that total work time will be approximately constant; there will be very little variation from thread to thread. Thus, they will all finish at pretty close to the same time, and we do not need to worry about load imbalance. Dynamic scheduling will not be necessary.

This reasoning does depend on a statistical assumption, but in practice, the assumption will typically be met sufficiently well for the outcome: Static scheduling does as well as dynamic in terms of uniformity of total work times across threads. And since static scheduling doesn’t have the overhead problems of the dynamic kind, in most cases the static approach will give better performance. There is one more aspect of this to discuss. To illustrate the issue, consider again the mutual outlinks example. Let’s review the outline of the algorithm:

sum = 0

for i = 0…n-1

for j = i+1…n-1

for k = 0…n-1 sum = sum + a[i][k]*a[j][k]

mean = sum / (n*(n-1)/2)

Say n is 10000 and we have four threads, and consider ways to partition the for i loop. Naively, we might at first decide to have thread 0 handle the i values 0 through 2499, thread 1 handle 2500 through 4999, and so on. However, this would produce a severe load imbalance, since the thread that handles a given value of i does an amount of work proportional to n-i. That, in fact, is why we staggered the values of i in our actual code: Thread 0 handled the i values 0, 4, 8 …, thread 1 worked on 1, 5, 9, …, and so on, yielding good load balance. The point then is that static assignment might require a bit more planning. One general approach to this is to randomly assign tasks (i values, in our case here) to threads (still doing so at the outset, before work begins). With a bit of forethought such as this, static assignment should work well in most applications.

 Software Alchemy: Turning General Problems into Embarrassingly Parallel Ones

As discussed earlier, it’s difficult to attain good performance from non-embarrassingly parallel algorithms. Fortunately, for statistical applications, there is a way to turn nonembarrassingly parallel problems into embarrassingly parallel ones. The key is to exploit some statistical properties. To demonstrate the method, let’s once again turn to our mutual out-links problem. The method, applied with w workers on a links matrix m, consists of the following:

  1. Break the rows of m into w chunks.

2. Have each worker find the mean number of mutual outlinks for pairs of vertices in its chunk.

3. Average the results returned by the workers.

It can be shown mathematically that for large problems (the only ones you would need parallel computing for anyway), this chunked approach gives the estimators of the same statistical accuracy as in the nonchunked method. But meanwhile, we’ve turned a nonparallel problem into not just a parallel one but an embarrassingly parallel one! The workers in the preceding outline compute entirely independently of each other. This method should not be confused with the usual chunk-based approaches in parallel processing. In those, such as the merge-sort example discussed on page 347, the chunking is embarrassingly parallel, but the combining of results is not. By contrast, here the combining of results consists of simple averaging, thanks to the mathematical theory.

I tried this approach on the mutual outlinks problem in a 4-worker snow cluster. This reduced the runtime to 1.5 seconds. This is far better than the serial time of about 16 seconds, double the speedup obtained by the GPU and approaching comparability to the OpenMP time. And the theory show-ing that the two methods give the same statistical accuracy was confirmed as well. The chunked method found the mean number of mutual outlinks to be 249.2881, compared to 249.2993 for the original estimator.

Debugging Parallel R Code

Parallel R packages such as Rmpi, snow, foreach, and so on do not set up a terminal window for each process, thus making it impossible to use R’s debugger on the workers. (My Rdsm package, which adds a threads capability to R, is an exception to this.) What then can you do to debug apps for those packages? Let’s consider snow for a concrete example. First, you should debug the underlying single-worker function, such as mtl(). Here, we would set up some artificial values of the arguments and then use R’s ordinary debugging facilities.

Debugging the underlying function may be sufficient. However, the bug may be in the arguments themselves or in the way we set them up. Then things get more difficult. It’s even hard to print out trace information, such as values of variables, since print() won’t work in the worker processes. The message() function may work for some of these packages; if not, you may need to resort to using cat() to write to a file.

This Is A Custom Widget

This Sliding Bar can be switched on or off in theme options, and can take any widget you throw at it or even fill it with your custom HTML Code. Its perfect for grabbing the attention of your viewers. Choose between 1, 2, 3 or 4 columns, set the background color, widget divider color, activate transparency, a top border or fully disable it on desktop and mobile.

This Is A Custom Widget

This Sliding Bar can be switched on or off in theme options, and can take any widget you throw at it or even fill it with your custom HTML Code. Its perfect for grabbing the attention of your viewers. Choose between 1, 2, 3 or 4 columns, set the background color, widget divider color, activate transparency, a top border or fully disable it on desktop and mobile.