R CHAPTER 3 2018-08-16T06:21:47+00:00

# R PROGRAMMING CHAPTER 3

## FACTORS AND TABLES

Factors form the basis for many of R’s powerful operations, including many
of those performed on tabular data. The motivation for factors comes from the notion of nominal, or categorical, variables in statistics. These values are nonnumerical in nature, corresponding to categories such as Democrat, Republican, and Unaffiliated, although they may be coded using numbers.In this chapter, we’ll begin by looking at the extra information contained in factors and then focus on the functions used with factors. We’ll also explore tables and common table operations.

## Factors and Levels

An R factor might be viewed simply as a vector with a bit more information added (though, as seen below, it’s different from this internally). That extra information consists of a record of the distinct values in that vector, called levels. Here’s an example:

> x <- c(5,12,13,12)

> xf <- factor(x)

> xf

[1]   5  12  13 12

Levels: 5 12 13

The distinct values in xf—5, 12, and 13—are the levels here. Let’s take a look inside: > str(xf)
Factor w/ 3 levels “5”,”12″,”13″:  1  2  3  2
> unclass(xf)
[1]  1  2  3  2
attr(,”levels”)
[1]  “5”  “12”  “13”

This is revealing. The core of xf here is not (5,12,13,12) but rather (1,2,3,2). The latter means that our data consists first of a level-1 value, then level-2 and level-3 values, and finally another level-2 value. So the data has been recoded by level. The levels themselves are recorded too, of course, though as characters such as “5” rather than 5. The length of a factor is still defined in terms of the length of the data rather than, say, being a count of the number of levels:

> length(xf)
[1] 4

We can anticipate future new levels, as seen here:

> x <- c(5,12,13,12) > xff<- factor(x,levels=c(5,12,13,88)) > xff

[1]  5   12  13   12

Levels:  5  12  13  88

> xff[2] <- 88 > xff

[1]   5   88  13  12

Levels:   5   12   13   88

Originally, xff did not contain the value 88, but in defining it, we allowed for that future possibility. Later, we did indeed add the value. By the same token, you cannot sneak in an “illegal” level. Here’s what happens when you try:

> xff[2] <- 28
Warning message:
In [<-.factor(*tmp*, 2, value = 28) :
invalid factor level, NAs generated

## Common Functions Used with Factors

With factors, we have yet another member of the family of apply functions, tapply. We’ll look at that function, as well as two other functions commonly used with factors: split() and by().

#### The tapply() Function

As motivation, suppose we have a vector x of ages of voters and a factor f showing some nonumeric trait of those voters, such as party affiliation (Democrat, Republican, Unaffiliated). We might wish to find the mean ages in x within each of the party groups. In typical usage, the call tapply(x,f,g) has x as a vector, f as a factor or list of factors, and g as a function. The function g() in our little example above would be R’s built-in mean() function. If we wanted to group by both party and another factor, say gender, we would need f to consist of the two factors, party and gender.

Each factor in f must have the same length as x. This makes sense in light of the voter example above; we should have as many party affiliations as ages. If a component of f is a vector, it will be coerced into a factor by applying as.factor() to it. The operation performed by tapply() is to (temporarily) split x into groups, each group corresponding to a level of the factor (or a combination of levels of the factors in the case of multiple factors), and then apply g() to the resulting subvectors of x. Here’s a little example:

> ages <- c(25,26,55,37,21,42)

> affils <- c(“R”,”D”,”D”,”R”,”U”,”D”)

> tapply(ages,affils,mean)

D   R   U

41  31  21

Let’s look at what happened. The function tapply() treated the vector (“R”,”D”,”D”,”R”,”U”,”D”) as a factor with levels “D”, “R”, and “U”. It noted that “D” occurred in indices 2, 3 and 6; “R” occurred in indices 1 and 4; and “U” occurred in index 5. For convenience, let’s refer to the three index vectors (2,3,6), (1,4), and (5) as x, y, and z, respectively. Then tapply() computed mean(u[x]), mean(u[y]), and mean(u[z]) and returned those means in a three-element vector. And that vector’s element names are “D”, “R”, and “U”, reflecting the factor levels that were used by tapply(). What if we have two or more factors? Then each factor yields a set of groups, as in the preceding example, and the groups are ANDed together. As an example, suppose that we have an economic data set that includes variables for gender, age, and income. Here, the call tapply(x,f,g) might have x as income and f as a pair of factors: one for gender and the other coding whether the person is older or younger than 25. We may be interested in finding mean income, broken down by gender and age. If we set g() to be mean(), tapply() will return the mean incomes in each of four subgroups:
1. Male and under 25 years old

2. Female and under 25 years old

3. Male and over 25 years old

4. Female and over 25 years old

Here’s a toy example of that setting:

> d <- data.frame(list(gender=c(“M”,”M”,”F”,”M”,”F”,”F”),

+  age=c(47,59,21,32,33,24),income=c(55000,88000,32450,76500,123000,45650)))

> d

gender  age  Income

1          M    47    55000

2         M    59    88000

3         F     21      32450

4        M    32      76500

5         F    33      123000

6         F    24       45650

>d$over25 <- ifelse(d$age > 25,1,0)

> d

gender   age Income over25
1         M   47    55000       1
2         M   59    88000      1
3         F    21     32450      0
4         M   32    76500       1
5         F    33    123000     1
6         F    24     45650      0
> tapply(d$income,list(d$gender,d$over25),mean) 0 1 F 39050 123000.00 M NA 73166.67 We specified two factors, gender and indicator variable for age over or under 25. Since each of these factors has two levels, tapply() partitioned the income data into four groups, one for each combination of gender and age, and then applied to mean() function to each group. #### The split() Function In contrast to tapply(), which splits a vector into groups and then applies a specified function on each group, split() stops at that first stage, just form-ing the groups. The basic form, without bells and whistles, is split(x,f), with x and f playing roles similar to those in the call tapply(x,f,g); that is, x being a vector or data frame and f being a factor or a list of factors. The action is to split x into groups, which are returned in a list. (Note that x is allowed to be a data frame with split() but not with tapply().) Let’s try it out with our earlier example. >d gender age income over25 1 M 47 55000 1 2 M 59 88000 1 3 F 21 32450 0 4 M 32 76500 1 5 F 33 123000 1 6 F 24 45650 0 > split(d$income,list(d$gender,d$over25))
$F.0 [1] 32450 45650$M.0
numeric(0)
$F.1 [1] 123000$M.1
[1] 55000 88000 76500

The output of split() is a list, and recall that list components are denoted by dollar signs. So the last vector, for example, was named “M.1” to indicate that it was the result of combining “M” in the first factor and 1 in the second. We wanted to determine the indices of the vector elements corresponding to male, female, and infant. The data in that little example consisted of the seven-observation vector (“M”,”F”,”F”,”I”,”M”,”M”,”F”), assigned to g. We can do this in a flash with split().

g <- c(“M”,”F”,”F”,”I”,”M”,”M”,”F”)

split(1:7,g)

$F [1] 2 3 7$I

[1] 4

$M [1] 1 5 6 The results show the female cases are in records 2, 3, and 7; the infant case is in record 4; and the male cases are in records 1, 5, and 6. Let’s dissect this step-by-step. The vector g, taken as a factor, has three levels: “M”, “F”, and “I”. The indices corresponding to the first level are 1, 5, and 6, which means that g[1], g[5], and g[6] all have the value “M”. So, R sets the M component of the output to elements 1, 5, and 6 of 1:7, which is the vector (1,5,6). We can take a similar approach to simplify the code in our text concordance example. There, we wished to input a text file, determine which words were in the text, and then output a list giving the words and their locations within the text. We can use split() to make short work of writing the code, as follows: findwords <- function(tf) { # read in the words from the file, into a vector of mode character txt <- scan(tf,””) words <- split(1:length(txt),txt) return(words) } The call to scan() returns a list txt of the words read in from the file tf. So, txt[[1]] will contain the first word input from the file, txt[[2]] will con-tain the second word, and so on; length(txt) will thus be the total number of words read. Suppose for concreteness that that number is 220. Meanwhile, txt itself, as the second argument in split() above, will be taken as a factor. The levels of that factor will be the various words in the file. If, for instance, the file contains the word world 6 times and climate was there 10 times, then “world” and “climate” will be two of the levels of txt. The call to split() will then determine where these and the other words appear in txt. #### The by() Function Suppose in the abalone example we wish to do regression analyses of diameter against length separately for each gender code: males, females, and infants. At first, this seems like something tailor-made for tapply(), but the first argument of that function must be a vector, not a matrix or a data frame. The function to be applied can be multivariate—for example, range()—but the input must be a vector. Yet the input for regression is a matrix (or data frame) with at least two columns: one for the predicted variable and one or more for predictor variables. In our abalone data application, the matrix would consist of a column for the diameter data and a column for length. The by() function can be used here. It works like tapply() (which it calls internally, in fact), but it is applied to objects rather than vectors. Here’s how to use it for the desired regression analyses: > aba <-read.csv(“abalone.data”,header=TRUE) > by(aba,aba$Gender,function(m) lm(m[,2]~m[,3]))

aba$Gender: F Call: lm(formula = m[, 2] ~ m[, 3]) Coefficients: (Intercept) m[, 3] 0.04288 1.17918 ———————————————————— aba$Gender: I

Call:

lm(formula = m[, 2] ~ m[, 3])

Coefficients:

(Intercept)           m[, 3]

0.02997             1.21833

————————————————————

aba$Gender: M Call: lm(formula = m[, 2] ~ m[, 3]) Coefficients: (Intercept) m[, 3] 0.03653 1.19480 Calls to by() look very similar to calls to tapply(), with the first argument specifying our data, the second the grouping factor, and the third the function to be applied to each group. Just as tapply() forms groups of indices of a vector according to levels of a factor, this by() call finds groups of row numbers of the data frame aba. That creates three subdata frames: one for each gender level of M, F, and I. The anonymous function we defined regresses the second column of its matrix argument m against the third column. This function will be called three times—once for each of the three subdata frames created earlier— thus producing the three regression analyses. ## Working with Tables To begin exploring R tables, consider this example: > u <- c(22,8,33,6,8,29,-2) > fl <- list(c(5,12,13,12,13,5,13),c(“a”,”bc”,”a”,”a”,”bc”,”a”,”a”)) > tapply(u,fl,length) a bc 5 2 NA 12 1 1 13 2 1 Here, tapply() again temporarily breaks u into subvectors, as you saw ear-lier, and then applies the length() function to each subvector. (Note that this is independent of what’s in u. Our focus now is purely on the factors.) Those subvector lengths are the counts of the occurrences of each of the 3 × 2 = 6 combinations of the two factors. For instance, 5 occurred twice with “a” and not at all with “bc”; hence the entries 2 and NA in the first row of the output. In statistics, this is called a contingency table. There is one problem in this example: the NA value. It really should be 0, meaning that in no cases did the first factor have level 5 and the second have level “bc”. The table() function creates contingency tables correctly. > table(fl) fl.2 fl.1 a bc 5 2 1 12 1 1 13 1 0 The first argument in a call to table() is either a factor or a list of factors. The two factors here were (5,12,13,12,13,5,13) and (“a”,”bc”,”a”,”a”,”bc”, “a”,”a”). In this case, an object that is interpretable as a factor is counted as one. Typically a data frame serves as the table() data argument. Suppose for instance the file ct.dat consists of election-polling data, in which candidate X is running for reelection. The ct.dat file looks like this: “Vote for X” “Voted For X Last Time” “Yes” “Yes” “Yes” “No” “No” “No” “Not Sure” “Yes” “No” “No” In the usual statistical fashion, each row in this file represents one sub-ject under study. In this case, we have asked five people the following two questions: Do you plan to vote for candidate X? Did you vote for X in the last election? This gives us five rows in the data file. Let’s read in the file: > ct <- read.table(“ct.dat”,header=T) > ct Vote.for.X Voted.for.X.Last.Time 1 Yes Yes 2 Yes No 3 No No 4 Not Sure Yes 5 No No We can use the table() function to compute the contingency table for this data: > cttab <- table(ct) > cttab Voted.for.X.Last.Time Vote.for.X No Yes No. 2 0 NotSure 0 1 Yes 1 1 The 2 in the upper-left corner of the table shows that we had, for example, two people who said “no” to the first and second questions. The 1 in the middle-right indicates that one person answered “not sure” to the first question and “yes” to the second question. We can also get one-dimensional counts, which are counts on a single factor, as follows: > table(c(5,12,13,12,8,5)) 8 12 13 2 1 2 1 Here’s an example of a three-dimensional table, involving voters’ genders, race (white, black, Asian, and other), and political views (liberal or conservative): > v # the data frame gender race pol 1 M W L 2 M W L 3 F A C 4 M O L 5 F B L 6 F B C > vt <- table(v) > vt , , pol = C race gender A B O W F 1 1 0 0 M 0 0 0 0 , , pol = L race gender A B O W F 0 1 0 0 M 0 0 1 2 R prints out a three-dimensional table as a series of two-dimensional tables. In this case, it generates a table of gender and race for conservatives and then a corresponding table for liberals. For example, the second two-dimensional table says that there were two white male liberals. #### Matrix/Array-Like Operations on Tables Just as most (nonmathematical) matrix/array operations can be used on data frames, they can be applied to tables, too. (This is not surprising, given that the cell counts portion of a table object is an array.) For example, we can access the table cell counts using matrix notation. Let’s apply this to our voting example from the previous section. > class(cttab) [1] “table” > cttab[1,1] [1] 2 > cttab[1,] No Yes 2 0 In the second command, even though the first command had shown that cttab had class “cttab”, we treated it as a matrix and printed out its “[1,1] element.” Continuing this idea, the third command printed the first column of this “matrix.” We can multiply the matrix by a scalar. For instance, here’s how to change cell counts to proportions: > ctt/5 Voted.for.X.Last.Time Vote.for.X No Yes No 0.4 0.0 Not Sure 0.0 0.2 Yes 0.2 0.2 In statistics, the marginal values of a variable are those obtained when this variable is held constant while others are summed. In the voting exam-ple, the marginal values of the Vote.for.X variable are 2 + 0 = 2, 0 + 1 = 1, and 1 + 1 = 2. We can of course obtain these via the matrix apply() function: > apply(ctt,1,sum) No Not Sure Yes 2 1 2 Note that the labels here, such as No, came from the row names of the matrix, which table() produced. But R supplies a function addmargins() for this purpose—that is, to find marginal totals. Here’s an example: > addmargins(cttab) Voted.for.X.Last.Time Vote.for.Xn No Yes Sum No 2 0 2 NotSure 0 1 1 Yes 1 1 2 Sum 3 2 5 Here, we got the marginal data for both dimensions at once, conve-niently superimposed onto the original table. We can get the names of the dimensions and levels through dimnames(), as follows: > dimnames(cttab)$Vote.for.X

[1] “No” “Not Sure” “Yes”

$Voted.for.X.Last.Time [1] “No” “Yes” #### Extended Example: Extracting a Subtable Let’s continue working with our voting example: > cttab Voted.for.X.Last.Time Vote.for.X No Yes No 2 0 NotSure 0 1 Yes 1 1 Suppose we wish to present this data at a meeting, concentrating on those respondents who know they will vote for X in the current election. In other words, we wish to eliminate the Not Sure entries and present a sub-table that looks like this: Voted.for.X.Last.Time Vote.for.X No Yes No 2 0 Yes 1 1 The function subtable() below performs subtable extraction. It has two arguments: • tbl: The table of interest, of class “table”. • subnames: A list specifying the desired subtable extraction. Each compo-nent of this list is named after some dimension of tbl, and the value of that component is a vector of the names of the desired levels. So, let’s review what we have in this example before looking at the code. The argument cttab will be a two-dimensional table, with dimension names Voted.for.X and Voted.for.X.Last.Time. Within those two dimensions, the level names are No, Not Sure, and Yes in the first dimension, and No and Yes in the second. Of those, we wish to exclude the Not Sure cases, so our actual argu-ment corresponding to the formal argument subnames is as follows: list(Vote.for.X=c(“No”,”Yes”),Voted.for.X.Last.Time=c(“No”,”Yes”)) We can now call the function. > subtable(cttab,list(Vote.for.X=c(“No”,”Yes”), + Voted.for.X.Last.Time=c(“No”,”Yes”))) Voted.for.X.Last.Time Vote.for.X No Yes No 2 0 Yes. 1 1 Now that we have a feel for what the function does, let’s take a look at its innards. subtable <- function(tbl,subnames) { # get array of cell counts in tbl tblarray <- unclass(tbl) # we’ll get the subarray of cell counts corresponding to subnames by # calling do.call() on the “[” function; we need to build up a list # of arguments first dcargs <- list(tblarray) ndims <- length(subnames) # number of dimensions for (i in 1:ndims) { dcargs[[i+1]] <- subnames[[i]] } subarray <- do.call(“[“,dcargs) # now we’ll build the new table, consisting of the subarray, the # numbers of levels in each dimension, and the dimnames() value, plus # the “table” class attribute dims <- lapply(subnames,length) subtbl <- array(subarray,dims,dimnames=subnames) class(subtbl) <- “table” return(subtbl) } So, what’s happening here? To prepare for writing this code, I first did a little detective work to determine the structure of objects of class “table”. Looking through the code of the function table(), I found that at its core, an object of class “table” consists of an array whose elements are the cell counts. So the strategy is to extract the desired subarray, then add names to the di-mensions of the subarray, and then bestow “table” class status to the result. For the code here, then, the first task is to form the subarray corre-sponding to the user’s desired subtable, and this constitutes most of the code. To this end, in line 3, we first extract the full cell counts array, stor-ing it in tblarray. The question is how to use that to find the desired sub-array. In principle, this is easy. In practice, that’s not always the case. To get the desired subarray, I needed to form a subsetting expression on the array tblarray—something like this: tblarray[some index ranges here] In our voting example, the expression is as follows: tblarray[c(“No”,”Yes”),c(“No”,”Yes”)] This is simple in concept but difficult to do directly, since tblarray could be of different dimensions (two, three, or anything else). Recall that R’s array subscripting is actually done via a function named “[“(). This function takes a variable number of arguments: two for two-dimensional arrays, three for three-dimensional arrays, and so on. This problem is solved by using R’s do.call(). This function has the following basic form: do.call(f,argslist) where f is a function and argslist is a list of arguments on which to call f(). In other words, the preceding code basically does this: f(argslist[[1],argslist[[2]],…) This makes it easy to call a function with a variable number of arguments. For our example, we need to form a list consisting first of tblarray and then the user’s desired levels for each dimension. Our list looks like this: list(tblarray,Vote.for.X=c(“No”,”Yes”),Voted.for.X.Last.Time=c(“No”,”Yes”)) Lines 7 through 11 build up this list for the general case. That’s our sub-array. Then we need to attach the names and set the class to “table”. The former operation can be done via R’s array() function, which has the follow-ing arguments: • data: The data to be placed into the new array. In our case, this is subarray. • dim: The dimension lengths (number of rows, number of columns, num-ber of layers, and so on). In our case, this is the value ndims, computed in line 16. • dimnames: The dimension names and the names of their levels, already given to us by the user as the argument subnames. This was a somewhat conceptually complex function to write, but it gets easier once you’ve mastered the inner structures of the “table” class. #### Extended Example: Finding the Largest Cells in a Table It can be difficult to view a table that is very big, with a large number of rows or dimensions. One approach might be to focus on the cells with the largest frequencies. That’s the purpose of the tabdom() function developed below— it reports the dominant frequencies in a table. Here’s a simple call: tabdom(tbl,k) This reports the cells in the table tbl that have the k largest frequencies. Here’s an example: > d <- c(5,12,13,4,3,28,12,12,9,5,5,13,5,4,12) > dtab <- table(d) > tabdom(dtab,3) d Freq 3 5 4 5 12 4 2 4 2 The function tells us that the values 5 and 12 were the most frequent in d, with four instances each, and the next most frequent value was 4, with two instances. (The 3, 5, and 2 on the left are actually extraneous information; see the following discussion regarding converting a table to a data frame.) As another example, consider our table cttab in the examples in the preceding sections: >tabdom(cttab,2) Vote.for.X Voted.For.X.Last.Time Freq 1 No No 2 3 Yes No 1 So the combination No-No was most frequent, with two instances, with the second most frequent being Yes-No, with one instance.1 Well, how is this accomplished? It looks fairly complicated, but actu-ally the work is made pretty easy by a trick, exploiting the fact that you can present tables in data frame format. Let’s use our cttab table again. > as.data.frame(cttab) Vote.for.X Voted.For.X.Last.Time Freq 1 No No 2 2 Not Sure No 0 3 Yes No 1 4 No Yes 0 5 Not Sure Yes 1 6 Yes Yes 1 Note that this is not the original data frame ct from which the table cttab was constructed. It is simply a different presentation of the table itself. There is one row for each combination of the factors, with a Freq column added to show the number of instances of each combination. This latter feature makes our task quite easy. # finds the cells in table tbl with the k highest frequencies; handling # of ties is unrefined tabdom <- function(tbl,k) { # create a data frame representation of tbl, adding a Freq column tbldf <- as.data.frame(tbl) # determine the proper positions of the frequencies in a sorted order freqord <- order(tbldf$Freq,decreasing=TRUE)

# rearrange the data frame in that order, and take the first k rows

dom <- tbldf[freqord,][1:k,]

return(dom)

}

The comments should make the code self-explanatory.mBut didn’t the Not Sure–Yes and Yes-Yes combinations also have one instance and thus should be tied with Yes-No for second place? Yes, definitely. My code is cavalier regarding ties, and the reader is encouraged to refine it in that direction.

The sorting approach in line 7, which makes use of order(), is the stan-dard way to sort a data frame (worth remembering, since the situation arises rather frequently)..

## Other Factor- and Table-Related Functions

R includes a number of other functions that are handy for working with tables and factors. We’ll discuss two of them here: aggregate() and cut().

NOTE Hadley Wickham’s reshape package “lets you flexibly restructure and aggregate data using just two functions: melt and cast.” This package may take a while to learn, but it is extremely powerful. His plyr package is also quite versatile. You can down-load both packages from R’s CRAN repository. See Appendix B for more details about downloading and installing packages.

#### The aggregate() Function

The aggregate() function calls tapply() once for each variable in a group. For example, in the abalone data, we could find the median of each variable, broken down by gender, as follows:

> aggregate(aba[,-1],list(aba$Gender),median) Group.1 Length Diameter Height WholeWt ShuckedWt ViscWt ShellWt Rings 1 F 0.590 0.465 0.160 1.03850 0.44050 0.2240 0.295 10 2 I 0.435 0.335 0.110 0.38400 0.16975 0.0805 0.113 8 3 M 0.580 0.455 0.155 0.97575 0.42175 0.2100 0.276 10 The first argument, aba[,-1], is the entire data frame except for the first column, which is Gender itself. The second argument, which must be a list, is our Gender factor as before. Finally, the third argument tells R to compute the median on each column in each of the data frames generated by the subgrouping corresponding to our factors. There are three such subgroups in our example here and thus three rows in the output of aggregate(). #### The cut() Function A common way to generate factors, especially for tables, is the cut() func-tion. You give it a data vector x and a set of bins defined by a vector b. The function then determines which bin each of the elements of x falls into. The following is the form of the call we’ll use here: y <- cut(x,b,labels=FALSE) where the bins are defined to be the semi-open intervals (b[1],b[2]], (b[2],b[3]],…. Here’s an example: > z [1] 0.88114802 0.28532689 0.58647376 0.42851862 0.46881514 0.24226859 0.05289197 [8] 0.88035617 > seq(from=0.0,to=1.0,by=0.1) [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 > binmarks <- seq(from=0.0,to=1.0,by=0.1) >cut(z,binmarks,labels=F) [1] 9 3 6 5 5 3 1 9 This says that z[1], 0.88114802, fell into bin 9, which was (0,0,0.1]; z[2], 0.28532689, fell into bin 3; and so on. This returns a vector, as seen in the example’s result. But we can convert it into a factor and possibly then use it to build a table. For instance, you can imagine using this to write your own specialized histogram function. (The R function findInterval() would be useful for this, too.) ## R PROGRAMMING STRUCTURES R is a block-structured language in the manner of the ALGOL-descendant family, such as C, C++, Python, Perl, and so on. As you’ve seen, blocks are delineated by braces, though braces are optional if the block consists of just a single statement. Statements are separated by newline characters or, optionally, by semicolons. Here, we cover the basic structures of R as a programming language. We’ll review some more details on loops and the like and then head straight into the topic of functions, which will occupy most of the chapter. In particular, issues of variable scope will play a major role. As with many scripting languages, you do not “declare” variables in R. Programmers who have a background in, say, the C language, will find similarities in R at first but then will see that R has a richer scoping structure. ## Control Statements Control statements in R look very similar to those of the ALGOL-descendant family languages mentioned above. Here, we’ll look at loops and if-else statements. #### Loops We defined the oddcount() function. In that function, the following line should have been instantly recognized by Python programmers: for (n in x) { It means that there will be one iteration of the loop for each component of the vector x, with n taking on the values of those components—in the first iteration, n = x[1]; in the second iteration, n = x[2]; and so on. For example, the following code uses this structure to output the square of every element in a vector: > x <- c(5,12,13) > for (n in x) print(n^2) [1] 25 [1] 144 [1] 169 C-style looping with while and repeat is also available, complete with break, a statement that causes control to leave the loop. Here is an example that uses all three: > i <- 1 > while (i <= 10) i <- i+4 >i[1] 13 > > i <- 1 > while(TRUE) { # similar loop to above + i<-i+4 + if (i > 10) break +} >i[1] 13 > > i <- 1 > repeat { # again similar + i <- i+4 + if (i > 10) break +} >i[1] 13 In the first code snippet, the variable i took on the values 1, 5, 9, and 13 as the loop went through its iterations. In that last case, the condition i <= 10 failed, so the break took hold and we left the loop. This code shows three different ways of accomplishing the same thing, with break playing a key role in the second and third ways. Note that repeat has no Boolean exit condition. You must use break (or something like return()). Of course, break can be used with for loops, too. Another useful statement is next, which instructs the interpreter to skip the remainder of the current iteration of the loop and proceed directly to the next one. This provides a way to avoid using complexly nested if-then-else constructs, which can make the code confusing. Let’s take a look at an example that uses next. sim <- function(nreps) { commdata <- list() commdata$countabsamecomm <- 0

for (rep in 1:nreps) {

commdata$whosleft <- 1:20 commdata$numabchosen <- 0

commdata <- choosecomm(commdata,5)

if (commdata$numabchosen > 0) next commdata <- choosecomm(commdata,4) if (commdata$numabchosen > 0) next

commdata <- choosecomm(commdata,3)

}

print(commdata$countabsamecomm/nreps) } There are next statements in lines 8 and 10. Let’s see how they work and how they improve on the alternatives. The two next statements occur within the loop that starts at line 4. Thus, when the if condition holds in line 8, lines 9 through 11 will be skipped, and control will transfer to line 4. The situation in line 10 is similar. Without using next, we would need to resort to nested if statements, something like these: sim <- function(nreps) { commdata <- list() commdata$countabsamecomm <- 0

for (rep in 1:nreps) {

commdata$whosleft <- 1:20 commdata$numabchosen <- 0

commdata <- choosecomm(commdata,5)

if (commdata$numabchosen == 0) { commdata <- choosecomm(commdata,4) if (commdata$numabchosen == 0)

commdata <- choosecomm(commdata,3)

}

}

print(commdata$countabsamecomm/nreps) } Because this simple example has just two levels, it’s not too bad. How-ever, nested if statements can become confusing when you have more levels. The for construct works on any vector, regardless of mode. You can loop over a vector of filenames, for instance. Say we have a file named file1 with the following contents: 1 2 3 4 5 6 We also have a file named file2 with these contents: 5 12 13 The following loop reads and prints each of these files. We use the scan() function here to read in a file of numbers and store those values in a vector. > for (fn in c(“file1″,”file2”)) print(scan(fn)) Read 6 items [1] 1 2 3 4 5 6 Read 3 items [1] 5 12 13 So, fn is first set to file1, and the file of that name is read in and printed out. Then the same thing happens for file2. #### Looping Over Nonvector Sets R does not directly support iteration over nonvector sets, but there are a couple of indirect yet easy ways to accomplish it: • Use lapply(), assuming that the iterations of the loop are independent of each other, thus allowing them to be performed in any order. • Use get(). As its name implies, this function takes as an argument a character string representing the name of some object and returns the object of that name. It sounds simple, but get() is a very powerful function. Let’s look at an example of using get(). Say we have two matrices, u and v, containing statistical data, and we wish to apply R’s linear regression function lm() to each of them. >u[,1] [,2] [1,] 1 1 [2,] 2 2 [3,] 3 4 >v [,1] [,2] [1,] 8 15 [2,] 12 10 [3,] 20 2 > for (m in c(“u”,”v”)) { + z <- get(m) + print(lm(z[,2] ~ z[,1])) +} Call: lm(formula = z[, 2] ~ z[, 1]) Coefficients: (Intercept) z[, 1] -0.6667 1.5000 Call: lm(formula = z[, 2] ~ z[, 1]) Coefficients: (Intercept) z[, 1] 23.286 -1.071 Here, m was first set to u. Then these lines assign the matrix u to z, which allows the call to lm() on u: z <- get(m) print(lm(z[,2] ~ z[,1])) The same then occurs with v. #### if-else The syntax for if-else looks like this: if (r == 4) { x <- 1 } else { x <- 3 y <- 4 } It looks simple, but there is an important subtlety here. The if section consists of just a single statement: x <- 1 So, you might guess that the braces around that statement are not necessary. However, they are indeed needed. The right brace before the else is used by the R parser to deduce that this is an if-else rather than just an if. In interactive mode, without braces, the parser would mistakenly think the latter and act accordingly, which is not what we want. An if-else statement works as a function call, and as such, it returns the last value assigned. v <- if (cond) expression1 else expression2 This will set v to the result of expression1 or expression2, depending on whether cond is true. You can use this fact to compact your code. Here’s a simple example: > x <- 2 > y <- if(x == 2) x else x+1 > y [1] 2 > x <- 3 > y <- if(x == 2) x else x+1 > y [1] 4 Without taking this tack, the code y <- if(x == 2) x else x+1 would instead consist of the somewhat more cluttered if(x == 2) y <- x else y <- x+1 In more complex examples, expression1 and/or expression2 could be function calls. On the other hand, you probably should not let compactness take priority over clarity.When working with vectors, use the ifelse() function, as it will likely produce faster code. #### Arithmetic and Boolean Operators and Values Table 7-1 lists the basic operators. Table 7-1: Basic R Operators Operation Description x+y Addition x-y Subtraction x*y Multiplication x/y Division x^y Exponentiation x %% y Modular arithmetic x %/% y Integer division x==y Test for equality x<=y Test for less than or equal to x>=y Test for greater than or equal to x&&y Boolean AND for scalars x||y Boolean OR for scalars x&y Boolean AND for vectors (vector x,y,result) x|y Boolean OR for vectors (vector x,y,result) !x Boolean negation Though R ostensibly has no scalar types, with scalars being treated as one-element vectors, we see the exception in Table 7-1: There are different Boolean operators for the scalar and vector cases. This may seem odd, but a simple example will demonstrate the need for such a distinction. >x[1] TRUE FALSE TRUE >y[1] TRUE TRUE FALSE >x&y[1] TRUE FALSE FALSE > x[1] && y[1] [1] TRUE > x && y # looks at just the first elements of each vector [1] TRUE > if (x[1] && y[1]) print(“both TRUE”)[1] “both TRUE” > if (x & y) print(“both TRUE”)[1] “both TRUE” Warning message: In if (x & y) print(“both TRUE”) : the condition has length > 1 and only the first element will be used The central point is that in evaluating an if, we need a single Boolean, not a vector of Booleans, hence the warning seen in the preceding example, as well as the need for having both the & and && operators. The Boolean values TRUE and FALSE can be abbreviated as T and F (both must be capitalized). These values change to 1 and 0 in arithmetic expressions: >1<2 [1] TRUE > (1 < 2) * (3 < 4) [1] 1 > (1 < 2) * (3 < 4) * (5 < 1) [1] 0 > (1 < 2) == TRUE [1] TRUE > (1 < 2) == 1 [1] TRUE In the second computation, for instance, the comparison 1 < 2 returns TRUE, and 3 < 4 yields TRUE as well. Both values are treated as 1 values, so the product is 1. On the surface, R functions look similar to those of C, Java, and so on. However, they have much more of a functional programming flavor, which has direct implications for the R programmer. ## Default Values for Arguments We read in a data set from a file named exams: > testscores <- read.table(“exams”,header=TRUE) The argument header=TRUE tells R that we have a header line, so R should not count that first line in the file as data. This is an example of the use of named arguments. Here are the first few lines of the function: > read.table function (file, header = FALSE, sep = “”, quote = “\”‘”, dec = “.”, row.names, col.names, as.is = !stringsAsFactors, na.strings = “NA”, colClasses = NA, nrows = -1, skip = 0, check.names = TRUE, fill = !blank.lines.skip, strip.white = FALSE, blank.lines.skip = TRUE, comment.char = “#”, allowEscapes = FALSE, flush = FALSE, stringsAsFactors = default.stringsAsFactors(), encoding = “unknown”) { if (is.character(file)) { file <- file(file, “r”) on.exit(close(file)) The second formal argument is named header. The = FALSE field means that this argument is optional, and if we don’t specify it, the default value will be FALSE. If we don’t want the default value, we must name the argument in our call: > testscores <- read.table(“exams”,header=TRUE) Hence the terminology named argumentNote, though, that because R uses lazy evaluation—it does not evaluate an expression until/unless it needs to—the named argument may not actually be used. ## Return Values The return value of a function can be any R object. Although the return value is often a list, it could even be another function. You can transmit a value back to the caller by explicitly calling return(). Without this call, the value of the last executed statement will be returned by default. For instance, consider the oddcount() example : > oddcount function(x) { k<-0 #assign0tok for(ninx) { if(n%%2==1)k<-k+1 #%%isthemodulooperator } return(k) } This function returns the count of odd numbers in the argument. We could slightly simplify the code by eliminating the call to return(). To do this, we evaluate the expression to be returned, k, as our last statement in the code: oddcount <- function(x) { k <- 0 pagebreak for (n in x) { if (n %% 2 == 1) k <- k+1 } k } On the other hand, consider this code: oddcount <- function(x) { k <- 0 for (n in x) { if (n %% 2 == 1) k <- k+1 } } It wouldn’t work, for a rather subtle reason: The last executed statement here is the call to for(), which returns the value NULL (and does so, in R parlance, invisibly, meaning that it is discarded if not stored by assignment). Thus, there would be no return value at all. #### Deciding Whether to Explicitly Call return() The prevailing R idiom is to avoid explicit calls to return(). One of the reasons cited for this approach is that calling that function lengthens execution time. However, unless the function is very short, the time saved is negligible, so this might not be the most compelling reason to refrain from using return(). But it usually isn’t needed nonetheless. Consider our second example from the preceding section: oddcount <- function(x) { k <- 0 for (n in x) { if (n %% 2 == 1) k <- k+1 } k } Here, we simply ended with a statement listing the expression to be returned—in this case, k. A call to return() wasn’t necessary. Code in here usually does include a call to return(), for clarity for beginners, but it is customary to omit it. Good software design, however, should be mean that you can glance through a function’s code and immediately spot the various points at which control is returned to the caller. The easiest way to accomplish this is to use an explicit return() call in all lines in the middle of the code that cause a return. (You can still omit a return() call at the end of the function if you wish.) #### Returning Complex Objects Since the return value can be any R object, you can return complex objects. Here is an example of a function being returned: >g function() { t <- function(x) return(x^2) return(t) } > g() function(x) return(x^2) If your function has multiple return values, place them in a list or other container. ## Functions Are Objects R functions are first-class objects (of the class “function”, of course), meaning that they can be used for the most part just like other objects. This is seen in the syntax of function creation: > g <- function(x) { + return(x+1) + } Here, function() is a built-in R function whose job is to create functions! On the right-hand side, there are really two arguments to function(): The first is the formal argument list for the function we’re creating—here, just x—and the second is the body of that function—here, just the single statement return(x+1). That second argument must be of class “expression”. So, the point is that the right-hand side creates a function object, which is then assigned to gBy the way, even the “{“ is a function, as you can verify by typing this: > ?”{“ Its job is the make a single unit of what could be several statements. These two arguments to function() can later be accessed via the R functions formals() and body(), as follows: > formals(g)$x

> body(g)

{

return(x + 1)

}

Recall that when using R in interactive mode, simply typing the name of an object results in printing that object to the screen. Functions are no exception, since they are objects just like anything else.

> g

function(x) {

return(x+1)

}

This is handy if you’re using a function that you wrote but which you’ve forgotten the details of. Printing out a function is also useful if you are not quite sure what an R library function does. By looking at the code, you may understand it better. For example, if you are not sure as to the exact behav-ior of the graphics function abline(), you could browse through its code to better understand how to use it.

> abline

function (a = NULL, b = NULL, h = NULL, v = NULL, reg = NULL, coef = NULL, untf = FALSE, …)

{

int_abline <- function(a, b, h, v, untf, col = par(“col”),

lty = par(“lty”), lwd = par(“lwd”), …) .Internal(abline(a, b, h, v, untf, col, lty, lwd, …))

if (!is.null(reg)) {

if (!is.null(a))

warning(“‘a’ is overridden by ‘reg'”)

a <- reg

}

if (is.object(a) || is.list(a)) {

p <- length(coefa <- as.vector(coef(a)))

If you wish to view a lengthy function in this way, run it through page():

> page(abline)

An alternative is to edit it using the edit() function. Note, though, that some of R’s most fundamental built-in functions are written directly in C, and thus they are not viewable in this manner. Here’s an example:

> sum

function (…, na.rm = FALSE) .Primitive(“sum”)

Since functions are objects, you can also assign them, use them as arguments to other functions, and so on.

> f1 <- function(a,b) return(a+b)

> f2 <- function(a,b) return(a-b)

> f <- f1

> f(3,2)

[1] 5

> f <- f2

> f(3,2)

[1] 1

> g <- function(h,a,b) h(a,b)

> g(f1,3,2)

[1] 5

> g(f2,3,2)

[1] 1

And since functions are objects, you can loop through a list consisting of several functions. This would be useful, for instance, if you wished to write a loop to plot a number of functions on the same graph, as follows:

> g1 <- function(x) return(sin(x))

> g2 <- function(x) return(sqrt(x^2+1))

> g3 <- function(x) return(2*x-1)

> plot(c(0,1),c(-1,1.5))  # prepare the graph, specifying X and Y ranges

> for (f in c(g1,g2,g3)) plot(f,0,1,add=T)  # add plot to existing graph

The functions formals() and body() can even be used as replacement functions. We’ll discuss replacement functions later, but for now, consider how you could change the body of a function by assignment:

> g <- function(h,a,b) h(a,b)

> body(g) <- quote(2*x + 3)

>g
function (x)

2*x+3

> g(3)

[1] 9

The reason quote() was needed is that technically, the body of a function has the class “call”, which is the class produced by quote(). Without the call to quote(), R would try to evaluate the quantity 2*x+3. So if x had been defined and equal to 3, for example, we would assign 9 to the body of g(), certainly not what we want. By the way, since * and + are functions, as a language object, 2*x+3 is indeed a call—in fact, it is one function call nested within another.

## Environment and Scope Issues

A function—formally referred to as a closure in the R documentation— consists not only of its arguments and body but also of its environment. The latter is made up of the collection of objects present at the time the function is created. An understanding of how environments work in R is essential for writing effective R functions.

#### The Top-Level Environment

Consider this example:

> w <- 12

> f <- function(y) {

+    d<-8

+   h <- function() {
+           return(d
*(w+y))
+ }
+ return(h())

+ }

> environment(f)

< environment: R_GlobalEnv >

Here, the function f() is created at the top level—that is, at the interpreter command prompt—and thus has the top-level environment, which in R output is referred to as R_GlobalEnv but which confusingly you refer to in R code as .GlobalEnv. If you run an R program as a batch file, that is considered top level, too. The function ls() lists the objects of an environment. If you call it at the top level, you get the top-level environment. Let’s try it with our example code:

> ls()

[1] “f” “w”

As you can see, the top-level environment here includes the variable w, which is actually used within f(). Note that f() is here too, as functions are indeed objects and we did create it at the top level. At levels other than the top, ls() works a little differently. You get a bit more information from ls.str():

> ls.str()

f : function (y)

w : num 12

Next, we’ll look at how w and other variables come into play within f().

#### The Scope Hierarchy

Let’s first get an intuitive overview of how scope works in R and then relate it to environments. If we were working with the C language (as usual, background in C is not assumed), we would say that the variable w in the previous section is global to f(), while d is local to f(). Things are similar in R, but R is more hierarchical. In C, we would not have functions defined within functions, as we have with h() inside f() in our example. Yet, since functions are objects, it is possible—and sometimes desirable from the point of view of the encapsulation goal of object-oriented programming—to define a function within a function; we are simply creating an object, which we can do anywhere.

Here, we have h() being local to f(), just like d. In such a situation, it makes sense for scope to be hierarchical. Thus, R is set up so that d, which is local to f(), is in turn global to h(). The same is true for y, as arguments are considered locals in R. Similarly, the hierarchical nature of scope implies that since w is global to f(), it is global to h() as well. Indeed, we do use w within h()In terms of environments then, h()’s environment consists of whatever objects are defined at the time h() comes into existence; that is, at the time that this assignment is executed:

h <- function() {

return(d*(w+y))

}

(If f() is called multiple times, h() will come into existence multiple times, going out of existence each time f() returns.)What, then, will be in h()’s environment? Well, at the time h() is created, there are the objects d and y created within f(), plus f()’s environment (w). In other words, if one function is defined within another, then that inner function’s environment consists of the environment of the outer one, plus whatever locals have been created so far within the outer one. With multiple nesting of functions, you have a nested sequence of larger and larger environments, with the “root” consisting of the top-level objects. Let’s try out the code:

> f(2)

[1] 112

What happened? The call f(2) resulted in setting the local variable d to 8, followed by the call h(). The latter evaluated d*(w+y)—that is, 8*(12+2)— giving us 112. Note carefully the role of w. The R interpreter found that there was no local variable of that name, so it ascended to the next higher level—in this case, the top level—where it found a variable w with value 12. Keep in mind that h() is local to f() and invisible at the top level.

> h

It’s possible (though not desirable) to deliberately allow name conflicts in this hierarchy. In our example, for instance, we could have a local variable d within h(), conflicting with the one in f(). In such a situation, the innermost environment is used first. In this case, a reference to d within h() would refer to h()’s d, not f()’s.Environments created by inheritance in this manner are generally referred to by their memory locations. Here is what happened after adding a print statement to f() (using edit(), not shown here) and then running the code:

>f function(y) {

d <- 8

h <- function() {

return(d*(w+y))

}

print(environment(h))

return(h())

}

> f(2)

< environment: 0x875753c >

[1] 112

Compare all this to the situation in which the functions are not nested:

>f function(y) {

d <- 8

return(h())

}

>h function() {

return(d*(w+y))

}

The result is as follows:

> f(5)

This does not work, as d is no longer in the environment of h(), because h() is defined at the top level. Thus, an error is generated. Worse, if by happenstance there had been some unrelated variable d in the top-level environment, we would not get an error message but instead would have incorrect results. You might wonder why R didn’t complain about the lack of y in the alternate definition of h() in the preceding example. As mentioned earlier, R doesn’t evaluate a variable until it needs it under a policy called lazy evaluation. In this case, R had already encountered an error with d and thus never got to the point where it would try to evaluate yThe fix is to pass d and y as arguments:

> f

function(y) {

d <- 8

return(h(d,y))

}

> h

function(dee,yyy) {

return(dee*(w+yyy))

}

> f(2)

[1] 88

Okay, let’s look at one last variation:

> f

function(y,ftn) {

d <- 8

print(environment(ftn))

return(ftn(d,y))

}
>h

function(dee,yyy) {

return(dee*(w+yyy))

}

> w <- 12

> f(3,h)

< environment: R_GlobalEnv >

[1] 120

When f() executed, the formal argument ftn was matched by the actual argument h. Since arguments are treated as locals, you might guess that ftn could have a different environment than top level. But as discussed, a closure includes environment, and thus ftn has h’s environment.Note carefully that all the examples so far involving nonlocal variables are for reads, not writes.

#### More on ls()

Without arguments, a call to ls() from within a function returns the names of the current local variables (including arguments). With the envir argument, it will print the names of the locals of any frame in the call chain. Here’s an example:

> f

function(y) {

d <- 8

return(h(d,y))

}

> h

function(dee,yyy) {

print(ls())

print(ls(envir=parent.frame(n=1)))

return(dee*(w+yyy))

}

> f(2)

[1] “dee” “yyy”

[1] “d” “y”

[1] 112

With parent.frame(), the argument n specifies how many frames to go up in the call chain. Here, we were in the midst of executing h(), which had been called from f(), so specifying n = 1 gives us f()’s frame, and thus we get its locals.

#### Functions Have (Almost) No Side Effects

Yet another influence of the functional programming philosophy is that functions do not change nonlocal variables; that is, generally, there are no side effects. Roughly speaking, the code in a function has read access to its nonlocal variables, but it does not have write access to them. Our code can appear to reassign those variables, but the action will affect only copies, not the variables themselves. Let’s demonstrate this by adding some more code to our previous example.

> w <- 12

>f

function(y) {

d <- 8

w <- w + 1

y <- y – 2

print(w)

h <- function() {

return(d*(w+y))

}

return(h())

}

> t <- 4

> f(t)

[1] 13

[1] 120

>w

[1] 12

>t

[1] 4

So, w at the top level did not change, even though it appeared to change within f(). Only a local copy of w, within f(), changed. Similarly, the top-level variable t didn’t change, even though its associated formal argument y did change.

NOTE More precisely, references to the local w actually go to the same memory location as the global one, until the value of the local changes. In that case, a new memory location is used.

An important exception to this read-only nature of globals arises with the superassignment operator.

#### Extended Example: A Function to Display the Contents of a Call Frame

In single-stepping through your code in a debugging setting, you often want to know the values of the local variables in your current function. You may also want to know the values of the locals in the parent function—that is, the one from which the current function was called. Here, we will develop code to display these values, thereby further demonstrating access to the environment hierarchy. (The code is adapted from my edtdbg debugging tool in R’s CRAN code repository.)For example, consider the following code:

f <- function() {

a <- 1

return(g(a)+a)

}

g <- function(aa) {

b <- 2

aab <- h(aa+b)

return(aab)

}

h <- function(aaa) {

c <- 3

return(aaa+c)

}

When we call f(), it in turn calls g(), which then calls h(). In the debugging setting, say we are currently about to execute the return() within g(). We want to know the values of the local variables of the current function, say the variables aa, b, and aab. And while we’re in g(), we also wish to know the values of the locals in f() at the time of the call to g(), as well as the values of the global variables. Our function showframe() will do all this.The showframe() function has one argument, upn, which is the number of frames to go up the call stack. A negative value of the argument signals that we want to view the globals—the top-level variables.Here’s the code:

# shows the values of the local variables (including arguments) of the

# frame upn frames above the one from which showframe() is called; if

# upn < 0, the globals are shown; function objects are not shown

showframe <- function(upn) {

# determine the proper environment

if (upn < 0) {

env <- .GlobalEnv

} else {

env <- parent.frame(n=upn+1)

}

# get the list of variable names

vars <- ls(envir=env)

# for each variable name, print its value

for (vr in vars) {

vrg <- get(vr,envir=env)

if (!is.function(vrg)) {

cat(vr,”:\n”,sep=””)

print(vrg)

}

}

}

Let’s try it out. Insert some calls into g():

> g

function(aa) {

b <- 2

showframe(0)

showframe(1)

aab <- h(aa+b)

return(aab)

}

Now run it:

> f()

aa:

[1] 1

b:

[1] 2

a:

[1] 1

To see how this works, we’ll first look at the get() function, one of the most useful utilities in R. Its job is quite simple: Given the name of an object, it fetches the object itself. Here’s an example:

> m <- rbind(1:3,20:22)

>m

[,1] [,2] [,3]

[1,]    1    2    3

[2,]   20   21   22

> get(“m”)

[,1] [,2] [,3]

[1,]    1    2    3

[2,]   20   21   22

This example with m involves the current call frame, but in our showframe() function, we deal with various levels in the environment hierarchy. So, we need to specify the level via the envir argument of get():

vrg <- get(vr,envir=env)

The level itself is determined largely by calling parent.frame():

if (upn < 0) {

env <- .GlobalEnv

} else {

env <- parent.frame(n=upn+1)

}

Note that ls() can also be called in the context of a particular level, thus enabling you to determine which variables exist at the level of interest and then inspect them. Here’s an example:

vars <- ls(envir=env)

for (vr in vars) {

This code picks up the names of all the local variables in the given frame and then loops through them, setting things up for get() to do its work.

## No Pointers in R

R does not have variables corresponding to pointers or references like those of, say, the C language. This can make programming more difficult in some cases. (As of this writing, the current version of R has an experimental feature called reference classes, which may reduce the difficulty.)For example, you cannot write a function that directly changes its arguments. In Python, for instance, you can do this:

>>> x = [13,5,12]

>>> x.sort()

>>> x

[5, 12, 13]

Here, the value of x, the argument to sort(), changed. By contrast, here’s how it works in R:

> x <- c(13,5,12)

> sort(x)

[1] . 5 12 13

> x

[1] 13  5 12

The argument to sort() does not change. If we do want x to change in this R code, the solution is to reassign the arguments:

> x <- sort(x)

> x

[1]  5 12 13

What if our function has several variables of output? A solution is to gather them together into a list, call the function with this list as an argument, have the function return the list, and then reassign to the original list. An example is the following function, which determines the indices of odd and even numbers in a vector of integers:

> oddsevens

function(v){

odds <- which(v %% 2 == 1)

evens <- which(v %% 2 == 1)

list(o=odds,e=evens)

}

In general, our function f() changes variables x and y. We might store them in a list lxy, which would then be our argument to f(). The code, both called and calling, might have a pattern like this:

f <- function(lxxyy) {

lxxyy$x <- … lxxyy$y <- …

return(lxxyy)

}

# set x and y

lxy$x <- … lxy$y <- …

lxy <- f(lxy)

# use new x and y

… <- lxy$x … <- lxy$y

However, this may become unwieldy if your function will change many variables. It can be especially awkward if your variables, say x and y in the example, are themselves lists, resulting in a return value consisting of lists within a list. This can still be handled, but it makes the code more syntactically complex and harder to read. Alternatives include the use of global variables and the new R reference classes mentioned earlier. Another class of applications in which lack of pointers causes difficulties is that of treelike data structures. C code normally makes heavy use of pointers for these kinds of structures. One solution for R is to revert to what was done in the “good old days” before C, when programmers formed their own “pointers” as vector indices.

## Writing Upstairs

As mentioned earlier, code that exists at a certain level of the environment hierarchy has at least read access to all the variables at the levels above it. On the other hand, direct write access to variables at higher levels via the standard <- operator is not possible. If you wish to write to a global variable—or more generally, to any variable higher in the environment hierarchy than the level at which your write statement exists—you can use the superassignment operator, <<-, or the assign() function. Let’s discuss the superassignment operator first.

#### Writing to Nonlocals with the Superassignment Operator

Consider the following code:

> two <- function(u) {

+  u <<- 2*u

+  z <- 2*z

+}
> x <- 1

> z <- 3

> u

> two(x)
>x

[1] 1

>z

[1] 3

>u

[1] 2

Let’s look at the impact (or not) on the three top-level variables x, z, and u:

• x: Even though x was the actual argument to two() in the example, it retained the value 1 after the call. This is because its value 1 was copied to the formal argument u, which is treated as a local variable within the function. Thus, when u changed, x did not change with it.
• z: The two z values are entirely unrelated to each other—one is top level, and the other is local to two(). The change in the local variable has no effect on the global variable. Of course, having two variables with the same name is probably not good programming practice.
• u: The u value did not even exist as a top-level variable prior to our call-ing two(), hence the “not found” error message. However, it was created as a top-level variable by the superassignment operator within two(), as confirmed after the call.

Though <<- is typically used to write to top-level variables, as in our example, technically, it does something a bit different. Use of this operator to write to a variable w will result in a search up the environment hierarchy, stopping at the first level at which a variable of that name is encountered. If none is found, then the level selected will be global. Look what happens in this little example:

>f function() {
inc <- function() {x <<- x + 1}

x <- 3

inc()

return(x)

}
> f()[1] 4
>x

Here, inc() is defined within f(). When inc() is executing, and the R interpreter sees a superassignment to x, it starts going up the hierarchy. At the first level up—the environment within f()—it does find an x, and so that x is the one that is written to, not x at the top level.

#### Writing to Nonlocals with assign()

You can also use the assign() function to write to upper-level variables. Here’s an altered version of the previous example:

> two

function(u) {

assign(“u”,2*u,pos=.GlobalEnv)

z <- 2*z

}

> two(x)

> x

[1] 1

>u

[1] 2

Here, we replaced the superassignment operator with a call to assign(). That call instructs R to assign the value 2*u (this is the local u) to a variable u further up the call stack, specifically in the top-level environment. In this case, that environment is only one call level higher, but if we had a chain of calls, it could be much further up.The fact that you reference variables using character strings in assign() can come in handy. Recall the example concerning analysis of hiring patterns of various large corporations. We wanted to form a subdata frame for each firm, extracted from the overall data frame, all2006. For in-stance, consider this call:

makecorpdfs(c(“MICROSOFT CORPORATION”,”ms”,”INTEL CORPORATION”,”intel”,”

This would first extract all Microsoft records from the overall data frame, naming the resulting subdata frame ms2006. It would then create intel2006 for Intel, and so on. Here is the code (changed to function form, for clarity):

makecorpdfs <- function(corplist) {

for (i in 1:(length(corplist)/2)) {

corp <- corplist[2*i-1]

newdtf <- paste(corplist[2*i],”2006″,sep=””)

assign(newdtf,makecorp(corp),pos=.GlobalEnv)

}

}

In the iteration i = 1, the code uses paste() to splice together the strings “ms” and “2006”, resulting in “ms2006”, the desired name.

#### Extended Example: Discrete-Event Simulation in R

Discrete-event simulation (DES) is widely used in business, industry, and government. The term discrete event refers to the fact that the state of the system changes only in discrete quantities, rather than changing continuously. A typical example would involve a queuing system, say people lining up to use an ATM. Let’s define the state of our system at time t to be the number of people in the queue at that time. The state changes only by +1, when someone arrives, or by 1, when a person finishes an ATM transaction. This is in contrast to, for instance, a simulation of weather, in which temperature, barometric pressure, and so on change continuously.

This will be one of the longer, more involved examples. But it exemplifies a number of important issues in R, especially concerning global variables, and will serve as an example when we discuss appropriate use global variables in the next section. Your patience will turn out to be a good investment of time. (It is not assumed here that the reader has any prior background in DES.) Central to DES operation is maintenance of the event list, which is simply a list of scheduled events. This is a general DES term, so the word list here does not refer to the R data type. In fact, we’ll represent the event list by a data frame. In the ATM example, for instance, the event list might at some point in the simulation look like this:

customer 1 arrives at time 23.12

customer 2 arrives at time 25.88

customer 3 arrives at time 25.97

customer 1 finishes service at time 26.02

Since the earliest event must always be handled next, the simplest form of coding the event list is to store it in time order, as in the example. (Read-ers with computer science background might notice that a more efficient approach might be to use some kind of binary tree for storage.) Here, we will implement it as a data frame, with the first row containing the earliest scheduled event, the second row containing the second earliest, and so on.

The main loop of the simulation repeatedly iterates. Each iteration pulls the earliest event off of the event list, updates the simulated time to reflect the occurrence of that event, and reacts to this event. The latter action will typically result in the creation of new events. For example, if a customer arrival occurs when the queue is empty, that customer’s service will begin—one event triggers setting up another. Our code must determine the customer’s service time, and then it will know the time at which service will be finished, which is another event that must be added to the event list.

One of the oldest approaches to writing DES code is the event-oriented paradigm. Here, the code to handle the occurrence of one event directly sets up another event, reflecting our preceding discussion. As an example to guide your thinking, consider the ATM situation. At time 0, the queue is empty. The simulation code randomly generates the time of the first arrival, say 2.3. At this point, the event list is simply (2.3,“arrival”). This event is pulled off the list, simulated time is updated to 2.3, and we react to the arrival event as follows:

• The queue for the ATM is empty, so we start the service by randomly generating the service time—say it is 1.2 time units. Then the comple-tion of service will occur at simulated time 2.3 + 1.2 = 3.5.
• We add the completion of service event to the event list, which will now consist of (3.5,“service done”)).
• We also generate the time to the next arrival, say 0.6, which means the arrival will occur at time 2.9. Now the event list consists of (2.9,“arrival”) and (3.5,“service done”).

The code consists of a generally applicable library. We also have an example application, which simulates an M/M/1 queue, which is a single-server queue in which both interarrival time and service time are exponen-tially distributed.

NOTE The code in this example is hardly optimal, and the reader is invited to improve it, especially by rewriting some portions in C. This example does, however, serve to illustrate a number of the issues we have discussed in this chapter.

Here is a summary of the library functions:

• schedevnt(): Inserts a newly created event into the event list.
• getnextevnt(): Pulls the earliest event off the event list.
• dosim(): Includes the core loop of the simulation. Repeatedly calls getnextevnt() to get the earliest of the pending events; updates the cur-rent simulated time, sim$currtime, to reflect the occurrence of that event; and calls the application-specific function reactevnt() to process this newly occurred event. The code uses the following application-specific functions: • initglbls(): Initializes the application-specific global variables. • reactevnt(): Takes the proper actions when an event occurs, typically generating new events as a result. • prntrslts(): Prints the application-specific results of the simulation. Note that initglbls(), reactevnt(), and prntrslts() are written by the application programmer and then passed to dosim() as arguments. In the M/M/1 queue example included here, these functions are named mm1initglbls(), mm1reactevnt(), and mm1prntrslts(), respectively. Thus, in cor-respondence with the definition of dosim(), dosim <- function(initglbls,reactevnt,prntrslts,maxsimtime,apppars=NULL,dbg=FALSE){ our call is as follows: dosim(mm1initglbls,mm1reactevnt,mm1prntrslts,10000.0, list(arrvrate=0.5,srvrate=1.0)) Here’s the library code:(numbers are given for reference) 1. # DES.R: R routines for discrete-event simulation (DES) 2. # each event will be represented by a data frame row consisting of the 3. # following components: evnttime, the time the event is to occur; 4. # evnttype, a character string for the programmer-defined event type; 5. # optional application-specific components, e.g. 6. # the job’s arrival time in a queuing app 7. # a global list named “sim” holds the events data frame, evnts, and 8. # current simulated time, currtime; there is also a component dbg, which 9. # indicates debugging mode 10. # forms a row for an event of type evntty that will occur at time 11. # evnttm; see comments in schedevnt() regarding appin 12. evntrow <- function(evnttm,evntty,appin=NULL) { 13. rw <- c(list(evnttime=evnttm,evnttype=evntty),appin) 14. return(as.data.frame(rw)) 15. } 16. # insert event with time evnttm and type evntty into event list; 17. # appin is an optional set of application-specific traits of this event, 18. # specified in the form a list with named components 19. schedevnt <- function(evnttm,evntty,appin=NULL) { 20. newevnt <- evntrow(evnttm,evntty,appin) 21. # if the event list is empty, set it to consist of evnt and return 22. if (is.null(sim$evnts)) {
23.  sim$evnts <<- newevnt 24. return() 25. } 26. # otherwise, find insertion point 27. inspt <- binsearch((sim$evnts)$evnttime,evnttm) 28. # now “insert,” by reconstructing the data frame; we find what 29. # portion of the current matrix should come before the new event and 30. # what portion should come after it, then string everything together 31. before <- 32. if (inspt == 1) NULL else sim$evnts[1:(inspt-1),]
33. nr <- nrow(sim$evnts) 34. after <- if (inspt <= nr) sim$evnts[inspt:nr,] else NULL
35. sim$evnts <<- rbind(before,newevnt,after) 36. } 37. # binary search of insertion point of y in the sorted vector x; returns 38. # the position in x before which y should be inserted, with the value 39. # length(x)+1 if y is larger than x[length(x)]; could be changed to C 40. # code for efficiency 41. binsearch <- function(x,y) { 42. n <- length(x) 43. lo <- 1 44. hi <- n 45. while(lo+1 < hi) { 46. mid <- floor((lo+hi)/2) 47. if (y == x[mid]) return(mid) 48. if (y < x[mid]) hi <- mid else lo <- mid 49. } 50. if (y <= x[lo]) return(lo) 51. if (y < x[hi]) return(hi) 52. return(hi+1) 53. } 54. # start to process next event (second half done by application 55. # programmer via call to reactevnt()) 56. getnextevnt <- function() { 57. head <- sim$evnts[1,]
59. if (nrow(sim$evnts) == 1) { 60. sim$evnts <<- NULL
61. } else sim$evnts <<- sim$evnts[-1,]
63. }
64. # simulation body
65. # arguments:
66. #initglbls:  application-specific initialization function; inits
67. #globals to statistical totals for the app, etc.; records apppars
68. #in globals; schedules the first event
69. #reactevnt: application-specific event handling function, coding the
70. #proper action for each type of event
71. #prntrslts:  prints application-specific results, e.g. mean queue
72.  # wait
73. #apppars:  list of application-specific parameters, e.g.
74. #number of servers in a queuing app
75. #maxsimtime:  simulation will be run until this simulated time
76. #dbg:  debug flag; if TRUE, sim will be printed after each event
77. dosim <- function(initglbls,reactevnt,prntrslts,maxsimtime,apppars=NULL,
78. dbg=FALSE) {
79. sim <<- list()
80. sim$currtime <<- 0.0 # current simulated time 81. sim$evnts <<- NULL  # events data frame
82. sim$dbg <<- dbg 83. initglbls(apppars) 84. while(sim$currtime < maxsimtime) {
86. sim$currtime <<- head$evnttime  # update current simulated time
87. reactevnt(head)  # process this event
88. if (dbg) print(sim)
89. }
90. prntrslts()
91.  }

The following is an example application of the code. Again, the simulation models an M/M/1 queue, which is a single-server queuing system in which service times and times between job arrivals are exponentially distributed.

1. # DES application:  M/M/1 queue, arrival rate 0.5, service rate 1.0
2. # the call
3. # dosim(mm1initglbls,mm1reactevnt,mm1prntrslts,10000.0,
4. 5 # list(arrvrate=0.5,srvrate=1.0))
5. # should return a value of about 2 (may take a while)
6. # initializes global variables specific to this app
7. mm1initglbls <- function(apppars) {
8. mm1glbls <<- list()
9. # simulation parameters
10. mm1glbls$arrvrate <<- apppars$arrvrate
11. mm1glbls$srvrate <<- apppars$srvrate
12. # server queue, consisting of arrival times of queued jobs
13. mm1glbls$srvq <<- vector(length=0) 14. # statistics 15. mm1glbls$njobsdone <<- 0  # jobs done so far
16. mm1glbls$totwait <<- 0.0 # total wait time so far 17. # set up first event, an arrival; the application-specific data for 18. # each event will consist of its arrival time, which we need to 19. # record in order to later calculate the job’s residence time in the 20. # system 21. arrvtime <- rexp(1,mm1glbls$arrvrate)
22. schedevnt(arrvtime,”arrv”,list(arrvtime=arrvtime))
23. }
24. # application-specific event processing function called by dosim()
25. # in the general DES library
27. if (head$evnttype == “arrv”) { # arrival 28. # if server free, start service, else add to queue (added to queue 29. # even if empty, for convenience) 30. if (length(mm1glbls$srvq) == 0) {
31. mm1glbls$srvq <<- head$arrvtime
32. srvdonetime <- sim$currtime + rexp(1,mm1glbls$srvrate)
33. schedevnt(srvdonetime,”srvdone”,list(arrvtime=head$arrvtime)) 34. } else mm1glbls$srvq <<- c(mm1glbls$srvq,head$arrvtime)
35. # generate next arrival
36. arrvtime <- sim$currtime + rexp(1,mm1glbls$arrvrate)
37. schedevnt(arrvtime,”arrv”,list(arrvtime=arrvtime))
38. } else {  # service done
39. # process job that just finished
40. # do accounting
41. mm1glbls$njobsdone <<- mm1glbls$njobsdone + 1
42. mm1glbls$totwait <<- 43. mm1glbls$totwait + sim$currtime – head$arrvtime
44. # remove from queue
45. mm1glbls$srvq <<- mm1glbls$srvq[-1]
46. # more still in the queue?
47. if (length(mm1glbls$srvq) > 0) { 48. # schedule new service 49. srvdonetime <- sim$currtime + rexp(1,mm1glbls$srvrate) 50. schedevnt(srvdonetime,”srvdone”,list(arrvtime=mm1glbls$srvq[1]))
51. }
52. }
53. }
54. mm1prntrslts <- function() {
55. print(“mean wait:”)
56. print(mm1glbls$totwait/mm1glbls$njobsdone)
57. }

To see how all this works, take a look at the M/M/1 application code. There, we have set up a global variable, mm1glbls, which contains variables relevant to the M/M/1 code, such as mm1glbls$totwait, the running total of the wait time of all jobs simulated so far. As you can see, the superassignment operator is used to write to such variables, as in this statement: mm1glbls$srvq <<- mm1glbls$srvq[-1] Let’s look at mm1reactevnt() to see how the simulation works, focusing on the code portion in which a “service done” event is handled. } else { # service done # process job that just finished # do accounting mm1glbls$njobsdone <<- mm1glbls$njobsdone + 1 mm1glbls$totwait <<-

mm1glbls$totwait + sim$currtime – head$arrvtime # remove this job from queue mm1glbls$srvq <<- mm1glbls$srvq[-1] # more still in the queue? if (length(mm1glbls$srvq) > 0) {

# schedule new service

srvdonetime <- sim$currtime + rexp(1,mm1glbls$srvrate)

schedevnt(srvdonetime,”srvdone”,list(arrvtime=mm1glbls$srvq[1])) } } First, this code does some bookkeeping, updating the totals of number of jobs completed and wait time. It then removes this newly completed job from the server queue. Finally, it checks if there are still jobs in the queue and, if so, calls schedevnt() to arrange for the service of the one at the head. What about the DES library code itself? First note that the simulation state, consisting of the current simulated time and the event list, has been placed in an R list structure, sim. This was done in order to encapsulate all the main information into one package, which in R, typically means using a list. The sim list has been made a global variable. As mentioned, a key issue in writing a DES library is the event list. This code implements it as a data frame, sim$evnts. Each row of the data frame corresponds to one scheduled event, with information about the event time, a character string representing the event type (say arrival or service com-pletion), and any application-specific data the programmer wishes to add. Since each row consists of both numeric and character data, it was natural to choose a data frame for representing this event list. The rows of the data frame are in ascending order of event time, which is contained in the first column. The main loop of the simulation is in dosim() of the DES library code, beginning at line 91:

while(sim$currtime < maxsimtime) { head <- getnextevnt() sim$currtime <<- head$evnttime # update current simulated time reactevnt(head) # process this event if (dbg) print(sim) } First, getnextevnt() is called to remove the head (the earliest event) from the event list. (Note the side effect: The event list changes.) Then the cur-rent simulated time is updated according to the scheduled time in the head event. Finally, the programmer-supplied function reactevnt() is called to pro-cess the event (as seen in the M/M/1 code discussed earlier).The main potential advantage of using a data frame as our structure here is that it enables us to maintain the event list in ascending order by time via a binary search operation by event time. This is done in line 31 within schedevnt(), the function that inserts a newly created event into the event list: inspt <- binsearch((sim$evnts)$evnttime,evnttm) Here, we wish to insert a newly created event into the event list, and the fact that we are working with a vector enables the use of a fast binary search. (As noted in the comments in the code, though, this really should be implemented in C for good performance.) A later line in schedevnt()is a good example of the use of rbind(): sim$evnts <<- rbind(before,newevnt,after)

Now, we have extracted the events in the event list whose times are ear-lier than that of evnt and stored them in before. We also constructed a simi-lar set in after for the events whose times are later than that of newevnt. We then use rbind() to put all these together in the proper order.

#### When Should You Use Global Variables?

Use of global variables is a subject of controversy in the programming community. Obviously, the question raised by the title of this section cannot be answered in any formulaic way, as it is a matter of personal taste and style.Nevertheless, most programmers would probably consider the outright banning of global variables, which is encouraged by many teachers of program-ming, to be overly rigid. In this section, we will explore the possible value of globals in the context of the structures of R. Here, the term global variable, or just global, will be used to include any variable located higher in the environment hierarchy than the level of the given code of interest.

The use of global variables in R is more common than you may have guessed. You might be surprised to learn that R itself makes very substantial use of globals internally, both in its C code and in its R routines. The superassignment operator <<-, for instance, is used in many of the R library functions (albeit typically in writing to a variable just one level up in the environment hierarchy). Threaded code and GPU code, which are used for writ-ing fast programs, tend to make heavy use of global variables, which provide the main avenue of communication between parallel actors. Now, to make our discussion concrete, let’s return to the earlier example :

f <- function(lxxyy) { # lxxyy is a list containing x and y

lxxyy$x <- … lxxyy$y <- …

return(lxxyy)

}

# set x and y

lxy$x <- … lxy$y <- …

lxy <- f(lxy)

# use new x and y

… <- lxy$x … <- lxy$y

As noted earlier, this code might be a bit unwieldy, especially if x and y are themselves lists. By contrast, here is an alternate pattern that uses globals:

f <- function() {

x <<- …

y <<- … }

# set x and y

x <- …

y <- …

f()  # x and y are changed in here

# use new x and y

… <- x

… <- y

Arguably, this second version is much cleaner, being less cluttered and not requiring manipulation of lists. Cleaner code is usually easier to write, debug, and maintain.It is for these reasons—avoiding clutter and simplifying the code—that we chose to use globals, rather than to return lists, in the DES code earlier in this chapter. Let’s explore that example further. We had two global variables (both lists, encapsulating various information): sim, associated with the library code, and mm1glbls, associated with our M/M/1 application code. Let’s consider sim first.

Even many programmers who have reservations about using globals agree that such variables may be justified if they are truly global, in the sense that they are used broadly in the program. This is the case for sim in our DES example. It is used both in the library code (in schedevnt(), getnextevnt(), and dosim()) and in in our M/M/1 application code (in mm1reactevnt()). The latter access to sim is on a read-only basis in this particular instance, but it could involve writes in some applications. A common example of such writes is when an event needs to be canceled. This might arise in modeling a “whichever comes first” situation; two events are scheduled, and when one of them occurs, the other must be canceled.

So, using sim as a global seems justified. Nevertheless, if we were bound and determined to avoid using globals, we could have placed sim as a local within dosim(). This function would then pass sim as an argument to all of the functions mentioned in the previous paragraph (schedevnt(), getnextevnt(), and so on), and each of these functions would return the modified simLine 94 for example, would change from this:

to this:

We would then need to add a line like the following to our application-specific function mm1reactevnt():

return(sim)

We could do something similar with mm1glbls, placing a variable called, say, appvars as a local within dosim(). However, if we did this with sim as well, we would need to place them together in a list so that both would be returned, as in our earlier example function f(). We would then have the lists-within-lists clutter described earlier—well, lists within lists within lists in this case.

On the other hand, critics of the use of global variables counter that the simplicity of the code comes at a price. They worry that it may be difficult during debugging to track down locations at which a global variable changes value, since such a change could occur anywhere in the program. This seems to be less of a concern in view of our modern text editors and integrated development tools (the original article calling for avoiding use of globals was published in 1970!), which can be used to find all instances of a variable. However, it should be taken into consideration.

Another concern raised by critics involves situations in which a function is called in several unrelated parts of the overall program using different values. For example, consider using our example f() function in different parts of our program, each call with its own values of x and y, rather than just a single value of each, as assumed earlier. This could be solved by setting up vectors of x and y values, with one element for each instance of f() in your program. You would lose some of the simplicity of using globals, though.

The above issues apply generally, not just to R. However, for R there is an additional concern for globals at the top level, as the user will typically have lots of variables there. The danger is that code that uses globals may accidentally overwrite an unrelated variable with the same name. This can be avoided easily, of course, by simply choosing long, very application-specific names for globals in your code. But a compromise is also available in the form of environments, such as the following for the DES example above.

Within dosim(), the line

sim <<- list()

would be replaced by

assign(“simenv”,new.env(),envir=.GlobalEnv)

This would create a new environment, pointed to by simenv at the top level. It would serve as a package in which to encapsulate our globals. We would access them via get() and assign(). For instance, the lines

if (is.null(sim$evnts)) { sim$evnts <<- newevnt

in schedevnt() would become

if (is.null(get(“evnts”,envir=simenv))) {

assign(“evnts”,newevnt,envir=simenv)

Yes, this is cluttered too, but at least it is not complex like lists of lists of lists. And it does protect against unwittingly writing to an unrelated variable the user has at the top level. Using the superassignment operator still yields the least cluttered code, but this compromise is worth considering.As usual, there is no single style of programming that produces the best solution in all applications. The globals approach is another option to con-sider for your programming tool kit.

#### Closures

Recall that an R closure consists of a function’s arguments and body together with its environment at the time of the call. The fact that the environment is included is exploited in a type of programming that uses a feature also known (in a slight overloading of terminology) as a closureA closure consists of a function that sets up a local variable and then creates another function that accesses that variable. This is a very abstract description, so let’s go right to an example.1

> counter

function () {

ctr <- 0

f <- function() {

ctr <<- ctr + 1

cat(“this count currently has value”,ctr,”\n”)

}

return(f)

}

Let’s try this out before going into the internal details:

> c1 <- counter()

> c2 <- counter()

> c1

function() {

ctr <<- ctr + 1

cat(“this count currently has value”,ctr,”\n”)

}

< environment: 0x8d445c0 >

> c2

function() {

ctr <<- ctr + 1

cat(“this count currently has value”,ctr,”\n”)

}

< environment: 0x8d445c0 >

> c1()

this count currently has value 1

> c1()

this count currently has value 2

> c2()

this count currently has value 1

> c2()

this count currently has value 2

> c2()

this count currently has value 3

> c1()

this count currently has value 3

Here, we called counter() twice, assigning the results to c1 and c2. As expected, those two variables will consist of functions, specifically copies of f()However, f() accesses a variable ctr through the superassignment opera-tor, and that variable will be the one of that name that is local to counter(), as it is the first one up the environment hierarchy. It is part of the environment of f() and, as such, is packaged in what is returned to the caller of counter().

The key point is that each time counter() is called, the variable ctr will be in a different environment (in the example, the environments were at memory addresses 0x8d445c0 and 0x8d447d4). In other words, different calls to counter() will produce physically different ctrs. The result, then, is that our functions c1() and c2() serve as completely independent counters, as seen in the example, where we invoke each of them a few times.

## Recursion

Once a mathematics PhD student whom I knew to be quite bright, but who had little programming background, sought my advice on how to write a certain function. I quickly said, “You don’t even need to tell me what the function is supposed to do. The answer is to use recursion.” Startled, he asked what recursion is. I advised him to read about the famous Towers of Hanoi problem. Sure enough, he returned the next day, reporting that he was able to solve his problem in just a few lines of code, using recursion. Obviously, recursion can be a powerful tool. Well then, what is it?A recursive function calls itself. If you have not encountered this concept before, it may sound odd, but the idea is actually simple. In rough terms, the idea is this:

To solve a problem of type X by writing a recursive function f():

1. Break the original problem of type X into one or more smaller problems of type X.
2. Within f(), call f() on each of the smaller problems.
3. Within f(), piece together the results of (b) to solve the original problem.

#### A Quicksort Implementation

A classic example is Quicksort, an algorithm used to sort a vector of numbers from smallest to largest. For instance, suppose we wish to sort the vector (5,4,12,13,3,8,88). We first compare everything to the first element, 5, to form two subvectors: one consisting of the elements less than 5 and the other consisting of the elements greater than or equal to 5. That gives us subvectors (4,3) and (12,13,8,88). We then call the function on the subvec-tors, returning (3,4) and (8,12,13,88). We string those together with the 5, yielding (3,4,5,8,12,13,88), as desired.R’s vector-filtering capability and its c() function make implementation of Quicksort quite easy.

NOTE This example is for the purpose of demonstrating recursion. R’s own sort function, sort(), is much faster, as it is written in C.

qs <- function(x) {

if (length(x) <= 1) return(x)

pivot <- x[1]

therest <- x[-1]

sv1 <- therest[therest < pivot]

sv2 <- therest[therest >= pivot]

sv1 <- qs(sv1)

sv2 <- qs(sv2)

return(c(sv1,pivot,sv2))

}

Note carefully the termination condition:

if (length(x) <= 1) return(x)

Without this, the function would keep calling itself repeatedly on empty vectors, executing forever. (Actually, the R interpreter would eventually refuse to go any further, but you get the idea.)Sounds like magic? Recursion certainly is an elegant way to solve many problems. But recursion has two potential drawbacks:

• It’s fairly abstract. I knew that the graduate student, as a fine mathemati-cian, would take to recursion like a fish to water, because recursion is really just the inverse of proof by mathematical induction. But many programmers find it tough.
• Recursion is very lavish in its use of memory, which may be an issue in R if applied to large problems.

#### Extended Example: A Binary Search Tree

Treelike data structures are common in both computer science and statis-tics. In R, for example, the rpart library for a recursive partioning approach to regression and classification is very popular. Trees obviously have applica-tions in genealogy, and more generally, graphs form the basis of analysis of social networks.

However, there are real issues with tree structures in R, many of them related to the fact that R does not have pointer-style references. Indeed, for this reason and for performance purposes, a better option is often to write the core code in C with an R wrapper. Yet trees can be implemented in R itself, and if performance is not an issue, using this approach may be more convenient.

For the sake of simplicity, our example here will be a binary search tree, a classic computer science data structure that has the following property:In each node of the tree, the value at the left link, if any, is less than or equal to that of the parent, while the value at the right link, if any, is greater than that of the parent.Here is an example:

We’ve stored 8 in the root—that is, the head—of the tree. Its two child nodes contain 5 and 20, and the former itself has two child nodes, which store 2 and 6. Note that the nature of binary search trees implies that at any node, all of the elements in the node’s left subtree are less than or equal to the value stored in this node, while the right subtree stores the elements that are larger than the value in this mode. In our example tree, where the root node contains 8, all of the values in the left subtree—5, 2 and 6—are less than 8, while 20 is greater than 8.

If implemented in C, a tree node would be represented by a C struct, similar to an R list, whose contents are the stored value, a pointer to the left child, and a pointer to the right child. But since R lacks pointer variables, what can we do? Our solution is to go back to the basics. In the old prepointer days in FORTRAN, linked data structures were implemented in long arrays. A pointer, which in C is a memory address, was an array index instead.

Specifically, we’ll represent each node by a row in a three-column matrix. The node’s stored value will be in the third element of that row, while the first and second elements will be the left and right links. For instance, if the first element in a row is 29, it means that this node’s left link points to the node stored in row 29 of the matrix. Remember that allocating space for a matrix in R is a time-consuming activity. In an effort to amortize the memory-allocation time, we allocate new space for a tree’s matrix several rows at a time, instead of row by row. The number of rows allocated each time will be given in the variable inc. As is common with tree traversal, we implement our algorithm with recursion.

NOTE If you anticipate that the matrix will become quite large, you may wish to double its size at each allocation, rather than grow it linearly as we have here. This would fur-ther reduce the number of time-consuming disruptions.

Before discussing the code, let’s run through a quick session of tree building using its routines.

> x <- newtree(8,3)

> x

$mat [,1] [,2] [,3] [1,] NA NA 8 [2,] NA NA NA [3,] NA NA NA$nxt

[1] 2

$inc [1] 3 > x <- ins(1,x,5) > x$mat

[,1] [,2] [,3]

[1,]    2   NA    8

[2,]   NA   NA    5

[3,]   NA   NA   NA

$nxt [1] 3$inc

[1] 3

> x <- ins(1,x,6)

>x
$mat [,1] [,2] [,3] [1,] 2 NA 8 [2,] NA 3 5 [3,] NA NA 6$nxt [1] 4

$inc [1] 3 > x <- ins(1,x,2) >x$mat

[,1] [ ,2]  [,3]

[1,]    2    NA     8

[2,]    4     3       5

[3,]   NA   NA    6

[4,]   NA   NA    2

[5,]  NA    NA   NA

[6,]  NA    NA .  NA

$nxt [1] 5$inc

[1] 3

> x <- ins(1,x,20)

>x
$mat [,1] [,2] [,3] [1,] 2 5 8 [2,] 4 3 5 [3,] NA NA 6 [4,] NA NA 2 [5,] NA NA 20 [6,] NA NA NA$nxt

[1] 6

$inc [1] 3 What happened here? First, the command containing our call newtree(8,3) creates a new tree, assigned to x, storing the number 8. The argument 3 specifies that we allocate storage room three rows at a time. The result is that the matrix component of the list x is now as follows: [,1] [,2] [,3] [1,] NA NA 8 [2,] NA NA NA [3,] NA NA NA Three rows of storage are indeed allocated, and our data now consists just of the number 8. The two NA values in that first row indicate that this node of the tree currently has no children. We then make the call ins(1,x,5) to insert a second value, 5, into the tree x. The argument 1 specifies the root. In other words, the call says, “Insert 5 in the subtree of x whose root is in row 1.” Note that we need to reassign the return value of this call back to x. Again, this is due to the lack of pointer variables in R. The matrix now looks like this: [,1] [,2] [,3] [1,] 2 NA 8 [2,] NA NA 5 [3,] NA NA NA The element 2 means that the left link out of the node containing 8 is meant to point to row 2, where our new element 5 is stored. The session continues in this manner. Note that when our initial allot-ment of three rows is full, ins() allocates three new rows, for a total of six. In the end, the matrix is as follows: [,1] [,2] [,3] [1,] 2 5 8 [2,] 4 3 5 [3,] NA NA 6 [4,] NA NA 2 [5,] NA NA 20 [6,] NA NA NA This represents the tree we graphed for this example. The code follows. Note that it includes only routines to insert new items and to traverse the tree. The code for deleting a node is somewhat more complex, but it follows a similar pattern. 1 # routines to create trees and insert items into them are included 2 # below; a deletion routine is left to the reader as an exercise 3 4 # storage is in a matrix, say m, one row per node of the tree; if row 5 # i contains (u,v,w), then node i stores the value w, and has left and 6 # right links to rows u and v; null links have the value NA 7 8 # the tree is represented as a list (mat,nxt,inc), where mat is the 9 # matrix, nxt is the next empty row to be used, and inc is the number of 10 # rows of expansion to be allocated whenever the matrix becomes full 11 12 # print sorted tree via in-order traversal 13 printtree <- function(hdidx,tr) { 14 left <- tr$mat[hdidx,1]

15  if (!is.na(left)) printtree(left,tr)

16  print(tr$mat[hdidx,3]) # print root 17 right <- tr$mat[hdidx,2]

18  if (!is.na(right)) printtree(right,tr)

19  }

20

21  # initializes a storage matrix, with initial stored value firstval

22  newtree <- function(firstval,inc) {

23  m <- matrix(rep(NA,inc*3),nrow=inc,ncol=3)

24  m[1,3] <- firstval

25   return(list(mat=m,nxt=2,inc=inc))
26
}

27

28  #inserts newval into the subtree of tr, with the subtree’s root being

29  #at index hdidx; note that return value must be reassigned to tr by the

30  #caller (including ins() itself, due to recursion)

31  ins <- function(hdidx,tr,newval) {

32  # which direction will this new node go, left or right?

33  dir <- if (newval <= tr$mat[hdidx,3]) 1 else 2 34 # if null link in that direction, place the new node here, otherwise 35 # recurse 36 if (is.na(tr$mat[hdidx,dir])) {

37  newidx <- tr$nxt # where new node goes 38 # check for room to add a new element 39 if (tr$nxt == nrow(tr$mat) + 1) { 40 tr$mat <-

41 rbind(tr$mat, matrix(rep(NA,tr$inc*3),nrow=tr$inc,ncol=3)) 42 } 43 # insert new tree node 44 tr$mat[newidx,3] <- newval

45 # link to the new node

46   tr$mat[hdidx,dir] <- newidx 47 tr$nxt <- tr$nxt + 1 # ready for next insert 48 return(tr) 49 } else tr <- ins(tr$mat[hdidx,dir],tr,newval)

50 }

There is recursion in both printtree() and ins(). The former is definitely the easier of the two, so let’s look at that first. It prints out the tree, in sorted order. Recall our description of a recursive function f() that solves a problem of category X: We have f() split the original X problem into one or more smaller X problems, call f() on them, and combine the results. In this case, our problem’s category X is to print a tree, which could be a subtree of a larger one. The role of the function on line 13 is to print the given tree, which it does by calling itself in lines 15 and 18. There, it prints first the left subtree and then the right subtree, pausing in between to print the root.

This thinking—print the left subtree, then the root, then the right subtree—forms the intuition in writing the code, but again we must make sure to have a proper termination mechanism. This mechanism is seen in the if() statements in lines 15 and 18. When we come to a null link, we do not continue to recurse. The recursion in ins() follows the same principles but is considerably more delicate. Here, our “category X” is an insertion of a value into a sub-tree. We start at the root of a tree, determine whether our new value must go into the left or right subtree (line 33), and then call the function again on that subtree. Again, this is not hard in principle, but a number of details must be attended to, including the expansion of the matrix if we run out of room (lines 40–41). One difference between the recursive code in printtree() and ins() is that the former includes two calls to itself, while the latter has only one. This implies that it may not be difficult to write the latter in a nonrecursive form.

## Replacement Functions

Recall the following example :

> x <- c(1,2,4)

> names(x)

NULL

> names(x) <- c(“a”,”b”,”ab”)

> names(x)

[1] “a” “b” “ab”

>x

a  b   ab

1  2   4

Consider one line in particular:

> names(x) <- c(“a”,”b”,”ab”)

Looks totally innocuous, eh? Well, no. In fact, it’s outrageous! How on Earth can we possibly assign a value to the result of a function call? The resolution to this odd state of affairs lies in the R notion of replacement functionsThe preceding line of R code actually is the result of executing the following:

x <- “names<-“(x,value=c(“a”,”b”,”ab”))

No, this isn’t a typo. The call here is indeed to a function named names<-(). (We need to insert the quotation marks due to the special char-acters involved.)

#### What’s Considered a Replacement Function?

Any assignment statement in which the left side is not just an identifier (meaning a variable name) is considered a replacement function. When encountering this:

g(u) <- v

R will try to execute this:

u <- “g<-“(u,value=v)

Note the “try” in the preceding sentence. The statement will fail if you have not previously defined g<-(). Note that the replacement function has one more argument than the original function g(), a named argument value, for reasons explained in this section. In earlier chapters, you’ve seen this innocent-looking statement:

x[3] <- 8

The left side is not a variable name, so it must be a replacement func-tion, and indeed it is, as follows. Subscripting operations are functions. The function “[“() is for reading vector elements, and “[<-“() is used to write. Here’s an example:

> x <- c(8,88,5,12,13)

> x

[1]  8  88  5  12  13

> x[3] [1] 5

> “[“(x,3)[1] 5
> x <- “[<-“(x,2:3,value=99:100)

>x[1]   8  99  100  12  13

Again, that complicated call in this line:

> x <- “[<-“(x,2:3,value=99:100)

is simply performing what happens behind the scenes when we execute this:

x[2:3] <- 99:100

We can easily verify what’s occurring like so:

> x <- c(8,88,5,12,13)

> x[2:3] <- 99:100

> x

[1] 8  99 100  12  13

#### Extended Example: A Self-Bookkeeping Vector Class

Suppose we have vectors on which we need to keep track of writes. In other words, when we execute the following:

x[2] <- 8

we would like not only to change the value in x[2] to 8 but also increment a count of the number of times x[2] has been written to. We can do this by writing class-specific versions of the generic replacement functions for vector subscripting.

NOTE This code uses classes, which we’ll discuss in detail later. For now, all you need to know is that S3 classes are constructed by creating a list and then anointing it as a class by calling the class() function.

1. # class “bookvec” of vectors that count writes of their elements
2. # each instance of the class consists of a list whose components are the
3. # vector values and a vector of counts
4.  # construct a new object of class bookvec
5. newbookvec <- function(x) {
6. tmp <- list()
7. tmp$vec <- x # the vector itself 8. tmp$wrts <- rep(0,length(x)) # counts of the writes, one for each element
9. class(tmp) <- “bookvec”
10. return(tmp)
11. }
13. “[.bookvec” <- function(bv,subs) {
14. return(bv$vec[subs]) 15. } 16. # function to write 17. “[<-.bookvec” <- function(bv,subs,value) { 18. bv$wrts[subs] <- bv$wrts[subs] + 1 # note the recycling 19. bv$vec[subs] <- value
20. return(bv)
21. }
22. \end{Code}
23. Let’s test it.
24. \begin{Code}
25. >b <- newbookvec(c(3,4,5,5,12,13))
26. > b
27. $vec 28. [1] 3 4 5 5 12 13 29. 35 30.$wrts
31. [1] 0 0 0 0 0 0
32. attr(,”class”)
33. [1] “bookvec”
34. > b[2]
35. [1] 4
36. > b[2] <- 88 # try writing
37. > b[2] # worked?
38. [1] 88
39. >b\$wrts # write count incremented?
40. [1] 0 1 0 0 0

We have named our class “bookvec”, because these vectors will do their own bookkeeping—that is, keep track of write counts. So, the subscripting functions will be [.bookvec() and [<-.bookvec()Our function newbookvec() (line 7) does the construction for this class. In it, you can see the structure of the class: An object will consist of the vector itself, vec (line 9), and a vector of write counts, wrts (line 10). By the way, note in line 11 that the function class() itself is a replacement function! The functions [.bookvec() and [<-.bookvec() are fairly straightforward. Just remember to return the entire object in the latter.

## Tools for Composing Function Code

If you are writing a short function that’s needed only temporarily, a quick-and-dirty way to do this is to write it on the spot, right there in your inter-active terminal session. Here’s an example:

> g <- function(x) {

+     return(x+1)

+  }

#### Text Editors and Integrated Development Environments

You can use a text editor such as Vim, Emacs, or even Notepad, or an editor within an integrated development environment (IDE) to write your code in a file and then read it into R from the file. To do the latter, you can use R’s source() function.For instance, suppose we have functions f() and g() in a file xyz.R. In R, we give this command:

> source(“xyz.R”)

This reads f() and g() into R as if we had typed them using the quick-and-dirty way shown at the beginning of this section.If you don’t have much code, you can cut and paste from your editor window to your R window.Some general-purpose editors have special plug-ins available for R, such as ESS for Emacs and Vim-R for Vim. There are also IDEs for R, such as the commercial one by Revolution Analytics, and open source products such as StatET, JGR, Rcmdr, and RStudio.

#### The edit() Function

A nice implication of the fact that functions are objects is that you can edit functions from within R’s interactive mode. Most R programmers do their code editing with a text editor in a separate window, but for a small, quick change, the edit() function can be handy.For instance, we could edit the function f1() by typing this:

> f1 <- edit(f1)

This opens the default editor on the code for f1, which we could then edit and assign back to f1Or, we might be interested in having a function f2() very similar to f1()and thus could execute the following:

> f2 <- edit(f1)

This gives us a copy of f1() to start from. We would do a little editing and then save to f2(), as seen in the preceding command. The editor involved will depend on R’s internal options variable editor. In UNIX-class systems, R will set this from your shell’s EDITOR or VISUAL environment variable, or you can set it yourself, as follows:

> options(editor=”/usr/bin/vim”)

For more details on using options, see the online documentation by typ-ing the following:

> ?options

You can use edit() to edit data structures, too.

## Writing Your Own Binary Operations

You can invent your own operations! Just write a function whose name be-gins and ends with %, with two arguments of a certain type, and a return value of that type.For example, here’s a binary operation that adds double the second operand to the first:

> “%a2b%” <- function(a,b) return(a+2*b)

> 3 %a2b% 5

[1] 13

## Anonymous Functions

As remarked at several points, the purpose of the R function function() is to create functions. For instance, consider this code:

inc <- function(x) return(x+1)

It instructs R to create a function that adds 1 to its argument and then assigns that function to inc. However, that last step—the assignment—is not always taken. We can simply use the function object created by our call to function() without naming that object. The functions in that context are called anonymous, since they have no name. (That is somewhat misleading, since even nonanonymous functions only have a name in the sense that a variable is pointing to them.)Anonymous functions can be convenient if they are short one-liners and are called by another function. Let’s go back to our example of using apply :

> z

[,1] [,2]

[1,]    1   4

[2,]     5

[3,]    6

f <- function(x) x/c(2,8)

> y <- apply(z,1,f)

> y

[,1] [,2] [,3]

[1,] 0.5 1.000 1.50

[2,] 0.5 0.625 0.75

Let’s bypass the middleman—that is, skip the assignment to f—by using an anonymous function within our call to apply(), as follows:

> y <- apply(z,1,function(x) x/c(2,8))

> y

[,1] [,2] [,3]

[1,] 0.5 1.000 1.50

[2,] 0.5 0.625 0.75

What really happened here? The third formal argument to apply() must be a function, which is exactly what we supplied here, since the return value of function() is a function!Doing things this way is often clearer than defining the function externally. Of course, if the function is more complicated, that clarity is not attained.

#### This Is A Custom Widget

This Sliding Bar can be switched on or off in theme options, and can take any widget you throw at it or even fill it with your custom HTML Code. Its perfect for grabbing the attention of your viewers. Choose between 1, 2, 3 or 4 columns, set the background color, widget divider color, activate transparency, a top border or fully disable it on desktop and mobile.

#### This Is A Custom Widget

This Sliding Bar can be switched on or off in theme options, and can take any widget you throw at it or even fill it with your custom HTML Code. Its perfect for grabbing the attention of your viewers. Choose between 1, 2, 3 or 4 columns, set the background color, widget divider color, activate transparency, a top border or fully disable it on desktop and mobile.