R PROGRAMMING CHAPTER 4
DOING MATH AND SIMULATIONS IN R
R contains builtin functions for your favorite math operations and, of course, for statistical distributions. This chapter provides an overview of using these functions. Given the mathematical nature of this chapter, the examples assume a slightly higherlevel knowledge than those in other chapters. You should be familiar with calculus and linear algebra to get the most out of these examples.
Math Functions
R includes an extensive set of builtin math functions. Here is a partial list:
 exp(): Exponential function, base e
 log(): Natural logarithm
 log10(): Logarithm base 10
 sqrt(): Square root
 abs(): Absolute value
 sin(), cos(), and so on: Trig functions
 min() and max(): Minimum value and maximum value within a vector
 which.min() and which.max(): Index of the minimal element and maximal element of a vector
 pmin() and pmax(): Elementwise minima and maxima of several vectors
 sum() and prod(): Sum and product of the elements of a vector
 cumsum() and cumprod(): Cumulative sum and product of the elements of a vector
 round(), floor(), and ceiling(): Round to the closest integer, to the closest integer below, and to the closest integer above
 factorial(): Factorial function
Extended Example: Calculating a Probability
As our first example, we’ll work through calculating a probability using the prod() function. Suppose we have n independent events, and the i^{th} event has the probability p_{i} of occurring. What is the probability of exactly one of these events occurring? Suppose first that n = 3 and our events are named A, B, and C. Then we break down the computation as follows:
P(exactly one event occurs) =
P(AandnotBandnotC) +
P(notAandBandnotC) +
P(not A and not B and C)
P(A and not B and not C) would be p_{A}(1 − p_{B} )(1 − p_{C} ), and so on.For general n, that is calculated as follows:
n
∑ p_{i}(1 − p_{1})…(1 − p_{i−}_{1})(1 − p_{i}_{+1})…(1 p_{n})
i=1
(The i^{th} term inside the sum is the probability that event i occurs and all the others do not occur.) Here’s code to compute this, with our probabilities p_{i} contained in the vector p:
exactlyone < function(p) {
notp < 1 – p
tot < 0.0
for (i in 1:length(p))
tot < tot + p[i] _{*} prod(notp[i])
return(tot)
}
How does it work? Well, the assignment
notp < 1 – p
creates a vector of all the “not occur” probabilities 1 − p_{j} , using recycling. The expression notp[i] computes the product of all the elements of notp, except the i^{th}—exactly what we need.
Cumulative Sums and Products
As mentioned, the functions cumsum() and cumprod() return cumulative sums and products.
> x < c(12,5,13)
> cumsum(x)
[1] 12 17 30
> cumprod(x)
[1] 12 60 780
In x, the sum of the first element is 12, the sum of the first two elements is 17, and the sum of the first three elements is 30. The function cumprod() works the same way as cumsum(), but with the product instead of the sum.
Minima and Maxima
There is quite a difference between min() and pmin(). The former simply combines all its arguments into one long vector and returns the minimum value in that vector. In contrast, if pmin() is applied to two or more vectors, it returns a vector of the pairwise minima, hence the name pmin. Here’s an example:
>z[,1] [,2]
[1,] 1 2
[2,] 5 3
[3,] 6 2
> min(z[,1],z[,2])
[1] 1
> pmin(z[,1],z[,2])
[1] 1 3 2
In the first case, min() computed the smallest value in (1,5,6,2,3,2). But the call to pmin() computed the smaller of 1 and 2, yielding 1; then the smaller of 5 and 3, which is 3; then finally the minimum of 6 and 2, giving 2. Thus, the call returned the vector (1,3,2). You can use more than two arguments in pmin(), like this:
> pmin(z[1,],z[2,],z[3,])
[1] 1 2
The 1 in the output is the minimum of 1, 5, and 6, with a similar computation leading to the 2. The max() and pmax() functions act analogously to min() and pmin(). Function minimization/maximization can be done via nlm() and optim(). For example, let’s find the smallest value of f (x) = x^{2} − sin(x).
> nlm(function(x) return(x^2sin(x)),8) $minimum
[1] 0.2324656
$estimate
[1] 0.4501831
$gradient
[1] 4.024558e09
$code
[1] 1
$iterations
[1] 5
Here, the minimum value was found to be approximately −0.23, occurring at x = 0.45. A NewtonRaphson method (a technique from numerical analysis for approximating roots) is used, running five iterations in this case. The second argument specifies the initial guess, which we set to be 8. (This second argument was picked pretty arbitrarily here, but in some problems, you may need to experiment to find a value that will lead to convergence.)
Calculus
R also has some calculus capabilities, including symbolic differentiation and numerical integration, as you can see in the following example.
> D(expression(exp(x^2)),”x”) # derivative
exp(x^2) _{*} (2 _{*} x)
> integrate(function(x) x^2,0,1)
0.3333333 with absolute error < 3.7e15
Here, R reported
You can find R packages for differential equations (odesolve), for interfacing R with the Yacas symbolic math system (ryacas), and for other calculus operations. These packages, and thousands of others, are available from the Comprehensive R Archive Network (CRAN).
Functions for Statistical Distributions
R has functions available for most of the famous statistical distributions.
Prefix the name as follows:
 With d for the density or probability mass function (pmf)
 With p for the cumulative distribution function (cdf)
 With q for quantiles
 With r for random number generation
The rest of the name indicates the distribution. Table 81 lists some common statistical distribution functions.
Table 81: Common R Statistical Distribution Functions
Distribution Density/pms cdf Quantiles Random Numbers
Normal dnorm() pnorm() qnorm() rnorm()
Chi square dchisq() pchisq() qchisq() rchisq()
Binomial dbinom() pbinom() qbinom() rbinom()
As an example, let’s simulate 1,000 chisquare variates with 2 degrees of freedom and find their mean.
> mean(rchisq(1000,df=2))
[1] 1.938179
The r in rchisq specifies that we wish to generate random numbers— in this case, from the chisquare distribution. As seen in this example, the first argument in the rseries functions is the number of random variates to generate. These functions also have arguments specific to the given distribution families. In our example, we use the df argument for the chisquare family, indicating the number of degrees of freedom.
NOTE Consult R’s online help for details on the arguments for the statistical distribution functions. For instance, to find our more about the chisquare function for quantiles, type ?qchisq at the command prompt.
Let’s also compute the 95^{th} percentile of the chisquare distribution with two degrees of freedom:
> qchisq(0.95,2)
[1] 5.991465
Here, we used q to indicate quantile—in this case, the 0.95 quantile, or the 95^{th} percentile. The first argument in the d, p, and q series is actually a vector so that we can evaluate the density/pmf, cdf, or quantile function at multiple points. Let’s find both the 50^{th} and 95^{th} percentiles of the chisquare distribution with 2 degrees of freedom.
qchisq(c(0.5,0.95),df=2)
[1] 1.386294 5.991465
Sorting
Ordinary numerical sorting of a vector can be done with the sort() function, as in this example:
x < c(13,5,12,5)
sort(x)
[1] 5 5 12 13
> x
[1] 13 5 12 5
Note that x itself did not change, in keeping with R’s functional language philosophy. If you want the indices of the sorted values in the original vector, use the order() function. Here’s an example:
> order(x)
[1] 2 4 3 1
This means that x[2] is the smallest value in x, x[4] is the second smallest, x[3] is the third smallest, and so on. You can use order(), together with indexing, to sort data frames, like this:
>y
V1 V2
1 def 2
2 ab 5
3 zzzz 1
> r < order(y$V2)
>r
[1] 3 1 2
> z < y[r,]
>z
V1 V2
3 zzzz 1
1 def 2
2 ab 5
What happened here? We called order() on the second column of y, yielding a vector r, telling us where numbers should go if we want to sort them. The 3 in this vector tells us that x[3,2] is the smallest number in x[,2]; the 1 tells us that x[1,2] is the second smallest; and the 2 tells us that x[2,2] is the third smallest. We then use indexing to produce the frame sorted by column 2, storing it in z. You can use order() to sort according to character variables as well as numeric ones, as follows:
>d
kids ages
1 Jack 12
2 Jill 10
3 Billy 13
> d[order(d$kids),]
kids ages
3 Billy 13
1 Jack 12
2 Jill 10
> d[order(d$ages),]
kids ages
2 Jill 10
1 Jack 12
3 Billy 13
A related function is rank(), which reports the rank of each element of a vector.
> x < c(13,5,12,5)
> rank(x)
[1] 4. 0 1.5 3.0 1.5
This says that 13 had rank 4 in x; that is, it is the fourth smallest. The value 5 appears twice in x, with those two being the first and second smallest, so the rank 1.5 is assigned to both. Optionally, other methods of handling ties can be specified.
Linear Algebra Operations on Vectors and Matrices
Multiplying a vector by a scalar works directly, as you saw earlier. Here’s another example:
> y
[1] 1 3 4 10
> 2_{*}y
[1] 2 6 8 20
If you wish to compute the inner product (or dot product) of two vectors, use crossprod(), like this:
> crossprod(1:3,c(5,12,13))
[,1]
[1,] 68
The function computed 1 · 5 + 2 · 12 + 3 · 13 = 68.
Note that the name crossprod() is a misnomer, as the function does not compute the vector cross product. We’ll develop a function to compute real cross products before. For matrix multiplication in the mathematical sense, the operator to use is %_{*}%, not _{*}. For instance, here we compute the matrix product:
Here’s the code:
> a
[,1] [,2]
[1,] 1 2
[2,] 3 4
> b
[,1] [,2]
[1,] 1 1
[2,] 0 1
> a %*% b
[,1] [,2]
[1,] 1 1
[2,] 3 1
The function solve() will solve systems of linear equations and even find matrix inverses. For example, let’s solve this system:
x_{1} + x_{2} = 2
−x_{1} + x_{2} = 4
Its matrix form is as follows:
Here’s the code:
> a < matrix(c(1,1,1,1),nrow=2,ncol=2)
> b < c(2,4)
> solve(a,b)
[1] 3 1
> solve(a)
[,1] [,2]
[1,] 0.5 0.5
[2,] 0.5 0.5
In that second call to solve(), the lack of a second argument signifies that we simply wish to compute the inverse of the matrix. Here are a few other linear algebra functions:
 t(): Matrix transpose
 qr(): QR decomposition
 chol(): Cholesky decomposition
 det(): Determinant
 eigen(): Eigenvalues/eigenvectors
 diag(): Extracts the diagonal of a square matrix (useful for obtaining variances from a covariance matrix and for constructing a diagonal matrix).
 sweep(): Numerical analysis sweep operations
Note the versatile nature of diag(): If its argument is a matrix, it returns a vector, and vice versa. Also, if the argument is a scalar, the function returns the identity matrix of the specified size.
>m[,1] [,2]
[1,] 1 2
[2,] 7 8
> dm < diag(m)
> dm
[1] 1 8
> diag(dm)
[,1] [,2]
[1,] 1 0
[2,] 0 8
> diag(3)
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 1 0
[3,] 0 0 1
The sweep() function is capable of fairly complex operations. As a simple example, let’s take a 3by3 matrix and add 1 to row 1, 4 to row 2, and 7 to row 3.
>m[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
> sweep(m,1,c(1,4,7),”+”)
[,1] [,2] [,3]
[1,] 2 3 4
[2,] 8 9 10
[3,] 14 15 16
The first two arguments to sweep() are like those of apply(): the array and the margin, which is 1 for rows in this case. The fourth argument is a function to be applied, and the third is an argument to that function (to the “+” function).
Extended Example: Vector Cross Product
Let’s consider the issue of vector cross products. The definition is very simple: The cross product of vectors (x_{1}, x_{2}, x_{3}) and (y_{1}, y_{2}, y_{3}) in threedimensional space is a new threedimensional vector, as shown in equation 8.1.
(x2y3 − x3y2, −x1y3 + x3y1, x1y2 − x2y1) (8.1)
This can be expressed compactly as the expansion along the top row of the determinant, as shown in Equation 8.2.
(8.2)
Here, the elements in the top row are merely placeholders.Don’t worry about this bit of pseudomath. The point is that the cross product vector can be computed as a sum of subdeterminants. For instance, the first component in Equation 8.1, x_{2}y_{3} − x_{3}y_{2}, is easily seen to be the determinant of the submatrix obtained by deleting the first row and first column in Equation 8.2, as shown in Equation 8.3.
(8.3)
Our need to calculate subdeterminants—that is determinants of submatrices—fits perfectly with R, which excels at specifying submatrices. This suggests calling det() on the proper submatrices, as follows:
xprod < function(x,y) {
m < rbind(rep(NA,3),x,y)
xp < vector(length=3)
for (i in 1:3)
xp[i] < (1)^i _{*} det(m[2:3,i])
return(xp)
}
Note that even R’s ability to specify values as NA came into play here to deal with the “placeholders” mentioned above. All this may seem like overkill. After all, it wouldn’t have been hard to code Equation 8.1 directly, without resorting to use of submatrices and determinants. But while that may be true in the threedimensional case, the approach shown here is quite fruitful in the nary case, in ndimensional space. The cross product there is defined as an nbyn determinant of the form shown in Equation 8.1, and thus the preceding code generalizes perfectly.
Extended Example: Finding Stationary Distributions of Markov Chains
A Markov chain is a random process in which we move among various states, in a “memoryless” fashion, whose definition need not concern us here. The state could be the number of jobs in a queue, the number of items stored in inventory, and so on. We will assume the number of states to be finite. As a simple example, consider a game in which we toss a coin repeatedly and win a dollar whenever we accumulate three consecutive heads. Our state at any time i will the number of consecutive heads we have so far, so our state can be 0, 1, or 2. (When we get three heads in a row, our state reverts to 0.)
The central interest in Markov modeling is usually the longrun state distribution, meaning the longrun proportions of the time we are in each state. In our cointoss game, we can use the code we’ll develop here to calculate that distribution, which turns out to have us at states 0, 1, and 2 in proportions 57.1%, 28.6%, and 14.3% of the time. Note that we win our dollar if we are in state 2 and toss a head, so 0.143 × 0.5 = 0.071 of our tosses will result in wins. Since R vector and matrix indices start at 1 rather than 0, it will be convenient to relabel our states here as 1, 2, and 3 rather than 0, 1, and 2. For example, state 3 now means that we currently have two consecutive heads.
Let p_{ij} denote the transition probability of moving from state i to state j during a time step. In the game example, for instance, p_{23} = 0.5, reflecting the fact that with probability 1/2, we will toss a head and thus move from having one consecutive head to two. On the other hand, if we toss a tail while we are in state 2, we go to state 1, meaning 0 consecutive heads; thus p_{21} = 0.5. We are interested in calculating the vector π = (π_{1}, …, π_{s}), where π_{i} is the longrun proportion of time spent at state i, over all states i. Let P denote the transition probability matrix whose i^{th} row, j^{th} column element is p_{ij} . Then it can be shown that π must satisfy Equation 8.4,
π = πP (8.4)
which is equivalent to Equation 8.5:
(I − P T )π = 0 (8.5)
Here I is the identity matrix and P ^{T} denotes the transpose of P. Any single one of the equations in the system of Equation 8.5 is redundant. We thus eliminate one of them, by removing the last row of I −P in Equation 8.5. That also means removing the last 0 in the 0 vector on the righthand side of Equation 8.5. But note that there is also the constraint shown in Equation 8.6.
∑πi = 1 (8.6)
In matrix terms, this is as follows:
1^{T}_{n} π = 1
where 1_{n} is a vector of n 1s.
So, in the modified version of Equation 8.5, we replace the removed row with a row of all 1s and, on the righthand side, replace the removed 0 with a 1. We can then solve the system. All this can be computed with R’s solve() function, as follows:
findpi1 < function(p) {
n < nrow(p)
imp < diag(n) – t(p)
imp[n,] < rep(1,n)
rhs < c(rep(0,n1),1)
pivec < solve(imp,rhs)
return(pivec)
}
Here are the main steps:
 Calculate I − P ^{T} in line 3. Note again that diag(), when called with a scalar argument, returns the identity matrix of the size given by that argument.
 Replace the last row of P with 1 values in line 4.
 Set up the righthand side vector in line 5.
 Solve for π in line 6.
Another approach, using more advanced knowledge, is based on eigenvalues. Note from Equation 8.4 that π is a left eigenvector of P with eigenvalue 1. This suggests using R’s eigen() function, selecting the eigenvector corresponding to that eigenvalue. (A result from mathematics, the PerronFrobenius theorem, can be used to carefully justify this.) Since π is a left eigenvector, the argument in the call to eigen() must be P transpose rather than P. In addition, since an eigenvector is unique only up to scalar multiplication, we must deal with two issues regarding the eigenvector returned to us by eigen():
 It may have negative components. If so, we multiply by −1.
 It may not satisfy Equation 8.6. We remedy this by dividing by the length of the returned vector.
Here is the code:
findpi2 < function(p) {
n < nrow(p)
# find first eigenvector of P transpose
pivec < eigen(t(p))$vectors[,1]
# guaranteed to be real, but could be negative
if (pivec[1] < 0) pivec < pivec
# normalize to sum to 1
pivec < pivec / sum(pivec)
return(pivec)
}
The return value of eigen() is a list. One of the list’s components is a matrix named vectors. These are the eigenvectors, with the i^{th} column being the eigenvector corresponding to the i^{th} eigenvalue. Thus, we take column 1 here.
Set Operations
R includes some handy set operations, including these:
 union(x,y): Union of the sets x and y
 intersect(x,y): Intersection of the sets x and y
 setdiff(x,y): Set difference between x and y, consisting of all elements of x that are not in y
 setequal(x,y): Test for equality between x and y
 c %in% y: Membership, testing whether c is an element of the set y
 choose(n,k): Number of possible subsets of size k chosen from a set of size n
Here are some simple examples of using these functions:
> x < c(1,2,5)
> y < c(5,1,8,9)
> union(x,y)
[1] 1 2 5 8 9
> intersect(x,y)
[1] 1 5
> setdiff(x,y)
[1] 2
> setdiff(y,x)
[1] 8 9
> setequal(x,y)
[1] FALSE
> setequal(x,c(1,2,5))
[1] TRUE
> 2 %in% x
[1] TRUE
> 2 %in% y
[1] FALSE
> choose(5,2)
[1] 10
Recall that you can write your own binary operations. For instance, consider coding the symmetric difference between two sets— that is, all the elements belonging to exactly one of the two operand sets.Because the symmetric difference between sets x and y consists exactly of those elements in x but not y and vice versa, the code consists of easy calls to setdiff() and union(), as follows:
> symdiff
function(a,b) {
sdfxy < setdiff(x,y)
sdfyx < setdiff(y,x)
return(union(sdfxy,sdfyx))
}
Let’s try it.
> x
[1] 1 2 5
y
[1] 5 1 8 9
symdiff(x,y)
[1] 2 8 9
Here’s another example: a binary operand for determining whether one set u is a subset of another set v. A bit of thought shows that this property
is equivalent to the intersection of u and v being equal to u. Hence we have another easily coded function:
> “%subsetof%” < function(u,v) {
+ return(setequal(intersect(u,v),u))
+ }
> c(3,8) %subsetof% 1:10
[1] TRUE
> c(3,8) %subsetof% 5:10
[1] FALSE
The function combn() generates combinations. Let’s find the subsets of {1,2,3} of size 2.
> c32 < combn(1:3,2)
> c32
[,1] [,2] [,3]
[1,] 1 1 2
[2,] 2 3 3
> class(c32)
[1] “matrix”
The results are in the columns of the output. We see that the subsets of {1,2,3} of size 2 are (1,2), (1,3), and (2,3). The function also allows you to specify a function to be called by combn() on each combination. For example, we can find the sum of the numbers in each subset, like this:
> combn(1:3,2,sum)
[1] 3 4 5
The first subset, {1,2}, has a sum of 2, and so on.
Simulation Programming in R
One of the most common uses of R is simulation. Let’s see what kinds of tools R has available for this application.
BuiltIn Random Variate Generators
As mentioned, R has functions to generate variates from a number of different distributions. For example, rbinom() generates binomial or Bernoulli random variates.Let’s say we want to find the probability of getting at least four heads out of five tosses of a coin (easy to find analytically, but a handy example). Here’s how we can do this:
> x < rbinom(100000,5,0.5)
> mean(x >= 4)
[1] 0.18829
First, we generate 100,000 variates from a binomial distribution with five trials and a success probability of 0.5. We then determine which of them has a value 4 or 5, resulting in a Boolean vector of the same length as x. The TRUE and FALSE values in that vector are treated as 1s and 0s by mean(), giving us our estimated probability (since the average of a bunch of 1s and 0s is the proportion of 1s).
Other functions include rnorm() for the normal distribution, rexp() for the exponential, runif() for the uniform, rgamma() for the gamma, rpois() for the Poisson, and so on. Here is another simple example, which finds E[max(X, Y )], the expected value of the maximum of independent N(0,1) random variables X and Y:
sum < 0
nreps < 100000
for (i in 1:nreps) {
xy < rnorm(2) # generate 2 N(0,1)s
sum < sum + max(xy)
}
print(sum/nreps)
We generated 100,000 pairs, found the maximum for each, and averaged those maxima to obtain our estimated expected value. The preceding code, with an explicit loop, may be clearer, but as before, if we are willing to use some more memory, we can do this more compactly.
> emax
function(nreps) {
x < rnorm(2_{*}nreps)
maxxy < pmax(x[1:nreps],x[(nreps+1):(2_{*}nreps)])
return(mean(maxxy))
}
Here, we generated double nreps values. The first nreps value simulates X, and the remaining nreps value represents Y. The pmax() call then computes the pairwise maxima that we need. Again, note the contrast here between max() and pmax(), the latter producing pairwise maxima.
Obtaining the Same Random Stream in Repeated Runs
According to the R documentation, all randomnumber generators use 32bit integers for seed values. Thus, other than roundoff error, the same initial seed should generate the same stream of numbers.By default, R will generate a different random number stream from run to run of a program. If you want the same stream each time—important in debugging, for instance—call set.seed(), like this:
> set.seed(8888) # or your favorite number as an argument
Extended Example: A Combinatorial Simulation
Consider the following probability problem:
Three committees, of sizes 3, 4 and 5, are chosen from 20 people. What is the probability that persons A and B are chosen for the same committee?
This problem is not hard to solve analytically, but we may wish to check our solution using simulation, and in any case, writing the code will demonstrate how R’s set operations can come in handy in combinatorial settings. Here is the code:
sim < function(nreps) {
commdata < list() # will store all our info about the 3 committees
commdata$countabsamecomm < 0
for (rep in 1:nreps) {
commdata$whosleft < 1:20 # who’s left to choose from
commdata$numabchosen < 0 # number among A, B chosen so far
# choose committee 1, and check for A,B serving together
commdata < choosecomm(commdata,5)
# if A or B already chosen, no need to look at the other comms.
if (commdata$numabchosen > 0) next
# choose committee 2 and check
commdata < choosecomm(commdata,4)
if (commdata$numabchosen > 0) next
# choose committee 3 and check
commdata < choosecomm(commdata,3)
}
print(commdata$countabsamecomm/nreps)
}
choosecomm < function(comdat,comsize) {
# choose committee
committee < sample(comdat$whosleft,comsize)
# count how many of A and B were chosen
comdat$numabchosen < length(intersect(1:2,committee))
if (comdat$numabchosen == 2)
comdat$countabsamecomm < comdat$countabsamecomm + 1
# delete chosen committee from the set of people we now have to choose from
comdat$whosleft < setdiff(comdat$whosleft,committee)
return(comdat)
}
We number the potential committee members from 1 to 20, with persons A and B having ID 1 and 2. Recalling that R lists are often used to store several related variables in one basket, we se up a list comdat. Its components include the following:
 comdat$whosleft: We simulate the random selection of the committees by randomly choosing from this vector. Each time we choose a committee, we remove the committee members’ IDs. It is initialized to 1:20, indicating that no one has been selected yet.
 comdat$numabchosen: This is a count of how many among the people A and B have been chosen so far. If we choose a committee and find this to be positive, we can skip choosing the remaining committees for the following reason: If this number is 2, we know definitely that A and B are on the same committee; if it is 1, we know definitely that A and B are not on the same committee.
 comdat$countabsamecomm: Here, we store a count of the number of times A and B are on the same committee.
Since committee selection involves subsets, it’s not surprising that a couple of R’s set operations—intersect() and setdiff()—come in handy here. Note, too, the use of R’s next statement, which tells R to skip the rest of this iteration of the loop.
OBJECTORIENTED PROGRAMMING
Many programmers believe that objectoriented programming (OOP) makes for clearer, more reusable code. Though very different from the familiar OOP languages like C++, Java, and Python, R is very much OOP in outlook.
The following themes are key to R:
 Everything you touch in R—ranging from numbers to character strings to matrices—is an object.
 R promotes encapsulation, which is packaging separate but related data items into one class instance. Encapsulation helps you keep track of related variables, enhancing clarity.
 R classes are polymorphic, which means that the same function call leads to different operations for objects of different classes. For instance, a call to print() on an object of a certain class triggers a call to a print function tailored to that class. Polymorphism promotes reusability.
 R allows inheritance, which allows extending a given class to a more specialized class.
This chapter covers OOP in R. We’ll discuss programming in the two types of classes, S3 and S4, and then present a few useful OOPrelated R utilities.
S3 Classes
The original R structure for classes, known as S3, is still the dominant class paradigm in R use today. Indeed, most of R’s own builtin classes are of the S3 type. An S3 class consists of a list, with a class name attribute and dispatch capability added. The latter enables the use of generic functions. S4 classes were developed later, with goal of adding safety, meaning that you cannot accidentally access a class component that is not already in existence.
S3 Generic Functions
As mentioned, R is polymorphic, in the sense that the same function can lead to different operations for different classes. You can apply plot(), for example, to many different types of objects, getting a different type of plot for each. The same is true for print(), summary(), and many other functions. In this manner, we get a uniform interface to different classes. For example, if you are writing code that includes plot operations, polymorphism may allow you to write your program without worrying about the various types of objects that might be plotted.
In addition, polymorphism certainly makes things easier to remember for the user and makes it fun and convenient to explore new library functions and associated classes. If a function is new to you, just try running plot() on the function’s output; it will likely work. From a programmer’s viewpoint, polymorphism allows writing fairly general code, without worrying about what type of object is being manipulated, because the underlying class mechanisms take care of that.The functions that work with polymorphism, such as plot() and print(), are known as generic functions. When a generic function is called, R will then dispatch the call to the proper class method, meaning that it will reroute the call to a function defined for the object’s class.
Example: OOP in the lm() Linear Model Function
As an example, let’s look at a simple regression analysis run via R’s lm() function. First, let’s see what lm() does:
> ?lm
The output of this help query will tell you, among other things, that this function returns an object of class “lm”. Let’s try creating an instance of this object and then printing it:
> x < c(1,2,3)
> y < c(1,3,8)
> lmout < lm(y ~ x)
> class(lmout)
[1] “lm”
> lmout
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
3.0 3.5
Here, we printed out the object lmout. (Remember that by simply typing the name of an object in interactive mode, the object is printed.) The R interpreter then saw that lmout was an object of class “lm” and thus called print.lm(), a special print method for the “lm” class. In R terminology, the call to the generic function print() was dispatched to the method print.lm() associated with the class “lm”. Let’s take a look at the generic function and the class method in this case:
function(x, …) UseMethod(“print”)
< environment: namespace:base >
> print.lm
function (x, digits = max(3, getOption(“digits”) – 3), …)
{
cat(“\nCall:\n”, deparse(x$call), “\n\n”, sep = “”)
if (length(coef(x))) {
cat(“Coefficients:\n”)
print.default(format(coef(x), digits = digits), print.gap = 2,
quote = FALSE)
}
else cat(“No coefficients\n”)
cat(“\n”)
invisible(x)
}
< environment: namespace:stats >
You may be surprised to see that print() consists solely of a call to UseMethod(). But this is actually the dispatcher function, so in view of print()’s role as a generic function, you should not be surprised after all. Don’t worry about the details of print.lm(). The main point is that the printing depends on context, with a special print function called for the “lm” class. Now let’s see what happens when we print this object with its class attribute removed:
> unclass(lmout)
$coefficients
(Intercept) x
3.0 3.5
$residuals
1 2 3
0.5 1.0. 0.5
$effects
(Intercept) x
6.928203 4.949747 1.224745
$rank
 2
…
I’ve shown only the first few lines here—there’s a lot more. (Try running this on your own!) But you can see that the author of lm() decided to make print.lm() much more concise, limiting it to printing a few key quantities.
Finding the Implementations of Generic Methods
You can find all the implementations of a given generic method by calling methods(), like this:
> methods(print)

[1] print.acf* [2] print.anova [3] print.aov* [4] print.aovlist* [5] print.ar*
[6] print.Arima* [7] print.arima0* [8] print.AsIs [9] print.aspell*
[10] print.Bibtex* [11] print.browseVignettes* [12] print.by [13] print.check_code_usage_in_package* [14] print.check_demo_index* [15] print.checkDocFiles*
[16] print.checkDocStyle* [17] print.check_dotInternal* [18] print.checkFF* [19] print.check_make_vars* [20] print.check_package_code_syntax*
…
Asterisks denote nonvisible functions, meaning ones that are not in the default namespaces. You can find these functions via getAnywhere() and then access them by using a namespace qualifier. An example is print.aspell(). The aspell() function itself does a spellcheck on the file specified in its argument. For example, suppose the file wrds consists of this line:
Which word is mispelled?
In this case, this function will catch the misspelled word, as follows:
aspell(“wrds”)
mispelled
wrds:1:15
The output says that there is the indicated spelling error in line 1, character 15 of the input file. But what concerns us here is the mechanism by which that output was printed. The aspell() function returns an object of class “aspell”, which does have its own generic print function, print.aspell(). In fact, that function was invoked in our example, after the call to aspell(), and the return value was printed out. At that time, R called UseMethod() on the object of class “aspell”. But if we call that print method directly, R won’t recognize it:
> aspout < aspell(“wrds”)
> print.aspell(aspout)
Error: could not find function “print.aspell”
However, we can find it by calling getAnywhere():
> getAnywhere(print.aspell)
A single object matching ‘print.aspell’ was found
It was found in the following places
registered S3 method for print from namespace utils
namespace:utils
with value
function (x, sort = TRUE, verbose = FALSE, indent = 2L, …)
{
if (!(nr < nrow(x)))
…
So, the function is in the utils namespace, and we can execute it by adding such a qualifier:
> utils:::print.aspell(aspout)
mispelled
wrds:1:15
You can see all the generic methods this way:
> methods(class=”default”)
…
Writing S3 Classes
S3 classes have a rather cobbledtogether structure. A class instance is created by forming a list, with the components of the list being the member variables of the class. (Readers who know Perl may recognize this ad hoc nature in Perl’s own OOP system.) The “class” attribute is set by hand by using the attr() or class() function, and then various implementations of generic functions are defined. We can see this in the case of lm() by inspecting the function:
> lm
…
z < list(coefficients = if (is.matrix(y))
matrix(,0,3) else numeric(0L), residuals = y,
fitted.values = 0 _{*} y, weights = w, rank = 0L,
df.residual = if (is.matrix(y)) nrow(y) else length(y))
}
…
class(z) < c(if(is.matrix(y)) “mlm”, “lm”)
…
Again, don’t mind the details; the basic process is there. A list was created and assigned to z, which will serve as the framework for the “lm” class instance (and which will eventually be the value returned by the function). Some components of that list, such as residuals, were already assigned when the list was created. In addition, the class attribute was set to “lm” (and possibly to “mlm”, as will be explained in the next section). As an example of how to write an S3 class, let’s switch to something simpler. Continuing our employee example from Section 4.1, we could write this:
> j < list(name=”Joe”, salary=55000, union=T)
> class(j) < “employee”
> attributes(j) # let’s check
$names
[1] “name” “salary” “union”
$class
[1] “employee”
Before we write a print method for this class, let’s see what happens when we call the default print():
> j
$name
[1] “Joe”
$salary
[1] 55000
$union
[1] TRUE
attr(,”class”)
[1] “employee”
Essentially, j was treated as a list for printing purposes. Now let’s write our own print method:
print.employee < function(wrkr) {
cat(wrkr$name,”\n”)
cat(“salary”,wrkr$salary,”\n”)
cat(“union member”,wrkr$union,”\n”)
}
So, any call to print() on an object of class “employee” should now be referred to print.employee(). We can check that formally:
> methods(,”employee”)
[1] print.employee
Or, of course, we can simply try it out:
>j
Joe
salary 55000
union member TRUE
Using Inheritance
The idea of inheritance is to form new classes as specialized versions of old ones. In our previous employee example, for instance, we could form a new class devoted to hourly employees, “hrlyemployee”, as a subclass of “employee”, as follows:
k < list(name=”Kate”, salary= 68000, union=F, hrsthismonth= 2)
class(k) < c(“hrlyemployee”,”employee”)
Our new class has one extra variable: hrsthismonth. The name of the new class consists of two character strings, representing the new class and the old class. Our new class inherits the methods of the old one. For instance, print.employee() still works on the new class:
>k
Kate
salary 68000
union member FALSE
Given the goals of inheritance, that is not surprising. However, it’s important to understand exactly what transpired here. Once again, simply typing k resulted in the call print(k). In turn, that caused UseMethod() to search for a print method on the first of k’s two class names, “hrlyemployee”. That search failed, so UseMethod() tried the other class name, “employee”, and found print.employee(). It executed the latter. Recall that in inspecting the code for “lm”, you saw this line:
class(z) < c(if(is.matrix(y)) “mlm”, “lm”)
You can now see that “mlm” is a subclass of “lm” for vectorvalued response variables.
Extended Example: A Class for Storing UpperTriangular Matrices
Now it’s time for a more involved example, in which we will write an R class “ut” for uppertriangular matrices. These are square matrices whose elements below the diagonal are zeros, such as shown in Equation 9.1.
Our motivation here is to save storage space (though at the expense of a little extra access time) by storing only the nonzero portion of the matrix.
NOTE The R class “dist” also uses such storage, though in a more focused context and without the class functions we have here.
The component mat of this class will store the matrix. As mentioned, to save on storage space, only the diagonal and abovediagonal elements will be stored, in columnmajor order. Storage for the matrix (9.1), for instance, consists of the vector (1,5,6,12,9,2), and the component mat has that value. We will include a component ix in this class, to show where in mat the various columns begin. For the preceding case, ix is c(1,2,4), meaning that column 1 begins at mat[1], column 2 begins at mat[2], and column 3 begins at mat[4]. This allows for handy access to individual elements or columns of the matrix.
The following is the code for our class.
 # class “ut”, compact storage of uppertriangular matrices
 # utility function, returns 1+…+i
 sum1toi < function(i) return(i_{*}(i+1)/2)
 # create an object of class “ut” from the full matrix inmat (0s included)
 ut < function(inmat) {
 n < nrow(inmat)
 rtrn < list() # start to build the object
 class(rtrn) < “ut”
 rtrn$mat < vector(length=sum1toi(n))
 rtrn$ix < sum1toi(0:(n1)) + 1
 for (i in 1:n) {
 # store column i
 ixi < rtrn$ix[i]
 rtrn$mat[ixi:(ixi+i1)] < inmat[1:i,i]
 }
 return(rtrn)
 }
 # uncompress utmat to a full matrix
 expandut < function(utmat) {
 n < length(utmat$ix) # numbers of rows and cols of matrix
 fullmat < matrix(nrow=n,ncol=n)
 for (j in 1:n) {
 # fill jth column
 start < utmat$ix[j]
 fin < start + j – 1
 abovediagj < utmat$mat[start:fin] # abovediag part of col j
 fullmat[,j] < c(abovediagj,rep(0,nj))
 }
 return(fullmat)
 }
 # print matrix
 print.ut < function(utmat)
 print(expandut(utmat))
 # multiply one ut matrix by another, returning another ut instance;
 # implement as a binary operation
 “%mut%” < function(utmat1,utmat2) {
 n < length(utmat1$ix) # numbers of rows and cols of matrix
 utprod < ut(matrix(0,nrow=n,ncol=n))
 for (i in 1:n) { # compute col i of product
 # let a[j] and bj denote columns j of utmat1 and utmat2, respectively,
 # so that, e.g. b2[1] means element 1 of column 2 of utmat2
 # then column i of product is equal to
 # bi[1]*a[1] + … + bi[i]*a[i]
 #find index of start of column i in utmat2
 startbi < utmat2$ix[i]
 # initialize vector that will become bi[1]*a[1] + … + bi[i]*a[i]
 prodcoli < rep(0,i)
 for (j in 1:i) { # find bi[j]*a[j], add to prodcoli
 startaj < utmat1$ix[j]
 bielement < utmat2$mat[startbi+j1]
 prodcoli[1:j] < prodcoli[1:j] +
 bielement * utmat1$mat[startaj:(startaj+j1)]
 }
 # now need to tack on the lower 0s
 startprodcoli < sum1toi(i1)+1
 utprod$mat[startbi:(startbi+i1)] < prodcoli
 }
 return(utprod)
 }
Let’s test it.
> test
function() {
utm1 < ut(rbind(1:2,c(0,2)))
utm2 < ut(rbind(3:2,c(0,1)))
utp < utm1 %mut% utm2
print(utm1)
print(utm2)
print(utp)
utm1 < ut(rbind(1:3,0:2,c(0,0,5)))
utm2 < ut(rbind(4:2,0:2,c(0,0,1)))
utp < utm1 %mut% utm2
print(utm1)
print(utm2)
print(utp)
}
> test()
[,1] [,2]
[1,] 1 2
[2,] 0 2
[,1] [,2]
[1,] 3 2
[2,] 0 1
[,1] [,2]
[1,] 3 4
[2,] 0 2
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 0 1 2
[3,] 0 0 5
[,1] [,2] [,3]
[1,] 4 3 2
[2,] 0 1 2
[3,] 0 0 1
[,1] [,2] [,3]
[1,] 4 5 9
[2,] 0 1 4
[3,] 0 0 5
Throughout the code, we take into account the fact that the matrices involved have a lot of zeros. For example, we avoid multiplying by zeros simply by not adding terms to sums when the terms include a 0 factor. The ut() function is fairly straightforward. This function is a constructor, which is a function whose job it is to create an instance of the given class, eventually returning that instance. So in line 9, we create a list that will serve as the body of the class object, naming it rtrn as a reminder that this will be the class instance to be constructed and returned.
As noted earlier, the main member variables of our class will be mat and idx, implemented as components of the list. Memory for these two components is allocated in lines 11 and 12. The loop that follows then fills in rtrn$mat column by column and assigns rtrn$idx element by element. A slicker way to do this for loop would be to use the rather obscure row() and col() functions. The row() function takes a matrix input and returns a new matrix of the same size, but with each element replaced by its row number. Here’s an example:
> m
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
> row(m)
[,1] [,2]
[1,] 1 1
[2,] 2 2
[3,] 3 3
The col() function works similarly. Using this idea, we could replace the for loop in ut() with a oneliner:
rtrn$mat < inmat[row(inmat) <= col(inmat)]
Whenever possible, we should exploit vectorization. Take a look at line 12, for example:
rtrn$ix < sum1toi(0:(n1)) + 1
Since sum1toi() (which we defined on line 4) is based only on the vectorized functions “_{*}“() and “+”(), sum1toi() itself is also vectorized. This allows us to apply sum1toi() to a vector above. Note that we used recycling as well. We want our “ut” class to include some methods, not just variables. To this end, we have included three methods:
 The expandut() function converts from a compressed matrix to an ordinary one. In expandut(), the key lines are 27 and 28, where we use rtrn$ix to determine where in utmat$mat the j^{th} column of our matrix is stored. That data is then copied to the j^{th} column of fullmat in line 30. Note the use of rep() to generate the zeros in the lower portion of this column.
 The print.ut() function is for printing. This function is quick and easy, using expandut(). Recall that any call to print() on an object of type “ut” will be dispatched to print.ut(), as in our test cases earlier.
 The “%mut%”() function is for multiplying two compressed matrices (without uncompressing them). This function starts in line 39. Since this is a binary operation, we take advantage of the fact that R accommodates userdefined binary operations, as described in Section 7.12, and implement our matrixmultiply function as %mut%.
Let’s look at the details of the “%mut%”() function. First, in line 43, we allocate space for the product matrix. Note the use of recycling in an unusual context. The first argument of matrix() is required to be a vector of a length compatible with the number of specified rows and columns, so the 0 we provide is recycled to a vector of length n^{2}. Of course, rep() could be used instead, but exploiting recycling makes for a bit shorter, more elegant code. For both clarity and fast execution, the code here has been written around the fact that R stores matrices in columnmajor order. As mentioned in the comments, our code then makes use of the fact that column i of the product can be expressed as a linear combination of the columns of the first factor. It will help to see a specific example of this property, shown in Equation 9.2.
The comments say that, for instance, column 3 of the product is equal to the following:
Inspection of Equation 9.2 confirms the relation. Couching the multiplication problem in terms of columns of the two input matrices enables us to compact the code and to likely increase the speed. The latter again stems from vectorization. This approach is used in the loop beginning at line 53. (Arguably, in this case, the increase in speed comes at the expense of readability of the code.)
Extended Example: A Procedure for Polynomial Regression
As another example, consider a statistical regression setting with one predictor variable. Since any statistical model is merely an approximation, in principle, you can get better and better models by fitting polynomials of higher and higher degrees. However, at some point, this becomes overfitting, so that the prediction of new, future data actually deteriorates for degrees higher than some value. The class “polyreg” aims to deal with this issue. It fits polynomials of various degrees but assesses fits via crossvalidation to reduce the risk of overfitting. In this form of crossvalidation, known as the leavingoneout method, for each point we fit the regression to all the data except this observation, and then we predict that observation from the fit. An object of this class consists of outputs from the various regression models, plus the original data. The following is the code for the “polyreg” class.
 # “polyreg,” S3 class for polynomial regression in one predictor variable
 # polyfit(y,x,maxdeg) fits all polynomials up to degree maxdeg; y is
 # vector for response variable, x for predictor; creates an object of
 # class “polyreg”
 polyfit < function(y,x,maxdeg) {
 # form powers of predictor variable, ith power in ith column
 pwrs < powers(x,maxdeg) # could use orthog polys for greater accuracy
 lmout < list() # start to build class
 class(lmout) < “polyreg” # create a new class
 for (i in 1:maxdeg) {
 lmo < lm(y ~ pwrs[,1:i])
 # extend the lm class here, with the crossvalidated predictions
 lmo$fitted.cvvalues < lvoneout(y,pwrs[,1:i,drop=F])
 lmout[[i]] < lmo
 }
 lmout$x < x
 lmout$y < y
 return(lmout)
 }
 # print() for an object fits of class “polyreg”: print
 # crossvalidated meansquared prediction errors
 print.polyreg < function(fits) {
 maxdeg < length(fits) – 2
 n < length(fits$y)
 tbl < matrix(nrow=maxdeg,ncol=1)
 colnames(tbl) < “MSPE”
 for (i in 1:maxdeg) {
 fi < fits[[i]]
 errs < fits$y – fi$fitted.cvvalues
 spe < crossprod(errs,errs) # sum of squared prediction errors
 tbl[i,1] < spe/n
 }
 cat(“mean squared prediction errors, by degree\n”)
 print(tbl)
 }
 # forms matrix of powers of the vector x, through degree dg
 powers < function(x,dg) {
 pw < matrix(x,nrow=length(x))
 prod < x
 for (i in 2:dg) {
 prod < prod * x
 pw < cbind(pw,prod)
 }
 return(pw)
 }
 # finds crossvalidated predicted values; could be made much faster via
 # matrixupdate methods
 lvoneout < function(y,xmat) {
 n < length(y)
 predy < vector(length=n)
 for (i in 1:n) {
 # regress, leaving out ith observation
 lmo < lm(y[i] ~ xmat[i,])
 betahat < as.vector(lmo$coef)
 # the 1 accommodates the constant term
 predy[i] < betahat %_{*}% c(1,xmat[i,])
 }
 return(predy)
 }
 # polynomial function of x, coefficients cfs
 poly < function(x,cfs) {
 val < cfs[1]
 prod < 1
 dg < length(cfs) – 1
 for (i in 1:dg) {
 prod < prod _{*} x
 val < val + cfs[i+1] _{*} prod
 }
 }
As you can see, “polyreg” consists of polyfit(), the constructor function, and print.polyreg(), a print function tailored to this class. It also contains several utility functions to evaluate powers and polynomials and to perform crossvalidation. (Note that in some cases here, efficiency has been sacrificed for clarity.) As an example of using the class, we’ll generate some artificial data and create an object of class “polyreg” from it, printing out the results.
> n < 60
> x < (1:n)/n
>y < vector(length=n)
> for (i in 1:n) y[i] < sin((3_{*}pi/2)_{*}x[i]) + x[i]^2 + rnorm(1,mean=0,sd=0.5)
> dg < 15
> (lmo < polyfit(y,x,dg))
mean squared prediction errors, by degree
MSPE
[1,] 0.4200127
[2,] 0.3212241
[3,] 0.2977433
[4,] 0.2998716
[5,] 0.3102032
[6,] 0.3247325
[7,] 0.3120066
[8,] 0.3246087
[9,] 0.3463628
[10,] 0.4502341
[11,] 0.6089814
[12,] 0.4499055
[13,] NA
[14,] NA
[15,] NA
Note first that we used a common R trick in this command:
> (lmo < polyfit(y,x,dg))
By surrounding the entire assignment statement in parentheses, we get the printout and form lmo at the same time, in case we need the latter for other things. The function polyfit() fits polynomial models up through a specified degree, in this case 15, calculating the crossvalidated mean squared prediction error for each model. The last few values in the output were NA, because roundoff error considerations led R to refuse to fit polynomials of degrees that high.
So, how is it all done? The main work is handled by the function polyfit(), which creates an object of class “polyreg”. That object consists mainly of the objects returned by the R regression fitter lm() for each degree. In forming those objects, note line 14:
lmo$fitted.cvvalues < lvoneout(y,pwrs[,1:i,drop=F])
Here, lmo is an object returned by lm(), but we are adding an extra component to it: fitted.cvvalues. Since we can add a new component to a list at any time, and since S3 classes are lists, this is possible. We also have a method for the generic function print(), print.polyreg() in line 24. In Section 12.1.5, we will add a method for the plot() generic function, plot.polyreg(). In computing prediction errors, we used crossvalidation, or the leavingoneout method, in a form that predicts each observation from all the others. To implement this, we take advantage of R’s use of negative subscripts in line 57:
lmo < lm(y[i] ~ xmat[i,])
So, we are fitting the model with the i^{th} observation deleted from our data set.
NOTE As mentioned in the comment in the code, we could make a much faster implementation by using a matrixinverse update method, known as the ShermanMorrisonWoodbury formula. For more information, see J. H. Venter and J. L. J. Snyman, “A Note on the Generalised CrossValidation Criterion in Linear Model Selection,” Biometrika, Vol. 82, no. 1, pp. 215–219.
Classes
Some programmers feel that S3 does not provide the safety normally associated with OOP. For example, consider our earlier employee database example, where our class “employee” had three fields: name, salary, and union. Here are some possible mishaps:
 We forget to enter the union status.
 We misspell union as onion.
 We create an object of some class other than “employee” but accidentally set its class attribute to “employee”.
In each of these cases, R will not complain. The goal of S4 is to elicit a complaint and prevent such accidents. S4 structures are considerably richer than S3 structures, but here we present just the basics. Table 91 shows an overview of the differences between the two classes.
Table 91: Basic R Operators
Operation  S3  S4 

Define class  Implicit in constructor code  setClass() 
Create object  Build list, set class attr  new() 
Create object  Build list, set class attr  new() 
Reference member variable  $  @ 
Implement generic f()  Define f.classname()  setMethod() 
Declare generic  UseMethod()  setGeneric() 
Writing S4 Classes
You define an S4 class by calling setClass(). Continuing our employee example, we could write the following:
> setClass(“employee”,
+ representation(
+ name=”character”,
+ salary=”numeric”,
+ union=”logical”)
+ )
[1] “employee”
This defines a new class, “employee”, with three member variables of the specified types. Now let’s create an instance of this class, for Joe, using new(), a builtin constructor function for S4 classes:
> joe < new(“employee”,name=”Joe”,salary=55000,union=T)
> joe
An object of class “employee”
Slot “name”:
[1] “Joe”
Slot “salary”:
[1] 55000
Slot “union”:
[1] TRUE
Note that the member variables are called slots, referenced via the @ symbol. Here’s an example:
> joe@salary
[1] 55000
We can also use the slot() function, say, as another way to query Joe’s salary:
> slot(joe,”salary”)
[1] 55000
We can assign components similarly. Let’s give Joe a raise:
> joe@salary < 65000
> joe
An object of class “employee”
Slot “name”:
[1] “Joe”
Slot “salary”:
[1] 65000
Slot “union”:
[1] TRUE
Nah, he deserves a bigger raise that that:
> slot(joe,”salary”) < 88000
> joe
An object of class “employee”
Slot “name”:
[1] “Joe”
Slot “salary”:
[1] 88000
Slot “union”:
[1] TRUE
As noted, an advantage of using S4 is safety. To illustrate this, suppose we were to accidentally spell salary as salry, like this:
> joe@salry < 48000
Error in checkSlotAssignment(object, name, value) :
“salry” is not a slot in class “employee”
By contrast, in S3 there would be no error message. S3 classes are just lists, and you are allowed to add a new component (deliberately or not) at any time.
Implementing a Generic Function on an S4 Class
To define an implementation of a generic function on an S4 class, use setMethod(). Let’s do that for our class “employee” here. We’ll implement the show() function, which is the S4 analog of S3’s generic “print”. As you know, in R, when you type the name of a variable while in interactive mode, the value of the variable is printed out:
> joe
An object of class “employee”
Slot “name”:
[1] “Joe”
Slot “salary”:
[1] 88000
Slot “union”:
[1] TRUE
Since joe is an S4 object, the action here is that show() is called. In fact, we would get the same output by typing this:
> show(joe)
Let’s override that, with the following code:
setMethod(“show”, “employee”,
function(object) {
inorout < ifelse(object@union,”is”,”is not”)
cat(object@name,”has a salary of”,object@salary,
“and”,inorout, “in the union”, “\n”)
}
)
The first argument gives the name of the generic function for which we will define a classspecific method, and the second argument gives the class name. We then define the new function. Let’s try it out:
> joe
Joe has a salary of 55000 and is in the union
S3 Versus S4
The type of class to use is the subject of some controversy among R programmers. In essence, your view here will likely depend on your personal choice of which you value more—the convenience of S3 or the safety of S4. John Chambers, the creator of the S language and one of the central developers of R, recommends S4 over S3 in his book Software for Data Analysis (Springer, 2008). He argues that S4 is needed in order to write “clear and reliable software.” On the other hand, he notes that S3 remains quite popular.
Google’s R Style Guide, which you can find at http://googlestyleguide .googlecode.com/svn/trunk/googlerstyle.html, is interesting in this regard. Google comes down squarely on the S3 side, stating “avoid S4 objects and methods when possible.” (Of course, it’s also interesting that Google even has an R style guide in the first place!)
NOTE A nice, concrete comparison of the two methods is given in Thomas Lumley’s “Programmer’s Niche: A Simple Class, in S3 and S4,” R News, April 1, 2004, pp. 33–36.
Managing Your Objects
As a typical R session progresses, you tend to accumulate a large number of objects. Various tools are available to manage them. Here, we’ll look at the following:
 The ls() function
 The rm() function
 The save() function
 Several functions that tell you more about the structure of an object, such as class() and mode()
 The exists() function
Listing Your Objects with the ls() Function
The ls() command will list all of your current objects. A useful named argument for this function is pattern, which enables wildcards. Here, you tell ls() to list only the objects whose names include a specified pattern. The following is an example.
> ls()
[1] “acc” “acc05” “binomci” “cmeans” “divorg” “dv”
[7] “fit” “g” “genxc” “genxnt” “j” “lo”
[13] “out1″ “out1.100” “out1.25” “out1.50” “out1.75” “out2”
[19] “out2.100” “out2.25” “out2.50” “out2.75” “par.set” “prpdf”
[25] “ratbootci” “simonn” “vecprod” “x” “zout” “zout.100”
[31] “zout.125” “zout3” “zout5” “zout.50” “zout.75”
> ls(pattern=”ut”)
[1] “out1” “out1.100” “out1.25” “out1.50” “out1.75” “out2”
[7] “out2.100” “out2.25” “out2.50” “out2.75” “zout” “zout.100”
[13] “zout.125” “zout3” “zout5” “zout.50” “zout.75”
In the second case, we asked for a list of all objects whose names include the string “ut”.
Removing Specific Objects with the rm() Function
To remove objects you no longer need, use rm(). Here’s an example:
> rm(a,b,x,y,z,uuu)
This code removes the six specified objects (a, b, and so on). One of the named arguments of rm() is list, which makes it easier to remove multiple objects. This code assigns all of our objects to list, thus removing everything:
> rm(list = ls())
Using ls()’s pattern argument, this tool becomes even more powerful. Here’s an example:
> ls()
[1] “doexpt” “notebookline” “nreps”
[5] “numnotebooklines” “numrules” “observationpt”
[9] “r” “rad” “radius” “rep”
[13] “s” “s2” “sim” “waits”
[17] “wbar” “x” “y” “z”
> ls(pattern=”notebook”)
[1] “notebookline” “numnotebooklines”
> rm(list=ls(pattern=”notebook”))
> ls()
[1] “doexpt” “nreps” “numcorrectcis” “numrules”
[5] “observationpt” “prop” “r” “rad”
[9] “radius” “rep” “s” “s2”
[13] “sim” “waits” “wbar” “x”
[17] “y” “z”
Here, we found two objects whose names include the string “notebook” and then asked to remove them, which was confirmed by the second call to ls().
NOTE You may find the function browseEnv() helpful. It will show in your web browser your globals (or objects in a different specified environment), with some details on each.
Saving a Collection of Objects with the save() Function
Calling save() on a collection of objects will write them to disk for later retrieval by load(). Here’s a quick example:
> z < rnorm(100000)
> hz < hist(z)
> save(hz,”hzfile”)
> ls()
[1] “hz” “z”
> rm(hz)
> ls()
[1] “z”
> load(“hzfile”)
> ls()
[1] “hz” “z”
> plot(hz) # graph window pops up
Here, we generate some data and then draw a histogram of it. But we also save the output of hist() in a variable, hz. That variable is an object (of class “histogram”, of course). Anticipating that we will want to reuse this object in a later R session, we use the save() function to save the object to the file hzfile. It can be reloaded in that future session via load(). To demonstrate this, we deliberately removed the hz object, then called load() to reload it, and then called ls() to show that it had indeed been reloaded. I once needed to read in a very large data file, each record of which required processing. I then used save() to keep the R object version of the processed file for future R sessions.
“What Is This?”
Developers often need to know the exact structure of the object returned by a library function. If the documentation does not give sufficient details, what can we do? The following R functions may be helpful:
 class(), mode()
 names(), attributes()
 unclass(), str()
 edit()
Let’s go through an example. R includes facilities for constructing contingency tables, which we discussed in Section 6.4. An example in that section involved an election survey in which five respondents are asked whether they intend to vote for candidate X and whether they voted for X in the last election. Here is the resulting table:
> cttab < table(ct)
> cttab
Voted.for.X.Last.Time
Vote.for.X No Yes
No 2 0
NotSure 0 1
Yes 1 1
For instance, two respondents answered no to both questions. The object cttab was returned by the function table and thus is likely of class “table”. A check of the documentation (?table) confirms this. But what is in the class? Let’s explore the structure of that object cttab of class “table”.
> ctu < unclass(cttab)
> ctu
Votes.for.X.Last.Time
Vote.for.X No Yes
No 2 0
NotSure 0 1
Yes 1 1
> class(ctu)
[1] “matrix”
So, the counts portion of the object is a matrix. (If the data had involved three or more questions, rather than just two, this would have been a higherdimensional array.) Note that the names of the dimensions and of the individual rows and columns are there, too; they are associated with the matrix.
The unclass() function is quite useful as a first step. If you simply print an object, you are at the mercy of the version of print() associated with that class, which may in the name of succinctness hide or distort some valuable information. Printing the result of calling unclass() allows you to work around this problem, though there was no difference in this example. (You saw an instance in which it did make a difference in the section about S3 generic functions in Section 9.1.1 earlier.) The function str() serves the same purpose, in a more compact manner.
Note, though, applying unclass() to an object still results in an object with some basic class. Here, cttab had the class “table”, but unclass(cttab) still had the class “matrix”.Let’s try looking at the code for table(), the library function that produced cttab. We could simply type table, but since this is a somewhat longish function, a lot of the function would zoom by on the screen too fast for us to absorb it. We could use page() to solve this problem, but I prefer edit():
> edit(table)
This allows you to browse through the code with your text editor. In doing so, you’ll find this code at the end:
y < array(tabulate(bin, pd), dims, dimnames = dn)
class(y) < “table”
y
Ah, interesting. This shows that table() is, to some extent, a wrapper for another function, tabulate(). But what might be more important here is that the structure of a “table” object is really pretty simple: It consists of an array created from the counts, with the class attribute tacked on. So, it’s essentially just an array. The function names() shows the components in an object, and attributes() gives you this and a bit more, notably the class name.
The exists() Function
The function exists() returns TRUE or FALSE, depending on whether the argument exists. Be sure to put the argument in quotation marks. For example, the following code shows that the acc object exists:
> exists(“acc”)
[1] TRUE
Why would this function be useful? Don’t we always know whether or not we’ve created an object and whether it’s still there? Not necessarily. If you are writing generalpurpose code, say to be made available to the world in R’s CRAN code repository, your code may need to check whether a certain object exists, and if it doesn’t, then your code must create it. For example, as you learned in Section 9.4.3, you can save objects to disk files using save() and then later restore them to R’s memory space by calling load(). You might write generalpurpose code that makes the latter call if the object is not already present, a condition you could check by calling exists().
INPUT/OUTPUT
One of the most underemphasized topics in many university programming courses is input/output (I/O). I/O plays a central role in most realworld applications of computers. Just consider an ATM cash machine, which uses multiple I/O operations for both input—reading your card and reading your typedin cash request—and output—printing instructions on the screen, printing your receipt, and most important, controlling the machine to output your money!
R is not the tool you would choose for running an ATM, but it features a highly versatile array of I/O capabilities, as you will learn in this chapter. We’ll start with the basics of access to the keyboard and monitor, and then go into considerable detail on reading and writing files, including the navigation of file directories. Finally, we discuss R’s facilities for accessing the Internet.
Accessing the Keyboard and Monitor
R provides several functions for accesssing the keyboard and monitor. Here, we’ll look at the scan(), readline(), print(), and cat() functions.
Using the scan() Function
You can use scan() to read in a vector, whether numeric or character, from a file or the keyboard. With a little extra work, you can even read in data to form a list. Suppose we have files named z1.txt, z2.txt, z3.txt, and z4.txt. The z1.txt file contains the following:
123
4 . 5
6
The z2.txt file contents are as follows:
123
4.2 5
6
The z3.txt file contains this:
abc
de f
g
And finally, the z4.txt file has these contents:
abc
123 6
y
Let’s see what we can do with these files using the scan() function.
> scan(“z1.txt”)
Read 4 items
[1] 123 4 5 6
> scan(“z2.txt”)
Read 4 items
[1] 123.0 4.2 5.0 6.0
> scan(“z3.txt”)
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
scan() expected ‘a real’, got ‘abc’
> scan(“z3.txt”,what=””)
Read 4 items
[1] “abc” “de” “f” “g”
> scan(“z4.txt”,what=””)
Read 4 items
[1] “abc” “123” “6” “y”
In the first call, we got a vector of four integers (though the mode is numeric). The second time, since one number was nonintegral, the others were shown as floatingpoint numbers, too. In the third case, we got an error. The scan() function has an optional argument named what, which specifies mode, defaulting to double mode. So, the nonnumeric contents of the file z3 produced an error. But we then tried again, with what=””. This assigns a character string to what, indicating that we want character mode. (We could have set what to any character string.) The last call worked the same way. The first item was a character string, so it treated all the items that followed as strings too. Of course, in typical usage, we would assign the return value of scan() to a variable. Here’s an example:
> v < scan(“z1.txt”)
By default, scan() assumes that the items of the vector are separated by whitespace, which includes blanks, carriage return/line feeds, and horizontal tabs. You can use the optional sep argument for other situations. As example, we can set sep to the newline character to read in each line as a string, as follows:
> x1 < scan(“z3.txt”,what=””)
Read 4 items
> x2 < scan(“z3.txt”,what=””,sep=”\n”)
Read 3 items
> x1
[1] “abc” “de” “f” “g”
> x2
[1] “abc” “de f” “g”
> x1[2]
[1] “de”
> x2[2]
[1] “de f”
In the first case, the strings “de” and “f” were assigned to separate elements of x1. But in the second case, we specified that elements of x2 were to be delineated by endofline characters, not spaces. Since “de” and “f” are on the same line, they are assigned together to x[2]. More sophisticated methods for reading files will be presented later in this chapter, such as methods to read in a file one line at a time. But if you want to read the entire file at once, scan() provides a quick solution. You can use scan() to read from the keyboard by specifying an empty string for the filename:
> v < scan(“”)
1: 12 5 13
4: 3 4 5
7: 8
8:
Read 7 items
>v[1] 12 5 13 3 4 5 8
Note that we are prompted with the index of the next item to be input, and we signal the end of input with an empty line. If you do not wish scan() to announce the number of items it has read, include the argument quiet=TRUE.
Using the readline() Function
If you want to read in a single line from the keyboard, readline() is very handy.
> w < readline()
abc de f
>w
[1] “abc de f”
Typically, readline() is called with its optional prompt, as follows:
> inits < readline(“type your initials: “)
type your initials: NM
> inits
[1] “NM”
Printing to the Screen
At the top level of interactive mode, you can print the value of a variable or expression by simply typing the variable name or expression. This won’t work if you need to print from within the body of a function. In that case, you can use the print() function, like this:
> x < 1:3
> print(x^2)
[1] 1 4 9
Recall that print() is a generic function, so the actual function called will depend on the class of the object that is printed. If, for example, the argument is of class “table”, then the print.table() function will be called. It’s a little better to use cat() instead of print(), as the latter can print only one expression and its output is numbered, which may be a nuisance. Compare the results of the functions:
> print(“abc”)
[1] “abc”
> cat(“abc\n”)
abc
Note that we needed to supply our own endofline character, “\n”, in the call to cat(). Without it, our next call would continue to write to the same line. The arguments to cat() will be printed out with intervening spaces:
> x
[1] 1 2 3
> cat(x,”abc”,”de\n”)
1 2 3 abc de
If you don’t want the spaces, set sep to the empty string “”, as follows:
> cat(x,”abc”,”de\n”,sep=””)
123abcde
Any string can be used for sep. Here, we use the newline character:
> cat(x,”abc”,”de\n”,sep=”\n”)
1
2
3
abc
de
You can even set sep to be a vector of strings, like this:
> x < c(5,12,13,8,88)
> cat(x,sep=c(“.”,”.”,”.”,”\n”,”\n”))
5.12.13.8
88
Reading and Writing Files
Now that we’ve covered the basics of I/O, let’s get to some more practical applications of reading and writing files. The following sections discuss reading data frames or matrices from files, working with text files, accessing files on remote machines, and getting file and directory information.
Reading a Data Frame or Matrix from a File
We discussed the use of the function read.table() to read in a data frame. As a quick review, suppose the file z looks like this:
name age
John 25
Mary 28
Jim 19
The first line contains an optional header, specifying column names. We could read the file this way:
> z < read.table(“z”,header=TRUE)
> z
name age
1 John 25
2 Mary 28
3 Jim 19
Note that scan() would not work here, because our file has a mixture of numeric and character data (and a header). There appears to be no direct way of reading in a matrix from a file, but it can be done easily with other tools. A simple, quick way is to use scan() to read in the matrix row by row. You use the byrow option in the function matrix() to indicate that you are defining the elements of the matrix in a rowwise, rather than columnwise, manner. For instance, say the file x contains a 5by3 matrix, stored rowwise:
1 0 1
1 1 1
1 1 0
1 1 0
0 0 1
We can read it into a matrix this way:
> x < matrix(scan(“x”),nrow=5,byrow=TRUE)
This is fine for quick, onetime operations, but for generality, you can use read.table(), which returns a data frame, and then convert via as.matrix(). Here is a general method:
read.matrix < function(filename) {
as.matrix(read.table(filename))
}
Reading Text Files
In computer literature, there is often a distinction made between text files and binary files. That distinction is somewhat misleading—every file is binary in the sense that it consists of 0s and 1s. Let’s take the term text file to mean a file that consists mainly of ASCII characters or coding for some other human language (such as GB for Chinese) and that uses newline characters to give humans the perception of lines. The latter aspect will turn out to be central here. Nontext files, such as JPEG images or executable program files, are generally called binary files. You can use readLines() to read in a text file, either one line at a time or in a single operation. For example, suppose we have a file z1 with the following contents:
John 25
Mary 28
Jim 19
We can read the file all at once, like this:
> z1 < readLines(“z1”)
> z1
[1] “John 25” “Mary 28” “Jim 19”
Since each line is treated as a string, the return value here is a vector of strings—that is, a vector of character mode. There is one vector element for each line read, thus three elements here. Alternatively, we can read it in one line at a time. For this, we first need to create a connection, as described next.
Introduction to Connections
Connection is R’s term for a fundamental mechanism used in various kinds of I/O operations. Here, it will be used for file access. The connection is created by calling file(), url(), or one of several other R functions. To see a list of those functions, type this:
> ?connection
So, we can now read in the z1 file line by line, as follows:
> c < file(“z1″,”r”)
> readLines(c,n=1)
[1] “John 25”
> readLines(c,n=1)
[1] “Mary 28”
> readLines(c,n=1)
[1] “Jim 19”
>readLines(c,n=1)
character(0)
We opened the connection, assigned the result to c, and then read the file one line at a time, as specified by the argument n=1. When R encountered the end of file (EOF), it returned an empty result. We needed to set up a connection so that R could keep track of our position in the file as we read through it. We can detect EOF in our code:
> c < file(“z”,”r”)
> while(TRUE) {
+ rl < readLines(c,n=1)
+ if (length(rl) == 0) {
+ print(“reached the end”)
+ break
+ } else print(rl)
+ }
[1] “John 25”
[1] “Mary 28”
[1] “Jim 19”
“reached the end”
If we wish to “rewind”—to start again at the beginning of the file—we can use seek():
> c < file(“z1″,”r”)
> readLines(c,n=2)
[1] “John 25” “Mary 28”
> seek(con=c,where=0)
[1] 16
> readLines(c,n=1)
[1] “John 25”
The argument where=0 in our call to seek() means that we wish to position the file pointer zero characters from the start of the file—in other words, directly at the beginning. The call returns 16, meaning that the file pointer was at position 16 before we made the call. That makes sense. The first line consists of “John 25” plus the endofline character, for a total of eight characters, and the same is true for the second line. So, after reading the first two lines, we were at position 16.
You can close a connection by calling—what else?—close(). You would use this to let the system know that the file you have been writing is complete and should now be officially written to disk. As another example, in a client/server relationship over the Internet, a client would use close() to indicate to the server that the client is signing off.
Extended Example: Reading PUMS Census Files
The U.S. Census Bureau makes census data available in the form of Public Use Microdata Samples (PUMS). The term microdata here means that we are dealing with raw data and each record is for a real person, as opposed to statistical summaries. Data on many, many variables are included. The data is organized by household. For each unit, there is first a Household record, describing the various characteristics of that household, followed by one Person record for each person in the household. Character positions 106 and 107 (with numbering starting at 1) in the Household record state the number of Person records for that household. (The number can be very large, since some institutions count as households.)
To enhance the integrity of the data, character position 1 contains H or P to confirm that this is a Household or Person record. So, if you read an H record, and it tells you there are three people in the household, then the following three records should be P records, followed by another H record; if not, you’ve encountered an error. As our test file, we’ll take the first 1,000 records of the year 2000 1 percent sample. The first few records look like this:
H000019510649 06010 99979997 70 631973 15758 59967658436650000012000000 0000000000000 0 0 0
0 0 0 0 0 0 0 0000 0 0 0 0 0 00000000000000000000000000000 00000000000000000000000000 P00001950100010923000420190010110000010147050600206011099999904200000 0040010000 00300280 28600 70 9997 9997202020202020220000040000000000000006000000
00000 00 0000 00000000000000000132241057904MS 47604120311010310
07000049010000000000900100000100000100000100000010000001000139010000490000
H000040710649 06010 99979997 70 631973
15758 599676584365300800200000300106060503010101010102010 01200006000000100001
00600020 0 0 0 0 0000 0 0 0 0 0 02000102010102200000000010750
02321125100004000000040000
P00004070100005301000010380010110000010147030400100009005199901200000 0006010000
00100000 00000 00 0000 0000202020202020220000040000000000000001000060
06010 70 9997 99970101004900100000001018703221 77005110111010500
40004000000000000000000000000000000000000000000000000000004000000040000349
P00004070200005303011010140010110000010147050000204004005199901200000 0006010000
00100000 00000 00 0000 000020202020 0 0200000000000000000000000050000
00000 00 0000 00000000000000000000000000000000000000000000000000000
000 0 0 0 0 0 0 0 0 00000000349
H000061010649 06010 99979997 70 631973
15758 599676584360801190100000200204030502010101010102010 00770004800064000001
1 0 030 0 0 0 0340 00660000000170 0 06010000000004410039601000000
00021100000004940000000000
The records are very wide and thus wrap around. Each one occupies four lines on the page here. We’ll create a function called extractpums() to read in a PUMS file and create a data frame from its Person records. The user specifies the filename and lists fields to extract and names to assign to those fields. We also want to retain the household serial number. This is good to have because data for persons in the same household may be correlated and we may want to add that aspect to our statistical model. Also, the household data may provide important covariates. (In the latter case, we would want to retain the covariate data as well.)
Before looking at the function code, let’s see what the function does. In this data set, gender is in column 23 and age in columns 25 and 26. In the example, our filename is pumsa. The following call creates a data frame consisting of those two variables.
pumsdf < extractpums(“pumsa”,list(Gender=c(23,23),Age=c(25,26)))
Note that we are stating here the names we want the columns to have in the resulting data frame. We can use any names we want—say Sex and Ancientness. Here is the first part of that data frame:
> head(pumsdf)
serno Gender Age
2 195 2 19
3 407 1 38
4 407 1 14
5 610 2 65
6 1609 1 50
7 1609 24 9
The following is the code for the extractpums() function.
# reads in PUMS file pf, extracting the Person records, returning a data
# frame; each row of the output will consist of the Household serial
# number and the fields specified in the list flds; the columns of
# the data frame will have the names of the indices in flds
extractpums < function(pf,flds) {
dtf < data.frame() # data frame to be built
con < file(pf,”r”) # connection
# process the input file
repeat {
hrec < readLines(con,1) # read Household record
if (length(hrec) == 0) break # end of file, leave loop
# get household serial number
serno < intextract(hrec,c(2,8))
# how many Person records?
npr < intextract(hrec,c(106,107))
if (npr > 0)
for (i in 1:npr) {
prec < readLines(con,1) # get Person record
# make this person’s row for the data frame
person < makerow(serno,prec,flds)
# add it to the data frame
dtf < rbind(dtf,person)
} }
return(dtf)
}
# set up this person’s row for the data frame
makerow < function(srn,pr,fl) {
l < list()
l[[“serno”]] < srn
for (nm in names(fl)) {
l[[nm]] < intextract(pr,fl[[nm]])
}
return(l) }
# extracts an integer field in the string s, in character positions
# rng[1] through rng[2]
intextract < function(s,rng) {
fld < substr(s,rng[1],rng[2])
return(as.integer(fld))
}
Let’s see how this works. At the beginning of extractpums(), we create an empty data frame and set up the connection for the PUMS file read.
dtf < data.frame() # data frame to be built
con < file(pf,”r”) # connection
The main body of the code then consists of a repeat loop.
repeat {
hrec < readLines(con,1) # read Household record
if (length(hrec) == 0) break # end of file, leave loop
# get household serial number
serno < intextract(hrec,c(2,8))
# how many Person records?
npr < intextract(hrec,c(106,107))
if (npr > 0)
for (i in 1:npr) {
…
}
}
This loop iterates until the end of the input file is reached. The latter condition will be sensed by encountering a zerolength Household record, as seen in the preceding code. Within the repeat loop, we alternate reading a Household record and reading the associated Person records. The number of Person records for the current Household record is extracted from columns 106 and 107 of that record, storing this number in npr. That extraction is done by a call to our function intextract(). The for loop then reads in the Person records one by one, in each case forming the desired row for the output data frame and then attaching it to the latter via rbind():
for (i in 1:npr) {
prec < readLines(con,1) # get Person record
# make this person’s row for the data frame
person < makerow(serno,prec,flds)
# add it to the data frame
dtf < rbind(dtf,person)
}
Note how makerow() creates the row to be added for a given person. Here the formal arguments are srn for the household serial number, pr for the given Person record, and fl for the list of variable names and column fields.
makerow < function(srn,pr,fl) {
l < list()
l[[“serno”]] < srn
for (nm in names(fl)) {
l[[nm]] < intextract(pr,fl[[nm]])
}
return(l)
}
For instance, consider our sample call:
pumsdf < extractpums(“pumsa”,list(Gender=c(23,23),Age=c(25,26)))
When makerow() executes, fl will be a list with two elements, named Gender and Age. The string pr, the current Person record, will have Gender in column 23 and Age in columns 25 and 26. We call intextract() to pull out the desired numbers. The intextract() function itself is a straightforward conversion of characters to numbers, such as converting the string “12” to the number 12. Note that, if not for the presence of Household records, we could do all of this much more easily with a handy builtin R function: read.fwf(). The name of this function is an abbreviation for “read fixedwidth formatted,” alluding to the fact that each variable is stored in given character positions of a record. In essence, this function alleviates the need to write a function like intextract().
Accessing Files on Remote Machines via URLs
Certain I/O functions, such as read.table() and scan(), accept web URLs as arguments. (Check R’s online help facility to see if your favorite function allows this.) As an example, we’ll read some data from the University of California, Irvine archive at http://archive.ics.uci.edu/ml/datasets.html, using the Echocardiogram data set. After navigating the links, we find the location of that file and then read it from R, as follows:
> uci < “http://archive.ics.uci.edu/ml/machinelearningdatabases/”
> uci < paste(uci,”echocardiogram/echocardiogram.data”,sep=””)
> ecc < read.csv(uci)
(We’ve built up the URL in stages here to fit the page.)Let’s take a look at what we downloaded:
We could then do our analyses. For example, the third column is age, so we could find its mean or perform other calculations on that data. See the echocardiogram.names page at http://archive.ics.uci.edu/ml/machinelearningdatabases/echocardiogram/echocardiogram.names for descriptions of all of the variables.
Writing to a File
Given the statistical basis of R, file reads are probably much more common than writes. But writes are sometimes necessary, and this section will present methods for writing to files. The function write.table() works very much like read.table(), except that it writes a data frame instead of reading one. For instance, let’s take the little Jack and Jill example :
> kids < c(“Jack”,”Jill”)
> ages < c(12,10)
> d < data.frame(kids,ages,stringsAsFactors=FALSE)
> d
kids ages
1 Jack 12
2 Jill 10
> write.table(d,”kds”)
The file kds will now have these contents:
“kids” “ages”
“1” “Jack” 12
“2” “Jill” 10
In the case of writing a matrix to a file, just state that you do not want row or column names, as follows:
> write.table(xc,”xcnew”,row.names=FALSE,col.names=FALSE)
The function cat() can also be used to write to a file, one part at a time. Here’s an example:
> cat(“abc\n”,file=”u”)
> cat(“de\n”,file=”u”,append=TRUE)
The first call to cat() creates the file u, consisting of one line with contents “abc”. The second call appends a second line. Unlike the case of using the writeLines() function (which we’ll discuss next), the file is automatically saved after each operation. For instance, after the previous calls, the file will look like this:
abc
de
You can write multiple fields as well. So:
> cat(file=”v”,1,2,”xyz\n”)
would produce a file v consisting of a single line:
1 2 xyz
You can also use writeLines(), the counterpart of readLines(). If you use a connection, you must specify “w” to indicate you are writing to the file, not reading from it:
> c < file(“www”,”w”)
> writeLines(c(“abc”,”de”,”f”),c)
> close(c)
The file www will be created with these contents:
abc
de
f
Note the need to proactively close the file.
Getting File and Directory Information
R has a variety of functions for getting information about directories and files, setting file access permissions, and the like. The following are a few examples:
 file.info(): Gives file size, creation time, directoryversusordinary file status, and so on for each file whose name is in the argument, a character vector.
 dir(): Returns a character vector listing the names of all the files in the directory specified in its first argument. If the optional argument recursive=TRUE is specified, the result will show the entire directory tree rooted at the first argument.
 file.exists(): Returns a Boolean vector indicating whether the given file exists for each name in the first argument, a character vector.
 getwd() and setwd(): Used to determine or change the current working directory.
To see all the file and directoryrelated functions, type the following:
> ?files
Some of these options will be demonstrated in the next example.
Extended Example: Sum the Contents of Many Files
Here, we’ll develop a function to find the sum of the contents (assumed numeric) in all files in a directory tree. In our example, a directory dir1 contains the files filea and fileb, as well as a subdirectory dir2, which holds the file filec. The contents of the files are as follows:
 filea: 5, 12, 13
 fileb: 3, 4, 5
 filec: 24, 25, 7
If dir1 is in our current directory, the call sumtree(“dir1”) will yield the sum of those nine numbers, 98. Otherwise, we need to specify the full pathname of dir1, such as sumtree(“/home/nm/dir1”). Here is the code:
sumtree < function(drtr) {
tot < 0
# get names of all files in the tree
fls < dir(drtr,recursive=TRUE)
for (f in fls) {
# is f a directory?
f < file.path(drtr,f)
if (!file.info(f)$isdir) {
tot < tot + sum(scan(f,quiet=TRUE))
}
}
return(tot)
}
Note that this problem is a natural for recursion. But here, R has done the recursion for us by allowing it as an option in dir(). Thus, in line 4, we set recursive=TRUE in order to find the files throughout the various levels of the directory tree. To call file.info(), we need to account for the fact that the current filename f is relative to drtr, so our file filea would be referred to as dir1/filea. In order to form that pathname, we need to concatenate drtr, a slash, and filea. We could use the R string concatenation function paste() for this, but we would need a separate case for Windows, which uses a backslash instead of a slash. But file.path() does all that for us.
Some commentary pertaining to line 8 is in order. The function file.info() returns information about f as a data frame, one of whose columns is isdir, with one row for each file and with row names being the filenames. That column consists of Boolean values indicating whether each file is a directory. In line 8, then, we can detect whether the current file f is a directory. If f is an ordinary file, we go ahead and add its contents to our running total.
Accessing the Internet
R’s socket facilities give the programmer access to the Internet’s TCP/IP protocol. For readers who are not familiar with this protocol, we begin with an overview of TCP/IP.
Overview of TCP/IP
TCP/IP is quite complex, so the overview here will be something of an oversimplification, but we’ll cover enough for you to understand what R’s socket functions are doing.For our purposes here, the term network refers to a set of computers connected together locally, without going through the Internet. This typically consists of all the computers in a home, all the computers in a smaller business, and so on. The physical medium between them is usually an Ethernet connection of some form.
The Internet, as its name implies, connects networks. A network in the Internet is connected to one or more other networks via routers, which are specialpurpose computers that connect two or more networks together. Every computer on the Internet has an Internet Protocol (IP) address. This is numeric, but it can be stated in characters, as in www.google.com, which is then translated into the numeric address by the Domain Name Service.
However, the IP address is not enough. When A sends a message to B, there may be several applications at computer B that are receiving Internet messages, such as web browsing, email service, and so on. How does the operating system at B know to which of these to send the message from A? The answer is that A will specify a port number in addition to the IP address. The port number indicates which program running at B is intended as the recipient. And A will also have a port number so that the response from B reaches the correct application at A.
When A wishes to send something to B, it writes to a software entity called a socket, using a system call syntactically similar to the one for writing to a file. In the call, A specifies B’s IP address and the port number to which A wishes to send a message. B has a socket, too, and it writes its responses to A in that socket. We say there is a connection between A and B via those sockets, but that doesn’t mean anything physical—it’s just an agreement between A and B to exchange data. Applications follow a client/server model. Say a web server is running at B, at the standard port for the Web, port 80. The server at B is listening at port 80. Again, this term should not be taken literally; it just means that the server program has made a function call that notifies the operating system that the server program is willing to have connections at port 80. When network node A requests such a connection, the function call at the server returns, and the connection is set up. If you are a nonprivileged user and write some kind of server program— say in R!—you must assign a port number above 1024.
NOTE If a server program is taken down or crashes, there may be a few seconds’ delay before the same port is reusable again.
Sockets in R
A very important point to keep in mind is that all the bytes sent by A to B during the time the connection between them exists are collectively considered one big message. Say A sends one line of text of 8 characters and then another of 20 characters. From A’s point of view, that’s two lines, but to TCP/IP, it’s just 28 characters of a yet incomplete message. Splitting that long message back into lines can take a bit of doing. R provides various functions for this purpose, including the following:
 readLines() and writeLines(): These allow you to program as if TCP/IP were sending messages line by line, even though this is not actually the case. If your application is naturally viewed in terms of lines, these two functions can be quite handy.
 serialize() and unserialize(): You can use these to send R objects, such as a matrix or the complex output of a call to a statistical function. The object is converted to character string form by the sender and then converted back to the original object form at the receiver.
 readBin() and writeBin(): These are for sending data in binary form. (Recall the comment on terminology)
Each of these functions operates on R connections, as you’ll see in the next example. It’s important to choose the right function for each job. If you have a long vector, for example, using serialize() and unserialize() may be more convenient but far more timeconsuming. This is not only because numbers must be converted to and from their character representations but also because the character representation is typically much longer, which means greater transmission time. Here are two other R socket functions:
 socketConnection(): This establishes an R connection via sockets. You specify the port number in the argument port, and state whether a server or client is to be created, by setting the argument server to TRUE or FALSE, respectively. In the client case, you must also supply the server’s IP address in the argument host.
 socketSelect(): This is useful when a server is connected to multiple clients. Its main argument, socklist, is a list of connections, and its return value is the sublist of connections that have data ready for the server to read.
Extended Example: Implementing Parallel R
Some statistical analyses have very long runtimes, so there naturally has been quite a bit of interest in “parallel R,” in which several R processes cooperate on a given task. Another possible reason to “go parallel” is memory limitations. If one machine does not have enough memory for the task at hand, it may help to pool the memories of several machines in some way.Sockets play a key role in many parallel R packages. The cooperating R processes could be either on the same machine or on separate machines. In the latter case (and even the former), a natural approach to implementing parallelism is to use R sockets. This is one of the choices in the snow package and in my Rdsm package (both available on CRAN, R’s code repository), as follows:
 In snow, the server sends out work tasks to the clients. The clients perform their tasks and send the results back to the server, which assembles them into the final result. Communication is done with serialize() and unserialize(), and the server uses socketSelect() to determine which client results are ready.
 Rdsm implements a virtual sharedmemory paradigm, and the server is used to store the shared variables. The clients contact the server whenever they need to read or write a shared variable. To optimize speed, communication between server and clients is done with readBin() and writebin(), instead of serialize() and unserialize().
Let’s look at some of the socketrelated details of Rdsm. First, here is the server code in which connections with the clients are set up, storing them in a list cons (there are ncon clients):
# set up socket connections with clients
#
cons << vector(mode=”list”,length=ncon) # list of connections
# prevent connection from dying during debug or long compute spell
options(“timeout”=10000)
for (i in 1:ncon) {
cons[[i]] <<
socketConnection(port=port,server=TRUE,blocking=TRUE,open=”a+b”)
# wait to hear from client i
checkin < unserialize(cons[[i]])
}
# send ACKs
for (i in 1:ncon) {
send the client its ID number, and the group size
serialize(c(i,ncon),cons[[i]])
}
Since the client messages and server acknowledgments are short messages, serialize() and unserialize() are good enough for the purpose here. The first part of the main loop of the server finds a ready client and reads from it.
repeat {
# any clients still there?
if (remainingclients == 0) break
# wait for service request, then read it
# find all the pending client requests
rdy < which(socketSelect(cons))
# choose one
j < sample(1:length(rdy),1)
con < cons[[rdy[j]]]
# read client request
req < unserialize(con)
Again serialize() and unserialize() are good enough here to read the short message from the client indicating what kind of operation—typically reading a shared variable or writing one—it’s requesting. But the reads and writes of the shared variables themselves use the faster readBin() and writeBin() functions. Here’s the write part:
# write data dt, of mode md (integer of double), to connection cn
binwrite < function(dt,md,cn) {
writeBin(dt,con=cn)
And here’s the read part:
# read sz elements of mode md (integer of double) from connection cn
binread < function(cn,md,sz) {
return(readBin(con=cn,what=md,n=sz))
On the client side, the connection setup code is as follows:
options(“timeout”=10000)
# connect to server
con < socketConnection(host=host,port=port,blocking=TRUE,open=”a+b”)
serialize(list(req=”checking in”),con)
# receive this client’s ID and total number of clients from server 6 myidandnclnt < unserialize(con)
myinfo <<
list(con=con,myid=myidandnclnt[1],nclnt=myidandnclnt[2])
The code for reading from and writing to the server is similar to the preceding server examples.
STRING MANIPULATION
Although R is a statistical language with numeric vectors and matrices playing a central role, character strings are surprisingly important as well. Ranging from birth dates stored in medical research data files to textmining applications, character data arises quite frequently in R programs. Accordingly, R has a number of stringmanipulation utilities, many of which will be introduced in this chapter.
An Overview of StringManipulation Functions
Here, we’ll briefly review just some of the many stringmanipulation functions R has to offer. Note that the call forms shown in this introduction are very simple, usually omitting many optional arguments. We’ll use some of those arguments in our extended examples later in the chapter, but do check R’s online help for further details.
grep()
The call grep(pattern,x) searches for a specified substring pattern in a vector x of strings. If x has n elements—that is, it contains n strings—then grep(pattern,x) will return a vector of length up to n. Each element of this vector will be the index in x at which a match of pattern as a substring of x[i]) was found.
Here’s an example of using grep:
> grep(“Pole”,c(“Equator”,”North Pole”,”South Pole”))
[1] 2 3
> grep(“pole”,c(“Equator”,”North Pole”,”South Pole”))
integer(0)
In the first case, the string “Pole” was found in elements 2 and 3 of the second argument, hence the output (2,3). In the second case, string “pole” was not found anywhere, so an empty vector was returned.
nchar()
The call nchar(x) finds the length of a string x. Here’s an example:
> nchar(“South Pole”)
[1] 10
The string “South Pole” was found to have 10 characters. C programmers, take note: There is no NULL character terminating R strings. Also note that the results of nchar() will be unpredictable if x is not in character mode. For instance, nchar(NA) turns out to be 2, and nchar(factor(“abc”)) is 1. For more consistent results on nonstring objects, use Hadley Wickham’s stringr package on CRAN.
paste()
The call paste(…) concatenates several strings, returning the result in one long string. Here are some examples:
> paste(“North”,”Pole”)
[1] “North Pole”
> paste(“North”,”Pole”,sep=””)
[1] “NorthPole”
> paste(“North”,”Pole”,sep=”.”)
[1] “North.Pole”
> paste(“North”,”and”,”South”,”Poles”)
[1] “North and South Poles”
As you can see, the optional argument sep can be used to put something other than a space between the pieces being spliced together. If you specify sep as an empty string, the pieces won’t have any character between them.
sprintf()
The call sprintf(…) assembles a string from parts in a formatted manner.
Here’s a simple example:
> i < 8
> s < sprintf(“the square of %d is %d”,i,i^2)
> s
[1] “the square of 8 is 64”
The name of the function is intended to evoke string print for “printing” to a string rather than to the screen. Here, we are printing to the string s. What are we printing? The function says to first print “the square of” and then print the decimal value of i. (The term decimal here means in the base10 number system, not that there will be a decimal point in the result.) The result is the string “the square of 8 is 64.”
substr()
The call substr(x,start,stop) returns the substring in the given character position range start:stop in the given string x. Here’s an example:
> substring(“Equator”,3,5)
[1] “uat”
strsplit()
The call strsplit(x,split) splits a string x into an R list of substrings based on another string split in x. Here’s an example:
> strsplit(“6162011″,split=”“)
[[1]]
[1] “6””16″ “2011”
regexpr()
The call regexpr(pattern,text) finds the character position of the first instance of pattern within text, as in this example:
> regexpr(“uat”,”Equator”)
[1] 3
This reports that “uat” did indeed appear in “Equator,” starting at character position 3.
gregexpr()
The call gregexpr(pattern,text) is the same as regexpr(), but it finds all instances of pattern. Here’s an example:
> gregexpr(“iss”,”Mississippi”)
[[1]]
[1] 2 5
This finds that “iss” appears twice in “Mississippi,” starting at character positions 2 and 5.
Regular Expressions
When dealing with stringmanipulation functions in programming languages, the notion of regular expressions sometimes arises. In R, you must pay attention to this point when using the string functions grep(), grepl(), regexpr(), gregexpr(), sub(), gsub(), and strsplit(). A regular expression is a kind of wild card. It’s shorthand to specify broad classes of strings. For example, the expression “[au]” refers to any string that contains either of the letters a or u. You could use it like this:
> grep(“[au]”,c(“Equator”,”North Pole”,”South Pole”))
[1] 1 3
This reports that elements 1 and 3 of (“Equator”,”North Pole”,”South Pole”)—that is, “Equator” and “South Pole”—contain either an a or a u. A period (.) represents any single character. Here’s an example of using it:
> grep(“o.e”,c(“Equator”,”North Pole”,”South Pole”))
[1] 2 3
This searches for threecharacter strings in which an o is followed by any single character, which is in turn followed by an e. Here is an example of the use of two periods to represent any pair of characters:
> grep(“N..t”,c(“Equator”,”North Pole”,”South Pole”))
[1] 2
Here, we searched for fourletter strings consisting of an N, followed by any pair of characters, followed by a t. A period is an example of a metacharacter, which is a character that is not to be taken literally. For example, if a period appears in the first argument of grep(), it doesn’t actually mean a period; it means any character. But what if you want to search for a period using grep()? Here’s the naive approach:
> grep(“.”,c(“abc”,”de”,”f.g”))
[1] 1 2 3
The result should have been 3, not (1,2,3). This call failed because periods are metacharacters. You need to escape the metacharacter nature of the period, which is done via a backslash:
> grep(“\\.”,c(“abc”,”de”,”f.g”))
[1] 3
Now, didn’t we say a backslash? Then why are there two? Well, the sad truth is that the backslash itself must be escaped, which is accomplished by its own backslash! This goes to show how arcanely complex regular expressions can become. Indeed, a number of books have been written on the subject of regular expressions (for various programming languages). As a start in learning about the topic, refer to R’s online help (type ?regex).
Extended Example: Testing a Filename for a Given Suffix
Suppose we wish to test for a specified suffix in a filename. We might, for instance, want to find all HTML files (those with suffix .html, .htm, and so on). Here is code for that:
testsuffix < function(fn,suff) {
parts < strsplit(fn,”.”,fixed=TRUE)
nparts < length(parts[[1]])
return(parts[[1]][nparts] == suff)
}
Let’s test it.
> testsuffix(“x.abc”,”abc”)
[1] TRUE
> testsuffix(“x.abc”,”ac”)
[1] FALSE
> testsuffix(“x.y.abc”,”ac”)
[1] FALSE
> testsuffix(“x.y.abc”,”abc”)
[1] TRUE
How does the function work? First note that the call to strsplit() on line 2 returns a list consisting of one element (because fn is a oneelement vector)—a vector of strings. For example, calling testsuffix(“x.y.abc”,”abc”) will result in parts being a list consisting of a threeelement vector with elements x, y, and abc. We then pick up the last element and compare it to suff.
A key aspect is the argument fixed=TRUE. Without it, the splitting argu ment . (called split in the list of strsplit()’s formal arguments) would have been treated as a regular expression. Without setting fixed=TRUE, strsplit() would have just separated all the letters.
Of course, we could also escape the period, as follows:
testsuffix < function(fn,suff) {
parts < strsplit(fn,”\\.”)
nparts < length(parts[[1]])
return(parts[[1]][nparts] == suff)
}
Let’s check to see if it still works.
> testsuffix(“x.y.abc”,”abc”)
[1] TRUE
Here’s another way to do the suffixtest code that’s a bit more involved but a good illustration:
testsuffix < function(fn,suff) {
ncf < nchar(fn) # nchar() gives the string length
# determine where the period would start if suff is the suffix in fn
dotpos < ncf – nchar(suff) + 1
# now check that suff is there
return(substr(fn,dotpos,ncf)==suff)
}
Let’s look at the call to substr() here, again with fn = “x.ac” and
suff = “abc”. In this case, dotpos will be 1, which means there should be a period at the first character in fn if there is an abc suffix. The call to substr() then becomes substr(“x.ac”,1,4), which extracts the substring in character positions 1 through 4 of x.ac. That substring will be x.ac, which is not abc, so the filename’s suffix is found not to be the latter.
Extended Example: Forming Filenames
Suppose we want to create five files, q1.pdf through q5.pdf, consisting of histograms of 100 random N(0,i2) variates. We could execute the follow ing code:
for(iin1:5){
fname < paste(“q”,i,”.pdf”)
pdf(fname)
hist(rnorm(100,sd=i))
dev.off()
}
The main point in this example is the string manipulation we use to create the filename fname. For more details about the graphics operations used in this example.
The paste() function concatenates the string “q” with the string form of the number i. For example, when i = 2, the variable fname will be q 2 .pdf. However, that isn’t quite what we want. On Linux systems, filenames with embedded spaces create headaches, so we want to remove the spaces. One solution is to use the sep argument, specifying an empty string for the separator, as follows:
for (i in 1:5) {
fname < paste(“q”,i,”.pdf”,sep=””)
pdf(fname)
hist(rnorm(100,sd=i))
dev.off()
}
Another approach is to employ the sprintf() function, borrowed from C:
for (i in 1:5) {
fname < sprintf(“q%d.pdf”,i)
pdf(fname)
hist(rnorm(100,sd=i))
dev.off()
}
For floatingpoint quantities, note also the difference between %f and %g formats:
> sprintf(“abc%fdef”,1.5)
[1] “abc1.500000def”
> sprintf(“abc%gdef”,1.5)
[1] “abc1.5def”
The %g format eliminated the superfluous zeros.
Use of String Utilities in the edtdbg Debugging Tool
The internal code of the edtdbg debugging tool, which will be discussed later, makes heavy use of string utilities. A typical example of such usage is the dgbsendeditcmd() function:
# send command to editor
dbgsendeditcmd < function(cmd) {
syscmd < paste(“vim –remotesend “,cmd,” –servername “,vimserver,sep=””)
system(syscmd)
}
What is going on here? The main point is that edtdbg sends remote commands to the Vim text editor. For instance, if you are running Vim with a server name of 168 and you want the cursor in Vim to move to line 12, you could type this into a terminal (shell) window:
vim –remotesend 12G –servername 168
The effect would be the same as if you had physically typed 12G at the Vim window. Since 12G is the Vim command to move the cursor to line 12, that’s what would occur. Consider this call:
paste(“vim –remotesend “,cmd,” –servername “,vimserver,sep=””)
Here, cmd is the string “12G”, vimserver is 168, and paste() concatenates all the indicated strings. The argument sep=”” says to use the empty string as separator in this concatenation—that is, no separation. Thus, paste() returns the following:
vim –remotesend 12G –servername 168
Another core element in the operation of edtdbg is that the program has arranged, via a call to R’s sink() function, to record to the file dbgsink most output from R’s debugger in your R window. (The edtdbg utility works in concert with that debugger.) That information includes the line numbers of your positions in your source file as you step through it using R’s debugger. The line position information in the debugger output looks like this:
debug at cities.r#16: {
So, there is code in edtdbg to determine the latest line in dbgsink that begins with “debug at.” That line is then placed, as a string, in a variable named debugline. The following code then extracts the line number (16 in the example) and the source filename/Vim buffer name (cities.r here):
linenumstart < regexpr(“#”,debugline) + 1
buffname < substr(debugline,10,linenumstart2)
colon < regexpr(“:”,debugline)
linenum < substr(debugline,linenumstart,colon1)
The call to regexpr() determines where in debugline the # character is located (character 18 in this example). Adding 1 to that gives the position of the line number within debugline. To get the buffer name, using the preceding example as a guide, we see that the name comes after debug at and ends just before the #. Since “debug at” contains nine characters, the buffer name will start at position 10—hence the 10 in the call,
substr(debugline,10,linenumstart2)
The end of the buffer name field is at linenumstart2, as it is just before the #, which precedes the start of the line number. The line number computation is then similar. Another illustrative example of edtdbg’s internal code is its use of the strsplit() function. For example, at one point, it prints out a prompt to the user:
kbdin < readline(prompt=”enter number(s) of fns you wish to toggle dbg: “)
As you can see, the user’s response is stored in kbdin. It will consist of a set of numbers separated by spaces, such as this:
1 4 5
We need to extract the numbers from the string 1 4 5 into an integer vector. This is done first via strsplit(), which produces three strings: “1”, “4”, and “5”. Then we call as.integer() to convert from characters to numbers:
tognums < as.integer(strsplit(kbdin,split=” “)[[1]])
Note that the output of strsplit() is an R list, in this case consisting of one element, which is in turn the vector (“1″,”4″,”5”). This leads to the expression [[1]] in the example.