R PROGRAMMING CHAPTER 2
MATRICES AND ARRAYS
A matrix is a vector with two additional attributes: the number of rows and the number of columns. Since matrices are vectors, they also have modes, such as numeric and character. (On the other hand, vectors are not one-column or one-row matrices.) Matrices are special cases of a more general R type of object: arrays. Arrays can be multidimensional. For example, a three-dimensional array would consist of rows, columns, and layers, not just rows and columns as in the matrix case. Most of this chapter will concern matrices, but we will briefly discuss higher-dimensional arrays in the final section. Much of R’s power comes from the various operations you can perform on matrices. We’ll cover these operations in this chapter, especially those analogous to vector subsetting and vectorization.
Creating Matrices
Matrix row and column subscripts begin with 1. For example, the upper-left corner of the matrix a is denoted a[1,1]. The internal storage of a matrix is in column-major order, meaning that first all of column 1 is stored, then all of column 2, and so on. One way to create a matrix is by using the matrix() function:
> y <- matrix(c(1,2,3,4),nrow=2,ncol=2)
> y
[,1] [,2]
[1,] 1 3
[2,] 2 4
Here, we concatenate what we intend as the first column, the numbers 1 and 2, with what we intend as the second column, 3 and 4. So, our data is (1,2,3,4). Next, we specify the number of rows and columns. The fact that R uses column-major order then determines where these four numbers are put within the matrix. Since we specified the matrix entries in the preceding example, and there were four of them, we did not need to specify both ncol and nrow; just nrow or ncol would have been enough. Having four elements in all, in two rows, implies two columns:
> y <- matrix(c(1,2,3,4),nrow=2)
> y
[,1] [,2]
[1,] 1 3
[2,] 2 4
Note that when we then print out y, R shows us its notation for rows and columns. For instance, [,2] means the entirety of column 2, as can be seen in this check:
> y[,2]
[1] 3 4
Another way to build y is to specify elements individually:
> y <- matrix(nrow=2,ncol=2)
> y[1,1] <- 1
> y[2,1] <- 2
> y[1,2] <- 3
> y[2,2] <- 4
> y
[,1] [,2]
[1,] 1 3
[2,] 2 4
Note that we do need to warn R ahead of time that y will be a matrix and give the number of rows and columns. Though internal storage of a matrix is in column-major order, you can set the byrow argument in matrix() to true to indicate that the data is coming in row-major order. Here’s an example of using byrow:
> m <- matrix(c(1,2,3,4,5,6),nrow=2,byrow=T)
> m
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
Note that the matrix is still stored in column-major order. The byrow argument enabled only our input to come in row-major form. This may be more convenient if you are reading from a data file organized that way, for example.
General Matrix Operations
Now that we’ve covered the basics of creating a matrix, we’ll look at some common operations performed with matrices. These include performing linear algebra operations, matrix indexing, and matrix filtering.
Performing Linear Algebra Operations on Matrices
You can perform various linear algebra operations on matrices, such as matrix multiplication, matrix scalar multiplication, and matrix addition. Using y from the preceding example, here is how to perform those three operations:
> y %_{*}% y # mathematical matrix multiplication
[,1] [,2]
[1,] 7 15
[2,] 10 22
3_{*}y # mathematical multiplication of matrix by scalar
[,1] [,2]
[1,] 3 9
[2,] 6 12
> y+y # mathematical matrix addition
[,1] [,2]
[1,] 2 6
[2,] 4 8
Matrix Indexing
The same operations we discussed for vectors before apply to matrices as well. Here’s an example:
>z[,1] [,2] [,3]
[1,] 1 1 1
[2,] 2 1 0
[3,] 3 0 1
[4,] 4 0 0
> z[,2:3]
[,1] [,2]
[1,] 1 1
[2,] 1 0
[3,] 0 1
[4,] 0 0
Here, we requested the submatrix of z consisting of all elements with column numbers 2 and 3 and any row number. This extracts the second and third columns. Here’s an example of extracting rows instead of columns:
>y[,1] [,2]
[1,] 11 12
[2,] 21 22
[3,] 31 32
> y[2:3,]
[,1] [,2]
[1,] 21 22
[2,] 31 32
> y[2:3,2]
[1] 22 32
You can also assign values to submatrices:
> y[,1] [,2]
[1,] 1 4[2,] 2 5[3,] 3 6
> y[c(1,3),] <- matrix(c(1,1,8,12),nrow=2)
>y
[,1] [,2]
[1,] 1 8
[2,] 2 5
[3,] 1 12
Here, we assigned new values to the first and third rows of y.
And here’s another example of assignment to submatrices:
> x <- matrix(nrow=3,ncol=3)
> y <- matrix(c(4,5,2,3),nrow=2)
> y
[,1] [,2]
[1,] 4 2
[2,] 5 3
> x[2:3,2:3] <- y
> x
[,1] [,2] [,3]
[1,] NA NA NA
[2,] NA 4 2
[3,] NA 5 3
Negative subscripts, used with vectors to exclude certain elements, work the same way with matrices:
>y[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
> y[-2,]
[,1] [,2]
[1,] 1 4
[2,] 3 6
In the second command, we requested all rows of y except the second.
Extended Example: Image Manipulation
Image files are inherently matrices, since the pixels are arranged in rows and columns. If we have a grayscale image, for each pixel, we store the intensity— the brightness–of the image at that pixel. So, the intensity of a pixel in, say, row 28 and column 88 of the image is stored in row 28, column 88 of the matrix. For a color image, three matrices are stored, with intensities for red, green, and blue components, but we’ll stick to grayscale here. For our example, let’s consider an image of the Mount Rushmore National Memorial in the United States. Let’s read it in, using the pixmap library.
> library(pixmap)
> mtrush1 <- read.pnm(“mtrush1.pgm”)
> mtrush1
Pixmap image
Type : pixmapGrey
Size : 194×259
Resolution : 1×1
Bounding box : 0 0 259 194
> plot(mtrush1)
We read in the file named mtrush1.pgm, returning an object of class pixmap. We then plot it, as seen in Figure 3-1.
Figure 3-1: Reading in Mount Rushmore
Now, let’s see what this class consists of:
> str(mtrush1)
Formal class ‘pixmapGrey’ [package “pixmap”] with 6 slots
> str(mtrush1)
Formal class ‘pixmapGrey’ [package “pixmap”] with 6 slots
..@ grey : num [1:194, 1:259] 0.278 0.263 0.239 0.212 0.192 …
..@ channels: chr “grey”
..@ size : int [1:2] 194 259
…
The class here is of the S4 type, whose components are designated by @, rather than $. S3 and S4 classes , but the key item here is the intensity matrix, mtrush1@grey. In the example, this matrix has 194 rows and 259 columns. The intensities in this class are stored as numbers ranging from 0.0 (black) to 1.0 (white), with intermediate values literally being shades of gray. For instance, the pixel at row 28, column 88 is pretty bright.
> mtrush1@grey[28,88]
[1] 0.7960784
To demonstrate matrix operations, let’s blot out President Roosevelt. (Sorry, Teddy, nothing personal.) To determine the relevant rows and columns, you can use R’s locator() function. When you call this function, it waits for the user to click a point within a graph and returns the exact coordinates of that point. In this manner, I found that Roosevelt’s portion of the picture is in rows 84 through 163 and columns 135 through 177. (Note that row numbers in pixmap objects increase from the top of the picture to the bottom, the opposite of the numbering used by locator().) So, to blot out that part of the image, we set all the pixels in that range to 1.0.
> mtrush2 <- mtrush1
> mtrush2@grey[84:163,135:177] <- 1
> plot(mtrush2)
The result is shown in Figure 3-2.
Figure 3-2: Mount Rushmore, with President Roosevelt removed
What if we merely wanted to disguise President Roosevelt’s identity? We could do this by adding random noise to the picture. Here’s code to do that:
# adds random noise to img, at the range rows,cols of img; img and the
# return value are both objects of class pixmap; the parameter q
# controls the weight of the noise, with the result being 1-q times the
# original image plus q times the random noise
blurpart <- function(img,rows,cols,q) {
lrows <- length(rows)
lcols <- length(cols)
newimg <- img
randomnoise <- matrix(nrow=lrows, ncol=ncols,runif(lrows_{*}lcols))
newimg@grey <- (1-q) _{*} img@grey + q _{*} randomnoise
return(newimg)
}
As the comments indicate, we generate random noise and then take a weighted average of the target pixels and the noise. The parameter q controls the weight of the noise, with larger q values producing more blurring. The random noise itself is a sample from U(0,1), the uniform distribution on the interval (0,1). Note that the following is a matrix operation:
newimg@grey <- (1-q) _{*} img@grey + q _{*} randomnoise
So, let’s give it a try:
> mtrush3 <- blurpart(mtrush1,84:163,135:177,0.65)
> plot(mtrush3)
The result is shown in Figure 3-3.
Figure 3-3: Mount Rushmore, with President Roosevelt blurred
Filtering on Matrices
Filtering can be done with matrices, just as with vectors. You must be careful with the syntax, though. Let’s start with a simple example:
>x
x
[1,] 1 2
[2,] 2 3
[3,] 3 4
> x[x[,2] >= 3,]
x
[1,] 2 3
[2,] 3 4
Again, let’s dissect this, just as we did when we first looked at filtering :
> j <- x[,2] >= 3
> j
[1] FALSE TRUE TRUE
Here, we look at the vector x[,2], which is the second column of x, and determine which of its elements are greater than or equal to 3. The result, assigned to j, is a Boolean vector. Now, use j in x:
> x[j,]
x
[1,] 2 3
[2,] 3 4
Here, we compute x[j,]—that is, the rows of x specified by the true ele-ments of j—getting the rows corresponding to the elements in column 2 that were at least equal to 3. Hence, the behavior shown earlier when this example was introduced:
> x
x
[1,] 1 2
[2,] 2 3
[3,] 3 4
> x[x[,2] >= 3,]
x
[1,] 2 3
[2,] 3 4
For performance purposes, it’s worth noting again that the computa-tion of j here is a completely vectorized operation, since all of the following are true:
- The object x[,2] is a vector.
- The operator >= compares two vectors.
- The number 3 was recycled to a vector of 3s.
Also note that even though j was defined in terms of x and then was used to extract from x, it did not need to be that way. The filtering criterion can be based on a variable separate from the one to which the filtering will be applied. Here’s an example with the same x as above:
> z <- c(5,12,13)
> x[z %% 2 == 1,]
[,1] [,2]
[1,] 1 4
[2,] 3 6
Here, the expression z %% 2 == 1 tests each element of z for being an odd number, thus yielding (TRUE,FALSE,TRUE). As a result, we extracted the first and third rows of x. Here is another example:
>m[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
> m[m[,1] > 1 & m[,2] > 5,]
[1] 3 6
We’re using the same principle here, but with a slightly more complex set of conditions for row extraction. (Column extraction, or more generally, extraction of any submatrix, is similar.) First, the expression m[,1] > 1 com-pares each element of the first column of m to 1 and returns (FALSE,TRUE,TRUE). The second expression, m[,2] > 5, similarly returns (FALSE,FALSE,TRUE). We then take the logical AND of (FALSE,TRUE,TRUE) and (FALSE,FALSE,TRUE), yield-ing (FALSE,FALSE,TRUE). Using the latter in the row indices of m, we get the third row of m.
Note that we needed to use &, the vector Boolean AND operator, rather than the scalar one that we would use in an if statement, &&. A complete list of such operators is given in Section 7.2. The alert reader may have noticed an anomaly in the preceding exam-ple. Our filtering should have given us a submatrix of size 1 by 2, but instead it gave us a two-element vector. The elements were correct, but the data type was not. This would cause trouble if we were to then input it to some other matrix function. The solution is to use the drop argument, which tells R to retain the two-dimensional nature of our data. We’ll discuss drop in detail in Section 3.6 when we examine unintended dimension reduction. Since matrices are vectors, you can also apply vector operations to them. Here’s an example:
> m
[,1] [,2]
[1,] 5 -1
[2,] 2 10
[3,] 9 11
> which(m > 2)
[1] 1 3 5 6
R informed us here that, from a vector-indexing point of view, elements 1, 3, 5, and 6 of m are larger than 2. For example, element 5 is the element in row 2, column 2 of m, which we see has the value 10, which is indeed greater than 2.
Extended Example: Generating a Covariance Matrix
This example demonstrates R’s row() and col() functions, whose arguments are matrices. For example, for a matrix a, row(a[2,8]) will return the row number of that element of a, which is 2. Well, we knew row(a[2,8]) is in row 2, didn’t we? So why would this function be useful? Let’s consider an example. When writing simulation code for multi-variate normal distributions—for instance, using mvrnorm() from the MASS library—we need to specify a covariance matrix. The key point for our pur-poses here is that the matrix is symmetric; for example, the element in row 1, column 2 is equal to the element in row 2, column 1.
Suppose that we are working with an n-variate normal distribution. Our matrix will have n rows and n columns, and we wish each of the n variables to have variance 1, with correlation rho between pairs of variables. For n = 3 and rho = 0.2, for example, the desired matrix is as follows:
Here is code to generate this kind of matrix:
makecov <- function(rho,n) {
m <- matrix(nrow=n,ncol=n)
m <- ifelse(row(m) == col(m),1,rho)
return(m)
}
Let’s see how this works. First, as you probably guessed, col() returns the column number of its argument, just as row() does for the row num-ber. Then the expression row(m) in line 3 returns a matrix of integer values, each one showing the row number of the corresponding element of m. For instance,
>z[,1] [,2]
[1,] 3 6
[2,] 4 7
[3,] 5 8
> row(z)
[,1] [,2]
[1,] 1 1
[2,] 2 2
[3,] 3 3
Thus the expression row(m) == col(m) in the same line returns a matrix of TRUE and FALSE values, TRUE values on the diagonal of the matrix and FALSE values elsewhere. Once again, keep in mind that binary operators—in this case, ==—are functions. Of course, row() and col() are functions too, so this expression:
row(m) == col(m)
applies that function to each element of the matrix m, and it returns a TRUE/FALSE matrix of the same size as m. The ifelse() expression is another function call.
ifelse(row(m) == col(m),1,rho)
In this case, with the argument being the TRUE/FALSE matrix just discussed, the result is to place the values 1 and rho in the proper places in our output matrix.
Applying Functions to Matrix Rows and Columns
One of the most famous and most used features of R is the _{*}apply() family of functions, such as apply(), tapply(), and lapply(). Here, we’ll look at apply(), which instructs R to call a user-specified function on each of the rows or each of the columns of a matrix.
Using the apply() Function
This is the general form of apply for matrices:
apply(m,dimcode,f,fargs)
where the arguments are as follows:
- m is the matrix.
- dimcode is the dimension, equal to 1 if the function applies to rows or 2 for columns.
- f is the function to be applied.
- fargs is an optional set of arguments to be supplied to f.
For example, here we apply the R function mean() to each column of a matrix z:
>z[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
> apply(z,2,mean)
[1] 2 5
In this case, we could have used the colMeans() function, but this provides a simple example of using apply(). A function you write yourself is just as legitimate for use in apply() as any R built-in function such as mean(). Here’s an example using our own function f:
> z
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
> f <- function(x) x/c(2,8)
> y <- apply(z,1,f)
> y
[,1] [,2] [,3]
[1,] 0.5 1.000 1.50
[2,] 0.5 0.625 0.75
Our f() function divides a two-element vector by the vector (2,8). (Recy-cling would be used if x had a length longer than 2.) The call to apply() asks R to call f() on each of the rows of z. The first such row is (1,4), so in the call to f(), the actual argument corresponding to the formal argument x is (1,4). Thus, R computes the value of (1,4)/(2,8), which in R’s element-wise vector arithmetic is (0.5,0.5). The computations for the other two rows are similar.
You may have been surprised that the size of the result here is 2 by 3 rather than 3 by 2. That first computation, (0.5,0.5), ends up at the first col-umn in the output of apply(), not the first row. But this is the behavior of apply(). If the function to be applied returns a vector of k components, then the result of apply() will have k rows. You can use the matrix transpose func-tion t() to change it if necessary, as follows:
> t(apply(z,1,f))
[,1] [,2]
[1,] 0.5 0.500
[2,] 1.0 0.625
[3,] 1.5 0.750
If the function returns a scalar (which we know is just a one-element vector), the final result will be a vector, not a matrix. As you can see, the function to be applied needs to take at least one argument. The formal argument here will correspond to an actual argument of one row or column in the matrix, as described previously. In some cases, you will need additional arguments for this function, which you can place following the function name in your call to apply().
For instance, suppose we have a matrix of 1s and 0s and want to create a vector as follows: For each row of the matrix, the corresponding element of the vector will be either 1 or 0, depending on whether the majority of the first d elements in that row is 1 or 0. Here, d will be a parameter that we may wish to vary. We could do this:
> copymaj
function(rw,d) {
maj <- sum(rw[1:d]) / d
return(if(maj > 0.5) 1 else 0)
}
>x[,1] [,2] [,3] [,4] [,5]
[1,] 1 0 1 1 0
[2,] 1 1 1 1 0
[3,] 1 0 0 1 1
[4,] 0 1 1 1 0
> apply(x,1,copymaj,3)
[1] 1 1 0 1
> apply(x,1,copymaj,2)
[1] 0 1 0 0
Here, the values 3 and 2 form the actual arguments for the formal argument d in copymaj(). Let’s look at what happened in the case of row 1 of x. That row consisted of (1,0,1,1,0), the first d elements of which were (1,0,1). A majority of those three elements were 1s, so copymaj() returned a 1, and thus the first element of the output of apply() was a 1.
Contrary to common opinion, using apply() will generally not speed up your code. The benefits are that it makes for very compact code, which may be easier to read and modify, and you avoid possible bugs in writing code for looping. Moreover, as R moves closer and closer to parallel processing, func-tions like apply() will become more and more important. For example, the clusterApply() function in the snow package gives R some parallel-processing capability by distributing the submatrix data to various network nodes, with each node basically applying the given function on its submatrix.
Extended Example: Finding Outliers
In statistics, outliers are data points that differ greatly from most of the other observations. As such, they are treated either as suspect (they might be erro-neous) or unrepresentative (such as Bill Gates’s income among the incomes of the citizens of the state of Washington). Many methods have been devised to identify outliers. We’ll build a very simple one here. Say we have retail sales data in a matrix rs. Each row of data is for a different store, and observations within a row are daily sales figures. As a simple (undoubtedly overly simple) approach, let’s write code to identify the most deviant observation for each store. We’ll define that as the observation furthest from the median value for that store. Here’s the code:
findols <- function(x) {
findol <- function(xrow) {
mdn <- median(xrow)
devs <- abs(xrow-mdn)
return(which.max(devs))
}
return(apply(x,1,findol))
}
Our call will be as follows:
findols(rs)
How will this work? First, we need a function to specify in our apply() call.Since this function will be applied to each row of our sales matrix, our description implies that it needs to report the index of the most deviant observation in a given row. Our function findol() does that, in lines 4 and 5 (Note that we’ve defined one function within another here, a common practice if the inner function is short.) In the expression xrow-mdn, we are subtracting a number that is a one-element vector from a vector that generally will have a length greater than 1. Thus, recycling is used to extend mdn to conform with xrow before the subtraction.
Then in line 5, we use the R function which.max(). Instead of finding the maximum value in a vector, which the max() function does, which.max() tells us where that maximum value occurs—that is, the index where it occurs. This is just what we need. Finally, in line 7, we ask R to apply findol() to each row of x, thus pro-ducing the indices of the most deviant observation in each row.
Adding and Deleting Matrix Rows and Columns
Technically, matrices are of fixed length and dimensions, so we cannot add or delete rows or columns. However, matrices can be reassigned, and thus we can achieve the same effect as if we had directly done additions or deletions.
Changing the Size of a Matrix
Recall how we reassign vectors to change their size:
> x
[1] 12 5 13 16 8
> x <- c(x,20) # append 20
> x
[1] 12 5 13 16 8 20
> x <- c(x[1:3],20,x[4:6]) # insert 20
> x
[1] 12 5 13 20 16 8 20
> x <- x[-2:-4] # delete elements 2 through 4
> x
[1] 12 16 8 20
In the first case, x is originally of length 5, which we extend to 6 via con-catenation and then reassignment. We didn’t literally change the length of x but instead created a new vector from x and then assigned x to that new vector. Reassignment occurs even when you don’t see it. For instance, even the innocuous-looking assignment x[2] <- 12 is actually a reassignment. Analogous operations can be used to change the size of a matrix. For instance, the rbind() (row bind) and cbind() (column bind) functions let you add rows or columns to a matrix.
> one
[1] 1 1 1 1
> z
[,1] [,2] [,3]
[1,] 1 1 1
[2,] 2 1 0
[3,] 3 0 1
[4,] 4 0 0
> cbind(one,z)
[1,] 1 1 1 1
[2,] 1 2 1 0
[3,] 1 3 0 1
[4,] 1 4 0 0
Here, cbind() creates a new matrix by combining a column of 1s with the columns of z. We choose to get a quick printout, but we could have assigned the result to z (or another variable), as follows:
z <- cbind(one,z)
Note, too, that we could have relied on recycling:
> cbind(1,z)
[,1] [,2] [,3] [,4]
[1,] 1 1 1 1
[2,] 1 2 1 0
[3,] 1 3 0 1
[4,] 1 4 0 0
Here, the 1 value was recycled into a vector of four 1 values.You can also use the rbind() and cbind() functions as a quick way to cre-ate small matrices. Here’s an example:
> q <- cbind(c(1,2),c(3,4))
> q
[,1] [,2]
[1,] 1 3
[2,] 2 4
Be careful with rbind and cbin(), though. Like creating a vector, creating a matrix is time consuming (matrices are vectors, after all). In the following code, cbind() creates a new matrix:
z <- cbind(one,z)
The new matrix happens to be reassigned to z; that is, we gave it the name
z—the same name as the original matrix, which is now gone. But the point is that we did incur a time penalty in creating the matrix. If we did this repeatedly inside a loop, the cumulative penalty would be large.
So, if you are adding rows or columns one at a time within a loop, and the matrix will eventually become large, it’s better to allocate a large matrix in the first place. It will be empty at first, but you fill in the rows or columns one at a time, rather than doing a time-consuming matrix memory allocation each time. You can delete rows or columns by reassignment, too:
> m <- matrix(1:6,nrow=3)
> m
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
> m <- m[c(1,3),]
> m
[,1] [,2]
[1,] 1 4
[2,] 3 6
Extended Example: Finding the Closest Pair of Vertices in a Graph
Finding the distances between vertices on a graph is a common example used in computer science courses and is used in statistics/data sciences too. This kind of problem arises in some clustering algorithms, for instance, and in genomics applications. Here, we’ll look at the common example of finding distances between cities, as it is easier to describe than, say, finding distances between DNA strands. Suppose we need a function that inputs a distance matrix, where the element in row i, column j gives the distance between city i and city j and outputs the minimum one-hop distance between cities and the pair of cities that achieves that minimum. Here’s the code for the solution:
# returns the minimum value of d[i,j], i != j, and the row/col attaining
# that minimum, for square symmetric matrix d; no special policy on ties mind <- function(d) {
n <- nrow(d)
# add a column to identify row number for apply()
dd <- cbind(d,1:n)
wmins <- apply(dd[-n,],1,imin)
# wmins will be 2xn, 1st row being indices and 2nd being values
i <- which.min(wmins[2,])
j <- wmins[1,i] return(c(d[i,j],i,j))
}
# finds the location, value of the minimum in a row x
imin <- function(x) {
lx <- length(x)
i <- x[lx] # original row number
j <- which.min(x[(i+1):(lx-1)])
k <- i+j
return(c(k,x[k]))
}
And here’s an example of putting our new function to use:
>q
[,1] [,2] [,3] [,4] [,5]
[1,] 0 12 13 8 20
[2,] 12 0 15 28 88
[3,] 13 15 0 6 9
[4,] 8 28 6 0 33
[5,] 20 88 9 33 0
> mind(q)
[1] 6 3 4
The minimum value was 6, located in row 3, column 4. As you can see, a call to apply() plays a prominent role. Our task is fairly simple: We need to find the minimum nonzero element in the matrix. We find the minimum in each row—a single call to apply() accomplishes this for all the rows—and then find the smallest value among those minima. But as you’ll see, the code logic becomes rather intricate.
One key point is that the matrix is symmetric, because the distance from city i to city j is the same as from j to i. So in finding the minimum value in the i^{th} row, we need look at only elements i+1, i+2,…, n, where n is the num-ber of rows and columns in the matrix. Note too that this means we can skip the last row of d in our call to apply() in line 7.
Since the matrix could be large—a thousand cities would mean a mil-lion entries in the matrix—we should exploit that symmetry and save work. That, however, presents a problem. In order to go through the basic com-putation, the function called by apply() needs to know the number of the row in the original matrix—knowledge that apply() does not provide to the function. So, in line 6, we augment the matrix with an extra column, consist-ing of the row numbers, so that the function called by apply() can take row numbers into account.
The function called by apply() is imin(), beginning in line 15, which finds the mininum in the row specified in the formal argument x. It returns not only the mininum in the given row but also the index at which that mini-mum occurs. When imin() is called on row 1 of our example matrix q above, the minimum value is 8, which occurs in index 4. For this latter purpose, the R function which.min(), used in line 18, is very handy. Line 19 is noteworthy. Recall that due to the symmetry of the matrix, we skip the early part of each row, as is seen in the expression (i+1):(1x-1) in line 18. But that means that the call to which.min() in that line will return the minimum’s index relative to the range (i+1):(1x-1). In row 3 of our example matrix q, we would get an index of 1 instead of 4. Thus, we must adjust by adding i, which we do in line 19.Finally, making proper use of the output of apply() here is a bit tricky. Think again of the example matrix q above. The call to apply() will return the matrix wmins:
As noted in the comments, the second row of that matrix contains the upper-diagonal minima from the various rows of d, while the first row contains the indices of those values. For instance, the first column of wmins gives the information for the first row of q, reporting that the smallest value in that row is 8, occurring at index 4 of the row. Thus line 9 will pick up the number i of the row containing the smallest value in the entire matrix, 6 in our q example. Line 10 will give us the position j in that row where the minimum occurs, 4 in the case of q. In other words, the overall minimum is in row i and column j, information that we then use in line 11.
Meanwhile, row 1 of apply()’s output shows the indices within those rows at which the row minima occur. That’s great, because we can now find which other city was in the best pair. We know that city 3 is one of them, so we go to entry 3 in row 1 of the output, finding 4. So, the pair of cities closest to each other is city 3 and city 4. Lines 9 and 10 then generalize this reasoning. If the minimal element in our matrix is unique, there is an alternate approach that is far simpler:
minda <- function(d) {
smallest <- min(d)
ij <- which(d == smallest,arr.ind=TRUE)
return(c(smallest,ij))
}
This works, but it does have some possible drawbacks. Here’s the key line in this new code:
ij <- which(d == smallest,arr.ind=TRUE)
It determines the index of the element of d that achieves the minimum. The argument arr.ind=TRUE specifies that the returned index will be a matrix index—that is, a row and a column, rather than a single vector subscript. Without this argument, d would be treated as a vector. As noted, this new code works only if the minimum is unique. If that is not the case, which() will return multiple row/column number pairs, contrary to our basic goal. And if we used the original code and d had multiple minimal elements, just one of them would be returned.
Another problem is performance. This new code is essentially making two (behind-the-scenes) loops through our matrix: one to compute smallest and the other in the call to which(). This will likely be slower than our original code. Of these two approaches, you might choose the original code if execution speed is an issue or if there may be multiple minima, but otherwise opt for the alternate code; the simplicity of the latter will make the code easier to read and maintain.
More on the Vector/Matrix Distinction
At the beginning of the chapter, we said that a matrix is just a vector but with two additional attributes: the number of rows and the number of columns. Here, we’ll take a closer look at the vector nature of matrices. Consider this example:
> z <- matrix(1:8,nrow=4)
>z
[,1] [,2]
[1,] 1 5
[2,] 2 6
[3,] 3 7
[4,] 4 8
As z is still a vector, we can query its length:
> length(z)
[1] 8
But as a matrix, z is a bit more than a vector:
> class(z)
[1] “matrix”
> attributes(z)
$dim
[1] 4 2
In other words, there actually is a matrix class, in the object-oriented programming sense. Most of R consists of S3 classes, whose components are denoted by dollar signs. The matrix class has one attribute, named dim, which is a vector containing the numbers of rows and columns in the matrix. You can also obtain dim via the dim() function:
> dim(z)
[1] 4 2
The numbers of rows and columns are obtainable individually via the nrow() and ncol() functions:
> nrow(z)
[1] 4
> ncol(z)
[1] 2
These just piggyback on dim(), as you can see by inspecting the code. Recall once again that objects can be printed in interactive mode by simply typing their names:
> nrow
function (x)
dim(x)[1]
These functions are useful when you are writing a general-purpose library function whose argument is a matrix. By being able to determine the number of rows and columns in your code, you alleviate the caller of the burden of supplying that information as two additional arguments. This is one of the benefits of object-oriented programming.
Avoiding Unintended Dimension Reduction
In the world of statistics, dimension reduction is a good thing, with many statistical procedures aimed to do it well. If we are working with, say, 10 variables and can reduce that number to 3 that still capture the essence of our data, we’re happy. However, in R, something else might merit the name dimension reduction that we may sometimes wish to avoid. Say we have a four-row matrix and extract a row from it:
> z
[,1] [,2]
[1,] 1 5
[2,] 2 6
[3,] 3 7
[4,] 4 8
> r <- z[2,]
> r
[1] 2 6
This seems innocuous, but note the format in which R has displayed r. It’s a vector format, not a matrix format. In other words, r is a vector of length 2, rather than a 1-by-2 matrix. We can confirm this in a couple of ways:
> attributes(z)
$dim
[1] 4 2
> attributes(r)
NULL
> str(z)
int [1:4, 1:2] 1 2 3 4 5 6 7 8
> str(r)
int [1:2] 2 6
Here, R informs us that z has row and column numbers, while r does not. Similarly, str() tells us that z has indices ranging in 1:4 and 1:2, for rows and columns, while r’s indices simply range in 1:2. No doubt about it—r is a vector, not a matrix.
This seems natural, but in many cases, it will cause trouble in programs that do a lot of matrix operations. You may find that your code works fine in general but fails in a special case. For instance, suppose that your code extracts a submatrix from a given matrix and then does some matrix oper-ations on the submatrix. If the submatrix has only one row, R will make it a vector, which could ruin your computation. Fortunately, R has a way to suppress this dimension reduction: the drop argument. Here’s an example, using the matrix z from above:
> r <- z[2,, drop=FALSE]
>r
[,1] [,2]
[1,] 2 6
> dim(r)
[1] 1 2
Now r is a 1-by-2 matrix, not a two-element vector. For these reasons, you may find it useful to routinely include the drop=FALSE argument in all your matrix code. Why can we speak of drop as an argument? Because that [ is actually a function, just as is the case for operators like +. Consider the following code:
> z[3,2]
[1] 7
> “[“(z,3,2)
[1] 7
If you have a vector that you wish to be treated as a matrix, you can use the as.matrix() function, as follows:
> u
[1] 1 2 3
v <- as.matrix(u)
attributes(u)
NULL
attributes(v)
$dim
[1] 3 1
Naming Matrix Rows and Columns
The natural way to refer to rows and columns in a matrix is via the row and column numbers. However, you can also give names to these entities. Here’s an example:
> z
[,1] [,2]
[1,] 1 3
[2,] 2 4
> colnames(z)
NULL
> colnames(z) <- c(“a”,”b”)
>z
ab
[1,] 1 3
[2,] 2 4
> colnames(z)
[1] “a” “b”
> z[,”a”]
[1] 1 2
As you see here, these names can then be used to reference specific columns. The function rownames() works similarly. Naming rows and columns is usually less important when writing R code for general applications, but it can be useful when analyzing a specific data set.
Higher-Dimensional Arrays
In a statistical context, a typical matrix in R has rows corresponding to obser-vations, say on various people, and columns corresponding to variables, such as weight and blood pressure. The matrix is then a two-dimensional data structure. But suppose we also have data taken at different times, one data point per person per variable per time. Time then becomes the third dimen-sion, in addition to rows and columns. In R, such data sets are called arrays.
As a simple example, consider students and test scores. Say each test consists of two parts, so we record two scores for a student for each test. Now suppose that we have two tests, and to keep the example small, assume we have only three students. Here’s the data for the first test:
> firsttest
[,1] [,2]
[1,] 46 30
[2,] 21 25
[3,] 50 50
Student 1 had scores of 46 and 30 on the first test, student 2 scored 21 and 25, and so on. Here are the scores for the same students on the sec-ond test:
> secondtest
[,1] [,2]
[1,] 46 43
[2,] 41 35
[3,] 50 50
Now let’s put both tests into one data structure, which we’ll name tests. We’ll arrange it to have two “layers”—one layer per test—with three rows and two columns within each layer. We’ll store firsttest in the first layer and secondtest in the second. In layer 1, there will be three rows for the three students’ scores on the first test, with two columns per row for the two portions of a test. We use R’s array function to create the data structure:
> tests <- array(data=c(firsttest,secondtest),dim=c(3,2,2))
In the argument dim=c(3,2,2), we are specifying two layers (this is the sec-ond 2), each consisting of three rows and two columns. This then becomes an attribute of the data structure:
> attributes(tests)
$dim
[1] 3 2 2
Each element of tests now has three subscripts, rather than two as in the matrix case. The first subscript corresponds to the first element in the $dim vector, the second subscript corresponds to the second element in the vector, and so on. For instance, the score on the second portion of test 1 for student 3 is retrieved as follows:
> tests[3,2,1]
[1] 48
R’s print function for arrays displays the data layer by layer:
> tests
, , 1
[,1] [,2]
[1,] 46 30
[2,] 21 25
[3,] 50 48
, , 2
[,1] [,2]
[1,] 46 43
[2,] 41 35
[3,] 50 49
Just as we built our three-dimensional array by combining two matrices, we can build four-dimensional arrays by combining two or more three-dimensional arrays, and so on. One of the most common uses of arrays is in calculating tables.
LISTS
In contrast to a vector, in which all elements must be of the same mode, R’s liststructure can combine objects of different types. For those familiar with Python, an R list is similar to a Python dictionary or, for that matter, a Perl hash. C programmers may find it similar to a C struct. The list plays a central role in R, forming the basis for data frames, object-oriented programming, and so on.In this chapter, we’ll cover how to create lists and how to work with them. As with vectors and matrices, one common operation with lists is indexing. List indexing is similar to vector and matrix indexing but with some major differences. And like matrices, lists have an analog for the apply() function. We’ll discuss these and other list topics, including ways to take lists apart, which often comes in handy.
Creating Lists
Technically, a list is a vector. Ordinary vectors—those of the type we’ve been using so far are termed atomic vectors, since their components cannot be broken down into smaller components. In contrast, lists are referred to as recursive vectors. For our first look at lists, let’s consider an employee database. For each employee, we wish to store the name, salary, and a Boolean indicating union membership. Since we have three different modes here—character, numer-ic, and logical—it’s a perfect place for using lists. Our entire database might then be a list of lists, or some other kind of list such as a data frame, though we won’t pursue that here. We could create a list to represent our employee, Joe, this way:
j <- list(name=”Joe”, salary=55000, union=T)
We could print out j, either in full or by component:
> j
$name
[1] “Joe”
$salary
[1] 55000
$union
[1] TRUE
Actually, the component names—called tags in the R literature—such as salary are optional. We could alternatively do this:
> jalt <- list(“Joe”, 55000, T)
> jalt
[[1]]
[1] “Joe”
[[2]]
[1] 55000
[[3]]
[1] TRUE
However, it is generally considered clearer and less error-prone to use names instead of numeric indices. Names of list components can be abbreviated to whatever extent is possi-ble without causing ambiguity:
> j$sal
[1] 55000
Since lists are vectors, they can be created via vector():
> z <- vector(mode=”list”)
> z[[“abc”]] <- 3
> z
$abc
[1] 3
General List Operations
Now that you’ve seen a simple example of creating a list, let’s look at how to access and work with lists.
List Indexing
You can access a list component in several different ways:
> j$salary
[1] 55000
> j[[“salary”]]
[1] 55000
> j[[2]]
[1] 55000
We can refer to list components by their numerical indices, treating the list as a vector. However, note that in this case, we use double brackets instead of single ones. So, there are three ways to access an individual component c of a list lst and return it in the data type of c:
- lst$c
- lst[[“c”]]
- lst[[i]], where i is the index of c within lst
Each of these is useful in different contexts, as you will see in sub-sequent examples. But note the qualifying phrase, “return it in the data type of c.” An alternative to the second and third techniques listed is to use single brackets rather than double brackets:
- lst[“c”]
- lst[i], where i is the index of c within lst
Both single-bracket and double-bracket indexing access list elements in vector-index fashion. But there is an important difference from ordi-nary (atomic) vector indexing. If single brackets [ ] are used, the result is
another list—a sublist of the original. For instance, continuing the preceding example, we have this:
> j[1:2]
$name
[1] “Joe”
$salary
[1] 55000
> j2 <- j[2]
> j2
$salary
[1] 55000
> class(j2)
[1] “list”
> str(j2)
List of 1
$ salary: num 55000
The subsetting operation returned another list consisting of the first two components of the original list j. Note that the word returned makes sense here, since index brackets are functions. This is similar to other cases you’ve seen for operators that do not at first appear to be functions, such as +.By contrast, you can use double brackets [[ ]] for referencing only a single component, with the result having the type of that component.
> j[[1:2]]
Error in j[[1:2]] : subscript out of bounds
> j2a <- j[[2]]
> j2a
[1] 55000
> class(j2a)
[1] “numeric”
Adding and Deleting List Elements
The operations of adding and deleting list elements arise in a surprising number of contexts. This is especially true for data structures in which lists form the foundation, such as data frames and R classes. New components can be added after a list is created.
> z <- list(a=”abc”,b=12)
> z
$
[1] “abc”
$b
[1] 12
> z$c <- “sailing” # add a c component
> # did c really get added?
> z
$a
[1] “abc”
$b
[1] 12
$c
[1] “sailing”
Adding components can also be done via a vector index:
> z[[4]] <- 28
> z[5:7] <- c(FALSE,TRUE,TRUE)
> z
$a
[1] “abc”
$b
[1] 12
$c
[1] “sailing”
[[4]]
[1] 28
[[5]]
[1] FALSE
[[6]]
[1] TRUE
[[7]]
[1] TRUE
You can delete a list component by setting it to NULL.
> z$b <- NULL
> z
$a
[1] “abc”
$c
[1] “sailing”
[[3]]
[1] 28
[[4]]
[1] FALSE
[[5]]
[1] TRUE
[[6]]
[1] TRUE
Note that upon deleting z$b, the indices of the elements after it moved up by 1. For instance, the former z[[4]] became z[[3]]. You can also concatenate lists.
> c(list(“Joe”, 55000, T),list(5)) [[1]]
[1] “Joe”
[[2]]
[1] 55000
[[3]]
[1] TRUE
[[4]]
[1] 5
Getting the Size of a List
Since a list is a vector, you can obtain the number of components in a list via length().
> length(j)
[1] 3
Extended Example: Text Concordance
Web search and other types of textual data mining are of great interest today. Let’s use this area for an example of R list code.We’ll write a function called findwords() that will determine which words are in a text file and compile a list of the locations of each word’s occurrences in the text. This would be useful for contextual analysis, for example. Suppose our input file, testconcord.txt, has the following contents:
The [1] here means that the first item in this line of output is item 1. In this case, our output consists of only one line (and one item), so this is redundant, but this notation helps to read voluminous output that consists of many items spread over many lines. For example, if there were two rows of output with six items per row, the second row would be labeled [7].
In order to identify words, we replace all nonletter characters with blanks and get rid of capitalization. We could use the string functions presented to do this, but to keep matters simple, such code is not shown here. The new file, testconcorda.txt, looks like this:
the Here means that the first item in this line of output is
item in this case our output consists of only one line and more
item so this is redundant this notation helps t read
voluminous. output that consists of many items spread over many
lines for example if there were two rows of output with six items
per row the second row would be labeled
Then, for instance, the word item has locations 7, 14, and 27, which means that it occupies the seventh, fourteenth, and twenty-seventh word positions in the file.Here is an excerpt from the list that is returned when our function findwords() is called on this file:
> findwords(“testconcorda.txt”)
Read 68 items
$the
[1] 1 5 63
$here
[1] 2
$means
[1] 3
$that
[1] 4 40
$first
[1] 6
$item
- 7 14 27
…
The list consists of one component per word in the file, with a word’s component showing the positions within the file where that word occurs. Sure enough, the word item is shown as occurring at positions 7, 14, and 27.
Before looking at the code, let’s talk a bit about our choice of a list struc-ture here. One alternative would be to use a matrix, with one row per word in the text. We could use rownames() to name the rows, with the entries within a row showing the positions of that word. For instance, row item would con-sist of 7, 14, 27, and then 0s in the remainder of the row. But the matrix approach has a couple of major drawbacks:
- There is a problem in terms of the columns to allocate for our matrix. If the maximum frequency with which a word appears in our text is, say, 10, then we would need 10 columns. But we would not know that ahead of time. We could add a new column each time we encountered a new word, using cbind() (in addition to using rbind() to add a row for the word itself). Or we could write code to do a preliminary run through the input file to determine the maximum word frequency. Either of these would come at the expense of increased code complexity and possibly increased runtime.
- Such a storage scheme would be quite wasteful of memory, since most rows would probably consist of a lot of zeros. In other words, the matrix would be sparse—a situation that also often occurs in numerical analysis contexts.
Thus, the list structure really makes sense. Let’s see how to code it.
findwords <- function(tf) {
# read in the words from the file, into a vector of mode character
txt <- scan(tf,””)
wl <- list()
for (i in 1:length(txt)) {
wrd <- txt[i] # ith word in input file
wl[[wrd]] <- c(wl[[wrd]],i)
}
return(wl)
}
We read in the words of the file (words simply meaning any groups of letters separated by spaces) by calling scan(). The details of reading and writing files , but the important point here is that txt will now be a vector of strings: one string per instance of a word in the file. Here is what txt looks like after the read:
> txt
[1] “the” “here” “means” “that” “the”
[6] “first” “item” “in” “this” “line”
[11] “of” “output” “is” “item” “in”
[16] “this” “case” “our” “output” “consists”
[21] “of” “only” “one” “line” “and”
[26] “one” “item” “so” “this” “is”
[31] “redundant” “but” “this” “notation” “helps”
[36] “to” “read” “voluminous” “output” “that”
[41] “consists” “of” “many” “items” “spread”
[46] “over” “many” “lines” “for” “example”
[51] “if” “there” “were” “two” “rows”
[56] “of” “output” “with” “six” “items”
[61] “per” “row” “the” “second” “row”
[66] “would” “be” “labeled”
The list operations in lines 4 through 8 build up our main variable, a list wl (for word list). We loop through all the words from our long line, with wrd being the current one. Let’s see what happens with the code in line 7 when i = 4, so that wrd = “that” in our example file testconcorda.txt. At this point, wl[[“that”]] will not yet exist. As mentioned, R is set up so that in such a case, wl[[“that”]] = NULL, which means in line 7, we can concatenate it! Thus wl[[“that”]] will become the one-element vector (4). Later, when i = 40, wl[[“that”]] will become (4,40), representing the fact that words 4 and 40 in the file are both “that”. Note how convenient it is that list indexing can be done through quoted strings, such as in wl[[“that”]].
Accessing List Components and Values
If the components in a list do have tags, as is the case with name, salary, and union for j , you can obtain them via names():
> names(j)
[1] “name” “salary” “union”
To obtain the values, use unlist():
> ulj <- unlist(j)
> ulj
name salary union
“Joe” “55000” “TRUE”
> class(ulj)
[1] “character”
The return value of unlist() is a vector—in this case, a vector of character strings. Note that the element names in this vector come from the components in the original list. On the other hand, if we were to start with numbers, we would get numbers.
> z <- list(a=5,b=12,c=13)
> y <- unlist(z)
> class(y)
[1]”numeric”
> y
a |
b |
c |
5 |
12 |
13 |
So the output of unlist() in this case was a numeric vector. What about a mixed case?
> w <- list(a=5,b=”xyz”)
> wu <- unlist(w)
> class(wu)
[1] “character”
> wu
a b
“5″ “xyz”
Here, R chose the least common denominator: character strings. This sounds like some kind of precedence structure, and it is. As R’s help for unlist() states:
Where possible the list components are coerced to a common mode during the unlisting, and so the result often ends up as a character vector. Vectors will be coerced to the highest type of the components in the hierarchy NULL < raw < logical < integer <real < complex < character < list < expression: pairlists are treated as lists.
But there is something else to deal with here. Though wu is a vector and not a list, R did give each of the elements a name. We can remove them by setting their names to NULL, as you saw in Section 2.11.
> names(wu) <- NULL
> wu
[1] “5” “xyz”
We can also remove the elements’ names directly with unname(), as follows:
> wun <- unname(wu)
> wun
[1] “5” “xyz”
This also has the advantage of not destroying the names in wu, in case they are needed later. If they will not be needed later, we could simply assign back to wu instead of to wun in the preceding statement.
Applying Functions to Lists
Two functions are handy for applying functions to lists: lapply and sapply.
Using the lapply() and sapply() Functions
The function lapply() (for list apply) works like the matrix apply() function, calling the specified function on each component of a list (or vector coerced to a list) and returning another list. Here’s an example:
lapply(list(1:3,25:29),median)
[[1]]
[1] 2
[[2]]
[1] 27
R applied median() to 1:3 and to 25:29, returning a list consisting of 2 and 27. In some cases, such as the example here, the list returned by lapply() could be simplified to a vector or matrix. This is exactly what sapply() (for simplified [l]apply) does.
> sapply(list(1:3,25:29),median)
[1] 2 27
You saw an example of matrix output in Section 2.6.2. There, we applied a vectorized, vector-valued function—a function whose return value is a vector, each of whose components is vectorized— to a vector input. Using sapply(), rather than applying the function directly, gave us the desired matrix form in the output.
Extended Example: Text Concordance, Continued
The text concordance creator, findwords(),returns a list of word locations, indexed by word. It would be nice to be able to sort this list in various ways. Recall that for the input file testconcorda.txt, we got this output:
$the
[1] 1 5 63
$here
[1] 2
$means
[1] 3
$that
[1] 4 40
…
Here’s code to present the list in alphabetical order by word:
# sorts wrdlst, the output of findwords() alphabetically by word
alphawl <- function(wrdlst) {
nms <- names(wrdlst) # the words
sn <- sort(nms) # same words in alpha order
return(wrdlst[sn]) # return rearranged version
}
Since our words are the names of the list components, we can extract the words by simply calling names(). We sort these alphabetically, and then in line 5, we use the sorted version as input to list indexing, giving us a sorted version of the list. Note the use of single brackets, rather than double, as the former are required for subsetting lists. (You might also consider using order() instead of sort().) Let’s try it out.
> alphawl(wl)
$and
- d
[1] 25
$be
[1] 67
$but
[1] 32
$case
[1] 17
$consists
[1] 20 41
$example
[1] 50
…
It works fine. The entry for and was displayed first, then be, and so on.We can sort by word frequency in a similar manner.
# orders the output of findwords() by word frequency f
reqwl <- function(wrdlst) {
freqs <- sapply(wrdlst,length) # get word frequencies
return(wrdlst[order(freqs)])
}
In line 3, we are using the fact that each element in wrdlst is a vector of numbers representing the positions in our input file at which the given word is found. Calling length() on that vector gives us the number of times the given word appears in that file. The result of calling sapply() will be the vector of word frequencies. We could use sort() here again, but order() is more direct. This latter function returns the indices of a sorted vector with respect to the original vector. Here’s an example:
> x <- c(12,5,13,8)
> order(x)
[1] 2 4 1 3
The output here indicates that x[2] is the smallest element in x, x[4] is the second smallest, and so on. In our case, we use order() to determine which word is least frequent, second least frequent, and so on. Plugging these indices into our word list gives us the same word list but in order of frequency. Let’s check it.
> freqwl(wl)
$here
[1] 2
$means
[1] 3
$first
[1] 6
…
$that
[1] 4 40
$`in`
[1] 8 15
$line
[1] 10 24
…
$this
[1] 9 16 29 33
$of
[1] 11 21 42 56
$output
[1] 12 19 39 57
Yep, ordered from least to most frequent. We can also do a plot of the most frequent words. I ran the following code on an article on R in the New York Times, “Data Analysts Captivated by R’s Power,” from January 6, 2009.
> nyt <- findwords(“nyt.txt”)
Read 1011 items
> snyt <- freqwl(nyt)
> nwords <- length(ssnyt)
> barplot(ssnyt[round(0.9*nwords):nwords])
My goal was to plot the frequencies of the top 10 percent of the words in the article. The results are shown in Figure 4-1.
Figure 4-1: Top word frequencies in an article about R
Extended Example: Back to the Abalone Data
Let’s use the lapply() function in our abalone gender example in Sec-tion 2.9.2. Recall that at one point in that example, we wished to know the indices of the observations that were male, female, and infant. For an easy demonstration, let’s use the same test case: a vector of genders.
g <- c(“M”,”F”,”F”,”I”,”M”,”M”,”F”)
A more compact way of accomplishing our goal is as follows:
> lapply(c(“M”,”F”,”I”),function(gender) which(g==gender))
[[1]]
[1] 1 5 6
[[2]]
[1] 2 3 7
[[3]]
[1] 4
The lapply() function expects its first argument to be a list. Here it was a vector, but lapply() will coerce that vector to a list form. Also, lapply() expects its second argument to be a function. This could be the name of a function, as you saw before, or the actual code, as we have here. This is an anonymous function.
Then lapply() calls that anonymous function on “M”, then on “F”, and then on “I”. In that first case, the function calculates which(g==”M”), giving us the vector of indices in g of the males. After determining the indices for the females and infants, lapply() will return the three vectors in a list.Note that even though the object of our main attention is the vector g of genders, it is not the first argument in the lapply() call in the example. Instead, that argument is an innocuous-looking vector of the three possible gender encodings. By contrast, g is mentioned only briefly in the function, as the second actual argument. This is a common situation in R.
Recursive Lists
Lists can be recursive, meaning that you can have lists within lists. Here’s an example:
> b <- list(u = 5, v = 12)
> c <- list(w = 13)
> a <- list(b,c)
> a
[[1]]
[[1]]$u
[1] 5
[[1]]$v
[1] 12
[[2]]
[[2]]$w
[1] 13
> length(a)
[1] 2
This code makes a into a two-component list, with each component itself also being a list. The concatenate function c() has an optional argument recursive, which controls whether flattening occurs when recursive lists are combined.
> c(list(a=1,b=2,c=list(d=5,e=9)))
$a [1] 1
$b
[1] 2
$c
$c$d
[1] 5
$c$e
[1] 9
> c(list(a=1,b=2,c=list(d=5,e=9)),recursive=T)
a b c.d c.e
1 2 5 9
In the first case, we accepted the default value of recursive, which is FALSE, and obtained a recursive list, with the c component of the main list itself being another list. In the second call, with recursive set to TRUE, we got a single list as a result; only the names look recursive. (It’s odd that setting recursive to TRUE gives a nonrecursive list.)Recall that our first example of lists consisted of an employee database. I mentioned that since each employee was represented as a list, the entire database would be a list of lists. That is a concrete example of recursive lists.
DATA FRAMES
On an intuitive level, a data frame is like a matrix, with a two-dimensional rows-and-columns structure. However, it differs from a matrix in that each column may have a different mode. For instance, one column may consist of numbers, and another column might have character strings. In this sense, just as lists are the heterog-eneous analogs of vectors in one dimension, data frames are the heterogeneous analogs of matrices for two-dimensional data. On a technical level, a data frame is a list, with the components of that list being equal-length vectors. Actually, R does allow the components to be other types of objects, including other data frames. This gives us heterogeneous–data analogs of arrays in our analogy. But this use of data frames is rare in practice, we will assume all components of a data frame are vectors. In this chapter, we’ll present quite a few data frame examples, so you can become familiar with their variety of uses in R.
Creating Data Frames
To begin, let’s take another look at our simple data frame example from :
> kids <- c(“Jack”,”Jill”)
> ages <- c(12,10)
> d <- data.frame(kids,ages,stringsAsFactors=FALSE)
> d # matrix-like viewpoint
kids ages
1 Jack 12
2 Jill 10
The first two arguments in the call to data.frame() are clear: We wish to produce a data frame from our two vectors: kids and ages. However, that third argument, stringsAsFactors=FALSE requires more comment. If the named argument stringsAsFactors is not specified, then by default, stringsAsFactors will be TRUE. (You can also use options() to arrange the opposite default.) This means that if we create a data frame from a character vector—in this case, kids—R will convert that vector to a factor. Because our work with character data will typically be with vectors rather than factors, we’ll set stringsAsFactors to FALSE.
Accessing Data Frames
Now that we have a data frame, let’s explore a bit. Since d is a list, we can access it as such via component index values or component names:
> d[[1]]
[1] “Jack” “Jill”
> d$kids
[1] “Jack” “Jill”
But we can treat it in a matrix-like fashion as well. For example, we can view column 1:
> d[,1]
[1] “Jack” “Jill”
This matrix-like quality is also seen when we take d apart using str():
> str(d)
‘data.frame’: 2 obs. of 2 variables:
$ kids: chr “Jack” “Jill”
$ ages: num 12 10
R tells us here that d consists of two observations—our two rows—that store data on two variables—our two columns. Consider three ways to access the first column of our data frame above: d[[1]], d[,1], and d$kids. Of these, the third would generally considered to be clearer and, more importantly, safer than the first two. This better identifies the column and makes it less likely that you will reference the wrong column. But in writing general code—say writing R packages—matrix-like notation d[,1] is needed, and it is especially handy if you are extracting sub-data frames .
Extended Example: Regression Analysis of Exam Grades Continued
Recall our course examination data set. There, we didn’t have a header, but for this example we do, and the first few records in the file now are as follows:
“Exam 1” “Exam 2” Quiz
2.0 3.3 4.0
3.3 2.0 3.7
4.0 4.0 4.0
2.3 0.0 3.3
2.3 1.0 3.3
3.3 3.7 4.0
As you can see, each line contains the three test scores for one student. This is the classic two-dimensional file notion, like that alluded to in the pre-ceding output of str(). Here, each line in our file contains the data for one observation in a statistical data set. The idea of a data frame is to encapsulate such data, along with variable names, into one object.
Notice that we have separated the fields here by spaces. Other delimiters may be specified, notably commas for comma-separated value (CSV) files. The variable names specified in the first record must be separated by the same delimiter as used for the data, which is spaces in this case. If the names themselves contain embedded spaces, as we have here, they must be quoted. We read in the file as before, but in this case we state that there is a header record:
examsquiz <- read.table(“exams”,header=TRUE)
The column names now appear, with periods replacing blanks:
> head(examsquiz)
Exam.1 Exam.2 Quiz
1 2.0 3.3 4.0
2 3.3 2.0 3.7
3 4.0 4.0 4.0
4. 2.3 0.0 3.3
5 2.3 1.0 3.3
6 3.3 3.7 4.0
Other Matrix-Like Operations
Various matrix operations also apply to data frames. Most notably and use-fully, we can do filtering to extract various subdata frames of interest.
Extracting Subdata Frames
As mentioned, a data frame can be viewed in row-and-column terms. In particular, we can extract subdata frames by rows or columns. Here’s an example:
> examsquiz[2:5,]
Exam.1 Exam.2 Quiz
2 3.3 2 3.7
3 4.0 4 4.0
4 2.3 0 3.3
5 2.3 1 3.3
> examsquiz[2:5,2]
[1] 2 4 0 1
> class(examsquiz[2:5,2])
[1] “numeric”
> examsquiz[2:5,2,drop=FALSE]
Exam.2
2 2
3 4
4 0
5 1
> class(examsquiz[2:5,2,drop=FALSE])
[1] “data.frame”
Note that in that second call, since examsquiz[2:5,2] is a vector, R created a vector instead of another data frame. By specifying drop=FALSE, as described for the matrix case, we can keep it as a (one-column) data frame. We can also do filtering. Here’s how to extract the subframe of all students whose first exam score was at least 3.8:
> examsquiz[examsquiz$Exam.1 >= 3.8,]
Exam.1 Exam.2 Quiz
3 4 4.0 4.0
9 4 3.3 4.0
11 4 4.0 4.0
14 4 0.0 4.0
16 4 3.7 4.0
19 4 4.0 4.0
22 4 4.0 4.0
25 4 4.0 3.3
29 4 3.0 3.7
More on Treatment of NA Values
Suppose the second exam score for the first student had been missing. Then we would have typed the following into that line when we were preparing the data file:
2.0 NA 4.0
In any subsequent statistical analyses, R would do its best to cope with the missing data. However, in some situations, we need to set the option na.rm=TRUE, explicitly telling R to ignore NA values. For instance, with the missing exam score, calculating the mean score on exam 2 by calling R’s mean() function would skip that first student in finding the mean. Otherwise, R would just report NA for the mean. Here’s a little example:
> x <- c(2,NA,4)
> mean(x)
[1] NA
> mean(x,na.rm=TRUE)
[1] 3
You were introduced to the subset() function, which saves you the trouble of specifying na.rm=TRUE. You can apply it in data frames for row selection. The column names are taken in the context of the given data frame. In our example, instead of typing this:
> examsquiz[examsquiz$Exam.1 >= 3.8,]
we could run this:
> subset(examsquiz,Exam.1 >= 3.8)
Note that we do not need to write this:
> subset(examsquiz,examsquiz$Exam.1 >= 3.8)
In some cases, we may wish to rid our data frame of any observation that has at least one NA value. A handy function for this purpose is complete.cases().
> d4
kids states
1 Jack CA
2 MA
3 Jillian MA
4 John
> complete.cases(d4)
[1] TRUE FALSE TRUE FALSE
> d5 <- d4[complete.cases(d4),]
> d5
kids states
1 Jack CA
3 Jillian MA
Cases 2 and 4 were incomplete; hence the FALSE values in the output of complete.cases(d4). We then use that output to select the intact rows.
Using the rbind() and cbind() Functions and Alternatives
The rbind() and cbind() matrix functions work with data frames, too, providing that you have compatible sizes, of course. For instance, you can use cbind() to add a new column that has the same length as the existing columns.In using rbind() to add a row, the added row is typically in the form of another data frame or list.
>d
kids ages
1 Jack 12
2 Jill 10
> rbind(d,list(“Laura”,19))
kids ages
1 Jack 12
2 Jill 10
3 Laura 19
You can also create new columns from old ones. For instance, we can add a variable that is the difference between exams 1 and 2:
> eq <- cbind(examsquiz,examsquiz$Exam.2-examsquiz$Exam.1)
> class(eq)
[1] “data.frame”
> head(eq)
Exam.1 Exam. 2 Quiz examsquiz$Exam.2 – examsquiz$Exam.1
1 2.0 3.3 4.0 1.3
2 3.3 2.0 3.7 -1.3
3 4.0 4.0 4.0 0.0
4 2.3 0.0 3.3 -2.3
5 2.3 1.0 3 .3 -1.3
6 3.3 3.7 4.0 0.4
The new name is rather unwieldy: It’s long, and it has embedded blanks. We could change it, using the names() function, but it would be better to exploit the list basis of data frames and add a column (of the same length) to the data frame for this result:
> examsquiz$ExamDiff <- examsquiz$Exam.2 – examsquiz$Exam.1
> head(examsquiz)
Exam.1 Exam.2 Quiz ExamDiff
1 2.0 3.3 4.0 1.3
2 3.3 2.0 3.7 -1.3
3 4.0 4.0 4.0 0.0
4 2.3 0.0 3.3 -2.3
5 2.3 1.0 3.3 -1.3
6 3.3 3.7 4.0 0.4
What happened here? Since one can add a new component to an already existing list at any time, we did so: We added a component ExamDiff to the list/data frame examsquiz. We can even exploit recycling to add a column that is of a different length than those in the data frame:
>d
kids ages
1 Jack 12
2 Jill 10
> d$one <- 1
>d
kids ages one
1 Jack 12 1
2 Jill 10 1
Applying apply()
You can use apply() on data frames, if the columns are all of the same type.For instance, we can find the maximum grade for each student, as follows:
> apply(examsquiz,1,max)
[1] 4.0 3.7 4.0 3.3 3.3 4.0 3.7 3.3 4.0 4.0 4.0 3.3 4.0 4.0 3.7 4.0 3.3 3.7 4.0
[20] 3.7 4.0 4.0 3.3 3.3 4.0 4.0 3.3 3.3 4.0 3.7 3.3 3.3 3.7 2.7 3.3 4.0 3.7 3.7
[39] 3.7
Extended Example: A Salary Study
In a study of engineers and programmers, was the question, “How many of these workers are the best and the brightest—that is, people of extraordinary ability?” (Some of the details have been changed here.) The government data I had available was limited. One (admittedly imperfect) way to determine whether a worker is of extraordinary ability is to look at the ratio of actual salary to the government prevailing wage for that job and location. If that ratio is substantially higher than 1.0, you can reasonably assume that this worker has a high level of talent. We used R to prepare and analyze the data and will present excerpts of my preparation code here. First read in the data file:
all2006 <- read.csv(“2006.csv”,header=TRUE,as.is=TRUE)
The function read.csv() is essentially identical to read.table() except that the input data is in the CSV format exported by spreadsheets, which is the way the data set was prepared by the US Department of Labor (DOL). The as.is argument is the negation of stringsAsFactors. So, setting as.is to TRUE here is simply an alternate way to achieve stringsAsFactors=FALSE. At this point, we had a data frame, all2006, consisting of all the data for the year 2006. we then did some filtering:
all2006 <- all2006[all2006$Wage_Per==”Year”,] # exclude hourly-wagers
all2006 <- all2006[all2006$Wage_Offered_From > 20000,] # exclude weird cases
all2006 <- all2006[all2006$Prevailing_Wage_Amount > 200,] # exclude hrly prv wg
These operations are typical data cleaning. Most large data sets contain some outlandish values—some are obvious errors, others use different measurement systems, and so on. We needed to remedy this situation before doing any analysis. We also needed to create a new column for the ratio between actual wage and prevailing wage:
all2006$rat <- all2006$Wage_Offered_From / all2006$Prevailing_Wage_Amount
Since we would be calculating the median in this new column for many subsets of the data, we defined a function to do the work:
medrat <- function(dataframe) {
return(median(dataframe$rat,na.rm=TRUE))
}
Note the need to exclude NA values, which are common in government data sets.we were particularly interested in three occupations and thus extracted subdata frames for them to make their analyses more convenient:
se2006 <- all2006[grep(“Software Engineer”,all2006),]
prg2006 <- all2006[grep(“Programmer”,all2006),]
ee2006 <- all2006[grep(“Electronics Engineer”,all2006),]
Here, We used R’s grep() function to identify the rows containing the given job title. Another aspect of interest was analysis by firm. Wewrote this function to extract the subdata frame for a given firm:
makecorp <- function(corpname) {
t <- all2006[all2006$Employer_Name == corpname,]
return(t)
}
We then created subdata frames for a number of firms (only some are shown here).
corplist <- c(“MICROSOFT CORPORATION”,”ms”,”INTEL CORPORATION”,”intel”,”
SUN MICROSYSTEMS, INC.”,”sun”,”GOOGLE INC.”,”google”)
for (i in 1:(length(corplist)/2)) {
corp <- corplist[2_{*}i-1]
newdtf <- paste(corplist[2_{*}i],”2006″,sep=””)
assign(newdtf,makecorp(corp),pos=.GlobalEnv)
}
There’s quite a bit to discuss in the above code. First, note that we want the variables we are creating to be at the top (that is, global) level, which is the usual place one does interactive analysis. Also, we are creating my new variable names from character strings, such as “intel2006.” For these reasons, the assign() function is wonderful. It allows me to assign a variable by its name as a string and enables me to specify top level.The paste() function allows me to concatenate strings, with sep=”” specifying that we don’t want any characters between strings in my concatenation.
Merging Data Frames
In the relational database world, one of the most important operations is that of a join, in which two tables can be combined according to the values of a common variable. In R, two data frames can be similarly combined using the merge() function.The simplest form is as follows:
merge(x,y)
This merges data frames x and y. It assumes that the two data frames have one or more columns with names in common. Here’s an example:
> d1
kids states
1 Jack CA
2 Jill MA
3 Jillian MA
4 John HI
> d2
es 10
ages kids
1 10 Jill
2 7 Lillian
3 12 Jack
> d <- merge(d1,d2)
>d
kids states ages
1 Jack CA 12
2 Jill MA 10
Here, the two data frames have the variable kids in common. R found the rows in which this variable had the same value of kids in both data frames (the ones for Jack and Jill). It then created a data frame with corresponding rows and with columns taken from data frames (kids, states, and ages). The merge() function has named arguments by.x and by.y, which handle cases in which variables have similar information but different names in the two data frames. Here’s an example:
> d3
ages pals
1 12 Jack
2 10 Jill
3 7 Jillian
> merge(d1,d3,by.x=”kids”,by.y=”pals”)
kids states ages
1 Jack CA 12
2 Jill MA 10
Even though our variable was called kids in one data frame and pals in the other, it was meant to store the same information, and thus the merge made sense. Duplicate matches will appear in full in the result, possibly in undesirable ways.
> d1
kids states
1 Jack CA
2 Jill MA
3 Jillian MA
4 John HI
> d2a <- rbind(d2,list(15,”Jill”))
> d2a
ages kids
1 12 Jack
2 10 Jill
3 7 Lillian
4 15 Jill
> merge(d1,d2a)
kids states ages
1 Jack CA 12
2 Jill MA 10
3 Jill MA 15
There are two Jills in d2a. There is a Jill in d1 who lives in Massachu-setts and another Jill with unknown residence. In our previous example, merge(d1,d2), there was only one Jill, who was presumed to be the same per-son in both data frames. But here, in the call merge(d1,d2a), it may have been the case that only one of the Jills was a Massachusetts resident. It is clear from this little example that you must choose matching variables with great care.
Extended Example: An Employee Database
The following is an adaptation of one of my consulting projects. At issue was whether older workers were faring as well as younger ones. I had data on several variables, such as age and performance ratings, which I used in my comparison of the older and younger employees. I also had employee ID numbers, which were crucial in being able to connect the two data files: DA and DB.The DA file had this header:
“EmpID”,”Perf 1″,”Perf 2″,”Perf 3″,”Job Title”
These are names for the employee ID, three performance ratings, and the job title. DB had no header. The variables again began with the ID, followed by start and end dates of employment.Both files were in CSV format. Part of my data-cleaning phase consisted of checking that each record contained the proper number of fields. DA, for example, should have five fields per record. Here is the check:
> count.fields(“DA”,sep=”,”)
[1] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
5 5 5 5
…
Here, we specified that the file DA had fields separated by commas. The function then reported the number of fields in each record of the file, which fortunately were all 5s.We could have used all() to check this, rather than checking it visually, via this call:
all(count.fields(“DA”,sep=”,”) >= 5)
A return value of TRUE would mean everything is fine. Alternatively, we could have used this form:
table(count.fields(“DA”,sep=”,”))
We would then get counts of the numbers of records with five fields, four fields, six fields, and so on. After this check we then read in the files as data frames:
da <- read.csv(“DA”,header=TRUE,stringsAsFactors=FALSE)
db <- read.csv(“DB”,header=FALSE,stringsAsFactors=FALSE)
WE wanted to check for possible spelling errors in the various fields, so I ran the following code:
for (col in 1:6)
print(unique(sort(da[,col])))
This gave me a list of the distinct values in each column so that we could visually scan for incorrect spellings.We needed to merge the two data frames, matching by employee ID, so we ran the following code:
mrg <- merge(da,db,by.x=1,by.y=1)
We specified that the first column would be the merge variable in both cases. (As remarked earlier, I could also have used field names rather than numbers here.)
Applying Functions to Data Frames
As with lists, you can use the lapply and sapply functions with data frames.
Using lapply() and sapply() on Data Frames
Keep in mind that data frames are special cases of lists, with the list components consisting of the data frame’s columns. Thus, if you call lapply() on a data frame with a specified function f(), then f() will be called on each of the frame’s columns, with the return values placed in a list. For instance, with our previous example, we can use lapply as follows:
> d
kids ages
1 Jack 12
2 Jill 10
> dl <- lapply(d,sort)
> dl
$kids
[1] “Jack” “Jill”
$ages
[1] 10 12
So, dl is a list consisting of two vectors, the sorted versions of kids and ages. Note that dl is just a list, not a data frame. We could coerce it to a data frame, like this:
as.data.frame(dl)
kids ages
1 Jack 10
2 Jill 12
But this would make no sense, as the correspondence between names and ages has been lost. Jack, for instance, is now listed as 10 years old instead of 12.
Extended Example: Applying Logistic Regression Models
Let’s run a logistic regression model on the abalone data we used before, predicting gender from the other eight variables: height, weight, rings, and so on, one at a time.The logistic model is used to predict a 0- or 1-valued random variable Y from one or more explanatory variables. The function value is the probability that Y = 1, given the explanatory variables. Let’s say we have just one of the latter, X. Then the model is as follows:
As with linear regression models, the β_{i} values are estimated from the data, using the function glm() with the argument family=binomial.We can use sapply() to fit eight single-predictor models—one for each of the eight variables other than gender in this data set—all in just one line of code.
aba <- read.csv(“abalone.data”,header=T)
abamf <- aba[aba$Gender != “I”,] # exclude infants from the analysis
lftn <- function(clmn) {
glm(abamf$Gender ~ clmn, family=binomial)$coef
}
loall <- sapply(abamf[,-1],lftn)
In lines 1 and 2, we read in the data frame and then exclude the obser-vations for infants. In line 6, we call sapply() on the subdata frame in which column 1, named Gender, has been excluded. In other words, this is an eight-column subframe consisting of our eight explanatory variables. Thus, lftn() is called on each column of that subframe.
Taking as input a column from the subframe, accessed via the formal argument clmn, line 4 fits a logistic model that predicts gender from that column and hence from that explanatory variable. Recall the ordinary regression function lm() returns a class “lm” object containing many components, one of which is $coefficients, the vector of estimated β_{i}. This component is also in the return value of glm(). Also recall that list component names can be abbreviated if there is no ambiguity. Here, we’ve shortened coefficients to coef.nIn the end, line 6 returns eight pairs of estimated β_{i}. Let’s check it out.
> loall
Length Diameter Height WholeWt ShuckedWt ViscWt
(Intercept) 1.275832 1.289130 1.027872 0.4300827 0.2855054 0.4829153
clmn -1.962613 -2.533227 -5.643495 -0.2688070 -0.2941351 -1.4647507
ShellWt Rings
(Intercept) 0.5103942 0.64823569
col -1.2135496 -0.04509376
Sure enough, we get a 2-by-8 matrix, with the j^{th} column given the pair of estimated β_{i} values obtained when we do a logistic regression using the j^{th} explanatory variable.
We could actually get the same result using the ordinary matrix/data frame apply() function, but when I tried it, I found that method somewhat slower. The discrepancy may be due to the matrix allocation time.Note the class of the return value of glm():
> class(loall)
[1] “glm” “lm”
This says that loall actually has two classes: “glm” and “lm”. This is because the “glm” class is a subclass of “lm”.
Extended Example: Aids for Learning Chinese Dialects
Standard Chinese, often referred to as Mandarin outside China, is officially termed putonghua or guoyu. It is spoken today by the vast majority of people in China and among many ethnic Chinese outside China, but the dialects, such as Cantonese and Shanghainese, still enjoy wide usage too. Thus, a Chi-nese businessman in Beijing who intends to do business in Hong Kong may find it helpful to learn some Cantonese. Similarly, many in Hong Kong may wish to improve their Mandarin. Let’s see how such a learning process might be shortened and how R can help.
The differences among the dialects are sometimes startling. The charac-ter for “down,” 下, is pronounced xia in Mandarin, ha in Cantonese, and wu in Shanghainese. Indeed, because of these differences, and differences in grammar as well, many linguists consider these tongues separate languages rather than dialects. We will call them fangyan (meaning “regional speech”) here, the Chinese term.
Let’s see how R can help speakers of one fangyan learn another one. The key is that there are often patterns in the correspondences between the fangyans. For instance, the initial consonant transformation x → h seen 下 in the previous paragraph (xia → ha) is common, arising also in charac-ters such as 香 (meaning “fragrant”), pronounced xiang in Mandarin and heung in Cantonese. Note, too, the transformation iang → eung for the non– initial consonant portions of these pronounciations, which is also common. Knowing transformations such as these could speed up the learning curve considerably for the Mandarin-speaking learner of Cantonese, which is the setting we’ll illustrate here.
We haven’t mentioned the tones yet. All the fangyan are tonal, and sometimes there are patterns there as well, potentially providing further learning aids. However, this avenue will not be pursued here. You’ll see that our code does need to make some use of the tones, but we will not attempt to analyze how tones transform from one fangyan to another. For simplicity, we also will not consider characters beginning with vowels, characters that have more than one reading, toneless characters, and other refinements.
Though the initial consonant x in Mandarin often maps to h, as seen previously, it also often maps to s, y, and other consonants. For example, the character xie, 謝, in the famous Mandarin term xiexie (for “thank you”) is pronounced je in Cantonese. Here, there is an x → j transformation on the consonant.It would be very helpful for the learner to have a list of transformations and their frequencies of occurrence. This a job made for R! The function mapsound(), shown a little later in the chapter, does exactly this. It relies on some support functions, also to be presented shortly. To explain what mapsound() does, let’s devise some terminology, illustrated by the x → h example earlier. We’ll call x the source value, with h, s, and so on being the mapped values.Here are the formal parameters:
- df: A data frame consisting of the pronunciation data of two fangyan
- fromcol and tocol: Names in df of the source and mapped columns
- sourceval: The source value to be mapped, such as x in the preceding example
Here is the head of a typical two-fangyan data frame, canman8, that would be used for df:
The function returns a list consisting of two components:
- counts: A vector of integers, indexed by the mapped values, showing the counts of those values. The elements of the vector are named according to the mapped values.
- images: A list of character vectors. Again, the indices of the list are the mapped values, and each vector consists of all the characters that corre-spond to the given mapped value.
To make this concrete, let’s try it out:
> m2cx <- mapsound(canman8,”Man cons”,”Can cons”,”x”)
> m2cx$counts
ch f g h j k kw n s y
15 2 1 87 1 4 2 1 81 21
We see that x maps to ch 15 times, to f 2 times, and so on. Note that we could have called sort() to m2cx$counts to view the mapped images in order, from most to least frequent. The Mandarin-speaking learner of Cantonese can then see that if he wishes to know the Cantonese pronunciation of a word whose Mandarin romanized form begins with x, the Cantonese almost certainly begins with h or s. Little aids like this should help the learning process quite a bit.
To try to discern more patterns, the learner may wish to determine in which characters x maps to ch, for example. We know from the result of the preceding example that there are six such characters. Which ones are they? That information is stored in images. The latter, as mentioned, is a list of vectors. We are interested in the vector corresponding to ch:
Now, let’s look at the code. Before viewing the code for mapsound() itself, let’s consider another routine we need for support. It is assumed here that the data frame df that is input to mapsound() is produced by merg-ing two frames for individual fangyans. In this case, for instance, the head of the Cantonese input frame is as follows:
> head(can8)
Ch char Can
1 一 yat1
2 乙 yuet3
3 丁 ding1
4 七 chat1
5 乃 naai5
6 九 gau2
The one for Mandarin is similar. We need to merge these two frames into canman8, seen earlier. I’ve written the code so that this operation not only combines the frames but also separates the romanization of a charac-ter into initial consonant, the remainder of the romanization, and a tone number. For example, ding1 is separated into d, ing, and 1.
We could similarly explore transformations in the other direction, from Cantonese to Mandarin, and involving the nonconsonant remainders of characters. For example, this call determines which characters have eung as the nonconsonant portion of their Cantonese pronunciation:
> c2meung <- mapsound(canman8,c(“Can cons”,”Man cons”),”eung”)
We could then investigate the associated Mandarin sounds. Here is the code to accomplish all this:
# merges data frames for 2 fangyans merge2fy <- function(fy1,fy2) {
outdf <- merge(fy1,fy2)
# separate tone from sound, and create new columns
for (fy in list(fy1,fy2)) {
# saplout will be a matrix, init cons in row 1, remainders in row
# 2, and tones in row 3
saplout <- sapply((fy[[2]]),sepsoundtone)
# convert it to a data frame
tmpdf <- data.frame(fy[,1],t(saplout),row.names=NULL,
stringsAsFactors=F)
# add names to the columns
consname <- paste(names(fy)[[2]],” cons”,sep=””)
restname <- paste(names(fy)[[2]],” sound”,sep=””)
tonename <- paste(names(fy)[[2]],” tone”,sep=””)
names(tmpdf) <- c(“Ch char”,consname,restname,tonename)
# need to use merge(), not cbind(), due to possibly different
# ordering of fy, outdf
outdf <- merge(outdf,tmpdf)
}
return(outdf)
}
# separates romanized pronunciation pronun into initial consonant, if any,
# the remainder of the sound, and the tone, if any
sepsoundtone <- function(pronun) {
nchr <- nchar(pronun)
vowels <- c(“a”,”e”,”i”,”o”,”u”)
# how many initial consonants?
numcons <- 0
for (i in 1:nchr) {
ltr <- substr(pronun,i,i)
if (!ltr %in% vowels) numcons <- numcons + 1 else break
}
cons <- if (numcons > 0) substr(pronun,1,numcons) else NA
tone <- substr(pronun,nchr,nchr)
numtones <- tone %in% letters # T is 1, F is 0
if (numtones == 1) tone <- NA
therest <- substr(pronun,numcons+1,nchr-numtones)
return(c(cons,therest,tone))
}
So, even the merging code is not so simple. And this code makes some simplifying assumptions, excluding some important cases. Textual analysis is never for the faint of heart! Not surprisingly, the merging process begins with a call to merge(), in line 3. This creates a new data frame, outdf, to which we will append new columns for the separated sound components.
The real work, then, involves the separation of a romanization into its sound components. For that, there is a loop in line 5 across the two input data frames. In each iteration, the current data frame is split into sound components, with the result appended to outdf in line 19. Note the comment preceding that line regarding the unsuitability of cbind() in this situation. The actual separation into sound components is done in line 8. Here, we take a column of romanizations, such the following:
yat1
yuet3
ding1
chat1
naai5
gau2
We split it into three columns, consisting of initial consonant, remainder of the sound, and tone. For instance, yat1 will be split into y, at, and 1. This is a very natural candidate for some kind of “apply” function, and indeed sapply() is used in line 8. Of course, this call requires that we write a suitable function to be applied. (If we had been lucky, there would have been an existing R function that worked, but no such good fortune here.) The function we use is sepsoundtone(), starting in line 26. The sepsoundtone() function makes heavy use of R’s substr() (for sub-string ) function. In line 31, for example, we loop until we collect all the initial consonants, such ch. The return value, in line 40, consists of the three sound components extracted from the given romanized form, the formal parameter pronun.
Note the use of R’s built-in constant, letters, in line 37. We use this to sense whether a given character is numeric, which means it’s a tone. Some romanizations are toneless. Line 8 will then return a 3-by-1 matrix, with one row for each of the three sound components. We wish to convert this to a data frame for merging with outdf in line 19, and we prepare for this in line 10.
Note that we call the matrix transpose function t() to put our information into columns rather than rows. This is needed because data-frame storage is by columns. Also, we include a column fy[,1], the Chinese characters themselves, to have a column in common in the call to merge() in line 19.Now let’s turn to the code for mapsound(), which actually is simpler than the preceding merging code.
mapsound <- function(df,fromcol,tocol,sourceval) {
base <- which(df[[fromcol]] == sourceval)
basedf <- df[base,]
# determine which rows of basedf correspond to the various mapped
# values
sp <- split(basedf,basedf[[tocol]])
retval <- list()
retval$counts <- sapply(sp,nrow)
retval$images <- sp
return(retval)
}
Recall that the argument df is the two-fangyan data frame, output from merge2fy(). The arguments fromcol and tocol are the names of the source and mapped columns. The string sourceval is the source value to be mapped. For concreteness, consider the earlier examples in which sourceval was x. The first task is to determine which rows in df correspond to sourceval. This is accomplished via a straightforward application of which() in line 2. This information is then used in line 3 to extract the relevant subdata frame.
In that latter frame, consider the form that basedf[[tocol]] will take in line 6. These will be the values that x maps to—that is, ch, h, and so on. The purpose of line 6 is to determine which rows of basedf contain which of these mapped values. Here, we use R’s split() function. We’ll discuss split() in detail later, but the salient point is that sp will be a list of data frames: one for ch, one for h, and so on. This sets up line 8. Since sp will be a list of data frames—one for each mapped value—applying the nrow() function via sapply() will give us the counts of the numbers of characters for each of the mapped values, such as the number of characters in which the map x → ch occurs (15 times, as seen in the example call).
The complexity of the code here makes this a good time to comment on programming style. Some readers may point out, correctly, that lines 2 and 3 could be replaced by a one-liner:
basedf <- df[df[[fromcol]] == sourceval,]
But to me, that line, with its numerous brackets, is harder to read. My personal preference is to break down operations if they become too complex. Similarly, the last few lines of code could be compacted to another one-liner:
list(counts=sapply(sp,nrow),images=sp)
Among other things, this dispenses with the return(), conceivably speeding up the code. Recall that in R, the last value computed by a function is auto-matically returned anyway, without a return() call. However, the time savings here are really small and rarely matter, and again, my personal belief is that including the return() call is clearer.