# R PROGRAMMING CHAPTER 1

As detailed in the introduction, R is an extremely versatile open source programming language for statistics and data science. It is widely used in every field where there is data— business, industry, government, medicine, academia, and so on.

In this chapter, you’ll get a quick introduction to R—how to invoke it, what it can do, and what files it uses. We’ll cover just enough to give you the basics you need to work through the examples in the next few chapters, where the details will be presented.R may already be installed on your system, if your employer or university has made it available to users.

**How to Run R**

R operates in two modes: *interactive* and *batch*. The one typically used is inter-active mode. In this mode, you type in commands, R displays results, you type in more commands, and so on. On the other hand, batch mode does not require interaction with the user. It’s useful for production jobs, such as when a program must be run periodically, say once per day, because you can automate the process.

#### *Interactive Mode*

*Interactive Mode*

On a Linux or Mac system, start an R session by typing R on the command line in a terminal window. On a Windows machine, start R by clicking the R icon. The result is a greeting and the R prompt, which is the > sign. The screen will look something like this:

**R version 2.10.0 (2009-10-26)**

**Copyright (C) 2009 The R Foundation for Statistical Computing**

**ISBN 3-900051-07-0**

**…**

**Type ‘demo()’ for some demos, ‘help()’ for on-line help, or**

** ‘help.start()’ for an HTML browser interface to help. **

**Type ‘q()’ to quit R.**

**>**

You can then execute R commands. The window in which all this appears is called the R *console*.

As a quick example, consider a standard normal distribution—that is, with mean 0 and variance 1. If a random variable *X* has that distribution, then its values are centered around 0, some negative, some positive, averag-ing in the end to 0. Now form a new random variable *Y* = *|X* *|*. Since we’ve taken the absolute value, the values of *Y* will *not* be centered around 0, and the mean of *Y* will be positive.Let’s find the mean of Y. Our approach is based on a simulated example of *N* (0,1) variates.

**> mean(abs(rnorm(100)))**

** [1] 0.7194236**

This code generates the 100 random variates, finds their absolute values, and then finds the mean of the absolute values.

The [1] you see means that the first item in this line of output is item 1. In this case, our output consists of only one line (and one item), so this is re-dundant. This notation becomes helpful when you need to read voluminous output that consists of a lot of items spread over many lines. For example, if there were two rows of output with six items per row, the second row would be labeled [7].

**> rnorm(10)**

**[1] -0.6427784 -1.0416696 -1.4020476 -0.6718250 -0.9590894 -0.8684650**

**[7] -0.5974668 0.6877001 1.3577618 -2.2794378**

Here, there are 10 values in the output, and the label [7] in the second row lets you quickly see that 0.6877001, for instance, is the eighth out-put item. You can also store R commands in a file. By convention, R code files have the suffix *.R* or *.r*. If you create a code file called *z.R*, you can execute the contents of that file by issuing the following command:

**> source(“z.R”)**

#### *Batch Mode*

*Batch Mode*

Sometimes it’s convenient to automate R sessions. For example, you may wish to run an R script that generates a graph without needing to bother with manually launching R and executing the script yourself. Here you would run R in batch mode. As an example, let’s put our graph-making code into a file named *z.R* with the following contents:

**pdf(“xh.pdf”) # set graphical output file**

**hist(rnorm(100)) # generate 100 N(0,1) variates and plot their histogram**

** dev.off() # close the graphical output file**

The items marked with # are *comments*. They’re ignored by the R inter-preter. Comments serve as notes to remind us and others what the code is doing, in a human-readable format.

Here’s a step-by-step breakdown of what we’re doing in the preced-ing code:

- We call the pdf() function to inform R that we want the graph we create to be saved in the PDF file
*xh.pdf*.

- We call rnorm() (for
*random normal*) to generate 100*N*(0,1) random variates.

- We call hist() on those variates to draw a histogram of these values.

- We call dev.off() to close the graphical “device” we are using, which is the file
*xh.pdf*in this case. This is the mechanism that actually causes the file to be written to disk.

We could run this code automatically, without entering R’s interactive mode, by invoking R with an operating system shell command (such as at the $ prompt commonly used in Linux systems):

**$ R CMD BATCH z.R**

You can confirm that this worked by using your PDF viewer to display the saved histogram. (It will just be a plain-vanilla histogram, but R is capa-ble of producing quite sophisticated variations.)

**A First R Session**

Let’s make a simple data set (in R parlance, a *vector* ) consisting of the num-bers 1, 2, and 4, and name it x:

**> x <- c(1,2,4)**

The standard assignment operator in R is <-. You can also use =, but this is discouraged, as it does not work in some special situations. Note that there are no fixed types associated with variables. Here, we’ve assigned a vector to x, but later we might assign something of a different type to it. The c stands for *concatenate*. Here, we are concatenating the numbers 1, 2, and 4. More precisely, we are concatenating three one-element vectors that consist of those numbers. This is because any number is also considered to be a one-element vector. Now we can also do the following:

**> q <- c(x,x,8)**

which sets q to (1,2,4,1,2,4,8) (yes, including the duplicates).

Now let’s confirm that the data is really in x. To print the vector to the screen, simply type its name. If you type any variable name (or, more gen-erally, any expression) while in interactive mode, R will print out the value of that variable (or expression). Programmers familiar with other languages such as Python will find this feature familiar. For our example, enter this:

**> x**

**[1] 1 2 4**

Yep, sure enough, x consists of the numbers 1, 2, and 4. Individual elements of a vector are accessed via [ ]. Here’s how we can print out the third element of x:

**> x[3]**

**[1] 4**

As in other languages, the selector (here, 3) is called the *index* or *sub-script*. Those familiar with ALGOL-family languages, such as C and C++,* *should note that elements of R vectors are indexed starting from 1, not 0.

*Subsetting *is a very important operation on vectors. Here’s an example:

**> x <- c(1,2,4)**

**> x[2:3]**

**[1] 2 4**

The expression x[2:3] refers to the subvector of x consisting of elements 2 through 3, which are 2 and 4 here.We can easily find the mean and standard deviation of our data set, as follows:

> mean(x)

[1] 2.333333

> sd(x)

[1] 1.527525

This again demonstrates typing an expression at the prompt in order to print it. In the first line, our expression is the function call mean(x). The return value from that call is printed automatically, without requiring a call to R’s print() function.

If we want to save the computed mean in a variable instead of just print-ing it to the screen, we could execute this code:

**> y <- mean(x)**

Again, let’s confirm that y really does contain the mean of x:

**> y**

**[1] 2.333333**

As noted earlier, we use # to write comments, like this:

**> y #print out y **

**[1] 2.333333 **

Comments are especially valuable for documenting program code, but they are useful in interactive sessions, too, since R records the command history (as discussed in Section 1.6). If you save your session and resume it later, the comments can help you remember what you were doing. Finally, let’s do something with one of R’s internal data sets (these are used for demos). You can get a list of these data sets by typing the following:

**> data()**

One of the data sets is called Nile and contains data on the flow of the Nile River. Let’s find the mean and standard deviation of this data set:

> mean(Nile)

[1] 919.35

> sd(Nile)

[1] 169.2275

We can also plot a histogram of the data:

**> hist(Nile)**

A window pops up with the histogram in it, as shown in Figure 1-1. This graph is bare-bones simple, but R has all kinds of optional bells and whistles for plotting. For instance, you can change the number of bins by specify-ing the breaks variable. The call hist(z,breaks=12) would draw a histogram of the data set z with 12 bins. You can also create nicer labels, make use of color, and make many other changes to create a more informative and eye-appealing graph. When you become more familiar with R, you’ll be able to construct complex, rich color graphics of striking beauty.

*Figure 1-1: Nile data, plain presentation*

Well, that’s the end of our first, five-minute introduction to R. Quit R by calling the q() function (or alternatively by pressing CTRL-D in Linux or CMD-D on a Mac):

**> q()**

**Save workspace image? [y/n/c]: n**

That last prompt asks whether you want to save your variables so that you can resume work later. If you answer y, then all those objects will be loaded automatically the next time you run R. This is a very important fea-ture, especially when working with large or numerous data sets. Answering y here also saves the session’s command history. We’ll talk more about saving your workspace and the command history in Section 1.6.

**Introduction to Functions**

As in most programming languages, the heart of R programming consists of writing *functions*. A function is a group of instructions that takes inputs, uses them to compute other values, and returns a result. As a simple introduction, let’s define a function named oddcount(), whose purpose is to count the odd numbers in a vector of integers. Normally, we would compose the function code using a text editor and save it in a file, but in this quick-and-dirty example, we’ll enter it line by line in R’s interactive mode. We’ll then call the function on a couple of test cases.

# counts the number of odd integers in x

> oddcount <- function(x) {

+ k <- 0 # assign 0 to k

+ for (n in x) {

+ if (n %% 2 == 1) k <- k+1 # %% is the modulo operator

+ }

+ return(k)

+ }

> oddcount(c(1,3,5))

[1] 3

> oddcount(c(1,2,3,7,9))

[1] 4

First, we told R that we wanted to define a function named oddcount with one argument, x. The left brace demarcates the start of the body of the function. We wrote one R statement per line. Until the body of the function is finished, R reminds you that you’re still in the definition by using + as its prompt, instead of the usual >. (Actually, + is a line-continuation character, not a prompt for a new input.) R resumes the > prompt after you finally enter a right brace to conclude the function body.

After defining the function, we evaluated two calls to oddcount(). Since there are three odd numbers in the vector (1,3,5), the call oddcount(c(1,3,5)) returns the value 3. There are four odd numbers in (1,2,3,7,9), so the second call returns 4. Notice that the modulo operator for remainder arithmetic is %% in R, as indicated by the comment. For example, 38 divided by 7 leaves a remainder of 3:

**> 38 %% 7**

**[1] 3**

For instance, let’s see what happens with the following code:

**for (n in x) {**

**if (n %% 2 == 1) k <- k+1**

**}**

First, it sets n to x[1], and then it tests that value for being odd or even. If the value is odd, which is the case here, the count variable k is incremented. Then n is set to x[2], tested for being odd or even, and so on.By the way, C/C++ programmers might be tempted to write the preceding loop like this:

**for (i in 1:length(x)) {**

**if (x[i] %% 2 == 1) k <- k+1**

**}**

Here, length(x) is the number of elements in x. Suppose there are 25 elements. Then 1:length(x) means 1:25, which in turn means 1,2,3,…,25. This code would also work (unless x were to have length 0), but one of the major themes of R programming is to avoid loops if possible; if not, keep loops simple. Look again at our original formulation:

**for (n in x) {**

**if (n %% 2 == 1) k <- k+1**

**}**

It’s simpler and cleaner, as we do not need to resort to using the length() function and array indexing. At the end of the code, we use the return statement:

**return(k)**

This has the function return the computed value of k to the code that called it. However, simply writing the following also works:

**k**

R functions will return the last value computed if there is no explicit return() call. However, this approach must be used with care. In programming language terminology, x is the *formal argument* (or *formal parameter *) of the function* *oddcount(). In the first function call in the* *preceding example, c(1,3,5) is referred to as the *actual argument*. These terms allude to the fact that x in the function definition is just a placeholder, whereas c(1,3,5) is the value actually used in the computation. Similarly, in the second function call, c(1,2,3,7,9) is the actual argument.

*Variable Scope*

*Variable Scope*

A variable that is visible only within a function body is said to be *local* to that function. In oddcount(), k and n are local variables. They disappear after the function returns:

**> oddcount(c(1,2,3,7,9))**

** [1] 4
>n
Error: object ‘n’ not found **

It’s very important to note that the formal parameters in an R function are local variables. Suppose we make the following function call:

**> z <- c(2,6,7)**

**> oddcount(z)**

Now suppose that the code of oddcount() changes x. Then z would *not* change. After the call to oddcount(), z would have the same value as before. To eval-uate a function call, R copies each actual argument to the corresponding local parameter variable, and changes to that variable are not visible outside the function.

Variables created outside functions are *global* and are available within functions as well. Here’s an example:

> f <- function(x) return(x+y)

> y <- 3

> f(5)

[1] 8

Here y is a global variable.

A global variable can be written to from within a function by using R’s *superassignment operator*,* *<<-.

*Default Arguments*

*Default Arguments*

R also makes frequent use of *default arguments*. Consider a function defini-tion like this:

**> g <- function(x,y=2,z=T) { … }**

Here y will be initialized to 2 if the programmer does not specify y in the call. Similarly, z will have the default value TRUE. Now consider this call:

**> g(12,z=FALSE)**

Here, the value 12 is the actual argument for x, and we accept the default value of 2 for y, but we override the default for z, setting its value to FALSE. The preceding example also demonstrates that, like many program-ming languages, R has a *Boolean* type; that is, it has the logical values TRUE and FALSE.

**NOTE** *R allows **TRUE** and **FALSE** to be abbreviated to **T** and **F**. However, you may choose not to **abbreviate these values to avoid trouble if you have a variable named **T** or **F**.*

**Preview of Some Important R Data Structures**

R has a variety of data structures. Here, we will sketch some of the most fre-quently used structures to give you an overview of R before we dive into the details. This way, you can at least get started with some meaningful exam-ples, even if the full story behind them must wait.

*Vectors, the R Workhorse*

*Vectors, the R Workhorse*

The vector type is really the heart of R. It’s hard to imagine R code, or even an interactive R session, that doesn’t involve vectors.The elements of a vector must all have the same *mode*, or data type. You can have a vector consisting of three character strings (of mode character) or three integer elements (of mode integer), but not a vector with one integer element and two character string elements.

**Scalars**

*Scalars*, or individual numbers, do not really exist in R. As mentioned ear-lier, what appear to be individual numbers are actually one-element vectors. Consider the following:

**> x <- 8**

**> x**

**[1] 8**

Recall that the [1] here signifies that the following row of numbers begins with element 1 of a vector—in this case, x[1]. So you can see that R was in-deed treating x as a vector, albeit a vector with just one element.

*Character Strings*

*Character Strings*

Character strings are actually single-element vectors of mode character, (rather than mode numeric):

> x <- c(5,12,13)

> x

[1] 5 12 13

> length(x)

[1] 3

> mode(x)[1] “numeric”

> y <- “abc” >y[1] “abc”

> length(y)[1] 1

> mode(y)[1] “character”

> z <- c(“abc”,”29 88″)

> length(z)[1] 2

> mode(z)[1] “character”

In the first example, we create a vector x of numbers, thus of mode numeric. Then we create two vectors of mode character: y is a one-element (that is, one-string) vector, and z consists of two strings. R has various string-manipulation functions. Many deal with putting strings together or taking them apart, such as the two shown here:

> u <- paste(“abc”,”de”,”f”) # concatenate the strings

> u

[1] “abc de f”

> v <- strsplit(u,” “) # split the string according to blanks

> v

[[1]]

[1] “abc” “de” “f”

#### *Matrices*

*Matrices*

An R matrix corresponds to the mathematical concept of the same name: a rectangular array of numbers. Technically, a matrix is a vector, but with two additional attributes: the number of rows and the number of columns. Here is some sample matrix code:

> m <- rbind(c(1,4),c(2,2))

> m

[,1] [,2]

[1,] 1 4

[2,] 2 2

> m %*% c(1,1)

[,1]

[1,] 5

[2,] 4

First, we use the rbind() (for *row bind*) function to build a matrix from two vectors that will serve as its rows, storing the result in m. (A correspond-ing function, cbind(), combines several columns into a matrix.) Then enter-ing the variable name alone, which we know will print the variable, confirms that the intended matrix was produced. Finally, we compute the matrix pro-duct of the vector (1,1) and m. The matrix-multiplication operator, which you may know from linear algebra courses, is %_{*}% in R. Matrices are indexed using double subscripting, much as in C/C++, although subscripts start at 1 instead of 0.

**> m[1,2]**

**[1] 4**

**> m[2,2]**

**[1] 2**

An extremely useful feature of R is that you can extract submatrices from a matrix, much as you extract subvectors from vectors. Here’s an example:

**> m[1,] # row 1**

**[1] 1 4**

**> m[,2] # column 2**

**[1] 4 2**

*Lists*

*Lists*

Like an R vector, an R list is a container for values, but its contents can be items of different data types. (C/C++ programmers will note the analogy to a C struct.) List elements are accessed using two-part names, which are indicated with the dollar sign $ in R. Here’s a quick example:

> . x <- list(u=2, v=”abc”)

> x

$u

[1] 2

$v

[1] “abc”

> x$u

[1] 2

The expression x$u refers to the u component in the list x. The latter con-tains one other component, denoted by v. A common use of lists is to combine multiple values into a single pack-age that can be returned by a function. This is especially useful for statisti-cal functions, which can have elaborate results. As an example, consider R’s basic histogram function, hist(), introduced in Section 1.2. We called the function on R’s built-in Nile River data set:

**> hist(Nile)**

This produced a graph, but hist() also returns a value, which we can save:

**> hn <- hist(Nile)**

What’s in hn? Let’s take a look:

> print(hn)

$breaks

[1] 400 500 600 700 800 900 1000 1100 1200 1300 1400

$counts[1] 1 0 5 20 25 19 12 11 6 1

$intensities

[1] 9.999998e-05 0.000000e+00 5.000000e-04 2.000000e-03 2.500000e-03

[6] 1.900000e-03 1.200000e-03 1.100000e-03 6.000000e-04 1.000000e-04

$density

[1] 9.999998e-05 0.000000e+00 5.000000e-04 2.000000e-03 2.500000e-03

[6] 1.900000e-03 1.200000e-03 1.100000e-03 6.000000e-04 1.000000e-04

$mids

[1] 450 550 650 750 850 950 1050 1150 1250 1350

$xname

[1] “Nile”

$equidist

[1] TRUE

attr(,”class”)

[1] “histogram”

Don’t try to understand all of that right away. For now, the point is that, besides making a graph, hist() returns a list with a number of components. Here, these components describe the characteristics of the histogram. For instance, the breaks component tells us where the bins in the histogram start and end, and the counts component is the numbers of observations in each bin.

The designers of R decided to package all of the information returned by hist() into an R list, which can be accessed and manipulated by further R commands via the dollar sign. Remember that we could also print hn simply by typing its name:

**> hn**

But a more compact alternative for printing lists like this is str():

> str(hn)

List of 7

$ breaks : num [1:11] 400 500 600 700 800 900 1000 1100 1200 1300 …

$ counts : int [1:10] 1 0 5 20 25 19 12 11 6 1

$ intensities : num [1:10] 0.0001 0 0.0005 0.002 0.0025 …

$ density :num [1:10] 0.0001 0 0.0005 0.002 0.0025 …

$ mids :num [1:10] 450 550 650 750 850 950 1050 1150 1250 1350

$ xname :chr”Nile”

$ equidist :logi TRUE

– attr(*, “class”)= chr “histogram”

Here str stands for *structure*. This function shows the internal structure of any R object, not just lists.

*Data Frames*

*Data Frames*

A typical data set contains data of different modes. In an employee data set, for example, we might have character string data, such as employee names, and numeric data, such as salaries. So, although a data set of (say) 50 employees with 4 variables per worker has the look and feel of a 50-by-4 matrix, it does not qualify as such in R, because it mixes types.Instead of a matrix, we use a *data frame*. A data frame in R is a list, with each component of the list being a vector corresponding to a column in our “matrix” of data. Indeed, you can create data frames in just this way:

**> d <- data.frame(list(kids=c(“Jack”,”Jill”),ages=c(12,10)))**

**> d**

**kids ages**

**1 Jack 12**

**2 Jill 10**

**> d$ages**

**[1] 12 10**

Typically, though, data frames are created by reading in a data set from a file or database.

*Classes*

*Classes*

R is an object-oriented language. *Objects* are instances of *classes*. Classes are a bit more abstract than the data types you’ve met so far. Here, we’ll look briefly at the concept using R’s S3 classes. (The name stems from their use in the old S language, version 3, which was the inspiration for R.) Most of R is based on these classes, and they are exceedingly simple. Their instances are simply R lists but with an extra attribute: the class name.

For example, we noted earlier that the (nongraphical) output of the hist() histogram function is a list with various components, such as break and count components. There was also an *attribute*, which specified the class of the list, namely histogram.

> print(hn)

$breaks

[1] 400 500 600 700 800 900 1000 1100 1200 1300 1400

$counts[1] 1 0 5 20 25 1 9 12 11 6 1

…

…

attr(,”class”)

[1] “histogram”

At this point, you might be wondering, “If S3 class objects are just lists, why do we need them?” The answer is that the classes are used by *generic* functions. A generic function stands for a family of functions, all serving a similar purpose but each appropriate to a specific class. A commonly used generic function is summary(). An R user who wants to use a statistical function, like hist(), but is unsure of how to deal with its output (which can be voluminous), can simply call summary() on the output, which is not just a list but an instance of an S3 class.

The summary() function, in turn, is actually a family of summary-making functions, each handling objects of a particular class. When you call summary() on some output, R searches for a summary function appropriate to the class at hand and uses it to give a friendlier representation of the list. Thus, call-ing summary() on the output of hist() produces a summary tailored to that function, and calling summary() on the output of the lm() regression function produces a summary appropriate for that function.

The plot() function is another generic function. You can use plot() on just about any R object. R will find an appropriate plotting function based on the object’s class. Classes are used to organize objects. Together with generic functions, they allow flexible code to be developed for handling a variety of different but related tasks.

**Extended Example: Regression Analysis of Exam Grades**

For our next example, we’ll walk through a brief statistical regression analysis. There isn’t much actual programming in this example, but it illustrates how some of the data types we just discussed are used, including R’s S3 objects. Also, it will serve as the basis for several of our programming examples in subsequent chapters.

I have a file, *ExamsQuiz.txt*, containing grades from a class I taught. Here are its first few lines:

**2 3.3 4**

**3.3 2 3.7**

**4 4.3 4**

**2.3 0 3.3**

**…**

The numbers correspond to letter grades on a four-point scale; 3.3 is a B+, for instance. Each line contains the data for one student, consisting of the midterm examination grade, final examination grade, and average quiz grade. It might be interesting to see how well the midterm and quiz grades predict the student’s grade on the final examination.

Let’s first read in the data file:

**> examsquiz <- read.table(“ExamsQuiz.txt”,header=FALSE)**

Our file does not include a header line naming the variables in each student record, so we specified header=FALSE in the function call. This is an example of a default argument, which we talked about earlier. Actually, the default value of the header argument is FALSE already (which you can check by consulting R’s online help for read.table()), so we didn’t need to specify this setting, but it’s clearer if we do.Our data is now in examsquiz, which is an R object of class data.frame.

**> class(examsquiz) **

**[1] “data.frame”**

Just to check that the file was read in correctly, let’s take a look at the first few rows:

**> head(examsquiz)**

** V1 V2 . V3**

**1 2.0 3.3 4.0**

**2 3.3 2.0 3.7**

**3 4.0 4.3 4.0**

**4 2.3 0.0 3.3**

**5 2.3 1.0 3.3**

**6 3.3 3.7 4.0**

Lacking a header for the data, R named the columns V1, V2, and V3. Row numbers appear on the left. As you might be thinking, it would be better to have a header in our data file, with meaningful names like Exam1. In later examples, we will usually specify names. Let’s try to predict the exam 2 score (given in the second column of examsquiz) from exam 1 (first column):

**lma <- lm(examsquiz[,2] ~ examsquiz[,1])**

The lm() (for *linear model* ) function call here instructs R to fit this predic-tion equation:

**predicted Exam 2 = β _{0} + β_{1} Exam 1**

Here, *β*_{0} and *β*_{1} are constants to be estimated from our data. In other words, we are fitting a straight line to the (exam 1, exam 2) pairs in our data. This is done through a classic least-squares method. (Don’t worry if you don’t have background in this.)

Note that the exam 1 scores, which are stored in the first column of our data frame, are collectively referred to as examsquiz[,1]. Omission of the first subscript (the row number) means that we are referring to an entire column of the frame. The exam 2 scores are similarly referenced. So, our call to lm() above predicts the second column of examsquiz from the first. We also could have written

**lma <- lm(examsquiz$V2 ~ examsquiz$V1)**

recalling that a data frame is just a list whose elements are vectors. Here, the columns are the V1, V2, and V3 components of the list. The results returned by lm() are now in an object that we’ve stored in the variable lma. It is an instance of the class lm. We can list its components by calling attributes():

> attributes(lma)

$names

[1] “coefficients” “residuals” “effects”

[5] “fitted.values” “assign” “qr”

[9] “xlevels” “call” “terms”

$class

[1] “lm”

As usual, a more detailed accounting can be obtained via the call str(lma). The estimated values of *β** _{i}* are stored in lma$coefficients. You can display them by typing the name at the prompt. You can also save some typing by abbreviating component names, as long as you don’t shorten a component’s name to the point of being ambig-uous. For example, if a list consists of the components xyz, xywa, and xbcde, then the second and third components can be abbreviated to xyw and xb, respectively. So here we could type the following:

**> lma$coef**

** (Intercept) examsquiz[, 1]**

** 1.1205209 0.5899803**

Since lma$coefficients is a vector, printing it is simple. But consider what happens when you print the object lma itself:

> lma

Call:

lm(formula = examsquiz[, 2] ~ examsquiz[, 1])

Coefficients:

(Intercept) examsquiz[, 1]

1.121 0.590

Why did R print only these items and not the other components of lma? The answer is that here R is using the print() function, which is another example of generic functions. As a generic function, print() actually hands off the work to another function whose job is to print objects of class lm— the print.lm() function—and this is what that function displays. We can get a more detailed printout of the contents of lma by call-ing summary(), the generic function discussed earlier. It triggers a call to summary.lm() behind the scenes, and we get a regression-specific summary:

> summary(lma)

Call:

lm(formula = examsquiz[, 2] ~ examsquiz[, 1])

Residuals:

Min 1Q Median 3Q Max

-3.4804 -0.1239 0.3426 0.7261 1.2225

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.1205 0.6375 1.758 0.08709 .

examsquiz[, 1] 0.5900 0.2030 2.907 0.00614 **

…

A number of other generic functions are defined for this class. See the online help for lm() for details. (Using R’s online documentation is discussed in Section 1.7.) To estimate a prediction equation for exam 2 from both the exam 1 and the quiz scores, we would use the + notation:

**> lmb <- lm(examsquiz[,2] ~ examsquiz[,1] + examsquiz[,3])**

Note that the + doesn’t mean that we compute the sum of the two quantities. It is merely a delimiter in our list of predictor variables.

**Startup and Shutdown**

Like that of many sophisticated software applications, R’s behavior can be customized using startup files. In addition, R can save all or part of a session, such as a record of what you did, to an output file. If there are R commands that you would like to execute at the beginning of every R session, you can place them in a file called *.Rprofile* located either in your home directory or in the directory from which you are running R. The latter directory is searched for this file first, which allows you to have custom profiles for par-ticular projects. For example, to set the text editor that R invokes if you call edit(), you can use a line in *.Rprofile* like this (if you’re on a Linux system):

**options(editor=”/usr/bin/vim”)**

R’s options() function is used for configuration, that is, to tweak various settings. You can specify the full path to your own editor, using the notation (slashes or backslashes) appropriate to your operating system. As another example, in *.Rprofile* on my Linux machine at home, I have the following line:

**.libPaths(“/home/nm/R”)**

This automatically adds a directory that contains all my auxiliary packages to my R search path.

Like most programs, R has the notion of your *current working directory*. Upon startup, this will be the directory from which you launched R, if you’re using Linux or a Mac. In Windows, it will probably be your *Documents* folder. If you then reference files during your R session, they will be assumed to be in that directory. You can always check your current directory by typing the following:

**> getwd()**

You can change your working directory by calling setwd() with the desired directory as a quoted argument. For example,

**> setwd(“q”)**

would set the working directory to *q*.

As you proceed through an interactive R session, R records the com-mands you submit. If you answer yes to the question “Save workspace im-age?” when you quit, R will save all the objects you created in that session and restore them in your next session. This means you do not need to redo the work from scratch to continue where you left off.

The saved workspace is in a file named *.Rdata*, which is located either in the directory from which you invoked the R session (Linux) or in the R installation directory (Windows). You can consult the *.Rhistory* file, which records your commands, to remind yourself how that workspace was created. If you want speedier startup/shutdown, you can skip loading all those files and the saving of your session at the end by running R with the vanilla option:

**R –vanilla**

Other options fall between vanilla and “load everything.” You can find more information about startup files by querying R’s online help facility, as follows:

**> ?Startup**

**Getting Help**

A plethora of resources are available to help you learn more about R. These include several facilities within R itself and, of course, on the Web. Much work has gone into making R self-documenting. We’ll look at some of R’s built-in help facilities and then at those available on the Internet.

*The help() Function*

*The help() Function*

To get online help, invoke help(). For example, to get information on the seq() function, type this:

**> help(seq)**

The shortcut to help() is a question mark (?):

**> ?seq**

Special characters and some reserved words must be quoted when used with the help() function. For instance, you need to type the following to get help on the < operator:

**> ?”<“**

And to see what the online manual has to say about for loops, enter this:

**> ?”for”**

*The example() Function*

*The example() Function*

Each of the help entries comes with examples. One really nice feature of R is that the example() function will actually run those examples for you. Here’s an illustration:

> example(seq)

seq> seq(0, 1, length.out=11)

[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

seq> seq(stats::rnorm(20))[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 7 18 1 9 20

seq> seq(1, 9, by = 2) # match

[1] 1 3 5 7 9

seq> seq(1, 9, by = pi)# stay below

[1] 1.000000 4.141593 7.283185

seq> seq(1, 6, by = 3)

[1] 1 4

seq> seq(1.575, 5.125, by=0.05)

[1] 1.575 1.625 1.675 1.725 1.775 1.825 1.875 1.925 1.975 2.025 2.075 2.125

[13] 2.175 2.225 2.275 2.325 2.375 2.425 2.475 2.525 2.575 2.625 2.675 2.725

[25] 2.775 2.825 2.875 2.925 2.975 3.025 3.075 3.125 3.175 3.225 3.275 3.325

[37] 3.375 3.425 3.475 3.525 3.575 3.625 3.675 3.725 3.775 3.825 3.875 3.925

[49] 3.975 4.025 4.075 4.125 4.175 4.225 4.275 4.325 4.375 4.425 4.475 4.525

[61] 4.575 4.625 4.675 4.725 4.775 4.825 4.875 4.925 4.975 5.025 5.075 5.125

seq> seq(17) # same as 1:17[1] 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 1 6 17

The seq() function generates various kinds of numeric sequences in arithmetic progression. Running example(seq) resulted in R’s running some examples of seq() before our very eyes. Imagine how useful this can be for graphics! If you are interested in see-ing what one of R’s excellent graphics functions does, the example() function will give you a “graphic” illustration.

To see a quick and very nice example, try running the following command:

**> example(persp)**

This displays a series of sample graphs for the persp() function. One of these is shown in Figure 1-2. Press ENTER in the R console when you are ready to go to the next one. Note that the code for each example is shown in the console, so you can experiment by tweaking the arguments.

*Figure 1-2: One of the *persp()* examples*

*If You Don’t Know Quite What You’re Looking For*

*If You Don’t Know Quite What You’re Looking For*

You can use the function help.search() to do a Google-style search through R’s documentation. For instance, say you need a function to generate ran-dom variates from multivariate normal distributions. To determine which function, if any, does this, you could try something like this:

**> help.search(“multivariate normal”)**

This produces a response containing this excerpt:

**mvrnorm(MASS) Simulate from a Multivariate Normal**

** Distribution**

You can see that the function mvrnorm() will do the job, and it is in the pack-age MASS. There is also a question-mark shortcut to help.search():

**> ??”multivariate normal”**

*Help for Other Topics*

*Help for Other Topics*

R’s internal help files include more than just pages for specific functions. For example, the previous section mentioned that the function mvrnorm() is in the package MASS. You can get information about the function by enter-ing this:

**> ?mvrnorm**

And you can also learn about the entire package by typing this:

**> help(package=MASS)**

Help is available for general topics, too. For instance, if you’re inter-ested in learning about files, type the following:

**> ?files**

This gives you information about a number of file-manipulation func-tions, such as file.create().

Here are some other topics:

**Arithmetic**

**Comparison**

**Control**

**Dates**

**Extract**

**Math**

**Memory**

**NA**

**NULL**

**NumericaConstants**

**Paren**

**Quotes**

**Startup**

**Syntax**

You may find it helpful to browse through these topics, even without a specific goal in mind.

*Help for Batch Mode*

*Help for Batch Mode*

Recall that R has batch commands that allow you to run a command directly from your operating system’s shell. To get help on a particular batch com-mand, you can type:

**R CMD command –help**

For example, to learn all the options associated with the INSTALL com-mand, you can type this:

**R CMD INSTALL –help**

*Help on the Internet*

*Help on the Internet*

There are many excellent resources on R on the Internet. Here are a few:

- The R Project’s own manuals are available from the R home page,
*http://www.r-project.org/*. Click Manuals.

- Various R search engines are listed on the R home page. Click Search.

- The sos package offers highly sophisticated searching of R materials.

- I use the RSeek search engine quite often:
*http://www.rseek.org/*.

- You can post your R questions to
*r*-help, the R list server. You can obtain information about this and other R list servers at*http://www.r-project.org/**mail.html*. You can use various interfaces.Like Gmane (*http://www**.gmane.org/*).

Because of its single-letter name, R is difficult to search for using general-purpose search engines such as Google. But there are tricks you can employ. One approach is to use Google’s filetype criterion. To search for R scripts (files having a *.R* suffix) pertaining to, say, permutations, enter this:

**filetype:R permutations -rebol**

The -rebol asks Google to exclude pages with the word “rebol,” as the REBOL programming language uses the same suffix.

The Comprehensive R Archive Network (CRAN), at *http://cran.r-project* *.org/*, is a repository of user-contributed R code and thus makes for a good* *Google search term. Searching for “lm CRAN,” for instance, will help you find material on R’s lm() function

**VECTORS**

The fundamental data type in R is the *vector*. You saw a few examples, and now you’ll learn the details. We’ll start by examining how vectors relate to some other data types in R. You’ll see that unlike in languages in the C family, individual numbers (scalars) do not have separate data types but instead are special cases of vectors. On the other hand, as in C family languages, matrices are special cases of vectors. We’ll spend a considerable amount of time on the following topics:

**Recycling** The automatic lengthening of vectors in certain settings

**Filtering** The extraction of subsets of vectors

**Vectorization** Where functions are applied element-wise to vectors

All of these operations are central to R programming, and you will see them referred to often in the remainder.

**Scalars, Vectors, Arrays, and Matrices**

In many programming languages, vector variables are considered different from *scalars*, which are single-number variables. Consider the following C code, for example:

**int x;**

**int y[3];**

This requests the compiler to allocate space for a single integer named x and a three-element integer array (C terminology analogous to R’s vector type) named y. But in R, numbers are actually considered one-element vectors, and there is really no such thing as a scalar.

R variable types are called *modes*. Recall that all elements in a vector must have the same mode, which can be integer, numeric (floating-point number), character (string), logical (Boolean), complex, and so on. If you need your program code to check the mode of a variable x, you can query it by the call typeof(x). Unlike vector indices in ALGOL-family languages, such as C and Python, vector indices in R begin at 1.

*Adding and Deleting Vector Elements*

*Adding and Deleting Vector Elements*

Vectors are stored like arrays in C, contiguously, and thus you cannot insert or delete elements—something you may be used to if you are a Python programmer. The size of a vector is determined at its creation, so if you wish to add or delete elements, you’ll need to reassign the vector. For example, let’s add an element to the middle of a four-element vector:

**> x <- c(88,5,12,13)**

**> x <- c(x[1:3],168,x[4]) # insert 168 before the 13**

**> x**

**[1] 88 5 12 168 13**

Here, we created a four-element vector and assigned it to x. To insert a new number 168 between the third and fourth elements, we strung together the first three elements of x, then the 168, then the fourth element of x. This creates a *new* five-element vector, leaving x intact for the time being. We then assigned that new vector to x. In the result, it appears as if we had actually changed the vector stored in x, but really we created a new vector and stored *that* vector in x. This difference may seem subtle, but it has implications. For instance, in some cases, it may restrict the potential for fast performance in R.

#### *Obtaining the Length of a Vector*

*Obtaining the Length of a Vector*

You can obtain the length of a vector by using the length() function:

**> x <- c(1,2,4)**

**> length(x)**

**[1] 3**

In this example, we already know the length of x, so there really is no need to query it. But in writing general function code, you’ll often need to know the lengths of vector arguments. For instance, suppose that we wish to have a function that determines the index of the first 1 value in the function’s vector argument (assuming we are sure there is such a value). Here is one (not necessarily efficient) way we could write the code:

first1 <- function(x) {

for (i in 1:length(x)) {

if (x[i] == 1) break . # break out of loop

}

return(i)

}

Without the length() function, we would have needed to add a second argument to first1(), say naming it n, to specify the length of x. Note that in this case, writing the loop as follows *won’t* work:

**for (n in x)**

The problem with this approach is that it doesn’t allow us to retrieve the index of the desired element. Thus, we need an explicit loop, which in turn requires calculating the length of x. One more point about that loop: For careful coding, you should worry that length(x) might be 0. In such a case, look what happens to the expression 1:length(x) in our for loop:

> x <- c()

> x

NULL

> length(x) [1] 0

> 1:length(x)

[1] 1 0

Our variable i in this loop takes on the value 1, then 0, which is certainly not what we want if the vector x is empty.A safe alternative is to use the more advanced R function seq().

*Matrices and Arrays as Vectors*

*Matrices and Arrays as Vectors*

Arrays and matrices (and even lists, in a sense) are actually vectors too, as you’ll see. They merely have extra class attributes. For example, matrices have the number of rows and columns. We’ll discuss them in detail later, but it’s worth noting now that arrays and matrices are vectors, and that means that everything we say about vectors applies to them, too.

Consider the following example:

> m

[,1] [,2]

[1,] 1 2

[2,] 3 4

> m + 10:13

[,1] [,2]

[1,] 11 14

[2,] 14 17

The 2-by-2 matrix m is stored as a four-element vector, column-wise, as (1,3,2,4). We then added (10,11,12,13) to it, yielding (11,14,14,17), but R remembered that we were working with matrices and thus gave the 2-by-2 result you see in the example.

**Declarations**

Typically, compiled languages require that you *declare* variables; that is, warn the interpreter/compiler of the variables’ existence before using them. This is the case in our earlier C example:

**int x;**

**int y[3];**

As with most scripting languages (such as Python and Perl), you do not declare variables in R. For instance, consider this code:

**z <- 3**

This code, with *no* previous reference to z, is perfectly legal (and common-place).However, if you reference specific elements of a vector, you must warn R. For instance, say we wish y to be a two-component vector with values 5 and 12. The following will *not* work:

**> y[1] <- 5**

**> y[2] <- 12**

Instead, you must create y first, for instance this way:

**> y <- vector(length=2)**

**> y[1] <- 5**

**> y[2] <- 12**

The following will also work:

**> y <- c(5,12)**

This approach is all right because on the right-hand side we are creating a new vector, to which we then bind y.

The reason we cannot suddenly spring an expression like y[2] on R stems from R’s functional language nature. The reading and writing of individual vector elements are actually handled by functions. If R doesn’t already know that y is a vector, these functions have nothing on which to act. Speaking of binding, just as variables are not declared, they are not constrained in terms of mode. The following sequence of events is perfectly valid:

**> x <- c(1,5)**

**> x**

**[1] 1 5**

**> x <- “abc”**

First, x is associated with a numeric vector, then with a string. (Again, for C/C++ programmers: x is nothing more than a pointer, which can point to different types of objects at different times.)

**Recycling**

When applying an operation to two vectors that requires them to be the same length, R automatically *recycles*, or repeats, the shorter one, until it is long enough to match the longer one. Here is an example:

> c(1,2,4) + c(6,0,9,20,22)

[1] 7 2 13 21 24

Warning message:

longer object length

is not a multiple of shorter object length in: c(1, 2, 4) + c(6,0, 9, 20, 22)

The shorter vector was recycled, so the operation was taken to be as follows:

**> c(1,2,4,1,2) + c(6,0,9,20,22)**

Here’s a more subtle example:

> x

[,1] [,2]

[1,] 1 4

[2,] 2 5

[3,] 3 6

> x+c(1,2)

[,1] [,2]

[1,] 2 6

[2,] 4 6

[3,] 4 8

Again, keep in mind that matrices are actually long vectors. Here, x, as a 3-by-2 matrix, is also a six-element vector, which in R is stored column by column. In other words, in terms of storage, x is the same as c(1,2,3,4,5,6). We added a two-element vector to this six-element one, so our added vector needed to be repeated twice to make six elements. In other words, we were essentially doing this:

**x + c(1,2,1,2,1,2)**

Not only that, but c(1,2,1,2,1,2) was also changed from a vector to a matrix having the same shape as x before the addition took place:

**1 2**

**2 1**

**1 2**

Thus, the net result was to compute the following:

**Common Vector Operations**

Now let’s look at some common operations related to vectors. We’ll cover arithmetic and logical operations, vector indexing, and some useful ways to create vectors. Then we’ll look at two extended examples of using these operations.

*Vector Arithmetic and Logical Operations*

*Vector Arithmetic and Logical Operations*

Remember that R is a functional language. Every operator, including + in the following example, is actually a function.

**> 2+3**

**[1] 5**

**> “+”(2,3)**

**[1] 5**

Recall further that scalars are actually one-element vectors. So, we can add vectors, and the + operation will be applied element-wise.

**> x <- c(1,2,4)**

**> x + c(5,0,-1)**

**[1] 6 2 3**

If you are familiar with linear algebra, you may be surprised at what happens when we multiply two vectors.

**> x _{*} c(5,0,-1)**

**[1] 5 0 -4**

But remember, because of the way the _{*} function is applied, the multiplication is done element by element. The first element of the product (5) is the result of the first element of x (1) being multiplied by the first element of c(5,0,1) (5), and so on. The same principle applies to other numeric operators. Here’s an example:

> x <- c(1,2,4)

> x / c(5,4,-1)

[1] 0.2 0.5 -4.0

> x %% c(5,4,-1)

[1] 1 2 0

*Vector Indexing*

*Vector Indexing*

One of the most important and frequently used operations in R is that of *indexing *vectors, in which we form a subvector by picking elements of the* *given vector for specific indices. The format is vector1[vector2], with the result that we select those elements of vector1 whose indices are given in vector2.

> y <- c(1.2,3.9,0.4,0.12)

> y[c(1,3)] # extract elements 1 and 3 of y

[1] 1.2 0.4

> y[2:3]

[1] 3.9 0.4

> v <- 3:4

> y[v]

[1] 0.40 0.12

Note that duplicates are allowed.

> x <- c(4,2,17,5)

> y <- x[c(1,1,3)]

> y

[1] 4 4 17

Negative subscripts mean that we want to exclude the given elements in our output.

**> z <- c(5,12,13)**

**> z[-1] # exclude element 1**

**[1] 12 13**

**> z[-1:-2] # exclude elements 1 through 2**

**[1] 13**

In such contexts, it is often useful to use the length() function. For instance, suppose we wish to pick up all elements of a vector z except for the last. The following code will do just that:

**z <- c(5,12,13)**

**z[1:(length(z)-1)]**

**[1] 5 12**

Or more simply:

**> z[-length(z)]**

**[1] 5 12**

This is more general than using z[1:2]. Our program may need to work for more than just vectors of length 2, and the second approach would give us that generality.

*Generating Useful Vectors with the : Operator*

*Generating Useful Vectors with the : Operator*

There are a few R operators that are especially useful for creating vectors. Let’s start with the colon operator :, which was introduced earlier. It produces a vector consisting of a range of numbers.

**> 5:8**

**[1] 5 6 7 8**

**> 5:1**

**[1] 5 4 3 2 1**

You may recall that it was used earlier in this chapter in a loop context, as follows:

**for (i in 1:length(x)) {**

Beware of operator precedence issues.

**> i <- 2**

**> 1:i-1 # this means (1:i) – 1, not 1:(i-1) **

**[1] 0 1**

**> 1:(i-1)**

**[1] 1**

In the expression 1:i-1, the colon operator takes precedence over the subtraction. So, the expression 1:i is evaluated first, returning 1:2. R then subtracts 1 from that expression. That means subtracting a one-element vector from a two-element one, which is done via recycling. The one-element vector (1) will be extended to (1,1) to be of compatible length with 1:2. Elementwise subtraction then yields the vector (0,1). In the expression 1:(i-1), on the other hand, the parentheses have higher precedence than the colon. Thus, 1 is subtracted from i, resulting in 1:1, as seen in the preceding example.

**NOTE** *You can obtain complete details of operator precedence in R through the included help. **Just type **?Syntax** at the command prompt.*

*Generating Vector Sequences with seq()*

*Generating Vector Sequences with seq()*

A generalization of : is the seq() (or *sequence*) function, which generates a sequence in arithmetic progression. For instance, whereas 3:8 yields the vector (3,4,5,6,7,8), with the elements spaced one unit apart (4 *−* 3 = 1, 5 *−* 4 = 1, and so on), we can make them, say, three units apart, as follows:

** > seq(from=12,to=30,by=3)**

** [1] 12 15 18 21 24 27 30**

The spacing can be a noninteger value, too, say 0.1.

> seq(from=1.1,to=2,length=10)

[1] 1.1 1.2 1.3 1.4 1. 5 1 .6 1.7 1.8 1.9 2.0

One handy use for seq() is to deal with the empty vector problem we mentioned earlier in Section 2.1.2. There, we were dealing with a loop that began with this:

**for (i in 1:length(x))**

If x is empty, this loop should not have any iterations, but it actually has two, since 1:length(x) evaluates to (1,0). We could fix this by writing the statement as follows:

**for (i in seq(x))**

To see why this works, let’s do a quick test of seq():

> x <- c(5,12,13)

>x[1] 5 12 13

> seq(x)

[1] 1 2 3

> x <- NULL

>x

NULL

> seq(x)

integer(0)

You can see that seq(x) gives us the same result as 1:length(x) if x is not empty, but it correctly evaluates to NULL if x is empty, resulting in zero iterations in the above loop.

*Repeating Vector Constants with rep()*

*Repeating Vector Constants with rep()*

The rep() (or *repeat*) function allows us to conveniently put the same constant into long vectors. The call form is rep(x,times), which creates a vector of *times*_{*}*length(x)* elements—that is, times copies of x. Here is an example:

x <- rep(8,4)

>x[1] 8 8 8 8

> rep(c(5,12,13),3)

[1] 51213 51213 51213

> rep(1:3,2)

[1] 1 2 3 1 2 3

There is also a named argument each, with very different behavior, which interleaves the copies of x.

**> rep(c(5,12,13),each=2)**

**[1] 5 5 12 12 13 13**

**Using all() and any()**

The any() and all() functions are handy shortcuts. They report whether any or all of their arguments are TRUE.

> x <- 1:10

> any(x > 8)

[1] TRUE

> any(x > 88)

[1] FALSE

> all(x > 88)

[1] FALSE

> all(x > 0)

[1] TRUE

For example, suppose that R executes the following:

**> any(x > 8)**

It first evaluates x > 8, yielding this:

**(FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,TRUE,TRUE)**

The any() function then reports whether any of those values is TRUE. The all() function works similarly and reports if *all* of the values are TRUE.

#### *Extended Example: Finding Runs of Consecutive Ones*

*Extended Example: Finding Runs of Consecutive Ones*

Suppose that we are interested in finding runs of consecutive 1s in vectors that consist just of 1s and 0s. In the vector (1,0,0,1,1,1,0,1,1), for instance, there is a run of length 3 starting at index 4, and runs of length 2 beginning at indices 4, 5, and 8. So the call findruns(c(1,0,0,1,1,1,0,1,1),2) to our function to be shown below returns (4,5,8). Here is the code:

findruns <- function(x,k) {

n <- length(x)

runs <- NULL

for (i in 1:(n-k+1)) {

if (all(x[i:(i+k-1)]==1)) runs <- c(runs,i)

}

return(runs)

}

In line 5, we need to determine whether all of the k values starting at x[i]—that is, all of the values in x[i],x[i+1],…,x[i+k-1]—are 1s. The expression x[i:(i+k-1)] gives us this range in x, and then applying all() tells us whether there is a run there. Let’s test it.

**> y <- c(1,0,0,1,1,1,0,1,1)**

**> findruns(y,3)**

**[1] 4**

**> findruns(y,2)**

**[1] 4 5 8**

**> findruns(y,6)**

**NULL**

Although the use of all() is good in the preceding code, the buildup of the vector runs is not so good. Vector allocation is time consuming. Each execution of the following slows down our code, as it allocates a new vector in the call c(runs,i). (The fact that new vector is assigned to runs is irrelevant; we still have done a vector memory space allocation.)

**runs <- c(runs,i)**

In a short loop, this probably will be no problem, but when application performance is an issue, there are better ways. One alternative is to preallocate the memory space, like this:

1. findruns1 <- function(x,k) {

2. n <- length(x)

3. runs <- vector(length=n)

4 count <- 0

5 for (i in 1:(n-k+1)) {

6 if (all(x[i:(i+k-1)]==1)) {

7 count <- count + 1

8 runs[count] <- i

9 }

10 }

11 if (count > 0) {

12 runs <- runs[1:count]

13 } else runs <- NULL

14 return(runs)

15 }

In line 3, we set up space of a vector of length n. This means we avoid new allocations during execution of the loop. We merely fill runs, in line 8. Just before exiting the function, we redefine runs in line 12 to remove the unused portion of the vector. This is better, as we’ve reduced the number of memory allocations to just two, down from possibly many in the first version of the code.

*Extended Example: Predicting Discrete-Valued Time Series*

Suppose we observe 0- and 1-valued data, one per time period. To make things concrete, say it’s daily weather data: 1 for rain and 0 for no rain. Suppose we wish to predict whether it will rain tomorrow, knowing whether it rained or not in recent days. Specifically, for some number k, we will predict tomorrow’s weather based on the weather record of the last k days. We’ll use majority rule: If the number of 1s in the previous k time periods is at least k/2, we’ll predict the next value to be 1; otherwise, our prediction is 0. For instance, if k = 3 and the data for the last three periods is 1,0,1, we’ll predict the next period to be a 1.

But how should we choose k? Clearly, if we choose too small a value, it may give us too small a sample from which to predict. Too large a value will cause us to rely on data from the distant past that may have little or no predictive value. A common solution to this problem is to take known data, called a *train-ing set*, and then ask how well various values of* *k* *would have performed on* *that data.

In the weather case, suppose we have 500 days of data and suppose we are considering using k = 3. To assess the predictive ability of that value for k, we “predict” each day in our data from the previous three days and then compare the predictions with the known values. After doing this throughout our data, we have an error rate for k = 3. We do the same for k = 1, k = 2, k= 4, and so on, up to some maximum value of k that we feel is enough. We then use whichever value of k worked best in our training data for future predictions.So how would we code that in R? Here’s a naive approach:

1 preda <- function(x,k) {

2 n <- length(x)

3 k2 <- k/2

4 # the vector pred will contain our predicted values

5 pred <- vector(length=n-k)

6 for (i in 1:(n-k)) {

7 if (sum(x[i:(i+(k-1))]) >= k2) pred[i] <- 1 else pred[i] <- 0

8 }

9 return(mean(abs(pred-x[(k+1):n])))

10 }

The heart of the code is line 7. There, we’re predicting day i+k (prediction to be stored in pred[i]) from the k days previous to it—that is, days i,…,i+k-1. Thus, we need to count the 1s among those days. Since we’re working with 0 and 1 data, the number of 1s is simply the sum of x[j] among those days, which we can conveniently obtain as follows:

**sum(x[i:(i+(k-1))])**

The use of sum() and vector indexing allow us to do this computation compactly, avoiding the need to write a loop, so it’s simpler and faster. This is typical R. The same is true for this expression, on line 9:

**mean(abs(pred-x[(k+1):n]))**

Here, pred contains the predicted values, while x[(k+1):n] has the actual values for the days in question. Subtracting the second from the first gives us values of either 0, 1, or *−*1. Here, 1 or *−*1 correspond to prediction errors in one direction or the other, predicting 0 when the true value was 1 or vice versa. Taking absolute values with abs(), we have 0s and 1s, the latter corresponding to errors. So we now know where days gave us errors. It remains to calculate the proportion of errors. We do this by applying mean(), where we are exploiting the mathematical fact that the mean of 0 and 1 data is the proportion of 1s. This is a common R trick.

The above coding of our preda() function is fairly straightforward, and it has the advantage of simplicity and compactness. However, it is probably slow. We could try to speed it up by vectorizing the loop. However, that would not address the major obstacle to speed here, which is all of the duplicate computation that the code does. For successive values of i in the loop, sum() is being called on vectors that differ by only two elements. Except for cases in which k is very small, this could really slow things down. So, let’s rewrite the code to take advantage of previous computation. In each iteration of the loop, we will update the previous sum we found, rather than compute the new sum from scratch.

predb <- function(x,k) {

n <- length(x)

k2 <- k/2

pred <- vector(length=n-k)

sm <- sum(x[1:k])

if (sm >= k2) pred[1] <- 1 else pred[1] <- 0

if (n-k >= 2) {

for (i in 2:(n-k)) {

sm <- sm + x[i+k-1] – x[i-1]

if (sm >= k2) pred[i] <- 1 else pred[i] <- 0

}

}

return(mean(abs(pred-x[(k+1):n])))

}

The key is line 9. Here, we are updating sm, by subtracting the oldest element making up the sum (x[i-1]) and adding the new one (x[i+k-1]). Yet another approach to this problem is to use the R function cumsum(), which forms cumulative sums from a vector. Here is an example:

**> y <- c(5,2,-3,8)**

**> cumsum(y)**

**[1] 5 7 4 12**

Here, the cumulative sums of y are 5 = 5, 5 + 2 = 7, 5 + 2 + (*−*3) = 4, and 5 + 2 + (*−*3) + 8 = 12, the values returned by cumsum(). The expression sum(x[i:(i+(k-1)) in preda() in the example suggests using differences of cumsum() instead:

predc <- function(x,k) {

n <- length(x)

k2 <- k/2

# the vector red will contain our predicted values

pred <- vector(length=n-k)

csx <- c(0,cumsum(x))

for (i in 1:(n-k)) {

if (csx[i+k] – csx[i] >= k2) pred[i] <- 1 else pred[i] <- 0

}

return(mean(abs(pred-x[(k+1):n])))

}

Instead of applying sum() to a window of k consecutive elements in x, like this:

**sum(x[i:(i+(k-1))**

we compute that same sum by finding the difference between the cumulative sums at the end and beginning of that window, like this:

**csx[i+k] – csx[i]**

Note the prepending of a 0 in the vector of cumulative sums:

**csx <- c(0,cumsum(x))**

This is needed in order to handle the case i = 1 correctly. This approach in predc() requires just one subtraction operation per iteration of the loop, compared to two in predb().

**Vectorized Operations**

Suppose we have a function f() that we wish to apply to all elements of a vector x. In many cases, we can accomplish this by simply calling f() on x itself. This can really simplify our code and, moreover, give us a dramatic performance increase of hundredsfold or more. One of the most effective ways to achieve speed in R code is to use operations that are *vectorized*, meaning that a function applied to a vector is actually applied individually to each element.

*Vector In, Vector Out*

*Vector In, Vector Out*

You saw examples of vectorized functions earlier in the chapter, with the + and _{*} operators. Another example is >.

**> u <- c(5,2,8)**

**> v <- c(1,3,9)**

**> u > v**

**[1] TRUE FALSE FALSE**

Here, the > function was applied to u[1] and v[1], resulting in TRUE, then to u[2] and v[2], resulting in FALSE, and so on. A key point is that if an R function uses vectorized operations, it, too, is vectorized, thus enabling a potential speedup. Here is an example:

**> w <- function(x) return(x+1)**

**> w(u)**

**[1] 6 3 9**

Here, w() uses +, which is vectorized, so w() is vectorized as well. As you can see, there is an unlimited number of vectorized functions, as complex ones are built up from simpler ones.Note that even the transcendental functions—square roots, logs, trig functions, and so on—are vectorized.

**> sqrt(1:9)**

**[1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427**

**[9] 3.000000**

This applies to many other built-in R functions. For instance, let’s apply the function for rounding to the nearest integer to an example vector y:

**> y <- c(1.2,3.9,0.4)**

**> z <- round(y)**

**> z**

**[1] 1 4 0**

The point is that the round() function is applied individually to each element in the vector y. And remember that scalars are really single-element vectors, so the “ordinary” use of round() on just one number is merely a special case.

**> round(1.2)**

**[1] 1**

Here, we used the built-in function round(), but you can do the same thing with functions that you write yourself. As mentioned earlier, even operators such as + are really functions. For example, consider this code:

**> y <- c(12,5,13) **

**> y+4**

**[1] 16 9 17**

The reason element-wise addition of 4 works here is that the + is actually a function! Here it is explicitly:

**> ‘+'(y,4)**

**[1] 16 9 17**

Note, too, that recycling played a key role here, with the 4 recycled into (4,4,4).

Since we know that R has no scalars, let’s consider vectorized functions that appear to have scalar arguments.

**> f
function(x,c) return((x+c)^2)**

** > f(1:3,0)[1] 1 4 9
> f(1:3,1)[1] 4 9 16 **

In our definition of f() here, we clearly intend c to be a scalar, but, of course, it is actually a vector of length 1. Even if we use a single number for c in our call to f(), it will be extended through recycling to a vector for our computation of x+c within f(). So in our call f(1:3,1) in the example, the quantity x+c becomes as follows:

This brings up a question of code safety. There is nothing in f() that keeps us from using an explicit vector for c, such as in this example:

**> f(1:3,1:3)**

**[1] 4 16 36**

You should work through the computation to confirm that (4,16,36) is indeed the expected output.

If you really want to restrict c to scalars, you should insert some kind of check, say this one:

**>f
function(x,c) {
if (length(c) != 1) stop(“vector c not allowed”) **

** return((x+c)^2)**

**}**

*Vector In, Matrix Out*

*Vector In, Matrix Out*

The vectorized functions we’ve been working with so far have scalar return values. Calling sqrt() on a number gives us a number. If we apply this function to an eight-element vector, we get eight numbers, thus another eight-element vector, as output. But what if our function itself is vector-valued, as z12() is here:

**z12 <- function(z) return(c(z,z^2))**

Applying z12() to 5, say, gives us the two-element vector (5,25). If we apply this function to an eight-element vector, it produces 16 numbers:

**x <- 1:8**

**> z12(x)**

**[1] 1 2 3 4 5 6 7 8 1 4 9 16 25 36 49 64**

It might be more natural to have these arranged as an 8-by-2 matrix, which we can do with the matrix function:

**> matrix(z12(x),ncol=2)**

** [,1] [,2]**

**[1,] 1 1**

**[2,] 2 4**

**[3,] 3 9**

**[4,] 4 16**

**[5,] 5 25 **

**[6,] 6 36 **

**[7,] 7 49 **

**[8,] 8 64**

But we can streamline things using sapply() (or *simplify apply*). The call sapply(x,f) applies the function f() to each element of x and then converts the result to a matrix. Here is an example:

**> z12 <- function(z) return(c(z,z^2))**

**> sapply(1:8,z12)**

** [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]**

**[1,] 1 2 3 4 5 6 7 8**

**[2,] 1 4 9 16 25 3 6 49 64**

We do get a 2-by-8 matrix, not an 8-by-2 one, but it’s just as useful this way.

**NA and NULL Values**

Readers with a background in other scripting languages may be aware of “no such animal” values, such as None in Python and undefined in Perl. R actually has two such values: NA and NULL.In statistical data sets, we often encounter missing data, which we represent in R with the value NA. NULL, on the other hand, represents that the value in question simply doesn’t exist, rather than being existent but unknown. Let’s see how this comes into play in concrete terms.

*Using NA*

*Using NA*

In many of R’s statistical functions, we can instruct the function to skip over any missing values, or NAs. Here is an example:

> x <- c(88,NA,12,168,13)

>x

[1] 88 NA 1 2 168 13

> mean(x)

[1] NA

> mean(x,na.rm=T)

[1] 70.25

> x <- c(88,NULL,12,168,13)

> mean(x)

[1] 70.25

In the first call, mean() refused to calculate, as one value in x was NA. But by setting the optional argument na.rm (*NA remove*) to true (T), we calculated the mean of the remaining elements. But R automatically skipped over the NULL value, which we’ll look at in the next section. There are multiple NA values, one for each mode:

> x <- c(5,NA,12)

> mode(x[1])

[1] “numeric”

> mode(x[2])

[1] “numeric”

> y <- c(“abc”,”def”,NA)

> mode(y[2])

[1] “character”

> mode(y[3])

[1] “character”

*Using NULL*

*Using NULL*

One use of NULL is to build up vectors in loops, in which each iteration adds another element to the vector. In this simple example, we build up a vector of even numbers:

**# build up a vector of the even numbers in 1:10 **

**> z <- NULL**

**> for (i in 1:10) if (i %%2 == 0) z <- c(z,i) > z**

**[1] 2 4 6 8 10**

Recall that %% is the modulo operator, giving remainders upon division. For example, 13 %% 4 is 1, as the remainder of dividing 13 by 4 is 1. (See Section 7.2 for a list of arithmetic and logic operators.) Thus the example loop starts with a NULL vector and then adds the element 2 to it, then 4, and so on. This is a very artificial example, of course, and there are much better ways to do this particular task. Here are two more ways another way to find even numbers in 1:10:

**> seq(2,10,2)**

**[1] 2 4 6 8 10**

**> 2 _{*}1:5**

**[1] 2 4 6 8 10**

But the point here is to demonstrate the difference between NA and NULL. If we were to use NA instead of NULL in the preceding example, we would pick up an unwanted NA:

**> z <- NA**

**> for (i in 1:10) if (i %%2 == 0) z <- c(z,i)**

**> z**

**[1] NA 2 4 6 8 10**

NULL values really are counted as nonexistent, as you can see here:

**> u <- NULL**

**> length(u)**

**[1] 0**

**> v <- NA**

**> length(v)**

**[1] 1**

NULL is a special R object with no mode.

**Filtering**

Another feature reflecting the functional language nature of R is *filtering*. This allows us to extract a vector’s elements that satisfy certain conditions. Filtering is one of the most common operations in R, as statistical analyses often focus on data that satisfies conditions of interest.

*Generating Filtering Indices*

*Generating Filtering Indices*

Let’s start with a simple example:

**> z <- c(5,2,-3,8)**

**> w <- z[z _{*}z > 8]**

**> w**

**[1] 5 -3 8**

Looking at this code in an intuitive, “What is our intent?” manner, we see that we asked R to extract from z all its elements whose squares were greater than 8 and then assign that subvector to w. But filtering is such a key operation in R that it’s worthwhile to examine the technical details of how R achieves our intent above. Let’s look at it done piece by piece:

**> z <- c(5,2,-3,8)**

** > z[1] 5 2 -3 8
> z*z > 8 **

** [1] TRUE FALSE TRUE TRUE**

Evaluation of the expression z_{*}z > 8 gives us a vector of Boolean values! It’s very important that you understand exactly how this comes about. First, in the expression z_{*}z > 8, note that *everything* is a vector or vector operator:

- Since z is a vector, that means z
_{*}z will also be a vector (of the same length as z).

- Due to recycling, the number 8 (or vector of length 1) becomes the vector (8,8,8,8) here.

- The operator >, like +, is actually a function. Let’s look at an example of that last point:

**> “>”(2,1)**

**[1] TRUE**

**> “>”(2,5)**

**[1] FALSE**

Thus, the following:

**z _{*}z > 8**

is really this:

**“>”(z _{*}z,8)**

In other words, we are applying a function to vectors—yet another case of vectorization, no different from the others you’ve seen. And thus the result is a vector—in this case, a vector of Booleans. Then the resulting Boolean values are used to cull out the desired elements of z:

**> z[c(TRUE,FALSE,TRUE,TRUE)]**

**[1] 5 -3 8**

This next example will place things into even sharper focus. Here, we will again define our extraction condition in terms of z, but then we will use the results to extract from another vector, y, instead of extracting from z:

**> z <- c(5,2,-3,8) **

**> j <- z*z > 8
>j[1] TRUE FALSE TRUE TRUE **

**> y <- c(1,2,30,5) **

**> y[j]
[1] 1 30 5 **

Or, more compactly, we could write the following:

**> z <- c(5,2,-3,8)**

**> y <- c(1,2,30,5)**

**> y[z _{*}z > 8]**

**[1] 1 30 5**

Again, the point is that in this example, we are using one vector, z, to determine indices to use in filtering *another* vector, y. In contrast, our earlier example used z to filter itself. Here’s another example, this one involving assignment. Say we have a vector x in which we wish to replace all elements larger than a 3 with a 0. We can do that very compactly—in fact, in just one line:

**> x[x > 3] <- 0**

Let’s check:

**> x <- c(1,3,8,2,20)**

**> x[x > 3] <- 0**

**> x**

**[1] 1 3 0**** 2 0**

*Filtering with the subset() Function*

Filtering can also be done with the subset() function. When applied to vectors, the difference between using this function and ordinary filtering lies in the manner in which NA values are handled.

**> x <- c(6,1:3,NA,12)**

** >x[1] 6 1 2 3 NA 12**

** > x[x > 5] **

**[1] 6 NA 12**

**> subset(x,x > 5)**

**[1] 6 12**

When we did ordinary filtering in the previous section, R basically said, “Well, x[5] is unknown, so it’s also unknown whether its square is greater than 5.” But you may not want NAs in your results. When you wish to exclude NA values, using subset() saves you the trouble of removing the NA values yourself.

#### The Selection Function which()

As you’ve seen, filtering consists of extracting elements of a vector z that satisfy a certain condition. In some cases, though, we may just want to find the positions within z at which the condition occurs. We can do this using which(), as follows:

**> z <- c(5,2,-3,8)**

**> which(z*z > 8)**

**[1] 1 3 4**

The result says that elements 1, 3, and 4 of z have squares greater than 8. As with filtering, it is important to understand exactly what occurred in the preceding code. The expression

**z _{*}z > 8**

is evaluated to (TRUE,FALSE,TRUE,TRUE). The which() function then simply reports which elements of the latter expression are TRUE. One handy (though somewhat wasteful) use of which() is for determining the location within a vector at which the first occurrence of some condition holds. For example, recall our code on page 27 to find the first 1 value within a vector x:

first1 <- function(x) {

for (i in 1:length(x)) {

if (x[i] == 1) break # break out of loop

}

return(i)

}

Here is an alternative way of coding this task:

**first1a <- function(x) return(which(x == 1)[1])**

The call to which() yields the indices of the 1s in x. These indices will be given in the form of a vector, and we ask for element index 1 in that vector, which is the index of the first 1.That is much more compact. On the other hand, it’s wasteful, as it actually finds *all* instances of 1s in x, when we need only the first. So, although it is a vectorized approach and thus possibly faster, if the first 1 comes early in x, this approach may actually be slower.

**A Vectorized if-then-else: The ifelse() Function**

In addition to the usual if-then-else construct found in most languages, R also includes a vectorized version, the ifelse() function. The form is as follows:

**ifelse(b,u,v)**

where b is a Boolean vector, and u and v are vectors. The return value is itself a vector; element i is u[i] if b[i] is true, or v[i] if b[i] is false. The concept is pretty abstract, so let’s go right to an example:

**> x <- 1:10**

**> y <- ifelse(x %% 2 == 0,5,12) # %% is the mod operator**

**> y**

**[1] 12 5 12 5 1 2 5 12 5 1 2 5**

Here, we wish to produce a vector in which there is a 5 wherever x is even or a 12 wherever x is odd. So, the actual argument corresponding to the formal argument b is (F,T,F,T,F,T,F,T,F,T). The second actual argument, 5, corresponding to u, is treated as (5,5,…)(ten 5s) by recycling. The third argument, 12, is also recycled, to (12,12,…). Here is another example:

**> x <- c(5,2,9,12) **

**> ifelse(x > 6,2*x,3*x) **

**[1]15 61824 **

We return a vector consisting of the elements of x, either multiplied by 2 or 3, depending on whether the element is greater than 6.

Again, it helps to think through what is really occurring here. The expression x *>* 6 is a vector of Booleans. If the *i*^{th} component is true, then the *i*^{th} element of the return value will be set to the *i*^{th} element of 2_{*}x; otherwise, it will be set to 3_{*}x[i], and so on. The advantage of ifelse() over the standard if-then-else construct is that it is vectorized, thus potentially much faster.

*Extended Example: A Measure of Association*

*Extended Example: A Measure of Association*

In assessing the statistical relation of two variables, there are many alternatives to the standard correlation measure (Pearson product-moment correlation). Some readers may have heard of the Spearman rank correlation, for example. These alternative measures have various motivations, such as robustness to *outliers*, which are extreme and possibly erroneous data items.

Here, let’s propose a new such measure, not necessarily for novel sta-tistical merits (actually it is related to one in broad use, Kendall’s *τ* ), but to illustrate some of the R programming techniques introduced in this chapter, especially ifelse(). Consider vectors x and y, which are time series, say for measurements of air temperature and pressure collected once each hour. We’ll define our measure of association between them to be the fraction of the time x and y increase or decrease together—that is, the proportion of i for which y[i+1]-y[i] has the same sign as x[i+1]-x[i]. Here is the code:

# findud() converts vector v to 1s, 0s, representing an element

# increasing or not, relative to the previous one; output length is 1

# less than input

findud <- function(v) {

vud <- v[-1] – v[-length(v)]

return(ifelse(vud > 0,1,-1))

}

udcorr <- function(x,y) {

ud <- lapply(list(x,y),findud)

return(mean(ud[[1]] == ud[[2]]))

}

Here’s an example:

**> x[1] 5 12 13 3 6 0 1 15 16 8 88 **

**> y[1] 4 2 3 23 6 10 1 1 1 2 6 3 2 **

**> udcorr(x,y)**

**[1] 0.4**

In this example, x and y increased together in 3 of the 10 opportunities (the first time being the increases from 12 to 13 and 2 to 3) and decreased together once. That gives an association measure of 4/10 = 0.4.

Let’s see how this works. The first order of business is to recode x and y to sequences of 1s and *−*1s, with a value of 1 meaning an increase of the current observation over the last. We’ve done that in lines 5 and 6. For example, think what happens in line 5 when we call findud() with v having a length of, say, 16 elements. Then v[-1] will be a vector of 15 elements, starting with the second element in v. Similarly, v[-length(v)] will again be a vector of 15 elements, this time starting from the first element in v. The result is that we are subtracting the original series from the series obtained by shifting rightward by one time period. The difference gives us the sequence of increase/decrease statuses for each time period—exactly what we need.

We then need to change those differences to 1 and *−*1s, according to whether a difference is positive or negative. The ifelse() call does this easily, compactly, and with smaller execution time than a loop version of the code would have. We could have then written two calls to findud(): one for x and the other for y. But by putting x and y into a list and then using lapply(), we can do this without duplicating code. If we were applying the same operation to many vectors instead of only two, especially in the case of a variable number of vectors, using lapply() like this would be a big help in compacting and clarifying the code, and it might be slightly faster as well. We then find the fraction of matches, as follows:

**return(mean(ud[[1]] == ud[[2]]))**

Note that lapply() returns a list. The components are our 1/*−*1–coded vectors. The expression ud[[1]] == ud[[2]] returns a vector of TRUE and FALSE values, which are treated as 1 and 0 values by mean(). That gives us the desired fraction. A more advanced version would make use of R’s diff() function, which does *lag* operations for vectors. We might, for instance, compare each element with the element three spots behind it, termed a *lag of 3*. The default lag value is one time period, just what we need here.

**>u[1] 1 6 7 2 3 5
> diff(u)[1] 5 1 -5 1 2 **

Then line 5 in the preceding example would become this:

**vud <- diff(d)**

We can make the code really compact by using another advanced R function, sign(), which converts the numbers in its argument vector to 1, 0, or *−*1, depending on whether they are positive, zero, or negative. Here is an example:

**>u[1] 1 6 7 2 3 5
> diff(u)[1] 5 1-5 1 2 **

**> sign(diff(u)) **

**[1] 1 1- 1 1 1 **

Using sign() then allows us to turn this udcorr()function into a one-liner, as follows:

**> udcorr <- function(x,y) mean(sign(diff(x)) == sign(diff(y)))**

This is certainly a lot shorter than the original version. But is it better? For most people, it probably would take longer to write. And although the code is short, it is arguably harder to understand. All R programmers must find their own “happy medium” in trading brevity for clarity.

#### *Extended Example: Recoding an Abalone Data Set*

*Extended Example: Recoding an Abalone Data Set*

Due to the vector nature of the arguments, you can nest ifelse() operations. In the following example, which involves an abalone data set, gender is coded as M, F, or I (for infant). We wish to recode those characters as 1, 2, or 3. The real data set consists of more than 4,000 observations, but for our example, we’ll say we have just a few, stored in g:

**> g**

**[1] “M” “F” “F” “I” “M” “M” “F”**

**> ifelse(g == “M”,1,ifelse(g == “F”,2,3))**

** [1] 1 2 2 3 1 1 2**

What actually happens in that nested ifelse()? Let’s take a careful look. First, for the sake of concreteness, let’s find what the formal argument names are in the function ifelse():

**> args(ifelse)**

**function (test, yes, no)**

**NULL**

Remember, for each element of test that is true, the function evaluates to the corresponding element in yes. Similarly, if test[i] is false, the function evaluates to no[i]. All values so generated are returned together in a vector.

In our case here, R will execute the outer ifelse() call first, in which test is g == “M”, and yes is 1 (recycled); no will (later) be the result of executing ifelse(g==”F”,2,3). Now since test[1] is true, we generate yes[1], which is 1. So, the first element of the return value of our outer call will be 1. Next R will evaluate test[2]. That is false, so R needs to find no[2]. R now needs to execute the inner ifelse() call. It hasn’t done so before, because it hasn’t needed it until now. R uses the principle of *lazy evaluation*, meaning that an expression is not computed until it is needed.

R will now evaluate ifelse(g==”F”,2,3), yielding (3,2,2,3,3,3,2); this is no for the outer ifelse() call, so the latter’s second return element will be the second element of (3,2,2,3,3,3,2), which is 2. When the outer ifelse() call gets to test[4], it will see that value to be false and thus will return no[4]. Since R had already computed no, it has the value needed, which is 3.

Remember that the vectors involved could be columns in matrices, which is a very common scenario. Say our abalone data is stored in the matrix ab, with gender in the first column. Then if we wish to recode as in the preceding example, we could do it this way:

**> ab[,1] <- ifelse(ab[,1] == “M”,1,ifelse(ab[,1] == “F”,2,3))**

Suppose we wish to form subgroups according to gender. We could use which() to find the element numbers corresponding to M, F, and I:

> m <- which(g == “M”)

> f <- which(g == “F”)

> i <- which(g == “I”)

> m

[1] 1 5 6

> f[1] 2 3 7

> i

[1] 4

Going one step further, we could save these groups in a list, like this:

> grps <- list()

> for (gen in c(“M”,”F”,”I”)) grps[[gen]] <- which(g==gen)

>grps

$M

[1] 1 5 6

$F

[1] 2 3 7

$I

[1] 4

Note that we take advantage of the fact that R’s for() loop has the ability to loop through a vector of strings. We might use our recoded data to draw some graphs, exploring the various variables in the abalone data set. Let’s summarize the nature of the variables by adding the following header to the file:

**Gender,Length,Diameter,Height,WholeWt,ShuckedWt,ViscWt,ShellWt,Rings**

We could, for instance, plot diameter versus length, with a separate plot for males and females, using the following code:

aba <- read.csv(“abalone.data”,header=T,as.is=T)

grps <- list()

for (gen in c(“M”,”F”)) grps[[gen]] <- which(aba==gen)

abam <- aba[grps$M,]

abaf <- aba[grps$F,]

plot(abam$Length,abam$Diameter)

plot(abaf$Length,abaf$Diameter,pch=”x”,new=FALSE)

First, we read in the data set, assigning it to the variable aba (to remind us that it’s abalone data). The call to read.csv() is similar to the read.table() call . We then form abam and abaf, the submatrices of aba corresponding to males and females, respectively.

Next, we create the plots. The first call does a scatter plot of diameter against length for the males. The second call is for the females. Since we want this plot to be superimposed on the same graph as the males, we set the argument new=FALSE, instructing R to *not* create a new graph. The argument pch=”x” means that we want the plot characters for the female graph to consist of *x* characters, rather than the default *o* characters. The graph (for the entire data set) is shown in Figure 2-1. By the way, it is not completely satisfactory. Apparently, there is such a strong correlation between diameter and length that the points densely fill up a section of the graph, and the male and female plots pretty much coincide. (It does appear that males have more variability, though.) This is a common issue in statistical graphics. A finer graphical analysis may be more illuminating, but at least here we see evidence of the strong correlation and that the relation does not vary much across genders.

*Figure 2-1: Abalone diameter vs. length by gender*

We can compact the plotting code in the previous example by yet another use of ifelse. This exploits the fact that the plot parameter pch is allowed to be a vector rather than a single character. In other words, R allows us to specify a different plot character for each point.

**pchvec <- ifelse(aba$Gender == “M”,”o”,”x”)**

**plot(aba$Length,aba$Diameter,pch=pchvec)**

(Here, we’ve omitted the recoding to 1, 2, and 3, but you may wish to retain it for various reasons.)

**Testing Vector Equality**

Suppose we wish to test whether two vectors are equal. The naive approach, using ==, won’t work.

**> x <- 1:3**

**> y <- c(1,3,4)**

**> x == y**

**[1] TRUE FALSE FALSE **

What happened? The key point is that we are dealing with vectorization. Just like almost anything else in R, == is a function.

**> “==”(3,2)**

**[1] FALSE**

**> i <- 2**

**> “==”(i,2)**

**[1] TRUE**

In fact, == is a vectorized function. The expression x == y applies the function ==() to the elements of x and y. yielding a vector of Boolean values. What can be done instead? One option is to work with the vectorized nature of ==, applying the function all():

**> x <- 1:3**

**> y <- c(1,3,4)**

**> x == y**

**[1] TRUE FALSE FALSE**

**> all(x == y)**

**[1] FALSE**

Applying all() to the result of == asks whether all of the elements of the latter are true, which is the same as asking whether x and y are identical. Or even better, we can simply use the identical function, like this:

**> identical(x,y) **

**[1] FALSE **

Be careful, though because the word *identical* really means what it says.

Consider this little R session:

> x <- 1:2

> y <- c(1,2)

> x

[1] 1 2

> y

[1] 1 2

> identical(x,y)

[1] FALSE

> typeof(x)

[1] “integer”

> typeof(y)

[1] “double”

So, : produces integers while c() produces floating-point numbers.

**Vector Element Names**

The elements of a vector can optionally be given names. For example, say we have a 50-element vector showing the population of each state in the United States. We could name each element according to its state name, such as “Montana” and “New Jersey”. This in turn might lead to naming points in plots, and so on.We can assign or query vector element names via the names() function:

**> x <- c(1,2,4) **

**> names(x)
NULL
> names(x) <- c(“a”,”b”,”ab”) **

**> names(x)
**

**[1] “a” “b” “ab”**** >x**

**a b ab **

**1 2 4**

We can remove the names from a vector by assigning NULL:

**> names(x) <- NULL**

**> x**

**[1] 1 2 4**

We can even reference elements of the vector by name:

**> x <- c(1,2,4)**

**> names(x) <- c(“a”,”b”,”ab”)**

**> x[“b”]**

**b**

**2**

**More on c()**

In this section, we’ll discuss a couple of miscellaneous facts related to the concatenate function, c(), that often come in handy. If the arguments you pass to c() are of differing modes, they will be reduced to a type that is the lowest common denominator, as follows:

> c(5,2,”abc”)

[1] “5” “2” “abc”

> c(5,2,list(a=1,b=4))

[[1]]

[1] 5

[[2]]

[1] 2

$a

[1] 1

$b

[1] 4

In the first example, we are mixing integer and character modes, a combination that R chooses to reduce to the latter mode. In the second example, R considers the list mode to be of lower precedence in mixed expressions. You probably will not wish to write code that makes such combinations, but you may encounter code in which this occurs, so it’s important to understand the effect.Another point to keep in mind is that c() has a flattening effect for vectors, as in this example:

**> c(5,2,c(1.5,6))**

**[1] 5.0 2.0 1.5 6.0**

Those familiar with other languages, such as Python, may have expected the preceding code to produce a two-level object. That doesn’t occur with R vectors though you can have two-level lists.