To follow along, use the lab04 code.

Programming in R

R is made up primarily of

We discussed many R objects in lab 01, but there are many more.

Functions

Functions are written to be performed on objects. The function below simply adds 1 to whatever the argument is.

add1 <- function(a) {
  return(a+1)
}

add1(2)
## [1] 3
add1(a = 3)
## [1] 4

The following function adds the two arguments.

add <- function(a, b) {
  return(a+b)
}

add(1,1)
## [1] 2
add(a = 2, b = 3)
## [1] 5

Many functions in R are vectorized meaning they can take vector arguments and return a sensible value. As written, our functions are vectorized, e.g. 

add1(a = 1:2)
## [1] 2 3
add(a = 1:2, b = 3:4)
## [1] 4 6

This is usually helpful, but it is not always clear what the results should be. For example, what do you think the result is if you run

add(a = 1:4, b = 1:2)

Arguments

Arguments to functions can be provided by order or by name or both. You can also specify arguments using a partial name.

f <- function(first, second) {
  return(first/2 + second)
}

f(1,2)
## [1] 2.5
f(1, second = 2)
## [1] 2.5
f(first = 1, second = 2)
## [1] 2.5
f(second = 2, 1) # I thought this was going to be an error
## [1] 2.5
f(f = 1, s = 2)
## [1] 2.5

My suggestion is that you always use argument full names when writing your code. Due to RStudio’s tab-autocomplete, typing in the full name should not slow you down.

Many functions have default arguments. You can specify default arguments in the functions you write by using the = sign in the function definition.

add <- function(a, b = 1) {
  return(a+b)
}

add(a = 1)
## [1] 2
add(a = 1:2)
## [1] 2 3
add(a = 2:3, b =  2)
## [1] 4 5
add(a = 3:4, b = -1)
## [1] 2 3

For example, the norm functions have default arguments that correspond to a standard normal, i.e. mean=0 and sd=1.

hist(rnorm(1e6), probability = TRUE)
curve(dnorm(x), add = TRUE, col = 'red')

Function activity

Create a function with arguments named first and second and have the function subtract the second argument from the first. Provide a default for first that is the vector 1:2 and for second that is the value 1.

Click for solution

sub <- function(first = 1:2, second = 1) {
  return(first - second)
}

Generic Functions

R has generic functions, i.e. functions where the type of object passed as the first argument to the function determine which version of the function to actually run. The summary() function is a generic function. This function will return different outputs depending on what the first argument is, e.g.

summary(c(TRUE,FALSE,TRUE))
##    Mode   FALSE    TRUE 
## logical       1       2
summary(1:10)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    3.25    5.50    5.50    7.75   10.00

In the help file it says “summary is a generic function”.

?summary

To determine exactly what function will be executed when using the a generic function, you need to determine the class of the object.

Here is an example

n <- 10
x <- rnorm(n)
y <- rnorm(n, x)
m <- lm(y~x) # a linear regression
summary(m)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2472 -0.3727  0.1527  0.3081  1.0535 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   0.1156     0.2338   0.494    0.634
## x             0.3269     0.2660   1.229    0.254
## 
## Residual standard error: 0.6698 on 8 degrees of freedom
## Multiple R-squared:  0.1588, Adjusted R-squared:  0.0537 
## F-statistic: 1.511 on 1 and 8 DF,  p-value: 0.2539

Here the generic function summary() checked the class of m.

class(m)
## [1] "lm"

and found that the class is lm and uses the appropriate summary function, i.e. summary.lm().

There are other helpful functions for lm objects, e.g. 

summary(m)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2472 -0.3727  0.1527  0.3081  1.0535 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   0.1156     0.2338   0.494    0.634
## x             0.3269     0.2660   1.229    0.254
## 
## Residual standard error: 0.6698 on 8 degrees of freedom
## Multiple R-squared:  0.1588, Adjusted R-squared:  0.0537 
## F-statistic: 1.511 on 1 and 8 DF,  p-value: 0.2539
anova(m)
## Analysis of Variance Table
## 
## Response: y
##           Df Sum Sq Mean Sq F value Pr(>F)
## x          1 0.6777 0.67768  1.5108 0.2539
## Residuals  8 3.5885 0.44857
opar <- par(mfrow=c(2,2)); plot(m); par(opar)

Scope

The objects available to an R function are defined by the scope of the function. But if the object is not available within the function, then the function will look to the environment containing that function to see if the object is there.

a <- 1
f <- function() {
  return(a)
}
f()
## [1] 1

If we modify the object within the function, that version of the object is only available within the function.

f <- function() {
  a <- 2
  return(a)
}
f()
## [1] 2
a
## [1] 1

Unless we use a special assignment operator <<-

f <- function() {
  a <<- a + 1
  return(a)
}
a
## [1] 1
f()
## [1] 2
a
## [1] 2

So be careful when constructing functions to deal appropriately with the scope of the objects you are using!

Loops

Basic programming often consists of using loops and conditionals. R has for and while loops as well as if-else statements.

Here is a basic for loop:

for (i in 1:3) {
  print(i)
}
## [1] 1
## [1] 2
## [1] 3

A basic while loop:

i <- 0
while (i < 3) {
  i <- i + 1
  print(i)
}
## [1] 1
## [1] 2
## [1] 3

And a basic use of the if-else statement:

for (i in 1:10) {
  if (i<5) {
    print(-i)
  } else {
    print(i*2)
  }
}
## [1] -1
## [1] -2
## [1] -3
## [1] -4
## [1] 10
## [1] 12
## [1] 14
## [1] 16
## [1] 18
## [1] 20

There is also a useful ifelse function, but it is a little confusing. This function takes in a vector logical first argument and two more scalar arguments. For each element in the logical vector, it will return the first scalar argument when the element is true and the second scalar argument when the element is false.

ifelse((1:10) < 5, "Less than 5", "Not less than 5")
##  [1] "Less than 5"     "Less than 5"     "Less than 5"     "Less than 5"     "Not less than 5" "Not less than 5"
##  [7] "Not less than 5" "Not less than 5" "Not less than 5" "Not less than 5"

Loop activity

Create a function that has a single scalar argument named a. Use a loop inside the function to calculate the sum of the numbers from 1 to a and return this sum. Test using the sum up to 100 which should be 5050.

Click for solution

f <- function(a) {
  sum <- 0
  for (i in 1:a) {
    sum <- sum + i
  }
  return(sum)
}
f(100)
## [1] 5050

Monte Carlo Methods in R

With the prevalence of computers, statistics is increasingly utilizing methods based on simulations to answer scientific questions of interest. These methods are called Monte Carlo methods.

Evaluating expectations

Suppose we have a random variable with the probability mass function

x 0 1 2 3
p(x) .5 .25 .1 .15

Now you are interested in estimating its expectation. If you can simulate from this distribution, then you can just take an average of the simulated values.

To simulate from a discrete random variable with finite support, you can use the sample() function.

x <- sample(0:3, size = 1e5, prob = c(.5,.25,.1,.15), replace = TRUE)

Now we can calculate the expectation and standard deviation (or variance) of these values which are estimates of the expectation and standard deviation (or variance) of the random variable.

(m <- mean(x))
## [1] 0.89799
(s <- sd(x))
## [1] 1.089548
s^2 # variance
## [1] 1.187116

The 100(1-a)% CLT can be used to provide bounds based on \[ \overline{X} \pm z_{a/2} s/\sqrt{n} \] where \(\overline{X}\) is the sample mean, \(z_{a/2}\) is the z-critical value, \(s\) is the sample standard deviation, and \(n\) is the size of our sample. For example a 90% interval is

mean(x) + c(-1,1) * 
  qnorm(.95) * 
  sd(x) / sqrt( length(x) )
## [1] 0.8923227 0.9036573

Expectation activity

Find the expectation for discrete random variable with the following pmf:

x 10 12 21 33
p(x) .1 .2 .3 .4

Also find a 99% interval for this expectation using the CLT based on a million samples.

Click for solution

x <- sample(c(10,12,21,33), size = 1e6, prob = c(.1,.2,.3,.4), replace = TRUE)
mean(x) + c(-1,1) * 
  qnorm(.995) * 
  sd(x) / sqrt( length(x) )
## [1] 22.87629 22.92305

Estimating probabilities

We can also use simulations to estimate probabilities. Suppose we have a random variable \(X\sim Unif(0,1)\) and a random variable \(Y\sim N(0,1)\) and we are interested in calculating the probability that \(X\) is larger than \(Y\). This is not an easy calculation to do by hand. But we can simply simulate a bunch of \(X\) and \(Y\) values and calculate the number of times that an \(X\) value is larger than an associated \(Y\) value. Note that the event \(X>Y\) will have a Bernoulli distribution, i.e. there is some probability that \(X>Y\) (precisely the probability we are trying to find). If we do this repeated for independent values of \(X\) and \(Y\) and record the number of times that \(X>Y\), this sum will have a binomial distribution. Thus we can use the CLT with mean \(np\) and variance \(np(1-p)\) or rather than using the sum, we can take the average and use the CLT with mean \(p\) and \(p(1-p)/n\). Since we are trying to estimate \(p\), the latter makes more sense.

So here is the simulation

n <- 1e5
x <- runif(n)
y <- rnorm(n)
(p <- mean(x>y))
## [1] 0.68366

and a 95% interval based on the CLT

p + c(-1,1) * qnorm(.975) * sqrt(p*(1-p)/n)
## [1] 0.6807777 0.6865423

Probability activity

Suppose \(X\sim Bin(10,.5)\) and \(Y\sim Po(5)\), calculate the probability that \(Y \ge X\). Since these are discrete random variables, the equal to sign is important. As a reminder, to get greater than or equal to in R, use >= and the random Poisson generator is rpois(). Provide a 95% interval based on a million samples.

Click for solution

n <- 1e6
x <- rbinom(n, size = 10, prob = 0.5)
y <- rpois(n, lambda = 5)
p <- mean(y >= x)
p + c(-1,1) * qnorm(.975) * sqrt(p*(1-p)/n)
## [1] 0.5562054 0.5581526