To follow along, use the lab04 code.
R is made up primarily of
We discussed many R objects in lab 01, but there are many more.
Functions are written to be performed on objects. The function below simply adds 1 to whatever the argument is.
add1 <- function(a) {
return(a+1)
}
add1(2)
## [1] 3
add1(a = 3)
## [1] 4
The following function adds the two arguments.
add <- function(a, b) {
return(a+b)
}
add(1,1)
## [1] 2
add(a = 2, b = 3)
## [1] 5
Many functions in R are vectorized meaning they can take vector arguments and return a sensible value. As written, our functions are vectorized, e.g.
add1(a = 1:2)
## [1] 2 3
add(a = 1:2, b = 3:4)
## [1] 4 6
This is usually helpful, but it is not always clear what the results should be. For example, what do you think the result is if you run
add(a = 1:4, b = 1:2)
Arguments to functions can be provided by order or by name or both. You can also specify arguments using a partial name.
f <- function(first, second) {
return(first/2 + second)
}
f(1,2)
## [1] 2.5
f(1, second = 2)
## [1] 2.5
f(first = 1, second = 2)
## [1] 2.5
f(second = 2, 1) # I thought this was going to be an error
## [1] 2.5
f(f = 1, s = 2)
## [1] 2.5
My suggestion is that you always use argument full names when writing your code. Due to RStudio’s tab-autocomplete, typing in the full name should not slow you down.
Many functions have default arguments. You can specify default arguments in the functions you write by using the =
sign in the function definition.
add <- function(a, b = 1) {
return(a+b)
}
add(a = 1)
## [1] 2
add(a = 1:2)
## [1] 2 3
add(a = 2:3, b = 2)
## [1] 4 5
add(a = 3:4, b = -1)
## [1] 2 3
For example, the norm
functions have default arguments that correspond to a standard normal, i.e. mean=0
and sd=1
.
hist(rnorm(1e6), probability = TRUE)
curve(dnorm(x), add = TRUE, col = 'red')
Create a function with arguments named first
and second
and have the function subtract the second argument from the first. Provide a default for first
that is the vector 1:2
and for second
that is the value 1
.
Click for solution
sub <- function(first = 1:2, second = 1) {
return(first - second)
}
R has generic functions, i.e. functions where the type of object passed as the first argument to the function determine which version of the function to actually run. The summary()
function is a generic function. This function will return different outputs depending on what the first argument is, e.g.
summary(c(TRUE,FALSE,TRUE))
## Mode FALSE TRUE
## logical 1 2
summary(1:10)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 3.25 5.50 5.50 7.75 10.00
In the help file it says “summary is a generic function”.
?summary
To determine exactly what function will be executed when using the a generic function, you need to determine the class
of the object.
Here is an example
n <- 10
x <- rnorm(n)
y <- rnorm(n, x)
m <- lm(y~x) # a linear regression
summary(m)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.2472 -0.3727 0.1527 0.3081 1.0535
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1156 0.2338 0.494 0.634
## x 0.3269 0.2660 1.229 0.254
##
## Residual standard error: 0.6698 on 8 degrees of freedom
## Multiple R-squared: 0.1588, Adjusted R-squared: 0.0537
## F-statistic: 1.511 on 1 and 8 DF, p-value: 0.2539
Here the generic function summary()
checked the class of m
.
class(m)
## [1] "lm"
and found that the class is lm
and uses the appropriate summary function, i.e. summary.lm()
.
There are other helpful functions for lm
objects, e.g.
summary(m)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.2472 -0.3727 0.1527 0.3081 1.0535
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1156 0.2338 0.494 0.634
## x 0.3269 0.2660 1.229 0.254
##
## Residual standard error: 0.6698 on 8 degrees of freedom
## Multiple R-squared: 0.1588, Adjusted R-squared: 0.0537
## F-statistic: 1.511 on 1 and 8 DF, p-value: 0.2539
anova(m)
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x 1 0.6777 0.67768 1.5108 0.2539
## Residuals 8 3.5885 0.44857
opar <- par(mfrow=c(2,2)); plot(m); par(opar)
The objects available to an R function are defined by the scope of the function. But if the object is not available within the function, then the function will look to the environment containing that function to see if the object is there.
a <- 1
f <- function() {
return(a)
}
f()
## [1] 1
If we modify the object within the function, that version of the object is only available within the function.
f <- function() {
a <- 2
return(a)
}
f()
## [1] 2
a
## [1] 1
Unless we use a special assignment operator <<-
f <- function() {
a <<- a + 1
return(a)
}
a
## [1] 1
f()
## [1] 2
a
## [1] 2
So be careful when constructing functions to deal appropriately with the scope of the objects you are using!
Basic programming often consists of using loops and conditionals. R has for
and while
loops as well as if
-else
statements.
Here is a basic for
loop:
for (i in 1:3) {
print(i)
}
## [1] 1
## [1] 2
## [1] 3
A basic while
loop:
i <- 0
while (i < 3) {
i <- i + 1
print(i)
}
## [1] 1
## [1] 2
## [1] 3
And a basic use of the if
-else
statement:
for (i in 1:10) {
if (i<5) {
print(-i)
} else {
print(i*2)
}
}
## [1] -1
## [1] -2
## [1] -3
## [1] -4
## [1] 10
## [1] 12
## [1] 14
## [1] 16
## [1] 18
## [1] 20
There is also a useful ifelse
function, but it is a little confusing. This function takes in a vector logical first argument and two more scalar arguments. For each element in the logical vector, it will return the first scalar argument when the element is true and the second scalar argument when the element is false.
ifelse((1:10) < 5, "Less than 5", "Not less than 5")
## [1] "Less than 5" "Less than 5" "Less than 5" "Less than 5" "Not less than 5" "Not less than 5"
## [7] "Not less than 5" "Not less than 5" "Not less than 5" "Not less than 5"
Create a function that has a single scalar argument named a
. Use a loop inside the function to calculate the sum of the numbers from 1 to a and return this sum. Test using the sum up to 100 which should be 5050.
Click for solution
f <- function(a) {
sum <- 0
for (i in 1:a) {
sum <- sum + i
}
return(sum)
}
f(100)
## [1] 5050
With the prevalence of computers, statistics is increasingly utilizing methods based on simulations to answer scientific questions of interest. These methods are called Monte Carlo methods.
Suppose we have a random variable with the probability mass function
x | 0 | 1 | 2 | 3 |
---|---|---|---|---|
p(x) | .5 | .25 | .1 | .15 |
Now you are interested in estimating its expectation. If you can simulate from this distribution, then you can just take an average of the simulated values.
To simulate from a discrete random variable with finite support, you can use the sample()
function.
x <- sample(0:3, size = 1e5, prob = c(.5,.25,.1,.15), replace = TRUE)
Now we can calculate the expectation and standard deviation (or variance) of these values which are estimates of the expectation and standard deviation (or variance) of the random variable.
(m <- mean(x))
## [1] 0.89799
(s <- sd(x))
## [1] 1.089548
s^2 # variance
## [1] 1.187116
The 100(1-a)% CLT can be used to provide bounds based on \[ \overline{X} \pm z_{a/2} s/\sqrt{n} \] where \(\overline{X}\) is the sample mean, \(z_{a/2}\) is the z-critical value, \(s\) is the sample standard deviation, and \(n\) is the size of our sample. For example a 90% interval is
mean(x) + c(-1,1) *
qnorm(.95) *
sd(x) / sqrt( length(x) )
## [1] 0.8923227 0.9036573
Find the expectation for discrete random variable with the following pmf:
x | 10 | 12 | 21 | 33 |
---|---|---|---|---|
p(x) | .1 | .2 | .3 | .4 |
Also find a 99% interval for this expectation using the CLT based on a million samples.
Click for solution
x <- sample(c(10,12,21,33), size = 1e6, prob = c(.1,.2,.3,.4), replace = TRUE)
mean(x) + c(-1,1) *
qnorm(.995) *
sd(x) / sqrt( length(x) )
## [1] 22.87629 22.92305
We can also use simulations to estimate probabilities. Suppose we have a random variable \(X\sim Unif(0,1)\) and a random variable \(Y\sim N(0,1)\) and we are interested in calculating the probability that \(X\) is larger than \(Y\). This is not an easy calculation to do by hand. But we can simply simulate a bunch of \(X\) and \(Y\) values and calculate the number of times that an \(X\) value is larger than an associated \(Y\) value. Note that the event \(X>Y\) will have a Bernoulli distribution, i.e. there is some probability that \(X>Y\) (precisely the probability we are trying to find). If we do this repeated for independent values of \(X\) and \(Y\) and record the number of times that \(X>Y\), this sum will have a binomial distribution. Thus we can use the CLT with mean \(np\) and variance \(np(1-p)\) or rather than using the sum, we can take the average and use the CLT with mean \(p\) and \(p(1-p)/n\). Since we are trying to estimate \(p\), the latter makes more sense.
So here is the simulation
n <- 1e5
x <- runif(n)
y <- rnorm(n)
(p <- mean(x>y))
## [1] 0.68366
and a 95% interval based on the CLT
p + c(-1,1) * qnorm(.975) * sqrt(p*(1-p)/n)
## [1] 0.6807777 0.6865423
Suppose \(X\sim Bin(10,.5)\) and \(Y\sim Po(5)\), calculate the probability that \(Y \ge X\). Since these are discrete random variables, the equal to sign is important. As a reminder, to get greater than or equal to in R, use >=
and the random Poisson generator is rpois()
. Provide a 95% interval based on a million samples.
Click for solution
n <- 1e6
x <- rbinom(n, size = 10, prob = 0.5)
y <- rpois(n, lambda = 5)
p <- mean(y >= x)
p + c(-1,1) * qnorm(.975) * sqrt(p*(1-p)/n)
## [1] 0.5562054 0.5581526