To follow along, use the lab01 code.

Detailed introduction

For an extremely detailed introduction, please see

help.start()

In this documentation, the above command will be executed at the command prompt, see below.

Brief introduction to R

From help.start():

R is an integrated suite of software facilities for data manipulation, calculation and graphical display.

and from https://www.rstudio.com/products/RStudio/:

RStudio is an integrated development environment (IDE) for R.

R interface

In contrast to many other statistical software packages that use a point-and-click interface, e.g. SPSS, JMP, Stata, etc, R has a command-line interface. The command line has a command prompt, e.g. >, see below.

>

This means, that you will be entering commands on this command line and hitting enter to execute them, e.g. 

help()

Use the up arrow to recover past commands.

hepl()
help() # Use up arrow and fix

R GUI (or RStudio)

Most likely, you are using a graphical user interface (GUI) and therefore, in addition, to the command line, you also have a windowed version of R with some point-and-click options, e.g. File, Edit, and Help.

In particular, there is an editor to create a new R script. So rather than entering commands on the command line, you will write commands in a script and then send those commands to the command line using Ctrl-R (PC) or Command-Enter (Mac).

a = 1 
b = 2
a + b
## [1] 3

Multiple lines can be run in sequence by selecting them and then using Ctrl-R (PC) or Command-Enter (Mac).

Intro Activity

One of the most effective ways to use this documentation is to cut-and-paste the commands into a script and then execute them.

Cut-and-paste the following commands into a new script and then run those commands directly from the script using Ctrl-R (PC) or Command-Enter (Mac).

x <- 1:10
y <- rep(c(1,2), each=5)
m <- lm(y~x)
s <- summary(m)

Now, look at the result of each line

x
y
m
s
s$r.squared
Click for solution
x
##  [1]  1  2  3  4  5  6  7  8  9 10
y
##  [1] 1 1 1 1 1 2 2 2 2 2
m
## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##      0.6667       0.1515
s
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4242 -0.1667  0.0000  0.1667  0.4242 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   0.6667     0.1880   3.546  0.00756 **
## x             0.1515     0.0303   5.000  0.00105 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2752 on 8 degrees of freedom
## Multiple R-squared:  0.7576, Adjusted R-squared:  0.7273 
## F-statistic:    25 on 1 and 8 DF,  p-value: 0.001053
s$r.squared
## [1] 0.7575758

Using R as a calculator

Basic calculator operations

All basic calculator operations can be performed in R.

1+2
## [1] 3
1-2
## [1] -1
1/2
## [1] 0.5
1*2
## [1] 2
2^3 # same as 2**3
## [1] 8

For now, you can ignore the [1] at the beginning of the line, we’ll learn about that when we get to vectors.

Advanced calculator operations

Many advanced calculator operations are also available.

(1+3)*2 + 100^2  # standard order of operations (PEMDAS)
## [1] 10008
sin(2*pi)        # the result is in scientific notation, i.e. -2.449294 x 10^-16 
## [1] -2.449294e-16
sqrt(4)
## [1] 2
log(10)          # the default is base e
## [1] 2.302585
log(10, base = 10)
## [1] 1

Using variables

A real advantage to using R rather than a calculator (or calculator app) is the ability to store quantities using variables.

a = 1
b = 2
a + b
## [1] 3
a - b
## [1] -1
a / b
## [1] 0.5
a * b
## [1] 2
b ^ 3
## [1] 8

Assignment operators =, <-, and ->

When assigning variables values, you can also use arrows <- and -> and you will often see this in code, e.g. 

a <- 1 # recommended
2 -> b # uncommon, but sometimes useful
c = 3  # similar to other languages

Now print them.

a
## [1] 1
b
## [1] 2
c
## [1] 3

Using informative variable names

While using variables alone is useful, it is much more useful to use informative variables names.

# Rectangle
length <- 4
width  <- 3

area <- length * width
area
## [1] 12
perimeter <- 2 * (length + width)


# Circle
radius <- 2

area   <- pi*radius^2 # this overwrites the previous `area` variable
circumference <- 2*pi*radius

area
## [1] 12.56637
circumference
## [1] 12.56637
# (Right) Triangle
opposite     <- 1
angleDegrees <- 30
angleRadians <- angleDegrees * pi/180

(adjacent     <- opposite / tan(angleRadians)) # = sqrt(3)
## [1] 1.732051
(hypotenuse   <- opposite / sin(angleRadians)) # = 2
## [1] 2

Calculator Activity

Bayes’ Rule

Suppose an individual tests positive for a disease, what is the probability the individual has the disease? Let

  • \(D\) indicates the individual has the disease
  • \(N\) means the individual does not have the disease
  • \(+\) indicates a positive test result
  • \(-\) indicates a negative test

The above probability can be calculated using Bayes’ Rule:

\[ P(D|+) = \frac{P(+|D)P(D)}{P(+|D)P(D)+P(+|N)P(N)} = \frac{P(+|D)P(D)}{P(+|D)P(D)+(1-P(-|N))\times(1-P(D))} \]

where

Calculate the probability the individual has the disease if the test is positive when

  • the specificity of the test is 0.95,
  • the sensitivity of the test is 0.99, and
  • the prevalence of the disease is 0.001.
Click for solution
specificity <- 0.95
sensitivity <- 0.99
prevalence <- 0.001
probability <- (sensitivity*prevalence) / (sensitivity*prevalence + (1-specificity)*(1-prevalence))
probability
## [1] 0.01943463

Data types

Objects in R can be broadly classified according to their dimensions:

and according to the type of variable they contain:

Scalars

Scalars have a single value assigned to the object in R.

a <- 3.14159265 
b <- "STAT 587 (Eng)" 
c <- TRUE

Print the objects

a
## [1] 3.141593
b
## [1] "STAT 587 (Eng)"
c
## [1] TRUE

Vectors

The c() function creates a vector in R

a <- c(1, 2, -5, 3.6)
b <- c("STAT", "587", "(Eng)")
c <- c(TRUE, FALSE, TRUE, TRUE)

To determine the length of a vector in R use length()

length(a)
## [1] 4
length(b)
## [1] 3
length(c)
## [1] 4

To determine the type of a vector in R use class()

class(a)
## [1] "numeric"
class(b)
## [1] "character"
class(c)
## [1] "logical"

Vector construction

Create a numeric vector that is a sequence using : or seq().

1:10
##  [1]  1  2  3  4  5  6  7  8  9 10
5:-2
## [1]  5  4  3  2  1  0 -1 -2
seq(from = 2, to = 5, by = .05)
##  [1] 2.00 2.05 2.10 2.15 2.20 2.25 2.30 2.35 2.40 2.45 2.50 2.55 2.60 2.65 2.70
## [16] 2.75 2.80 2.85 2.90 2.95 3.00 3.05 3.10 3.15 3.20 3.25 3.30 3.35 3.40 3.45
## [31] 3.50 3.55 3.60 3.65 3.70 3.75 3.80 3.85 3.90 3.95 4.00 4.05 4.10 4.15 4.20
## [46] 4.25 4.30 4.35 4.40 4.45 4.50 4.55 4.60 4.65 4.70 4.75 4.80 4.85 4.90 4.95
## [61] 5.00

Another useful function to create vectors is rep()

rep(1:4, times = 2)
## [1] 1 2 3 4 1 2 3 4
rep(1:4, each  = 2)
## [1] 1 1 2 2 3 3 4 4
rep(1:4, each  = 2, times = 2)
##  [1] 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4

Arguments to functions in R can be referenced either by position or by name or both. The safest and easiest to read approach is to name all your arguments. I will often name all but the first argument.

Accessing vector elements

Elements of a vector can be accessed using brackets, e.g. [index].

a <- c("one","two","three","four","five")
a[1]
## [1] "one"
a[2:4]
## [1] "two"   "three" "four"
a[c(3,5)]
## [1] "three" "five"
a[rep(3,4)]
## [1] "three" "three" "three" "three"

Alternatively we can access elements using a logical vector where only TRUE elements are accessed.

a[c(TRUE, TRUE, FALSE, FALSE, FALSE)]
## [1] "one" "two"

You can also see all elements except some using a negative sign -.

a[-1]
## [1] "two"   "three" "four"  "five"
a[-(2:3)]
## [1] "one"  "four" "five"

Modifying elements of a vector

You can assign new values to elements in a vector using = or <-.

a[2] <- "twenty-two"
a
## [1] "one"        "twenty-two" "three"      "four"       "five"
a[3:4] <- "three-four" # assigns "three-four" to both the 3rd and 4th elements
a
## [1] "one"        "twenty-two" "three-four" "three-four" "five"
a[c(3,5)] <- c("thirty-three","fifty-five")
a
## [1] "one"          "twenty-two"   "thirty-three" "three-four"   "fifty-five"

Matrices

Matrices can be constructed using cbind(), rbind(), and matrix():

m1 <- cbind(c(1,2), c(3,4))       # Column bind
m2 <- rbind(c(1,3), c(2,4))       # Row bind

m1
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
all.equal(m1, m2)
## [1] TRUE
m3 <- matrix(1:4, nrow = 2, ncol = 2)
all.equal(m1, m3)
## [1] TRUE
m4 <- matrix(1:4, nrow = 2, ncol = 2, byrow = TRUE)
all.equal(m3, m4)
## [1] "Mean relative difference: 0.4"
m3
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
m4
##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4

Accessing matrix elements

Elements of a matrix can be accessed using brackets separated by a comma, e.g. [row index, column index].

m <- matrix(1:12, nrow=3, ncol=4)
m
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12
m[2,3]
## [1] 8

Multiple elements can be accessed at once

m[1:2,3:4]
##      [,1] [,2]
## [1,]    7   10
## [2,]    8   11

If no row (column) index is provided, then the whole row (column) is accessed.

m[1:2,]
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11

Like vectors, you can eliminate rows (or columns)

m[-c(3,4),]
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11

Be careful not to forget the comma, e.g. 

m[1:4]
## [1] 1 2 3 4

You can also construct an object with more than 2 dimensions using the array() function.

Cannot mix types

You cannot mix types within a vector, matrix, or array

c(1, "a")
## [1] "1" "a"

The number 1 is in quotes indicating that R is treating it as a character rather than a numeric.

c(TRUE, 1, FALSE)
## [1] 1 1 0

The logicals are converted to numeric (0 for FALSE and 1 for TRUE).

c(TRUE, 1, "a")
## [1] "TRUE" "1"    "a"

Everything is converted to a character.

Activity

Using the matrix below,

  1. Print the element in the 3rd-row and 4th column
  2. Print the 2nd column
  3. Print all but the 3rd row
m <- rbind(c(1, 12, 8, 6),
          c(4, 10, 2, 9),
          c(11, 3, 5, 7))
m
##      [,1] [,2] [,3] [,4]
## [1,]    1   12    8    6
## [2,]    4   10    2    9
## [3,]   11    3    5    7

If you have extra time, to try create this same matrix using the matrix() function.

Click for solution
# Print the element in the 3rd-row and 4th column
m[3,4]
## [1] 7
# Print the 2nd column
m[,2]
## [1] 12 10  3
# Print all but the 3rd row
m[-3,]
##      [,1] [,2] [,3] [,4]
## [1,]    1   12    8    6
## [2,]    4   10    2    9
# Reconstruct the matrix if time allows
n <- matrix(c(1,12,8,6,4,10,2,9,11,3,5,7), nrow=3, ncol=4, byrow=TRUE)
n
##      [,1] [,2] [,3] [,4]
## [1,]    1   12    8    6
## [2,]    4   10    2    9
## [3,]   11    3    5    7
all.equal(m,n)
## [1] TRUE

Data frames

A data.frame is a special type of matrix that allows different data types in different columns.

class(warpbreaks) # warpbreaks is a built-in data.frame
## [1] "data.frame"

Access data.frame elements

A data.frame can be accessed just like a matrix, e.g. [row index, column index].

warpbreaks[1:3,1:2]
##   breaks wool
## 1     26    A
## 2     30    A
## 3     54    A

data.frames can also be accessed by column names. In order to determine the column names use the names() function.

names(warpbreaks)
## [1] "breaks"  "wool"    "tension"
warpbreaks[1:3, c("breaks","wool")]
##   breaks wool
## 1     26    A
## 2     30    A
## 3     54    A

Different data types in different columns

The function str() allows you to see the structure of any object in R. Using str() on a data.frame object tells you

  1. that the object is a data.frame,
  2. the number of rows and columns,
  3. the names of each column,
  4. each column’s data type, and
  5. the first few elements in that column.
str(warpbreaks)
## 'data.frame':    54 obs. of  3 variables:
##  $ breaks : num  26 30 54 25 70 52 51 26 67 18 ...
##  $ wool   : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1 1 1 1 ...
##  $ tension: Factor w/ 3 levels "L","M","H": 1 1 1 1 1 1 1 1 1 2 ...

Packages

Much of the functionality in R is available in R packages. Many of these packages are hosted on CRAN and mirrored to many locations, e.g. ISU CRAN mirror. Bioconductor also hosts a number of packages primarily for bioinformatics uses. Packages can also be installed directly from GitHub. We will focus on packages available on CRAN.

Installing packages from CRAN

One of the first packages we will need in these labs is ggplot2. Since this package is hosted on CRAN, we can install it using

install.packages("ggplot2")

You may have to choose a CRAN repository. I suggest choosing the repository that is geographically the closest, e.g. ISU CRAN mirror.

When you install a package, the installation process will automatically install any dependencies for that package. For example, ggplot2 depends on rlang, tibble, and other packages.

The ggplot2 package is part of the tidyverse and we will be using many of the packages in the tidyverse throughout this course. Thus I suggest you go ahead and install the tidyverse package which is primarily a wrapper to install ggplot2, dplyr, readr, and other packages.

install.packages("tidyverse")

Installing packages from GitHub

Packages hosted on CRAN undergo substantial testing to ensure the technicalities of package creation are satisfied. This testing does not automatically ensure the package does what it says it does. Many package authors don’t want to go through the hassle of this testing and thus host their package on GitHub. In order to install a package directly from GitHub, you need to first install the devtools package and, if you are on a Windows system, Rtools.

To install the devtools package, use

install.packages("devtools")

and to install Rtools, follow these instructions.

Once devtools has been installed, you can use the install_github() function from the devtools package to install a package from CRAN. To install a package, you need to know the GitHub username and the repository name. For example, to install the swgoh package from user jarad (that’s me), you can use

devtools::install_github("jarad/swgoh")

Loading packages

A package only needs to be installed once on your system (until you upgrade R), but you will need to either load the package when you want to use it or explicitly indicate the package name when using functions in the package.

To load the package in an R session, use

library("ggplot2")

If you want to explicitly call the function, then you use ::. For example

devtools::install_github("jarad/swgoh")

uses the install_github() function from the devtools package.

Personally, I have started to do both in my scripts. That is, at the top of the script I include a library call for every package that is required by the script. Then, throughout the script, I explicitly call each function that comes from a manually loaded package.

tidyverse R style guide

Coding in R, like any language, is a process. It is helpful if you have a structure to start from. The tidyverse R style guide provides suggestions for best practices when coding in R.

Getting help for R

Learning R

To learn R, you may want to try the swirl package. To install, use

install.packages("swirl")

After installation, use the following to get started

library("swirl")
swirl()

General help

As you work with R, there will be many times when you need to get help.

My basic approach is

  1. Use the help contained within R.
  2. Perform an internet search for an answer.
  3. Find somebody else who knows.

In all cases, knowing the R keywords, e.g. a function name, will be extremely helpful.

Help within R I

If you know the function name, then you can use ?<function>, e.g.

?seq

The structure of help is

  • Description: quick description of what the function does
  • Usage: the arguments, their order, and default values (if any)
  • Arguments: more thorough description about the arguments
  • Details: more in depth information about the function
  • Value: what the function returns
  • See Also: similar functions
  • Examples: examples of how to use the function

Help within R II

If you cannot remember the function name, then you can use help.search("<something>"), e.g.

help.search("sequence")

Depending on how many packages you have installed, you will find a lot or a little here.

Getting help on ggplot2

Although the general R help can still be used, e.g. 

?ggplot
?geom_point

It is much more helpful to google for an answer

geom_point 
ggplot2 line colors

The top hits will all have the code along with what the code produces.

Helpful sites

These sites all provide code. The first two also provide the plots that are produced.