Detailed introduction

For an extremely detailed introduction, please see

help.start()

In this documentation, the above command will be executed at the command prompt, see below.

Brief introduction to R

From help.start():

R is an integrated suite of software facilities for data manipulation, calculation and graphical display.

and from https://www.rstudio.com/products/RStudio/:

RStudio is an integrated development environment (IDE) for R.

R interface

In contrast to many other statistical software packages that use a point-and-click interface, e.g. SPSS, JMP, Stata, etc, R has a command-line interface. The command line has a command prompt, e.g. >, see below.

>

This means, that you will be entering commands on this command line and hitting enter to execute them, e.g.

help()

Use the up arrow to recover past commands.

hepl()
help() # Use up arrow and fix

R GUI (or RStudio)

Most likely, you are using a graphical user interface (GUI) and therefore, in addition, to the command line, you also have a windowed version of R with some point-and-click options, e.g. File, Edit, and Help.

In particular, there is an editor to create a new R script. So rather than entering commands on the command line, you will write commands in a script and then send those commands to the command line using Ctrl-R (PC) or Command-Enter (Mac).

a = 1 
b = 2
a+b
## [1] 3

Multiple lines can be run in sequence by selecting them and then using Ctrl-R (PC) or Command-Enter (Mac).

Intro Activity

One of the most effective ways to use this documentation is to cut-and-paste the commands into a script and then execute them.

Cut-and-paste the following commands into a new script and then run those commands directly from the script using Ctrl-R (PC) or Command-Enter (Mac).

x = 1:10
y = rep(c(1,2), each=5)
m = lm(y~x)
s = summary(m)

Now, look at the result of each line

x
y
m
s
s$r.squared

When you have completed the activity, compare your results to the solutions.

Using R as a calculator

Basic calculator operations

All basic calculator operations can be performed in R.

1+2
## [1] 3
1-2
## [1] -1
1/2
## [1] 0.5
1*2
## [1] 2

For now, you can ignore the [1] at the beginning of the line, we’ll learn about that when we get to vectors.

Advanced calculator operations

Many advanced calculator operations are also available.

(1+3)*2 + 100^2  # standard order of operations
## [1] 10008
sin(2*pi)        # the result is in scientific notation, i.e. -2.449294 x 10^-16 
## [1] -2.449294e-16
sqrt(4)
## [1] 2
10^2
## [1] 100
log(10)          # the default is base e
## [1] 2.302585
log(10, base=10)
## [1] 1

Using variables

A real advantage to using R rather than a calculator (or calculator app) is the ability to store quantities using variables.

a = 1
b = 2
a+b
## [1] 3
a-b
## [1] -1
a/b
## [1] 0.5
a*b
## [1] 2

Assignment operators =, <-, and ->

When assigning variables values, you can also use arrows <- and -> and you will often see this in code, e.g.

a <- 1
2 -> b
c = 3  # is the same as <-

Now print them.

a
## [1] 1
b
## [1] 2
c
## [1] 3

Using informative variable names

While using variables alone is useful, it is much more useful to use informative variables names.

population = 1000
number_infected = 200
deaths = 3

death_rate = deaths / number_infected
attack_rate = number_infected / population

death_rate
## [1] 0.015
attack_rate
## [1] 0.2

Calculator Activity

Bayes’ Rule

Suppose an individual tests positive for a disease, what is the probability the individual has the disease? Let

  • \(D\) indicates the individual has the disease
  • \(N\) means the individual does not have the disease
  • \(+\) indicates a positive test result
  • \(-\) indicates a negative test

The above probability can be calculated using Bayes’ Rule:

\[ P(D|+) = \frac{P(+|D)P(D)}{P(+|D)P(D)+P(+|N)P(N)} = \frac{P(+|D)P(D)}{P(+|D)P(D)+(1-P(-|N))\times(1-P(D))} \]

where

Calculate the probability the individual has the disease if the test is positive when

  • the specificity of the test is 0.95,
  • the sensitivity of the test is 0.99, and
  • the prevalence of the disease is 0.001.

When you have completed the activity, compare your results to the solutions.

Reading data into R

In this section, we will learn how to read in csv or Excel files into R. We focus on csv files because they are simplest to import, they can be easily exported from Excel (or other software), and they are portable, i.e. they can be used in other software.

Changing your working directory

One of the first tasks after starting R is to change the working directory. This directory will contain the files and scripts you will use for your current project. Today, that project is this workshop, so you might put a folder on your desktop named MWBDSSworkshop and use this as your working directory. To set the working directory,

  • in RStudio: Session > Set Working Directory > Choose Directory… (Ctrl + Shift + H)
  • in R GUI (Windows): File > Change Dir…
  • in R GUI (Mac): Misc > Change Working Directory…

Make sure you have write access to this directory.

Installing and loading a package

Much of the functionality of R is contained in packages. The first time these packages are used, they need to be installed, e.g. to install a package from CRAN use

install.packages('dplyr')

Once installed, a package needs to be loaded into each R session where the package is used.

library('dplyr')

Load and start this workshop

First load the package

library("MWBDSSworkshop")

This package contains a function to help you get started, so run that function.

workshop(write_data = TRUE, write_scripts = TRUE)

This function did three things:

  1. It created a set of .csv data files in your working directory.
  2. It created a set of .R scripts in your working directory.
  3. It opened the workshop outline in a web browser.

Open an R script

As we progress through the workshop, the code for a particular module will be available in the R script for that module.

In R/RStudio, open the module called 01_intro.R. From here on out, as I run commands you should run the commands as well by using Ctrl-R (Windows) or Command-Enter (Mac) with the appropriate line(s) highlighted.

You will notice that nothing after a # will be evaluated by R. That is because the # character indicates a comment in the code. For example,

# This is just a comment. 
1+1 # So is this
## [1] 2
# 1+2

Reading data into R

Data are stored in many different formats. I will focus on data stored in a csv file, but mention approaches to reading in data stored in Excel, SAS, Stata, SPSS, and database formats.

Reading a csv file into R

The most common way I read data into R is through a csv, comma-separated value, file. The utils package (which is installed and loaded with base R) has a function called read.csv for reading csv files into R. For example,

GI = read.csv("GI.csv")

This created a data.frame object in R called GI.

The utils package has the read.table() function which is a more general function for reading data into R and it has many options. We could have gotten the same results if we had used the following code:

GI2 = read.table("GI.csv", 
                 header=TRUE, # There is a header.
                 sep=",")     # The column delimiter is a comma.

To check if the two data sets are equal, use the following

all.equal(GI, GI2)
## [1] TRUE

The read.csv function is available in base R, but these days I will often use the read_csv function in the readr package.

library("readr")
GI = read_csv("GI.csv")

This function has two main differences compared to the read.csv function:

  • does NOT use make.names on the column names
  • does NOT convert character columns to factors

Read Excel files

My main suggestion for reading Excel files into R is to

  1. Save the Excel file as a csv
  2. Read the csv file into R using read.csv

This approach will work regardless of any changes Excel makes in its document structure.

Reading an Excel xlsx file into R can be done using the read_excel function from the readxl R package. Unfortunately many scenarios can cause this process to not work. Thus, we do not focus on this method in an introductory R course. When it works, it looks like this

install.packages("readxl")
library("readxl")
d = read_excel("filename.xlsx", sheetIndex=1) # or
d = read_excel("filename.xlsx", sheetName="sheetName")

More information can be found here.

Again, if these approaches don’t work, you can Save as... a csv file in Excel.

Read SAS, Stata, or SPSS data files

The haven package provides functionality to read in SAS, Stata, and SPSS files. An example of reading a SAS file is

install.packages('haven')
library('haven')
d = read_sas('filename.sas7bdat')

Read a database file

There are many different types of databases, so the code you will need will be specific to the type of database you are trying to access.

The dplyr package,
which we will discussing today, has a number of functions to read from some databases. The code will look something like

library("dplyr")
my_db <- src_sqlite("my_db.sqlite3", create = T)

The RODBC package has a number of functions to read from some databases. The code might look something like

install.packages("RODBC")
library("RODBC")

# RODBC Example
# import 2 tables (Crime and Punishment) from a DBMS
# into R data frames (and call them crimedat and pundat)

library("RODBC")
myconn <-odbcConnect("mydsn", uid="Rob", pwd="aardvark")
crimedat <- sqlFetch(myconn, "Crime")
pundat <- sqlQuery(myconn, "select * from Punishment")
close(myconn)

Exploring the data set

There are a number of functions that will provide information about a data.frame. Here are a few:

dim(GI)
## [1] 21244     9
nrow(GI)
## [1] 21244
ncol(GI)
## [1] 9
names(GI)       # column names
## [1] "id"              "date"            "facility"        "icd9"           
## [5] "age"             "zipcode"         "chief_complaint" "syndrome"       
## [9] "gender"
head(GI, n = 5) # first 5 rows of the data.frame
##           id       date facility   icd9 age zipcode chief_complaint
## 1 1001301988 2005-02-28       67 787.01   7   21075        Abd Pain
## 2 1001829757 2005-02-28       67 558.90  41   20721   upset stomach
## 3 1001581758 2005-02-28      123 787.91   2   22152        diarrhea
## 4 1001950471 2005-02-28      123 787.91  71   22060        ABD PAIN
## 5 1001076304 2005-02-28      309 558.90  28   21702   LOWER AD PAIN
##   syndrome gender
## 1       GI   Male
## 2       GI Female
## 3       GI   Male
## 4       GI   Male
## 5       GI Female
tail(GI, n = 5) # last5 rows of the data.frame
##               id       date facility   icd9 age zipcode
## 21240 1001392877 2008-12-30      123 787.01   6   22153
## 21241 1001887911 2008-12-30       37 535.50  72   22033
## 21242 1001196061 2008-12-30     6200 536.20  59   20155
## 21243 1001067104 2008-12-30       67 558.90  80   22202
## 21244 1001396039 2008-12-30      123 787.03  12   20112
##                     chief_complaint syndrome gender
## 21240                         N/V/D       GI   Male
## 21241               VOMITING CHILLS       GI   Male
## 21242 LEFT CHEST AND ABDOMINAL PAIN       GI   Male
## 21243                      ABD PAIN       GI Female
## 21244                      Diarrhea       GI   Male
str(GI)
## 'data.frame':    21244 obs. of  9 variables:
##  $ id             : int  1001301988 1001829757 1001581758 1001950471 1001076304 1001087075 1001536948 1001966448 1001868408 1001356109 ...
##  $ date           : Factor w/ 1399 levels "2005-02-28","2005-03-01",..: 1 1 1 1 1 2 2 2 2 2 ...
##  $ facility       : int  67 67 123 123 309 66 6201 67 66 66 ...
##  $ icd9           : num  787 559 788 788 559 ...
##  $ age            : int  7 41 2 71 28 43 12 1 25 64 ...
##  $ zipcode        : int  21075 20721 22152 22060 21702 20762 22192 20121 20772 20602 ...
##  $ chief_complaint: Factor w/ 2936 levels " /v"," /V/D",..: 315 2600 1100 15 1741 2238 640 2480 1833 2311 ...
##  $ syndrome       : Factor w/ 1 level "GI": 1 1 1 1 1 1 1 1 1 1 ...
##  $ gender         : Factor w/ 2 levels "Female","Male": 2 1 2 2 1 1 2 2 2 2 ...

Activity

If you brought your own Excel file, open it and save a sheet as a csv file in your working directory. If you brought your own csv file, save it in your working directory. If you did not bring your own file, use the fluTrends.csv file in your working directory.

Try to use the read.csv or read.table function to read the file into R. There are a number of different options in the read.table() function that may be useful:

d = read.table("filename.csv", # Make sure to change filename>to your filename and 
                               # make sure you use the extension, e.g. .csv. 
               header = TRUE,  # If there is no header column, change TRUE to FALSE.
               sep =",",       # The column delimiter is a comma.
               skip = 0        # Skip this many lines before starting to read the file
               )

You may also need to look at the help file for read.table() to find additional options that you need.

?read.table

When you have completed the activity, compare your results to the solutions.

Descriptive statistics

When reading your data set into R, you will likely want to perform some descriptive statistics. The single most useful command to assess the whole data set is the summary() command:

summary(GI)
##        id                    date          facility         icd9        
##  Min.   :1.001e+09   2007-01-31:   57   Min.   :  37   Min.   :    3.0  
##  1st Qu.:1.001e+09   2007-01-29:   55   1st Qu.:  66   1st Qu.:  558.9  
##  Median :1.001e+09   2007-01-16:   52   Median :  67   Median :  787.0  
##  Mean   :1.001e+09   2007-02-28:   52   Mean   :1102   Mean   : 1043.2  
##  3rd Qu.:1.002e+09   2007-01-24:   50   3rd Qu.: 309   3rd Qu.:  787.3  
##  Max.   :1.002e+09   2007-01-17:   45   Max.   :7298   Max.   :78791.0  
##                      (Other)   :20933                                   
##       age            zipcode            chief_complaint  syndrome  
##  Min.   :  0.00   Min.   :20001   Abd Pain      : 1390   GI:21244  
##  1st Qu.:  8.00   1st Qu.:20747   ABD PAIN      : 1074             
##  Median : 27.00   Median :21740   Vomiting      :  661             
##  Mean   : 29.98   Mean   :21420   VOMITING      :  563             
##  3rd Qu.: 47.00   3rd Qu.:22182   ABDOMINAL PAIN:  452             
##  Max.   :157.00   Max.   :22556   vomiting      :  423             
##                                   (Other)       :16681             
##     gender     
##  Female:10653  
##  Male  :10591  
##                
##                
##                
##                
## 

Descriptive statistics for continuous (numeric) variables

To access a single column in the data.frame use a dollar sign ($).

GI$age     # or
GI[,'age'] # or
GI[,5]     # since age is the 5th column

Here are a number of descriptive statistics for age:

min(GI$age)
## [1] 0
max(GI$age)
## [1] 157
mean(GI$age)
## [1] 29.98221
median(GI$age)
## [1] 27
quantile(GI$age, c(.025,.25,.5,.75,.975))
##   2.5%    25%    50%    75%  97.5% 
##  0.075  8.000 27.000 47.000 81.000
summary(GI$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    8.00   27.00   29.98   47.00  157.00

Anything look odd here?

Descriptive statistics for categorical (non-numeric) variables

The table() function provides the number of observations at each level of a categorical variable.

table(GI$gender)
## 
## Female   Male 
##  10653  10591

which is the same as summary() if the variable is not coded as numeric

summary(GI$gender)
## Female   Male 
##  10653  10591

If the variable is coded as numeric, but is really a categorical variable, then you can still use table, but summary won’t give you the correct result.

table(GI$facility)
## 
##   37   66   67  123  255  256  259  309  390  413  420  522  703 6200 6201 
## 3571 2950 4281 4408   41  575    1  325  661  668  100   67  178 1423 1928 
## 7298 
##   67
summary(GI$facility)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      37      66      67    1102     309    7298

Apparently there is only 1 observation from facility 259, was that a typo?

Filtering the data

Rather than having descriptive statistics for the dataset as a whole, we may be interested in descriptive statistics for a subset of the data, i.e. you want to filter() the data.

The following code creates a new data.frame() that only contains observations from facility 37:

library('dplyr')

GI_37 <- GI %>% 
  filter(facility == 37) # Notice the double equal sign!

nrow(GI_37)              # Number of rows (observations) in the new data set
## [1] 3571

The following code creates a new data.frame that only contains observations with chief_complaint “Abd Pain”:

GI_AbdPain <- GI %>% 
  filter(chief_complaint == "Abd Pain") # Need to quote non-numeric variable level

nrow(GI_AbdPain)
## [1] 1390

Alternative ways to filter

There are many other ways to subset/filter the data, but these days I almost exclusively use dplyr::filter() as I find the code is much easier to read.

GI_37a = GI[GI$facility==37,]
rownames(GI_37a) = 1:nrow(GI_37a)
all.equal(GI_37, GI_37a)
## [1] TRUE
GI_37b = subset(GI, facility==37)
rownames(GI_37b) = 1:nrow(GI_37b)
all.equal(GI_37, GI_37b)
## [1] TRUE
GI_AbdPain1 = GI[GI$chief_complaint == "Abd Pain",]
rownames(GI_AbdPain1) = 1:nrow(GI_AbdPain1)
all.equal(GI_AbdPain, GI_AbdPain1)
## [1] TRUE
GI_AbdPain2 = subset(GI, chief_complaint == "Abd Pain")
rownames(GI_AbdPain2) = 1:nrow(GI_AbdPain2)
all.equal(GI_AbdPain, GI_AbdPain2)
## [1] TRUE

Advanced filtering

We can subset variables using other logical statements.

GI %>% filter(age <   5)
GI %>% filter(age >= 60)
GI %>% filter(chief_complaint %in% c("Abd Pain","ABD PAIN")) # Abd Pain or ABD PAIN
GI %>% filter(tolower(chief_complaint) == "abd pain")        # any capitalization pattern
GI %>% filter(!(facility %in% c(37,66)))                     # facility is NOT 37 or 66

Descriptive statistics on the subset

Now we can calculate descriptive statistics on this subset, e.g.

summary(GI_37$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     3.0    19.0    36.0    39.3    58.0   139.0
summary(GI_AbdPain$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    7.00   27.00   28.61   45.00  157.00

Activity

Find the min, max, mean, and median age for zipcode 20032.

When you have completed the activity, compare your results to the solutions.

Graphical statistics

Here we focus on the graphical options available in the base package graphics.

Although I sometimes use these base graphics, I end up switching to ggplot2 graphics very quickly.

Histograms

For continuous variables, histograms are useful for visualizing the distribution of the variable.

hist(GI$age)

When there is a lot of data, you will typically want more bins

hist(GI$age, 50)

You can also specify your own bins

hist(GI$age, 0:158)

Boxplots

Boxplots are another way to visualize the distribution for continuous variables.

boxplot(GI$age)