Close down R, say ‘No’ to saving the workspace. In fact, I recommend you change your settings so that

Workflow

The typical workflow in R is

  1. Start R
  2. Choose your working directory (choose project in RStudio)
  3. Open a script
  4. Read in data
  5. Check data (interactively)
  6. Do analysis

If you are using RStudio, I highly recommend you look into setting up projects.

Choose your working directory

Start R and then choose your working directory.

setwd(choose.dir())

Start the workshop

Now start the workshop to get your data files and scripts set up. Typically you would choose your working directory to be where you actually have your data and scripts.

library('ISDSWorkshop')
workshop(launch_index  = FALSE) # If you need the outline, set this to TRUE

Open the script

Open the 03_advanced_graphics.R script.

A convention is to put library() calls for all packages that you will use in the script at the top of the script.

library('dplyr')
library('ggplot2')

Read in the data

Typically, you will have a script that reads in the data, converts variables, and then performs statistical analyses.

This time we will use the mutate() function from the dplyr package. This function allows you to create new columns in a data.frame.

# Read in csv files
GI = read.csv('GI.csv')
icd9df = read.csv("icd9.csv")

# Add columns to the data.frame
GI <- GI %>%
  mutate(
    date      = as.Date(date),
    weekC     = cut(date, breaks="weeks"), 
    week      = as.numeric(weekC),
    facility  = as.factor(facility),
    icd9class = factor(cut(icd9, 
                           breaks = icd9df$code_cutpoint, 
                           labels = icd9df$classification[-nrow(icd9df)], 
                           right  = TRUE)),
    ageC      = cut(age, 
                    breaks = c(-Inf, 5, 18, 45 ,60, Inf)),
    zip3      = trunc(zipcode/100))

Reading other scripts

If the script to read in the data is extensive, it may be better to separate your scripts into two different files: one for reading in the data and converting variables and another (or more) to perform statistical analyses. Then, in the second file, i.e. the one that does an analysis, you can source() the one that just reads in the data, e.g.

source("read_data.R") 

Check the data

At this point, I typically check the data to make sure the data.frame is what I think it is. I usually do this in interactive mode, i.e. not in a script.

dim(GI)
## [1] 21244    14
str(GI)
## 'data.frame':    21244 obs. of  14 variables:
##  $ id             : int  1001301988 1001829757 1001581758 1001950471 1001076304 1001087075 1001536948 1001966448 1001868408 1001356109 ...
##  $ date           : Date, format: "2005-02-28" "2005-02-28" ...
##  $ facility       : Factor w/ 16 levels "37","66","67",..: 3 3 4 4 8 2 15 3 2 2 ...
##  $ icd9           : num  787 559 788 788 559 ...
##  $ age            : int  7 41 2 71 28 43 12 1 25 64 ...
##  $ zipcode        : int  21075 20721 22152 22060 21702 20762 22192 20121 20772 20602 ...
##  $ chief_complaint: Factor w/ 2936 levels " /V/D"," /v",..: 784 2904 2759 9 1646 2229 601 2876 1736 2166 ...
##  $ syndrome       : Factor w/ 1 level "GI": 1 1 1 1 1 1 1 1 1 1 ...
##  $ gender         : Factor w/ 2 levels "Female","Male": 2 1 2 2 1 1 2 2 2 2 ...
##  $ weekC          : Factor w/ 201 levels "2005-02-28","2005-03-07",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ week           : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ icd9class      : Factor w/ 4 levels "infectious and parasitic disease",..: 3 2 3 3 2 3 2 3 3 3 ...
##  $ ageC           : Factor w/ 5 levels "(-Inf,5]","(5,18]",..: 2 3 1 5 3 3 2 1 3 5 ...
##  $ zip3           : num  210 207 221 220 217 207 221 201 207 206 ...
summary(GI)
##        id                 date               facility         icd9        
##  Min.   :1.001e+09   Min.   :2005-02-28   123    :4408   Min.   :    3.0  
##  1st Qu.:1.001e+09   1st Qu.:2006-03-07   67     :4281   1st Qu.:  558.9  
##  Median :1.001e+09   Median :2007-02-14   37     :3571   Median :  787.0  
##  Mean   :1.001e+09   Mean   :2007-02-05   66     :2950   Mean   : 1043.2  
##  3rd Qu.:1.002e+09   3rd Qu.:2008-01-17   6201   :1928   3rd Qu.:  787.3  
##  Max.   :1.002e+09   Max.   :2008-12-30   6200   :1423   Max.   :78791.0  
##                                           (Other):2683                    
##       age            zipcode            chief_complaint  syndrome  
##  Min.   :  0.00   Min.   :20001   Abd Pain      : 1390   GI:21244  
##  1st Qu.:  8.00   1st Qu.:20747   ABD PAIN      : 1074             
##  Median : 27.00   Median :21740   Vomiting      :  661             
##  Mean   : 29.98   Mean   :21420   VOMITING      :  563             
##  3rd Qu.: 47.00   3rd Qu.:22182   ABDOMINAL PAIN:  452             
##  Max.   :157.00   Max.   :22556   vomiting      :  423             
##                                   (Other)       :16681             
##     gender             weekC            week      
##  Female:10653   2007-01-29:  233   Min.   :  1.0  
##  Male  :10591   2007-01-22:  212   1st Qu.: 54.0  
##                 2007-02-26:  200   Median :103.0  
##                 2008-01-28:  184   Mean   :101.7  
##                 2007-01-08:  180   3rd Qu.:151.0  
##                 2007-01-15:  179   Max.   :201.0  
##                 (Other)   :20056                  
##                                        icd9class            ageC     
##  infectious and parasitic disease           : 1611   (-Inf,5] :4324  
##  diseases of the digestive system           : 7242   (5,18]   :3802  
##  symptoms, signs, and ill-defined conditions:12229   (18,45]  :7476  
##  injury and poisoning                       :  162   (45,60]  :3133  
##                                                      (60, Inf]:2509  
##                                                                      
##                                                                      
##       zip3      
##  Min.   :200.0  
##  1st Qu.:207.0  
##  Median :217.0  
##  Mean   :213.8  
##  3rd Qu.:221.0  
##  Max.   :225.0  
## 

Activity

Create a new variable in the GI data set called weekD that is a Date object with each observation having the Monday of the week for that observation. Check that it is actually a date using the str() function.

# Create weekD variable in GI data set

When you have completed the activity, compare your results to the solutions.

Graph workflow

Now to construct a graph, the workflow is

  1. construct an appropriate data set and
  2. construct the graph.

Construct the data set

Suppose we want to plot the number of weekly GI cases by facility. We need to summarize the data by week and facility.

GI_wf <- GI %>%
  group_by(week, facility) %>%
  summarize(count = n())

In interactive mode, you should verify that this data set is correct, e.g.

nrow(GI_wf) # Should have number of weeks times number of facilities rows
ncol(GI_wf) # Should have 3 columns: week, facility, count
dim(GI_wf)
head(GI_wf)
tail(GI_wf)
summary(GI_wf)
summary(GI_wf$facility)

Construct the graph

Now, we would like week on the x-axis and count on the y-axis.

ggplot(GI_wf, aes(x = week, y = count)) + 
  geom_point()

But, clearly we need to distinguish the facilities.

Try colors and shapes to distinguish facilities

Colors:

ggplot(GI_wf, aes(x = week, y = count, color = facility)) + 
  geom_point()