Data Visualization II (ggplot2)

Close down R, say ‘No’ to saving the workspace. In fact, I recommend you change your settings so that

the workspace is never stored and
.RData is not restored upon startup.

Workflow

The typical workflow in R is

Start R
Choose your working directory (choose project in RStudio)
Open a script
Read in data
Check data (interactively)
Do analysis

If you are using RStudio, I highly recommend you look into setting up projects.

Choose your working directory

Start R and then choose your working directory.

setwd(choose.dir())

Start the workshop

Now start the workshop to get your data files and scripts set up. Typically you would choose your working directory to be where you actually have your data and scripts.

library('ISDSWorkshop')
workshop(launch_index  = FALSE) # If you need the outline, set this to TRUE

Open the script

Open the 03_advanced_graphics.R script.

A convention is to put library() calls for all packages that you will use in the script at the top of the script.

library('dplyr')
library('ggplot2')

Read in the data

Typically, you will have a script that reads in the data, converts variables, and then performs statistical analyses.

This time we will use the mutate() function from the dplyr package. This function allows you to create new columns in a data.frame.

# Read in csv files
GI = read.csv('GI.csv')
icd9df = read.csv("icd9.csv")

# Add columns to the data.frame
GI <- GI %>%
  mutate(
    date      = as.Date(date),
    weekC     = cut(date, breaks="weeks"), 
    week      = as.numeric(weekC),
    facility  = as.factor(facility),
    icd9class = factor(cut(icd9, 
                           breaks = icd9df$code_cutpoint, 
                           labels = icd9df$classification[-nrow(icd9df)], 
                           right  = TRUE)),
    ageC      = cut(age, 
                    breaks = c(-Inf, 5, 18, 45 ,60, Inf)),
    zip3      = trunc(zipcode/100))

Reading other scripts

If the script to read in the data is extensive, it may be better to separate your scripts into two different files: one for reading in the data and converting variables and another (or more) to perform statistical analyses. Then, in the second file, i.e. the one that does an analysis, you can source() the one that just reads in the data, e.g.

source("read_data.R")

Check the data

At this point, I typically check the data to make sure the data.frame is what I think it is. I usually do this in interactive mode, i.e. not in a script.

dim(GI)

## [1] 21244    14

str(GI)

## 'data.frame':    21244 obs. of  14 variables:
##  $ id             : int  1001301988 1001829757 1001581758 1001950471 1001076304 1001087075 1001536948 1001966448 1001868408 1001356109 ...
##  $ date           : Date, format: "2005-02-28" "2005-02-28" ...
##  $ facility       : Factor w/ 16 levels "37","66","67",..: 3 3 4 4 8 2 15 3 2 2 ...
##  $ icd9           : num  787 559 788 788 559 ...
##  $ age            : int  7 41 2 71 28 43 12 1 25 64 ...
##  $ zipcode        : int  21075 20721 22152 22060 21702 20762 22192 20121 20772 20602 ...
##  $ chief_complaint: Factor w/ 2936 levels " /V/D"," /v",..: 784 2904 2759 9 1646 2229 601 2876 1736 2166 ...
##  $ syndrome       : Factor w/ 1 level "GI": 1 1 1 1 1 1 1 1 1 1 ...
##  $ gender         : Factor w/ 2 levels "Female","Male": 2 1 2 2 1 1 2 2 2 2 ...
##  $ weekC          : Factor w/ 201 levels "2005-02-28","2005-03-07",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ week           : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ icd9class      : Factor w/ 4 levels "infectious and parasitic disease",..: 3 2 3 3 2 3 2 3 3 3 ...
##  $ ageC           : Factor w/ 5 levels "(-Inf,5]","(5,18]",..: 2 3 1 5 3 3 2 1 3 5 ...
##  $ zip3           : num  210 207 221 220 217 207 221 201 207 206 ...

summary(GI)

##        id                 date               facility         icd9        
##  Min.   :1.001e+09   Min.   :2005-02-28   123    :4408   Min.   :    3.0  
##  1st Qu.:1.001e+09   1st Qu.:2006-03-07   67     :4281   1st Qu.:  558.9  
##  Median :1.001e+09   Median :2007-02-14   37     :3571   Median :  787.0  
##  Mean   :1.001e+09   Mean   :2007-02-05   66     :2950   Mean   : 1043.2  
##  3rd Qu.:1.002e+09   3rd Qu.:2008-01-17   6201   :1928   3rd Qu.:  787.3  
##  Max.   :1.002e+09   Max.   :2008-12-30   6200   :1423   Max.   :78791.0  
##                                           (Other):2683                    
##       age            zipcode            chief_complaint  syndrome  
##  Min.   :  0.00   Min.   :20001   Abd Pain      : 1390   GI:21244  
##  1st Qu.:  8.00   1st Qu.:20747   ABD PAIN      : 1074             
##  Median : 27.00   Median :21740   Vomiting      :  661             
##  Mean   : 29.98   Mean   :21420   VOMITING      :  563             
##  3rd Qu.: 47.00   3rd Qu.:22182   ABDOMINAL PAIN:  452             
##  Max.   :157.00   Max.   :22556   vomiting      :  423             
##                                   (Other)       :16681             
##     gender             weekC            week      
##  Female:10653   2007-01-29:  233   Min.   :  1.0  
##  Male  :10591   2007-01-22:  212   1st Qu.: 54.0  
##                 2007-02-26:  200   Median :103.0  
##                 2008-01-28:  184   Mean   :101.7  
##                 2007-01-08:  180   3rd Qu.:151.0  
##                 2007-01-15:  179   Max.   :201.0  
##                 (Other)   :20056                  
##                                        icd9class            ageC     
##  infectious and parasitic disease           : 1611   (-Inf,5] :4324  
##  diseases of the digestive system           : 7242   (5,18]   :3802  
##  symptoms, signs, and ill-defined conditions:12229   (18,45]  :7476  
##  injury and poisoning                       :  162   (45,60]  :3133  
##                                                      (60, Inf]:2509  
##                                                                      
##                                                                      
##       zip3      
##  Min.   :200.0  
##  1st Qu.:207.0  
##  Median :217.0  
##  Mean   :213.8  
##  3rd Qu.:221.0  
##  Max.   :225.0  
##

Activity

Create a new variable in the GI data set called weekD that is a Date object with each observation having the Monday of the week for that observation. Check that it is actually a date using the str() function.

# Create weekD variable in GI data set

When you have completed the activity, compare your results to the solutions.

Graph workflow

Now to construct a graph, the workflow is

construct an appropriate data set and
construct the graph.

Construct the data set

Suppose we want to plot the number of weekly GI cases by facility. We need to summarize the data by week and facility.

GI_wf <- GI %>%
  group_by(week, facility) %>%
  summarize(count = n())

In interactive mode, you should verify that this data set is correct, e.g.

nrow(GI_wf) # Should have number of weeks times number of facilities rows
ncol(GI_wf) # Should have 3 columns: week, facility, count
dim(GI_wf)
head(GI_wf)
tail(GI_wf)
summary(GI_wf)
summary(GI_wf$facility)

Construct the graph

Now, we would like week on the x-axis and count on the y-axis.

ggplot(GI_wf, aes(x = week, y = count)) + 
  geom_point()

But, clearly we need to distinguish the facilities.

Try colors and shapes to distinguish facilities

Colors:

ggplot(GI_wf, aes(x = week, y = count, color = facility)) + 
  geom_point()

Shapes:

ggplot(GI_wf, aes(x = week, y = count, shape = facility)) + 
  geom_point()

## Warning: The shape palette can deal with a maximum of 6 discrete values
## because more than 6 becomes difficult to discriminate; you have
## 16. Consider specifying shapes manually if you must have them.

## Warning: Removed 1229 rows containing missing values (geom_point).

ggplot2 only plots 6 different shapes and provides a warning.

Activity - weekly GI counts by age category

Construct a data set and then a plot to look at weekly GI cases by age category.

# Construct a data set aggregated by week and age category

# Construct a plot to look at weekly GI cases by age category.

When you have completed the activity, compare your results to the solutions.

Faceting

Faceting is a technique to view many small graphs on a single page.

In ggplot, there are two different ways to construct facets: facet_wrap() and facet_grid(). The former performs the faceting for a single variable and constructs the matrix of plots automatically. The latter performs faceting for two variables putting one vertically and the other horizontally.

Faceting on a single variable

When there are many levels of a single variable, facet_wrap() can often allow comparisons across that variable.

ggplot(GI_wf, aes(x = week, y = count)) + 
  geom_point() + 
  facet_wrap(~ facility)

By default, faceting forces all axes to be exactly the same. If you want the axis to automatically scale for each facet use scales="free".

ggplot(GI_wf, aes(x = week, y = count)) + 
  geom_point() + 
  facet_wrap(~ facility, scales = "free")

Faceting on two variables

When there are two variables (with not very many levels in each), facet_grid() can allow relevant comparisons.

Suppose we are interested in weekly GI counts by gender and age category. First, construct the data set

GI_sa <- GI %>%
  group_by(week, gender, ageC) %>%
  summarize(count = n())

Next, construct the plot using facet_grid(row ~ column) syntax.

ggplot(GI_sa, aes(x = week, y = count)) + 
  geom_point() + 
  facet_grid(gender ~ ageC)

Using faceting and colors/shapes

In addition, we could use faceting together with colors/shapes

ggplot(GI_sa, aes(x = week, y = count, shape = gender, color = gender)) + 
  geom_point() + 
  facet_wrap(~ ageC)

Activity - weekly GI counts by zip3 and ageC

Construct a plot of weekly GI counts by zip3 and ageC.

# Construct a plot of weekly GI counts by zip3 and ageC.

When you have completed the activity, compare your results to the solutions.

Filtering

Often our data set is large and we would like to filter the data to focus on a particular group or time frame. You may want to filter by a

categorical,
continuous (numeric), or
Date variable.

Filtering by a categorical variable

For a categorical variable, you want to filter by

a single category
a set of categories
not in a single category
not in a set of categories

Filtering by a single category

Suppose we are only interested in counts for infectious and parasitic diseases (IPD).

IPD_w <- GI %>%
  filter(icd9class == "infectious and parasitic disease") %>%
  group_by(week) %>%
  summarize(count = n())

ggplot(IPD_w, aes(x = week, y = count)) + 
  geom_point()

Filtering by a set of categories

Suppose we are interested in looking at both IPD and ill-defined conditions

conditions = levels(GI$icd9class)
conditions

## [1] "infectious and parasitic disease"           
## [2] "diseases of the digestive system"           
## [3] "symptoms, signs, and ill-defined conditions"
## [4] "injury and poisoning"

So we want levels 1 and 3.

IPDp <- GI %>%
  filter(icd9class %in% levels(icd9class)[c(1,3)])

summary(IPDp$icd9class)

##            infectious and parasitic disease 
##                                        1611 
##            diseases of the digestive system 
##                                           0 
## symptoms, signs, and ill-defined conditions 
##                                       12229 
##                        injury and poisoning 
##                                           0

Now construct the graph

IPDp_w <- IPDp %>%
  group_by(week, icd9class) %>%
  summarize(count = n())

ggplot(IPDp_w, aes(x = week, y = count)) + 
  geom_point() + 
  facet_wrap(~ icd9class, scales = "free")

Filtering by not in a single category

Suppose we are interested in all icd9 codes except IPD.

# Combine filtering and summarizing
nIPD_w <- GI %>%
  filter(icd9class != "infectious and parasitic disease") %>%
  group_by(week, icd9class) %>%
  summarize(count = n())

ggplot(nIPD_w, aes(x = week, y = count)) + 
  geom_point() + 
  facet_wrap(~ icd9class, scales = "free")

Filtering by not in a set of categories

Suppose we are interested in all icd9 codes except IPD and ill-defined conditions.

nIPDp_w <- GI %>%
  filter(!(icd9class %in% levels(icd9class)[c(1,3)])) %>%
  group_by(week, icd9class) %>%
  summarize(count = n())

ggplot(nIPDp_w, aes(x = week, y = count)) + 
  geom_point() + 
  facet_wrap(~ icd9class, scales = "free")

Filtering by a continuous (numeric) variable

We may want to subset a continuous (numeric) variable by

Larger than (or equal to) a number
Smaller than (or equal to) a number
In a range of numbers (with our without endpoints)

Filtering larger than a number

Age60p_w <- GI %>%
  filter(age > 60) %>%
  group_by(week) %>%
  summarize(count = n())

ggplot(Age60p_w, aes(x = week, y = count)) + 
  geom_point()

Filtering less than or equal to a number

Age_lte30_w <- GI %>%
  filter(age <= 30) %>%
  group_by(week) %>%
  summarize(count = n())

ggplot(Age_lte30_w, aes(x = week, y = count)) + 
  geom_point()

Filtering within a range

Age30to60_w <- GI %>%
  filter(age > 30, age <= 60) %>% # This includes those who are 60, but not those who are 30
  group_by(week) %>%
  summarize(count = n())

ggplot(Age30to60_w, aes(x = week, y = count)) + 
  geom_point()

Filtering a date

Filtering a date works similar to filtering a continuous (numeric) variable once you convert a comparison value to a Date using the as.Date() function.

Let’s create a Date vector containing the two values we will use for filtering.

two_dates = as.Date(c("2007-01-01", "2007-12-31"))

Filtering dates

early <- GI %>% filter(date <  two_dates[1])
late  <- GI %>% filter(date >= two_dates[2])
mid   <- GI %>% filter(date >= two_dates[1], date < two_dates[2])

Now make plots

early_w <- early %>% group_by(week) %>% summarize(count = n())
late_w  <- late  %>% group_by(week) %>% summarize(count = n())
mid_w   <- mid   %>% group_by(week) %>% summarize(count = n())

g1 = ggplot(early_w, aes(x = week, y = count)) + geom_point()
g2 = ggplot(late_w,  aes(x = week, y = count)) + geom_point()
g3 = ggplot(mid_w,   aes(x = week, y = count)) + geom_point()

To arrange multiple plots, use the grid.arrange function in the gridExtra package.

# you may need to run install.packages("gridExtra")
if (require('gridExtra'))
  gridExtra::grid.arrange(g1,g3,g2)

## Loading required package: gridExtra

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

This set of figures would have been a lot easier to construct if we had simply created a new variable that had values early, mid, and late and then used faceting to create the plot, e.g.

cut_dates <- c(as.Date("1900-01-01"),
               two_dates,
               Sys.Date())

GI_time <- GI %>% 
  mutate(time = cut(date, 
                    breaks = cut_dates, 
                    labels = c("early","mid","late"))) %>%
  group_by(week, time) %>%
  summarize(count = n())

ggplot(GI_time,
       aes(x = week, y = count)) +
  geom_point() +
  facet_wrap(~ time, ncol = 1, scales = "free")

Activity - subsetted graphs

Construct a plot for those in zipcode 206xx between Jan 1, 2008 and Dec 31, 2008.

# Filter the data to zipcode 206xx between Jan 1, 2008 and Dec 31, 2008

# Aggregate the date for each week in this time frame

# Construct the plot of weekly GI counts in zipcode 206xx.

When you have completed the activity, compare your results to the solutions.

Making professional looking graphics

To make the graphics we are producing look professional, we will need to customize the graph according to the final production platform, e.g. website, word document, pdf document, etc. This will be highly specific to your use. Nonetheless, likely you will use the following ggplot2 functions:

scale_color_manual()
labs()
theme()

But there are many others. One thing you will notice is that we can keep updating the graph by updating the R object that holds the graph.

Base graphic

Let’s start with the following graph

GI$weekD = as.Date(GI$weekC) 

GI_sum <- GI %>%
  group_by(weekD, gender, ageC) %>%
  summarize(count = n())

ggplot(GI_sum, aes(x = weekD, y = count, color = gender)) + 
  geom_point(size = 3) + 
  facet_grid(ageC ~ ., scales = "free_y")

Change age category labels

The easiest way to change some of the labels on the graph is to change the underlying data. Here we change the levels associated with the ageC variable.

levels(GI_sum$ageC)

## [1] "(-Inf,5]"  "(5,18]"    "(18,45]"   "(45,60]"   "(60, Inf]"

levels(GI_sum$ageC) = paste("Age:", c("<5","5-18","18-45","45-60",">60"))
table(GI_sum$ageC)

## 
##    Age: <5  Age: 5-18 Age: 18-45 Age: 45-60   Age: >60 
##        402        402        402        402        400

g = ggplot(GI_sum, aes(x = weekD, y = count, color = gender)) + 
  geom_point(size = 3) + 
  facet_grid(ageC ~ ., scales = "free_y")
g

g is the R object that contains the graph

Change colors

Intuitive colors can make a graph easier to read.

g = g + scale_color_manual(values=c("hotpink","blue"), name='Gender')
g

Change title/axis labels and put legend on the bottom

Use proper names and capitalization in title and axis labels.

g = g + 
  labs(title = "Weekly GI cases", x = "Year", y = "Weekly count") + 
  theme(legend.position = "bottom")
g

Try alternate themes

Try alternate themes:

g = g + theme_bw()
g

To see a list of possibilities:

?theme_bw
?theme

Adjust font sizes

Depending on the final platform, font sizes may need to be adjusted.

g = g + theme(title = element_text(size=rel(2)),
              text = element_text(size=16),
              legend.background = element_rect(fill  = "white", 
                                               size  = .5, 
                                               color = "black"))
g

Exporting graphs

To export the graphs, use the ggsave() function which saves the last plot that you displayed. As a default, it uses the size of the current graphics device, but you will likely want to modify this.

ggsave("plot.pdf")

## Saving 7 x 5 in image

If you look in your current working directory, you will see the plot.pdf file.

ggsave("plot.pdf", width=14, height=8)

Data Visualization II (ggplot2)

Jarad Niemi

2017-01-19

Workflow

Choose your working directory

Start the workshop

Open the script

Read in the data

Reading other scripts

Check the data

Activity

Graph workflow

Construct the data set

Construct the graph

Try colors and shapes to distinguish facilities

Activity - weekly GI counts by age category

Faceting

Faceting on a single variable

Faceting on two variables

Using faceting and colors/shapes

Activity - weekly GI counts by zip3 and ageC

Filtering

Filtering by a categorical variable

Filtering by a single category

Filtering by a set of categories

Filtering by not in a single category

Filtering by not in a set of categories

Filtering by a continuous (numeric) variable

Filtering larger than a number

Filtering less than or equal to a number

Filtering within a range

Filtering a date

Filtering dates

Activity - subsetted graphs

Making professional looking graphics

Base graphic

Change age category labels

Change colors

Change title/axis labels and put legend on the bottom

Try alternate themes

Adjust font sizes

Exporting graphs