Close down R, say ‘No’ to saving the workspace. In fact, I recommend you change your settings so that
The typical workflow in R is
If you are using RStudio, I highly recommend you look into setting up projects.
Start R and then choose your working directory.
Now start the workshop to get your data files and scripts set up. Typically you would choose your working directory to be where you actually have your data and scripts.
library('MWBDSSworkshop')
workshop(write_data = TRUE, write_scripts = TRUE, launch_index = FALSE)
Open the 03_advanced_graphics.R
script.
A convention is to put library()
calls for all packages that you will use in the script at the top of the script.
library('tidyverse') # loads dplyr/ggplot2/readr
Typically, you will have a script that reads in the data, converts variables, and then performs statistical analyses.
This time we will use the mutate()
function from the dplyr package. This function allows you to create new columns in a data.frame
.
# Read in csv files
GI = read.csv("GI.csv")
icd9df = read.csv("icd9.csv")
# Add columns to the data.frame
GI <- GI %>%
mutate(
date = as.Date(date),
weekC = cut(date, breaks="weeks"),
week = as.numeric(weekC),
facility = as.factor(facility),
icd9class = factor(cut(icd9,
breaks = icd9df$code_cutpoint,
labels = icd9df$classification[-nrow(icd9df)],
right = TRUE)),
ageC = cut(age,
breaks = c(-Inf, 5, 18, 45 ,60, Inf)),
zip3 = trunc(zipcode/100))
If the script to read in the data is extensive, it may be better to separate your scripts into two different files: one for reading in the data and converting variables and another (or more) to perform statistical analyses. Then, in the second file, i.e. the one that does an analysis, you can source()
the one that just reads in the data, e.g.
source("read_data.R")
At this point, I typically check the data to make sure the data.frame
is what I think it is. I usually do this in interactive mode, i.e. not in a script.
dim(GI)
## [1] 21244 14
str(GI)
## 'data.frame': 21244 obs. of 14 variables:
## $ id : int 1001301988 1001829757 1001581758 1001950471 1001076304 1001087075 1001536948 1001966448 1001868408 1001356109 ...
## $ date : Date, format: "2005-02-28" "2005-02-28" ...
## $ facility : Factor w/ 16 levels "37","66","67",..: 3 3 4 4 8 2 15 3 2 2 ...
## $ icd9 : num 787 559 788 788 559 ...
## $ age : int 7 41 2 71 28 43 12 1 25 64 ...
## $ zipcode : int 21075 20721 22152 22060 21702 20762 22192 20121 20772 20602 ...
## $ chief_complaint: Factor w/ 2936 levels " /v"," /V/D",..: 315 2600 1100 15 1741 2238 640 2480 1833 2311 ...
## $ syndrome : Factor w/ 1 level "GI": 1 1 1 1 1 1 1 1 1 1 ...
## $ gender : Factor w/ 2 levels "Female","Male": 2 1 2 2 1 1 2 2 2 2 ...
## $ weekC : Factor w/ 201 levels "2005-02-28","2005-03-07",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ week : num 1 1 1 1 1 1 1 1 1 1 ...
## $ icd9class : Factor w/ 4 levels "infectious and parasitic disease",..: 3 2 3 3 2 3 2 3 3 3 ...
## $ ageC : Factor w/ 5 levels "(-Inf,5]","(5,18]",..: 2 3 1 5 3 3 2 1 3 5 ...
## $ zip3 : num 210 207 221 220 217 207 221 201 207 206 ...
summary(GI)
## id date facility icd9
## Min. :1.001e+09 Min. :2005-02-28 123 :4408 Min. : 3.0
## 1st Qu.:1.001e+09 1st Qu.:2006-03-07 67 :4281 1st Qu.: 558.9
## Median :1.001e+09 Median :2007-02-14 37 :3571 Median : 787.0
## Mean :1.001e+09 Mean :2007-02-05 66 :2950 Mean : 1043.2
## 3rd Qu.:1.002e+09 3rd Qu.:2008-01-17 6201 :1928 3rd Qu.: 787.3
## Max. :1.002e+09 Max. :2008-12-30 6200 :1423 Max. :78791.0
## (Other):2683
## age zipcode chief_complaint syndrome
## Min. : 0.00 Min. :20001 Abd Pain : 1390 GI:21244
## 1st Qu.: 8.00 1st Qu.:20747 ABD PAIN : 1074
## Median : 27.00 Median :21740 Vomiting : 661
## Mean : 29.98 Mean :21420 VOMITING : 563
## 3rd Qu.: 47.00 3rd Qu.:22182 ABDOMINAL PAIN: 452
## Max. :157.00 Max. :22556 vomiting : 423
## (Other) :16681
## gender weekC week
## Female:10653 2007-01-29: 233 Min. : 1.0
## Male :10591 2007-01-22: 212 1st Qu.: 54.0
## 2007-02-26: 200 Median :103.0
## 2008-01-28: 184 Mean :101.7
## 2007-01-08: 180 3rd Qu.:151.0
## 2007-01-15: 179 Max. :201.0
## (Other) :20056
## icd9class ageC
## infectious and parasitic disease : 1611 (-Inf,5] :4324
## diseases of the digestive system : 7242 (5,18] :3802
## symptoms, signs, and ill-defined conditions:12229 (18,45] :7476
## injury and poisoning : 162 (45,60] :3133
## (60, Inf]:2509
##
##
## zip3
## Min. :200.0
## 1st Qu.:207.0
## Median :217.0
## Mean :213.8
## 3rd Qu.:221.0
## Max. :225.0
##
Create a new variable in the GI data set called weekD
that is a Date
object with each observation having the Monday of the week for that observation. Check that it is actually a date using the str()
function.
# Create weekD variable in GI data set
When you have completed the activity, compare your results to the solutions.
Now to construct a graph, the workflow is
Suppose we want to plot the number of weekly GI cases by facility. We need to summarize the data by week and facility.
GI_wf <- GI %>%
group_by(week, facility) %>%
summarize(count = n())
In interactive mode, you should verify that this data set is correct, e.g.
nrow(GI_wf) # Should have number of weeks times number of facilities rows
ncol(GI_wf) # Should have 3 columns: week, facility, count
dim(GI_wf)
head(GI_wf)
tail(GI_wf)
summary(GI_wf)
summary(GI_wf$facility)
Now, we would like week on the x-axis and count on the y-axis.
ggplot(GI_wf, aes(x = week, y = count)) +
geom_point()
But, clearly we need to distinguish the facilities.
Colors:
ggplot(GI_wf, aes(x = week, y = count, color = facility)) +
geom_point()
Shapes:
ggplot(GI_wf, aes(x = week, y = count, shape = facility)) +
geom_point()
## Warning: The shape palette can deal with a maximum of 6 discrete values
## because more than 6 becomes difficult to discriminate; you have
## 16. Consider specifying shapes manually if you must have them.
## Warning: Removed 1229 rows containing missing values (geom_point).
ggplot2
only plots 6 different shapes and provides a warning.
Construct a data set and then a plot to look at weekly GI cases by age category.
# Construct a data set aggregated by week and age category
# Construct a plot to look at weekly GI cases by age category.
When you have completed the activity, compare your results to the solutions.
Faceting is a technique to view many small graphs on a single page.
In ggplot, there are two different ways to construct facets: facet_wrap()
and facet_grid()
. The former performs the faceting for a single variable and constructs the matrix of plots automatically. The latter performs faceting for two variables putting one vertically and the other horizontally.
When there are many levels of a single variable, facet_wrap()
can often allow comparisons across that variable.
ggplot(GI_wf, aes(x = week, y = count)) +
geom_point() +
facet_wrap(~ facility)
By default, faceting forces all axes to be exactly the same. If you want the axis to automatically scale for each facet use scales="free"
.
ggplot(GI_wf, aes(x = week, y = count)) +
geom_point() +
facet_wrap(~ facility, scales = "free")
When there are two variables (with few levels in each), facet_grid()
can allow relevant comparisons.
Suppose we are interested in weekly GI counts by gender and age category. First, construct the data set
GI_sa <- GI %>%
group_by(week, gender, ageC) %>%
summarize(count = n())
Next, construct the plot using facet_grid(row ~ column)
syntax.
ggplot(GI_sa, aes(x = week, y = count)) +
geom_point() +
facet_grid(gender ~ ageC)
In addition, we could use faceting together with colors/shapes
ggplot(GI_sa, aes(x = week, y = count, shape = gender, color = gender)) +
geom_point() +
facet_wrap(~ ageC)
Construct a plot of weekly GI counts by zip3 and ageC.
# Construct a plot of weekly GI counts by zip3 and ageC.
When you have completed the activity, compare your results to the solutions.
Often our data set is large and we would like to filter the data to focus on a particular group or time frame. You may want to filter by a
Date
variable.For a categorical variable, you want to filter by
Suppose we are only interested in counts for infectious and parasitic diseases (IPD).
IPD_w <- GI %>%
filter(icd9class == "infectious and parasitic disease") %>%
group_by(week) %>%
summarize(count = n())
ggplot(IPD_w, aes(x = week, y = count)) +
geom_point()
Suppose we are interested in looking at both IPD and ill-defined conditions
conditions = levels(GI$icd9class)
conditions
## [1] "infectious and parasitic disease"
## [2] "diseases of the digestive system"
## [3] "symptoms, signs, and ill-defined conditions"
## [4] "injury and poisoning"
So we want levels 1 and 3.
IPDp <- GI %>%
filter(icd9class %in% levels(icd9class)[c(1,3)])
summary(IPDp$icd9class)
## infectious and parasitic disease
## 1611
## diseases of the digestive system
## 0
## symptoms, signs, and ill-defined conditions
## 12229
## injury and poisoning
## 0
Now construct the graph
IPDp_w <- IPDp %>%
group_by(week, icd9class) %>%
summarize(count = n())
ggplot(IPDp_w, aes(x = week, y = count)) +
geom_point() +
facet_wrap(~ icd9class, scales = "free")
Suppose we are interested in all icd9 codes except IPD.
# Combine filtering and summarizing
nIPD_w <- GI %>%
filter(icd9class != "infectious and parasitic disease") %>%
group_by(week, icd9class) %>%
summarize(count = n())
ggplot(nIPD_w, aes(x = week, y = count)) +
geom_point() +
facet_wrap(~ icd9class, scales = "free")
Suppose we are interested in all icd9 codes except IPD and ill-defined conditions.
nIPDp_w <- GI %>%
filter(!(icd9class %in% levels(icd9class)[c(1,3)])) %>%
group_by(week, icd9class) %>%
summarize(count = n())
ggplot(nIPDp_w, aes(x = week, y = count)) +
geom_point() +
facet_wrap(~ icd9class, scales = "free")
We may want to subset a continuous (numeric) variable by
Age60p_w <- GI %>%
filter(age > 60) %>%
group_by(week) %>%
summarize(count = n())
ggplot(Age60p_w, aes(x = week, y = count)) +
geom_point()
Age_lte30_w <- GI %>%
filter(age <= 30) %>%
group_by(week) %>%
summarize(count = n())
ggplot(Age_lte30_w, aes(x = week, y = count)) +
geom_point()
Age30to60_w <- GI %>%
filter(age > 30, age <= 60) %>% # This includes those who are 60, but not those who are 30
group_by(week) %>%
summarize(count = n())
ggplot(Age30to60_w, aes(x = week, y = count)) +
geom_point()
Filtering a date works similar to filtering a continuous (numeric) variable once you convert a comparison value to a Date
using the as.Date()
function.
Let’s create a Date
vector containing the two values we will use for filtering.
two_dates = as.Date(c("2007-01-01", "2007-12-31"))
early <- GI %>% filter(date < two_dates[1])
late <- GI %>% filter(date >= two_dates[2])
mid <- GI %>% filter(date >= two_dates[1], date < two_dates[2])
Now make plots
early_w <- early %>% group_by(week) %>% summarize(count = n())
late_w <- late %>% group_by(week) %>% summarize(count = n())
mid_w <- mid %>% group_by(week) %>% summarize(count = n())
g1 = ggplot(early_w, aes(x = week, y = count)) + geom_point()
g2 = ggplot(late_w, aes(x = week, y = count)) + geom_point()
g3 = ggplot(mid_w, aes(x = week, y = count)) + geom_point()
To arrange multiple plots, use the grid.arrange
function in the gridExtra
package.
# you may need to run install.packages("gridExtra")
if (require('gridExtra'))
gridExtra::grid.arrange(g1,g3,g2)
## Loading required package: gridExtra
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
This set of figures would have been a lot easier to construct if we had simply created a new variable that had values early, mid, and late and then used faceting to create the plot, e.g.
cut_dates <- c(as.Date("1900-01-01"),
two_dates,
Sys.Date())
GI_time <- GI %>%
mutate(time = cut(date,
breaks = cut_dates,
labels = c("early","mid","late"))) %>%
group_by(week, time) %>%
summarize(count = n())
ggplot(GI_time,
aes(x = week, y = count)) +
geom_point() +
facet_wrap(~ time, ncol = 1, scales = "free")
Construct a plot for those in zipcode 206xx between Jan 1, 2008 and Dec 31, 2008.
# Filter the data to zipcode 206xx between Jan 1, 2008 and Dec 31, 2008
# Aggregate the date for each week in this time frame
# Construct the plot of weekly GI counts in zipcode 206xx.
When you have completed the activity, compare your results to the solutions.
To make the graphics we are producing look professional, we will need to customize the graph according to the final production platform, e.g. website, word document, pdf document, etc. This will be highly specific to your use. Nonetheless, likely you will use the following ggplot2
functions:
scale_color_manual()
labs()
theme()
But there are many others. One thing you will notice is that we can keep updating the graph by updating the R object that holds the graph.
Let’s start with the following graph
GI$weekD = as.Date(GI$weekC)
GI_sum <- GI %>%
group_by(weekD, gender, ageC) %>%
summarize(count = n())
ggplot(GI_sum, aes(x = weekD, y = count, color = gender)) +
geom_point(size = 3) +
facet_grid(ageC ~ ., scales = "free_y")
The easiest way to change some of the labels on the graph is to change the underlying data. Here we change the levels associated with the ageC
variable.
levels(GI_sum$ageC)
## [1] "(-Inf,5]" "(5,18]" "(18,45]" "(45,60]" "(60, Inf]"
levels(GI_sum$ageC) = paste("Age:", c("<5","5-18","18-45","45-60",">60"))
table(GI_sum$ageC)
##
## Age: <5 Age: 5-18 Age: 18-45 Age: 45-60 Age: >60
## 402 402 402 402 400
g = ggplot(GI_sum, aes(x = weekD, y = count, color = gender)) +
geom_point(size = 3) +
facet_grid(ageC ~ ., scales = "free_y")
g
g
is the R object that contains the graph
Intuitive colors can make a graph easier to read.
g = g + scale_color_manual(values=c("hotpink","blue"), name='Gender')
g
Use proper names and capitalization in title and axis labels.
g = g +
labs(title = "Weekly GI cases", x = "Year", y = "Weekly count") +
theme(legend.position = "bottom")
g
Try alternate themes:
g = g + theme_bw()
g
To see a list of possibilities:
?theme_bw
?theme
Depending on the final platform, font sizes may need to be adjusted.
g = g + theme(title = element_text(size=rel(2)),
text = element_text(size=16),
legend.background = element_rect(fill = "white",
size = .5,
color = "black"))
g
To export the graphs, use the ggsave()
function which saves the last plot that you displayed. As a default, it uses the size of the current graphics device, but you will likely want to modify this.
g
ggsave("plot.pdf")
## Saving 7 x 5 in image
If you look in your current working directory, you will see the plot.pdf file.
ggsave("plot.pdf", width=14, height=8)