PDF file location: http://www.murraylax.org/rtutorials/gog.pdf

HTML file location: http://www.murraylax.org/rtutorials/gog.html

*Note on required packages:* The following code requires the packages `tidyverse`

, `scales`

, `stringr`

, `Hmisc`

, and `ggthemes`

.

The

`tidyverse`

package contains many packages that allow you to organize, summarize, and plot data.We use the

`scales`

library to customize the scales of our axes.The

`stringr`

package allows us to manipulate strings, which we use to manipulate string labels.The

`Hmisc`

provides mathematical and statistical functions to use with our plots.The

`ggthemes`

package provides multiple themes, which are combinations of parameters to change a plots look and feel

If you have not already done so, download, install, and load the libraries with the following code:

`install.packages("tidyverse") # This only needs to be executed once for your machine`

`install.packages("scales") # This only needs to be executed once for your machine`

`install.packages("stringr") # This only needs to be executed once for your machine`

`install.packages("Hmisc") # This only needs to be executed once for your machine`

`install.packages("ggthemes") # This only needs to be executed once for your machine`

`library("ggplot2") # This needs to be executed every time you load R`

`library("scales") # This needs to be executed every time you load R`

`library("stringr") # This needs to be executed every time you load R`

`library("Hmisc") # This needs to be executed every time you load R`

`library("ggthemes") # This needs to be executed every time you load R`

The **grammar of graphics** is a way of thinking about how graphs are constructed that allows data analysts to move beyond thinking about a small number of graph types (like bar graphs, line graphs, scatter plots, etc).

Think about the grammar of graphics just like you would about the grammar of sentence structure in language. We think beyond understanding language as just sentence types, like declarative sentences (statements of fact), imperative sentences (statement of request), or exclamatory sentences (statements of excitement/emotion).

Grammar in language considers multiple structures within the sentence that are layered together. For example, at a minimum, the structures of noun and verb can be put together to form a most simple sentence of any type. We can optionally add more structures like conjunctions, pronouns, and propositions, and layer these together in simple or complex ways to form more complex sentences. We can layer these structures to produce sentences that have infinite possibilities for communicating.

So too with the grammar of graphics. There are seven *structures* or *layers* in the grammar of graphics. We can layer on just a minimum number of structures (just like a minimal sentence can have just one noun and one verb) or make more complicated graphs by using multiple types of layers and multiple layers of some of the types. This is just like a sentence can include multiple structures like nouns, verbs, prepositions, and also include more than one type of structure, like have more than one noun).

The following three layers are the minimum necessary for any type of plot:

**1. Data**: A data frame with one or more variables, each with one or more observations.

**2. Aesthetic**: A mapping of one or more variables to one or more visual elements on the graph. For example, you could map a variable to the x-axis, another variable to the y-axis, and a categorical variable to color so that different categories get plotted with different colors.

**3. Geometry**: The type or shape of the visual elements on the graph. For example, this could be a *point* in the case of a scatter plot, a *bar* in the case of a bar plot, or a *line* in the case of a line plot.

*Data set:* The following example uses a sample from the 2016 Current Population Survey which is a monthly survey conducted by the U.S. Census Bureau and Bureau of Labor Statistics. The data is used, among other things, to compute employment statistics related to earnings, hours employed, and unemployment. This particular sample includes 1,552 observations and includes only head-of-households.

The line below downloads and loads the data set.

`load(url("http://murraylax.org/datasets/cps2016.RData"))`

The data is in a `data.frame`

object called `df`

and a description of the variables is given in another `data.frame`

object called `df.desc`

. You can familiarize yourself with the data set by opening these data frames in *Rstudio*. Alternatively, you can get a short description of the data frame `df`

with a call to the `str()`

function.

`str(df)`

```
## 'data.frame': 1552 obs. of 16 variables:
## $ age : num 46 37 35 38 66 28 50 49 63 43 ...
## $ incwage : num 24000 86000 22000 35000 50000 24000 75000 52000 40000 30000 ...
## $ sex : Factor w/ 2 levels "Female","Male": 1 2 1 2 2 1 1 2 2 2 ...
## $ race : Factor w/ 5 levels "Asian/Pacific Islander",..: 5 5 2 5 5 5 5 5 5 5 ...
## $ empstat : Factor w/ 4 levels "Armed Forces",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ inlf : num 1 1 1 1 1 1 1 1 1 1 ...
## $ unempl : num 0 0 0 0 0 0 0 0 0 0 ...
## $ industry : Factor w/ 7 levels "Agriculture, Forestry, and Fishing",..: 7 6 5 3 1 5 5 6 5 7 ...
## $ usualhrs : num 32 40 NA 40 60 40 45 40 45 40 ...
## $ ureason : Factor w/ 4 levels "Job Leaver","Job Loser",..: NA NA NA NA NA NA NA NA NA NA ...
## $ vetran : num 0 0 0 0 1 0 0 0 0 0 ...
## $ usualhrearn: num 14.4 41.2 NA 16.8 16 ...
## $ edu : Ord.factor w/ 5 levels "Less than high school"<..: 2 4 3 1 3 3 5 3 3 3 ...
## $ medoop : num 1000 3300 3570 1500 1730 ...
## $ insprem : num 0 1000 1800 0 480 6500 0 2000 16000 4000 ...
## $ totmed : num 1000 4300 5370 1500 2210 ...
```

We will use the three layers of the grammar of graphics above to create a basic scatter plot of the data. We have two numerical variables in the data set, the survey respondents’ annual income from wages and salaries (`incwage`

) and their annual total health care expenditures, including health insurance premiums and out-of-pocket expenses for health care services (`totmed`

). We will create a scatter plot with `incwage`

on the horizontal axis and `totmed`

on the vertical axis.

We will first create a ggplot object with the first two layers, **data** and **aesthetics**, and set this to an object called, `scatter.base.p`

`scatter.base.p <- ggplot(data=df, mapping=aes(x=incwage, y=totmed))`

The function call `ggplot()`

above sets two parameters. The first parameter, `data`

, is set equal to the data frame `df`

which ggplot will look to for the data.

The second parameter, `mapping`

, is the aesthetics layer. Here we map variable names from the data frame to visuals in our plot. We call another function here called `aes()`

to create the mapping. The parameters we can set in our call to `aes()`

include all the possible visual elements of the graph that can be mapped to variables. We use two visual elements, the x-axis and y-axis. We map the x-axis to income with `x=incwage`

and the y-axis to total out-of-pocket expenses using `y=totmed`

.

You will not see a plot yet, but this does set up the foundation for the plot. It does not yet specify a geometry layer, i.e. the geometrical shapes, to put on the graph.

Additional layers can be “added” to the ggplot object using the addition symbol (+). We next add a **geometry** layer. There are many types of geometries that are available for ggplot. We add *points* to the base plot above, and save this to a new object called `scatter.p`

.

`scatter.p <- scatter.base.p + geom_point()`

We can now view the graph by calling `scatter.p`

at the R command line:

`scatter.p`

We have a plot, even if it is not too pretty, created with the minimum requirements for layers. Like a minimal sentence with just a noun and verb, it communicates something, but maybe not very much, and maybe not very effectively. We will next add on a few layers to make the plot not only more visually appealing, but change the visualization so it is more effectively communicates the relationship between our two variables.

Our plot suffers from a common problem called **over-plotting**. This is when a number of data points or statistical visuals land on top of one another which hides data and can obscure a visual relationship.

Here are some strategies we can employ with this plot to overcome our over-plotting problem:

Decrease size of geometries: We can make the points smaller.

Make geometries partially transparent: We can make it so that our points are not completely opaque so we can “see through” the points to find more points underneath. When a number of points land on top of one another, we will be better able to see the multiple points underneath. We will see darker areas where many points are clustered close together.

Zoom in: We can zoom in so that we visualize most of our sample, but not our outliers. The limits for annual income on our horizontal axis range from $0 per year to over $1,000,000 per year, yet most of our data includes individuals who earn less than $130,000 per year.

Let us first employ the first two strategies: making our points smaller and adding some transparency to them. To do this, we will change our call to the geometry layer. Consider the following plot:

`scatter.p <- scatter.base.p + geom_point(alpha=0.3, size=0.7)`

We passed two optional parameters to the function `geom_point()`

. The first `alpha=0.3`

sets the transparency. This is a number between 0 and 1, where 1 is completely opaque (the default) and 0 is completely invisible. A value for transparency of 0.3 lets us see the points, but gives us a lot of transparency to also see through them. The second parameter, `size=0.7`

, gives us a smaller point size. Picking the right level of transparency and size for a geometry takes some trial and error and will be different for every plot.

Let’s see what the new plot looks like:

`scatter.p`

Let us next zoom in so that our plot displays well the large majority of the data while omitting the outliers. We can zoom in by specifying another *layer* of the plot, the **coordinates layer**. The coordinates layer specifies the type of coordinate system to use and it allows you to customize the dimensions. The most common coordinate system for a scatter plot, and the one we have been using by default, is the Cartesian plane.

In the code below, we explicitly add the Cartesian coordinate system with a call to the function `coord_cartesian()`

and we specify the largest and smallest values for each the x- and y-axes. In the code that follows, we start with the plot we already created that we saved in an object called `scatter.p`

, we add the coordinates layer, and save the resulting plot in an object called `scatter.p.zoom`

.

```
scatter.p.zoom <- scatter.p + coord_cartesian(xlim=c(0,130000), ylim=c(0,30000))
scatter.p.zoom
```

The plot itself looks much better. While the wide variability in the data still does not make it too clear whether there is a strong positive correlation, we can see the distribution of the data quite well. In the next subsection, we will fix up some of the non-data ink, including the axes scale labels, axes titles, and plot title.

In this section, we clean up some of the non-data ink to make our graph more visually appealing and more effectively communicate our variables and how they are measured. Below we give the axes more descriptive labels and give the plot a title. We also saw in the previous subsection that the scale labels on the horizontal axes was inconveniently expressed in scientific notation (i.e. `5e+4`

was given as a label rather than `50,000`

). The in code below, we add to our zoomed-in scatter plot calls to `scale_x_continuous()`

and `scale_y_continuous()`

which set the notation for the x- and y-axis scale labels, respectively.

```
scatter.p.zoom <- scatter.p.zoom +
scale_x_continuous(labels=dollar) + scale_y_continuous(labels=dollar)
```

The parameter `labels`

in each of these functions is set equal to the *function* `dollar()`

, which takes on a single numerical value and returns a string that expresses the same number with a dollar symbol and commas after every third digit from the decimal point. Let us look at the plot now.

`scatter.p.zoom`

We can set the titles for the x-axis, y-axis, and overall plot with a call to the `labs()`

(short for labels) function.

```
scatter.p.zoom <- scatter.p.zoom +
labs(title="Distribution of Labor Income and Medical Expenditures",
x="Wage and Salary Income (dollars)", y="Total Medical Expenditures (dollars)")
```

Here is our final scatter plot:

`scatter.p.zoom`

Let us use our same data set and build another common plot using the grammar of graphics in `ggplot`

. In this example, we use the usual hourly earnings variable (`usualhrearn`

) and education level (`edu`

) to compare the *mean* usual hourly earnings across different levels of education. Usual hourly earnings is a numerical variable expressed in dollars. Education level is an *ordinal* variable, which is a categorical variable, expressed descriptively as a string, but is encoded with an meaningful order. That is, the variable is coded so that the value `"Baccalaureate degree" > "Some college"`

and `"Some college" > "High school diploma"`

, etc.

We start with a call to `ggplot()`

that specifies the data and aesthetics layer. We use the same data frame as before and map the x-axis to `edu`

and the y-axis to `usualhrearn`

.

`bar.base <- ggplot(data=df, mapping=aes(x=edu, y=usualhrearn))`

In the case of a bar plot, we do not wish to graph the raw data, but rather a *statistic* that summarizes the data. Our statistic is the *mean*. We wish to compute the mean usual hourly earnings for each education level, and plot rectangular bars with a height equal to the mean. To do this, we will not specify a geometry layer, but rather a **statistics layer** which will both compute the statistic we are interested in and specify the *bar* geometry.

`bar.p <- bar.base + stat_summary(fun.data=mean_sdl, geom="bar")`

The call to `stat_summary()`

specifies our statistics and geometry layer. The first parameter `fun.data=mean_sdl`

tells `stat_summary()`

what function to call to calculate the desired statistics. The function `mean_sdl()`

takes a single variable as a parameter and computes the mean and the upper and lower limits of the 95% confidence interval, the latter which is not used for the bar geometry. The second parameter, `geom="bar"`

, tells `stat_summary()`

to use the bar geometry for the statistics it calculates.

Let us look at the plot:

`bar.p`

`## Warning: Removed 271 rows containing non-finite values (stat_summary).`

We get a warning that many observations were not included because there are a number of people in our sample that are not employed so they did not have usual weekly hours and/or wage or salary income. We still have over 1000 data points, so this is not problematic.

So that it is clear the vertical axis refers to usual hourly earnings in dollars, let us specify that the vertical scale labels be expressed in dollars.

`bar.p <- bar.p + scale_y_continuous(labels=dollar)`

Finally, we use the function `labs()`

to add a descriptive title for plot and we remove the titles for the horizontal and vertical axes since those will be obvious from the title of the plot and the labels on the scales.

```
bar.p <- bar.p + labs(title="Usual Hourly Earnings by Education Level", x="", y="")
bar.p
```

`## Warning: Removed 271 rows containing non-finite values (stat_summary).`

The labels for the education levels are long and overlap with one another. To address this, we can specify a **theme layer**. The theme layer lets us calibrate all non-data ink on our graph. In this case, we will use the theme layer to display the x-axis labels at an angle so that the labels do not overlap and we can read them. The `theme()`

function has dozens of possible parameters to calibrate everything from axes labels, titles, tick marks, legend, placement of all of these, background color and shading, and much, much more.

In the call below, we add a theme layer to our `bar.p`

object with a call to `theme()`

to fix the x-axis labels.

```
bar.p <- bar.p + theme(axis.text.x=element_text(angle=30, vjust=1, hjust=1))
bar.p
```

`## Warning: Removed 271 rows containing non-finite values (stat_summary).`

We set the parameter for `axis.text.x`

and call a function `element_text()`

that allows us to specify an angle to write the text, vertical justification, and horizontal justification. We set the text at a 30 degree angle, and set both vertical and horizontal justification equal to `1`

so that text labels are right-aligned and top-aligned, which puts the end of the labels close to the horizontal axis.

Alternatively, we can wrap our x-axis labels onto new lines so that they do not overwrite each other. To do this, we will change the text for the levels of the ordered factor variable, `edu`

. Let us remind ourselves what are these levels:

`levels(df$edu)`

```
## [1] "Less than high school" "High school diploma" "Some college"
## [4] "Baccalaureate degree" "Advanced degree"
```

The code below calls the function `str_wrap()`

to re-write the levels for `edu`

:

`levels(df$edu) <- str_wrap( levels(df$edu), width=15 )`

The function `str_wrap()`

takes as parameters a string or vector of strings and a desired width in characters. We pass the vector of strings that `levels(df$edu)`

returns, we specify a maximum width of 15 characters, and `str_wrap()`

outputs a new vector of strings that includes new lines as necessary to assure that no string goes over 15 characters in a single line. We save that output to `levels(df$edu)`

, which overrides the existing levels.

Now let us view our levels.

`levels(df$edu)`

```
## [1] "Less than high\nschool" "High school\ndiploma"
## [3] "Some college" "Baccalaureate\ndegree"
## [5] "Advanced degree"
```

The newline characters, `\n`

, indicate where a new line will appear.

Unfortunately, we need to recreate the plot from scratch now. We changed the definition of one of our variables (the factor variable `edu`

is different, with different levels), which entered into the aesthetics layer in the very first call to `ggplot()`

. The code below puts everything together, and is constructed by simply copying and pasting code from above to recreate the plot.

```
bar.p <- ggplot(data=df, mapping=aes(x=edu, y=usualhrearn)) +
stat_summary(fun.data=mean_sdl, geom="bar") +
scale_y_continuous(labels=dollar) +
labs(title="Usual Hourly Earnings by Education Level", x="", y="")
bar.p
```

`## Warning: Removed 271 rows containing non-finite values (stat_summary).`

Let’s add some color to our plot. There are many ways of adding color to a plot. Sometimes you may want to map color to a variable, for example, so that different levels of education are represented by different colors. In this case, we will use just one color, but make it something prettier than gray.

We again recreate the plot from scratch, copying and pasting most of the above code. We add two more optional parameters to the `stat_summary()`

call though, `col="black"`

and `fill="lightblue"`

. The `fill`

parameter sets the color of the bars, and the `col`

parameter sets the color of the *outline* of the bars.

```
bar.p <- ggplot(data=df, mapping=aes(x=edu, y=usualhrearn)) +
stat_summary(fun.data=mean_sdl, geom="bar", fill="lightblue", col="black") +
scale_y_continuous(labels=dollar) +
labs(title="Usual Hourly Earnings by Education Level", x="", y="")
bar.p
```

`## Warning: Removed 271 rows containing non-finite values (stat_summary).`

Finally, we add to the plot an alternative theme to the default value. Elements of the theme layer can be customized with a significant amount of code, you can create your own custom themes, or you can borrow one of dozens of pretty themes from the `ggthemes`

package. Below we add the 538 theme (inspired by the plots on the website, http://fivethirtyeight.com/)

```
bar.p <- bar.p + theme_fivethirtyeight()
bar.p
```

`## Warning: Removed 271 rows containing non-finite values (stat_summary).`

Does it look pretty?

In this tutorial, we introduced the concept of the grammar of graphics and illustrated how to use the `ggplot`

package to create graphics using a minimal number of grammar layers. There are still more layers in the grammar of graphics to learn and a whole world of data visualizations and graphical customization that can be built with ggplot. This tutorial sets the framework that is used for all graphics and illustrates the grammar of graphics with two of the most simple and common plot types.

Use the data frame in this tutorial and create the scatter plots described in the following problems.

Create a scatter plot that illustrates the relationship between a respondent’s age (

`age`

) and total medical expenses (`totmed`

). What problems do you notice about the scatter plot?- Recreate the plot and do the following to remedy over-plotting problems:
- Decrease the point size (parameter is
`size`

) to 1.0 - Set transparency (parameter is
`alpha`

) to 0.4 - Zoom in: Set the limits for ages to between 18 and 70 years and the limits for total annual medical expenses to between $0 and $30,00. Create a coordinates layer and use parameters
`xlim=c(18,70)`

and`ylim=c(0,30000)`

.

- Decrease the point size (parameter is
Recreate the plot and set the scale labels and titles so that the plot more easily communicates to the reader what variables the graph shows and how these variables are measured. Change the labels on the vertical axis

*scale*so that they are expressed in dollars. Change the text labels of the axes so that they are descriptive and create a title for the overall plot.Recreate the plot from part (c) and add the geometry,

`geom_smooth()`

. This function creates a curve that describes the relationship between the two variables, which also includes a shaded region depicting the margin of error in this estimate for the relationship.

Use the data frame in this tutorial and create bar charts as described in the following problems.

Create a bar chart illustrating average annual income by industry.

Recreate the graph in (a) and fix the x-axis labels so that they do not overlap.

Recreate the plot and set the scale labels and titles so that the plot more easily communicates to the reader what variables the graph shows and how these variables are measured. Change the labels on the vertical axis

*scale*so that they are expressed in dollars. Create a descriptive title for the overall plot that makes clear what variables are being compared and remove the labels for the horizontal and vertical axes.Recreate the graph in (c) and add another statistics layer with the following function call,

`stat_summary(fun.data=mean_cl_boot, geom="errorbar")`

. The`mean_cl_boot`

function computes 95% confidence bounds on the mean using the data to estimate the distribution of medical expenses rather than assuming a normal distribution. The`geom="errorbar"`

specifies the geometry used to illustrate these confidence bounds. Does the graph look nice? What does it communicate to you in terms of the relationship between industry and annual income?