PDF file location: http://www.murraylax.org/rtutorials/correlation.pdf

HTML file location: http://www.murraylax.org/rtutorials/correlation.html

*Note on required packages:* The following code requires the packages in the `tidyverse`

. The `tidyverse`

contains many packages that allow you to organize, summarize, and plot data. If you have not already done so, download and install the libraries (needed only once per computer), and load the libraries (need to do every time you start R) with the following code:

`install.packages("tidyverse") # This only needs to be executed once for your machine`

`library("tidyverse") # This needs to be executed every time you load R`

A **correlation** exists between two variables when one is related to the other such that there is **co-movement**. **Positive co-movement** means as one variable increases, the other variable also increases. **Negative co-movement** means as one variable increases, the other variable decreases.

Stock and Watson’s *Introduction to Econometrics* textbook includes a data set with economic growth and education data for 65 countries from 1960-1995. A subset of that data set is available to download from `http://murraylax.org/datasets/growth.RData`

The code below downloads the data file assigns the data set to an object we create and call `growthdata`

.

`load(url("http://murraylax.org/datasets/growth.RData"))`

We will focus on the average annual growth rate of real GDP from 1960-1995 (labeled `growth`

) and the average number of years of schooling for adult residents in the country in 1960 (labeled `yearsschool`

).

Let us first create a scatter plot that illustrates the relationship between average years of schooling of adult residents and the subsequent average growth rate over the next 35 years. We can create a scatter plot using the `ggplot()`

function as follows:

```
growthplot <- ggplot(growthdata, aes(x=yearsschool, y=growth)) +
geom_point() +
xlab("Years Schooling") +
ylab("Growth Rate %")
```

The first line of the `ggplot()`

function call sets the data and aesthetic layers of the plot. The first parameter `growthdata`

tells ggplot where to find the data. The second parameter, `aes(x=yearsschool, y=growth)`

maps the variable `yearsschool`

to the x-axis and the variable `growth`

to the y-axis.

The next line adds the geometry layer. In this case, `geom_point()`

creates a point for each pair of observations.

The last two lines set the labels for the x-axis and y-axis.

We save the output to an object we call `growthplot`

. We can view the plot by entering the name of this object at the R console:

`growthplot`

It appears that years of schooling and real GDP growth may have a positive relationship. We can compute the best fitting straight line that describes this relationship with the function `lm()`

which stands for ‘linear model’. In the code below, we call the `lm()`

function and assign its output to a variable we call `growthmodel`

.

`growthmodel <- lm(growth ~ yearsschool, data=growthdata)`

The first parameter we passed to the function `lm()`

is a *formula* of the form `y ~ x`

. This notation means to fit a function that has the linear form \(y= a + b x\). The output variable \(growthmodel\) includes a lot of objects and statistical tests that describe the linear relationship between the x and y variables.

We can find out what precisely what the equation of the line is by calling the `coefficients`

variable in the `growthmodel`

object as follows:

`growthmodel$coefficients`

```
## (Intercept) yearsschool
## 0.9582918 0.2470275
```

The output means when Y = (growth rate of real GDP) and X = (average years of schooling of adults in 1960), the equation of the line that best describes the linear relationship between these two variables is \(Y = 0.958 + 0.247 X\).

In a later tutorial, we will discuss the precise equation and hypothesis testing on that equation at length. For now, let us add a graph of this line to our scatter plot, so that we can see the data and the best fitting line together on one graph.

Below, we add another geometry layer to our existing `growthplot`

object. The call to function `geom_smooth(method="lm")`

computes the same linear relationship that we computed above and plots a line and gives shaded areas representing the margin of error for a 95% confidence.

`growthplot + geom_smooth(method="lm")`

We can see from this graph that an upward sloping line describes well the relationship between years of schooling of adults in 1960 and the subsequent 35 year average growth rate of real GDP. That is, our variables seem to display a *positive, linear co-movement*.

The **Pearson correlation coefficient** is a measure of the strength of a **linear** co-movement between two interval or ratio variables. **Linear co-movement** implies that either an upward sloping or downward sloping **straight line** best describes the relationship.

The Pearson correlation coefficient takes values only between -1.0 and +1.0. The stronger is the relationship, the closer the points on the scatter plot will be to the best fitting line. For a positive relationship, the stronger it is, the closer the correlation coefficient will be to +1.0. For a negative relationship, the stronger it is, the closer the correlation coefficient will be to -1.0. If the relationship is weaker, the observations will be farther from the best fitting line, and the correlation coefficient will be closer to 0.0.

The function `cor`

can be used to compute the Pearson correlation coefficient for two variables as follows:

`cor(x=growthdata$yearsschool, y=growthdata$growth, method='pearson')`

`## [1] 0.3309986`

We see from our result that the sample estimate for the Pearson correlation coefficient is 0.33. Since this number is positive, the two variables are positively correlated.

Our sample estimate for the correlation coefficient is positive, but is this enough evidence that there is a relationship between years of schooling and real GDP growth in the population? To answer this, let us conduct a hypothesis test with the following null and alternative hypotheses:

**Null hypothesis: \(\rho = 0\)
Alternative hypothesis: \(\rho \neq 0\)**

Following common statistical notation, we use the Greek letter \(\rho\) to denote the *population* Pearson correlation coefficient. The null hypothesis says that the two variables are not correlated, i.e. that there is not a linear relationship. Like all null hypotheses, it states that a population parameter is *equal to* some specified value (zero in this case). The alternative hypothesis says that the two variables are correlated, that there is *some* linear relationship, either positive or negative. The not-equal sign in the alternative hypothesis implies that this is a *two-tailed* test, so either positive or negative Pearson correlation coefficients significantly far away from zero will result in the null hypothesis rejected.

The function `cor.test`

can be called to conduct this hypothesis test as follows:

```
cor.test(x=growthdata$yearsschool, y=growthdata$growth,
alternative="two.sided", conf.level=0.95, method='pearson')
```

```
##
## Pearson's product-moment correlation
##
## data: growthdata$yearsschool and growthdata$growth
## t = 2.7842, df = 63, p-value = 0.007077
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.09474858 0.53195301
## sample estimates:
## cor
## 0.3309986
```

The first two parameters tell the function which variables to estimate a Pearson correlation coefficient for. The parameter `alternative="two.sided"`

tells the function to conduct a two-tailed hypothesis test. Finally the parameter `conf.level=0.95`

is used to conduct a 95% confidence interval for the population Pearson correlation coefficient.

The p-value for the hypothesis test is 0.007, which is far below a common significance level of 0.05. With a high degree of confidence we can state we have found sufficient statistical evidence that the average years of schooling is correlated subsequent real GDP growth.

**Confidence Interval**

The 95% confidence interval is also included in the output to `cor.test`

. The results reveal an interval estimate for the population Pearson correlation coefficient between 0.095 and 0.53. With 95% confidence, this interval contains the true population Pearson correlation coefficient. This range includes all positive numbers, but ranges from somewhat weak but positive correlation to strong positive correlation.

The **Spearman correlation coefficient** is a non-parametric measure of the relationship between two variables, which measures the strength of the relationship between the **ranks** of observations in the two variables. Because the calculation relies only on ranks, the method is appropriate for ordinal data as well as interval or ratio data.

Because the Spearman correlation measures the strength of a **linear relationship on ranks** between two variables, and ** not** a

We can create a scatter plot that illustrates the relationship **in the ranks** of average years of schooling of adult residents and the subsequent average growth rate over the next 35 years. We call the `ggplot()`

function like before, but instead of passing in the raw data for years of schooling and real GDP growth, we pass in the ranks, with inner calls to the function, `rank()`

```
ggplot(growthdata, aes(x=rank(yearsschool), y=rank(growth))) +
geom_point() +
xlab("Years Schooling (Ranks)") +
ylab("Growth Rate (Ranks)") +
geom_smooth(method="lm")
```

The Spearman correlation coefficient estimates the strength of the linear relationship between the ranks. It is exactly the same as the Pearson correlation coefficient, expect applied to the ranks of the data. We can compute the estimate with either of the following methods:

`cor(x=growthdata$yearsschool, y=growthdata$growth, method='spearman')`

`## [1] 0.376975`

`cor(x=rank(growthdata$yearsschool), y=rank(growthdata$growth), method='pearson')`

`## [1] 0.376975`

**Hypothesis Testing**

We can conduct hypothesis tests on the Spearman correlation coefficient in the same manner as the Pearson correlation coefficient. If we want to test for evidence for a relationship between schooling and economic growth, we would consider the following null and alternative hypotheses:

**Null hypothesis: \(\rho = 0\)
Alternative hypothesis: \(\rho \neq 0\)**

Again we call the function `cor.test()`

to conduct the hypothesis test, this time specifying the Spearman method with the parameter, `method='spearman'`

.

```
cor.test(x=growthdata$yearsschool, y=growthdata$growth,
alternative="two.sided", method='spearman')
```

```
## Warning in cor.test.default(x = growthdata$yearsschool, y =
## growthdata$growth, : Cannot compute exact p-value with ties
```

```
##
## Spearman's rank correlation rho
##
## data: growthdata$yearsschool and growthdata$growth
## S = 28510, p-value = 0.001966
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.376975
```

Equivalently, we can conduct hypothesis tests and confidence intervals on the Spearman correlation coefficient with a call `cor.test()`

using the *Pearson* method, but submit the ranks of the data instead of the raw data.

```
cor.test(x=rank(growthdata$yearsschool), y=rank(growthdata$growth),
alternative="two.sided", conf.level=0.95, method='spearman')
```

```
## Warning in cor.test.default(x = rank(growthdata$yearsschool), y =
## rank(growthdata$growth), : Cannot compute exact p-value with ties
```

```
##
## Spearman's rank correlation rho
##
## data: rank(growthdata$yearsschool) and rank(growthdata$growth)
## S = 28510, p-value = 0.001966
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.376975
```

Using either method, we find a p-value equal to `0.001966`

. Since this is below common significance levels, we reject the null hypothesis and conclude there is sufficient statistical evidence that there is a correlation between schooling and economic growth.