Chapter 4: Covariance and Correlation
A great way to understand how two continuous variables relate is through a scatterplot. A scatterplot shows one of the variables on the y-axis and one on the x-axis. Lets take for example, the continuous variables height and weight. Height is on the x-axis on weight is on the y-axis. [a]
Figure 4.1. Dataset of height vs. weight. n=3.
To create this graphically, points are created where the values are for each individual, so in this case there would be three points at (60,150), (62,175), and (63,170).
Figure 4.2. Scatterplot of height vs. weight. n=3.
Notice the x and y-axes are clearly labeled to show which variable they represent. The user can tell at a glance, even with a dataset this small, that the relationship between height and weight is positive, meaning as height increases, so does weight, in general. Create the dataset in SPSS and create the scatterplot via Graphs → Legacy Dialogues → Scatter/Dot → Simple Scatter.
Now click define. Next put “Weight” in the ‘Y Axis’ and “Height” in the ‘X Axis’ and press OK. Your scatterplot should look like Figure 4.2.
In the scatterplot above we could visually see that as height increased in our data set, weight also increased. This made our points into a shape approximating a line. We call this type of association between our x and y variables a linear association. Because datasets are more complicated than the one above, it is useful to have a number to summarize the strength of the linear association between x and y. A good way to visually determine this linear association is to draw an oval around the points on your scatter plot. The longer and skinnier the oval is, the stronger the linear association is. One [b]such measure is the covariance, which is defined in the following formula
Where xi and yi are each individual data point for x and y, and are the means of x and y, and n is the sample size. For example x could be height and y could be weight. Therefore would be the mean of height and would be the mean of weight for the data set.
-∞ < Covariance(x,y) < ∞
Covariance can go as low as negative infinity and as high as positive infinity, with a 0 value signifying no linear association between x and y. In the case of our data of height and weight found in figure 4.1 the covariance is 17.5, which shows a positive relationship but does not tell us much about the strength of the relationship between height and weight. In order to tell more about the strength we use the Pearson correlation coefficient.
Pearson Correlation Coefficient
The Pearson correlation coefficient is the covariance of a pair of variables but it is standardized. Instead of going from -∞ to ∞ like covariance, Pearson correlation goes just from -1 to 1.
-1 < rxy < 1
Here is what it looks like in equation form. Pearson correlation between x and y is generally expressed as rxy.
Where 𝜎x and 𝜎y are the standard deviations of x and y. Now the bounds of the Pearson correlation coefficient are -1 and +1.
Where a Pearson correlation coefficient of -1 is a perfect negative linear association. This relationship can be seen in Figure 4.3. In this type of correlation as your variable y gets smaller, your x gets larger. A Pearson correlation coefficient of 0 means there is no linear association between x and y as can be seen in Figure 4.4. A Pearson correlation coefficient of 1 means there is a perfect positive linear association as can be seen in Figure 4.5. This means that as y increases, x also increases.
Figure 4.3. Scatterplot of a perfect negative Pearson correlation coefficient (rxy=-1).
Figure 4.4. Scatterplot of a relationship where there is no linear association between x and y (rxy=0).
Figure 4.5. Scatterplot of a perfect positive linear association (rxy=1).
Using the Pearson correlation coefficient will allow us to tell the magnitude of the strength of the association between x and y.[c] While magnitude cutoffs are arbitrary generally it is regarded that an rxy > |.8| is considered a strong correlation. An rxy of |.5| is considered a moderately strong correlation, and an rxy < |.2| is considered a weak correlation. These cutoffs will vary by discipline, with the harder sciences such as chemistry or physics a relationship can be considered strong if it is > .9 while in education a correlation of .5 would be considered very strong. A good way to visually determine the correlation is to draw an oval around the points on your scatter plot. The longer and skinnier the oval is, the stronger the correlation is. Figure 4.6 has a weak correlation relationship between x and y, while Figure 4.7 has a strong correlation relationship. For the height and weight dataset the correlation is .866 signifying a strong relationship between height and weight. Calculate the Pearson correlation coefficient yourself via Analyze→Correlate→Bivariate Correlation. Put both height and weight into the “Variables” box on the left and click OK.
Click on this link to download this .sav (which is an SPSS datafile) file.
Example dataset 4.1.sav[e]
Once you have downloaded this dataset, open it in SPSS, go to the variable view, and you should see the following as shown in figure 4.6.
Figure 4.6. Screenshot of example dataset 4.1
Now go to Graphs → Legacy Dialogues → Scatter/Dot → Simple Scatter and click ‘Define’. Move the first variable “Age in years” to the ‘X Axis’ box and the second variable “Number of books…” to the ‘Y Axis’ box, as seen figure 4.7. Now click ‘OK’.
Figure 4.7. Graphic User interface (GUI) for simple scatterplot in SPSS.
You will produce the output shown in figure 4.8.
Figure 4.8. Scatterplot of age vs. number of books read in past year. n=20
Now, draw a circle around the points in your scatterplot. How would you describe the correlation between these variables?
Repeat the steps you just used with the variable “Hours spent on social…” on the ‘X Axis’ and variable “Depression score…” on the ‘Y Axis’. Your scatterplot will look like Figure 4.9.
Figure 4.9. Scatterplot of hours spent on social media per week vs. depression score on the DASS-21. n=20
How is this scatterplot different than the scatterplot in Figure 4.8? What does the shape of the oval say about the correlation?
Now, repeat the steps using variable “Depression score…” on the ‘X Axis’ and “Anxiety score…” on the ‘Y Axis”. Your scatterplot will look like Figure 4.10.
Figure 4.10. Scatterplot of depression score on the DASS-21 vs. Anxiety score on the DASS-21. n=20
How does this scatterplot compare to the previous two? What does the shape of the oval say about the correlation?
Now go to Analyze → Correlate → Bivariate, as seen in figure 4.11.
Figure 4.11. Graphic User Interface (GUI) for bivariate correlation in SPSS.
Move variables “Age in years” and “Number of books…” to the ‘Variables’ box and select ‘Show only the lower triangle’, as shown in Figure 4.12. Now click ‘OK’.
Figure 4.12. Graphic User Interface (GUI) for bivariate correlation in SPSS.
Your correlation table will look like Figure 4.13. What does this correlation say about the relationship between age and the number of books read in the past year?
Figure 4.13. Correlation table for age in years vs. number of books read in past year.
Now, create a correlation table for the variables “Hours spent on social…” and “Depression score…”. Your correlation table will look like Figure 4.14.
Figure 4.14. Correlation table for hours spent on social media per week and depression score on the DASS-21.
What kind of correlation is present between these two variables? Why might that be? How does it differ from the previous example? Finally, create a correlation table for the variables “Depression score…” and “Anxiety score…”. Your correlation table will look like Figure 4.15.
Figure 4.15. Correlation table for depression score on the DASS-21 and anxiety score on the DASS-21.
What is the strength of this correlation? Does it make sense for that kind of relationship to exist between the variables?
Figure 4.16. Structural equation modeling (SEM) diagram of the correlation between anxiety and depression on the DASS-21.
You can use the following datasets to practice creating correlations in SPSS. The variable names you should use are in italics. The independent variable (x) is listed first, followed by the dependent variable (y). For each item:
- Create a scatterplot of the two variables
- Find the Pearson correlation
- Characterize the correlation (strong/weak, positive/negative)
Practice problems (Answer key)
- Use this dataset and find the Pearson correlation between the diamonds’ weight (Carat_Weight) and price (Price).
- Use this dataset to find the Pearson correlation between the cars’ curb weight (curbweight) and miles per gallon in a city (citympg).
- Use this dataset to find the Pearson correlation between the cars’ height in inches (height) and horsepower (horsepower).
- Use this dataset to find the Pearson correlation between the cars’ engines’ peak revolutions per minute (peakrpm) and price in dollars (price).
- Use this dataset to find the Pearson correlation between the irises’ sepal length (sepal_length) and petal length (petal_length).
- Use this dataset to find the Pearson correlation between the air temperature in Celsius (temp) and relative humidity in percent (RH).
- Use this dataset to find the Pearson correlation between the wind speed in km/hour (wind) and rain in millimeters (rain).
Next, you will find these Pearson correlations in AMOS. These instructions will walk you through how to do this using practice problem #1. (AMOS output for practice problems 2 to 7).
[c]This should be made consistent. Either capital or lowercase
[d]My assignment: Simulate 3 datasets, create the scatterplots, draw the oval, calculate the correlation.
[e]When I first clicked on this link I got an error and had to push a bunch of buttons. Then it didn't work. So I closed down all my spss windows and tried again after which the link downloaded and opened perfectly. Not sure what happened. Just thought I'd mention it.
[f]Introduce first AMOS example. We actually have the data, repeat in AMOS.