Chapter 7

Chapter 7: Linear Regression

Linear regression is the mathematical model behind the path diagrams introduced in chapter 1. Here is a path diagram.^{[a]}

^{[b]}

Figure 7.1. A basic Path Diagram showing the relationship between being male and an individual’s

Depression Score.

Linear regression has two purposes:

- Prediction: Given the various linear regression parameters a researcher can give a prediction of the next person’s depression score based on their gender.
- Explanation: Looking at Figure 7.1 a researcher can tell males are more depressed than females by 2 points. Thus they may claim that gender causes an increase in depression score. This explains some of the variation in depression scores.
^{[c]}

Much of the scientific process is an effort to explain why certain phenomena happen. Linear regression and SEM are powerful tools in achieving that.

The Equation

As mentioned in chapter 1, path diagrams like Figure 7.1 can be more intuitive than the following mathematical equation. Many people would have seen the following equation before

yi = mxi+b^{[d]}^{[e]}^{[f]}

where yi is the predicted outcome for subject i , xi is the independent variable for subject i, m is the slope and b is the intercept, the value of yi when xi =0.

Figure 7.2 refers to the quantity m, where m=2. The quantity b is not shown.

^{[g]}

Figure 7.2 A basic Path Diagram showing the relationship between being male and an individual’s Depression Score with m labeled and b not shown.

Statistics changes the formula slightly, please follow along in Table 7.1:

yi = ꞵ0 + ꞵ1xi + 𝜀i^{[h]}

Because this equation is not particularly intuitive, the following figures will help clarify the equation.

Subject i in this case refers to the study ID for a specific individual. Notice there is a y, x, and ^{[i]}^{[j]}𝜀^{[k]} that each have that little i next to them. Because that is a subscript, that is pronounced sub i, as in y sub i. ^{[l]}^{[m]}^{[n]}

In this equation, yi is the observed outcome score, or the actual value for the variable depression score for subject i^{[o]}^{[p]}^{[q]}^{[r]} .

^{[v]}

The variable xi in this case is the individual’s Male score (0=female, 1=Male).

w

Column 4 is the Predicted value or i . It is not part of the equation, however, it is used to calculate the error or ^{[w]}^{[x]}𝜀i^{[y]} . It is the value the model predicts for subject i given their male value xi . ^{[z]}^{[aa]}^{[ab]}Notice the predicted value has a “hat” over the y to show it is the predicted value. Look at the subject who has the ID=1 and is found in the second row, their Male variable is 0, meaning they are female. Their predicted depression score (i) is 23 which is found in column 4. This is true for all females in the dataset (rows 1-5). Now, look at the subject who has ID=6 whose Male variable is 1, meaning they are male. Their predicted depression score (i) is 25 which is found in column 4. This is true for all males in the dataset. The equation that is used to calculate this predicted value will be discussed below. The predicted value is wrong for every subject in the dataset as can be seen by comparing their observed depression score (yi) in column 2 to their predicted score (i) in column 4. The difference between what is observed (yi) and what is predicted (i) by the model is called the error term(^{[ac]}^{[ad]}𝜀i^{[ae]}).

^{[af]}

The parameter 𝜀i , pronounced epsilon sub i , is the error term. Linear regression predicts a depression score for subject i based on their gender^{[ag]}. The error term is the difference between that predicted depression score and the subject’s actual observed depression score.^{[ah]}

^{[ai]}

The parameter ꞵ0 is the predicted value of the outcome when the predictor, xi is 0. In this case, for the variable male, someone who is has a score xi =0 is female. Thus ꞵ0 is equivalent to b in y=mx+b.^{[aj]} ꞵ1 means that for every one unit increase in xi the predicted score will change by ꞵ1^{[ak]}^{[al]}. In our case, xi can only take on two values, 0 meaning female and 1 meaning male. Therefore as the variable changes from female to male, the predicted depression score increases by 2 points. ꞵ1 is equivalent to m in y=mx+b. ^{[am]}^{[an]}

Taking the information from Figure 7.1 and plugging it into the general formula results in:

Table 7.1

Subset of data used in calculating the model where male predicts depression score^{[ao]}

(depression score)i = ꞵ0 + 2*(male)i + 𝜀i

Memorize the following:

- yi is the observed outcome score for subject i
^{[ap]}^{[aq]}^{[ar]}. - ꞵ0 is the predicted value of the outcome when the predictor, xi is 0
^{[as]}^{[at]}. - For every one unit increase in xi the predicted score will change by ꞵ1
^{[au]}.

^{[av]}^{[aw]}^{[ax]}^{[ay]}

Figure 7.3 A basic path diagram showing the relationship between being male and an individual’s depression Score with ꞵ1 labeled and ꞵ0 not shown.

yi is the depression score for person i. ꞵ0 is the predicted depression score for person i if they were female x=0. ꞵ1 is the effect of gender on depression score and is estimated to be 2. This is interpreted exactly as above: Being male will result in a 2 point increase in predicted depression score. Note that graphic ^{[az]}shown in Figure 7.1 gives no prediction for ꞵ0 , the predicted depression score for a female. This is because ꞵ0 is generally not a value that researchers are focused on. This study is focused on predicting the effect of gender on depression score. It is not focused on predicting the actual depression score. If the predicted depression score is of interest then the figure could be modified as seen in Figure 7.2.

^{[ba]}

Figure 7.2. A basic path diagram with the estimated ꞵ0 associated with the triangle shape

From Figure 7.2 the triangle with a 1 in the middle has been added with an arrow pointing to the outcome (depression score). The 1 in a triangle symbolizes that this value is a constant for all subjects. The estimated value 23 shows the value for this constant. Thus, we would interpret the 23 as the predicted score for person i who has value “0” for all predictors. More complicated models might have more than one predictor. In this case there is only 1 predictor (male) with the values (1:Male, 0:Female). With that extra information we can further contextualize the number 23 and say it is the predicted depression score for woman i . If a person is a man then you have to take the 23 and add the effect of being male to it (in this case 2). Thus, the predicted depression score for a man is 25. ^{[bb]}This is how column 5 in Table 7.1 is calculated.

The linear regression equation in this case would be:

(depression score)i = 23 + 2*(male)i + 𝜀i .

Note, that in Figures 7.1 and 7.2 the error term is not represented. Generally the error term is considered a nuisance variable and thus can be safely ignored. In this case, every shape that receives an arrow is assumed to have an error term. If, for whatever reason the error term is desired, the figure can be represented as in Figure 7.3.

^{[bc]}

Figure 7.3. A basic path diagram with the error component shown explicitly

The circle with the 𝜀i. in it represents the error component of the prediction. It is a circle because it is not directly observed like the variables male or depression score are, but must be calculated from the model.

Unstandardized and Standardized Coefficients

In Figure 7.3 and in the corresponding equation the calculated or realized value of ꞵ1^{[bd]} in this case “2” is called the unstandardized beta or the unstandardized coefficient. An unstandardized coefficient is in the natural metric of both xi and yi . Thus, for every one unit increase in Male, or as a new subject is male instead of female, the predicted depression score increases by 2 depression score units. If yi was height in inches instead of depression score then a subject who is considered Male (their value was 1) would be predicted to be 2 inches taller than those who were not Male (their value was 0). An unstandardized beta or unstandardized coefficient is useful if the consumer of the information is familiar with the metrics involved. Thus, an expert in depression score would know whether the value of “2” is considered large or important. For those who are not experts in the metric of depression score another way to judge whether a beta or coefficient is large or important is helpful. Thus, we come to standardized coefficients. Standardized coefficients are created by changing the natural metric of yi and, when appropriate, xi to standard deviations. In the case of depression scores and Male, it makes sense to rescale depression scale to standard deviations instead. On the other hand, for the variable Male, standard deviations don’t make sense as Male is dichotomous and can take on only two values (0 or 1). After transforming just yi into standard deviation units the results are found in Figure 7.4.

^{[be]}

Figure 7.1. A basic Path Diagram showing the relationship between being male and an individual’s ^{[bf]}depression score when depression score is standardized.

Now we would interpret the relationship as: Males score .1 standard deviations higher than females on Depression score, a small value. This is much more informative to those consumers than the natural metric. Thankfully, SPSS automatically calculates this value for us as seen subsequently. ^{[bg]}

In this example, we’ll look at how to predict a person’s anxiety score based on their gender. The equation for this example is still:

yi = 0 + 1xi + i

And the model is:

Click on this link to download this .sav (which is an SPSS datafile) file.

[Insert file]

Once you have downloaded this dataset, open it in SPSS. Now go to Analyze → Regression → Linear.

Figure 7.^{[bj]}# Screenshot of example dataset

In the GUI for linear regression, move the variable “Anxiety score on the DASS-21” to the ‘Dependent’ box, and move the variable “Gender” to the ‘Independent’ box, as seen in Figure #. Then click ‘OK’ to run your regression.

Figure 7.^{[bk]}#. Graphic User interface (GUI) for linear regression in SPSS.

SPSS produces a series of tables when you run your regression. Look at the ‘Coefficients’ table to see the predicted effect of gender on anxiety score.

Figure 7.#. Coefficients table for the linear regression of anxiety score on the DASS-21 on gender

The output of the ‘Coefficients’ table tells us the following:

- 0 (the predicted value of the outcome when the predictor xi is 0) is 18.400.
- For males you would predict a 5.70 point decrease in anxiety score on the DASS-21.
^{[bl]}^{[bm]} - The probability of obtaining an unstandardized beta of -5.70 or more extreme when the null hypothesis is true is 15.9%.
^{[bn]}

[a]AMOS examples.

[b]bring the figure down, have a label for 2 connecting it with Beta1, Male=X, Depression Score=Y

[c]Make this example real.

[d]Create a dual equation figure where we have arrows from m to B1 etc.

[e]Ignore if not familar

[f]Think about Keith Verbosity, appendices???

[g]bring the figure down, have a label for 2 connecting it with Beta1, Male=X, Depression Score=Y

[h]Change formatting, from paragraph format to more of a single sentence, then the figure, repeat. Have a final paragraph synthesis.

[i]Show some examples using the spreadsheet of what x(i) and all the others are.

[j]Introduce p-value, refer to chapter 6 in parenthesis

[k]Change formatting, from paragraph format to more of a single sentence, then the figure, repeat. Have a final paragraph synthesis.

[l]Show some examples using the spreadsheet of what x(i) and all the others are.

[m]Introduce p-value, refer to chapter 6 in parenthesis

[n]Change formatting, from paragraph format to more of a single sentence, then the figure, repeat. Have a final paragraph synthesis.

[o]define the subject.

[p]Show some examples using the spreadsheet of what x(i) and all the others are.

[q]Introduce p-value, refer to chapter 6 in parenthesis

[r]Change formatting, from paragraph format to more of a single sentence, then the figure, repeat. Have a final paragraph synthesis.

[s]Show some examples using the spreadsheet of what x(i) and all the others are.

[t]Introduce p-value, refer to chapter 6 in parenthesis

[u]Change formatting, from paragraph format to more of a single sentence, then the figure, repeat. Have a final paragraph synthesis.

[v]Change formatting, from paragraph format to more of a single sentence, then the figure, repeat. Have a final paragraph synthesis.

[w]Show some examples using the spreadsheet of what x(i) and all the others are.

[x]Introduce p-value, refer to chapter 6 in parenthesis

[y]Change formatting, from paragraph format to more of a single sentence, then the figure, repeat. Have a final paragraph synthesis.

[z]Show some examples using the spreadsheet of what x(i) and all the others are.

[aa]Introduce p-value, refer to chapter 6 in parenthesis

[ab]Change formatting, from paragraph format to more of a single sentence, then the figure, repeat. Have a final paragraph synthesis.

[ac]Show some examples using the spreadsheet of what x(i) and all the others are.

[ad]Introduce p-value, refer to chapter 6 in parenthesis

[ae]Change formatting, from paragraph format to more of a single sentence, then the figure, repeat. Have a final paragraph synthesis.

[af]Change formatting, from paragraph format to more of a single sentence, then the figure, repeat. Have a final paragraph synthesis.

[ag]make these all say male, not gender

[ah]Change formatting, from paragraph format to more of a single sentence, then the figure, repeat. Have a final paragraph synthesis.

[ai]Change formatting, from paragraph format to more of a single sentence, then the figure, repeat. Have a final paragraph synthesis.

[aj]Introduce p-value, refer to chapter 6 in parenthesis

[ak]I do like that B(1) is the first thing in this sentence. I don't like that it is repeated twice in the sentence. I do think it helps keep track of where we are in the confusing equation, but I also don't think it's terribly grammatical. Any suggestions?

[al]This would still have B1 twice in one sentence, but it could say something like:

B1 can be interpreted as: For every one unit increase in xi, the predicted score will change by B1.

[am]Show some examples using the spreadsheet of what x(i) and all the others are.

[an]Change formatting, from paragraph format to more of a single sentence, then the figure, repeat. Have a final paragraph synthesis.

[ao]Both figure and table captions are above the thing now.

[ap]define the subject.

[aq]Show some examples using the spreadsheet of what x(i) and all the others are.

[ar]Change formatting, from paragraph format to more of a single sentence, then the figure, repeat. Have a final paragraph synthesis.

[as]Show some examples using the spreadsheet of what x(i) and all the others are.

[at]Change formatting, from paragraph format to more of a single sentence, then the figure, repeat. Have a final paragraph synthesis.

[au]Show some examples using the spreadsheet of what x(i) and all the others are.

[av]Suggestion: just reverse the gender.

[aw]bring the figure down, have a label for 2 connecting it with Beta1, Male=X, Depression Score=Y

[ax]blue eyes vs. other? instead of male.

[ay]also, gender is sooo common in the field, probably should keep it.

[az]Change formatting, from paragraph format to more of a single sentence, then the figure, repeat. Have a final paragraph synthesis.

[ba]Include beta0 arrow/label

[bb]make bullet pointy

[bc]Have the student do this in SPSS long way, and then AMOS. Add more variables. Split chapters into Simple Linear, and Multiple Linear Regression chapter. Just tease the multiple linear regression, tell them to take stats 2.

[bd]Change formatting, from paragraph format to more of a single sentence, then the figure, repeat. Have a final paragraph synthesis.

[be]bring the figure down, have a label for 2 connecting it with Beta1, Male=X, Depression Score=Y

[bf]Change formatting, from paragraph format to more of a single sentence, then the figure, repeat. Have a final paragraph synthesis.

[bg]Include an example where x is continuous

[bh]Somewhere in this chapter, we probably need to include something about standardized vs. unstandardized output in SPSS and when to use each one.

Also, how to interpret p-values in the context of regression. Should we do it in this example or before the example?

[bi]We probably also need to talk about centering continuous IVs

[bj]#?

[bk]#?

[bl]Insert a sidebar that says that for here the unit is gender but if the x variable is inches, the unit would be inches.

[bm]Theoryish chapter/applied chapter, AMOS.

[bn]Include another example.

This content is provided to you freely by EdTech Books.

Access it online or download it at https://edtechbooks.org/sem/linear_regression.