Timothy S. Vaughan

University of Wisconsin - Eau Claire

Kelly E. Berry

University of Wisconsin - Eau Claire

Journal of Statistics Education Volume 13, Number 1 (2005), jse.amstat.org/v13n1/vaughan.html

Copyright © 2005 by Timothy S. Vaughan and Kelly E. Berry, all rights reserved. This text may be freely shared among individuals, but it may not be republished in any medium without express written consent from the authors and advance notification of the editor.

**Key Words:** Multiple regression; Simulation

In contrast, our position is that some degree of multicollinearity is closer to being the rule rather than the exception, when data are not collected under a purposely-structured experimental design. This is particularly true when researchers have a poor understanding of multicollinearity, and hence fail to address the issue in the variable selection and model specification stage of a research study. Indeed, when presented with such an exercise, many students generate laundry lists of highly correlated or even entirely redundant predictor variables. Thus, a lack of understanding with regard to the implications of multicollinearity actually aggravates the problem itself.

The concept itself, moreover, is difficult to teach at an intuitive level. Allison (1999) suggests that typical users’ understanding of the concept is limited to “It’s bad” and “It has something to do with high correlation among the variables.” We find that well motivated students can learn to describe the phenomenon, and apply basic diagnostic tools directed at identifying its presence. Beyond that, a solid conceptual understanding of the practical implications of multicollinearity seems to be difficult to instill.

One reason this concept is difficult for students to comprehend is that the problems arising from multicollinearity arise
with respect to the *sampling distributions* of regression coefficients, whereas the concept is typically addressed
in class with respect to a particular dataset. The typical manner of covering multicollinearity in statistics education is
to demonstrate the correlation among two or more predictor variables in a regression study, and then to cast some aspersions
about the validity of the regression coefficients in response to this. This “single dataset” approach leads students to
attempt to answer the question, “How has this multicollinearity affected the regression coefficients on my computer output?”
as opposed to “How has this multicollinearity affected the sampling distribution of regression coefficients from which my
results were drawn?” While *identification* of multicollinearity can certainly be taught in this manner, truly
understanding the *implications* of multicollinearity requires attention to the effects on the sampling distribution
of regression coefficients.

It is of course possible to simply “tell” the students these effects, e.g. that multicollinearity increases the variance of the regression coefficients, leads to correlated errors in the regression coefficients themselves, etc. Positively demonstrating these effects, and thus instilling an intuitive level of understanding is more challenging. Kennedy (2002) presents a method of using Ballentine Venn diagrams, which provides a conceptual framework for understanding the implications of multicollinearity.

This paper describes a Monte Carlo exercise that has been used in a second-semester statistics class, to provide a graphic demonstration of the effect multicollinearity has on the sampling distribution of regression coefficients. The basic approach is to generate a number of random datasets, for a scenario where the underlying relationships between the predictor variables and the response variable are known. For each sample, ordinary least squares regression is used to estimate the coefficients of the underlying relationship, using an appropriately specified model. Replication of this experiment generates a collection of regression coefficients which together demonstrate the nature of the underlying sampling distribution. A graphical display of the resulting coefficients under independent vs. correlated predictor variables provides a compelling depiction of the effects of multicollinearity. The basic exercise described here takes about 30 minutes. Additional explanation, side discussions, etc., will naturally impact the total class time expended.

Mills (2002) presents a thorough review and discussion of using computer simulation methods to teach statistical concepts, including regression. Although multicollinearity itself is not explicitly mentioned, there appears to be particular value in applying the technique to the teaching of this topic. The approach gives students the opportunity to observe the behavior of the regression coefficients, relative to the “true” relationship underlying the data.

The demonstration as described here is implemented using a Microsoft Excel spreadsheet. Of course, the technique could be implemented using any statistical package capable of generating independent and correlated samples, computing appropriate response variable values, and running the necessary regressions. Indeed, the entire process could be automated with an appropriately written macro for the software package of choice. While this approach would speed up the demonstration, we feel there is potential for reduced learning impact when the activities described here fall into a “black box”.

(1) |

where is normally distributed with mean E[] = 0 and variance VAR[] = As such, it is not difficult to ask them to envision a scenario in which:

(2) |

where is normally distributed with mean E[] = 0 and VAR[] = 1600. We thus have a situation in which the “true” relationship coefficients are known, e.g. = 1, = 1, = -1. As with any Monte Carlo demonstration of an estimation technique, it is imperative to remind students that in practice we would not know these values, but present them only to compare our estimates with the “right” answers.

We display an Excel spreadsheet (a portion of which is shown in Figure 1) which
generates a sample of n = 100 random observations of (*x*_{1}, *x*_{2}, *x*_{3},
), and computes the corresponding value of *y* under the
relationship in equation (2). (In our demonstration, the xi are generated as
independent observations from the uniform [0,100) distribution, then rounded to the nearest integer.) The random
generation process is demonstrated by hitting the F9 key a few times, and we verify that the *y*’s being computed
correctly reflect the underlying relationship.

Figure 1

**Figure 1.** Portion of spreadsheet used to generate data for the independent variables case.

After a short side discussion reviewing the iterative process of model specification, parameter estimation, diagnostic checking, etc., we proceed to fit the “right” model

(3) |

for the current dataset. The resulting predictor variable coefficients , , and are then copied into a separate worksheet. This process is repeated 10 or more times, resulting in 10 or more different (, , ). (It is important that the students actually witness at least a few replications of the data generation/parameter estimation procedure. Depending on class time available, it is possible to have some of the replications done ahead of time, making it clear that these coefficient sets were produced under the same procedure.)

Before examining the regression coefficients further we point out the correlation matrix displayed in
Figure 1, which suggests the (*x*_{1}, *x*_{2},
*x*_{3}) being generated are independent of one another. The independence between *x*_{1} and
*x*_{2}, in particular, is graphically demonstrated with a scatter plot as shown in
Figure 1. (An instructor might, at this point, note the sampling variability in
the correlation coefficients. This provides an opportunity to review the idea that coefficients strictly equal to zero
will not be realized, despite the fact that *x*_{1} and *x*_{2} are independent.)

With independence between *x*_{1} and *x*_{1} firmly in mind, we turn to a scatter-plot of the
various and as
depicted in Figure 2. (The worksheet into which the regression coefficients are
copied is already set up to generate the plot.) The grid lines plotted at
= 1 and = 1
form “crosshairs” on the true underlying parameters = 1,
= 1. As the respective average values of
and are
approximately equal to 1, our procedure is generating unbiased estimates of the “separate effects” of *x*_{1}
and *x*_{2} on *y*. Nonetheless we see a certain degree of estimation error around the true
and , with
ranging from 0.63 to 1.27, and
ranging from 0.75 to 1.14. (This is a good time to review the idea
that the term “effect of *x*_{i}” refers to the change in E[*y*] *associated with*, but not necessarily
*caused by*, a 1-unit increase in *x*_{i} with all other variables held constant.)

Figure 2

**Figure 2.** Regression coefficients produced by independent data.

Once this point is made, we turn to a second worksheet (shown in Figure 3) which
computes y values using equation (2) in a manner identical to that used in the first
sampling effort. We then point out the only difference between this sampling mechanism and the previous, i.e. the
correlation between the *x*_{1} and *x*_{2} values reflected by the correlation matrix and
scatter plot. In our demonstration, each x2 value is generated from a uniform distribution with mean equal to the
corresponding *x*_{i} and range governed by a parameter *w*. The data in
Figure 3 is generated with *w* = 20, thus each *s*_{2} is
generated uniformly over the range *x*_{i}10. The process of
generating 10 or more different observations of (,
, ) is performed
with data generated by this worksheet, producing a display such as shown in Figure 4.

Figure 3

**Figure 3.** Portion of spreadsheet used to generate correlated *x*_{1},
*x*_{2}.

Figure 4

**Figure 4.** Regression coefficients produced by correlated *x*_{1},
*x*_{2}.

The striking contrast between Figure 2 and Figure 4 form the primary impact of the exercise. We first note that the average and are still approximately 1, implying the regression parameters remain unbiased estimates of the true relationship coefficients. In contrast, we then note the inflated variance of the individual coefficients, as and now range over 1.3 from their expectations.

Finally, we draw attention to the manner in which (,
) are no longer clustered around the point (1,1), but rather tend
toward the line +
= 2, in the vicinity of (1,1). Thus, the sampling errors in the (,
) coefficients have themselves become correlated. To explain this, we
note we have induced a very simple type of correlation such that E[*x*_{2}] = *x*_{1} As such,
*across many datasets* we realize a collection of and
for which +
2. Under
any such pair, the expression *x*_{1} +
*x*_{2} adequately represents the net effect of
1*x*_{1}+ 1*x*_{2} when *x*_{1}
*x*_{2}. (Some simple algebraic substitution into (2) with
*x*_{1} is strictly equal to *x*_{2} helps to clarify the point.) In short, our regression
coefficients are no longer reliably estimating the “separate effects” of *x*_{1} and *x*_{2},
but rather measure the *net effect* of *x*_{1} and *x*_{2} varying together in a manner depicted
in Figure 3.

In order to make the comparison of Figure 2 and
Figure4 as compelling as possible, an extreme degree of collinearity between
*x*_{1} and *x*_{2} has been introduced. Time permitting, the latter sampling procedure is
repeated under less severe collinearity, generating a display similar to that shown in
Figure 5. (The results in Figure 5 were
generated with the *w* parameter discussed above equal to 60.) While the sample coefficients still tend to the line
+ = 2, the
variance inflation effect is reduced under weaker collinearity. Here
ranges from 0.63 to 1.73, while now ranges from 0.31 to 1.34. Having
witnessed the creation of Figure 2 and Figure 4,
it is not as critical for the students to witness the sampling involved in this latter case.

Figure 5

**Figure 5.** Regression coefficients under less extreme collinearity between *x*_{1},
*x*_{2}.

The exercise closes with a discussion of how multicollinearity may affect the utility of a multiple regression model. Here, we turn to the two uses of a regression model, e.g. explanation vs. prediction.

If the model is to be used for *explanation* (e.g., we are more interested in the
parameters than the
values produced by the model), multicollinearity casts a shadow of doubt over the classic interpretation of
as “the change in E[*y*] associated with a 1-unit increase in
*x*_{i}.” As we have seen above, multicollinearity reduces our ability isolate the change in E[*y*]
associated with a change in any single *x*_{i}. This concept is used to introduce and motivate the concept of
experimental design, the next topic in the course in which the demonstration is used.

If the model is to be used for *prediction* (e.g. we are more interested in the
values produced by the model, and less concerned with the individual
parameters), multicollinearity is not quite as damning. We do, however, need to base our predictions on the assumption
that the predictor variables continue to reflect any relationships present in the data used to fit the model. The scatter
plot of *x*_{2} vs. *x*_{1} from Figure 3 provides an
opportunity to present this idea as a special case of the more basic concept, that a regression model should only be used
for prediction within the range of *x*’s actually observed. In this case, the basic idea extends to the region of
(*x*_{1}, *x*_{2}) actually observed, rather than simply the range of *x*_{1} taken
in union with the range of *x*_{2}.

One last useful point that can be made with this exercise is to clarify the distinction between multicollinearity and
interaction. (Students tend to confuse the concepts, especially when relying on a memorized definition as opposed to
conceptual understanding.) Returning to the scatter plot from Figure 3, the
multicollinearity in this exercise is the relationship between *x*_{1} and *x*_{2}. Interaction
between *x*_{1} and *x*_{2}, on the other hand, would exist if we had used the equation:

(4) |

with a non-zero parameter to generate the *y* values.

It is important to note that if the Monte Carlo demonstration is conducted with a more general correlated case of the form
*x*_{2} *a* + *b**x*_{1}, the
variances of the predictor variables must be held constant across the two scenarios. Otherwise the scenario with the
larger predictor variable variance has an advantage in reducing the variance of the corresponding sample coefficient. Of
course, increasing the strength of the predictor variable correlation relative to the variance of the response variable
magnifies the contrast between the two scenarios.

Some instructors may wish to adopt a realistic setting for the demonstration, rather than simply referring to the variables
generically as *x*_{1}, *x*_{2}, *x*_{3}, and *y*. One anonymous reviewer
proposed an example in which the predictor variables are exam scores and class attendance, while the response variable is
the student’s score on a final exam. While the course in which this demonstration is used is replete with regression
examples grounded in realistic scenarios, for this demonstration we prefer the generic approach. This is partially due to
the difficulty in identifying a realistic example in which two predictor variables might be independent, and a few minutes
later are assumed to be correlated. Moreover, there is potential for student confusion in this change of assumption with
respect to a specific variable. One particular exception would be a scenario in which the correlated data arise due to
convenience sampling, while the independent observations are gathered under a controlled experimental design.

Our demonstration has used a sample size *n* = 100 observations per replication. Time permitting, the effect of
sample size on the results demonstrated above would be an interesting extension or out-of-class assignment. A smaller
sample size (say *n* = 20) would of course result in greater coefficient variation under both scenarios, but would
generate a similar if not more impressive contrast between the independent vs. correlated predictor variables case.
Interestingly, a larger number of replications is not required to see the contrast, despite the smaller sample size.

(5) |

e.g, we omit the *x*_{2} variable from the model specification. Replicating the procedure a few times, we
come to the conclusion that we are still getting unbiased estimates of the “effect of” *x*_{1} (e.g.
1) while the
missing *x*_{2} term is contributing to the error term of the model, as evidenced by larger standard error
and smaller adjusted R^{2}.

When the same procedure is used with correlated *x*_{1}and *x*_{2} of the type depicted in
Figure 3, we naturally come to the conclusion that the
parameter is reflecting the combined effect of *x*_{1}
and *x*_{2} (e.g.
2) thus the missing *x*_{2} term is contributing to
the parameter rather than the error term. The bottom line to this
exercise is to recognize that, to the extent any “other variables” vary independently of the variables included in the
model, the effects of these variables contribute to the error term. To the extent that other variables are correlated
with the variables in our model, we must realize our regression coefficients measure the “net effect” of these simultaneous
changes, for each unit change in the variable we have measured and included in our model. This gives additional credence
to the regression instructors’ mantra that regression coefficients measure association not cause and effect.

The model specification exercise could also be extended in a number of ways. For example, a “true” functional relationship in which a correlated predictor variable contributes nothing to the observed y values (e.g = 0 in the preceding example) provides additional fodder in this direction. Such an exercise could examine the merits of a model containing both variables, relative to a reduced model possibly having greater predictive power.

Allison, P. D. (1999), *Multiple Regression – A Primer*, Thousand Oaks, CA: Pine Forge Press.

Kennedy, P. E. (2002), “More on Venn Diagrams for Regression,” *Journal of Statistics Education* [Online], 10(1).
(jse.amstat.org/v10n1/kennedy.html)

Mills, J. D. (2002), “Using Computer Simulation Methods to Teach Statistics: A Review of the Literature,” *Journal of
Statistics Education* [Online], 10(1). (jse.amstat.org/v10n1/mills.html)

Timothy S. Vaughan

Management and Marketing

University of Wisconsin - Eau Claire

Schneider Hall 453

Eau Claire, Wisconsin 54702

U.S.A.
*VAUGHATS@uwec.edu*

Kelly E. Berry

Management and Marketing

University of Wisconsin - Eau Claire

Schneider Hall 423

Eau Claire, Wisconsin 54702

U.S.A.
*BERRYKE@uwec.edu *

Volume 13 (2005) | Archive | Index | Data Archive | Information Service | Editorial Board | Guidelines for Authors | Guidelines for Data Contributors | Home Page | Contact JSE | ASA Publications