Using Monte Carlo Techniques to Demonstrate the Meaning and Implications of Multicollinearity

Timothy S. Vaughan
University of Wisconsin - Eau Claire

Kelly E. Berry
University of Wisconsin - Eau Claire

Journal of Statistics Education Volume 13, Number 1 (2005), jse.amstat.org/v13n1/vaughan.html

Copyright © 2005 by Timothy S. Vaughan and Kelly E. Berry, all rights reserved. This text may be freely shared among individuals, but it may not be republished in any medium without express written consent from the authors and advance notification of the editor.

Key Words: Multiple regression; Simulation

Abstract

This article presents an in-class Monte Carlo demonstration, designed to demonstrate to students the implications of multicollinearity in a multiple regression study. In the demonstration, students already familiar with multiple regression concepts are presented with a scenario in which the “true” relationship between the response and predictor variables is known. Two alternative data generation mechanisms are applied to this scenario, one in which the predictor variables are mutually independent, and another where two predictor variables are correlated. A number of independent realizations of data samples are generated under each scenario, and the regression coefficients for an appropriately specified model are estimated with respect to each sample. Scatter-plots of the estimated regression coefficients under the two scenarios provide a clear visual demonstration of the effects of multicollinearity. The two scenarios are also used to examine the effects of model specification error.

1. Introduction

The issue of multicollinearity occupies a perplexing position in a comprehensive treatment of multiple regression modeling. The topic itself is a complex one, and as such is typically relegated to latter sections or chapters in textbooks covering multiple regression. While this “back end” treatment is entirely appropriate, the unintended implication communicated to students is that multicollinearity is an “advanced topic”, addressing an infrequent scenario that won’t affect “casual” users of multiple regression analysis.

In contrast, our position is that some degree of multicollinearity is closer to being the rule rather than the exception, when data are not collected under a purposely-structured experimental design. This is particularly true when researchers have a poor understanding of multicollinearity, and hence fail to address the issue in the variable selection and model specification stage of a research study. Indeed, when presented with such an exercise, many students generate laundry lists of highly correlated or even entirely redundant predictor variables. Thus, a lack of understanding with regard to the implications of multicollinearity actually aggravates the problem itself.

The concept itself, moreover, is difficult to teach at an intuitive level. Allison (1999) suggests that typical users’ understanding of the concept is limited to “It’s bad” and “It has something to do with high correlation among the variables.” We find that well motivated students can learn to describe the phenomenon, and apply basic diagnostic tools directed at identifying its presence. Beyond that, a solid conceptual understanding of the practical implications of multicollinearity seems to be difficult to instill.

One reason this concept is difficult for students to comprehend is that the problems arising from multicollinearity arise with respect to the sampling distributions of regression coefficients, whereas the concept is typically addressed in class with respect to a particular dataset. The typical manner of covering multicollinearity in statistics education is to demonstrate the correlation among two or more predictor variables in a regression study, and then to cast some aspersions about the validity of the regression coefficients in response to this. This “single dataset” approach leads students to attempt to answer the question, “How has this multicollinearity affected the regression coefficients on my computer output?” as opposed to “How has this multicollinearity affected the sampling distribution of regression coefficients from which my results were drawn?” While identification of multicollinearity can certainly be taught in this manner, truly understanding the implications of multicollinearity requires attention to the effects on the sampling distribution of regression coefficients.

It is of course possible to simply “tell” the students these effects, e.g. that multicollinearity increases the variance of the regression coefficients, leads to correlated errors in the regression coefficients themselves, etc. Positively demonstrating these effects, and thus instilling an intuitive level of understanding is more challenging. Kennedy (2002) presents a method of using Ballentine Venn diagrams, which provides a conceptual framework for understanding the implications of multicollinearity.

This paper describes a Monte Carlo exercise that has been used in a second-semester statistics class, to provide a graphic demonstration of the effect multicollinearity has on the sampling distribution of regression coefficients. The basic approach is to generate a number of random datasets, for a scenario where the underlying relationships between the predictor variables and the response variable are known. For each sample, ordinary least squares regression is used to estimate the coefficients of the underlying relationship, using an appropriately specified model. Replication of this experiment generates a collection of regression coefficients which together demonstrate the nature of the underlying sampling distribution. A graphical display of the resulting coefficients under independent vs. correlated predictor variables provides a compelling depiction of the effects of multicollinearity. The basic exercise described here takes about 30 minutes. Additional explanation, side discussions, etc., will naturally impact the total class time expended.

Mills (2002) presents a thorough review and discussion of using computer simulation methods to teach statistical concepts, including regression. Although multicollinearity itself is not explicitly mentioned, there appears to be particular value in applying the technique to the teaching of this topic. The approach gives students the opportunity to observe the behavior of the regression coefficients, relative to the “true” relationship underlying the data.

The demonstration as described here is implemented using a Microsoft Excel spreadsheet. Of course, the technique could be implemented using any statistical package capable of generating independent and correlated samples, computing appropriate response variable values, and running the necessary regressions. Indeed, the entire process could be automated with an appropriately written macro for the software package of choice. While this approach would speed up the demonstration, we feel there is potential for reduced learning impact when the activities described here fall into a “black box”.

2. The Demonstration

The demonstration is given after the students have had significant exposure to the basics of multiple regression. In particular, the students are familiar with the underlying multiple regression model:

(1)

where is normally distributed with mean E[] = 0 and variance VAR[] = As such, it is not difficult to ask them to envision a scenario in which:

(2)

where is normally distributed with mean E[] = 0 and VAR[] = 1600. We thus have a situation in which the “true” relationship coefficients are known, e.g. = 1, = 1, = -1. As with any Monte Carlo demonstration of an estimation technique, it is imperative to remind students that in practice we would not know these values, but present them only to compare our estimates with the “right” answers.

We display an Excel spreadsheet (a portion of which is shown in Figure 1) which generates a sample of n = 100 random observations of (x₁, x₂, x₃, ), and computes the corresponding value of y under the relationship in equation (2). (In our demonstration, the xi are generated as independent observations from the uniform [0,100) distribution, then rounded to the nearest integer.) The random generation process is demonstrated by hitting the F9 key a few times, and we verify that the y’s being computed correctly reflect the underlying relationship.

Figure 1

Figure 1. Portion of spreadsheet used to generate data for the independent variables case.

After a short side discussion reviewing the iterative process of model specification, parameter estimation, diagnostic checking, etc., we proceed to fit the “right” model

(3)

for the current dataset. The resulting predictor variable coefficients , , and are then copied into a separate worksheet. This process is repeated 10 or more times, resulting in 10 or more different (, , ). (It is important that the students actually witness at least a few replications of the data generation/parameter estimation procedure. Depending on class time available, it is possible to have some of the replications done ahead of time, making it clear that these coefficient sets were produced under the same procedure.)

Before examining the regression coefficients further we point out the correlation matrix displayed in Figure 1, which suggests the (x₁, x₂, x₃) being generated are independent of one another. The independence between x₁ and x₂, in particular, is graphically demonstrated with a scatter plot as shown in Figure 1. (An instructor might, at this point, note the sampling variability in the correlation coefficients. This provides an opportunity to review the idea that coefficients strictly equal to zero will not be realized, despite the fact that x₁ and x₂ are independent.)

With independence between x₁ and x₁ firmly in mind, we turn to a scatter-plot of the various and as depicted in Figure 2. (The worksheet into which the regression coefficients are copied is already set up to generate the plot.) The grid lines plotted at = 1 and = 1 form “crosshairs” on the true underlying parameters = 1, = 1. As the respective average values of and are approximately equal to 1, our procedure is generating unbiased estimates of the “separate effects” of x₁ and x₂ on y. Nonetheless we see a certain degree of estimation error around the true and , with ranging from 0.63 to 1.27, and ranging from 0.75 to 1.14. (This is a good time to review the idea that the term “effect of x_i” refers to the change in E[y] associated with, but not necessarily caused by, a 1-unit increase in x_i with all other variables held constant.)

Figure 2

Figure 2. Regression coefficients produced by independent data.

Once this point is made, we turn to a second worksheet (shown in Figure 3) which computes y values using equation (2) in a manner identical to that used in the first sampling effort. We then point out the only difference between this sampling mechanism and the previous, i.e. the correlation between the x₁ and x₂ values reflected by the correlation matrix and scatter plot. In our demonstration, each x2 value is generated from a uniform distribution with mean equal to the corresponding x_i and range governed by a parameter w. The data in Figure 3 is generated with w = 20, thus each s₂ is generated uniformly over the range x_i10. The process of generating 10 or more different observations of (, , ) is performed with data generated by this worksheet, producing a display such as shown in Figure 4.

Figure 3

Figure 3. Portion of spreadsheet used to generate correlated x_₁, x_₂.

Figure 4

Figure 4. Regression coefficients produced by correlated x_₁, x_₂.

The striking contrast between Figure 2 and Figure 4 form the primary impact of the exercise. We first note that the average and are still approximately 1, implying the regression parameters remain unbiased estimates of the true relationship coefficients. In contrast, we then note the inflated variance of the individual coefficients, as and now range over 1.3 from their expectations.

Finally, we draw attention to the manner in which (, ) are no longer clustered around the point (1,1), but rather tend toward the line + = 2, in the vicinity of (1,1). Thus, the sampling errors in the (, ) coefficients have themselves become correlated. To explain this, we note we have induced a very simple type of correlation such that E[x₂] = x₁ As such, across many datasets we realize a collection of and for which + 2. Under any such pair, the expression x₁ + x₂ adequately represents the net effect of 1x₁+ 1x₂ when x₁ x₂. (Some simple algebraic substitution into (2) with x₁ is strictly equal to x₂ helps to clarify the point.) In short, our regression coefficients are no longer reliably estimating the “separate effects” of x₁ and x₂, but rather measure the net effect of x₁ and x₂ varying together in a manner depicted in Figure 3.

In order to make the comparison of Figure 2 and Figure4 as compelling as possible, an extreme degree of collinearity between x₁ and x₂ has been introduced. Time permitting, the latter sampling procedure is repeated under less severe collinearity, generating a display similar to that shown in Figure 5. (The results in Figure 5 were generated with the w parameter discussed above equal to 60.) While the sample coefficients still tend to the line + = 2, the variance inflation effect is reduced under weaker collinearity. Here ranges from 0.63 to 1.73, while now ranges from 0.31 to 1.34. Having witnessed the creation of Figure 2 and Figure 4, it is not as critical for the students to witness the sampling involved in this latter case.

Figure 5

Figure 5. Regression coefficients under less extreme collinearity between x_₁, x_₂.

The exercise closes with a discussion of how multicollinearity may affect the utility of a multiple regression model. Here, we turn to the two uses of a regression model, e.g. explanation vs. prediction.

If the model is to be used for explanation (e.g., we are more interested in the parameters than the values produced by the model), multicollinearity casts a shadow of doubt over the classic interpretation of as “the change in E[y] associated with a 1-unit increase in x_i.” As we have seen above, multicollinearity reduces our ability isolate the change in E[y] associated with a change in any single x_i. This concept is used to introduce and motivate the concept of experimental design, the next topic in the course in which the demonstration is used.

If the model is to be used for prediction (e.g. we are more interested in the values produced by the model, and less concerned with the individual parameters), multicollinearity is not quite as damning. We do, however, need to base our predictions on the assumption that the predictor variables continue to reflect any relationships present in the data used to fit the model. The scatter plot of x₂ vs. x₁ from Figure 3 provides an opportunity to present this idea as a special case of the more basic concept, that a regression model should only be used for prediction within the range of x’s actually observed. In this case, the basic idea extends to the region of (x₁, x₂) actually observed, rather than simply the range of x₁ taken in union with the range of x₂.

One last useful point that can be made with this exercise is to clarify the distinction between multicollinearity and interaction. (Students tend to confuse the concepts, especially when relying on a memorized definition as opposed to conceptual understanding.) Returning to the scatter plot from Figure 3, the multicollinearity in this exercise is the relationship between x₁ and x₂. Interaction between x₁ and x₂, on the other hand, would exist if we had used the equation:

(4)

with a non-zero parameter to generate the y values.

3. Alternatives and Extensions

As with any teaching material, there are numerous ways in which the basic demonstration outlined above could be modified or extended, depending on the learning goals adopted by the instructor and the amount of time one wishes to spend on the topic. It might be useful to ask students to anticipate the effect on model parameters if x₂

+ 5x₁ (for example) rather than the simpler x₂

x₁ correlation model used here. Rather than simulating, this issue can initially be addressed with some simple algebraic substitution for the case x₂ strictly equal to 3 + 5x₁ which helps to clear up any impression that the tendency toward

2 is a general result. (In the more general case where x₂

a + bx₂, the regression coefficients (

) will tend toward a line passing through (

+ b

, 0) and (0,

/b+

), in the vicinity of (

).)

It is important to note that if the Monte Carlo demonstration is conducted with a more general correlated case of the form x₂ a + bx₁, the variances of the predictor variables must be held constant across the two scenarios. Otherwise the scenario with the larger predictor variable variance has an advantage in reducing the variance of the corresponding sample coefficient. Of course, increasing the strength of the predictor variable correlation relative to the variance of the response variable magnifies the contrast between the two scenarios.

Some instructors may wish to adopt a realistic setting for the demonstration, rather than simply referring to the variables generically as x₁, x₂, x₃, and y. One anonymous reviewer proposed an example in which the predictor variables are exam scores and class attendance, while the response variable is the student’s score on a final exam. While the course in which this demonstration is used is replete with regression examples grounded in realistic scenarios, for this demonstration we prefer the generic approach. This is partially due to the difficulty in identifying a realistic example in which two predictor variables might be independent, and a few minutes later are assumed to be correlated. Moreover, there is potential for student confusion in this change of assumption with respect to a specific variable. One particular exception would be a scenario in which the correlated data arise due to convenience sampling, while the independent observations are gathered under a controlled experimental design.

Our demonstration has used a sample size n = 100 observations per replication. Time permitting, the effect of sample size on the results demonstrated above would be an interesting extension or out-of-class assignment. A smaller sample size (say n = 20) would of course result in greater coefficient variation under both scenarios, but would generate a similar if not more impressive contrast between the independent vs. correlated predictor variables case. Interestingly, a larger number of replications is not required to see the contrast, despite the smaller sample size.

4. A Model Specification Exercise

An interesting follow-up to this demonstration is an exercise in model specification. Using independent x₁, x₂, x₃ and y values computed using equation (2), we fit the model

(5)

e.g, we omit the x₂ variable from the model specification. Replicating the procedure a few times, we come to the conclusion that we are still getting unbiased estimates of the “effect of” x₁ (e.g. 1) while the missing x₂ term is contributing to the error term of the model, as evidenced by larger standard error and smaller adjusted R².

When the same procedure is used with correlated x₁and x₂ of the type depicted in Figure 3, we naturally come to the conclusion that the parameter is reflecting the combined effect of x₁ and x₂ (e.g. 2) thus the missing x₂ term is contributing to the parameter rather than the error term. The bottom line to this exercise is to recognize that, to the extent any “other variables” vary independently of the variables included in the model, the effects of these variables contribute to the error term. To the extent that other variables are correlated with the variables in our model, we must realize our regression coefficients measure the “net effect” of these simultaneous changes, for each unit change in the variable we have measured and included in our model. This gives additional credence to the regression instructors’ mantra that regression coefficients measure association not cause and effect.

The model specification exercise could also be extended in a number of ways. For example, a “true” functional relationship in which a correlated predictor variable contributes nothing to the observed y values (e.g = 0 in the preceding example) provides additional fodder in this direction. Such an exercise could examine the merits of a model containing both variables, relative to a reduced model possibly having greater predictive power.

5. Conclusion

While simulated data should never entirely replace analysis of real datasets in statistics education, selective use of the technique can greatly enhance students’ understanding of certain complex topics. With regard to multicollinearity, the ability for students to actually witness the effect on the regression coefficients, vis-à-vis the “true” relationship parameters, has an impact that is difficult to replicate through other means.

References

Allison, P. D. (1999), Multiple Regression – A Primer, Thousand Oaks, CA: Pine Forge Press.

Kennedy, P. E. (2002), “More on Venn Diagrams for Regression,” Journal of Statistics Education [Online], 10(1). (jse.amstat.org/v10n1/kennedy.html)

Mills, J. D. (2002), “Using Computer Simulation Methods to Teach Statistics: A Review of the Literature,” Journal of Statistics Education [Online], 10(1). (jse.amstat.org/v10n1/mills.html)

Timothy S. Vaughan
Management and Marketing
University of Wisconsin - Eau Claire
Schneider Hall 453
Eau Claire, Wisconsin 54702
U.S.A.
VAUGHATS@uwec.edu

Kelly E. Berry
Management and Marketing
University of Wisconsin - Eau Claire
Schneider Hall 423
Eau Claire, Wisconsin 54702
U.S.A.
BERRYKE@uwec.edu