Thomas L. Moore

Grinnell College

Journal of Statistics Education Volume 14, Number 1 (2006), jse.amstat.org/v14n1/datasets.moore.html

Copyright © 2006 by Thomas L. Moore, all rights reserved. This text may be freely shared among individuals, but it may not be republished in any medium without express written consent from the authors and advance notification of the editor.

**Key Words:**Controlling for a variable; Non-transitivity of positive correlation; Simpson’s paradox

The Movie and Video Guide by Leonard Maltin is an annual ratings guide to movies. While not all films ever made are in Maltin’s Guide, it does contain a very large number of movies covering the history of cinema. In this article, I discuss a dataset collected from the 1996 edition, which contained ratings on about 19,000 films.

I used Minitab to generate a simple random sample of 100 titles from the book. I recorded 5 variables on each movie sampled: The year the movie was released (Year), the running time of the movie in minutes (Length), the number of cast members listed (Cast), the rating that Maltin gave the movie on a rising scale of 1, 1.5, 2, ..., 4 (Rating), and the number of lines of description for the movie in the Guide (Description).

Pair of Variables | R | P-value |
---|---|---|

Length vs. Rating | 0.318 | 0.001 |

Year vs. Length | 0.509 | 0.000 |

Year vs. Rating | -0.148 | 0.143 |

Let’s explore the data further to see what is going on. Figure 2 shows a
coded scatterplot of Rating against Year. We have defined a movie as short if its length is less than 90 minutes and as
*long* if its length is 90 minutes or more. From the plot, we see that the longer movies tend to be more recent movies
than the short movies, __but__ within each length category there is a fairly clear *negative* relationship between
Year and Rating: more recent movies tend to be rated lower and now the negative correlations are statistically significant.
(See Table 2.) Length “masks” the negative relationship between Year and
Rating—as Length increases Year tends to increase and the tendency of longer movies to get higher ratings negates the
tendency of more recent movies to get lower ratings.

Pair of Variables | R | P-value |
---|---|---|

Rating vs. Year (short movies) | -0.520 | 0.000 |

Rating vs. Year (long movies) | -0.280 | 0.033 |

In an elementary course, even at the descriptive statistics level, I like this example because it illustrates the perils of aggregating data. I have also used this example when introducing multiple regression in a more advanced course. The two-predictor model estimates the relationship between Rating (our response variable) and Year, controlling for Length:

Rating = 24.6 - 0.0119 Year + 0.0124 Length Predictor Coef SE Coef T P Constant 24.59 10.04 2.45 0.018 Year -0.011856 0.005095 -2.33 0.024 Length 0.012407 0.006154 2.02 0.049 S = 0.6151 R-Sq = 14.2% R-Sq(adj) = 11.0%

Compare this to the simple linear regression Rating = 13.5 - 0.00570 Year, where the slope estimate of -0.00570 has the confirmatory non-significant P-value of 0.143.

The students can see how our regression output corroborates what we have learned through the coded scatterplots and correlations computed previously: there is a statistically significant, negative relationship between Rating and Year, controlling for Length.

My favorite examples of Simpson’s paradox are summarized in Table 3. For example, in the Berkeley admissions data from Freedman, Pisani and Purves (1998), men applicants appear to have a higher rate of admission to graduate school than women, but when we control for the graduate program, men’s advantage disappears. Or in the Florida death sentence data from Witmer (1992), whites convicted of murder appear more likely to be given the death sentence, but when we control for the race of the victim, blacks are more likely to get the death sentence regardless of whether the victim is white or black. The reader can consult the references for the data and story for each example. The data for each example with an abbreviated description can be found at www.math.grinnell.edu/~mooret/reports/SimpsonExamples.pdf.

See the references for the complete data and the stories behind the data.

Subject | X | Y | Z | Reference |
---|---|---|---|---|

Berkely Admissions Data | sex of applicant | accept or reject | grad program applied to | Freedman, et al. 1998, pp 17-20. |

Airlines on-time data | airline | on-time or late | airport location | Moore 2003, p 143. |

Death sentence data | race of convicted murderer | death sentence: yes or no | race of murder victim | Witmer 1992, pp 110-112. |

Comparing batting averages | person batting | hit or out | year of that at bat | Friedlander 1992, p 845. |

Prenatal care | care status | infant mortality | clinic | Bishop, Fienberg and Holland 1975, pp 41-42. |

We can create a Simpson’s paradox from the films data as follows. As above, use 90 minutes to define two categories of movie length: short movies run less than 90 minutes and long movies run 90 minutes or longer. Then define two categories of movie based on Year: 1965 or prior are called ‘old’ and 1966 or later are called ‘new.’ Finally, define ‘bad’ movies as those with ratings at or below 2.5 and ‘good’ movies as those with ratings 3 or above. Based upon these definitions, we obtain a Simpson’s paradox, as Table 4 illustrates.

But this comparison reverses itself when controlling for movie length (i.e., when disaggregating into Short or Long movies.)

Short Movies | bad | good | good% | Long Movies | bad | good | good% | All Movies | bad | good | good% |
---|---|---|---|---|---|---|---|---|---|---|---|

new | 7 | 0 | 0.0% | new | 27 | 16 | 37.2% | new | 34 | 16 | 32.0% |

old | 29 | 6 | 17.1% | old | 6 | 9 | 60.0% | old | 35 | 15 | 30.0% |

69 | 31 | 31.0% |

Not any choice of break points defining your categories will lead to an instance of Simpson’s paradox. Simpson’s paradox requires, by definition, an actual reversal in the relationship when controlling for the third variable, but I like to tell my students that the important point in studying Simpson’s paradox is not just that reversals can happen, but that with observational data relationships that look one way when aggregated can look quite different when disaggregated by a third variable. Calling this more general effect a “Simpson-like paradox,” I tell students that “Simpson’s paradox happens” and “Simpson-like paradoxes happen a lot.” Among famous paradoxes they have studied, Simpson’s may be one they encounter with some frequency in their later lives.

Sampling. How does one take a simple random sample of movies? This question provides lessons in confronting practical
sampling issues in a simple, yet real setting. I sampled by having Minitab choose random (page number, item number) pairs.
For example, the pair (1083, 3) would lead to the third movie listed on page 1083 of the *Guide*. To make the sample
proper, one needs an upper bound on the number of items on a given page, which is admittedly a bit ad hoc. When the page
selected contains fewer items than the item number selected, you ignore that random pair; so for a SRS of 100 you may need
to select a few more than 100 random pairs. It takes some thought to convince oneself that all samples of 100 films have
an equal probability of being selected under this scheme. The reason for selecting pairs is for convenience, as it would
be prohibitive to number all 19,000 movies consecutively.

Identifying outliers. We can see one clear outlier in Figure 1: the movie with a **** rating that runs less than 50 minutes. Identifying this outlier serves, at least symbolically, to make the point that outliers are often the most interesting cases in a dataset. The movie in question is “Sherlock, Jr.,” the 1924 Buster Keaton classic, which Maltin describes as a “sublime study of film and fantasy, which has undoubtedly influenced countless filmmakers.” But does Keaton’s classic influence our correlations? Minus the outlier, the correlation between Rating and Length rises from .318 to .408, but the outlier has no qualitative effect on the paradoxes described above.

EDA for a single variable. Of interest to me, and probably to any user of the Guide, would be the distribution of Rating. For example, I tended to assume that a rating of *** or better was a good movie and that ***1/2 or **** movies were rare. But one doesn’t know this until one looks. Figure 3 shows the distribution of Ratings. Only 31 of the 100 movies had ratings of *** or higher and only 7 had ratings of ***1/2 or ****.

Confidence intervals. Given that we have a SRS from a population, one can ask students to compute confidence intervals for parameters of interest. For example, one could compute a confidence interval for the mean rating: the mean is 2.33 with a 95% confidence interval of 2.19 to 2.47. This assumes that we can treat Rating as a quantitative variable, an issue you can discuss in class as well. We might choose a confidence interval more relevant to the discussion above. For example, 31% of movies in the sample have ratings of 3 or above, with a 95% confidence interval of 22% to 41%. (This is the classical Wald interval; the “plus four” interval gives 31.7% with a confidence interval of 22.8% to 40.6%.)

Other relationships. One can also look at other bivariate relationships. For example both Cast and Description show statistically significant, positive correlations with Rating. There are plausible explanations for these, which would make good class discussion or exercises.

Freedman, D., Pisani, R., and Purves, R. (1998), *Statistics* (3^{rd} ed.), New York, NY: W.W. Norton and
Company.

Friedlander, R. J. (1992), “Ol’ Abner Has Done it Again,” *American Mathematical Monthly,* 99(9), 845.

Langford, E., Schwertman, N., and Owens, M. (2001), “Is the Property of Being Positively Correlated Transitive?” *The
American Statistician,* 55, 322-325.

Maltin, L. (1996), *Leonard Maltin’s 1996 Movie and Video Guide*, New York, NY: Penguin Books.

Moore, D. S. (2003), *The Basic Practice of Statistics* (3^{rd} ed.), New York, NY: W.H. Freeman.

Witmer, J. A. (1992), *Data Analysis: An Introduction*, Prentice-Hall, Engelwood Cliffs, NJ.

Thomas L. Moore

Department of Mathematics and Statistics

Grinnell College

Grinnell, IA

U.S.A.
*mooret@grinnell.edu*

Volume 14 (2006) | Archive | Index | Data Archive | Information Service | Editorial Board | Guidelines for Authors | Guidelines for Data Contributors | Home Page | Contact JSE | ASA Publications