Thomas L. Moore
Grinnell College
Journal of Statistics Education Volume 14, Number 1 (2006), jse.amstat.org/v14n1/datasets.moore.html
Copyright © 2006 by Thomas L. Moore, all rights reserved. This text may be freely shared among individuals, but it may not be republished in any medium without express written consent from the authors and advance notification of the editor.
Key Words:Controlling for a variable; Non-transitivity of positive correlation; Simpson’s paradox
I selected a simple random sample of 100 movies from the Movie and Video Guide (1996), by Leonard Maltin. My intent was to obtain some basic information on the population of roughly 19,000 movies through a small sample. In exploring the data, I discovered that it exhibited two paradoxes about a three-variable relationship: (1) A non-transitivity paradox for positive correlation, and (2) Simpson’s paradox. Giving concrete examples of these two paradoxes in an introductory course gives to students a sense of the nuances involved in describing associations in observational studies.The Movie and Video Guide by Leonard Maltin is an annual ratings guide to movies. While not all films ever made are in Maltin’s Guide, it does contain a very large number of movies covering the history of cinema. In this article, I discuss a dataset collected from the 1996 edition, which contained ratings on about 19,000 films.
I used Minitab to generate a simple random sample of 100 titles from the book. I recorded 5 variables on each movie sampled: The year the movie was released (Year), the running time of the movie in minutes (Length), the number of cast members listed (Cast), the rating that Maltin gave the movie on a rising scale of 1, 1.5, 2, ..., 4 (Rating), and the number of lines of description for the movie in the Guide (Description).
Pair of Variables | R | P-value |
---|---|---|
Length vs. Rating | 0.318 | 0.001 |
Year vs. Length | 0.509 | 0.000 |
Year vs. Rating | -0.148 | 0.143 |
Let’s explore the data further to see what is going on. Figure 2 shows a coded scatterplot of Rating against Year. We have defined a movie as short if its length is less than 90 minutes and as long if its length is 90 minutes or more. From the plot, we see that the longer movies tend to be more recent movies than the short movies, but within each length category there is a fairly clear negative relationship between Year and Rating: more recent movies tend to be rated lower and now the negative correlations are statistically significant. (See Table 2.) Length “masks” the negative relationship between Year and Rating—as Length increases Year tends to increase and the tendency of longer movies to get higher ratings negates the tendency of more recent movies to get lower ratings.
Pair of Variables | R | P-value |
---|---|---|
Rating vs. Year (short movies) | -0.520 | 0.000 |
Rating vs. Year (long movies) | -0.280 | 0.033 |
In an elementary course, even at the descriptive statistics level, I like this example because it illustrates the perils of aggregating data. I have also used this example when introducing multiple regression in a more advanced course. The two-predictor model estimates the relationship between Rating (our response variable) and Year, controlling for Length:
Rating = 24.6 - 0.0119 Year + 0.0124 Length Predictor Coef SE Coef T P Constant 24.59 10.04 2.45 0.018 Year -0.011856 0.005095 -2.33 0.024 Length 0.012407 0.006154 2.02 0.049 S = 0.6151 R-Sq = 14.2% R-Sq(adj) = 11.0%
Compare this to the simple linear regression Rating = 13.5 - 0.00570 Year, where the slope estimate of -0.00570 has the confirmatory non-significant P-value of 0.143.
The students can see how our regression output corroborates what we have learned through the coded scatterplots and correlations computed previously: there is a statistically significant, negative relationship between Rating and Year, controlling for Length.
My favorite examples of Simpson’s paradox are summarized in Table 3. For example, in the Berkeley admissions data from Freedman, Pisani and Purves (1998), men applicants appear to have a higher rate of admission to graduate school than women, but when we control for the graduate program, men’s advantage disappears. Or in the Florida death sentence data from Witmer (1992), whites convicted of murder appear more likely to be given the death sentence, but when we control for the race of the victim, blacks are more likely to get the death sentence regardless of whether the victim is white or black. The reader can consult the references for the data and story for each example. The data for each example with an abbreviated description can be found at www.math.grinnell.edu/~mooret/reports/SimpsonExamples.pdf.
Subject | X | Y | Z | Reference |
---|---|---|---|---|
Berkely Admissions Data | sex of applicant | accept or reject | grad program applied to | Freedman, et al. 1998, pp 17-20. |
Airlines on-time data | airline | on-time or late | airport location | Moore 2003, p 143. |
Death sentence data | race of convicted murderer | death sentence: yes or no | race of murder victim | Witmer 1992, pp 110-112. |
Comparing batting averages | person batting | hit or out | year of that at bat | Friedlander 1992, p 845. |
Prenatal care | care status | infant mortality | clinic | Bishop, Fienberg and Holland 1975, pp 41-42. |
We can create a Simpson’s paradox from the films data as follows. As above, use 90 minutes to define two categories of movie length: short movies run less than 90 minutes and long movies run 90 minutes or longer. Then define two categories of movie based on Year: 1965 or prior are called ‘old’ and 1966 or later are called ‘new.’ Finally, define ‘bad’ movies as those with ratings at or below 2.5 and ‘good’ movies as those with ratings 3 or above. Based upon these definitions, we obtain a Simpson’s paradox, as Table 4 illustrates.
Short Movies | bad | good | good% | Long Movies | bad | good | good% | All Movies | bad | good | good% |
---|---|---|---|---|---|---|---|---|---|---|---|
new | 7 | 0 | 0.0% | new | 27 | 16 | 37.2% | new | 34 | 16 | 32.0% |
old | 29 | 6 | 17.1% | old | 6 | 9 | 60.0% | old | 35 | 15 | 30.0% |
69 | 31 | 31.0% |
Not any choice of break points defining your categories will lead to an instance of Simpson’s paradox. Simpson’s paradox requires, by definition, an actual reversal in the relationship when controlling for the third variable, but I like to tell my students that the important point in studying Simpson’s paradox is not just that reversals can happen, but that with observational data relationships that look one way when aggregated can look quite different when disaggregated by a third variable. Calling this more general effect a “Simpson-like paradox,” I tell students that “Simpson’s paradox happens” and “Simpson-like paradoxes happen a lot.” Among famous paradoxes they have studied, Simpson’s may be one they encounter with some frequency in their later lives.
Sampling. How does one take a simple random sample of movies? This question provides lessons in confronting practical sampling issues in a simple, yet real setting. I sampled by having Minitab choose random (page number, item number) pairs. For example, the pair (1083, 3) would lead to the third movie listed on page 1083 of the Guide. To make the sample proper, one needs an upper bound on the number of items on a given page, which is admittedly a bit ad hoc. When the page selected contains fewer items than the item number selected, you ignore that random pair; so for a SRS of 100 you may need to select a few more than 100 random pairs. It takes some thought to convince oneself that all samples of 100 films have an equal probability of being selected under this scheme. The reason for selecting pairs is for convenience, as it would be prohibitive to number all 19,000 movies consecutively.
Identifying outliers. We can see one clear outlier in Figure 1: the movie with a **** rating that runs less than 50 minutes. Identifying this outlier serves, at least symbolically, to make the point that outliers are often the most interesting cases in a dataset. The movie in question is “Sherlock, Jr.,” the 1924 Buster Keaton classic, which Maltin describes as a “sublime study of film and fantasy, which has undoubtedly influenced countless filmmakers.” But does Keaton’s classic influence our correlations? Minus the outlier, the correlation between Rating and Length rises from .318 to .408, but the outlier has no qualitative effect on the paradoxes described above.
EDA for a single variable. Of interest to me, and probably to any user of the Guide, would be the distribution of Rating. For example, I tended to assume that a rating of *** or better was a good movie and that ***1/2 or **** movies were rare. But one doesn’t know this until one looks. Figure 3 shows the distribution of Ratings. Only 31 of the 100 movies had ratings of *** or higher and only 7 had ratings of ***1/2 or ****.
Confidence intervals. Given that we have a SRS from a population, one can ask students to compute confidence intervals for parameters of interest. For example, one could compute a confidence interval for the mean rating: the mean is 2.33 with a 95% confidence interval of 2.19 to 2.47. This assumes that we can treat Rating as a quantitative variable, an issue you can discuss in class as well. We might choose a confidence interval more relevant to the discussion above. For example, 31% of movies in the sample have ratings of 3 or above, with a 95% confidence interval of 22% to 41%. (This is the classical Wald interval; the “plus four” interval gives 31.7% with a confidence interval of 22.8% to 40.6%.)
Other relationships. One can also look at other bivariate relationships. For example both Cast and Description show statistically significant, positive correlations with Rating. There are plausible explanations for these, which would make good class discussion or exercises.
Freedman, D., Pisani, R., and Purves, R. (1998), Statistics (3rd ed.), New York, NY: W.W. Norton and Company.
Friedlander, R. J. (1992), “Ol’ Abner Has Done it Again,” American Mathematical Monthly, 99(9), 845.
Langford, E., Schwertman, N., and Owens, M. (2001), “Is the Property of Being Positively Correlated Transitive?” The American Statistician, 55, 322-325.
Maltin, L. (1996), Leonard Maltin’s 1996 Movie and Video Guide, New York, NY: Penguin Books.
Moore, D. S. (2003), The Basic Practice of Statistics (3rd ed.), New York, NY: W.H. Freeman.
Witmer, J. A. (1992), Data Analysis: An Introduction, Prentice-Hall, Engelwood Cliffs, NJ.
Thomas L. Moore
Department of Mathematics and Statistics
Grinnell College
Grinnell, IA
U.S.A.
mooret@grinnell.edu
Volume 14 (2006) | Archive | Index | Data Archive | Information Service | Editorial Board | Guidelines for Authors | Guidelines for Data Contributors | Home Page | Contact JSE | ASA Publications