Jeffrey S. Simonoff
New York University
Journal of Statistics Education v.5, n.1 (1997)
Copyright (c) 1997 by Jeffrey S. Simonoff, all rights reserved. This text may be freely shared among individuals, but it may not be republished in any medium without express written consent from the author and advance notification of the editor.
Key Words: Classroom exercise; Logistic regression; Model building; Survival data.
Dawson (1995) described a dataset giving population at risk and fatalities for an unusual mortality episode (the sinking of the ocean liner Titanic), and discussed experiences in using the dataset in an introductory statistics course. In this paper the same dataset is analyzed from the point of view of the second statistics course. A combination of exploratory analysis using tables of observed survival percentages, model building using logistic regression, and careful thought allows the statistician (and student) to get to the essence of the random process described by the data. The well-known nature of the episode gives the students a chance at determining its character, and the data are complex enough to require sophisticated modeling methods to get at the truth.
1 In a recent paper Dawson (1995) discussed a dataset relating to an "unusual episode" of mortality -- the sinking of the ocean liner Titanic after colliding with an iceberg on April 15, 1912. The dataset gives the number at risk and deaths for the passengers and crew of the ship categorized by different characteristics. The first part of the paper focused on the process by which a "correct" version of the dataset was determined. The second part described the author's experience in using the dataset in an early class session of an introductory statistics course (by presenting it with identifying characteristics omitted and asking the students to try to identify what the episode actually was).
2 Dawson (1995) actually presented two similar, but not identical, tables related to the Titanic sinking. Table 1 of that paper referred to passengers only, while Table 2 also included crew. Table 2 was a modified version of Table 1 based on information in the Board of Trade Inquiry Report (1990). This latter table is the basis of the analyses in this paper. The 2201 people at risk are categorized by economic status (first-class passengers, second-class passengers, third-class passengers, or crew), age (child or adult), gender (female or male), and survival (survived or did not survive). The table is thus a 4 x 2 x 2 x 2 contingency table, with one dimension (survival) a natural target variable.
3 Informal discussion of these data in an introductory course can help get students thinking about randomness and exploratory data analysis. The data are also ideal for use in a second course, however, where statistical (regression) model building is covered, as they can be used to show how models can highlight, reinforce, and contradict informal impressions. In this way they can be a powerful tool to help students see both the power and limitations of exploratory data analysis, and the power and limitations of statistical models.
4 During the Spring 1996 semester I used the "unusual episode" data in the second statistics course. The course assumes knowledge of data analysis and statistical inference up through basic regression modeling. The second course is called "Regression and Multivariate Data Analysis," and it is the data analysis portion of the title that is key. All discussion of models, testing, estimation, and diagnostics is directed towards practical issues of understanding and exploring real data.
5 The "unusual episode" data were discussed roughly two-thirds of the way through the semester, just after discussion of logistic regression. By that time linear regression (including simple and multiple regression, regression diagnostics, model selection, weighted least squares, and regression on time series data) had been thoroughly discussed at an applied level. This leads to discussion of analysis of variance (ANOVA) models, including the use of indicator and effect codings to fit ANOVA models. The students had seen two real data analyses using logistic regression in class.
6 At the end of class I gave out a nine-page handout entitled "An unusual episode," which I asked the students to read before the next class. The handout, highlights of which follow, described a process by which a logistic regression model can be fit to these data (the entire model selection process was explicitly given in the handout). I also asked the students to write down on a piece of paper what they thought the mortality episode actually was (as all identifying characteristics were omitted from the handout).
7 My goal was to make the analysis as natural as possible by drawing heavily on analogies with least squares regression. The students had seen many analyses of continuous data, but had not seen tables of counts before (at least in any systematic fashion). Instead of histograms of all the variables, we have frequency distributions, because the variables are all categorical. Similarly, cross-classifications take the role of scatter plots.
8 The handout begins with a brief discussion of how cross-classified data can be analyzed as a logistic regression if one of the dimensions is a natural (binary) target variable. Table 1 summarizes the data set. The table is given with entries being the percentages of cell totals that survived, as this allows both the relationship with survival probability and the number at risk for each defined subgroup to be apparent. Summing over the table gives the overall survival rate of 32.3% of the 2201 people at risk.
Table 1: Survival Percentages Separated by Characteristics
Gender Age Male Adult Child High 32.6% of 175 100% of 5 Economic status Medium 8.3% of 168 100% of 11 Low 16.2% of 462 27.1% of 48 Other 22.3% of 862 --- Female High 97.2% of 144 100% of 1 Economic status Medium 86.0% of 93 100% of 13 Low 46.1% of 165 45.2% of 31 Other 87.0% of 23 ---
9 Table 2 gives the three cross-classifications of economic status, age, and gender, respectively, with survival. The tables highlight the patterns that identify this event as a shipwreck: decreasing survival with decreasing economic status ("Other" has lowest survival rate, but of course it is uncertain that it has lowest status), higher survival for children (and relatively few children at risk), and much higher survival for women (with fewer women than men at risk).
Table 2: Observed Survival Percentages by Variable
Economic status Percent survived Age Percent survived High 62.5% of 325 Child 52.3% of 109 Medium 41.4% of 285 Adult 31.3% of 2092 Low 25.2% of 706 Other 24.0% of 885 Gender Percent survived Female 73.2% of 470 Male 21.2% of 1731
10 The categorical nature of the dataset allows easy exploration of the effect of interactions of the variables on survival. Table 3 gives one such interaction table, that of economic status by gender. The interaction effect corresponds to an association with survival that is not explained by the marginal (main) effects alone. This table corrects and clarifies several impressions from Table 2. First, "Other" status is not actually lower than "Low" status in terms of survival probability; rather, more than 97% of the members of this status were male, with associated lower survival probability than females (this overwhelming gender imbalance provides a clue that "Other" status corresponds to the crew). The other pieces of new information in Table 3 are that women of "Low" status fared far worse than women at other status levels, while men of "High" status fared better than men at other status levels.
Table 3: Observed Interaction of Economic Status and Gender on Survival
Percent survived Gender Economic status Female Male High 97.2% of 145 34.4% of 180 Medium 87.7% of 106 14.0% of 179 Low 45.9% of 196 17.3% of 510 Other 87.0% of 23 22.3% of 862
11 This exploratory analysis is supported by formal model building. Details are given in Table 4. The main effects and interaction effects are fit using effect codings (see, for example, Hamilton 1992, pp. 99-101). All of the models are hierarchical, in that the presence of an interaction effect in the model implies that the associated main effects are also present. Note that because there were no children in the crew, the interaction between economic status and age (EA) is fit using only two of the effect codings corresponding to pairwise products of those for the main effects, rather than three. The likelihood ratio goodness-of-fit statistic (G²) is given for each model, along with associated degrees of freedom (df) and tail probability (p). The Akaike Information Criterion (AIC), which attempts to provide a tradeoff between goodness-of-fit and parsimony, is given in the last column (it equals G² + 2 x (number of parameters in the model)). The models are given ordered from smallest to largest AIC value within model class (one main effect, two main effects, three main effects, etc.) to make model selection easier.
12 An important point about the construction of G² should be made here. A different representation of the data set from that given in Table 1 is to consider it as a set of 2201 observations, each having a 0/1 (survived/did not survive) response value associated with it. The representations are equivalent with respect to fitted logistic regression coefficients, likelihood ratio and Wald tests of the significance of any individual effects, and maximized log-likelihood. They are different, however, in their implications for goodness-of-fit. It is inappropriate to use G² to evaluate goodness-of-fit considering each of the 2201 observations as a separate binomial random variable (based on one trial each). The reason for this is that the distribution of G² given the fitted regression coefficients is degenerate in this circumstance, and thus provides no information on goodness-of-fit (McCullagh and Nelder 1989, section 4.4.5). Rather, goodness-of-fit should be evaluated over the set of covariate patterns, which in this case corresponds to the 14 cells in Table 1 with nonzero number at risk.
Table 4: Logistic Regression Fits to Survival Data (E = Economic Status, A = Age, G = Gender)
Model G^2 df p AIC G 237.49 12 <.0001 241.49 E 491.06 10 <.0001 499.06 A 652.40 12 <.0001 656.40 E, G 131.42 9 <.0001 141.42 A, G 231.60 11 <.0001 237.60 E, A 465.48 9 <.0001 475.48 E, A, G 112.57 8 <.0001 124.57 E, G, EG 66.24 6 <.0001 82.24 A, G, AG 215.28 10 <.0001 223.28 E, A, EA 436.27 7 <.0001 450.27 E, A, G, EG 45.90 5 <.0001 63.90 E, A, G, EA 76.91 6 <.0001 92.91 E, A, G, AG 94.55 7 <.0001 108.55 E, A, G, EA, EG 1.69 3 .6395 23.69 E, A, G, EG, AG 37.26 4 <.0001 57.26 E, A, G, EA, AG 65.02 5 <.0001 83.02 E, A, G, EA, EG, AG 0.00 2 1.000 24.00
13 According to G², the only two models that fit the table include the two interactions EA and EG, or all three pairwise interactions EA, EG and AG. We must recognize, however, that the large sample size here means that statistically significant effects might not have great practical importance. Similarly, while the three models with minimum AIC include all of the main effects and two or three interaction effects, the well-known tendency of AIC to lead to overfitted models (Hurvich and Tsai 1989), implies that more care in choosing a model that fits adequately but is parsimonious is called for.
14 One way of doing this is to compare the fitted values for the three models (E, G, EG), (E, A, G, EG), and (E, A, G, EA, EG) (the best-fitting models of their respective classes). These are given in Table 5. Examination of the fitted survival percentages shows that they are very similar for all three models for the adult classes, but differ for the child classes. Because children represent less than 5% of the total population at risk, the simple model E, G, EG seems adequate to describe the important associations with survival in the data. This model also has the advantage of yielding as fitted survival percentages the observed percentages in Table 3, making summary of the model easy.
Table 5: Survival Percentages Separated by Characteristics for Three Models
E, G, EG Gender Age Male Adult Child High 34.4% of 175 34.4% of 5 Economic status Medium 14.0% of 168 14.0% of 11 Low 17.3% of 462 17.3% of 48 Other 22.3% of 862 --- Female High 97.2% of 144 97.2% of 1 Economic status Medium 87.7% of 93 87.7% of 13 Low 45.9% of 165 45.9% of 31 Other 87.0% of 23 --- E, A, G, EG Gender Age Male Adult Child High 33.7% of 175 59.4% of 5 Economic status Medium 12.9% of 168 29.9% of 11 Low 15.5% of 462 34.4% of 48 Other 22.3% of 862 --- Female High 97.2% of 144 99.0% of 1 Economic status Medium 86.7% of 93 94.9% of 13 Low 41.9% of 165 67.4% of 31 Other 87.0% of 23 --- E, A, G, EA, EG Gender Age Male Adult Child High 32.6% of 175 100% of 5 Economic status Medium 8.3% of 168 100% of 11 Low 16.8% of 462 22.0% of 48 Other 22.3% of 862 --- Female High 97.2% of 144 100% of 1 Economic status Medium 86.0% of 93 100% of 13 Low 44.6% of 165 53.0% of 31 Other 87.0% of 23 ---
15 Model fitting can also help students determine what the nature of the mortality episode actually is. The comparison in Table 5 reinforces how few children were at risk here, potentially giving a clue to the students that this was not simply an epidemic in some town or city. The relationship between the fitted survival percentages and the economic status of the people at risk shows that apparently exposure to the mortality agent was higher for poorer people than for richer ones (or, more correctly for this incident, exposure to survival measures was lower). The fitted survival percentages for the two more complicated models highlight that the higher survival rate for children is stronger for boys (compared to men) than for girls (compared to women), perhaps helping to trigger recognition of the "women and children first" rule of the sea. In fact, the nine classes with highest fitted survival percentages for the model (E, A, G, EA, EG) correspond to either women or children or both.
16 Thirty students turned in their guesses about the nature of the mortality episode at the beginning of the next class. Subsequent tallies demonstrated little prediction success, as only 3 of the 30 correctly identified the episode. Disease (11 of 30, with 4 people specifically saying AIDS), a military engagement (6 of 30), and gunshot (4 of 30) were the most popular choices. Despite this, class discussion was very brief. When the second person to volunteer a guess said "The sinking of the Titanic," the response was positively electric -- all of the students started nodding their heads and saying "That's it." We then spent a few more minutes going over the data and analyses before moving on to new material.
17 One impression I received from the class discussion was that some of the students were unclear on how to obtain the values in Table 4, and how to interpret them. To make this clearer, I added an Appendix to the handout describing how ordinary (weighted) least squares regression can be used to approximate logistic regression fitting (see, e.g., McCullagh and Nelder 1989, pp. 106-107). When combined with the simplification of treating economic status as dichotomous (low status/not low status), a best subsets regression program can be used to try to choose the model that best balances goodness-of-fit with parsimony (the model based on economic status, gender, and their interaction is the model of choice). The revised handout is available with this paper.
18 The "unusual episode" data provide a very rewarding experience in the second statistics course. At the cost of a little class time, the students see how a combination of exploratory analysis, model building, and careful thought allows the statistician to cut through a maze of numbers to the essence of an unknown process. The data are complex enough to require careful and sophisticated modeling methods, yet easily accessible (because the episode itself is so well-known). An alternative approach to that described here is to actually analyze the data in front of the class using a computer (assuming that this is possible). Allowing the students to work through the analysis cooperatively could be a very rewarding educational experience, although it would likely also be a time-consuming one.
I would like to thank the referees for helpful comments on an earlier draft of this article.
"Report on the Loss of the `Titanic' (S.S.)" (1990), British Board of Trade Inquiry Report (reprint), Gloucester, UK: Allan Sutton Publishing.
Dawson, R. J. M. (1995), "The `Unusual Episode' Data Revisited," Journal of Statistics Education [Online], 3(3). (http://jse.amstat.org/v3n3/datasets.dawson.html)
Hamilton, L. C. (1992), Regression With Graphics: A Second Course in Applied Statistics, Belmont, CA: Duxbury.
Hurvich, C. M., and Tsai, C.-L. (1989), "Regression and Time Series Model Selection in Small Samples," Biometrika, 76, 297-307.
McCullagh, P., and Nelder, J. A. (1989), Generalized Linear Models (2nd ed.), London: Chapman and Hall.
Jeffrey S. Simonoff
Department of Statistics & Operations Research
New York University
44 West 4th Street, Room 8-54
New York, NY 10012-1126
A postscript version of the handout, "An unusual episode", is available.