Scott Preston
SUNY - Oswego
Journal of Statistics Education Volume 14, Number 2 (2006), jse.amstat.org/v14n2/datasets.preston.html
Copyright © 2006 by Scott Preston, all rights reserved. This text may be freely shared among individuals, but it may not be republished in any medium without express written consent from the authors and advance notification of the editor.
Key Words:Descriptive statistics; Legislation; Logistic regression; Model selection.
I use the CAFE dataset described below as the basis for an in-class activity in a second semester applied statistics course taught to mostly Business, Biology/Zoology, and Information Science majors. I am primarily interested in model fitting and descriptive statements about the fit. (The dataset does not lend itself to sample-based inferences about a population.) I have students study the analysis, and I revisit the exercise on homework and an exam, asking students to explain the decisions that lead to a good fit, and to provide interpretations of that fit.
In an upper-division regression course I have used the dataset in an assignment/take-home exam late in the course. Emphasis here is again on descriptive, rather than inferential, statistics. Students have worked with all the tools necessary to perform a good analysis, yet they have trouble wielding those tools together; consequently I require them to consult with me as they take actions in their analyses.
The dataset consists of information about each of the 100 U.S. Senators regarding their vote on the Levin amendment. A senator’s vote (Vote) is the response variable. Provided explanatory variables include the state represented, political party affiliation (Party), and the lifetime total amount of contributions received from auto manufacturers (Amount). See Appendix 1 for a detailed description of the contents and format of the data file cafe.dat.txt.
Motivation for studying the CAFE issue can be found in the media. An article in the October 2005 Consumer Reports describes how automakers classify cars as light trucks to “bend” the restrictions set by the standard. (Vehicles classified as light trucks fall under relaxed fuel efficiency standards. For instance, by having its Outback model reclassified as a light truck, Subaru was able to add weight to the vehicle without making expenditures to remediate the resulting reduction in fuel efficiency.) Paul Roberts’ 2004 book, The End of Oil, provides a good summary of the issue in the context of energy concerns – see Appendix 2 for an excerpt. The New York Times (Hakim 2004b) has reported on the Bush administration’s plan to propose changes to the national fuel economy regulations. John Kerry’s 2004 presidential election campaign vow to pursue better fuel economy involved a stance on CAFE – an article on the Kerry campaign in the New York Times (Hakim 2004a) directly references the vote on the Levin amendment.
Students are led to the dataset on the internet and then import it into a spreadsheet. We first obtain univariate summaries for each of the variables. At this point the Jeffords (Independent, VT) case provokes discussion. For the time being we omit this case.
A tally of Vote by Party results in a two-by-two table that can be put to a significance test (one that might have been performed during coverage of categorical data earlier in the term). Our conclusion is that the difference in vote between the two parties did not come about by randomly allocating the 62 YES votes among the 99 senators. We are now primed to include Amount as a second explanatory variable.
|
Republican |
|
Democrat |
Stem unit = $10,000 |
99987766544420 |
0 |
00000000000111111222333444455556677 |
Figure 1. Back-to-back stemplots of contribution amounts for the two parties.
Variable: | Amount | ||||||
Party | Count | Mean | Minimum | Q1 | Median | Q3 | Maximum |
Democrat | 50 | 10025 | 0 | 1000 | 4375 | 14250 | 133250 |
Republican | 49 | 17783 | 250 | 9250 | 15000 | 25000 | 48939 |
The distributions have skew, and outlying cases are identified. A transformation of Amount is in order. I have settled on log10(10×Amount + 1). (Using log10(10×Amount) expresses “how many ‘figures’ in the dollar amount,” (see “Dealing with Logarithms,” De Veaux, Velleman and Bock (2005), page 45), but leaves cases with Amount = 0 undefined. The proposed compromise allows students to quickly handle the transformation in the spreadsheet and results in essentially the same interpretation. And, to perform the transformation on a calculator, students merely affix the digit 1 to the end of Amount, then press the log10 key.)
Variable: | log(10A+1) | ||||||
Party | Count | Mean | Minimum | Q1 | Median | Q3 | Maximum |
Democrat | 50 | 4.124 | 0.000 | 4.000 | 4.641 | 5.154 | 6.125 |
Republican | 49 | 5.1318 | 3.3981 | 4.9660 | 5.1761 | 5.3979 | 5.6897 |
See the boxplots in Figure 2 and dotplots in Figure 3 for graphical displays of the transformed amounts. This transformation completely, and meaningfully, alters what constitutes an outlier.
Here again one may ask students to anticipate the fit. Supplied with dotplots of Amount by Party as in Figure 3, have them draw a curve for each party, summarizing the relative density of filled circles as Amount increases.
My students begin the logistic regression analysis (logit link function) fitting Vote by Party. (This amounts to estimating cell probabilities in a 2×2 table. Moore, McCabe, Duckworth, and Sclove (2003) use a 2×2 table as a first step in introducing logistic regression.) We confirm that the fit produces the observed probabilities. From this point forward we engage in an exercise fitting Vote as a function of Amount and Party.
an indicator for Political Affliliation, and F=log10(10A + 1), the continuous variable for the approximate number of “figures” in the contributed amount A. With p=Pr{Vote = YES}, the full model is stated
A formal statistical test (the strategy I’d be least inclined to use with my students) reveals the interaction term to be insignificant (from Table 3, P-value = 0.122).
Odds | 95% CI | ||||||
---|---|---|---|---|---|---|---|
Predictor | Coef | SE Coef | Z | P | Ratio | Lower | Upper |
Constant | -3.20274 | 1.64536 | -1.95 | 0.052 | |||
REP | -8.04599 | 6.46223 | -1.25 | 0.213 | 0.00 | 0.00 | 101.49 |
log(10A+1) | 0.618414 | 0.350380 | 1.76 | 0.078 | 1.86 | 0.93 | 3.69 |
REP*log(10A+1) | 2.01934 | 1.30583 | 1.55 | 0.122 | 7.53 | 0.58 | 97.39 |
Measures of Association:
(Between the Response Variable and Predicted Probabilities)
Pairs | Number | Percent | Summary Measures | |
Concordant | 1980 | 86.3 | Somers' D | 0.73 |
Discordant | 301 | 13.1 | Goodman-Kruskal Gamma | 0.74 |
Ties | 13 | 0.6 | Kendall's Tau-a | 0.35 |
Total | 2294 | 100.0 |
Measures of association are commonly used to compare fits in regression. Here, concordant and discordant pairs form the basis for these measures. For each pair of senators whose votes were different (there are 62×37 = 2294 such pairs, out of 99×98/2 = 4851 total pairs), the fitted model is used to estimate the probabilities of voting YES. The pair is concordant (discordant) if the senator who voted YES has a higher (lower) fitted probability of voting YES; ties occur when two senators who vote differently have identical predictor sets.
The percent of concordant pairs, 1980/2294 = 86.3%, is a simple, direct measure of association. The Goodman-Kruskal Gamma is the difference between the proportions of concordant and discordant pairs when ties are ignored: (1980–301)/(1980+301) = 0.736. Somers’ D is similar to Gamma. Kendall’s Tau-a is the difference between the proportions of concordant and discordant pairs out of all pairs: (1980–301)/4851 = 0.346. (These are covered in detail in Section 9.1 of Agresti (1984).) In logistic regression, these association measures are generally between 0 and 1, and models that have higher values are generally better predictors. The measures displayed in Table 3 and Table 4 show that the full model improves very little on the reduced (no interaction) model.
Odds | 95% CI | ||||||
---|---|---|---|---|---|---|---|
Predictor | Coef | SE Coef | Z | P | Ratio | Lower | Upper |
Constant | -4.51004 | 2.00965 | -2.24 | 0.025 | |||
REP | 1.91463 | 0.558377 | 3.43 | 0.001 | 6.78 | 2.27 | 20.27 |
log(10A+1) | 0.898979 | 0.423062 | 2.12 | 0.034 | 2.46 | 1.07 | 5.63 |
Measures of Association:
(Between the Response Variable and Predicted Probabilities)
Pairs | Number | Percent | Summary Measures | |
Concordant | 1964 | 85.6 | Somers' D | 0.72 |
Discordant | 318 | 13.9 | Goodman-Kruskal Gamma | 0.72 |
Ties | 12 | 0.5 | Kendall's Tau-a | 0.34 |
Total | 2294 | 100.0 |
A third assessment involves a graphical comparison of the full and reduced model fits. On first inspection of Table 3 and Table 4, the fits – more precisely the coefficients – are very different. Figure 2 reveals the two fits to differ substantially in the range of amounts from $0 to $1,000,000. On closer inspection it becomes clear that within the range of actual contribution amounts (the boxplots in Figure 2 outline these), the fits are quite similar.
Figure 2. A graphical comparison of the competing models must take into account the range of explanatory values – here displayed with boxplots.
Finally, common sense prevails. A broader view of a plot of the fitted full model reveals that the curves intersect, and that below approximately $1000 ($964.76 to be exact), the full model puts the probability of a YES vote higher for a Democrat than a Republican. (The sparseness of Republicans under $1000, and the one $0 Democrat voting YES, are in part responsible for this.) The result is counterintuitive and best avoided.
Figure 3. Dotplots of contribution amounts by party, and the fitted logistic regression of the probability of a YES on party and amount contributed. Dotted lines are used for extrapolated values for the regression (six Democrats received no contribution; they are indicated by a solid dot). The vertical line at the $8000 contribution amount marks Vermont’s James Jeffords, the sole independent in the Senate.
Figure 3 shows a plot not only of the fit I obtain, but of the data. I like this plot because it serves also as a pedagogical tool: students can relate the logistic regression fit to the fill density of the dotplots. Students can obtain fairly accurate quantities from such a plot, helping them anticipate and verify some of the quantitative statements discussed below.
We settle on the reduced model.
I want students to synthesize their understanding of the fit by a) identifying coefficients and odds-ratios displayed in Table 4, b) using the coefficients to produce the logit, odds, and probability of a YES vote for any combination of Amount and Party, c) obtaining odds-ratios from the coefficients, and d) interpreting odds-ratios in the context of the CAFE setting.
For example, take a Republican who’s received $10,000. On a handheld calculator input 100001, take the logarithm (the result F = 5.00 tells us $10,000 is a 5-figure amount), multiply by 0.898979, subtract 4.51004, and then add 1.91463 (the Republican effect), to obtain a logit of 1.89949. The odds then are e1.89949 = 6.6825 (to 1); the probability of a YES vote is 6.6825/(1+6.6825) = 0.8698. For a Democrat, stop before adding 1.91463, yielding a logit of -0.05114. Exponentiate to obtain odds e-0.05114 = 0.9850, giving a probability of 0.9850/(1+0.9850) = 0.4962 of a YES vote. Students can now plot these estimated probabilities.
Since Amount = 0 is equivalent to F = 0, the coefficient of –4.51004 tells us that for a Democrat with Amount = 0, the log-odds of a YES vote are –4.51004. The odds are e-4.51004 = 0.0110; the estimated probability of a YES vote is 0.0110/(1+0.0110) = 0.0109.
The “Republican effect” is quantified by the coefficient 1.91463. The odds-ratio is e1.91463 = 6.7844. The odds for the Republican voting YES are 6.7844 times that for the Democrat. For example, suppose a Democrat has 0.2500 chance of voting YES. The odds then are 1:3. For a Republican with the same contribution amount, the odds are 6.7844:3, leading to a probability of 0.6934.
The lack of interaction implies that the effect of Amount on Vote is the same for Republicans as for Democrats. However, such a statement requires care, because of the transformation of the Amount variable. Suppose we increase Amount by a fixed amount (or percentage). What is the effect on the likelihood of a YES vote?
The coefficient of 0.898979 tells us that for each 1 unit increase in F = log10(10A + 1) there is a corresponding 0.898979 increase in the log-odds of a YES vote. Solved algebraically, if the amount is multiplied by 10 and 90 cents is added, then there is a 0.898979 increase in the log-odds, which is equivalent to multiplying the odds of a YES vote by 2.4571. The 90 cents is insignificant – it’s an artifact of adding roughly a dime to each contribution amount before multiplying by ten and taking the logarithm. To avoid the algebra, appeal to the transformation of Amount in terms of “figures” and ignore the added dime: Multiplying the amount by 10 (i.e. increasing it 900%) results in the odds being multiplied by 2.4571. Consider the $10,000 Republican from above – odds 6.6825. A $100,000 Republican then has odds 2.4571×6.6825 = 16.4196, leading to a probability of 0.9426.
One of the six senators receiving no contribution voted YES. At issue here is whether the model adequately accounts for this. One can construct a rudimentary goodness of fit test to assess this. Recall that 0.0109 is the estimated probability of a YES vote for a Democrat with Amount = 0. Taking this value as the base probability for a YES vote, the probability of at least one YES vote among six is 1 – 0.98916 = 0.0635. This suggests some lack of fit for these cases.
The estimated 0.0109 probability of a YES for Democrats with Amount = 0 is somewhat inconsistent with the observed 0.1666. (The interaction term in the full model – which puts Democrats more likely to vote YES at low contribution amounts – is clearly influenced by these cases.) One alternative is to isolate these cases, and fit the remaining 93 with a logistic regression. Doing this presumes that a different model applies to Amount = 0 cases; one possible explanation (which students would need to check) is that among these cases are new senators with little opportunity to receive contributions from any source.
All the issues discussed in this paper can be addressed for the Amount = 0 cases deleted situation; naturally the fitted values change. The full and reduced model coefficients are similar on inspection, and the interaction term plays less a role (P-value = 0.690 vs. P-value = 0.122 when the Amount = 0 cases are included). Still, an interaction implies that the curves cross: While they do so at a fairly unlikely contribution amount of $9.22 when the Amount = 0 cases are omitted, this result remains unappealing.
See Appendix 3 for discussion of these issues.
The CAFE dataset generates a good deal of class interest and discussion. It would be interesting and instructive to perform similar analyses for other issues voted on by lawmakers, and to compare results to those obtained for the CAFE data.
Columns | |
1 - 18 | Name (Last, First) |
20 -21 | Two-character abbreviation for state |
23 | Party affiliation (R = Republican, D = Democrat, I = Independent) |
25 - 27 | Vote on CAFE standard (YES, NO) |
30 - 34 | Amount, lifetime contributions from auto manufacturers |
Values are aligned and delimited by blanks. There are no missing values.
“… By any reasonable standard, the most important step the United States could take to simultaneously improve energy security, cut CO2 emission, boost urban air quality, and deprive Middle Eastern terrorists of financing would be to raise fuel efficiency requirements. American cars and trucks burn two of every three barrels of oil used in the United States – and one of every seven barrels used worldwide – a figure that is hardly surprising, given that economy standards have been frozen since 1988. Today, American cars need to achieve an average fuel economy of just 27.5 miles per gallon, while “light trucks,” that hugely popular category that includes pickups and SUVs, need achieve only 20.5 miles per gallon. Even a modest improvement in fuel-economy standards – say thirty-two miles per gallon for cars and twenty-four per gallon for light trucks – would by 2010 be saving 2.7 million barrels per day – or nearly twice as much as could be pumped every day from the Arctic National Wildlife Refuge.Yet so far, even that small change has proved to be a political impossibility. Although such efficiency improvements are already technically feasible – Ford’s Escape SUV, a gas-electric hybrid, gets 36 miles per gallon in the city [Consumer Reports’ most recent test of the 2005 Escape, published in the August 2005 edition, found values of 22 (city) and 29 (highway); this “EPA shortfall” is common, and is detailed in the October 2005 issue of Consumer Reports] – U.S. automakers and the big automotive unions have persuaded Congress not to raise fuel-efficiency standards since the late 1980s. Why? Among other reasons, because any regulations requiring greater fuel efficiency will initially favor Japanese and German automakers, whose fleets are already more fuel-efficient – thereby costing U.S. companies more of their market share and U.S. auto workers more of their jobs. And such losses are not inconsequential to American politicians. Since 1990, the U.S. transportation industry has made more than $256 million in campaign contributions. Whereas nearly 70 percent has wound up with Republicans, Democrats haven’t been shy about asking for auto dollars, especially from the auto workers’ unions. No surprise the CAFE has never come close to being updated.”
“Automobile and Light Truck Fuel Economy: The CAFE Standard” (2002), Almanac of Policy Issues [Online],
September 25, 2002.
www.policyalmanac.org/environment/archive/crs_cafe_standards.shtml
“CAFE Standard,” (2002) Public Campaign [Online].
www.howdarethey.org/news/cafe/
“The fuel-economy shell game” (2004), Consumer Reports 69(5), 8.
“Not easy being green” (2005), Consumer Reports 70(8), 50-57.
“CR Investigates: Fuel economy” (2005), Consumer Reports 70(10), 20-23.
De Veaux, R. D., Velleman, P. F., and Bock, D. E. (2005), Stats: Data and Models, Boston: Pearson Education, Inc.
Hakim, D. (2004, March 25), “Kerry Is Sticking With Plan to Raise Auto Fuel Efficiency,” The New York Times [Online].
Hakim, D. (2004, May 5), “Average U.S. Car Is Tipping Scales at 4,000 Pounds,” The New York Times [Online].
Hosmer, D. W., and Lemeshow, S. (2000), Applied Logistic Regression (2nd ed.), New York: John Wiley & Sons, Inc.
Love, T. E. (1998), “A Project-Driven Second Course,” Journal of Statistics Education [Online], 6(1).
jse.amstat.org/v6n1/love.html
Moore, D. S, McCabe, G. P., Duckworth, W. M., and Sclove S. L. (2003), The Practice of Business Statistics, New York: W. H. Freeman and Company.
Roback, P. J. (2003), “Teaching an Advanced Methods Course to a Mixed Audience,” Journal of Statistics Education
[Online], 11(2).
jse.amstat.org/v11n2/roback.html
Roberts, P. (2004), The End of Oil: On the Edge of a Perilous New World, New York: Mariner Books.
Simonoff, J. S. (1997), “The ‘Unusual Episode’ and a Second Statistics Course,” Journal of Statistics Education
[Online], 5(1).
jse.amstat.org/v5n1/simonoff.html
U.S. Senate Roll Call Votes 107th Congress - 2nd Session [Online].
www.senate.gov/legislative/LIS/roll_call_lists/roll_call_vote_cfm.cfm?congress=107&session=2&vote=00047
Scott R. Preston
Department of Mathematics
SUNY Oswego
Oswego, NY 13126
U.S.A.
srp@oswego.edu
Volume 14 (2006) | Archive | Index | Data Archive | Information Service | Editorial Board | Guidelines for Authors | Guidelines for Data Contributors | Home Page | Contact JSE | ASA Publications