Move Over, Roger Maris:
Breaking Baseball's Most Famous Record

Jeffrey S. Simonoff
New York University

Journal of Statistics Education v.6, n.3 (1998)

Copyright (c) 1998 by Jeffrey S. Simonoff, all rights reserved. This text may be freely shared among individuals, but it may not be republished in any medium without express written consent from the author and advance notification of the editor.

Key Words: Contingency table; Exploratory data analysis; Smoothing; Sports statistics.


The race between Mark McGwire and Sammy Sosa to break the major league season home run record captured the attention of sports fans (and even non-sports fans) during the summer of 1998. In this article the game-by-game home run performance of each of these players is provided, along with some team statistics for each game. This dataset provides a rich set of possibilities of analyses in both introductory and advanced statistics courses, including graphical exploratory displays, categorical data analysis, analysis of variance, logistic regression, and smoothing methods for Poisson and binomial data.

1. Introduction

1 The sports world in general, and baseball world in particular, was electrified by the attempts of Mark McGwire (of the St. Louis Cardinals) and Sammy Sosa (of the Chicago Cubs) to break Roger Maris' 37-year old all-time season home run record during the 1998 season. Indeed, the race ultimately transcended sports entirely, becoming the lead story in newspapers and television news reports. McGwire broke the record with his 62nd home run on September 8, and Sosa followed suit five days later. Ultimately McGwire ended the season with an almost unbelievable 70 home runs, while Sosa finished with 66.

2 The great interest in these record-breaking performances makes it natural to consider examining them more carefully in statistics classes. Not everyone is a baseball enthusiast, however, and it is certainly a good idea for teachers to use datasets of this type judiciously, so that nonfans don't become bored or confused.

3 Many different questions can be addressed using these data, including informal comparisons of the two players' performances and investigation of potential factors relating to home run hitting, including home field, team success, and performance by teammates. Indeed, in some ways a rich dataset offers too many possible questions, since it becomes more likely that surprising relationships can arise by random chance. This doesn't mean that instructors and students shouldn't explore the dataset in ways that interest them, but it does mean that a bit of caution should be used when interpreting what is found.

2. The Dataset

4 The dataset consists of game-by-game information for McGwire and the Cardinals, and Sosa and the Cubs. For each game (for each team) the following information is given: the date of the game (both as text in the form of month and date and as the number of days from the start of the season on March 31), whether the game was played at home or on the road, the game result, the number of home runs hit by McGwire or Sosa, the number of runs driven in by those home runs, and an indicator variable of whether McGwire or Sosa played in the game. The data were obtained from the official web sites for the St. Louis Cardinals ( and Chicago Cubs (

5 The standard length of the baseball season is 162 games, but each team played an additional game worth special mention. On August 24 the Cardinals played a game against the Pittsburgh Pirates in Pittsburgh that was rained out after 6 1/2 innings, ending in a 5-5 tie. All statistics for the game are considered official in this context. The game was replayed as part of a doubleheader in St. Louis on September 15, which is why the Cardinals played 82 home games, rather than the expected 81. The Cubs ended up in a tie with the San Francisco Giants for the wild card postseason berth, and the two teams played a one game playoff in Chicago on September 28 to determine which would play in the postseason. All statistics for the game are considered as part of the regular season in this context, which is why the Cubs also played 82 home games.

3. Exploratory Analysis

6 The first step in the analysis of any dataset is to look at the data. Frequency distributions of the number of home runs by each player show that Sosa had more multi-home run games (two or more home runs), 11 to 10 (Sosa's 11 tied the major league record). Runs scored and given up by each team are long right-tailed, reflecting that while teams usually score between two and five runs, occasionally they score as many as 10 runs or more. See Figure 1 for a histogram of St. Louis' runs scored.

Figure 1 (6.4K gif)

Figure 1. Histogram of Runs Scored by St. Louis.

7 From the point of view of team success it is the difference between runs scored and runs given up (the victory margin, where negative values imply a loss) that matters. Figure 2 is a histogram of this variable for St. Louis. It is obvious that this variable is much more symmetric than either of the runs scored variables, illustrating that the distribution of the difference between two variables can have a very different shape than that of the underlying variables.

Figure 2 (6.6K gif)

Figure 2. Histogram of Victory Margin for St. Louis.

8 A particularly compelling view of the home run race between McGwire and Sosa is given in Figure 3. The figure tracks the total numbers of home runs hit by each player (McGwire in blue circles, Sosa in purple crosses) by the calendar date, with Roger Maris' 61 home runs marked by a dotted line. By plotting versus calendar date (rather than game number) several remarkable patterns emerge. McGwire was far ahead of Sosa until June (day 63), when Sosa went on a tear, setting a record with 20 home runs in the month. The players were tied at the end of the day for the first time on August 10 (day 133), and were tied again at the end of 12 other days (the last being September 25, day 179). Amazingly, despite the closeness of the race the last seven weeks of the season, Sosa was never leading the race alone at the end of a day; in fact, he was only ahead of McGwire for less than two hours during the entire season!

Figure 3 (8.3K gif)

Figure 3. Total Home Runs Hit by Each Player by Calendar Date. McGwire's runs are shown by blue circles and Sosa's by purple crosses. Roger Maris' 61 home runs are marked by a dotted line.

9 The long tails for the runs scored and runs given up variables suggest that a transformation of the variables might be useful before formal analysis. Possible choices of transformations include logarithms or square roots (the latter being the variance-stabilizing transformation for count data).

4. Categorical Data Analysis

10 Many of the interesting variables in the dataset are categorical, making it an excellent source of data for analyses of contingency tables and tests of hypotheses related to such tables. There are many interesting possibilities, and I will only mention a few here. Tables 1 and 2 show cross-classifications of the number of home runs hit and the game result, along with column percentages, for McGwire and Sosa, respectively. St. Louis' tie is omitted, and games with two or three home runs hit are combined into one category.

Table 1. Cross-Classification of Home Runs Hit and Game Result for McGwire

        Column percentages
  Zero One Two or more Zero One Two or more
Loss 55 23 1 52.9 47.9 10.0
Win 49 25 9 47.1 52.1 90.0

11 As might be expected, for McGwire the two variables are related, with the Cardinals being more successful when McGwire hits more home runs. A Pearson chi-squared test of independence confirms this, with a value of 6.73 on 2 degrees of freedom (p-value = .034).

Table 2. Cross-Classification of Home Runs Hit and Game Result for Sosa

        Column percentages
  Zero One Two or more Zero One Two or more
Loss 49 19 5 45.0 44.2 45.5
Win 60 24 6 55.0 55.8 54.5

12 Remarkably, not only is this not the case for Sosa, the team winning percentages at each of his three home run levels are extraordinarily similar, with the Cubs winning roughly 55% of the games. A Pearson test of independence confirms this, with a value of 0.01 (p > .99). A more correct version of these tests would omit games where McGwire or Sosa did not play. If this is done, the observed associations between the variables strengthen slightly.

13 More complicated analyses are also possible. Table 3 is a 2x2x2 table summarizing team success (home and away).

Table 3. Team Success for Chicago and St. Louis at Home and Away

  Chicago St. Louis
  Loss Win Loss Win
Away 42 39 45 35
Home 31 51 34 48

14 Loglinear modeling can be used to investigate the associations in the table. A simple model that fits particularly well is the constant odds ratio model, where the home field advantage (the association between location of the game and the result) is the same for each team (Pearson statistic = 0.55 on 4 degrees of freedom, p = .97).

5. Other Analyses

15 Many other uses of the data are possible, depending on the level and coverage of the class. Possibilities include the following:

  1. Investigation of other home field effects for each team. For example, do runs scored or given up differ at home versus on the road? This can be explored using side-by-side boxplots, two-sample t-tests (possibly after transformation of the variables), and nonparametric tests for comparisons of groups.
  2. Were McGwire's and Sosa's home runs related to runs accounted for by other players? If so, this suggests that a general factor (such as the weather or the quality of the opposing pitching) helps account for the player's home run hitting. This can be investigated using side-by-side boxplots and analysis of variance, with the target variable being the difference between the total runs scored by the team (St. Louis or Chicago, respectively) and the runs driven in by the player's (McGwire or Sosa, respectively) home runs, and the groups being defined by the number of home runs hit.
  3. Was player or team performance related to time of year? The month (text) variable can be used to construct side-by-side boxplots and perform analyses of variance.
  4. Can the probability of winning a game, or the probability of McGwire or Sosa hitting a home run, be modeled as a function of other variables? This is a logistic regression problem.
  5. Figure 4 is a nice way (along with Figure 3) to summarize the McGwire/Sosa home run race. The figure gives smooth curves estimating the home run per game rate for each player (a blue solid line for McGwire, a purple dashed line for Sosa). The curves are derived as local quadratic nonparametric regression curves for a Poisson regression of the number of home runs hit versus calendar date of the season (see Fan and Gijbels 1996, or Simonoff 1996, for discussion of such estimates). The curves show that McGwire started off hitting home runs at a much higher rate than Sosa, with the rate increasing slowly until peaking near the end of May. Sosa's rate increased rapidly from the start of the season, passing McGwire's at the end of May and peaking in mid-June. Both players slowed down until late July, when their home run rates both began to increase again. Sosa's rate remained higher than McGwire's until September, when it leveled off and then dropped. McGwire, on the other hand, finished the season with a rush, hitting 15 home runs in September, and five home runs the last three days of the season. This likelihood-based smoothing method also can be used with a binomial target variable, investigating (for example) the teams' victory probabilities over the course of the season. More formal analyses based on the time series structure of the data are also possible (although there appears to be little autocorrelation in the variables).

Figure 4 (6.8K gif)

Figure 4. Smooth Curves Estimating the Home Run Rate per Game for Each Player. The blue solid line is for McGwire and the purple dashed line is for Sosa.

This analysis is not one that would routinely be performed in an introductory class. Without further information about what home run per game rate functions look like for other players, it's not possible to say whether the observed patterns for Sosa and McGwire are typical or atypical. That is, they are interesting from an exploratory point of view, but don't necessarily reflect an underlying structure in home run hitting by sluggers like McGwire and Sosa. The actual construction of the curves is clearly beyond the scope of introductory classes. Still, I have found that students in introductory classes are comfortable with the idea of putting a smooth curve through a scatter plot, and the concept of home run hitting rates is a natural one. While the curves in Figure 4 can't resolve whether home runs occur in bunches or not, they are descriptive of what the observed pattern was for these two players during their record-setting years, and illustrate the kinds of sophisticated methodology available to statisticians to uncover hidden structure.

6. Conclusions

16 The McGwire/Sosa home run race captivated sports and non-sports fans alike, and provides a rich source of data for many different statistical analyses. The connections to individual performance (rather than just team performance) are a bonus, giving a personal connection (i.e., giving a "face") to the numbers.

17 The title of this article, which refers to baseball's most famous record (not its greatest) was chosen carefully. There are other records usually considered more difficult to break than the season home run record. Examples include Hack Wilson's season record of 190 runs batted in, Cy Young's 511 career victories, Nolan Ryan's 5714 career strikeouts, and Joe DiMaggio's 56 consecutive game hitting streak. Personally, I have to agree with New York Mets announcer (and baseball Hall of Famer) Ralph Kiner, who says that the record that will probably never be broken is Johnny Vander Meer's two consecutive no-hit no-run games. After all, someone would have to pitch three consecutive no-hitters to break it!

7. Getting The Data

18 The file homerun.dat.txt contains the raw data. The file homerun.txt is a documentation file containing a brief description of the dataset. The file homerun.s contains the S-PLUS commands used to create the figures, tables, and statistics discussed in this article.

Appendix - Key To Variables in homerun.dat.txt

       1 -  3  Game number
       5 - 13  Month of game (St. Louis)
      15 - 16  Date of game (St. Louis)
      18 - 20  Calendar date of game [days since beginning
               of season] (St. Louis)
           22  Game location (St. Louis)
                     (0 = Away, 1 = Home)
      24 - 25  Runs scored (St. Louis)
      27 - 28  Runs scored by opposition (St. Louis)
      30 - 31  Game result (St. Louis)
                     (-1 = Tie, 0 = Loss, 1 = Win)
           33  Number of home runs hit by McGwire
           35  Runs driven in by McGwire's home runs
           37  McGwire game status
                     (0 = Played, 1 = Did not play)
      39 - 47  Month of game (Chicago)
      49 - 50  Date of game (Chicago)
      52 - 54  Calendar date of game [days since beginning
               of season] (Chicago)
           56  Game location (Chicago)
                     (0 = Away, 1 = Home)
      58 - 59  Runs scored (Chicago)
      61 - 62  Runs scored by opposition (Chicago)
           64  Game result (Chicago)
                     (0 = Loss, 1 = Win)
           66  Number of home runs hit by Sosa
           68  Runs driven in by Sosa's home runs
           70  Sosa game status
                     (0 = Played, 1 = Did not play)

Values are aligned and delimited by blanks.


Fan, J., and Gijbels, I. (1996), Local Polynomial Modelling and Its Applications, London: Chapman and Hall.

Simonoff, J. S. (1996), Smoothing Methods in Statistics, New York: Springer.

Jeffrey S. Simonoff
Department of Statistics and Operations Research
New York University
44 West 4th Street, Rm. 8-54
New York, NY 10012-0258

Return to Table of Contents | Return to the JSE Home Page