Data Management, Exploratory Data Analysis, and Regression Analysis with 1969-2000 Major League Baseball Attendance

James J. Cochran
Louisiana Tech University

Journal of Statistics Education Volume 10, Number 2 (2002),

Copyright © 2002 by James J. Cochran, all rights reserved. This text may be freely shared among individuals, but it may not be republished in any medium without express written consent from the author and advance notification of the editor.

Key Words: Classroom data; Exploratory data analysis; Regression analysis.


The 1969-2000 Major League Baseball Attendance dataset contains Runs Scored, Runs Allowed, Wins, Losses, Number of Games Behind the Division Leader, and Home Game Attendance of each major league franchise for the 1969 through 2000 seasons. Also included for each franchise are its location, league affiliation (National or American), and division affiliation (East, Central, or West). These data have been used in a project-based modeling course to instruct students in basic data management, the use of exploratory data analysis to "clean" data, and construction of regression models. The dataset, which is both cross-sectional and time-series, is of a manageable size and easily understood. Furthermore, it provides a useful, interesting, and realistic classroom example for discussing many important statistical concepts.

1. Introduction

This paper deals with the use of the 1969-2000 Major League Baseball Attendance dataset in a capstone data analysis course designed for undergraduate and Master's level quantitative analysis students. Although most of the enrolled students were working toward business degrees, each had completed a full sequence of mathematical calculus (not business calculus) courses, a course in linear algebra, and two courses in computer programming using a high-level language. Additionally, the undergraduates had completed a three-course sequence in mathematical statistics. The graduate students had completed a two-course mathematical statistics sequence with similar coverage. Students at both levels had also completed a course in statistical methods.

To provide the students with a comprehensive data analysis experience, the course was designed around an analytic project. At the onset of the course, the class was divided into groups of three or four students. Each student group was given a diskette containing a unique part of a dataset that the instructor had intentionally corrupted by introducing various anomalies including missing values, repeated observations, and misplaced decimals. The first eight weeks of the ten-week course were devoted to discussions on exploratory data analysis, data management, advanced topics in statistical modeling, and the role of a statistical consultant. During the last two weeks of the quarter, the instructor met individually with the student groups to help refine their analyses and polish their presentations. During their final-examination period, each group submitted a written report and presented their results to an audience composed of classmates, faculty, and Ph.D. students.

2. The Dataset

These data appear in most encyclopedic sources on baseball, including The Baseball Encyclopedia (1993) and Total Baseball (2001). The data include Runs Scored, Runs Allowed, Wins, Losses, Number of Games Behind the Division Leader, and Home Game Attendance for each major league franchise for each season from 1969 through 2000. Also included for each franchise are its location, league affiliation (National or American), and division affiliation (East, Central, or West). Many important events in major league baseball occurred over the period included in the data. A list of some of these events includes:

Each of these events may have had an impact on the total attendance of a franchise by reducing the number of games they played and antagonizing fans. These issues must be dealt with when modeling attendance. Additionally, several new stadia opened during this period:

Each of these new stadia could also have had an impact on both on-field performance measures and attendance. The above list does not include stadia whose opening or initial use for Major League baseball coincides with the inaugural season of a new or relocated franchise. Also note the use of artificial turf increased over the early years covered by these data and has more recently decreased.

3. The Course

During the first class meeting, each of the student groups was provided with a unique portion of the 1969-2000 Major League Baseball Attendance Dataset. The data were limited to the 1969 through 1992 seasons to somewhat simplify the modeling process (reducing the number of expansions, work stoppages, and changes in scheduling format the students would have to consider). The subsets of the data that were distributed to the student groups did not overlap and did combine to form the complete dataset to be used for the course. None of these subsets contained data for all years to be utilized, nor did any subset contain data for every team.

Students were also given a description of the contents of the full dataset. The class was then instructed to use these data (and any other data that they cared to collect) to determine if baseball franchises in large markets had significantly higher attendance relative to their counterparts in relatively small markets. The definitions of large and small market were left to the students’ discretion.

As the class progressed through a discussion of exploratory data analysis, each group was expected to use the methods discussed in class to clean its unique portion of the data. The groups used various univariate measures (such as the maximum, minimum, range, and mean) to detect data that appeared erroneous. The students then consulted The Baseball Encyclopedia to ascertain the validity of these data and make changes as necessary.

As part of the ensuing coverage of data management issues, the groups exchanged their cleaned datasets and attempted to aggregate their accumulated data into a single, cohesive dataset. The student groups had to cooperatively interleave and concatenate their individual datasets to create a single comprehensive dataset.

Once the student groups successfully aggregated the data, the class began a discussion of statistical modeling. At this point, the course could move in whatever direction the instructor deemed appropriate. Depending on the class project chosen, the instructor could opt to cover any of a wide variety of topics in applied statistics. The course was designed to allow for reasonable coverage of topics such as regression analysis, design and analysis of experiments, time-series analysis, nonparametrics, or categorical data analysis from an applied perspective. Thus, the course could be tailored to address the specific interests of a particular class of seniors and graduate students. The first time the course was offered, the 1969-2000 Major League Baseball Attendance Dataset was utilized to facilitate coverage of statistical modeling with regression analysis. Topics covered in the course included model-building strategies, dummy variables, interactions, assumptions, and diagnostics. These topics were covered in much greater detail than had been provided in the students’ previous course work. Edward W. Frees’ Data Analysis Using Regression Models (1996) was used for the course text, and the students used SAS software to perform all data management tasks and analyses.

As each topic (including model building and selection, and inference and assumptions) was discussed, the student groups were encouraged to apply what they were learning to their model(s). In using the 1969-2000 Major League Baseball Attendance dataset, the students encountered minor problems with multicollinearity, moderate problems with heteroscedasticity, and substantial problems with autocorrelation (as would be expected with any large dataset that is both cross-sectional and time series in nature). Eventually each student group independently realized that the heteroscedasticity and autocorrelation were both easily corrected through inclusion of time as an independent variable. Student groups also had to develop their own definition of what was meant by small market (each group defined this characteristic somewhat differently). Finally, the student groups wrestled with the implications of franchise-related differences in the residuals (that is, models consistently underpredicted the Los Angeles Dodgers’ attendance but almost always overpredicted attendance for the Oakland A’s).

Eventually each student group developed a unique model with its own merits and shortcomings. While each model was unique, each group reached a similar conclusion regarding the large market/small market attendance issue - the models developed by the student groups consistently estimated that large market franchises outdrew small market franchises by a statistically significant 450,000 to 800,000 fans annually.

4. Conclusions

Having a single, well-defined course objective encouraged the student groups to consider practical uses for course material as the classroom discussion progressed. This resulted in a much greater level of classroom participation and student involvement. The students generally completed their reading assignments before class meetings and came to class with prepared questions. Furthermore, each student group collected additional demographic and socio-economic data on the various markets and used these data to classify markets as large or small. The student groups also used some of these data as independent variables in their final regression analyses.

The models produced by the student groups were quite diverse, but each was well conceived and thorough. Interestingly, each group reported a significant difference between large and small markets in attendance, but the magnitude of the estimated difference varied greatly. Their oral presentations were well received, and many of the faculty complimented the student groups on the quality of their analyses and presentations. The class has since been offered again (emphasizing generalized linear models) with similar results.

5. Getting The Data

The file MLBattend.dat.txt contains the uncorrupted data in the fixed column format provided in the Appendix. The file MLBattend.txt is a documentation file containing a brief description of the dataset.

Appendix - Key To Variables in MLBattnd.dat.txt

Columns Variable
1 - 4 Major League Baseball franchise
9 - 10 League affiliation (National or American)
16 - 19 Division affiliation (East, Central, or West)
25 - 26 Season
32 - 38 Home game attendance
43 - 46 Runs scored
51 - 54 Runs allowed
59 - 61 Wins
66 - 68 Losses
74 - 77 Number of games behind the division winner
Values are aligned and delimited by blanks. There are no missing values.


Munsey, P., and Suppes, C. (2001), Ballparks [Online]. (

Frees, E. W. (1996), Data Analysis Using Regression Models, Englewood Cliffs: Prentice Hall.

InfoPlease (2001), [Online]. (

Reichler, J. L. (ed.) (2001), The Baseball Encyclopedia, New York: MacMillan Publishing Company.

Thorn, J., and Palmer, P. (2001), Total Baseball, New York: HarperCollins Publishers.

James J. Cochran
Department of Economics and Finance
Louisiana Tech University
Ruston, LA 71272

Volume 10 (2002) | Archive | Index | Data Archive | Information Service | Editorial Board | Guidelines for Authors | Guidelines for Data Contributors | Home Page | Contact JSE | ASA Publications