![]() |
John Holcomb
Cleveland State University
Angela Spalsbury
Youngstown State University
Journal of Statistics Education Volume 13, Number 3 (2005), jse.amstat.org/v13n3/datasets.holcomb.html
Copyright © 2005 by John Holcomb and Angela Spalsbury, all rights reserved. This text may be freely shared among individuals, but it may not be republished in any medium without express written consent from the authors and advance notification of the editor.
Key Words:Activity-based learning; Data cleaning; Team projects.
The goal of the project described in this article is to provide students with an opportunity to use exploratory data analysis techniques taught in the beginning of the course to identify what we call “problem values” for a team homework project. The data for this project come from a clinical investigation of calcium, inorganic phosphorus, and alkaline phosphatase levels in the blood for subjects age 65 and older where the blood was tested at several different commercial laboratories.
As implemented by the authors, the team project has one submission from each team, and the instructor evaluates the project 50% on statistical accuracy and 50% on written presentation. Teams were self-selected, and the project counted toward 5% of the final grade. Previous literature has addressed the use of projects or case studies in statistics education reform. Holcomb and Ruffer (2000) describe their use of projects in introductory statistics. The project described in this article is similar in structure and graded in the same way projects described therein. Other specific examples of using projects in class include Fillebrown (1994), Halvorsen and Moore (2000), Ledolter (1995), Love (1998, 2000), Sevin (1995), Radke-Sharpe (2000), Smith (1998), and Vaughan (2003). Articles on the benefits of scientific writing include Radke-Sharpe (1991), Samsa and Oddone (1994), and Stromberg and Ramanathan (1996).
Before doing statistical analysis it is very important that the data have as much integrity as possible. Data cleaning is a two-step process including detection and then correction of problem values (Mason, Gillenwater, Pugh, Kenefik, Collins, Whitaker and Volk). Some common sources of problem values are
Data cleaning issues are especially important when working with large databases, in particular with databases that include name and address information. Duplicate or incomplete entries are the bane of data managers. These issues are beyond the scope of this paper (cf. Kimball, 1996).
Elementary tools that can be used to clean data include descriptive statistics and statistical graphs. During analysis of a dataset, researchers faced with problem values that are missing, appear erroneous, or are not possible, often need to consult the original record containing the observation. Standard textbook limitations prohibit the inclusion of original records. The advance of technology allows us to provide this information (or mimic this process) on the web. We describe in this article a project that involves placing pseudo records for each observation on the web so that students can have the valuable experience of checking for reliability in the data as well as learning how to correct data transcription errors. It may be helpful to note that this exercise is not designed for use imputing missing values. An instructor who wishes to explore imputation, however, could adapt the project easily. For information regarding data imputation, see Little and Rubin (2002).
Although the original data for this study had observations needing investigation, we massaged the original data to include data problems and issues that have arisen in other research projects. This was done to expose students to a range of cleaning issues that can arise when preparing data for analysis.
The discrete variables in this dataset are sex (1 for male, 2 for female), lab (one of 6 labs where blood was analyzed), and agegroup (five, 5 year categories, starting at 65 and ending at 89). The continuous variables in this dataset are age (years), akphos (alkaline phosphatase, international units per liter), cammol (raw calcium, milligrams per deciliter), and phosmmol (inorganic phosphorous, milligrams per deciliter).
Students downloaded the messy dataset from a class web site and were told that there were problems with the data. They were instructed to identify unusual and missing values with their exploratory data techniques and then check the values against the patient record. The patient record is a grid of observation numbers available on the web that mimics actual patient data values. Thus if a student wanted to confirm the age of observation #34, he or she would click on obs34 at bigtable.htm. The goal for the students was to find 27 incorrect data values and 2 missing data values in the variables age, sex, alkphos, lab, cammol, phosmmol, and agegroup and then correct these data values. They also needed to identify 4 missing values that are truly missing and not available in the original record.
We wanted students to use statistical techniques to determine which values needed confirmation by looking at the patient record. Of course, students have the option of clicking on every patient record on the web and locating any wrong values in the dataset with that approach. One reason in using this dataset for the assignment is that the sample size of 178 observations discourages such activity, without being too large for us to create the html files for each of the patient records. However, at least one team in every class has checked every observation on the web.
Another advantage of this dataset is that it deals with physiological variables (levels of various minerals and enzymes in blood) of which most of our students have no familiarity. Statisticians often have to work with variables for which they have little knowledge. This project illustrates the interdisciplinary nature of statistics and that collaboration with subject-matter experts is often important.
This article describes the class activities of cleaning the data and the summary assignment for students to complete once the data are cleaned. This data, however, can also be used for a variety of subsequent projects. Independent samples t-tests could be used with the grouping variable of gender. One way analysis of variance (ANOVA) can be applied to the lab or the agegroup variables. Two way ANOVA can be used to take into account variability from the factors of gender and lab. Lastly, students can use age (in years) and regression analysis to predict calcium, inorganic phosphorus, or alkaline phosphatase levels.
For example, creating a frequency table for lab yields the following output in SPSS:
Frequency | Percent | Valid Percent | Cumulative Percent | |
---|---|---|---|---|
Valid 1 | 88 | 49.4 | 49.7 | 49.7 |
2 | 41 | 23.0 | 23.2 | 72.9 |
3 | 16 | 9.0 | 9.0 | 81.9 |
4 | 13 | 7.3 | 7.3 | 89.3 |
5 | 11 | 6.2 | 6.2 | 95.5 |
6 | 6 | 3.4 | 3.4 | 98.9 |
21 | 1 | 0.6 | 0.6 | 99.4 |
43 | 1 | 0.6 | 0.6 | 100.0 |
Total | 177 | 99.4 | 100.0 | |
Missing System | 1 | 0.6 |
The student should immediately see the problem values of 21 and 43. The problem values for the continuous variables are also obvious, but in completely a different manner than the discrete variables. By inspecting the boxplot for inorganic phosphorus (phosmmol) the student should realize that something is wrong.
Figure 1. Boxplot of phosmmol with problem values.
If one uses the web grid of observations, the student sees the correct value of phosmmol is .84 for observation 132. Checking observation 108 shows the correct value of phosmmol is 0.9 and the student should begin to surmise that the outliers need to be checked. Students could also examine a histogram and see that something looked amiss as well. With the variable age, the boxplot looks even stranger. One corrupt value given for age was 771. From inspection it is unclear if the true age is 77 or 71, but obviously 771 is not correct. Students need to find the correct age by clicking on the web patient record for that observation number.
In regard to missing data, one of the missing values is the value of agegroup, which can be determined from the actual age. Another article (Johnson 1996) addressing a similar issue had students correct erroneous data values by using redundant variables available within the dataset. (In this case height, weight, and body mass index (BMI) were given, allowing the student to infer the apparent correct height for a man using the BMI and weight.) There is also a missing alkphos value which can be found by looking at the patient record. The remaining four missing values are not available on the patient record, thus exhibiting what often occurs in real research. A complete table of problem values (with the variable name and observation number) with the correct value appears in Appendix B.
Instructors should note for the continuous variables that all problem values are either missing or detectable as outliers. However, not all outliers are problem values. For example, even when the dataset is cleaned, outliers still exist in all three of the outcome variables. This is good reinforcement for students who sometimes feel all outliers are erroneous or need to be removed.
Once the student believes that their dataset is cleaned and has integrity, they can then complete the official summary. We require students to provide a summary of the discrete variables. We also require them to give summary statistics and to construct side-by-side boxplots for the continuous variables to compare the levels of the minerals and enzymes between men and women and between laboratories (see Appendix C).
We would like for students to come to the realization that cleaning data may or may not affect the conclusion. With this in mind, in future assignments, we will require students to compare one or more of the project analyses between the original and the cleaned data (i.e., sensitivity analysis). As one can guess, the erroneous age values greatly inflate both the mean and standard deviation. With the outcome variables, the results are not so clear. Note that the suggested changes are already incorporated into the assignment as presented in Appendix C.
One issue that we faced was that students often asked us if they had found all of the problem values. This is problematic because in real statistical analysis, the analyst never really knows if he or she has found all incorrect data values. When presented with the number of problem observations a group found, we often told the students if they had found them all or not. This is not ideal, and we hope to avoid doing this in the future. Since all of our students found all of the problem values, it appears that there is little difficulty in locating all of the problem values. When the first listed author gave the team class time to begin the assignment, he noted that all the student teams found the problem values within an hour. Also, he observed different student groups updating each other on their progress in finding the problem values. This kind of cross-class collaboration is not necessarily harmful, and it is probably impossible to avoid.
The most important issue that we faced with the written project reports was the lack of substantial commentary on the graphs and tables. Students were reluctant to infer any conclusions from the summary graphics. We feel this is most likely a result of a variety of factors, one of which may be that this was the first time students were asked to express in words their statistical thinking. To address this concern, we have decided that in the future we will give a simple assignment prior to this project. The assignment will contain a dataset with only a few well understood variables. We will require students to perform a summary analysis with at least one or two well written paragraphs analyzing the data. This assignment will also teach students how to cut and paste statistical graphs and tables, with appropriate labels, into their word processing program.
Columns | Variable | Comment |
---|---|---|
9-11 | OBSNO | Patient Observation Number |
21-22 | AGE | Years |
33 | SEX | 1=Male, 2=Female |
42-44 | ALKPHOS | Alkaline Phosphatase International Units/Liter |
55 | Lab | Lab: 1=Metpath; 2=Deyor; 3=St. Elizabeth's;
4=CB Rouche; 5=Youngstown Osteopathic Hospital; 6=Horizon |
63-66 | CAMMOL | Calcium mmol/L |
74-77 | PHOSMMOL | Inorganic Phosphorus mmol/L |
88 | AGEGROUP | Age group 1=65-69; 2=70-74; 3=75-79; 4=80-84; 5=85-89 Years |
Columns | Variable | Comment |
---|---|---|
9-11 | OBSNO | Patient Observation Number |
20-22 | AGE | Years |
32-33 | SEX | 1=Male, 2=Female |
42-44 | ALKPHOS | Alkaline Phosphatase International Units/Liter |
54-55 | Lab | Lab: 1=Metpath; 2=Deyor; 3=St. Elizabeth's;
4=CB Rouche; 5=Youngstown Osteopathic Hospital; 6=Horizon |
62-66 | CAMMOL | Calcium mmol/L |
74-77 | PHOSMMOL | Inorganic Phosphorus mmol/L |
88 | AGEGROUP | Age group 1=65-69; 2=70-74; 3=75-79; 4=80-84; 5=85-89 Years |
Observation Number | Variable | Problem Value | Correct Value |
---|---|---|---|
6 | lab | 43 | 4 |
11 | age | 771 | 71 |
21 | cammol | 3.2 | 2.2 |
22 | alkphos | Missing | 64 |
25 | cammol | 25.3 | 2.53 |
26 | cammol | 20 | 2 |
27 | cammol | 22.3 | 2.23 |
28 | sex | 22 | 2 |
28 | cammol | 24.3 | 24.3 |
29 | cammol | 25 | 2.5 |
30 | cammol | 23.3 | 2.33 |
31 | cammol | 24 | 2.4 |
32 | cammol | 25 | 2.5 |
32 | phosmmol | 3.21 | 1.23 |
33 | cammol | 25 | 2.5 |
34 | cammol | 23.5 | 2.35 |
35 | cammol | 22.5 | 2.25 |
36 | cammol | 24.5 | 2.45 |
37 | cammol | 23.3 | 2.33 |
42 | phosmmol | Missing | Missing |
53 | age | 699 | 69 |
60 | alkphos | 9 | 97 |
78 | lab | 21 | 2 |
79 | agegroup | Missing | 4 |
85 | cammol | Missing | Missing |
105 | age | Missing | Missing |
108 | phosmmol | 0.09 | 0.9 |
120 | sex | 21 | 1 |
123 | age | 730 | 73 |
132 | phosmmol | 8.84 | 0.84 |
149 | cammol | 1.05 | 2.05 |
170 | lab | Missing | Missing |
173 | sex | 12 | 2 |
176 | phosmmol | 2.16 | 1.26 |
Sex | 1 = Male; 2 = Female |
Lab | 1 = Metpath; 2 = Deyor; 3 = St. Elizabeth's; 4 = CB Rouche; 5 = YOH; 6 = Horizon |
Agegroup | 1 = 65-69; 2 = 70-74; 3 = 75-79; 4 = 80-84; 5 = 85-89 |
The other variables of age (years), alkphos - alkaline phosphatase (IU/L), cammol - calcium (mmol/L), and phosmmol – inorganic phosphorus (mmol/L), are continuous.
Cobb, G. (1992), "Teaching Statistics," in Heeding the Call for Change: Suggestions for Curricular Action, ed. L. A. Steen, pp. 3-43, Washington, DC: Mathematical Association of America.
Cobb, G., and Moore, D. S. (1997), "Mathematics, Statistics, and Teaching," The American Mathematical Monthly, 104, 801-823.
Fillebrown, S. (1994), “Using Projects in an Elementary Statistics Course for Non-Science Majors,” Journal of Statistics Education [Online] , 2(2), jse.amstat.org/v2n2/fillebrown.html.
Halvorsen, K.T. and Moore, T.L., (2000), “Motivating, Monitoring and Evaluating Student Projects,” in Resources for Undergraduate Instructors Teaching Statistics, MAA Notes, No. 52, Washington DC:Mathematical Association of America.
Holcomb, J. and Ruffer, R. (2000). “Using a term-long project sequence in introductory statistics,” The American Statistician, 54, 49-53.
Johnson, R. W., (1996). “Fitting Percentage of Body Fat to Simple Body Measurements,” The Journal of Statistics Education, 4(1), jse.amstat.org/v4n1/datasets.johnson.html.
Kimball, R. (1996). “Dealing with Dirty Data,” DBMS, 9(10), www.dbmsmag.com/9609d14.html.
Ledolter, J. (1995), “Projects in Introductory Statistics Courses,” The American Statistician, 49(4), 364-367.
Little, R. J. A., and Rubin, D. B. (2002). Statistical Analysis with Missing Data, 2nd edition, New York: Wiley – Interscience.
Love, T. E. (1998). "A Project-Driven Second Course" Journal of Statistics Education [Online], 6(1), jse.amstat.org/v6n1/love.html.
Love, T. E. (2000). "A Different Approach to Project Assessment" Journal of Statistics Education [Online], 8(1) jse.amstat.org/secure/v8n1/love.cfm.
Mason, J., Gillenwater, K., Pugh, R., Kenefik, E., Collins, G., Whitaker, M., Volk, D. “Practical Analysis of Nutritional Data (PANDA),” Chapter 2, [Online], www.tulane.edu/~panda2/Analysis2/ahome.html.
Radke-Sharpe, M. (1991), “Writing as a Component of Statistics Education,” The American Statistician, 45, 292-293.
Radke-Sharpe, N., (2000), “Curriculum in Context: Teaching with Case Studies in Statistics Courses,” in Resources for Undergraduate Instructors: Teaching Statistics, MAA Notes, No. 52, Washington DC:Mathematical Association of America.
Samsa, G., and Oddone, E. Z. (1994), “Integrating Scientific Writing Into a Statistics Curriculum: A Course in Statistically Based Scientific Writing,” The American Statistician, 48, 117-119.
Sevin, A. (1995), "Some Tips For Helping Students In Introductory Statistics Classes Carry Out Successful Data Analysis Projects" in ASA Proceedings of the Section of Statistical Education, 159-164.
Smith, G. (1998), "Learning Statistics By Doing Statistics" Journal of Statistics Education [Online], 6(3), jse.amstat.org/v6n3/smith.html.
Stromberg, A. J., and Ramanathan, S. (1996), “Easy Implementation of Writing in Introductory Statistics Courses,” The American Statistician, 50, 159-163.
Vaughan, T. S. (2003), "Teaching Statistical Concepts With Student-Specific Datasets" Journal of Statistics Education [Online], 11(1) jse.amstat.org/v11n1/vaughan.html.
John P. Holcomb, Jr.
Department of Mathematics
Cleveland State University
Cleveland, OH
U.S.A.
j.p.holcomb@csuohio.edu
Angela Spalsbury
Department of Mathematics and Statistics
Youngstown State University
Youngstown, OH
U.S.A.
angie@math.ysu.edu
Volume 13 (2005) | Archive | Index | Data Archive | Information Service | Editorial Board | Guidelines for Authors | Guidelines for Data Contributors | Home Page | Contact JSE | ASA Publications