Jack E. Matson
Tennessee Technological University
Brian R. Huguenard
Tennessee Technological University
Journal of Statistics Education Volume 15, Number 2 (2007), http://jse.amstat.org/v15n2/datasets.matson.html
Copyright © 2007 by Jack E. Matson and Brian R. Huguenard all rights reserved. This text may be freely shared among individuals, but it may not be republished in any medium without express written consent from the author and advance notification of the editor.
Key Words: data transformation; residual analysis; linear model assumptions; linear regression
When a new software project is being planned, the number and types of function points for the project can be estimated from the design specifications, thus making it possible to estimate development effort during the early phases of project planning. In addition, since the function point count is derived from the design specifications, any changes in the specifications (which occur frequently during software development) can be easily accounted for in the estimate of development effort.
There are five basic types of function points: external inputs (data coming from the user or some other system), external outputs (reports or messages going out to the user or some other system), external inquiries (queries coming from outside the system which result in a report being sent to the requestor), internal logical files (data files that reside within the boundaries of the system), and external interface files (data files that reside outside the boundaries of the system). Standardized criteria have been developed to allow the consistent identification and categorization of function points from the design specifications of a proposed system, or from the actual features of an existing system (International Function Point Users Group, 2005). Once an initial count of function points has been generated, it is adjusted to allow for the overall complexity of the system, using a standardized system of weights that account for 14 different system factors (for a more detailed account of the adjustment process, see Function Point Counting Practices Manual, 2001). The final adjusted function point measure (FP) is then complete, and serves as an objective measure of the system’s size and complexity.
If any of the statistical assumptions of the model are not met, then the model is not appropriate for the data. The fourth assumption (independence of error terms) is relevant when the data constitute a time series. Since the data in this paper is not time series data, we do not test for independence of the error terms.
Residual analysis uses some simple graphic methods for studying the aptness of a model, as well as some formal statistical tests for doing so. In addition, when a model does not satisfy these assumptions, certain transformations of the data might be done so that these assumptions are reasonably satisfied for the transformed model.
Eest = 585.7 + 15.124(FP) | (Model 1) |
where Eest is the estimated development hours and FP is the size in function points. The coefficient of determination (R2) for this model is 0.655. The Minitab results for the simple linear regression model are shown in Table 1.
To determine the adequacy of the model, residual analysis should be performed. Figure 4 shows the plot of residuals against the independent variable (function points).
The spread of residuals around zero increases as function points increase for this plot. Ideally, the residuals should fluctuate in a more or less uniform band around zero. The residuals shown in Figure 4 get larger as software size in function points increases, an indication that the error variance is not constant. The project data violate the equal variance assumption.
A normal probability plot of the residuals can be used to test the normality of the error terms. The normal probability plot in Figure 5 is not linear, an indication that this assumption is also being violated. A Kolmogorov-Smirnov (K-S) test for normality resulted in a p-value less than .010, a second indication that the error terms are not normally distributed.
Based on the results of the residual analysis, if inference about development effort is to be conducted, then fitting function points by simple linear regression model is inappropriate. The model violates the constant error variance and normality assumptions.
The regression model built from the transformed project data is as follows:
ln(Eest) = 2.5144 + 1.0024ln(FP) | (Model 2) |
where ln(Eest) is the natural logarithm of estimated development hours and ln(FP) is the natural logarithm of function points. The coefficient of determination (R2) for this model is 0.534. The Minitab results for this regression are shown in Table 2.
The next step is to check the equal variance and normality assumptions. A scatter plot of the residuals against the independent variable (natural logarithm of function points) is shown in Figure 7. No pattern in the residual data is apparent. The logarithmic transformation resolved the problem with increasing variance of error terms that existed with the project data in its original form.
To check the error terms for normality, a histogram of the residuals and a normal probability plot of residuals are shown in Figures 8 and 9, respectively. The normal probability plot is nearly linear, indicating that the error terms are normally distributed. The shape of the histogram supports this conclusion. In addition, a K-S test for normality resulted in a p-value greater than .150. The K-S test provides further support for the error terms being normally distributed.
Therefore, Model 2 apparently satisfies the assumptions of equal variance and normality. Thus, Model 2 is appropriate for the transformed project data.
Since the assumptions of our regression model are reasonably satisfied, we may now perform inference for our model. Appendix A, for instance, shows the results of a test for the slope β1 = 0 versus β1 ≠ 0 (Neter, Wasserman, and Kutner, 1985). We conclude that β1 ≠ 0 and a 95% confidence interval for the slope β1 is 0.818 < β1 < 1.186.
For an example of applying Model 2 to estimate software development effort, suppose a system has a software size of 381 function points. To estimate development effort, the function points must be transformed: ln(FP) = ln(381) = 5.9428. Using the regression equation for Model 2,
ln(Eest) = 2.5144 + 1.0024(5.9428) = 8.4715. |
Taking the inverse logarithm of 8.4715, the estimated development effort for the system is 4776.5 hours or 36.7 man-months. One man-month is defined as 130 hours.
The coefficient of determination for this model was 0.534, which means that almost half of the variability cannot be explained by the model. Basing an estimate solely on the results of this model would be extremely risky. However, the model is not without significance and usefulness because it leads to a measure of the uncertainty that exists in the point estimate. The primary cost estimate should always be based on a detailed analysis of the work to be performed. The model developed in this study could be used to provide further support of the detailed cost estimate and provide a measure of uncertainty for the effort estimate.
Prediction intervals for Model 2 (Appendix B) are apparently narrower for small projects and get wider as software size in function points increases. For example, the smallest project in the sample of project data is 119 function points. The point estimate of development effort for this project is 11.4 man-months. Considering that the corresponding 90% prediction interval is 2.91 man-months to 44.98 man-months the practical application of this model becomes questionable.
The second assignment dealt with developing a multiple regression model. In addition to function points, the data also includes three other explanatory variables: operating system, database management system, and programming language. This requires the use of indicator variables. Inclusion of these variables results in a modest improvement over the simple linear regression model, increasing the R-Squared value from .534 to .694. The Minitab results for a multiple regression model are shown in Table 3. (The variable Unix, for example, is equal to 1 if the Unix operating system was used and is equal to 0 if the Unix operating system was not used.) While this paper does not address multiple regression, the topic is discussed, for example, by Chu (2001).
Chu, Singat (2001), “Pricing the C’s of Diamond Stones”, Journal of Statistics Education, Volume 9, Number 2, http://jse.amstat.org/v9n2/datasets.chu.html.
Devore, Jay L. (1995), Probability and Statistics for Engineering and the Sciences, 4th edition, Belmont, CA: Wadsworth Publishing Company.
International Function Point Users Group (2001), http://www.ifpug.org/publications/manual.htm Function Point Counting Practices Manual, release 4.1.1, Princeton Junction, NJ.
International Function Point Users Group (2005), “About Function Point Analysis”, http://www.ifpug.org/about/about.htm.
Neter, J., Wasserman, W., and Kutner, M. (1985), Applied Linear Statistical Models, Homewood, Illinois: Irwin.
Jack E. Matson
Department of Decision Sciences and Management
Tennessee Technological University
Johnson Hall
1105 North Peachtree Street
Cookeville, TN 38505-0001
U.S.A.
JEMatson@tnech.edu
Brian R. Huguenard
Department of Decision Sciences and Management
Tennessee Technological University
Johnson Hall
1105 North Peachtree Street
Cookeville, TN 38505-0001
U.S.A
BHuguenard@tntech.edu
Volume 15 (2007) | Archive | Index | Data Archive | Information Service | Editorial Board | Guidelines for Authors | Guidelines for Data Contributors | Home Page | Contact JSE | ASA Publications