Terence C. Mills
Journal of Statistics Education Volume 13, Number 2 (2005), jse.amstat.org/v13n2/datasets.mills.html
Copyright © 2005 by Terence C. Mills, all rights reserved. This text may be freely shared among individuals, but it may not be republished in any medium without express written consent from the authors and advance notification of the editor.
Key Words:Body Mass Index; Functional form; Prediction; Regression.
A scatterplot of all 252 pairs of observations is shown in Figure 1. A traditional starting point for students analysing the relationship between percent body fat and the BMI is to fit a linear regression. Two such regressions are shown superimposed on Figure 1: a fit to all 252 cases and a fit with case 39 omitted from the calculations. This case has a BMI of 48.9 associated with a body fat percentage of 33.8 and is seen to be an outlier, both pulling the fitted line towards it and giving an impression of a distinct curvilinear relationship between the two variables. It will therefore be omitted from further modelling, but students could be encouraged to discuss whether this is the best course of action for dealing with an aberrant observation and whether alternative solutions could be considered.
Figure 1. Scatterplot of body fat percentage against BMI with linear regressions fitted to all observations (bodyfat% =
–20.4 + 1.55BMI)
and with the outlier removed (bodyfat% = –24.9 + 1.73BMI) superimposed.
may be more appropriate on theoretical grounds. Here b represents the theoretical BMI associated with 0% body fat, and a represents the percentage of excess body weight which is fat (see Gray and Fujioka, 1991, pages 548-9, for more detailed interpretation of this relationship). With a and b both positive, the first and second derivatives of this function with respect to BMI are positive and negative respectively. Thus the model implies that percent body fat increases, but at a decreasing rate, with increasing BMI and is bounded by the value a. An alternative model which would also capture this type of nonlinearity is the semi-logarithmic relationship bodyfat% = c + dln(BMI) with d positive. Students can fit these two nonlinear functions to the data and assess their fits relative to that of the linear model. Figure 2 presents the fits of the three functions graphically. Students can note that the estimates of the inverse function parameters are a = 64.3 2.6% and b = 17.6 0.3%, where we use the notation , where and are the ordinary least squares (OLS) estimate and associated standard error of a parameter , i.e., that a BMI of 17.6 is associated with 0% body fat and that 64.3% of excess body weight is fat. It should be noted that, while estimates of the parameters of this model can be obtained by regressing percent body fat on the inverse of the BMI and algebraically calculating the estimates from the regression coefficients, parameter standard errors can only be obtained using a dedicated nonlinear regression routine. The R2 statistics from the three models are: linear 0.560, inverse 0.557, and semi-log 0.563, with residual error variances 5.1232, 5.1412 and 5.1052, respectively. Thus, on goodness of fit grounds, the semi-logarithmic model is to be preferred. The fits, however, are very similar over the central range of BMI values, only differing substantially for the very highest BMI values, so that a more formal method of model selection could be considered.
Figure 2. Fitted linear, semi-logarithmic (bodyfat% = –126.2 + 45.0ln(BMI)), and inverse (bodyfat% = 64.3(BMI – 17.6/BMI) functions.
One approach is to note that all three functional forms may be “nested” within a general functional form by using the Box and Cox (1964) family of power transformations, which are defined for the generic variable Z as
If we denote the ith observations, i = 1, 2, ..., n, on percent body fat and the BMI as yi and xi, respectively, then we may consider the nonlinear regression model
where ui is an error term, assumed to be normally and independently distributed with zero mean and constant variance . The linear model is obtained when = , setting =1 and = 0 defines the semi-logarithmic model, while setting = 1 and = –1 defines the inverse model (in each case with a redefined intercept of ). However, arbitrary values of the power transformation parameters need not be imposed upon (1): rather, they may be estimated along with the other parameters of the model, and tests of the hypotheses implied by the alternative functional forms may then be performed in order to discriminate between them.
The procedure for doing this is to recognise that, for any given values of and , estimates of and conditional upon these values may be obtained from the regression of equation (1). Maximum likelihood (ML) estimates of the power transformation parameters, denoted and , and hence of the ’s, are found by maximizing the concentrated log-likelihood, defined as
is an estimate of the “conditional” error variance, the being the residuals from the conditional regression (1). The term “concentrated” is used because the maximization is, in fact, a step-wise procedure in that is first obtained from a linear regression with fixed values of and , with the second step being to maximize over all values of . For an extended discussion of concentrated likelihood methods, see Seber and Wild (2003, pages 37-42), and for a review of the Box-Cox transformation, see Sakia (1992).
This maximization may conveniently be computed by searching over a grid of , values: advanced students can be encouraged to develop routines for carrying out this two-dimensional grid search, which involves calculating the power transformations, saving regression output to compute the log-likelihood, and writing looping procedures. Students may also experiment with plotting the contours of the likelihood function so constructed, which will provide a graphical perspective on the likely precision with which the transformation parameters, and , are estimated. This precision may also be examined by calculating the confidence region obtained by using the result that
is approximately distributed as chi-squared with two degrees of freedom (Box and Cox, 1964). Thus, for example, 95% and 75% confidence regions are defined by < 5.99 and respectively. The ML estimates are obtained as = 0.92 and = 0.01, with L(0.92, 0.01) = –407.83. Since L(1, 0) = –407.17, it is clear that the semi-logarithmic model is contained within any conventional confidence region. The other functional forms are quite “close' in terms of fit, however. The linear model ( = = 1 in equation (1)) has L(1, 1) = –409.04, and so is contained within the 75% confidence region, while the inverse model ( = 1, = –1) has L(1, –1) = –409.95 and is thus contained within a 95% confidence region. The double-logarithmic model ( = = 0 in equation (1) and a functional form that is often used in regression analysis), however, has L(0, 0) = –501.59 and so is excluded from all conventional confidence regions.
The fitted regressions for the linear, semi-logarithmic and inverse functional forms are
where parameter standard errors are shown in parentheses. Diagnostic checks on the residuals of each of these regressions found no evidence of heteroskedasticity or non-normality in any of the regressions, as was also true for the residuals from the regression using the ML estimates.
At this point students could be asked to consider alternative non-linear specifications. A plausible competitor would be a polynomial in xi, so that students may fit the quadratic regression:
Figure 3 shows the implied semi-logarithmic, quadratic and linear functions for 10 BMI 50, bearing in mind that the range of BMI values in the sample used for estimation is 18.1 to 39.1. The three functions are almost identical over the central region of the BMI range (20 to 30), but the linear model is a poor approximation to the semi-logarithmic outside of this interval. The quadratic, on the other hand, provides a good approximation to the semi-logarithmic over the entire observed range of BMI values, even though, as students can check, the quadratic term in (7) is insignificant (its t-ratio is just –1.36). Since the semi-logarithmic function also produces a superior fit to the quadratic (the error variance of the latter is 5.1142) and contains one less parameter, we prefer the former as a better representation of the relationship between percent body fat and BMI.
Figure 3. Semi-logarithmic (bodyfat% = –126.2 + 45.0ln(BMI)), quadratic (bodyfat% = –42.5 + 3.08BMI – 0.025BMI2), and linear (bodyfat% = –24.9 + 1.73BMI) functions.
The one-standard error bounds for yf are then calculated as . Students may be encouraged to provide interpretations of these bounds. For example, the current U.S. Dietary guidelines define the range 18.5 < BMI < 25 to be “healthy,” 25 < BMI < 30 to be “overweight,” and higher values of BMI to be “obese” (see Kuczmarski and Flegal, 2000, table 2). Using the linear model, these cut-offs predict body fat percentages of yf = 7.1, 18.3 and 27.0 for xf = 18.5, 25 and 30, respectively. For the data set here, we have = 25.3 and = 2787.8. The one-standard error bounds (68% prediction intervals) are then calculated as (1.9, 12.3), (13.2, 23.4) and (21.8, 32.2). These bounds, along with the predicted values, are plotted in Figure 4 and show that, at this level of confidence, a BMI value can only predict body fat within a range of approximately 10 percentage points with 68% accuracy (i.e., if a sequence of forecasts were made given xf, the true value yf would only be contained in 68% of the calculated prediction intervals). This reflects the modest strength of the fitted relationship given by the R2 value of just 0.56. Students can then be asked to repeat the calculations with the semi-logarithmic model and comment on any differences. Here the 68% prediction intervals are (–0.1, 10.3), (13.5, 23.7) and (21.8, 32.0), respectively, so that the prediction intervals are shifted downwards, relative to the linear model, for BMI values further from the sample mean, reflecting the curvature of the semi-logarithmic function.
Figure 4. Scatterplot of body fat and BMI with predicted values and one standard error bounds from the fitted linear equation.
A related question to ask students is: if we are given a value yf, what is the value x that could have given rise to yf? An answer to this may be obtained by inverting (8) to give
Although this is the ML estimate of xf, it is biased because, in general,
(see, for example, Seber and Lee, 2003, section 6.1.5). An alternative estimate is that obtained from estimating the reverse, or inverse, regression of x on y:
where and are the OLS estimates of the inverse regression. Predicting from xf is often referred to as calibration. The inverse regression estimator is typically thought to be appropriate when the x observations are regarded as a random sample from a population: this is referred to as the “natural” or “random” calibration problem. The estimator is appropriate when the x values may be taken as fixed by the design of the experiment: this is “controlled” calibration. Osborne (1991) presents a historical review of statistical calibration, which has given rise to confusion and argument for many years, while Brown (1993, chapter 2.3) is a useful reference. Students could be asked to discuss which of the two calibration problems the data set most appropriately falls under.
Error bounds (prediction intervals) for xf using the estimate are straightforwardly calculated by using the formulae in (8) with x and y interchanged. A more challenging question is the construction of a prediction interval for xf using . The approach developed in Seber and Lee (2003, chapter 6.1.5), for example, considers the error made in predicting yf, , which can be written as
the last equality being obtained using . For large n, and defining , the ratio
will be distributed as standard normal. Thus
where is the percentage point of the chi-square distribution with one degree of freedom. The set of all values of satisfying the inequality
will then provide a (1 – ) confidence interval for the unknown xf, with lower and upper bounds defined as and . and are the solutions (roots) of the quadratic equation
It is possible for these roots to be complex if is not significantly different from zero, in which case the regression line is close to being horizontal and any value of the regressor is acceptable. As can be seen from the fitted regression (4), however, is highly significant as the 95% confidence interval for is 1.73 0.20. The resultant interval is often called a discrimination interval rather than a prediction interval (Seber and Wild, 2003, page 146).
Johnson (1996) reports a suggestion that 15% body fat is a maximum for good health for men, so it is interesting to calculate BMI prediction and discrimination intervals for this value of yf. The estimated coefficients of the inverse regression are = 19.22 and = 0.32, so that
while the value of xf is
For the calculation of the intervals we require and , so that = 2.225. A 95% prediction interval using = 24.1 is thus (19.7, 28.5). Using = 23.1, for a 95% discrimination interval, with = 3.84, the quadratic (9) simplifies to
The two roots of this equation are = –8.1 and = 3.6. The 95% discrimination limits for xf are thus 17.2 and 28.9, which show that the interval is asymmetric about . Thus, if is used, then at the 95% level of confidence, a 15% body fat is consistent with a BMI ranging from 17.2 to 28.9, i.e., from a BMI below the current lower healthy BMI cut-off to a value close to the upper-end of the overweight range. If is used, this range is a little narrower, running from 19.7 to 28.5. A similar calculation with the semi-logarithmic function obtains 95% limits of 20.2 and 28.3 (using ) and 18.3 and 28.9 (using ). Students may be encouraged to discuss the implications of the width of these intervals for the efficiency of the BMI as an indicator of percent body fat.
where wi and hi are the weight and height of the ith case. Thus, if the multiple regression
is considered, equation (5) is obtained if the restriction = 0 is imposed. Students may then be asked to consider how they might test such a linear restriction. One approach would be construct a confidence interval for : a 100(1 - ) interval is given by
where is the /2 percentage point of the standard normal distribution and
(see, for example, Maddala, 1977, chapter 10.3). The Johnson (1996) data set contains data on weight and height, although measured in pounds and inches. After rescaling to kilograms and metres (multiplying by the factors 0.4536 and 0.0254 respectively), the following multiple regression was obtained
The variances and covariances of the estimated coefficients are estimated as = 6.24, = 103.63 and = –13.23, respectively, so that = 75.67. Thus an approximate 95% confidence interval (using z0.025 = 1.96) for is calculated to be –11.7 17.0, which includes 0 and thus provides evidence in favour of using the BMI as a “composite' weight-height index. Note, however, that an approximate 68% interval (using z0.159 = 1) is –11.7 8.7, which does not include 0.
Earlier research on “weight for height' indices by Benn (1971) considered power indices of the form weight/heightp (see also Flegal, 1990). The testing approach outlined above may be utilised to construct a range of values for the exponent p that are consistent with the data. The power index implies the linear combination between the slope coefficients of equation (10), which can then be written as
The exponent can be estimated as . The variance of is given by
Using the estimates = 44.8 and = –101.3, we can calculate = 2.26 and 0.038. Thus an approximate 95% confidence interval for p is 2.26 1.96(0.038)1/2 = 2.26 0.38. This interval contains p = 2, thus again confirming the choice of the BMI as an appropriate “Benn power-index', but excludes p = 1.5, a value of the exponent suggested by early studies of the weight-height relationship but not currently recommended (see Kuczmarski and Flegal, 2000).
Benn, R.J. (1971), “Some mathematical properties of weight for height indices as measures of adiposity,” British Journal of Preventive and Social Medicine, 25, 42-50.
Box, G.E.P. and Cox, D.R. (1964), “An analysis of transformations,” Journal of the Royal Statistical Society, Series B, 26, 211-243.
Brown, P.J. (1993), Measurement, Regression, and Calibration, Oxford, U.K.: Oxford University Press.
Flegal, K.M., (1990), “Ratio of actual to predicted weight as an alternative to a power-type weight-height index (Benn index),” American Journal of Clinical Nutrition, 51, 540-547.
Gray, D.S. and Fujioka, K. (1991), “Use of relative weight and body mass index for the determination of adiposity,” Journal of Clinical Epidemiology, 44, 545-550.
Johnson, R.W. (1996), “Fitting percentage of body fat to simple body measurements,” Journal of Statistics Education [Online], 4(1). jse.amstat.org/v4n1/datasets.johnson.html
Kuczmarski, R.J. and Flegal, K.M. (2000), “Criteria for definition of overweight in transition: background and recommendations for the United States,” American Journal of Clinical Nutrition, 72, 1074-1081.
Maddala, G.S. (1977), Econometrics, McGraw-Hill.
Osborne, C. (1991), “Statistical calibration: a review,” International Statistical Review, 59, 309-336.
Sakia, R.M. (1992), “The Box-Cox transformation technique,” The Statistician, 41, 169-178.
Seber, G.A.F. and Lee, A.J. (2003), Linear Regression Analysis, 2nd Edition, New York: John Wiley & Sons.
Seber, G.A.F. and Wild, C.J. (2003), Nonlinear Regression, New York: John Wiley & Sons.
Webster, J.D., Hesp, R. and Garrow, J.S. (1984), “The composition of excess weight in obese women estimated by body density, total body water and total body potassium,” Human Nutrition: Clinical Nutrition, 38C, 299-306.
Terence C. Mills
Department of Economics
Volume 13 (2005) | Archive | Index | Data Archive | Information Service | Editorial Board | Guidelines for Authors | Guidelines for Data Contributors | Home Page | Contact JSE | ASA Publications