![]() |
Terence C. Mills
Loughborough University
Journal of Statistics Education Volume 13, Number 2 (2005), jse.amstat.org/v13n2/datasets.mills.html
Copyright © 2005 by Terence C. Mills, all rights reserved. This text may be freely shared among individuals, but it may not be republished in any medium without express written consent from the authors and advance notification of the editor.
Key Words:Body Mass Index; Functional form; Prediction; Regression.
A scatterplot of all 252 pairs of observations is shown in Figure 1. A traditional starting point for students analysing the relationship between percent body fat and the BMI is to fit a linear regression. Two such regressions are shown superimposed on Figure 1: a fit to all 252 cases and a fit with case 39 omitted from the calculations. This case has a BMI of 48.9 associated with a body fat percentage of 33.8 and is seen to be an outlier, both pulling the fitted line towards it and giving an impression of a distinct curvilinear relationship between the two variables. It will therefore be omitted from further modelling, but students could be encouraged to discuss whether this is the best course of action for dealing with an aberrant observation and whether alternative solutions could be considered.
Figure 1. Scatterplot of body fat percentage against BMI with linear regressions fitted to all observations (bodyfat% =
–20.4 + 1.55BMI)
and with the outlier removed (bodyfat% = –24.9 + 1.73BMI) superimposed.
may be more appropriate on theoretical grounds. Here b represents the theoretical BMI associated with 0% body fat,
and a represents the percentage of excess body weight which is fat (see
Gray and Fujioka, 1991, pages 548-9, for more detailed interpretation of
this relationship). With a and b both positive, the first and second derivatives of this function with
respect to BMI are positive and negative respectively. Thus the model implies that percent body fat increases, but at a
decreasing rate, with increasing BMI and is bounded by the value a. An alternative model which would also capture
this type of nonlinearity is the semi-logarithmic relationship bodyfat% = c + dln(BMI) with d positive.
Students can fit these two nonlinear functions to the data and assess their fits relative to that of the linear model.
Figure 2 presents the fits of the three functions graphically. Students
can note that the estimates of the inverse function parameters are
a = 64.3 2.6% and
b = 17.6
0.3%, where we use the notation
, where
and
are the ordinary least squares (OLS) estimate and associated
standard error of a parameter
, i.e., that a BMI of 17.6 is associated
with 0% body fat and that 64.3% of excess body weight is fat. It should be noted that, while estimates of the parameters
of this model can be obtained by regressing percent body fat on the inverse of the BMI and algebraically calculating the
estimates from the regression coefficients, parameter standard errors can only be obtained using a dedicated nonlinear
regression routine. The R2 statistics from the three models are: linear 0.560, inverse 0.557, and
semi-log 0.563, with residual error variances 5.1232, 5.1412 and 5.1052, respectively.
Thus, on goodness of fit grounds, the semi-logarithmic model is to be preferred. The fits, however, are very similar
over the central range of BMI values, only differing substantially for the very highest BMI values, so that a more formal
method of model selection could be considered.
Figure 2. Fitted linear, semi-logarithmic (bodyfat% = –126.2 + 45.0ln(BMI)), and inverse (bodyfat% = 64.3(BMI – 17.6/BMI) functions.
One approach is to note that all three functional forms may be “nested” within a general functional form by using the Box and Cox (1964) family of power transformations, which are defined for the generic variable Z as
![]() |
If we denote the ith observations, i = 1, 2, ..., n, on percent body fat and the BMI as yi and xi, respectively, then we may consider the nonlinear regression model
![]() | (1) |
where ui is an error term, assumed to be normally and independently distributed with zero mean
and constant variance . The linear model is obtained when
=
, setting
=1 and
= 0 defines
the semi-logarithmic model, while setting
= 1 and
= –1 defines the inverse model (in each case with a redefined intercept
of
). However, arbitrary values of the power transformation
parameters need not be imposed upon (1): rather, they may be estimated along
with the other parameters of the model, and tests of the hypotheses implied by the alternative functional forms may then
be performed in order to discriminate between them.
The procedure for doing this is to recognise that, for any given values of
and
, estimates of
and
conditional upon these values may be obtained from the regression of
equation (1). Maximum likelihood (ML) estimates of the power transformation
parameters, denoted
and
,
and hence of the
’s, are found by maximizing the concentrated
log-likelihood, defined as
![]() | (2) |
where
![]() |
is an estimate of the “conditional” error variance, the being the
residuals from the conditional regression (1). The term “concentrated” is
used because the maximization is, in fact, a step-wise procedure in that
is first obtained from a linear regression with fixed values of
and
, with the second step being to maximize
over all values of
.
For an extended discussion of concentrated likelihood methods, see Seber
and Wild (2003, pages 37-42), and for a review of the Box-Cox transformation, see
Sakia (1992).
This maximization may conveniently be computed by searching over a grid of ,
values: advanced students can be encouraged to develop routines for
carrying out this two-dimensional grid search, which involves calculating the power transformations, saving regression
output to compute the log-likelihood, and writing looping procedures. Students may also experiment with plotting the
contours of the likelihood function so constructed, which will provide a graphical perspective on the likely precision
with which the transformation parameters,
and
, are estimated. This precision may also be examined by calculating the
confidence region obtained by using the result that
![]() | (3) |
is approximately distributed as chi-squared with two degrees of freedom (Box
and Cox, 1964). Thus, for example, 95% and 75% confidence regions are defined by
< 5.99 and
respectively.
The ML estimates are obtained as
= 0.92 and
= 0.01, with L(0.92, 0.01) = –407.83.
Since L(1, 0) = –407.17, it is clear that the semi-logarithmic model is contained within any conventional confidence
region. The other functional forms are quite “close' in terms of fit, however. The linear model
(
=
= 1 in equation
(1)) has L(1, 1) = –409.04, and so is contained within the 75%
confidence region, while the inverse model (
= 1,
= –1) has L(1, –1) = –409.95 and is thus contained within a 95%
confidence region. The double-logarithmic model (
=
= 0 in equation (1) and a
functional form that is often used in regression analysis), however, has L(0, 0) = –501.59 and so is excluded from
all conventional confidence regions.
The fitted regressions for the linear, semi-logarithmic and inverse functional forms are
![]() | (4) |
![]() | (5) |
and
![]() | (6) |
where parameter standard errors are shown in parentheses. Diagnostic checks on the residuals of each of these regressions found no evidence of heteroskedasticity or non-normality in any of the regressions, as was also true for the residuals from the regression using the ML estimates.
At this point students could be asked to consider alternative non-linear specifications. A plausible competitor would be a polynomial in xi, so that students may fit the quadratic regression:
![]() | (7) |
Figure 3 shows the implied semi-logarithmic, quadratic and linear functions
for 10 BMI
50, bearing in
mind that the range of BMI values in the sample used for estimation is 18.1 to 39.1. The three functions are almost
identical over the central region of the BMI range (20 to 30), but the linear model is a poor approximation to the
semi-logarithmic outside of this interval. The quadratic, on the other hand, provides a good approximation to the
semi-logarithmic over the entire observed range of BMI values, even though, as students can check, the quadratic term in
(7) is insignificant (its t-ratio is just –1.36). Since the semi-logarithmic
function also produces a superior fit to the quadratic (the error variance of the latter is 5.1142) and contains
one less parameter, we prefer the former as a better representation of the relationship between percent body fat and BMI.
Figure 3. Semi-logarithmic (bodyfat% = –126.2 + 45.0ln(BMI)), quadratic (bodyfat% = –42.5 + 3.08BMI – 0.025BMI2), and linear (bodyfat% = –24.9 + 1.73BMI) functions.
![]() | (8) |
where
![]() |
The one-standard error bounds for yf are then calculated as
. Students may be encouraged to provide interpretations of these
bounds. For example, the current U.S. Dietary guidelines define the range 18.5 < BMI < 25 to be “healthy,”
25 < BMI < 30 to be “overweight,” and higher values of BMI to be “obese” (see
Kuczmarski and Flegal, 2000, table 2). Using the linear model, these
cut-offs predict body fat percentages of yf = 7.1, 18.3 and 27.0 for xf
= 18.5, 25 and 30, respectively. For the data set here, we have
= 25.3
and
= 2787.8. The one-standard error bounds (68% prediction intervals)
are then calculated as (1.9, 12.3), (13.2, 23.4) and (21.8, 32.2). These bounds, along with the predicted values, are
plotted in Figure 4 and show that, at this level of confidence, a BMI value
can only predict body fat within a range of
approximately 10 percentage points with 68% accuracy (i.e., if a sequence of forecasts were made given
xf, the true value yf would only be contained in 68% of the
calculated prediction intervals). This reflects the modest strength of the fitted relationship given by the R2
value of just 0.56. Students can then be asked to repeat the calculations with the semi-logarithmic model and comment on
any differences. Here the 68% prediction intervals are (–0.1, 10.3), (13.5, 23.7) and (21.8, 32.0), respectively, so that
the prediction intervals are shifted downwards, relative to the linear model, for BMI values further from the sample mean,
reflecting the curvature of the semi-logarithmic function.
Figure 4. Scatterplot of body fat and BMI with predicted values and one standard error bounds from the fitted linear equation.
A related question to ask students is: if we are given a value yf, what is the value x that could have given rise to yf? An answer to this may be obtained by inverting (8) to give
![]() |
Although this is the ML estimate of xf, it is biased because, in general,
![]() |
(see, for example, Seber and Lee, 2003, section 6.1.5). An alternative estimate is that obtained from estimating the reverse, or inverse, regression of x on y:
![]() |
where and
are
the OLS estimates of the inverse regression. Predicting from xf is often referred to as
calibration. The inverse regression estimator
is typically
thought to be appropriate when the x observations are regarded as a random sample from a population: this is
referred to as the “natural” or “random” calibration problem. The estimator
is appropriate when the x values may be taken as fixed by the design of the experiment: this is “controlled”
calibration. Osborne (1991) presents a historical review of
statistical calibration, which has given rise to confusion and argument for many years, while
Brown (1993, chapter 2.3) is a useful reference. Students could be asked
to discuss which of the two calibration problems the data set most appropriately falls under.
Error bounds (prediction intervals) for xf using the estimate
are straightforwardly calculated by using the formulae in
(8) with x and y interchanged. A more challenging question is
the construction of a prediction interval for xf using
. The approach developed in
Seber and Lee (2003, chapter 6.1.5), for example, considers the error
made in predicting yf,
, which can be
written as
![]() |
the last equality being obtained using . For large n, and
defining
, the ratio
![]() |
will be distributed as standard normal. Thus
![]() |
where is the
percentage point of the chi-square distribution with one degree of freedom. The set of all values of
satisfying the inequality
![]() |
will then provide a (1 – ) confidence interval for the unknown
xf, with lower and upper bounds defined as
and
.
and
are the solutions (roots) of the quadratic equation
![]() |
i.e.,
![]() | (9) |
It is possible for these roots to be complex if is not significantly
different from zero, in which case the regression line is close to being horizontal and any value of the regressor is
acceptable. As can be seen from the fitted regression (4), however,
is highly significant as the 95% confidence interval for
is 1.73
0.20. The resultant interval is
often called a discrimination interval rather than a prediction interval
(Seber and Wild, 2003, page 146).
Johnson (1996) reports a suggestion that 15% body fat is a maximum for
good health for men, so it is interesting to calculate BMI prediction and discrimination intervals for this value of
yf. The estimated coefficients of the inverse regression are
= 19.22 and
= 0.32, so that
![]() |
while the value of xf is
![]() |
For the calculation of the intervals we require and
, so that
= 2.225.
A 95% prediction interval using
= 24.1 is thus (19.7, 28.5). Using
= 23.1, for a 95% discrimination interval, with
= 3.84, the quadratic (9)
simplifies to
![]() |
The two roots of this equation are = –8.1 and
= 3.6. The 95% discrimination limits for
xf are thus 17.2 and 28.9, which show that the interval is asymmetric about
. Thus, if
is used,
then at the 95% level of confidence, a 15% body fat is consistent with a BMI ranging from 17.2 to 28.9, i.e., from a BMI
below the current lower healthy BMI cut-off to a value close to the upper-end of the overweight range. If
is used, this range is a little narrower, running from 19.7 to 28.5.
A similar calculation with the semi-logarithmic function obtains 95% limits of 20.2 and 28.3 (using
) and 18.3 and 28.9 (using
). Students may be encouraged to discuss the implications of the width
of these intervals for the efficiency of the BMI as an indicator of percent body fat.
![]() |
where wi and hi are the weight and height of the ith case. Thus, if the multiple regression
![]() | (10) |
is considered, equation (5) is obtained if the restriction
= 0 is imposed. Students may then be asked to consider how they might
test such a linear restriction. One approach would be construct a confidence interval for
: a 100(1 -
) interval
is given by
![]() |
where is the
/2
percentage point of the standard normal distribution and
![]() |
(see, for example, Maddala, 1977, chapter 10.3). The Johnson (1996) data set contains data on weight and height, although measured in pounds and inches. After rescaling to kilograms and metres (multiplying by the factors 0.4536 and 0.0254 respectively), the following multiple regression was obtained
![]() |
The variances and covariances of the estimated coefficients are estimated as
= 6.24,
=
103.63 and
= –13.23, respectively, so that
= 75.67. Thus an approximate 95% confidence interval (using
z0.025 = 1.96) for
is calculated to be
–11.7
17.0, which includes 0 and thus provides evidence in favour of using
the BMI as a “composite' weight-height index. Note, however, that an approximate 68% interval (using
z0.159 = 1) is –11.7
8.7, which does not
include 0.
Earlier research on “weight for height' indices by Benn (1971) considered
power indices of the form weight/heightp (see also
Flegal, 1990). The testing approach outlined above may be utilised to
construct a range of values for the exponent p that are consistent with the data. The power index implies the
linear combination between the slope coefficients of equation
(10), which can then be written as
![]() |
The exponent can be estimated as . The variance of
is given by
![]() |
Since
![]() |
Using the estimates = 44.8 and
= –101.3, we can calculate
= 2.26 and
0.038.
Thus an approximate 95% confidence interval for p is 2.26
1.96(0.038)1/2 = 2.26
0.38. This interval contains p = 2,
thus again confirming the choice of the BMI as an appropriate “Benn power-index', but excludes p = 1.5, a value of
the exponent suggested by early studies of the weight-height relationship but not currently recommended (see
Kuczmarski and Flegal, 2000).
Benn, R.J. (1971), “Some mathematical properties of weight for height indices as measures of adiposity,” British Journal of Preventive and Social Medicine, 25, 42-50.
Box, G.E.P. and Cox, D.R. (1964), “An analysis of transformations,” Journal of the Royal Statistical Society, Series B, 26, 211-243.
Brown, P.J. (1993), Measurement, Regression, and Calibration, Oxford, U.K.: Oxford University Press.
Flegal, K.M., (1990), “Ratio of actual to predicted weight as an alternative to a power-type weight-height index (Benn index),” American Journal of Clinical Nutrition, 51, 540-547.
Gray, D.S. and Fujioka, K. (1991), “Use of relative weight and body mass index for the determination of adiposity,” Journal of Clinical Epidemiology, 44, 545-550.
Johnson, R.W. (1996), “Fitting percentage of body fat to simple body measurements,” Journal of Statistics Education [Online], 4(1). jse.amstat.org/v4n1/datasets.johnson.html
Kuczmarski, R.J. and Flegal, K.M. (2000), “Criteria for definition of overweight in transition: background and recommendations for the United States,” American Journal of Clinical Nutrition, 72, 1074-1081.
Maddala, G.S. (1977), Econometrics, McGraw-Hill.
Osborne, C. (1991), “Statistical calibration: a review,” International Statistical Review, 59, 309-336.
Sakia, R.M. (1992), “The Box-Cox transformation technique,” The Statistician, 41, 169-178.
Seber, G.A.F. and Lee, A.J. (2003), Linear Regression Analysis, 2nd Edition, New York: John Wiley & Sons.
Seber, G.A.F. and Wild, C.J. (2003), Nonlinear Regression, New York: John Wiley & Sons.
Webster, J.D., Hesp, R. and Garrow, J.S. (1984), “The composition of excess weight in obese women estimated by body density, total body water and total body potassium,” Human Nutrition: Clinical Nutrition, 38C, 299-306.
Terence C. Mills
Department of Economics
Loughborough University
Leicestershire
United Kingdom
T.C.Mills@lboro.ac.uk
Volume 13 (2005) | Archive | Index | Data Archive | Information Service | Editorial Board | Guidelines for Authors | Guidelines for Data Contributors | Home Page | Contact JSE | ASA Publications