Gary D. Kader

Appalachian State University

Mike Perry

Appalachian State University

Journal of Statistics Education Volume 15, Number 2 (2007), http://jse.amstat.org/v15n2/kader.html

Copyright © 2007 by Gary D. Kader and Mike Perry all rights reserved. This text may be freely shared among individuals, but it may not be republished in any medium without express written consent from the author and advance notification of the editor.

**Keywords:** Variability, Categorical Variable, Unalikeability

- Describe a concept of variability for a categorical variable, and provide a method for its measurement. This is done at an elementary level which requires no probability or statistics background and thus is appropriate for an introductory course.
- Show how these ideas evolved from research results on students' concepts of variability for quantitative variables.
- Although our development is done independently of previous ideas, we point out that the underlying ideas have been around for at least ninety years. The early uses were for specialized applications or in statistically sophisticated settings and thus not presented in a fashion appropriate for a student's first exposure to variability.

The students were instructed as follows: “These are two sets of blocks: a set of red blocks and a set of yellow ones. In which set do the blocks have the greater variation among themselves?” Fifty percent selected set I for the greater variation, 36% selected II and 14% said there was no difference.

The 50% who selected set I are making their judgment on the observation that no two blocks have the same length. These students are basing their choice on an intuitive concept of variability - unalikeability – the lack of bars of the same size or the lack of clusters of bars of the same size. These learners do not think of variation as “how much the values differ from the mean.” Their perception has to do with “how often the observations differ from one another.” The authors point out that this can be an important part of a classroom lesson. The teacher can show the students that the standard deviation would indicate that set II has the greater variation because its standard deviation is larger than that of set I, and that the standard deviation is not measuring the concept of variation for students who selected set I.

The concept of unalikeability focuses on how often observations differ, not how much. The incidence of differences for the six blocks of set I and set II are indicated in Table 1. Each table gives all possible pairings of the sizes of the bars, and table entries are either 0 or 1 to indicate whether the block sizes are equal or different, respectively. Note that all pairs are indicated twice -- once in each half of the table. Comparisons of a block with itself are not of interest and are indicated with an asterisk.

Figure 1. Physical Representation for Two Sets of Numerical Data

Table 1. Incidence of Differences

10 | 20 | 30 | 40 | 50 | 60 | |

10 | * | 1 | 1 | 1 | 1 | 1 |

20 | 1 | * | 1 | 1 | 1 | 1 |

30 | 1 | 1 | * | 1 | 1 | 1 |

40 | 1 | 1 | 1 | * | 1 | 1 |

50 | 1 | 1 | 1 | 1 | * | 1 |

60 | 1 | 1 | 1 | 1 | 1 | * |

10 | 10 | 10 | 60 | 60 | 60 | |

10 | * | 0 | 0 | 1 | 1 | 1 |

10 | 0 | * | 0 | 1 | 1 | 1 |

10 | 0 | 0 | * | 1 | 1 | 1 |

60 | 1 | 1 | 1 | * | 0 | 0 |

60 | 1 | 1 | 1 | 0 | * | 0 |

60 | 1 | 1 | 1 | 0 | 0 | * |

If the 1’s in a table are added up, we obtain the number of differences that occur when all possible comparisons are made, one observation with another. If we divide by 36-6=30, the number of comparisons, then we get the proportion of differences that occur.

For set I, where all of the data differ from one another, this proportion is 30/30 =1. For set II, the proportion is 18/30 = 0.60. Note that since all pairs appear twice, only half of the entries need to be counted. In the case of set II, there would be 15 comparisons, and the proportion would be 9/15 = 0.60. Also note that if all of the data are equal in value, this proportion is 0.

This provides a *coefficient of unalikeability* on a scale from 0 to 1. The higher the value,
the more unalike the data are. If *x*_{1}, *x*_{2},...,
*x _{n}* are

This coefficient was suggested by the idea of a “within data” variance. Gordon (1986) reminds us that standard deviation and variance can be defined independently of the mean by taking the average of the squares of the differences between each pair of values:

The coefficient of unalikeability mimics this idea by replacing the squares of distances with the 0 - 1 indicator of differences. Gordon points out that

Note that in the case of a categorical variable, *x*, each observation is classified into one
of *m* distinct categories. In this case, the definition for quantity
*c*(*x _{i}, x_{j}*) becomes:

Group 1: Seven responses in Category A; three responses in Category BFigure 2 provides a physical representation for these three different situations. Note that, unlike numerical data, the bar height in this representation for categorical data does not indicate the magnitude of a response; it indicates only whether the response was in Category A or Category B.

Group 2: Five responses in Category A; five responses in Category B

Group 3: One response in Category A; nine responses in Category B

Which group of data has the most variability? the least variability? For categorical data, the notion “how far apart?” does not make sense; however, the notion of unalikeability does make sense. Within a particular group two responses differ if they are in different categories and are the same if they are in the same category. That is, the two responses are either unalike (different categories) or alike (same category). Consequently, variability in categorical data is equivalent to unalikeability in numerical data.

Comparing Groups 1 and 2, the data in Group 1 are more alike since seven of the values are the same (i.e., 7 are in Category A), while only five of the values in Group 2 are the same (i.e., 5 are in either Category A or B). Consequently, the data in Group 2 are more unalike. That is, there is more variability in Group 2 than in Group 1. Since nine of the values in Group 3 are the same (i.e., 9 are in Category B) then, among the three groups, Group 3 has the least variability.

When *u* is defined with the divisor *n*^{2}-*n*, the coefficient has
value 1 with all distinct measurements. Using *n*^{2} as the divisor instead
produces a value close to 1 for large *n* since:

This second coefficient is analogous to the other “within data” variance proposed by Gordon (1986):

Gordon points out that

(1) |

The incidence of differences for the ten responses of Groups 1, 2, and 3 are shown in Table 2.
Each table gives all possible pairings of responses, and table entries are either 1 or 0 to indicate
whether the responses are unalike or alike, respectively. The corresponding values for
*u*_{2} are indicated in Table 3.

Figure 2. Physical Representations for Three Groups of Categorical Data

Table 2. Incidence of Differences for Three Groups of Categorical Data

A | A | A | A | A | A | A | B | B | B | |

A | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |

A | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |

A | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |

A | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |

A | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |

A | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |

A | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |

B | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 |

B | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 |

B | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 |

A | A | A | A | A | B | B | B | B | B | |

A | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 |

A | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 |

A | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 |

A | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 |

A | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 |

B | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |

B | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |

B | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |

B | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |

B | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |

A | B | B | B | B | B | B | B | B | B | |

A | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |

B | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

B | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

B | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

B | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

B | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

B | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

B | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

B | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

B | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

Table 3. Value of *u*_{2} for Three Groups of Categorical Data

Group | u_{2} |

1 | 42/100 = .42 |

2 | 50/100 = .50 |

3 | 18/100 = .18 |

The values for *u*_{2} indicate that the data in Group 3 are most alike and the data
in Group 2 are most unalike. That is, Group 3 has the least variation and Group 2 has the most
variation.

A second look at the table of incidences for Group 1 (Table 2) reveals that the 1's occur in the array in blocks.

The sum of the 1's can be determined by:

Thus

Note that here *u*_{2} has the form:

(2) |

where *p*_{1} and *p*_{2} are the proportion of responses in categories
A, B, respectively.

The sum of the 1's can also be determined by:

Thus

Note that here *u*_{2} has the form
*p*_{1}(1-*p*_{1})+*p*_{2}(1-*p*_{2}).

The sum of the 1's can also be determined by:

Thus

Note that here *u*_{2} has the form
1-*p*_{1}^{2}-*p*_{2}^{2}.

In each case we get .42, the proportion of possible pairings which are unalike. Note that the
three formulas for finding *u*_{2} work for the Groups 2 and 3 as well.

p_{1}= the proportion of 1’s or the proportion of responses in Category A, andp_{2}= 1-p_{1}= the proportion of 0’s or the proportion of responses in Category B.

It is well known that the mean of a Bernoulli variable is *p*_{1} and the variance,
*V*, is *p*_{1}*p*_{2}. So, like the second form of Gordon’s
within variance, *W*_{2}, the coefficient of unalikeability as described in Equation
(1) can be expressed as:

Group 4: Two responses in Category A; three responses in Category B; and five responses in Category C

The table of incidences for Group 4 (Table 4) reveals that the 1's again occur in the array in blocks.

Table 4. Incidence of Differences for Three Outcome Categorical Variable

A | A | B | B | B | C | C | C | C | C | |

A | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |

A | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |

B | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 |

B | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 |

B | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 |

C | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |

C | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |

C | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |

C | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |

C | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |

The sum of the 1's in Table 4 can be determined by:

Thus

Note that here *u*_{2} has the form
2(*p*_{1}*p*_{2} + *p*_{1}*p*_{3} +
*p*_{2}*p*_{3}), where *p*_{1}, *p*_{2}, and
*p*_{3} are the proportion of responses in categories A, B, and C, respectively.

The sum of the 1's can also be determined by:

Thus

Note that here *u*_{2} has the form
*p*_{1}(1-*p*_{1}) + *p*_{2}(1-*p*_{2}) +
*p*_{3}(1-*p*_{3}).

The sum of the 1's can also be determined by:

Thus

Note that here *u*_{2} has the form
1-*p*_{1}^{2}-*p*_{2}^{2}-*p*_{3}^{2}.

In each case we get .62, the proportion of possible pairings which are unalike.

(3) |

(4) |

(5) |

where *p _{i}* =

The interpretation of *u*_{2} is that it represents the proportion of possible
comparisons (pairings) which are unalike. Note that *u*_{2} includes comparisons
of each response with itself.

“Suppose an investigator wishes to measure the degree of religious diversity within a specified aggregate, e.g., a city. A very simple operational solution is to describe the city in terms of the probability that randomly paired members of the population will hold different religious affiliations.”

Lieberson points out that his index is essentially identical to each of the following measures:

Gini's index of mutability (1912)

Simpson's measure of diversity (1949)

Bachi's index of linguistic homogeneity (1956)

Greenberg's Monolingual Non-Weighted Method for measuring linguistic diversity (1956)

The index of qualitative variation described by Mueller and Schuessler (1961)

Gibbs and Martin's measurement of industry diversification (1962)

In order to convey the operational interpretations of these measures, discussions of Simpson's and Greenberg's ideas are presented in the following section.

"Suppose two individuals are drawn at random and without replacement from anS-species collection containingNindividuals, of whichNbelong to the_{j}j-th species (j=1,2,...,S;N_{1}+N_{2}+ ... +N=_{S}N). If the probability is great that both individuals will belong to the same species, we can say that the diversity of the collection is low. This probability isand so we may use

as a measure of the collection's diversity."

Greenberg (1956) describes the Monolingual Nonweighted Methods for measuring linguistic diversity as follows.

“If from a given area we choose two members of the population at random, the probability that these two individuals speak the same language can be considered a measure of its linguistic diversity. If everyone speaks the same language, the probability that two such individuals speak the same language is obviously 1, or certainty. If each individual speaks a different language, the probability is zero. Since we are measuring diversity rather than uniformity, this measure may be subtracted from 1, so that our index will vary from 0, indicating the least diversity, to 1, indicating the greatest.

where i is the proportion of speakers for a particular language."

Note that in both discussions, the measure of diversity is described in terms of the likelihood of two responses being either the same or being different, and the measure of diversity is expressed in a form similar to Equation (4).

where

is the probability a response is in Category *j*. Note that Agresti’s first expression for
*V*(*Y*) is equivalent to *u*_{2} as described in Equation (3), and his
second expression is equivalent to *u*_{2} as described in Equation (4).

Agresti points out that this quantity “is the probability that two independent observations from
the marginal distribution of Y fall in different categories." He also notes that in the case of
*m* distinct categories for *Y*, *V*(*Y*) is maximized when

for all *j* and the maximum value is (*m*-1)/*m*. It is minimized when all responses
are in the same category, in which case it is 0. Of course, Agresti’s book is not an introductory
textbook and his presentation of the notion of variability for a categorical variable is not at an
elementary level. Also, the presentation of this idea seems to have been deleted from the latest
edition of his book.

Although some may exist, we have not seen a general introductory level statistics text that includes a discussion on measuring variability in qualitative data. However, some introductory statistics books designed for the social sciences do include such a discussion. For example, the book Social Statistics for a Diverse Society (Leon-Guerrero and Frankfort-Nachmias, 2000), presents the index of qualitative variation (IQV) as a measure of variability for nominal variables. The IQV is described as a measure of variability for qualitative variables “based on the ratio of the total number of differences in the distribution to the maximum number of possible differences within the same distribution.” This definition is equivalent to the coefficient of unalikeability. Their presentation does not develop the underlying ideas but is formula driven and moves immediately from the definition to how to calculate the IQV from a frequency table.

Variability in categorical data is based on unalikeability (diversity), which is quite different from variability in quantitative data. Thus the coefficient of unalikeability is a natural measure of variability that has a well-defined interpretation. The concept and its measurement are appropriate for an introductory statistics course.

The evolution of ideas is often ignored in the teaching of statistics. It is important, in our opinion, to show students how definitions and formulas evolve. The coefficient of unalikeability is a fairly straightforward illustration of how measures of statistical concepts can be invented. We have found this sort of development effective for other concepts. For example, developing the mean absolute deviation as a prelude to the standard deviation. The idea of a ratio based on counts as a correlation coefficient can be introduced before the full development of Pearson’s correlation coefficient ((Holmes 2001).

The distinction between "unalikeability" and "variation about the mean" is based on the difference between “how often” and “how much.” Throughout statistical analysis we see this type of distinction, especially the difference between measures based on distance and those which are not based on distance. Most introductory presentations of statistics emphasize the differences between measures based on distance and measures based on order for quantitative data. We believe the development of statistical thinking should include a discussion on measuring variability in categorical data as well.

Agresti, Alan (1990). *Categorical Data Analysis*. John Wiley and Sons, Inc. 24-25.

Bachi, R. (1956). “A statistical analysis of the revival of Hebew in Israel.” in Roberto Bachi (ed.), Scripta Hierosolymitana, Vol. III, Jerusalem: Magnus press. 179-247

Gibbs, J. P. and Martin, W. T. (1962). “Urbanization, technology and division of labor:
International patterns.” *American Sociological Review* 27: 667-677.

Gini, C. W. (1912). “Variability and Mutability, contribution to the study of statistical
distributions and relations.” *Studi Economico-Giuricici della R. Universita de Cagliari.*

Gordon, T. (1986). “Is the standard deviation tied to the mean?” *Teaching Statistics*, 8(2),
40-2. (Reprinted in Green, D.R. (ed.) (1994).

Greenberg, J. H. (1956). “The measurement of linguistic diversity.” *Language* 32, 109-115.

Holmes, P. (2001). “Correlation: From Picture to Formula.” *Teaching Statistics* 23(3), 67-70.

Leon-Guerrero, Anna and Frankfort-Nachmias, Chava (2000). *Social Statistics for a Diverse
Society*. 2nd edition, Pine Forge Press: Thousand Oaks, California. 153-162.

Lieberson, S. (1969). “Measuring Population Diversity.” *American Sociological Review*, 34(6),
850-862.

Loosen, F., Lioen, M. and Lacante, M. (1985). “The standard deviation: some drawbacks to an
intuitive approach.” *Teaching Statistics*, 7(1), 2-5.

Mueller, J. H. and Schuessler, K. F. (1961). *Statistical Reasoning in Sociology*.
Boston: Houghton Mifflin.

Perry, M. and Kader, G. (2005). “Variation as Unalikeability.” *Teaching Statistics*,
27 (2), 58-60.

Pielou, E. C. (1969). *An Introduction to Mathematical Ecology*.
John Wiley and Sons, Inc. 223.

Simpson, E. H. (1949). "Measurement of diversity." *Nature*, 163, 688.

Gary D. Kader

Department of Mathematical Sciences

Appalachian State University

Boone, NC 28608

U.S.A.
*gdk@math.appstate.edu*

Mike Perry

Department of Mathematical Sciences

Appalachian State University

Boone, NC 28608

U.S.A.
*perrylm@appstate.edu*

Volume 15 (2007) | Archive | Index | Data Archive | Information Service | Editorial Board | Guidelines for Authors | Guidelines for Data Contributors | Home Page | Contact JSE | ASA Publications