Gary D. Kader
Appalachian State University
Mike Perry
Appalachian State University
Journal of Statistics Education Volume 15, Number 2 (2007), http://jse.amstat.org/v15n2/kader.html
Copyright © 2007 by Gary D. Kader and Mike Perry all rights reserved. This text may be freely shared among individuals, but it may not be republished in any medium without express written consent from the author and advance notification of the editor.
Keywords: Variability, Categorical Variable, Unalikeability
The students were instructed as follows: “These are two sets of blocks: a set of red blocks and a set of yellow ones. In which set do the blocks have the greater variation among themselves?” Fifty percent selected set I for the greater variation, 36% selected II and 14% said there was no difference.
The 50% who selected set I are making their judgment on the observation that no two blocks have the same length. These students are basing their choice on an intuitive concept of variability - unalikeability – the lack of bars of the same size or the lack of clusters of bars of the same size. These learners do not think of variation as “how much the values differ from the mean.” Their perception has to do with “how often the observations differ from one another.” The authors point out that this can be an important part of a classroom lesson. The teacher can show the students that the standard deviation would indicate that set II has the greater variation because its standard deviation is larger than that of set I, and that the standard deviation is not measuring the concept of variation for students who selected set I.
The concept of unalikeability focuses on how often observations differ, not how much. The incidence of differences for the six blocks of set I and set II are indicated in Table 1. Each table gives all possible pairings of the sizes of the bars, and table entries are either 0 or 1 to indicate whether the block sizes are equal or different, respectively. Note that all pairs are indicated twice -- once in each half of the table. Comparisons of a block with itself are not of interest and are indicated with an asterisk.
Figure 1. Physical Representation for Two Sets of Numerical Data
Table 1. Incidence of Differences
10 | 20 | 30 | 40 | 50 | 60 | |
10 | * | 1 | 1 | 1 | 1 | 1 |
20 | 1 | * | 1 | 1 | 1 | 1 |
30 | 1 | 1 | * | 1 | 1 | 1 |
40 | 1 | 1 | 1 | * | 1 | 1 |
50 | 1 | 1 | 1 | 1 | * | 1 |
60 | 1 | 1 | 1 | 1 | 1 | * |
10 | 10 | 10 | 60 | 60 | 60 | |
10 | * | 0 | 0 | 1 | 1 | 1 |
10 | 0 | * | 0 | 1 | 1 | 1 |
10 | 0 | 0 | * | 1 | 1 | 1 |
60 | 1 | 1 | 1 | * | 0 | 0 |
60 | 1 | 1 | 1 | 0 | * | 0 |
60 | 1 | 1 | 1 | 0 | 0 | * |
If the 1’s in a table are added up, we obtain the number of differences that occur when all possible comparisons are made, one observation with another. If we divide by 36-6=30, the number of comparisons, then we get the proportion of differences that occur.
For set I, where all of the data differ from one another, this proportion is 30/30 =1. For set II, the proportion is 18/30 = 0.60. Note that since all pairs appear twice, only half of the entries need to be counted. In the case of set II, there would be 15 comparisons, and the proportion would be 9/15 = 0.60. Also note that if all of the data are equal in value, this proportion is 0.
This provides a coefficient of unalikeability on a scale from 0 to 1. The higher the value, the more unalike the data are. If x1, x2,..., xn are n observations on a quantitative variable, x, Perry and Kader (2005) give a general definition for the coefficient of unalikeability as:
This coefficient was suggested by the idea of a “within data” variance. Gordon (1986) reminds us that standard deviation and variance can be defined independently of the mean by taking the average of the squares of the differences between each pair of values:
The coefficient of unalikeability mimics this idea by replacing the squares of distances with the 0 - 1 indicator of differences. Gordon points out that
Note that in the case of a categorical variable, x, each observation is classified into one of m distinct categories. In this case, the definition for quantity c(xi, xj) becomes:
Group 1: Seven responses in Category A; three responses in Category BFigure 2 provides a physical representation for these three different situations. Note that, unlike numerical data, the bar height in this representation for categorical data does not indicate the magnitude of a response; it indicates only whether the response was in Category A or Category B.
Group 2: Five responses in Category A; five responses in Category B
Group 3: One response in Category A; nine responses in Category B
Which group of data has the most variability? the least variability? For categorical data, the notion “how far apart?” does not make sense; however, the notion of unalikeability does make sense. Within a particular group two responses differ if they are in different categories and are the same if they are in the same category. That is, the two responses are either unalike (different categories) or alike (same category). Consequently, variability in categorical data is equivalent to unalikeability in numerical data.
Comparing Groups 1 and 2, the data in Group 1 are more alike since seven of the values are the same (i.e., 7 are in Category A), while only five of the values in Group 2 are the same (i.e., 5 are in either Category A or B). Consequently, the data in Group 2 are more unalike. That is, there is more variability in Group 2 than in Group 1. Since nine of the values in Group 3 are the same (i.e., 9 are in Category B) then, among the three groups, Group 3 has the least variability.
When u is defined with the divisor n2-n, the coefficient has value 1 with all distinct measurements. Using n2 as the divisor instead produces a value close to 1 for large n since:
This second coefficient is analogous to the other “within data” variance proposed by Gordon (1986):
Gordon points out that
(1) |
The incidence of differences for the ten responses of Groups 1, 2, and 3 are shown in Table 2. Each table gives all possible pairings of responses, and table entries are either 1 or 0 to indicate whether the responses are unalike or alike, respectively. The corresponding values for u2 are indicated in Table 3.
Figure 2. Physical Representations for Three Groups of Categorical Data
Table 2. Incidence of Differences for Three Groups of Categorical Data
A | A | A | A | A | A | A | B | B | B | |
A | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
A | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
A | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
A | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
A | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
A | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
A | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
B | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 |
B | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 |
B | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 |
A | A | A | A | A | B | B | B | B | B | |
A | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 |
A | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 |
A | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 |
A | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 |
A | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 |
B | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
B | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
B | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
B | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
B | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
A | B | B | B | B | B | B | B | B | B | |
A | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
B | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
B | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
B | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
B | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
B | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
B | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
B | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
B | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
B | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Table 3. Value of u2 for Three Groups of Categorical Data
Group | u2 |
1 | 42/100 = .42 |
2 | 50/100 = .50 |
3 | 18/100 = .18 |
The values for u2 indicate that the data in Group 3 are most alike and the data in Group 2 are most unalike. That is, Group 3 has the least variation and Group 2 has the most variation.
A second look at the table of incidences for Group 1 (Table 2) reveals that the 1's occur in the array in blocks.
The sum of the 1's can be determined by:
Thus
Note that here u2 has the form:
(2) |
where p1 and p2 are the proportion of responses in categories A, B, respectively.
The sum of the 1's can also be determined by:
Thus
Note that here u2 has the form p1(1-p1)+p2(1-p2).
The sum of the 1's can also be determined by:
Thus
Note that here u2 has the form 1-p12-p22.
In each case we get .42, the proportion of possible pairings which are unalike. Note that the three formulas for finding u2 work for the Groups 2 and 3 as well.
p1 = the proportion of 1’s or the proportion of responses in Category A, and
p2 = 1-p1 = the proportion of 0’s or the proportion of responses in Category B.
It is well known that the mean of a Bernoulli variable is p1 and the variance, V, is p1p2. So, like the second form of Gordon’s within variance, W2, the coefficient of unalikeability as described in Equation (1) can be expressed as:
Group 4: Two responses in Category A; three responses in Category B; and five responses in Category C
The table of incidences for Group 4 (Table 4) reveals that the 1's again occur in the array in blocks.
Table 4. Incidence of Differences for Three Outcome Categorical Variable
A | A | B | B | B | C | C | C | C | C | |
A | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
A | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
B | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 |
B | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 |
B | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 |
C | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
C | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
C | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
C | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
C | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
The sum of the 1's in Table 4 can be determined by:
Thus
Note that here u2 has the form 2(p1p2 + p1p3 + p2p3), where p1, p2, and p3 are the proportion of responses in categories A, B, and C, respectively.
The sum of the 1's can also be determined by:
Thus
Note that here u2 has the form p1(1-p1) + p2(1-p2) + p3(1-p3).
The sum of the 1's can also be determined by:
Thus
Note that here u2 has the form 1-p12-p22-p32.
In each case we get .62, the proportion of possible pairings which are unalike.
(3) |
(4) |
(5) |
where pi = ki/n.
The interpretation of u2 is that it represents the proportion of possible comparisons (pairings) which are unalike. Note that u2 includes comparisons of each response with itself.
“Suppose an investigator wishes to measure the degree of religious diversity within a specified aggregate, e.g., a city. A very simple operational solution is to describe the city in terms of the probability that randomly paired members of the population will hold different religious affiliations.”
Lieberson points out that his index is essentially identical to each of the following measures:
Gini's index of mutability (1912)
Simpson's measure of diversity (1949)
Bachi's index of linguistic homogeneity (1956)
Greenberg's Monolingual Non-Weighted Method for measuring linguistic diversity (1956)
The index of qualitative variation described by Mueller and Schuessler (1961)
Gibbs and Martin's measurement of industry diversification (1962)
In order to convey the operational interpretations of these measures, discussions of Simpson's and Greenberg's ideas are presented in the following section.
"Suppose two individuals are drawn at random and without replacement from an S-species collection containing N individuals, of which Nj belong to the j-th species (j=1,2,...,S; N1 + N2 + ... + NS = N). If the probability is great that both individuals will belong to the same species, we can say that the diversity of the collection is low. This probability isand so we may use
as a measure of the collection's diversity."
Greenberg (1956) describes the Monolingual Nonweighted Methods for measuring linguistic diversity as follows.
“If from a given area we choose two members of the population at random, the probability that these two individuals speak the same language can be considered a measure of its linguistic diversity. If everyone speaks the same language, the probability that two such individuals speak the same language is obviously 1, or certainty. If each individual speaks a different language, the probability is zero. Since we are measuring diversity rather than uniformity, this measure may be subtracted from 1, so that our index will vary from 0, indicating the least diversity, to 1, indicating the greatest.
where i is the proportion of speakers for a particular language."
Note that in both discussions, the measure of diversity is described in terms of the likelihood of two responses being either the same or being different, and the measure of diversity is expressed in a form similar to Equation (4).
where
is the probability a response is in Category j. Note that Agresti’s first expression for V(Y) is equivalent to u2 as described in Equation (3), and his second expression is equivalent to u2 as described in Equation (4).
Agresti points out that this quantity “is the probability that two independent observations from the marginal distribution of Y fall in different categories." He also notes that in the case of m distinct categories for Y, V(Y) is maximized when
for all j and the maximum value is (m-1)/m. It is minimized when all responses are in the same category, in which case it is 0. Of course, Agresti’s book is not an introductory textbook and his presentation of the notion of variability for a categorical variable is not at an elementary level. Also, the presentation of this idea seems to have been deleted from the latest edition of his book.
Although some may exist, we have not seen a general introductory level statistics text that includes a discussion on measuring variability in qualitative data. However, some introductory statistics books designed for the social sciences do include such a discussion. For example, the book Social Statistics for a Diverse Society (Leon-Guerrero and Frankfort-Nachmias, 2000), presents the index of qualitative variation (IQV) as a measure of variability for nominal variables. The IQV is described as a measure of variability for qualitative variables “based on the ratio of the total number of differences in the distribution to the maximum number of possible differences within the same distribution.” This definition is equivalent to the coefficient of unalikeability. Their presentation does not develop the underlying ideas but is formula driven and moves immediately from the definition to how to calculate the IQV from a frequency table.
Variability in categorical data is based on unalikeability (diversity), which is quite different from variability in quantitative data. Thus the coefficient of unalikeability is a natural measure of variability that has a well-defined interpretation. The concept and its measurement are appropriate for an introductory statistics course.
The evolution of ideas is often ignored in the teaching of statistics. It is important, in our opinion, to show students how definitions and formulas evolve. The coefficient of unalikeability is a fairly straightforward illustration of how measures of statistical concepts can be invented. We have found this sort of development effective for other concepts. For example, developing the mean absolute deviation as a prelude to the standard deviation. The idea of a ratio based on counts as a correlation coefficient can be introduced before the full development of Pearson’s correlation coefficient ((Holmes 2001).
The distinction between "unalikeability" and "variation about the mean" is based on the difference between “how often” and “how much.” Throughout statistical analysis we see this type of distinction, especially the difference between measures based on distance and those which are not based on distance. Most introductory presentations of statistics emphasize the differences between measures based on distance and measures based on order for quantitative data. We believe the development of statistical thinking should include a discussion on measuring variability in categorical data as well.
Agresti, Alan (1990). Categorical Data Analysis. John Wiley and Sons, Inc. 24-25.
Bachi, R. (1956). “A statistical analysis of the revival of Hebew in Israel.” in Roberto Bachi (ed.), Scripta Hierosolymitana, Vol. III, Jerusalem: Magnus press. 179-247
Gibbs, J. P. and Martin, W. T. (1962). “Urbanization, technology and division of labor: International patterns.” American Sociological Review 27: 667-677.
Gini, C. W. (1912). “Variability and Mutability, contribution to the study of statistical distributions and relations.” Studi Economico-Giuricici della R. Universita de Cagliari.
Gordon, T. (1986). “Is the standard deviation tied to the mean?” Teaching Statistics, 8(2), 40-2. (Reprinted in Green, D.R. (ed.) (1994).
Greenberg, J. H. (1956). “The measurement of linguistic diversity.” Language 32, 109-115.
Holmes, P. (2001). “Correlation: From Picture to Formula.” Teaching Statistics 23(3), 67-70.
Leon-Guerrero, Anna and Frankfort-Nachmias, Chava (2000). Social Statistics for a Diverse Society. 2nd edition, Pine Forge Press: Thousand Oaks, California. 153-162.
Lieberson, S. (1969). “Measuring Population Diversity.” American Sociological Review, 34(6), 850-862.
Loosen, F., Lioen, M. and Lacante, M. (1985). “The standard deviation: some drawbacks to an intuitive approach.” Teaching Statistics, 7(1), 2-5.
Mueller, J. H. and Schuessler, K. F. (1961). Statistical Reasoning in Sociology. Boston: Houghton Mifflin.
Perry, M. and Kader, G. (2005). “Variation as Unalikeability.” Teaching Statistics, 27 (2), 58-60.
Pielou, E. C. (1969). An Introduction to Mathematical Ecology. John Wiley and Sons, Inc. 223.
Simpson, E. H. (1949). "Measurement of diversity." Nature, 163, 688.
Gary D. Kader
Department of Mathematical Sciences
Appalachian State University
Boone, NC 28608
U.S.A.
gdk@math.appstate.edu
Mike Perry
Department of Mathematical Sciences
Appalachian State University
Boone, NC 28608
U.S.A.
perrylm@appstate.edu
Volume 15 (2007) | Archive | Index | Data Archive | Information Service | Editorial Board | Guidelines for Authors | Guidelines for Data Contributors | Home Page | Contact JSE | ASA Publications