Juana Sanchez
UCLA
Yan He
UCLA
Journal of Statistics Education Volume 13, Number 3 (2005), jse.amstat.org/v13n3/datasets.sanchez.html
Copyright © 2005 by Juana Sanchez and Yan He, all rights reserved. This text may be freely shared among individuals, but it may not be republished in any medium without express written consent from the authors and advance notification of the editor.
Key Words:Exponential; Internet traffic; Inverse Gaussian; Maximum likelihood; Negative Binomial; Poisson; Web server log data.
The two activities introduced in this paper, as well as those in the CS-STATS web site, are suitable for any level of Statistics pedagogy. Instructors interested in using them for service-type Introductory Statistics classes can choose the descriptive analysis and the quantile-quantile plots and curve fitting of the probability distributions appropriate at the introductory level. On the other hand, instructors in calculus-based Introductory Statistics or Mathematical Statistics classes can move beyond the descriptive analysis to maximum likelihood estimation, alternative distributions such as the inverse Gaussian and other distributions with thick tails if they wish. In Introduction to Probability courses, the stories presented below and the research literature on which they are based, can help expand the curriculum by introducing students to phenomena that may not have the usual distributions studied in the course. Classes focused on Data Analysis or Applied Statistics can use these activities too. Moreover, because of the size of the data set on web browsing behavior, instructors can use the data sets as the population from which random samples could be taken to illustrate the Central Limit theorem or sampling theory results.
One common thread in the two stories told below is that there is an increasing demand for Internet data analysis. Computer Scientists are being called more and more frequently to provide computer log data that can be used to understand users' web browsing behavior, to make web pages more responsive to users' needs. In addition to that, communication networks are providing data that can help understand how the network itself responds to users, to make the quality of the network better for users.
For the teaching of Statistics, there are some things in common in the two stories. Because the data and graphs that appear in Internet data analysis behave sometimes differently from what we teach students in the descriptive data analysis module, we can use these stories to reinforce what students already know, by contrast. In addition to that, the data sets are huge, much bigger than the ones we usually use in our teaching, presenting the student with the mystery of dealing with such monsters. Finally, but not the least, there are no definite answers yet, so the students are really being exposed to the ongoing search for new paradigms in the engineering, computer science and statistics community.
Messages that flow from a source to a destination through a network are also known as traffic. This traffic and the network conditions are extremely random in nature.
There are three types of telecommunication networks --telephony (telephone network for voice calls, fax, and also dial up connections), cable-TV networks (cable, web-TV, etc.) and high speed networks such as the Internet. We are concerned with high-speed networks.
Traffic flowing through the networks can be classified into several types. Two of the most common traffic types are Ethernet packets/frames and ATM cells. The length or size of an Ethernet packet ranges anywhere from 60 bytes to 1500 bytes and generally follows a bimodal distribution. The length of ATM cells is fixed at 53 bytes. Therefore the network traffic comprises of millions and billions of these little packets or cells. We are concerned with packets traffic.
The packets arise because when a message needs to be sent from a source to a destination, it is broken down into small packets and transported that way from the source to the destination.
The protocols responsible for this transport of packets over networks are user datagram protocol (UDP) and transmission control protocol (TCP). With UDP, the destination does not acknowledge the receipt of packets to the source. TCP is an acknowledgment (ACK) based protocol. In this paper, we are concerned with packets transported by TCP protocol.
There are several network performance measures that contribute to the Quality of Service of a network. Among others, we have: (a) loss probability, or the probability of delivering a message with some data loss; (b) delay or latency, the time lag between the source sending a message and the destination receiving it; (c) delay-jitter or measure of the variation of the delay; (d) bandwidth or rate at which messages are processed. These measures can be used for optimal design and admission control of the networks. Design deals with buffer sizes, link capacities, network parameters, traffic shaping parameters, and other. Admission control involves rejecting or accepting an incoming request for connection. These variables are very hard to measure; some researchers have come up with inference methods to estimate them. These methods are very complicated and difficult to understand by beginning statistics students. Because the nature of the data is so different from that of data we are more used to in our undergraduate classes, students should be able to understand that scientists are having a very hard time coming up with good models.
There are two main areas of research on Internet traffic data. One area is concerned with understanding how traffic data perform within a single route, i.e., in the connection between a pair of nodes in the network, which allows the use of stochastic process modeling (Willinger, Taqqu, Leland, and Wilson 1995; Willinger and Paxson 1998; Paxson and Floyd 1995). It has been found that the models used for telephone networks are not good for Internet traffic (Willinger, et al. 1995). The other area is more focused in modeling the simultaneous activity across all the nodes in the network, i.e., traffic measurements at different nodes based on carefully done sampling at different nodes. This latter approach is known as “network tomography” (Castro, Coates, Liang, Nowak, and Yu 2004) and it is very recent. The goal of both approaches is to predict measures of performance of the networks. The single route approach supporters believe that by careful selection of processes to model the traffic, more theoretical analysis can be done for multiple routes as well. Many attempts are being made by this group at maintaining the old queueing models that work so well with telephone networks, with some adaptations to the different behavior of Internet data (Guerin, Nyberg, Perrrin, Resnick, Rootzen, and Starica 2003).
It should be pointed out that because of the privacy of network data researchers are having a very hard time obtaining the data they need. But for the data sets that have been made publicly available, there are standardized ways to reduce data on traffic to statistics-friendly data. For example, Jeff Mogul used the steps described at this site ita.ee.lbl.gov/html/contrib/sanitize-readme.txt to make the TCP data set we use in this section user friendly.
The data discussed here can be obtained from the Internet Traffic Archive at: ita.ee.lbl.gov/html/contrib/DEC-PKT.html and is labeled dec-pkt-1.tcp. This data set summarizes traces of one hour's worth of TCP traffic between Digital Equipment Corporation and the rest of the world on March 8th, 1995. Paxson and Floyd (1995) also analyzed this data set. It describes 2,153,462 million packets and contains the following 6 variables.
The packet size and number of packets per unit of time are very important variables, and we will analyze them here.
Our analysis in this paper will be limited to some of the preliminary descriptive data analysis usually done. Since the data set we use is only for an hour, there are many things that other studies look at that we can not investigate. The R commands for the analysis done here can be found in Appendix A.
Figure 1. Histogram of the number of data bytes in a packet.
A histogram of the number of bytes per packet is shown in Figure 1. We can see that it is bimodal as expected. The minimum value is 1 and the maximum is 1460. The most frequent packet size is around 512. Packets of that size arrive uniformly throughout the day, they are not concentrated in any particular hour. The correlation coefficient between the timestamp and the databytes variables is 0.01128, illustrating the lack of relation for a large majority of the data. However, one could do a plot of timestamp against databytes (not shown here) to see that the largest packages, those larger than 600, don't arrive uniformly throughout the hour.
Figure 2. Distribution of interarrival times of packets.
There is some indication of exponential-like behavior for small interarrival times, but the distribution is heavy tailed, and the exponential is not a good model due to large interarrival times. A quantile-quantile plot of the data against simulated exponential random numbers with the same average as our data, shows that there is lack of fit for large interarrival times, i.e., those interarrivals beyond the range one would expect from an exponential model. Figure 3 reveals that the quantiles of the data do not correspond to the same values as the quantiles in the simulated exponential. If the empirical data came from an exponential population, the points should fall approximately along the diagonal reference line. The greater the departure from this reference line, the greater the evidence for the conclusion that the data set have come from a population with a different distribution. Could the long tail be due to outliers? A boxplot (not shown) reveals that there are too many points outside the whisker to consider these tail points outliers. Rare interarrival times are not so rare in this data set.
Figure 3. QQ plot of the interarrival times and the exponential distribution.
Figure 4. QQ plot of the number of packets per second and a Poisson distribution.
This lack of fit has been attributed by computer scientists and engineers to the burstiness of packets per unit of time over time, regardless of the time scales at which we measure them. Many papers have been written showing how this periodic outbursts of activity over time are present in almost all the publicly available data sets. The challenge has been and still is to determine why. Most papers have resorted to asymptotic explanations such as long range dependence and self similarity, and to this day there is no consensus on this matter.
The data set we use in this section is too short to investigate the burstiness of the number of packets per unit of time thoroughly. But it illustrates what this behavior looks like in a plot. We created a data set called ``pacpersecond'' representing the number of packets arriving every second, and plotted it against an index variable representing seconds. Figure 5 shows that the number of packets per second displays burstiness. Changing the scale does not make the burstiness disappear; we created another variable called ``pacperminute'' that represents the number of packets arriving per minute. Figure 6 shows the trace of this variable against an index variable representing minutes. Again, the burstiness is present. If the counts were Poisson, burstiness in the data would disappear when the scale is aggregated. The fact that the burstiness in the data does not change whether we measure the number of packets per minute, per second, or millisecond is another indication that the data is not Poisson. In fact, any distribution whose characteristic is that more aggregated scales will not change the burstiness will not fit the packet data well. And this is what is making the modeling so difficult. (Willinger and Paxson 1998)
Figure 5. Trace of the number of packets per second.
Figure 6. Trace of the number of packets per minute.
The above descriptions of the data we are using in this paper indicate that this data set follows the same behavior as most of the data sets analyzed in the literature on network traffic. With these pieces of information, researchers are trying to determine what kind of model of network traffic will capture these characteristics. There are many other interesting types of analysis that one can do with longer data sets, but they all lead to the same conclusions. These other data sets can be found in the address given earlier in this section.
Figure 7. Box plot of packets per second compared with box plot of a negative binomial distribution.
Figure 8. Q-Q plot of packets per second against the negative binomial distribution.
We can see in the quantile-quantile plot that the negative binomial fits the data better than the Poisson, however at very low and very high values of the variable pacpersecond the fit is not so good. The negative binomial is then not a perfect alternative to the Poisson but only an improvement. Instructors in Introductory Statistics or Mathematical Statistics courses could expand the exercises done here by making students fit other models that might account better for the thickness of the tail, like some of the distributions considered in the literature referenced at the end of the paper. This activity will certainly engage students in the current debate about the nature of Internet traffic data.
Some of the analysis done in the literature to answer that question can be illustrated with data published in the UCI KDD Archive (Heckerman). We processed these data to obtain observations on the number of different pages visited by users who entered the msnbc.com page on September 28, 1999 and other information. A random sample of this data set was used by Cadez, et al. (2003) to do cluster analysis and visualization of the patterns (order) of visits followed by users, i.e., to see the frequency of whole sequences. This is a very important question, too. But we don't look into it in this paper.
The original data comes from Internet Information Server (IIS) logs for msnbc.com and news-related portions of msn.com for the entire day of September 28, 1999 (Pacific Standard Time). Each sequence in the dataset corresponds to page views of a user during that twenty-four hours period. Since there are 989818 users, there are 989818 sequences. This is a 22.6 MB size data set. Each event in a sequence corresponds to a user's request for a page. Requests are not recorded at the finest level of detail--that is, not at the level of URL, but rather, they are recorded at the levels of page category (as determined by a site administrator). The categories are ``frontpage'', ``news'', ``tech'', ``local'', ``opinion'', ``on-air'', ``misc'', ``weather'', ``health'', ``living'', ``business'', ``sports'', ``summary'', ``bbs (bulletin board service)'', ``travel'', ``msn-news'', and ``msn-sports''. As an example, we write below the sequence for the first three users in the data set (one line per user):
User 1 | frontpage, frontpage |
User 2: | news |
User 3: | tech,news,news,local,news,news,news,tech,tech |
We processed the original data set to obtain the variable ''length'', which represents the actual total number of links visited by each user. For example, user one has length=2 user two has length=1, and user three has length=9.
Figure 9. Histogram of the length of visits with an Inverse Gaussian distribution with the same mean and variance superimposed.
Notice that the histogram contains only values of length less than or equal to 100, excluding those users that visited more than 100 pages. The longest visits are probably crawlers or maybe different people logged into the same IP address. One of the problems with web server log data is precisely what to do with these crawlers. Should they be included, should they not? Although we did not include them all in the graphs, all the numbers were used for the computations of the statistics. An important fact to observe is that most users visit few pages, but the tails are very long, indicating that some users visit a lot of pages.
The mean and variance , where is a scale parameter. This distribution “has a very long tail, which extends much farther than that of a normal distribution with comparable mean and variance. This implies a finite probability for events that would be unlikely if described by a normal distribution. Consequently, large deviations from the average number of user-clicks computed at a site will be observed” (Huberman, et al. 1998, pg. 95). Another property is that “because of the asymmetry of the distribution function, the typical behavior of users will not be the same as their average behavior. Thus, because the mode is lower than the mean, care must be exercised with available data on the average number of clicks, as this average overestimates the typical depth being surfed.”
It can be shown that the cumulative distribution function of the inverse Gaussian distribution is
where is the standard normal distribution function.
Is the inverse Gaussian really a good model for the data we have? It is instructive to follow the guidelines given in the references mentioned above to answer this question.
Theoretically, by maximizing the likelihood function, the equations for the maximum likelihood estimators (MLE) of and in the inverse Gaussian distribution given above can be found to be
and
For the msnbc.com data, we find:
The inverse Gaussian with these estimates is fitted to the histogram in Figure 9. Visually, it is not a perfect fit for lower values of length, which is where the majority of the data are concentrated. And we don't show the tails, so we can not conclude from this plot that the fit is good over the whole distribution.
Commands in R to do these computations and the ones that follow can be found in Appendix B.
To see how good is this model, Huberman, et al. (1998) and Hansen and Sen (2003) compared the cumulative distribution function implied by the model to the empirical cumulative distribution function derived from the data. Then they use a probability-probability plot against the fitted distribution. We do the same with the length variable; the plots can be seen in Figure 10.The p-p plot reveals a misfit of the inverse Gaussian model to our data at the lower values of length. Hansen and Sen (2003) got similar results with the bell-labs.com data set they used.
Figure 10. Left: Empirical (ooo) and Inverse Gaussian (---) cumulative distributions.
Right: p-p plot.
Another way of investigating whether the inverse Gaussian is a good model, is based on the following fact: If you take logs on both sides of the inverse Gaussian formula you obtain
Thus a plot of log(L) vs log(frequency) should show a straight line whose slope approximates -3/2 for small values of L and large values of the variance. This is because if we substitute for on the right hand side of the formula, i.e., which follows from the formula for the variance, the variance appears in the denominator and the mean in the numerator. For small mean, which is the case here, and large variance, which is also the case for our data, the second term is almost 0. A plot of the frequency distribution of surfing clicks on the log-log scale can be seen in Figure 11. According to this plot andthe theoretical result, the regression line for the whole range of the data has a slope of -1.9, which is not too far from -1.5, so this result holds approximately for our data.
Figure 11. Plot of frequency distribution of length in log-log scale.
The log-log plot also helps us notice that, up to a constant given by the third term, the probability of finding a group surfing at a given level scales inversely in proportion to its length, . This is a characteristic that appears in a lot of Internet data sets. We don't pursue it further here, but it is at the heart of the debate about the nature of the data and the best possible model.
The appeal of the inverse Gaussian is in its decision theoretic foundations: it is the distribution that would result if visitors to the web site were optimizing their utility (Huberman, et al. 1998). But based on the results above, would we recommend the inverse Gaussian model for the length of visits (or number of links that a user visits) in the msnbc.com data set or other web server log data? This is one of the questions still unanswered and in need of more research. Several authors have tried other distributions with other data sets, for example, the geometric, the Poisson and a power law, but none of these distributions have fit the data well, either in the upper tail or in the lower levels of length. See, for example, Baldi, Frasconi, and Smyth (2003). A power law distribution tends to give a good fit at the tail, but it fails in fitting the lower values, which is where most of the observations are concentrated.
We shall not dwell further on Web browsing behavior research. But before we conclude this section, we would like to point out that the above is just the tip of the iceberg. Once the distribution of ``length'' is settled, the next step for researchers is to model the sequence of requests by users. Huberman, et al. (1998) model them using a simple first order Markov model. Hansen and Sen(2003) try a first and second-order Markov model, a finite mixture of first-order models, and a Bayesian approach. Cadez, et al. (2003) investigate simple Markov models for different clusters of users. The objective of these modeling attempts is to determine the best model to predict a user's next page request. Pages with higher probability of being requested can then be made more accessible.
Interested readers can experiment in class with many of these questions, either using the raw data, or three different processed data sets that we extracted from this raw data set, and that are available at the CS-STATS web site that we will be glad to provide upon request. Perhaps the reader can obtain Web server log data from the school where this material will be taught. In the latter case, be aware that raw log server data with URLs and detailed computer information needs to be converted to something like the raw data of Heckerman using Perl or similar programs. After that, you can process it further to use it for data analysis. The CS-STATS web site has some Perl scripts that could be used to that end.
What is most relevant is that the analysis of Internet data is at its early stages and therefore there are many unsolved questions and no established paradigm. By involving students in those questions, we are making them active participants in current debates, without leaving the classroom. Interested readers can contact the authors to access exercises and data sets that we have prepared for use in the classroom.
dec1 = read.table("packetdata.dat.txt",header=TRUE)
timestamp = dec1$timestamp[dec1$databytes>0]
databytes = dec1$databytes[dec1$databytes>0]
source=dec1$source[dec1$databytes>0]
destination=dec1$destination[dec1$databytes>0]
sourceport=dec1$sourceport[dec1$databytes>0]
destport=dec1$destport[dec1$databytes>0]
hist(databytes)
summary(databytes)
plot(timestamp,databytes)
arrivals = matrix(timestamp)
n=(length(timestamp))-1
interarrivals = matrix(rep(0,n),ncol=1)
for(i in 1:n) {interarrivals[i]= arrivals[i+1]-arrivals[i]} #we are measuring the time between each two arrivals
par(mfrow=c(2,1)) # open space to put two graphs together
hist(interarrivals, main="histogram of interarrivals in the data") # don't copy this graph yet
summary(interarrivals)
lambda=1/mean(interarrivals)
exponential =rexp(n,lambda) #generate exponential random variables
hist(exponential, main="histogram of simulated exponential", xlim=c(0,max(interarrival)) #copy and paste the graph window now.
dev.off() #close the graph window
qqplot(interarrivals,exponential,main="qqplot of data vs model")
abline(0,1)
pacpersecond = matrix(c(rep(0,3600)))
par(mfrow=c(1,2))
negbinomial = rnbinom(3450, mu = 384.2629, size = 384.2629^2/(15306.35-384.2629))
boxplot(pacpersecond,ylim=c(0,1300),main="Packets per second")
boxplot(negbinomial,ylim=c(0,1300),main="Negative binomial")
qqplot(pacpersecond,negbinomial,main="quantile-quantile plot")
abline(0,1)
for(i in 1:3600) {pacpersecond[i]=length(timestamp[(floor(timestamp))==i])}
pacpersecond=pacpersecond[pacpersecond>0]
plot(pacpersecond,type=''b'')
qqplot(pacpersecond, rpois(length(pacpersecond), mean(pacpersecond)),ylab="Poisson")
abline(0,1)
pacperminute = matrix(c(rep(0,60)))
pacperminute[1]= sum(pacpersecond[1:60])
for(i in 2:60)
{j=(i-1)*60+1
k=i*60
pacperminute[i]=sum(pacpersecond[j:k])
}
plot(pacperminute,type=''b'')
# Read data, do histogram and superimpose the inverse Gaussian
length=read.table("msnbclength.dat.txt")
length=length$length
length=length[length<=100]
library(SuppDists)
postscript("histogram.eps",horizontal=FALSE)
par(mar=par("mar")+c(0,0,15,0))
hist(length,prob=T)
pts=seq(from=par("usr")[1],to=par("usr")[2],len=length(length))
lines(pts,dinvGauss(pts,4.747129,3.081917),xpd=T)
dev.off()
# Do the plot of log length against log frequency and
# find regression estimates
freqtable = read.table("frequencytable",header=TRUE)
length=freqtable$length
frequency=freqtable$frequency
plot(log(length),log(frequency))
lm(log(frequency)~log(length))
library(SuppDists)
length=read.table("msnbclength.txt",header=TRUE)
length=length$length
n=length(length)
lengthsum=sum(length)
lengthtable=table(length)
lengthfrequency=table(length)/n
cumlength=cumsum(lengthfrequency)
mu=4.747129
lambda=3.081917
l=seq(1,100)
prob=pinvGauss(l,mu,lambda)
par(mfrow=c(1,2))
plot(l,cumlength[1:100], type="p", xlab="length", ylab="Cumulative probability")
lines(l,prob)
plot(prob,cumlength[1:100], xlab="Inverse Gaussian CDF", ylab="Empirical CDF")
lines(l/100,l/100)
dev.off()
Baldi, P., Frasconi, and P., and Smyth, P. (2003), “Modeling the Internet and the Web,” in Probabilistic Methods and Algorithms, New York: Wiley & Sons.
Cadez, I., Heckerman, D., Meek, C., Smyth, P., and White, S. (2003). “Model-based clustering and visualization of navigation patterns on a Web site,” Journal of Data Mining and Knowledge Discovery, 7(4), 399-424.
Castro,R., Coates M., Liang, G., Nowak R., and Yu, B. (2004), “Network Tomography: recent developments,” Statistical Science, 19(3), 499-517.
CS-STATS www.stat.ucla.edu/~jsanchez/oid03/csstats/cs-stats.html
Digital Equipment Corporation. The traces were made by Jeff Mogul (mogel@pa.dec.com) of Digital's Western Research Lab (WRL). The trace correspond to DEC-WRL-1
Gautam, N. (2003), “Stochastic Models in Telecommunications for Optimal Design, Control and Performance Evaluation,” Handbook of Statistics, Vol. 21, eds. D.N. Shanbhag and C.R. Rao, Elsevier Science B.V.
Guerin, C.A., Nyberg, H., Perrin, O., Resnick, S., Rootzen, H., and Starica, C. (2003) “Empirical Testing of the Infinite Source Poisson Data Traffic Model,” Stochastic Models, 19(2), 151-200.
Hansen, M.H. and Sen, R. (2003), “Predicting Web User's next access based on log data,” Journal of Computational and Graphical Statistics, 12(1), 143-155.
Heckerman, D. The UCI KDD Archive (kdd.ics.uci.edu) Irvine, CA: University of California, Department of Information and Computer Science. The URL for the data used in this paper is kdd.ics.uci.edu/databases/msnbc/msnbc.dat.txta.html
Huberman, B.A., Pirolli, P.L.T., Pitkow, J.E., and Lukose, R.M. (1998), “Strong Regularities in World Wide Web Surfing,” Science, 280, 95-97.
Paxson, V. and Floyd S.(1995). “Wide-Area Traffic: The Failure of Poisson Modeling,” IEEE/ACM Transactions on Networking, 3(3), 226-244.
Willinger, W. and Paxson, V. (1998), “Where Mathematics meets the Internet,” Notices of the AMS, September 1998, 961-970.
Willinger, W., Taqqu, M.S., Leland, W.E. and Wilson, D.V. (1995), “Self-similarity in High-Speed PAcket Traffic: Analysis and Modeling of Ethernet Traffic Measurements,” Statistical Science, 10(1), 67-85.
Juana Sanchez
Department of Statistics
UCLA
Los Angeles, CA
U.S.A.
jsanchez@stat.ucla.edu
Yan He
Department of Statistics
UCLA
Los Angeles, CA
U.S.A.
yanhe@stat.ucla.edu
Volume 13 (2005) | Archive | Index | Data Archive | Information Service | Editorial Board | Guidelines for Authors | Guidelines for Data Contributors | Home Page | Contact JSE | ASA Publications