NAME: packetdata.dat (Internet Data Analysis for Undergrad Curriculum) TYPE: Observational SIZE: 100,001 rows with six columns space delimited with headers "timestamp," "source," "destination," "sourceport," "destport," "databytes." DESCRIPTIVE ABSTRACT: The data is a set of 50000 (1.3 MB ) observations containing roughly 2 minutes of traffic from the one hour, larger dec-pkt-1.tcp file used in the paper. The larger file can be accessed from the author's web page or from its source. With only 50000 observations, the data set SOURCE: The data came from the logs of Digital Equipment corporation servers and was archived in the Internet Traffic Archive http://ita.ee.lbl.gov/html/traces.html after being sanitized, i.e., all content that was private was removed. The trace from that public archive that is used here is called dec-pkt-1 and correspond to DEC-WRL-1 in the article Wide-Area Traffic: The Failure of Poisson Modeling, V. Paxson and S. Floyd, IEEE/ACM Transactions on Networking, 3(3), pp. 226-244, June 1995. The original dec-pkt-1.tcp data summarize traces of one hour's worth of TCP packet traffic (2,153,462 rows), between Digital Equipment Corporation and the rest of the world on March 8th, 1995. It contains information about 2,153,462 million packets. The data is the processed version of the raw logs of traffic that the server keeps. VARIABLES DESCRIPTIONS: Timestamp: time of packet arrivals. In minutes since the last hour. Source: Source of the packet or host, with a code for confidentiality reasons. Destination: Destination host with code for confidentiality reasons. Sourceport: source TCP port Destport: Destination TCP port Databytes: number of data bytes in the packet, or 0 if none (this can happen for packets that only ack data sent by the other side. The 0 bytes packages are usually removed. There are no missing values STORY BEHIND THE DATA: Messages that flow from a source to a destination through the internet network are also known as traffic which travels according to some protocols. This traffic and the network conditions are extremely random in nature. The most common protocol for traffic is TCP or transfer control protocol, which acknowledges packet receipt. One of the ways to capture this traffic is by looking at the logs that servers keep of all incoming traffic or tcpdumps. The data set dec-pkt-1 comes from one such log, and allows us to measure the number of packets per unit of time, as well as the interarrival time between packets. These are two very important messages to determine traffic congestion and network performance. The smaller data set provided follows similar behavior to the larger dataset. PEDAGOGICAL NOTES: The TCP packet traffic data is excellent to illustrate in your class the idea of q-q plots, time plots, box plots and histograms for unusually behaving data with thick tails, and to bring up the discussion on which summary statistics are appropriate for these data and which probability model best fits the data. The histogram of the databytes is an interesting bimodal histogram. The time plots reveal burstiness that translates into non constant rate of traffic. This idea can be used to illustrate to students that common distributions like the Poisson and the Exponential are not appropriate, or just simply that means are not good summaries. The box plots can show students that there is a continuum of outliers, and therefore that rare is not so uncommon in these data. REFERENCES: http://www.stat.ucla.edu/~jsanchez/oid03/csstats/index.htm http://ita.ee.lbl.gov/ SUBMITTED BY: Juana Sanchez UCLA Department of Statistics 8125 Math Sciences Building Box 951554 Los Angeles, CA 90095-1554 jsanchez@stat.ucla.edu