My favorites | Sign in
Project Home Wiki
Search
for
SideStream  
Description of the SideStream experiment.
Updated Oct 21, 2011 by tizi...@google.com

Introduction

SideStream is an experiment that collects statistics about the TCP connections used by the measurement tools running on the M-Lab platform.

System Architecture

SideStream consists of 3 components:

  • A standard webserver with sample data,
  • A daemon collecting Web100 exit statistics, and
  • A daemon collecting raw TCP packet traces (tcpdump).

Future Extensions

In addition, two other classes of components are envisioned:

  • External clients to pull emulated application data from the web servers, and
  • External analysis tools to extract useful statistics from the raw TCP instrumentation. One useful (pre-)analysis tool could extract individual TCP connections from of the raw daemon data files and write them into individual per-connection or per host files.
Note that the external components do not have to be uniform or even coordinated. In principle, multiple pools of external clients could use the same M-Lab components in conjunction with multiple sets of analysis tools to implement slightly different experiments, all under the SideSteam umbrella experiment.

Deployment on M-Lab

SideStream is running on all M-Lab nodes.

All the SideStream components are running in the same M-Lab/PlanetLab slice as NPAD, and NPAD shares the same webserver for its own use.

In particular, NPAD is running a fairly standard Apache configuration on port 8000 and it uses the webserver to load the NPAD diagnostic server form, applet and final reports.

To facilitate SideStream, each NPAD server includes a directory of synthetic data /Sample/, which can be listed. In the sample data, each file is named by its size, with an appropriate extension for its type. Currently the only data is (highly compressible, repeating) .txt files in powers of 2 file sizes from 512 Bytes to 1 MByte. Note that the actual http transfers will be slightly larger than the file sizes due to http overhead, etc.

Arguments on the URL are explicitly ignored, so they can be used to suppress caching.

Data Collected

The daemons write 1 hour data files (web100 dumps and tcpdumps).

Since web100 does not use network namespaces nor participate in the PlanetLab/M-Lab vserver virtual machine, the SideStream Web100 dumps cover all TCP connections on a given M-Lab node, including all other experiments, as well as management traffic for all experiments and PlanetLab itself. On the other hand, since the PlanetLab packet capture facility enforces network namspaces, it is not possible to see packets to or from other M-Lab slices. Furthermore tcpdump only captures TCP connections to local port 8000, which is used by the NPAD webserver. This was done to protect it from NPAD's measurement traffic, which could quite easily overwhelm tcpdump and/or disk space.

NPAD runs anonymous rsync to facilitate retrieving the results. However, the SideStream data files are not atomically updated. The currently open tcpdump and web100 dump output files used by the demons can be expected to change as rsync runs. The partial files will be correctly updated on a later rsync run.

Web100 dumps

Web100 exit statistics are saved in ASCII files named:

SideStream/yyyy/mm/dd/nodename/iso_timeZ_ALL0.web100

where,

  • ALL indicates that this file aggregates data across all clients,
  • In some cases the trailing 0 is replaced by a small integer to avoid overwriting existing files (e.g. When the collection process is restarted).

Web100 dumps files contain 2 types of records:

  • Record for data keys.
  • Records for exit (close) statistics.

A data key record has the following fields:

K: cid PollTime LocalAddress LocalPort RemAddress RemPort <Web100 variables>

where

  • The cid is the connection id, a pid like identifier unique to each connection for its duration.
  • The PollTime is the iso timestamp when the connection was observed to already be closed (may be up to 5 seconds after the actual close).
  • LocalAddress, LocalPort, RemAddress, RemPort are the TCP 4-tuple that uniquely identify the connection.
  • The rest of the line names all Web100 raw instruments.
As of 25 Aug 2009, the keys are nominally deterministic, however this property should not be assumed. The format may change in the future, and the keys are different and non-deterministic for older data.

An exit statistic record starts with C: and is in the format suggested by the most recent preceding data key record. Note that these are only summary statistics: total bytes, packets, retransmissions, etc.

In the future we might add other record types, for example "progress" statistics for long running connections.

TCP dumps

Packet traces are collected to provide a mechanism for validating the Web100 data, or to collect statistics that are not covered by Web100.

They are saved in tcpdump binary files named:

SideStream/yyyy/mm/dd/nodename/iso_timeZ_ALL0.tra

where,

  • ALL indicates that this file aggregates data across all clients,
  • In some cases the trailing 0 is replaced by a small integer to avoid overwriting existing files (e.g. When the collection process is restarted).

These files are normally slightly longer than an hour, running from about 50 seconds before to about 10 seconds after the hour, such that consecutive files normally overlap by 1 minute. This was done to minimize potential problems associated with analyzing packet traces that span more than one tcpdump file.


Sign in to add a comment
Powered by Google Project Hosting