My favorites | Sign in
Project Logo
                
Search
for
Updated Sep 02, 2008 by jwzurawski
Tier2BCP  

Tier 2 Best Common Practices

DRAFT

To Do

Restructure current content & expand into 3 Documents

  • An Introduction document that describes why measurement is important and documents the value of well deployed measurement infrastructure
  • A Best Practices document, styled after IETF BCP 15 that recommends specific measurements, protocols and schedules for the US LHC 1/2/3 commmunity
  • An Implementation Guide that describes how to install, configure & maintain specific tools to implement the BCP.

Review all text in Bold Italics for appropriate parameters with subject area experts.

Table of Contents

Nothing in here should be considered formal, or official, until somebody says otherwise!

Introduction

This documentation is a recommendation from the US PerfSONAR community to the US LHC community that describes how the US LHC Tier-1, Tier-2 and Tier-3 centers can effectively utilize the network measurement infrastructure tools developed by the perfSONAR collaboration to debug, monitor and manage the network paths critical to their center.

The LHC computing model is a complex distributed workflow system that relies on a large number of compute, storage, and network services provided and supported by many different organizations around the globe. Many of the components of this system are new, and have never operated as production infrastructure on this scale before.

The way the LHC community is going to use the global networks is significantly different than most prior large science experiments. It is expected that this new fashion of using and stressing the research and education network infrastructure will probably bump into previously unknown problems or limitations in some parts of the global infrastructure. Simple faults where a single system fails completely until it is repaired are usually easy to diagnose and repair. However, transient faults and subtle partial failures in a system as large and complex as the LHC computing model can be very difficult to track down.

Deploying a perfSONAR network measurement infrastructure in the LHC Tier 2 centers will make the network components of the workflow system more predictable and deterministic. It will make it trivial to determine if the network services are up and functioning correctly, or suffering some impairment. It should reduce the effort required to diagnose complex workflow problems, and it should allow the LHC scientists to focus their time on other parts of the computing model, or the LHC science.

Goals

Allow LHC Scientists to easily:

  1. Characterize and track network connectivity between their center, and the centers they serve or rely on.
  2. Characterize and quantify network performance problems to accelerate diagnosing and fixing them.
  3. Differentiate between application and network performance problems.
  4. Differentiate between local and remote network problems.
  5. Identify, understand and respond effectively to changes in the underlying network.

Use Cases

There are many use-cases for robust network measurement capabilities. We expect common use cases in the US LHC community to include:

How these use cases apply to the LHC Center management process can be illustrated in the following scenarios. A Tier 2 center manager has heard that there is a network performance problem between his center and 4 different sites. He might use the measurement infrastructure to confirm the problem, isolate it to a particular domain, and gather the diagnostic information ad described below. Note: general information about the diagnostic processes , flow charts, etc. will be in the Usage Guide section of this document.

  1. He looks at the regularly scheduled latency tests and sees the following:
  2. He investigate the performance to Site C.
    1. Checks the regularly scheduled bandwidth tests.
    2. Checks the interface utilization data on the path to Site C
    3. So, he concludes that the network to Site C has failed over to an alternate path, but it is working correctly.
  3. He investigates the performance to Site B.
    1. He notes that the base latency hasn't changed significantly
    2. He checks the regularly scheduled bandwidth tests
    3. He checks the Interface Utilizations along the path
  4. He investigates the performance to site A.
    1. He checks the scheduled bandwidth tests
    2. He checks the scheduled end to end latency tests.
    3. He checks the interface data
    4. He calls his network administrator to open a ticket with their network provider.
    5. The upstream network provider correlates the performance change with a maintenance event on an adjacent piece of equipment and believes a fiber might have been bumped.
    6. He investigates the performance to Site D.
      1. The latency hasn't changed
      2. The regularly scheduled bandwidth test results haven't changed.
      3. The Utilization along the path is low.
      4. The PhEDEx graphs show the many queued and failed connections

Measurements

The network measurement infrastructure will support the following network measurements. These can be used to characterize the network between points of interest, and in some cases all along the path.

General Diagnostics

Continuously measure end to end delay

What

Why

Make regular scheduled bandwidth measurements across paths of interest

What

Why

Monitor up down status of cross domain circuits

What

Why

Monitor Link Circuit Capacity Utilization and Errors

What

Why

Diagnostics to look for specific known performance problems

What

Why

Tools

There are several tools that can be used to provide the measurements listed above. The first priority in this recommendation is to ensure that diagnostic measurements can take place, but those diagnostics are most useful given historical data for comparison. Therefore, the specific tool recommendations in this list were chosen based on the ability of the tool to work in both an on-demand mode for diagnostics as well as a scheduled mode for on-going monitoring and historical analysis. Additionally, because network performance diagnostics is inherently a multi-domain effort, the perfSONAR framework is used as a medium to share the results of the measurements.

Delay Measurements

diagnostics: Two main tools are recommended in this area:

  1. ping (ICMP)
  2. owamp

on-going monitoring: Two tools can be used for on-going monitoring with the decision upon which depending largely on the amount of cooperation between the sites. (It is expected that sites that are serious about monitoring will implement both.)

  1. pingER along with the perfSONAR-PS pingER-MA
  2. perfSONARBUOY

Bandwidth Measurements

diagnostics: bwctl

up down status of cross domain circuits

TBD

link capacity utilization errors

TBD

Best Practices

Introduction

This section describes the network measurement infrastructure that should be deployed at all major US LHC centers, and the set of regularly scheduled network measurements that should be made to all other US LHC centers of interest.

Measurement Infrastructure

General Guidelines

The measurement infrastructure contains multiple components that may influence each other making results analysis more difficult. The primary example is that bandwidth tests run on the same computer as continuous latency measurements will affect the latency measurement results. In order to simplify the analysis process, bandwidth measurement points and latency measurement points SHOULD NOT be deployed on the same physical machine.

Measurement points SHOULD be deployed as close to the network administrative boundaries as possible. The reason for this is to facilitate diagnosing problems using path decomposition techniques and to make the resulting data as actionable as possible.

Bandwidth Measurement Infrastructure

The bandwidth measurement infrastructure will measure achievable TCP bandwidth using memory to memory transfers over tuned TCP sessions between bandwidth measurement points.

Bandwidth measurements are useful to detect various network problems that may not affect delay measurements. Since bandwidth measurements are intrusive, they should be used with restraint as described in the schedule section below.

One of the issues that must be addressed when deploying bandwidth measurement tools is should the servers be capable of saturating the network? This issue is an area of active debate. There is a general consensus that test machines should be at least 1 Gbps capable to detect the most common problems. Some domains are actively deploying 10G capable bandwidth test systems so they can easily identify and debug problems that only appear in networks faster than 1 Gbps.

Hardware
Protocols
Operations

Delay Measurement Infrastructure

Delay measurements can provide very sensitive light-weight indications of many different network changes or pathologies.

Clock Synchronization

One way delay measurements protocols rely on the servers having both stable and accurate clocks. The protocols also require the ability to estimate the accuracy of their time synchronization. Therefore the delay measurement system must have a stable and accurate clock. The problem is that configurations tuned for stability alone are not very accurate and vise versa. The engineering compromise that MUST be maintained is as follows:

Obtaining this level of clock accuracy is not that difficult but it does require some planning. The Accuracy requirement can be achieved by synchronizing to a Stratum 1 time source such as GPS or CDMA synchronized hardware clocks or NTP synchronizing with a Stratum 1 time source over a low jitter network path. Maintaining the error bounds within the recommended range requires NTP synchronization with 4 or 5 other stratum 1 time sources over low jitter paths. This should be straight forward to achieve if most Delay Servers have their own hardware clocks, and they NTP peer with the Delay Servers that they are making regularly scheduled tests against, or a set of public clocks maintained by the community.

NDT and NPAD Measurement Infrastructure

Passive Measurement Point

Scheduled Measurements

Delay

One way delay measurements are mo re valuable because they essentially perform a first level path decomposition by measuring each direction unidirectionally. Therefore, one-way latency measurements SHOULD be used when ever possible, and round-trip measurements SHOULD only be used when one-way measurement infrastructure is not available.

One Way Delay Measurements
Protocols
Schedule
Round Trip Delay Measurements
Protocols
Schedule

Bandwidth Measurements

Schedule
Server Configuration

Passive Measurements

Interface Statistics

Utilization

Errors and Discards

Adhoc Measurements

NDT

NPAD

Bandwidth

Latency

Legal Issues

Most countries have privacy laws regarding the publication of information about people. They range from the relaxed US laws to the UK requirement that information should be accurate to the Norwegian law that says that you can't publish individually identifiable information unless you get specific permission from the individual. Every maintainer of network performance information should publish data according to the national law of the country in which the local database which holds the information resides.

In general, individually identifiable information is not required for network performance monitoring, analysis and debugging. It is recommended that organizations do not publish network performance information about interfaces, flow records, or network attributes that can be identified with a single individual.

Organizations should also consider any other legal restrictions on their network performance data before publication. For example, some commercial network provider contracts explicitly prohibit publication of network performance data. It is recommended that organizations attempt to negotiate any such terms to allow as broad of network performance data publication as possible.

Security Considerations

Considerations for Deploying Measurement Systems

Considerations for Sourcing and Sinking Active Measurements

Considerations for Publishing Measurement Results

Implementation Guide

The implementation guide section of this document describes how to deploy a network measurement infrastructure.

Based on what we expect to be available at Tier-1 sites, these tools and configurations would be useful at Tier-2 sites. This should remain fairly high-level, with the expectation that we will create very detailed instructions for the LHC community accepted portions.

Setup local infrastructure so others can perform robust measurements to your site

The hardware for a typical perfSONAR installation should contain at least 2 systems, one for a bandwidth measurement point and one for the latency measurement point, so the different measurements do not affect each other. These measurement points should be placed as close to the administrative borders of the network as possible.

We anticipate 2 main deployment options. One option is to use a bootable CD with all of the tools already installed. Another option is to use a set of Red Hat Enterprise Linux 5 RPMs.

Bootable CD Installation

Insert URL to knoppix install here.

RHEL5 RPM installation

  1. Basic Configs
  2. OWAMP
  3. BWCTL
  4. PerfSONARBUOY
  5. Pinger
  6. PerfSONAR-PS Utilization MA
    1. Install the utilization MA on the latency test system (or other web services platform.)
    2. Detailed instructions for deploying the utilization MA will be developed
  7. NDT
  8. NPAD
  9. Etc.

Identify important collaborators

  1. Organizations that provide important services to you
    1. Tier 1 sites that serve important data
    2. Tier 2 sites that you collaborate with
    3. Cern?
  2. Organizations you provide services to
    1. Tier 3 sites that you service
    2. Tier 2 sites that you collaborate with.

Configure Local Measurements

Once you have identified your collaborators, you need to identify which collaborators are participating by deploying local measurement infrastructure and which are not participating at this time. It is expected that all Tier 1's will be participating before the LHC goes online.

  1. Participating Collaborators
    1. Setup continuous latency measurements to the peers OWAMP service. (HOW DO WE DO THIS USING PERFSONAR SERVICES?)
    2. Setup 1 minute bandwidth tests 4 to 6 times a day with the peers BWCTL service. (HOW DO WE DO THIS USING PERFSONAR SERVICES?)
  2. Non-Participating Collaborators
    1. Send a note to the remote site administrator asking if they could recommend two reliable servers that you could ping to monitor site availability.
      1. Externally accessible Grid service nodes or storage server frontdoors may be good candidates.
      2. Routers are typically not a good idea
      3. NEED TO ADD DETAILED PING TARGET LOAD EXPECTATIONS HERE. IE How many packets per day will the default config send? How does this compare to normal background junk levels from Internet?
    2. Identify 2 hosts per remote location that you are going to measure.
    3. Configure a local perfSONAR Pinger system to track performance to the hosts identified

Example Configuration Files

Conclusion

It is possible to participate in the network measurement infrastructure at different levels. Your organization will get different levels of benefits depending on the level of participation.

EXPAND STRAWMAN BELOW....

Non Participant

Non-participants do not expend any effort, and have no control of network measurements made from remote sites into the local infrastructure. Target

Participating in the measurement infrastructure at this level provides you information about, and some level of control over the measurements made to your local site from remote locations.

You need to do the following to participate at this level:

Normal Participant

(NEED A BETTER TITLE HERE...)

Participating at this level will allow a site to measure, document and understand the network characteristics between the local site and the important customers and providers, simplifying problem identification and resolution, and capacity planning.

You need to do the following to participate at this level:

Measurement Champion

Participants at this level are expected to assist others in their community with deploying and maintaining the measurement infrastructure.

They will host the Web visualization tools allowing inspection and analysis of the measurement data collected at their local site, as well as at other sites...

You need to do the following to participate at this level:

Usage Guide

The Usage Guide section of this document will describe how to use an operating network measurement infrastructure to detect, diagnose and confirm resolution of network performance problems.

Authors

Everybody add your name to the list below.

Joe Metzger

Last Updated

$Id$


Sign in to add a comment
Hosted by Google Code