Note the new location for the seminar.


Thursday, November 13th, 2003
Room 202, Packard Building

 Monitoring, Analyzing, and Controlling Internet-scale Systems with ACME


David Oppenheimer

University of California, Berkeley

About the talk:
 
Analyzing and controlling large distributed services under a wide range of conditions is difficult. Yet these capabilities are essential to a number of important development and operational tasks such as benchmarking, testing, and system management. To facilitate these tasks, we have built the Application Control and Monitoring Environment (ACME), a scalable, flexible infrastructure for monitoring, analyzing, and controlling Internet-scale systems. ACME consists of two parts. ISING, the Internet Sensor In-Network agGregator, queries "sensors" and aggregates the results as they are routed through an overlay network. ENTRIE, the ENgine for TRiggering Internet Events, uses the data streams supplied by ISING, in combination with a user's XML configuration file, to trigger "actuators" such as killing processes during a robustness benchmark or paging a system administrator when predefined anomalous conditions are observed. We find that for a 512-node system running atop an emulated Internet topology, ISING's use of in-network aggregation can reduce end-to-end query-response latency by more than 50% compared to using either direct network connections or the same overlay network without aggregation. We also find that an untuned implementation of ACME can invoke an actuator on one or all nodes in response to a discrete or aggregate event in less than four seconds, and we illustrate ACME's applicability to concrete benchmarking and monitoring scenarios.

About the speaker:
 
David Oppenheimer is a Ph.D. candidate in computer science at the University of California, Berkeley. For his thesis he is developing and applying a framework for management, robustness testing, and benchmarking of large-scale geographically distributed systems. He received a BSE in electrical engineering from Princeton University in 1996 and an MS in computer science from UC Berkeley in 2002. His research is part of the UC Berkeley ROC project, which is researching design techniques such as self-testing and fast online detection, diagnosis, and recovery from failures, for building highly dependable systems that place a minimal burden on human operators. This is also joint work with the UC Berkeley OceanStore project.