About the talk:
|A significant fraction of software failures in large-scale Internet
systems are cured by rebooting, even when the exact failure causes are
unknown. However, rebooting can be expensive, causing nontrivial service
disruption or downtime even when clusters and failover are employed.
In this work we separate process recovery from data recovery to enable
microrebooting -- a fine-grain technique for surgically recovering faulty
application components, without disturbing the rest of the
We evaluate microrebooting in an Internet auction system running on an application server. Microreboots recover most of the same failures as full reboots, but do so an order of magnitude faster and result in an order of magnitude savings in lost work. This cheap form of recovery engenders a new approach to high availability: microreboots can be employed at the slightest hint of failure, prior to node failover in multi-node clusters, even when mistakes in failure detection are likely; failure and recovery can be masked from end users through transparent call-level retries; and systems can be rejuvenated by parts, without ever being shut down.
About the speaker:
|George Candea is in the final year of his Ph.D. in the Software Infrastructures Group here at Stanford, where he has been working on bringing higher availability to complex Internet services. He recently completed a 5-year stint at Oracle Corp., focused on scalability and high availability in the database server. Before that, he hacked on kernels and dabbled in mobile computing while at MIT, IBM, and Microsoft.|