Prof. Kishor S. Trivedi, ECE Dept., Duke University, Durham, NC.
Tuesday June 9, 9:30-12:30
Title: Capacity Planning for Infrastructure-as-a-Service Cloud
Abstract: From an enterprise perspective, one key motivation to transform the traditional IT management into Cloud is the cost reduction of the hosted services. In an Infrastructure-as-a-Service (IaaS) Cloud, virtual machine (VM) instances share the physical machines (PMs) in the provider’s data center. Increasing the number of PMs can lead to lower downtime cost at the expense of higher infrastructure and other operational costs (e.g., power consumption and cooling costs). Hence, determining the optimal PM capacity that minimizes the overall cost is of interest. In this talk, we show how an optimization framework can be developed using stochastic availability and performance models of an IaaS Cloud. Specifically, we develop and solve a cost minimization problem to address the capacity planning in an IaaS Cloud: what is the optimal number of PMs that minimizes the total cost of ownership for a given downtime and performance requirement set by service level agreements? We use simulated annealing, a well-known stochastic search algorithm, to solve the optimization model. For each point in the search space, we need to determine, the performance, availability and power consumption requirements. Hence we develop scalable analytic models for the performance, availability and power consumption of an IaaS Cloud. The essence of our approach is in reducing the complexity of analysis by dividing the overall model into multiple interacting stochastic process sub-models and then obtaining the overall solution by (fixed-point) iteration over individual sub-model solutions.
Thursday June 11, 9:30-12:30
Title: Why Does Software Fail and What Should be Done About It?
Abstract: Most large scale systems contain significant amount of software. Several recent studies have established that most system outages are due to software faults. Traditional methods of fault avoidance, fault removal based on extensive testing/debugging, and fault tolerance based on design/data diversity are found wanting. The key challenge then is how to provide highly dependable software. We discuss a new view of fault tolerance of software-based systems. We classify software faults into Bohrbugs and Mandelbugs, and identify aging-related bugs as a subtype of the latter. Traditional methods have been designed to deal with Bohrbugs. The next challenge then is to develop mitigation methods for Mandelbugs in general and aging-related bugs in particular. We submit that mitigation methods for Mandelbugs utilize environmental diversity. Restart application, failover to an identical replica (hot, warm or cold) and reboot the OS are examples of mitigation techniques that rely on environmental diversity. We discuss environmental diversity both from experimental and analytic points of view. We also discuss software aging related faults where it is possible to utilize proactive environmental diversity technique known as software rejuvenation.