Using Application-Domain Knowledge in the Runtime Support of Multi-Experiment Computational Studies
Candidate: Siu-Man Yau
Advisor: Vijay Karamcheti and Denis Zorin

Abstract

Multi-Experiment Studies (MESs) is a type of computational study in which the same simulation software is executed multiple times, and the result of all executions need to be aggregated to obtain useful insight. As computational simulation experiments become increasingly accepted as part of the scientific process, the use of MESs is becoming more wide-spread among scientists and engineers.

MESs present several challenging requirements on the computing system. First, many MESs need constant user monitoring and feedback, requiring simultaneous steering of multiple executions of the simulation code. Second, MESs can comprise of many executions of long-running simulations; the sheer volume of computation can make them prohibitively long to run.

Parallel architecture offer an attractive computing platform for MESs. Low-cost, small-scale desktops employing multi-core chips allow wide-spread dedicated local access to parallel computation power, offering more research groups an opportunity to achieve interactive MESs. Massively-parallel, high-performance computing clusters can afford a level of parallelism never seen before, and present an opportunity to address the problem of computationally intensive MESs.

However, in order to fully leverage the benefits of parallel architectures, the traditional parallel systems' view has to be augmented. Existing parallel computing systems often treat each execution of the software as a black box, and are prevented from viewing an entire computational study as a single entity that must be optimized for.

This dissertation investigates how a parallel system can view MESs as an end-to-end system and leverage the application-specific properties of MESs to address its requirements. In particular, the system can 1) adapt its scheduling decisions to the overall goal of an MES to reduce the needed computation, 2) simultaneously aggregate results from, and disseminate user actions to, multiple executions of the software to enable simultaneous steering, 3) store reusable information across executions of the simulation software to reduce individual run-time, and 4) adapt its resource allocation policies to the MES's properties to improve resource utilization.

Using a test bed system called SimX and four example MESs across different disciplines, this dissertation shows that the application-aware MES-level approach can achieve multi-fold to multiple orders-of-magnitude improvements over the traditional simulation-level approach.