Fault-tolerant Parallel Processing Combining Linda,
Checkpointing, and Transactions
4:00 p.m., Thursday, September 28, 1995
Warren Weaver Hall #402
With the advent of high performance workstations and fast LANs, networks of workstations have recently emerged as a promising computing platform for long-running coarse grain parallel applications. Their advantages are wide availability and cost-effectiveness, as compared to massively parallel computers. Long-running computation in the workstation environment, however, requires both fault tolerance and the effective utilization of idle workstations.
In this dissertation, we present a variant of Linda, called Persistent Linda (PLinda), that treats these two issues uniformly: specifically, PLinda treats non-idleness as failure.
PLinda provides a combination of checkpointing and transaction support on both data and program state (an encoding of continuations). The traditional transaction model is simplified and then extended to support robust parallel computation. Treatable failures include processor and main memory hard and slowdown failures, and network omission and corruption failures.
The programmer can customize fault tolerance when constructing an application, trading failure-free performance against recovery time. When creating a PLinda program, the programmer can decide on the frequency of transactions and the encoding of continuations to be saved upon transaction commit. At runtime, the programmer can decide to suppress certain continuations for better failure-free performance.
PLinda has been applied to corporate bond index statistics computation and biological pattern recognition.