Distributed Computing Systems Survey

This paper surveys software-only distributed computing systems with some or all of the following features:

Scorecard

modern-language based
secure
scaleable
fault tolerant (esp. in master/slave paradigm)
dynamically load balanced
distributed computation
expresses task-parallel concurrency naturally
language integrates synchronization primitives
supports heterogeneous networks of workstations

We seek systems that implement as many of these features as possible, and indicate how well each system measures up.

This paper does not consider actual performance measurements of particular implementations as these are vague and misleading and often reveal only constant factors in speedup. Instead I point out the design decisions that may affect system scaleability, where applicable.

The projects are arranged in alphabetic order. Each project is marked with the originating institution and an estimate of the starting year and duration. Some descriptory material is directly quoted (sometimes with some massaging) from each project to make browsing easier, and my impressions follow.

Please note that this page is still under construction.
Entries which include a Scorecard are considered complete,
and it is concerning these which the author will be grateful
for feedback and suggestions.



  1. BSP/Many Languages/Many Platforms , Harvard University, 1990-1995
  2. H-BSP is a general purpose parallel computing system, currently under development at Harvard University, which proposes a solution to the current parallel software crisis based on the bridging model approach to developing transportable, (i.e. efficient, architecture independent, and scalable) parallel software. HPCC funding enables the H-BSP project to be realized. There are a number of components in the project: a programming language, a compiler, a runtime system, and numerous applications as depicted in the above figure where items in black denote a more advanced stage of development. We believe that the bridging model approach together with these components will make general purpose parallel computing available to a much wider range of people than is presently the case.

    Scorecard

    modern-language based
    secure
    scaleable
    fault tolerant
    dynamically load balanced
    distributed computation
    expresses task-parallel concurrency naturally
    language integrates synchronization primitives
    supports heterogeneous networks of workstations

  3. C4/C++/MPI , University of Texas at Austin, 1994-1995
  4. The Canonical Classes for Concurrency Control (C4), is a library of C++ classes which assist in parallel programming. C4 provides objects which implement a variety of synchornization and data transmission paradigms. Some of these exploit aspects of the C++ language to provide specific high level control semantics. Others export an abstracted interface of some high level construct, allowing the user to specialize for use with his own classes.

    C4 allows the C++ programmer to express parallel programs using high level abstractions for a variety of useful operations. Instead of having to explicitly code every single interaction between processing nodes with message passing calls, some important patterns of communication can be handled through the use of provided C++ objects. Furthermore, C4 caries forward some of the ideas of C++ relating to strict type matching, into a parallel context.

    The primary emphasis in C4 is on elevating the abstraction level of parallel programming through the use of object oriented programming methods. C4 is not an "API wrapper" per se; although it does contain some global functions which ease the use of message passing API's, this is not the primary focus.

    Some examples of specific classes provided by C4 are:

    • Spinlocks (various forms)
    • NodeInfo (multicomputer configuration info)
    • Bswap (typesafe buffer swap - uses templates)
    • DeepObj
    • DeepBSwap (transmit objects with complex internals)
    • Baton (value passing spinlock - uses templates)

    The programming model for C4 is message passing, and it currently supports both Intel's NX and the increasingly popular MPI.

    This system was created by two graduate students for the purpose of structuring message passing calls within an object framework using unadulterated C++. Use of this framework eases significantly the intellectual overhead normally associated with message passing code. For example, a message buffer is created with correct communication semantics already built in. This paradigm appears to offer some of the benefits for which we employ shared memory programming.

    Fault-tolerance and load-balancing are not addressed at the system level. However, this example is important in this era of language-trashing in demonstrating what can be done without departing from the C++ definition.

    Scorecard

    modern-language based
    secure
    scaleable
    fault tolerant
    dynamically load balanced
    distributed computation
    expresses task-parallel concurrency naturally
    language integrates synchronization primitives
    supports heterogeneous networks of workstations


  5. Calypso/modified C++/Caylpso Runtime , New York University and Arizona State University, 1994-1995
  6. Calypso is an execution environment for running parallel programs in a distributed environment. It is a unique system that automatically provides both fault-tolerance and dynamic load balancing based on its novel design. Furthermore, application programmers are provided with a clean programming interface which separates parallel programming and the execution environment. Calypso involves no kernel modifications an currently runs on SunOS 4.1.3 with ports to Solaris 2.4, Linux and Windows-NT underway.

    Calypso's rationale is based on the need for a highly reliable, efficient and scalable programming model which can effectively utilize the power of networks of workstations (NOW). Calypso presents a novel technical approach for building such a model, and as the power of both hardware and software components of NOWS increases, we believe the need for systems suchs as our will grow dramitically.

    Parallel programs for the Calypso system are written using a slightly augmented C++ called CSL . Its syntax is C++ augmented with a few keywords, namely shared, parbegin, parend and routine . Once a CSL program is written, our preprocessor converts it into a standard C++ program, which is then compiled and linked with Calypso libraries.

    Calypso uses the CSL language, which starts with the C++ standard and adds the obvious `parbegin/parend' parallel statement annotations, including dynamic replication. (A special syntax is available to indicate a condition for premature termination of a parallel step.) A preprocessor churns this into legal (& legible!) C++ (including proper source-file citations for debugging). Calypso implements automatic fault-tolerance and load balancing, but with no attempt at checkpointing and recovering lagging processes; rather, `clone' processes are eagerly scheduled on any available machines. Results are registered from exactly one of the clones; this is how idempotency is maintained. The other clones are thereupon mercilessly terminated. This approach is particularly efficient in not costing extra overhead for fault-tolerance when no faults are present.

    Communication between processes takes place via shared memory, which is declared in global scope. This shared memory is made consistent at the implicit barrier following a `parend'; any synchronization within a step must be programmed explicitly. As a programming convenience, stack variables which are declared in a scope enclosing a parallel step inside of which they are directly referenced are copied in and/or copied out of the spawned task of the parallel step as needed. This is untrue for arrays, which are defined to be refenced indirectly through pointers. Arrays may be shared only when declared as such in global scope.

    Scorecard

    modern-language based
    secure
    scaleable
    fault tolerant
    dynamically load balanced
    distributed computation
    expresses task-parallel concurrency naturally
    language integrates synchronization primitives
    supports heterogeneous networks of workstations


  7. CC++/modified C++/Nexus-PORTS , California Institute of Technology, 1992-1995
  8. Compositional C++ (CC++) was designed to alleviate the frustrations of parallel programming by adding a few simple extensions to the sequential language C++. It is a strict superset of the C++ language so any valid C or C++ program is a valid CC++ program. Conversly, many classes of parallel CC++ programs can be simply rewritten as equivalent sequential programs. For these classes of programs, the developement path can be very similar to that of the equivalent sequential program. This compatibility with the C and C++ languages facilitates the transition to the task of parallel programming for users knowledgeable in those languages.

    CC++ extends C++ with the following eight constructs:

    • par blocks enclose statements that are executed in parallel.
    • parfor denotes a loop whose iterations are executed in parallel.
    • spawn statements create a new thread of control executing in parallel with the spawning thread.
    • sync data items are used for synchronization.
    • atomic functions control the level of interleaving of actions composed in parallel.
    • processor objects define the distribution of a computation.
    • global pointers link distributed parts of the computation.
    • data transfer functions describes how information is transmitted between address spaces.

    Despite the simplicity and the small number of extensions, the conjunction of these constructs, when combined with C++, results in an extremely rich and powerful parallel programming notation.

    The richness of this language is reflected in its suitability to a wide spectrum of applications. In fact, CC++ integrates many seemingly disparate fields:

    • Sequential and parallel programming - Parallel blocks are notationally similar to sequential blocks.
    • Shared and distributed memory models - CC++ can be used on shared or distributed memory architectures as well as across networks which may be heterogenous.
    • Granularity - CC++ can be used in computations involving a variety of granularities, ranging from fine-grain parallelism with frequent synchronization between small threads to coarse-grain parallelism with sparse synchronization between large, distributed processes.
    • Task and data parallelism - Task and data parallel applications can be expressed in CC++, as well as programs that combine the two.
    • Synchronization techniques - The synchronization mechanisms provided by CC++ are powerful enough to express any of the traditional imperative synchronization and communication paradigms.

    All of the object-oriented features of C++ are preserved in CC++. These features (especially generic classes and inheritance mechanisms) promote the reuse of code, structure, and verification arguments. Thus, CC++ provides an excellent framework for the developement of libraries and for the use of software templates in program construction. This reuse is especially important and useful in the context of parallel programming.

    The CC++ parfor is a slight generalization of the Calypso `routine[...]' notation, in that processor ids are customizable instead of sequentially proceeding from 0. Also, the unsynchronised `spawn' statement is useful when a certain activity need not be subject to an implicit barrier with companion tasks.

  9. Charm/modified C++/Converse-PORTS , University of Illinois at Urbana-Champaign, 1991-1995
  10. CHARM is a machine independent parallel programming system. Programs written using this system will run unchanged on MIMD machines with or without a shared memory. It provides high-level mechanisms and strategies to facilitate the task of developing even highly complex parallel applications.

    Charm programs are written in C with a few syntactic extensions... Charm++ is the C++-based parallel object oriented language having all features of Charm, which supports multiple inheritance, late bindings, and polymorphism.

    The ... system is based on Efficient Portability, Dynamic Load Balancing, and these other fine features:

    • Latency Tolerance: Latency of communication - the idea that remote data will take longer to access - is a significant issue common across most MIMD platforms. Message-driven execution, supported in CHARM , is a very useful mechanism for tolerating or hiding this latency. In message driven execution (which is distinct from just message-passing), a processor is allocated to a process only when a message for the process is received. This means when a process blocks, waiting for a message, another process may execute on the processor. It also means that a single process may block for any number of distinct messages, and will be awakened when any of these messages arrive. Thus, it forms an effective way of scheduling a processor in the presence of potentially large latencies.
    • Specific Information Sharing Modes: A major activity in a parallel computation is creation and sharing of information. Information is shared in many specific modes. The system provides six information sharing modes, each of which may be implemented differently and efficiently on different machines.
    • Reuse and Modularity: It should be possible to develop parallel software by reusing existing parallel software. CHARM supports this with a well-developed ``module'' construct and associated mechanisms. These mechanisms allow for compositionality of modules without sacrificing the latency-tolerance. With them, two modules, each spread over hundreds of processors, may exchange data in a distributed fashion.
    • The Programming Model: Programs consist of potentially medium-grained processes (called chares), and a special type of replicated process. These processes interact with each other via messages and any of the other information-sharing abstractions. There may be thousands of medium-grained processes on each processor, or just a few, depending on the application. The ``replicated processes'' can also be used for implementing novel information sharing abstractions, distributed data structures, and intermodule interfaces. The system can be considered a concurrent object-oriented system with a clear separation between sequential and parallel objects.
    • Reusable Libraries: The modularity-related features make the system very attractive for building library modules that are highly reusable because they can be used in a variety of data-distributions. We have just begun the process of building such libraries, and have a small collection of library modules. However, we expect such libraries, contributed by us and other users, to be one of the most significant aspects of the system.
    • Regular and Irregular Computations: For regular computations, the system is useful because it provides portability, static load balancing, and latency tolerance via message driven execution, and facilitates construction and flexible reuse of libraries. The system is unique for the extensive support it provides for highly irregular computations. This includes management of many medium-grained processes, support for prioritization, dynamic load balancing strategies, handling of dynamic data-structures such as lists and graphs, etc. The specific information sharing modes are especially useful for such computations.
    • Associated Tools:
      • DagTool allows specification of dependences between messages and sub-computations within a single process, provides a pictorial view of this dependence graph, and simplifies management of message-driven execution.
      • Projections is a performance visualization and feedback tool. Projections has a much more refined understanding of user computation than is possible in traditional tools because it is language-specific. Thus it can provide much specific feedback about entities in the user program such as objects and messages. It also incorporates a unique expert performance analysis component which can provide recommendations for improving performance.
    The execution of a Charm program begins at the CharmInit entry point of the main chare. The CharmInit entry point is used to read in input, to create new chares, and to initialize branch office chares and specifically shared variables, e.g., read only variables, accumulators, monotonics and dynamic tables. The parallel program execution begins (conceptually) after this entry point is executed.

    When a CHARM++ program is run on a parallel machine, invocation of an entry point on a remote processor requires an encoded version of the entry point name to be sent across address spaces, since function pointers are not valid in general on a different processor. This encoding to entry point ids cannot be done at compile time because each separately compiled file needs to have non-overlapping ids. In the absence of a custom linker, the CHARM++ translator generates code to do this encoding during initialization at run time.

    The Charm system introduces the unique concent of message-driven execution.

    Different classes of shared memory are accessed with different functions, possibly a preprocessor quirk. Also, sections of some modules currently need to be defined in a particular order. Some functions return illegal values if called outside of the CharmInit entry point. Also, extern "C" functions can't be declared in the local declaration section of chares.

    The charmc preprocessor is clunky and will not accept standard system header files.

    There are 6 data-sharing modes: Read Only Variables and Messages, and Accumulator, Monotonic, WriteOnce and Dynamic Variables. The system lends itself to computation by phases with quiescence detection. There are special memory allocation and deallocation routines. Differently-moded memory is accessed by specialized functions (awkward).

    Chares do not migrate between processors. To prevent unnecessary marshalling and unmarshalling in the local message passing case, the user installs packing and unpacking handlers which Charm calls when needed.

    Load balancing strategies include: random, acwn, manager, token, receiver. Custom load balancing is achieved by implementing the exported load balancing interface: Ldb_NewMsg_FromLocal(void *msg), Ldb_NewMsg_FromNet(void *msg), LdbFillLDB(LDB_ELEMENT*ldbptr), LdbStripLDB(LDB_ELEMENT*ldbptr), LdbProcessorIdle(), LdbPeriodicCheckInit(), LdbInit(), and LdbProcessMsg(void *msgPtr, void *localdataPtr). Helpful system functions in this endeavor are: McTotalNumPe(), McMyPeNum(), McNumNeighbours(), McGetNodeNeighbours(), QsMyLoad(), Qs_EnQMsg(), and Qs_DeQMsg(). Note that load information is piggy-backed on every message.

    Local chare queueing strategies include: stack, fifo, fl (fifo-lifo), bstack, bfifo, istack and ififo.

    The Dagger system of chares adds guarded when execution (active messaging on conditional expressions). The Branch Office Chare is one which has a representative on every system node. The keyword thishandle holds the handle of the chare whose public function or entry point is currently executing. mainhandle is the handle of the main chare. Function addresses may be converted to global handles & remotely called with: FunctionRefType CFunctionNameToRef( function-name ) and CFunctionRefToName( (FunctionRefType)function-ref ).

    By default, Charm organizes all processors into a logarithmic-height spanning tree. This is useful for implementing global operations with branched chares.

    Scorecard

    modern-language based
    secure
    scaleable
    fault tolerant
    dynamically load balanced
    distributed computation
    expresses task-parallel concurrency naturally
    language integrates synchronization primitives
    supports heterogeneous networks of workstations


  11. Concert/C , IBM, 1991-1994
  12. Concert/C supports distributed C programming. Programmers write a program's local logic in C, and its distributed logic in Concert/C's powerful operations. Concert/C applications are composed of communicating processes which are executing sequential C programs. The language provides primitives to create and terminate processes, and communicate between them. The programmer explicitly expresses parallelization and distribution.
    Communication type-safety is a fundemental design element in this language, which is implemented via a preprocessor which translates the code to standard C.

  13. Condor , University of Wisconsin - Madison, 1988-1995
  14. Condor is a distributed resource management system that can manage large heterogeneous clusters of workstations. Its design has been motivated by the needs of users who would like to use the unutilized capacity of such clusters for their long-running, computation-intensive jobs.

    Condor pays special attention to the needs of the interactive user of the workstation. It is the interactive user who defines the conditions under which the workstation can be allocated by Condor to a batch user. Condor preserves a large measure of the originating machine's environment on the execution machine, even if the originating and execution machines do not share a common file and/or password systems. Condor jobs that consist of a single process are automatically checkpointed and migrated between workstations as needed to ensure eventual completion.

    Due to the limitations of the remote execution and checkpointing mechanisms, there are several restrictions on the type of program which can be executed by the condor facility. Most importantly only single process jobs are supported, (i.e. the fork() and exec() system calls are not supported). Secondly, signal and IPC calls are not implemented, (e.g. signal(), kill(), and socket() calls will fail). Finally the checkpointing mechanism will only work correctly if all file operations are idempotent - read-only and write-only file access works correctly, but programs which both read and write the same file may become confused after a checkpoint.

    There are also some practical limitations regarding use of disk space for execution of condor jobs. During the time when a job is waiting for a workstation to become available, it is stored as a "checkpoint file" on the machine from which the job was submitted. The file contains all the text, data, and stack space for the process, along with some additional control information. This means jobs which use a very large virtual address space will generate very large checkpoint files. Some advantage can be gained by submitting multiple jobs which share the same executable as a single unit or "job cluster". This is because all jobs in a cluster share a single copy of the checkpoint file until they begin execution. See condor(1) for details on how to submit multiple jobs as a cluster. Also the workstations on which the jobs will actually execute often have very little free disk. Thus it is not always possible to transfer a condor job to a machine, even though that machine is idle. Since large virtual memory jobs must wait for a machine that is both idle, and has a sufficient amount of free disk space, such jobs may suffer long turnaround times.

    The Condor System is not a multiprocessing system per se, but does illustrate a simple working prototype of checkpointing & migration which actually runs under Linux.

  15. GLU/Lucid/Eduction Engine , SRI International, 1994-1995
  16. GLU (Granular Lucid) is a very high-level programming system for constructing parallel and distributed applications to run on diverse high-performance computing systems. GLU is based on a hybrid programming model that combines intensional (Lucid) and imperative models. GLU not only enables rapid parallel program development using existing sequential code, it results in programs that are portable, efficient, and adaptive.
    The GLU system is based on Lucid, which is compiled by gc, the GLU compiler, and linked with imperative native code.

    The Lucid side of the system functionally composes the modular imperative code into a complete distributed system. Work is executed in a master/slave paradigm. The master executes from a different executable than the regular worker process. Garbage collection clears out saved computation results which havn't been used in a while.

    Imperative procedures may be declared local or remote (the default). Procedures executing in different address spaces may not share global data (there's no DSM). The model is based on the idempotency of individual imperative slaves.

    Scorecard

    modern-language based
    secure
    scaleable
    fault tolerant
    dynamically load balanced
    distributed computation
    expresses task-parallel concurrency naturally
    language integrates synchronization primitives
    supports heterogeneous networks of workstations


  17. Java/modified C++/Java Interpreter , Sun Microsystems, 1992-1995
  18. The object-oriented facilities of Java are essentially those of C++, with extensions from Objective C for more dynamic method resolution... Java omits many rarely used, poorly understood, confusing features of C++ that in our experience bring more grief than benefit. These omitted features primarily consist of operator overloading (although the Java language does have method overloading), multiple inheritance, and extensive automatic coercions.

    We added auto garbage collection thereby simplifying the task of Java programming but making the system somewhat more complicated.

    The single biggest difference between Java and C/C++ is that Java has a pointer model that eliminates the possibility of overwriting memory and corrupting data. Instead of pointer arithmetic, Java has true arrays. This allows subscript checking to be performed.

    Java is intended to be used in networked/distributed environments. Toward that end, a lot of emphasis has been placed on security. Java enables the construction of virus-free, tamper-free systems. The authentication technique s are based on public-key encryption.

    To enable a Java application to execute anywhere on the network, the compiler generates an architecture neutral object file format -- the compiled code is executable on many processors, given the presence of the Java runtime system.

    Java has a sophisticated set of synchronization primitives that are based on the widely used monitor and condition variable paradigm that was introduced by C.A.R.Hoare. By integrating these concepts into the language they become much easier to use and are more robust. Much of the style of this integration came from Xerox's Cedar/Mesa system.

    The declaration syntax interfaceName variableName declares a variable or parameter to be an instance of some class that implements interfaceName. Interfaces behave exactly as classes when used as a type. This lets the programmer specify that an object must implement a given interface, without having to know the exact type or inheritance of that object. Using interfaces makes it unnecessary to force related classes to share a common abstract superclass or to add methods to Object.

    The Java language is based on C++, which is compiled by javac, the Java compiler.

    As in Concert/C, where C was pared down to a `safe' subset, here C++ has been purged of dangerous features. There is no preprocessor, so there are no header files or macros. Instead there are class interfaces. There are no stray variables or functions; everything is contained in a class. There are no structures or unions, and no multiple inheritance. Descendant classes are said to ``extend'' their parents. There is no operator overloading or automatic conversions.

    Any class can be executed if it defines a main() function. A class may also be designed to implement the Runnable interface or extend the Thread class. This involves defining a run() method which actually does the work. The instantiated Thread class begins to execute with a call to start() and ceases with a call to stop().

    Threads may be built into an elaborate group hierarchy. The virtual machine runs until all Threads that are not daemon Threads have died. A Thread dies when its run() method returns, or when the stop() method is called.

    Monitors can be associated with code blocks as well as with methods. The superclass can be explicitly referenced with the super variable. Several new variable and method modifiers are abstract, final, native, threadsafe and transient.

    As an example of how robust the runtime system is, casting one class to a subclass generates a runtime check that this is really an instance of the lower class; if not, the runtime throws the ClassCastException. Incidentally, this can be checked programmatically by the new instanceOf binary operator.

    Interfaces address the problems multiple inheritance was meant to solve. They provide encapsulation of method protocols without restricting the implementation to one inheritance tree.

    It need be pointed out that a language which doesn't support pointers is severely handicapped; many important hierarchical data structures simply cannot be created in this model. Also note that the native multitasking capabilities are intended only to execute all on the same machine. Inter-machine cooperation would require starting up the Java interpreter on the other machine and communication by message-passing.

    Scorecard

    modern-language based
    secure
    scaleable
    fault tolerant
    dynamically load balanced
    distributed computation
    expresses task-parallel concurrency naturally
    language integrates synchronization primitives
    supports heterogeneous networks of workstations


  19. Mentat/modified C++/Macro Data Flow Engine , University of Virginia, 1993-1995
  20. Mentat is an object-oriented parallel processing system designed to directly address the difficulty of developing architecture-independent parallel programs. The fundamental objectives of Mentat are to (1) provide easy-to-use parallelism, (2) achieve high performance via parallel execution, and (3) facilitate the execution of applications across a wide range of platforms. The Mentat approach exploits the object-oriented paradigm to provide high-level abstractions that mask the complex aspects of parallel programming, including communication, synchronization, and scheduling, from the programmer. Instead of managing these details, the programmer concentrates on the application. The programmer uses application domain knowledge to specify those object classes that are of sufficient computational complexity to warrant parallel execution.

    Mentat combines a medium-grain, data-driven computation model with the object-oriented programming paradigm and provides automatic detection and management of data dependencies. The data-driven computation model supports high degrees of parallelism and a simple decentralized control, while the use of the object-oriented paradigm permits the hiding of much of the parallel environment from the programmer. Because Mentat uses a data-driven computation model, it is particularly well-suited for message passing, non-shared memory architectures.

    There are two primary aspects of Mentat: the Mentat Programming Language (MPL) and the Mentat run-time system. The MPL is an object-oriented programming language based on C++ that masks the difficulty of the parallel environment from the programmer. The granule of computation is the Mentat class member function. The programmer is responsible for identifying those object classes whose member functions are of sufficient computational complexity to allow efficient parallel execution. Instances of Mentat classes are used exactly like C++ classes, freeing the programmer to concentrate on the algorithm, not on managing the environment. The data and control dependencies between Mentat class instances involved in invocation, communication, and synchronization are automatically detected and managed by the compiler and run-time system without further programmer intervention. By splitting the responsibility between the compiler and the programmer, we exploit the strengths and avoid the weaknesses of each. The underlying assumption is that the programmer can make better decisions regarding granularity and partitioning, while the compiler can better manage synchronization. This simplifies the task of writing parallel programs, making parallel architectures more accessible to non-computer scientist researchers.

    The run-time system supports parallel object-oriented computing on top of a data-driven, message-passing model. It supports more than just method invocation by remote procedure call (RPC). Instead the run-time system supports a graph-based, data-driven computation model in which the invoker of an object member function need not wait for the result of the computation, or for that matter, ever receive a copy of the result. The run-time system constructs program graphs and manages communication and synchronization. Furthermore, the run-time system is portable across a wide variety of MIMD architectures and runs on top of the existing host operating system. The underlying operating system must provide process support and some form of inter-process communication.

    Mentat has been distributed to over 100 sites in the US and abroad, and has been successfully applied to a wide-variety of real-world problems. We invite new users who wish to learn more about Mentat to write to us for documentation at

    An active Mentat network has an Instantiation Manager (im) and a Token Manager Unit (tmu) daemon on each host. These daemons coordinate the execution of an application by instantiating Mentat objects, scheduling where they will execute, and marshalling the arguments to and results from Mentat member functions so that they are matched to their appropriate destinations.

    From the abstract of a technical report on "Fault-Tolerance in Coarse Grain Data Flow", Aug. 28 1995.

    Wide-area parallel processing systems will soon be available to researchers to solve a range of problems. It is certain that host failures and other faults will be an every day occurrence in these systems. Unfortunately contemporary parallel processing systems were not designed with fault-tolerance as a design objective. The data-flow model, long a mainstay of parallel processing, offers hope. The model's functional nature, which makes it so amenable to parallel processing, also facilitates straight- forward fault-tolerant implementations. It is the combination of ease of parallelization and fault- tolerance that we feel will increase the importance of the model in the future, and lead to the widespread use of functional components. Using Mentat, an object-oriented, data-flow based, parallel processing system, we demonstrate two orthogonal methods of providing application fault-tolerance. The first method provides transparent replication of actors and requires modification to the existing Mentat run-time system. Providing direct support for replicating actors enables the programmer to easily build fault-tolerant applications regardless of the complexity of their data-flow graph representation. The second method - the checkboard method - is applicable to applications that contain independent and restartable computations such as "bag of tasks", Monte Carlo's, and pipelines, and involves some simple restructuring of code. While these methods are presented separately, they could in fact be combined. For both methods, we present experimental data to illustrate the trade-offs between fault-tolerance, performance and resource consumption.

    From the abstract of a technical report on Braid:Braid: Integrating Task and Data Parallelism, November 11, 1994.

    Archetype data parallel or task parallel applications are well served by contemporary languages. However, for applications containing a balance of task and data parallelism the choice of language is less clear. While there are languages that enable both forms of parallelism, e.g., one can write data parallel programs using a task parallel language, there are few languages which support both. We present a set of data parallel extensions to the Mentat Programming Language (MPL) which allow us to integrate task parallelism, data parallelism, and nested task and data parallelism within a single language on top of a single run-time system. The result is an object-oriented language, Braid, that supports both task and data parallelism on MIMD machines. In addition, the data parallel extensions define a language in and of itself which makes a number of contributions to the data parallel programming style. These include subset-level operations (a more general notion of element-level operations), compiler provided iteration within a data parallel data set, and the ability to define complex data parallel operations.
    The Mentat language is based on C++, which is preprocessed by mplc, the Mentat preprocessor (a rather fragile implementation which does not support templates or exceptions), into actual C++ code.

    The system configuration file can be customized to specify worstation ``clusters'' and ``families'', as well as scheduling policies for LOCATION (RANDOM or ROUND_ROBIN), TRANSFER_LIMIT (how many hosts/clusters to examine when scheduling), and TRANSFER_POLICY (LOAD x or RUN_QUEUE x).

    Classes may be declared as regular mentat, persistent mentat (which contain local state which must be preserved between calls) or sequential persistent mentat (where execution order is guaranteed to be program order even with no data dependencies). mentat classes are generally heavy duty either by virtue of high computation/communication cost or containment of some global system state. x.create() (which accepts COLOCATE, DISJOINT and HIGH_COMPUTATION_RATIO hints), with x.bind() (which associates the variable with an existing instance located within a search scope parameter), or by assignment. These methods are inherited from a base class that is logically a superclass for Mentat classes.

    In Mentat, argument semantics is always call-by-value, even for pointers or references, whose referrents are copied to the callee (no DSM). If the pointer is to a variable-sized object, the object must implement a size_of() function that returns the object's size in bytes.

    Unlike traditional RPC systems, which execute synchronously, Mentat objects call each other asynchronously, thereby realizing substantial performance improvements. This is possible because the runtime system constructs program graphs and monitors all data dependencies, directing function return values precisely to where they'll be used next, and freely creates extra copies of regular mentat classes for added concurrency. This is besides any intra-object parallelism encalsulation. (The default scheduling model is fully automatic; a programmer may acheive manual scheduling by making an object a persistent class and instantiating instances on the desired processors.)

    The rtf() function is used in Mentat to return values instead of the C++ return statement. This call does not signify the end of a function, but only that all computation to produce the result has been completed. (The function may still continue to update local state.) The runtime system may or may not send this result back to the caller, depending on where is it actually used next. In fact, because of the way Mentat provides remote values, it's possible that this return value may not yet be available (this could happen if no local computation was performed on the value).

    The mstreams class is available for stream I/O to and from the point-of-launch of the system.

    Due to the lack of shared memory, Mentat classes forbid static members, public data members and friend classes or functions. Furthermore, pointers to instances of classes will not necessarily point to the respective instance data.

    Synchronization primitives (if needed) include an extended model of Ada's select/accept.

    Macro Data Flow is coarser than regular data flow. Potential Result Variables - an RV Table allows optimizations. destroy() is also grandfathered into all Mentat classes.

    Scorecard

    modern-language based
    secure
    scaleable
    fault tolerant
    dynamically load balanced
    distributed computation
    expresses task-parallel concurrency naturally
    language integrates synchronization primitives
    supports heterogeneous networks of workstations


  21. Orca/Orca/Panda , Vrije Universiteit in Amsterdam, 1992-1995
  22. Orca is a language for parallel programming on distributed systems, based on the shared data-object model. This model is a simple and portable form of object-based distributed shared memory.

    Scorecard

    modern-language based
    secure
    scaleable
    fault tolerant
    dynamically load balanced
    distributed computation
    expresses task-parallel concurrency naturally
    language integrates synchronization primitives
    supports heterogeneous networks of workstations


  23. Phantom/Obliq/POSIX Threads , University of Dublin, Trinity College, 1994-1995
  24. Phantom is a new interpreted language designed to address some of the problems presented by large-scale, interactive, distributed applications such as distributed conferencing systems, multi-player games, and collaborative work tools. Phantom combines the distributed lexical scoping semantics of Obliq with a substantial language core. The language core is based on a safe, extended subset of Modula-3, and supports a number of modern programming features, including:
    • static structural-equivalence typing
    • objects
    • modules and interfaces
    • lightweight threads
    • exceptions
    • garbage collection
    • higher-order functions and lambda expressions
    • a keyword binding mechanism
    • dynamically sized lists and slice indexing notation
    • type-safe implicit declarations
    The Phantom interpreter is implemented entirely in ANSI C, and provides a binding to the Tk GUI toolkit.

    Phantom has similar goals to Java, but was developed independently. There are a number of differences between Phantom and Java. The basic goal of Java is similar to that of Phantom: provide a secure and portable language and runtime environment for developing dynamically-extensible, interactive applications for the Internet. The basic approach is similar as well: both Java and Phantom provide a safe language with a number of modern programming features, and support a secure mechanism for sending code across sites.

    However, there are a number of differences between Phantom and Java. Here is an unashamedly biased account of some of the differences:

    • As might be expected from a project with a three year head-start, the Java implementation is at a much more advanced stage of development than the Phantom implementation. In particular, Java has a wealth of standard libraries, working networking code, and a compelling application (HotJava). Phantom has none of these yet, though work is actively underway on the first two.
    • The Phantom interpreter will be freely available for commerical and non-commerical use.
    • Java is based on a safe subset of C++ and Objective C; Phantom is based on a safe subset of Modula-3.
    • The Java team don't believe in transparent distribution, so only provide low-level networking APIs. In contrast, Phantom provides transparent, language-level distribution of both code and data using the distributed lexical scoping semantics of Obliq.
    • Java can clearly be used to write disconnected agents: small programs which are transmitted to a user's site for local execution. However, Java doesn't support connected agents: programs which carry their network connections with them as they travel across sites. Phantom supports both models of code distribution, as a result of using Obliq's distributed semantics.
    • Java's mobile code mechanism is based on a runtime which dynamically obtains the code for classes from across the network as they are needed. Phantom provides a similar mechanism, but also gives the programmer flexible control over the code transmission mechanism through the use of higher-order functions.
    • The Java team are focusing their efforts on interactive applications for the Web; Phantom attempts to be more widely applicable to other application domains (such as distributed conferencing).
    Here is some information on the base language, Obliq:
    Obliq is a lexically-scoped untyped interpreted language that supports distributed object-oriented computation. An Obliq computation may involve multiple threads of control within an address space, multiple address spaces on a machine, heterogeneous machines over a local network, and multiple networks over the Internet. Obliq objects have state and are local to a site. Obliq computations can roam over the network, while maintaining network connections.

    Obliq computations, in the form of procedures or methods, can be freely transmitted over the network. Actual computations (closures, not source text) are transmitted; lexically scoped free identifiers retain their bindings to the originating sites. Through these free identifiers, migrating computations can maintain connections to objects and locations residing at various sites. Disconnected agents can be represented as procedures with no free identifiers; these agents do not need to rely on prolonged network connectivity.

    In order to concentrate on distributed computation issues and to reduce complexity, Obliq is designed as an untyped language. This decision leads to simpler and smaller language processors that can be easily embedded in applications. Moreover, untyped programs are somewhat easier to distribute, because we avoid problems of compatibility of types at multiple sites.

    The Obliq run-time is strongly typed: erroneous computations produce clean errors that are correctly propagated across sites. The run-time data space is heterogeneous, meaning that there are different kinds of run-time values and no provisions to discriminate between them; heterogeneity discourages writing programs that would be difficult to typecheck in typed languages. Because of heterogeneity and lexical scoping, Obliq is in principle suitable for static typing. More importantly, Obliq is compatible with the disciplined approach to programming that is inspired by statically typed languages.

    Lexical scoping makes it easy to distribute computations over multiple sites, since computations behave correctly even when they are carried out at the wrong place (by some measure). Flexibility in distribution can, however, result in undesirable network traffic. Obliq relieves some of the burden of distributing data and computations, but care and planning are still required to achieve satisfactory distributed performance.


  25. Treadmarks/C++ , Rice University, 1992-1995
  26. Treadmarks supports parallel computing on networks of workstations. Its main novel feature is that it provides a global shared address space across the different machines on a cluster. The shared address space distinguishes it from other well-known packages such as PVM that provide a message passing interface between machines. There is growing consensus in the parallel computing community that a shared memory interface is more desirable from the application programmer's viewpoint, allowing him or her to focus on algorithmic development rather than on managing communication. The challenge in providing a shared memory interface is to do so efficiently. To this end, TreadMarks incorporates several innovative features, including release consistency and multiple-writer protocols.
    The Treadmarks system is implemented as a C++ library.

    This system implements lazy release consistency, which is a particular way of keeping the DSM consistent. Other ways are eager release consistency, implemented in Munin, also from Rice University. These systems improve on the efficiency of the obvious sequential consistency model first implemented in IVY. A slightly different approach which directly associates shared data units with synchronization at the program level, called entry consistency, is implemented in Midway, developed at Carnegie Mellon University. An excellent overview of the design issues of DSM is avaliable here.

    Multiple writers introduce special consistency problems, requiring the use of special diff data structures. Since the system uses lazy release consistency, diff construction can be postponed or avoided, a technique known as lazy diff creation. Garbage collection reclaims the space used by write notice records, interval records and diffs.

    To minimize asynchronous communications latency, Treadmarks handles SIGIO, which is generated upon any communication to a socket. The mprotect() call is used to generate SIGSEGV signals upon access to shared pages. Shared memory access are said to be partially ordered by the happened-before-1 relation.

    Programs allocate shared memory with Tmk_malloc() (and release it with Tmk_free()). The address returned by the Tmk_malloc() (which is usually called only by process 0) can be distributed to the other processes in the system with Tmk_distribute(). The same executable runs on every machine; the user specifies at runtime how many copies should be created.

    Variables Tmk_nprocs and Tmk_proc_id identify the current process's relative rank in the system. Synchronization mechanisms include the global Tmk_barrier() and exclusive lock administators Tmk_lock_acquire() and Tmk_lock_release. Tmk_startup() initializes the entire system and each process calls the Tmk_exit() for local cleanup and bookkeeping befor exiting.

    Scorecard

    modern-language based
    secure
    scaleable
    fault tolerant
    dynamically load balanced
    distributed computation
    expresses task-parallel concurrency naturally
    language integrates synchronization primitives
    supports heterogeneous networks of workstations


  27. pSather/Sather/Solaris Threads , International Computer Science Institute, 1993-1995
  28. Parallel Sather (pSather) is a parallel version of the language Sather, developed and in use at ICSI. pSather addresses non-uniform-memory-access multiprocessor architectures but presents a shared memory model to the programmer. It extends serial Sather with threads, synchronization and data distribution. Unlike actor languages, multiple threads can execute in one object. A distinguished class GATE combines various dependent low-level synchronization mechanisms efficiently: locks, futures, and conditions.

    Sather's preprocessor, cs, translates Sather into C.

    Scorecard

    modern-language based
    secure
    scaleable
    fault tolerant
    dynamically load balanced
    distributed computation
    expresses task-parallel concurrency naturally
    language integrates synchronization primitives
    supports heterogeneous networks of workstations


Email comments and suggestions to the author, Naftali Schwartz, are welcome.