Chime: A Versatile Distributed Parallel Processing Environment [1]

Shantanu Sardesai and Partha Dasgupta

Department of Computer Science and Engineering
Arizona State University
Tempe AZ 85287-5406

Email: shantanu@asu.edu, partha@asu.edu
Web: http://enuxsa.eas.asu.edu:80/~sardesai/,
http://cactus.eas.asu.edu/partha,
http://cactus.eas.asu.edu/Milan/
phone: +1 602 965 5583, fax: +1 602 965 2751

Abstract

Chime is a “distributed” parallel processing system that provides true multiprocessor semantics, including a parallel programming language targeted for multiprocessors, in an unreliable, asynchronous distributed environment. Language support provided by Chime includes parallel blocks, nested parallel blocks, global shared memory, scoping of local variables and support for language based synchronization construct. In addition the runtime system provides an efficient implementation that supports fault tolerance and load balancing. Chime has been implemented and tested on a network of Pentium based systems running Windows NT.

The language supported by Chime is the shared memory part of Compositional C++ (CC++) [CK92] language definition even though the target environment is distributed. It uses some key mechanisms from the Calypso [BDK95, MSD97] parallel processing system to provide fault tolerance and load balancing. This approach allows CC++ multiprocessor constructs to work in a distributed environment.

Chime actually makes a distributed system look like a multiprocessor. Unlike other DSM based system, the shared memory is not an add-on but an integral part of the system. Similarly parallelism expressed in the language is mapped to the distributed hardware, automatically. Its handling of faults and management of machines with heterogeneous speeds is novel and consumes very little overhead. Some of the novel features added by Chime, include distributed cactus stacks, management of nested parallel blocks, thread synchronization and automatic data transfer in a distributed system.

This paper describes the properties, design and implementation details of Chime.

1. Introduction

Over the last decade, there has been an ever-increasing interest for using parallel processing, both for high-performance scientific computing and for general-purpose applications. This was mainly the result of the demand for higher performance and lower cost. The acceptance of parallel processing has been facilitated by two major developments: multiprocessors and widespread use of distributed computing. In recent years, the focus of parallel processing technology has shifted from specialized multiprocessor hardware to distributed computing. The most important factor in the favor of distributed computing is that it can provide high performance at low cost. Computational power of the networked workstations can surpass that of supercomputers, if harnessed properly [2]. The advantages of using a network of workstations for implementing a parallel processing system are evident from the development of a plethora of parallel processing systems based on distributed platforms, in recent years. These “distributed” parallel processing systems enable the programmer to exploit the computational power of the networked workstations, but they do not always address important issues related to the programming languages, model or elegant execution environments on networked hardware.

We have implemented a “distributed” parallel processing system called Chime, that runs on off-the-shelf networked workstations, operating systems and compilers. It provides a programming interface that is based on the Compositional C++ or CC++ [CK92] language definition. (CC++ is a language defined for use in high performance parallel and distributed applications, at CalTech). In Chime, a programmer develops applications for a shared memory multiprocessor model, using CC++, which are then pre-processed into C++ programs [3]. The compiled programs are executed by the Chime runtime system, on a virtual machine, implemented on a distributed network. This approach not only provides the programmer with a completely transparent distributed shared memory combined with traditional block structured programming, but it also allows transparent utilization of unreliable shared resources available over the network, by providing dynamic load balancing and fault tolerance at almost no additional overhead. Chime has been implemented in C++, using Microsoft Visual C++ 4.2 and was developed on Pentium-Pro-based machines running Windows NT 4.0 .

This paper presents a overview of the Chime system from its goals and properties to its design and its implementation. Chime has been implemented and several small applications have been tested. In section 2 we present the motivation behind Chime and its relationships to CC++, Calypso [BDK95] and Calypso NT [MSD97]. We also discuss other approaches and their shortcoming. Section 3 provides a tutorial on the subset of CC++ language definition that is implemented by Chime. The design of the Chime system is described in section 4 and section 5 details the actual implementation. Sections 6 and 7 discuss the user interface and preliminary testing information about the implemented system.

2. Related Work, Motivation and Goals

Until recent years, most of the parallel processing was done on specialized multiprocessor hardware. Various mature parallel systems and languages exist, that allow efficient utilization of the hardware underneath and are straightforward to program. But, these systems have not proven to be cost effective and hence have not succeeded in the marketplace.

Recent developments in networking and distributed computing have made development of high performance parallel processing systems feasible in distributed environments. Although, there are several challenging problems in providing a satisfactory environment for parallel processing on networks of workstations, the solutions have proven to be cost-effective. As a result, a large number of “distributed” parallel processing systems exist. These systems allow the programmer to utilize computational power of the networked workstations, but they do not always address several important issues, outlined below:

Chime addresses most of these problems in a simple, clean and efficient manner by providing a multiprocessor-like shared memory programming model on network of workstations, along with automatic fault-tolerance and load balancing. As is shown below, Chime has its roots in Calypso and CC++. We have coalesced some ideas from both these systems and combined them to build a system design for a significantly more powerful system.

Calypso [BDK95] (and its sibling, Calypso NT [MSD97]) are distributed parallel processing systems, that use distributed shared memory efficiently, and yet provide fault tolerance and automatic load balancing. These systems decouple runtime parallelism (number of available machines) from application parallelism (number of concurrent threads). Calypso and Calypso NT have been rigorously tested and shown to provide very good performance (i.e. has exceptionally low runtime overhead.) However, the parallel tasks are executed in an isolated context and only designated global shared variables can be accessed by the parallel tasks. The systems do not support nested parallelism. Moreover, there is no synchronization and data transfer support for parallel tasks.

CC++ [CK92] on the other hand is a complete programming language that provides all the missing features of Calypso, and more. However, CC++ uses two memory models - a shared memory model and a distributed memory model, and programmers can use both in the same program. The shared memory model can only be used on multiprocessors. On a network the distributed memory model has to be used. The distributed memory model is far more difficult to use than the shared memory model. For example, the spawning of processes and collection of work has to be done via difficult to use constructs. Using two different models can make programs complex and hard to debug. The existing implementation does not support fault tolerance or load balancing.

Chime is a solution that absolves all the above shortcomings and provides a cleaner solution. We have chosen the shared memory model of CC++ language definition as the only programming model, regardless of the execution environment. This model is clean and simple to program with and should be used regardless whether the program will run on a shared memory multiprocessor or a distributed system. To the cleaner programming semantics of CC++ shared memory model, we add the features of Calypso i.e. fault tolerance, high performance and automatic load balancing, to yield a system that we believe, is superior to both. Specifically Chime has the following salient features:

3. The Programming Interface

The Chime programming interface is essentially a subset of CC++ language definition. Along with all C++ keywords and constructs, the following four CC++ keywords are used to allow the programmer to exploit task and data parallelism; to allow data transfer between parallel tasks through synchronization and to control the level of interleaving of parallel tasks:

par - Used to express task parallelism. Each statement enclosed in the
block is executed in parallel.

parfor - Used to express data parallelism. Each iteration of the loop is executed in parallel.

synch - Data items used for synchronization between parallel tasks.

atomic - Functions to control the level of interleaving of parallel tasks.


par { 
	parfor (int i=0; i<N; i++) {
		statment_1; statment_1;
	 	: : : :
		statement_n; statement_n;
	}
}

Figure 1: par and parfor constructs.

The par construct (figure 1) enables the programmer to exploit task parallelism in the application. The keyword par converts a sequential C++ compound statement into a parallel block and each statement within the par block executes concurrently. The execution order of the statements is not defined and their execution may either be interleaved or concurrent. The execution of a parallel block is complete when the execution of all the parallel tasks is complete. It should be noted that in a distributed environment, to get high performance, each statement in a par block must express a fair amount of computation.

The parfor construct is used to express data parallelism and is similar to for loop construct in C, except that the keyword for is replaced with parfor and all the iterations are executed in parallel. As in the case of par, the execution order and actual degree of parallelism is undefined. At the end of par and parfor there is an implicit barrier. Thus all programs written with these constructs are barrier synchronized programs.

A nested parallel construct is a valid statement inside a parallel construct. Nesting of par/parfor statements can be used to express nested parallelism. This nesting can either be linguistic (i.e. a par block contains a par statement) or at runtime (i.e. a statement in a par block calls a function that has an embedded par block). The correct execution of such nesting is implemented by Chime.

Chime allows parallel tasks to synchronize and transfer data between themselves by providing single-assignment [FT93] or synch variables. As specified by CC++ language definition, a synch variable has only two states: an initial undefined state or a defined state. Once defined, it is same as a constant. Modifying its defined value causes a run-time error. If a parallel task attempts to read a synch variable that has not yet been defined then the task suspends its execution until the synch variable is defined. Thus, synch variables can be used as a means of synchronization as well as data transfer between tasks. If a parallel task wants to use data generated by another task, they must synchronize (and transfer the data) through a synch variable. A variable can be defined as a synch variable by using the keyword synch in the beginning of its declaration as follows:

synch int a; //synch integer

synch int A[10]; //an array of synch integers

Some parallel applications have a control structure where each task does huge amount of computations but once in a while each task executes a function, e.g. writing to a file, which cannot be interleaved with execution of same or a particular different function. In a traditional barrier synchronized program such a function must be executed within a sequential step thereby limiting the exploitation of available parallelism. The programmers can overcome the problem by using atomic functions [5]. An atomic function is guaranteed not to be executed in parallel with any other atomic function. An atomic function can be defined as follows:

atomic void AnAtomicFunction (. . .){ . . .}

3.1 The Programming Model

As stated earlier, the programming model provided by the Chime programming interface is the shared memory programming model of CC++ language definition which is simple and easy to use. The programs are structured by composing parallel tasks in sequential programs. The execution of parallel tasks is referred as a parallel step and that of a sequential fragment between two parallel steps is referred as a sequential step.

The application starts its execution as a sequential program, executing the first sequential step which is followed by a parallel step. The execution of a parallel step consists of execution of several parallel tasks. A parallel task execution may involve execution of one or more nested parallel steps. Figure 2. illustrates execution of a small Chime program fragment. The parallel steps are barrier synchronized and also allow synchronization and data transfer between concurrently executing parallel tasks of same or different parallel steps.

As specified by the CC++ language definition and contrary to most of the distributed parallel processing systems, parallel tasks start their execution from within the context of the task that started them. Local variables declared within the scope of the function from which parallel tasks are spawned are accessible by the parallel tasks. These variable are allocated at runtime on the stack which are made available by Chime, on remote workers by implementing a distributed cactus-stack.

Chime provides programmers with a uniform view of the shared memory across the network and does not put the burden of data partitioning and data transfer on them. All the variables defined in the global scope of the program are transparently accessible on remote workers through distributed shared memory. Unlike, CC++ and P4 [BBD+87], there is no distributed memory and thus, the programmer does not deal with two completely different memory models in a single program. This makes the program design clean, simple and easy to debug.

The semantics of accessing shared memory in Chime is the same as that specified by CC++ language definition, i.e. concurrent read exclusive write (CREW). Thus, any two parallel tasks can read value from a same location concurrently but if a parallel task writes to location no other parallel task should read value from or write to the location. In Chime, the above rule holds only between two synchronization points. Thus, two parallel tasks can actually write to same location by synchronizing between each write.

3.2 An Example - Matrix Multiply

Figure 3 shows the complete source-code for an example matrix multiplication program in CC++ that compiles and executes correctly under Chime. This is one of the many programs we have checked for correct execution under our current implementation of Chime. The program initializes two matrices (500 500), with pseudo-random numbers, multiplies them and stores the result in a third matrix. The actual multiplication is done in parallel using a user specified number of parallel tasks. The three arrays (matrices) are declared in the global scope and thus are part of the transparent distributed shared memory.

The execution of the program begins in main. The two arrays (A and B) are initialized with pseudo-random numbers, using the function RandomFill, during the sequential step. Then comes the parallel step. The multiplication of the matrices is performed in parallel within the parfor loop that uses a user-specified number (numOfSubDiv) of parallel iterations.

It should be noted that:

#include "chime.h"
#include <stdlib.h>
#include <iostream.h>
const int N=500;

float A[N][N], B[N][N], C[N][N];
void RandomFill(float mat[N][N], int size){
	for(int i=0; i<size; i++)
	for(int j=0; j<size; j++)
		mat[i][j] = rand();
} /* RandomFill */

void main(int argc, char *argv[]){
	int numOfSubDiv;
	cout << "Input num of sub divisions: ";
	cin >> numOfSubDiv ;
	RandomFill(A,N);
	RandomFill(B,N);
	parfor(int division=0; division < numOfSubDiv; division++){
		int from = division * (N/numOfSubDiv);
		int to = from + (N/numOfSubDiv);
		for(int i=from; i<to; i++)
		for(int j=0; j<N; j++){
		C[i][j] = 0;
		for (int k=0; k<N; k++)
			C[i][j] += A[i][k] * B[k][j];
		}
	} /* parfor */
} /* main */
Figure 3: Matrix Multiplication Program

4. System Design

The run-time environment of Chime consists of regular networked workstations with conventional operating system running on it. One of the workstations is designated as the manager and the rest as workers. The manager executes sequential steps and workers execute tasks in parallel steps upon allocation from the manager. Workers get the required data from the manager as needed, while executing a task and report all updates to the manger after finishing the task.

4.1 Design Challenges

Although the above manager-worker execution model is straightforward, there are various challenging issues involved in implementing a system to provide a multiprocessor like shared memory parallel processing environment on networked workstations. These are mainly as follows:

In the remainder of this section we present the system design of the existing implementation by describing the core mechanisms used to deal with the above issues involved in providing multiprocessor like programming model in a distributed environment.

4.2 Execution Environment

As stated earlier, Chime uses the manager-worker model of execution [6]. A program starts on the manager with execution of the first sequential step. When the execution reaches a parallel step, parallel tasks are generated and allocated by the manager to waiting workers. The manager uses a scheduling algorithm that takes care of task allocation to the workers as well as scheduling of nested parallel tasks in correct order. Nested parallel tasks in an application form a DAG as shown in figure 4. Each nested parallel step consists of several parallel tasks which are called siblings. It also has a parent task and a continuation that must be executed, once the nested parallel step has been completed. A continuation is an object that fully describes a future computation. It represents the remaining execution of the parent parallel task that must be resumed after the nested parallel step has been completed. To complicate the scenario, a continuation may itself have nested parallel step(s).

The manager maintains an execution dependency graph to capture the dependencies between the parallel tasks and schedules them and their corresponding continuations in correct order. The allocation of tasks to the workers is done by eager-scheduling [DKR95, BDK95] which replicates computations in a non-conventional fashion, whenever more than required computational resources are available. To ensure correctness in eager scheduled computations, the execution uses TIES (two phase idempotent execution strategy) [BDK95] for memory management. This has two benefits: First, it provides fault tolerance for free as any worker may fail during the execution of a parallel task, without affecting the computation at all. Second, it also provides load-balancing for free as fast workers do not wait for slow workers and do more work. Actually, this also speeds up the overall computation as fast workers do not wait for slow workers to finish the work. And the coupling of eager scheduling with TIES provides a very efficient distributed execution scheme [BDK95, MSD97]. [7]

The execution of parallel tasks start its execution from within the context of the parent task providing a uniform scoping of variables between parallel and sequential step. All the variables which are visible at the beginning of the parallel step are also visible inside the parallel step. This is done by the distributed cactus stack. Before the execution of a parallel task begins, its complete context is set to that of the parent task. This is done by setting hardware context and the stack, appropriately.

While a worker is executing a task, it acquires the global shared memory pages that it uses, via demand paging from the manager, as discussed later. When a worker finishes a task, it returns memory updates (updated variables in global shared memory and the process stack) to the manager. The manager then determines whether these memory updates have already been received from some other worker (which finished faster). If so, the updates are discarded otherwise they are committed. If a worker reaches a nested parallel step, it commits its updates and reports all newly generated parallel tasks to the manager. The manager schedules these newly generated task in the correct order.

The processes that execute the manager and the worker use a two-threaded architecture. There are two threads in each process. These threads are called the application thread and the controlling thread. The application thread executes user-level code and the controlling thread controls the execution of the application thread.

In manager, the controlling thread sets up the application thread appropriately, for the execution of application code segment as part of sequential steps. It communicates with workers, performs scheduling, task-allocation and memory management functions during the execution of parallel step. It also handles callbacks from the workers and provides distributed shared memory service. In the manager the application thread executes the sequential steps of the computation and remains suspended while parallel steps are executed by the workers.

In a worker, the controlling thread sets up complete context (stack, software and hardware context) of the application thread for execution of application code segment as part of a parallel task. The application thread executes a single parallel task of the application program. After completion of the parallel task the application thread notifies the controlling thread which in turn notifies the manager and sends all updates. Whenever, there is a need to perform some task other than execution of application code, such as, to start execution of a nested parallel step, to get a shared page from manager or to synchronize with another parallel task, the application thread requests the controlling thread to accomplish the task on its behalf and suspends until the request is fulfilled.

The controlling thread is also responsible for maintaining the distributed cactus stack as it grows and shrinks. When various parallel tasks start their execution from within the same context as that of the parent task, a cactus stack is formed. A cactus stack, conceptually, grows from the initial parent stack and bifurcates into multiple different branches, which vary depending upon the control flow of the parallel tasks. The number of branches depends upon the degree of parallelism and the depth depends upon the number of nested parallel blocks. All the parallel tasks share the initial common stack but have their own branched stack after the beginning of the parallel step. Before the execution of a parallel task in a worker the controlling thread gets the parent stack from the manager and sets it up for the application thread. It also sends all the stack updates back to the manager once the application thread has finished the execution of the parallel task.

The two thread architecture provides a cleaner distribution of responsibilities. It also enables Chime to have complete control over the execution of the application. The application thread execution is completely controlled and monitored by the controlling thread. As a result, the application thread can be started, stopped or restarted at any time and in any context on any worker machine. The architecture can easily accommodate new features like distributed resource allocation and smart scheduling using processes migration planned in near future.

5. Implementation

The programmer writes a Chime program (a “.chm” file) using the earlier specified subset of CC++ language. Our preprocessor takes the program and converts it into a C++ (“.cpp”) program, which then is compiled with Chime runtime library using Visual C++ compiler, to produce an executable (“.exe”). The executable file contains all the code needed to run it as a manager or a worker.

In this section, first we describe the major transformations applied by the preprocessor to the source code, which is followed by the description of the steps involved in the distributed execution of a program.

5.1 Pre-Processing

The preprocessor converts a Chime program to a C++ program by transforming the two CC++ constructs which are primarily used to express parallelism. The par block is transformed as shown in figure 5.

The preprocessor inserts code to signal the start of a parallel step to the controlling thread and code to get information about the stack frame of the function in which the parallel tasks are to be executed. All the statements in par block are uniquely labeled thereby assigning an implicit JobNum to each of the parallel tasks. It also inserts code to initialize the progress table, suspend the parent task and a switch statement so that workers will execute appropriate parallel task when they are resumed on a remote machine.

The parfor construct is transformed in a different manner as shown in figure 6. The preprocessor inserts code to initialize progress table entry corresponding to each parallel iteration of parfor with appropriate value of the index variable. It also inserts code to get information about the current stack frame, to suspend the parent task and code for worker, so that it executes the body of parfor with appropriate index value set, when it resumes on a remote machine.

par{ { /*******Start of PAR Block 0********/

	statement_1; 
	statement_2 ; BeginParallelStep(); 
	statement_3; __asm mov _theBP, ebp 
	} __asm mov _theSP, esp 

goto chime_ps0;

chime_ps0_stmt0: statement_1;

EndParallelThread();

chime_ps0_stmt1: statement_2;

EndParallelThread();

chime_ps0_stmt2: statement_3;

EndParallelThread();

chime_ps0: /*Initialize Progress Table with 3 Job(s)*/

AddParJobs(3);

SuspendUntilDone();

if (isThisWorker){

	switch (theJobNum){ 
		case 0: goto chime_ps0_stmt0; 
		case 1: goto chime_ps0_stmt1; 
		case 2: goto chime_ps0_stmt2; 
		} 
	} 
	/*******End of PAR Block 0********/ 

}

Figure 5: Preprocessing of a par block.

It should be noted that all the parallel task as well as the continuation start their execution from the same point in the program where the parent task suspends until the execution of the parallel step is completed.

parfor(int i=0; i<N; i++){ /*********Start of PARFOR*************/

	statement_1; BeginParallelStep(); 
	statement_2; for(int i=0; i<N; i++){ 
	statement_3; AddParforEntry(&i); 
	} 

}

__asm mov _theBP, ebp

__asm mov _theSP, esp

SuspendUntilDone();

if (isThisWorker && !Skip){{

	statement_1; 
	statement_2; 
	statement_3; 
	} 

EndParallelThread();

}

/*********End of PARFOR*************/

Figure 6: Preprocessing of a parfor construct.

The preprocessor also does other transformations on the source code. It uses compiler directive to group all global variables into a single data section, which is made available as shared memory space among manager and workers. The synch variables are defined as generic objects of user specified type, using C++ templates. The atomic functions are transformed by code insertion to avoid interleaving with each other at runtime. Last but not the least, the main function is renamed to ChimeMain.

5.2 Execution Management

When an application is started, the first process created becomes the manager process. In the manager process, the primary thread creates the controlling thread and then takes the role of the application thread by immediately suspending itself. The controlling thread initializes communication, memory management subsystems and resumes the application thread in ChimeMain. While the controlling thread is waiting for any request form the application, the application thread executes the first sequential step of the application.

At this point, worker processes are started on other machines via a remote execution facility. The worker executes the same executable program as the manager, and hence has the same startup sequence as the manager. Then the controlling thread in the worker contacts the manager for work and the application thread is resumed only when work is allocated to the worker. It should be noted that designating the primary thread as the application thread assures same stack address on manager and all workers.

At some point, the application thread in manager, reaches a parallel step. The application thread builds a progress table and wakes up the controlling thread by signaling an event [R95] and suspends itself. The progress table contains information needed to execute the parallel tasks on workers, and to schedule the workers in order to complete the execution of the parallel step. For example, figure 7 shows a simplified version of the progress table that is build while executing the matrix multiplication program, if the numOfSubDiv is set to 4.

Each row in the progress table represents a parallel task in the parallel step and has unique task id. For instance, the last row represent a parallel task with task id equal to 3 and the type of task is parfor. The “Started” and “Finished” columns respectively indicate that the task has not been assigned to any worker and has not completed execution. Chime has different types of tasks: par, parfor, synch, etc. Each task has some additional information associated with it based on its task type. For example, parfor tasks have the value and address of the index variable(s) for each of the parallel iteration, a par task has job id within the par block and a synch task has a synch object and other information associated with it. It should be noted that during the execution of a parallel step different types of nested tasks may be added to the progress table.

Figure 7: Progress Table for the parallel step in the matrix multiply program

The controlling thread on notification from the application thread, comes out of wait, makes sure that the application thread has suspended and fills in the remaining information by saving the context and stack of the application thread. After filling in all the required information in the progress table, the controlling thread listens for workers requesting work. When a worker (actually, the controlling thread in a worker) contacts the manager-controlling-thread, the scheduler assigns to the worker a parallel task which has been started least number of times and has not finished. The dispatcher sends to the worker appropriate information based on the type of the selected task from the progress table. The manger also gets informed when a worker finishes its work assignment, and the manager-controlling-thread updates the “Finished” column of the appropriate row upon notification. Once, all the rows in the progress table are marked finished, the manger-controlling-thread resumes the manager-application-thread to continue the execution of the sequential step after the finished parallel step. Finally, after a sequence of sequential and parallel steps the execution of the application thread(ChimeMain) terminates and the controlling thread is notified of the same. Which in turn informs all workers of the termination. In addition to the scheduling effort, the controlling thread in the manager services all memory requests and also applies memory updates as workers finish tasks.

5.3 The Worker

The control flow in the worker is interesting and unconventional. The controlling thread in the worker upon work allocation from manager, receives complete context and stack for task. It then access protects (makes inaccessible, via a VirtualProtect call) the shared memory area; sets up the stack and context for the application thread and resumes it.

This causes the an interesting effect. Suddenly, the application thread starts running at the exact same location, with the exact same context as the point where the application thread of the manager got suspended! Since the JobNum or index variable is correctly set in the stack of the worker by the manager as it ships the context, the worker executes the task it was assigned. During the course of execution, it accesses the shared pages and generates exceptions (or page faults). The exception handler notifies the controlling thread which gets the pages from the manager, installs them and un-protects them as they are required. Upon completion of the task, the application thread notifies the controlling thread and suspends itself. The controlling thread sends all the updated pages and stack data updates to the manager in the form of difference (XOR) between the original and updated pages. The controlling thread then waits for the manager to allocate new task and repeats the same sequence of events until the execution of the application is complete.

The above description of memory handling ignores caching of memory by workers. Workers actually handle memory quite intelligently, detecting between read and write accesses, caching pages that have not changed as well as caching updated pages that other workers have not changed. The details of the caching strategy is outside the scope of the paper.

The manager accepts only the first completed execution of each parallel task and discards the others. If the updates are late, they are discarded otherwise they are applied by another XOR operation with the managers copy of the page. Step numbers, task-ids and such information is used to decide on updating memory. Any two tasks can update different parts of the same page as long as multiple-read or exclusive write condition is satisfied. This ensures correctness in spite of multiple executions of same thread segment. The mechanism also provides an efficient implementation by avoiding page-shuttling completely. Moreover, there is no need of complicated mechanisms such as distributed locking.

5.4 Nested Parallelism

The mechanism for supporting nested parallelism integrates well with the manager-worker execution model and also supports fault tolerance. When a worker hits a parallel step it generates parallel tasks exactly in the same way as the manager does. It sends all updates and reports all the newly generated parallel tasks along with the continuation of the parent task to the manager. If the worker is not late, the manager commits the updates, marks the parallel task as finished and adds all new parallel tasks to its progress table. It also adds an extra task entry for the continuation, which represents the remainder of the parallel task that follows the nested parallel step.

As stated earlier, the manager maintains an execution dependency graph to schedule all nested parallel tasks in the correct order. For each nested parallel task, the manager descries the parents task, list of all siblings and the continuation of the parent task. Siblings of a nested parallel task are the tasks which form the parallel step. As stated earlier, all tasks in a parallel step and the continuation start their execution from the same context. Thus, only a single context(that of the parent task) is saved for a parallel step.

On completion of each nested parallel task, the manager marks the task done and checks if all the siblings of the task have completed their execution. If so, it marks the continuation ready and schedules it appropriately. Otherwise, it just deletes the completed task from the list of siblings. When a worker reports completion of a continuation, manager checks for the execution dependency of the parallel step consisting of the parent task and appropriately marks corresponding continuation ready for execution. The actual algorithm and the data structure used for this purpose are highly optimized and their details are beyond the scope of paper.

5.5 Synchronization

As stated earlier, synch variables are implemented as generic objects using C++ templates. Each of these objects has a state, which is undefined initially and a value which is of programmer specified type. The programmer can read from or write to a synch variable by using getValue and setValue methods, respectively. On each of these operation the worker sends and commits all of its update to the manager. It also sends to the manager the context and stack of the parallel task(worker-application-thread). The manager marks the parallel task as finished and adds a new entry to the progress table representing the remainder of the parallel task execution.

On a setValue request, if the state of the object is undefined then the manager marks the newly created entry as assigned, sets the value of the variable, changes the state of the object to defined and sends OK to the worker, which continues its execution. Otherwise, it aborts the entire application as an assignment to already defined synch variable is a runtime error.

If it is a getValue request and the state of the object is defined, the manager marks the newly created entry as assigned and sends the value of the variable to the worker, which continues its execution. Otherwise, the manager enqueues the newly created task in a wait queue for the object and sends a terminate message to the worker. The worker then terminates the current parallel task and gets a new task from the manager. As and when the manager receives a setValue request for the same object it checks for the wait queue and dequeues all the waiting tasks. When a waiting synch job is restarted by a worker it gets and sets the value of the object and resumes the execution.

5.6 Atomic Functions

The execution of an atomic function is not interleaved with any other atomic function. Whenever, a call is made to an atomic function the worker contacts the manager, sends updates, context and the stack of the parallel task to the manager and waits for a new task to be allocated. On the other side, when the manager gets a request for an atomic execution from a worker, it marks the parallel task finished and creates a new entry for the remainder of the task in the progress table. All the parallel task representing atomic execution are executed by a dedicated worker on the manager machine to preserve the fault-tolerance. The manager keeps record of all atomic executions and assigns them one by one to the local dedicated worker. A FIFO queue of all suspended request is maintained to avoid starvation.

6. Graphical User Interface

A graphical user interface, as shown in figure 8, is also provided to enable the user to visually monitor the execution of a Chime/CC++ program, on local as well as remote machines. The GUI is a program monitor that displays the progress of the program while under execution. The parallel tasks of the parallel step under execution are represented by vertical bars, which change color as they change their status from unassigned to assigned and finally, done. A detailed description and step-by-step instructions for using the GUI can be found in [MSD97].

The user uses GUI to run a pre-processed and compiled Chime/CC++ program(.exe), which is essentially a compiled VisualC++ console application. The user specifies list of worker machines and can stop or start execution on any of them. The fault-tolerant nature of Chime allows the user to add as well as delete machines and/or workers at any time during the execution of the program without affecting its correctness. Dynamic load balancing makes sure that the workers on faster machines do more work than those on slower machines.

7. Testing and Preliminary Results

The Chime system has been implemented. The implementation consists of:

Testing of the Chime system is currently in progress. We have tested ordinary applications such as matrix multiply. These work well. We have also tested application that cause explosion of parallel threads. One such application is a parallel version of partition sort. Our version of the partition sort takes an array of numbers and finds a pivot element and places it in its correct sorted position, such that all numbers to the left of this element is lower than the element and all numbers to the right is higher. Now we have two smaller arrays which can be sorted in parallel. This is done recursively, creating almost as many parallel sorting threads as there are elements in the array. We have tested this program for up to 2,048 threads created by about 11 levels of nesting. Of course the fine grain nature of the application makes it inefficient to run on a network of workstations, but it executes correctly under the Chime system, showing the correctness of the memory management and the nested parallelism scheduler.

Performance testing of Chime is in progress. Our preliminary test (matrix multiply) show performance gains are exactly the same as Calypso, as expected [8]. However, the newer features of Chime are not fully exploited by the matrix multiply program and hence it is not a good predictor of overall performance. More tests are in progress and will be reported in the final version of this paper.

8. Conclusion

In this paper we have described a complete system that implements a feature-rich parallel processing language developed for shared memory multiprocessors in a distributed environment. In addition, our system provides support for workstations that fail or suffer from slowdowns due to external load. Specifically, we have implemented completely transparent distributed shared memory, nested parallel statements, inter-thread synchronization, atomic functions, scoping of variables (via a distributed cactus stack), non-isolated distributed execution of parallel tasks and management of faults and loads. We have shown, how generation of nested parallel tasks can be dealt with elegantly in the manager-worker execution model. Moreover, we have demonstrated, how all of the above features can be provided along with fault tolerance, without using elaborate mechanisms like check-pointing, replication or process groups.

Thus, we have implemented a fault-tolerant virtual multiprocessor machine on network of workstations in effort to make distribution completely transparent and distributed parallel programming easier for the programmer. Our claims of good performance using this approach is based on the experience with the Calypso system, as we share many of the runtime mechanisms with Calypso. However, stand alone performance testing of Chime is in progress and very preliminary results show that our claim is justified.

9. Acknowledgments

Sponsor Acknowledgment: Effort sponsored by the Defense Advanced Research Projects Agency and Rome Laboratory, Air Force Materiel Command, USAF, under agreement number F30602-96-1-0320. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon.

Sponsor Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the Defense Advanced Research Projects Agency, Rome Laboratory, or the U.S. Government.

10. References

[ACD+95] C. Amza, A.L. Cox, S. Dwarkadas, P. Keleher, H. Lu, R. Rajamony, W. Yu, and W. Zwaenepoel. TreadMarks: Shared Memory Computing on Networks of Workstations, IEEE Computer, December 1995.

[AS91] Brian Anderson and Dennis Shasha. Persistent Linda: Linda + Transactions + Query Processing. Workshop on Research Directions in High-Level Parallel Programming Languages, Mont Saint-Michel, France June 1991.

[BBD+87] J. Boyle, R. Butler, T. Disz, B. Glickfeld, E. Lusk, R. Overbeek, and R. Stevens. Portable Programs for Parallel Processors. Holt, Rinehart and Winston, Inc., 1987.

[BCZ90] J. Bennett, J. Carter, and W. Zwaenepoel. Munin: Distributed Shared Memory Based on Type-Specific Memory Coherence. In Proc. 2nd Annual Symp. on Principles and Practice of Parallel Programming, Seattle, WA (USA), 1990. ACM SIGPLAN.

[BDK95] A. Baratloo, P. Dasgupta, and Z. M. Kedem. A Novel Software System for Fault Tolerant Parallel Processing on Distributed Platforms. In Proceedings of the 4th IEEE International Symposium on High Performance Distributed Computing, 1995.

[BJ87] K. P. Birman, and T. A. Joseph. Reliable Communication in the Presence of Failures. ACM Transactions of Computer Systems, Vol. 5, no. 1, pp. 47-76.

[BS93] D. Bakken and R. Schlichting. Supporting Fault-Tolerant Parallel Programming in Linda. Technical Report TR93-18, The University of Arizona, 1993.

[CG89] N. Carriero and D.Gelernter. Linda in Context. Comm. of ACM, 32, 1989.

[CK92] K. M. Chandy and C. Kesselman, CC++: A Declarative Concurrent, Object Oriented Programming Notation, Technical Report, CS-92-01, California Institute of Technology, 1992.

[C85] E. Cooper. Replicated Distributed Programs, Operating Systems Review, 19(5), pp. 63-78, Dec 1985.

[DKR95] P. Dasgupta, Z. M. Kedem, and M. O. Rabin. Parallel Processing on Networks of Workstations: A Fault-Tolerant, High Performance Approach. In Proceedings of the 15th IEEE International Conference on Distributed Computing Systems, 1995.

[DLA+90] P. Dasgupta, R. J. LeBlanc Jr., M. Ahamad, and U. Ramachandran. The Clouds Distributed Operating System. IEEE Computer, 1990.

[GBD+94] Al. Geist, Adam Beguelin, Jack Dongarra, Weicheng Jiang, Robert Mancheck, and Vaidy Sunderam. PVM: Parallel Virtual Machine. The MIT Press, 1994.

[GLS94] W. Gropp, E. Lusk, A. Skjellum. Using MPI Portable Parallel Programming with the Message Passing Interface. MIT Press, 1994, ISBN 0-262-57104-8.

[HPF93] High Performance Fortran Forum. High Performance Fortran Language Specification Version 1.0, May 1993. Also in Scientific Programming, Vol. 2, No. 1 and 2, Spring and Summer 1993; also Tech Report CRPC-TR92225, Rice University.

[JA91] R. Jagannathan and E. A. Ashcroft. Fault Tolerance in Parallel Implementations of Functional Languages, In The Twenty First International Symposium on Fault-Tolerant Computing. 1991.

[JF92] R. Jagannathan and A. A. Faustini. GLU: A Hybrid Language for Parallel Applications Programming. Technical Report SRI-CSL-92-13, SRI International. 1992.

[K96] Dilip R. Khandekar. Quarks: Distributed Shared Memory as a Basic Building Block for Complex Parallel and Distributed Systems. Master's Thesis. University of Utah. March 1996.

[LFS93] J. Leon, A. Fisher, and P. Steenkiste. Fail-safe PVM: A Portable Package for Distributed Programming with Transparent Recovery. Technical Report CMU-CS-93-124, CMU, 1993.

[MSD97] D. Mclaughlin, S. Sardesai, and P. Dasgupta. Calypso NT: Reliable, Efficient Parallel Processing on Windows NT Networks, Technical Report, TR-97-001, Department of Computer Science and Engineering, Arizona State University, 1997.

[PWC+81] G. Popek and B. Walker and J. Chow and D. Edwards and C. Kline and G. Rudisin and G. Thiel, LOCUS: A Network Transparent, High Reliability Distributed System, Operating Systems Review, 15(5), pp. 169-177, Dec 1981.

[R95] Jeffery Richter, Advanced Windows: The Developers Guide to the Win32 API for Widows NT 3.5 and Windows 95 , Microsoft Press, Redmond, WA, 1995.

[S90] V. S. Sunderam. PVM: A Framework for Parallel Distributed Computing. Concurrency: Practice and Experience, 2(4):315-339, 1990.


Footnotes

[1] This research is partially supported by grants from DARPA/Rome Labs, Intel Corporation, Microsoft and NSF.

[2] Effective and innovative harnessing of the computing power of workstation networks is one of research goal of the Milan (Metacomputing In Large Asynchronous Networks) project. The Milan Project is a joint effort of Arizona State University and New York University.

[3] While we use CC++ for writing programs, Chime does not use any CC++ runtime systems to execute the program.

[4] The mechanisms for providing fault tolerance and load balancing are adapted from Calypso, and is not adequately addressed in this paper. The reader is referred to [DKR95, BDK95 and MSD97]

[5] The term “atomic” is taken from CC++ terminology. It implies “non-interleaved” execution rather than “indivisible” execution.

[6] It must be noted that programmer uses logical parallelism to exploit parallelism inherent to the application which is independent of physical parallelism and execution model used at runtime.

[7] Eager Scheduling and TIES form the basis of our fault-tolerance and load balancing mechanism. Complete description of these mechanisms are beyond the scope of this paper.

[8] Performance of Calypso on matrix multiply, even in the presence of dynamic faults is very good, and has been reported in [BDK95]. Similar results for Ray-Tracing under Calypso NT is reported in [MSD97].