Distributed Context Switching:
A Technique to Improve Turnaround Times
of Distributed Computations [1]


Donald McLaughlin and Partha Dasgupta


Department of Computer Science
Arizona State University
Tempe AZ 85287

Email: dmclaugh@enuxsa.eas.asu.edu, partha@asu.edu
phone: 1 602 965 5583, fax: 1 602 965 2751


Abstract

High performance parallel computing platforms are, with increasing frequency, moving away from expensive multiprocessors to cost effective networks of small (but powerful) computers, typically powered by off the shelf microprocessors. This paper presents a new scheduling policy for managing distributed parallel computations. Distributed Context Switching is the policy of moving an executing task off one machine to another machine in a distributed network. As is obvious, such distributed context switching is not only difficult to implement but adds significant overhead. When each machine runs exactly one process of a distributed (or parallel) computation, such a policy seems needless and counterproductive. Yet we show that this is not so. We also show that preemptive scheduling policies in a distributed system can lead to better turnaraound times for parallel programs

In this paper we investigate common situations that occur in most parallel processing systems that actually benefit from distributed context switching. We show how impressive performance benefits (in terms of improved turnaround times) can be achieved by judicious use of context switching, in spite of the overhead associated with each switch. In addition, we describe a scheduling algorithm which will provide optimal performance and minimize the number of switches required. We also discuss our assumptions and discuss the approaches that can be taken when the actual system does not meet our assumptions.

We have implemented the distributed context switching policy on the Calypso parallel processing system. We show actual performance results obtained by running a Ray-Tracing application.

1. Introduction

Distributed Systems are being used with increasing frequency to run high volume parallel computations. They are cost effective, ubiquitous platforms with an enormous potential for delivering CPU cycles. The popularity of PVM[GS92, LFS93], MPI[GLS94], and Linda[CG89, BS93] attest to the practicality of using workstation networks for high performance computing. This paper addresses the problem of assigning concurrently executable segments of a parallel computation to a set of available machines in order to attain the best possible turnaround time with the minimum amount of overhead. We investigate the use of distributed context switching as a method of decreasing turnaround time.

1.1 Parallel Computations on Distributed Networks

In a distributed computing system, a parallel computation is split into a set of concurrently executable tasks (or processes) and each task is assigned to one machine on a network. However, the number of tasks that a computation splits to, or the number of machines available, are not predictable in advance. So it is possible (and in fact desirable, especially in systems such as Linda[CG89, BS93], CC++[GR88], Calypso[BDK95, DKR95]) to decouple the degree of parallelism from the number of actual machines available at some point in time. This leads to a mismatch between the logical and physical structure of parallelism. Such a mismatch also occurs whenever parallel computations are run on systems supporting process migration or in systems in which the number of available machines may change.

Process migration in distributed systems is a mechanism for transferring incomplete computations from one machine to another. Process migration is difficult to implement in most systems, but it has been studied extensively for use in load balancing. Load balancing is a strategy by which processes from a heavily loaded machine are transferred to one or more lightly loaded machines to increase system performance.

1.2 Scheduling Parallel Tasks in Distributed Systems

A parallel computation consists of sequential parts and parallel parts. It starts as a sequential computation then splits up into parallel tasks and then coalesces into a sequential part after a barrier. The program could then run additional parallel parts. The scheduling problem is to assign the parallel tasks to available processors while the program is executing a parallel step.

Many distributed parallel computing systems, notably PVM[GS92, LFS93], MPI[GLS94], TreadMarks[ACD+95] and Quarks[QUA96], assume that the number of parallel tasks p (or width of the computation) is equal to the number of available machines n. In such systems, each parallel task is then assigned to run on one machine. Further, for high performance, each machine runs only one parallel task and nothing else; Thus p=n. In this case scheduling is simplistic. Each task is sent to one machine and the results retrieved after it completes. Finally, the results from all p processes are coalesced to form the final result. In this situation, since each machine is running exactly one computing task (we assume no other external tasks), there is no need to load balance and hence no need for process migration.

However, it is not always possible to ensure p=m. Conditions change dynamically, i.e. machines become unavailable or new machines become available or the width of the computation changes. From a programmers standpoint, the program should not be written based on the number of machines available at runtime. When p ¹ n there is a need for scheduling. Many systems use some simple form of non-preemptive scheduling. Preemptive scheduling or co-scheduling (or gang scheduling) has been used in shared memory multiprocessor systems, but not in distributed systems. This paper presents a preemptive scheduling scheme for distributed systems.

In this paper, we make the assumption that only one parallel task is executed on each processor (or machine) and that that machine does not execute any other task.

1.3 Distributed Context Switching

Distributed Context Switching is the distributed version of regular context switching. In the distributed case, a task may relinquish a processor (or machine) and wait to be run by another processor (or machine). Process migration is the mechanism needed to implement distributed context switching; that is, when a task is to be context switched out of machine M1 to machine M2, we can use established methods of process migration to enable the switch.

Note that we distinguish between process migration and distributed context switching. Process migration has typically been used in load balancing. In systems that do load balancing, multiple computations (or tasks) execute on each machine and the number of tasks per machine is adjusted via process migration. As per the above assumption, we are addressing a somewhat different problem.

When p is not equal to m, in particular p>m, then distributed context switching becomes not only a viable option but a desirable option. In this paper, we show cases in which moving a sole task (or process) off of one machine to another identical but idle machine can result in faster processing even though the movement bears significant overhead. We present a optimal algorithm that lowers turnaround times and optimizes the total number of context switches. We also present the results of actual performance tests that show significant gains in spite of the overhead paid for supporting the distributed context switch.

1.4 A Global Assumption

This research is done in the context of the Calypso parallel processing system [BDK95, DKR95]. Calypso runs on workstation clusters and supports Windows NT, Solaris, and Linux. The Calypso system is available for use and can be downloaded from [CAL96].

Calypso is a shared memory system that makes a few assumptions on the nature of parallel computations. A parallel step consists of a set of tasks). These tasks share memory. However, a task can read the values of the variables as they existed before the task started or newer values, only if that task updates the variables. It cannot see updates made by other tasks. Also, tasks run independently and do not communicate or synchronize. A task can update a variable (any granularity) as long as no other task updates the same variable. Thus all programs are data race free. We retain these assumptions.

We have implemented a rich set of applications using these assumptions, including ray-tracing, radiative heat transfer, financial applications, matrix manipulation, automatic target recognition, and image processing. As a result, we feel the assumptions are not restrictive in the arena of compute intensive application programming.

1.5 Paper Overview

We motivate the need for distributed context switching in the next section (section 2). Section 3 describes our approach and the distributed context switching algorithms. Section 4 shows how the algorithm can be implemented and section 5 describes our implementation of the system and presents performance results. Section 6 discusses related work.

2. Motivation

As stated earlier, Calypso is a parallel processing system that runs shared memory parallel applications on a set of workstations with built in support for fault tolerance, load balancing, and intelligent scheduling. In this paper, we address one of the scheduling problems in Calypso, that being the optimization of the turnaround times (or speed) of parallel computations.

As we stated above, if a set of m machines is available for a parallel computation, the computation can also be split into m concurrent tasks, only to find that fewer than m machines are really available once these tasks are scheduled to run. The solution to this problem would be to schedule these tasks in several sets that can be handled by the number of machines available. The drawback to this approach, in its simplest form, is that some of the machines assigned to this computation will remain idle while the last set of tasks is being processed. The result is wasted machine time and an increase in turnaround time for the computation as a whole. Clearly a more sophisticated scheduling method is required.

Let us consider a simple example in which we have 4 tasks and 3 machines. Each task takes t seconds to execute on a single processor. Scheduling the tasks, naively, we get the following schedule (figure 1).

Figure : Simple schedule for 4 tasks and 3 machines.

In the above diagram, we can see that the total time taken by the computation is 2t, where t is the time taken to run each task. Also, 2 of the machines are idle for t time each and do not contribute to the progress of the computation. In high performance computing, where turnaround time is of concern, such scheduling will lead to poor system utilization and performance.

Schedulers used in multiprocessor systems are able to avoid this problem through the use of time slicing. A time slice is typically a fraction of a second (about 100mS). Each processor picks a non-executing task and, after the time slice is over, switches to another non-executing task. Since the individual processors share a common memory, or can at least cache the relevant portions of a global memory in their local memory, context switches in themselves are not burdened with excessive overhead. These systems may, therefore, allow processors to context switch between tasks frequently. The net effect, in this case, would be to share these processors equally among a set of tasks so that no processor is idle for any significant period of time.

By contrast, tasks in a distributed system execute on individual workstations separated by network connections. These processors are each bound to local memory and have no knowledge of the contents of other workstations in the system. If a machine in this type of system wishes to take over the execution of a given task, the context switch requires that the entire task, as well as its current state, be transferred from the machine which is currently hosting the it to the target machine before processing can be resumed. Due to the overhead of this transfer, time slicing in a distributed environment becomes infeasible.

We can now say that a new approach to scheduling is required for this type of parallel computation in a distributed system. While pure time slicing is too expensive, a simple non-preemptive scheduling scheme provides far from optimal performance. We suggest an approach in which a minimum number of context switches are made in order to prevent processors from remaining idle. In this way, the cost of transferring the incomplete task can be compensated for by the increase in the number of processors that can be used in processing.

3. Our Approach

3.1 The simple example continued

Taking the example shown above, we can actually improve the total time taken for the computation by dividing each task into, say, 2 segments [2]. At first, each machine chooses to execute the first segment of tasks 1-3. After these tasks are half-way completed (t/2 seconds) , machines 1 and continue to execute tasks 1 and 2 but machine 3 switches to task 4. After another t/2 seconds, machines 1 and 2 finish tasks 1 and 2. Machine 1 then remains idle but 2 starts the second segment of task 3 and machine 3 continues with task 4. Finally, all tasks are complete in another t/2 seconds, and the total computation takes 1.5t seconds.

Even this is not the optimal solution. The first machine is still left idle while the last segments of tasks 3 and 4 are completed. If all machines are to be kept employed, the total number of segments must be a multiple of the number machines.

Figure : Dividing the tasks into 2 segments, better performance

If each task is divided into 3 segments instead of 2, there will be 12 segments to schedule between 3 machines (figure 3). All machines remain active, in this case, and the processing time (or turnaround time) has been reduced to 1.33 times that required to complete a single task. As is obvious, this is an optimal solution in terms of total time needed. The number of actual context switches is 2 (task 2 is moved from machine 2 to 1 and task 3 is moved from machine 3 to 2.).

Figure : Dividing tasks into 3 segments, optimal performance

The above scheme is optimal in terms of turnaround time, excluding time for 2 context switches. If the execution time is larger than context switch time (which is the case for large-grained computations in a distributed system), then the payoff is quite large. The difficulty is, of course, predicting each task's runtime so that the segments can be defined. This problem is addressed in section 4.1.

3.2 A General Allocation Scheme

We now discuss the general case of scheduling p tasks on m machines in a distributed environment. We first make some assumptions:

Assumption 1 is true in most cases of parallel processing. In case it is not, we can still use our algorithm and get good performance improvements. Assumption 2 is a contrary assumption and we will discuss this issue further in section 4.1. Assumption 3 is a safe assumption in distributed implementations of parallel tasks.

The problem of scheduling p tasks between m machines (or processors) gives rise to 4 cases.

  1. p<m: In this case no scheduling is necessary, any assignment will provide an optimal runtime.
  2. p=m: [Same as 1]
  3. p>m: This is the case addressed by our algorithm. It provides optimal turnaround time and a minimum number of context switches.
  4. p>>>m: In this case, the idle time produced by the naïve algorithm might be such a small percentage of the total time that our algorithm, though optimal, may not provide a justifiable improvement.

In general, given a set of p tasks to be executed by m machines, where p > m, the optimal solution requires that the tasks share the processors through time slicing. If we were to assume that the overhead associated with context switching was negligible, time slices of arbitrarily small duration could be chosen. In this case, the optimum time in which a given a set of p tasks could be processed by m machines would be(pt)/m, where t is the time taken to complete one task. Thus (pt)/m represents the optimum time which our allocation scheme should approach. In reality, distributed context switching is costly. We must, therefore, find an allocation scheme that approaches this optimum with a minimum number of context switches.

3.3 The Task Allocation Scheme

Now we present a task allocation scheme that allocates p tasks on m machines (or processors) and achieves optimal execution time by utilizing distributed context switching. As discussed earlier, this scheme is appropriate in the cases where p > m.

Case 1: p > k • m, k > 2.

Let the m machines grab m tasks at a time and run them to completion. Repeat the processing until the remaining number of tasks is just less than 2m but greater than m. Then the remainder of the problem is identical to Case 2.

Case 2: 2m > p > m.

This case forms the base case of our algorithm. Since all other cases reduce to this by using the above assignment algorithm, we will discuss next how to allocate machines to tasks for this case.

We begin by dividing each task to into m segments of equal length. We choose this value so that we are guaranteed that the segments can be divided equally among the available processors. We also observe that this division, which will require that each machine handle exactly p segments, is consistent with the time constraint p•t/m discussed previously.

Now, let the m machines again grab m tasks, as in case 1, and begin to run them. We call this the “initial task set”. This leaves an additional p-m tasks, the “delayed task set”, that must be started at some later time.

The initial task set runs uninterrupted for p-m segments; that is, the first p-m segments of each task are completed. Then out of the m machines, any p-m machines stop executing the initial task set and start executing the delayed task set. They execute the delayed task set to completion without further interruption. Note that these machines execute a total of (p-m) + m = p segments, and thus take (pt)/m time.

Now we are left with m unfinished tasks each of which has m-(p-m) = 2m-p unfinished segments. We also have 2m-p machines. Thus we have a problem of assigning p' = m tasks to m' = 2m-p machines. This is a sub-problem of the original problem, but the strongest statement we can make about p' and m' is that p' is greater than m'. Therefore, we must consider both cases 1 and 2 in our solution.

Let us first consider the situation were case 2 applies directly, since this is the recursive case. We solve this problem as discussed above, except that we now have m' machines and we choose p'-m' machines from the initial task set to execute the delayed task set. These machines have now executed a total of (p-m)+(p'-m')+m' segments. Since p' = m, we find that these machines also take (pt)/m time.

In case 1 we know that p'>k k>2. We must therefore select k-1 sets that contain m' tasks before case 2 again recurs. Since each of these tasks contain m' segments, these machines will execute a total of km segments. In addition, we know from the above discussion that each machine will execute p'' segments in this new case 2. So these machines will each execute a total of (p-m)+(k-1)m+p segments. But p'' = p'-(k-1)m and p' = m. So we find that the total number of segments executed by each machine is:

(p-m) + (k-1)m' + m-(k-1)m' which reduces to p

Finally, we can show that the total number of context switches will always be p-1. We note that there must exist a base case in which one machine is required to execute some number of tasks, pn where pn >2. This follows from the fact that the number of tasks is greater than the number of machines. In this base case then the number of context switches is pn-1.

The sub-problem that led to this base case then must have been of type case 2. In this sub-problem there were 1+x processes and x context switches that led to the base case. This means that the number of context switches is still p-1.

If we look back two sub-problems instead of one, we find that this sub-problem may have been of either type case 1 or type case 2. In this case there was an additional number of tasks that were a multiple of the number of machines. Therefore, we again find that the number of additional context switches is equal to the number of additional tasks so the total number of context switches is still p-1.

Figure 4: Schedule for 7 tasks on 5 machines showing recursive algorithm

4. Estimating Task Length

The scheme that we have proposed depends on our ability to determine the length of a task before it has been processed. In general, this seems to be an impossible assumption but indeed it is not. Even in cases where it is, in fact, impossible to estimate task lengths, distributed context switching can be used (see Method 3).

Below are some methods that can be used to estimate task lengths.

Method 1: Dynamic History

This method works for the case where the number of tasks (p) is greater than twice the number of machines (m). In this case, in the first round, all machines execute the selected tasks to completion. Thus, we can actually measure the time taken per task. This time is then used in the last round to calculate segment lengths and context switching times. This method works for a large number of situations.

Method 2: Static History

If 2m > p > m, then we seemingly have no information about task length. However, we make the following observation:

“High Performance Computing very often involves running the same program over and over again for multiple runs on different data.”

In this case, each run should keep a history of the time taken per task, for all tasks, and this information can be used to make reasonably accurate guesses of the time taken by each task.

Method 3: Round Robin Scheduling

In some cases, we are not sure about the task length because no history is available or because of large variations in the running time of the computation over a number of runs. In such situations, we rely on a sub-optimal method with higher overhead. We choose a time slice scheme where the time slice is significantly larger than the context switch overhead. Then the scheme works as follows.

This Round Robin scheduling algorithm has been extensively used in our performance studies (section 5.2) and is indeed a viable solution for increasing throughput of parallel computations.

Thus we have three methods for different conditions. As is apparent, all these methods would provide better throughput than the naive (non-preemptive) methods of task assignment.

5. Implementation on Calypso System Running Under Windows NT

In this section we discuss the details of our implementation of the distributed context switch using process migration. There are many methods of process migration, and these are system dependent. See the related work section for a description of some approaches. Our implementation was done by modifying Calypso NT version 1.0. Calypso NT is a version of Calypso than runs on Windows 95, Windows NT 3.51, and Windows NT 4.0. Calypso NT is downloadable from the Calypso world wide web [CAL96]. Our work was done on Windows NT 4.0.

5.1 Implementation Details

We built our migration mechanism into the Calypso system. In Calypso, a task runs under the control of a central manager. The manager spawns a worker process on a machine that has indicated its willingness to host such processes. The manager then assigns tasks to this process. The process then executes the code for the task and accesses the shared memory (shared by all tasks, as Calypso is a shared memory system). As the worker accesses shared memory, the pages are demand paged into the worker from the manager. At the end of the task, the memory updates (differences) are sent back to the manager so that the master image can be updated.

We expanded the architecture of this version of Calypso. In this new architecture, both the worker and the manager contain two separate threads (Windows NT supports kernel supported preemptable threads). In the manager, this allows a cleaner separation to be made between scheduling responsibilities and shared memory management responsibilities. However, in the worker the resulting benefits are somewhat more interesting. One of the threads becomes an “application” thread which is responsible for performing the tasks assigned by the manager. The other thread becomes a “controlling” thread which receives the task assignments from the manager and instructs the application thread to actually perform the work. While this may at first appear somewhat redundant, this separation of control from task execution is the basis for our process migration mechanism.

Windows NT provides the facility for suspending a thread and retrieving or setting its context via a context structure and then resuming that thread again. In our implementation of process migration, the controlling thread uses this facility to suspend the execution of the application thread. It can then retrieve this context structure for the application thread, and “freeze” that thread's state by writing this structure to a shared disk file along with the contents of the thread's stack. At some later point, the task may be resumed by any worker process by retrieving this state information for the frozen computation.

A distributed context switch may therefore be seen as the following series of events.

At some later time, the manager will send a “resume task” message containing the same file name to another free worker. This is a request to the worker to start executing a computation that has been “frozen” at some point earlier on. The steps in resuming the task are as follows:

We have implemented this architecture on Windows NT 4.0. We have thoroughly tested the implementation and are satisfied that we can migrate tasks in this way. Our testing included a number of situations in which the switch was attempted while the computation was executing deep in the guts of some runtime libraries. However, a process blocked on a system call cannot be context switched out and transported to another machine. We have thus guarded out some of the calls (such as page faults) and ensure that we do not attempt to switch while they are in progress. Furthermore, since Calypso does not allow the parallel threads to perform I/O, we have not encountered any problems due to this limitation.

5.2 Results

We ran a set of experiments to verify the following:

  1. The Distributed Context Switching mechanism works correctly.
  2. The Distributed Context Switching algorithm is “implementable” and provides increased throughput.
  3. The overhead of context switching is not too high.

Result A:

In order to test the correctness of our implementation of the Distributed Context Switching mechanism, we ran the system through plenty of tests involving thousands of switches. To ensure the results were correct and “glitch free”, we chose a Ray Tracing application and ran it on various “scenes”. In each case, we displayed the resulting image. Any error, even in a single pixel, could be detected visually in the picture; and of course, initially we had plenty of pixel level errors. After all the bug fixes, the program ran well even with hundreds of context switches per run. We thus believe we have implemented the mechanism correctly.

Result B:

Next, we designed a set of experiments to validate the conceptual argument that some forms of preemptive scheduling would be better than non-preemptive scheduling in parallel computations. Again, we used the Ray Tracing application and chose a picture that was symmetrical, giving us equal task lengths in the parallel computation. We then used three types of scheduling to run the application.

  1. Eager Scheduling: Eager scheduling is the standard scheduling algorithm used in Calypso. It works very well in situations where CPU speeds on the different machines are not equal, the machines have dynamically varying loads, the machines are prone to fail, and/or new machines can join the computation. However, eager scheduling produces sub-optimal turnaround times in cases where the number of tasks is slightly higher than the number of processors.
  2. Round Robin Scheduling: We have discussed Round Robin Scheduling in section 4 (method 3). We used this scheduling method with a time quantum that varied from 5 second to 50 seconds.
  3. Optimal Distributed Context Switching: This method has also been discussed. In this case, we measured the task lengths and then used these values to compute the switching points.

We ran two separate timing experiments in order to compare the performance of these three scheduling methods. The experiments were run on five Pentium-90 systems with 32 Mbytes of memory connected with a 100Mb/s Ethernet. The operating system used was Windows NT 4.0 beta 2. The compiler used was Visual C++ 4.0. All timings are elapsed times, measure by a real clock (not system reported runtime).

For experiment #1 we chose to split the parallel computation into 3 tasks and use 2 machines to run these tasks. The results are shown in Figure 5.

Our initial test was to determine the time required for sequential execution of the application. This was accomplished using the original version of the application written in (sequential) C and run without using Calypso. All possible compiler provided optimizations were turned on during these tests. This is an honest depiction of the baseline; that is, the fastest an application could run sequentially. In this case, the ray-traced picture was produced in 757 seconds.

We then tested the performance of Eager Scheduling as well as Round Robin Scheduling using the parallel version of the application. Eager Scheduling produced the picture in 518 seconds. For the Round Robin method, the time quantum of 40 seconds produced the fastest run, at 458 seconds. We wish to comment here on two interesting results of the Round Robin testing. The first is that the 40 second time quantum is better than the 50 second time quanta (sounds counter intuitive) because 50 seconds is “too large”. At the end, one processor remained idle while the other was finishing its task. The second is that for the 5 second quantum, the context switching overhead significantly reduced performance (it took 600 seconds).

The final test was conducted with pre-scheduled optimal distributed context switching. It produced the picture in 400 seconds, which gave us a speedup of 1.89 on 2 machines. Note that this experiment takes into account all overhead, including that of demand paging shared memory pages, coalescing shared memory updates, network traffic, context switching and so on.

Figure 5: Performance Results for 3 tasks on 2 machines

In experiment #2 we used 3 machines but split the parallel computation into 5 tasks. The results are shown in Figure 6. This time, Eager Scheduling produced a respectable 319 seconds yielding a speedup of 2.37. As before, the round robin scheduler performance depended heavily on the chosen time quantum, and eager scheduling was not very attractive (though not too bad). The optimal algorithm, in this case, (278 seconds) was only slightly better than round robin scheduling with a time quantum of 50 seconds (283 seconds). As before, the speedup achieved was quite good, i.e. 2.72 on 3 machines. Note again that we are computing speedup by comparing the parallel execution time to a real sequential program execution time.

Figure 6: Performance Results for 5 tasks on 3 machines

To conclude, in most cases, preemptive scheduling did better than eager scheduling (which is a good non-preemptive scheduler). Round robin scheduling in distributed systems is a viable option which, to our knowledge, has not been tried before. Our optimal algorithm can produce good results but the task lengths need to be known or estimated.

Result C:

We have, of course, wondered what the context switching overhead is. It is, however, very difficult (if not impossible) to measure.

The overhead associated with the remote context switch is dependent on how the task is migrated from one machine to another. As discussed previously, some systems, such as MPVM, move the entire task, as well as its current state, to the new machine. In these systems, the overhead associated with migration is dominated by the cost of transmitting the image of the process from machine to machine across a network connection.

The overhead associated with a remote context switch in our implementation (on top of Calypso) is more difficult to quantify. Our process migration technique is innovative and has the potential for having lower overhead than that for copying the process image from one machine to another. We only transfer some of the data that has been updated by the source process, into the space of the target process. The rest of the data is treated as a cache and is reused. References made to shared memory by the migrated task that had not been made by the target task are brought in later, on demand. The cache plays a considerable part in determining overhead, and cannot be quantified in general.

However, we can make a very pessimistic calculation based on the above results. Lets take experiment 1. The round robin scheduler with a time quantum of 5 seconds took 600 seconds and the sequential run took 300 seconds. We can thus assume there were about 120 switches (600/5) and the extra time over the optimal algorithm was 200 seconds. Thus, if we say the rest of the time was spent in context switching, then we have a context switch overhead of 1.67 seconds. Similar computations for experiment 2 give us a context switch overhead of 1.53 seconds.

Note that this method includes over head for time spent in paging, messaging, extra time spend idling at the end of the computation and so on. Hence we feel it is a pessimistic estimation. Thus, context switching overhead is somewhere in the 1 second to 2 second range. Considering that each task runs for hundreds of seconds, a few switches costing a few seconds each is feasible. overhead, This is especially true since it reduces idle time by tens or even hundreds of seconds. Of course, all this is application dependent and large-grained applications will benefit the most. Then again, only large grained parallel applications are run on distributed systems.

6. Related Work

Research related to this effort falls under several categories:

  1. Parallel Processing on distributed networks.
  2. Scheduling and job placement in parallel systems
  3. Process migration techniques and their use in load balancing systems.

6.1 Parallel Processing on Distributed Systems

A large number of systems exist that are capable of supporting high performance parallel computations on networks of machines. The most popular amongst such platforms are PVM[GS92, LFS93], MPI[GLS94] and Linda[CG89, BS92].

PVM and MPI are designed for portable, parallel computing on a wide range of platforms from workstations to supercomputers. They use a message passing layer that allows distributed computations to communicate, synchronize and collectively compute parts of a larger computation.

Linda uses a tuple-space idea to provide a shared bag of tuples which in turn can be used to provide several features, including data transfer and control transfer between parts of a distributed computation.

Other message passing or RPC based systems include Amber[CAL+89], Isis[BSS91], Concert/C[ABG+92, AGG+94], GLU[JA91], ORCA[BT88, BT90] and so on.

Some systems use distributed shared memory to provide a more natural application programming interface to the programmer. Such systems include Mether-NFS[MF89], Midway[BLL88], Munin[BCZ90], Treadmarks[ACD+95], Quarks[QUA96], Clouds[DLA+90].

Almost all of the above systems use a naïve approach to scheduling parallel computations. In PVM and MPI the computations are split into as many parts as there are machines to run them. Then it statically assigns each part to a machine. Linda uses a more dynamic scheme in which each machine picks up work out of a bag of tasks. Other systems use one of the two approaches above, but all scheduling is non-preemptive. As a result, once the task is assigned, it runs to completion. Calypso uses a more innovative scheduling technique called Eager Scheduling, but in its current form it is non-preemptive as well. In this paper we have presented a preemptive scheme that provides optimum turnaround time.

6.2 Scheduling in Parallel Processors

One of the major advantages of multiprocessor systems, when compared with distributed systems, is their ability to support much more fine grained parallelism. This advantage is inherent in their design since all processors share a common memory. In addition, a great deal of work has been done to support multiple threads of execution within a single address space, thus further reducing the cost of supporting and scheduling parallel computations.

Multiprocessor scheduling has tended to exploit the low cost of context switching in order to use dynamic scheduling of threads. This has allowed these systems to reorder processor assignments under varying loads. More recently, this has given way to somewhat less flexible scheduling methods that take into account the structure of the computation.

Coscheduling, also known as gang scheduling [GTU91], is a method that takes into account the fact that certain threads of a computation must coordinate with each other during execution. Coscheduling attempts to identify such dependencies and schedule the related tasks together.

Another form of scheduling that tends to restrict context switching is dynamic partitioning[TG89]. This method attempts to keep a task associated with the same processor whenever possible. This allows the processor to spend less time rebuilding its cache after context switches.

Handoff scheduling [Bla90] is a method by which a task, that is aware of its roll in a computation, can suggest which task would be most advantageous to schedule next. It is a type of scheduling that tends to give more control of scheduling to individual tasks rather than the operating system.

6.3 Process Migration Techniques

Process migration is used for load balancing, as stated above. Systems that implement process migration include MPVM[CCK+95], Dynamic PVM[DLV+94], Dome[SB94], Sprite[DO89, DO91], Charlotte[AF89], Condor[LLM88, LS92]. Support for process migration also exists in Mach[MZD+93], Clouds[DLA+90] and the V-kernel[TLC85].

Process migration is typically done by freezing a process at one site and reviving it on another. The transfer technique varies from system to system. MPVM uses signals to freeze processes and the PVM message passing calls to enable the transfer of data. Sprite uses a global file system which includes the swap space to facilitate such data transfers. The V-kernel uses a scheme called pre-copying in which the transfer is started even while the process is executing. Clouds and Mach use remote demand paging like DSM systems to allow the target process to load itself.

As stated earlier, Calypso[DKR95, BDK95] uses a user level scheme in which the process is not actually migrated, but the computation is migrated. This is achieved by letting the old process live on and have its context overlaid with the new task assignment. Data pages are delivered on demand, as and when necessary. A lot of data remains in the client cache and lowers the overhead of context switching.

7. Conclusions

Scheduling in distributed systems is generally non-preemptive. This paper illustrates both in theory and practice how preemptive scheduling can produce speedups of computations. For large grained computations, the overhead introduced by preemptive scheduling is more than offset by higher utilization of the processors and thus better turnaround times of the computation.

In the simplest case, it is possible to benefit from preemptive scheduling by the use of the round robin scheduling. For round robin scheduling, the time slice is an important parameter and choosing it carefully is necessary for good performance. However if some information is available about the execution times of the tasks, then the scheduling can be done using an optimal algorithm. Estimation of task length can also be used to obtain the schedules.

We have implemented the round robin scheduler and the optimal scheduler in a version of the Calypso parallel processing system. The schedulers use a process migration mechanism to perform the context switch. We have presented actual timing measurements using our implementation.

8. References

[ABG+92] J. Auerbach, D. Bacon, A. Goldberg, G. Goldszmidt, M. Kennedy, A. Lowry, J. Russell, W. Silverman, R. Strom, D. Yellin, and S. Yemini. High-Level Language Support for Programming Distributed Systems. In Proceedings of the 1992 International Conference on Computer Languages, Oakland, California, April 1992.

[ACD+95] C. Amza, A.L. Cox, S. Dwarkadas, P. Keleher, H. Lu, R. Rajamony, W. Yu, and W. Zwaenepoel. TreadMarks: Shared Memory Computing on Networks of Workstations, to appear in IEEE Computer (draft copy) December 1995..

[AF89] Y.Artsy and R.Finkel. Designing a process migration facility -- the Charlotte experience, Computer, 22(9):47-56, 1989.

[AGG+] J. Auerbach, A. Goldberg, G. Goldszmidt, A. Gopal, M. Kennedy, J. Rao, and J. Russell. Concert/C: A Language for Distributed Programming. Usenix Winter Conference Proceedings, San Francisco CA, 1994.

[BBB+93] L.Branscomb, T.Belytschko, P.Bridenbaugh, T.Chay, J.Dozier, G.Grest, E.Hayes, B.Honig, N.Lane, W.Lester, Jr., G.McRae, J.Sethian, B.Smith, and M.Vernon. From Desktop to Teraflop: Exploiting the U.S. lead in High Performance Computing. Technical report, NSF Blue Ribbon Panel on High Performance Computing, 1993.

[BCZ90] J.Bennett, J.Carter, and W.Zwaenepoel. Munin: Distributed Shared Memory Based on Type-Specific Memory Coherence. In Proc. 2nd Annual Symp. on Principles and Practice of Parallel Programming, Seattle, WA (USA), 1990. ACM SIGPLAN.

[BDK95] A.Baratloo, P.Dasgupta, and Z.Kedem. A Novel Software System for Fault Tolerant Parallel Processing on Distributed Platforms. In Proceedings of the 4th IEEE International Symposium on High Performance Distributed Computing, 1995.

[Bla90] D.L. Black. Scheduling support for concurrency and parallelism in the mach operating system. IEEE Computer, May 1990.

[BLL88] B.Bershad, E.Lazowska, and H.Levy. PRESTO: A System for Object-Oriented Parallel Programming. Software-Practice and Experience , 18(8):713-732, 1988.

[BS93] D.Bakken and R.Schlichting. Supporting Fault-Tolerant Parallel Programming in Linda. Technical Report TR93-18, The University of Arizona, 1993.

[BSS91] K.Birman, A.Schiper, and P.Stephenson. Lightweight Causal and Atomic Group Multicast. ACM Transactions on Computer Systems, 9(3):272-314, 1991.

[BT88] H.Bal and A.Tanenbaum. Distributed Programming with Shared Data. In Proceedings of ICCL, pages 82-91, Miami, FL, October 1988. IEEE, Computer Society Press.

[BT90] H.E. Bal and A.S. Tanenbaum. Orca: A Language for Distributed Object-Based Programming. SIGPLAN Notices, 25(5):17-24, may 1990.

[CAL+89] J. S. Chase, F. G. Amador, E. D. Lazowska, H. M. Levy and R. J. Littlefield. The Amber System: Parallel Programming on a Network of Multiprocessors. In Proceedings of the 12th ACM Symposium on Operating Systems Principles, pages 147<196>158. ACM, 1989..

[CAL96] The Calypso Home Page. http://www.eas.asu.edu/~calypso

[CCK+95] J.Casas, D.Clark, R.Konuru, S.Otto, R.Prouty, and J.Walpole. MPVM: A Migration Transparent Version of PVM, USENIX, 8(2): pages 171-616, Spring 1995.

[CG89] N.Carriero and D. Gelernter. Linda in Context. Communications of the ACM, 32(4):444-458, April 1989.

[DKR95] P.Dasgupta, Z.Kedem, and M.Rabin. Parallel processing on networks of workstations: A fault-tolerant, high performance approach. In Proceedings of the 15th IEEE International Conference on Distributed Computing Systems, 1995.

[DLA+90] P.Dasgupta, R.LeBlanc Jr., M.Ahamad, and U.Ramachandran. The Clouds Distributed Operating System. IEEE Computer, 1990.

[DLV+94] L.Dikken, F. van der Linden, J.Vesseur, and P.Sloot, Dynamic PVM -- Dynamic Load Balancing on Parallel Systems, Proceedings Volume II: Networking and Tools, pages 273-277. Springer-Verlag, Munich, Germany, 1994.

[DO89] F.Douglis and J.Ousterhout, Process migration in the Sprite operating system. Proceedings of the 7th IEEE International Conference on Distributed Computing Systems, pages 18-25, Berlin, West Germany, 1987.

[DO91] F.Douglis and J.Ousterhout, Transparent process migration: Design alternatives and the Sprite implementation, Software -- Practice and Experience. 21(8):757-785, 1991.

[GLS94] W.Gropp, E.Lusk, A. Skjellum. Using MPI Portable Parallel Programming with the Message Passing Interface. MIT Press, 1994, ISBN 0-262-57104-8.

[GR88] N.Gehani and W.Roome. Concurrent C++: concurrent programming with class(es), Software -- Practice and Experience, 18, 1157-1177, 1988.

[GS92] G.A. Geist and V.S. Sunderam. Network-Based Concurrent Computing on the PVM System. Concurrency: Practice and experience, 4(4):293-311, 1992.

[GTU91] A.Gupta, A.Tucker, and S.Urushibara. The impact of operating systems scheduling policies and synchronization methods of the performance of parallel applications. In Proceedings of the ACM SIGMETRICS Conference on Measurement and Modeling of computer Systems, May 1991.

[JA91] R.Jagannathan and E.A. Ashcroft. Fault Tolerance in Parallel Implementations of Functional Languages, In The Twenty First International Symposium on Fault-Tolerant Computing. 1991.

[LFS93] J.Leon, A.Fisher, and P.Steenkiste. Fail-safe PVM: A Portable Package for Distributed Programming with Transparent Recovery. Technical Report CMU-CS-93-124, CMU, 1993.

[LH89] K.Li and P.Hudak. Memory Coherence in Shared Virtual Memory Systems. ACM Transactions on Computer Systems, 7(4):321-359, November 1989.

[LLM88] M.Litzkow, M.Livny, and M.Mutka, Condor -- A hunter if idle workstations. Proceedings of the 8th IEEE International Conference on Distributed Computing Systems, pages 104-111, 1988.

[LS89] E.Lazowska and M.Squillante. Using processor-cache affinity in shared-memory multiprocessor scheduling. Technical Report 89-06-01, Department of Computer Science and Engineering, University of Washington June 1989.

[LS92] M.Litzkow and M. Solomon. Supporting checkpointing and process migration outside the Unix kernel. Usenix Winter Conference Proceedings, pages 283-290, San Francisco CA, 1992.

[LV90] S.Leutenegger and M.Vernon. The performance of multiprogrammed multiprocessor scheduling policies. In Proceedings of the 1990 ACM SIGMET-RICS Conference on Measurement and Modeling of Computer Systems, May 1990.

[MF89] R.Minnich and D.Farber. The Mether System: Distributed Shared Memory for SunOS 4.0. In USENIX-Summer, pages 51-60, Baltimore, Maryland (USA), 1989.

[MZD+] D. Milojicic, W.Zint, A.Dangel. and P.Giese, Task migration on the top of the Mach microkernel, MACH III Symposium Proceedings, pages 273 - 289, Sante Fe. New Mexico, 1993.

[Ous82] J.Ousterhout. Scheduling techniques for concurrent systems. In Proceedings of Distributed Computing Systems Conference, pages 22-30, 1982.

[QUA96] The Quarks Home Page. http://www.cs.utah.edu/~khands/quarks.html

[SB94] E. Seligman and A. Beguelin, High-Level Fault Tolerance in Distributed Programs, Technical Report CMU-CS-94-223, School of Computer, Science, Carnegie Mellon University, December 1994.

[Sun90] V.S. Sunderam. PVM: A Framework for Parallel Distributed Computing. Concurrency: Practice and Experience, 2(4):315-339, 1990.

[TG89] A.Tucker and A.Gupta. Process control and scheduling issues for multiprogrammed shared memory multiprocessors. In Proceedings of the 12th ACM Symposium on Operating Systems Principles, pages 159 - 166, December 1989.

[TLC85] M. Theimer, K.Lantz, and D.Cheriton. Preemptable remote execution facilities for the V-System, Proceedings of the 10th ACM Symposium on Operating Systems Principles, pages 2-12. Orcas Islands, Washington, 1985.

[ZM90] J.Zahorjan and C.McCann. Processor scheduling in shared memory multiprocessors. In Proceedings of the 1990 ACM SIGMETRICS conference on Measurement and Modeling of computer Systems, Pages 214-225, May 1990.


Footnotes

[1] This research is supported by grants from NSF (contract CCR-95-05519), DARPA/Rome Labs (contract F30602-96-1-0320), Intel Corporation and a scholarship from the Academic Rewards for College Scientists Foundation.

[2] A segment is a sequential part of a process (or task).