
Invited Talk
Puppeteer: Component-based Adaptation for Mobile Computing
Willy Zwaenepoel
Department of Computer Science, Rice University
Puppeteer is a system for adapting component-based applications
in mobile environments. Puppeteer takes advantage of the exported interfaces
of these applications to perform adaptation without modifying the
applications. The system is structured in a modular fashion, allowing easy
addition of new applications and adaptation policies.
Our initial prototype focuses on adaptation to limited bandwidth. It runs
on Windows NT, and includes support for a variety of adaptation policies for
Microsoft PowerPoint and Internet Explorer 5. We demonstrate that Puppeteer
can support complex policies without any modification to the application and
with little overhead. To the best of our knowledge, previous implementations
of adaptations of this nature have relied on modifying the application.
This talk reports on joint work with Eyal de Lara and Dan Wallach.

Short Talks

CANS: Composable, Adaptive Network Services
Xiaodong Fu, Weisong Shi, Anatoly Akkerman, and Vijay Karamcheti
New York University
The growth of the internet has been fueled by an
increasing number of sophisticated network-accessible services.
Unfortunately, the high bandwidth and processing requirements of
such services is at odds with current trends towards increased
variation in network characteristics and a large diversity in end
devices. Ubiquitous access to such services requires the injection
of additional functionality into the network to handle protocol
conversion, data transcoding, and in general bridge disparate
portions of the physical network.
CANS is an application-level infrastructure for injecting
application-specific components into the network that focuses on
three challenges: (a) efficient and dynamic composition of
individual components; (b) dynamic and distributed adaptation of
injected components in response to system conditions; and (c)
support for legacy applications and services. The network view
supported by CANS consists of applications, stateful services, and
data paths between them. Both services and data paths can be
dynamically created and reconfigured: a planning and event
propagation model assists in distributed adaptation, and a run-time
type-based composition model dictates composition. It supports
legacy applications and services using interception and delegation.
This talk will describe the CANS architecture and implementation,
and a case study involving a shrink-wrapped client application in a
dynamically changing network environment where CANS was used to
improve overall user experience.

Océano: Infrastructure for a Computing Utility
Michael Kalantar, Karen Appleby, Sameh Fakhouri, Liana Fong,
Germán Goldszmidt, Srirama Krishnakumar, Donald Pazel, John
Pershing, and Benny Rochwerger
IBM T. J. Watson Research Center
Océano is a prototype of a highly available,
scaleable, and manageable infrastructure for an e-business computing
utility. It enables multi-enterprise hosting on a collection of
shared resources (eg., servers). However, at any point of time, each
resource is assigned for use by only a single customer. That is, the
hosting environment is divided into smaller, secure domains, each
supporting one customer. These domains are dynamic: the resources
assigned to them may be augmented when load increases and reduced
when load dips. This dynamic resource allocation enables flexible
Service Level Agreements (SLAs) with customers in an environment
where peak loads are an order of magnitude greater than the normal
steady state.

Effect of Interrupts in Multiprocessor Servers
Aniruddha Bohra and Liviu Iftode
Rutgers University
Problem and Motivation:
Interrupts are the major performance bottlenecks in computer systems
with very high Input/Output(I/O) activity. With the advent of new
applications like webservers, which are not expected to carry out
any computationally intensive tasks, the cost of interrupts has
become a more pronounced bottleneck. The highly optimized
architectures which rely heavily on pipelines, make the interrupt
cost higher than it was before. The problem is compounded in case of
multiprocessor servers where there is a considerable overhead, of
cache and Translation Lookaside Buffer(TLB)pollution,saving of the
system state and synchronization, to service interrupts at a very
fast pace. Denial of Service attacks are proof of the fact that by a
very simple mechanism of bombarding a system with a very large
number of packets at a very fast rate, the applications fail to
progress (livelock). This makes us think whether the current
interrupt driven mechanism for handling of network events is the
best, or even correct!
The present study wishes to address the questions regarding the
interrupt structure on multiprocessors for highly I/O bound
processes like the webserver. We wish to investigate the effect of
interrupts on the performance of a webserver running on a
multiprocessor system under heavy I/O workload. We also aim to
evaluate various alternative mechanisms that have been suggested for
asynchronous event processing and try to suggest a mechanism
which would help reduce the impact of interrupts on the performance
of the system, and eliminate livelock in such systems.
Background and Related Work:
Several studies about webservers and their interaction with the
underlying operating system establish interrupt processing as the
performance bottleneck. Hu et al. point out that the Apache
webserver spends around 25%-40% of its execution time handling
interrupts. Synchronization and signal handling are identified as
severe performance penalties in the same study. Also, they predict
the effect would be more pronounced in case of a multiprocessor.
Many researchers have looked at the problem of processing incoming
messages synchronously. In particular the problem of polling v/s
interrupts with active messages has received much attention. Several
systems use a mixture of polling and interrupts. Some systems use
interrupts for just the operating system or protocol specific
messages and poll otherwise. Langendoen et al. use polling whenever
the processor is idle and use interrupts otherwise, Soft Timers
extend that idea and artificially interrupt the execution at a
certain interval to prevent starvation of the polling thread for the
network interface. Smith and Traw initiate the polling using clocked
interrupts, that is polling of the device is initiated by a timer
interrupt. Another very popular idea is to interrupt only when we
cannot poll fast enough. Several systems implement this in hardware,
in Polling Watchdog and in software by Mogul et al where the
interrupt is used just to wake up the polling thread which then
services all outstanding requests and then reactivates the
interrupts.
However, none of the above studies investigate the impact of
interrupts or their own event handling mechanisms on multiprocessor
systems. Also, the fact that computation is not the main aim of the
system in a very important class of applications has been ignored.
Since the advent of the World Wide Web and the gain of importance of
the class of applications that are not CPU bound, the effect of
interrupt processing needs to be re-evaluated and the above
mechanisms revisited in the new light.
Approach and Uniqueness:
Traditional approaches to eliminating livelock typically suggest the
use of polling and interrupts in conjunction. With interrupts acting
as triggers to polling. The rationale behind it is, that for polling
to be as effective as the interrupts, we need to poll tens of
thousands of times per second, which even with the modern processor
speed would leave few cycles for other processing.Our approach is a
logical extension of the ideas presented in above studies, in the
realm of multiprocessor environments.
Polling of the event source is a way to eliminate livelock. To
have a good performance, we need to poll the source as fast as
possible. If we have more than one processor, we can dedicate one
processor to poll the devices and let the other processors carry on
with the usual processing work. Here, we ensure that we poll as many
times as the processor allows us to. Other benefits that we
expect by letting the processor always poll the event source is
eliminating the cost of invoking the interrupt handler with a
context switch and removing the overhead of crossing the kernel
boundary repeatedly. The above approach ensures that the performance
of the system is not adversely affected while eliminating the
problem of livelock in the system.
Although by dedicating a processor to polling the devices, a
potential processing element is being wasted, we expect, based on
observations of several studies, that modern workloads are highly
I/O bound and the current processors are fast enough that there is
no loss in processing ability of the system as a whole.
Results and Contributions:
We have implemented the mechanisms to handle network events using
some of the above approaches We are currently working to provide a
comprehensive comparison of all approaches for interrupt handling
over multiprocessor systems. We have carried out some performance
measurements over the Apache webserver using the above approaches as
the underlying event handling mechanisms.
The preliminary results are promising and we expect to further
strengthen our claim through them. We also wish to investigate the
impact of a dedicated I/O processor and its possible use for certain
performance enhancing tasks like prefetching. To study the effect of
having multiple I/O processors in case of a large multiprocessor
system is also one of the avenues that we would explore in future.

Efficiency vs. Portability in Cluster-Based
Network Servers
Enrique V. Carrera and Ricardo Bianchini
Rutgers University
Efficiency and portability are usually conflicting
objectives for cluster-based network servers that distribute the
clients' requests across the cluster based on the actual content
requested. Our work is based on the observation that this efficiency
vs. portability tradeoff has not been discussed before in the
literature. To fill this gap, in this paper we study this tradeoff
in the context of an interesting class of content-based network
servers, the locality-conscious servers, using modeling and
experimentation. Our analytical model gauges the potential
performance benefits of portable and non-portable locality-conscious
request distribution with respect to a traditional,
locality-oblivious server, as a function of multiple parameters.
Based on our experience with the model, we design and evaluate a
portable, locality-conscious server. Experiments with our server, a
non-portable server, and a traditional server validate and confirm
our modeling results under several real workloads. Based on our
modeling and experimental results, our main conclusion is that
portability should be promoted in cluster-based network servers with
low processor overhead communication, given its relatively low cost
(<= 15%) in terms of efficiency. For clusters with high processor
overhead communication, efficiency should be the overriding concern,
as the cost of portability can be very high (as high as 106% on 32
nodes). We also conclude that user-level communication can be useful
even for non-scientific applications such as network servers.

Low-Power Fault-Tolerant Real-Time
Network-Attached Storage Device
Tzi-cker Chiueh
State University of New York at Stony Brook
Phoenix is a real-time network-attached storage
device (NASD) that guarantees real-time data delivery to network
clients even across single disk failure. The service interfaces that
Phoenix provides are best-effort/real-time reads/writes based on
unique object identifiers and block offsets. Data retrieval from
Phoenix can be serviced in server push or client pull modes.
Phoenix's real-time disk subsystem performance results from a
standard cycle-based scan-order disk scheduling mechanism. However,
the disk I/O cycle of Phoenix is either completely active or
completely idle. This on-off disk scheduling model effectively
reduces the power consumption of the disk subsystem, without
increasing the buffer size requirement. Phoenix also exploits
unused disk storage space and maintains additional redundancy beyond
the generic RAID5-style parity. This extra redundancy,
typically in the form of block replication, reduces the time to
reconstruct the data on the failed disk. This talk presents the
design and implementation of Phoenix, one of the first, if not the
first, NASDs that support fault-tolerant, real-time, and low-power
network storage service, and a detailed performance evaluation of
the Phoenix prototype based on Linux. I will also talk briefly about
the evolution of the Phoenix project, and the on-going research on
large-scale virtual disk architecture.

Trading Capacity for Performance in a Disk
Array
Xiang Yu and Randy Wang
Princeton University
A variety of performance-enhancing techniques, such
as striping, mirroring, and rotational data replication, exist in
the disk array literature. Given a fixed budget of disks, one must
intelligently choose what combination of these techniques to employ.
In this work, we present a way of designing disk arrays that can
flexibly and systematically reduce seek and rotational delay in a
balanced manner. We give analytical models that can guide an array
designer towards optimal configurations by considering both disk and
workload characteristics. We have implemented a prototype disk array
that incorporates the configuration models. In the process, we have
also developed a robust disk head position prediction mechanism
without any hardware support. The resulting prototype demonstrates
the effectiveness of the configuration models.

The Implementation and Evaluation of a
Rotation-Latency-Sensitive Disk Scheduler
Lan Huang and Tzi-cker Chiueh
State University of New York at Stony Brook
Disk rotation latency is becoming an increasingly
significant component of disk service time, and yet has rarely been
exposed to system software for scheduling purpose. In this paper, we
presented the design and implementation of a disk scheduler that can
effectively incorporate rotation latency into its scheduling
decision. The two enabling mechanisms responsible for the successful
implementation of this disk scheduler are an accurate run-time disk
head position prediction scheme, and a precise disk service time
model. We show that the accuracy of disk head position prediction
and disk service time prediction can approach close to 100% and more
than 95%, respectively. We used both real and synthetic traces to
evaluate the performance of this rotation latency-sensitive disk
scheduler (RSDS) and its variants, and found that the performance
improvement of RSDS over conventional disk schedulers that consider
only seek delay is up to 48%.

GulfStream - Dynamic Topology Management in
Multi-domain Server Farms
Sameh A. Fakhouri
IBM T.J. Watson Research Center
A multi-domain server farm is a collection of
servers divided into a number of distinct domains. While the
resources of the server farm are shared, each domain is isolated
from the others. Server farm administrative software allocates
resources, and enforces isolation policies. This paper describes
GulfStream, a distributed software system that addresses the problem
of managing the network topology of such a server farm. In
particular, it addresses the following core problems: topology
discovery and verification, and failure detection.
Unlike most topology discovery and failure detection systems which
focus on the nodes in a cluster, GulfStream logically organizes all
of the network adapters of the server farm into groups. Each group
contains those adapters that can directly exchange layer 2 messages.
GulfStream dynamically establishes a hierarchy for reporting network
topology and availability of network adapters.
We describe a prototype implementation of GulfStream on a 55 node
heterogeneous server farm interconnected using switched fast
ethernet. This was done in the context of the Oceano project which
provides for dynamic resource allocation in response to workload
variations. We discuss scaling GulfStream to larger environments; we
envision its use in server farms containing thousands of nodes.

Quantifying the Impact of Architectural Scaling
on Communication
Taliver Heath, Samian Kaur, Richard P. Martin, and Thu D. Nguyen
Rutgers University
This work quantifies how persistent increases in
processor speed compared to I/O speed reduce the performance gap
between specialized, high performance messaging layers and general
purpose protocols such as TCP/IP and UDP/IP. The comparison is
important because specialized layers sacrifice considerable system
connectivity and robustness to obtain increased performance. We
first quantify the scaling effects on small messages by measuring
the LogP performance of two Active Message II layers, one running
over a specialized VIA layer and the other over stock UDP as we
scale the CPU and I/O components. We then predict future LogP
performance by mapping the LogP model's network parameters,
particularly overhead, into architectural components. Our
projections show that the performance benefit afforded by
specialized messaging for small messages will erode to a factor of 2
in the next 5 years. Our models further show that the performance
differential between the two approaches will continue to erode
without a radical restructuring of the I/O system. For long
messages, we quantify the variable per-page instruction budget that
a zero-copy messaging approach has for page table manipulations if
it is to outperform a single-copy approach. Finally, we conclude
with an examination of future I/O advances that would result in
substantial improvements to messaging performance.

Operating System Support for Safely and
Efficiently Programmable Routers
Prashant Pradhan and Tzi-cker Chiueh
State University of New York at Stony Brook
Placing computation inside the network can yield
significant performance benefits to network-based applications.
These benefits accrue from topologically strategic placement of
computation and from the ability of such computation to exploit
global network context. For example, congestion control state can be
shared between flows passing through an intranet's portal router, by
placing a function in the router that aggregates congestion control
state on a path-by-path basis [1].
To support placement of computation in the network, a router
operating system should support appropriate abstractions for
composing computation on network flows, and provide efficient
implementations of these abstractions. More importantly, the core
router's integrity and performance should remain unaffected in the
presence of such a composable computation framework. Such isolation
requires memory protection and performance protection
of the router kernel from dynamically added functions. With these
goals in mind, we have developed a router operating system that
allows safe and efficient composition of computation on network
flows. We present the essential features of this OS and describe an
example application, Aggregate TCP (ATCP) [1], that can be
implemented using these features.
Computation is composed in terms of the following entities :
- Extension functions: These are preemptible functions that
can carry state across invocations. Every extension function
invocation is made in some execution context or flow. An
extension function may have multiple pending invocations, issued in
different flow contexts.
- Flows: Flows are abstract execution contexts. A flow is a
unit of scheduling and resource allocation. Flows may allow other
flows to share their resources through a simple access control
mechanism.
Given the above entities, there are two key mechanisms to compose
computation and to determine control flow :
- Asynchronous Invocation: A function, invoked in a given
flow context, may pass control to another function by posting an
invocation to it. The invocation is asynchronous, and the CPU
scheduler determines when to make invocations pending in various
flow contexts.
- Static Binding: Flows may statically bind themselves to a
stream of packets by specifying a packet filtering rule, and an
extension function that should get control when a packet matches the
rule.
- Dynamic Binding: To pass control to a target function, a
given function may reference the target function by names that
are strings with semantic connotations. The router OS provides a
mechanism to register and query these names.
To implement this framework with good invocation performance,
extension functions may be co-located with the router kernel to
avoid expensive context-switching and TLB flushing overheads.
However, the router kernel's safety is not compromised, owing to the
use of intra-address space protection [2]. The extension
functions are placed in a lesser privileged subset of the kernel
address space, which provides memory protection to the kernel, but
only incurs the overhead of a protected function call while making
extension function invocation. Performance protection is ensured
through a preemptive CPU scheduler.
Aggregate TCP congestion control (ATCP) [1] is an ideal function for
placement inside the network, since it can exploit global
information about congestion status on various network paths and
allow TCP flows to avoid their cold-start phase in congestion
estimation. In ATCP, a router placed at the edge of the network,
maintains congestion control related state for flows passing through
it, grouped by the destination subnet of these flows. An ATCP
router, upon receiving a TCP connection request, splits it into a
local subconnection (L) and a remote subconnection (R). R starts
from a congestion window equal to the warm estimate. On L, an available
credit is maintained, depending upon the congestion window of R
and its growth mode (linear/exponential). Since the RTT on L is much
smaller than that on the whole L-R path, the congestion window for L
can grow to the warm estimate much faster.
ATCP doesn't require any changes to the end-system TCP
implementations and its evaluation using a real web server trace
shows a potential improvement of upto a factor of 2 in normalized
HTTP transaction latency. ATCP can be naturally and efficiently
implemented using the API exposed by the proposed operating system.
References
[1] P. Pradhan; T. Chiueh; A. Neogi, "Aggregate TCP Congestion
Control Using Multiple Network Probing", Proc. ICDCS-2000.
[2] T. Chiueh; G. Venkitachalam; P. Pradhan, "Integrating
Segmentation and Paging Protection for Safe, Efficient and
Transparent Software Extensions" , Proc. ACM SOSP-99.

Transport Layer Support for Highly-Available
Network Services
Kiran Srinivasan, Florin Sultan, and Liviu Iftode
Rutgers University
Problem and Motivation:
There has been a growing trend in the Internet to view resources as
services rather than as servers. The coupling between a resource and
its location (IP address) is progressively losing importance. The
recipients of the services (clients) are only interested in
obtaining good quality service no matter from where (geographical
location) the service is being provided. This requires the service
providers to provide highly-available and high-throughput services.
This problem is being approached in two different ways. One method
is of building fault-tolerant, high-throughput servers from a
cluster of computers connected by a SAN. The other approach is of
distributing the servers of the same service onto geographically
different areas. The current transport layer framework (TCP) does
not facilitate either of these mechanisms as it does not support
easy connection hand-off from one server to another. The primary
emphasis of this project is to explore and design a transport layer
mechanism that enables building services based on the above
architectures.
Background and Related Work:
Related work includes TCP/IP connection hand-off protocols used
either for mobility extensions to TCP/IP [Bakre95, Balakrishnan95,
Snoeren00] or for request distribution in clusters [Aron99].
Indirect TCP [Bakre95] provides hand-off support for physical
mobility of a connection endpoint over a wireless link, by splitting
and maintaining hard state for it at a base station, and by
transferring that state to the next base station during hand-off.
[Balakrishnan95] provides support for mobility by maintaining soft
state in the base stations, at the expense of flooding nearby base
stations using multicast to build necessary state prior to a
handoff. [Snoeren00] changes the protocol stack at the endpoints to
provide end-to-end mobility when the network attachment point (IP
address) changes.
All these approaches do not consider the task of migrating the
connection endpoints between physically distinct machines. They
either rely on directly using the full connection state maintained
at the physical endpoints [Bakre95, Balakrishnan95], or on
restarting a previously established connection after a IP address
change by using an authentication mechanism to reuse the state of
the old connection [Snoeren00].
[Aron99] proposes TCP connection handoff in clustered servers for
distributing incoming connection requests from a front-end machine
to the server back-end nodes. The approach has limitations due to
its specialized nature (cluster-based web server): in their single
handoff scheme a connection endpoint can migrate only during the
connection setup phase. Multiple handoff of persistent HTTP/1.1
connections is only mentioned as an alternative, but no design or
implementation is described. Even in the multiple handoff scheme,
the granularity of migration of live connections is
application-dependent: a connection can only migrate after fully
servicing a HTTP request.
Our mechanism for connection handoff targets migration of the
endpoint of an active (live) connection between physically distinct
hosts, transparent to the fixed endpoint, at any moment during the
connection lifetime.
Approach and Uniqueness:
The model of client-server interaction that we envision in our
design assumes that there are a number of hosts (the servers),
either clustered or distributed across the Internet, that provide
the same service (i.e., run the same application). A client contacts
a preferred server using a TCP connection, at the beginning of a
service session.
During the lifetime of a session, under the control of our scheme,
the remote endpoint of the connection may transparently (to the
client application) migrate between the servers, for example in
response to events like failure of the current server or a loss in
the quality of service received by the client. In response, the
transport layer at the destination server reincarnates the
connection endpoint existing at the previous server. In case of a
failure to do so, this event is treated just as a failure to provide
the service and the session/connection is aborted.
Connection migration involves only one endpoint of the connection
(the server side), while the other endpoint (client side) is fixed.
Migration may occur multiple times and at any moment throughout the
lifetime of an active connection.
Because migration may occur at any time, the application level state
associated with the ongoing data transfer may have to be restored at
the destination host to ensure correct continuation of the data
transfer on the migrated connection. We provide a minimal API for
exporting/importing the server-specific state that completely
describes the ongoing data transfer at the application level. The
writer of a server application uses our API in order to take
advantage of dynamic server-side migration of live TCP connections.
Our migration mechanism can be uniformly used both in cluster-based
servers or in groups of servers distributed over wide area. Inside a
cluster, we can take advantage by a backend network of high
bandwidth and low latency (SAN) to proactively replicate connection
state across the cluster. This provides hot-swappable endpoints for
a connection: in case of a failure of the server node, the client
can transparently continue communication with a backup node. Over
wide area, it can be used to recover from events like network
congestion and DoS attacks, that degrade the quality of the service
received by the client.
In our system, the client applications (e.g., web browser) are
unchanged. We change the client and server TCP/IP stack to
accommodate a new type of connection. Although we require changes to
server applications, we believe the programming effort involved
should be fairly low.
Results and Contributions:
We are currently designing and implementing our migration scheme by
extending the TCP specification.
The connection migration mechanism we propose can be used (i) on the
client side, to dynamically shift to another server when the current
server does not provide a satisfactory level of service, or (ii) on
the server side, to recover connections from hard node failures in
cluster-based servers, or to implement a load balancing scheme by
shedding load from existing connections to other less loaded
servers.
In addition, we believe that our scheme can be used to alleviate the
impact on existing connections of certain types of DoS attacks (SYN
flooding, the process table attack) by shifting connections from the
host under attack to alternate servers. While the migration
mechanism may need to use resources on the attacked machine (which,
depending on the severity of the attack, may or may not be
available), connection migration provides in any case a better
alternative than the loss of existing connections.
References:
[Aron99] M. Aron, P. Druschel, W. Zwaenepoel. Efficient Support for
P-HTTP in Cluster-Based Web Servers. USENIX '99.
[Bakre95] Ajay Bakre, B. R. Badrinath. Handoff and system support
for Indirect TCP/IP. Second Usenix Symposium on Mobile and
Location-dependent Computing, April 1995.
[Balakrishnan95] H. Balakrishnan, S. Seshan, E. Amir, R. H. Katz.
Improving TCP/IP Performance over Wireless Networks. 1st ACM Conf.
on Mobile Computing and Networking, November 1995.
[Snoeren00] A. C. Snoeren, H. Balakrishnan. An End-to-End Approach
to Host Mobility. 6th ACM MOBICOM, August 2000.

An Architecture to Define, Monitor and Bill
SLAs in a Server Farm
Juliana Cunha and Karen Appleby
T. J. Watson Research Center
We present an architecture to define and maintain
Service Level Agreements, which was developed as part of a larger
project, Oceano. An XML Contract is used to establishes an SLA
between a customer and a service provider. This contract includes
report definition, violation policy descriptions, penalties for
disruption of service and pricing. The information provided in the
contract must be validated for integrity, to avoid conflicts and
errors. In addition, we need to guarantee that the provider has a
sufficient number of resources to support the defined service level.
Oceano will monitor the enforcement of the contract and triggers the
policy engine whenever a violation occurs. Contract Violations are
expressed as policies, which include a violation scenario, start and
stop time, the violator agent and a related action. The possible
actions are procedures to correct the problem, and/or apply monetary
value to be charged from the service provider violator. A pricing
engine (that combines the aggregate cost values of each resource
involved in the violation scenario) is responsible for the billing
calculations. Reports are periodically generated to inform the
participants of the system behavior. We address the problem of SLA
definition by using customer feedback and providing a flexible way
to define and monitor the quality of service. Another important
issue is the integration between the SLA monitoring and pricing
model. We will present recent results, and ongoing work.

Posters

Cooperative File System (CFS) for PC clusters
Suresh Gopalakrishnan and Liviu Iftode
Rutgers University
Problem and Motivation:
CFS aggregates local storage available on each node in a PC cluster
into a single global file system that supports not just location
transparency but also location independence. The main goal is to
provide scalable I/O throughput and support variability in load
without compromising manageability and flexibility. In recent times
there is a significant trend in the usage of file systems towards
server-based applications rather than the traditional interactive
applications. These changes in usage and traffic patterns along with
the cluster based server models motivate the idea of building a
global cluster file system. Location independence allows CFS to
manage within itself, transparently and dynamically, the tasks of
file placement and file migration for load balancing and file
replication for fault tolerance. The possibility of using global
memory (cooperative caching) and the availability of fast
interconnects with user level communication obviate concerns about
performance.
Background and Related Work:
CFS is different from other DFS projects (Andrew, Sprite, xFS, NFS,
Frangipani, Archipelago etc.) mainly in that
1. it runs in the user level
2. on top of the existing file systems
3. using volatile metadata that can be rebuilt at run time
4. providing location independence
Projects like Frangipani and xFS were built from scratch and
included storage management (Petal) issues also in the design, which
complicates the whole system. CFS runs on top of any commodity FS
and chooses to leverage on existing systems rather than attempting
to redesign a file system itself.
Approach and Uniqueness:
The key to providing location independence is maintaining virtual
directories on each node that contain global information. There is a
single global (virtual) file system that is the conglomeration of
the local file systems on participating nodes, with a single root.
Directories are allowed to be replicated, so directories with the
same name in the local filesystems (eg: /bin) are considered to be
replications of the single global directory with the same name. The
virtual directories are maintained in memory and are volatile - they
are built up at runtime by a procedure called 'directory merging' or
'dirmerge'.
The CFS is implemented at the user level as a library that provides
all the file system interface needed by the applications. The design
of CFS has two logical layers - one is the virtual directories that
contain the metadata required to provide location independence and
the second is a cooperative caching layer with global replacement
that caches the file data at block level. CFS exploits the high
speed interconnects and user level communication with features like
memory mapped communication provided by VIA to achieve better
performance as well as to allow relaxed consistency protocols.
Results and Contributions:
The basic CFS implementation consisting of the cooperative caching
and virtual directories has been implemented. Using CFS, we have
also built a user level distributed NFS server that we are currently
benchmarking using SPEC NFS benchmark and traces from the file
servers in the department. The testbed consists of a cluster of 8
PCs (300 MHz) connected by a VIA-based interconnect (Giganet). The
next step is to design and evaluate file replication and migration
policies and mechanisms. CFS will also be used as a platform to
study other (distributed systems) issues like impact of
communication on performance, local v/s global cache replacement
policies, meta-data consistency models etc.

Encryption Servers: A Cost-Effective Versatile
Solution for Internet Security
Vivek Pathak and Liviu Iftode
Rutgers University
The explosive growth of the Internet has given rise
to a number of security concerns. These concerns have become
important due to the increasing use of electronic commerce. As the
Internet and its user community are distributed in nature, any large
scale Internet security solution has to tackle a number of issues.
These include scalability, lack of single points of failure and
economic viability. This motivates the creation of Encryption
server, a scalable and cost effective medium for encryption and
authentication of Internet traffic.
Encryption server provides IP level security. It implements
encryption and authentication of forwarded IP packets. The
Encryption servers function as secure gateways to the Internet and
support a destination based security policy. Variable key lifetimes
are used to provide a constant level of security and performance for
a wide range of operating loads. The authentication of fresh public
keys is done by a vote among the peers and avoids the use of a
single trusted third party. An automatic key exchange protocol for
authenticated propagation of public keys and secure destinations is
developed. It uses lazy key update and optimistic trust to eliminate
overhead in the common operating case. A partial implementation on
the Linux operating system has been done and preliminary experiments
have shown promising results.
It is intended to explore the performance tradeoffs in the variable
key lifetime approach. Issues like caching of cryptographic
processing and interrupt free forwarding of IP packets have to be
studied. The performance tradeoffs of various encryption methods and
key exchange mechanisms have to be experimentally validated. The
scalability issues present due to hardware limitations have to be
addressed by creation of encryption clusters.

Cooperative Caching Middleware for Clusters
Matias Cuenca and Thu D. Nguyen
Rutgers University
In order to seize the power of commodity clusters,
Internet servers sometimes take advantage of the aggregate amount of
memory in the cluster nodes by using cooperative caching (CC)
mechanisms. Such mechanisms can reduce the number of accesses to
disk, and as a result reduce the service latency and improve overall
server throughput. Most of the past work on CC has been done at the
application level, i.e., caching was tailored to a specific server's
needs. For example, some WWW servers have assumed Zipf request
distributions, have used (variable-sized) files as the caching unit,
and have applied local rather than global replacement policies. The
problem with this approach is that the caching and coherence
infrastructure has to be rebuilt for each new server. In contrast,
we believe that cooperative caching should be implemented as a
middleware layer that can be reused by a variety of servers. To
verify this claim, we are developing a simulator that can test
multiple CC policies and mechanisms. At first, we will focus on a
distributed file system and will compare the performance of
hand-optimized CC against that of our general CC infrastructure.
With our simulator and file system, we plan to study the tradeoff
between the performance and the generality of the middleware we
propose.

A Window Into Your Computing Environment
Christopher Peery and Thu D. Nguyen
Rutgers University
With the advent of Virtual Computing it has become
possible for users to move the execution of applications off of
their local machines and on to a better computing base that is
usually located at a remote location. To the users of this form of
Virtual Computing the application should appear as though it were
still running on the local machine. This application could be a
anything, from a simple program, to a graphic environment, or even
to an entire operating system.
Currently this model has been implemented for simple
applications. For example, AT&T has developed an application
known as Virtual Network Computing (VNC). Using this application,
users are able to access a simplified X-server that is running on a
remote machine. Since the X-server is running remotely, all the
state of the server is kept remotely as well. Thus, users are
able to utilize a graphical interface that can potentially
(depending on the reliability of the remote machine) survive a crash
of the local host. In addition, a given user can access the same
remotely running X-server from any number of different machines and
is therefore not bound to working at a given computer.
In addition to virtual computing, the computing
world has seen a huge growth in the development of PDAs, such as the
Compaq IPAQ. The goal of this project is to combine the versatility
and portability of hand-held devices with the reliability and
availability of virtual computing to allow individuals to access an
already existing computing environment.
This project will attempt to get the IPAQ to connect
with an X server that is running remotely on a reliable computing
base. The IPAQ will use a wireless medium and the VNC
application to interface with this X server. In essence, the IPAQ
will be turned in to a mobile network terminal with graphical
capabilities. This will allow users to access their computing
environments from remote locations using this PDAs.
The purpose of this set-up is not to provide the
IPAQ with the complete functionality of full desktop machine.
Instead it is to allow individuals to access their computing
environment for information or to allow them to make small
alterations in it. The emphasis is not on functionality but on
convienience and look-up.
The current focus of this project is on the
implementation details. This includes altering the VNC application
to allow for efficient realization on the IPAQ, and how to
best grant the user of this application an overview of his computing
environment.

Continuously Available and Scalable Sorted List
Data Structure
Kiran Nagaraja, Richard Martin, and Thu Nguyen
Rutgers University
The emergence of the Internet as the global,
ubiquitous networking infrastructure is driving a new class of
highly scalable Internet services. As these services become a part
of our everyday life, users will demand ultra reliability. In this
work, we argue that that the complexity of replication, fault
tolerance, and consistency necessary to achieve this reliability
could and should be hidden behind data structure abstractions such
as in-memory sorted lists, spatial trees, and graphs.
We believe that the data structure should be
flexible, resilient to structural imbalances and allow high
concurrency in order to provide the above said services in an
efficient manner. We are investigating a distributed sorted list
structure sorted on the value field of a 2-tuple (keyid, value);
potential applications include multi-item auctions and web
ranking/indexing services.
The data structure is organized in a two level
hierarchy, a globally replicated 'Splitter' that partitions the
value-range among participating nodes and a set of B-trees to hold
the data local to each node. Our list implementation allows for high
concurrency and ease of management. We are currently investigating
data rebalancing among B-trees of a node as well as across nodes.