
Invited Talk

Global Internet Content Delivery
Bruce Maggs
Akamai Technologies and Carnegie-Mellon University
This talk describes Akamai's Internet content delivery service called
FreeFlow. Akamai has deployed over 8000 servers on more than 350 networks
around the world, and delivers content for more than 1500 customers. The
talk begins with a review of the mechanics of content delivery on the
Internet, and then examines the unique features of Akamai's massively
distributed system. After addressing several of the technological challenges
faced in designing such a system, the talk concludes by presenting two
theoretical problems that arose during its implementation.

Short Talks

Neptune: Replication Infrastructure
for Cluster-Based Network Services
Kai Shen, Tao Yang, Lingkun Chu, JoAnne L. Holliday, Doug Kuschner,
Huican Zhu - University of California at Santa Barbara
Previous research has addressed the scalability and
availability of building large-scale cluster-based network services.
This project studies the clustering of replicated network services
when the persistent service data is frequently updated. The
infrastructural middleware we built, called Neptune, provides a
flexible interface to glue and replicate existing service modules
and accommodate a variety of underlying storage mechanisms. Neptune
maintains dynamic and location-transparent mapping to isolate faulty
service modules and enforce replica consistency. It allows efficient
use of multiple-level replication consistency mechanisms with
staleness control and failure recovery support. In this talk we
present Neptune's overall architecture and data replication support,
and we illustrate its performance using three network services.
http://www.cs.ucsb.edu/research/neptune

Optimistic
Active Replication
Pascal Felber - Bell Labs, Lucent Technologies
Replication is a powerful technique for increasing availability
of a distributed service. Algorithms for replicating distributed
services do however face a dilemma: they should be (1) efficient
(low latency), while (2) ensuring consistency of the replicas, which
are two contradictory goals. To be efficient, some Atomic Broadcast
algorithms used for active replication deliberately sacrifice
consistency, if inconsistency is likely to occur with a low
probability. We present an algorithm that handles replication
efficiently in most scenarios, while always preventing
inconsistencies.

It Takes Two To Tangle
-
Censorship Resistant Publishing Through Document Entanglements
Marc Waldman and David Mazieres - New York University
Today, documents available over the the Internet are easy to
censor. Each document can usually be traced back to a specific host
or even the individual responsible for publishing the document.
Someone wishing to censor a document can use the courts, threats, or
some other means to force the host administrator or author to delete
a particular file. Even if these methods prove unsuccessful, various
denial of service attacks can be launched against the hosting server
in order to make the document difficult, if not impossible, to
retrieve.
Publishing the document on many hosts may seem to be the answer.
However, no standard naming convention exists that allows one to
easily specify several hosts via a single name. Even if such a
naming convention existed it merely makes the censor's job somewhat
harder --- the censor still knows exactly which hosts contain the
content and therefore which hosts to attack. In addition to the
naming difficulty there is little incentive for a hosting provider
to store documents that he is being pressured into deleting.
We propose a distributed censorship resistant file system that
publishes the blocks of a document in such a way that it is very
difficult for an adversary to delete all the component blocks of a
file. This is accomplished, in part, by using Shamir's secret
sharing algorithm. The component blocks of newly published documents
are split into secret shares, half of which also belong to other
documents.

Fault-Tolerant TCP (FT-TCP)
Thomas C. Bressoud - Bell Labs, Lucent Technologies
In this talk I will describe an implementation of fault-tolerant
TCP (FT-TCP) that allows a faulty server to keep its TCP connections
open until it either recovers or it is failed over to a backup. The
failure and recovery of the server process are completely
transparent to client processes connected with it via TCP. The
system does not use a proxy and no changes to the client software or
the TCP implementation on the server are required.

Distributed Computing
over Networks of Embedded Systems using Smart Messages
Phillip Stanley-Marbell, Cristian Borcea, Kiran Nagaraja, Liviu
Iftode - Rutgers University
Embedded systems have outnumbered traditional desktop computing
systems for several years and will continue to do so in the future.
A new trend in embedded systems is to provide networking, either
wired or wireless. As device costs plummet, it will soon be feasible
to deploy large networks of embedded computing devices. These
networks will be inherently heterogeneous, both in the
interconnection technologies, and in the individual functions of the
component nodes. These embedded computing devices will typically be
mobile and have constrained energy resources, and are very likely to
be battery powered.
To benefit from programmability, and the aggregated computing
resources deployed in these networks, new distributed computing
models must be employed, which will necessarily be different from
the traditional distributed computing models for several reasons.
First, at such large network scale, a property-based naming rather
than a unique identifier for the participant nodes in the
computation must be supported. Second, given the fluidity and
volatility of these networks in terms of node configuration and
network topology, it may be impossible to synchronize computation,
and round-trip communication may never complete. Third, applications
should accept partial execution as long as it is relevant, i.e.
meets a certain {\bf quality of result} (QoR).
Presented is a system architecture, Smart Messages, for
computation and communication in large networks of embedded systems.
In this model, communication is realized by sending Smart Messages
in the network. These messages are comprised of code, which is
executed at each hop in the path of the message, and a payload which
the message carries in the network. The execution of the message at
each hop determines the next hop in the path of the message, thus
smart messages are responsible for their own routing.
The goal of the Smart Message architecture is to keep the support
required from nodes in the network to the bare minimum, placing
intelligence in the Smart Messages rather than in individual nodes.
Placing intelligence in SMs, and providing a common minimal support
from cooperative nodes, provides flexibility and obviates the need
for the potentially impossible task of updating all nodes in a
network for the implementation of a new application or protocol.
This talk details the motivation for Smart Messages, describes
the smart message model and proposes a system architecture to
support this model.

ARMS - Automated Replication for
Meshed Servers
Weibin Zhao and Henning Schulzrinne - Columbia University
Server replication is a widely used mechanism to provide fault
tolerance, enhance availability and improve performance in
distributed systems, e.g., mirrored Web servers, FTP servers, DNS
servers and directory servers. Replication can be synchronous in
which an update delivers to all servers or none of them, or
asynchronous in which an update delivers to one server first, and
then propagates to other servers. Synchronous replication is
difficult to achieve in wide area where asynchronous approach is
more promising. However, propagating updates promptly among
replicated servers is a challenging issue.
This paper describes ARMS, an automated replication protocol for
meshed servers. ARMS supports wide area server replications and
automates replication process by employing service discovery
technology. It is highly efficient for dynamic data, and can be used
in multiple server groups.
ARMS uses the asynchronous replication model, but propagates
updates in a quasi-synchronous way. Replicated servers in ARMS
maintain a mesh topology. Updates are propagated in two ways: batch
transfer and direct forwarding. Batch transfer is used to exchange
bulk data among servers to let them catch up updates with each
other. It is useful for initial population exchanges when new
servers are booted, failed servers are rebooted, or network
partitions are fixed. Direct forwarding is used to propagate new
updates promptly. Each server maintains logical connections to all
other servers and forwards updates using best-effort. In ARMS, each
data entry has a unique updating ID which includes the server's URL,
who initially performs the update, and the updating timestamp at
that server. From time to time, a server sends its updating report
to other servers. Newer updates are exchanged in batch among
servers.
Traditional replication systems often have management overheads,
such as static group membership configuration. ARMS supports dynamic
changes of group membership, which facilities new server joining and
old server leaving. Servers in ARMS periodically send their
advertisements for presence using multicast. They also forward new
server advertisements to each other. Furthermore, replicated servers
exchange group membership information regularly.
ARMS is designed to support dynamic data efficiently. Dynamic
data are soft state in that they are valid only for the specified
periods of time. They need to be refreshed periodically, otherwise
they will be removed from servers when they are timeout. Examples of
dynamic data include the using of lifetime in SLP's Directory
Agents, the using of lease in Jini's Lookup Services, and the using
of time-to-live in LDAP. As there are many update operations for
dynamic data including refreshes, ARMS choose to propagate updating
results instead of updating operations. This way no updating log is
needed, and several updating operations on one data entry can be
propagated as a single updating result.
ARMS supports server replications in multiple groups. Servers
detect the group information in server advertisements, and join
designated groups (a server can join multiple groups). ARMS also
addresses other important replication issues, such as resolving
updating conflicts from clients, handling deleted data entries.
Special care is given in ARMS to various failure scenarios. ARMS
uses keepalive messages to handle server failure and network
partitions. ARMS has been implemented for mSLP - Mesh-enhanced
Service Location Protocol, further details can be found at http://www.cs.columbia.edu/~zwb/project/slp/.

LBFS: Low-Bandwidth File Systems
David Mazieres - New York University
Users rarely consider running network file systems over slow or
wide-area networks, as the performance would be unacceptable.
Nonetheless, efficient remote file access would be desirable in many
situations, particularly over high latency networks. Rather than run
interactive programs such as editors remotely, users could run the
programs locally and manipulate remote files through the file
system. Unfortunately, most network file systems require too much
bandwidth to be practical outside of the local area network.
In this talk I will discuss LBFS, a network file system designed
for low-bandwidth networks. LBFS exploits similarities between files
or versions of the same file to save bandwidth. It avoids sending
data over the network when the same data can already be found in the
server's file system or the client's cache. Using this technique,
LBFS achieves up to two orders of magnitude reduction in bandwidth
utilization on common workloads, compared to traditional network
file systems.

Policy Driven Resource
Allocation of Commodity Servers in an Océano Farm
Karen Appleby, Juliana Cunha, Tamar Eilam, German Goldszmidt - IBM
T. J. Watson Research Center
The Océano project is developing a pilot prototype of a
scalable, manageable hosting infrastructure for e-business
utilities. One of the issues that Océano aims at addressing is
support for peak loads that are an order of magnitude larger than
the average load. Océano mitigates the differences between average
and peak load by sharing the same resources sequentially between the
customers. Threshold events are generated according to performance
monitoring information flow. These threshold events can be a trigger
for actions such as changing the allocation of resources among the
customers.
A problem that has to be addressed is a policy to decide on the
allocation of servers to customers. The main goal is to maximize the
total revenue of the farm. There are two players in a full policy
driven solution to the server allocation problem. o An agreement
structure between the Océano farm owner and the customers (revenue
model). o A server allocation algorithm to maximize the total
revenue of the farm based on the set of agreements with the
customers. We discuss both components in this talk.
The Océano's revenue model is a parameterized structure that
defines some principles according to which charging is performed. It
also gives quantified meaning to different levels of guarantee
(quality of service), and explains how they are taken into account
for charging purposes. An instance of this structure, termed an
Infrastructure Service Level Agreement (ISLA), is an assignment of
values to the set of parameters that are defined in the revenue
model. Every customer is associated with an ISLA. The ISLAs are
mapped and given as input to the resource allocation algorithm. We
present the Océano's revenue model and discuss its relationships
with other existing revenue models.
We also present a novel approach for resource allocation in an
Océano server farm. The main idea is to take advantages of periodic
access patterns in Internet traffic in order to construct a resource
allocation plan that is used to provision servers in advance
according to the expected workload. This planned capacity component
is combined with a reactive component that fine-tunes the allocation
according to the actual behavior of the system. We discuss the
advantages of this combined approach over using one of the
components stand-alone and how it is used for revenue maximization.

An Internet Utility Platform
and a Pub/Sub Messaging Utility
Mike Spreitzer - IBM T. J. Watson Research Center
I will describe the current state of a project to define a
platform for Internet utilities and offer a few specific Internet
utilities on that platform. By utility I mean a service that has a
standardized interface and is offered to all comers; an Internet
utility is one that is delivered over the Internet. I will describe
our thoughts on architecture for an Internet utility platform. I
will also describe our thoughts on how build a pub/sub messaging
utility using that platform.
The value proposition of an Internet utility includes realizing
economies of scale in terms of both human and hardware resources. To
do this requires automated management of the system that provides
the service. This includes dynamically reassigning resources to meet
the changing loads presented by the various customers. We plan to do
this even in situations where multiple customers are served by a
single server.
One of the key challenges in building an Internet utility is
defining what Service Level Agreements (SLAs) will be offered. It is
the meeting of those SLAs that drives the dynamic resource
allocations. In an SLA and its associating pricing scheme it must be
possible to see that both (a) the customer will get good value for
what he pays and (b) the utility operator will get paid for his
expenses. The terms quantified in an SLA must be (1) meaningful and
valuable to the customer, (2) measurable, and (3) controllable by
the automated management system. This is turning out to be a
non-trivial challenge for the pub/sub utillity. I will describe our
current thinking on the subject.

Blue Gene: A Massively
Parallel System
Jose E. Moreira - IBM T. J. Watson Research Center
ABSTRACT: Blue Gene is a massively parallel system being
developed at the IBM T. J. Watson Research Center. With its 4
million-way parallelism and 1 Petaflop peak performance, Blue Gene
is a unique environment for research in parallel processing. Full
exploitation of the machine's capability requires 100-way shared
memory parallelism inside a single-chip multiprocessor node and
message-passing across 30,000 nodes. Even more challenging, this
parallelism has to be exploited in the presence of failed
components, both in the form of entire nodes and in the form of
nodes that have some broken subsystems. New programming models,
languages, compilers, and libraries will need to be investigated and
developed for Blue Gene, therefore offering the opportunity to break
new ground in those areas. In addition, system management and
input/output operations in a system of this scale present their own
challenges. In this talk, I will describe some of the hardware and
software features of Blue Gene. I will also describe some of the
protein science and molecular dynamics computations that are
important driving forces behind Blue Gene.

The Optimistic
Readers Transformation
Rob Strom and Josh Auerbach - IBM T. J. Watson Research Center
Monitors, such as Java classes with synchronized methods, are a
convenient and safe abstraction for designing and reasoning about
multithreaded object-oriented programs. However, the straightforward
implementation of monitors can be inefficient, particularly in
programs in which the majority of calls are to read-only methods. We
introduce the optimistic readers program transformation, which may
be implemented either as a compiler optimization, or as a ``design
pattern.'' This transformation produces an implementation whose
observable behavior is equivalent to that of a monitor, but in which
read-only methods do not acquire locks or perform any shared writes.
As a result, programmers can reason about their programs as if each
shared object were implemented using mutual exclusion, while
achieving the performance benefits of unsynchronized reads. We
present the program transformation using the platform-independent
abstraction CRF. We then demonstrate the performance of this
transformation as applied to benchmarks derived from the core module
of a practical system -- a Java-based publish-subscribe router. We
compare the performance of the optimistic readers transformation to
unoptimized synchronized methods and to reader and writer locks.
This is a preview of a talk to be given at ECOOP 2001 in
Budapest.

An Architecture for
Accelerating Large Scale Distributed Web Applications
Dinesh C. Verma and Seraphin Calo - IBM T. J. Watson Research Center
The prevalent way to access web-based applications over the
Internet is by using a browser that communicates to a web-server
across a number of routers. The web-server typically maintains
static content in data files, and composes dynamic content by
executing programs, typically cgi-bin scripts or Java servlets.
During periods of congestion and traffic overload at the servers,
the response time experienced by the client is often poor.
Typically, as the number of routers lying on the path between the
client and the server increase, the chances of encountering
congestion increases, and the user is likely to see degraded
performance. Different QoS approaches that address the problem have
been proposed, notably the IntServ/RSVP [1] signaling approach based
on the notion of reserving resources within the network and the
DiffServ approach [2] based on the notion of mapping traffic
aggregates into several different service classes. However, the
deployment of any of these approaches requires changes in the basic
infrastructure of the Internet, which creates significant
operational difficulties, and is unlikely to happen in the medium
time range.
An approach that has been partially successful at reducing user
response time is that of caching content closer to the user. A
client can then access web-based content from a proxy server from
which it is likely to obtain a better response time than by going to
the original server [3]. The notion of caching has been extended to
that of content distribution networks by a number of companies such
as Akamai, Digital Island, Cisco, etc. A content distribution
network consists of a network of caching proxy servers to which
clients are directed transparently using various wide-area load
balancing schemes. The caching approach works well for data that is
static and unchanging, e.g. images, video clips, etc. However, such
content forms an increasing smaller percentage of total web-content.
Therefore, there is a need to extend this scheme to other type of
web-applications which include dynamically generated content. As in
the case of caching of static data, it is highly desirable that the
caching of applications be done so that the administrative and
operational control of the data/application resides with the
original server, rather than with the proxy server. A solution is
needed which accelerates applications while still providing the
administrative control of the application from the original server,
rather than the proxy server. In this paper, we present such an
architecture.
The architecture allows either an application developer or
application deployer to split an application into edgable and a non-edgable
components. The edgable components are retrieved by at a proxy cache
server and executed while the non-edgable components are relayed to
the origin web-server. However, the control and management of the
application is kept at the origin web-server. In order to alleviate
the issues involved in managing the proxy servers, our architecture
has been developed to deploy edgeservers with identical
configuration throughout the network. Each edgeserver automatically
configures itself based upon its location with the network. All
management tasks are confined to a single machine for the
administration of edge-servers. All configuration of applications is
confined to the origin web-servers hosting the applications.
During development time, an application developer can mark the
edgable components by defined classes that implement a specific
interface. During deployment time, existing applications that may or
may not use this interface, can define a configuration file that
describes the edgable components to the proxy servers.
We have developed a few sample appliactions on top of this
framework which include websites that create personalized web pages,
generation of banner pages, and lookup in corporate directory
services. Our early experiments demonstrates that applications
developed in this distributed paradigm can significantly improve the
user-percevied response time during periods of network congestion.
References:
[1] R. Braden, L. Zhang, S. Berson, S. Herzog, and S. Jamin,
ReSerVation Protocol (RSVP) Version 1 Functional Specification.
RFC2205, Sept. 1997.
[2] S. Blake et. Al. An Architecture for Differentiated Services,
Internet RFC 2475, Decemember 1998.
[3] Akamai Technologies Inc., "FreeFlow content distribution
service," www.akamai.com.

Virtual-Time Round-Robin: An
O(1) Proportional Share Scheduler
Chris Vaill and Jason Nieh - Columbia University
Proportional share resource management provides a flexible and
useful abstraction for providing performance isolation and fair
resource allocation among users in large multi-user server systems.
However, previous proportional share mechanisms have either weak
proportional sharing accuracy or high scheduling overhead. To
address this problem, we have created Virtual-Time Round-Robin (VTRR),
a proportional share scheduler that can provide fine-grain
proportional sharing accuracy with O(1) scheduling overhead. VTRR
achieves this by combining the benefits of fair queueing algorithms
with a round-robin scheduling mechanism. Unlike many other
schedulers, VTRR is simple to implement. We have implemented a VTRR
CPU scheduler in Linux in less than 100 lines of code. Our
performance results demonstrate that VTRR provides accurate
proportional share allocation with constant, sub-microsecond
scheduling overhead. The scheduling overhead using VTRR is two
orders of magnitude less than the standard Linux scheduler for large
numbers of clients.

Cooperative Caching
Middleware for Cluster-Based Servers
Francisco Matias Cuenca-Acuna and Thu D. Nguyen - Rutgers University
We consider the use of cooperative caching to manage the memories
of cluster-based servers. Over the last several years, a number of
researchers have proposed locality-conscious servers that implement
content-aware request distribution to address this problem [SWEB95,
LARD98, SLARD00 , L2S00, PRESS01]. During this development, it has
become conventional wisdom that cooperative caching cannot match the
performance of these servers [LARD98]. Unfortunately, while
locality-conscious servers provide very high performance, their
request distribution algorithms are typically bound to specific
applications. The advantage of building distributed servers on top
of a block-based cooperative caching layer is the generality of such
a layer; it can be used as a building block for diverse services,
ranging from file systems to web servers.
We have reexamined the question of whether a server built on top
of a generic block-based cooperative caching algorithm can perform
competitively with locality-conscious servers. Specifically, we have
compared the performance of a cooperative caching-based web server
against L2S, a highly optimized locality-conscious server. Our
results show that by modifying the replacement algorithm of
traditional cooperative caching algorithms, we can achieve much of
the performance provided by locality-conscious servers. Our
modification increases network communication to reduce disk
accesses, a reasonable trade-off considering the current trend of
relative performance between LANs and disks.
