Home

Invited Talk

Global Internet Content Delivery
Bruce Maggs
Akamai Technologies and Carnegie-Mellon University

This talk describes Akamai's Internet content delivery service called FreeFlow. Akamai has deployed over 8000 servers on more than 350 networks around the world, and delivers content for more than 1500 customers. The talk begins with a review of the mechanics of content delivery on the Internet, and then examines the unique features of Akamai's massively distributed system. After addressing several of the technological challenges faced in designing such a system, the talk concludes by presenting two theoretical problems that arose during its implementation.

Short Talks

Neptune: Replication Infrastructure for Cluster-Based Network Services
Kai Shen, Tao Yang, Lingkun Chu, JoAnne L. Holliday, Doug Kuschner, Huican Zhu - University of California at Santa Barbara

Previous research has addressed the scalability and availability of building large-scale cluster-based network services. This project studies the clustering of replicated network services when the persistent service data is frequently updated. The infrastructural middleware we built, called Neptune, provides a flexible interface to glue and replicate existing service modules and accommodate a variety of underlying storage mechanisms. Neptune maintains dynamic and location-transparent mapping to isolate faulty service modules and enforce replica consistency. It allows efficient use of multiple-level replication consistency mechanisms with staleness control and failure recovery support. In this talk we present Neptune's overall architecture and data replication support, and we illustrate its performance using three network services.
http://www.cs.ucsb.edu/research/neptune

Optimistic Active Replication
Pascal Felber - Bell Labs, Lucent Technologies

Replication is a powerful technique for increasing availability of a distributed service. Algorithms for replicating distributed services do however face a dilemma: they should be (1) efficient (low latency), while (2) ensuring consistency of the replicas, which are two contradictory goals. To be efficient, some Atomic Broadcast algorithms used for active replication deliberately sacrifice consistency, if inconsistency is likely to occur with a low probability. We present an algorithm that handles replication efficiently in most scenarios, while always preventing  inconsistencies.

It Takes Two To Tangle -
Censorship Resistant Publishing Through Document Entanglements

Marc Waldman and David Mazieres - New York University

Today, documents available over the the Internet are easy to censor. Each document can usually be traced back to a specific host or even the individual responsible for publishing the document. Someone wishing to censor a document can use the courts, threats, or some other means to force the host administrator or author to delete a particular file. Even if these methods prove unsuccessful, various denial of service attacks can be launched against the hosting server in order to make the document difficult, if not impossible, to retrieve.

Publishing the document on many hosts may seem to be the answer. However, no standard naming convention exists that allows one to easily specify several hosts via a single name. Even if such a naming convention existed it merely makes the censor's job somewhat harder --- the censor still knows exactly which hosts contain the content and therefore which hosts to attack. In addition to the naming difficulty there is little incentive for a hosting provider to store documents that he is being pressured into deleting.

We propose a distributed censorship resistant file system that publishes the blocks of a document in such a way that it is very difficult for an adversary to delete all the component blocks of a file. This is accomplished, in part, by using Shamir's secret sharing algorithm. The component blocks of newly published documents are split into secret shares, half of which also belong to other documents.

Fault-Tolerant TCP (FT-TCP)
Thomas C. Bressoud - Bell Labs, Lucent Technologies

In this talk I will describe an implementation of fault-tolerant TCP (FT-TCP) that allows a faulty server to keep its TCP connections open until it either recovers or it is failed over to a backup. The failure and recovery of the server process are completely transparent to client processes connected with it via TCP. The system does not use a proxy and no changes to the client software or the TCP implementation on the server are required.

Distributed Computing over Networks of Embedded Systems using Smart Messages
Phillip Stanley-Marbell, Cristian Borcea, Kiran Nagaraja, Liviu Iftode - Rutgers University

Embedded systems have outnumbered traditional desktop computing systems for several years and will continue to do so in the future. A new trend in embedded systems is to provide networking, either wired or wireless. As device costs plummet, it will soon be feasible to deploy large networks of embedded computing devices. These networks will be inherently heterogeneous, both in the interconnection technologies, and in the individual functions of the component nodes. These embedded computing devices will typically be mobile and have constrained energy resources, and are very likely to be battery powered.

To benefit from programmability, and the aggregated computing resources deployed in these networks, new distributed computing models must be employed, which will necessarily be different from the traditional distributed computing models for several reasons. First, at such large network scale, a property-based naming rather than a unique identifier for the participant nodes in the computation must be supported. Second, given the fluidity and volatility of these networks in terms of node configuration and network topology, it may be impossible to synchronize computation, and round-trip communication may never complete. Third, applications should accept partial execution as long as it is relevant, i.e. meets a certain {\bf quality of result} (QoR).

Presented is a system architecture, Smart Messages, for computation and communication in large networks of embedded systems. In this model, communication is realized by sending Smart Messages in the network. These messages are comprised of code, which is executed at each hop in the path of the message, and a payload which the message carries in the network. The execution of the message at each hop determines the next hop in the path of the message, thus smart messages are responsible for their own routing.

The goal of the Smart Message architecture is to keep the support required from nodes in the network to the bare minimum, placing intelligence in the Smart Messages rather than in individual nodes. Placing intelligence in SMs, and providing a common minimal support from cooperative nodes, provides flexibility and obviates the need for the potentially impossible task of updating all nodes in a network for the implementation of a new application or protocol.

This talk details the motivation for Smart Messages, describes the smart message model and proposes a system architecture to support this model.

ARMS - Automated Replication for Meshed Servers
Weibin Zhao and Henning Schulzrinne - Columbia University

Server replication is a widely used mechanism to provide fault tolerance, enhance availability and improve performance in distributed systems, e.g., mirrored Web servers, FTP servers, DNS servers and directory servers. Replication can be synchronous in which an update delivers to all servers or none of them, or asynchronous in which an update delivers to one server first, and then propagates to other servers. Synchronous replication is difficult to achieve in wide area where asynchronous approach is more promising. However, propagating updates promptly among replicated servers is a challenging issue.

This paper describes ARMS, an automated replication protocol for meshed servers. ARMS supports wide area server replications and automates replication process by employing service discovery technology. It is highly efficient for dynamic data, and can be used in multiple server groups.

ARMS uses the asynchronous replication model, but propagates updates in a quasi-synchronous way. Replicated servers in ARMS maintain a mesh topology. Updates are propagated in two ways: batch transfer and direct forwarding. Batch transfer is used to exchange bulk data among servers to let them catch up updates with each other. It is useful for initial population exchanges when new servers are booted, failed servers are rebooted, or network partitions are fixed. Direct forwarding is used to propagate new updates promptly. Each server maintains logical connections to all other servers and forwards updates using best-effort. In ARMS, each data entry has a unique updating ID which includes the server's URL, who initially performs the update, and the updating timestamp at that server. From time to time, a server sends its updating report to other servers. Newer updates are exchanged in batch among servers.

Traditional replication systems often have management overheads, such as static group membership configuration. ARMS supports dynamic changes of group membership, which facilities new server joining and old server leaving. Servers in ARMS periodically send their advertisements for presence using multicast. They also forward new server advertisements to each other. Furthermore, replicated servers exchange group membership information regularly.

ARMS is designed to support dynamic data efficiently. Dynamic data are soft state in that they are valid only for the specified periods of time. They need to be refreshed periodically, otherwise they will be removed from servers when they are timeout. Examples of dynamic data include the using of lifetime in SLP's Directory Agents, the using of lease in Jini's Lookup Services, and the using of time-to-live in LDAP. As there are many update operations for dynamic data including refreshes, ARMS choose to propagate updating results instead of updating operations. This way no updating log is needed, and several updating operations on one data entry can be propagated as a single updating result.

ARMS supports server replications in multiple groups. Servers detect the group information in server advertisements, and join designated groups (a server can join multiple groups). ARMS also addresses other important replication issues, such as resolving updating conflicts from clients, handling deleted data entries. Special care is given in ARMS to various failure scenarios. ARMS uses keepalive messages to handle server failure and network partitions. ARMS has been implemented for mSLP - Mesh-enhanced Service Location Protocol, further details can be found at http://www.cs.columbia.edu/~zwb/project/slp/.

LBFS: Low-Bandwidth File Systems
David Mazieres - New York University

Users rarely consider running network file systems over slow or wide-area networks, as the performance would be unacceptable. Nonetheless, efficient remote file access would be desirable in many situations, particularly over high latency networks. Rather than run interactive programs such as editors remotely, users could run the programs locally and manipulate remote files through the file system. Unfortunately, most network file systems require too much bandwidth to be practical outside of the local area network.

In this talk I will discuss LBFS, a network file system designed for low-bandwidth networks. LBFS exploits similarities between files or versions of the same file to save bandwidth. It avoids sending data over the network when the same data can already be found in the server's file system or the client's cache. Using this technique, LBFS achieves up to two orders of magnitude reduction in bandwidth utilization on common workloads, compared to traditional network file systems.

Policy Driven Resource Allocation of Commodity Servers in an Océano Farm
Karen Appleby, Juliana Cunha, Tamar Eilam, German Goldszmidt - IBM T. J. Watson Research Center

The Océano project is developing a pilot prototype of a scalable, manageable hosting infrastructure for e-business utilities. One of the issues that Océano aims at addressing is support for peak loads that are an order of magnitude larger than the average load. Océano mitigates the differences between average and peak load by sharing the same resources sequentially between the customers. Threshold events are generated according to performance monitoring information flow. These threshold events can be a trigger for actions such as changing the allocation of resources among the customers.

A problem that has to be addressed is a policy to decide on the allocation of servers to customers. The main goal is to maximize the total revenue of the farm. There are two players in a full policy driven solution to the server allocation problem. o An agreement structure between the Océano farm owner and the customers (revenue model). o A server allocation algorithm to maximize the total revenue of the farm based on the set of agreements with the customers. We discuss both components in this talk.

The Océano's revenue model is a parameterized structure that defines some principles according to which charging is performed. It also gives quantified meaning to different levels of guarantee (quality of service), and explains how they are taken into account for charging purposes. An instance of this structure, termed an Infrastructure Service Level Agreement (ISLA), is an assignment of values to the set of parameters that are defined in the revenue model. Every customer is associated with an ISLA. The ISLAs are mapped and given as input to the resource allocation algorithm. We present the Océano's revenue model and discuss its relationships with other existing revenue models.

We also present a novel approach for resource allocation in an Océano server farm. The main idea is to take advantages of periodic access patterns in Internet traffic in order to construct a resource allocation plan that is used to provision servers in advance according to the expected workload. This planned capacity component is combined with a reactive component that fine-tunes the allocation according to the actual behavior of the system. We discuss the advantages of this combined approach over using one of the components stand-alone and how it is used for revenue maximization.

An Internet Utility Platform and a Pub/Sub Messaging Utility
Mike Spreitzer - IBM T. J. Watson Research Center

I will describe the current state of a project to define a platform for Internet utilities and offer a few specific Internet utilities on that platform. By utility I mean a service that has a standardized interface and is offered to all comers; an Internet utility is one that is delivered over the Internet. I will describe our thoughts on architecture for an Internet utility platform. I will also describe our thoughts on how build a pub/sub messaging utility using that platform.

The value proposition of an Internet utility includes realizing economies of scale in terms of both human and hardware resources. To do this requires automated management of the system that provides the service. This includes dynamically reassigning resources to meet the changing loads presented by the various customers. We plan to do this even in situations where multiple customers are served by a single server.

One of the key challenges in building an Internet utility is defining what Service Level Agreements (SLAs) will be offered. It is the meeting of those SLAs that drives the dynamic resource allocations. In an SLA and its associating pricing scheme it must be possible to see that both (a) the customer will get good value for what he pays and (b) the utility operator will get paid for his expenses. The terms quantified in an SLA must be (1) meaningful and valuable to the customer, (2) measurable, and (3) controllable by the automated management system. This is turning out to be a non-trivial challenge for the pub/sub utillity. I will describe our current thinking on the subject.

Blue Gene: A Massively Parallel System
Jose E. Moreira - IBM T. J. Watson Research Center

ABSTRACT: Blue Gene is a massively parallel system being developed at the IBM T. J. Watson Research Center. With its 4 million-way parallelism and 1 Petaflop peak performance, Blue Gene is a unique environment for research in parallel processing. Full exploitation of the machine's capability requires 100-way shared memory parallelism inside a single-chip multiprocessor node and message-passing across 30,000 nodes. Even more challenging, this parallelism has to be exploited in the presence of failed components, both in the form of entire nodes and in the form of nodes that have some broken subsystems. New programming models, languages, compilers, and libraries will need to be investigated and developed for Blue Gene, therefore offering the opportunity to break new ground in those areas. In addition, system management and input/output operations in a system of this scale present their own challenges. In this talk, I will describe some of the hardware and software features of Blue Gene. I will also describe some of the protein science and molecular dynamics computations that are important driving forces behind Blue Gene.

Rob Strom and Josh Auerbach - IBM T. J. Watson Research Center

Monitors, such as Java classes with synchronized methods, are a convenient and safe abstraction for designing and reasoning about multithreaded object-oriented programs. However, the straightforward implementation of monitors can be inefficient, particularly in programs in which the majority of calls are to read-only methods. We introduce the optimistic readers program transformation, which may be implemented either as a compiler optimization, or as a design pattern.'' This transformation produces an implementation whose observable behavior is equivalent to that of a monitor, but in which read-only methods do not acquire locks or perform any shared writes. As a result, programmers can reason about their programs as if each shared object were implemented using mutual exclusion, while achieving the performance benefits of unsynchronized reads. We present the program transformation using the platform-independent abstraction CRF. We then demonstrate the performance of this transformation as applied to benchmarks derived from the core module of a practical system -- a Java-based publish-subscribe router. We compare the performance of the optimistic readers transformation to unoptimized synchronized methods and to reader and writer locks.

This is a preview of a talk to be given at ECOOP 2001 in Budapest.

An Architecture for Accelerating Large Scale Distributed Web Applications
Dinesh C. Verma and Seraphin Calo - IBM T. J. Watson Research Center

The prevalent way to access web-based applications over the Internet is by using a browser that communicates to a web-server across a number of routers. The web-server typically maintains static content in data files, and composes dynamic content by executing programs, typically cgi-bin scripts or Java servlets. During periods of congestion and traffic overload at the servers, the response time experienced by the client is often poor. Typically, as the number of routers lying on the path between the client and the server increase, the chances of encountering congestion increases, and the user is likely to see degraded performance. Different QoS approaches that address the problem have been proposed, notably the IntServ/RSVP [1] signaling approach based on the notion of reserving resources within the network and the DiffServ approach [2] based on the notion of mapping traffic aggregates into several different service classes. However, the deployment of any of these approaches requires changes in the basic infrastructure of the Internet, which creates significant operational difficulties, and is unlikely to happen in the medium time range.

An approach that has been partially successful at reducing user response time is that of caching content closer to the user. A client can then access web-based content from a proxy server from which it is likely to obtain a better response time than by going to the original server [3]. The notion of caching has been extended to that of content distribution networks by a number of companies such as Akamai, Digital Island, Cisco, etc. A content distribution network consists of a network of caching proxy servers to which clients are directed transparently using various wide-area load balancing schemes. The caching approach works well for data that is static and unchanging, e.g. images, video clips, etc. However, such content forms an increasing smaller percentage of total web-content. Therefore, there is a need to extend this scheme to other type of web-applications which include dynamically generated content. As in the case of caching of static data, it is highly desirable that the caching of applications be done so that the administrative and operational control of the data/application resides with the original server, rather than with the proxy server. A solution is needed which accelerates applications while still providing the administrative control of the application from the original server, rather than the proxy server. In this paper, we present such an architecture.

The architecture allows either an application developer or application deployer to split an application into edgable and a non-edgable components. The edgable components are retrieved by at a proxy cache server and executed while the non-edgable components are relayed to the origin web-server. However, the control and management of the application is kept at the origin web-server. In order to alleviate the issues involved in managing the proxy servers, our architecture has been developed to deploy edgeservers with identical configuration throughout the network. Each edgeserver automatically configures itself based upon its location with the network. All management tasks are confined to a single machine for the administration of edge-servers. All configuration of applications is confined to the origin web-servers hosting the applications.

During development time, an application developer can mark the edgable components by defined classes that implement a specific interface. During deployment time, existing applications that may or may not use this interface, can define a configuration file that describes the edgable components to the proxy servers.

We have developed a few sample appliactions on top of this framework which include websites that create personalized web pages, generation of banner pages, and lookup in corporate directory services. Our early experiments demonstrates that applications developed in this distributed paradigm can significantly improve the user-percevied response time during periods of network congestion.

References:

[1] R. Braden, L. Zhang, S. Berson, S. Herzog, and S. Jamin, ReSerVation Protocol (RSVP) Version 1 Functional Specification. RFC2205, Sept. 1997.

[2] S. Blake et. Al. An Architecture for Differentiated Services, Internet RFC 2475, Decemember 1998.

[3] Akamai Technologies Inc., "FreeFlow content distribution service," www.akamai.com.

Virtual-Time Round-Robin: An O(1) Proportional Share Scheduler
Chris Vaill and Jason Nieh - Columbia University

Proportional share resource management provides a flexible and useful abstraction for providing performance isolation and fair resource allocation among users in large multi-user server systems. However, previous proportional share mechanisms have either weak proportional sharing accuracy or high scheduling overhead. To address this problem, we have created Virtual-Time Round-Robin (VTRR), a proportional share scheduler that can provide fine-grain proportional sharing accuracy with O(1) scheduling overhead. VTRR achieves this by combining the benefits of fair queueing algorithms with a round-robin scheduling mechanism. Unlike many other schedulers, VTRR is simple to implement. We have implemented a VTRR CPU scheduler in Linux in less than 100 lines of code. Our performance results demonstrate that VTRR provides accurate proportional share allocation with constant, sub-microsecond scheduling overhead. The scheduling overhead using VTRR is two orders of magnitude less than the standard Linux scheduler for large numbers of clients.

Cooperative Caching Middleware for Cluster-Based Servers
Francisco Matias Cuenca-Acuna and Thu D. Nguyen - Rutgers University

We consider the use of cooperative caching to manage the memories of cluster-based servers. Over the last several years, a number of researchers have proposed locality-conscious servers that implement content-aware request distribution to address this problem [SWEB95, LARD98, SLARD00 , L2S00, PRESS01]. During this development, it has become conventional wisdom that cooperative caching cannot match the performance of these servers [LARD98]. Unfortunately, while locality-conscious servers provide very high performance, their request distribution algorithms are typically bound to specific applications. The advantage of building distributed servers on top of a block-based cooperative caching layer is the generality of such a layer; it can be used as a building block for diverse services, ranging from file systems to web servers.

We have reexamined the question of whether a server built on top of a generic block-based cooperative caching algorithm can perform competitively with locality-conscious servers. Specifically, we have compared the performance of a cooperative caching-based web server against L2S, a highly optimized locality-conscious server. Our results show that by modifying the replacement algorithm of traditional cooperative caching algorithms, we can achieve much of the performance provided by locality-conscious servers. Our modification increases network communication to reduce disk accesses, a reasonable trade-off considering the current trend of relative performance between LANs and disks.

 Last Update: 04/06/2001