A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable. Leslie Lamport, a foundational thinker in the theory of distributed computing.
Distributed computing has become pervasive. A typical enterprise application involves widely dispersed traders within a city talking to a duplexed server at the primary site with an identical configuration at a disaster recovery site. Many such applications are extended to transoceanic operation. Embedded system enterprises from aerospace to telecommunications have developed or announced major distributed computing systems. (Typical question: How do real-time guarantees interact with fault tolerance?)
The ideal distributed system is cheaper, faster, more highly available, and safer than a centralized system. Unfortunately, few ideal distributed systems exist as the Lamport quote above suggests. The reasons often have to do with programming complexity, but often also have to do with inherent tradeoffs between safety and availability. For example, safety suggests that all server updates go to the primary and backup, but should the primary stop if the backup has failed? Such tradeoffs disappear if we are less demanding about what safety means. Understanding the reasons underlying such tradeoffs helps one make better decisions. (Why are the most successful distributed systems essentially communications applications?)
Distributing work, especially across the web, brings with it a new set of problems as well: How does one manage networks? How does one trust that others do the work they promise or even who those workers are? How does one interface with systems one doesn't know? How can one protect intellectual property at a distance? How does voice over IP work?
Consider the possibility of emergent properties in distributed systems. For example, if you were to design a room for fast exit, you wouldn't put a pillar in front of the door. Yet "swarm intelligence" studies have suggested doing exactly that for movie theatres. What do distributed systems tell us about mass behavior?
In summary, our approach is to find the common underlying principles that underlie most distributed systems. I prefer algorithms to impossibility results though you will see a few because it is important that you see a few. I also very much prefer algorithms/protocols that people are likely to use (or use variants of) in practice. Finally, I want to train you to think about how to model your problem, particularly synchrony and failure assumptions. I will discuss underlying data communication models to make sure you understand the whole picture. So we will look at several models as we go. Sometimes we will engage in in-class challenge and I will expect students to be prepared to think and, in some cases, move around.
Unless otherwise indicated, the material comes from the combination of:
Here is the syllabus in order:
Two homeworks ( homework 1 and homework 2) requiring substantial thought but no programming. One project involving route discovery in an adversarial networks. Once again, the amount of code should not be large, but the thought should be. There is no final exam.
For both homeworks and the project, you may work alone or with at most one partner.
Lateness policy: Late homeworks or project will not be accepted . If you have a medical excuse, then we will simply not count that homework and your grade will be the average over your other work. We will spend some time reviewing the homeworks either the night they are due or the following week. That is one reason for the intolerant lateness policy (see below).
Here is a mail server
Here are the rules regarding academic honesty.
An excellent theoretical text: Distributed Algorithms, Nancy Lynch, published by Morgan-Kaufmann
A closer-to-the-network text: Distributed Systems, Sape Mullender, Addison-Wesley.
A book that discusses competitive analysis in great depth: Approximation Algorithms by Vijay Vazirani
Here is a beautiful example of vastly improving performance by changing a protocol done by my colleague Lakshmi Subramanian
Here is a survey paper in PDF format from ACM Computing Surveys vol. 31, number 1: Fundamentals of Fault-Tolerant Distributed Computing in Asynchronous Environments by Felix Gaertner This is a copy made available for classroom use without fee.
Coordinated attack over unreliable networks, muddy children, and knowledge logic (notes by Lenore Zuck).
Snapshot protocol plus first notes on faulty security protocols (notes by Lenore Zuck).
More notes on faulty security protocols (notes by Lenore Zuck).
Scientific American article about swarm intelligence
"Designing Masking Fault-tolerance via Nonmasking Fault-tolerance" by Anish Arora and Sandeep S. Kulkarni This closely resembles a self-stabilizing network. This paper, like many other excellent papers, is on Nancy Lynch's course web site at MIT.
A nice brief introduction to MPI software for building distributed systems. (I like Linda better.)
A historical view of
Links to papers about
a technology whose goal it is to use lots of standardized parts to deliver
high quality service even though devices have no central manager.
Links to papers about grid computing, a technology whose goal it is to use lots of standardized parts to deliver high quality service even though devices have no central manager.