Programming Project: A Web Distribution Manager for a Web Content Delivery Service

Internet and Intranet Protocols and Applications

Spring 2000

Prof. Arthur P. Goldberg

Modification history

Modifications:

Phases and Phase I revised: 3/9

Update due dates: 3/30

Incorporate emailed instructions into Phase I, specify Phase II, mostly specify Phase III: 4/4

Fix AbsoluteURI rewrite, add extension request, link to TA’s Phase I: 4/5

Clarify HTML: white space in comments, and syntactically correct: 4/10

Add table describing attribute URLs: 4/11

Significantly elaborate Phase III specification: 4/16

Add HEAD request to Phase III, default caching rule, share the ‘are-objects-cachable’ table: 4/20

Summary

A Web Content Delivery service (WCS) distributes network and host load by moving request traffic from publisher’s servers to content servers.  For this project, write a web distribution manager (WDM), which is a critical part of WCS.

This assignment is due in 3 phases, each of which is worth 10 points.  In addition, questions about your program’s design and implementation will be due with phase III and worth 10 points.

The phases are due as follows:

                I: 4/2

                II: 4/16

                III: 4/30

You will learn about server socket programming, server concurrency, robot client programming, timeout handling, the HTTP specification, and HTTP message semantics, formats and headers.

This document is at http://www.cs.nyu.edu/artg/internet/S00/load_balancing_assignment/index.htm.

Introduction

A WCS operates a global farm of servers.  For example, a WCS might operate 500 server machines at 10 server sites around the world.  These machines run standard-compliant (HTTP/1.0 and HTTP/1.1) content distributor web servers (CDWS).

Consider a web publisher (WP) who publishes their content on primary Web server (PS).  The WP is entirely responsible for creating and managing the PS.  The WP hires the WCS to host and distribute the WP's static traffic.

The WCS distributes load in 2 steps: static content distribution, and HTML rewriting.

The WCS places a web distribution manager (WDM) as a filter between the PS and the network. All browser requests and responses for the PS pass through the WDM.

The CDWS are loaded with copies of static objects (images, Java programs, etc.) from the PS.  The WDM rewrites HTML responses sent by the PS (on-the-fly) so the domain names of hypertext links to static objects reference a CDWS server.

For example, consider the link to the image on my home page (http://WWW.CS.NYU.EDU/artg/), expressed as the HTML

<img SRC="http://WWW.CS.NYU.EDU/artg/arthur3.gif">

then the WDM for the WCS company wcs.com would rewrite this to say

<img SRC="http://www.server18.wcs.com/WWW.CS.NYU.EDU/artg/arthur3.gif">

and the browser would load the image arthur3.gif from www.server18.wcs.com, a server in the CDWS.

This load distribution method works well because many of the bytes served by the PS are static objects that can be pre-loaded into the content distributor web servers.  Note that every PS is assigned unique names in the CDWS.

Major WCS providers include Akamai and Sandpiper Networks (now merged with Digital Island).  I recommend you access one of their customers (Akamai’s, Digital Island’s) and see how they rewrite HTML.  Also, read a presentation from May 1999 (ppt) by the CTO of Sandpiper.

General Requirements

The WDM must handle all error codes returned by system calls and library routines.  The WDM must avoid deadlock, avoid infinite resource use, and avoid busy waiting.

There may be command line options.

Phases

Write the WDM in phases, passing in each phase separately.  I expect that each phase of your WDM will reuse the code from the previous phase, although this is not required.

Your primary specification for WDM will be the HTTP/1.1 Specification, RFC 2616.  Where appropriate, the WDM must comply with the HTTP/1.1 specification.

Comply with sections 1.1, 1.4, 3.1, 3.2.2, 4.1, 4.2, 4.3, 4.4, 5.*, 6.*, 8.1, 9.3, 10.2.1, 10.4.1, 10.4.5, 10.4.9, 10.5.5, 10.5.6, 14.10, 14.13, 14.23, 14.30, 14.38, 14.41 (perhaps), 14.45.

I: Behave like a proxy

Grading program

The WDM talks HTTP.  It receives a GET request from the browser, forwards the request to the PS, receives the HTTP response, and forwards the response to the browser.  Then close the TCP connections.

Recognize and reject syntactically incorrect HTTP requests.

Ignore other headers in requests and responses, but you MUST pass them without modification.

Comply with sections 1.1, 1.4, 3.1, 3.2.2, 4.1, 4.2, 4.3, 4.4, 5.*, 6.*, 9.3, 10.2.1, 10.4.1, 10.4.5, 10.5.6, 14.23, 14.45.

Section Details

3.2.2

Although the spec says “The use of IP addresses in URLs SHOULD be avoided whenever possible.” WDM MUST support a host that is an IP address.  Ignore the last 2 sentences.

4.4

To determine the length of a message body, support these techniques: the Content-length header, and the server closing the connection.  You will never have to support the self delimiting "multipart/byteranges" technique.

5.1.2

Assume Request-URI    = absoluteURI | abs_path. 

Change

In order to avoid request loops, a

   proxy MUST be able to recognize all of its server names, including

   any aliases, local variations, and the numeric IP address.

To

In order to avoid request loops, a

   proxy MUST be able to recognize all of its server name and the numeric IP address.

WDM's client does not know WDM exist.  Therefore, the request is formatted for a standard web server.  The web server is actually WDM, which behaves 'like' a proxy, but is transparent to the client.  Thus, for this assignment an abs_path  is a legal Request-URI.

9.3

Assume that WDM will receive only GET requests.

10.4.1

Do not forward syntactically bad HTTP requests.  Detect them and return status code 400.

For example, return status code 400 if an absolute URI is given, and the Host value does not agree with host field in the absolute URI.

10.4.5

WDM MUST detect requests for WDM’s server name or numeric IP address.  This will avoid request loops.  Return status code 404 in this case.  (A clearer error would be nice, but the HTTP/1.1 Spec doesn't provide one.)

14.45

via = received-by

II: Concurrent and Rewrite

Grading Program

Continue delivering the functionality in Phase I.

Concurrently respond to multiple requests from different clients.  In particular, suppose request R1 arrives and WDM is forwarding R1 to the PS.  WDM MUST be able to receive and completely serve another request R2 and before PS finishes responding to R1.

Rewrite HTML for load balancing.  Assume the HTML conforms to the HTML 4.01 Specification.  Assume that WDM receives syntactically correct HTML.  You do not need to fully parse the HTML.  Simply skip over comments and identify and parse the elements below.

Identify HTML responses by detecting a Content-type: text/html header.

Skip over HTML comments, which are described in Section 3.2.4 of the HTML Specification.  In particular “White space is ... permitted between the comment close delimiter ("--") and the markup declaration close delimiter (">").”  Do not remove the comments from the response body. 

Replace domain names in links to the 3 static HTML inclusion elements described in section 13.1 of the HTML Specification: IMG (includes an image), OBJECT (includes a generic object), and APPLET (includes an applet).

Within these elements examine the attribute that indicates the URI of the object, as indicated. 

Element

Indicates URI of object

IMG

Src

OBJECT

Data

APPLET

Code

For this phase, we assume that the src URI references a static object if and only if the URI does NOT contain a query string.  Rewrite ONLY URIs that reference static objects.  WDM MUST not modify any other parts of the HTML, and WDM MUST not change the order of attributes within an inclusion element.  The only change WDM may make is to change the length of linear white space (LWS), provided, of course, WDM does NOT remove it.

The URI may be either an absoluteURI or an abs_path.

If the URI is an absoluteURI, which you may assume takes the form

absoluteURI = http://host[:port]/abs_path,

then rewrite absoluteURI according to the function

Rewritten(absoluteURI) = http://wdm.cs.nyu.edu/host[:port]/abs_path.  In the rewritten absoluteURI the host MUST be in lower case, without URL escape sequences.

If the URI is an abs_path, which takes the form

abs_path = token,

then determine the base URI and rewrite src by rewriting the base URI and then appending the abs_path.  The base URI is defined and discussed in sections 3.2.1 and 14.14.

For example, consider the link to the image on my home page (http://WWW.CS.NYU.EDU/artg/), expressed as the HTML

<img SRC="http://WWW.CS.NYU.EDU/artg/arthur3.gif">

then the WDM would rewrite this to say

<img SRC="http://wdm.cs.nyu.edu/www.cs.nyu.edu/artg/arthur3.gif">

and the browser would load the image arthur3.gif from wdm.cs.nyu.edu, a server in the CDWS (yes, I realize that we’re not creating a farm of servers, but that’s another level of functionality).

Finally, change the Content-length header to the new length.

Comply with sections 3.2.1, 14.14.

III: Dynamic Decision Function for Inclusion Objects

Grading Program

Continue delivering the functionality in Phases I and II.

It is difficult for WDM to determine whether an inclusion object should be delivered from the content network cache.  The decision in Phase II—deliver an object from the cache if-and-only-if the URL requesting it does not have a query string will often fail.  For example, a server program might be accessed by a URL that does not have a query string.

A more direct approach to determine whether an inclusion object should be delivered from the cache is to examine whether the object is cachable.  Caching in HTTP is described extensively in Chapter 13 of the HTTP/1.1 Specification, RFC 2616.  Several HTTP headers indicate whether an object is cachable, and if so, under what conditions and for how long.  These headers include 14.19 Etag, 14.21 Expires, and especially, 14.32 Pragma and 14.9 Cache-Control.  WDM will examine just the Cache-Control and Pragma headers.

WDM MUST follow these rules.  If a Request contains the header Cache-Control: no-cache

then WDM MUST NOT rewrite the URIs of inclusion objects.  Otherwise, for each HTML response WDM MUST evaluate each inclusion object can be delivered from a cache, by examining the object.

WDM MUST download the metainformation for each inclusion object from the PS by issuing a HEAD request.  (The PS will be responsible for avoiding infinite loops.)  If the object’s headers indicates that it can be cached then WDM will rewrite the object’s src URI. 

By default, assume an object is cachable.  Then, examine the headers and comply with the following semantics.  Note that WDM is a shared cache.

Header

Value

Cachable?

Cache-Control

no-cache

N

Cache-Control

Public

Y

Cache-Control

Private

N

Pragma

No-cache

N

If headers conflict, that is, if some headers indicate cachable and others indicate not cachable, then WDM MUST assume the object is not cachable.  (As the Spec says in 13.1.3, “if there is any apparent conflict between header values, the most restrictive interpretation is applied”.) 

WDM MUST cache the mapping from URL to ‘cachable’ for use in future requests.  Note that a ‘future’ request may be concurrent, so WDM will need to share the ‘are-objects-cachable’ table across components responding to requests.  (WDM will not worry about aging the cache.)

Comply with sections 9.4, some of 13.1.3, 14.9 and 14.32.

Implementation

Advice

Start by studying the following

·         The server examples in Comer, available in the link "Copy of software from the book (45.1 Kbytes)" at    http://www.cs.purdue.edu/homes/comer/netbooks.html   

·         The HTTP/1.1 Specification, at http://www.w3.org/Protocols/History.html#HTTP11.

I recommend you implement one of three servers styles:

·         A concurrent multiprocessor server.

·         A multithreaded single process server.

·         A single process concurrent server that uses select.

You may combine several of these strategies.  For example, a multithreaded single process server may use select.

I recommend that to simplify your task, you build your server incrementally. Since this is a big program, start by designing a pseudocode algorithm.

Programming Language

You may write your WDM in any commonly used programming language such as C, C++, Java, PERL, Pascal, Ada, Fortran, etc.  Obtain my permission to use any other language.

Software tools

Write your server against a sockets network abstraction.  You may use the networking classes provided in common libraries, such as Java’s.  You may also use low-level libraries, such as C’s stdio, or libraries that parse RFC 822 headers.  You must indicate all sources of code and libraries used by the WDM.

You may use the JAVA class that parses (and detect malformed) URLs.

You must support the functionality required by the assignment; if you choose a tool that doesn’t support the functionality, create a work-around or abandon the tool.

You may use the TA’s (Guangwei Dai guangwei@cs.nyu.edu) implementation of Phase I in java for LATER phases.

Logistics

Assistance

You must write all the programs and answers you hand in by yourself. (Except that I expect your code will include fragments of the server in Comer.)  Please do not incorporate code from free servers available on the network.

As always, you can ask either of the TAs, Junhua Wang junhua@cs.nyu.edu and Guangwei Dai guangwei@cs.nyu.edu, or me advice in class and via email.  The TAs will write the grading program and help you understand the assignment and technology (not doing the assignment for them), and grading. You may help each other to understand the problem and the properties of HTTP servers. You're welcome to post questions to our class mailing list.

As I expect you know, passing in another person's work is cheating, morally wrong, not helpful to your education, and against NYU's rules. We use programs that compare your assignments and can detect copying, even if the copy has been modified.  Finally, I become extremely angry when I discover cheating.

Grading

You will help us grade your WDM by having it communicate via TCP/IP with a grading program (GP).   You will start GP from my Web site.  You will need to tell GP which machine and port number your WDM listens on.  The grading program will access your WDM and check its operation.  Details will follow later.

While being graded your server must run on a machine that can be accessed by a grading program running at NYU.  This means it cannot be behind a firewall.

Unless exceptional circumstances arise, late assignments will be penalized 20% per week, pro-rated.

What to Pass In

As with all assignments in this course, “Grades for late work are penalized 20% per week, pro rated.  Exceptions will be granted for unusual circumstances.” Please make requests for exceptions in advance.

For each phase please

·         Pass in a hardcopy of your program and of the grading program’s output when it evaluated your program.  Bring the hardcopy to class on the next Tuesday, or drop it in my mailbox.  In addition, for phase III, pass in written answers to the 10 points of questions.

·         Email me your code.  Indicate in a comment at the top of the main program the amount of time (in hours) they spent writing and debugging.  If you've multiple files in your WDM implementation, please package them into a single attachment, such as a 'tar' or 'zip' file, named ‘WDM_Your_Name’, and attach that to an email for me.  Call the email’s subject “WDM implementation”.

Computing

You may run WDM on any machine (except that you MUST not run WDM on the main CS department server, sparky.cs.nyu.edu). 

The department has machines named courses[1-4].cs.nyu.edu which you might want to use, as they might be lightly loaded.  These are sitting in the server room, so they're remote access only.

There are also 5 Sun Ultra10's in room 505, which are for console use, but they can certainly handle additional cycles from remote users.  These are pubsun[1-5].cs.nyu.edu.  Of course, since they are accessible to people, there is no guaranty that someone won't hit the power switch at any given moment, despite the signs telling people not to.

Assignment Updates

You must subscribe to and read the class email beacon “Internet and Intranet Protocols and Applications, Spring 2000” at internetspring2000@forums.nyu.edu.  Announcements I make on the list become official parts of this assignment.  I will refrain from making any announcements less than a few days before an assignment is due.