Research interests
My research concerned how to improve the performance of a system by
understanding, in a detailed fashion, the storage system. Detailed areas: random
bit generation, file
systems, web
searching, payment schemes, improving performance (task
scheduling, parallel
I/O, parallel
file systems, storage
systems, hierarchical memory models).
Recent work
User navigation is a powerful, yet underused source of data. We apply user
navigation to web searching, with the desire to reduce the number of URLs
visited when searching. Our source of user navigation information is proxy logs.
We tease out search sessions and topic sessions. A search
session is the query, and the set of URLs that the user visited when searching. Topic
sessions which are groups of search sessions on the same topic.
Details on how we use search and topic sessions to improve web searching can
be found at the user
navigation home page. Two projects that are part of this effort are SearchLight
and Hyponym.
We have developed a random bit generator that uses disk accesses. Our generator
is practical and economical. It requires no additional hardware (given a system
with a disk), and no user involvement. As a concrete example of performance, on
a Sun Ultra-1 with a Seagate Cheetah disk, it generates bits at a rate of either
5 bits per minute or 577 bits per minute depending on the physical phenomena
that we use as a source of randomness. The generated bits are random by a
theoretical argument, and also pass a severe battery of statistical tests. More
information can be found here.
We have been working on developing high-performance application-specific file
systems. We have some very promising results for a file system for caching web
proxies. More details are on our Hummingbird
web page.
We developed an analytic model of read performance in a file system which
performs prefetching. The model helps one understand why and when prefetching
works. We also have some suggesions on how to improve file system performance
for UNIX-like workloads. Our paper was presented at USENIX,
June 6-11, 1999. More information can be found here.
Previous work
Single disk model
I developed an analytic performance model of a SCSI disk that determines the
service time of a request in a workload-specific method. The inputs into the
model are the workload attributes (e.g., mean request size, arrival rate, and
fraction of read requests) and a description of the disk. More information can
be found here.
Multiple disk model
We have developed an analytic model for multiple disks on a SCSI bus, along
with a technique to increase the performance of the disk system when the
workload has large requests and random access. More information can be found here.
We are interested in developing efficient scheduling algorithms for jobs where
the data set is so large that it is stored on tape. For example, the data for
the tape-resident NASA EOS jobs must be staged from tape into disk before the
processing can begin. Our current results were presented at PODS,
May 31, 1999. More information can be found here.
Jeff Vitter and I developed a computation model (called the Parallel Disk
Model) and a number of out-of-core algorithms for problems such as sorting and
FTT. This work is published as my master's thesis and in a STOC '90 paper. (See published
papers section.)
Len Wisniewski and I designed an application programmer interface for
multiprocessor multi-disk file systems based on the Parallel Disk Model. (See
the published
papers section for our TR describing our work.)
Bibliography files
My .bib files have many references for the above areas: