A large amount of new information is posted on the Web every day. Large-scale web search engines often update their index slowly and are unable to present such information in a timely manner. Here we present our solutions of searching new information from the web by tracking the changes of web documents.
First, we present the algorithms and techniques useful for solving the following problems: detecting web pages that have changed, extracting changes from different versions of a web page, and evaluating the significance of web changes. We propose a two-level change detector: MetaDetector and ContentDetector. The combined detector successfully reduces network traffic by about 67%. Our algorithm for extracting web changes consists of three steps: document tree construction, document tree encoding and tree matching. It has linear time complexity and extracts effectively the changed content from different versions of a web page. In order to evaluate web changes, we propose a unified ranking framework combining three metrics: popularity ranking, content-based ranking and evolution ranking. Our methods can identify and deliver important new information in a timely manner.
Second, we present an application using the techniques and algorithms we developed, named "Web Daily News Assistant (WebDNA): finding what's new on Your Web". It is a search tool that helps community users search new information on their community web. Currently WebDNA is deployed on the New York University web site.
Third, we model the changes of web documents using survival analysis. Modeling web changes is useful for web crawler scheduling and web caching. Currently people model changes to web pages as a Poisson Process, and use a necessarily incomplete detection history to estimate the true frequencies of changes. However, other features that can be used to predict change frequency have not previously been studied. Our analysis shows that PageRank value is a good predictor. Statistically, the change frequency is a function proportional to $\exp[0.36\cdot (\ln(PageRank)+C)]$. We further study the problem of combining the predictor and change history into a unified framework. An improved estimator of change frequency is presented, which successfully reduces the error by 27.3% when the change history is short.