Bogen, Paul (2011-12). Intelligent Information Interaction for Managing Distributed Collections of Web Documents. Doctoral Dissertation.
Digital collections are ubiquitous. However, not all digital collections are the same. While most digital collections have limited forms of change - primarily creation and deletion of additional resources - there exists a class of digital collections that undergo additional kinds of change. These collections are made up of resources that are distributed across the Internet and brought together into the collection via hyperlinking. This means the underlying collection members are not controlled by the curator of the collection. Resources can be expected to change as time goes on. To further complicate matters these collections can be hard to maintain when they are large, highly dynamic, or lacking active curation. Part of the difficulty in maintaining these collections is determining if a changed page is still a valid member of the collection. While others have tried to address this problem by measuring change and defining a maximum allowed threshold of change, these methods treat all change as a potential problems and treat web content as a static document despite its intrinsically dynamic nature. Instead, I approach the problem of determining significance of change on the web by embracing it as a normal part of a web document's lifecycle, Instead of using thresholds to identify abnormal changes, I determine the difference between what a maintainer expects a page to do and what it actually does. These models are created using a variety of feature extractors to find pertinent information in a page, a Kalman filter to model the history of a page and predict a next version and finally classification of results into either expected or unexpected change. I evaluate the different options for extractors and analyzers to determine the best options from my suite of possibilities. This work is informed by a series of studies on both web pages and potential collection maintainers, observations of the NSDL Pathways, and a ground-truth set of blog changes tagged by a human judgment of the kind of change. The results of this work showed a statistically significant improvement over a range of traditional threshold techniques when applied to the collection of tagged blog changes.