A quantitative evaluation of techniques for detection of abnormal change events in blogs.
Additional Document Info
While most digital collections have limited forms of change - primarily creation and deletion of additional resources - there exists a class of digital collections that undergoes additional kinds of change. These collections are made up of resources that are distributed across the Internet and brought together into a collection via hyperlinking. Resources in these collections can be expected to change as time goes on. Part of the difficulty in maintaining these collections is determining if a changed page is still a valid member of the collection. Others have tried to address this problem by measuring change and defining a maximum allowed threshold of change, however, these methods treat all change as a potential problem and treat web content as a static document despite its intrinsically dynamic nature. Instead, we approach the significance of change on the web as a normal part of a web document's life-cycle and determine the difference between what a maintainer expects a page to do and what it actually does. In this work we evaluate the different options for extractors and analyzers in order to determine the best options from a suite of techniques. The evaluation used a human-generated ground-truth set of blog changes. The results of this work showed a statistically significant improvement over a range of traditional threshold techniques when applied to our collection of tagged blog changes. 2012 ACM.
name of conference
Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries