A quantitative evaluation of techniques for detection of abnormal change events in blogs.
Conference Paper
Overview
Identity
Additional Document Info
Other
View All
Overview
abstract
While most digital collections have limited forms of change - primarily creation and deletion of additional resources - there exists a class of digital collections that undergoes additional kinds of change. These collections are made up of resources that are distributed across the Internet and brought together into a collection via hyperlinking. Resources in these collections can be expected to change as time goes on. Part of the difficulty in maintaining these collections is determining if a changed page is still a valid member of the collection. Others have tried to address this problem by measuring change and defining a maximum allowed threshold of change, however, these methods treat all change as a potential problem and treat web content as a static document despite its intrinsically dynamic nature. Instead, we approach the significance of change on the web as a normal part of a web document's life-cycle and determine the difference between what a maintainer expects a page to do and what it actually does. In this work we evaluate the different options for extractors and analyzers in order to determine the best options from a suite of techniques. The evaluation used a human-generated ground-truth set of blog changes. The results of this work showed a statistically significant improvement over a range of traditional threshold techniques when applied to our collection of tagged blog changes. 2012 ACM.
name of conference
Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries