Singh, Sanjeev Kumar (2014-08). Identifying Webpage Regions and Their Roles by Combining Image Processing and Markup Analysis. Master's Thesis.
Understanding what are the regions of a webpage and the functions of those regions is important for many services over web pages, including screen readers, web search, and assessing web-page similarity. In this thesis, we present an approach to identify the regions of a webpage based on image processing techniques and to identify the portions of the DOM tree corresponding to these regions. We then present and compare a rule-based approach and a SVM-based approach using the visual and markup information to classify regions based on their roles. A corpus of 150 web pages exhibiting a wide variety of designs was collected. Each page was provided human-assigned regions and their roles to use in training and for evaluating results. The segmentation algorithm accurately identified 77.8% of the 1222 web page regions in the corpus but its performance was not even across different types of regions. Segmentation accuracy was above 80% for headers, footers, body regions, and top navigation bars. The algorithm had more difficulty with left, right, and bottom navigation bars and dynamic content, having lower than 70% accuracy for locating these segments. The correctly segmented web page components were used as a test collection to compare the rule-based and SVM-based approach to assigning the role of each segment. The SVM-based and the rule-based approach both achieved between 74 and 75% accuracy over 951 classifications. The SVM-based approach was better at classifying left and bottom navigation bars while the rule-based approach did better at recognizing dynamic content. Moreover, an accuracy of 81.3% is obtained when we used both the methods to identify regions correctly. In this case, we considered a region correctly identified if the region is identified correctly either by the rule-based or SVM-based method. Overall, these results are promising for incorporating these segmentation and segment role classifications into web services.