cluster, and discover: Focused extraction of QA-Pagelets from the Deep Web Probe

abstract

In this paper, we introduce the concept of a QA-Pagelet to refer to the content region in a dynamic page that contains query matches. We present THOR, a scalable and efficient mining system for discovering and extracting QA-Pagelets from the Deep Web. A unique feature of THOR is its two-phase extraction framework. In the first phase, pages from a deep web site are grouped into distinct clusters of structurally-similar pages. In the second phase, pages from each page cluster are examined through a subtree filtering algorithm that exploits the structural and content similarity at subtree level to identify the QA-Pagelets.

name of conference

Proceedings. 20th International Conference on Data Engineering

authors

Caverlee, James

published proceedings

20TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS

author list (cited authors)

Caverlee, J., Liu, L., & Buttler, D.

citation count

17

complete list of authors

Caverlee, J||Liu, L||Buttler, D

editor list (cited editors)

Özsoyoglu, Z. M., & Zdonik, S. B.

publication date

January 2004

publisher

Institute of Electrical and Electronics Engineers (IEEE) Publisher

published in

Proceedings / International Conference on Data Engineering. International Conference on Data Engineering Journal

keywords

46 Information And Computing Sciences
4605 Data Management And Data Science

Digital Object Identifier (DOI)

10.1109/ICDE.2004.1319988

International Standard Book Number (ISBN) 10

0-7695-2065-0

start page

103

end page

114

volume

20

URL

http://dx.doi.org/10.1109/icde.2004.1319988

Probe, cluster, and discover: Focused extraction of QA-Pagelets from the Deep Web Conference Paper

Overview

abstract

name of conference

authors

published proceedings

author list (cited authors)

citation count

complete list of authors

editor list (cited editors)

publication date

publisher

published in

Research

keywords

Identity

Digital Object Identifier (DOI)

International Standard Book Number (ISBN) 10

Additional Document Info

start page

end page

volume

Other

URL