Content-based analysis to detect Arabic web spam - Texas A&M University (TAMU) Scholar

abstract

Search engines are important outlets for information query and retrieval. They have to deal with the continual increase of information available on the web, and provide users with convenient access to such huge amounts of information. Furthermore, with this huge amount of information, a more complex challenge that continuously gets more and more difficult to illuminate is the spam in web pages. For several reasons, web spammers try to intrude in the search results and inject artificially biased results in favour of their websites or pages. Spam pages are added to the internet on a daily basis, thus making it difficult for search engines to keep up with the fast-growing and dynamic nature of the web, especially since spammers tend to add more keywords to their websites to deceive the search engines and increase the rank of their pages. In this research, we have investigated four different classification algorithms (nave Bayes, decision tree, SVM and K-NN) to detect Arabic web spam pages, based on content. The three groups of datasets used, with 1%, 15% and 50% spam contents, were collected using a crawler that was customized for this study. Spam pages were classified manually. Different tests and comparisons have revealed that the Decision Tree was the best classifier for this purpose.

authors

Alsmadi, Izzat

published proceedings

JOURNAL OF INFORMATION SCIENCE

author list (cited authors)

Al-Kabi, M., Wahsheh, H., Alsmadi, I., Al-Shawakfa, E., Wahbeh, A., & Al-Hmoud, A.

citation count

13

complete list of authors

Al-Kabi, Mohammed||Wahsheh, Heider||Alsmadi, Izzat||Al-Shawakfa, Emad||Wahbeh, Abdullah||Al-Hmoud, Ahmed

publication date

June 2012

publisher

SAGE Publications Publisher

published in

Journal of Information Science Journal

keywords

Arabic Content Features
Arabic Web Spam
Arabic Web Spam Detection
Content Features
Web Spam
Web Spam Detection

Digital Object Identifier (DOI)

10.1177/0165551512439173

start page

284

end page

296

volume

38

issue

3

URL

http://dx.doi.org/10.1177/0165551512439173

Content-based analysis to detect Arabic web spam Academic Article

Overview

abstract

authors

published proceedings

author list (cited authors)

citation count

complete list of authors

publication date

publisher

published in

Research

keywords

Identity

Digital Object Identifier (DOI)

Additional Document Info

start page

end page

volume

issue

Other

URL