Intelligent high-performance crawlers used to reveal topic-specific structure of the www

András Lrincz, István Kókai, Attila Meretei

Research output: Contribution to journalArticle

10 Citations (Scopus)

Abstract

The slogan that «information is power» has undergone a slight change. Today, «information updating» is in the focus of interest. The largest source of information today is the World Wide Web. Fast search methods are needed to utilize this enormous source of information. In this paper our novel crawler using support vector classification and on-line reinforcement learning is described. We launched crawler searches from different sites, including sites that offer, at best, very limited information about the search subject. This case may correspond to typical searches of non-experts. Results indicate that the considerable performance improvement of our crawler over other known crawlers is due to its on-line adaptation property. We used our crawler to characterize basic topic-specific properties of WWW environments. It was found that topic-specific regions have a broad distribution of valuable documents. Expert sites are excellent starting points, whereas mailing lists can form trape for the crawler. These properties of the WWW and the emergence of intelligent «high-performance» crawlers that monitor and search for novel information together predict a significant increase of communication load on the WWW in the near future.

Original languageEnglish
Pages (from-to)477-495
Number of pages19
JournalInternational Journal of Foundations of Computer Science
Volume13
Issue number4
DOIs
Publication statusPublished - Dec 1 2002

Keywords

  • Internet
  • adaptation
  • crawler
  • reinforcement learning

ASJC Scopus subject areas

  • Computer Science (miscellaneous)

Fingerprint Dive into the research topics of 'Intelligent high-performance crawlers used to reveal topic-specific structure of the www'. Together they form a unique fingerprint.

  • Cite this