Improving recognition accuracy on structured documents by learning structural patterns

György Hévízi, Tamás Marcinkovics, A. Lőrincz

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

In this paper, we present a probabilistic method that can improve the efficiency of document classification when applied to structured documents. The analysis of the structure of a document is the starting point of document classification. Our method is designed to augment other classification schemes and complement pre-filtering information extraction procedures to reduce uncertainties. To this end, a probabilistic distribution on the structure of XML documents is introduced. We show how to parameterise existing learning methods to describe the structure distribution efficiently. The learned distribution is then used to predict the classes of unseen documents. Novelty detection making use of the structure-based distribution function is also discussed. Demonstration on model documents and on Internet XML documents are presented.

Original languageEnglish
Pages (from-to)66-76
Number of pages11
JournalPattern Analysis and Applications
Volume7
Issue number1
DOIs
Publication statusPublished - 2004

Fingerprint

XML
Information filtering
Distribution functions
Demonstrations
Internet
Uncertainty

Keywords

  • Bayesian networks
  • Classification
  • Novelty detection
  • Probabilistic tree model
  • XML

ASJC Scopus subject areas

  • Control and Systems Engineering
  • Artificial Intelligence
  • Computer Vision and Pattern Recognition

Cite this

Improving recognition accuracy on structured documents by learning structural patterns. / Hévízi, György; Marcinkovics, Tamás; Lőrincz, A.

In: Pattern Analysis and Applications, Vol. 7, No. 1, 2004, p. 66-76.

Research output: Contribution to journalArticle

@article{51ddbc595c5b48a797e0e3ae32e92b20,
title = "Improving recognition accuracy on structured documents by learning structural patterns",
abstract = "In this paper, we present a probabilistic method that can improve the efficiency of document classification when applied to structured documents. The analysis of the structure of a document is the starting point of document classification. Our method is designed to augment other classification schemes and complement pre-filtering information extraction procedures to reduce uncertainties. To this end, a probabilistic distribution on the structure of XML documents is introduced. We show how to parameterise existing learning methods to describe the structure distribution efficiently. The learned distribution is then used to predict the classes of unseen documents. Novelty detection making use of the structure-based distribution function is also discussed. Demonstration on model documents and on Internet XML documents are presented.",
keywords = "Bayesian networks, Classification, Novelty detection, Probabilistic tree model, XML",
author = "Gy{\"o}rgy H{\'e}v{\'i}zi and Tam{\'a}s Marcinkovics and A. Lőrincz",
year = "2004",
doi = "10.1007/s10044-004-0208-3",
language = "English",
volume = "7",
pages = "66--76",
journal = "Pattern Analysis and Applications",
issn = "1433-7541",
publisher = "Springer London",
number = "1",

}

TY - JOUR

T1 - Improving recognition accuracy on structured documents by learning structural patterns

AU - Hévízi, György

AU - Marcinkovics, Tamás

AU - Lőrincz, A.

PY - 2004

Y1 - 2004

N2 - In this paper, we present a probabilistic method that can improve the efficiency of document classification when applied to structured documents. The analysis of the structure of a document is the starting point of document classification. Our method is designed to augment other classification schemes and complement pre-filtering information extraction procedures to reduce uncertainties. To this end, a probabilistic distribution on the structure of XML documents is introduced. We show how to parameterise existing learning methods to describe the structure distribution efficiently. The learned distribution is then used to predict the classes of unseen documents. Novelty detection making use of the structure-based distribution function is also discussed. Demonstration on model documents and on Internet XML documents are presented.

AB - In this paper, we present a probabilistic method that can improve the efficiency of document classification when applied to structured documents. The analysis of the structure of a document is the starting point of document classification. Our method is designed to augment other classification schemes and complement pre-filtering information extraction procedures to reduce uncertainties. To this end, a probabilistic distribution on the structure of XML documents is introduced. We show how to parameterise existing learning methods to describe the structure distribution efficiently. The learned distribution is then used to predict the classes of unseen documents. Novelty detection making use of the structure-based distribution function is also discussed. Demonstration on model documents and on Internet XML documents are presented.

KW - Bayesian networks

KW - Classification

KW - Novelty detection

KW - Probabilistic tree model

KW - XML

UR - http://www.scopus.com/inward/record.url?scp=2442507516&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=2442507516&partnerID=8YFLogxK

U2 - 10.1007/s10044-004-0208-3

DO - 10.1007/s10044-004-0208-3

M3 - Article

AN - SCOPUS:2442507516

VL - 7

SP - 66

EP - 76

JO - Pattern Analysis and Applications

JF - Pattern Analysis and Applications

SN - 1433-7541

IS - 1

ER -