Recognition of the logical structure of arabic newspaper pages

Hassina Bouressace, J. Csirik

Research output: Chapter in Book/Report/Conference proceedingConference contribution

4 Citations (Scopus)

Abstract

In document analysis and recognition, we seek to apply methods of automatic document identification. The main goal is to go from a simple image to a structured set of information exploitable by machine. Here, we present a system for recognizing the logical structure (hierarchical organization) of Arabic newspapers pages. These are characterized by a rich and variable structure. They may contain several articles composed of titles, figures, author’s names and figure captions. However, the logical structure recognition of a newspaper page is preceded by the extraction of its physical structure. This extraction is performed in our system using a combined method which is essentially based on the RLSA (Run Length Smearing/Smoothing Algorithm) [1], projections profile analysis, and connected components labeling. Logical structure extraction is then performed based on certain rules of sizes and positions of the physical elements extracted earlier, and also on an a priori knowledge of certain properties of logical entities (titles, figures, authors, captions, etc.). Lastly, the hierarchical organization of the document is represented as an XML file generated automatically. To evaluate the performance of our system, we tested it on a set of images and the results are encouraging.

Original languageEnglish
Title of host publicationText, Speech, and Dialogue - 21st International Conference, TSD 2018, Proceedings
EditorsPetr Sojka, Aleš Horák, Ivan Kopecek, Karel Pala
PublisherSpringer Verlag
Pages251-258
Number of pages8
ISBN (Print)9783030007935
DOIs
Publication statusPublished - Jan 1 2018
Event21st International Conference on Text, Speech, and Dialogue, TSD 2018 - Brno, Czech Republic
Duration: Sep 11 2018Sep 14 2018

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume11107 LNAI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

Other21st International Conference on Text, Speech, and Dialogue, TSD 2018
CountryCzech Republic
CityBrno
Period9/11/189/14/18

    Fingerprint

Keywords

  • Arabic language
  • Document processing
  • Document recognition
  • Logical structure
  • Physical structure
  • Segmentation

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Cite this

Bouressace, H., & Csirik, J. (2018). Recognition of the logical structure of arabic newspaper pages. In P. Sojka, A. Horák, I. Kopecek, & K. Pala (Eds.), Text, Speech, and Dialogue - 21st International Conference, TSD 2018, Proceedings (pp. 251-258). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 11107 LNAI). Springer Verlag. https://doi.org/10.1007/978-3-030-00794-2_27