The Szeged Corpus: A POS tagged and syntactically annotated hungarian natural language Corpus

Research output: Conference article

22 Citations (Scopus)

Abstract

The Szeged Corpus is a manually annotated natural language corpus comprising 1.2 million word entries plus 225 thousand punctuation marks. With this, it is the largest manually processed Hungarian textual database that serves as a reference material for further research in natural language processing (NLP) as well as a learning database for machine learning algorithms and other software applications. Language processing of the corpus texts so far included morpho-syntactic analysis, POS tagging and shallow syntactic parsing. Semantic information was also added to a pre-selected section of the corpus to support automated information extraction (IE).

Original languageEnglish
Pages (from-to)41-47
Number of pages7
JournalLecture Notes in Artificial Intelligence (Subseries of Lecture Notes in Computer Science)
Volume3206
Publication statusPublished - dec. 1 2004
Event7th International Conference TSD 2004: Text, Speech and Dialogue - Brno, Czech Republic
Duration: szept. 8 2004szept. 11 2004

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Fingerprint Dive into the research topics of 'The Szeged Corpus: A POS tagged and syntactically annotated hungarian natural language Corpus'. Together they form a unique fingerprint.

  • Cite this