Szeged Corpus 2.5: Morphological modifications in a manually pos-tagged hungarian corpus

Veronika Vincze, Viktor Varga, Katalin Ilona Simkó, János Zsibrita, Ágoston Nagy, Richárd Farkas, J. Csirik

Research output: Conference contribution

1 Citation (Scopus)

Abstract

The Szeged Corpus is the largest manually annotated database containing the possible morphological analyses and lemmas for each word form. In this work, we present its latest version, Szeged Corpus 2.5, in which the new harmonized morphological coding system of Hungarian has been employed and, on the other hand, the majority of misspelled words have been corrected and tagged with the proper morphological code. New morphological codes are introduced for participles, causative / modal / frequentative verbs, adverbial pronouns and punctuation marks, moreover, the distinction between common and proper nouns is eliminated. We also report some statistical data on the frequency of the new morphological codes. The new version of the corpus made it possible to train magyarlanc, a data-driven POS-tagger of Hungarian on a dataset with the new harmonized codes. According to the results, magyarlanc is able to achieve a state-of-the-art accuracy score on the 2.5 version as well.

Original languageEnglish
Title of host publicationProceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014
PublisherEuropean Language Resources Association (ELRA)
Pages1074-1078
Number of pages5
ISBN (Electronic)9782951740884
Publication statusPublished - jan. 1 2014
Event9th International Conference on Language Resources and Evaluation, LREC 2014 - Reykjavik, Iceland
Duration: máj. 26 2014máj. 31 2014

Other

Other9th International Conference on Language Resources and Evaluation, LREC 2014
CountryIceland
CityReykjavik
Period5/26/145/31/14

ASJC Scopus subject areas

  • Linguistics and Language
  • Library and Information Sciences
  • Education
  • Language and Linguistics

Fingerprint Dive into the research topics of 'Szeged Corpus 2.5: Morphological modifications in a manually pos-tagged hungarian corpus'. Together they form a unique fingerprint.

  • Cite this

    Vincze, V., Varga, V., Simkó, K. I., Zsibrita, J., Nagy, Á., Farkas, R., & Csirik, J. (2014). Szeged Corpus 2.5: Morphological modifications in a manually pos-tagged hungarian corpus. In Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014 (pp. 1074-1078). European Language Resources Association (ELRA).