Patent categorization (PC) is a typical application area of text categorization (TC). TC can be applied in different scenarios at the work of patent offices depending on at what stage the categorization is needed. This is a challenging field for TC algorithms, since the applications have to deal simultaneously with a large number of categories (in the magnitude of 1,000-10,000) organized in hierarchy, large number of long documents with huge vocabularies at training, and they are required to work fast and accurate at on-the-fly categorization. In this chapter we present a hierarchicalonline classifier, called HITEC, which meets the above requirements. The novelty of the method lies in the taxonomy dependent architecture of the classifier, the applied weight updating scheme, and in the relaxed category selection method. We evaluate the method on two large English patent application databases, the WIPO-alpha and the Espace A/B corpora.1 We also compare the presented method to other TC algorithms on these collections and show that it outperforms them significantly.
|Title of host publication||Emerging Technologies of Text Mining|
|Subtitle of host publication||Techniques and Applications|
|Number of pages||24|
|Publication status||Published - Dec 1 2007|
ASJC Scopus subject areas
- Social Sciences(all)