Validation of hierarchical classifications by splitting dataset

Research output: Article

3 Citations (Scopus)


Whenever we make any kind of ecological study it is obvious that a sample is analysed since we are not able to measure the whole statistic population. Numerical classification in general is a useful tool to explore the structure of different kinds of ecological data, but it reflects the structure of the studied dataset (the sample). However, we are interested in the structure of the statistical population from which the sample is derived. It is possible that among the clusters gained by the classification there are some, which are representative only for the sample and not for the whole statistical population, thus these clusters can be called "artificial". This paper describes a method that helps us to avoid the interpretation of these " artificial" clusters, which are characteristic only for the sample, not for entire population. The method is called validation, because its steps are similar to validation used in other fields of numerical analysis. In case of cluster analysis the definitive characteristics of the particular clusters are unknown. This means that it is not possible to make testable hypothesis based on the results of the cluster analysis. Therefore, the method proposed here does not compare the clusters themselves, but the "meaning" of the clusters; i.e. their characteristics that are used for the interpretation of the results. Frequency of species was chosen as "meaning" of clusters here, but using other characteristics, e.g. mean or median for continuous variables is also possible. The new methods are applied to an artificial dataset to illustrate the procedure and to show its merits.

Original languageEnglish
Pages (from-to)73-80
Number of pages8
JournalActa Botanica Hungarica
Issue number1-2
Publication statusPublished - márc. 1 2008


ASJC Scopus subject areas

  • Ecology, Evolution, Behavior and Systematics
  • Plant Science

Cite this