Validation of hierarchical classifications by splitting dataset

Research output: Contribution to journalArticle

3 Citations (Scopus)

Abstract

Whenever we make any kind of ecological study it is obvious that a sample is analysed since we are not able to measure the whole statistic population. Numerical classification in general is a useful tool to explore the structure of different kinds of ecological data, but it reflects the structure of the studied dataset (the sample). However, we are interested in the structure of the statistical population from which the sample is derived. It is possible that among the clusters gained by the classification there are some, which are representative only for the sample and not for the whole statistical population, thus these clusters can be called "artificial". This paper describes a method that helps us to avoid the interpretation of these " artificial" clusters, which are characteristic only for the sample, not for entire population. The method is called validation, because its steps are similar to validation used in other fields of numerical analysis. In case of cluster analysis the definitive characteristics of the particular clusters are unknown. This means that it is not possible to make testable hypothesis based on the results of the cluster analysis. Therefore, the method proposed here does not compare the clusters themselves, but the "meaning" of the clusters; i.e. their characteristics that are used for the interpretation of the results. Frequency of species was chosen as "meaning" of clusters here, but using other characteristics, e.g. mean or median for continuous variables is also possible. The new methods are applied to an artificial dataset to illustrate the procedure and to show its merits.

Original languageEnglish
Pages (from-to)73-80
Number of pages8
JournalActa Botanica Hungarica
Volume50
Issue number1-2
DOIs
Publication statusPublished - Mar 2008

Fingerprint

cluster analysis
sampling
methodology
statistics
method
analysis

Keywords

  • Classification
  • Data splitting
  • Validation

ASJC Scopus subject areas

  • Ecology, Evolution, Behavior and Systematics
  • Plant Science

Cite this

Validation of hierarchical classifications by splitting dataset. / Botta-Dukát, Z.

In: Acta Botanica Hungarica, Vol. 50, No. 1-2, 03.2008, p. 73-80.

Research output: Contribution to journalArticle

@article{652a4553396b47a3925b55076e231008,
title = "Validation of hierarchical classifications by splitting dataset",
abstract = "Whenever we make any kind of ecological study it is obvious that a sample is analysed since we are not able to measure the whole statistic population. Numerical classification in general is a useful tool to explore the structure of different kinds of ecological data, but it reflects the structure of the studied dataset (the sample). However, we are interested in the structure of the statistical population from which the sample is derived. It is possible that among the clusters gained by the classification there are some, which are representative only for the sample and not for the whole statistical population, thus these clusters can be called {"}artificial{"}. This paper describes a method that helps us to avoid the interpretation of these {"} artificial{"} clusters, which are characteristic only for the sample, not for entire population. The method is called validation, because its steps are similar to validation used in other fields of numerical analysis. In case of cluster analysis the definitive characteristics of the particular clusters are unknown. This means that it is not possible to make testable hypothesis based on the results of the cluster analysis. Therefore, the method proposed here does not compare the clusters themselves, but the {"}meaning{"} of the clusters; i.e. their characteristics that are used for the interpretation of the results. Frequency of species was chosen as {"}meaning{"} of clusters here, but using other characteristics, e.g. mean or median for continuous variables is also possible. The new methods are applied to an artificial dataset to illustrate the procedure and to show its merits.",
keywords = "Classification, Data splitting, Validation",
author = "Z. Botta-Duk{\'a}t",
year = "2008",
month = "3",
doi = "10.1556/ABot.50.2008.1-2.4",
language = "English",
volume = "50",
pages = "73--80",
journal = "Acta Botanica Hungarica",
issn = "0236-6495",
publisher = "Akademiai Kiado",
number = "1-2",

}

TY - JOUR

T1 - Validation of hierarchical classifications by splitting dataset

AU - Botta-Dukát, Z.

PY - 2008/3

Y1 - 2008/3

N2 - Whenever we make any kind of ecological study it is obvious that a sample is analysed since we are not able to measure the whole statistic population. Numerical classification in general is a useful tool to explore the structure of different kinds of ecological data, but it reflects the structure of the studied dataset (the sample). However, we are interested in the structure of the statistical population from which the sample is derived. It is possible that among the clusters gained by the classification there are some, which are representative only for the sample and not for the whole statistical population, thus these clusters can be called "artificial". This paper describes a method that helps us to avoid the interpretation of these " artificial" clusters, which are characteristic only for the sample, not for entire population. The method is called validation, because its steps are similar to validation used in other fields of numerical analysis. In case of cluster analysis the definitive characteristics of the particular clusters are unknown. This means that it is not possible to make testable hypothesis based on the results of the cluster analysis. Therefore, the method proposed here does not compare the clusters themselves, but the "meaning" of the clusters; i.e. their characteristics that are used for the interpretation of the results. Frequency of species was chosen as "meaning" of clusters here, but using other characteristics, e.g. mean or median for continuous variables is also possible. The new methods are applied to an artificial dataset to illustrate the procedure and to show its merits.

AB - Whenever we make any kind of ecological study it is obvious that a sample is analysed since we are not able to measure the whole statistic population. Numerical classification in general is a useful tool to explore the structure of different kinds of ecological data, but it reflects the structure of the studied dataset (the sample). However, we are interested in the structure of the statistical population from which the sample is derived. It is possible that among the clusters gained by the classification there are some, which are representative only for the sample and not for the whole statistical population, thus these clusters can be called "artificial". This paper describes a method that helps us to avoid the interpretation of these " artificial" clusters, which are characteristic only for the sample, not for entire population. The method is called validation, because its steps are similar to validation used in other fields of numerical analysis. In case of cluster analysis the definitive characteristics of the particular clusters are unknown. This means that it is not possible to make testable hypothesis based on the results of the cluster analysis. Therefore, the method proposed here does not compare the clusters themselves, but the "meaning" of the clusters; i.e. their characteristics that are used for the interpretation of the results. Frequency of species was chosen as "meaning" of clusters here, but using other characteristics, e.g. mean or median for continuous variables is also possible. The new methods are applied to an artificial dataset to illustrate the procedure and to show its merits.

KW - Classification

KW - Data splitting

KW - Validation

UR - http://www.scopus.com/inward/record.url?scp=41049097105&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=41049097105&partnerID=8YFLogxK

U2 - 10.1556/ABot.50.2008.1-2.4

DO - 10.1556/ABot.50.2008.1-2.4

M3 - Article

AN - SCOPUS:41049097105

VL - 50

SP - 73

EP - 80

JO - Acta Botanica Hungarica

JF - Acta Botanica Hungarica

SN - 0236-6495

IS - 1-2

ER -