### Abstract

Whenever we make any kind of ecological study it is obvious that a sample is analysed since we are not able to measure the whole statistic population. Numerical classification in general is a useful tool to explore the structure of different kinds of ecological data, but it reflects the structure of the studied dataset (the sample). However, we are interested in the structure of the statistical population from which the sample is derived. It is possible that among the clusters gained by the classification there are some, which are representative only for the sample and not for the whole statistical population, thus these clusters can be called "artificial". This paper describes a method that helps us to avoid the interpretation of these " artificial" clusters, which are characteristic only for the sample, not for entire population. The method is called validation, because its steps are similar to validation used in other fields of numerical analysis. In case of cluster analysis the definitive characteristics of the particular clusters are unknown. This means that it is not possible to make testable hypothesis based on the results of the cluster analysis. Therefore, the method proposed here does not compare the clusters themselves, but the "meaning" of the clusters; i.e. their characteristics that are used for the interpretation of the results. Frequency of species was chosen as "meaning" of clusters here, but using other characteristics, e.g. mean or median for continuous variables is also possible. The new methods are applied to an artificial dataset to illustrate the procedure and to show its merits.

Original language | English |
---|---|

Pages (from-to) | 73-80 |

Number of pages | 8 |

Journal | Acta Botanica Hungarica |

Volume | 50 |

Issue number | 1-2 |

DOIs | |

Publication status | Published - Mar 2008 |

### Fingerprint

### Keywords

- Classification
- Data splitting
- Validation

### ASJC Scopus subject areas

- Ecology, Evolution, Behavior and Systematics
- Plant Science

### Cite this

**Validation of hierarchical classifications by splitting dataset.** / Botta-Dukát, Z.

Research output: Contribution to journal › Article

*Acta Botanica Hungarica*, vol. 50, no. 1-2, pp. 73-80. https://doi.org/10.1556/ABot.50.2008.1-2.4

}

TY - JOUR

T1 - Validation of hierarchical classifications by splitting dataset

AU - Botta-Dukát, Z.

PY - 2008/3

Y1 - 2008/3

N2 - Whenever we make any kind of ecological study it is obvious that a sample is analysed since we are not able to measure the whole statistic population. Numerical classification in general is a useful tool to explore the structure of different kinds of ecological data, but it reflects the structure of the studied dataset (the sample). However, we are interested in the structure of the statistical population from which the sample is derived. It is possible that among the clusters gained by the classification there are some, which are representative only for the sample and not for the whole statistical population, thus these clusters can be called "artificial". This paper describes a method that helps us to avoid the interpretation of these " artificial" clusters, which are characteristic only for the sample, not for entire population. The method is called validation, because its steps are similar to validation used in other fields of numerical analysis. In case of cluster analysis the definitive characteristics of the particular clusters are unknown. This means that it is not possible to make testable hypothesis based on the results of the cluster analysis. Therefore, the method proposed here does not compare the clusters themselves, but the "meaning" of the clusters; i.e. their characteristics that are used for the interpretation of the results. Frequency of species was chosen as "meaning" of clusters here, but using other characteristics, e.g. mean or median for continuous variables is also possible. The new methods are applied to an artificial dataset to illustrate the procedure and to show its merits.

AB - Whenever we make any kind of ecological study it is obvious that a sample is analysed since we are not able to measure the whole statistic population. Numerical classification in general is a useful tool to explore the structure of different kinds of ecological data, but it reflects the structure of the studied dataset (the sample). However, we are interested in the structure of the statistical population from which the sample is derived. It is possible that among the clusters gained by the classification there are some, which are representative only for the sample and not for the whole statistical population, thus these clusters can be called "artificial". This paper describes a method that helps us to avoid the interpretation of these " artificial" clusters, which are characteristic only for the sample, not for entire population. The method is called validation, because its steps are similar to validation used in other fields of numerical analysis. In case of cluster analysis the definitive characteristics of the particular clusters are unknown. This means that it is not possible to make testable hypothesis based on the results of the cluster analysis. Therefore, the method proposed here does not compare the clusters themselves, but the "meaning" of the clusters; i.e. their characteristics that are used for the interpretation of the results. Frequency of species was chosen as "meaning" of clusters here, but using other characteristics, e.g. mean or median for continuous variables is also possible. The new methods are applied to an artificial dataset to illustrate the procedure and to show its merits.

KW - Classification

KW - Data splitting

KW - Validation

UR - http://www.scopus.com/inward/record.url?scp=41049097105&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=41049097105&partnerID=8YFLogxK

U2 - 10.1556/ABot.50.2008.1-2.4

DO - 10.1556/ABot.50.2008.1-2.4

M3 - Article

VL - 50

SP - 73

EP - 80

JO - Acta Botanica Hungarica

JF - Acta Botanica Hungarica

SN - 0236-6495

IS - 1-2

ER -