Comparison of ridge regression, partial least-squares, pairwise correlation, forward- -and best subset selection methods for prediction of retention indices for aliphatic alcohols

Orsolya Farkas, K. Heberger

Research output: Contribution to journalArticle

32 Citations (Scopus)

Abstract

A quantitative structure-retention relationship (QSRR) study based on multiple linear regression (MLR) was performed for the description and prediction of Kováts retention indices (RI) of alcohol compounds. Alcohols were of saturated, linear or branched types and contained a hydroxyl group on the primary, secondary or tertiary carbon atoms. Constitutive and weighted holistic invariant molecular (WHIM) descriptors were used to represent the structure of alcohols in the MLR models. Before the model building, five variable selection methods were applied to select the most relevant variables from a large set of descriptors, respectively. The selected molecular properties were included into the MLR models. The efficiency of the variable selection methods was also compared. The selection methods were as follows: ridge regression (RR), partial least-squares method (PLS), pair-correlation method (PCM), forward selection (FS) and best subset selection (BSS). The stability and the validity of the MLR models were tested by a cross-validation technique using a leave-n-out technique. Neither RR nor PLS selected variables were able to describe the Kováts retention index properly, and PCM gave reliable results in the description but not for prediction. We built models with good predicting ability using FS and BSS as a selection method. The most relevant variables in the description and prediction of RIs were the mean electrotopological state index, the molecular mass, and WHIM indices characterizing size and shape.

Original languageEnglish
Pages (from-to)339-346
Number of pages8
JournalJournal of Chemical Information and Modeling
Volume45
Issue number2
DOIs
Publication statusPublished - Mar 2005

Fingerprint

Set theory
Alcohols
Linear regression
alcohol
regression
Correlation methods
Molecular mass
Hydroxyl Radical
Carbon
Atoms
efficiency
ability

ASJC Scopus subject areas

  • Chemistry(all)
  • Computational Theory and Mathematics
  • Computer Science Applications
  • Information Systems

Cite this

@article{8926e2911c1a4ba0af5db0d09454ebee,
title = "Comparison of ridge regression, partial least-squares, pairwise correlation, forward- -and best subset selection methods for prediction of retention indices for aliphatic alcohols",
abstract = "A quantitative structure-retention relationship (QSRR) study based on multiple linear regression (MLR) was performed for the description and prediction of Kov{\'a}ts retention indices (RI) of alcohol compounds. Alcohols were of saturated, linear or branched types and contained a hydroxyl group on the primary, secondary or tertiary carbon atoms. Constitutive and weighted holistic invariant molecular (WHIM) descriptors were used to represent the structure of alcohols in the MLR models. Before the model building, five variable selection methods were applied to select the most relevant variables from a large set of descriptors, respectively. The selected molecular properties were included into the MLR models. The efficiency of the variable selection methods was also compared. The selection methods were as follows: ridge regression (RR), partial least-squares method (PLS), pair-correlation method (PCM), forward selection (FS) and best subset selection (BSS). The stability and the validity of the MLR models were tested by a cross-validation technique using a leave-n-out technique. Neither RR nor PLS selected variables were able to describe the Kov{\'a}ts retention index properly, and PCM gave reliable results in the description but not for prediction. We built models with good predicting ability using FS and BSS as a selection method. The most relevant variables in the description and prediction of RIs were the mean electrotopological state index, the molecular mass, and WHIM indices characterizing size and shape.",
author = "Orsolya Farkas and K. Heberger",
year = "2005",
month = "3",
doi = "10.1021/ci049827t",
language = "English",
volume = "45",
pages = "339--346",
journal = "Journal of Chemical Information and Modeling",
issn = "1549-9596",
publisher = "American Chemical Society",
number = "2",

}

TY - JOUR

T1 - Comparison of ridge regression, partial least-squares, pairwise correlation, forward- -and best subset selection methods for prediction of retention indices for aliphatic alcohols

AU - Farkas, Orsolya

AU - Heberger, K.

PY - 2005/3

Y1 - 2005/3

N2 - A quantitative structure-retention relationship (QSRR) study based on multiple linear regression (MLR) was performed for the description and prediction of Kováts retention indices (RI) of alcohol compounds. Alcohols were of saturated, linear or branched types and contained a hydroxyl group on the primary, secondary or tertiary carbon atoms. Constitutive and weighted holistic invariant molecular (WHIM) descriptors were used to represent the structure of alcohols in the MLR models. Before the model building, five variable selection methods were applied to select the most relevant variables from a large set of descriptors, respectively. The selected molecular properties were included into the MLR models. The efficiency of the variable selection methods was also compared. The selection methods were as follows: ridge regression (RR), partial least-squares method (PLS), pair-correlation method (PCM), forward selection (FS) and best subset selection (BSS). The stability and the validity of the MLR models were tested by a cross-validation technique using a leave-n-out technique. Neither RR nor PLS selected variables were able to describe the Kováts retention index properly, and PCM gave reliable results in the description but not for prediction. We built models with good predicting ability using FS and BSS as a selection method. The most relevant variables in the description and prediction of RIs were the mean electrotopological state index, the molecular mass, and WHIM indices characterizing size and shape.

AB - A quantitative structure-retention relationship (QSRR) study based on multiple linear regression (MLR) was performed for the description and prediction of Kováts retention indices (RI) of alcohol compounds. Alcohols were of saturated, linear or branched types and contained a hydroxyl group on the primary, secondary or tertiary carbon atoms. Constitutive and weighted holistic invariant molecular (WHIM) descriptors were used to represent the structure of alcohols in the MLR models. Before the model building, five variable selection methods were applied to select the most relevant variables from a large set of descriptors, respectively. The selected molecular properties were included into the MLR models. The efficiency of the variable selection methods was also compared. The selection methods were as follows: ridge regression (RR), partial least-squares method (PLS), pair-correlation method (PCM), forward selection (FS) and best subset selection (BSS). The stability and the validity of the MLR models were tested by a cross-validation technique using a leave-n-out technique. Neither RR nor PLS selected variables were able to describe the Kováts retention index properly, and PCM gave reliable results in the description but not for prediction. We built models with good predicting ability using FS and BSS as a selection method. The most relevant variables in the description and prediction of RIs were the mean electrotopological state index, the molecular mass, and WHIM indices characterizing size and shape.

UR - http://www.scopus.com/inward/record.url?scp=18344383728&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=18344383728&partnerID=8YFLogxK

U2 - 10.1021/ci049827t

DO - 10.1021/ci049827t

M3 - Article

C2 - 15807497

AN - SCOPUS:18344383728

VL - 45

SP - 339

EP - 346

JO - Journal of Chemical Information and Modeling

JF - Journal of Chemical Information and Modeling

SN - 1549-9596

IS - 2

ER -