Background. Biomarker selection, i.e., the definition of which variables are important in statistical regression or discrimination models, is an ever more important topic in the omics sciences. Data from these fields are typically characterized by a low number of samples, but a large number of variables – a meaningful biological interpretation often is only possible when considering the most important variables. Methods. In this context, statistical tests like the t test will lead to many false positives, while multiple testing corrections tend to lose much power and select only very few variables. In addition, the cutoff value (usually set to a value like 5%) is often chosen in a haphazard way. We present two meta-statistics to tackle the problem of variable selection: higher criticism thresholding [1,2] and stability selection [3,4]. Higher criticism thresholding, applicable in a two-class discrimination setting, is a way to set suitable cutoff levels for significance, based on the data at hand. The underlying mechanism has been described as the “z-score of the p-value” [1]. The current work has extended higher criticism to multivariate methods like PLSDA and the VIP statistics [4]. Stability selection is a novel variable selection method, assessing the stability of biomarker selections under perturbations of the data. The concept is extremely general and robust and can be applied both in regression and discrimination cases: primary selection methods assessed in thie work include PLS and lasso models. Results. Simulated as well as experimental data show very good results for both stability selection and higher criticism. The experimental data in this study consist of LC-MS metabolomics data of spiked-in apple extracts [5] – such spike-in data are extremely important in assessing the value of biomarker selection methods but are rarely available. Good results are also obtained in other areas of science [1-3]. The advantages of stability selection include a broad applicability (regression, discrimination) and modest computational demands; on the other hand, the number of samples that is required is relatively high. For discrimination problems with fewer than, say, eight samples per class, it is probably better to rely on the higher criticism approach. Both higher criticism and stability selection have been implemented in an R package, BioMark, available from the CRAN repository, and also containing the experimental spike-in data
Wehrens, H.R.M.J.; Franceschi, P. (2012). Meta-statistics for biomarker selectionin the omics sciences. In: 4th StatSeq Workshop - Verona, Italy, 18-19 April 2012: 26. url: http://ddlab.sci.univr.it/statseq/booklet.pdf handle: http://hdl.handle.net/10449/22015
Meta-statistics for biomarker selection in the omics sciences
Wehrens, Herman Ronald Maria Johan;Franceschi, Pietro
2012-01-01
Abstract
Background. Biomarker selection, i.e., the definition of which variables are important in statistical regression or discrimination models, is an ever more important topic in the omics sciences. Data from these fields are typically characterized by a low number of samples, but a large number of variables – a meaningful biological interpretation often is only possible when considering the most important variables. Methods. In this context, statistical tests like the t test will lead to many false positives, while multiple testing corrections tend to lose much power and select only very few variables. In addition, the cutoff value (usually set to a value like 5%) is often chosen in a haphazard way. We present two meta-statistics to tackle the problem of variable selection: higher criticism thresholding [1,2] and stability selection [3,4]. Higher criticism thresholding, applicable in a two-class discrimination setting, is a way to set suitable cutoff levels for significance, based on the data at hand. The underlying mechanism has been described as the “z-score of the p-value” [1]. The current work has extended higher criticism to multivariate methods like PLSDA and the VIP statistics [4]. Stability selection is a novel variable selection method, assessing the stability of biomarker selections under perturbations of the data. The concept is extremely general and robust and can be applied both in regression and discrimination cases: primary selection methods assessed in thie work include PLS and lasso models. Results. Simulated as well as experimental data show very good results for both stability selection and higher criticism. The experimental data in this study consist of LC-MS metabolomics data of spiked-in apple extracts [5] – such spike-in data are extremely important in assessing the value of biomarker selection methods but are rarely available. Good results are also obtained in other areas of science [1-3]. The advantages of stability selection include a broad applicability (regression, discrimination) and modest computational demands; on the other hand, the number of samples that is required is relatively high. For discrimination problems with fewer than, say, eight samples per class, it is probably better to rely on the higher criticism approach. Both higher criticism and stability selection have been implemented in an R package, BioMark, available from the CRAN repository, and also containing the experimental spike-in dataFile | Dimensione | Formato | |
---|---|---|---|
2012 COST VR 1.pdf
accesso aperto
Licenza:
Tutti i diritti riservati (All rights reserved)
Dimensione
112.38 kB
Formato
Adobe PDF
|
112.38 kB | Adobe PDF | Visualizza/Apri |
6708.pdf
accesso aperto
Licenza:
Tutti i diritti riservati (All rights reserved)
Dimensione
6.76 MB
Formato
Adobe PDF
|
6.76 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.