Many biological processes behave in a nonlinear way. Observations over time usually show a curved trajectory in the data space. To understand the dynamics of biological processes we can identify and analyse the time trajectory. This can be done by using a nonlinear extension of principal component analysis (PCA). Nonlinear PCA is a technique frequently applied in the fields of atmospheric and oceanic sciences, e.g., to analyse the El Niño-Southern oscillation. But also in molecular biology, nonlinear PCA is used for gene expression analysis, e.g., to analyse the reproductive cycle of the malaria parasite Plasmodium falciparum in red blood cells. Linear principal component analysis (PCA) can be extended to a nonlinear PCA by using artificial neural networks. But the benefit of curved components requires a careful control of the model complexity. Moreover, standard techniques for model selection, including cross-validation and more generally the use of an independent test set, fail when applied to nonlinar PCA because of its inherent unsupervised characteristics. Here, we propose to use a natural approach that validates a model by its own ability to estimate missing data. It is motivated by the idea that only the model of optimal complexity is able to predict missing values with the highest accuracy. While standard test set validation usually favours overfitted nonlinear PCA models, the proposed model validation approach correctly selects the optimal model complexity of nonlinear PCA.
|Citation:||Scholz, M. (2012). A missing data approach to validate nonlinear PCA. In: 16th Annual International Conference on Research in Computational Molecular Biology, Barcelona, 21-24 April, 2012. url: http://cdn.f1000.com/posters/docs/249352732 handle: http://hdl.handle.net/10449/22019|
|Organization unit:||Computational Biology # CRI_2011-JAN2016|
|Title:||A missing data approach to validate nonlinear PCA|
|Keywords ENG:||Nonlinear PCA|
Gene expression analysis
|Appears in Collections:||03 - Conference object|