Many biological processes behave in a nonlinear way. Observations over time usually show a curved trajectory in the data space. To understand the dynamics of biological processes we can identify and analyse the time trajectory. This can be done by using a nonlinear extension of principal component analysis (PCA). Nonlinear PCA is a technique frequently applied in the fields of atmospheric and oceanic sciences, e.g., to analyse the El Niño-Southern oscillation. But also in molecular biology, nonlinear PCA is used for gene expression analysis, e.g., to analyse the reproductive cycle of the malaria parasite Plasmodium falciparum in red blood cells. Linear principal component analysis (PCA) can be extended to a nonlinear PCA by using artificial neural networks. But the benefit of curved components requires a careful control of the model complexity. Moreover, standard techniques for model selection, including cross-validation and more generally the use of an independent test set, fail when applied to nonlinar PCA because of its inherent unsupervised characteristics. Here, we propose to use a natural approach that validates a model by its own ability to estimate missing data. It is motivated by the idea that only the model of optimal complexity is able to predict missing values with the highest accuracy. While standard test set validation usually favours overfitted nonlinear PCA models, the proposed model validation approach correctly selects the optimal model complexity of nonlinear PCA.

Scholz, M.U. (2012). A missing data approach to validate nonlinear PCA. In: 16th Annual International Conference on Research in Computational Molecular Biology, Barcelona, 21-24 April, 2012. url: http://cdn.f1000.com/posters/docs/249352732 handle: http://hdl.handle.net/10449/22019

A missing data approach to validate nonlinear PCA

Scholz, Matthias Uwe
2012-01-01

Abstract

Many biological processes behave in a nonlinear way. Observations over time usually show a curved trajectory in the data space. To understand the dynamics of biological processes we can identify and analyse the time trajectory. This can be done by using a nonlinear extension of principal component analysis (PCA). Nonlinear PCA is a technique frequently applied in the fields of atmospheric and oceanic sciences, e.g., to analyse the El Niño-Southern oscillation. But also in molecular biology, nonlinear PCA is used for gene expression analysis, e.g., to analyse the reproductive cycle of the malaria parasite Plasmodium falciparum in red blood cells. Linear principal component analysis (PCA) can be extended to a nonlinear PCA by using artificial neural networks. But the benefit of curved components requires a careful control of the model complexity. Moreover, standard techniques for model selection, including cross-validation and more generally the use of an independent test set, fail when applied to nonlinar PCA because of its inherent unsupervised characteristics. Here, we propose to use a natural approach that validates a model by its own ability to estimate missing data. It is motivated by the idea that only the model of optimal complexity is able to predict missing values with the highest accuracy. While standard test set validation usually favours overfitted nonlinear PCA models, the proposed model validation approach correctly selects the optimal model complexity of nonlinear PCA.
Nonlinear PCA
Neural networks
Missing data
Model validation
Time series
Gene expression analysis
2012
Scholz, M.U. (2012). A missing data approach to validate nonlinear PCA. In: 16th Annual International Conference on Research in Computational Molecular Biology, Barcelona, 21-24 April, 2012. url: http://cdn.f1000.com/posters/docs/249352732 handle: http://hdl.handle.net/10449/22019
File in questo prodotto:
File Dimensione Formato  
2012 Barcelona Scholz.pdf

accesso aperto

Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 1.22 MB
Formato Adobe PDF
1.22 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10449/22019
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact