CINECA IRIS Institutional Research Information System

Many biological processes behave in a nonlinear way. Observations over time usually show a curved trajectory in the data space. To understand the dynamics of biological processes we can identify and analyse the time trajectory. This can be done by using a nonlinear extension of principal component analysis (PCA). Nonlinear PCA is a technique frequently applied in the fields of atmospheric and oceanic sciences, e.g., to analyse the El Niño-Southern oscillation. But also in molecular biology, nonlinear PCA is used for gene expression analysis, e.g., to analyse the reproductive cycle of the malaria parasite Plasmodium falciparum in red blood cells. Linear principal component analysis (PCA) can be extended to a nonlinear PCA by using artificial neural networks. But the benefit of curved components requires a careful control of the model complexity. Moreover, standard techniques for model selection, including cross-validation and more generally the use of an independent test set, fail when applied to nonlinar PCA because of its inherent unsupervised characteristics. Here, we propose to use a natural approach that validates a model by its own ability to estimate missing data. It is motivated by the idea that only the model of optimal complexity is able to predict missing values with the highest accuracy. While standard test set validation usually favours overfitted nonlinear PCA models, the proposed model validation approach correctly selects the optimal model complexity of nonlinear PCA.

Scholz, M.U. (2012). A missing data approach to validate nonlinear PCA. In: 16th Annual International Conference on Research in Computational Molecular Biology, Barcelona, 21-24 April, 2012. url: http://cdn.f1000.com/posters/docs/249352732 handle: http://hdl.handle.net/10449/22019

A missing data approach to validate nonlinear PCA

Scholz, Matthias Uwe

2012-01-01

Abstract

Many biological processes behave in a nonlinear way. Observations over time usually show a curved trajectory in the data space. To understand the dynamics of biological processes we can identify and analyse the time trajectory. This can be done by using a nonlinear extension of principal component analysis (PCA). Nonlinear PCA is a technique frequently applied in the fields of atmospheric and oceanic sciences, e.g., to analyse the El Niño-Southern oscillation. But also in molecular biology, nonlinear PCA is used for gene expression analysis, e.g., to analyse the reproductive cycle of the malaria parasite Plasmodium falciparum in red blood cells. Linear principal component analysis (PCA) can be extended to a nonlinear PCA by using artificial neural networks. But the benefit of curved components requires a careful control of the model complexity. Moreover, standard techniques for model selection, including cross-validation and more generally the use of an independent test set, fail when applied to nonlinar PCA because of its inherent unsupervised characteristics. Here, we propose to use a natural approach that validates a model by its own ability to estimate missing data. It is motivated by the idea that only the model of optimal complexity is able to predict missing values with the highest accuracy. While standard test set validation usually favours overfitted nonlinear PCA models, the proposed model validation approach correctly selects the optimal model complexity of nonlinear PCA.

Scheda breve

Scheda completa

Scheda completa (DC)

	Keywords
	
				Nonlinear PCA
Neural networks
Missing data
Model validation
Time series
Gene expression analysis
			
	Date of issue
	
				2012
			
	Citazione
	
				Scholz, M.U. (2012). A missing data approach to validate nonlinear PCA. In:  16th Annual International Conference on Research in Computational Molecular Biology, Barcelona, 21-24 April, 2012. url: http://cdn.f1000.com/posters/docs/249352732 handle: http://hdl.handle.net/10449/22019
			
	Appare nelle tipologie:
	
				4.03 Poster

File in questo prodotto:

File	Dimensione	Formato
2012 Barcelona Scholz.pdf accesso aperto Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 1.22 MB Formato Adobe PDF Visualizza/Apri	1.22 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10449/22019

Citazioni

ND

ND

ND

social impact