Single Nucleotide Polymorphisms (SNPs) represent the most abundant type of genetic variation and they are a valuable tool for several biological applications like linkage mapping, integration of genetic and physical maps, population genetics as well as evolutionary and protein structure-function studies. SNP genotyping by mapping DNA reads produced via Next generation sequencing (NGS) technologies on a reference genome is a very common and convenient approach in our days, but still prone to a significant error rate. The need of defining in silico true genetic variants in genomic and transcriptomic sequences is prompted by the high costs of the experimental validation through re-sequencing or SNP arrays, not only in terms of money but also time and sample availability. Several open-source tools have been recently developed to identify small variants in whole-genome data, but still the candidate variants, provided in the VCF output format, present a high false positive calling rate. Goal of this thesis work is the development of a bioinformatic method that classifies variant calling outputs in order to reduce the number of false positive calls. With the aim to dissect the molecular bases of grape acidity (Vitis vinifera L.), this tool has been then used to select SNPs in two grapevine varieties, which show very different content of organic acids in the berry. The VCF parameters have been used to train a Support Vector Machine (SVM) that classifies the VCF records in true and false positive variants, cleaning the output from the most likely false positive results. The SVM approach has been implemented in a new software, called VerySNP, and applied to model and non-model organisms. In both cases, the machine learning method efficiently recognized true positive from false positive variants in both genomic and transcriptomic sequences. In the second part of the thesis, VerySNP was applied to identify true SNPs in RNA-seq data of the grapevine variety Gora Chirine, characterized by low acidity, and Sultanine, a normal acidity variety closely related to Gora. The comparative transcriptomic analysis crossed with the SNP information lead to discover non-synonymous polymorphisms inside coding regions and, thus, provided a list of candidate genes potentially affecting acidity in grapevine

Leonardelli, Lorena (2014-12-09). Grapevine acidity: SVM tool development and NGS data analyses. (Doctoral Thesis). Università degli studi di Trento. Montpellier SupAgro-INRA, a.y. 2013/2014, ICT Doctoral School - XXVI cycle, Ecole Doctoral SIBAGHE, GMPF. handle: http://hdl.handle.net/10449/24467

Grapevine acidity: SVM tool development and NGS data analyses

Leonardelli, Lorena
2014-12-09

Abstract

Single Nucleotide Polymorphisms (SNPs) represent the most abundant type of genetic variation and they are a valuable tool for several biological applications like linkage mapping, integration of genetic and physical maps, population genetics as well as evolutionary and protein structure-function studies. SNP genotyping by mapping DNA reads produced via Next generation sequencing (NGS) technologies on a reference genome is a very common and convenient approach in our days, but still prone to a significant error rate. The need of defining in silico true genetic variants in genomic and transcriptomic sequences is prompted by the high costs of the experimental validation through re-sequencing or SNP arrays, not only in terms of money but also time and sample availability. Several open-source tools have been recently developed to identify small variants in whole-genome data, but still the candidate variants, provided in the VCF output format, present a high false positive calling rate. Goal of this thesis work is the development of a bioinformatic method that classifies variant calling outputs in order to reduce the number of false positive calls. With the aim to dissect the molecular bases of grape acidity (Vitis vinifera L.), this tool has been then used to select SNPs in two grapevine varieties, which show very different content of organic acids in the berry. The VCF parameters have been used to train a Support Vector Machine (SVM) that classifies the VCF records in true and false positive variants, cleaning the output from the most likely false positive results. The SVM approach has been implemented in a new software, called VerySNP, and applied to model and non-model organisms. In both cases, the machine learning method efficiently recognized true positive from false positive variants in both genomic and transcriptomic sequences. In the second part of the thesis, VerySNP was applied to identify true SNPs in RNA-seq data of the grapevine variety Gora Chirine, characterized by low acidity, and Sultanine, a normal acidity variety closely related to Gora. The comparative transcriptomic analysis crossed with the SNP information lead to discover non-synonymous polymorphisms inside coding regions and, thus, provided a list of candidate genes potentially affecting acidity in grapevine
MOSER, CLAUDIO
NGS
SVM
Variant calling
Grapevine
Acidity
NGS
SVM
Chiamata di varianti
Vite
Acidità
Settore INF/01 - INFORMATICA
9-dic-2014
2013/2014
ICT Doctoral School - XXVI cycle
Ecole Doctoral SIBAGHE
GMPF
Leonardelli, Lorena (2014-12-09). Grapevine acidity: SVM tool development and NGS data analyses. (Doctoral Thesis). Università degli studi di Trento. Montpellier SupAgro-INRA, a.y. 2013/2014, ICT Doctoral School - XXVI cycle, Ecole Doctoral SIBAGHE, GMPF. handle: http://hdl.handle.net/10449/24467
File in questo prodotto:
File Dimensione Formato  
PhD-Thesis.pdf

Open Access dal 01/07/2015

Descrizione: Main text
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 4.66 MB
Formato Adobe PDF
4.66 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10449/24467
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact