VCF features to train SVM in grapevine SNP detection

Leonardelli, L.; Cestaro, A.; Livi, C.M.; Romieu, C.; This, P.; Moser, C.; Blanzieri, E.

Motivation. Although next generation sequencing (NGS) technologies are increasing genomic information at an unprecedented pace, they are prone to an error rate higher than 1 per 100 bp. Efficient approaches are thus needed to distinguish real polymorphisms from abundant sequencing artifacts. Many open source tools have been recently developed to identify Single Nucleotide Polymorphisms (SNPs) in whole genome data, the most popular being Samtools (Li et al., 2009) and GATK (DePristo et al., 2011). Still they present an unsatisfactory accuracy due to a high false positive polymorphism prediction. SNPs are the most abundant type of DNA sequence mutations and they are efficient markers for several biological applications such as cultivar identification, construction of genetic maps, assessment of genetic diversity, detection of genotype/phenotype associations, and marker-assisted breeding. The biological importance of finding only true SNPs is evident, considering the high cost (or expense) of SNP validation through re-sequencing or SNP-chip, not only in monetary terms but also time and, above all, sample availability. Since these small mutations can be responsible for large changes in the physiology or the evolution of an organism, our interest is to define if this category of polymorphisms is the genetic determinant of low acidity content in the grapevine (Vitis vinifera L.) cultivar Gora Chirine. Therefore, we are investigating the acidity trait in grapevine by comparative analysis of the genome sequences of Gora and Sultanine, with the latter a normal acidity grapevine cultivar and a genetically close relative to Gora. Malic acid in grape berries is an essential quality parameter in wine fermentation and the investigation of new genes involved in grapevine acid metabolism is of common interest for biologists, enologists and wine makers. Method. New data formats for aligned sequences are SAM/BAM (Sequence Alignment/Map / Binary Alignment/Map), now adopted by the entire genomics community. Calling SNPs from SAM/BAM files with predictors like SAMtools and GATK (Genome Analysis Toolkit) provides a Variant Call Format (VCF) file as output (Danecek et al., 2011). The VCF file contains a list of candidate SNPs with relative positions on contigs, the nucleotide present on the reference genome and on alternative alleles, SNP call quality, genotype and many other parameters. It is difficult to consider all these values in order to distinguish which SNPs are actually polymorphisms or sequencing errors, but VCF parameters can be much more informative if used to train a Support Vector Machine (SVM) approach (Vapnik et al., 1998) that classifies the list of candidate SNPs in real SNPs and false positive results. SVM is an efficient and reliable machine learning method to distinguish categorical data; it separates the positive and negative training data by constructing a linear classifier or a non-linear classifier with a kernel function. Based on training features, SVM represents the data as points in space, where the data belong to two categories (positive and negative) divided by a gap that is as wide as possible. The training features were calculated on an experimentally validated set of SNPs (550 positive data set) and on monomorphic SNP positions (300 negative control data set). The SNP predictors, SAMtools and GATK, output approximately 400 of the 520 positive SNPs and 40 of the 300 negative SNPs, compelling us to re- balance the SNP sets with the SMOTE algorithm (Chawla et al., 2011) before the SVM training. The SVM training was validated by the 10-fold cross validation method. The resulting model was applied on biological study mentioned above as well as other data sets. Results. SVM trained with 23 VCF parameters as features reached an average accuracy of 94% with SAMtools data and 91% with GATK data. The SVM performance suggested which VCF parameters were determinants to understand if polymorphic sites were real SNP sites or errors due to sequencing as well as to low quality nucleotide alignment. SVM could efficiently recognize true SNPs from false positive predictions as shown by high sensitivity (GATK 94%, SAMtools 96%), specificity (GATK 63%, SAMtools 65%), and precision (GATK 97%, SAMtools 97%) resulting from the SVM 10-fold cross validation.

Leonardelli, L.; Cestaro, A.; Livi, C.M.; Romieu, C.; This, P.; Moser, C.; Blanzieri, E. (2013). VCF features to train SVM in grapevine SNP detection. In: BITS Annual Meeting 2013, May 21–23, Udine, Italy. url: http://bits2013.units.it/ handle: http://hdl.handle.net/10449/22276