Motivation: Prediction of alternative splicing has been traditionally based on the study of expressed sequences, helped by homology considerations and the analysis of local discriminative features. More recently, machine learning algorithms have been developed that try avoid or reduce the use of a priori information, with partial success. Objective and method: With the aim of developing a fully automatic procedure of recognition of alternative splicing events based only on the genomic sequence, we first introduce a virtual genetic coding scheme to numerically modeling the information content of sequences in an effective way, then we use time series analysis to extract a fixed-length set of features from each sequence and finally we adopt a supervised learning method, namely the support vector machine, to predict alternative splicing events. Results: On the base of real C. elegans data, we show that it is possible within this purely numeric framework to obtain results better than the state of the art, without any explicit modeling of homology or positions in the splice site, nor any use of other local features. Conclusion: The virtual genetic coding together with time series analysis allows us to introduce an effective and powerful sequence coding scheme, that may be useful in various areas of genomics and transcriptomics

Virtual genetic coding and time series analysis for alternative splicing prediction in C. elegans

CECCARELLI M;
2009

Abstract

Motivation: Prediction of alternative splicing has been traditionally based on the study of expressed sequences, helped by homology considerations and the analysis of local discriminative features. More recently, machine learning algorithms have been developed that try avoid or reduce the use of a priori information, with partial success. Objective and method: With the aim of developing a fully automatic procedure of recognition of alternative splicing events based only on the genomic sequence, we first introduce a virtual genetic coding scheme to numerically modeling the information content of sequences in an effective way, then we use time series analysis to extract a fixed-length set of features from each sequence and finally we adopt a supervised learning method, namely the support vector machine, to predict alternative splicing events. Results: On the base of real C. elegans data, we show that it is possible within this purely numeric framework to obtain results better than the state of the art, without any explicit modeling of homology or positions in the splice site, nor any use of other local features. Conclusion: The virtual genetic coding together with time series analysis allows us to introduce an effective and powerful sequence coding scheme, that may be useful in various areas of genomics and transcriptomics
File in questo prodotto:
File Dimensione Formato  
Artificial Intelligence in Medicine 2009 Ceccarelli.pdf

non disponibili

Licenza: Non specificato
Dimensione 1.08 MB
Formato Adobe PDF
1.08 MB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12070/4484
Citazioni
  • ???jsp.display-item.citation.pmc??? 1
  • Scopus 3
  • ???jsp.display-item.citation.isi??? ND
social impact