Motivation: Prediction of alternative splicing has been traditionally based on the study of expressed sequences, helped by homology considerations and the analysis of local discriminative features. More recently, machine learning algorithms have been developed that try avoid or reduce the use of a priori information, with partial success. Objective and method: With the aim of developing a fully automatic procedure of recognition of alternative splicing events based only on the genomic sequence, we first introduce a virtual genetic coding scheme to numerically modeling the information content of sequences in an effective way, then we use time series analysis to extract a fixed-length set of features from each sequence and finally we adopt a supervised learning method, namely the support vector machine, to predict alternative splicing events. Results: On the base of real C. elegans data, we show that it is possible within this purely numeric framework to obtain results better than the state of the art, without any explicit modeling of homology or positions in the splice site, nor any use of other local features. Conclusion: The virtual genetic coding together with time series analysis allows us to introduce an effective and powerful sequence coding scheme, that may be useful in various areas of genomics and transcriptomics
Virtual genetic coding and time series analysis for alternative splicing prediction in C. elegans
CECCARELLI M;
2009-01-01
Abstract
Motivation: Prediction of alternative splicing has been traditionally based on the study of expressed sequences, helped by homology considerations and the analysis of local discriminative features. More recently, machine learning algorithms have been developed that try avoid or reduce the use of a priori information, with partial success. Objective and method: With the aim of developing a fully automatic procedure of recognition of alternative splicing events based only on the genomic sequence, we first introduce a virtual genetic coding scheme to numerically modeling the information content of sequences in an effective way, then we use time series analysis to extract a fixed-length set of features from each sequence and finally we adopt a supervised learning method, namely the support vector machine, to predict alternative splicing events. Results: On the base of real C. elegans data, we show that it is possible within this purely numeric framework to obtain results better than the state of the art, without any explicit modeling of homology or positions in the splice site, nor any use of other local features. Conclusion: The virtual genetic coding together with time series analysis allows us to introduce an effective and powerful sequence coding scheme, that may be useful in various areas of genomics and transcriptomicsFile | Dimensione | Formato | |
---|---|---|---|
Artificial Intelligence in Medicine 2009 Ceccarelli.pdf
non disponibili
Licenza:
Non specificato
Dimensione
1.08 MB
Formato
Adobe PDF
|
1.08 MB | Adobe PDF | Visualizza/Apri Richiedi una copia |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.