Semi Supervised methods use a small amount of auxiliary information as a guide in the learning process in presence of unlabeled data. When using a clustering algorithm, the auxiliary information has the form of Side Information, that is a list of co-clustered points. Recent literature shows better performance of these methods with respect to totally unsupervised ones even with a small amount of Side Information. This fact suggests that the use of Semi Supervised methods may be useful especially in very difficult and noisy tasks where little a priori information is available, as is the case of data deriving from biological experiments. The two more frequently used paradigms to include Side Information into clustering are Constrained Clustering and Metric Learning. In this paper we use a Metric Learning approach as a way to improve the classical fuzzy c-means clustering through a two steps procedure: first a series of metrics (one for each cluster) that satisfy a randomly generated set of constraints are learnt based on the data; then a generalized version of the fuzzy c-means (with the metrics computed in the previous step) is executed. We show the benefits and the limitations of this method using real world datasets and a modified version of the Partition Entropy index.

Improving Fuzzy Clustering of Biological Data by Metric Learning with Side Information

CECCARELLI M;
2008

Abstract

Semi Supervised methods use a small amount of auxiliary information as a guide in the learning process in presence of unlabeled data. When using a clustering algorithm, the auxiliary information has the form of Side Information, that is a list of co-clustered points. Recent literature shows better performance of these methods with respect to totally unsupervised ones even with a small amount of Side Information. This fact suggests that the use of Semi Supervised methods may be useful especially in very difficult and noisy tasks where little a priori information is available, as is the case of data deriving from biological experiments. The two more frequently used paradigms to include Side Information into clustering are Constrained Clustering and Metric Learning. In this paper we use a Metric Learning approach as a way to improve the classical fuzzy c-means clustering through a two steps procedure: first a series of metrics (one for each cluster) that satisfy a randomly generated set of constraints are learnt based on the data; then a generalized version of the fuzzy c-means (with the metrics computed in the previous step) is executed. We show the benefits and the limitations of this method using real world datasets and a modified version of the Partition Entropy index.
File in questo prodotto:
File Dimensione Formato  
International Journal of Approximate Reasoning 2008 Ceccarelli.pdf

non disponibili

Licenza: Non specificato
Dimensione 435.24 kB
Formato Adobe PDF
435.24 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: http://hdl.handle.net/20.500.12070/5155
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 19
  • ???jsp.display-item.citation.isi??? 12
social impact