CIF: Small: Info-Clustering: An Information-Theoretic Framework for Data Clustering
- View All
Clustering refers to a procedure that groups similar objects together while separating dissimilar ones apart. This simple idea has a wide range of applications in different areas of scientific research. From the mathematical viewpoint, the problem of clustering is quite unique in that it attempts to discover unknown patterns of data without a clear knowledge of the ground truth. Instead of jumping to a specific algorithm or a dataset (which is a common practice in the literature), this research aims to lay a rigorous theoretical ground, upon which many meaningful and practical implementations can be developed subsequently. This research is accompanied by the investigator''s continuing effort in curriculum development, involving undergraduate and graduate students in research, and broadening the participation of women and underrepresented minorities in engineering.To achieve the aforementioned goal, the investigator plans to take an information-theoretic view of the data clustering problem by modeling each object to be clustered as a piece of information. A key advantage of this information-theoretic view is that now the similarity among multiple objects can be naturally measured by the amount of shared information. This is precisely where information theory, with the accumulation of over 70 years of active research, can be most useful. The main agendas of this research are to understand: 1) what clustering algorithms can be derived from the proposed info-clustering framework by leveraging the large body of literature on multivariate dependency modeling including graphical models and parameter families; 2) whether the proposed info-clustering framework can be leveraged to make some progress on the long-standing open problem of subset feature selection in statistics and machine learning.