In spite of the rapid development in data acquisition technology resulting in the explosive collection of acquired datasets, techniques such as data organization and classification, manipulation, and analysis of very large, diverse, heterogeneous datasets have only evolved modestly. This has led to hindrances in effective utility and better understanding of the acquired, large-scale data for knowledge discovery. In an industrial setting, an interesting visual from McKinsey illustrates that despite collecting data from tens of thousands of sensors, less than 1% is actually utilized.
Data clustering is the classification of data objects into different groups (clusters) such that data objects in one group are similar together and dissimilar from another group. Typically, homogeneous data objects, i.e. data objects having the same data type, are grouped together using some of the well-known clustering algorithms. However, many of the real world data clustering problems arising in data mining applications are pair-wise heterogeneous in nature. Clustering problems of these kinds have two data types that need to be clustered together. For example, in a customer relationship management (CRM) application, it is desirable to co-cluster customers and items purchased to study items of interest for particular category of customers. Customized product promotion campaigns are then targeted at appropriate prospective customers. Collaborative information filtering applications such as movie recommender systems co-cluster the accumulated movie rating provided by viewers and the movies they have watched. A new viewer submits a movie rating for a movie he/she has liked. Using this information, the viewer is recommended other movies by classifying the rating he/she provided to a viewer ratings-movies watched cluster. In some of the biomedical applications, co-clustering is performed on patient symptoms and medical diagnosis for patients in the database. Computer-aided diagnosis is then achieved for a patient based on symptoms provided. From the above discussion, it is clear that the existence of two pair-wise data types is “hand-in-hand”. In other words, one data type in this scenario induces clustering of the other data type and vice-versa. Hence, applying conventional clustering algorithms separately to each of the data types cannot produce meaningful co-clustering results.
Typically, the data is stored in a contingency or co-occurrence matrix C where rows and columns of the matrix represent the data types to be co-clustered. An entry Cij of the matrix signifies the relation between the data type represented by row i and column j. Co-clustering is the problem of deriving sub-matrices from the larger data matrix by simultaneously clustering rows and columns of the data matrix. Names such as bi-clustering, bi-dimensional clustering, and block clustering, among others, are often used in the literature to refer to the same problem formulation.
One technique for achieving co-clustering is to approach the problem from a graph theoretic point of view. That is, we model the relationship between the two data types in the co-clustering problem using a weighted bipartite graph model. The two data types represent the two kinds of vertices in the bipartite graph. Data co-clustering is achieved by partitioning the bipartite graph.
The square and circular vertices (m and r, respectively) denote the two data types in the co-clustering problem that are represented by the bipartite graph. Partitioning this bipartite graph leads to co-clustering of the two data types.
I would welcome any conversation on application development to provide stronger insights for a variety of industries. We can move rapidly into Industry 4.0 by combining subject matter expertise, data collection methods and next-generation data science tools, beyond many of the "me too" products.