Efficient Clustering with SLK
Purchased on Istock.com. Copyright.
In AI, to make machines learn automatically without any given supervision through examples is called unsupervised learning. Cluster analysis is a very common unsupervised learning method which aims to gather similar instances in the same group from a given pool of unlabeled instances. In large scale scenarios, clustering methods can suffer in terms of clustering solutions and scalability issues. We propose an efficient clustering method Scalable Laplacian K-modes (SLK): which combines both the centroid and pairwise clustering objectives in a single formulation and optimizes efficiently and rapidly the distributed algorithm, while being better in clustering tasks than most existing methods.
Organizing Huge Amounts of Data
The advancement of data acquisition technologies through the Internet, digital media, social networks, video surveillance and growing business industries have produced massive amounts of high dimensional data. Manipulation of these huge volumes of data is crucial to build a resourceful model specific to related applications. Thus, the need for automatic data analysis is obvious. Clustering algorithms appeared extensively in literature as a tool to analyze data through extracting similar patterns from data distribution in an unsupervised manner. These techniques have been used in a wide range of data analysis applications such as business data analysis, market analysis, social network analysis and many other applications related to AI. However, organizing a large amount of highdimensional data in some meaningful clusters is very challenging due to scalability and efficiency issues. For example popular photosharing websites like Facebook or Instagram where the need to handle millions of uploaded photos per day is obvious.
Numerous clustering algorithms are proposed in the literature. However, one particular clustering algorithm successful in a specific application may not necessarily succeed in other related applications. We propose Scalable Laplacian K-modes: a clustering method which combines two well-known clustering models in a single objective and provides a smart way of optimizing it for distributed, faster and better clustering solutions than popular existing clustering methods. This research work was accepted as a spotlight article in Neural Information Processing Systems (Neurips), arguably considered the best venue in AI and machine learning.
Scalable Laplacian K-modes (SLK) Clustering
As said, Laplacian K-modes combines two models in a single objective: K-modes and the well known Laplacian. K-modes find the mode of each cluster, as in the very popular mean-shift method. The term mode roughly means the most representative sample corresponding to the peak of a distribution. The Laplacian term brings consistency in clusters by imposing same cluster assignments for nearby data samples and thereby helps to find manifold structured clusters such as spiral clusters shown in the following image.
Although, the model is promising, there is a very challenging optimization problem due to discrete constraints and Laplacian. We propose a concave-convex relaxation of the formulation and an iterative bound optimization method which solves the clustering problem parallel at each iteration. Also our proposed solution provides the representative mode of each cluster as a byproduct in linear time complexity which we call SLK-BO. This makes the algorithm faster than the variant SLK-MS which finds the mode with an iterative algorithm.
We report comprehensive experiments on various datasets mostly containing images or numeric data. We report Normalized Mutual Information (NMI) and Clustering accuracy (ACC) as performance measures. Higher values mean better performance. We also evaluate the time in seconds to organize the data. Lower time mean better efficiency.
In the Table, it can be observed that the SLK-MS/SLK-BO method offers much better accuracy and timing. For example, in the MNIST (code) dataset, 95% images were automatically organized correctly according to their similarity in an unsupervised way. Also in terms of timing, SLK is much faster than the other methods particularly for high-dimensional datasets such as LabelMe (GIST) and YTF and scalable for large Reuters dataset.
Additional Information
More technical details about SLK clustering can be found in the Neurips paper: https://papers.nips.cc/paper/8208-scalable-laplacian-k-modes.pdf. And the python code is available at: https://github.com/imtiazziko/SLK.