21 September 2020

Automated Manufacturing Engineering Research and Innovation Intelligent and Autonomous Systems LIVIA – Imaging, Vision and Artificial Intelligence Laboratory

Efficient Clustering with SLK

Purchased on Istock.com. Copyright.

SUMMARY

In AI, to make machines learn automatically without any given supervision through examples is called unsupervised learning. Cluster analysis is a very common unsupervised learning method which aims to gather similar instances in the same group from a given pool of unlabeled instances. In large scale scenarios, clustering methods can suffer in terms of clustering solutions and scalability issues. We propose an efficient clustering method Scalable Laplacian K-modes (SLK): which combines both the centroid and pairwise clustering objectives in a single formulation and optimizes efficiently and rapidly the distributed algorithm, while being better in clustering tasks than most existing methods.

Organizing Huge Amounts of Data

The advancement of data acquisition technologies through the Internet, digital media, social networks, video surveillance and growing business industries have produced massive amounts of high dimensional data. Manipulation of these huge volumes of data is crucial to build a resourceful model specific to related applications. Thus, the need for automatic data analysis is obvious. Clustering algorithms appeared extensively in literature as a tool to analyze data through extracting similar patterns from data distribution in an unsupervised manner. These techniques have been used in a wide range of data analysis applications such as business data analysis, market analysis, social network analysis and many other applications related to AI. However, organizing a large amount of highdimensional data in some meaningful clusters is very challenging due to scalability and efficiency issues. For example popular photosharing websites like Facebook or Instagram where the need to handle millions of uploaded photos per day is obvious.

Clustering algorithms to classify images

Images classified according to similar characteristics

Numerous clustering algorithms are proposed in the literature. However, one particular clustering algorithm successful in a specific application may not necessarily succeed in other related applications. We propose Scalable Laplacian K-modes: a clustering method which combines two well-known clustering models in a single objective and provides a smart way of optimizing it for distributed, faster and better clustering solutions than popular existing clustering methods. This research work was accepted as a spotlight article in Neural Information Processing Systems (Neurips), arguably considered the best venue in AI and machine learning.

Scalable Laplacian K-modes (SLK) Clustering

Scalable Laplacian K-modes

As said, Laplacian K-modes combines two models in a single objective: K-modes and the well known Laplacian. K-modes find the mode of each cluster, as in the very popular mean-shift method. The term mode roughly means the most representative sample corresponding to the peak of a distribution. The Laplacian term brings consistency in clusters by imposing same cluster assignments for nearby data samples and thereby helps to find manifold structured clusters such as spiral clusters shown in the following image.

Spiral clusters

Although, the model is promising, there is a very challenging optimization problem due to discrete constraints and Laplacian. We propose a concave-convex relaxation of the formulation and an iterative bound optimization method which solves the clustering problem parallel at each iteration. Also our proposed solution provides the representative mode of each cluster as a byproduct in linear time complexity which we call SLK-BO. This makes the algorithm faster than the variant SLK-MS which finds the mode with an iterative algorithm.

We report comprehensive experiments on various datasets mostly containing images or numeric data. We report Normalized Mutual Information (NMI) and Clustering accuracy (ACC) as performance measures. Higher values mean better performance. We also evaluate the time in seconds to organize the data. Lower time mean better efficiency.

Clustering results for different databanks

Table: Clustering results as NMI/ACC in the upper half and average elapsed time in seconds (s) in the lower half.

In the Table, it can be observed that the SLK-MS/SLK-BO method offers much better accuracy and timing. For example, in the MNIST (code) dataset, 95% images were automatically organized correctly according to their similarity in an unsupervised way. Also in terms of timing, SLK is much faster than the other methods particularly for high-dimensional datasets such as LabelMe (GIST) and YTF and scalable for large Reuters dataset.

Additional Information

More technical details about SLK clustering can be found in the Neurips paper: https://papers.nips.cc/paper/8208-scalable-laplacian-k-modes.pdf. And the python code is available at: https://github.com/imtiazziko/SLK.

Portes ouvertes

Efficient Clustering with SLK

Organizing Huge Amounts of Data

Scalable Laplacian K-modes (SLK) Clustering

Additional Information

Featured articles

Une équipe d’étudiants de l’ÉTS remporte un prix au Défi RAUPI

Kathy Baig named CEO of ÉTS by the Government of Quebec

L’ÉTS s’unit aux sept universités montréalaises pour accélérer l'action climatique et renforcer la résilience de la métropole dans le cadre du Sommet Climat Montréal