The k-means cost function is one of the most used objectives for clustering. Lloyd's algorithm, a local search heuristic for k-means without quality or runtime guarantees known for more than sixty years, might be the most used clustering algorithm at all.
Lately, research in the theory community has begun to thoroughly analyze propertys of Lloyd and k-means. It is now known that Lloyd's algorithm can have exponential run time, but has polynomial smoothed complexity, and that it clusters well if the data satisfies certain constraints. For the k-means problem itself, NP-hardness and APX-hardness has been established, and constant factor algorithms based on primal dual methods and local search are known. For more restricted settings where either the number of centers or the dimension of the input points is a constant, polynomial approximation schemes have been found.
Another direction of theoretical research is more focussed on bridging the gap between practical applications (mostly Lloyd) and theoretical results (constant aproximations or PTAS). The famous algorithm k-means++ is now the de facto standard for practical applications since it (in practice) produces high quality clusterings with an expected guarantee of O(log k) (in theory), and is nearly as fast as Lloyd's algorithm.
Further challenges now e.g. lie in the development of even faster algorithms with theoretical guarantees, in particular in the development of data stream algorithms. BICO is a very fast algorithm that compues so-called coresets for the k-means problem. A coreset is a small weighted summary of the data for which the k-means cost is approximately the same for all possible solutions, i.e. for all possible sets of k centers. Such a coreset algorithm can be combined with any algorithm for the k-means problem to obtain a solution much faster. BICO computes coresets of size O(k log n) as long as the dimension of the input points is a constant; in practical applications, BICO computed high quality summaries of size 200k for data sets with dimension up to 128.
Here we collect some data sets used in experiments.
In our experiments with BICO, we used several data sets from the UCI Machine Learning Repository. One of these data sets is Covertype from here.
We also used two additional large data sets, named Caltech128 and BigCross. The latter data sets are available below.
The data set BigCross is a product of Covertype with Tower (see here). BigCross was introduced by the authors of StreamKM++.