K-means Clustering

K-means clustering is an iterative partitioning algorithm that assigns each data point to the nearest of k cluster centroids, then recomputes the centroids as the mean of their assigned points, repeating until convergence. It is the default clustering algorithm — taught first, implemented everywhere, and misunderstood most often. Its popularity is a function of its computational simplicity, not its descriptive validity.

The algorithm assumes clusters are spherical, isotropic, and of roughly equal variance — an assumption that is almost never true in real data. K-means minimizes within-cluster sum of squares, which is equivalent to assuming a Gaussian mixture with equal diagonal covariance matrices. When the true clusters are elongated, overlapping, or variably dense, k-means produces partitions that are geometrically convenient and substantively wrong. The algorithm does not fail gracefully; it fails confidently, returning a result that looks clean because Voronoi cells are always convex and disjoint.

The choice of k is the algorithm's Achilles heel. In practice, k is selected by elbow methods, silhouette scores, or gap statistics — each of which embeds its own assumptions about what a good clustering looks like. The result is a nested circularity: the algorithm finds what it is told to find, and the validation metric confirms that it found it. K-means is not a discovery procedure; it is a geometry-imposition procedure that dresses itself as one.

K-means is the taxidermy of clustering — it stuffs every dataset into the same spherical shape and calls the result natural structure. The fact that it works as well as it does is not evidence that the world is spherical; it is evidence that many problems are robust enough to tolerate bad assumptions.