I want to provide a high level view of some statistical methods and data structures we use in our work. These are meant to be living notes, and may contain stubs or incomplete information. There are plenty of exhaustive resources, but I highly recommend Matteo Matteucci’s work.
K-means (or medians) is an exclusive model that associates observations with clusters. The number of clusters cannot be determined analytically. Instead is chosen based on domain knowledge and refined iteratively according to the Schwarz Criterion. The model is sensitive to initial positions, so the choice matters, and Monte Carlo replication should be used, otherwise it tends to find local maxima. The steps are:
- distribute clusters uniformly in multi-variate space
- assign observations to their nearest cluster
- recalculate cluster centroid from children
- repeat until clusters are stationary
This process minimizes the squared error function:
Hierarchical models view clustering as a binary tree of most-similar observations. This can be built top-down (divide) or bottom-up (agglomerate).
- compute (half) distance matrix
- create pair cluster from indexed observations
- repeat until there is a single top-level cluster
The final product is a b-tree implemented as a hash map. The distance function is a domain specific comparison that logically connects clusters by their most similar (single), least similar (complete), or mean (mean). Some additional hierarchical resources with more detail: 1 2 3
Fuzzy clustering is an overlapping model, in which observations have a probability of belonging to every cluster.
A common approach is fuzzy c-means (FCM), which extends k-means models, and uses the centroid method.
Model-based methods are probabilistic. Clusters are represented by mixing a collection of Gaussian or Poisson components.
These are some of the canonical references in the field:
- MacQueen, J (1967). Some methods for classification and analysis of multivariate observations. Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability (pp. 281–297). Berkeley: University of California Press.
- Johnson, Stephen, C (1967). Hierarchical Clustering Schemes. Psychometrika, 32(3).
- Day, WHE, & Edelsbrunner, H (1985). Investigation of proportional link linkage clustering methods. Journal of Classification, 2, 239–254.