| ... | ... | @@ -54,7 +54,7 @@ In k-means, we saw how we can apply an iterative algorithm to learn the stereoty |
|
|
|
|
|
|
|
We have talked so far how we can partition instances into discrete groups with k-means and Gaussian mixtures, which is one way grouping the samples and called “flat clustering”. Another strategy we can follow is the hierarchical clustering, where we create “trees”. In flat clustering, we build cluster on the same “plane”, a sample belongs to only one cluster (with different degrees of certainty, depending on the model details). In hierarchical clustering (imagine decision trees), we create groups that is nested inside each other. The tree structure can be build top-down (like a decision tree), called divisive clustering. Alternatively, it can be built bottom-up (inverse decision tree) called agglomerative clustering. Tree creation process does not utilize an optimization function and the samples are merged / split into groups by comparison: therefore, it will always cluster the samples even if there is no hidden structure in the observations (white noise).
|
|
|
|
|
|
|
|
In agglomerative clustering, we start with an over-fitted tree, where there is only one sample at each leaf. At each iteration, we combine the most two similar instances in a sub-group (takes O(N2) time as we look the combinations). Iterations are done until we reach a single group (in total O(N3) time). With a smarter technique (priority queue), time can be reduced to (O(NxNxlog(N)). Once the tree is created, it is cut at any desired level, giving K number of clusters (analogous to pruning).
|
|
|
|
In agglomerative clustering, we start with an over-fitted tree, where there is only one sample at each leaf. At each iteration, we combine the most two similar instances in a sub-group (takes $`O(N^2)`$ time as we look the combinations). Iterations are done until we reach a single group (in total $`O(N^3)`$ time). With a smarter technique (priority queue), time can be reduced to ($`O(N^2log(N))`$). Once the tree is created, it is cut at any desired level, giving K number of clusters (analogous to pruning).
|
|
|
|
|
|
|
|
In order to group the observations, we need a measure of distance. If it is set, then we need to decide how to compare the similarities between group of observations. There are three common ways deployed for that comparison:
|
|
|
|
|
| ... | ... | @@ -66,9 +66,18 @@ The third method is what is usually preferred. In any case, selecting the right |
|
|
|
|
|
|
|
### Density based clustering
|
|
|
|
|
|
|
|
...
|
|
|
|
An alternative way of clustering is to treat the data space in a continuous manner and identify the regions of observations with higher densities. Here we inherently assume that the clusters of interest creates "dense" regions, which are separated from one another with "less dense" space. One battle tested implementation is the [density based spatial clustering of applications with noise](https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf) --cited over 22600 times in the literature. The approach performs very well arbitrarily shaped clusters, can handle noise & outliers, and does not require a user-defined number of clusters to be found. The model even creates a cluster of outliers from the given observations. The algorithm only needs two inputs:
|
|
|
|
|
|
|
|
1. minimum number of points to call a region dense,
|
|
|
|
2. distance to consider a point a neighbor of any other point.
|
|
|
|
|
|
|
|
With these rules, we can find out "core" particles (like dynamic stereotypes) and the neighboring particles around the core particles. The rest is treated as outliers. The algorithm first sample a point randomly from the dataset. Then, we search locally the neighboring particles. Found number of points should satisfy the first criteria. If yes, then we search for the neighbors of the neighbors, growing the cluster in the mean while. In other words, it is as if we are dropped in a data world consists of islands to be discovered. If we are dropped on an island by luck, we start exploring it and figure out its extend. If we are drop into the ocean, nothing to hold on, we trigger the emergency protocol "outlier" and re-beamed again. By doing so, we travel from one island to another, until whole sample space is visited.
|
|
|
|
|
|
|
|
If you want to learn more about the model, you can follow the details from the [original study]() and the [follow-up work]().
|
|
|
|
|
|
|
|
### Concept of similarity in machine learning
|
|
|
|
|
|
|
|
...
|
|
|
|
|
|
|
|
## Additional materials
|
|
|
|
|