Tip: How to define your distance function for Hierarchical Clustering
[This article was first published on Python – Predictive Hacks, and kindly contributed to python-bloggers]. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.
Want to share your content on python-bloggers? click here.
Many times there is a need to define your distance function. I found this answer in StackOverflow very helpful and for that reason, I posted here as a tip.
All of the SciPy
hierarchical clustering routines will accept a custom distance function that accepts two 1D vectors specifying a pair of points and returns a scalar. For example, using fclusterdata
:
import numpy as np from scipy.cluster.hierarchy import fclusterdata # a custom function that just computes Euclidean distance def mydist(p1, p2): diff = p1 - p2 return np.vdot(diff, diff) ** 0.5 X = np.random.randn(100, 2) fclust1 = fclusterdata(X, 1.0, metric=mydist) fclust2 = fclusterdata(X, 1.0, metric='euclidean') print(np.allclose(fclust1, fclust2)) # True
Valid inputs for the metric=
kwarg are the same as for scipy.spatial.distance.pdist. Also here you can find some other info
To leave a comment for the author, please follow the link and comment on their blog: Python – Predictive Hacks.
Want to share your content on python-bloggers? click here.