Dunn Index for K-Means Clustering Evaluation

This article was first published on PyShark , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

In this tutorial we will explore the Dunn index and its application to K-Means clustering evaluation in Python.

Table of Contents


Introduction

The Dunn Index (DI) is one of the clustering algorithms evaluation measures. It is most commonly used to evaluate the goodness of split by a K-Means clustering algorithm for a given number of clusters.

We have previously discussed the Davies-Bouldin index, and Dunn index is yet another metric to evaluate the performance of clustering.

How do you interpret Dunn index?

The Dunn index is calculated as a ratio of the smallest inter-cluster distance to the largest intra-cluster distance. A high DI means better clustering since observations in each cluster are closer together, while clusters themselves are further away from each other.

In the next section the process of calculating the DI is described in detail with a few examples.


Books I recommend:


Dunn Index Explained

In this part we will go through each step of the calculations and provide meaningful illustrative examples to help better understand the formulas.


Step 1: Calculate inter-cluster distance

The first step is to calculate the inter-cluster distance and there are several ways to calculate it.

The inter-cluster distance measures the distance between observations belonging to two difference clusters.

Consider two clusters: \(C_a\) and \(C_b\).

Each cluster has a centroid: \(A\) and \(B\) respectively,

as well as some elements in each cluster \(a_{i}\) and \(b_{j}\), where:

  • \(a_{i}\) : \(i\)-th observation of cluster \(C_a\)
  • \(b_{j}\) : \(j\)-th observation of cluster \(C_b\)

Note:

The inter-cluster distance is computer for a pair of clusters.


There are 5 ways of calculating inter-cluster distance:

  1. Single linkage distance
  2. Complete linkage distance
  3. Average linkage distance
  4. Centroid linkage distance
  5. Average of centroids linkage distance

Single linkage distance

The single linkage distance is the smallest distance between two observations from two different clusters (closest observations):

$$\delta_1 (C_a, C_b) = \min \Bigg \{ d(a_{i}, b_{j}) \Bigg \}$$

where:

  • \(a_{i} \in C_a\)
  • \(b_{j} \in C_b\)

Complete linkage distance

The single linkage distance is the largest distance between two observations from two different clusters (furthest observations):

$$\delta_2 (C_a, C_b) = \max \Bigg \{ d(a_{i}, b_{j}) \Bigg \}$$

where:

  • \(a_{i} \in C_a\)
  • \(b_{j} \in C_b\)

Average linkage distance

The average linkage distance is the average distance between all observations from two different clusters:

$$\delta_3 (C_a, C_b) = \frac{1}{N_{C_a} \times N_{C_b}} \sum_{i=1}^{N_{C_a}} \sum_{j=1}^{N_{C_b}} d(a_i, b_j)$$

where:

  • \(a_{i} \in C_a\)
  • \(b_{j} \in C_b\)
  • \(N_{C_a}\) is the # of observations in \(C_a\)
  • \(N_{C_b}\) is the # of observations in \(C_b\)

Centroid linkage distance

The centroid linkage distance is the distance between the centroids of two clusters:

$$\delta_4 (C_a, C_b) = d(A, B)$$

where:

  • \(A\) is the centroid of \(C_a\)
  • \(B\) is the centroid of \(C_b\)

Average of centroids linkage distance

Average of centroids linkage distance is the average distance between a centroid of a cluster and all observations from a different cluster:

$$\delta_5 (C_a, C_b) = \frac{1}{N_{C_a} + N_{C_b}} \Bigg \{ \sum_{i=1}^{N_{C_a}} d(a_i, B) + \sum_{j=1}^{N_{C_b}} d(b_j, A) \Bigg \}$$

  • \(a_{i} \in C_a\)
  • \(b_{j} \in C_b\)
  • \(N_{C_a}\) is the \(\#\) of observations in \(C_a\)
  • \(N_{C_b}\) is the \(\#\) of observations in \(C_b\)
  • \(A\) is the centroid of \(C_a\)
  • \(B\) is the centroid of \(C_b\)

Depending which approach is chosen, following one of the above formulas will calculate the inter-cluster distance.


Step 2: Calculate intra-cluster distance

The second step is to calculate the intra-cluster distance and there are several ways to calculate it.

The inter-cluster distance measures the distance between observations belonging to the same cluster.

Consider one cluster: \(C_a\) and its centroid \(A\) as well as some elements in this cluster \(a_{i}\) and \(a_{j}\), where:

  • \(a_{i}\) : \(i\)-th observation of cluster \(C_a\)
  • \(a_{j}\) : \(j\)-th observation of cluster \(C_a\)

Note:

The intra-cluster distance is computed for individual cluster (one distance measure per cluster).


There are 3 ways of calculating inter-cluster distance:

  1. Complete diameter distance
  2. Average diameter distance
  3. Centroid diameter distance

Complete diameter distance

The complete diameter distance measures the distance between two furthest observations within the same cluster:

$$\Delta_1 (C) = \max \Bigg \{d(a_i, a_j) \Bigg \}$$

where:

  • \(a_i \in C\)
  • \(a_j \in C\)

Average diameter distance

The average diameter distance measures the average distance between all observations within the same cluster:

$$\Delta_2 (C) = \frac{1}{N_C \times (N_C-1)} \sum_{i \neq j}^{N_C} d(a_i, a_j) $$

where:

  • \(a_i \in C\)
  • \(a_j \in C\)
  • \(N_{C}\) is the \(\#\) of observations in \(C\)

Centroid diameter distance

The centroid diameter distance measures the double average distance between a centroid of a cluster and all observations from the same cluster:

$$\Delta_3 (C) = 2 \times \frac{1}{N_C} \sum_{i=1}^{N_C} d(a_i, A) $$

where:

  • \(a_i \in C\)
  • \(N_{C}\) is the \(\#\) of observations in \(C\)
  • \(A\) is the centroid of \(C\)

Step 3: Calculate Dunn Index

The Dunn index is defined as the ratio of the smallest inter-cluster distance and the largest intra-cluster distance.

For \(m\) clusters, the Dunn index is calculated as:

$$DI_m = \frac{\min\limits_{1 \leq i \leq j \leq m} \delta (C_i, C_j)}{\max\limits_{1 \leq k \leq m} \Delta_k}$$

This means first that we should minimize the inter-cluster distance function, which is to find the distance between two closest clusters.

And then maximize the intra-cluster distance function, which is to find the cluster with the largest diameter.

Finally, we will take the ratio of the above two values which is referred to as the Dunn index (with range between 0 and 1).

From the above explanation, we can conclude that the large values of Dunn index represent better clustering.

Note:

\(\delta (C_i, C_j)\) can be calculated using any of the inter-cluster distance functions. And \(\Delta_k\) can be calculated using any of the intra-cluster distance functions.


Conclusion

In this article we discussed how to calculate the Dunn index for clustering evaluation.

Feel free to leave comments below if you have any questions or have suggestions for some edits and check out more of my Python Programming articles.

The post Dunn Index for K-Means Clustering Evaluation appeared first on PyShark.

To leave a comment for the author, please follow the link and comment on their blog: PyShark .

Want to share your content on python-bloggers? click here.