The Data Science Lab

K-Means Data Clustering from Scratch Using C#

The demo program displays the loaded data using the MatShow() function:

Console.WriteLine("Data: ");
MatShow(X, 2, 6);  // 2 decimals, 6 wide

In a non-demo scenario, you might want to display all the data to make sure it was loaded properly.

Clustering the Data
The KMeans clustering object is created and used like so:

Console.WriteLine("Clustering with k=3 ");
KMeans km = new KMeans(X, 3);
Console.WriteLine("Done ");
int[] clustering = km.Cluster();

The number of clusters, k, must be supplied. There is no good way to programmatically determine an optimal number of clusters. A somewhat subjective approach for determining a good number of clusters is called the elbow technique. You try different values of k and plot the associated WCSS values. When the WCSS value doesn't change much at some point, the associated value of k is a good candidate. See the article, "Determining the Number of Clusters Using the Elbow and the Knee Techniques."

The demo concludes by displaying the clustering results:

Console.WriteLine("clustering = ");
VecShow(clustering, 3);

The results look like:

1  1  2  0  0  1 . . .

The clustering information can be displayed in several ways. For example, instead of displaying by data item, you can display by cluster ID. Or, you can display each data item and its associated cluster ID on the same line.

Wrapping Up
The k-means implementation presented in this article can be used as-is or modified in several ways. The demo uses random partition initialization. The order of the N data items is randomized, then the first k data items are assigned to the first k cluster IDS, and then the remaining N-k data items are assigned to a random cluster ID. This approach guarantees that there is at least one data item assigned to each cluster ID but doesn't guarantee that the initial distribution of cluster ID counts are equal. Instead of assigning random cluster IDs, you can cycle though the cluster IDs sequentially.


The Cluster() method calls ClusterOnce() this.trials times looking for the clustering that has the smallest within-cluster sum of squares. Instead of iterating a fixed number of times, you can track previous-wcss and current-wcss and exit the loop when the two values do not change after a threshold value (such as N times).

The demo implementation assumes the source data is normalized. An alternative is to integrate the data normalization into the k-means system.

The UpdateClustering() uses Euclidean distance to assign each data item to the cluster ID that has the closest mean. There are alternatives to Euclidean distance, such as Manhattan distance and Minkowski distance. However, Euclidean distance works fine in most scenarios.


About the Author

Dr. James McCaffrey works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products including Azure and Bing. James can be reached at [email protected].

comments powered by Disqus

Featured

Subscribe on YouTube