K-Means Data Clustering from Scratch Using C# -- Visual Studio Magazine

K-Means Data Clustering from Scratch Using C#

12/01/2023

The demo program displays the loaded data using the MatShow() function:

Console.WriteLine("Data: ");
MatShow(X, 2, 6);  // 2 decimals, 6 wide

In a non-demo scenario, you might want to display all the data to make sure it was loaded properly.

Clustering the Data
The KMeans clustering object is created and used like so:

Console.WriteLine("Clustering with k=3 ");
KMeans km = new KMeans(X, 3);
Console.WriteLine("Done ");
int[] clustering = km.Cluster();

The number of clusters, k, must be supplied. There is no good way to programmatically determine an optimal number of clusters. A somewhat subjective approach for determining a good number of clusters is called the elbow technique. You try different values of k and plot the associated WCSS values. When the WCSS value doesn't change much at some point, the associated value of k is a good candidate. See the article, "Determining the Number of Clusters Using the Elbow and the Knee Techniques."

The demo concludes by displaying the clustering results:

Console.WriteLine("clustering = ");
VecShow(clustering, 3);

The results look like:

1  1  2  0  0  1 . . .

The clustering information can be displayed in several ways. For example, instead of displaying by data item, you can display by cluster ID. Or, you can display each data item and its associated cluster ID on the same line.

Wrapping Up
The k-means implementation presented in this article can be used as-is or modified in several ways. The demo uses random partition initialization. The order of the N data items is randomized, then the first k data items are assigned to the first k cluster IDS, and then the remaining N-k data items are assigned to a random cluster ID. This approach guarantees that there is at least one data item assigned to each cluster ID but doesn't guarantee that the initial distribution of cluster ID counts are equal. Instead of assigning random cluster IDs, you can cycle though the cluster IDs sequentially.

The Cluster() method calls ClusterOnce() this.trials times looking for the clustering that has the smallest within-cluster sum of squares. Instead of iterating a fixed number of times, you can track previous-wcss and current-wcss and exit the loop when the two values do not change after a threshold value (such as N times).

The demo implementation assumes the source data is normalized. An alternative is to integrate the data normalization into the k-means system.

The UpdateClustering() uses Euclidean distance to assign each data item to the cluster ID that has the closest mean. There are alternatives to Euclidean distance, such as Manhattan distance and Minkowski distance. However, Euclidean distance works fine in most scenarios.

Get Code Download

About the Author

Dr. James McCaffrey works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products including Azure and Bing. James can be reached at [email protected].

Printable Format

comments powered by Disqus

Featured

VS Code Insiders Get Copilot Chat AI-Enhanced Extensions in Latest Update

Developers using the "Insiders" build of the latest Visual Studio Code update, version 1.90, can now enjoy enhanced chat AI functionality in extensions.
Regression Using LightGBM

Dr. James McCaffrey of Microsoft Research presents a full-code, step-by-step tutorial on this powerful machine learning technique used to predict a single numeric value.
Microsoft Teases Discounts for August Dev Conference at Redmond HQ

Microsoft is offering a special discount for Visual Studio Professional and Enterprise subscribers wishing to attend a developer conference being held in August at the company's Redmond, Wash., headquarters.
Building Planet-Scale .NET Apps with Azure Cosmos DB

Azure Cosmos DB is a fully managed distributed database that can be transparently replicated across regions while remaining highly performant and seamlessly scaling according to needs, making it great for applications of any scale.
Java Devs in VS Code Can Now Ask Copilot for Syntax Rewrites

Count among the many emerging abilities of GitHub Copilot new functionality for rewriting your Java syntax in Visual Studio Code.