Clustering Mixed Categorical and Numeric Data Using k-Means with C# -- Visual Studio Magazine

Clustering Mixed Categorical and Numeric Data Using k-Means with C#

05/15/2024

The KMeans class uses Euclidean distance between data items. An alternative is to use Manhattan distance. Manhattan distance uses the sum of absolute values of the differences between data item elements instead of the square root of the sum of squared differences. Therefore, it is sometimes said that, in theory, Manhattan distance may be preferable for data that has a very high dimensionality. I'm mildly skeptical of this claim and Euclidean distance seems to work fine in practice.

The KMeans class clustering algorithm uses two loops. In pseudo-code:

  
    loop exactly trials times
      wcss, clustering = cluster using at most maxIter iterations
      if wcss < best_wcss then
        best_wcss = current wcss
        best_clustering = current clustering
        best_means = current means
      end_if
    end_loop
    return best_clustering

The KMeans constructor sets the outer loop number of trials value to N * 5 where N is the number of data items being clustered. For the demo data this is 240 * 5 = 1200 iterations. This N * 5 value has worked well in practice, but I can imagine there might be scenarios where you need a larger value or a smaller value, or possibly a fixed value for doing performance evaluations. For example, the scikit-learn Python KMeans library module uses a default fixed value of 300 for the number of trials. Because the trials variable is a public class member, you can set it directly, like km.trials = 250, before calling the Cluster() method.

The value of the max iterations used in one clustering operation is set to N * 2. The k-means algorithm typically converges quickly to a result where there is no change in clustering -- typically about 15 iterations for the demo data -- so it's unlikely that you'll need to modify the maxIter value.

Wrapping Up
As is often the case with machine learning, when using the k-means clustering technique presented in this article, the most time-consuming part of the process is preparing the raw data by normalizing and encoding it. For relatively small datasets (typically less than N = 1000 and fewer than 30 columns), I usually just drop the raw data into an Excel spreadsheet and process the data there. For larger datasets it's usually worth the effort to implement a hard-coded specialized function to do the work, like the NormAndEncode() helper function in the demo program.

The technique presented in this article is all about encoding and normalizing any dataset so that it can be clustered using standard k-means. For example, I passed the encoded and normalized data to the scikit-learn Python language KMeans module and got identical results as the C# KMeans implementation presented in this article.

The technique presented in this article to cluster mixed categorical and numeric data can also be used to cluster strictly categorical data using k-means.

The demo uses k = 3 clusters, which is more or less arbitrary. To find a good value for k, you can cluster using values from 1 to 10, and track the resulting WCSS values. The WCSS values will typically get smaller quickly but start to level out at some point. The k-value at that point is often a good choice. This is called the "elbow" technique.

An alternative for clustering mixed categorical and numeric data is to use an old technique called k-prototypes clustering. One disadvantage of using k-prototypes clustering is that it's not widely implemented, as is the case with k-means clustering routines. Another disadvantage of k-prototypes clustering is that it's relatively crude because it uses a dissimilarity function that simply counts the number of mismatched ordinal-encoded categorical values between two items. In a set of limited experiments, the clustering technique presented in this article beat or matched k-prototypes clustering results on all datasets.

Get Code Download

About the Author

Dr. James McCaffrey works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products including Azure and Bing. James can be reached at [email protected].

Printable Format

comments powered by Disqus

Featured

VS Code Insiders Get Copilot Chat AI-Enhanced Extensions in Latest Update

Developers using the "Insiders" build of the latest Visual Studio Code update, version 1.90, can now enjoy enhanced chat AI functionality in extensions.
Regression Using LightGBM

Dr. James McCaffrey of Microsoft Research presents a full-code, step-by-step tutorial on this powerful machine learning technique used to predict a single numeric value.
Microsoft Teases Discounts for August Dev Conference at Redmond HQ

Microsoft is offering a special discount for Visual Studio Professional and Enterprise subscribers wishing to attend a developer conference being held in August at the company's Redmond, Wash., headquarters.
Building Planet-Scale .NET Apps with Azure Cosmos DB

Azure Cosmos DB is a fully managed distributed database that can be transparently replicated across regions while remaining highly performant and seamlessly scaling according to needs, making it great for applications of any scale.
Java Devs in VS Code Can Now Ask Copilot for Syntax Rewrites

Count among the many emerging abilities of GitHub Copilot new functionality for rewriting your Java syntax in Visual Studio Code.