DBSCAN Data Clustering from Scratch Using C# -- Visual Studio Magazine

DBSCAN Data Clustering from Scratch Using C#

11/15/2023

The RegionQuery() method returns neighbor data items that are a distance strictly less than the epsilon value. I used the name RegionQuery instead of something more logical like "GetNeighbors" because RegionQuery is used in the 1996 original research paper "A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise" by M. Ester, et al. The RegionQuery() method uses a strictly less-than epsilon condition. Some DBSCAN implementations use a less-than-or-equal condition.

The Cluster() method stores the data to be clustered by reference rather than by value. An alternative is to make an explicit copy of the data in situations where the data might be modified by some other external process.

The demo program uses Euclidean distance to determine if data items are neighbors. There are other distance measures, such as Manhattan distance or Minkowski distance, but as far as I know there are no research results that suggest an alternative to Euclidean distance is useful.

Utility Functions
The top-level Program class defines six helper functions. The code is presented in Listing 3. The MatShow() and VecShow() functions are used to display the data and the resulting clustering array. The ListShow() function is not used by the demo but is handy if you want to insert code to programmatically examine the neighbors because they're defined as a List<int> collection.

Listing 3: Utility Functions

static void MatShow(double[][] m, int dec, int wid)
{
  for (int i = 0; i < m.Length; ++i)
  {
    for (int j = 0; j < m[0].Length; ++j)
    {
      double v = m[i][j];
      if (Math.Abs(v) < 1.0e-5) v = 0.0;  // avoid "-0.0"
      Console.Write(v.ToString("F" + dec).PadLeft(wid));
    }
    Console.WriteLine("");
  }
}

static void VecShow(int[] vec, int wid)
{
  for (int i = 0; i < vec.Length; ++i)
    Console.Write(vec[i].ToString().PadLeft(wid));
  Console.WriteLine("");
}

public static void ListShow(List<int> list, int wid)
{
  for (int i = 0; i < list.Count; ++i)
    Console.Write(list[i].ToString().PadLeft(wid));
  Console.WriteLine("");
}

static double[][] MatCreate(int rows, int cols)
{
  double[][] result = new double[rows][];
  for (int i = 0; i < rows; ++i)
    result[i] = new double[cols];
  return result;
}

static int NumNonCommentLines(string fn,
  string comment)
{
  int ct = 0;
  string line = "";
  FileStream ifs = new FileStream(fn, FileMode.Open);
  StreamReader sr = new StreamReader(ifs);
  while ((line = sr.ReadLine()) != null)
    if (line.StartsWith(comment) == false)
      ++ct;
  sr.Close(); ifs.Close();
  return ct;
}

static double[][] MatLoad(string fn, int[] usecols,
  char sep, string comment)
{
  // count number of non-comment lines
  int nRows = NumNonCommentLines(fn, comment);

  int nCols = usecols.Length;
  double[][] result = MatCreate(nRows, nCols);
  string line = "";
  string[] tokens = null;
  FileStream ifs = new FileStream(fn, FileMode.Open);
  StreamReader sr = new StreamReader(ifs);

  int i = 0;
  while ((line = sr.ReadLine()) != null)
  {
    if (line.StartsWith(comment) == true)
      continue;
    tokens = line.Split(sep);
    for (int j = 0; j < nCols; ++j)
    {
      int k = usecols[j];  // into tokens
      result[i][j] = double.Parse(tokens[k]);
    }
    ++i;
  }
  sr.Close(); ifs.Close();
  return result;
}

The MatLoad() function can be used to load data from a text file into an array-of-arrays style matrix. The design of MatLoad() mimics the Python NumPy library loadtxt() function. The MatLoad() function calls helper MatCreate() and helper NumNonCommentLines() functions. You might want to consider placing the MatCreate() and NumNonCommentLines() functionality directly in the MatLoad() function.

Wrapping Up
The DBSCAN clustering technique often gives very different results than the k-means clustering technique. This happens because DBSCAN can chain together neighbor data items, which can result in the first and last data items in a cluster being far apart. Because DBSCAN and k-means clustering can give significantly different results, it's not a bad idea to use both techniques.

Data clustering is an exploratory process. There is no correct clustering of any dataset. In most data clustering scenarios, there should be a human-in-the-loop where human expertise is used to examine clustering results to see if there are any interesting patterns.

Get Code Download

About the Author

Dr. James McCaffrey works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products including Azure and Bing. James can be reached at [email protected].

Printable Format

comments powered by Disqus

Featured

VS Code Insiders Get Copilot Chat AI-Enhanced Extensions in Latest Update

Developers using the "Insiders" build of the latest Visual Studio Code update, version 1.90, can now enjoy enhanced chat AI functionality in extensions.
Regression Using LightGBM

Dr. James McCaffrey of Microsoft Research presents a full-code, step-by-step tutorial on this powerful machine learning technique used to predict a single numeric value.
Microsoft Teases Discounts for August Dev Conference at Redmond HQ

Microsoft is offering a special discount for Visual Studio Professional and Enterprise subscribers wishing to attend a developer conference being held in August at the company's Redmond, Wash., headquarters.
Building Planet-Scale .NET Apps with Azure Cosmos DB

Azure Cosmos DB is a fully managed distributed database that can be transparently replicated across regions while remaining highly performant and seamlessly scaling according to needs, making it great for applications of any scale.
Java Devs in VS Code Can Now Ask Copilot for Syntax Rewrites

Count among the many emerging abilities of GitHub Copilot new functionality for rewriting your Java syntax in Visual Studio Code.