The Data Science Lab

DBSCAN Data Clustering from Scratch Using C#

The RegionQuery() method returns neighbor data items that are a distance strictly less than the epsilon value. I used the name RegionQuery instead of something more logical like "GetNeighbors" because RegionQuery is used in the 1996 original research paper "A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise" by M. Ester, et al. The RegionQuery() method uses a strictly less-than epsilon condition. Some DBSCAN implementations use a less-than-or-equal condition.

The Cluster() method stores the data to be clustered by reference rather than by value. An alternative is to make an explicit copy of the data in situations where the data might be modified by some other external process.

The demo program uses Euclidean distance to determine if data items are neighbors. There are other distance measures, such as Manhattan distance or Minkowski distance, but as far as I know there are no research results that suggest an alternative to Euclidean distance is useful.

Utility Functions
The top-level Program class defines six helper functions. The code is presented in Listing 3. The MatShow() and VecShow() functions are used to display the data and the resulting clustering array. The ListShow() function is not used by the demo but is handy if you want to insert code to programmatically examine the neighbors because they're defined as a List<int> collection.

Listing 3: Utility Functions

static void MatShow(double[][] m, int dec, int wid)
{
  for (int i = 0; i < m.Length; ++i)
  {
    for (int j = 0; j < m[0].Length; ++j)
    {
      double v = m[i][j];
      if (Math.Abs(v) < 1.0e-5) v = 0.0;  // avoid "-0.0"
      Console.Write(v.ToString("F" + dec).PadLeft(wid));
    }
    Console.WriteLine("");
  }
}

static void VecShow(int[] vec, int wid)
{
  for (int i = 0; i < vec.Length; ++i)
    Console.Write(vec[i].ToString().PadLeft(wid));
  Console.WriteLine("");
}

public static void ListShow(List<int> list, int wid)
{
  for (int i = 0; i < list.Count; ++i)
    Console.Write(list[i].ToString().PadLeft(wid));
  Console.WriteLine("");
}

static double[][] MatCreate(int rows, int cols)
{
  double[][] result = new double[rows][];
  for (int i = 0; i < rows; ++i)
    result[i] = new double[cols];
  return result;
}

static int NumNonCommentLines(string fn,
  string comment)
{
  int ct = 0;
  string line = "";
  FileStream ifs = new FileStream(fn, FileMode.Open);
  StreamReader sr = new StreamReader(ifs);
  while ((line = sr.ReadLine()) != null)
    if (line.StartsWith(comment) == false)
      ++ct;
  sr.Close(); ifs.Close();
  return ct;
}

static double[][] MatLoad(string fn, int[] usecols,
  char sep, string comment)
{
  // count number of non-comment lines
  int nRows = NumNonCommentLines(fn, comment);

  int nCols = usecols.Length;
  double[][] result = MatCreate(nRows, nCols);
  string line = "";
  string[] tokens = null;
  FileStream ifs = new FileStream(fn, FileMode.Open);
  StreamReader sr = new StreamReader(ifs);

  int i = 0;
  while ((line = sr.ReadLine()) != null)
  {
    if (line.StartsWith(comment) == true)
      continue;
    tokens = line.Split(sep);
    for (int j = 0; j < nCols; ++j)
    {
      int k = usecols[j];  // into tokens
      result[i][j] = double.Parse(tokens[k]);
    }
    ++i;
  }
  sr.Close(); ifs.Close();
  return result;
}

The MatLoad() function can be used to load data from a text file into an array-of-arrays style matrix. The design of MatLoad() mimics the Python NumPy library loadtxt() function. The MatLoad() function calls helper MatCreate() and helper NumNonCommentLines() functions. You might want to consider placing the MatCreate() and NumNonCommentLines() functionality directly in the MatLoad() function.

Wrapping Up
The DBSCAN clustering technique often gives very different results than the k-means clustering technique. This happens because DBSCAN can chain together neighbor data items, which can result in the first and last data items in a cluster being far apart. Because DBSCAN and k-means clustering can give significantly different results, it's not a bad idea to use both techniques.

Data clustering is an exploratory process. There is no correct clustering of any dataset. In most data clustering scenarios, there should be a human-in-the-loop where human expertise is used to examine clustering results to see if there are any interesting patterns.


About the Author

Dr. James McCaffrey works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products including Azure and Bing. James can be reached at [email protected].

comments powered by Disqus

Featured

Subscribe on YouTube