The Data Science Lab

Weighted k-Nearest Neighbors Regression Using C#

Console.WriteLine("Explaining for " +
  "x = (0.5, -0.5, 0.5, -0.5, 0.5) ");
double[] x = 
  new double[] { 0.5, -0.5, 0.5, -0.5, 0.5 };
knnr_model.Explain(x);
      
Console.WriteLine("End demo ");
Console.ReadLine();

The Explain() method computes a predicted y value but also prints messages that show how the prediction is made. In essence Explain() is a verbose Predict() method and so an alternative design is to combine the Predict() and Explain() methods.

The Predict Method
The implementation of the Predict() method begins with:

public double Predict(double[] x)
{
  if (this.trainX == null)
    Console.WriteLine("Error: Store() not yet called ");

  // 0. set up ordering/indices
  int n = this.trainX.Length;
  int[] indices = new int[n];
  for (int i = 0; i < n; ++i)
    indices[i] = i;
. . .

The purpose of the indices array will become clear in a moment. Next, the distances from the input vector x to all training data items are computed and then sorted from closest to farthest:

  double[] distances = new double[n];
  for (int i = 0; i < n; ++i)
    distances[i] = EucDistance(x, this.trainX[i]);
  Array.Sort(distances, indices);

The C# language Array.Sort() method physically sorts its first argument, the distances, and also sorts in parallel the second argument, the indices of each training item. It would be possible to sort the distances, and the training x data, and the target y data in parallel, but that approach is a lot of unnecessary effort.

The weights are computed by two statements:

  double[]? wts = null;
  if (this.weighting == "uniform")
    wts = UniformWts(this.k);
  else if (this.weighting == "skewed")
    wts = SkewedWts(this.k);

When called, the Predict() method expects an explicit string of "uniform" or "skewed." Using magic strings is often a bad design choice, but in this case the technique is simpler than using an enumeration in my opinion.


The Predict() method finishes by computing the weighted sum of the k-nearest target y values:

. . .
  double result = 0.0;
  for (int i = 0; i < this.k; ++i)
    result += wts[i] * this.trainY[indices[i]];

  return result;
} // Predict

Computing Skewed Weights
The demo program computes skewed weights using a micro-algorithm I devised a few years ago. It's best explained by example. Suppose k = 5, as in the demo. A uniform weight scheme would return 1/5 = 0.20 for each of the five weight values. The demo program gives more weight to the closest training item by multiplying a uniform weight by 1.5, and gives less weight to the farthest item by multiplying by 0.5. So weight[0] = 0.20 * 1.5 = 0.30 and weight[4] = 0.20 * 0.5 = 0.10.

Because the weights must sum to 1.0, this leaves 1 - (0.30 + 0.10) = 0.60 to be shared by the remaining three weights. Therefore weight[1] = weight[2] = weight[3] = 0.60 / 3 = 0.20.

This skewed weighting technique doesn't work for k = 1 or 2 or 3 and so skewed weights are explicitly returned:

if (k == 1) {
  result[0] = 1.0;
}
else if (k == 2) {
  result[0] = 0.60; result[1] = 0.40;
}
else if (k == 3) {
  result[0] = 0.40; result[1] = 0.35; result[2] = 0.25;
}

To be sure, these weight values are arbitrary to some extent, but they have worked well for me in practice and in a series of experiments with synthetic data. A perfectly good alternative design is to use uniform weighting for small values of k and use skewed weighting for k = 4 or larger.

In some machine learning libraries, notably scikit-learn, the weighted k-nearest neighbors regression module allows you to apply a mini-algorithm called inverse distance. Based on my experience, this weighting technique usually works poorly because when a distance is very small, the inverse distance is huge and the resulting weight overwhelms all other weights.

Wrapping Up
Weighted k-nearest neighbors regression is arguably the simplest technique for regression. The main strengths of weighted KNNR are simplicity, interpretability and ability to scale to large (but not huge) datasets. The main weaknesses of KNNR are that it works with only strictly numeric data, and sensitivity to the value of the k parameter.

In practice, weighted KNNR is often less accurate than more sophisticated techniques such as kernel ridge regression and neural network regression. But sometimes weighted KNNR works better than more sophisticated techniques. A common approach is to apply weighted KNNR and one or more other regression techniques to see if they all give consistent results.

Most machine learning libraries have a k-nearest neighbors regression module. But implementing weighted KNNR from scratch allows you to fully customize your code, easily integrate your prediction model with other systems, and gives you a complete understanding of how results are computed.


About the Author

Dr. James McCaffrey works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products including Azure and Bing. James can be reached at [email protected].

comments powered by Disqus

Featured

Subscribe on YouTube