The Data Science Lab
Linear Ridge Regression Using C#
The Predict Method
The Predict() method is simple. The method assumes that the constant / bias is located at index [0] of the coeffs vector. Therefore, input x[i] is associated with coeffs[i+1].
public double Predict(double[] x)
{
// constant at coeffs[0]
double sum = 0.0;
int n = x.Length; // number predictors
for (int i = 0; i < n; ++i)
sum += x[i] * this.coeffs[i + 1];
sum += this.coeffs[0]; // add the constant
return sum;
}
In a non-demo scenario you might want to add normal error checking, such as making sure that the length of the input x vector is equal to 1 less than the length of the coeffs vector.
The Train Method
The Train() method, shown in Listing 1, is short but dense because most of the work is farmed out to matrix functions in the Utils class. The training code starts with:
public void Train(double[][] trainX, double[] trainY)
{
double[][] DX =
Utils.MatMakeDesign(trainX); // add 1s column
. . .
The MatMakeDesign() function programmatically adds a leading column of 1s to act as inputs for the constant / bias term. An alternative approach is to physically add a leading column of 1s to the training data.
Next, the DXt * DX matrix is computed:
// coeffs = inv(DXt * DX) * DXt * Y
double[][] a = Utils.MatTranspose(DX);
double[][] b = Utils.MatProduct(a, DX);
Breaking the code down into small intermediate statements is not necessary but makes the code easier to debug, compared to a single statement like double[][] b = Utils.MatProduct(Utils.MatTranspose(DX), DX).
At this point the alpha / noise value is added to the diagonal of the DXt * DX matrix:
for (int i = 0; i < b.Length; ++i)
b[i][i] += this.alpha;
If alpha = 0, then no matrix conditioning or L2 regularization is applied and linear ridge regression reduces to standard linear regression.
The Train() method concludes with:
. . .
double[][] c = Utils.MatInverse(b);
double[][] d = Utils.MatProduct(c, a);
double[][] Y = Utils.VecToMat(trainY, DX.Length, 1);
double[][] e = Utils.MatProduct(d, Y);
this.coeffs = Utils.MatToVec(e);
}
The utility MatInverse() function uses Crout's decomposition algorithm. A significantly different approach is to avoid computing a matrix inverse directly and use an indirect technique called QR decomposition. The QR decomposition approach is used by many machine learning libraries because it is slightly more efficient than using matrix inverse. However, using matrix inverse is a much cleaner design in my opinion.
Notice that in order to multiply the vector of target y values in trainY times the inverse, the target y values must be converted from a one-dimensional vector to a two-dimensional matrix using the VecToMat() function. Then after the matrix multiplication, the resulting matrix must be converted back to a vector using the MatToVec() function. Details like this can be a major source of bugs during development.
Because the linear ridge regression training algorithm presented in this article inverts a matrix, the technique doesn't scale to problems with huge amounts of training data. In such situations, it is possible to use stochastic gradient descent (SGD) to estimate model coefficients and constant / bias. You can see an example of SGD training on the data used in my article, "Linear Ridge Regression from Scratch Using C# with Stochastic Gradient Descent."
Wrapping Up
To recap, linear ridge regression is essentially standard linear regression with L2 regularization added to prevent huge model coefficient values that can cause model overfitting. The weakness of linear ridge regression compared to more powerful techniques such as kernel ridge regression and neural network regression is that LRR assumes that data is linear. Even when an LRR model doesn't predict well, LRR is still useful to establish baseline results. The main advantage of LRR is simplicity.
The demo program uses strictly numeric predictor variables. LRR can handle categorical data such as a predictor variable of color with possible values red, blue and green. Each categorical value can be one-hot encoded, for example red = (1, 0, 0), blue = (0, 1, 0) and green = (0, 0, 1). In standard linear regression it's usually a mistake to use one-hot encoding because the resulting columns are mathematically correlated, which causes matrix inversion to fail. Standard linear regression uses a somewhat ugly technique (in my opinion) called dummy coding for categorical predictor variables.
Implementing LRR from scratch requires more effort than using a library such as scikit-learn. But implementing from scratch allows you to customize your code, makes it easier to integrate with other systems, and gives you a complete understanding of how LRR works.
About the Author
Dr. James McCaffrey works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products including Azure and Bing. James can be reached at [email protected].