The Data Science Lab

Probit Regression Using C#

Probit ("probability unit") regression is a classical machine learning technique that can be used for binary classification -- predicting an outcome that can only be one of two discrete values. For example, you might want to predict the job satisfaction (0 = low satisfaction, 1 = high satisfaction) of a person based on their sex, age, job-type and income.

Probit regression is very similar to logistic regression and the two techniques typically give similar results. Probit regression tends to be used most often with finance and economics data, but both probit regression and logistic regression can be used for any type of binary classification problem.

Figure 1: Probit Regression Using C# in Action
[Click on image for larger view.] Figure 1: Probit Regression Using C# in Action

A good way to see where this article is headed is to take a look at the screenshot of a demo program in Figure 1. The demo sets up 40 data items that represent employees. There are four predictor variables: sex (male = -1, female = +1), age (divided by 100), job-type (mgmt. = 1 0 0, supp = 0 1 0, tech = 0 0 1), and income (divided by $100,000). Each employee has a job satisfaction value encoded as 0 = low, 1 = high. The goal is to predict job satisfaction from the four predictor values.

The demo also sets up an 8-item test dataset to evaluate the model after training. The training and test datasets are completely artificial.

The demo program uses the 40-item training data and a modified form of stochastic gradient descent (SGD) to create a probit regression prediction model. After training, the model scores 75 percent accuracy on the training data (30 out of 40 correct), and also 75 percent accuracy on the test data (6 out of 8 correct).


The demo program concludes by making a prediction for a new, previously unseen data item where sex = male, age = 24, job-type = mgmt., income = $48,500. The output of the model is a p-value of 0.6837. Because the p-value is greater than 0.5 the prediction is job satisfaction = high. If the p-value had been less than 0.5 the prediction would have been job satisfaction = low.

This article assumes you have intermediate or better programming skill with the C# language but doesn't assume you know anything about probit regression. The complete demo code and the associated data are presented in this article. The source code and the data are also available in the accompanying download. All normal error checking has been removed to keep the main ideas as clear as possible.

Understanding Probit Regression
Suppose, as in the demo, you want to predict job satisfaction from sex, age, job-type, and income. The input values corresponding to (male, 24, mgmt, $48,500) are (x0, x1, x2, x3, x4, x5) = (-1, 0.24, 1, 0, 0, 0.485). Numeric values are normalized so that they're all between 0 and 1, Boolean values are minus-one plus-one encoded, and categorical values are one-hot encoded. There are several normalization and encoding alternatives, but the scheme used by the demo is simple and works well in practice.

Each input value has an associated numeric constant called a weight. There is an additional constant called a bias. For the demo problem, the model weights are (w0, w1, w2, w3, w4, w5) = (-0.057, 0.099, 0.460, 0.041, -0.508, -0.113) and the model bias is b = -0.0077.

Figure 2: The phi() Function
[Click on image for larger view.] Figure 2: The phi() Function

The first step is to compute a z-value as the sum of the products of the weights times the inputs, plus the bias:

z = (w0 * x0) +  (w1 * x1) +  (w2 * x2) +  (w3 * x3) +  (w4 * x4) +  (w5 * x5) +  b
  = (-0.057 * -1) + (0.099 * 0.24) + (0.460 * 1) + (0.041 * 0) + (-0.508 * 0) +
      (-0.113 * 0.485) + (-0.0077)
  = 0.0570 + 0.0238 + 0.4600 + 0.0000 + 0.0000 + (-0.0548) + (-0.0077)
  = 0.4783

The second step is to pass the computed z-value to the phi() function:

p = phi(0.4783)
  = 0.6837

The p-value result will be a value between 0.0 and 1.0 where a value less than 0.5 indicates a prediction of class 0 (low satisfaction in the demo) and p-value greater than 0.5 indicates a prediction of class 1 (high satisfaction). Therefore, for this example, the prediction is that the employee has high job satisfaction.

The phi() function is the cumulative density function (CDF) of the standard Normal distribution. For any input z-value, the phi() function returns the area under the Normal distribution from negative infinity to z. The graph in Figure 2 shows the phi() function where an input of z = 0.4783 returns p = 0.6837.

The two main challenges when creating a probit regression model are 1.) training the model to find the values of the weights and the bias, and 2.) writing code to implement the phi() function.

The Demo Program
The complete demo program, with a few minor edits to save space, is presented in Listing 1. To create the program, I launched Visual Studio and created a new C# .NET Core console application named ProbitRegression. I used Visual Studio 2022 (free Community Edition) with .NET Core 6.0, but the demo has no significant dependencies so any version of Visual Studio and .NET library will work fine. You can use the Visual Studio Code program too.

After the template code loaded, in the editor window I removed all unneeded namespace references, leaving just the reference to the top-level System namespace. In the Solution Explorer window, I right-clicked on file Program.cs, renamed it to the more descriptive ProbitProgram.cs, and allowed Visual Studio to automatically rename class Program to ProbitProgram.

Listing 1: Probit Regression Demo Code
using System;  // .NET 6.0
namespace ProbitRegression
{
  class ProbitProgram
  {
    static void Main(string[] args)
    {
      Console.WriteLine("Begin probit regression demo ");
      Console.WriteLine("Predict job satisfaction " +
        "(0 = low, 1 = high) ");
      Random rnd = new Random(0);  // 28

      Console.WriteLine("Raw data: Sex, Age, Job, Income," +
        " Satisfaction looks like: ");
      Console.WriteLine("Male    66  mgmt  $52,100.00 |  low");
      Console.WriteLine("Female  35  tech  $86,300.00 |  high");
      Console.WriteLine("Male    27  supp  $47,900.00 |  high");
      Console.WriteLine(" . . . ");
      Console.WriteLine("Encoded and normed data looks like: ");
      Console.WriteLine("-1  0.66  1 0 0  0.52100  |  0 ");
      Console.WriteLine(" 1  0.35  0 0 1  0.86300  |  1 ");
      Console.WriteLine("-1  0.27  0 1 0  0.47900  |  1 ");
      Console.WriteLine(" . . . ");

      double[][] trainX = new double[40][];
      trainX[0] = new double[] { -1, 0.66, 1, 0, 0, 0.5210 };
      trainX[1] = new double[] { 1, 0.35, 0, 0, 1, 0.8630 };
      trainX[2] = new double[] { -1, 0.24, 0, 0, 1, 0.4410 };
      trainX[3] = new double[] { 1, 0.43, 0, 1, 0, 0.5170 };
      trainX[4] = new double[] { -1, 0.37, 1, 0, 0, 0.8860 };
      trainX[5] = new double[] { 1, 0.30, 0, 1, 0, 0.8790 };
      trainX[6] = new double[] { 1, 0.40, 1, 0, 0, 0.2020 };
      trainX[7] = new double[] { -1, 0.58, 0, 0, 1, 0.2650 };
      trainX[8] = new double[] { 1, 0.27, 1, 0, 0, 0.8480 };
      trainX[9] = new double[] { -1, 0.33, 0, 1, 0, 0.5600 };
      trainX[10] = new double[] { 1, 0.59, 0, 0, 1, 0.2330 };
      trainX[11] = new double[] { 1, 0.52, 0, 1, 0, 0.8700 };
      trainX[12] = new double[] { -1, 0.41, 1, 0, 0, 0.5170 };
      trainX[13] = new double[] { 1, 0.22, 0, 1, 0, 0.3500 };
      trainX[14] = new double[] { 1, 0.61, 0, 1, 0, 0.2980 };
      trainX[15] = new double[] { -1, 0.46, 1, 0, 0, 0.6780 };
      trainX[16] = new double[] { 1, 0.59, 1, 0, 0, 0.8430 };
      trainX[17] = new double[] { 1, 0.28, 0, 0, 1, 0.7730 };
      trainX[18] = new double[] { -1, 0.46, 0, 1, 0, 0.8930 };
      trainX[19] = new double[] { 1, 0.48, 0, 0, 1, 0.2920 };
      trainX[20] = new double[] { 1, 0.28, 1, 0, 0, 0.6690 };
      trainX[21] = new double[] { -1, 0.23, 0, 1, 0, 0.8970 };
      trainX[22] = new double[] { -1, 0.60, 1, 0, 0, 0.6270 };
      trainX[23] = new double[] { 1, 0.29, 0, 1, 0, 0.7760 };
      trainX[24] = new double[] { -1, 0.24, 0, 0, 1, 0.8750 };
      trainX[25] = new double[] { 1, 0.51, 1, 0, 0, 0.4090 };
      trainX[26] = new double[] { 1, 0.22, 0, 1, 0, 0.8910 };
      trainX[27] = new double[] { -1, 0.19, 0, 0, 1, 0.5380 };
      trainX[28] = new double[] { 1, 0.25, 0, 1, 0, 0.9000 };
      trainX[29] = new double[] { 1, 0.44, 0, 0, 1, 0.8980 };
      trainX[30] = new double[] { -1, 0.35, 1, 0, 0, 0.5380 };
      trainX[31] = new double[] { -1, 0.29, 0, 1, 0, 0.7610 };
      trainX[32] = new double[] { 1, 0.25, 1, 0, 0, 0.3450 };
      trainX[33] = new double[] { 1, 0.66, 1, 0, 0, 0.2210 };
      trainX[34] = new double[] { -1, 0.43, 0, 0, 1, 0.7450 };
      trainX[35] = new double[] { 1, 0.42, 0, 1, 0, 0.8520 };
      trainX[36] = new double[] { -1, 0.44, 1, 0, 0, 0.6580 };
      trainX[37] = new double[] { 1, 0.42, 0, 1, 0, 0.6970 };
      trainX[38] = new double[] { 1, 0.56, 0, 0, 1, 0.3680 };
      trainX[39] = new double[] { -1, 0.38, 1, 0, 0, 0.2600 };

      int[] trainY = new int[40] { 0, 0, 0, 1, 1, 0, 0, 0, 0, 1,
        1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0,
        1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1 };

      double[][] testX = new double[8][];
      testX[0] = new double[] { 1, 0.36, 0, 0, 1, 0.8670 };
      testX[1] = new double[] { -1, 0.26, 0, 0, 1, 0.4310 };
      testX[2] = new double[] { 1, 0.42, 0, 1, 0, 0.5190 };
      testX[3] = new double[] { -1, 0.37, 1, 0, 0, 0.8260 };
      testX[4] = new double[] { 1, 0.31, 0, 1, 0, 0.8890 };
      testX[5] = new double[] { 1, 0.42, 1, 0, 0, 0.2040 };
      testX[6] = new double[] { -1, 0.57, 0, 0, 1, 0.2750 };
      testX[7] = new double[] { -1, 0.32, 0, 1, 0, 0.5500 };

      int[] testY = new int[8] { 0, 0, 1, 1, 0, 0, 0, 1 };

      Console.WriteLine("Creating dim = 6 probit model ");
      int dim = 6; // number predictors

      double[] wts = new double[dim];
      double lo = -0.01; double hi = 0.01;
      for (int i = 0; i < wts.Length; ++i)
        wts[i] = (hi - lo) * rnd.NextDouble() + lo;
      double bias = 0.0;

      Console.WriteLine("Starting quasi-SGD training " +
        "with lr = 0.01 ");
      int maxEpochs = 100;
      double lr = 0.01;

      int N = trainX.Length;  // number training items
      int[] indices = new int[N];
      for (int i = 0; i < N; ++i)
        indices[i] = i;

      for (int epoch = 0; epoch < maxEpochs; ++epoch)
      {
        Shuffle(indices, rnd);
        for (int i = 0; i < N; ++i)  // each data item
        {
          int ii = indices[i];
          double[] x = trainX[ii];  // inputs
          double y = trainY[ii];    // target 0 or 1
          double p = ComputeOutput(x, wts, bias);  // [0.0, 1.0]

          for (int j = 0; j < dim; ++j)  // each predictor weight
            wts[j] += lr * x[j] * (y - p) *
              p * (1 - p);  // o-t error => t-o update
          bias += lr * (y - p) * p * (1 - p);  // update bias
        }

        if (epoch % 10 == 0)
        {
          double mse = Error(trainX, trainY, wts, bias);
          double acc = Accuracy(trainX, trainY, wts, bias);
          Console.WriteLine("epoch = " +
            epoch.ToString().PadLeft(4) +
            "  |  loss = " + mse.ToString("F4") +
            "  |  acc = " + acc.ToString("F4"));
        }
      }
      Console.WriteLine("Done ");

      Console.WriteLine("Trained model weights: ");
      ShowVector(wts, 3, false);
      Console.WriteLine("Model bias: " + bias.ToString("F4"));

      double accTrain = Accuracy(trainX, trainY, wts, bias);
      Console.WriteLine("Accuracy on train data = " +
        accTrain.ToString("F4"));

      double accTest = Accuracy(testX, testY, wts, bias);
      Console.WriteLine("Accuracy on test data = " +
        accTest.ToString("F4"));

      Console.WriteLine("Predicting satisfaction for: ");
      Console.WriteLine("Male  24  mgmt  $48,500.00 ");
      double[] xx = new double[] { -1, 0.24, 1, 0, 0, 0.485 };
      double pp = ComputeOutput(xx, wts, bias, true);
      Console.WriteLine("output p = " + pp.ToString("F4"));
      if (pp < 0.5)
        Console.WriteLine("satisfaction = low");
      else
        Console.WriteLine("satisfaction = high");

      Console.WriteLine("End probit regression demo ");
      Console.ReadLine();
    } // Main

    static double Phi(double z)
    {
      // cumulative density of Standard Normal
      // erf is Abramowitz and Stegun 7.1.26

      if (z < -6.0)
        return 0.0;
      if (z > 6.0)
        return 1.0;

      double a0 = 0.3275911;
      double a1 = 0.254829592;
      double a2 = -0.284496736;
      double a3 = 1.421413741;
      double a4 = -1.453152027;
      double a5 = 1.061405429;

      int sign = 0;
      if (z < 0.0)
        sign = -1;
      else
        sign = 1;

      double x = Math.Abs(z) / Math.Sqrt(2.0);  // inefficient
      double t = 1.0 / (1.0 + a0 * x);
      double erf = 1.0 - (((((a5 * t + a4) * t) +
        a3) * t + a2) * t + a1) * t * Math.Exp(-x * x);
      return 0.5 * (1.0 + (sign * erf));
    }

    static double ComputeOutput(double[] x, double[] wts,
      double bias, bool showZ = false)
    {
      double z = 0.0;
      for (int i = 0; i < x.Length; ++i)
        z += x[i] * wts[i];
      z += bias;
      if (showZ == true)
        Console.WriteLine("z = " + z.ToString("F4"));
      return Phi(z);  // probit
      // return Sigmoid(z);  // logistic sigmoid
    }

    static double Error(double[][] dataX, int[] dataY,
      double[] wts, double bias)
    {
      double sum = 0.0;
      int N = dataX.Length;
      for (int i = 0; i < N; ++i)
      {
        double[] x = dataX[i];
        int y = dataY[i];  // target, 0 or 1
        double p = ComputeOutput(x, wts, bias);
        sum += (p - y) * (p - y); // E = (o-t)^2 form
      }
      return sum / N; ;
    }

    static double Accuracy(double[][] dataX, int[] dataY,
      double[] wts, double bias)
    {
      int numCorrect = 0; int numWrong = 0;
      int N = dataX.Length;
      for (int i = 0; i < N; ++i)
      {
        double[] x = dataX[i];
        int y = dataY[i];  // actual, 0 or 1
        double p = ComputeOutput(x, wts, bias);
        if (y == 0 && p < 0.5 || y == 1 && p >= 0.5)
          ++numCorrect;
        else
          ++numWrong;
      }
      return (1.0 * numCorrect) / (numCorrect + numWrong);
    }

    static double AccuracyVerbose(double[][] dataX,
      int[] dataY, double[] wts, double bias)
    {
      int numCorrect = 0; int numWrong = 0;
      int N = dataX.Length;
      for (int i = 0; i < N; ++i)
      {
        Console.WriteLine("=========");
        double[] x = dataX[i];
        int y = dataY[i];  // actual, 0 or 1
        double p = ComputeOutput(x, wts, bias);

        Console.WriteLine(i);
        ShowVector(x, 1, false);
        Console.WriteLine("target = " + y);
        Console.WriteLine("computed = " + p.ToString("f4"));

        if (y == 0 && p < 0.5 || y == 1 && p >= 0.5)
        {
          Console.WriteLine("correct ");
          ++numCorrect;
        }
        else
        {
          Console.WriteLine("wrong ");
          ++numWrong;
        }
        Console.WriteLine("=========");
        Console.ReadLine();
      }
      return (1.0 * numCorrect) / (numCorrect + numWrong);
    }

    static void Shuffle(int[] arr, Random rnd)
    {
      int n = arr.Length;
      for (int i = 0; i < n; ++i)
      {
        int ri = rnd.Next(i, n);
        int tmp = arr[ri];
        arr[ri] = arr[i];
        arr[i] = tmp;
      }
    }

    static void ShowVector(double[] vector, int decs,
      bool newLine)
    {
      for (int i = 0; i < vector.Length; ++i)
        Console.Write(vector[i].ToString("F" + decs) + " ");
      Console.WriteLine("");
      if (newLine == true)
        Console.WriteLine("");
    }

  } // Program
} // ns

comments powered by Disqus

Featured

Subscribe on YouTube