Data Prep for Machine Learning: Outliers -- Visual Studio Magazine

Data Prep for Machine Learning: Outliers

After previously detailing how to examine data files and how to identify and deal with missing data, Dr. James McCaffrey of Microsoft Research now uses a full code sample and step-by-step directions to deal with outlier data.

By James McCaffrey
07/14/2020

This article explains how to programmatically identify and deal with outlier data (it's a follow-up to "Data Prep for Machine Learning: Missing Data"). Suppose you have a data file of loan applications. Examples of outlier data include a person's age of 99 (either a very old applicant or possibly a placeholder value that was never changed) and a person's country of "Cannada" (probably a transcription error).

In situations where the source data file is small, about 500 lines or less, you can usually find and deal with outlier data manually. But in almost all realistic scenarios with large datasets you must handle outlier data programmatically.

Preparing data for use in a machine learning (ML) system is time consuming, tedious, and error prone. A reasonable rule of thumb is that data preparation requires at least 80 percent of the total time needed to create an ML system. There are three main phases of data preparation: cleaning, normalizing and encoding, and splitting. Each of the three phases has several steps. Dealing with outlier data is part of the data cleaning phase.

A good way to understand outlier data and see where this article is headed is to take a look at the screenshot of a demo program in Figure 1. The demo uses a small text file where each line represents an employee. The demo analyzes a representative numeric column (age) and then analyzes a representative categorical column (region).

**[Click on image for larger view.]** Figure 1: Programmatically Dealing with Outlier Data

The demo is a Python language program that examines and performs a series of transformations on the original data. The first five lines of the demo source data are:

M   32   eastern   59200.00   moderate
F   43   central   38400.00   moderate
M   35   central   30800.00   liberal
F   36   ?         47800.00   moderate
M   26   western   53800.00   conservative
. . .

There are five tab-delimited fields: sex, age, region, annual income, and political leaning. The eventual goal of the ML system that will use the data is to create a neural network that predicts political leaning from other fields. The source data file has been standardized so that all lines have the same number of fields/columns.

The demo begins by displaying the source data file. Next the demo scans through the age column and computes a z-score value for each age. Data lines with outlier values where the z-score is less than -2.0 or greater than +2.0 are displayed. These are line [7] where age = 61 and z = +2.26, and line [9] where age = 3 and z = -2.47.

When a line with an outlier value has been identified, you can do one of three things. You can ignore the data line, you can correct the data line, or you can delete the line. The demo leaves line [7] with age = 61 alone. The implication is that the person is just significantly older than the other people in the dataset. The demo updates line [9] with age = 3 by changing the age to 33. The implication is that the value was incorrectly entered and the correct age value of 33 was located in some way.

Next, the demo scans through the region column and computes a frequency count for each value. There are four instances of "eastern", four instances of "central", and three instances of "western." There is one instance of "?" in line [4] and one instance of "centrel" in line [6]. The demo deletes line [4]. The implication is that "?" was entered as a placeholder value to mean "unknown" and that the correct region value could not be determined. The demo updates line [6] by replacing "centrel" with "central." The implication is that this was a typo.

To summarize, outliers are unusual values. For numeric variables, one way to find outliers is to compute z-scores. For categorical variables, one way to find outliers is to compute frequency counts.

This article assumes you have intermediate or better skill with a C-family programming language. The demo program is coded using Python but you shouldn't have too much trouble refactoring the demo code to another language if you wish. The complete source code for the demo program is presented in this article. The source code is also available in the accompanying file download.

The Data Preparation Pipeline
Although data preparation is different for every source dataset, in general the data preparation pipeline for most ML systems is usually something similar to the steps shown in Figure 2.

**[Click on image for larger view.]** Figure 2: Data Preparation Pipeline Typical Tasks

Data preparation for ML is deceptive because the process is conceptually easy. However, there are many steps, and each step is much trickier than you might expect if you're new to ML. This article explains the fifth and sixth steps in Figure 2. Future Data Science Lab articles will explain the other steps. The articles can be found here.

The tasks in Figure 2 are usually not followed strictly sequentially. You often have to backtrack and jump around to different tasks. But it's a good idea to follow the steps shown in order as much as possible. For example, it's better to deal with missing data before dealing with bad data, because after you get rid of missing data, all lines will have the same number of fields which makes it dramatically easier to compute column metrics such as the mean of a numeric field or rare occurrences in a categorical field.

The Demo Program
The structure of the demo program, with a few minor edits to save space, is shown in Listing 1. I indent my Python programs using two spaces, rather than the more common four spaces or a tab character, as a matter of personal preference. The program has six worker functions plus a main() function to control program flow. The purpose of worker functions line_count(), show_file(), delete_lines(), show_numeric_outliers(), show_cat_outliers(), and update_line() should be clear from their names.

Listing 1: Outlier Data Detection Demo Program

# file_outliers.py
# Python 3.7.6  NumPy 1.18.1

import numpy as np

def line_count(fn): . . .

def show_file(fn, start, end, indices=False,
 strip_nl=False): . . .

def delete_lines(src, dest, omit_lines): . . .

def show_numeric_outliers(fn, col, z_max,
  delim): . . .

def show_cat_outliers(fn, col, ct_min,
  delim): . . .

def update_line(src, dest, line_num, col_num,
  new_val, delim): . . .

def main():
  # 1. display source file
  print("\nSource file: ")
  fn = ".\\people_no_missing.txt"
  show_file(fn, 1, 999, indices=True, strip_nl=True)

  # 2. numeric outliers
  print("\nIdentifying outliers in Age column:")
  fn = ".\\people_no_missing.txt"
  show_numeric_outliers(fn, 2, 2.0, "\t")  # age

  print("\nModifying line [9] to age = 33")
  src = ".\\people_no_missing.txt"
  dest = ".\\people_no_missing_update1.txt"
  update_line(src, dest, 9, 2, "33", "\t")

  # 3. categorical outliers
  print("\nExamining Region column:")
  fn = ".\\people_no_missing_update1.txt" 
  show_cat_outliers(fn, 3, ct_min=1, delim="\t")

  print("\nUpdating line [6], deleting line [4]")
  src = ".\\people_no_missing_update1.txt"
  dest = ".\\people_no_missing_update2.txt"
  update_line(src, dest, 6, 3, "central", "\t")

  src = ".\\people_no_missing_update2.txt"
  dest = ".\\people_clean.txt"
  delete_lines(src, dest, [4])

  print("\nCleaned data: ")
  fn = ".\\people_clean.txt"
  show_file(fn, 1, 999, indices=True, strip_nl=True)

if __name__ == "__main__":
  main()

Program execution begins with:

def main():
  # 1. display source file
  print("\nSource file: ")
  fn = ".\\people_no_missing.txt"
  show_file(fn, 1, 999, indices=True, strip_nl=True). . .

The first step when working with any machine learning data file is to do a preliminary investigation. The source data is named people_no_missing.txt ("no missing columns") and has only 13 lines to keep the main ideas of dealing with outlier data as clear as possible. The number of lines in the file could have been determined by a call to the line_count() function. The entire data file is examined by a call to show_file() with arguments start=1 and end=999. In most cases you'll examine just specified lines of the data file rather than the entire file.

The indices=True argument instructs show_file() to display 1-based line numbers. With some data preparation tasks it's more natural to use 1-based indexing, but with other tasks it's more natural to use 0-based indexing. Either approach is OK but you've got to be careful of off-by-one errors. The strip_nl=True argument instructs function show_file() to remove trailing newlines from the data lines before printing them to the shell so that there aren't blank lines between data lines in the display.

The demo continues with:

# 2. numeric outliers
  print("\nIdentifying outliers in Age column:")
  fn = ".\\people_no_missing.txt"
  show_numeric_outliers(fn, 2, 2.0, "\t")  # age
  print("\nModifying line [9] to age = 33")
  src = ".\\people_no_missing.txt"
  dest = ".\\people_no_missing_update1.txt"
  update_line(src, dest, 9, 2, "33", "\t")
. . .

The call to function show_numeric_outliers() means, "Scan the age values in 1-based column number [2], and display lines where the z-score is less than or equal to -2.0 or greater than or equal to +2.0."

The call to function update_line() means, "Take file people_no_missing.txt, change the age value in 1-based column [2] on 1-based line number [9] to "33" and save the result as people_no_missing_update1.txt."

Function update_line() uses a functional programming paradigm and accepts a source file and writes the results to a destination file. It's possible to implement update_line() so that the source file is modified. I do not recommend this approach. It's true that using source and destination files in a data preparation pipeline creates several intermediate files. But you can always delete intermediate files when they're no longer needed. If you corrupt a data file, especially a large one, recovering your data can be very painful or in some cases, impossible.

The demo program examines only the age column. In a non-demo scenario you should examine all numeric columns. The demo continues by examining the region column of categorical values and updating line [6] from "centrel" to "central":

# 3. categorical outliers
  print("\nExamining Region column:")
  fn = ".\\people_no_missing_update1.txt" 
  show_cat_outliers(fn, 3, ct_min=1, delim="\t")

  print("\nUpdating line [6], deleting line [4]")
  src = ".\\people_no_missing_update1.txt"
  dest = ".\\people_no_missing_update2.txt"
  update_line(src, dest, 6, 3, "central", "\t")
. . .

The call to function show_cat_outliers() means, "Scan the region values in 1-based column [3], and display lines where a region value occurs 1 time or less." Note that "one time or less" usually means exactly one time because there can't be any frequency counts of zero unless an external list of possible values was supplied to the show_cat_outliers() function.

The demo program concludes by deleting line [4] which has a region value of "?" and then displaying the final people_clean.txt result file:

. . . 
  src = ".\\people_no_missing_update2.txt"
  dest = ".\\people_clean.txt"
  delete_lines(src, dest, [4])

  print("\nCleaned data: ")
  fn = ".\\people_clean.txt"
  show_file(fn, 1, 999, indices=True, strip_nl=True)

if __name__ == "__main__":
  main()

Notice that updating then deleting is not the same as deleting then updating. If you update, line numbering does not change but if you did a delete line [4] followed by update line [6], after the delete operation line numbering changes and so you'd update the wrong line.

Exploring the Data
When working with data for an ML system you always need to determine how many lines there are in the data, how many columns/fields there are on each line, and what type of delimiter is used. The demo defines a function line_count() as:

def line_count(fn):
  ct = 0
  fin = open(fn, "r")
  for line in fin:
    ct += 1
  fin.close()
  return ct

The file is opened for reading and then traversed using a Python for-in idiom. Each line of the file, including the terminating newline character, is stored into variable named "line" but that variable isn't used. There are many alternative approaches.

The definition of function show_file() is presented in Listing 2. As is the case with all data preparation functions, there are many possible implementations.

Listing 2: Displaying Specified Lines of a File

def show_file(fn, start, end, indices=False,
  strip_nl=False):
  fin = open(fn, "r")

  ln = 1            # advance to start line
  while ln < start:
    fin.readline()
    ln += 1

  while ln <= end:    # show specified lines
    line = fin.readline()
    if line == "": break  # EOF
    if strip_nl == True:
      line = line.strip()
    if indices == True:
      print("[%3d]  " % ln, end="")
    print(line)
    ln += 1
  fin.close()

Because the while-loop terminates with a break statement, if you specify an end parameter value that's greater than the number of lines in the source file, such as 999 for the 13-line demo data, the display will end after the last line has been printed, which is usually what you want.

Get Code Download

Printable Format

comments powered by Disqus

Featured

VS Code Insiders Get Copilot Chat AI-Enhanced Extensions in Latest Update

Developers using the "Insiders" build of the latest Visual Studio Code update, version 1.90, can now enjoy enhanced chat AI functionality in extensions.
Regression Using LightGBM

Dr. James McCaffrey of Microsoft Research presents a full-code, step-by-step tutorial on this powerful machine learning technique used to predict a single numeric value.
Microsoft Teases Discounts for August Dev Conference at Redmond HQ

Microsoft is offering a special discount for Visual Studio Professional and Enterprise subscribers wishing to attend a developer conference being held in August at the company's Redmond, Wash., headquarters.
Building Planet-Scale .NET Apps with Azure Cosmos DB

Azure Cosmos DB is a fully managed distributed database that can be transparently replicated across regions while remaining highly performant and seamlessly scaling according to needs, making it great for applications of any scale.
Java Devs in VS Code Can Now Ask Copilot for Syntax Rewrites

Count among the many emerging abilities of GitHub Copilot new functionality for rewriting your Java syntax in Visual Studio Code.