Module 5: Machine Learning Basics

Topic 4: Training Data and Ethics

How training data affects your model

We have already studied issues with non-representative data, especially in our big dive into facial recognition with the opening movie. That could be the main topic for this part of the module but I’m going to jump into a few other things instead. However, before we leave the topic entirely, I wanted to share a study done by NIST (National Institutes of Standards and Technology) about facial recognition on Caucasian and Asian faces.

This is an article in the popular press (Scientific American) about the study.
- How NIST Tested Facial Recognition Algorithms for Racial Bias
This is the original paper released by NIST. Since this is a mixed undergrad/grad class, I’m not asking you to read the study in depth but do at least look through it and examine the figures in particular. They did a very interesting comparison of human versus algorithm performance on the two varieties of faces!
- An Other-Race Effect for Face Recognition Algorithms

How the choice of performance metric affects your outcome

Often beginners in ML are tempted to use accuracy as their performance metric. My theory on this is because so many students are just used to evaluating your own performance with accuracy in your classes (accuracy = percent score on your exams, homework assignments, etc). However, accuracy is a very poor metric in most circumstances in the real-world. It is not a good metric for data with more than two classes and it especially falls down for imbalanced data.

Quick read on this
- Why Accuracy Is Not A Good Metric For Imbalanced Data

Looking specifically into one application area, consider your weather forecasts. As with the previous scientific article that I linked, you don’t have to read this is depth because that’s beyond the scope of an undergraduate class (but the grad students should be able to read scientific papers!) but at least read through it to get the main ideas of different ways that one can evaluate a forecast and think about how that affects your choice of metric for training your ML model.

What Is a Good Forecast? An Essay on the Nature of Goodness in Weather Forecasting

How to handle imbalanced data

If you do ML outside of class, you are extremely likely to run into imbalanced data and I want to make sure you know of standard ways to address it. And even if you never do any ML again outside of class, two of your class projects will be on ML and you will quite likely run into imbalanced data issues. For your projects, you will be in charge of creating your data so you will also be in charge of ensuring your data is as balanced as possible! To get you started. I have some readings about different ways of handling imbalanced data.

The typical three main approaches are

Oversample your minority class
Undersample your majority class
Perform data augmentation in some form (this means creating synthetic new data based on your existing data)

The articles go into more details of these approaches.

One variant of data augmentation that gets mentioned in the articles is called SMOTE. This article is specific to SMOTE.

https://medium.com/analytics-vidhya/handling-imbalanced-data-by-oversampling-with-smote-and-its-variants-23a4bf188eaf
If you get excited about SMOTE, here is the original journal paper

Exercise

This one is more of a quiz than our usual ethics exercise but hopefully makes you think a bit about implications of metrics and data. Complete the exercise on ethics & sampling.