Module 5: Machine Learning Basics

Topic 4: Training Data and Ethics

How training data affects your model

We have already studied issues with non-representative data, especially in our big dive into facial recognition with the opening movie.  That could be the main topic for this part of the module but I’m going to jump into a few other things instead.  However, before we leave the topic entirely, I wanted to share a study done by NIST (National Institutes of Standards and Technology) about facial recognition on Caucasian and Asian faces.  

How the choice of performance metric affects your outcome

Often beginners in ML are tempted to use accuracy as their performance metric.  My theory on this is because so many students are just used to evaluating your own performance with accuracy in your classes (accuracy = percent score on your exams, homework assignments, etc).  However, accuracy is a very poor metric in most circumstances in the real-world.  It is not a good metric for data with more than two classes and it especially falls down for imbalanced data.

Looking specifically into one application area, consider your weather forecasts.  As with the previous scientific article that I linked, you don’t have to read this is depth because that’s beyond the scope of an undergraduate class (but the grad students should be able to read scientific papers!) but at least read through it to get the main ideas of different ways that one can evaluate a forecast and think about how that affects your choice of metric for training your ML model.

How to handle imbalanced data

If you do ML outside of class, you are extremely likely to run into imbalanced data and I want to make sure you know of standard ways to address it.  And even if you never do any ML again outside of class, two of your class projects will be on ML and you will quite likely run into imbalanced data issues.  For your projects, you will be in charge of creating your data so you will also be in charge of ensuring your data is as balanced as possible!  To get you started. I have some readings about different ways of handling imbalanced data.

The typical three main approaches are

  • Oversample your minority class
  • Undersample your majority class
  • Perform data augmentation in some form (this means creating synthetic new data based on your existing data)

The articles go into more details of these approaches.

One variant of data augmentation that gets mentioned in the articles is called SMOTE.  This article is specific to SMOTE.


This one is more of a quiz than our usual ethics exercise but hopefully makes you think a bit about implications of metrics and data.  Complete the exercise on ethics & sampling.