Module 5: Machine Learning Basics
Topic 4: Training Data and Ethics
How training data affects your model
We have already studied issues with non-representative data, especially in our big dive into facial recognition with the opening movie. That could be the main topic for this part of the module but I’m going to jump into a few other things instead. However, before we leave the topic entirely, I wanted to share a study done by NIST (National Institutes of Standards and Technology) about facial recognition on Caucasian and Asian faces.
- This is an article in the popular press (Scientific American) about the study.
- This is the original paper released by NIST. Since this is a mixed undergrad/grad class, I’m not asking you to read the study in depth but do at least look through it and examine the figures in particular. They did a very interesting comparison of human versus algorithm performance on the two varieties of faces!
How the choice of performance metric affects your outcome
Often beginners in ML are tempted to use accuracy as their performance metric. My theory on this is because so many students are just used to evaluating your own performance with accuracy in your classes (accuracy = percent score on your exams, homework assignments, etc). However, accuracy is a very poor metric in most circumstances in the real-world. It is not a good metric for data with more than two classes and it especially falls down for imbalanced data.
- Quick read on this
Looking specifically into one application area, consider your weather forecasts. As with the previous scientific article that I linked, you don’t have to read this is depth because that’s beyond the scope of an undergraduate class (but the grad students should be able to read scientific papers!) but at least read through it to get the main ideas of different ways that one can evaluate a forecast and think about how that affects your choice of metric for training your ML model.
How to handle imbalanced data
If you do ML outside of class, you are extremely likely to run into imbalanced data and I want to make sure you know of standard ways to address it. And even if you never do any ML again outside of class, two of your class projects will be on ML and you will quite likely run into imbalanced data issues. For your projects, you will be in charge of creating your data so you will also be in charge of ensuring your data is as balanced as possible! To get you started. I have some readings about different ways of handling imbalanced data.
The typical three main approaches are
- Oversample your minority class
- Undersample your majority class
- Perform data augmentation in some form (this means creating synthetic new data based on your existing data)
The articles go into more details of these approaches.
- Handling imbalanced data: 7 innovative techniques for successful analysis
- Handling imbalanced datasets in machine learning
- https://www.kdnuggets.com/2017/06/7-techniques-handle-imbalance
One variant of data augmentation that gets mentioned in the articles is called SMOTE. This article is specific to SMOTE.
- https://medium.com/analytics-vidhya/handling-imbalanced-data-by-oversampling-with-smote-and-its-variants-23a4bf188eaf
- If you get excited about SMOTE, here is the original journal paper
Exercise
This one is more of a quiz than our usual ethics exercise but hopefully makes you think a bit about implications of metrics and data. Complete the exercise on ethics & sampling.