CS 2 (Winter 2025) Lab 06: Decision Trees

This lab focuses on using recursion in tree algorithms and construction. It also introduces some terminology used in machine learning.

Setup

Register for the lab using http://register.caltech.codes/ and clone the repository as explained in the setup instructions.

Goals and Outcomes

In this lab, you will write a simple machine learning algorithm that uses a tree!

By the end of this lab, you will…

Machine Learning with Decision Trees

Machine learning (ML) is all about using patterns in data to solve problems that are difficult to pin down mathematically. Typically, we build a model which aims to predict the “right” outcome given an input. This is where trees come in! One way of modeling a complex function in an intuitive manner is with a decision tree.

We will discuss how to use attributes and features to try to predict outcomes. In this lab, when we refer to an attribute, we are referring to an abstract quality or feature that can be used to categorize something. When we refer to a feature, we are referring to the realization of an attribute. For example, a Caltech student has several attributes, such as their grade and major. The specific values of these attributes are the student’s features. For example, we could consider a specific Caltech student which has “grade”=”Sophomore” and “major”=”Mechanical Engineering”.

We can use attributes and features to make a decision tree which attempts to predict an outcome - which can be the effects of an event or the answer to a question. For example, imagine all Mechanical and Electrical Engineering Caltech students had to take a thermodynamics exam. Since MechEs learn thermodynamics and EEs don’t, the raw data would show a greater percentage of MechEs passed the exam compared to EEs, and that less frosh MechE majors passed compared to non-frosh. Therefore, a decision tree made from this dataset might look like this:

titantic

Now, let’s imagine we have the features of a new student who missed the exam. For example, “grade”=”Junior” and “major”=”Mechanical Engineering”. Using these features, we can navigate down our decision tree and return a predicted outcome:

titantic

ID3 Algorithm

So, how do we actually make a decision tree from a dataset? In this lab, we will use the Iterative Dichotomizer 3 (ID3) algorithm to predict whether a passenger survived the Titanic disaster. We will look at several (problematic) attributes of the passengers (e.g., “age”) and try to use each passenger’s features (realizations of each attribute, e.g. “age”=”old” or “age”=”young”) to predict their outcome (e.g., dead or alive).

To do this, we will build a tree where each non-leaf node represents an attribute and each leaf node represents an outcome. The features are represented as edges and determine which path down the tree a given passenger (datapoint) takes, and ultimately their predicted outcome.

For example, ID3 might output a tree that looks like the following:

titantic

We need some metric to decide how to build our decision tree from the available attributes–which attributes do we want to split on earlier, and which are we okay leaving for later?

Entropy is a fitting metric for such a decision tree: it is the expected value of information content when measuring the outcome of an event. By minimizing entropy as we build the tree from the root, we can narrow down from the more general attributes to the more specific ones and thus minimize uncertainty. We used a similar metric for predicting “englishness” in project01, and like last time, you won’t have to compute entropy yourself.

The pseudocode for constructing the ID3 decision tree is as follows:

id3helper(data, attributes):

  • If all the remaining data points have the same outcome:
    • Return an outcome node with this outcome
  • If there are no remaining attributes:
    • Return an outcome node with the most common outcome
  • Find the lowest entropy attribute, \(a\).
  • For each feature, \(f\), of \(a\):
    • Find all the data points that have the feature \(f\).
    • If there are none:
      • Guess! Make an outcome child with the most common outcome.
    • Otherwise:
      • Use the data! Recursively make a child out of the remaining data points, using all attributes except for \(a\).
  • Return the attribute node with the children generated above.

Checking the Results on gitlab

Ethics and AI

Several of the attributes (e.g., age, sex at birth, wealth) of our model are problematic in terms of equity. For example, if someone used our model during another Titanic-like disaster to determine who got to board the lifeboats first, it would produce biased results. Ponder for a while about what responsibility we have as computer scientists to the communities using our products, and how can we correct for biases when training our models.