Decision Tree from Scratch with Python

In this post, I am walking through how I create a decision tree from scratch with Python. I will explain every piece of the codes to try to make myself clear. Note that this article is not introducing the fundamentals of decision tree but instead focusing on how to turn the theory into reality. Without further a due, let’s go !!

Code

You can view the entire codes for building a decision tree with the LINK

Splitting Algorithm

There are various algorithm (ID3, C4.5, C5.0, CART) designed for choosing an optimal split on each split. The most common Scikit-Learn model uses modified CART algorithm. Here, I used the ID3 algorithm.

How ID3 works?

This algorithm makes use of entropy to calculate information gain. For a node m, its entropy is defined as the following

    \[E(m) = - \sum_{k=1}^{K} p_{mk}log_{2}p_{mk}\]

where
p_{mk} stands for the probability of observing class k in node m

The concept of entropy is to calculate the information contained in a node. Lower entropy means that the node contains more information, thus easier to be predicted. For example, consider the following two nodes that both contain 10 observations

The node on the right is considered more predictable since if you pick an observation randomly and label it as blue you would have a probability of 0.8 to guess it correctly. And the entropy of these two nodes will tell you how easy it is to predict the observation in the node. The node on the right would have lower entropy

\begin{aligned} E(left) = -\frac{4}{10}log_{2}\frac{4}{10} -\frac{3}{10}log_{2}\frac{3}{10} -\frac{3}{10}log_{2}\frac{3}{10} = 1.57 \\ E(right) = -\frac{8}{10}log_{2}\frac{8}{10} -\frac{1}{10}log_{2}\frac{1}{10} -\frac{1}{10}log_{2}\frac{1}{10} = 0.92 \end{aligned}

The next step would be to calculate the information gain when splitting a node m, the information gain is defined as

    \[G(m, A) = E(m) - \sum_{v \in values(A)} \frac{|A_{v}|}{|A|} E(m|A)\]

where
A is the optimal split for node m
|A| is the number of total observations
|A_v| is the number of observations classified to value v

The second term of the above formula is also known as the conditional entropy of m given A. The concept is similar to entropy: with lower entropy, the split makes the node m more predictable. The algorithm would iterate through every possible splits and calculate information gain for each split. Then it would choose the split that result in higher information gain, which represents a larger decrease in entropy. For example, considering the following two splits

Obviously, the split on the left side should be a better split since it’s creating a purer node. The information gain for the split on the left side would be

    \[G(m, A_{L}) = 1.57 - \left( \frac{5}{10} \left( -\frac{4}{5} \log_{2}\frac{4}{5} - \frac{1}{5} \log_2{\frac{1}{5}} \right) + \frac{5}{10} \left( -\frac{2}{5} \log_2\frac{2}{5} - \frac{3}{5} \log_2{\frac{3}{5}} \right) \right) = 0.724\]

And the information gain for the split on the right side would be

    \[G(m, A_{R}) = 1.57 - \left( \frac{6}{10} \left( -3*\frac{1}{3} \log_{2}\frac{1}{3} \right) + \frac{4}{10} \left( -\frac{2}{4} \log_2\frac{2}{4} - 2*\frac{1}{4} \log_2{\frac{1}{4}} \right) \right) = 0.019\]

The calculation of information gain also tells us that the split on the left side is a better split.

Implementation

def calculate_entropy(self, y):
        p = np.bincount(y) / len(y)
        p = p[p>0]
        return -sum(p*np.log2(p))

def id3(self, X, y):
        best_feature = None
        best_threshold = None
        best_entropy = np.inf
        n_samples, n_features = X.shape

        for feature in np.arange(n_features):
            feature_level = sorted(list(set(X[:, feature])))
            entropy = 0.0
 
            for i in range(len(feature_level)-1):
                med = (feature_level[i]+feature_level[i-1])/2
                left_outputs = y[X[:, feature]<=med]
                right_outputs = y[X[:, feature]>med]

                left_entropy = self.calculate_entropy(left_outputs)
                right_entropy = self.calculate_entropy(right_outputs)

                entropy = (left_entropy*len(left_outputs) + right_entropy*len(right_outputs)) / n_samples
                
                if entropy < best_entropy:
                    best_feature = feature
                    best_threshold = med
                    best_entropy = entropy

        return best_feature, best_threshold

The entire frame of this function lies on two for loop

for feature in np.arange(n_features):
    feature_level = sorted(list(set(X[:, feature])))
    entropy = 0.0
    for i in range(len(feature_level)-1):
        med = (feature_level[i]+feature_level[i-1])/2
        ...

For each split, ID3 needs to find a combination of a feature and threshold to create an optimal split. Hence, the outer for loop is iterating every feature in the data set. The inner loop is iterating every unique value of the feature and create a threshold.

There are various way to select a threshold at this step. Here, I set the threshold as the mean of feature_level[i] and feature_level[i+1], where feature_level is a array containing unique value of the feature and sorted increasingly. After the function decides the threshold, it will then split the observation whose feature is less than the threshold value to the left node, and the rest goes to right node. And then it has to calculate the entropy distinctively

left_outputs = y[X[:, feature]<=med]
right_outputs = y[X[:, feature]>med]

left_entropy = self.calculate_entropy(left_outputs)
right_entropy = self.calculate_entropy(right_outputs)

What self.calculate_entropy(left_outputs) does here is simply calculating the value of E(m) mentioned above. After calculating the distinctive entropy for each node, the function will then calculate the conditional entropy

entropy = (left_entropy*len(left_outputs) + right_entropy*len(right_outputs)) / n_samples

Now we would only consider this split as a potential optimal split if the conditional entropy calculated in this loop is smaller than the minimal entropy we’ve met so fat. And the feature, threshold, and the entropy would be updated once this split is found

if entropy < best_entropy:
    best_feature = feature
    best_threshold = med
    best_entropy = entropy

Build the Tree

def build_tree(self, X, y, current_depth=0, max_depth=9):
    y = y.astype(np.int64)

    if current_depth == max_depth or len(X) <= 2 or len(np.unique(y)) == 1:
        return DecisionTree(value=np.bincount(y).argmax())

    feature, thresh = self.id3(X, y)
        
    left_mask = X[:, feature] <= thresh
    right_mask = ~left_mask

    if sum(right_mask) == 0 or sum(left_mask) == 0:
        return Node(value=np.bincount(y).argmax())

    left = self.build_tree(X[left_mask], y[left_mask], current_depth+1, max_depth)
    right = self.build_tree(X[right_mask], y[right_mask], current_depth+1, max_depth)

    return DecisionTree(left, right, feature, thresh, np.bincount(y).argmax())

I used recursive function to build the tree. The tree would be recursively built until some conditions are met

if current_depth == max_depth or len(X) <= 2 or len(np.unique(y)) == 1:
    return DecisionTree(value=np.bincount(y).argmax())

This is one of the stopping criterion. When the current depth is equal to maximum depth, the number of observations is equal to 2, or the observations in the current node are all label the same, the tree would stop building. If none of these conditions are met, we can ensure that the current node can be split further. So the function will then execute ID3 algorithm to calculate the optimal split.

feature, thresh = self.id3(X, y)

Yet, we still need to ensure that this optimal split won’t create a child node with zero observations (rare but still happen). So I add an another line to ensure that if this split creates empty node, then this branch should stop building and return the current node.

if sum(right_mask) == 0 or sum(left_mask) == 0:
    return DecisionTree(value=np.bincount(y).argmax())

If none of the stopping criterions are met, it means that this split successfully reduce the entropy of the current node. So the function will then recursively build the tree on the left and the one on the right.

left = self.build_tree(X[left_mask], y[left_mask], current_depth+1, max_depth)
right = self.build_tree(X[right_mask], y[right_mask], current_depth+1, max_depth)

return Node(left, right, feature, thresh, np.bincount(y).argmax())

Predicting Process

def predict(self, x, depth=0):

    if self.left is None and self.right is None:
        return self.value
        
    if x[self.feature] <= self.thresh:
        return self.left.predict(x, depth+1)
    else:
        return self.right.predict(x, depth+1)

The predicting process is relatively straightforward. We are just recursively comparing the feature value of the threshold of the current node and decide whether we should move left or right for the prediction process.

Demonstration

To testify whether my self-built decision tree works as expected, I used it to train and predict two well-known data set, the iris and the titanic data set. I’ve set the test size to 40% of the training data set and maximum depth size to 4.

data = pd.read_csv("iris.csv")
data = data.drop(["Id"], axis = 1)

le = LabelEncoder()
data["Species"] = le.fit_transform(data["Species"])
X = np.array(data.iloc[:, :-1])
y = np.array(data.iloc[:, -1])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

tree = DecisionTree()
tree = tree.build_tree(X_train, y_train)

predict_results = []
for obs in X_test:
    predict_results.append(tree.predict(obs))
accuracy = accuracy_score(y_test, predict_results)
print(accuracy)

The resulting accuracy is 0.983, which is an amazing result. And it shows that my decision tree actually works. Next, let’s try it on titanic dataset

data = pd.read_csv("titanic.csv")
data = data.drop(["PassengerId", "Name", "Ticket", "Cabin"], axis = 1)
data = data.dropna()

le = LabelEncoder()
data["Sex"] = le.fit_transform(data["Sex"])
data["Embarked"] = le.fit_transform(data["Embarked"])

X = np.array(data.iloc[:, 1:])
y = np.array(data.iloc[:, 0])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

predict_results = []
for obs in X_test:
    predict_results.append(tree.predict(obs))
accuracy = accuracy_score(y_test, predict_results)
print(accuracy)

The resulting accuracy is 0.74. Though no as good as expected, it does shows that this self-built decision tree works to some extent. Note that I don’t do any data preprocessing or hyperparameters optimizing since I want to put emphasis on the process of scratching a decision tree.

Conclusion

This article explains how I build the decision tree from scratch using Python. Though this decision tree is specifically designed for classifier with ID3 algorithm, you can modify to a regressor with other splitting algorithm such as C4.5, C5.0 by changing the splitting part with ease. Feel free to comment below for any suggestions and discussions 🙂🙂