Why do Decision Tree Learning Algorithm preferably outputs the smallest Decision Tree?

Question

I have been following the ML course by Tom Mitchel. The inherent assumption while using Decision Tree Learning Algo is: The algo. preferably chooses a Decision Tree which is the smallest.

Why is this so when we can have bigger extensions of the tree which could in principle perform better than the shorter tree?

score 2 · Answer 1 · answered Jul 31 '17 at 13:25

2

The bigger your tree is the more overfitting your model is. In machine learning, we always prefer a simpler model unless there is good reason to go for complication.

answered Jul 31 '17 at 13:25

ABCD

1,411
1
8
14

Why would a bigger tree correspond to more overfitting? i don't see any logic behind it. – imflash217 Jul 31 '17 at 13:30
More complexity = more potential for noise and errors. Although I'm not working in Machine Learning specifically, this is consistent with the [KISS](https://en.wikipedia.org/wiki/KISS_principle) principle ingrained in me by the punch-card generation. From a statistical standpoint, it sounds like, in this case, more complexity runs a large risk of reducing the reliability and value of the results. – DukeZhou Jul 31 '17 at 18:42
1

@DukeZhou Correct. A big tree is more likely to be complicated and less likely be useful for prediction. – ABCD Aug 01 '17 at 02:46
To me this does not make any valid reason. Why wouldn't i choose a bigger Decision Tree if it performs far far better than a shorter tree! There should be some mathematical foundation for it. Also, even if what you are saying is correct then there must be some foundation/validation of that. I couldn't find any. – imflash217 Aug 01 '17 at 18:53
@vinaykumar2491 please read a machine learning book carefully. What you ask is just beginner concepts. Pointless for me to explain. You need to put some efforts. – ABCD Aug 01 '17 at 22:54
For some, the concept of overfitting by complexity may be simple at first glance, but the reason why can go as deep as Occam's Razor principle in philosophy. – Toon Tran Feb 27 '23 at 18:39

score 2 · Answer 2 · answered Jul 31 '17 at 14:13

2

Adding to SmallChess's answer , Larger trees(with many nodes) are too adapted to the training set, as a small change in the input train data might cause the trees to change very much and hence change the estimate value too much.This is mainly due to the hierarchical structures of trees(because a change in a higher node may cause all lower nodes to change). As an extreme case you can think of a large tree where in each training example has its own node.Such a is absolutely useless for test prediction.

answered Jul 31 '17 at 14:13

Fenil

181
5

But, when we change the **training dataset** there is always a possibility of change in the tree structure (as we would not be knowing what the final outcome of new training will be). And if the new Tree performs better in terms of accuracy, time/space complexity etc. then why should not we choose it. – imflash217 Aug 01 '17 at 19:10
1

@vinaykumar2491 It's all about variance-bias tradeoff. I recommend a beginner machine learning book, otherwise you don't really know what you're talking about. You'll understand your question is not really a question. – ABCD Aug 02 '17 at 04:01

score 0 · Accepted Answer · answered Aug 03 '17 at 20:37

Consider the LCD (least common denominator) principle in algebra. A larger denominator would work for most processes for which the LCD would be calculated, however the least is the one used by convention. Why? The interest in prime numbers in general is based on the value of reductive methodologies.

In philosophy, Occam's Razor is a principle that, given two conceptual constructs that correlate equally well with observations, the simpler is most likely the best. A more formal and generalized prescription is this:

Given two mathematical models of a physical system with equal correlation to the set of all observations about the system being modeled, the simpler model is more likely to be the most likely to be predictive of conditions outside those of the observation set.

This principle of simplicity as the functional ideal is true of decision trees. The more likely functional and clear decision tree resulting from a given data set driving its construction will be the simplest. Without a clear reason for additional complexity, there may be no benefit derived from the complexity added and yet there may be penalties in terms of clarity and therefore maintainability and verifiability. There may be computational penalties too, in some applications.

The question mentions, "Bigger extensions of the tree which could in principle perform better," however the performance of the tree should be a matter optimization in preparing for execution and execution of the decision tree in real time. In other words, the minimalist decision tree is the most clear, workable, and verifiable construct, however a clever software engineer could translate that minimalist construct to a run time optimized equivalent.

Just as with compilation of source code, performance is multidimensional in meaning. There are time efficiency, memory efficiency, network bandwidth efficiency, or other performance metrics that could be used when optimizing the tree for run time. Nonetheless, the simpler tree is the best starting point for any weighted combination of these interests.

Why do Decision Tree Learning Algorithm preferably outputs the smallest Decision Tree?

3 Answers3