How can I determine the bias and variance of a random forrest?

Question

On this website https://scikit-learn.org/stable/modules/learning_curve.html, the authors are speaking about variance and bias and they give a simple example of how works in a linear model.

How can I determine the bias and variance of a random forest?

score 1 · Accepted Answer · answered Oct 22 '19 at 14:46

1

To gain a good understanding of this, I recommend first reading about the trade-off between bias and variance in ML and AI methods.

A great article on this topic that I recommend as a light mathematical introduction is this: https://towardsdatascience.com/understanding-the-bias-variance-tradeoff-165e6942b229

In short: Bias represents the models effort to generalize samples, as opposed to Variance that represents the models effort to conform to new data. A high bias, low variance model will thus look more like a straight(underfitted) line, while a low bias, high variance model will look jagged and all-over the place(overfitted).

In essence, you need to find a balance between the two to avoid both overfitting(high variance, low bias) and underfitting(high bias, low variance) for your specific application.

But how can I determine this for a model such as a Random Forrest classifier?

To determine your models bias and variance configuration(if either is too high/low), you can look at the models performance on the validation and test set. The very reason we divide our data into training-validation-test sets, is so that we can validate the models performance when it is presented with samples it has not seen during training.

answered Oct 22 '19 at 14:46

Krrrl

211
1
10

Krrrl I am familiar with the famous dartboard and for basic linear models I do understand how it works, . And the test method that you suggest I am familar with but https://ai.stackexchange.com/questions/15964/how-should-i-interpret-this-validation-plot here look in the comments the opposite is said (it can be underfitting and overfitting so i am confused) – jennifer ruurs Oct 22 '19 at 14:52
1

Sorry, I did not mean to imply that you did not already know that. I am not exactly sure what your question is in your comment, could you type it as a question? The graphs you posted in the other question seem to indicate that the model is overfitting with a tree-depth higher than 15. That is to say, the useful features in the data is already accounted for by depth 15, thus by increasing depth above 15 the model starts to learn features unimportant to the actual data(such as noise present in your training set). Does this answer your question? – Krrrl Oct 22 '19 at 15:02
No I know you want to help and I am thankfull for that, But are you sure that I am overfitting there? Because my trainingscores are higher than my cross validation scores. And the trainingscores are predicted based on the model that is created with cross validation or is that not true? – jennifer ruurs Oct 22 '19 at 15:23
1

No, that is not the case. The graph you posted in the other question illustrates what accuracy is achieved for different tree depths - tested on two different data sets(training set and validation set). You should use this graph to determine the depth of the trees in your model, before you are going to use it on real data(non-training data). So, the graph tells you that a depth of 15 seem to be the optimal depth for the specific data you are working with here. If you change out the data set with something else, you will probably get a new optimal tree depth, specific to the new data. – Krrrl Oct 22 '19 at 15:37
1

Think of the graph like this: When different tree depths are tested, these different accuracy's are found. If you then choose tree depth deeper than 15(say 18), your model _will be_ overfitting. While if you choose a tree depth less than 15(say 12), you model will be underfitting. So, you use this graph to determine how deep your trees should be - in order to avoid both under- and overfitting. The two graphs are the result of using the same model, tested with different tree depths, on two different data sets(the training- and the validation-set). – Krrrl Oct 22 '19 at 15:43
Creating a complex model in general will make your model overfit and to less underfit. Because here for example they say that I could use learning curve plot to determine over and underfitting: https://www.ritchieng.com/machinelearning-learning-curve/ So based on that and my plot on: https://ai.stackexchange.com/questions/15971/how-to-interpret-this-learning- curve-plot Can i conclude something about under and overfitting? Is there a formal statiscal test to assess over or underfitting fo neural networks and forrest ? – jennifer ruurs Oct 22 '19 at 16:00
1

Thank you for editing your previous comment for clarity. I am not aware of any way to compute over/underfitting analytically, other than comparing performance on trainingset vs. validation set(which is what we were talking about earlier, in your original question). The learning curve can indeed be used to conclude something about over/underfitting, based on the difference between performance on training and validation set, like demonstrated in the article you linked. – Krrrl Oct 22 '19 at 16:22

How can I determine the bias and variance of a random forrest?

1 Answers1