Author: Sarah Dar
Decision Tree Classificator for Passing
Confusion Matrix
Summary
Importance:
Same decision Tree with target variable set to dropout
It is still about 91% accurate. This is due to the fact that very few students dropout compared to the whole population.
After adjusting for imbalance
This model is a little less accurate with a 70% accuracy:
We can see that the performance is similar to the decision tree, however there are more Type II errors and less Type I errors. The summary tells us that the model is 75% accurate. Just 1% more than the decision tree.
The model also let us know what are the important variables (avg_score and active_days).
Most people is classified as FALSE
Adjusting for imbalance:
Now the model behave much better. Again disability and avg_score seem to be the most important variables to predict dropout. This model it is a little bit more accurate (72%) than the decision tree.
The results seem better. However, a look at the summary tell us that the accuracy is just marginally better (76%).
It seems that we have slight better performance. The summary says that the model is 73% accurate in the test set.
We can see that the model is not good at predicting actual grade. The R Square is low (0.26), that means that the variables cannot explain the variability in scores. The error is high +/- 16.24 points in average (RMSE).
If you go to Actual/Predicted the graph is not a clear line (as you would expect from a good prediction)
The R Squared is still low (0.26), meaning that the variables are not able to explain the variability of the score.
If we go to importance, we see that avg_score is the main variable to predict the final score.
I would say that logistic regression proved to be most accurate, however, it has high type II error which is detrimental for a prediction model. If we want to avoid Type II errors then our next best bet would be random forests with good accuracy and low type II error