Author: Sarah Dar

Decision Tree Classificator for Passing

Confusion Matrix

Summary

Importance:

Same decision Tree with target variable set to dropout

It is still about 91% accurate. This is due to the fact that very few students dropout compared to the whole population.

After adjusting for imbalance

This model is a little less accurate with a 70% accuracy:

Logistic Regression for Results (Pass/Fail)

We can see that the performance is similar to the decision tree, however there are more Type II errors and less Type I errors. The summary tells us that the model is 75% accurate. Just 1% more than the decision tree.

The model also let us know what are the important variables (avg_score and active_days).

Logistic Regression for Dropout

Most people is classified as FALSE

Adjusting for imbalance:

Now the model behave much better. Again disability and avg_score seem to be the most important variables to predict dropout. This model it is a little bit more accurate (72%) than the decision tree.

Random Forest for Pass/Fail

The results seem better. However, a look at the summary tell us that the accuracy is just marginally better (76%).

Random Forest for Dropout

It seems that we have slight better performance. The summary says that the model is 73% accurate in the test set.

Step 6: Regression Models

Linear Regression

We can see that the model is not good at predicting actual grade. The R Square is low (0.26), that means that the variables cannot explain the variability in scores. The error is high +/- 16.24 points in average (RMSE).

If you go to Actual/Predicted the graph is not a clear line (as you would expect from a good prediction)

Random Forest

The R Squared is still low (0.26), meaning that the variables are not able to explain the variability of the score.

If we go to importance, we see that avg_score is the main variable to predict the final score.

Conclusion

I would say that logistic regression proved to be most accurate, however, it has high type II error which is detrimental for a prediction model. If we want to avoid Type II errors then our next best bet would be random forests with good accuracy and low type II error