Author: Sarah Dar

Decision Tree Classificator for Passing

Loading...

Confusion Matrix

Loading...

Summary

Loading...

Importance:

Loading...

Same decision Tree with target variable set to dropout

Loading...
Loading...

It is still about 91% accurate. This is due to the fact that very few students dropout compared to the whole population.

After adjusting for imbalance

Loading...
Loading...

This model is a little less accurate with a 70% accuracy:

Loading...

Logistic Regression for Results (Pass/Fail)

Loading...

We can see that the performance is similar to the decision tree, however there are more Type II errors and less Type I errors. The summary tells us that the model is 75% accurate. Just 1% more than the decision tree.

The model also let us know what are the important variables (avg_score and active_days).

Loading...

Logistic Regression for Dropout

Loading...

Most people is classified as FALSE

Adjusting for imbalance:

Loading...

Now the model behave much better. Again disability and avg_score seem to be the most important variables to predict dropout. This model it is a little bit more accurate (72%) than the decision tree.

Random Forest for Pass/Fail

Loading...

The results seem better. However, a look at the summary tell us that the accuracy is just marginally better (76%).

Loading...

Random Forest for Dropout

Loading...

It seems that we have slight better performance. The summary says that the model is 73% accurate in the test set.

Loading...

Step 6: Regression Models

Linear Regression

Loading...

We can see that the model is not good at predicting actual grade. The R Square is low (0.26), that means that the variables cannot explain the variability in scores. The error is high +/- 16.24 points in average (RMSE).

If you go to Actual/Predicted the graph is not a clear line (as you would expect from a good prediction)

Loading...

Random Forest

Loading...

The R Squared is still low (0.26), meaning that the variables are not able to explain the variability of the score.

If we go to importance, we see that avg_score is the main variable to predict the final score.

Conclusion

I would say that logistic regression proved to be most accurate, however, it has high type II error which is detrimental for a prediction model. If we want to avoid Type II errors then our next best bet would be random forests with good accuracy and low type II error 

Export Chart Image
Output Format
PNG SVG
Background
Set background transparent
Size
Width (Pixel)
Height (Pixel)
Pixel Ratio