Classification models:

Decision Tree for Pass using Final Result as target variable

We see that in the test set, 21.31% failed the course, and the model predicted that they will fail. Also, in the test set, 53.11% passed the course, and the model predicted that they will pass. In total, the model is 21.31+53.11 (74%) accurate, that is, it fails for around a quarter of the cases. Not good, not bad.

Decision Tree for Drop using Final Result as target variable

This model is a little less accurate with a 70% accuracy

Logistic Regression for Passing using Final Result as target variable

We can see that the performance is similar to the decision tree, however there are more Type II errors and less Type I errors. The summary tell us that the model is 75% accurate. Just 1% more than the decision tree.

Logistic Regression for Drop using Final Result as target variable

Now the model behave much better. Again disability and avg_score seem to be the most important variables to predict dropout. This model it is a little bit more accurate (72%) than the decision tree.

Random Forest for Passing using Final Result as target variable

The results seem better. However, a look at the summary tell us that the accuracy is just marginally better (76%).

Random Forest for Drop using Final Result as target variable

It seems that we have slight better performance. The summary says that the model is 77% accurate in the test set.

As we can see all the models have a similar performance, the inaccuracy if due mainly to some missing information.

Regression Models

We will now will try to predict the final grade of the students.

Linear Regression

We can see that the model is not good at predicting actual grade. The R Square is low (0.26), that means that the variables cannot explain the variability in scores. The error is high +/- 16.24 points in average (RMSE).

If you go to Actual/Predicted the graph is not a clear line (as you would expect from a good prediction)

Random Forest

The R Squared is still low (0.25), meaning that the variables are not able to explain the variability of the score.

If we go to importance, we see that avg_score is the main variable to predict the final score.

It seems Random Forest classification models have the highest prediction accuracy, therefore this might seem to be the better option to use to predict a student's final grade.