Prediction of Student Performance

1. The Data

The data was collected from two Portuguese schools about Math and Portuguese language performance.

  • The majority of the students are from 'GP' - Gabriel Pereira. In Math, only 46 (11%) are from 'MS' - Mousinho da Silveira. There were more in Portuguese, but still only 68 (35%).
  • More students had paid classes in Math (46%) compared to Portuguese (7%).
  • For alcohol consumption, students were only given options from 1 - very low to 5 - very high. There wasn't an option for those who don't drink. Also, it's not clear what low and high means in number of drinks so it might be subjective.

2. Prediction of Final Grade based on first and second evaluation

Using a linear regression model with the Student-Math dataset, we predicted the final grade using G1, G2, G1 + G2, and All Variables as predictor variables for their respective models. Below are the scatter plots:

G1 Scatter Plot

Loading...

G2 Scatter Plot

Loading...

G1+G2 Scatter Plot

Loading...

All Variables Scatter Plot

Loading...

R Squares for each model:

  • G1: 0.6035
  • G2: 0.7930
  • G1+G2: 0.7972
  • All Variables: 0.8365

Here are some of the findings:

  • We could predict the final grade based on the previous two grades with almost 80% accuracy.
  • The All Variables model is the most accurate.
  • Even though the All Variables model is the most accurate, it requires more work to get the results. G2 is the most useful since it's quick and easy to look at one set of scores without giving up too much in terms of accuracy.

3. Prediction of risk of Failing (Logistic Regression)

Using the Student-Por dataset, we predicted risk of failing using logistic regression. Fail is if the score is lower than 12 and pass is if the score is 12 or higher.

Confusion Matrix

Loading...

Based on the confusion matrix, the model seems really good with only 2 predictions that were incorrect. I looked at those two records, and it seems like they were predicted incorrectly because their predicted grades were very close to the cut off of 12. I was not able to find any problems with the model.

4. Decision Tree

Using the Student-Por dataset, we created a decision tree with the variables available to predict pass/fail.

Decision Tree

Loading...

Important Variables

Loading...

Study time and previous failures are the most important variables. Reason for attending this school, mother's education, absences, age, and family educational support also important, and the rest of the variables are not important. If a student had many past class failures, they more likely to fail again. If they study more than 1.5 hours, they’ll have a better chance of passing. For those who don’t study as much and have a mother who is not well educated, chances of failing are higher. These patterns seems to make sense. This model can be used to identify students at risk by looking at their past failures. The intervention would be to encouraging those students to study more, perhaps by forming a study group or offering after school tutoring.

5. Random Forest

Using the Student-Por dataset, we created a random forest with the variables available to predict pass/fail. Below are results using a threshold selection value of 0.5.

Random Forest

Loading...

Confusion Matrix

Loading...

Using a threshold of 0.5, the model gets 71% correct. Although I wouldn't say that Random Forest is better than the other prediction models, it does allow for more control in the predictive modelling. For example, we get to choose whether we want to set a threshold value manually or optimize for accuracy rate, precision, recall, F-score, or specificity. To create an application guide for instructor-led intervention for students at risk of failing the course, I would use a threshold selection value of 0.5. We want to minimize those who are predicted as pass, but who would actually fail. Those students need help and we wouldn't want to leave them out of the intervention. We also don't want to include too many of those who would pass, but are predicted as fail. If we include a lot of them, then we'll be including most of the class in the intervention. Using the threshold selection value of 0.5 minimizes the sum of those 2 numbers.