Assignment 2: Predictive Models

The data provided approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful.

G3 based on G1

G3 based on G2

G3 based on G2+G1

Final grade based on all variables

On the one hand, I think it is possible to predict the final grade based on the previous grades if the student is quite consistent. It is much harder to predict based on just 1 grade and also if student is experiencing some issues that we may not be aware of. ix. G3 based on G1: R2 is 0.60, the scatterplot is showing that there is a linear relation, but it’s still quite scattered. G3 based on G2: R2 is 0.79 which is much more accurate, the scatterplot is showing that there is a linear relation, but it’s still a bit scattered. G3 based on G1 and G2: R2 is 0.80 which is about the same accuracy as the previous model, the scatterplot is showing a linear relation. G3 based on all variables: R2 is 0.83 and so far this is the highest accuracy out of all. The scatterplot is very similar to the one based on G1 and G2 variables.

I think the most useful model would be the model there the final grade is based on all variables, because this model has a higher accuracy compared to the rest of the models but also because when choosing the model for prediction we also need to take into account other variables that may play a significant role in predicting the final grade such as: study time, free time, guardians, job of the guardians, health, travel time, absences etc.

However, it may be too late to change the student's behaviour knowing that to get all the variables we would probably need to wait closer to the end of the semester and then the prediction wouldn't be this helpful.

Then I built a logistic regression based on the average grade (G1+G2+G3)/3 to see if the model could predict if the students would fail or pass.

Logistic Regression

How good is the model? The model seems to be quite accurate, there is only error in one case for each pass and 6 error for fail out of 194 in total, that is almost a perfect prediction.
Is there any problem with the model? I haven’t found any problem in this model itself because G1,G2,G3 are the strongest predictors for pass and fail. The model is based on the components of the grade, we are not predicting the data that doesn’t exist, but the data that is directly related to the final result.
If there is a problem with the model, create a new model without the problem and check its confusion matrix.
The result changes if we change the variables instead of all variables to just G1. It becomes less accurate: 0.81, however it would be more useful to predict based on the first grade earlier in the semester.

Logistic regression based on G1

What is your conclusion? Like I mentioned above, I think the logistic regression module is pretty accurate in terms of predicting a pass or fail grade based on the average scores. However, it is too easy to predict when there is all the information given, there is no use in this model. If we change variables to the variables that were received in the beginning of the semester, then it becomes more useful and valuable. I would not use this prediction as a prescription for students, but it is something for the instructor to be aware of where their students stand and maybe a good conversation starter to talk to students who could be failing at class.

After that, I created a decision tree to see which variables outside of grades would be more important to determine if the student is passing or failing.

Decision Tree - Tree

Decision Tree - Importance

Decision Tree - Prediction Matrix

What are the most important variables according to the decision tree? The most important variables are school, failures and higher education.

What are some patterns/rules that appear to be present in the data? About 43% of students will fail, only 4% students that failed the class before will still pass it. Only 6% of students don’t want to go to higher education have passed the course. about 18% of people who consume alcohol on workdays will pass the course, but most of them will fail (about 53%) which is about 79% of the data. About 42% of students whose current health status is less than 2 (bad) will most likely fail the course. However, about 26% of students who did have a bad health still could pass the course. i. Do these rules make sense to you or are they just coincidence? Most of them make sense. I think these are important variables when we predict if student would pass or fail, of course the most direct influence on the pass/fail would have number of failures in the past and and the decision to pursue higher education because it reflects student's motivation. i. Could this model be used to identify students at risk? I think it is possible to use it to get a picture of which variables are important and what's the probability of students failing the course. ii. Would you use this model? How? If I see that the students falls under several of these categories and will most likely fail, I would probably talk to the student, I would not show this model to them. But I think it is good to have this for the teacher's reference.

I also built the random forest model to predict Pass/Fail for students using all the variables except for G1,G2,G3 and final grade:

Random Forest 1

Is Random Forest better than the others? It is the most accurate model with almost 84% of accuracy rate (taking into account we are not using any grades)
Which Threshold selection value would you use to create an application guide instructor-led intervention for students at risk of failing the course? By testing, I realized that 0.5 or 0.6 (which also would be an F score) would be an optimal threshold for prediction of the students failing the course. There are some errors in the 65 cases out of 194, so I would not necesserily rely on this prediction fully. However, if the instuctor knows some of the background information of the student and how the student's progress was in the past, that would give a better picture to the instructor and would let him create a better guide for an intervention.