The data provided approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful.
On the one hand, I think it is possible to predict the final grade based on the previous grades if the student is quite consistent. It is much harder to predict based on just 1 grade and also if student is experiencing some issues that we may not be aware of. ix. G3 based on G1: R2 is 0.60, the scatterplot is showing that there is a linear relation, but it’s still quite scattered. G3 based on G2: R2 is 0.79 which is much more accurate, the scatterplot is showing that there is a linear relation, but it’s still a bit scattered. G3 based on G1 and G2: R2 is 0.80 which is about the same accuracy as the previous model, the scatterplot is showing a linear relation. G3 based on all variables: R2 is 0.83 and so far this is the highest accuracy out of all. The scatterplot is very similar to the one based on G1 and G2 variables.
I think the most useful model would be the model there the final grade is based on all variables, because this model has a higher accuracy compared to the rest of the models but also because when choosing the model for prediction we also need to take into account other variables that may play a significant role in predicting the final grade such as: study time, free time, guardians, job of the guardians, health, travel time, absences etc.
However, it may be too late to change the student's behaviour knowing that to get all the variables we would probably need to wait closer to the end of the semester and then the prediction wouldn't be this helpful.
Then I built a logistic regression based on the average grade (G1+G2+G3)/3 to see if the model could predict if the students would fail or pass.
After that, I created a decision tree to see which variables outside of grades would be more important to determine if the student is passing or failing.
What are the most important variables according to the decision tree? The most important variables are school, failures and higher education.
What are some patterns/rules that appear to be present in the data? About 43% of students will fail, only 4% students that failed the class before will still pass it. Only 6% of students don’t want to go to higher education have passed the course. about 18% of people who consume alcohol on workdays will pass the course, but most of them will fail (about 53%) which is about 79% of the data. About 42% of students whose current health status is less than 2 (bad) will most likely fail the course. However, about 26% of students who did have a bad health still could pass the course. i. Do these rules make sense to you or are they just coincidence? Most of them make sense. I think these are important variables when we predict if student would pass or fail, of course the most direct influence on the pass/fail would have number of failures in the past and and the decision to pursue higher education because it reflects student's motivation. i. Could this model be used to identify students at risk? I think it is possible to use it to get a picture of which variables are important and what's the probability of students failing the course. ii. Would you use this model? How? If I see that the students falls under several of these categories and will most likely fail, I would probably talk to the student, I would not show this model to them. But I think it is good to have this for the teacher's reference.
I also built the random forest model to predict Pass/Fail for students using all the variables except for G1,G2,G3 and final grade: