The datasets present information about demographic information of students in 2 secondary schools in Portugal, in two core course subjects—Math and Portuguese.

They contain information about variables such as: school name, student sex, student age, student address, size of student family (less than or equal to 3, or greater than 3), parents’ cohabitation status (together or apart), mother’s education, father’s education, mother’s job, father’s job, reason for selecting that school, guardian, home to school travel time, weekly study time, number of past failures, extra educational support (Y/N), family educational support (Y/N), extra paid classes within the course subject (Y/N), extracurricular activities (Y/N), attended nursery school (Y/N), wants to take higher education (Y/N), internet access at home (Y/N), in a romantic relationship (Y/N), quality of family relationships (Likert scale 1-5), free time after school (Likert scale 1-5), going out with friends (Likert scale 1-5), daily alcohol consumption, weekly alcohol consumption, current health condition, absences, grade in period 1, grade in period 2 of evaluation, grade in period 3 of evaluation. The grades are on a 20-point grading scale.

In the Math dataset, it is interesting to note that the mean grade does not change much from around 10.5, over the 3 periods of evaluation. The number of students who wish to opt for higher education is high, almost 95 percent, 375, out of 395. The number of absences is low. Around 45 percent of the students do take paid extra classes for Math.

In the Portuguese dataset: The number of students who wish to opt for higher education is also quite high, almost 89 percent. In comparison to math, only 6 percent of the students take paid extra classes for Portuguese.
The mean grade does not change much from around 11.5, over the 3 periods of evaluation.

The number of past failures for both are not too high for a large percent of the students, approximately 94 percent.

Linear regression model

Based on G1 grade (grade in the first evaluation) to predict the final grade (G3) for the Math dataset.

The goodness of fit of a regression predictive model refers to how well the model's predicted values match the actual values in the data. One way to measure the goodness of fit of a model is by looking at the R-squared value. R-squared is a measure that represents the proportion of the variance in the dependent variable that is explained by the independent variable(s) in the model. An R-squared value of 1 indicates a perfect fit between the model and the data.

We find that the R-squared value of this linear regression model for G3~G1 model is only 0.6. This means that the model explains 60% of the variation in the dependent variable, which can be considered a moderate level of accuracy. However, the 40 percent unexplained variance might be a cause for concern.

Now, let’s see how the prediction of the G3 grade relates with the G2 grade. Looking at the summary of the G2 grade’s relation with the G3 grade, we can see that the R-squared value is 0.87, which is considerably better than the G1 grade as the independent variable.
By using both G1 and G2 grades to predict G3, we see that the R-squared value improves a lot further to approximately 0.91. The RMSE also reduces to 1.6. This means that while G1 and G2 individually might not be able to give a very accurate prediction of the G3 grade, together they can predict with greater accuracy.
Finally, when we use all the variables to predict the G3 grade, we see that the R-squared value appears to be around 0.83.

Let's answer a few questions: * Could we predict the final grade based on the previous two grades? The previous grades seem to be fair predictor variable choices for the final grade, as they seem to have a relation with the final grade, G3. While there might be other factors to consider, it would be a fair assumption. * Which of the created models (G3~G1, G3~G2, G3~G1+G2 or G3~All variables) is more accurate? Based on the results of the linear regression models, the G3~G1+G2 is the most accurate. This might be because the G3~All variables model is overfitted due to the presence of too many variables that do not closely have a relation with the final grade are also included, which G3~G1+G2 includes both G1 and G2, which makes for more accurate predictions than either of those variables alone. *Reflect about which of the created models is more useful and why?

It appears that the grades G1 and G2 have the strongest effect on the prediction of the final grade G3. Using the model G3~G1+G2 has the highest R-squared value of 0.91. So, this model might be the most useful and might avoid inaccurate predictions, by identifying at-risk students based on grades of the previous 2 evaluation periods.

Logistic Regression Model to Predict Risk of Failing

Now, we try to create a logistic regression model for predicting the pass/fail of a student on the basis of all the variables, and the final grade, taken as the mean of G, G2, and G3. The criteria for passing is set as final grade greater than 12.

Logistic Regression - Summary

Confusion Matrix_Pass/Fail

How good is the model? From the confusion matrix and the F1 score of 0.96, we can see that the precision and recall of the test data for the model is good.
It appears that for 8 instances, the model predicts a false negative or fail, when the prediction should be pass. Whereas, it predicts a false positive or pass, when the prediction should be fail only for 3 instances. Since the cost of not identifying students who might fail is higher, this model has an above average accuracy.
By using G1, G2, and G3 (an average final grade), to predict pass/fail conditions for the student, we are using the actual pass/fail information to also then predict failing conditions. This makes the model highly accurate, but not very useful. If we use only strong variables that influence the pass/fail criteria, and drop G2 and G3, the model can be made more useful. By adjusting the decision threshold to lower, we can avoid false pass predictions.

Logistic Regression - Summary

Decision Tree Model

Decision Tree - Tree

Decision Tree - Summary

According to the decision tree, we see that if past failures are less than 4, health is excellent (a rating of 5), and an average free time after school (of 3 on the rating scale), then the student has an 84 percent chance of passing. On the other hand, according to the tree, if the past failures are 4 or more, there is a 0 percent chance of passing predicted.

Decision Tree - Importance

What are the most important variables according to the decision tree? According to the decision tree, the most important variables are failures and health of students, along with the reason for selecting the school, and free time after school.
What are some patterns/rules that appear to be present in the data? For example, one of the rules says that if a student has past failures are less than 4, health is excellent (a rating of 5), and low free time after school (of less than 3 on the rating scale), and the reason for selecting the school was either distance from home or the reputation of the school, then the student has a 91 percent chance of passing.
Do these rules make sense to you or are they just coincidence? To me, this seems to be coincidental as variables such as reason for selecting the school and free time after school cannot be relied upon without further analysis to predict the pass or fail condition of a student.
Could this model be used to identify students at risk? This model seems to give high importance to past failures, which seems reasonable. But the assumption would be that these students would already be identified as at-risk. The other variables that seem to be considered as important by the model seem coincidental. Using these to predict the final pass/fail or identify at-risk students might be problematic.
Would you use this model? How? The way to use this model would be to use feature selection to include fewer variables, that might have a strong effect on the dependent variable, pass/fail, to avoid overfitting and avoid inaccurate predictions.

Random Forest Model

Random Forest - Summary

Random Forest - Pred. Matrix

Now, taking a look at the Random Forest model for predicting pass/fail of students, we find that the F score for the test data is still decent but not very high, 0.62. This could be a result of too many variables that are not closely related to the final prediction of passing or failing for the student.

Is Random Forest better than the others? In this case, Random Forest model does seem to be somewhat better than the logistic regression and decision tree models.
Which threshold selection value would you use to create an application guide instructor-led intervention for students at risk of failing the course? Since it is extremely important to identify at-risk students who might fail the course, the model should be extremely sensitive to false negatives (or in this case, false pass). So increasing the decision threshold, for example, to 0.7 might be better for this application.