Assignment II Report

1. Loading and Understanding the Data

First, the proportion of the two schools (Gabriel Pereira (GP) and Mousinho da Silveira (MS)) is not very balanced. Especially for Math performance, there are 349 entries from GP but only 46 entries from MS. It is slightly more balanced for Portuguese language performance with 423 from GP and 226 from MS. With the huge discrepancies in sample sizes, I am not sure if this will affect the accuracy of data analysis and comparison of the two schools at the later stages.

Second, for "activities" (i.e. extra-curricular activities), the trends in Math and Portuguese language performance are different. More students responded "yes" for the former while more students responded "no" for the latter. This is interesting because the trends of most of the other data of the two schools show similar trends except for this one. I wonder whether we will be able to find out the reasons behind this at the later stages.

2. Prediction of Final Grade based on first and second evaluation

All four graphs show a similar linear regression, meaning that the final grade is predictable based on the previous two grades.
The Adj R Squared for every graph is as follows:

G3~G1: 0.60
G3~G2: 0.79
G3~G1+G2: 0.80
G3~all variables: 0.81

Since G3~all variables has the highest Adj R Squared, this model seems to be the most accurate. The accuracy of G3~G2 and G3~G1+G2 is similar, while the accuracy of G3~G1 is the lowest.

The model that takes G1+G2 seems to be more useful. The accuracy rate is slightly higher than using solely G1 or G2, and it also contains less outliers than the model using all variables.

3. Prediction of risk of Failing (Logistic Regression)

1.The model looks quite accurate to me. The majority of the Fail and Pass prediction are true positive and true negative. There is only 1 Type I error and 1 Type II error. The accuracy rate is nearly 0.99. The model looks good to me.

2.The problem is that we included G1 + G2 + G3 as predictors to predict the final grade. G1 + G2 + G3 is basically the final result. When we used the final results to predict the result, the model is definitely accurate.

3.In the first new model, I excluded G1 + G2 + G3 and any test scores, and included all other variables as predictors to build the logistic regression model. The accuracy of this model is 0.71.

New model using variables except G1, G2 and G3

The accuracy of the first model is not very high. So I decided to also include one test score as predictors in the second model. In the second new model, I excluded only G2 and G3 and included G1 and other variables as predictors. The accuracy of this model is 0.89. I think this model should be able to give a more accurate picture of the students' final grade.

New model using all variables except G2 and G3

4.My conclusion is that accuracy is not the main purpose of using a prediction model. Instead, the ability to find students at risks and help them is the key. So when we use different models, we should not just choose the one with the highest accuracy but think about which variables we used are the most relevant.

4. Decision Tree

The most important variable is failures (i.e. number of past class failures).
The tree shows a pattern that students who are more willing to spend time and effort on study would have a higher chance to pass. Variables such as "higher" (wants to take higher education), "school", and "Walc" (weekend alcohol consumption) are also important variables as shown in the tree.
I think these rules make sense. Failures is the most important variable to me, because it is likely that a student failed the past test would fail the future tests. "Higher" also makes sense to me, because students who want to take higher education means they are usually more serious about their study, so they are likely to study harder. For "Walc", it is less important compared to the other two, but the lower alcohol consumption may also indicate that they tend to spend less time in going out and more time in studying.

However, there is some variables that I think may not completely relevant, such as "school". "School" may be explained by the different teaching and marking styles at the two different schools, contributing to the possibly higher pass tendency for a particular school.

e.i. Yes. I can use this tree to identify risky students who failed past tests, do not want to take higher education, and consume more alcohol at weekends. Teachers may pay more attention to these students. The accuracy rate of the model is 66%. In general, I think this tree has done a good job in identifying students at risk.

Yes, I would use this model. The analysis of risky students in this tree is quite accurate, and its visualization of different variables and their importance are quite easy to understand. But I would also determine which variables make more sense to me (e.g. failures, higher") and make my own judgement on which categories of students I would want to pay more attention to.

5. Random Forest

Random Forest (Threshold: 0.5)

Random Forest (Threshold: 0.54)

The accuracy rates for different methods are:

new logistic regression model (without any test scores) = 0.71
decision tree = 0.66
random forest (Threshold: 0.5 and 0.54) = 0.69 - 0.72

The random forest has better accuracy rate than the new logistic regression model (the one without test scores) and the decision tree.

I would choose the model with a 0.5 threshold. Although the one with 0.54 threshold has a higher accuracy rate, accuracy is not the main decisive factor when designing a instructor-led intervention.

One main reason I chose 0.5 threshold is it generates relatively less false negative (i.e. students who are predicted passed but failed eventually) and relatively more false positive (i.e. students who are predicted failed but passed eventually). As the purpose of using this model is to provide instructor-led intervention for students at risk, it is desirable to get more false positive than false negative. It is better to identify more students who were predicted to fail but would pass eventually than vice versa. It is often better to help more students who turned out do not need help, than to overlook students who needed help but to whom we did not provide support.