TPLA Assignment 2

Sarah Dar

Part 1 Loading and Understanding the Data

Downloaded both data sets
Two noteworthy things I found from the data:

i) In both datasets, most students belong to GP school, and then there is a small proportion of students from the MS school. Considering that these are two different schools bring in a whole range of environment factors affecting student grades and whose differences may have not been accounted for in the data and so those variables can cause disturbances in the data coming from students from the MS school

ii) In both datasets, another important thing is that the majority of the mother as well as the father's occupation is in the category of "other". In the case where these variables are found to affect the G3 significantly, the data won't be so useful since the major occupation of the parents is technically unknown.

Part 2 Prediction of Final Grade based on first and second evaluation

Using the student_mat data set, I created a linear regression using G1 as the predictor variable:

R squared or goodness of fit was found to be close to 0.6 which means it is fairly good fit and the variable should be used.

Using G1 to predict G3, a scatterplot is made

Here is the scatter plot between G# and its predicted value

Repeating these steps for G2

After running linear regression on G3 using G2, the goodness of fit is found to be 0.87 which is very good. Running this regression on test data, and finding predicted values.

Here's a scatterplot with the new predicted values and G3

The new values are even more close.

Doing the same steps with G2 and G3 together

R^2 is found to be very good: 0.91

The scatter plot shows values that are even closer to the true value:

Running these steps on all variables to predict G3

R^2 turns out to be perfect fit: 1

The cis as follows:

Reflect about if we could predict the final grade based on the previous two grades.

The final grade can be predicted on the basis of the previous 2 grade with a good accuracy. This is because the goodness of fit is very high for both individually - 0.6 and 0.8, and both of them together it is 0.9

Reflect about what of the created models (G3~~G1 , G3~~G2, G3~~G1+G2 or G3~~All variables) is more accurate

G3 ~ G1+G2 is most accurate.

Since the goodness of fit of G3 ~ G1+G2 is 0.9, it means that 90% of variation in G3 is explained by the variables G1 and G2. That means that very little variation is explained by the rest of the variables and they can cause inaccuracy in prediction.

Reflect about what of the created models is more useful and why

The G1+G2 linear regression model is most useful primarily because it is highly accurate. Secondly, the variables G1 and G2 are also easy to collect for the school and hence this can prove to a most useful model for final grade prediction for them.

Part 3 Prediction of risk of Failing (Logistic Regression)

1. Created a new column Passfail

Ran logistic regression with all variables as predictor variables
Ran on test data to generate predicted labels with threshold 0.5
Forming a pivot table with passfail as rows and predicted value as column:

v) Answer the following questions:

1. How good is the model?

Apparently it is a very accurate model with negligible false positives and false negatives.

2. Is there any problem with the model?

I think that the variable final grade is causing an artificial accuracy in the predicted labels because it is the variable that was directly used to form the true labels for pass/fail. Moreover since it is the average of G1, G2, G3, they follow a similar distribution and so these 4 variables need to be excluded from the predictor variables for the model.

3. If there is a problem with the model, create a new model without the problem and check its confusion matrix.

After removing these variable, here's the pivot table:

4. What is your conclusion?

The variables included do give some direction to the predicted value of passfail, however they are not accurate enough to be used reliably.

Part 4 Decision Tree

Converted required columns to factor
After running the model, here's the decision tree formed:

Analysis:

For the decision tree formed, the first node is for the decision "Pass", with a percentage of erroneous calculation of 53%. This node contains 100% of the data (because it is the first node). The tree has a height of 5. The partition rule for the first node is number of failures being either less than 0.5 or greater than or equal to 0.5. This divides the data into failed students directly, or to the next node, respectively for both sides of the boundary condition. Similarly the decision for the next node is pass and the partition condition is whether of not study time == 1. If it is then the students from this node are shifted to a fail and a pass node respectively. If they end up in the fail node, the next condition is mother's education, if it is 0,1, or 3, they all fail, and if not, they are divided into pass and fail based on the reason for choosing the school. Going back up in the tree, if the study time was not equal to 1, they end up in a Pass node where the condition checked is family support, further conditions that help divide students into pass/fail nodes are number of absences, and reason for choosing school.

Importance:

Prediction Matrix:

Summary:

Answer the following questions
What are the most important variables according to the decision tree?

Primarily, Failures and study time

Secondarily, absences, reason for choosing school, mother's education, and family support

What are some patterns/rules that appear to be present in the data?

More than 0.5 failures ensures the student fails the class
After categorizing by failures less than 0.5 and study time equal to 1, if mother's education is 0,1,3 then they all fail, which shows that mother's education matters in the aspect of the child being able to pass the class
Having family support is also important to pass the class

Do these rules make sense to you or are they just coincidence?

They do make sense because a child who has had failures in the past is more likely to exhibit behaviors that would lead them to fail this class as well. The more time the student spends studying the less likely they are to fail. Without family support it is hard for children to pass a class. Secondly, since most children spend most of the time with their mother, the level of the mother's education impacts their perception of studies and the importance of passing a class (a little far-fetched but true)

Answer the following questions
Could this model be used to identify students at risk?

Yes. The decision tree formed is actually great for identifying students who can possibly fail. Number of past failures (data that is easy to gather for the school), the mother's education and the study hours allotted to studying and family support are the most important factors to be considered.

Would you use this model? How?

Yes

I would identify students who are have had more than 1 failures before, whose study time falls under category 1, who's mother's eduction falls among categories 0, 1 and 3, who don't have family support and work to mitigate these factors to prevent them from failing.

Part 5 Random Forests

After generating a random forest

Here is the pivot table formed with threshold 0.5:

Here is the pivot table formed with threshold 0.2:

Here is the pivot table formed with threshold 0.8:

Based on the results answer the following questions:
Is Random Forest better than the others?

For determining pass/fail based on variables other than those used to calculate pass/fail, random forest has the highest accuracy.

Which Threshold selection value would you use to create an application guide instructor-led intervention for students at risk of failing the course?

When such an intervention is in use would want to consider avoiding type II errors since a student being deprived of the intervention based on a type II error is more detrimental than a student wrongly predicted to be at the risk of failing and then receiving help from the intervention. In order to pick a threshold we would pick one that would minimize type II error in our prediction and hence would use a higher threshold value: 0.7.