Using the data sets we are trying to predict the score (Pass/Fail) and the dropout from the class. We are using variables such as: period of time (first 50 days since the start of the course), the average grade, number of assessments delivered, average score for them, lateness, clicks on LMS, general student info such as student id, gender, highest education, socio-economical status, age and disability. We have only taken a look at DDD course.
Here is the Decision Tree Model for Pass/Fail based on average score:
From the model we can tell that if the student's average score is less than 67, then the student will most likely fail (36% will still pass). If gthe avg score is more than 67, then yes student will pass (about 85% of students in this group will pass). The percentage of the group that will pass is 57% (middle number).
The most important variable here is the average score of the student.
In the summary and prediction matrix it is apparent that the accuracy of the prediction in test is about 74% and there is still a little over 25% of error. Of course, in the training set the accuracy is higher at about 76% but we use training data to create the model and test data to evaluate the model. The actual accuracy of the data is 74% because there is data that the algorithm has not seen before.
Now let's take a look at the Decision Tree that will predict the dropout probability for students:
It is apparent that the most important variables for this prediction is disability. The decision tree is demonstrating that very few students dropout compared to the whole population. If a student has a disability, there is a 29% chance that this student will dropout from the course.
In the summary and prediction matrix we can see that the accuracy rate for the test is about 70% and the error is still under 30%, so this prediction is still quite accurate.
Here is a statistical prediction method - logistic regression model for passing or failing.
As we can tell, the most important variables here are average score and active days (days when students were active on LMS).
From the summary and prediction matrix it is apparent that the accuracy rate for test is about 74% and the error a little over 25%. The performance of this model is similar to the decision tree shown above.
Let's take a look at logistic regression model for dropout:
As we can see, the most importangt variables are still disability and the average score that students get in the class.
From the summary and prediction matrix we can tell that the accuracy for this model is about 72% which is 2% more than the decision tree. The error is about 28% which is also 2% less than the decision tree prediction model for dropout presented above.
Overall, this statistical model is more accurate for predicting dropout than the decision tree.
Now let's test another prediction model: random forest. We are predicting if the student will pass or fail the course:
Again, the prediction is showing that the most important variable is average score, but also number of elements student interacted with on LMS.
The accuracy for this model is about 75% which is a slight improvement compared to decision tree but the same as logistic regression. The error rate is about 25%, which makes the model quite accurate overall.
Here is the same random forest model but for dropout prediction:
The most important variables here are average score and disability, however, in this model disability appears to be in the 2nd place.
The accuracy of this model is slightly better than the logistic regression and decision tree - about 73% accuracy and 27% error. Which makes it the most accurate out of all three.
# Regression Models
# Linear Regression - Score
Now we will see if we can predict the final score of students in this course:
As we can see, the values are scattered do not form a straight line in this linear regression model which means the model if not effective at predicting the final grade.
The most important variable is the average score. The model is not accurate because R Squared is about 0.26 at test which means the model is not good at predicting the final score. RMSE is about 16.20 at test which means that the error is high.
Verdict: This model is not a good method to predict the final grade of students in this course.
We have also tried using random forest as a regression algorithm for score:
Again, we can see that the values are forming a blob and not a straight line which means the model is not accurate for predicting the final grade.
The most important variable is the average score, just like before.
This model is not accurate because R Squared is about 0.27 at test which means the model is not good at predicting the final score. RMSE is about 16.08 at test which means that the error is quite high.
Verdict: This model is not a good method to predict the final grade of students in this course. This means that more data is needed to predict the final score.
All models have similar performance and accuracy (about 2-3% apart in the accuracy and error rate). It is because there may be some information missing for the models to predict more accurately.
However, from the models above I can come to conclusion that the most effective ones for predicting students passing or failing were logistic regression and random forest with about 75% accuracy.
For the prediction of students dropping out from the course, the most accurate model appeared to be random forest with 73% accuracy. Although logistic regression was not too far apart with 72% accuracy. The least accurate model from these three was a decision tree with 70% accuracy.
On the other hand, the decision tree prediction model is easier to read for somebody with not an advanced knowledge of data visualization.
Overall, I assume that it depends on who the information will be presented to. If I present it to stakeholders who don't have a lot of experience with data visualization, I would prefer using decision tree because it is easier to explain, however, if I am using this information for myself for the sake of research, I would use linear rtegression or random forest models because they seem to be more accurate at predicting variables that we tested.