During this tutorial we will explore a complex dataset and use it to predict student droput. The main goals of this tutorial are to:
This dataset is reduced verison of the one available in https://analyse.kmi.open.ac.uk/open_dataset and correspond to student activity in several MOOCs of the Open University UK.
First we will import the csv files with the data:
File contains the list of all available modules and their presentations. The columns are:
The structure of B and J presentations may differ and therefore it is good practice to analyse the B and J presentations separately. Nevertheless, for some presentations the corresponding previous B/J presentation do not exist and therefore the J presentation must be used to inform the B presentation or vice versa. In the dataset this is the case of CCC, EEE and GGG modules.
This file contains information about assessments in module-presentations. Usually, every presentation has a number of assessments followed by the final exam. CSV contains columns:
If the information about the final exam date is missing, it is at the end of the last presentation week.
The csv file contains information about the available materials in the VLE. Typically these are html pages, pdf files, etc. Students have access to these materials online and their interactions with the materials are recorded. The vle.csv file contains the following columns:
This file contains demographic information about the students together with their results. File contains the following columns: - code_module – an identification code for a module on which the student is registered. - code_presentation - the identification code of the presentation during which the student is registered on the module. - id_student – a unique identification number for the student. - gender – the student’s gender. - region – identifies the geographic region, where the student lived while taking the module-presentation. - highest_education – highest student education level on entry to the module presentation. - imd_band – specifies the Index of Multiple Depravation band of the place where the student lived during the module-presentation. - age_band – band of the student’s age. - num_of_prev_attempts – the number times the student has attempted this module. - studied_credits – the total number of credits for the modules the student is currently studying. - disability – indicates whether the student has declared a disability. - final_result – student’s final result in the module-presentation.
This file contains information about the time when the student registered for the module presentation. For students who unregistered the date of unregistration is also recorded. File contains five columns: - code_module – an identification code for a module. - code_presentation - the identification code of the presentation. - id_student – a unique identification number for the student. - date_registration – the date of student’s registration on the module presentation, this is the number of days measured relative to the start of the module-presentation (e.g. the negative value -30 means that the student registered to module presentation 30 days before it started). date_unregistration – date of student unregistration from the module presentation, this is the number of days measured relative to the start of the module-presentation. Students, who completed the course have this field empty. Students who unregistered have Withdrawal as the value of the final_result column in the studentInfo.csv file.
This file contains the results of students’ assessments. If the student does not submit the assessment, no result is recorded. The final exam submissions is missing, if the result of the assessments is not stored in the system. This file contains the following columns: - id_assessment – the identification number of the assessment. - id_student – a unique identification number for the student. - date_submitted – the date of student submission, measured as the number of days since the start of the module presentation. - is_banked – a status flag indicating that the assessment result has been transferred from a previous presentation. - score – the student’s score in this assessment. The range is from 0 to 100. The score lower than 40 is interpreted as Fail. The marks are in the range from 0 to 100.
The studentVle.csv file contains information about each student’s interactions with the materials in the VLE. This file contains the following columns: - code_module – an identification code for a module. - code_presentation - the identification code of the module presentation. - id_student – a unique identification number for the student. - id_site - an identification number for the VLE material. - date – the date of student’s interaction with the material measured as the number of days since the start of the module-presentation. - sum_click – the number of times a student interacts with the material in that day.
Due to the size of the dataset, we will only work with one of the MOOCs, the moudule "DDD".
First we calculate the outcome of the students from the different tables. This information is the final status (Pass, Fail), if the student has withdrawn (dropout) and the final score in the course.
First, get the final state (Pass, Fail) of the students in a given course. If the student has a "Withdrawn" state, we remove them from the list. If the student has "Distinction" as a final state, we count it as a "Pass".
Next, to obtain the if a student has droped-out of the course we use the fina_result variable. Dropout students are marked with a "Withdrawn".
Continuing, we will obtain the final grade of the student, given the course. This is a little more complicated because we need to use two datasets. First, we need to find what is the ID code of the “Exam” assessment from the “assessments” dataset for the course. Then we need to use those ID codes to select only those assessments from the “studentAssessment” dataset.
Next that will extract the predictors, we should specify not only the course from which we want the information ("DDD"), but the period of time since the start of the course at which want to make the prediction (50 days since the start of the course).
We will start with the information about the assessments deliverd by the student. We will extract two predictors, the average grade of the assessments present until that date, and the total number of assessments presented.
Then we obtain the information about lateness delivery of assessments. For this we need information about the deadline of the assessment and we substract the deliver day to calculate if it was delivered late.
Create a calculation (mutate) a new column called "delay" with the formula ifelse(date<date_submitted,1,0)
Rename the delay column to "sum_delays"
Finally we will get information from the number of clicks in the VLE information. We will get tree predictors, the total number of clicks, the average number of clicks per day and the number of active days.
We generate a dataset that contain additional information about students
First we will put together the extracted numerical predictors for the given course ("DDD") and cut-off days (50).
Now, we create the outcome values of Pass/Fail (finalState), the final grade (finalGrade) and the dropout (dropout)
We will build classification models to predict if a student will pass or fail the course and another to determine if the students will dropout of the "DDD" course.
We will start with the simplies classificator, the decision tree.
The result tell us that if, in the first 50 days of the course you have an average score higher than 67, you visited more than 85 elements in the LMS and delivered less than 2.5 assessments you have a 85% chance of passing the course (57% of the population comply with this characteristics). On the other hand if your average score is less than 67 in the first 50 days, you only have a 0.36 of Passing (28% of the training set was in this situation).
If we want to know how accurate is the model, go to the Prediction Matrix (confusion matrix).
We see than in the test set, 21.31% failed the course, and the model predicted that they will fail. Also, in the test set, 53.11% passed the course, and the model predicted that they will pass. In total, the model is 21.31+53.11 (74%) accurate, that is, it fails for around a quarter of the cases. Not good, not bad.
If you want to see all the metrics for the model, go to Summary. Here we can see that the performance in the test and training sets are similar, indicating a consistent model that is not overtrained on the training data. The accuracy rate is the one that we have calculated before (76% for the training set, 74% for the test set).
We can also see which variables are more important to determine if you will pass or fail. Go to Importance tab:
We can see that your score in the first 50 days is highly predictive of your passing or failing, while your disability status is not so much.
Now we will create a decision tree for the dropout.
!!! No tree !!!
If you go to the Prediction Matrix you see why, it is basically assigning No dropout to all the students (and still being 75% accurate).
This is due to the fact that very few students dropout compared to the whole population. The algorihtm provided by Exploratory has a way to balance the samples.
Now you will have something like this:
Now we see that disability, while not important to pass or fail the course, it is actually important to being able to finish the course.
The prediction matrix tell us that there are errors, but they stay below the 30% for the test set.
This model is a little less accurate with a 70% accuracy:
Let's try a statistical method for classification, Logistic Regression. This method only work for TRUE/FALSE variables, so we need to convert the Pass/Fail factor variable to a logical variable.
We can see that the performance is similar to the decision tree, however there are more Type II errors and less Type I errors. The summary tell us that the model is 75% accurate. Just 1% more than the decison tree.
The model also let us know what are the important varibles (avg_score and active_days).
Now we apply this algorithm to the dropout data. Again, we need to create a logical variable out of the dropout information.
Now lets train the model:
Something similar is happening. Most people is classified as FALSE. Let's balance
Now the model behave much better. Again disability and avg_score seem to be the most important variables to predict dropout. This model it is a little bit more accurate (72%) than the decision tree.
Now we will run the Random Forest algorithm over the same data
The results seem better. However, a look at the summary tell us that the accuracy is just marginally better (76%).
Average score and total elements used seem to be the main variables.
Now we apply Random Forest to the dropout data.
It seems that we have slight better performance. The summary says that the model is 73% accurate in the test set.
As we can see all the models have a similar performance, the inacurracy if due mainly to some missing information.
We will now will try to predict the final grade of the students.
We can see that the model is not good at predicting actual grade. The R Square is low (0.26), that means that the variables cannot explain the variability in scores. The error is high +/- 16.24 points in average (RMSE).
If you go to Actual/Predicted the graph is not a clear line (as you would expect from a good prediction)
Now we will apply Random Forest as a regression algorithm
The R Squared is still low (0.26), meaning that the variables are not able to explain the variability of the score.
If we go to importance, we see that avg_score is the main variable to predict the final score.
Create a Report comparing the performance of the different models and which one will you choose and why.