Welcome to the Clustering turorial! In this analysis we are going to look at data collected by the lead instructor of a first year chemistry course at a mid-sized state university in the northeastern US. The course is a required for all science majors and consistently has an enrollment of around 100-120 students. The main learning objectives of the course relate to knowledge of core chemistry concepts and manipulation of chemical equations. It is a class that students often struggle with, particularly if they do not have strong prior preparation from high school.
In 2018, the instructor gave students a short quiz in the first week of the class with questions on each of the two key areas to assess their incoming knowledge about chemistry concepts and ability to work with chemical equations; she gave another quiz in the third week on the concepts and equations covered in the first two weeks of the semester to see how, if at all, the students were improving.
We will perform a cluster analysis on this data to see if we can identify profiles of common student types. This information could be useful to help the instructor better understand the class and how to support them. We will first need to clean and tidy the data to get it ready for analysis (though actually this data comes to us in pretty good shape) and then we will be performing a cluster analysis to see if we can identify some number of "groups" (clusters) of students who have similar patterns of data.
Load the data (CSV format):
Take a look at the data and think about
How many cases are there and what do they represent?
Review the histograms of the variables in the Summary view
Identify clear data entry errors
Identify potential outliers
You will have noticed during your inspection of the data that there are some cases with missing values. These are observations that are missing data for one or more variables. Missing values are a problem for cluster analysis since it is not possible to calculate a distance score when we don't know where one of the points lies (on a given dimension). When you have a good number of observations for a variable but some values are missing there are several common options:
The simplest solution is to drop the observations with missing values. This is a reasonable choice when (a) you have many observations and (b) there is no reason to think that observations missing data are similar in some way. (The opposite of this would be if the missing data is systematic and thus represents some property of part of the population). This is also a good choice if you are missing several values for the same observation.
One relatively simple (but often problematic) solution is imputation, where you calculate the mean or median of all of the other values for that variable and use it to replace the missing data. This can be problematic since it may mask actual variation between observations.
A slightly more complicated, but better, solution is to replace the missing data with the "nearest neighbor." This means you find the observation that is closest to the observation with the missing value on all the non-missing variables and use its value to replace the missing data. This is a better way to generate an estimate for missing data than simply taking the mean or median because the estimate is specific to the observation.
Finally, it is possible to modify the basic cluster algorithm to "work around" missing data. The idea here is that even if we are missing a value for one variable on an observation, we can still calculate its distance from the other observations based on data from the other variables.
For our data, we will use method 3, we will predict the value of the variable, based on other variable.
impute_na(conceptswk1, type = "predict", val= conceptswk3)
First, we scale the data. This puts each variable on a comparable scale with a mean of zero and a standard distribution of one based on its distribution. So for the scaled data, whether a particular value is positive or negative just tells us if it is above or below the mean, and its value tells us how much (in standard deviation units) it is above or below the mean.
scale
We now notice, that the type of the variables have changed to "array". We need to convert them back to numeric
as.numeric
Now the variables should be centered around 0 with a standard deviation of 1
Now let's look at how the different observations are similar / different across the different variables to assess what k (# of clusters) we might be looking for. We will create four different scatter plots * Add a new Scatterplot Chart * Select as variables X=conceptswk1, Y=conceptswk3
Based on the Scatterplots how many clusters of students do you think there might be? Explain why
To get a visual representation of the clustering, we will use the Analytics tab
Does the number of clusters that you selected seems to be clear separations of data in the plot? Why?
We will analytically check the number of clustes
Observe the resulting graph
What is the ideal number of clusters according to the “Elbow” graph?
Based on all the analyses above:
What, if any, ethical issues are there related to the application of this analysis to inform instruction?
Do you think clustering was a useful approach in this situation? Why or why not?
Now you need to create a Note report where you will explain your findings to the instructor. Use as many graphs and explanations as you see fit.