Assignment 4 Report
1bi) The information captured by Perusall gives us the full list of each comment/annotation in Perusall throughout the semester. We are able to see the specific source, and each comment is given an id so we can see the comment one is replying to (if it is a reply). We can also see the first and last name of the student as well as a student id to distinguish them, the highlighted text being annotated, the content of the annotation, word count, score, # of replies, # of upvoters, type of comment (comment/question), created/edit time, and page number. This gives us detailed information information for each comment to perform a variety of discourse analyses. The summary in Perusall also shows us the distribution of their values.
2ai) The only "missing data" is N/A entries in the "in response to comment id" column, which makes sense since not all comments are replies to other comments.
2aii) There are more comments than questions (621 vs. 107)
2aiii) Based on highest count in the Document columnm, Introduction to Learning Analytics (88), Herodotou et al., 2019 (78), Wise 2013 (76), Selwyn 2015 (71), and Wise 2021 (71) have the most comments.
2aiv) The page range with the most comments is 1-3.8, which makes sense since people are likely most engaged in the first parts of the reading, and these sections also contain information that is repeated or futher explained later in the text.
2av) The most prolific commentators were Wilhelm, Iron, and Ethan. We can see this by the summary distribution for "First Name". We use this variable because there is some overlap in last names but no common first names. Also shown in the chart below.
3di) After tokenizing the submission text, we added the columns for document id, sentence id, and the Tokenized Text.
3dii) The data is long since we extracted information from the columns and expanded the focus to individual tokens instead of individual comments.
3diii) We could begin the process of rebuilding the orginal text by grouping together tokens with common docuement id's, sentence id's, and student id's. From the table, it appears that they are already set up in this manner and the words are in order. We can run a long to wide operation with document id and sentence id as keys and tokenized text as values. Since the table was in order, this gives us the orginal text seprated by commas, which can be merged and resolved through data tidying operations.
3div) The most common words are stopwords like "the", "to", "and", "of", "a", and "is" since we set remove stopwords to false.
4d) Now the most frequent words are "data", "learn", "student", "use", "think", and "also". As shown in the pivot table below.
4e) These words make more sense since they are more applicable to course topics and readings. There are still seem to be some frequent simple words like "use" and "also" but for the most part it is a better representation of our discussions.
4f) We could have lost important information if there was a number that was very commonly used such as a year or if it is important to differentiate the various forms of a word but that could also complicate the information we want to extract.
5bi) The "Tokenized Text" column disappeared 5bii) The "token" (contains the new tokens with the words merged to form n-grams) and the "gram" (number of words) columns appear
New Pivot Table with ngrams below
5ci) The most common bigrams are learn_analyt, predict_model, data_collect, student_learn, machine_learn, use_data, and collect_data 5cii) The most common trigrams are make_inform_decis, predict_learn_analyt, provid_valuable_insight, use_predict_model, and way_collect_data.
5ciii) There appear to be considerable overlap with the common simple words and the bigrams and trigrams. However, through the ngrams we gain additinal information on the way those words are used if they are commonly used with other words to get a clearer picture of common topics of discussion and elaborating on common words like "use" or "learn".
5di) Most common bigrams with the word student are student_learn, student_perform, teacher_student, risk_student, student_teacher. Most common trigrams are dialogu_advisor_student, agre_collect_student, better_understand_student, collect_data_student, and friend_strong_student.
6bvi) The most relevant words correspond with each of the papers topics. This is because we used TDIDF to guage the relevancy of each word instead of just frequency, so we have high scoring words that may not be the most frequent, and we also filtered out the most common words to eliminate simple words that aren't directly related to the reading topic.
6civ1) I can recognize the most important words, even though the full names are integers, by looking through the table and seeing the section that relates most to the topics I discussed.
6civ2) The most common ngrams I used that weren't used by my classmates were data_often_mislead and often_mislead and this makes sense to me because in general my classmates comments were more applicable to the specifc readings and these comments were sidenotes of my opinions of data sometimes misleading. This also makes sense because these were in my most common list and had the lowest tdidf score indicating low usage by classmates.
7fi) The scatterplot is shown below
In this section we created a scatterplot to get an idea of how many clusters we should use. In order to simplify the categories, we used 3 dimenions for each students to categorize the tokens based on tdidf scores into 3 arbitrary categories. As such, it isn't clear what these categories represent so it isn't clear what topics each cluster of students has in common. However, it does make sense if we interpet this scatterplot to show that in general there is one main cluster of common conversations with 4 outliers representing less relevant conversations.
7h) These are the analytics derived from k-means clustering with 5 clusters
7hi) The similarities appear to be preserved as we have 5 clusters with one with the most common levels of tdidf across all 3 dimensions and 4 outlier clusters. 7hii) My interpretation is the same as for the scatterplot since it yielded as similar result and showed that there is one main cluster of common conversations with 4 outliers representing less relevant conversations.
8a) Sentiment Analysis Charts
8bi) The readings with the most negative comments were herodotou-et-al-2019, srinivasa-kurni-2021, and wise-2021 based on having the greatest length below zero on the chart and having the largest numbers in the green category. The readings with the most positive comments were selwyn-2015, wise-2019, and wise-2021 based on having the largest number in the black category.
8bii) Based on the student sentiment chart, Ethan could be considered as having the most negative comments based on being the furthest left, but on average it appears that either Madhu or Soham have the most negative comments based on having a midpoint of their bar being furthest left. Adam Li appears to have the most positive comments on average for the same reason and he also has the furthest right graph and the largest combined blue and black categories.
8biii) This sentiment analysis appears to be accurate based on our comments in the reading. One interesting observation is that we didn't have any comments in the yellow or light green section, which makes sense since I noticed that most of the comments were positive. Additionally, the comments sentiment appear to align with the way each student has commented throughout the year. The results for readings also make sense because readings that included critique themselves correspond with the number of negative comments. Additionally, we can also see that earlier readings seem to have more positive comments and later readings have more negative comments which also makes sense since students may be more comfortable and confident in providing negative critique of readings the further along they are in a course.