What information is being captured by Perusall in this dataset?
1.Student details: student's last name; student's first name; and student IS
2.Comment details: ID of the reading that the comment is made; comment ID; ID of the original comment it responded to; highlight text; comment submission; word count; type (comment or question); score, creation date; last edit date; status (online or invalid); page number; and range.
3.Interactions with the comment: Number of replies; number of upvoters
Are there missing data?
No, there doesn't seem to have missing data.
Are there more comments or questions?
There are more comments (481) than questions (136).
Which papers receive more comments?
The top three papers receiving the most comments are Wise, 2013 (71 comments), Santos, 2012 (67 comments), and Introduction to Learning Analytics (66 comments).
In which pages most of the comments happen?
Most comments happen on page 2.
Who were the most prolific commentators?
Zack is the most prolific commentators (81 comments), followed by Jaydon (53 comments), and Janice (46 comments).
What new columns are added to the dataset?
"document_id", "sentence_id", and "tokenized text".
The structure of the data is wide or it is long?
Long, because it contains a lot of rows with relatively few columns.
Could you rebuild the original text from the new columns? How?
Yes. Since every data has its own document id and sentence id, we can rebuild the sentence using the document id (which reading this text commented on) and sentence id (which comment thread this word belongs to).
What are the most frequent words?
"The" (1,969 entries), "to" (1,334 entries), and "and" (1,025 entries).
Which are the most frequent words now?
"data" (778 entries), "student" (332 entries), and "learn" (294 entries).
Does this list make sense given the topic of our course?
Yes. Given that this course is a learning analytics course for education, it is normal to have "data" as the most common word relating to data analytics, followed by "student" and "learn", which are related to education.
Do you think we lost important information in this cleaning?
Since we filtered out numbers and any word that is not alphabetical, it may be possible that we miss out some numbers (e.g. year) that are important to the context of the reading. But from my memory, there should not be a lot of numbers that are very important to the reading discussion, so filtering out numbers should be fine. Filtering the stem of the word would make our analysis easier.
What column disappear?
"Tokenized text".
** What columns appear?**
"gram", and "token".
What are the most common bigrams?
"learn_analyst", "data_collect", "predict_model", "data_analysi", and "tidi_data".
What are the most common trigrams?
"use_learn_analst", "learn_analyt_softwar", "learn_analyt_tool", "social_media_platform", "type_learn_analyst".
Are bigrams and trigrams common compared with simple words?
Some common words tend to appear in both bigrams and trigrams, such as "learn", "analytics/analysis", "data", etc.
What are we the most common bigrams and trigrams with the word student?
"student_data", "student_learn", and "student_engag".
Which paper receive the most positive / negative comments?
Srinivasa 2021 receives the most positive comments, and Selwyn 2015 receives the most negative comments.
Which student in average was the most positive / negative according to the analysis?
Jewelina was the most positive commenters and Zack was the most negative commenters.
Do you think the sentiment analysis is accurate? Why?
Yes, I think it is mostly accurate. It is understandable that Srinivasa 2021 received the most positive comments because it is the first introductory reading at Week 1, which is more straightforward and less controversial. In contrast, Selwyn 2015 is about the critical study of digital data and education, so students' comments are more critical. I remember commenting about the power inequality and control of data from those in power in that reading.
For the analysis of student, I also think it is mostly accurate. From my memory, Jewelina liked summarising the reading and raising follow-up questions related to it. Her comments usually did not involve a lot of controversial or critical comments. In contrast, for Zack, he made a lot of comments, with many of those being original and provoking. So it is possible that the ideas he raised involve more critical thinking and personal judgement, leading to a higher level of negativity.