Assignment 4 Report

1. Loading and Understanding the Data

What information is being captured by Perusall in this dataset?

1.Student details: student's last name; student's first name; and student IS

2.Comment details: ID of the reading that the comment is made; comment ID; ID of the original comment it responded to; highlight text; comment submission; word count; type (comment or question); score, creation date; last edit date; status (online or invalid); page number; and range.

3.Interactions with the comment: Number of replies; number of upvoters

2. Examining the Data Distribution

Are there missing data?

No, there doesn't seem to have missing data.

Are there more comments or questions?

There are more comments (481) than questions (136).

Which papers receive more comments?

The top three papers receiving the most comments are Wise, 2013 (71 comments), Santos, 2012 (67 comments), and Introduction to Learning Analytics (66 comments).

In which pages most of the comments happen?

Most comments happen on page 2.

Who were the most prolific commentators?

Zack is the most prolific commentators (81 comments), followed by Jaydon (53 comments), and Janice (46 comments).

3. Tokenizing the Text

What new columns are added to the dataset?

"document_id", "sentence_id", and "tokenized text".

The structure of the data is wide or it is long?

Long, because it contains a lot of rows with relatively few columns.

Could you rebuild the original text from the new columns? How?

Yes. Since every data has its own document id and sentence id, we can rebuild the sentence using the document id (which reading this text commented on) and sentence id (which comment thread this word belongs to).

What are the most frequent words?

"The" (1,969 entries), "to" (1,334 entries), and "and" (1,025 entries).

4. Cleaning the bag of words

Which are the most frequent words now?

"data" (778 entries), "student" (332 entries), and "learn" (294 entries).

Does this list make sense given the topic of our course?

Yes. Given that this course is a learning analytics course for education, it is normal to have "data" as the most common word relating to data analytics, followed by "student" and "learn", which are related to education.

Do you think we lost important information in this cleaning?

Since we filtered out numbers and any word that is not alphabetical, it may be possible that we miss out some numbers (e.g. year) that are important to the context of the reading. But from my memory, there should not be a lot of numbers that are very important to the reading discussion, so filtering out numbers should be fine. Filtering the stem of the word would make our analysis easier.

5. Creating bi-grams and tri-grams

What column disappear?

"Tokenized text".

** What columns appear?**

"gram", and "token".

What are the most common bigrams?

"learn_analyst", "data_collect", "predict_model", "data_analysi", and "tidi_data".

What are the most common trigrams?

"use_learn_analst", "learn_analyt_softwar", "learn_analyt_tool", "social_media_platform", "type_learn_analyst".

Are bigrams and trigrams common compared with simple words?

Some common words tend to appear in both bigrams and trigrams, such as "learn", "analytics/analysis", "data", etc.

What are we the most common bigrams and trigrams with the word student?

"student_data", "student_learn", and "student_engag".

6. Important words for papers and authors

Important words for Papers

Are the most relevant words, used in the comments, relevant for each one of the papers? Why?

Yes, they are mostly relevant for each paper. For example, for Wickham 2014, "column", "messi", and "data_tidi" are commonly used because the reading is about tidying data. For Grunspan 2014, it is about social network, so it is normal for "social_network", "social_influenc" to be frequently mentioned. These topics are very specific to that week's reading, therefore they are not frequently mentioned in other readings.

Important words for Authors

Can you recognize the most important words that you have used?

Some important words I have used include "goal", "big", "meaning" etc. These are words that are quite commonly used by me, so I didn't recall in which particular reading I have used them. But I do recognize most of the important words in the table.

What is the most common word that you have used, that is least used by the rest of your classmates? Does it make sense to you?

The most common words I have used include "futur", "assign", "lack", "enhanc", etc., some of them are also commonly used by other classmates, e.g. "futur". I realize I tend to use verbs that are more general (e.g. "enhance", "lack"), while many classmates tend to use nouns that are more specific to the topics (e.g. "data", "analyz_data, "academ", etc.) It makes sense to me because I like describing how data analytics work in real life or its consequences on students/institutions, so I may use more verbs than nouns in my comments.

7.Cluster the authors

Does this graph make sense? Why?

Yes. Points that are closer to (0,0,0) are words that are less important (i.e. commonly used in many readings). The graph shows that most people's points are close to (0,0,0), which aligns with my impression that most words used by us are quite common and not specifically important. Some outliers here include Jewelina, Siqi Du, and Zack, suggesting that they may have used words that are important in certain readings.

K-Means Clustering

Are the similarities of the previous graph preserved in this 2 D representation?

Yes. Both graphs show that the majority of the students are close to the center, meaning that the common words used by these students are similar to each other and relevant to the discussion. Similar to the previous 3D visualization, there are a few outliers that are far from the cluster centroid.

** What is your interpretation of this graph?**

My interpretation is that most students' comments around the same discussion is quite similar and relevant. The reasons behind the few outliers may be because their comments are not very relevant to the reading, or because not many classmates responded to their comments, therefore there is no discussion provoked around their comments.

8. Sentiment Analysis

Which paper receive the most positive / negative comments?

Srinivasa 2021 receives the most positive comments, and Selwyn 2015 receives the most negative comments.

Which student in average was the most positive / negative according to the analysis?

Jewelina was the most positive commenters and Zack was the most negative commenters.

Do you think the sentiment analysis is accurate? Why?

Yes, I think it is mostly accurate. It is understandable that Srinivasa 2021 received the most positive comments because it is the first introductory reading at Week 1, which is more straightforward and less controversial. In contrast, Selwyn 2015 is about the critical study of digital data and education, so students' comments are more critical. I remember commenting about the power inequality and control of data from those in power in that reading.

For the analysis of student, I also think it is mostly accurate. From my memory, Jewelina liked summarising the reading and raising follow-up questions related to it. Her comments usually did not involve a lot of controversial or critical comments. In contrast, for Zack, he made a lot of comments, with many of those being original and provoking. So it is possible that the ideas he raised involve more critical thinking and personal judgement, leading to a higher level of negativity.