<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Analysis of Discussion Forum Data as a Basis for Mentoring Support</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jakub Kuzilek</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Milos Kravcik</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rupali Sinha</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>German Research Center for Arti cial Intelligence, Alt-Moabit 91c</institution>
          ,
          <addr-line>Berlin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Humboldt University of Berlin</institution>
          ,
          <addr-line>Unter den Linden 6, Berlin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Supporting mentoring processes in higher education is a relevant and challenging aim. Meta-cognitive, emotional and motivational aspects play a crucial role here. Big data can help to recognize the affects of mentees, to react accordingly and to make the mentoring support scalable. In our study, we processed data from university discussion forums utilizing text and sentiment analysis. The results suggest that this approach can raise mentors' awareness of the activities in discussion forums, but limitations need to be considered. Evaluations with real users can help to develop these approaches further.</p>
      </abstract>
      <kwd-group>
        <kwd>Mentoring Text analysis Sentiment analysis</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Good learning should be individualized and personalized. This goal was already
addressed with Intelligent Tutoring Systems, Adaptive and Personalized
Learning Environments. These systems mainly aim at the cognitive aspects of the
learning process. Intelligent Mentoring Systems (IMS) are going one step
further by including metacognitive, emotional and motivational elements in the
learning process.</p>
      <p>
        This leads to the following question: What should concepts for designing
learning and teaching look like to make the quality of individual mentoring
scalable for the acquisition of target competences? Compared to coaching and
tutoring, the mentoring process is more spontaneous, more holistic, based on the
needs and interests of the mentee and focusing on psychological support. The
relationship is more complex, two-way and based on emotions [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Here we deal
with the question how to help mentors recognize relevant and urgent
contributions in discussion forums by means of text and sentiment analysis techniques.
      </p>
      <p>
        In the following we rst very brie y mention selected related work. Then
we introduce the analyzed data and the methods applied. In the main part the
results are presented and discussed. Finally we summarize the paper and outline
next steps.
Sentiment analysis (SA) aims to analyse people's opinions and emotions from
written language. It is widely studied in data mining, Web mining, and text
mining to better understand human behaviours [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. It is usually essential to
consider the context of the text and the user preferences [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. User emotions and
intents when contributing to discussion forums can help to elicit their goals [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        There is a lack of opinion mining systems in non-English languages. Moreover,
cross-domain SA is still a signi cant challenge, including issues like the di erence
in sentiment vocabularies across di erent domains and an objective assignment
of a strength marker to each sentiment word [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
3
      </p>
    </sec>
    <sec id="sec-2">
      <title>Data</title>
      <p>For this research the data from OPAL discussion forums at the Technical
University of Dresden between the years 2005 and 2009 have been employed. The
dataset contains 16,614 messages from 123 forums exchanged between 1490 users
(students and teachers). Each forum, message and user have a unique identi er.</p>
      <p>The data is in anonymised form. Messages contain the plain text with the
HTML tags and contain a collection of these emoticons: angel, blushing,
confused, cool, devil, grin, kiss, ohoh, sad, smile, tongue, ugly and wink.</p>
      <p>The analysis focused on the data from 5 forums containing the highest
number of messages. Tab. 1 shows the statistics of the selected forums.
To uncover information in the data, we applied text mining methods on the
selected messages. In the following, the data preprocessing is explained and then
each method is introduced.</p>
      <sec id="sec-2-1">
        <title>Text Preprocessing</title>
        <p>
          The text corpus was preprocessed in the following way:
1. Extraction of emoticons: At the beginning, all emoticons presented in the
text as HTML tags "IMG" with class "emoji" were extracted.
2. Removal of HTML formatting: All messages were stripped from the HTML
tags to get the clean text messages.
3. Tokenization: All messages were divided into separate words (tokens),
keeping the information to which message each word belongs. The unnest tokens
algorithm from tidytext R1 package [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] was used.
4. Stop words removal: From the tokenized corpus German stop words were
removed using stop words dictionary2.
5. Stemming: The remaining words were stemmed, meaning they were reduced
to their root form. For example, the words "Abschlusses" and "Abschlussen"
will be reduced to the root form "abschluss". For the stemming, we used
Snowball library [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
6. Removal of tokens with length less than 4: All tokens with the low number
of characters representing shortcuts or abbreviations were removed.
The preprocessed data contains 291,151 tokens in the root form. Each word can
be mapped back to the original message and user, who created the message.
4.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Word Frequencies and Document Frequencies</title>
        <p>The analysis of word frequencies is the most common way to approach text
corpus. The purpose is to uncover the most common words re ecting the text
content. At rst, the word counts for each forum were analysed by merely counting
the number of word occurrences. The analysis showed the most common words
in each forum.</p>
        <p>To quantify what are the discussion forums about the term frequency -
inverse document frequency (t df) measure was used. It measures how each word
is important to the forum in the collection. The t df of word i in the
document j is product of two measures: tf idfi;j = tfi;j idfi where term frequency
tfi;j = Pnkin;jk;j is number of word occurrences (ni;j) divided by document length
jDj
(Pk nk;j) and inverse document frequency idfi = log jj:ti2Djj is the logarithm of
number of documents (jDj) divided by number of documents in which the word
is presented (jj : ti 2 Djj).
4.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Sentiment Analysis</title>
        <p>
          For SA, the SentimentWortschatz sentiment lexicon [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] has been used. It contains
approximately 34.000 German words annotated by sentiment value ranging from
1 https://cran.r-project.org/
2 https://github.com/stopwords-iso
-1 to 1, representing both negative and positive sentiment. The sentiment of a
message is calculated as a sum of the sentiments of the individual message words.
Three kinds of sentiment analysis were performed, which will be described in the
following sections.
        </p>
        <p>Sentiment Trajectory The sentiment of the messages in chronological
order was visualised. The information can be interpreted as sentiment trajectory
during the whole forum lifetime. This visual interpretation can uncover the
general sentiment trend as well as outliers from the overall sentiment. Outliers are
messages "too" positive or negative compared to the others.</p>
        <p>Sentiment Wordcloud The Wordcloud visualisation showing the most
common words can be used in combination with the sentiment. The "cloud" is
divided into two halves. One half of the cloud represents words with a positive
attitude, and the second half those with the negative. The size of halves, in this
case, is irrelevant. What is important are the terms themselves. They represent
the most common negative and positive words in the text.</p>
        <p>
          Correlation of Sentiment and Emojis The last analysis answers the
question of whether the emoticons used within the messages somehow correspond
to the sentiment of the message. We assigned the sentiment values to the
emojis (sentiment value is in brackets): angel (0), blushing (0.4), confused (-0.2),
cool (0.8), devil (0), grin (0.8), kiss (0.4), ohoh (-0.8), sad (-0.8), smile (0.4),
tongue (0.6), ugly (-0.8) and wink (0.4). Then the emojis and corresponding
text sentiment were compared using Pearson's product-moment correlation test
[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
5
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results and Discussion</title>
      <p>The previously presented methods have been applied to the data, and the
corresponding results are presented within this section.
5.1</p>
      <sec id="sec-3-1">
        <title>Word Count</title>
        <p>Fig. 1 presents the results of word count analysis for top 5 forums. Every chart
represents the top words used in the discussion forum. We can observe that
one of the most used words is "aufgab", which is the root form of the word
"Aufgabe", representing the assignment within the course. Other terms such
as "frag", "klausur" or "dank" are also standard within the selected discussion
forums. Thus one can assume that most of the message content are questions
about course assignments and exams. The content is not surprising since that is
why forums exist in many educational settings.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Term Frequency - Inverse Document Frequency</title>
        <p>Fig. 2 shows the most representative words for each forum. One can observe
that forums 447053831, 220528647 and 320634883 discuss mathematical issues
in their courses. There are words like "hilbert", "algebra", "logit", which are
representatives of the mathematical terms. The other two forums cover topics in
economics containing the terms like "frank", "gmbh" and "gemeinkost". Based
on the analysis of word count and t df, one can assume that the forums focus
on questions related to the course assessments.
5.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Sentiment</title>
        <p>Our analysis focused on the sentiment within the discussion forums. Fig. 3 shows
the sentiment trajectory of each forum. One can observe that most discussion
forums tend to be slightly negative, and there are several negative outlier
values. For example, forum 447053831 has multiple negative peaks, suggesting that
these messages may be worth analysing and answering by a mentor. One can
also observe that the trajectories have lowering sentiment values over time, which
suggests that in the end, the messages were more urgent. The forum educational
focus can explain the overall negativity of the messages. One can expect that the
huge portion of words used in communication between students and their
mentors will be of neutral sentiment. Still, if the student has di culty, the sentiment
tends to be more negative.</p>
        <p>Another way to analyse sentiment is sentiment wordcloud, as shown in Fig.
4. One can observe important positive and negative words. The typical
representatives of positive words are: "einfach", "genau" and "verstand". These words
re ect the understanding of assignments and student's success. Words
representing negative sentiments are for example "falsch", "frag" and "nicht".</p>
        <p>Finally, we also compute correlation between emojis and sentiment of the
corresponding text. The resulting Pearson's product moment correlation test
estimates correlation of 0.04 with p-value 4.5 thus we can conclude, that there
is only small correlation between sentiment in the text and emoji used.
Mentoring is of crucial importance for university students if provided timely
and with good quality. As the capacity of experienced mentors is limited,
scalable solutions are needed to make their work e cient. The available technology
can analyze the emotions of students from the big data, revealing the learning
progress and the need to intervene.</p>
        <p>This work in progress deals with the text and sentiment analysis of
university discussion forums, which may help mentors to notice critical points where
their help is required. As mentioned earlier, there are many challenges in this
research area, including a lack of opinion mining systems in non-English
languages and cross-domain sentiment analysis. The sentiment vocabularies with
their assignments of words are of signi cant importance.</p>
        <p>Using our approach, we made several observations. In one case, the
exceptionally high sentiment positivity was caused by the commercial interests of the
author. Several other posts in this category were not related to learning, but
rather to various celebrations. Apparently, for mentoring purposes, additional
pre-processing and ne-tuning would be helpful. On the other side, a post with
high sentiment negativity explained privacy and data protection rules. In
another post, the indicated negativity was a context issue. E.g. the short text
"das Risiko des Verlusts der Paritatsinfos geringer ist" contains three strongly
negative words, but it just describes a fact. One critical post deals with the
information architecture of an educational web site. So it seems for mentors,
especially posts with an increased negative sentiment, are relevant to consider.
The sentiment trajectory indicates that the messages after a certain point of
time may become more urgent.</p>
        <p>Of course, we realize various limitations of this study, in addition to the
general challenges mentioned above, regarding the domain and context-dependency.</p>
        <p>Also, the tf-idf method does not capture the contextual information, assuming
the complete independence among all the words.</p>
        <p>Nevertheless, despite these limitations, text and sentiment analysis of
discussion forums can undoubtedly help to make the work of mentors more e ective
and e cient. Even if not every pointed post turns to be urgent, such noti cations
should be valid in a longer time frame, if proper tools are deployed. To justify
their usefulness evaluations with real mentors need to be performed. Relevant
functionalities will be integrated in the infrastructure of the tech4comp project
(https://tech4comp.de/).</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgments</title>
      <p>The project underlying this report is funded by the German Federal Ministry of
Education and Research under the funding code 16DHB2102. The presented
work was partially inspired by discussions with Cathleen Stuetzer and Ralf
Klamma. Responsibility for the content of this publication lies with the authors.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Best</surname>
            ,
            <given-names>D.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roberts</surname>
            ,
            <given-names>D.E.</given-names>
          </string-name>
          :
          <article-title>Algorithm as 89: The upper tail probabilities of spearman's rho</article-title>
          .
          <source>Journal of the Royal Statistical Society</source>
          . Series C (Applied Statistics)
          <volume>24</volume>
          (
          <issue>3</issue>
          ),
          <volume>377</volume>
          {
          <fpage>379</fpage>
          (
          <year>1975</year>
          ), http://www.jstor.org/stable/2347111
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bouchet-Valat</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>SnowballC: Snowball Stemmers Based on the C 'libstemmer</article-title>
          ' UTF-8
          <string-name>
            <surname>Library</surname>
          </string-name>
          (
          <year>2019</year>
          ), https://CRAN.R-project.
          <source>org/package=SnowballC, r package version 0.6.0</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Goldhahn</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eckart</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Quastho</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          :
          <article-title>Building large monolingual dictionaries at the leipzig corpora collection: From 100 to 200 languages</article-title>
          .
          <source>In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)</source>
          . pp.
          <volume>759</volume>
          {
          <fpage>765</fpage>
          .
          <string-name>
            <surname>European Language Resources Association</surname>
          </string-name>
          (ELRA), Istanbul, Turkey (May
          <year>2012</year>
          ), http://www.lrec-conf.org/proceedings/lrec2012/pdf/327 Paper.pdf
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Krenge</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Petrushyna</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kravcik</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klamma</surname>
          </string-name>
          , R.:
          <article-title>Identi cation of learning goals in forum-based communities (07</article-title>
          <year>2011</year>
          ). https://doi.org/10.1109/ICALT.
          <year>2011</year>
          .95
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Medhat</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hassan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Korashy</surname>
          </string-name>
          , H.:
          <article-title>Sentiment analysis algorithms and applications: A survey</article-title>
          .
          <source>Ain Shams Engineering Journal</source>
          <volume>5</volume>
          (
          <issue>4</issue>
          ),
          <volume>1093</volume>
          {
          <fpage>1113</fpage>
          (
          <year>2014</year>
          ). https://doi.org/https://doi.org/10.1016/j.asej.
          <year>2014</year>
          .
          <volume>04</volume>
          .011, http://www.sciencedirect.com/science/article/pii/S2090447914000550
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Ravi</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ravi</surname>
          </string-name>
          , V.:
          <article-title>A survey on opinion mining and sentiment analysis: Tasks, approaches and applications</article-title>
          .
          <source>Knowledge-Based Systems 89</source>
          ,
          <fpage>14</fpage>
          {
          <fpage>46</fpage>
          (
          <year>2015</year>
          ). https://doi.org/https://doi.org/10.1016/j.knosys.
          <year>2015</year>
          .
          <volume>06</volume>
          .015, http://www.sciencedirect.com/science/article/pii/S0950705115002336
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Risquez</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sanchez-Garcia</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>The jury is still out: Psychoemotional support in peer e-mentoring for transition to university</article-title>
          .
          <source>The Internet and Higher Education</source>
          <volume>15</volume>
          (
          <issue>3</issue>
          ),
          <volume>213</volume>
          {
          <fpage>221</fpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Silge</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Robinson</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <article-title>: tidytext: Text mining and analysis using tidy data principles in r</article-title>
          .
          <source>JOSS</source>
          <volume>1</volume>
          (
          <issue>3</issue>
          ) (
          <year>2016</year>
          ). https://doi.org/10.21105/joss.00037, http://dx.doi.org/10.21105/joss.00037
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          , L.,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <source>Sentiment Analysis and Opinion Mining</source>
          , pp.
          <volume>1152</volume>
          {
          <fpage>1161</fpage>
          .
          <string-name>
            <surname>Springer</surname>
            <given-names>US</given-names>
          </string-name>
          , Boston, MA (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>