<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Topic Modeling for Answers Detection in Online Game Chats</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alexander Sirotkin</string-name>
          <email>avsirotkin@hse.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ilya Musabirov</string-name>
          <email>imusabirov@hse.ru</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paul Okopny</string-name>
          <email>paul.okopny@gmail.com</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Denis Bulygin</string-name>
          <email>bulygindi@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vladimir Ivanov</string-name>
          <email>vivanov@kpfu.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Kazan (Volga Region) Federal University</institution>
          ,
          <addr-line>Kazan</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>National Research University Higher School of Economics</institution>
          ,
          <addr-line>Saint Petersburg</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Uppsala University</institution>
          ,
          <addr-line>Upsala</addr-line>
          ,
          <country country="SE">Sweden</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Helping behavior is a significant part of social learning process in online games. One type of such a behavior is answering questions in a chat. We provide a method to detect if the question asked in a chat was answered and by whom. Proposed method is based on topic modeling for chat messages and comparison of a detected topic of question with a topic of possible reply. We show its efficiency on chat messages from online games.</p>
      </abstract>
      <kwd-group>
        <kwd>online games</kwd>
        <kwd>chats</kwd>
        <kwd>topic modeling</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Massive Multiplayer Online Games are extremely abundant source of social
interactions. There are plenty of studies dedicated to exploration of altruistic or
toxic behavior [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1,2,3</xref>
        ]. Our main goal is to extract informal helping practices from
an in-game chat in an online game. In the article we address one type of helping
behavior in chat namely answering questions of other players. This work provides
an alternative and more flexible solution to earlier work on extraction questions
related to predefined game aspects [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Analyzing text chats is a big field for
research [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. This paper describes some heuristic for detecting chat blocks that
may contain question–answer pair, and proposes a method that allows to mark
possible answers. Our main assumption based on the idea that a question and an
answer to it should be similar in some way. At the paper we show that closeness
of topics assigned to chat lines can serve as a measure of such similarity.
      </p>
      <p>In Section 2 we describe the dataset used for experiments. Section 3 consists of
three parts. In the first part we introduce heuristics used for detecting chat blocks
which could potentially contain questions. The second part is devoted to the
dataset preparation and marking up procedure that allows us to evaluate quality
of proposed method. In the third part we report on applying Twitter LDA to infer
topics for chat lines. Inferred topics were used as measure of similarity between
questions and answers. In Section 4 we describe results of the method evaluation
on the dataset and provide some quality metrics. Section 4 also contains overall
conclusion.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Description of data</title>
      <p>In this work we analyze chat messages from a browser-based Massive Multiplayer
Online Strategy/Role-Playing Game (MMO Strategy/RPG) Castlot. We use
chat logs from public chat of a single server covering the period from March
to December of 2013. In total there are 366,244 messages from 5568 users. The
length of the message is limited to 200 characters.</p>
      <p>The strategy part of Castlot gameplay involves both individual development
and collaboration with other players. This structure reflected in topics discussed
in chat.
3
3.1</p>
    </sec>
    <sec id="sec-3">
      <title>Text analysis</title>
      <sec id="sec-3-1">
        <title>Block detection</title>
        <p>Firstly, we try to detect blocks of messages that appear to be a question and
an answer. We assume that in most cases a question must contain question
words or/and question mark. Such messages can be detected with simple regular
expression. The second assumption is that most of the people will be thankful
for help, so we expect that the person who asks a question says “thank you”
in any form in a while after he got an answer. So we can use a list of possible
synonyms for “thank you” and check if there is a thankful message that follows
the question from the same user. In this work we assume that distance should
not exceed 30 messages. This two simple criteria provide us with a number of
messages blocks that are suspicious as containing a question-answer pair.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Test set marking</title>
        <p>
          For the test purpose we consider messages from the first 40 days of the servers
period. This set consist of 149,729 messages. We extract 1517 blocks that we
suspect to contain a question-answer pair, based on assumptions described above.
Every block was manually analyzed and marked if it has an answer or not, then
every message was marked if it contains an Answer (A) to the question that
starts the block, or not (X). We found that 925 of blocks contain answers and
592 do not. Below you may find examples of blocks with marking.
The most popular approach to topic modeling for text documents is Latent
Dirichlet Allocation (LDA) [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. However, it was developed for processing large
documents. The chat consist from a number of short messages and differs from
traditional datasets that LDA was developed to deal with. The similar issue rises
when analyzing Twitter or other short message-based social media. One of the
solutions to address this issue is Twitter-LDA algorithm [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], a modification of
LDA for short-length messages.
        </p>
        <p>We apply Twitter-LDA with 20 topics to our dataset. As a result it provides
us with one topic assigned to every message line in chat.</p>
        <p>Since we have marked the dataset we can compare topics distribution between
different types of messages. On fig. 1 you may find such distribution among
the questions (Q), answers (A) and messages that are neither questions, neither
answers (X) in detected blocks of a test set. It is easy to see that the distributions
of topics between various types of messages are different.</p>
        <p>We look at the messages and their topics and found that some of them can
be easily interpreted by human. For example, topic 5, which covers more than
15% of questions, contains discussions of interactions between players such as
PvP battles and friendship requests. Topic 19, in turn, represents messages with
help requests. In both cases, questions were mostly answered (fig. 2).</p>
        <p>At the same time, there are other topics with a large share of questions
with answers. For example, topic 0 is characterized by the words related to
ingame quests (“quest”, “dignity”, “main”). Topic 8, in turn, includes the words
characterizing interaction between players (“assault”, “castle”).</p>
        <p>In contrast, there are topics with a low share of questions. For example, topic
2, containing greeting, or topic 3 and 6, containing pronouns and common verbs
(e.g. “you”, “are”, “she”, “know”).
In every possible question-answer block we check if there is a message with the
same topic as the first line of a block that we assumed to be a question. In
table 1 we provide an estimate of answered and unanswered question numbers
based on our method and based on dataset marking that we assume to be ground
truth. The accuracy of proposed method on our dataset is 68%. It is seems to
be reasonable number for such simple method.</p>
        <p>Block with answer
Block without answer</p>
        <p>
          In comparison to approach described in[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] and based on GLR parser [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], our
current approach is more flexible. Structure of questions is significantly variable
and not all question can be detected using small list of strict parsing rules. In our
test set only 65 of 925 blocks that contain answers are started with a question
that can be represented in a form described in [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
        </p>
        <p>In our work we presented method of extracting questions and answers from
free-form text in chat logs based on topic modeling. The method is based on
two simple heuristics. First, it assumes that an answered question often leads to
thankful message, and the second is that the topic of a question and an answer
must be similar, or exactly the same. The suggested method shows good enough
quality to be used in a social interaction analysis of in-game communication
and helping behavior. We demonstrate that chance of question to be answered
dependents on topic. So, for technical in-game questions, there is bigger chance
to be answered in comparison to common questions. Our current work is directed
to integration of common question answering techniques with the proposed topic
modeling approach.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgment</title>
      <p>This study was supported by Russian Science Foundation Project (no.
15-1110019) and carried out at Kazan Federal University.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Kang</surname>
            ,
            <given-names>A.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Woo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Park</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>H.K.</given-names>
          </string-name>
          :
          <article-title>Altruism in games: Helping others help themselves</article-title>
          .
          <source>In: Network and Systems Support for Games (NetGames)</source>
          ,
          <year>2014</year>
          13th Annual Workshop on.
          <source>(Dec</source>
          <year>2014</year>
          )
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Kwak</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blackburn</surname>
          </string-name>
          , J.:
          <article-title>Linguistic Analysis of Toxic Behavior in an Online Video Game</article-title>
          . In: Social Informatics:
          <article-title>SocInfo 2014 International Workshops</article-title>
          , Barcelona, Spain, November
          <volume>11</volume>
          ,
          <year>2014</year>
          , Revised Selected Papers. Springer International Publishing,
          <string-name>
            <surname>Cham</surname>
          </string-name>
          (
          <year>2015</year>
          )
          <fpage>209</fpage>
          -
          <lpage>217</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Dumais</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jeffries</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Russell</surname>
            ,
            <given-names>D.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tang</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Teevan</surname>
          </string-name>
          , J.:
          <article-title>Understanding User Behavior Through Log Data and Analysis</article-title>
          .
          <source>In: Ways of Knowing in HCI</source>
          . Springer New York, New York, NY (
          <year>2014</year>
          )
          <fpage>349</fpage>
          -
          <lpage>372</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Musabirov</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Okopny</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bulygin</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Analyzing chat logs in online games for tutorial improvement</article-title>
          .
          <source>In: Proceedings of the 2015 Annual Symposium on ComputerHuman Interaction in Play. CHI PLAY '15</source>
          , New York, NY, USA, ACM (
          <year>2015</year>
          )
          <fpage>661</fpage>
          -
          <lpage>666</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Uthus</surname>
            ,
            <given-names>D.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aha</surname>
            ,
            <given-names>D.W.</given-names>
          </string-name>
          :
          <article-title>Multiparticipant chat analysis: A survey</article-title>
          .
          <source>Artificial Intelligence</source>
          <volume>199</volume>
          -
          <fpage>200</fpage>
          (
          <year>2013</year>
          )
          <fpage>106</fpage>
          -
          <lpage>121</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Blei</surname>
            ,
            <given-names>D.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ng</surname>
            ,
            <given-names>A.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jordan</surname>
            ,
            <given-names>M.I.</given-names>
          </string-name>
          :
          <article-title>Latent dirichlet allocation</article-title>
          .
          <source>the Journal of machine Learning research 3</source>
          (
          <year>2003</year>
          )
          <fpage>993</fpage>
          -
          <lpage>1022</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>W.X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jiang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weng</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>He</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lim</surname>
            ,
            <given-names>E.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yan</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          :
          <article-title>Comparing twitter and traditional media using topic models</article-title>
          .
          <source>In: Advances in Information Retrieval</source>
          . Springer (
          <year>2011</year>
          )
          <fpage>338</fpage>
          -
          <lpage>349</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Tomita</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ng</surname>
            ,
            <given-names>S.K.</given-names>
          </string-name>
          :
          <article-title>The generalized lr parsing algorithm</article-title>
          . In Tomita, M., ed.:
          <source>Generalized LR Parsing</source>
          . Kluwer, Boston (
          <year>1991</year>
          )
          <fpage>1</fpage>
          -
          <lpage>16</lpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>