<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Identifying Discourse Boundaries in Group Discussions using a Multimodal Embedding Space</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ken Tomiyama</string-name>
          <email>dm176207@cc.seikei.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fumio Nihei</string-name>
          <email>dd166201@st.seikei.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yukiko I. Nakano</string-name>
          <email>y.nakano@st.seikei.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yutaka Takase</string-name>
          <email>yutaka-takase@st.seikei.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Seikei University</institution>
          ,
          <addr-line>Musashino, Tokyo 1808633</addr-line>
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In group discussion, it is not always easy for the participants to effectively control the discussion to make it fruitful. With the goal of contributing to facilitating group discussions, this study proposes a method of segmenting a discussion. Predicted discussion boundaries may be useful for tracking the discussion topics, analyzing the discussion structure, and determining a timing for intervention. We created a multimodal embedding space using an autoencoder, and represented each multimodal utterance data in the embedding space. Then, a simple unsupervised approach was used to detect the discussion boundary. In a preliminary analysis, we found that the proposed method can generate discussion segments that are comprehensible for analyzing a discourse structure. But, the performance in the discourse segmentation task should be improved as future work.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>INTRODUCTION
Group discussion is widely used for decision-making and
idea generation. However, it is not always easy for the
participants to effectively control the discussion by
themselves. A facilitator is a person who helps the
participants establish common understanding and reach
consensus during the conversation. In order to make an
effective contribution, the facilitator needs to choose a right
timing for intervening to the discussion while observing the
discussion. Thus, for the purpose of exploiting information
technology in supporting a discussion, tracking a discussion
is one of the basic function for a computer system to facilitate
the discussion.</p>
      <p>There were many previous studies for topic tracking and
discourse segmentation. There were mainly two approaches
in this research area. Unsupervised approach is based on
lexical cohesion, such as identical words, synonyms, and
hypernyms [1, 2]. Discourse boundary is determined by the
© 2018. Copyright for the individual papers remains with the authors.
Copying permitted for private and academic purposes.</p>
      <p>SymCollab '18, March 11, Tokyo, Japan.
concise similarity between word vectors. The other approach
is the supervised approach, where a set of features are
calculated and a classifier is learned to decide a boundary or
non-boundary [3]. While the motivation of these previous
studies is to use discourse boundaries to identify more
informative segments, retrieve specific information more
accurately, and generate a summary of the discourse. The
purpose of discourse segmentation in this study is slightly
different. We aim to identify discussion boundaries, each of
which is a kind of shift in the discussion and may be an
appropriate intervention timing for facilitation. Thus, each
discourse segment divided by a boundary should be a
coherent discourse.</p>
      <p>Moreover, group discussion is not well-structured compared
to texts, and discussion segmentation would be more difficult
than text segmentation. The discussion sometimes does not
go straightforwardly, and the same topic may be discussed
multiple times. As more closely related work, [4, 5] proposed
a discourse segmentation model by employing a
featurebased supervised classification approach.</p>
      <p>However, feature selection is a painful process. In this study,
we employ an autoencoder to learn multimodal embedding
space to represent each utterance as a vector. The advantage
of this approach is that feature selection is not necessary.
Then, we employ unsupervised approach to decide a
discourse boundary by calculating cosine similarity between
the vectors.</p>
      <p>
        GROUP DISCUSSION CORPUS
Task and Subjects
We recruited 30 subjects (10 groups of 3 people), who were
native Japanese speakers. They participated in a group
discussion for 30 minutes to create a one-day travel plan for
foreigners. The group of participants cooperatively filled in
a work sheet in which they described (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) the country of the
expected travelers, (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) the catchphrase and (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) the details of
the sightseeing course, and (
        <xref ref-type="bibr" rid="ref4">4</xref>
        ) its selling points. The
participants were instructed to discuss four themes (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) to (
        <xref ref-type="bibr" rid="ref4">4</xref>
        )
in this order. In order to enhance the motivation to be
engaged in the task, they were also instructed that their plan
would be evaluated later (e.g. the number of sightseeing
spots included in the plan).
      </p>
      <p>Experimental Environment
Figure 1 shows a snapshot of the experiment. Three people
were seated at a table, and each of them wore a head set
microphone (Audio-technica HYP-190H) to record speech
data. Inertial Motion Unit (IMU, ATR-Promotions:
WAA010) were attached to the back of each participant’s head.
These sensors measured head acceleration, angular velocity,
and terrestrial magnetism in the x, y, and z coordinates at 20
fps. A Kinect sensor placed on the other side of each
participant was used to collect face tracking data
individually1. In addition, two video cameras were set to
record the overview of the communication. Speech data were
manually transcribed.</p>
      <p>MULTIMODAL EMBEDDING SPACE
From the speech audio2, we obtained 7052 utterances, for
each of which we calculated following verbal and nonverbal
features.</p>
      <p>
        Features
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) The number of new/already used nouns: Nouns were
extracted from speech transcription using the Mecab
morphological tagger. Then, each of the extracted nouns was
categorized as a new noun or a used noun. If the noun had
already been used in the conversation, it was categorized as
a used noun, If not, it was categorized as a new noun. The
number of new/already used nouns was counted for each
utterance.
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) The number of nous in common/different between the
current and the previous utterance: We counted the number
of nouns that were shown in both the current and the previous
utterance (hereafter “nouns in common”). We also counted
the number of nouns that were shown in the current utterance
but not in the previous one (hereafter “different nouns”).
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) The number of verbs in common/different between the
current and the previous utterance: The number of verbs in
1 Kinect data was not used in this work.
common and the number of different verbs were counted in
the same way as in (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ).
(
        <xref ref-type="bibr" rid="ref4">4</xref>
        ) Utterance length (time duration and the number of
morphemes): We used two types of measures for utterance
length. One is the time duration of utterance. The other is the
number of morphemes contained in the utterance.
(
        <xref ref-type="bibr" rid="ref5">5</xref>
        ) Utterance overlap: If a given utterance was overlapped
with other one, the length of overlapping time was measured.
If the utterance was overlapped with other two utterances
(three people were speaking at the same time), both
overlapping time intervals were added up.
(
        <xref ref-type="bibr" rid="ref6">6</xref>
        ) Speech intensity Speech intensity (db) was measured
every 10 ms using the Praat audio analysis tool, and
maximum, minimum, average, and variance were calculated
for each utterance.
(7) Head rotation: Head rotation in the y-axis was measured
every 20 ms from the Kinect face tracking data. Then,
maximum, minimum, average, and variance were calculated
for each utterance.
(8) Composite head acceleration: IMUs were attached to the
back of each participant’s head, and the acceleration was
measured at 20 frames per second (fps). The composite
acceleration for x, y, and z axes was computed for each time
frame i using the following equation;
      </p>
      <p>ܣܪ ௜ ൌ ට ݔ ௜ଶ ൅ ݕ ௜ଶ ൅ ݖ ௜ଶ
Then, maximum, minimum, average, and variance were
computed for each participant per utterance.
(9) Wavelet features for the composite head acceleration:
This feature is used for measuring the synchrony of the head
motions between discussion participants. Multiresolution
analysis with Daubechies wavelets [6] was applied to the
composite acceleration calculated in (8). Then, maximum,
minimum, average, and variance were computed for a
wavelet at the highest resolution.
(10) Doc2Vec features: A Doc2Vec [6] model, which was
trained by using Wikipedia articles written in Japanese, was
applied to each utterance, and a 200-dimensional vector was
obtained. All elements of the vector were used as features.
LEARNING A MULTIMODAL EMBEDDING SPACE
All the features described in the previous section were
concatenated, and each utterance was represented as a
214dimentional vector, including 12-dimentions for Wavelet
analysis, 4-dimentions for speech intensity, 4-dimentions for
head rotation, and 200-dimentions for Doc2Vec features. We
2 One group was excluded from the analysis because the
speech audio was not recorded by mistake.</p>
    </sec>
    <sec id="sec-2">
      <title>Window size Window size</title>
      <p>4
5
6
4
5
6
0.52
0.60
0.67
0.47
0.57
0.65</p>
    </sec>
    <sec id="sec-3">
      <title>Precision</title>
    </sec>
    <sec id="sec-4">
      <title>Recall</title>
    </sec>
    <sec id="sec-5">
      <title>F-measure</title>
    </sec>
    <sec id="sec-6">
      <title>Precision</title>
    </sec>
    <sec id="sec-7">
      <title>Recall F-measure 0.58 0.64</title>
      <p>0.70
0.52
0.62
0.70
0.55
0.62
0.68
0.50
0.60
0.67
used this 214-dimentional vectors as the input to an
autoencoder.</p>
      <p>We built an autoencoder consisting of one input layer, one
hidden layer, and one output layer. We used ReLu as the
activation function in the hidden layer, and a linear function
for output layer. Minimum square error was used in the cost
function. The 241-dimensionnal data in the input layer was
reduced to150-dimensions in the hidden layer. The data from
7 out of 9 groups (4044 utterances) were used for training,
and the data from the remaining two groups (1124
utterances) were used for testing.</p>
      <p>ANALYSIS
The test data obtained from two groups were used in the
following analysis. Each utterance was represented as an
output vector from the autoencoder. Then, the cosine
similarity values were calculated by pairing the current
utterance with the previous three utterances, and the average
of three similarity values was calculated. If the average
similarity with the recent three utterances was lower than
0.75, the current utterance was identified as a discussion
boundary.</p>
      <p>Coherence
In order to test the coherence of each discussion segment, we
calculated the lexical similarity between the segments. First,
a word vector was generated for each segment by extracting
nouns and verbs from the transcription. Then, the cosine
similarity was calculated for all the pairs of segments. The
cosine similarity was generally very low. In more than 90%
of the pairs, the cosine similarity is lower than 0.2. While in
0.7% of the pairs, the similarity was over 0.5, the content of
these segments was quite similar to each other (e.g.
discussing the same place). These results suggest that each
discussion segment had enough lexical coherence.
As a qualitative analysis, we visualized the structure of the
discussion based on the segments obtained. Figure 2
visualizes the structure of a discussion of Group2. Labels,
such as G2-1, indicate a discussion segment. As shown in the
diagram, the main stream of the discussion can be easily
interpreted: starting from determining the expected travelers,
followed by the discussion about the catchphrase and the
visiting spots. In addition, it was also possible to assign
subG2‐21 G2‐24 G2‐27
topics for some segments. For instance the topic “Midtown”
(G2-16) has four sub-topics: “route”, “evaluation”, “staying
time”, and “reason.” This suggests that the results of
automatic segmentation is comprehensible for a human
analyzer, and there is a good possibility that such
segmentation is useful for supporting a human facilitator.
Agreement with the segmentation by a human annotator
As a preliminary analysis, we compared the result of
automatic segmentation with the segmentation by a human
annotator. For the last part of the group work, the participants
mainly worked on filling out the task sheet, and the
interaction is very different from other parts. Thus, we did
not use the data for this part. So we used 404 utterances from
the Group2 discussion, and 453 from the Group7 discussion.
The model detected 58 boundaries for the Group2 discussion,
and 79 for the Group7 discussion, while the human annotator
detected 56 and 55 respectively. In order to permit near miss
judgment, we set a tolerance window of size n, and judged
that the model prediction was correct if there was a boundary
or no boundary within the window for both model prediction
and human judgement. With this tolerated agreement
measure, we calculated precision, recall, and F-measure.
Table 1shows the evaluation results of the proposed model,
and Table 2 shows the evaluation results of a model only
using Doc2Vec features. These two models were created to
compare a language-based model and a multimodal model.
As the average segment length in the human annotation was
7.8, we assume that window size n=4 (half the average
segment size) may be reasonable. Although the model
performance should definitely be improved, the multimodal
model outperformed the language-based unimodal model for
all window sizes. As our final goal is not finding discourse
boundaries, but identifying a good timing for intervention,
we need to propose more appropriate evaluation metrics as
future work.</p>
      <p>CONCLUSION
This study proposed a method for finding discussion
boundaries based on the similarity of utterance vectors in a
multimodal embedding space which was created by using an
autoencoder. Although the performance in the discourse
segmentation task is not good enough, the proposed method
can generate segments that are comprehensible for
interpreting a discourse structure.</p>
      <p>As future directions, we will test the model in terms of
determining a timing for intervention. In addition, it is
necessary to improve the model for tracking the topics of
discussion.</p>
      <p>ACKNOWLEDGMENTS
This work was supported by CREST, JST.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Hearst</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>A., Multi-paragraph segmentation of expository text</article-title>
          ,
          <source>in Proceedings of the 32nd annual meeting on Association for Computational Linguistics</source>
          .
          <year>1994</year>
          , Association for Computational Linguistics: Las Cruces, New Mexico. p.
          <fpage>9</fpage>
          -
          <lpage>16</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Choi</surname>
            ,
            <given-names>F.Y.Y.</given-names>
          </string-name>
          ,
          <article-title>Advances in domain independent linear text segmentation, in Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference</article-title>
          .
          <year>2000</year>
          , Association for Computational Linguistics: Seattle, Washington. p.
          <fpage>26</fpage>
          -
          <lpage>33</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Beeferman</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Berger</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Lafferty</surname>
          </string-name>
          ,
          <article-title>Statistical Models for Text Segmentation</article-title>
          . Mach. Learn.,
          <year>1999</year>
          .
          <volume>34</volume>
          (
          <issue>1- 3</issue>
          ): p.
          <fpage>177</fpage>
          -
          <lpage>210</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Galley</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , et al.,
          <article-title>Discourse segmentation of multi-party conversation</article-title>
          ,
          <source>in Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume</source>
          <volume>1</volume>
          .
          <year>2003</year>
          ,
          <article-title>Association for Computational Linguistics: Sapporo, Japan</article-title>
          . p.
          <fpage>562</fpage>
          -
          <lpage>569</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Hsueh</surname>
            , P.-Y. and
            <given-names>J.D.</given-names>
          </string-name>
          <string-name>
            <surname>Moore</surname>
          </string-name>
          .
          <article-title>Automatic topic segmentation and labeling in multiparty dialogue</article-title>
          .
          <source>in IEEE Spoken Language Technology Workshop</source>
          .
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Le</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          ,
          <article-title>Distributed representations of sentences and documents</article-title>
          ,
          <source>in Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32</source>
          .
          <year>2014</year>
          , JMLR.org: Beijing, China. p.
          <fpage>II</fpage>
          -1188
          <string-name>
            <surname>-</surname>
          </string-name>
          II-1196.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>