Identifying Discourse Boundaries in Group Discussions
             using a Multimodal Embedding Space
     Ken Tomiyama                          Fumio Nihei                     Yukiko I. Nakano                 Yutaka Takase
    Seikei University                    Seikei University                  Seikei University              Seikei University
  Musashino, Tokyo 180-                Musashino, Tokyo 180-            Musashino, Tokyo 180-           Musashino, Tokyo 180-
       8633 Japan                           8633 Japan                         8633 Japan                      8633 Japan
 dm176207@cc.seikei.ac.jp             dd166201@st.seikei.ac.jp          y.nakano@st.seikei.ac.jp      yutaka-takase@st.seikei.ac.jp

ABSTRACT
In group discussion, it is not always easy for the participants
to effectively control the discussion to make it fruitful. With
the goal of contributing to facilitating group discussions, this
study proposes a method of segmenting a discussion.
Predicted discussion boundaries may be useful for tracking
the discussion topics, analyzing the discussion structure, and
determining a timing for intervention. We created a
multimodal embedding space using an autoencoder, and
represented each multimodal utterance data in the
embedding space. Then, a simple unsupervised approach
was used to detect the discussion boundary. In a preliminary
                                                                                        Figure 1. Snapshot of experiment
analysis, we found that the proposed method can generate
discussion segments that are comprehensible for analyzing a              concise similarity between word vectors. The other approach
discourse structure. But, the performance in the discourse               is the supervised approach, where a set of features are
segmentation task should be improved as future work.                     calculated and a classifier is learned to decide a boundary or
Author Keywords                                                          non-boundary [3]. While the motivation of these previous
Group discussion; discourse segmentation; autoencoder;                   studies is to use discourse boundaries to identify more
multimodal embedding.                                                    informative segments, retrieve specific information more
                                                                         accurately, and generate a summary of the discourse. The
ACM Classification Keywords
                                                                         purpose of discourse segmentation in this study is slightly
H.5.m. Information interfaces and presentation (e.g., HCI):
                                                                         different. We aim to identify discussion boundaries, each of
Miscellaneous.
                                                                         which is a kind of shift in the discussion and may be an
INTRODUCTION                                                             appropriate intervention timing for facilitation. Thus, each
Group discussion is widely used for decision-making and                  discourse segment divided by a boundary should be a
idea generation. However, it is not always easy for the                  coherent discourse.
participants to effectively control the discussion by
themselves. A facilitator is a person who helps the                      Moreover, group discussion is not well-structured compared
participants establish common understanding and reach                    to texts, and discussion segmentation would be more difficult
consensus during the conversation. In order to make an                   than text segmentation. The discussion sometimes does not
effective contribution, the facilitator needs to choose a right          go straightforwardly, and the same topic may be discussed
timing for intervening to the discussion while observing the             multiple times. As more closely related work, [4, 5] proposed
discussion. Thus, for the purpose of exploiting information              a discourse segmentation model by employing a feature-
technology in supporting a discussion, tracking a discussion             based supervised classification approach.
is one of the basic function for a computer system to facilitate         However, feature selection is a painful process. In this study,
the discussion.                                                          we employ an autoencoder to learn multimodal embedding
There were many previous studies for topic tracking and                  space to represent each utterance as a vector. The advantage
discourse segmentation. There were mainly two approaches                 of this approach is that feature selection is not necessary.
in this research area. Unsupervised approach is based on                 Then, we employ unsupervised approach to decide a
lexical cohesion, such as identical words, synonyms, and                 discourse boundary by calculating cosine similarity between
hypernyms [1, 2]. Discourse boundary is determined by the                the vectors.


© 2018. Copyright for the individual papers remains with the authors.
Copying permitted for private and academic purposes.
SymCollab '18, March 11, Tokyo, Japan.
GROUP DISCUSSION CORPUS                                          common and the number of different verbs were counted in
Task and Subjects                                                the same way as in (2).
We recruited 30 subjects (10 groups of 3 people), who were       (4) Utterance length (time duration and the number of
native Japanese speakers. They participated in a group           morphemes): We used two types of measures for utterance
discussion for 30 minutes to create a one-day travel plan for    length. One is the time duration of utterance. The other is the
foreigners. The group of participants cooperatively filled in    number of morphemes contained in the utterance.
a work sheet in which they described (1) the country of the
expected travelers, (2) the catchphrase and (3) the details of   (5) Utterance overlap: If a given utterance was overlapped
the sightseeing course, and (4) its selling points. The          with other one, the length of overlapping time was measured.
participants were instructed to discuss four themes (1) to (4)   If the utterance was overlapped with other two utterances
in this order. In order to enhance the motivation to be          (three people were speaking at the same time), both
engaged in the task, they were also instructed that their plan   overlapping time intervals were added up.
would be evaluated later (e.g. the number of sightseeing         (6) Speech intensity Speech intensity (db) was measured
spots included in the plan).                                     every 10 ms using the Praat audio analysis tool, and
Experimental Environment                                         maximum, minimum, average, and variance were calculated
Figure 1 shows a snapshot of the experiment. Three people        for each utterance.
were seated at a table, and each of them wore a head set         (7) Head rotation: Head rotation in the y-axis was measured
microphone (Audio-technica HYP-190H) to record speech            every 20 ms from the Kinect face tracking data. Then,
data. Inertial Motion Unit (IMU, ATR-Promotions: WAA-            maximum, minimum, average, and variance were calculated
010) were attached to the back of each participant’s head.       for each utterance.
These sensors measured head acceleration, angular velocity,
and terrestrial magnetism in the x, y, and z coordinates at 20    (8) Composite head acceleration: IMUs were attached to the
fps. A Kinect sensor placed on the other side of each            back of each participant’s head, and the acceleration was
participant was used to collect face tracking data               measured at 20 frames per second (fps). The composite
individually 1 . In addition, two video cameras were set to      acceleration for x, y, and z axes was computed for each time
record the overview of the communication. Speech data were       frame i using the following equation;
manually transcribed.
                                                                                  ‫ܣܪ‬௜ ൌ ට‫ݔ‬௜ଶ ൅ ‫ݕ‬௜ଶ ൅ ‫ݖ‬௜ଶ

MULTIMODAL EMBEDDING SPACE                                       Then, maximum, minimum, average, and variance were
From the speech audio2, we obtained 7052 utterances, for         computed for each participant per utterance.
each of which we calculated following verbal and nonverbal       (9) Wavelet features for the composite head acceleration:
features.                                                        This feature is used for measuring the synchrony of the head
Features                                                         motions between discussion participants. Multiresolution
(1) The number of new/already used nouns: Nouns were             analysis with Daubechies wavelets [6] was applied to the
extracted from speech transcription using the Mecab              composite acceleration calculated in (8). Then, maximum,
morphological tagger. Then, each of the extracted nouns was      minimum, average, and variance were computed for a
categorized as a new noun or a used noun. If the noun had        wavelet at the highest resolution.
already been used in the conversation, it was categorized as     (10) Doc2Vec features: A Doc2Vec [6] model, which was
a used noun, If not, it was categorized as a new noun. The       trained by using Wikipedia articles written in Japanese, was
number of new/already used nouns was counted for each            applied to each utterance, and a 200-dimensional vector was
utterance.                                                       obtained. All elements of the vector were used as features.
(2) The number of nous in common/different between the
current and the previous utterance: We counted the number
of nouns that were shown in both the current and the previous    LEARNING A MULTIMODAL EMBEDDING SPACE
utterance (hereafter “nouns in common”). We also counted         All the features described in the previous section were
the number of nouns that were shown in the current utterance     concatenated, and each utterance was represented as a 214-
but not in the previous one (hereafter “different nouns”).       dimentional vector, including 12-dimentions for Wavelet
                                                                 analysis, 4-dimentions for speech intensity, 4-dimentions for
(3) The number of verbs in common/different between the          head rotation, and 200-dimentions for Doc2Vec features. We
current and the previous utterance: The number of verbs in


1                                                                2
    Kinect data was not used in this work.                         One group was excluded from the analysis because the
                                                                 speech audio was not recorded by mistake.
   Expected                                                                                   Window     Precision     Recall      F-measure
    traveler Catchphrase                Visiting spots
                                                                                               size
     G2‐1      G2‐3        G2‐7 G2‐8       G2‐9
                                    Tsukiji
                                                Odaiba &
                                                 Skytree Midtown                                 4       0.52         0.58        0.55
     G2‐2 G2‐4 G2‐5          G2‐10 G2‐11 G2‐15 G2‐16
                              staying
                               time        reason        route   Eval.
                                                                         staying
                                                                          time reason            5       0.60         0.64        0.62
                   G2‐6        G2‐12       G2‐14 G2‐17 G2‐19 G2‐22 G2‐25
                                                                                                 6       0.67         0.70        0.68
                               G2‐13                 G2‐18 G2‐20 G2‐23 G2‐26
                                                                                        Table 1. Agreement of discussion boundary judgment with
                                                                 G2‐21 G2‐24 G2‐27
                                                                                                    a human annotator. (autoencoder)
     Figure 2. Interpretation of the structure of a discussion

                                                                                              Window     Precision     Recall      F-measure
used this 214-dimentional vectors as the input to an                                           size
autoencoder.
                                                                                                 4       0.47         0.52        0.50
We built an autoencoder consisting of one input layer, one
hidden layer, and one output layer. We used ReLu as the                                          5       0.57         0.62        0.60
activation function in the hidden layer, and a linear function                                   6       0.65         0.70        0.67
for output layer. Minimum square error was used in the cost
function. The 241-dimensionnal data in the input layer was                              Table 2. Agreement of discussion boundary judgment with
reduced to150-dimensions in the hidden layer. The data from                                           a human annotator. (doc2vec)
7 out of 9 groups (4044 utterances) were used for training,
and the data from the remaining two groups (1124                                        topics for some segments. For instance the topic “Midtown”
utterances) were used for testing.                                                      (G2-16) has four sub-topics: “route”, “evaluation”, “staying
                                                                                        time”, and “reason.” This suggests that the results of
                                                                                        automatic segmentation is comprehensible for a human
ANALYSIS                                                                                analyzer, and there is a good possibility that such
The test data obtained from two groups were used in the                                 segmentation is useful for supporting a human facilitator.
following analysis. Each utterance was represented as an
output vector from the autoencoder. Then, the cosine                                    Agreement with the segmentation by a human annotator
similarity values were calculated by pairing the current                                As a preliminary analysis, we compared the result of
utterance with the previous three utterances, and the average                           automatic segmentation with the segmentation by a human
of three similarity values was calculated. If the average                               annotator. For the last part of the group work, the participants
similarity with the recent three utterances was lower than                              mainly worked on filling out the task sheet, and the
0.75, the current utterance was identified as a discussion                              interaction is very different from other parts. Thus, we did
boundary.                                                                               not use the data for this part. So we used 404 utterances from
                                                                                        the Group2 discussion, and 453 from the Group7 discussion.
Coherence                                                                               The model detected 58 boundaries for the Group2 discussion,
In order to test the coherence of each discussion segment, we                           and 79 for the Group7 discussion, while the human annotator
calculated the lexical similarity between the segments. First,                          detected 56 and 55 respectively. In order to permit near miss
a word vector was generated for each segment by extracting                              judgment, we set a tolerance window of size n, and judged
nouns and verbs from the transcription. Then, the cosine                                that the model prediction was correct if there was a boundary
similarity was calculated for all the pairs of segments. The                            or no boundary within the window for both model prediction
cosine similarity was generally very low. In more than 90%                              and human judgement. With this tolerated agreement
of the pairs, the cosine similarity is lower than 0.2. While in                         measure, we calculated precision, recall, and F-measure.
0.7% of the pairs, the similarity was over 0.5, the content of                          Table 1shows the evaluation results of the proposed model,
these segments was quite similar to each other (e.g.                                    and Table 2 shows the evaluation results of a model only
discussing the same place). These results suggest that each                             using Doc2Vec features. These two models were created to
discussion segment had enough lexical coherence.                                        compare a language-based model and a multimodal model.
As a qualitative analysis, we visualized the structure of the                           As the average segment length in the human annotation was
discussion based on the segments obtained. Figure 2                                     7.8, we assume that window size n=4 (half the average
visualizes the structure of a discussion of Group2. Labels,                             segment size) may be reasonable. Although the model
such as G2-1, indicate a discussion segment. As shown in the                            performance should definitely be improved, the multimodal
diagram, the main stream of the discussion can be easily                                model outperformed the language-based unimodal model for
interpreted: starting from determining the expected travelers,                          all window sizes. As our final goal is not finding discourse
followed by the discussion about the catchphrase and the                                boundaries, but identifying a good timing for intervention,
visiting spots. In addition, it was also possible to assign sub-
we need to propose more appropriate evaluation metrics as
future work.


CONCLUSION
This study proposed a method for finding discussion
boundaries based on the similarity of utterance vectors in a
multimodal embedding space which was created by using an
autoencoder. Although the performance in the discourse
segmentation task is not good enough, the proposed method
can generate segments that are comprehensible for
interpreting a discourse structure.
As future directions, we will test the model in terms of
determining a timing for intervention. In addition, it is
necessary to improve the model for tracking the topics of
discussion.


ACKNOWLEDGMENTS
This work was supported by CREST, JST.

REFERENCES
1. Hearst, M.A., Multi-paragraph segmentation of
   expository text, in Proceedings of the 32nd annual
   meeting on Association for Computational Linguistics.
   1994, Association for Computational Linguistics: Las
   Cruces, New Mexico. p. 9-16.
2. Choi, F.Y.Y., Advances in domain independent linear
   text segmentation, in Proceedings of the 1st North
   American chapter of the Association for Computational
   Linguistics conference. 2000, Association for
   Computational Linguistics: Seattle, Washington. p. 26-
   33.
3. Beeferman, D., A. Berger, and J. Lafferty, Statistical
   Models for Text Segmentation. Mach. Learn., 1999. 34(1-
   3): p. 177-210.
4. Galley, M., et al., Discourse segmentation of multi-party
   conversation, in Proceedings of the 41st Annual Meeting
   on Association for Computational Linguistics - Volume 1.
   2003, Association for Computational Linguistics:
   Sapporo, Japan. p. 562-569.
5. Hsueh, P.-Y. and J.D. Moore. Automatic topic
   segmentation and labeling in multiparty dialogue. in
   IEEE Spoken Language Technology Workshop. 2006.
6. Le, Q. and T. Mikolov, Distributed representations of
   sentences and documents, in Proceedings of the 31st
   International Conference on International Conference
   on Machine Learning - Volume 32. 2014, JMLR.org:
   Beijing, China. p. II-1188-II-1196.