Identifying Discourse Boundaries in Group Discussions using a Multimodal Embedding Space Ken Tomiyama Fumio Nihei Yukiko I. Nakano Yutaka Takase Seikei University Seikei University Seikei University Seikei University Musashino, Tokyo 180- Musashino, Tokyo 180- Musashino, Tokyo 180- Musashino, Tokyo 180- 8633 Japan 8633 Japan 8633 Japan 8633 Japan dm176207@cc.seikei.ac.jp dd166201@st.seikei.ac.jp y.nakano@st.seikei.ac.jp yutaka-takase@st.seikei.ac.jp ABSTRACT In group discussion, it is not always easy for the participants to effectively control the discussion to make it fruitful. With the goal of contributing to facilitating group discussions, this study proposes a method of segmenting a discussion. Predicted discussion boundaries may be useful for tracking the discussion topics, analyzing the discussion structure, and determining a timing for intervention. We created a multimodal embedding space using an autoencoder, and represented each multimodal utterance data in the embedding space. Then, a simple unsupervised approach was used to detect the discussion boundary. In a preliminary Figure 1. Snapshot of experiment analysis, we found that the proposed method can generate discussion segments that are comprehensible for analyzing a concise similarity between word vectors. The other approach discourse structure. But, the performance in the discourse is the supervised approach, where a set of features are segmentation task should be improved as future work. calculated and a classifier is learned to decide a boundary or Author Keywords non-boundary [3]. While the motivation of these previous Group discussion; discourse segmentation; autoencoder; studies is to use discourse boundaries to identify more multimodal embedding. informative segments, retrieve specific information more accurately, and generate a summary of the discourse. The ACM Classification Keywords purpose of discourse segmentation in this study is slightly H.5.m. Information interfaces and presentation (e.g., HCI): different. We aim to identify discussion boundaries, each of Miscellaneous. which is a kind of shift in the discussion and may be an INTRODUCTION appropriate intervention timing for facilitation. Thus, each Group discussion is widely used for decision-making and discourse segment divided by a boundary should be a idea generation. However, it is not always easy for the coherent discourse. participants to effectively control the discussion by themselves. A facilitator is a person who helps the Moreover, group discussion is not well-structured compared participants establish common understanding and reach to texts, and discussion segmentation would be more difficult consensus during the conversation. In order to make an than text segmentation. The discussion sometimes does not effective contribution, the facilitator needs to choose a right go straightforwardly, and the same topic may be discussed timing for intervening to the discussion while observing the multiple times. As more closely related work, [4, 5] proposed discussion. Thus, for the purpose of exploiting information a discourse segmentation model by employing a feature- technology in supporting a discussion, tracking a discussion based supervised classification approach. is one of the basic function for a computer system to facilitate However, feature selection is a painful process. In this study, the discussion. we employ an autoencoder to learn multimodal embedding There were many previous studies for topic tracking and space to represent each utterance as a vector. The advantage discourse segmentation. There were mainly two approaches of this approach is that feature selection is not necessary. in this research area. Unsupervised approach is based on Then, we employ unsupervised approach to decide a lexical cohesion, such as identical words, synonyms, and discourse boundary by calculating cosine similarity between hypernyms [1, 2]. Discourse boundary is determined by the the vectors. © 2018. Copyright for the individual papers remains with the authors. Copying permitted for private and academic purposes. SymCollab '18, March 11, Tokyo, Japan. GROUP DISCUSSION CORPUS common and the number of different verbs were counted in Task and Subjects the same way as in (2). We recruited 30 subjects (10 groups of 3 people), who were (4) Utterance length (time duration and the number of native Japanese speakers. They participated in a group morphemes): We used two types of measures for utterance discussion for 30 minutes to create a one-day travel plan for length. One is the time duration of utterance. The other is the foreigners. The group of participants cooperatively filled in number of morphemes contained in the utterance. a work sheet in which they described (1) the country of the expected travelers, (2) the catchphrase and (3) the details of (5) Utterance overlap: If a given utterance was overlapped the sightseeing course, and (4) its selling points. The with other one, the length of overlapping time was measured. participants were instructed to discuss four themes (1) to (4) If the utterance was overlapped with other two utterances in this order. In order to enhance the motivation to be (three people were speaking at the same time), both engaged in the task, they were also instructed that their plan overlapping time intervals were added up. would be evaluated later (e.g. the number of sightseeing (6) Speech intensity Speech intensity (db) was measured spots included in the plan). every 10 ms using the Praat audio analysis tool, and Experimental Environment maximum, minimum, average, and variance were calculated Figure 1 shows a snapshot of the experiment. Three people for each utterance. were seated at a table, and each of them wore a head set (7) Head rotation: Head rotation in the y-axis was measured microphone (Audio-technica HYP-190H) to record speech every 20 ms from the Kinect face tracking data. Then, data. Inertial Motion Unit (IMU, ATR-Promotions: WAA- maximum, minimum, average, and variance were calculated 010) were attached to the back of each participant’s head. for each utterance. These sensors measured head acceleration, angular velocity, and terrestrial magnetism in the x, y, and z coordinates at 20 (8) Composite head acceleration: IMUs were attached to the fps. A Kinect sensor placed on the other side of each back of each participant’s head, and the acceleration was participant was used to collect face tracking data measured at 20 frames per second (fps). The composite individually 1 . In addition, two video cameras were set to acceleration for x, y, and z axes was computed for each time record the overview of the communication. Speech data were frame i using the following equation; manually transcribed. ‫ܣܪ‬௜ ൌ ට‫ݔ‬௜ଶ ൅ ‫ݕ‬௜ଶ ൅ ‫ݖ‬௜ଶ MULTIMODAL EMBEDDING SPACE Then, maximum, minimum, average, and variance were From the speech audio2, we obtained 7052 utterances, for computed for each participant per utterance. each of which we calculated following verbal and nonverbal (9) Wavelet features for the composite head acceleration: features. This feature is used for measuring the synchrony of the head Features motions between discussion participants. Multiresolution (1) The number of new/already used nouns: Nouns were analysis with Daubechies wavelets [6] was applied to the extracted from speech transcription using the Mecab composite acceleration calculated in (8). Then, maximum, morphological tagger. Then, each of the extracted nouns was minimum, average, and variance were computed for a categorized as a new noun or a used noun. If the noun had wavelet at the highest resolution. already been used in the conversation, it was categorized as (10) Doc2Vec features: A Doc2Vec [6] model, which was a used noun, If not, it was categorized as a new noun. The trained by using Wikipedia articles written in Japanese, was number of new/already used nouns was counted for each applied to each utterance, and a 200-dimensional vector was utterance. obtained. All elements of the vector were used as features. (2) The number of nous in common/different between the current and the previous utterance: We counted the number of nouns that were shown in both the current and the previous LEARNING A MULTIMODAL EMBEDDING SPACE utterance (hereafter “nouns in common”). We also counted All the features described in the previous section were the number of nouns that were shown in the current utterance concatenated, and each utterance was represented as a 214- but not in the previous one (hereafter “different nouns”). dimentional vector, including 12-dimentions for Wavelet analysis, 4-dimentions for speech intensity, 4-dimentions for (3) The number of verbs in common/different between the head rotation, and 200-dimentions for Doc2Vec features. We current and the previous utterance: The number of verbs in 1 2 Kinect data was not used in this work. One group was excluded from the analysis because the speech audio was not recorded by mistake. Expected Window Precision Recall F-measure traveler Catchphrase Visiting spots size G2‐1 G2‐3 G2‐7 G2‐8 G2‐9 Tsukiji Odaiba & Skytree Midtown 4 0.52 0.58 0.55 G2‐2 G2‐4 G2‐5 G2‐10 G2‐11 G2‐15 G2‐16 staying time reason route Eval. staying time reason 5 0.60 0.64 0.62 G2‐6 G2‐12 G2‐14 G2‐17 G2‐19 G2‐22 G2‐25 6 0.67 0.70 0.68 G2‐13 G2‐18 G2‐20 G2‐23 G2‐26 Table 1. Agreement of discussion boundary judgment with G2‐21 G2‐24 G2‐27 a human annotator. (autoencoder) Figure 2. Interpretation of the structure of a discussion Window Precision Recall F-measure used this 214-dimentional vectors as the input to an size autoencoder. 4 0.47 0.52 0.50 We built an autoencoder consisting of one input layer, one hidden layer, and one output layer. We used ReLu as the 5 0.57 0.62 0.60 activation function in the hidden layer, and a linear function 6 0.65 0.70 0.67 for output layer. Minimum square error was used in the cost function. The 241-dimensionnal data in the input layer was Table 2. Agreement of discussion boundary judgment with reduced to150-dimensions in the hidden layer. The data from a human annotator. (doc2vec) 7 out of 9 groups (4044 utterances) were used for training, and the data from the remaining two groups (1124 topics for some segments. For instance the topic “Midtown” utterances) were used for testing. (G2-16) has four sub-topics: “route”, “evaluation”, “staying time”, and “reason.” This suggests that the results of automatic segmentation is comprehensible for a human ANALYSIS analyzer, and there is a good possibility that such The test data obtained from two groups were used in the segmentation is useful for supporting a human facilitator. following analysis. Each utterance was represented as an output vector from the autoencoder. Then, the cosine Agreement with the segmentation by a human annotator similarity values were calculated by pairing the current As a preliminary analysis, we compared the result of utterance with the previous three utterances, and the average automatic segmentation with the segmentation by a human of three similarity values was calculated. If the average annotator. For the last part of the group work, the participants similarity with the recent three utterances was lower than mainly worked on filling out the task sheet, and the 0.75, the current utterance was identified as a discussion interaction is very different from other parts. Thus, we did boundary. not use the data for this part. So we used 404 utterances from the Group2 discussion, and 453 from the Group7 discussion. Coherence The model detected 58 boundaries for the Group2 discussion, In order to test the coherence of each discussion segment, we and 79 for the Group7 discussion, while the human annotator calculated the lexical similarity between the segments. First, detected 56 and 55 respectively. In order to permit near miss a word vector was generated for each segment by extracting judgment, we set a tolerance window of size n, and judged nouns and verbs from the transcription. Then, the cosine that the model prediction was correct if there was a boundary similarity was calculated for all the pairs of segments. The or no boundary within the window for both model prediction cosine similarity was generally very low. In more than 90% and human judgement. With this tolerated agreement of the pairs, the cosine similarity is lower than 0.2. While in measure, we calculated precision, recall, and F-measure. 0.7% of the pairs, the similarity was over 0.5, the content of Table 1shows the evaluation results of the proposed model, these segments was quite similar to each other (e.g. and Table 2 shows the evaluation results of a model only discussing the same place). These results suggest that each using Doc2Vec features. These two models were created to discussion segment had enough lexical coherence. compare a language-based model and a multimodal model. As a qualitative analysis, we visualized the structure of the As the average segment length in the human annotation was discussion based on the segments obtained. Figure 2 7.8, we assume that window size n=4 (half the average visualizes the structure of a discussion of Group2. Labels, segment size) may be reasonable. Although the model such as G2-1, indicate a discussion segment. As shown in the performance should definitely be improved, the multimodal diagram, the main stream of the discussion can be easily model outperformed the language-based unimodal model for interpreted: starting from determining the expected travelers, all window sizes. As our final goal is not finding discourse followed by the discussion about the catchphrase and the boundaries, but identifying a good timing for intervention, visiting spots. In addition, it was also possible to assign sub- we need to propose more appropriate evaluation metrics as future work. CONCLUSION This study proposed a method for finding discussion boundaries based on the similarity of utterance vectors in a multimodal embedding space which was created by using an autoencoder. Although the performance in the discourse segmentation task is not good enough, the proposed method can generate segments that are comprehensible for interpreting a discourse structure. As future directions, we will test the model in terms of determining a timing for intervention. In addition, it is necessary to improve the model for tracking the topics of discussion. ACKNOWLEDGMENTS This work was supported by CREST, JST. REFERENCES 1. Hearst, M.A., Multi-paragraph segmentation of expository text, in Proceedings of the 32nd annual meeting on Association for Computational Linguistics. 1994, Association for Computational Linguistics: Las Cruces, New Mexico. p. 9-16. 2. Choi, F.Y.Y., Advances in domain independent linear text segmentation, in Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference. 2000, Association for Computational Linguistics: Seattle, Washington. p. 26- 33. 3. Beeferman, D., A. Berger, and J. Lafferty, Statistical Models for Text Segmentation. Mach. Learn., 1999. 34(1- 3): p. 177-210. 4. Galley, M., et al., Discourse segmentation of multi-party conversation, in Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1. 2003, Association for Computational Linguistics: Sapporo, Japan. p. 562-569. 5. Hsueh, P.-Y. and J.D. Moore. Automatic topic segmentation and labeling in multiparty dialogue. in IEEE Spoken Language Technology Workshop. 2006. 6. Le, Q. and T. Mikolov, Distributed representations of sentences and documents, in Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32. 2014, JMLR.org: Beijing, China. p. II-1188-II-1196.