<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Multimodal Fusion of Body Movement Signals for No-audio Speech Detection</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Xinsheng</forename><surname>Wang</surname></persName>
							<email>wangxinsheng@stu.xjtu.edu.cn</email>
							<affiliation key="aff0">
								<orgName type="department">School of Software Engineering</orgName>
								<orgName type="institution">Xi&apos;an Jiaotong University</orgName>
								<address>
									<settlement>Xi&apos;an</settlement>
									<country key="CN">China</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="laboratory">Multimedia Computing Group</orgName>
								<orgName type="institution">Delft University of Technology</orgName>
								<address>
									<settlement>Delft</settlement>
									<country key="NL">The Netherlands</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Jihua</forename><surname>Zhu</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">School of Software Engineering</orgName>
								<orgName type="institution">Xi&apos;an Jiaotong University</orgName>
								<address>
									<settlement>Xi&apos;an</settlement>
									<country key="CN">China</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Odette</forename><surname>Scharenborg</surname></persName>
							<email>o.e.scharenborg@tudelft.nl</email>
							<affiliation key="aff1">
								<orgName type="laboratory">Multimedia Computing Group</orgName>
								<orgName type="institution">Delft University of Technology</orgName>
								<address>
									<settlement>Delft</settlement>
									<country key="NL">The Netherlands</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Multimodal Fusion of Body Movement Signals for No-audio Speech Detection</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">0674E1F129DB1A12022413BED912E740</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T07:12+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>No-audio Multimodal Speech Detection is one of the tasks in Media-Eval 2020, with the goal to automatically detect whether someone is speaking in social interaction on the basis of body movement signals. In this paper, a multimodal fusion method, combining signals obtained by an overhead camera and a wearable accelerometer, was proposed to determine whether someone was speaking. The proposed system directly takes the accelerometer signals as input, while using a pre-trained 3D convolutional network to extract the video features that work as input. Experiments on the No-audio Multimodal Speech Detection task show that our method outperforms all submissions of previous years.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>There is a close relationship between body movements, e.g., gesturing, and speaking status, i.e., whether someone is speaking or not. This might make it possible to determine whether a person is speaking by analyzing the person's body movements. This No-Audio Multimodal Speech Detection task of MediaEval 2020 focuses on analyzing the problem of determining the speaking status of standing subjects in crowded mingling scenarios with the information recorded by an overhead camera and a single body-worn triaxial accelerometer, hung around the neck of the subjects <ref type="bibr" target="#b0">[1]</ref>. In this paper, we fuse the signals from these two modalities to perform the No-audio Speech Detection task. The details of the proposed approach will be described in the following section 1 .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">APPROACH</head><p>The architecture of the proposed method is shown in Fig. <ref type="figure" target="#fig_0">1</ref>. The proposed model consists of three parts, i.e., AccelNet, VideoNet, and the fusion part for the accelerometer data input, the video input, and the multi-modality fusion respectively. According to the requirements of this task, the AccelNet and VideoNet are also designed to be able to predict the speaking status individually.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Data processing</head><p>In the provided database, video and accelerometer data were recorded with a duration of 22 minutes at 20Hz. For training, we segmented the video and accelerometer data into 11 segments, each of which 1 The code of the proposed method can be found at: https://github.com/xinshengwang/No-audio-speech-detection</p><p>Copyright 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). MediaEval'20, December 14-15 2020, Online has a duration of 2 minutes, resulting in a video segment with 2400 frames and an accelerometer data segment with a size of 3 × 2400.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">AccelNet</head><p>As shown in Fig. <ref type="figure" target="#fig_0">1</ref>, the AccelNet consists of 3 1-D convolution layers and a bi-directional GRU layer. Between every two adjacent convolutional layers, a batch normalization layer is adopted. The 3 convolution layers take kernel sizes of 5, 3, and 3 respectively, and take stride sizes of 5, 2, and 2 respectively, resulting in a feature with a receptive field of 23 frames, which is similar to the sampling rate of 20Hz. Therefore, we can assume that each frame out of the total of 120 frames from the last convolutional layer, with a dimension of 256, represents the movement status within a second. Intuitively, the speaking status in one moment would have a relationship with the previous and following several time steps, the bi-directional GRU, with 256 units, is adopted after the last 1-D convolutional layer to capture this relationship.</p><p>Concatenating the features of two directions at each time step, the bi-directional GRU results in a 512-d feature with a sequence length of 120. Then this feature will be concatenated with the video feature to perform the multimodal speech detection task. In order for the AccelNet to detect speaking status on the basis of the accelerometer data only, a linear transformation followed by a sigmoid layer can be added after the bi-directional GRU.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">VideoNet</head><p>The C3D <ref type="bibr" target="#b6">[7]</ref> pre-trained on Sports-1M <ref type="bibr" target="#b3">[4]</ref> is adopted to extract the video features. The video was recorded with a frequency of 20Hz, while the C3D model only uses 16 consecutive frames as context to obtain the 3D convolutional features. In practice, we dropped the last 4 frames within each second in the video, so that we can use the C3D to extract video features of each second, resulting in 120 feature vectors with a dimension of 512 for each video segment (2 minutes). The C3D features go through a bi-directional GRU, with 256 units, before being fused with the accelerometer features.</p><p>Similar to the AccelNet, the output of VideoNet can also be used for unimodal speech detection.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.4">Fusion and objective function</head><p>The early fusion strategy is adopted in this paper. Specifically, the accelerometer feature from the AccelNet and the visual feature from the VideoNet are concatenated, resulting in a feature with 1024 dimensions and 120 frames. Two linear transformation layers are used to transform the feature dimension from 1024 to 1, and then a sigmoid layer is utilized after the last linear transformation layer to obtain the final prediction probability.</p><p>MediaEval'20, December 14-15 2020, Online X. Wang et al.</p><p>To train the model, the binary cross-entropy loss is adopted on the frame level. First, the AccelNet and VideoNet are trained for the unimodal prediction task individually. Next, the pre-trained models are used in the multimodal task. During multimodal task training, we only updated the fusion network, i.e., two linear transformation layers, while keeping the parameters of the pre-trained AccelNet and VideoNet fixed. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">RESULTS</head><p>In order to evaluate our speech detection approach, we followed the given split method of the No-audio Speech Detection task. The model was trained on data from 54 subjects and tested on data from 16 unseen subjects that non-overlap with the subjects in the training set. We report the Area Under Curve (AUC) metric for each test subject and each modality. The mean AUC scores computed over all test subjects are shown in Table <ref type="table">1</ref>, while the AUC scores for each test subject separately are shown in Fig. <ref type="figure" target="#fig_1">2</ref>.</p><p>Table <ref type="table">1</ref>: Performance of each of the previously submitted results and our proposed method for the unimodal and multimodal speech detection tasks. Bold indicates best result.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Method</head><p>Accel Video Fusion Cabrera-Quiros et al. <ref type="bibr" target="#b1">[2]</ref> 0.656±0.074 0.549±0.079 0.658±0.073 Liu et al. <ref type="bibr" target="#b5">[6]</ref> 0.533±0.020 0.512± 0.021 0.535±0.019 Giannakeris et al. <ref type="bibr" target="#b2">[3]</ref> 0.649±0.066 0.614±0.067 0.672±0.051 Li et al. <ref type="bibr" target="#b4">[5]</ref> 0.644 0.513 0.620 Vargas et al. In Table <ref type="table">1</ref>, our method is compared with the submission results of pervious years. Our method achieves the better performance on the multimodal speech detection task. On the unimodal tasks, our AccelNet outperforms our VideoNet. Moreover, our accelerometer data-based method is only slightly lower than that of <ref type="bibr" target="#b7">[8]</ref>, while our video-based method achieves a much higher performance than the second best approach <ref type="bibr" target="#b2">[3]</ref>, indicating the good performance of C3D on extracting video features and also the good design of the VideoNet. The best performance of our multimodal result benefits from the good performance of the VideoNet.</p><p>From Fig. <ref type="figure" target="#fig_1">2</ref> we can see that the accelerometer modality-based method does not always outperform the video-based method, indicating that the signals from the accelerometer and video could be complementary, which could explain the higher performance of the fusion of the two modalities compared to the unimodal methods. However, fusion did not lead to an improved performance for all individual test subjects (see subjects 17 and 83), and a better fusion method should be considered in the future.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">CONCLUSION</head><p>In this paper, we proposed a multimodal speech detection model, with video and accelerometer data as input. Our model showed competitive results on the unimodal speech detection tasks with either video or accelerometer data as input, and it outperformed previous methods on the multi-modal task which uses both types of input.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: The proposed multimodal speech detection network.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: AUC scores for each test subject.</figDesc></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">The MatchNMingle dataset: a novel multi-sensor resource for the analysis of social interactions and group dynamics in-the-wild during free-standing conversations and speed dates</title>
		<author>
			<persName><forename type="first">Laura</forename><surname>Cabrera-Quiros</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Andrew</forename><surname>Demetriou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ekin</forename><surname>Gedik</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Hayley</forename><surname>Leander Van Der Meij</surname></persName>
		</author>
		<author>
			<persName><surname>Hung</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Affective Computing</title>
		<imprint>
			<date type="published" when="2018">2018. 2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Transductive Parameter Transfer, Bags of Dense Trajectories and MILES for No-Audio Multimodal Speech Detection</title>
		<author>
			<persName><forename type="first">Laura</forename><surname>Cabrera-Quiros</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ekin</forename><surname>Gedik</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Hayley</forename><surname>Hung</surname></persName>
		</author>
		<ptr target=".org" />
	</analytic>
	<monogr>
		<title level="m">Working Notes Proceedings of the MediaEval Workshop</title>
				<meeting><address><addrLine>MediaEval</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2018">2018. 2018. 2018</date>
			<biblScope unit="volume">3</biblScope>
		</imprint>
	</monogr>
	<note type="report_type">CEUR-WS</note>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<title level="m" type="main">Multimodal Fusion of Appearance Features, Optical Flow and Accelerometer Data for Speech Detection</title>
		<author>
			<persName><forename type="first">Panagiotis</forename><surname>Giannakeris</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Stefanos</forename><surname>Vrochidis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ioannis</forename><surname>Kompatsiaris</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2019">2019. 2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Large-scale Video Classification with Convolutional Neural Networks</title>
		<author>
			<persName><forename type="first">Andrej</forename><surname>Karpathy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">George</forename><surname>Toderici</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sanketh</forename><surname>Shetty</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Thomas</forename><surname>Leung</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Rahul</forename><surname>Sukthankar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Li</forename><surname>Fei-Fei</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">CVPR</title>
				<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<title level="m" type="main">Combining Body Pose and Movement Modalities for No-audio Speech Detection</title>
		<author>
			<persName><forename type="first">Liandong</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Zhuo</forename><surname>Hao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Bo</forename><surname>Sun</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2019">2019. 2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<title level="m" type="main">Analyzing Human Behavior in Subspace: Dimensionality Reduction+ Classification</title>
		<author>
			<persName><forename type="first">Yang</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Zhonglei</forename><surname>Gu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Tobey</forename><forename type="middle">H</forename><surname>Ko</surname></persName>
		</author>
		<editor>MediaEval</editor>
		<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Learning spatiotemporal features with 3d convolutional networks</title>
		<author>
			<persName><forename type="first">Du</forename><surname>Tran</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Lubomir</forename><surname>Bourdev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Rob</forename><surname>Fergus</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Lorenzo</forename><surname>Torresani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Manohar</forename><surname>Paluri</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE international conference on computer vision</title>
				<meeting>the IEEE international conference on computer vision</meeting>
		<imprint>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="4489" to="4497" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<title level="m" type="main">CNNs and Fisher Vectors for No-Audio Multimodal Speech Detection</title>
		<author>
			<persName><forename type="first">Jose</forename><surname>Vargas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Hayley</forename><surname>Hung</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2019">2019. 2019</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
