<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">No-Audio Multimodal Speech Detection Task at MediaEval 2020</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Laura</forename><surname>Cabrera-Quiros</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Instituto Tecnológico de Costa Rica</orgName>
								<address>
									<country key="CR">Costa Rica</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Jose</forename><surname>Vargas</surname></persName>
							<affiliation key="aff1">
								<orgName type="institution">Delft University of Technology</orgName>
								<address>
									<country key="NL">Netherlands</country>
								</address>
							</affiliation>
						</author>
						<author role="corresp">
							<persName><forename type="first">Hayley</forename><surname>Hung</surname></persName>
							<email>h.hung@tudelft.nl</email>
							<affiliation key="aff1">
								<orgName type="institution">Delft University of Technology</orgName>
								<address>
									<country key="NL">Netherlands</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">No-Audio Multimodal Speech Detection Task at MediaEval 2020</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">77256394AE1E09B00A4404AA9687D400</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T07:13+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>This overview paper provides a description of the No-Audio multimodal speech detection task for MediaEval 2020. Similar to the previous two editions, the participants of this task are encouraged to estimate the speaking status (i.e. person speaking or not) of individuals interacting freely during a crowded mingle event, from multimodal data. In contrast to conventional speech detection approaches, no audio is used for this task. Instead, the automatic estimation system proposed must exploit the natural human movements that accompany speech, captured by cameras and wearable sensors. Task participants are provided with cropped videos of individuals while interacting, captured by an overhead camera, and the tri-axial acceleration of each individual throughout the event, captured with a single badge-like device hung around the neck. This year's edition of the task also focuses on investigating posible reasons for interpersonal differences in the performances obtained.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>Speaking status is one of the key signals that is used for studying conversational dynamics in face to face settings <ref type="bibr" target="#b9">[10]</ref>. From the speaking status of multiple people one can also derive speaking turns, and other features that have shown beneficial for estimating many different social constructs such as dominance <ref type="bibr" target="#b7">[8]</ref>, or cohesion <ref type="bibr" target="#b6">[7]</ref>. Overall, automated analysis of conversational dynamics in large unstructured social gatherings is an under-explored problem despite the relevance of such events <ref type="bibr" target="#b10">[11]</ref>, and automated speaking detection one of its key components.</p><p>The majority of works regarding speaking status detection focuses on utilizing the audio signal captured by microphones. However, most unstructured social gatherings such as parties or cocktail events tend to have inherent background noise and to collect good quality audio signals, participants need to wear uncomfortable and intrusive equipment. Recording audio also risks to be perceived as an invasion of privacy due to the access to the precise verbal contents of the conversation, further limiting the natural behavior of the individuals involved. Because of these restrictions, recording audio in such cases is challenging.</p><p>As a suitable alternative, the main goal of this task is to estimate a person's speaking status using video and wearable acceleration data from a smart ID badge which is hung around the neck, instead of audio. Such alternative modalities are more privacy-preserving, and easy to use and replicate for crowded environments such as conferences, networking events, or organizational settings.</p><p>Body movements such as gesturing tend to co-occur with speaking, as it has been well-documented by social scientists <ref type="bibr" target="#b8">[9]</ref>. Thus, an automatic estimation system should exploit the natural human movements that accompany speech. This task is motivated by such insights, and past work which estimated speaking status from a single body worn tri-axial accelerometer <ref type="bibr" target="#b4">[5,</ref><ref type="bibr" target="#b5">6]</ref> and video <ref type="bibr" target="#b3">[4]</ref>.</p><p>Despite many efforts, one of the major challenges of these alternative approaches has been achieving competitive estimation performance against audio-based systems. Moreover, results from past editions of this task have shown a significant difference in the performance of different individuals, and lower performances for a particular subset of them (failure cases) not fully understood yet.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">TASK DETAILS 2.1 Unimodal estimation of speaking status</head><p>Participants are encouraged to design and implement separate speaking status estimators for each modality. However, baseline approaches for each modality are provided, in case they prefer to focus on improving an estimator for only one of the modalities, or the fusion technique. The baseline using acceleration implements the logistic regression in <ref type="bibr" target="#b4">[5]</ref> and the video baseline employs dense trajectories and multiple instance learning, as explained in <ref type="bibr" target="#b2">[3]</ref>.</p><p>For the video modality, the input will be a video of a person interacting freely in a social gathering (see Figure <ref type="figure" target="#fig_0">1</ref>), and a estimation of that persons' speaking status (speaking/non-speaking) should be provided every second. For the wearable modality, the method will have the wearable tri-axial acceleration signal of a person as input and must also return a speaking status estimation every second.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Multimodal estimation of speaking status</head><p>For this subtask teams must provide an estimation of speaking status every second by exploiting both modalities together. Teams can use any type of fusion method they see fit <ref type="bibr" target="#b0">[1]</ref>. The goal is to leverage the complementary nature of the modalities to better estimate the speaking status. Thus, teams are encouraged to go beyond basic fusion and really think about the impact of each modality on the estimation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Analysis of failure test cases</head><p>As a new addition for this year's edition, teams must analyze the differences in the performance results for the test set, focusing on the three subjects with the lower performances, and hypothesize about the reasons the method underperforms for these persons. Participants are encouraged to think about the circumstances for the subjects (e.g. occlusion) or interpersonal differences that could explain such dissimilarities.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">DATA</head><p>A subset of the MatchNMingle dataset<ref type="foot" target="#foot_0">1</ref>  <ref type="bibr" target="#b1">[2]</ref> is used for this task. It contains data for 70 people who attended one of three separate mingle events for over 45 minutes. To eliminate the effects of acclimatization, only 30 minutes in the middle of the event are used. Subjects were separated using stratified sampling to create the train (54 subjects) and test sets (16 subjects). Stratification was done with various criteria to ensure balanced distributions in both sets for speaking status, gender, event day, and level of occlusion in the video. 2 An additional segment of the data was created for the optional subject specific evaluation of the task (see more in Section 4). While the dataset used this year is the same as the one used in previous versions of the challenge, making comparisons possible between solutions of different years, focus is given to the differences shown by the 16 subjects in the test set.</p><p>Videos were captured from an overhead view at 20FPS. The rectangular (bounding box) area around each subject has been cropped, in such a way that a video is provided per person. Important challenges in the automatic analysis of this data include the significant amount of cross-contamination and occlusion, both in self-occlusion and occlusion by other subjects, due to the crowded nature of the event (cocktail party).</p><p>Subjects also wore a badge-like body-worn accelerometer (see Figure <ref type="figure" target="#fig_0">1</ref>), recording tri-axial acceleration at 20Hz. These acceleration readings were processed via whitening applied per axis. All video and wearable the data is synchronized.</p><p>Finally, binary speaking status (speaking/non-speaking) was annotated by 3 different annotators. Inter-annotator agreement was calculated on a 2 minute segment of the data, which resulted in a Fleiss' kappa coefficient of 0.55.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">EVALUATION</head><p>The Area Under the ROC Curve (ROC-AUC) is used as evaluation metric, since it is robust against class imbalance which exists in our scenario. Therefore, participants need to submit continuous prediction scores (posterior probabilities, distances to the separating hyperplane, etc.) obtained by running their method on the evaluation set. These scores will be compared against the test labels, which are not available to participants.</p><p>Required evaluation. For unimodal and multimodal estimations, each team must provide up to 5 runs with their scores for a persons' speaking status. As mentioned, the evaluation set does not contain any data from participants in the test set to achieve person independent results. 2 Occlusion levels can be requested if needed for training set.</p><p>Optional evaluation. Teams may optionally submit up to 5 runs (per person) using person dependent training. To do so, a separate 5 minutes interval for all people in the training set is provided. Thus, samples and labels from the same subject can be used to train or fine-tune and then test for a specific test subject's data, which is temporally to adjacent to the training samples. This method would be expected to perform better when trained or fine-tuned on the target person rather than other people.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">DISCUSSION AND OUTLOOK</head><p>With this task, we aim to support the study of speaking status detection in the wild using alternative modalities to audio. We aim to learn more about the connection between speaking and body movements, expecting that in the future this will bring on valuable insights for both the social science and multimedia communities.</p><p>Participation in previous editions of the task has been limited, with only small improvements over the baseline. We believe this is due to the variety of ways in which this task is atypical. For example, the connection between speech and body movements has been found to be person-specific <ref type="bibr" target="#b4">[5]</ref>. Additionally, the interaction between the two modalities of interest (chest acceleration and video) is not traditionally explored, i.e. the combination of these two modalities is not common. This leaves open opportunities to explore their complimentarity, to better understand in which situations one modality is more reliable over the other, and develop or apply appropriate fusion strategies. Moreover, differences in the performances between test subjects was consistently found in previous editions, further supporting past research <ref type="bibr" target="#b4">[5]</ref>. Thus, this year participants are encouraged to focus on such failure cases and hypothesize about the reasons of such dissimilarities.</p><p>We are reaching out to different communities (affective computing, multimedia, computer vision, and speech), as we believe each of these communities can bring their own expertise to the task. In the following years as well as augmenting the data, we aim to include and explore the implications of the spatial social component of the mingle (e.g. F-Formations) on the speaking status detection.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Alternative modalities to audio used for the task. Left: Individual video of each participant while interacting freely. Right: Wearable triaxial acceleration recorded by a device hung around the neck.</figDesc></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">MatchNMingle is openly available for research purposes under an EULA at http://matchmakers.ewi.tudelft.nl/matchnmingle/pmwiki/</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>ACKNOWLEDGMENTS</head><p>This task is partially supported by the Netherlands Organization for Scientific Research (NWO) under project number 639.022.606.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Multimodal fusion for multimedia analysis: a survey</title>
		<author>
			<persName><forename type="first">M</forename><surname>Pradeep K Atrey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Abdulmotaleb</forename><forename type="middle">El</forename><surname>Anwar Hossain</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mohan</forename><forename type="middle">S</forename><surname>Saddik</surname></persName>
		</author>
		<author>
			<persName><surname>Kankanhalli</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Multimedia Systems</title>
		<imprint>
			<biblScope unit="volume">16</biblScope>
			<biblScope unit="issue">6</biblScope>
			<biblScope unit="page" from="345" to="379" />
			<date type="published" when="2010">2010. 2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">The MatchNMingle dataset: a novel multi-sensor resource for the analysis of social interactions and group dynamics in-the-wild during free-standing conversations and speed dates</title>
		<author>
			<persName><forename type="first">Laura</forename><surname>Cabrera-Quiros</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Andrew</forename><surname>Demetriou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ekin</forename><surname>Gedik</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Hayley</forename><surname>Leander Van Der Meij</surname></persName>
		</author>
		<author>
			<persName><surname>Hung</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Affective Computing</title>
		<imprint>
			<date type="published" when="2018">2018. 2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Gestures in-the-wild: detecting conversational hand gestures in crowded scenes using a multimodal fusion of bags of video trajectories and body worn acceleration</title>
		<author>
			<persName><forename type="first">Laura</forename><surname>Cabrera-Quiros</surname></persName>
		</author>
		<author>
			<persName><forename type="first">David</forename><forename type="middle">Mj</forename><surname>Tax</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Hayley</forename><surname>Hung</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Multimedia</title>
		<imprint>
			<date type="published" when="2019">2019. 2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Look at who&apos;s talking: Voice activity detection by automated gesture analysis</title>
		<author>
			<persName><forename type="first">Marco</forename><surname>Cristani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Anna</forename><surname>Pesarin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Alessandro</forename><surname>Vinciarelli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Marco</forename><surname>Crocco</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Vittorio</forename><surname>Murino</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Joint Conference on Ambient Intelligence</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2011">2011</date>
			<biblScope unit="page" from="72" to="80" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Personalised models for speech detection from body movements using transductive parameter transfer</title>
		<author>
			<persName><forename type="first">Ekin</forename><surname>Gedik</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Hayley</forename><surname>Hung</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Personal and Ubiquitous Computing</title>
		<imprint>
			<biblScope unit="volume">21</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="723" to="737" />
			<date type="published" when="2017">2017. 2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Classifying social actions with a single accelerometer</title>
		<author>
			<persName><forename type="first">Hayley</forename><surname>Hung</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Gwenn</forename><surname>Englebienne</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jeroen</forename><surname>Kools</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2013 ACM international joint conference on Pervasive and ubiquitous computing</title>
				<meeting>the 2013 ACM international joint conference on Pervasive and ubiquitous computing</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2013">2013</date>
			<biblScope unit="page" from="207" to="210" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Estimating cohesion in small groups using audio-visual nonverbal behavior</title>
		<author>
			<persName><forename type="first">Hayley</forename><surname>Hung</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Daniel</forename><surname>Gatica-Perez</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Multimedia</title>
		<imprint>
			<biblScope unit="volume">12</biblScope>
			<biblScope unit="issue">6</biblScope>
			<biblScope unit="page" from="563" to="575" />
			<date type="published" when="2010">2010. 2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Modeling Dominance in Group Conversations Using Nonverbal Activity Cues</title>
		<author>
			<persName><forename type="first">Dinesh</forename><surname>Babu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jayagopi</forename></persName>
		</author>
		<author>
			<persName><forename type="first">Hayley</forename><surname>Hung</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Chuohao</forename><surname>Yeo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Daniel</forename><surname>Gatica-Perez</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Audio, Speech, and Language Processing</title>
		<imprint>
			<biblScope unit="volume">17</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="501" to="513" />
			<date type="published" when="2009">2009. 2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<author>
			<persName><forename type="first">David</forename><surname>Mcneill</surname></persName>
		</author>
		<title level="m">Language and gesture</title>
				<imprint>
			<publisher>Cambridge University Press</publisher>
			<date type="published" when="2000">2000</date>
			<biblScope unit="volume">2</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Bridging the gap between social animal and unsocial machine: A survey of social signal processing</title>
		<author>
			<persName><forename type="first">Alessandro</forename><surname>Vinciarelli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Maja</forename><surname>Pantic</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Dirk</forename><surname>Heylen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Catherine</forename><surname>Pelachaud</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Isabella</forename><surname>Poggi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D'</forename><surname>Francesca</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Marc</forename><surname>Errico</surname></persName>
		</author>
		<author>
			<persName><surname>Schroeder</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Affective Computing</title>
		<imprint>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="69" to="87" />
			<date type="published" when="2012">2012. 2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Effects of networking on career success: a longitudinal study</title>
		<author>
			<persName><forename type="first">Hans-Georg</forename><surname>Wolff</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Klaus</forename><surname>Moser</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Applied Psychology</title>
		<imprint>
			<biblScope unit="volume">94</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page">196</biblScope>
			<date type="published" when="2009">2009. 2009</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
