<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">THUHCSI in MediaEval 2018 Emotional Impact of Movies Task</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Ye</forename><surname>Ma</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science &amp; Technology</orgName>
								<orgName type="institution">Tsinghua University</orgName>
								<address>
									<settlement>Beijing</settlement>
									<country key="CN">China</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Xihao</forename><surname>Liang</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science &amp; Technology</orgName>
								<orgName type="institution">Tsinghua University</orgName>
								<address>
									<settlement>Beijing</settlement>
									<country key="CN">China</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Mingxing</forename><surname>Xu</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science &amp; Technology</orgName>
								<orgName type="institution">Tsinghua University</orgName>
								<address>
									<settlement>Beijing</settlement>
									<country key="CN">China</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">THUHCSI in MediaEval 2018 Emotional Impact of Movies Task</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">B705957894EAFF821842FFD5ED02F1A9</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T02:17+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>In this paper we describe our team's approach to the MediaEval 2018 Challenge Emotional Impact of Movies. We extract several sets of audio and visual features, and then apply the time-sequential models such as LSTM and BLSTM to model the continuous flow of emotion in movies. Different fusion methods are also considered and discussed. The results show that our methods achieve promising performance, indicating the effectiveness of the features and the models we choose.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>The Challenge Emotional Impact of Movies of MediaEval has been held since 2015 <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b1">2,</ref><ref type="bibr" target="#b8">9]</ref>. This challenge mainly focuses on the emotion aroused from the movies and how to predict it. This year's task consists of two subtasks. Subtask 1 aims at Valence / Arousal prediction and Subtask 2 aims at Fear prediction. Details of both subtasks could be found in <ref type="bibr" target="#b2">[3]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">APPROACH</head><p>In this section, we describe in detail our team's main approach, including feature extraction, prediction models, fusion methods, pre-processing and post-processing.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Feature extraction</head><p>Audio features. Previous results <ref type="bibr" target="#b5">[6,</ref><ref type="bibr" target="#b7">8]</ref> have showed the great potential of the extended Geneva Minimalistic Acoustic Parameter Set (eGeMAPS) <ref type="bibr" target="#b3">[4]</ref>. This feature set contains 23 low level descriptors (llds), which is proved effective in acoustic tasks such as speech emotion recognition. In our experiments, we extract the low level descriptors of eGeMAPS using the OpenSMILE toolbox <ref type="bibr" target="#b4">[5]</ref>. Then we compute the mean and standard deviation in a centered 5-second-long sliding window of all 23 features to obtain the feature of 46 dimension for each second of the movie clip.</p><p>Besides, baseline features provided by the organizer are also considered, which is the Emobase 2010 feature set (1582 dimensions).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Visual features.</head><p>Baseline features consist of multiple generalpurpose visual features. Following last year's experiments, we concatenate all the visual features to one big feature except the CNN feature, which is of 1271 dimensions. The CNN feature is treated separately from other features because it is much larger (4096 dimensions) and has the different source from others.</p><p>In order to utilize more visual information, we try using Sen-tiBank for feature extraction. We apply the MVSO detectors <ref type="bibr" target="#b6">[7]</ref> on image frames extracted every one second from the movies to obtain the final layer of Inception net, which can be referred as the composition ratio of different concepts (4342 dimensions).</p><p>All features are scaled to vectors of zero mean and unit variance for normalization.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Prediction models</head><p>Last year's results <ref type="bibr" target="#b5">[6]</ref> showed that the Support Vector Machines (SVM) are better than Long Short-Term Memory models (LSTM). However, as the size of the training dataset is larger than that from last year and time sequential models should perform better on bigger dataset, this year we adopt LSTM as the prediction model to predict the emotional flow. In detail, we take the problem as a Sequence-to-Sequence problem and the time length of input sequences is determined by the validation set.</p><p>This year, we also use the Bidirectional LSTM, which is mainly for these two reasons: First, the ground truth of emotion is labelled while the annotators are watching the movies, so the latency and mismatch of ground truth and movie content must be considered. Second, the emotional flow in movies is changing smoothly, where the Bidirectional LSTM could be less affected by the fluctuation of input features.</p><p>Besides, another difference from last year is that we train models for valence and arousal together. Considering that both valence and arousal share similar emotion concept, it is reasonable to use the same underlying structure. Therefore, every regression model is trained to predict a two dimensional vector which represents both valence and arousal.</p><p>As for the Subtask 2, the experiments are done in two steps for simplicity: First, we train a classification model to predict the label for every second. Second, we identify a segment as "Fear" according the labels of every seconds within it. Specifically, we filter out the seconds whose probability of evoking fear is lower than the threshold we set and only keep the sequences whose length is longer than certain threshold, which could remove noise from the sequence.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Fusion methods</head><p>In our experiments, we apply multiple fusion methods, which are shown as follows.</p><p>Early fusion: We concatenate features from different modalities and different sources to one bigger vector. This method is simple and straightforward while sometimes very effective.</p><p>Late fusion: We trained several LSTM models simultaneously. The output of the last layer of these LSTM models are merged together and used as the input of the next fully-connected layer.</p><p>Average fusion: To avoid over-fitting and reduce noise, we compute the average of several models' prediction.</p><p>In addition, we apply a triangle filter of 25 seconds to reduce the noise of the outputs. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">EXPERIMENTS AND RESULTS</head><p>In this section, we will elaborate our specific experiment settings and show the results. Note that all hyper-parameters below such as sequence length, hidden size, number of layers are all determined by the validation set. The ratio of training and validation data is 4:1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Subtask 1</head><p>In our experiments on the validation set, it shows that BLSTM models perform better than LSTM models, which verifies our assumption. And we also find that BLSTM performs best when the sequence length is 100. As for the features, we have tested multiple early fusion combinations and early fusion of Emobase, visual features (except CNN) and eGeMAPS performs the best. Thus, we have submitted 5 runs for subtask 1 all using BLSTM models whose sequence length is 100, and the input features of them are all the same. The first three runs only differ in the number of BLSTM layers, which is 4, 2 and 3 respectively. Run 4 is the average fusion of the first three runs. Run 5 is the late fusion of two BLSTM models, of which the inputs are Emobase and visual features (except CNN) respectively. All runs are trained using a dropout probability of 0.5 to avoid over-fitting.</p><p>From Table <ref type="table" target="#tab_0">1</ref> we can see that the best run of valence is Run 3, which is a 2-layer BLSTM model using Emobase, visual features (except CNN) and eGeMAPS as inputs. As for arousal, Run 4 achieves best performance in MSE, which indicates average fusion sometimes enhances the performance to some extent. The result of valence prediction is remarkably better than that of arousal prediction. This is probably because arousal is harder to predict than valence.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Subtask 2</head><p>As for subtask 2, we try to use the method discussed in Section 2.2. However, it performs much worse than expected. Due to the problem of imbalanced dataset, the prediction probability of fear is very low and only a few segments of consecutive seconds are predicted as "fear". Some movies in development set even have no "fear" segments. It shows that LSTM models may not be proper for imbalanced problem. We've also tried to use techniques for imbalanced problem, such as down-sampling movies and adding more weight for positive samples. Nevertheless, these methods hardly work. Owing to time constraints, we didn't submit runs for this subtask finally, and we will continue researching in future work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">DISCUSSION AND OUTLOOK</head><p>In summary, this year we've further studied the Emotional Impact of Movies task and discovered some useful insights. Firstly, temporal models such as LSTM and BLSTM can capture more information in time sequential problems, when given enough training data. And BLSTM models could be less affected by the latency and mismatch between annotations and movies, which perform better than single directional LSTM. As for fusion methods, early fusion and average fusion are both simple and intuitive, but they usually have a good performance.</p><p>Still, some problems remain to be solved. SentiBank features are not so useful as expected in this task. More and more CNN related features should be extracted and tested. Arousal is much harder to predict than valence in our experiments, which needs further investigation. For subtask 2, the problem of imbalanced dataset still remains unsolved this year, even though the evaluation metric has been changed to intersection over union. In addition, some novel techniques from other domains such as object segmentation and voice activity detection could be applied to this subtask to handle this new metric. Moreover, adding more fear related movies to dataset could be another effective approach to alleviate the imbalanced problem.</p><p>In conclusion, this paper illustrates our approach to the Media-Eval 2018 Challenge Emotional Impact of Movies task. We've trained BLSTM models using multi-modality features and several fusion methods, which achieves promising performance in valence and arousal prediction task. Fear prediction task is not fully solved and remains to be further investigated.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 : Results of Subtask 1 on test set</head><label>1</label><figDesc></figDesc><table><row><cell>Runs</cell><cell>Valence</cell><cell></cell><cell>Arousal</cell></row><row><cell></cell><cell>MSE</cell><cell>r</cell><cell>MSE</cell><cell>r</cell></row><row><cell cols="5">Run 1 0.1021 0.1714 0.1414 0.0870</cell></row><row><cell cols="5">Run 2 0.1036 0.1820 0.1399 -0.0181</cell></row><row><cell cols="5">Run 3 0.0924 0.3048 0.1399 0.0761</cell></row><row><cell cols="5">Run 4 0.0980 0.2422 0.1396 0.0612</cell></row><row><cell cols="5">Run 5 0.0944 0.2511 0.1460 -0.0667</cell></row></table></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>ACKNOWLEDGMENTS</head><p>This work was partially supported by the National Natural Science Foundation of China (61433018, 61171116) and the National High Technology Research and Development Program of China (863 program) (2015AA016305).</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">The MediaEval 2016 Emotional Impact of Movies Task</title>
		<author>
			<persName><forename type="first">Emmanuel</forename><surname>Dellandréa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Liming</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yoann</forename><surname>Baveye</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mats</forename><surname>Sjöberg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Christel</forename><surname>Chamaret</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of MediaEval 2016 Workshop. Hilversum</title>
				<meeting>MediaEval 2016 Workshop. Hilversum<address><addrLine>Netherlands</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">The MediaEval 2017 Emotional Impact of Movies Task</title>
		<author>
			<persName><forename type="first">Emmanuel</forename><surname>Dellandréa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Martijn</forename><surname>Huigsloot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Liming</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yoann</forename><surname>Baveye</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mats</forename><surname>Sjöberg</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of MediaEval 2017 Workshop</title>
				<meeting>MediaEval 2017 Workshop<address><addrLine>Dublin, Ireland</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">The MediaEval 2018 Emotional Impact of Movies Task</title>
		<author>
			<persName><forename type="first">Emmanuel</forename><surname>Dellandréa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Martijn</forename><surname>Huigsloot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Liming</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yoann</forename><surname>Baveye</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Zhongzhe</forename><surname>Xiao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mats</forename><surname>Sjöberg</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of MediaEval 2018 Workshop</title>
				<meeting>MediaEval 2018 Workshop<address><addrLine>Sophia Antipolis, France</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing</title>
		<author>
			<persName><forename type="first">Florian</forename><surname>Eyben</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Klaus</forename><surname>Scherer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Khiet</forename><surname>Truong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Bjorn</forename><surname>Schuller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Johan</forename><surname>Sundberg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Elisabeth</forename><surname>Andre</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Carlos</forename><surname>Busso</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Laurence</forename><surname>Devillers</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Julien</forename><surname>Epps</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Petri</forename><surname>Laukka</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Affective Computing</title>
		<imprint>
			<biblScope unit="volume">12</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="190" to="202" />
			<date type="published" when="2016">2016. 2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Recent developments in openSMILE, the munich open-source multimedia feature extractor</title>
		<author>
			<persName><forename type="first">Florian</forename><surname>Eyben</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Felix</forename><surname>Weninger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Florian</forename><surname>Gross</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Björn</forename><surname>Schuller</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 21st ACM international conference on Multimedia</title>
				<meeting>the 21st ACM international conference on Multimedia</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2013">2013</date>
			<biblScope unit="page" from="835" to="838" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">THUHCSI in MediaEval 2017 Emotional Impact of Movies Task</title>
		<author>
			<persName><forename type="first">Zitong</forename><surname>Jin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yuqi</forename><surname>Yao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ye</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mingxing</forename><surname>Xu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of MediaEval 2017 Workshop</title>
				<meeting>MediaEval 2017 Workshop<address><addrLine>Dublin, Ireland</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Visual affect around the world: A large-scale multilingual visual sentiment ontology</title>
		<author>
			<persName><forename type="first">Brendan</forename><surname>Jou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Tao</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Nikolaos</forename><surname>Pappas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Miriam</forename><surname>Redi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mercan</forename><surname>Topkara</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Shih-Fu</forename><surname>Chang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 23rd ACM international conference on Multimedia</title>
				<meeting>the 23rd ACM international conference on Multimedia</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="159" to="168" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">THU-HCSI at MediaEval 2016: Emotional Impact of Movies Task</title>
		<author>
			<persName><forename type="first">Ye</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Zipeng</forename><surname>Ye</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mingxing</forename><surname>Xu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of MediaEval 2016 Workshop</title>
				<meeting>MediaEval 2016 Workshop<address><addrLine>Hilversum, Netherlands</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">The MediaEval 2015 Affective Impact of Movies Task</title>
		<author>
			<persName><forename type="first">Mats</forename><surname>Sjöberg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yoann</forename><surname>Baveye</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Hanli</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Vu</forename><surname>Lam Quang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Bogdan</forename><surname>Ionescu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Emmanuel</forename><surname>Dellandréa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Markus</forename><surname>Schedl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Claire-Hélène</forename><surname>Demarty</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Liming</forename><surname>Chen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of MediaEval 2015 Workshop</title>
				<meeting>MediaEval 2015 Workshop<address><addrLine>Wurzen, Germany</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
