<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">THUHCSI in MediaEval 2017 Emotional Impact of Movies Task</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Zitong</forename><surname>Jin</surname></persName>
						</author>
						<author>
							<persName><forename type="first">Yuqi</forename><surname>Yao</surname></persName>
						</author>
						<author>
							<persName><forename type="first">Ye</forename><surname>Ma</surname></persName>
						</author>
						<author>
							<persName><forename type="first">Mingxing</forename><surname>Xu</surname></persName>
						</author>
						<author>
							<affiliation key="aff0">
								<orgName type="department">Ministry of Education Tsinghua National Laboratory for Information Science and Technology (TNList</orgName>
								<orgName type="laboratory">Key Laboratory of Pervasive Computing</orgName>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff1">
								<orgName type="department">Department of Computer Science and Technology</orgName>
								<orgName type="institution">Tsinghua University</orgName>
								<address>
									<settlement>Beijing</settlement>
									<country key="CN">China</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">THUHCSI in MediaEval 2017 Emotional Impact of Movies Task</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">16D82465579D60C264E2722D2E36F887</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T04:53+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>In this paper we describe our team's approach to MediaEval 2017 Challenge Emotional Impact of Movies. Except for the baseline features, we use OpenSMILE toolbox to extract audio features eGeMAPS from video clips. We also aim at the continuous flow of emotion, where using time-sequential models such as LSTM will be useful and effective. Fusion methods are also considered and discussed in this paper. The evaluation results of our experiments show that our features and models are competitive in both valence / arousal and fear prediction, indicating our approaches' effectiveness.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>The MediaEval 2017 Challenge Emotional Impact of Movies consists of two subtasks. Subtask 1 aims at Valence/Arousal prediction while subtask 2 aims at Fear prediction. Long movies are considered for both cases and prediction needs to be given every 5 seconds for the consecutive ten seconds' segment. LIRIS-ACCEDE <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b1">2]</ref> dataset is used for training and testing, including both discrete and continuous sections of data. For more details, please refer to <ref type="bibr" target="#b4">[5]</ref>.</p><p>Video affective analysis and prediction is an important and challenging issue, which has drawn the attention of many researchers recently. The Emotional Impact of Movies task has been held for three years, so there are many participants who took part in the challenge in 2015 and 2016 <ref type="bibr" target="#b3">[4,</ref><ref type="bibr" target="#b8">9]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">APPROACH</head><p>In this section, we will describe the main approaches for the subtasks, including feature extraction, pre-processing, prediction models, fusion and post-processing methods.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Subtask 1: valence / arousal prediction</head><p>Feature extraction. Except for the baseline features provided by the organizers, the extended Geneva Minimalistic Acoustic Parameter Set (eGeMAPS) <ref type="bibr" target="#b5">[6]</ref> is extracted from audio channel, which contains 88 features and has been proved effective in the same task of last year <ref type="bibr" target="#b7">[8]</ref>. In our experiments, we extract them with the OpenSMILE toolkit <ref type="bibr" target="#b6">[7]</ref> from 5-second-long segments which are cut from original videos in advance.</p><p>As for the visual features, the general purpose visual features provided by the organizers (except CNN features) are merged into one large feature. This is mainly on account of the fact that these features are short and complementary, and that combining them can greatly reduce the training workload to try every one of them.</p><p>All input features are scaled into vectors of zero mean and unit variance for normalization.</p><p>Prediction models. Two aspects of models are adopted in our experiments, which are traditional machine learning models and time-sequential models. Specifically, the traditional models consist of Support Vector Regression (SVR) and AdaBoost while the time-sequential ones are Long-Short Term Memory (LSTM) models. The LSTM models may capture the emotional flow of video and enhance the performance. We take the problem as a Sequenceto-One regression problem and the input features of LSTM models are segmented in a 10-second-long sliding window of 5 seconds overlapping.</p><p>All models are trained separately for valence and arousal.</p><p>Fusion methods. To combine features of different modalities, except for the early fusion method which simply concatenates different features, late fusion method is also considered. As for the traditional prediction models, average fusion is used to avoid overfitting. As for the LSTM models, the hidden vectors of several LSTM models taking different inputs are fused using an one-layer fullyconnected network to obtain final prediction, which is trained with LSTM models simultaneously.</p><p>After fusion, to reduce the fluctuation of output and smooth out the random noise, a 25-frame-long triangle filter is applied to each video.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Subtask 2: Fear prediction</head><p>Feature extraction. We use the same feature sets as Subtask 1. However, the main problem and the biggest challenge in Subtask 2 is that the samples are so unbalanced that simply predicting "zero" obtains the accuracy score of 84.34% in the test set (see <ref type="bibr">Run 4)</ref>. Therefore, to solve the unbalanced problem, SMOTE (Synthetic Minority Over-sampling TEchnique <ref type="bibr" target="#b2">[3]</ref>) method is adopted after feature extraction to re-sample. The main idea of SMOTE algorithm is to generate new samples for minorities using interpolation, which will make it more balanced.</p><p>Prediction models. Random Forest model is adopted in fear prediction, which may behave better than Support Vector Machine (SVM) in unbalanced problem. We first use Random Forest model to obtain the probability of predicting fear ("one") for each video clip. Then we set up the decision threshold p, and predict fear when the probability is larger than p. The value of p are adjusted according to the validation set's results. Due to the time constaints, we didn't try the LSTM model for Subtask 2.</p><p>Fusion methods. Similar to Subtask 1, both early and late fusion are used. In late fusion, the probability of different models are averaged to get one probability. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">EXPERIMENTS AND RESULTS</head><p>In this section, we will describe our specific runs in more detail and show the results. Note that all the hyper-parameters are selected due to the results of validation set, and the ratio of training data and validation data is 4:1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Subtask 1: valence / arousal prediction</head><p>We've submitted 5 runs for valence / arousal prediction, where the first two use LSTM and the other ones use SVR and AdaBoost, all listed below: Run 1: For valence, 2-layer LSTM model of hidden size 500 taking eGeMAPS as input; For arousal, 3-layer LSTM model of hidden size 500 taking VGG as input.</p><p>Run 2: For valence, late fusion of three 2-layer LSTM models of hidden size 1000 taking eGeMAPS, VGG and other visual features as input respectively; For arousal, the input features are Emobase, eGeMAPS and CEDD respectively.</p><p>Run 3: For both valence and arousal, SVR model taking VGG as input.</p><p>Run 4: For valence, AdaBoost model taking eGeMAPS as input; For arousal, AdaBoost model taking other visual features as input.</p><p>Run 5: For both valence and arousal, late fusion of Run 3 and Run 4.</p><p>In detail, the "other visual features" in Run 2 and 4 means the concatenation of all the visual features except the CNN feature. CEDD means Color and Edge Directivity Descriptor, which is one of the visual feature provided. VGG means CNN features extracted using VGG16 fc6 layer.</p><p>From Table <ref type="table" target="#tab_0">1</ref> we can see that, the best run of valence MSE is Run 2, using late fusion of LSTM models. Run 3 achieves the best results on other metrics, using SVR model and VGG feature. Notice that Run 2, the LSTM late fusion method, is better at MSE than Run 1, the single LSTM model, which means late fusion of three models utilizes different information in different features and enhances the performance to some extent. However, LSTM models perform worse in Pearson's r, compared to traditional machine learning models. This could be because LSTM models tend to predict similar values of all time, and thus obtain lower MSE and lower Pearson's r.</p><p>Taken together, Run 3 using SVR and VGG achieves best results, which means CNN features may contain useful information for emotion analysis, and traditional model could behave well when trained properly. From Table <ref type="table" target="#tab_1">2</ref> we can see that, Run 2 using VGG features achieve best results on recall and f1, while Run 5 using late fusion achieve best results on accuracy and precision. As mentioned before, the problem of subtask 2 is very unbalanced, and the fear samples are much fewer. Therefore, there is no surprise that accuracy and precision are one pair while recall and f1 are the other pair. Predicting more "zeros" will lead to higher accuracy while lower recall, and vice versa.</p><p>When considering f1 score, which is the harmonic mean of both precision and recall, Run 2 using VGG feature performs best, which confronts with the result of subtask 1 that CNN features contain useful information for emotion analysis.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">CONCLUSION AND DISCUSSION</head><p>In this paper, we illustrate our approach to the MediaEval 2017 Challenge "Emotional Impact of Movies" task. In valence / arousal prediction subtask, both LSTM and SVR models are trained and compared. In fear prediction subtask, Random Forest model using different features are compared. Besides, early fusion and late fusion are adopted in experiments, which shows promising results in some aspects.</p><p>However, some problems have not been solved yet. For instance, some of the LSTM models tend to predict similar values of all time, leading to a very low Pearson's r, which may be caused by inappropriate experiment configuration. Unbalanced problem in subtask 2 still exists even using SMOTE algorithm, which means changing models or features could make no big difference, and all predicting "zero" can still obtain a very high accuracy. These problems remain to be solved in the future.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 : Results of Subtask 1 on test set</head><label>1</label><figDesc></figDesc><table><row><cell>Runs</cell><cell>Valence</cell><cell></cell><cell>Arousal</cell></row><row><cell></cell><cell>MSE</cell><cell>r</cell><cell>MSE</cell><cell>r</cell></row><row><cell cols="5">Run 1 0.2230 -0.0985 0.1577 0.2261</cell></row><row><cell cols="5">Run 2 0.1670 -0.0990 0.1269 -0.0122</cell></row><row><cell cols="5">Run 3 0.1833 0.3707 0.1166 0.3213</cell></row><row><cell cols="5">Run 4 0.2074 -0.0111 0.1318 0.2708</cell></row><row><cell cols="5">Run 5 0.2046 0.0122 0.1300 0.2750</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 : Results of Subtask 2 on test set</head><label>2</label><figDesc>Late fusion of Run 1 and Run 2.</figDesc><table><row><cell cols="4">Runs Accuracy Precision Recall</cell><cell>f1</cell></row><row><cell>Run 1</cell><cell>0.7352</cell><cell>0.0206</cell><cell cols="2">0.0530 0.0239</cell></row><row><cell>Run 2</cell><cell>0.8153</cell><cell>0.2318</cell><cell cols="2">0.2781 0.2352</cell></row><row><cell>Run 3</cell><cell>0.8461</cell><cell>0.2035</cell><cell cols="2">0.0208 0.0371</cell></row><row><cell>Run 4</cell><cell>0.8434</cell><cell>0.0000</cell><cell cols="2">0.0000 0.0000</cell></row><row><cell>Run 5</cell><cell>0.8469</cell><cell>0.2383</cell><cell cols="2">0.2186 0.2165</cell></row><row><cell cols="4">3.2 Subtask 2: fear prediction</cell></row><row><cell cols="5">We've submitted 5 runs for fear prediction, all using Random Forest</cell></row><row><cell cols="2">model, listed below:</cell><cell></cell><cell></cell></row><row><cell cols="4">Run 1: Random Forest + other visual features.</cell></row><row><cell cols="3">Run 2: Random Forest + VGG.</cell><cell></cell></row><row><cell cols="4">Run 3: Random Forest + all visual features.</cell></row><row><cell cols="4">Run 4: All predicting "zero" (just for test)</cell></row><row><cell>Run 5:</cell><cell></cell><cell></cell><cell></cell></row></table></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>ACKNOWLEDGMENTS</head><p>This work was partially supported by the National High Technology Research and Development Program of China (863 program) (2015AA016305) and the National Natural Science Foundation of China (61433018, 61171116). Emotional Impact of Movies Task MediaEval'17, 13-15 September 2017, Dublin, Ireland</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Deep learning vs. kernel methods: Performance for emotion prediction in videos</title>
		<author>
			<persName><forename type="first">Yoann</forename><surname>Baveye</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Emmanuel</forename><surname>Dellandréa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Christel</forename><surname>Chamaret</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Liming</forename><surname>Chen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Affective Computing and Intelligent Interaction (ACII), 2015 International Conference on. IEEE</title>
				<imprint>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="77" to="83" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">LIRIS-ACCEDE: A video database for affective content analysis</title>
		<author>
			<persName><forename type="first">Yoann</forename><surname>Baveye</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Emmanuel</forename><surname>Dellandrea</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Christel</forename><surname>Chamaret</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Liming</forename><surname>Chen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Affective Computing</title>
		<imprint>
			<biblScope unit="volume">6</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="43" to="55" />
			<date type="published" when="2015">2015. 2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">SMOTE: synthetic minority over-sampling technique</title>
		<author>
			<persName><forename type="first">Kevin</forename><forename type="middle">W</forename><surname>Nitesh V Chawla</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Lawrence</forename><forename type="middle">O</forename><surname>Bowyer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Philip</forename><surname>Hall</surname></persName>
		</author>
		<author>
			<persName><surname>Kegelmeyer</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of artificial intelligence research</title>
		<imprint>
			<biblScope unit="volume">16</biblScope>
			<biblScope unit="page" from="321" to="357" />
			<date type="published" when="2002">2002. 2002</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">The MediaEval 2016 Emotional Impact of Movies Task</title>
		<author>
			<persName><forename type="first">Emmanuel</forename><surname>Dellandréa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Liming</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yoann</forename><surname>Baveye</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mats</forename><surname>Sjöberg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Christel</forename><surname>Chamaret</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of MediaEval 2016 Workshop. Hilversum</title>
				<meeting>MediaEval 2016 Workshop. Hilversum<address><addrLine>Netherlands</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">The MediaEval 2017 Emotional Impact of Movies Task</title>
		<author>
			<persName><forename type="first">Emmanuel</forename><surname>Dellandréa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Martijn</forename><surname>Huigsloot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Liming</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yoann</forename><surname>Baveye</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mats</forename><surname>Sjöberg</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of MediaEval 2017 Workshop</title>
				<meeting>MediaEval 2017 Workshop<address><addrLine>Dublin, Ireland</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing</title>
		<author>
			<persName><forename type="first">Florian</forename><surname>Eyben</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Klaus</forename><surname>Scherer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Khiet</forename><surname>Truong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Bjorn</forename><surname>Schuller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Johan</forename><surname>Sundberg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Elisabeth</forename><surname>Andre</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Carlos</forename><surname>Busso</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Laurence</forename><surname>Devillers</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Julien</forename><surname>Epps</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Petri</forename><surname>Laukka</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Affective Computing</title>
		<imprint>
			<biblScope unit="volume">12</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="190" to="202" />
			<date type="published" when="2016">2016. 2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Recent developments in openSMILE, the munich open-source multimedia feature extractor</title>
		<author>
			<persName><forename type="first">Florian</forename><surname>Eyben</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Felix</forename><surname>Weninger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Florian</forename><surname>Gross</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Björn</forename><surname>Schuller</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 21st ACM international conference on Multimedia</title>
				<meeting>the 21st ACM international conference on Multimedia</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2013">2013</date>
			<biblScope unit="page" from="835" to="838" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">THU-HCSI at MediaEval 2016: Emotional Impact of Movies Task</title>
		<author>
			<persName><forename type="first">Ye</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Zipeng</forename><surname>Ye</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mingxing</forename><surname>Xu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of MediaEval 2016 Workshop</title>
				<meeting>MediaEval 2016 Workshop<address><addrLine>Hilversum, Netherlands</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">The MediaEval 2015 Affective Impact of Movies Task</title>
		<author>
			<persName><forename type="first">Mats</forename><surname>Sjöberg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yoann</forename><surname>Baveye</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Hanli</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Vu</forename><surname>Lam Quang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Bogdan</forename><surname>Ionescu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Emmanuel</forename><surname>Dellandréa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Markus</forename><surname>Schedl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Claire-Hélène</forename><surname>Demarty</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Liming</forename><surname>Chen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of MediaEval 2015 Workshop</title>
				<meeting>MediaEval 2015 Workshop<address><addrLine>Wurzen, Germany</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
