<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Fudan-NJUST at MediaEval 2014: Violent Scenes Detection Using Deep Neural Networks</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Qi</forename><surname>Dai</surname></persName>
							<email>daiqi@fudan.edu.cn</email>
							<affiliation key="aff0">
								<orgName type="department">School of Computer Science</orgName>
								<orgName type="institution">Fudan University</orgName>
								<address>
									<settlement>Shanghai</settlement>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Zuxuan</forename><surname>Wu</surname></persName>
							<email>zxwu@fudan.edu.cn</email>
							<affiliation key="aff0">
								<orgName type="department">School of Computer Science</orgName>
								<orgName type="institution">Fudan University</orgName>
								<address>
									<settlement>Shanghai</settlement>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Yu-Gang</forename><surname>Jiang</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">School of Computer Science</orgName>
								<orgName type="institution">Fudan University</orgName>
								<address>
									<settlement>Shanghai</settlement>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Xiangyang</forename><surname>Xue</surname></persName>
							<email>xyxue@fudan.edu.cn</email>
							<affiliation key="aff0">
								<orgName type="department">School of Computer Science</orgName>
								<orgName type="institution">Fudan University</orgName>
								<address>
									<settlement>Shanghai</settlement>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Jinhui</forename><surname>Tang</surname></persName>
							<email>jinhuitang@mail.njust.edu.cn</email>
							<affiliation key="aff1">
								<orgName type="department">School of Computer Science and Engineering</orgName>
								<orgName type="institution">Nanjing University of Science and Technology</orgName>
							</affiliation>
						</author>
						<title level="a" type="main">Fudan-NJUST at MediaEval 2014: Violent Scenes Detection Using Deep Neural Networks</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">51BA727E68BF0DB5309ADF7B1F3052A9</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T16:11+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>The Violent Scenes Detection task aims at evaluating algorithms that automatically localize violent segments in both Hollywood movies and short web videos. The definition of violence is subjective: "the segments that one would not let an 8 years old child see in a movie because they contain physical violence". This is a highly challenging problem because of the strong content variations among the positive instances. In this year's evaluation, we adopted our recently proposed classification method to fuse multiple features using Deep Neural Networks (DNN). The method was named regularized DNN. We extracted a set of visual and audio features, which have been observed useful. We then applied the regularized DNN for feature fusion and classification. Results indicate that using multiple features is still very helpful, and more importantly, our proposed regularized DNN offers significantly better results than the popular SVM. We achieved a mean average precision of 0.63 for the main task and 0.60 for the generalization task.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">SYSTEM DESCRIPTION</head><p>Figure <ref type="figure" target="#fig_1">1</ref> gives an overview of our system. In this short paper, we briefly describe each of the key components. For the task definition, data and evaluation metric, interested readers may refer to <ref type="bibr" target="#b0">[1]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.1">Features</head><p>Three kinds of audio-visual features were extracted, which have been observed useful in 2013.</p><p>We extracted trajectory-based motion features according to our previous work <ref type="bibr" target="#b2">[2]</ref>. A main difference is that the new improved dense trajectories (IDT) <ref type="bibr" target="#b4">[4]</ref> were used as the basis to replace the original dense trajectories. Four baseline features, histograms of oriented gradients (HOG), histograms of optical flow (HOF), motion boundary histograms (MBH) and trajectory shape (TrajShape) descriptors were computed. These features were encoded using the Fisher vectors (FV) with a codebook of 256 codewords. We further computed our proposed TrajMF <ref type="bibr" target="#b2">[2]</ref> based on the HOG, HOF and MBH, by considering the motion relationships of the trajectories. As the dimension of the original TrajMF is very high, we employed the expectation-maximization principal component analysis (EM-PCA) <ref type="bibr" target="#b3">[3]</ref> for dimension reduction, generating a 1500-dimensional representation for each fea-  ture. In total, there are seven trajectory-based features, including four baseline FV and three dimension-reduced Tra-jMF features. See <ref type="bibr" target="#b2">[2]</ref> for more details.</p><p>The other two kinds of features include Space-Time Interest Points (STIP) <ref type="bibr" target="#b5">[5]</ref> and Mel-Frequency Cepstral Coefficients (MFCC). The STIP describes the texture and motion features around local interest points, which were encoded using the bag-of-words framework with 4000 codewords. Here we randomly sampled 300k features and used k-means to generate the codebook. The MFCC is a very popular audio feature. It was extracted from every 32ms time-window with 50% overlap. The bag-of-words was also adopted to quantize the MFCC descriptors, using 4000 codewords.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.2">Classifiers</head><p>We adopted both SVM and deep neural networks (DNN) for classification.</p><p>SVM: χ 2 kernel was adopted for the bag-of-words features (STIP and MFCC), and linear kernel was used for the others. For feature fusion, kernel-level average fusion was used for the trajectory-based features, while score-level average late fusion was adopted to combine trajectory features with STIP and MFCC.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Classification Feature Fusion</head><p>Feature Extraction </p><formula xml:id="formula_0">l E = l 1 = l F = l L = W L−1 W 3 E W 2 E W 1 E W 3 1 W<label>2</label></formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>DNN:</head><p>We also adopted a new DNN-based classifier proposed in our recent work <ref type="bibr" target="#b6">[6,</ref><ref type="bibr" target="#b7">7]</ref>. The aforementioned fusion methods used for the SVM classifiers neglect the hidden patterns shared among the different features. To capture the relationships of distinct features, we constructed a regularized DNN for video classification. Specifically, as shown in Figure <ref type="figure">2</ref>, in the regularized DNN, a layer of neurons were first used to perform feature abstraction separately for each input feature. After that, another layer was used for feature fusion with carefully designed structural-norm regularization on network weights, which can identify feature relationships. Finally, the fused representation was used to build a classification model in the last layer. With this special network, we are able to fuse features by considering both feature correlation and feature diversity, as well as perform classification simultaneously. See <ref type="bibr" target="#b6">[6,</ref><ref type="bibr" target="#b7">7]</ref> for more details.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.3">Score Smoothing and Clip Merging</head><p>Temporal score smoothing has been proved to be effective as incorrect predictions on a short clip may be eliminated by considering predictions on nearby clips. All the videos were first partitioned uniformly into 3-second long clips. A smoothed prediction score of a clip is simply the average value of the scores in a three-clip window.</p><p>As we need to output segment level predictions (not on the fixed-length clip-level), we need to merge continuous clips if they are all determined to contain violence or no violence. This was done if their violence scores were all above or below a threshold, and the new score of the merged segment was set to be the average value of clips.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">RESULTS AND DISCUSSIONS</head><p>We submitted 5 runs for official evaluation. As shown in Figure <ref type="figure" target="#fig_1">1</ref>, Run 1 and Run 2 used SVM and DNN respectively. Run 2 did not use FV encoding of the HOG, HOF and MBH features, as the dimensionality of these three features are too high, which would jeopardize the performance of DNN when there is insufficient training data. <ref type="bibr">Run</ref>   of Run 3 (smoothing was performed before merging), while Run 5 is the direct fusion of SVM and DNN without using any smoothing and merging functions.</p><p>The official results are summarized in Figure <ref type="figure" target="#fig_3">3</ref>. We see that, although some features were not used in DNN, the performance of DNN (Run 2) is still significantly better than SVM. This clearly confirms the effectiveness of deep networks. Directly fusing DNN and SVM incurs a small performance drop (Run 3). This may be due to the sub-optimal parameters used in the fusion process. Another fusion setting (Run 5) without using score merging improves the main task performance but still hurts the result of the generalization task, showing that DNN has better generalization capability than the SVM, and thus fusing SVM with DNN will always degrade the performance of the generalization task. Finally, the results of Run 4 indicate that both smoothing and merging are useful for the main task. It is not surprising that smoothing does not work for the generalization task, because, compared with the long movies used in the main task, the test clips are short and are relatively temporally more consistent.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: An overview of the key components in our system, where circled numbers indicate the 5 submitted runs.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>1 .Figure 2 :</head><label>12</label><figDesc>Figure 2: Illustration of the structure of our regularized DNN. Multiple features are used as the inputs, and the network transforms the features separately first, before using regularizations to explore feature relationships. The identified relationships are then utilized for improved classification performance. This figure is reprinted from [7].</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Performance of our 5 submitted runs on both main and generalization tasks. Note that, following this year's guideline, a specially designed MAP was used (MAP2014 [1])</figDesc><graphic coords="2,349.00,64.58,189.37,98.26" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head></head><label></label><figDesc>3 is the score fusion of Run 1 and Run 2. Run 4 is the score-smoothed version</figDesc><table><row><cell></cell><cell>0.5% 0.7% 0.6%</cell><cell cols="2">0.409% 0.494% Main%Task% Generaliza;on%Task% 0.454% 0.604%</cell><cell>0.404% 0.538%</cell><cell>0.63%</cell><cell>0.5%</cell><cell>0.514% 0.552%</cell></row><row><cell>MAP</cell><cell>0.3% 0.4%</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell></cell><cell>0.2%</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell></cell><cell>0.1%</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell></cell><cell>0%</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell></cell><cell></cell><cell>Run%1%</cell><cell>Run%2%</cell><cell>Run%3%</cell><cell cols="2">Run%4%</cell><cell>Run%5%</cell></row></table></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgements</head><p>This work was supported in part by a National 863 Program (#2014AA015101), the National Natural Science Foundation of China (#61201387), and the Science and Technology Commission of Shanghai Municipality (#13PJ1400400, #13511504503, #12511501602).</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title level="m" type="main">The MediaEval 2014 Affect Task: Violent Scenes Detection</title>
		<author>
			<persName><forename type="first">M</forename><surname>Sjöberg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Ionescu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y.-G</forename><surname>Jiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><forename type="middle">L</forename><surname>Quang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Schedl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C.-H</forename><surname>Demarty</surname></persName>
		</author>
		<editor>MediaEval</editor>
		<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<title level="m">Workshop</title>
				<meeting><address><addrLine>Barcelona, Spain</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2014">Oct 16-17, 2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Trajectory-based modeling of human actions with motion reference points</title>
		<author>
			<persName><forename type="first">Y.-G</forename><surname>Jiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Dai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Xue</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C.-W</forename><surname>Ngo</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ECCV</title>
				<imprint>
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">EM Algorithms for PCA and SPCA</title>
		<author>
			<persName><forename type="first">S</forename><surname>Roweis</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1998">1998</date>
			<publisher>NIPS</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Action Recognition With Improved Trajectories</title>
		<author>
			<persName><forename type="first">H</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Schmid</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ICCV</title>
				<imprint>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">On space-time interest points</title>
		<author>
			<persName><forename type="first">I</forename><surname>Laptev</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IJCV</title>
		<imprint>
			<biblScope unit="volume">64</biblScope>
			<biblScope unit="page" from="107" to="123" />
			<date type="published" when="2005">2005</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<title level="m" type="main">Exploring Inter-feature and Inter-class Relationships with Deep Neural Networks for Video Classification</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y.-G</forename><surname>Jiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Pu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Xue</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2014">2014</date>
			<publisher>ACM MM</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<title level="m" type="main">Challenge Huawei Challenge: Fusing Multimodal Features with Deep Neural Networks for Mobile Video Annotation</title>
		<author>
			<persName><forename type="first">J</forename><surname>Tu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Dai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y.-G</forename><surname>Jiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Xue</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2014">2014</date>
			<publisher>ICME</publisher>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
