<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">MTM at MediaEval 2014 Violence Detection Task</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Bruno</forename><surname>Do</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Universidade Fedederal de Minas Gerais Belo Horizonte</orgName>
								<address>
									<country key="BR">Brazil</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Nascimento</forename><surname>Teixeira</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Universidade Fedederal de Minas Gerais Belo Horizonte</orgName>
								<address>
									<country key="BR">Brazil</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">MTM at MediaEval 2014 Violence Detection Task</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">D243063AFD71F777239CDE023FF25855</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T16:12+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>This paper describes the team MTM participation in Violent Scenes Detection (VSD) task of the MediaEval 2014 campaign. We propose an approach to the problem of detecting violence, which is based on probabilistic graphical models using Mel-frequency cepstral coefficients (MFCCs) as audio feature. In our approach, we employ Dynamic Bayesian Networks (DBNs) to represent a violent scene as an dynamic system.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">INTRODUCTION</head><p>The goal of the Violent Scenes Detection (VSD) task of the Me-diaEval 2014 benchmarking campaign is to detect violence in movies <ref type="bibr" target="#b4">[5]</ref>. This year the organizers of the VSD task released two datasets: (i) a set of 31 Hollywood movies, where 24 are used for training and 7 for the testing (our focus); (ii) Youtube set, composed of 86 violent and non-violent videos. Violence is defined as "one would not let an 8 years old child see in a movie because it contains physical violence". A model based on the variable-duration hidden Markov model is proposed to detect complex events using latent variables in Internet videos <ref type="bibr" target="#b5">[6]</ref>. The authors of <ref type="bibr" target="#b0">[1]</ref> propose an audio-visual approach to video genre classification using content descriptors that exploit audio, color, temporal, and contour information and demonstrated good results over other existing approaches by using a combination of these descriptors in genre classification. In <ref type="bibr" target="#b1">[2]</ref>, temporal structure of broadcast tennis video is recovered from HMMs. This trained HMM is used to analyze the temporal interleaving shots.</p><p>We propose to model video based on temporal structure and principle of causality using Dynamic Bayesian Networks (DBN).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">METHOD</head><p>For this year's benchmark, we have developed an acoustic system based on temporal data (MFCC vector). The main idea behind this approach is to represent a violent scene as a dynamic system.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Dynamic Bayesian Network</head><p>A DBN (see Figure <ref type="figure">1</ref>) is a state-space model of random variable Vt <ref type="bibr" target="#b2">[3]</ref>:</p><formula xml:id="formula_0">Vt = (Ut, Xt, Yt),<label>(1)</label></formula><p>where Ut represents the hidden, Xt the input and Yt the output variable. A pair (B1, B2) defines a DBN, where B1 and B2 are</p><p>Copyright is held by the author/owner(s).</p><p>MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain</p><formula xml:id="formula_1">U i−2 U i−1 U i U i+1 X i−2 X i−1 X i X i+1</formula><p>Figure <ref type="figure">1</ref>: A graphical-model view of an DBN unrolled for 4 slices with hidden state sequence U and a observed node X.</p><p>BNs. The two-slice temporal Bayes net B2 (DBN unrolled for 2 slices), defines P (Vt|Vt−1):</p><formula xml:id="formula_2">P (Vt|Vt−1) = N i=1 P (V t i |P a(V t i )),<label>(2)</label></formula><p>where P a(V t i ) are the parents in the net. Next, our acoustic feature detector is described.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Acoustic Feature Detector</head><p>Our audio concept detector is based on MFCCs. The audio signal is segmented into acoustic frames with overlapping. Acoustic frames are used to group samples using a window with fixed length. We split the audio signal into frames of 40ms length, with 20ms overlap, and apply a Hamming window to each frame. The Hamming function is given by:</p><formula xml:id="formula_3">w(n) = 0.54 − 0.46 cos( 2πn N − 1 ).<label>(3)</label></formula><p>For each audio frame, 12 MFCCs (range 133Hz-6855Hz) and their first and second derivates are computed to build an acoustic vector y j :</p><formula xml:id="formula_4">y j = (y j 1 , y j 2 , ..., y j 36 ).<label>(4)</label></formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Bag of Audio Words representation</head><p>After the feature extraction, a way of representing audio is through a feature vector model using Bag of Audio Words (BoAW). In this representation, each vector has the size of the vocabulary, where each vocabulary word represents a position vector. The i th vector value for a n audio segment equals the number of occurrences of that word i in the audio segment. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">SUBMITTED RUNS</head><p>For each run, a naive DBN is trained using two different observed vectors Yt: (i) acoustic vector y j , and (i) BoAW by j with 128 audio words (see Figure <ref type="figure">2</ref>). The likelihood of a model M , P (y1:T |M ), is used to assign a sequence y1:T to non-violent or violent label as follows:</p><p>M * (y1:T ) = arg max M P (y1:T |M )P (M ).</p><p>(</p><p>The Bayes Net Toolbox for Matlab (BNT) <ref type="bibr" target="#b3">[4]</ref> is used to train the dynamic networks. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">RESULTS AND DISCUSSION</head><p>Table <ref type="table" target="#tab_0">1</ref> shows the Mean Average Precision (MAP): MAP2014 and MAP@100 for the test movies. DBN with BoAW and DBN without have similar performances. Both approaches (run #1 and run #2) fail at detecting of violent scenes in the movie "8 Mile". The run #2 results are higher in the movies "BRAVEHEART", "DES-PERADO", "GHOST IN THE SHELL" and "V FOR VENDETTA", but lower for the movies "TERMINATOR 2" and "JUMANJI" in comparisom with run #1 (using MAP@100 and MAP2014 metrics). Run #2 uses BoAW representation, that has less observations (temporal segments) than run #1 approach, which uses directly the acoustic feature vector built from MFCCs. Our best result is 16.51% (MAP@100) or 2.64 % (MAP2014 ) for run #2 (see Table <ref type="table" target="#tab_1">2</ref>). We investigated the results and came to the presumption that BoAW removes noisy observations,while reducing the number of observations per segment. It might be related with the observation "grouping" when the BoAW is computed for the temporal segment (see Figure <ref type="figure">2</ref>). Thus, BoAW removes data noise and builds a better representation for a scene (model observation). However, the results are still very poor. We suppose it could be due to features, only MFCCs seems not capable of distinguishing all violence and non-violence segments and generalize the violence concept. Further work directions relies in capture the causality in violence segments using different structures and other feature modalities (feature selection).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">ACKNOWLEDGMENTS</head><p>This work was supported in part by two grants from CAPES and CNPq.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 :</head><label>1</label><figDesc>Performance of DBNs for the violence detection task at MediaEval 2014.</figDesc><table><row><cell>Source</cell><cell></cell><cell>run #1 DBN</cell><cell></cell><cell cols="3">run #2 DBN BoAW</cell></row><row><cell></cell><cell>Mean</cell><cell>Mean</cell><cell>Mean</cell><cell>Mean</cell><cell>Mean</cell><cell>Mean</cell></row><row><cell></cell><cell>Average</cell><cell>Average</cell><cell>Average</cell><cell>Average</cell><cell>Average</cell><cell>Average</cell></row><row><cell></cell><cell>Precision</cell><cell>Precision</cell><cell>Precision</cell><cell>Precision</cell><cell>Precision</cell><cell>Precision</cell></row><row><cell></cell><cell>(MAP)</cell><cell>2014</cell><cell>at 100</cell><cell>(MAP)</cell><cell>2014</cell><cell>at 100</cell></row><row><cell></cell><cell></cell><cell>(MAP2014)</cell><cell>(MAP@100)</cell><cell></cell><cell>(MAP2104)</cell><cell>(MAP@100)</cell></row><row><cell>8 MILE</cell><cell>0.0000</cell><cell>0.0000</cell><cell>0.0000</cell><cell>0.0000</cell><cell>0.0000</cell><cell>0.0000</cell></row><row><cell>BRAVEHEART</cell><cell>0.0429</cell><cell>0.0029</cell><cell>0.0369</cell><cell>0.0572</cell><cell>0.0149</cell><cell>0.2977</cell></row><row><cell>DESPERADO</cell><cell>0.1875</cell><cell>0.0159</cell><cell>0.1407</cell><cell>0.2165</cell><cell>0.0173</cell><cell>0.1635</cell></row><row><cell>GHOST IN THE SHELL</cell><cell>0.1018</cell><cell>0.0125</cell><cell>0.0458</cell><cell>0.1401</cell><cell>0.0423</cell><cell>0.1970</cell></row><row><cell>JUMANJI</cell><cell>0.0480</cell><cell>0.0235</cell><cell>0.1000</cell><cell>0.0443</cell><cell>0.0180</cell><cell>0.0307</cell></row><row><cell>TERMINATOR 2</cell><cell>0.1974</cell><cell>0.0518</cell><cell>0.1993</cell><cell>0.1113</cell><cell>0.0133</cell><cell>0.0295</cell></row><row><cell>V FOR VENDETTA</cell><cell>0.1201</cell><cell>0.0364</cell><cell>0.1432</cell><cell>0.0985</cell><cell>0.0794</cell><cell>0.4311</cell></row><row><cell cols="3">Figure 2: Given a video, we split into segments and build BoAW</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>histograms for each segment.</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 :</head><label>2</label><figDesc>Global results for the violence detection task at MediaEval 2014.</figDesc><table><row><cell>Run</cell><cell cols="2">MAP@100 MAP2014</cell></row><row><cell>#1 (MFCC-DBN)</cell><cell>9.51 %</cell><cell>2.04 %</cell></row><row><cell>#2 (MFCC-BoAW-DBN)</cell><cell>16.51 %</cell><cell>2.64 %</cell></row></table></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Video genre categorization and representation using audio-visual information</title>
		<author>
			<persName><forename type="first">B</forename><surname>Ionescu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Seyerlehner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Rasche</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Vertan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Lambert</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Electronic Imaging</title>
		<imprint>
			<biblScope unit="volume">21</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="23017" to="23018" />
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Temporal structure analysis of broadcast tennis video using hidden markov models</title>
		<author>
			<persName><forename type="first">E</forename><surname>Kijak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Oisel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Gros</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Storage and Retrieval for Media Databases</title>
				<editor>
			<persName><forename type="first">M</forename><forename type="middle">M</forename><surname>Yeung</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">R</forename><surname>Lienhart</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">C.-S</forename><surname>Li</surname></persName>
		</editor>
		<imprint>
			<publisher>SPIE</publisher>
			<date type="published" when="2003">2003</date>
			<biblScope unit="volume">5021</biblScope>
			<biblScope unit="page" from="289" to="299" />
		</imprint>
	</monogr>
	<note>SPIE Proceedings</note>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<title level="m" type="main">Dynamic Bayesian Networks: Representation, Inference and Learning</title>
		<author>
			<persName><forename type="first">K</forename><surname>Murphy</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2002-07">July 2002</date>
		</imprint>
		<respStmt>
			<orgName>UC Berkeley, Computer Science Division</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">PhD thesis</note>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">The bayes net toolbox for matlab</title>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">P</forename><surname>Murphy</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Computing Science and Statistics</title>
		<imprint>
			<biblScope unit="volume">33</biblScope>
			<date type="published" when="2001">2001</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">The MediaEval 2014 Affect Task: Violent Scenes Detection</title>
		<author>
			<persName><forename type="first">M</forename><surname>Sjöberg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Ionescu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Jiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Quang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Schedl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Demarty</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">MediaEval 2014 Workshop</title>
				<meeting><address><addrLine>Barcelona, Spain</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2014">October 16-17 2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Learning latent temporal structure for complex event detection</title>
		<author>
			<persName><forename type="first">K</forename><surname>Tang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, CVPR &apos;12</title>
				<meeting>the 2012 IEEE Conference on Computer Vision and Pattern Recognition, CVPR &apos;12<address><addrLine>Washington, DC, USA</addrLine></address></meeting>
		<imprint>
			<publisher>IEEE Computer Society</publisher>
			<date type="published" when="2012">2012</date>
			<biblScope unit="page" from="1250" to="1257" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
