<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">SpeeD @ MediaEval 2014: Spoken Term Detection with Robust Multilingual Phone Recognition</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Andi</forename><surname>Buzo</surname></persName>
							<email>andi.buzo@upb.ro</email>
							<affiliation key="aff0">
								<orgName type="laboratory">SpeeD Research Laboratory</orgName>
								<orgName type="institution">University Politehnica of Bucharest</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Horia</forename><surname>Cucu</surname></persName>
							<email>horia.cucu@upb.ro</email>
							<affiliation key="aff0">
								<orgName type="laboratory">SpeeD Research Laboratory</orgName>
								<orgName type="institution">University Politehnica of Bucharest</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Corneliu</forename><surname>Burileanu</surname></persName>
							<email>corneliu.burileanu@upb.ro</email>
							<affiliation key="aff0">
								<orgName type="laboratory">SpeeD Research Laboratory</orgName>
								<orgName type="institution">University Politehnica of Bucharest</orgName>
							</affiliation>
						</author>
						<title level="a" type="main">SpeeD @ MediaEval 2014: Spoken Term Detection with Robust Multilingual Phone Recognition</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">B1B97FF2046A371F5CCDF674A440C239</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T16:11+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>In this paper, we attempt to resolve the Spoken Term Detection (STD) problem for under-resourced languages by phone recognition with a multilingual acoustic model of three languages (Albanian, English and Romanian). The Power Normalized Cepstral Coefficients (PNCC) features are used for improved robustness to noise.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">INTRODUCTION AND APPROACH</head><p>We approach the Query by Example Search on Speech Task (QUESST) @ MediaEval 2014 <ref type="bibr" target="#b1">[1]</ref> by using a multilingual acoustic model (AM) trained with three languages (Albanian, English and Romanian). The task involves searching for audio content within audio content using an audio query. The approach consists in two stages: <ref type="bibr" target="#b1">(1)</ref> the indexing, i.e. the phone recognition of the content data and (2) the searching, i.e. finding a similar string of phones in the indexed content that matches the one of the query by using a DTW based searching algorithm.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.1">The acoustic model</head><p>In our approach, we want to compare the effect of using multilingual AM against the monolingual AM. In order to achieve this we have built five acoustic models described in Table <ref type="table" target="#tab_0">1</ref>. The AM training and the phoneme recognition are made by using Hidden Markov Models (HMMs). We have built an AM for each language, (AM1 -AM3). AM1 is trained with 8.7 hours of read speech. We had more available training data for Romanian (in the MediaEval 2013 evaluation campaign we used 64 hours <ref type="bibr" target="#b2">[2]</ref>), but this year we chose to train with less Romanian data in order to have a balanced training data set among different languages. AM2 is trained with 4.1 hours of Albanian read speech and broadcast news. AM3 is trained with 3.9 hours of native English read speech from the standard TIMIT database <ref type="bibr" target="#b3">[3]</ref>. All these three languages are part of the languages used in MediaEval 2014 evaluation campaign <ref type="bibr" target="#b1">[1]</ref> (except for English which is non-native). Hence, using more training data would go beyond the context of the competition which aims at low-resourced languages. AM4 is trained with all the data from the three languages. Phonemes from different languages, however, are trained separately. This led to a big number of phonemes (145). AM5 was trained with the same data as AM4, but in contrast phonemes that are common in different languages were trained together, thus reducing the number of phonemes to 98, which is still high. The identification of the common phonemes was made based on International Phonetic Alphabet (IPA) classification <ref type="bibr" target="#b0">[4]</ref>. It is interesting to notice that Romanian and Albanian had in common more than 80% of their phonemes.</p><p>As for English, it has in common many consonants with the other two languages, but very different vowels. AM6 is the one used by SpeeD team in MediaEval 2013 and it is tracked here for comparison <ref type="bibr" target="#b2">[2]</ref>. Two speech features types are used in this work: the common Mel Frequency Cepstral Coefficients (MFCC) and the Power Normalized Cepstral Coefficients (PNCC).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.2">Searching algorithm</head><p>If the ASR accuracy would be 100% then the STD is reduced to a simple character string search of a query within a textual content. As the experimental results show, we are far from the ideal case, hence we have to find within a content a string which is similar to the query.</p><p>The DTW String Search (DTWSS) uses the Dynamic Time Warping to align a string (a query) within a content. The search is not performed on the entire content, but only on a part of it by the means of a sliding window proportional to the length of the query. The term is considered detected if the DTW scores above a threshold. This method is refined by introducing a penalization for the short queries and the spread of the DTW match. The formula for the score s is given by equation (1):</p><formula xml:id="formula_0">) 1 )( 1 )( 1 ( Q S W Qm QM Qm Q L L L L L L L PhER s          (<label>1</label></formula><formula xml:id="formula_1">)</formula><p>where LQ is the length of the query, LQM=18 and LQm=4 are the maximum and the minimum query lengths found in the development data set, LW is the length of the sliding window, LS is the length of the matched term in the content, while α and β are the tuning parameters. In this work, α and β are set to 0.6. The penalizations in formula (1) are motivated by the assumption that for two queries of different length that match their respective contents by the same phone error rate (PhER), the match of the longer query is more probable to be the right one. Similarly the more compact DTW matches are assumed to be more probable than the longer ones. This algorithm is suitable for queries of type 1 and 2, because the DTW handles inherently the small variations from the query, but it is not suitable for queries of type 3 where words order may be inverted.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">EXPERIMENTAL RESULTS</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">STD results</head><p>The results obtained with different acoustic models on the development data set are shown in Figure <ref type="figure" target="#fig_0">1</ref>. The comparison is made by using the Maximum Term-Weighted Value (MTWV) and the Detection Error Tradeoff (DET) curves. The speech features used are the PNCCs. By comparing the acoustic models trained with a single language, the Romanian AM outperformed the other two. This is most probably because the Romanian AM is trained on more data (8.7h vs. ~4h). AM4 performed slightly better than the monolingual acoustic models. On one hand, it is trained with multiple languages which would increase the phoneme recognition accuracy, on the other hand the number of phonemes for this acoustic model is significantly increased which increases the uncertainty during recognition. AM5 improves this latter aspect by not training separately common phonemes among different languages and the results show an improvement in performance. However the best results are obtained with AM6. Even though it is trained with only one language (Romanian), it is trained with a big amount of data (64h) and the set of phonemes is relatively small (34). This means that for larger phonemes set larger data are needed for training. Regarding the STD task, it seems that by training with multiple languages the performance increases but more data are needed in order to consolidate the acoustic models. The results obtained on the development database with different speech features (PNCC and MFCC) are shown in Table <ref type="table">II</ref>. The metric used is the normalized cross entropy cost (Cnxe). The results show almost no difference between the two types of features. The same conclusion is drawn even when comparing by TWV metric. In general speech recognition, PNCCs obtain better accuracy in noise conditions, but, most probably, the noise in the MediaEval 2014 database is not significant. Therefore, the use of PNCC did not bring any improvement.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Official runs results</head><p>The results obtained by the official runs on the evaluation database are shown in Table <ref type="table" target="#tab_2">3</ref> and the metrics used are the actual and the minimum Cnxe. Because no tuning is made based on the development data set, the results on the evaluation data set are quite similar and the same conclusions can be drawn. Table <ref type="table" target="#tab_2">3</ref> shows also the results per query type. It can be noticed that better results are obtained by query type 2. In contrast to query type 1, these queries are longer, which may have affected the results. Query type 3 has obtained a slightly worse performance, most probably because of the reordering of the words in such queries. The results are obtained on a Xeon E5-2430, 6 cores, 2.20GHz, 48GB, under Linux Ubuntu 12.04.2 LTS. The Indexing Speed Factor (ISF), Searching Speed Factor (SSF) and Peak Memory Usage for indexing and searching (PMUi and PMUs) as described in <ref type="bibr" target="#b4">[5]</ref> are almost the same for all runs (the differences between different runs stand only in the AM used). Their average values are ISF =0.81, SSF=1.2*10 -5 s -1 , PMUi=2203MB, PMUs=197MB.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">CONCLUSIONS</head><p>We have approached STD with a two step process. Single or multilingual ASR is used as a phone recognizer for indexing the database, while a DTW based algorithm is used for searching a given query in the content database. The results show that by training with multiple languages the accuracy of the detection is increased, however the quantity of the data used for training is insufficient for training such a large phoneme set. The searching algorithm works better for query types 1 and 2 and slightly worse for query type 3 where the words' order may be inverted. </p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 .</head><label>1</label><figDesc>Figure 1. The results for the development data set</figDesc><graphic coords="2,52.56,88.40,233.76,205.20" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 . Training data</head><label>1</label><figDesc></figDesc><table><row><cell>ID</cell><cell>Language</cell><cell>No.</cell><cell>Training</cell></row><row><cell></cell><cell></cell><cell>phonemes</cell><cell>data [h]</cell></row><row><cell>AM1</cell><cell>Romanian</cell><cell>34</cell><cell>8.7</cell></row><row><cell>AM2</cell><cell>Albanian</cell><cell>36</cell><cell>4.1</cell></row><row><cell>AM3</cell><cell>English</cell><cell>75</cell><cell>3.9</cell></row><row><cell>AM4</cell><cell>Multilingual separate phones</cell><cell>145</cell><cell>16.7</cell></row><row><cell>AM5</cell><cell>Multilingual common phones</cell><cell>98</cell><cell>16.7</cell></row><row><cell>AM6</cell><cell>Romanian MediaEval 2013</cell><cell>34</cell><cell>64</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 . PNCC vs. MFCC performance comparison</head><label>2</label><figDesc></figDesc><table><row><cell>ID</cell><cell></cell><cell>PNCC</cell><cell></cell><cell>MFCC</cell></row><row><cell></cell><cell>ACnxe</cell><cell>MinCnxe</cell><cell>ACnxe</cell><cell>MinCnxe</cell></row><row><cell>AM1</cell><cell>1.032</cell><cell>0.986</cell><cell>1.032</cell><cell>0.986</cell></row><row><cell>AM2</cell><cell>1.055</cell><cell>0.997</cell><cell>1.055</cell><cell>0.997</cell></row><row><cell>AM3</cell><cell>1.03</cell><cell>0.994</cell><cell>1.03</cell><cell>0.994</cell></row><row><cell>AM4</cell><cell>1.015</cell><cell>0.972</cell><cell>1.016</cell><cell>0.971</cell></row><row><cell>AM5</cell><cell>1.016</cell><cell>0.969</cell><cell>1.016</cell><cell>0.969</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3 . Official runs</head><label>3</label><figDesc></figDesc><table><row><cell></cell><cell>Overall</cell><cell>Type 1</cell><cell>Type 2</cell><cell>Type 3</cell></row><row><cell></cell><cell>A/Min Cnxe</cell><cell>A/Min Cnxe</cell><cell>A/Min Cnxe</cell><cell>A/Min Cnxe</cell></row><row><cell>AM1</cell><cell>1.032/0.990</cell><cell>1.035/0.990</cell><cell>1.027/0.982</cell><cell>1.039/0.992</cell></row><row><cell>AM2</cell><cell>1.053/0.997</cell><cell>1.057/0.999</cell><cell>1.046/0.994</cell><cell>1.052/0.995</cell></row><row><cell>AM3</cell><cell>1.027/0.990</cell><cell>1.029/0.991</cell><cell>1.024/0.983</cell><cell>1.032/0.994</cell></row><row><cell>AM4</cell><cell>1.017/0.977</cell><cell>1.019/0.976</cell><cell>1.012/0.973</cell><cell>1.018/0.974</cell></row><row><cell>AM5</cell><cell>1.017/0.972</cell><cell>1.019/0.972</cell><cell>1.016/0970</cell><cell>1.017/0.963</cell></row></table></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title/>
		<author>
			<persName><surname>References</surname></persName>
		</author>
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Query by Example Search on Speech at Mediaeval 2014</title>
		<author>
			<persName><forename type="first">X</forename><surname>Anguera</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">J</forename><surname>Rodriguez-Fuentes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Szöke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Buzo</surname></persName>
		</author>
		<author>
			<persName><surname>Metze</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Working Notes Proceedings of the Mediaeval 2014 Workshop</title>
				<meeting><address><addrLine>Barcelona, Spain</addrLine></address></meeting>
		<imprint>
			<date>October 16-17</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">SpeeD@MediaEval 2013 : A Phone Recognition Approach to Spoken Term Detection</title>
		<author>
			<persName><forename type="first">H</forename></persName>
		</author>
		<author>
			<persName><forename type="first">Andi</forename><surname>Buzo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Horia</forename><surname>Cucu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Iris</forename><surname>Molnar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Bogdan</forename><surname>Ionescu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Corneliu</forename><surname>Burileanu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. Mediaeval 2013 Workshop</title>
				<meeting>Mediaeval 2013 Workshop<address><addrLine>Barcelona, Spain</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">TIMIT Acoustic-Phonetic Continuous Speech Corpus</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">S</forename><surname>Garofolo</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Linguistic Data Consortium</title>
				<meeting><address><addrLine>Philadelphia</addrLine></address></meeting>
		<imprint>
			<date type="published" when="1993">1993</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<title level="m" type="main">MediaEval 2013 Spoken Web Search Task: System Performance Measures</title>
		<author>
			<persName><forename type="first">L.-J</forename><surname>Rodriguez-Fuentes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Penagarikano</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2013-05">May 2013</date>
		</imprint>
		<respStmt>
			<orgName>GTTS, UPV/EHU</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">Technical report</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
