<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Document Relation System Based on Ontologies for the Security Domain</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Janine</forename><surname>Hellriegel</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Fraunhofer Institute for Open Communication Systems (FOKUS)</orgName>
								<address>
									<addrLine>Kaiserin Augusta Allee 31</addrLine>
									<postCode>10589</postCode>
									<settlement>Berlin</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
							<affiliation key="aff0">
								<orgName type="department">Fraunhofer Institute for Open Communication Systems (FOKUS)</orgName>
								<address>
									<addrLine>Kaiserin Augusta Allee 31</addrLine>
									<postCode>10589</postCode>
									<settlement>Berlin</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Hans</forename><forename type="middle">Georg</forename><surname>Ziegler</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Fraunhofer Institute for Open Communication Systems (FOKUS)</orgName>
								<address>
									<addrLine>Kaiserin Augusta Allee 31</addrLine>
									<postCode>10589</postCode>
									<settlement>Berlin</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
							<affiliation key="aff0">
								<orgName type="department">Fraunhofer Institute for Open Communication Systems (FOKUS)</orgName>
								<address>
									<addrLine>Kaiserin Augusta Allee 31</addrLine>
									<postCode>10589</postCode>
									<settlement>Berlin</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Ulrich</forename><surname>Meissen</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Fraunhofer Institute for Open Communication Systems (FOKUS)</orgName>
								<address>
									<addrLine>Kaiserin Augusta Allee 31</addrLine>
									<postCode>10589</postCode>
									<settlement>Berlin</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
							<affiliation key="aff0">
								<orgName type="department">Fraunhofer Institute for Open Communication Systems (FOKUS)</orgName>
								<address>
									<addrLine>Kaiserin Augusta Allee 31</addrLine>
									<postCode>10589</postCode>
									<settlement>Berlin</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Document Relation System Based on Ontologies for the Security Domain</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">7912D149683535222CAD9B8096924BD3</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T20:09+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Document Relation</term>
					<term>Security</term>
					<term>Ontology</term>
					<term>Semantic Relatedness</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Finding semantic similarity or semantic relatedness between unstructured text documents is an ongoing research field in the semantic web area. For larger text corpuses often lexical matching -the matching of shared terms -is applied. Related sematic terms and concepts are not considered in this solution. Also documents that use heterogeneous perspectives on a domain could not be set into a relation properly. In this paper, we present our ongoing work on a flexible and expandable system that handles text documents with different points of view, languages and level of detail. The system is presented in the security domain but could be adapted to other domains. The decision making process is transparent and the result is a ranked list.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>The amount of available information in the Internet is growing day by day. It is difficult to keep an overview of relevant data in a domain, especially if different kinds of views on the same topic are considered. An expert is using different words and level of detail in contrast to a normal user, but they describe exactly the same concept. Having a database consisting of documents authored from people with different levels of expertise, language skills and ambitions imposes a big challenge on a semantic search algorithm. The usage of long texts as search input enables a wider range of search terms, which is the foundation to detect a larger spectrum of documents. The relevant results are documents related to the input query text document. A basic method to compare two text documents is the vector space model <ref type="bibr" target="#b1">[2]</ref>, which relates the text similarity to the amount of similar words. However, semantically related words are not considered. Knowledge based similarity measures use lager document corpuses and external networks like WordNet or Wikipedia to analyze co-occurrences and relations. An overview of theses techniques is presented in <ref type="bibr" target="#b2">[3]</ref> but most of the methods just work for a couple words as search query. Although all documents affiliate to one domain (e.g. the security domain) lexical matching and knowledge-based measure don't retrieve a sufficient number of related documents. Another measure, the Ontology based matching includes concepts and heterogeneous relations. Wang <ref type="bibr" target="#b6">[7]</ref> proposes a system to relate documents using the concepts found in WordNet. But the measurement step still depends on words and heterogeneous concepts could not be related. In the security and safety domain only specialized ontologies exist <ref type="bibr" target="#b4">[5]</ref>, <ref type="bibr" target="#b5">[6]</ref>, that mainly focus on the security of information systems. An attempt to combine different ontologies was made by <ref type="bibr" target="#b0">[1]</ref> but could not express the diversity of the domain also addressing e.g. security of citizens, infrastructures or utilities. As the mentioned references show, a system that searches for related text documents in a clear and traceable way is not yet developed. At the moment no ontology exists that would match the terminology of the whole security domain. Therefore a new, more general, ontology as well as a general system are developed.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">System of Semantic Related Documents</head><p>The fundament to measure semantic relatedness between two documents are terms. A terminology is built, which is used to compare all documents quickly and determine their relations. The whole system is divided into three steps. Figure <ref type="figure" target="#fig_0">1</ref> displays an overview of the whole system. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Preprocessing and Keyword based extraction</head><p>In order to extract the valuable terms from the documents, a manually created keyword list is used and their term frequency for each document is determined. Comparing the occurrence of the keywords gives a first measure for the relatedness. The more terms the texts have in common, the more related they are. However, different views and special relations are not yet taken into account. In order to extract the keywords all documents are preprocessed with a tokenization on term bases. Further, stemming algorithms are used to transform all terms in the documents as well as all keywords to their base form. The keyword list was developed by early-warning system experts together with civil protection and police specialists. It contains about 500 English words relevant to the security domain but still could be modified or extended. Automatic keyword extraction algorithms are not suitable since they produce too much noise and could not hold up to the quality of the keywords list. All keywords can be translated semi-automatically. Therefore the system supports different languages. Synonyms, categories and other semantic relevant words are added by using BabelNet <ref type="bibr" target="#b3">[4]</ref>. From the term video surveillance the terms surveillance camera, cctv, video home security system are derived. In total a keyword list with over 4000 terms has been produced. In this way it is ensured, that only domain related terms are found.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Computing the Document Relation with help of Ontologies</head><p>In the case when related documents don't contain identical or similar terms, an ontology or terminological net can be used in order to improve the calculation of the relatedness. The relation between a technical and a user view could only be determined over a shared concept. Using the heterogeneous paths between the terms in the graphbased knowledge representation, new relations between the documents are revealed. Not only the distances in the terminological net are considered, also the type of relation like is-a or part-of between the terms determines the relatedness of the text documents. In this way, for each detected keyword in the query document, related keywords could be found. Texts containing the related keywords are most likely to correlate with the query document. A new ontology in the security domain is manually built at the moment, containing the original 500 keywords, relations from BabelNet and a taxonomy created by security researchers. The taxonomy is loosely based on a project categorization for the recent FP 7 Cordis security call <ref type="bibr" target="#b7">[8]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Weighting and Ranking</head><p>A ranked list of texts related to the query document is the result of the system. Two measurements are used to rank the results, first is the weighting of the original keywords and second is the type of relation between the keywords. Not all retrieved terms are equally important to distinguish the texts. The term security is important but very general and can be found in a lot of documents. Due to the low entropy of the term, it does not help to find unique relations. In contrast, the term body scanner is more useful to find related documents. A term weighting is applied with the tf-idf statistic <ref type="bibr" target="#b1">[2]</ref> to identify significant terms. As document corpus the FP 7 Cordis security call project descriptions are used. Secondly the relation between two specific keywords (body scanner and metal detector) is ranked higher then a relation between a specific keyword and a more general keyword (body scanner and airport security).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Conclusion and Future Work</head><p>With the presented system, a ranked list of related documents can be retrieved. Regardless what kind of view or level of detail they contain. The system describes a general sequence of functions and could be adapted to other domains if a corresponding list and ontology are available. In the music domain e.g. artist profiles could be related to genre or instrument descriptions. The system is based on a simple method but achieves good results because it works close to the domain. In addition, it allows the evaluation of the results and to understand why documents are identified as related. The system is still work in progress, the next steps are to complete the development of the ontology and to evaluate the chosen keywords. Further evaluations concerning the accuracy as well as user satisfaction have to be performed.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 .</head><label>1</label><figDesc>Fig. 1. System overview with three steps to determine document relatedness</figDesc></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgement</head><p>This work has received funding from the Federal Ministry of Education and Research for the security research project "fit4sec" under grant agreement no. 13N12809.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Ontologies for Crisis Management: a Review of State of the Art in Ontology Design and Usability</title>
		<author>
			<persName><forename type="first">Shuangyan</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Duncan</forename><surname>Shaw</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Christopher</forename><surname>Brewster</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Information Systems for Crisis Response and Management Conference</title>
				<meeting>the Information Systems for Crisis Response and Management Conference</meeting>
		<imprint>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<author>
			<persName><forename type="first">Christopher</forename><forename type="middle">D</forename><surname>Manning</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Hinrich</forename><surname>Prabhakar Raghavan</surname></persName>
		</author>
		<author>
			<persName><surname>Schütze</surname></persName>
		</author>
		<title level="m">Introduction to Information Retrieval</title>
				<imprint>
			<publisher>Cambridge university press Cambridge</publisher>
			<date type="published" when="2008">2008</date>
			<biblScope unit="volume">1</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Corpus-based and Knowledgebased Measures of Text Semantic Similarity</title>
		<author>
			<persName><forename type="first">Rada</forename><surname>Mihalcea</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Courtney</forename><surname>Corley</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">AAAI</title>
		<imprint>
			<biblScope unit="volume">6</biblScope>
			<biblScope unit="page" from="775" to="780" />
			<date type="published" when="2006">2006</date>
		</imprint>
	</monogr>
	<note>Carlo Strapparava</note>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">BabelNet: Building a Very Large Multilingual Semantic Network</title>
		<author>
			<persName><forename type="first">Roberto</forename><surname>Navigli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Simone</forename><surname>Paolo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ponzetto</forename></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics</title>
				<meeting>the 48th Annual Meeting of the Association for Computational Linguistics</meeting>
		<imprint>
			<publisher>Association for Computational Linguistics</publisher>
			<date type="published" when="2010">2010</date>
			<biblScope unit="page" from="216" to="225" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Security Ontology for Adaptive Mapping of Security Standards</title>
		<author>
			<persName><forename type="first">Simona</forename><surname>Ramanauskaite</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Dmitrij</forename><surname>Olifer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Nikolaj</forename><surname>Goranin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Antanas</forename><surname>Čenys</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International Journal of Computers Communications &amp; Control</title>
		<imprint>
			<biblScope unit="volume">8</biblScope>
			<biblScope unit="issue">6</biblScope>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Ontologies for Security Requirements: A Literature Survey and Classification</title>
		<author>
			<persName><forename type="first">Amina</forename><surname>Souag</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Camille</forename><surname>Salinesi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Isabelle</forename><surname>Comyn-Wattiau</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advanced Information Systems Engineering Workshops</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2012">2012</date>
			<biblScope unit="page" from="61" to="69" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Concept Forest: A New Ontology-assisted Text Document Similarity Measurement Method</title>
		<author>
			<persName><forename type="first">James</forename><forename type="middle">Z</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">William</forename><surname>Taylor</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Web Intelligence, IEEE/WIC/ACM International Conference On</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2007">2007</date>
			<biblScope unit="page" from="395" to="401" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<ptr target="http://cordis.europa.eu/fp7/security/home_en.html" />
		<title level="m">FP7 Cordis Project</title>
				<imprint/>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
