<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">An approach to unsupervised ontology term tagging of dependency-parsed text using a Self-Organizing Map (SOM)</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Seppo</forename><surname>Nyrkkö</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Department of Digital Humanities</orgName>
								<orgName type="institution">University of Helsinki</orgName>
							</affiliation>
						</author>
						<title level="a" type="main">An approach to unsupervised ontology term tagging of dependency-parsed text using a Self-Organizing Map (SOM)</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">73688F30B5C454B93C1FF4D034E487FB</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-23T21:03+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>I describe here a machine-learning estimation method for term tagging which can learn semantic disambiguation. The model is trained with a Semantic Web ontology, and a set of sample text documents with a set of concepts tagged, referring to the given ontology. The machine-learning method is based on creating numeric representations, or embeddings, which are based on dependency analysis of the syntactic environment of the word being analyzed. In contrast to many modern neural data-driven models, this model uses a less data-hungry unsupervised clustering method, the Self-Organizing Map (SOM). Based on the observations found with the experimental model, I suggest this can be utilized for populating ontologies with new concepts and terms, and for guessing the best matching ontology concepts for the found terms.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Large amounts of written information flow in news, article databases and knowledge forums, and searching for required information often requires using proper keywords. Semantic Web ontologies describe a vocabulary of concepts and terms which are useful for Information Retrieval in their specified domain.</p><p>Ontologies can provide enhanced results in information search when multiple taxonomies of terms and keywords are used in composing a large document database. Such databases may cover for instance a multilingual, cultural or biological domain <ref type="bibr" target="#b0">[1]</ref> where problems may be caused by diverse term variants, historical synonyms, misspellings and foreign terms.</p><p>Using automated content analysis based on machine learning, the amount of manual work in concept annotation and keyword tagging can be reduced. Automatic concept tagging makes it possible to apply ontology-based retrieval methods that combine keyword search with concept-based search <ref type="bibr" target="#b1">[2]</ref>. This leads to a better coverage and quality compared to standard information retrieval.</p><p>I suggest here a method where a machine learning model is trained for semantic tagging. For demonstration purposes, a model is trained with a small annotated text, containing a set of examples of the terms described in the annotation ontology. In figure <ref type="figure" target="#fig_0">1</ref>  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">The Method</head><p>The method intends to assist the process of adding semantic tags to individual sentences and paragraphs in new documents to be added to the database. The input text in new document is analyzed a sentence at a time, with a dependency parser. The semantic similarities between terms in new input and the reference text (training data) are estimated by the similarities in their syntactic dependencies in the new document.</p><p>A high syntactic similarity is considered as a possible semantic match. Furthermore, the method can be extended to detect a new term, without proper match. The approach also finds the closest match for an out-of-vocabulary term, that is not yet introduced in the current ontology.</p><p>By using an unsupervised machine learning method such as the Self-Organizing Map (SOM) we can even give a comprehensive, visual impression of the collection of articles available in the text database. A SOM is a Neural Network model that is different from most modern neural network architectures. It is less datahungry and it is tolerant to noise in the training data <ref type="bibr" target="#b2">[3]</ref>. This way, it can also classify rare term occurrences that have no exact match in the training data set, by guessing the best partial match based on the syntactic features of the term. This makes it an interesting alternative model to learning features for terms, associated with a set of ontology concepts.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Sample experiment</head><p>In the experiment, the sentences of text corpora are processed with the Stanford Parser (Penn PCFG dependency model for English). The sentences are tokenized as part of the dependency parsing process. Each token (in its actual word form) in the sentences are indexed in the training sentence bank. The dependency arcs and bi-arcs are extracted from the parse output, and each arc forms a feature descriptor on the tagged word. The arcs are bi-directional so that one dependency is tagged on the head and dependent word. The semantic features for individual word tokens are random projected indexes of the features produced by the Stanford Parser model. The syntactic context representation is very similar to the one in Dependency-Based Word Embeddings as in <ref type="bibr" target="#b3">[4]</ref>.</p><p>The experimental OntoR tool was developed in the R statistical programming environment, using the CRAN library som, based on SOM-PAK, the Self-Organizing Map Program Package (version 3.1) <ref type="bibr" target="#b4">[5]</ref>. A screen shot from the On-toR user interface (2) demonstrates how ontology-based term structure is reflected as a SOM map containing the keywords. A modified plot of the SOM map has been developed to explore the mapping of ontology term classes and super-classes over the term model trained with the sample corpus.</p><p>The areas in the outcome SOM grid show the taxonomical hierarchy that can be seen in the mapping of ontology-terms in the unsupervised model representing the training corpus. Multiple clusters were seen with both the subterms and terms categorized in the same map cell and their neighborhood. This supports the earlier work hypothesis that a data point cluster with an internal topology, or a structure, has a strong tendency to distribute over multiple adjacent cells over the SOM lattice.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Related Work and Discussion</head><p>The WebSOM project <ref type="bibr" target="#b5">[6]</ref> inspired work towards unsupervised term learning and classification with the use of Self Organizing Map, which works and learns on Internet sourced text articles, and extracts topics based on the tokens found in the text articles. Also the work by Tanev et al <ref type="bibr" target="#b6">[7]</ref> describes the main paradigms on weakly supervised ontology population, one being the term pattern related method and the another being the context sensitive triggering. The approach described here is a contextual extension to the WebSOM model since it adds syntactic dependencies as additional information over the tokens found in the text. In this work, the suggested method for mapping concepts occurring in text into the SOM grid will analogously support automatic tagging of new term candidates in document databases. This seems applicable especially for hyponyms (terms for subclasses) and synonyms for previously categorized terms. In the following phase of the experiment, the internal weighting parameters for building numeric embeddings from syntactic analysis will be evaluated and analyzed in contrast to using plain word-based embeddings.</p><p>This method can also be seen applicable in weakly supervised ontology concept population for adding new term candidates, since the presence of some rare terms occurrences were found in distinct areas of the SOM map in the experiment. This aim to use the SOM in concept mining is also supported by work by Honkela and Pöllä <ref type="bibr" target="#b7">[8]</ref>. The set of ontologies used with OntoR is not restricted to a medical domain, as seen with the sample experiment. The used ontologies can even cover multiple topics, for instance, history, politics, science and culture.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 .</head><label>1</label><figDesc>Fig. 1. Left: A visualization of a hexagonal Self-Organizing Map lattice (SOM) with data clusters of hypothetical concepts A and B. The SOM lattice has a tendency for clustering data points with similar features in cells next to each other. Right: A sample ontology, used for tagging terms in the experiment, shown as a Venn diagram.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Fig. 2 .</head><label>2</label><figDesc>Fig. 2. The user interface of the OntoR tool, showing a cell of the Self-Organizing Map (SOM), a matching token as a data point and its syntactic descriptor components. The sample text was based in the Malaria text downloaded from Wikipedia (May 2017) and set of 200 pubmed article abstracts (100 of keyword mosquito and 100 of malaria).</figDesc><graphic coords="3,137.60,116.83,340.16,112.08" type="bitmap" /></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Acknowledgments: Research and development of the method and the OntoR tool have been supported by the MOLTO EU project and Whitelake Software Point. The suggested model and inspection of the methods described here have been developed and supported with feedback from professor Timo Honkela and the Research Seminar in Language Technology held at University of Helsinki.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Biological names and taxonomies on the semantic web-managing the change in scientific conception</title>
		<author>
			<persName><forename type="first">Jouni</forename><surname>Tuominen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Nina</forename><surname>Laurenne</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Eero</forename><surname>Hyvönen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">The Semanic Web: Research and Applications</title>
				<imprint>
			<date type="published" when="2011">2011</date>
			<biblScope unit="page" from="255" to="269" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Aatos-a configurable tool for automatic annotation</title>
		<author>
			<persName><forename type="first">Minna</forename><surname>Tamper</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Petri</forename><surname>Leskinen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Esko</forename><surname>Ikkala</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Arttu</forename><surname>Oksanen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Eetu</forename><surname>Mäkelä</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Erkki</forename><surname>Heino</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jouni</forename><surname>Tuominen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mikko</forename><surname>Koho</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Eero</forename><surname>Hyvönen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Language, Data and Knowledge</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="276" to="289" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Clustering of the self-organizing map</title>
		<author>
			<persName><forename type="first">Juha</forename><surname>Vesanto</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Esa</forename><surname>Alhoniemi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on neural networks</title>
		<imprint>
			<biblScope unit="volume">11</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="586" to="600" />
			<date type="published" when="2000">2000</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Dependency-based word embeddings</title>
		<author>
			<persName><forename type="first">Omer</forename><surname>Levy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yoav</forename><surname>Goldberg</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ACL (2)</title>
				<imprint>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="302" to="308" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<title level="m" type="main">Som pak: The self-organizing map program package</title>
		<author>
			<persName><forename type="first">Teuvo</forename><surname>Kohonen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jussi</forename><surname>Hynninen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jari</forename><surname>Kangas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jorma</forename><surname>Laaksonen</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1996">1996</date>
		</imprint>
		<respStmt>
			<orgName>Helsinki University of Technology, Laboratory of Computer and Information Science</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">Report A31</note>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Self-organizing maps of very large document collections: Justification for the websom method</title>
		<author>
			<persName><surname>Honkela</surname></persName>
		</author>
		<author>
			<persName><surname>Kaski</surname></persName>
		</author>
		<author>
			<persName><surname>Kohonen</surname></persName>
		</author>
		<author>
			<persName><surname>Lagus</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Classification, Data Analysis, and Data Highways</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="1998">1998</date>
			<biblScope unit="page" from="245" to="252" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Weakly supervised approaches for ontology population</title>
		<author>
			<persName><forename type="first">Hristo</forename><surname>Tanev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Bernardo</forename><surname>Magnini</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">11th Conference of the European Chapter of the Association for Computational Linguistics</title>
				<imprint>
			<date type="published" when="2006">2006</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Concept mining with self-organizing maps for the semantic web</title>
		<author>
			<persName><forename type="first">Timo</forename><surname>Honkela</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Matti</forename><surname>Pöllä</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">WSOM</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2009">2009</date>
			<biblScope unit="page" from="98" to="106" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
