<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Background knowledge for ontology construction</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Blaz</forename><surname>Fortuna</surname></persName>
						</author>
						<author>
							<persName><forename type="first">Marko</forename><surname>Grobelnik</surname></persName>
						</author>
						<author>
							<persName><forename type="first">Dunja</forename><surname>Mladenic</surname></persName>
						</author>
						<title level="a" type="main">Background knowledge for ontology construction</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">23638A59F4EBBCD28C377C60BDC99769</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T21:50+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>In this paper we describe a solution for incorporating background knowledge into the OntoGen system for semi-automatic ontology construction. This makes it easier for different users to construct different and more personalized ontologies for the same domain. To achieve this we introduce a word weighting schema to be used in the document representation. The weighting schema is learned based on the background knowledge provided by user. It is than used by OntoGen's machine learning and text mining algorithms.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>When using ontology-based techniques for knowledge management it is important for the ontology to capture the domain knowledge in a proper way. Very often different tasks and users require the knowledge to be encoded into ontology in different ways, depending on the task. For instance, the same document-database in a company may be viewed differently by marketing, management, and technical staff. Therefore it is crucial to develop techniques for incorporating user's background knowledge into ontologies.</p><p>In <ref type="bibr" target="#b3">[4]</ref> we introduced a system called OntoGen for semi-automatic construction of topic ontologies. Topic ontology consists of a set of topics (or concepts) and a set of relations between the topics which best describe the data. The OntoGen system helps the user by discovering possible concepts and relations between them within the data.</p><p>In this paper we propose a method which extends OntoGen system so that the user can supervise the methods for concept discovery by providing background knowledge -his specific view on the data used by the text mining algorithms in the system.</p><p>To encode the background knowledge we require from the user to group documents into categories. These categories do not need to describe the data in details, the important thing is that they show to the system the user's view of the data -which documents are similar and which are different from the user's perspective. The process of manually marking the documents with categories is time consuming but can be significantly speeded up by the use of active learning <ref type="bibr" target="#b4">[5]</ref>, <ref type="bibr" target="#b7">[8]</ref>.</p><p>Another source of such labeled data could be popular online tagging services (e.g Del.icio.us) which allow the user to label the websites of his interests with labels he chose.</p><p>This paper is organized as follows. In Section 2 we introduce On-toGen system and in Section 3 we derive the algorithm for calculating word weights. We conclude the paper with some preliminary results in Section 4.</p><p>OntoGen <ref type="bibr" target="#b3">[4]</ref> is a system for semi-automatic ontology construction, screenshot of the tool is presented in the Figure <ref type="figure" target="#fig_0">1</ref>. Important part of OntoGen are methods for discovering concepts from a collection of documents. For the representation of the documents we use the well established bag-of-words representation which heavily relies on the weights associated with the words. The weights of the words are commonly calculated by so called TFIDF weighting. We argue that this provides just one of the possible views on the data and propose an alternative word weighting that takes into account the background knowledge which provides the user's view on the documents.</p><p>OntoGen discovers concepts using Latent Semantic Indexing (LSI) <ref type="bibr" target="#b2">[3]</ref> and k-means clustering <ref type="bibr" target="#b5">[6]</ref>. The LSI is a method for linear dimensionality reduction by learning an optimal sub-basis which approximates documents' bag-of-words vectors. The sub-basis vectors are treated as concepts. The k-means method discovers concepts by clustering the documents' bag-of-words vectors into k clusters where each cluster is treated as a concept.</p><p>Both methods heavily rely on the representation of the documents. Namely, the document representation provides the vectors of the documents which LSI tries to approximate and, the basis for clustering algorithm is the similarity of document which also depends on the document representation.</p><p>By incorporating background knowledge directly into the document representation via word weighting, reflecting similarity between the documents, we enable our methods to discover concepts which resemble the view that the user has on the data. 3 WORD WEIGHTING</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Bag-of-Words and Cosine Similarity</head><p>Most commonly used representation of the documents in text mining is bag-of-words representation. Let V = w1, . . . , wn be vocabulary of words. Let T F k be the number of occurrences of the word w k in the document. In the bag-of-words representation a single document is encoded as a vector x with elements corresponding to the words from a vocabulary, eg. x k = T F k . These vectors are in general very sparse since the number of different words that appear in the whole collection is usually much larger than the number of different words that appear inside one specific document. Measure usually used to compare text documents is the cosine similarity and is defined to be the cosine of the angle between two documents' bag-of-words vectors,</p><formula xml:id="formula_0">sim(xi, xj) = n k=1 x k i x k j n k=1 x k i x k i n k=1 x k i x k i .<label>(1)</label></formula><p>Performance of both bag-of-words representation and cosine similarity can be significantly improved by introducing word weights. Each word from vocabulary V is assigned a weight and elements of vectors xi are multiplied by the corresponding weights.</p><p>As we already mentioned, our approach is based on the word weights being the key to viewing the same data from different angels. We can use the weights to store the background knowledge since the weights define which words are important.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">TFIDF</head><p>Most of the research on word weighting schemas was traditionally done in the information retrieval community. A typical goal in information retrieval is to find the most relevant document from the document collection for a given query. Many popular methods from information retrieval are based on measuring cosine similarity between the documents and a query and their performance can be significantly improved by appropriate weighting of the words.</p><p>Most of the popular methods for this task developed in last decades do not involve learning. Word weights are calculated by predefined formulas from some basic statistics of the word frequencies inside the document and inside the whole document collection <ref type="bibr" target="#b9">[10]</ref>. These methods are base on intuition and experimental validation.</p><p>The most widely used is the TFIDF weighting schema <ref type="bibr" target="#b9">[10]</ref> which defines elements of bag-of-words vectors with the following formula:</p><formula xml:id="formula_1">x k i = T F k • log(N • IDF k ).<label>(2)</label></formula><p>The intuition behind this weighting schema is that the words which occur very often are not so important for determining if a pair of documents is similar while a not so frequent words occurring in the both documents is a strong sign of similarity. The TFIDF weighting can be easily modified to include category information by replacing IDF and number of documents with ICF and number of categories.</p><p>There are many extensions of this schema most famous being Okapi weighting schema <ref type="bibr" target="#b8">[9]</ref> which we will skip here since it does not incorporate category information.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">SVM Feature Selection</head><p>As we will see in the next chapter a different approach can also be taken for generating word weights based on feature selection methods. Feature selection methods based on Support Vector Machine (SVM) <ref type="bibr" target="#b1">[2]</ref> has been found to increase the performance of classification by discovering which words are important for determining the correct category of a document <ref type="bibr" target="#b0">[1]</ref>.</p><p>The method proceeds as follow. First linear SVM classifier is trained using all the features. Classification of a document is done by multiplying the document's bag-of-words vector with the normal vector computed by SVM,</p><formula xml:id="formula_2">x T w = x 1 w 1 + x 2 w 2 + . . . + x n w n ,<label>(3)</label></formula><p>and if the result is above some threshold b then the document is considered positive. This process can also be seen as voting where each word is assigned a vote weight w i and when document is being classified each word from the document issues x i w i as its vote. All the votes are summed together to obtain the classification. A vote can be positive (document should belong to the category) or negative (the document should not belong to the category).</p><p>A simple and naive way of selecting the most important words for the given category would be to select the words with the highest vote values wi for the category. It turns out that it is more stable to select the words with the highest vote x i w i averaged over all the positive documents.</p><p>The votes w i could also be interpreted as word weights since they are higher for the words which better separate the documents according to the given categories.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4">Word Weighting with SVM</head><p>The algorithm we developed for assigning weights using SVM feature selection method is the following:</p><p>1. Calculate a classifier for each category from the document collection (one-vs-all method for multi-class classification). TFIDF weighting schema can be used at this stage. Result is a set of SVM normal vectors W = {wj; j = 1, . . . , m}, one for each category. 2. Calculate weighting for each of the categories from its classifier weight vector. Weights are calculated by averaging votes x i w i across all the documents from the category. Only weights with positive average are kept while the negative ones are set to zero. This results in a separate set of word weights for each category. By µ j k we denote weight for the k-th word and j-th category. 3. Weighted bag-of-words vectors are calculated for each document.</p><p>Let C(di) be a set of categories of a document di. Elements of vector xi are calculated in the following way:</p><formula xml:id="formula_3">x k i =   j∈C(d i ) µ j k   • T F k .<label>(4)</label></formula><p>This approach has another strong point. Weights are not only selected so that similarities correspond to the categories given by the user but they also depend on the context. Let us illustrate this on a sample document which contains words "machine learning". If the document would belong to category "learning" then the word "learning" would have high weight and the word "machine" low weight. However, if the same document would belong to category "machine learning", then most probably both words would be found important by SVM.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">PRELIMINARY RESULTS</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Reuters RCV1 Dataset</head><p>As a document collection for testing our method we chose Reuters RCV1 <ref type="bibr" target="#b6">[7]</ref> dataset. The reason for which we chose it is that each news article from the dataset has two different types of labels (categories). Each news article is assigned labels according to (1) the topics covered and (2) the countries involved in it. We used a subset of 5000 randomly chosen documents for the experiments.</p><p>A List with the 10 most frequent categories from the used subset of RCV1 dataset is shown in Table <ref type="table" target="#tab_0">1</ref>. The statistics are for the subset used in the experiments. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Results</head><p>In the Figure <ref type="figure" target="#fig_2">2</ref>   </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">CONCLUSION</head><p>In this paper we have presented a method for learning document similarity measure trough selecting appropriate word weights for bag-ofwords document representation model. We selected the word weights by training the SVM linear classifier for given categories and than extracting the word weights from the hyper plane normal vector. The learned word weighting schema was used to adjust the concept discovery methods in the OntoGen system to the user's domain knowledge.</p><p>As part of the future work we plan to extend this method to the text categorization task where category information is known only for the documents from training set.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 .</head><label>1</label><figDesc>Figure 1. Screen shot of the interactive system for construction topic ontologies.</figDesc><graphic coords="1,312.47,489.50,230.39,172.67" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head></head><label></label><figDesc>are the top 3 concepts discovered with k-means algorithm for both word weighting schemas. Documents are placed also in different concepts. For example, having two documents talking about the stock prices, one at the New York stock-exchange and the other at the UK stock-exchange. The New York document was placed in (1) Market concept (the same as the UK document) and in (2) USA concept (while the UK document was placed in (2) Europe concept).</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 2 .</head><label>2</label><figDesc>Figure 2. The top 3 discovered concepts for topic labels (left) and for country labels (right).</figDesc><graphic coords="3,49.45,432.76,230.40,106.47" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 .</head><label>1</label><figDesc>List of 10 most frequent categories for topics and countries view.</figDesc><table><row><cell cols="2">TOPICS VIEW</cell><cell></cell><cell cols="2">COUNTRIES VIEW</cell></row><row><cell>CCAT</cell><cell>corporate/industrial</cell><cell>46%</cell><cell>USA</cell><cell>33%</cell></row><row><cell>GCAT</cell><cell>government/social</cell><cell>30%</cell><cell>UK</cell><cell>11%</cell></row><row><cell cols="2">MCAT markets</cell><cell>24%</cell><cell>Japan</cell><cell>6%</cell></row><row><cell>C15</cell><cell>performance</cell><cell>19%</cell><cell>Germany</cell><cell>4%</cell></row><row><cell>ECAT</cell><cell>economics</cell><cell>14%</cell><cell>France</cell><cell>4%</cell></row><row><cell>C151</cell><cell>accounts/earnings</cell><cell>10%</cell><cell>Australia</cell><cell>3%</cell></row><row><cell>M14</cell><cell cols="2">commodity/markets 10%</cell><cell>India</cell><cell>3%</cell></row><row><cell>C152</cell><cell>comment/forcast</cell><cell>9%</cell><cell>China</cell><cell>3%</cell></row><row><cell>GPOL</cell><cell>domestic politics</cell><cell>7%</cell><cell>EEC</cell><cell>3%</cell></row><row><cell>M13</cell><cell>money markets</cell><cell>7%</cell><cell>Hong Kong</cell><cell>2%</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">Institute Jozef Stefan, Slovenia, email: {blaz.fortuna, marko.grobelnik, dunja.mladenic}@ijs.si</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>ACKNOWLEDGEMENTS</head><p>This work was supported by the Slovenian Research Agency and the IST Programme of the European Community under SEKT Semantically Enabled Knowledge Technologies (IST-1-506826-IP), NeOn Networked Ontologies (IST-2004-27595) and PASCAL Network of Excellence (IST-2002-506778). This publication only reflects the authors' views.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Feature selection using support vector machines</title>
		<author>
			<persName><forename type="first">J</forename><surname>Brank</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Grobelnik</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Milic-Frayling</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Mladenic</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Third International Conference on Data Mining Methods and Databases for Engineering, Finance, and Other Fields</title>
				<meeting>the Third International Conference on Data Mining Methods and Databases for Engineering, Finance, and Other Fields<address><addrLine>Bologna, Italy</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2002-09">September 2002</date>
			<biblScope unit="page" from="25" to="27" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<author>
			<persName><forename type="first">N</forename><surname>Cristianini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Shawe-Taylor</surname></persName>
		</author>
		<title level="m">An introduction to support vector machines</title>
				<imprint>
			<publisher>Cambridge University Press</publisher>
			<date type="published" when="2000">2000</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Indexing by Latent Semantic Analysis</title>
		<author>
			<persName><forename type="first">S</forename><surname>Deerwester</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Dumais</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Furnas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Landuer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Harshman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of the American Society of Information Science</title>
		<imprint>
			<biblScope unit="volume">41</biblScope>
			<biblScope unit="issue">6</biblScope>
			<biblScope unit="page" from="391" to="407" />
			<date type="published" when="1990">1990</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Semi-automatic construction of topic ontology</title>
		<author>
			<persName><forename type="first">B</forename><surname>Fortuna</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Mladenic</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Grobelnik</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the ECML/PKDD Workshop on Knowledge Discovery for Ontologies</title>
				<meeting>the ECML/PKDD Workshop on Knowledge Discovery for Ontologies</meeting>
		<imprint>
			<date type="published" when="2005">2005a</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Automated knowledge discovery in advanced knowledge management</title>
		<author>
			<persName><forename type="first">M</forename><surname>Grobelnik</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Mladenic</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">J. of. Knowledge management</title>
		<imprint>
			<biblScope unit="volume">9</biblScope>
			<biblScope unit="page" from="132" to="149" />
			<date type="published" when="2005">2005</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Data Clustering: A Review</title>
		<author>
			<persName><forename type="first">Murty</forename><surname>Jain</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Flynn</forename></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM Comp. Surv</title>
		<imprint>
			<date type="published" when="1999">1999</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">RCV1: A New Benchmark Collection for Text Categorization Research</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">D</forename><surname>Lewis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Rose</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Li</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Machine Learning Research</title>
		<imprint>
			<biblScope unit="volume">5</biblScope>
			<biblScope unit="page" from="361" to="397" />
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Text classification with active learning</title>
		<author>
			<persName><forename type="first">B</forename><surname>Novak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Mladenic</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Grobelnik</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of GfKl</title>
				<meeting>GfKl</meeting>
		<imprint>
			<date type="published" when="2005">2005</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Okapi at TREC-4</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">E</forename><surname>Robertson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Walker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">M</forename><surname>Hancock-Beaulieu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Gatford</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Payne</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">The Fourth Text REtrieval Conference (TREC-4)</title>
				<imprint>
			<date type="published" when="1996">1996</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Developments in Automatic Text Retrieval</title>
		<author>
			<persName><forename type="first">G</forename><surname>Salton</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Science</title>
		<imprint>
			<biblScope unit="volume">253</biblScope>
			<biblScope unit="page" from="974" to="979" />
			<date type="published" when="1991">1991</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Pivoted Document Length Normalization</title>
		<author>
			<persName><forename type="first">A</forename><surname>Singhal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Buckley</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Mitra</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 19th ACM SIGIR Conference on Research and Development in Information Retrieval</title>
				<meeting>the 19th ACM SIGIR Conference on Research and Development in Information Retrieval</meeting>
		<imprint>
			<date type="published" when="1996">1996</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
