<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Semantic Annotation of Quantitative Textual Content</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Mehrnaz</forename><surname>Ghashghaei</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">University of New Brunswick</orgName>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="department">LS 3</orgName>
								<orgName type="laboratory">Laboratory for Systems, Software and Semantics</orgName>
								<orgName type="institution">Ryerson University</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">John</forename><surname>Cuzzola</surname></persName>
							<affiliation key="aff1">
								<orgName type="department">LS 3</orgName>
								<orgName type="laboratory">Laboratory for Systems, Software and Semantics</orgName>
								<orgName type="institution">Ryerson University</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Ebrahim</forename><surname>Bagheri</surname></persName>
							<affiliation key="aff1">
								<orgName type="department">LS 3</orgName>
								<orgName type="laboratory">Laboratory for Systems, Software and Semantics</orgName>
								<orgName type="institution">Ryerson University</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Ali</forename><forename type="middle">A</forename><surname>Ghorbani</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">University of New Brunswick</orgName>
							</affiliation>
						</author>
						<title level="a" type="main">Semantic Annotation of Quantitative Textual Content</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">E9192DCB0E230563E2FA511B7230635C</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-25T09:12+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Semantic annotation techniques provide the basis for linking textual content with concepts in well grounded knowledge bases. In spite of their many application areas, current semantic annotation systems have some limitations. One of the prominent limitations of such systems is that none of the existing semantic annotator systems are able to identify and disambiguate quantitative (numerical) content. In textual documents such as Web pages, specially technical contents, there are many quantitative information such as product specifications that need to be semantically qualified. In this paper, we propose an approach for annotating quantitative values in short textual content. In our approach, we identify numeric values in the text and link them to an existing property in a knowledge base. Based on this mapping, we are then able to find the concept that the property is associated with; whereby, identifying both the concept and the specific property of that concept that the numeric value belongs to. Our experiments show that our proposed approach is able to reach an accuracy of over 70% for semantically annotating quantitative content.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>As more and more content is being disseminated on online platforms such as blogs, social media and microblogs, the need for better and more efficient techniques for organizing, searching and efficiently retrieving information is required. Techniques that benefit from well-grounded knowledge bases such as ontologies for the sake of information organization and retrieval have received attention in the recent years <ref type="bibr" target="#b7">[7]</ref>, which include open information extraction <ref type="bibr" target="#b8">[8]</ref>, ontology population and enrichment <ref type="bibr" target="#b9">[9]</ref>, and semantic tagging and annotation <ref type="bibr">[1,</ref><ref type="bibr" target="#b2">2]</ref>, just to name a few. These techniques aim to identify and extract structured information from unstructured content. Automated semantic annotation systems are among such systems that enable the identification and labeling of instances of knowledge base concepts within text; whereby, enriching textual documents with additional semantic information linked to external knowledge bases.</p><p>With the emergence of the linked open data initiative, many semantic annotator systems now benefit from the knowledge bases that are shared through this platform to spot, disambiguate and link semantic information within textual content. Knowledge bases such as Freebase and DBpedia that sit at the core of the linked open data cloud have been used extensively for this purpose where their concepts are employed for semantically grounding textual content. Semantic annotator systems typically provide support for entity linking, suggestion of related but unobserved concepts, role assignment and detection of relevant semantic categories.</p><p>In spite of the growing adoption of semantic annotator systems, one of the major limitations that current annotators face concerns dealing with quantitative (numerical) textual content. In other words, none of the existing semantic annotator systems is able to semantically link or describe numerical content. Therefore, valuable information that are expressed in the form of numbers are largely ignored in the current semantic annotator systems; hence, they are neither exploited in the annotation process nor are they semantically linked for future use.</p><p>Let us consider a sample short text describing a Samsung Galaxy S smart phone: "The Samsung Galaxy S uses the Samsung S5PC110 processor. This processor combines a 1 GHz ARM Cortex-A8 based CPU core with a PowerVR SGX 540 GPU made by Imagination Technologies.". When processed by a state of the art semantic annotator system such as TagMe <ref type="bibr" target="#b10">[10]</ref>, the phrases 'Samsung Galaxy S', 'ARM Cortex-A8', 'processor', 'Imagination Technologies' and 'CPU' are detected and linked to their corresponding Wikipedia entities. However, none of the numerical values are detected for semantic annotation. This limitation prevents the correct interpretation of quantitative values within text, which can constitute a noticeable portion of text, e.g. see product specification Web pages.</p><p>In this paper, we propose an approach for annotating quantitative values in a short text. In our work, we identify numeric values in text and not only link them to the most relevant property in the knowledge base but also find the best matching concept 1 that has the identified property. Therefore, our method enables the specification of a numeric value within the context of a concept and by relating it with one of the properties of that concept. For instance, in the above example, our method is able to determine that 1 GHz is the value of the frequency property of the ARM Cortex-A8 concept.</p><p>For evaluating our approach, we exploit a gold standard dataset consisting of short textual snippets that have at least one numerical value. We compare the obtained property and concept for each numerical value and compare them with the gold standard. The results of our evaluation show that our method is able to correctly identify the most relevant concept and corresponding property in over 70% of the cases.</p><p>The rest of this paper is organized as follows. In Section 2, we review the background on automated semantic annotation of textual content. Section 3 is a detailed description of our proposed approach including the procedure for identifying the relevant entities and corresponding properties. The evaluation procedure, dataset and results are provided in Section 4 and finally Section 5 concludes the paper.</p><p>1 Also known as entity.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Background</head><p>One of the open areas of knowledge extraction from natural language is semantic annotation. For the sake of brevity, we refer to semantic annotation tools as annotators. Annotation is basically the task of extraction and disambiguation of mentioned entities in a given text. Annotators typically operate based on three main phases: detection of concept candidates, disambiguation, and pruning of results <ref type="bibr">[1]</ref>, which we briefly review in the following.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Detection</head><p>In the first phase, the annotator processes the given input text and picks out specific phrases from the text, called "mentions", that can potentially refer to an existing concept with the source knowledge base. For each of the mentions, a set of candidate concepts are selected that are associated with that mention. Detection of mentions is also known as "spotting". TagMe <ref type="bibr" target="#b10">[10]</ref> has an Anchor Dictionary for this phase, and detects mentions by querying this dictionary. DBpedia Spotlight <ref type="bibr" target="#b3">[3]</ref> also relies on a dictionary for spotting. In DBpedia Spotlight, a lexicon that associates multiple surface forms to a concept is used. Wikipedia Miner <ref type="bibr" target="#b4">[4]</ref> uses pure text processing to find the spots and their candidates. It gathers all n-grams within text but only keeps those that have a high probability of linking in order to discard irrelevant phrases and stop words. In AIDA <ref type="bibr" target="#b5">[5]</ref>, a Named Entity Recognition (NER) tool is used. This NER tool identifies noun phrases that potentially denote named entities. Then YAGO2 is used to associate a candidate set to each potential named entity. In Illinois Wikifier <ref type="bibr" target="#b6">[6]</ref> the authors perform pure text processing for entity spotting. They utilize an anchortitle index, computed by crawling Wikipedia, that maps each distinct hyperlink anchor text to its target Wikipedia titles. Since checking all substrings in the input text against the index is computationally inefficient, they only consider the expressions marked as named entities by a NER tagger, the noun-phrase chunks extracted by a publicly available shallow parser, and all sub-expressions of up to 5 tokens of the noun-phrase chunks. Then, for each mention, Wikipedia titles that are mapped to the mention (anchor text) are considered to be the candidate entities.</p><p>In our work, the detection phase starts with finding the numeric values in the input text. Assuming that we have the disambiguated mentions in the text, a set of candidate concepts are extracted. These concepts have the potential of having the most relevant property for the numeric value. Then from all properties of candidate concepts, a set of candidate properties are selected and associated with the spotted numeric value.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Disambiguation</head><p>Within the detection phase, a set of candidate concepts are identified. The objective of the disambiguation phase is then to select the concepts that most accurately highlight each mention's semantics, from among the concepts identified in the previous phase. There are generally four groups of work that perform disambiguation in annotators <ref type="bibr">[1]</ref>, namely popularity based, context based, collective disambiguation and graph-based techniques.</p><p>In the popularity-based approach the most frequently observed concept for a given mention is chosen. This method is usually combined with other approaches, since merely using this approach can lead to erroneous results. The reason is that the results do not consider context in which the mention appears and therefore largely ignore the main theme of the text. TagMe, Wikipedia Miner, AIDA, and Illinois Wikifier use the popularity-based approach combined with one of the following approaches for disambiguation.</p><p>Within the context-based approach, the context of the mention and the context of candidate concepts are compared. Context is typically modeled through bag-of-words and different distance measures <ref type="bibr">[1]</ref>. Context-based approaches are used in DBpedia Spotlight, AIDA, and Illinois Wikifier for disambiguation.</p><p>The third type of disambiguation relies on collective disambiguation, where multiple mentions are disambiguated together. In this approach, target entities should be coherent and semantically related to each other. Many semantic annotation tools combine this approach with the popularity-based method such as TagMe, Wikipedia Miner, and Illinois Wikifier.</p><p>The final disambiguation approach is designed on a graph-based representation. In this approach, the extracted mentions and candidate concepts form the vertices of a graph. In this graph, the weighted edges between the mentions and candidate concepts represent the contextual similarity. On this basis, disambiguation is formulated as the task of finding a dense sub-graph in which each mention has exactly one edge. AIDA uses a graph-based approach for disambiguation.</p><p>In our work, disambiguation of a numeric value concerns the identification of the best matching property for that value from among the identified candidate properties in the detection phase. Our work is primarily based on the popularitybased approach. The selection of the best candidate is based on the cumulative distribution of values associated with each property in the knowledge base. The candidate property that has the closest distribution to the value observed in the given input text is selected.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Pruning</head><p>In this phase, the concepts that are irrelevant or marginally related to the topic of the input text are pruned. Some annotators such as AIDA perform this task in the disambiguation phase. However others such as DBpedia Spotlight perform it as a post-disambiguation phase.</p><p>In TagMe, pruning is based on the average value of each mention's link probability and the coherence between the selected concepts for all of the identified concepts. In DBpedia Spotlight pruning is based on a number of parameters that can be tuned by the user. Wikipedia Miner uses automated prunning similar to TagMe. It uses a topic detector to classify related and unrelated links in a document. Positive training instances for the classifier are the articles that were manually linked to an article in Wikipedia, while negative ones are those that were not. Features of these articles and the places where they were mentioned inform the classifier as to which mentions should or should not be linked. In our work, we do not perform pruning.</p><p>There are other areas of research that can be considered relevant to the theme of this paper including the work on ontology learning and knowledge base population. One of the state of the art automatic knowledge extraction tools is FRED <ref type="bibr" target="#b11">[11]</ref>. This tool enables robust ontology learning and population (OL&amp;P) from natural language. Ontology learning is the task of acquiring a domain model from a given text and therefore involves parsing of natural language and extracting complex relations and concepts for the purpose of taxonomy induction. FRED does the OL&amp;P task based on Discourse Representation Theory (DRT).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Theoretical Model</head><p>The overall objective of our work is to find a best describing property and the concept that it belongs to for a quantitative value in a short text. We first describe the method for finding the best property and then explain how we identify its corresponding concept.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Property Identification</head><p>In order to find the most relevant property that accurately describes a numeric value<ref type="foot" target="#foot_0">2</ref> , the first step is to identify the set of properties from the knowledge base that can potentially be related to that numeric value. Let us first provide a theoretic foundation for describing our work.</p><p>Definition 1 (Textual Snippet) Let textual snippet T = [w 1 ...w k ] be a string where w i (1 ≤ i ≤ k) is a word. We define T.dt = w j ∈ D and T.r = w j−1 s.t. w j−1 is a numeric value and 2 ≤ j ≤ k, and D is the set of all possible datatypes. Further we define, T.S to be the set of all concepts that are spotted in T .</p><p>According to this definition, our objective is to annotate T.r with the most relevant property. For instance, for a textual snippet T such as "Motorola RAZR can support up to 64 MB", T.dt is "MB" that represents the megabyte datatype and T.r is "64". Furthermore, with the help of an automated semantic annotation system, one can find all the relevant concepts to T . For this example, T.S is {M otorola Razr, M egabyte, Secure Digital} 3 . We rely on an existing annotator to provide the values for T.S. Now, our task is to find an appropriate property for the value "64" from the list of properties in our knowledge base (e.g. DBpedia).</p><p>Definition 2 (Knowledge Base) Let KB = {c 1 , ..., c n } be a knowledge base, where</p><formula xml:id="formula_0">c i (1 ≤ i ≤ n) is a concept and c i .P = {(p ci 1 , v ci 1 ), ..., (p ci m , v ci m )} where (p ci j , v ci j )(1 ≤ j ≤ m)</formula><p>represents a property-value pair for concept c i .</p><p>For instance, for a concept such as "Motorola A1000" in DBpedia, one can find a set of property-value pairs such as {(type, Device), (operatingsystem, "Symbian OS 7.0 + UIQ 2.1"), (storage, "24.0 megabyte"), ...}, among others. Based on Definitions 1 and 2, we formally specify the issue of property identification as follows:</p><p>Definition 3 (Property Identification) For a knowledge base KB = {c 1 , ..., c n } and a textual snippet T , let P c = {p|(p, v) ∈ c.P } be the set of all properties for concept c. The set of all possible properties in our knowledge base is defined as U P = c∈KB P c . The objective is to find the most relevant property p ∈ U P for T.r.</p><p>In the context of the earlier example, our goal would be to find a relevant property for "64" which would in this case be "memory" or "storage". As the first step we select a set of concepts from the knowledge base such that they consist of appropriate properties for T.r. According to this definition, a candidate concepts set will include all concepts that have at least one property with a value whose datatype is equivalent to T.dt. In our running example, concept "Motorola A1000" would be in the candidate concept set, since it has the datatype "megabyte" in the value of one of its properties. In order to choose the best concepts from the members of the Candidate concepts set a ranking function is required. We rank the members of the Candidate concepts set based on their distance to the spots in T.S. Definition 5 (Concept Distance) For concept c and textual snippet T , a distance function is defined as follows:</p><formula xml:id="formula_1">dist(c, T ) = s∈T.S ρ(s) r(c, s) + β 2 (1)</formula><p>where semantic relatedness of two concepts c 1 and c 2 is represented as r(c 1 , c 2 )<ref type="foot" target="#foot_2">4</ref> and ρ is the function that returns the confidence score of the mentioned concept in the text (provided by the annotator). Also β is a very small constant for when r(c, s) = 0.</p><p>Table <ref type="table" target="#tab_0">1</ref> shows a number of concepts and their distances to the spots in the context of the earlier example. We rank the concepts in the candidate concepts set using the distance function in Definition 5 and hypothesize that less distant concepts have a higher probability to include relevant properties for our purpose. Therefore, we select the top-k concepts from the candidate concept set, denoted by Top Concepts (TC) <ref type="foot" target="#foot_3">5</ref> . Based on the top-k concepts, the set of properties of concepts in TC that have a datatype equal to T.dt will form a candidate property set defined as follows: </p><formula xml:id="formula_2">(T C) = {p|c ∈ T C, (p, v) ∈ c.P, v is N umeric and v.dt = T.dt}</formula><p>In order to find the best related property from the candidate property set, for each of the properties in CP (T C), a statistical analysis is done to see which property is more likely to have the numeric value T.r. To do so, we perform a statistical analysis over all observed values for each of the properties in CP(TC). In order to analyze the values of each property, we first build a set, called the Number Set. The Number Set represents the set of all the numerical values for a specific property observed in the knowledge base as long as the datatype for that value had a semantic similarity score of above threshold α<ref type="foot" target="#foot_4">6</ref> with the datatype of the value that we are annotating (T.dt). Based on the Number Set, we calculate the relevance probability for a given property through its Cumulative Distribution Function (CDF). In CDF, we assume that a Number Set has a Gaussian distribution. The CDF for a property p and T.r shows the probability of property p being the suitable representation for T.r. Table <ref type="table" target="#tab_1">2</ref> shows a set of properties and their CDF values for the above example where the numeric value 64.0 was considered. Based on the ranking provided through the CDF function, we are able to determine the best property that matches T.r. Algorithm 1 details the proposed approach to find the best property that describes a quantitative value mentioned in the input text. Lines 2-8 show how the candidate concepts set (C) is built. C is a subset of KB whose members (concepts) have a property value that includes the datatype of interest. After identifying candidate concepts, we find the top concepts (T C). T C is formed by taking the top-k members of C based on the ranking function in Definition 5 (line 9). Lines 10-16 show the process of forming the candidate properties set (CP ). For every concept in the top concept set, all numeric-valued properties of the concepts that have a datatype close to v.dt are chosen for CP . Finally, the property that has the highest probability of having T.r as its value is identified as the property of interest (line 17). end for 16: end for 17: return arg maxp∈CP P r(p, T.r)</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Algorithm 1 IdentifyProperty(TexualSnippet T)</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Concept Identification</head><p>Now, given that the most relevant property for the numeric value is identified, in this phase, the objective is to find the most relevant concept mentioned in the text which either directly or through inference has the property identified in previous step. Let the identified property of the numeric value in our knowledge base be P . The objective is to find a subject for P with T.r as its object. Note that the desired concept may or may not have the predicate (property) explicitly assigned to it. For example, Ford XT Falcon is a concept in the category of Ford Falcon and it has the property "weight" in our knowledge base. However, Ford Fairmont (Australia) is in the category of Ford Falcon but it does not have the property "weight". We are interested in all the concepts in T.S that can potentially have P in one of their properties, which may not be direct but can be derived through hierarchical subclass inference.</p><p>Definition 9 (Candidate Mentions) For a textual snippet T and a property P , a candidate mentions set is defined as CM (T, P ) = {c|c ∈ T.S, (P, v) ∈ c.P }.</p><p>In case candidate mentions set is empty (none of the mentioned concepts have the property P ), we search for similar concepts in the knowledge base that have P as a property. A mention would be considered as a candidate, if there is at least a concept in the knowledge base that have the property P and is at least in one of the mention's categories as expressed in DBpedia's hierarchical concept categories. For example, Ford XT Falcon categories are Vehicles introduced in 1968, Cars of Australia, and Ford Falcon. In order to identify the related concepts based on shared categories, we define the Related Concepts set as follows:</p><p>Definition 10 (Related Concepts) For a concept s and a property P , related concepts set is defined as RC(s, P ) = {c|c ∈ O, (P, v) ∈ c.P, cat(c) ∩ cat(s) = Ø} where cat is a function that returns the set of all DBpedia categories for a concept.</p><p>Algorithm 2 shows the procedure for identifying the best concept for P . First, if the candidate mentions set is not empty, the concept in CM with the highest confidence (ρ) is selected (lines 1-3). Otherwise, we try to find other related concepts to each mention that have the property P. If such a concept is found, it will be added to CM (lines 5-9). CM is populated based on the related concepts. Finally, the best concept for property P is the one with the highest confidence value (ρ) in CM (line 10).</p><p>As an example in the earlier text "Motorola RAZR can support up to 64 MB" that was mentioned earlier, "memory" was selected as the best property for 64. Based on this identified property, there is only one concept in T.S = {M otorola Razr, M egabyte, secure Digital}, i.e. Motorola Razr, that has "memory" as property. Therefore, Motorola Razr would be the selected concept. In case there are more than one mentions that have the identified property, the one with the highest confidence is selected.</p><p>In order to evaluate our work, we first developed a gold standard dataset that would include sentences that have quantitative values. Existing datasets that are used for evaluating semantic annotator systems were not suitable as they do not provide gold standard annotations for numeric values. Therefore, we recruited a group of ten Computer Science graduate students at MSc and PhD levels, all of whom had experience in working with semantic annotator systems before, to collect and annotate the gold standard dataset. The recruited graduate students were given a set of suggested concept-property pairs and were asked to collect descriptive sentences about each concept-property pair such that the sentences included quantitative content describing the desired property of the desired concept. Since our knowledge base (DBpedia) does not contain much numerical information about concepts, we provided the participants the suggested concept-property pairs to make sure that the collected gold standard would consist of concepts that exist in the knowledge base. Since the recruited graduate students were given a set of suggested concept-property pairs, there were no overlaps between the sentences they collected. The concept-property pairs were chosen so that they cover various domains including electronics, motor vehicles, movies &amp; music, geographical locations, famous people and food. As a final step, all the collected gold standard content were processed by the TagMe semantic annotator and the extracted concepts were stored in the gold standard.</p><p>The developed gold standard dataset consists of 165 separate entries <ref type="foot" target="#foot_5">7</ref> . In the whole dataset, there are 1,225 unique concepts extracted by TagMe. In each instance, there are 9.85 mentioned concepts on average. Each of the entries was selected such that TagMe can find at least one spot for that entry.</p><p>With regards to DBpedia, in our experiments, we used DBpedia 3.8. locally installed on a MongoDB server and specifically exploited the "properties" collection. The "properties" collection has over 130 million subject-predicate-object triples. One of our observations when working with DBpedia was that although DBpedia is a great source of information, it does not provide substantial reliable numeric data. In other words, many of the properties that need to have numeric values are missing or have incorrect or too generic datatypes associated with them. Given DBpedia does not enforce a schema, we believe one of the areas that can be improved on this knowledge base is with regards to the quantitative values.</p><p>Based on the gold standard, our objective was to identify the correct concept and property for each of the quantitative values in the dataset entries. The experiments were run on a machine with 3.20 GHz CPU and 8 GB RAM. Table <ref type="table" target="#tab_3">3</ref> (in Appendix) shows some sample entries and the corresponding concept and properties that were identified. In this table the mentioned entities are the spotted concepts extracted by TagMe. The predicted property and concept are those identified by our method for the highlighted numeric value in that dataset entry. As an example, in the first entry, fuelCapacity (http: //dbpedia.org/property/fuelCapacity) is identified as the best property and Honda Gyro (http://dbpedia.org/resource/Honda_Gyro) as the best concept for the numeric value 5.0L in the entry.</p><p>The experiments on the gold standard shows an accuracy of 73% for predicting the correct property and 72% for identifying the correct concept. It should be noted that given concept identification is dependent on the performance of the property detection method in our work, when the property was correctly identified, in 87% of the cases the concept was identified accurately as well.</p><p>One of the areas that we plan to investigate to further improve the performance of our work is to contextualize the consideration of properties with DBpedia categories. In other words, we intend to first identify the set of categories that a given input text belongs to, and then only consider property values of concepts within those categories for further predicting the correct property. For Example, if the input text is mainly about automobiles and a candidate property is "length", we would only consider values of "length" within concepts related to automobiles rather than "length" of irrelevant concepts such as rivers or cellphones.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Concluding Remarks</head><p>In this paper we have proposed a technique for semantically annotating quantitative values in textual content. To the best of our knowledge, our work is among the first to consider the semantic annotation of numerical values and connecting them to appopriate properties on an external knowledge base such as DBpedia. While we reach an overal accuracy of 73% on the gold standard, there is one main limitations for our work that we will be addressing in the future work: The core assumption of our work is that a numeric value is proceeded by a unit measure (datatype), e.g. 5.0 L. However, in many real world cases such a unit measure is non-existent after a numeric value. We are interested in predicting the unit measure of a numeric value based on context. </p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Definition 4 (</head><label>4</label><figDesc>Candidate Concepts) For a textual snippet T , a Candidate concept set is defined as C(T ) = {c|c ∈ KB, ∃(p, v) ∈ c.P s.t. v.dt = T.dt} where v.dt denotes the datatype for v.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Definition 7 (</head><label>7</label><figDesc>Number Set) For a textual snippet T and a given property p, Number Set is defined as N S(p, T ) = {v i |c ∈ KB, (p, v) ∈ c.P, r(v.dt, T.dt) &gt; α}.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Definition 8 (</head><label>8</label><figDesc>CDF) For a random variable R we have P r[R ≤ T.r] ≈ CDF (T.r). So, P r[T.r − ∆T.r &lt; R &lt; T.r + ∆T.r] = CDF (T.r + ∆T.r) − CDF (T.r − ∆T.r) where ∆T.r = T.r/100. Therefore for a property p and a numeric value T.r, P r(p, T.r) = CDF (N S(p, T ), T.r + ∆T.r) − CDF (N S(p, T ), T.r − ∆T.r).</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 .</head><label>1</label><figDesc>Concept distances to the spots in T.S</figDesc><table><row><cell>Concept</cell><cell>Distance</cell></row><row><cell cols="2">Theatre of War (video game) 73.09</cell></row><row><cell>Sony Ericsson C510</cell><cell>61.63</cell></row><row><cell>Motorola A1000</cell><cell>7.39</cell></row></table><note>Definition 6 (Candidate Properties) Assume T C is the set of top-k concepts based on the distance function in Equation 1. Candidate property set for T C is defined as CP</note></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 .</head><label>2</label><figDesc>The cumulative distance function for the properties'</figDesc><table><row><cell cols="2">Property CDF</cell></row><row><cell cols="2">memory 0.004084854021963902</cell></row><row><cell>storage</cell><cell>1.0839821475783218E-4</cell></row><row><cell>size</cell><cell>1.316693881592279E-6</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 3 .</head><label>3</label><figDesc>Some instances of dataset and their results</figDesc><table><row><cell>Predicted Property Predicted Concept</cell><cell></cell><cell></cell><cell cols="2">fuelCapacity Honda Gyro</cell><cell></cell><cell></cell><cell></cell><cell cols="2">weight Sonika Kaliraman</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell cols="2">fat Maple syrup</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>runtime The Boy Who Could Fly</cell></row><row><cell>Spotted Concepts by TagMe</cell><cell></cell><cell>Honda Express, Japan, Fuel,</cell><cell>Family, Engine displacement,</cell><cell>Tricycle, Single-occupant vehicle,</cell><cell>Honda Gyro, Delivery (commerce),</cell><cell>Service (economics)</cell><cell></cell><cell>Reality television, Kilogram,</cell><cell>Sonika Kaliraman, GMTV,</cell><cell>Nothing</cell><cell></cell><cell cols="2">Maple syrup, Canadian dollar,</cell><cell cols="2">Xylem, Doepfer A-100,</cell><cell cols="2">Acer nigrum, Gram, Fat,</cell><cell cols="2">Acer saccharum, Plant sap,</cell><cell cols="2">Species, Acer rubrum, Maple</cell><cell></cell><cell>Ontario Highway 114, Dream,</cell><cell>The Boy Who Could Fly, Autism,</cell><cell>Minute, Film, Flight, Family,</cell><cell>Death, Everyone (film),</cell><cell>Christian Shephard, Father</cell></row><row><cell>Dataset Entry</cell><cell>The Honda Gyro is a family</cell><cell>of small, three-wheeled, single-</cell><cell>occupant vehicles sold primarily in</cell><cell>Japan, and often used for delivery</cell><cell>or express service with 5.0 L fuel</cell><cell>capacity.</cell><cell>Before participating in the reality</cell><cell>show, Sibuja Kaliraman weighed 90</cell><cell>kg, and knew nothing about "pre-</cell><cell>sentation".</cell><cell>Maple syrup is a syrup usually</cell><cell>made from the xylem sap of sugar</cell><cell cols="2">maple, red maple, or black maple</cell><cell cols="2">trees, although it can also be made</cell><cell cols="2">from other maple species. It con-</cell><cell cols="2">tains 0.1 g fat every 100 grams of</cell><cell>the syrup.</cell><cell>The Boy Who Could Fly is a 114-</cell><cell>minute movie about an autistic</cell><cell>boy who dreams of flying touches</cell><cell>everyone he meets, including a new</cell><cell>family who has moved in after their</cell><cell>father dies.</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_0">If numeric values are written in English words, we automatically convert them to numeric form before processing.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_1">In our work, we employ DBpedia as the source knowledge base; hence, the complete URI for the concepts would be in the form of http://dbpedia.org/resource/ Motorola_Razr.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_2">We benefit from TagMe Relatedness API for this purpose in our experiments.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_3">We set k to 10 in our experiments.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_4">In our experiments, we set α to 0.5.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_5">The dataset is publicly available at http://ls3.rnet.ryerson.ca/people/ mehrnaz/dataset.xlsx.</note>
		</body>
		<back>
			<div type="annex">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Algorithm 2 IdentifyConcept(TexualSnippet T, Property P)</p></div>			</div>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title level="m" type="main">end if 4: CM ← Ø 5: for s ∈ T.S do 6: if RC(s, P )</title>
		<author>
			<persName><forename type="first">P</forename><surname>Cm (t</surname></persName>
		</author>
		<imprint>
			<biblScope unit="volume">3</biblScope>
		</imprint>
	</monogr>
	<note>not empty then 2: return arg max c∈CM (T,P ) ρ. not empty then 7: add s to CM 8: end if 9: end for 10: return arg maxc∈CM ρ(c) Now let us suppose that in the above example the &quot;storage&quot; was selected instead of &quot;memory&quot;. Then, in this case, candidate properties set would be empty, because none of the members of T.S has the &quot;storage&quot; property. Therefore, we need to consider the related concepts to the concepts in T.S. Here, there is only one concept, i.e., Motorola Razr in T.S, which has a non-empty related concepts set. This is because we are able to find some concepts such as Motorola Rokr that share a common DBpedia category with Motorola Razr, i.e. Motorola mobile phones, and at the same time consist of the &quot;storage&quot; property. Therefore, our proposed algorithm identifies Motorola Razr as the concept and &quot;storage&quot; as the property for the numeric value 64. References</note>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Automated Semantic Tagging of Textual Content</title>
		<author>
			<persName><forename type="first">J</forename><surname>Jovanovic</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Bagheri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Cuzzola</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Gasevic</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Jeremic</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Bashash</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IT Professional</title>
		<imprint>
			<biblScope unit="volume">16</biblScope>
			<biblScope unit="issue">6</biblScope>
			<biblScope unit="page" from="38" to="46" />
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">A framework for benchmarking entity-annotation systems</title>
		<author>
			<persName><forename type="first">M</forename><surname>Cornolti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Ferragina</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Ciaramita</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 22nd international conference on World Wide Web</title>
				<meeting>the 22nd international conference on World Wide Web</meeting>
		<imprint>
			<date type="published" when="2013-05">2013. May</date>
			<biblScope unit="page" from="249" to="260" />
		</imprint>
	</monogr>
	<note>International World Wide Web Conferences Steering Committee</note>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">DBpedia spotlight: shedding light on the web of documents</title>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">N</forename><surname>Mendes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Jakob</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>García-Silva</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Bizer</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 7th International Conference on Semantic Systems</title>
				<meeting>the 7th International Conference on Semantic Systems</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2011-09">2011. September</date>
			<biblScope unit="page" from="1" to="8" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">An open-source toolkit for mining Wikipedia</title>
		<author>
			<persName><forename type="first">D</forename><surname>Milne</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><forename type="middle">H</forename><surname>Witten</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Artificial Intelligence</title>
		<imprint>
			<biblScope unit="volume">194</biblScope>
			<biblScope unit="page" from="222" to="239" />
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Robust disambiguation of named entities in text</title>
		<author>
			<persName><forename type="first">J</forename><surname>Hoffart</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Yosef</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Bordino</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Fürstenau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Pinkal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Spaniol</surname></persName>
		</author>
		<author>
			<persName><forename type="first">.</forename><forename type="middle">.</forename><surname>Weikum</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Conference on Empirical Methods in Natural Language Processing</title>
				<meeting>the Conference on Empirical Methods in Natural Language Processing</meeting>
		<imprint>
			<date type="published" when="2011-07">2011. July</date>
			<biblScope unit="page" from="782" to="792" />
		</imprint>
	</monogr>
	<note>Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Local and global algorithms for disambiguation to wikipedia</title>
		<author>
			<persName><forename type="first">L</forename><surname>Ratinov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Roth</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Downey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Anderson</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies</title>
				<meeting>the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies</meeting>
		<imprint>
			<date type="published" when="2011-06">2011. June</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="1375" to="1384" />
		</imprint>
	</monogr>
	<note>Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Information organization and retrieval using a topic mapsbased ontology: results of a taskbased evaluation</title>
		<author>
			<persName><forename type="first">M</forename><surname>Yi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of the American Society for Information Science and Technology</title>
		<imprint>
			<biblScope unit="volume">59</biblScope>
			<biblScope unit="issue">12</biblScope>
			<biblScope unit="page" from="1898" to="1911" />
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Open information extraction using Wikipedia</title>
		<author>
			<persName><forename type="first">F</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">S</forename><surname>Weld</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics</title>
				<meeting>the 48th Annual Meeting of the Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2010-07">2010. July</date>
			<biblScope unit="page" from="118" to="127" />
		</imprint>
	</monogr>
	<note>Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<author>
			<persName><forename type="first">G</forename><surname>Petasis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Karkaletsis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Paliouras</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Krithara</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Zavitsanos</surname></persName>
		</author>
		<title level="m">Knowledge-driven multimedia information extraction and ontology evolution</title>
				<imprint>
			<publisher>Springer-Verlag</publisher>
			<date type="published" when="2011-01">2011. January</date>
			<biblScope unit="page" from="134" to="166" />
		</imprint>
	</monogr>
	<note>Ontology population and enrichment: State of the art</note>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Tagme: on-the-fly annotation of short text fragments (by wikipedia entities)</title>
		<author>
			<persName><forename type="first">P</forename><surname>Ferragina</surname></persName>
		</author>
		<author>
			<persName><forename type="first">U</forename><surname>Scaiella</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 19th ACM international conference on Information and knowledge management</title>
				<meeting>the 19th ACM international conference on Information and knowledge management</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2010-10">2010. October</date>
			<biblScope unit="page" from="1625" to="1628" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Knowledge extraction based on discourse representation theory and linguistic frames</title>
		<author>
			<persName><forename type="first">V</forename><surname>Presutti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Draicchio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Gangemi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Knowledge Engineering and Knowledge Management</title>
				<meeting><address><addrLine>Berlin Heidelberg</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2012">2012</date>
			<biblScope unit="page" from="114" to="129" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
