<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Semantic Annotation of Quantitative Textual Content</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mehrnaz Ghashghaei</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>John Cuzzola</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ebrahim Bagheri</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ali A. Ghorbani</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>, Ryerson University</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Laboratory for Systems</institution>
          ,
          <addr-line>Software and Semantics (LS</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of New Brunswick</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Semantic annotation techniques provide the basis for linking textual content with concepts in well grounded knowledge bases. In spite of their many application areas, current semantic annotation systems have some limitations. One of the prominent limitations of such systems is that none of the existing semantic annotator systems are able to identify and disambiguate quantitative (numerical) content. In textual documents such as Web pages, specially technical contents, there are many quantitative information such as product speci cations that need to be semantically quali ed. In this paper, we propose an approach for annotating quantitative values in short textual content. In our approach, we identify numeric values in the text and link them to an existing property in a knowledge base. Based on this mapping, we are then able to nd the concept that the property is associated with; whereby, identifying both the concept and the speci c property of that concept that the numeric value belongs to. Our experiments show that our proposed approach is able to reach an accuracy of over 70% for semantically annotating quantitative content.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        As more and more content is being disseminated on online platforms such as
blogs, social media and microblogs, the need for better and more e cient
techniques for organizing, searching and e ciently retrieving information is required.
Techniques that bene t from well-grounded knowledge bases such as ontologies
for the sake of information organization and retrieval have received attention
in the recent years [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], which include open information extraction [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], ontology
population and enrichment [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], and semantic tagging and annotation [
        <xref ref-type="bibr" rid="ref1 ref2">1,2</xref>
        ], just
to name a few. These techniques aim to identify and extract structured
information from unstructured content. Automated semantic annotation systems are
among such systems that enable the identi cation and labeling of instances of
knowledge base concepts within text; whereby, enriching textual documents with
additional semantic information linked to external knowledge bases.
      </p>
      <p>With the emergence of the linked open data initiative, many semantic
annotator systems now bene t from the knowledge bases that are shared through
this platform to spot, disambiguate and link semantic information within textual
content. Knowledge bases such as Freebase and DBpedia that sit at the core of
the linked open data cloud have been used extensively for this purpose where
their concepts are employed for semantically grounding textual content.
Semantic annotator systems typically provide support for entity linking, suggestion
of related but unobserved concepts, role assignment and detection of relevant
semantic categories.</p>
      <p>In spite of the growing adoption of semantic annotator systems, one of the
major limitations that current annotators face concerns dealing with
quantitative (numerical) textual content. In other words, none of the existing semantic
annotator systems is able to semantically link or describe numerical content.
Therefore, valuable information that are expressed in the form of numbers are
largely ignored in the current semantic annotator systems; hence, they are
neither exploited in the annotation process nor are they semantically linked for
future use.</p>
      <p>
        Let us consider a sample short text describing a Samsung Galaxy S smart
phone: \The Samsung Galaxy S uses the Samsung S5PC110 processor. This
processor combines a 1 GHz ARM Cortex-A8 based CPU core with a PowerVR
SGX 540 GPU made by Imagination Technologies.". When processed by a state
of the art semantic annotator system such as TagMe [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], the phrases `Samsung
Galaxy S', `ARM Cortex-A8', `processor', `Imagination Technologies' and `CPU'
are detected and linked to their corresponding Wikipedia entities. However, none
of the numerical values are detected for semantic annotation. This limitation
prevents the correct interpretation of quantitative values within text, which can
constitute a noticeable portion of text, e.g. see product speci cation Web pages.
      </p>
      <p>In this paper, we propose an approach for annotating quantitative values
in a short text. In our work, we identify numeric values in text and not only
link them to the most relevant property in the knowledge base but also nd the
best matching concept1 that has the identi ed property. Therefore, our method
enables the speci cation of a numeric value within the context of a concept and
by relating it with one of the properties of that concept. For instance, in the
above example, our method is able to determine that 1 GHz is the value of the
frequency property of the ARM Cortex-A8 concept.</p>
      <p>For evaluating our approach, we exploit a gold standard dataset consisting
of short textual snippets that have at least one numerical value. We compare
the obtained property and concept for each numerical value and compare them
with the gold standard. The results of our evaluation show that our method is
able to correctly identify the most relevant concept and corresponding property
in over 70% of the cases.</p>
      <p>The rest of this paper is organized as follows. In Section 2, we review the
background on automated semantic annotation of textual content. Section 3
is a detailed description of our proposed approach including the procedure for
identifying the relevant entities and corresponding properties. The evaluation
procedure, dataset and results are provided in Section 4 and nally Section 5
concludes the paper.</p>
    </sec>
    <sec id="sec-2">
      <title>1 Also known as entity.</title>
      <sec id="sec-2-1">
        <title>Background</title>
        <p>
          One of the open areas of knowledge extraction from natural language is semantic
annotation. For the sake of brevity, we refer to semantic annotation tools as
annotators. Annotation is basically the task of extraction and disambiguation of
mentioned entities in a given text. Annotators typically operate based on three
main phases: detection of concept candidates, disambiguation, and pruning of
results [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], which we brie y review in the following.
2.1
        </p>
        <sec id="sec-2-1-1">
          <title>Detection</title>
          <p>
            In the rst phase, the annotator processes the given input text and picks out
speci c phrases from the text, called \mentions", that can potentially refer to
an existing concept with the source knowledge base. For each of the mentions,
a set of candidate concepts are selected that are associated with that mention.
Detection of mentions is also known as \spotting". TagMe [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ] has an Anchor
Dictionary for this phase, and detects mentions by querying this dictionary.
DBpedia Spotlight [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ] also relies on a dictionary for spotting. In DBpedia Spotlight,
a lexicon that associates multiple surface forms to a concept is used. Wikipedia
Miner [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ] uses pure text processing to nd the spots and their candidates. It
gathers all n-grams within text but only keeps those that have a high
probability of linking in order to discard irrelevant phrases and stop words. In AIDA [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ],
a Named Entity Recognition (NER) tool is used. This NER tool identi es noun
phrases that potentially denote named entities. Then YAGO2 is used to
associate a candidate set to each potential named entity. In Illinois Wiki er [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ] the
authors perform pure text processing for entity spotting. They utilize an
anchortitle index, computed by crawling Wikipedia, that maps each distinct hyperlink
anchor text to its target Wikipedia titles. Since checking all substrings in the
input text against the index is computationally ine cient, they only consider
the expressions marked as named entities by a NER tagger, the noun-phrase
chunks extracted by a publicly available shallow parser, and all sub-expressions
of up to 5 tokens of the noun-phrase chunks. Then, for each mention, Wikipedia
titles that are mapped to the mention (anchor text) are considered to be the
candidate entities.
          </p>
          <p>In our work, the detection phase starts with nding the numeric values in the
input text. Assuming that we have the disambiguated mentions in the text, a set
of candidate concepts are extracted. These concepts have the potential of having
the most relevant property for the numeric value. Then from all properties of
candidate concepts, a set of candidate properties are selected and associated
with the spotted numeric value.
2.2</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>Disambiguation</title>
          <p>
            Within the detection phase, a set of candidate concepts are identi ed. The
objective of the disambiguation phase is then to select the concepts that most
accurately highlight each mention's semantics, from among the concepts
identied in the previous phase. There are generally four groups of work that perform
disambiguation in annotators [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ], namely popularity based, context based,
collective disambiguation and graph-based techniques.
          </p>
          <p>In the popularity-based approach the most frequently observed concept for a
given mention is chosen. This method is usually combined with other approaches,
since merely using this approach can lead to erroneous results. The reason is that
the results do not consider context in which the mention appears and therefore
largely ignore the main theme of the text. TagMe, Wikipedia Miner, AIDA, and
Illinois Wiki er use the popularity-based approach combined with one of the
following approaches for disambiguation.</p>
          <p>
            Within the context-based approach, the context of the mention and the
context of candidate concepts are compared. Context is typically modeled through
bag-of-words and di erent distance measures [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ]. Context-based approaches are
used in DBpedia Spotlight, AIDA, and Illinois Wiki er for disambiguation.
          </p>
          <p>The third type of disambiguation relies on collective disambiguation, where
multiple mentions are disambiguated together. In this approach, target entities
should be coherent and semantically related to each other. Many semantic
annotation tools combine this approach with the popularity-based method such as
TagMe, Wikipedia Miner, and Illinois Wiki er.</p>
          <p>The nal disambiguation approach is designed on a graph-based
representation. In this approach, the extracted mentions and candidate concepts form
the vertices of a graph. In this graph, the weighted edges between the mentions
and candidate concepts represent the contextual similarity. On this basis,
disambiguation is formulated as the task of nding a dense sub-graph in which each
mention has exactly one edge. AIDA uses a graph-based approach for
disambiguation.</p>
          <p>In our work, disambiguation of a numeric value concerns the identi cation of
the best matching property for that value from among the identi ed candidate
properties in the detection phase. Our work is primarily based on the
popularitybased approach. The selection of the best candidate is based on the cumulative
distribution of values associated with each property in the knowledge base. The
candidate property that has the closest distribution to the value observed in the
given input text is selected.
2.3</p>
        </sec>
        <sec id="sec-2-1-3">
          <title>Pruning</title>
          <p>In this phase, the concepts that are irrelevant or marginally related to the topic
of the input text are pruned. Some annotators such as AIDA perform this task
in the disambiguation phase. However others such as DBpedia Spotlight perform
it as a post-disambiguation phase.</p>
          <p>In TagMe, pruning is based on the average value of each mention's link
probability and the coherence between the selected concepts for all of the identi ed
concepts. In DBpedia Spotlight pruning is based on a number of parameters
that can be tuned by the user. Wikipedia Miner uses automated prunning
similar to TagMe. It uses a topic detector to classify related and unrelated links in a
document. Positive training instances for the classi er are the articles that were
manually linked to an article in Wikipedia, while negative ones are those that
were not. Features of these articles and the places where they were mentioned
inform the classi er as to which mentions should or should not be linked. In our
work, we do not perform pruning.</p>
          <p>
            There are other areas of research that can be considered relevant to the theme
of this paper including the work on ontology learning and knowledge base
population. One of the state of the art automatic knowledge extraction tools is FRED
[
            <xref ref-type="bibr" rid="ref11">11</xref>
            ]. This tool enables robust ontology learning and population (OL&amp;P) from
natural language. Ontology learning is the task of acquiring a domain model
from a given text and therefore involves parsing of natural language and
extracting complex relations and concepts for the purpose of taxonomy induction.
FRED does the OL&amp;P task based on Discourse Representation Theory (DRT).
3
          </p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Theoretical Model</title>
        <p>The overall objective of our work is to nd a best describing property and the
concept that it belongs to for a quantitative value in a short text. We rst
describe the method for nding the best property and then explain how we
identify its corresponding concept.
3.1</p>
        <sec id="sec-2-2-1">
          <title>Property Identi cation</title>
          <p>In order to nd the most relevant property that accurately describes a numeric
value2, the rst step is to identify the set of properties from the knowledge base
that can potentially be related to that numeric value. Let us rst provide a
theoretic foundation for describing our work.</p>
          <p>De nition 1 (Textual Snippet) Let textual snippet T = [w1:::wk] be a string
where wi (1 i k) is a word. We de ne T:dt = wj 2 D and T:r = wj 1
s.t. wj 1 is a numeric value and 2 j k, and D is the set of all possible
datatypes. Further we de ne, T:S to be the set of all concepts that are spotted in
T .</p>
          <p>According to this de nition, our objective is to annotate T:r with the most
relevant property. For instance, for a textual snippet T such as \Motorola RAZR
can support up to 64 MB", T:dt is \MB" that represents the megabyte datatype
and T:r is \64". Furthermore, with the help of an automated semantic annotation
system, one can nd all the relevant concepts to T . For this example, T:S is
fM otorola Razr; M egabyte; Secure Digitalg3. We rely on an existing annotator
2 If numeric values are written in English words, we automatically convert them to
numeric form before processing.
3 In our work, we employ DBpedia as the source knowledge base; hence, the
complete URI for the concepts would be in the form of http://dbpedia.org/resource/
Motorola_Razr.
to provide the values for T:S. Now, our task is to nd an appropriate property for
the value \64" from the list of properties in our knowledge base (e.g. DBpedia).
De nition 2 (Knowledge Base) Let KB = fc1; :::; cng be a knowledge base,
where ci (1 i n) is a concept and ci:P = f(pc1i ; v1ci ); :::; (pcmi ; vmci )g where
(pjci ; vjci )(1 j m) represents a property-value pair for concept ci.</p>
          <p>For instance, for a concept such as \Motorola A1000" in DBpedia, one can
nd a set of property-value pairs such as f(type, Device), (operatingsystem,
\Symbian OS 7.0 + UIQ 2.1"), (storage, \24.0 megabyte"), ...g, among others.
Based on De nitions 1 and 2, we formally specify the issue of property identi
cation as follows:
De nition 3 (Property Identi cation) For a knowledge base KB = fc1; :::; cng
and a textual snippet T , let Pc = fpj(p; v) 2 c:P g be the set of all properties for
concept c. The set of all possible properties in our knowledge base is de ned as
U P = Sc2KB Pc. The objective is to nd the most relevant property p 2 U P for
T:r.</p>
          <p>In the context of the earlier example, our goal would be to nd a relevant
property for \64" which would in this case be \memory" or \storage". As the
rst step we select a set of concepts from the knowledge base such that they
consist of appropriate properties for T:r.</p>
          <p>De nition 4 (Candidate Concepts) For a textual snippet T , a Candidate
concept set is de ned as C(T ) = fcjc 2 KB; 9(p; v) 2 c:P s:t: v:dt = T:dtg where
v:dt denotes the datatype for v.</p>
          <p>According to this de nition, a candidate concepts set will include all
concepts that have at least one property with a value whose datatype is equivalent
to T:dt. In our running example, concept \Motorola A1000" would be in the
candidate concept set, since it has the datatype \megabyte" in the value of one
of its properties. In order to choose the best concepts from the members of the
Candidate concepts set a ranking function is required. We rank the members of
the Candidate concepts set based on their distance to the spots in T:S.
De nition 5 (Concept Distance) For concept c and textual snippet T , a
distance function is de ned as follows:</p>
          <p>v
dist(c; T ) = utu X
s2T:S</p>
          <p>(s)
r(c; s) +
2
(1)
where semantic relatedness of two concepts c1 and c2 is represented as r(c1; c2)4
and is the function that returns the con dence score of the mentioned concept
in the text (provided by the annotator). Also is a very small constant for when
r(c; s) = 0.
4 We bene t from TagMe Relatedness API for this purpose in our experiments.</p>
          <p>Table 1 shows a number of concepts and their distances to the spots in the
context of the earlier example. We rank the concepts in the candidate concepts
set using the distance function in De nition 5 and hypothesize that less distant
concepts have a higher probability to include relevant properties for our purpose.
Therefore, we select the top-k concepts from the candidate concept set, denoted
by Top Concepts (TC)5. Based on the top-k concepts, the set of properties of
concepts in TC that have a datatype equal to T:dt will form a candidate property
set de ned as follows:</p>
          <p>In order to nd the best related property from the candidate property set,
for each of the properties in CP (T C), a statistical analysis is done to see which
property is more likely to have the numeric value T:r. To do so, we perform a
statistical analysis over all observed values for each of the properties in CP(TC).
In order to analyze the values of each property, we rst build a set, called the
Number Set.</p>
          <p>De nition 7 (Number Set) For a textual snippet T and a given property p,
Number Set is de ned as N S(p; T ) = fvijc 2 KB; (p; v) 2 c:P; r(v:dt; T:dt) &gt; g.</p>
          <p>The Number Set represents the set of all the numerical values for a speci c
property observed in the knowledge base as long as the datatype for that value
had a semantic similarity score of above threshold 6 with the datatype of the
value that we are annotating (T:dt). Based on the Number Set, we calculate
the relevance probability for a given property through its Cumulative
Distribution Function (CDF). In CDF, we assume that a Number Set has a Gaussian
distribution.</p>
          <p>De nition 8 (CDF) For a random variable R we have P r[R T:r] CDF (T:r).
So, P r[T:r T:r &lt; R &lt; T:r + T:r] = CDF (T:r + T:r) CDF (T:r T:r)
where T:r = T:r=100. Therefore for a property p and a numeric value T:r,
P r(p; T:r) = CDF (N S(p; T ); T:r + T:r) CDF (N S(p; T ); T:r T:r).</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5 We set k to 10 in our experiments.</title>
      <p>6 In our experiments, we set to 0.5.</p>
      <p>The CDF for a property p and T:r shows the probability of property p being
the suitable representation for T:r. Table 2 shows a set of properties and their
CDF values for the above example where the numeric value 64.0 was considered.</p>
      <p>Based on the ranking provided through the CDF function, we are able to
determine the best property that matches T:r. Algorithm 1 details the proposed
approach to nd the best property that describes a quantitative value mentioned
in the input text. Lines 2-8 show how the candidate concepts set (C) is built. C
is a subset of KB whose members (concepts) have a property value that includes
the datatype of interest. After identifying candidate concepts, we nd the top
concepts (T C). T C is formed by taking the top-k members of C based on the
ranking function in De nition 5 (line 9). Lines 10-16 show the process of forming
the candidate properties set (CP ). For every concept in the top concept set, all
numeric-valued properties of the concepts that have a datatype close to v:dt are
chosen for CP . Finally, the property that has the highest probability of having
T:r as its value is identi ed as the property of interest (line 17).
Algorithm 1 IdentifyProperty(TexualSnippet T)</p>
      <sec id="sec-3-1">
        <title>Concept Identi cation</title>
        <p>Now, given that the most relevant property for the numeric value is identi ed,
in this phase, the objective is to nd the most relevant concept mentioned in
the text which either directly or through inference has the property identi ed in
previous step. Let the identi ed property of the numeric value in our knowledge
base be P . The objective is to nd a subject for P with T:r as its object. Note
that the desired concept may or may not have the predicate (property) explicitly
assigned to it. For example, Ford XT Falcon is a concept in the category of
Ford Falcon and it has the property \weight" in our knowledge base. However,
Ford Fairmont (Australia) is in the category of Ford Falcon but it does not have
the property \weight". We are interested in all the concepts in T:S that can
potentially have P in one of their properties, which may not be direct but can
be derived through hierarchical subclass inference.</p>
        <p>De nition 9 (Candidate Mentions) For a textual snippet T and a property
P , a candidate mentions set is de ned as CM (T; P ) = fcjc 2 T:S; (P; v) 2 c:P g.</p>
        <p>In case candidate mentions set is empty (none of the mentioned concepts have
the property P ), we search for similar concepts in the knowledge base that have
P as a property. A mention would be considered as a candidate, if there is at least
a concept in the knowledge base that have the property P and is at least in one
of the mention's categories as expressed in DBpedia's hierarchical concept
categories. For example, Ford XT Falcon categories are Vehicles introduced in 1968,
Cars of Australia, and Ford Falcon. In order to identify the related concepts
based on shared categories, we de ne the Related Concepts set as follows:
De nition 10 (Related Concepts) For a concept s and a property P , related
concepts set is de ned as RC(s; P ) = fcjc 2 O; (P; v) 2 c:P; cat(c) \ cat(s) 6=
g where cat is a function that returns the set of all DBpedia categories for a
concept.</p>
        <p>Algorithm 2 shows the procedure for identifying the best concept for P . First,
if the candidate mentions set is not empty, the concept in CM with the highest
con dence ( ) is selected (lines 1-3). Otherwise, we try to nd other related
concepts to each mention that have the property P. If such a concept is found, it
will be added to CM (lines 5-9). CM is populated based on the related concepts.
Finally, the best concept for property P is the one with the highest con dence
value ( ) in CM (line 10).</p>
        <p>As an example in the earlier text \Motorola RAZR can support up to 64
MB" that was mentioned earlier, \memory" was selected as the best property
for 64. Based on this identi ed property, there is only one concept in T.S =
fM otorola Razr; M egabyte; secure Digitalg, i.e. Motorola Razr, that has
\memory" as property. Therefore, Motorola Razr would be the selected concept. In
case there are more than one mentions that have the identi ed property, the one
with the highest con dence is selected.</p>
        <p>Algorithm 2 IdentifyConcept(TexualSnippet T, Property P)
1: if CM (T; P ) not empty then
2: return arg maxc2CM(T;P ) (c)
3: end if
4: CM
5: for s 2 T:S do
6: if RC(s; P ) not empty then
7: add s to CM
8: end if
9: end for
10: return arg maxc2CM (c)</p>
        <p>Now let us suppose that in the above example the \storage" was selected
instead of \memory". Then, in this case, candidate properties set would be empty,
because none of the members of T.S has the \storage" property. Therefore, we
need to consider the related concepts to the concepts in T.S. Here, there is
only one concept, i.e., Motorola Razr in T.S, which has a non-empty related
concepts set. This is because we are able to nd some concepts such as
Motorola Rokr that share a common DBpedia category with Motorola Razr, i.e.
Motorola mobile phones, and at the same time consist of the \storage"
property. Therefore, our proposed algorithm identi es Motorola Razr as the concept
and \storage" as the property for the numeric value 64.
4</p>
        <sec id="sec-3-1-1">
          <title>Experimental Results</title>
          <p>In order to evaluate our work, we rst developed a gold standard dataset that
would include sentences that have quantitative values. Existing datasets that
are used for evaluating semantic annotator systems were not suitable as they
do not provide gold standard annotations for numeric values. Therefore, we
recruited a group of ten Computer Science graduate students at MSc and PhD
levels, all of whom had experience in working with semantic annotator systems
before, to collect and annotate the gold standard dataset. The recruited
graduate students were given a set of suggested concept-property pairs and were asked
to collect descriptive sentences about each concept-property pair such that the
sentences included quantitative content describing the desired property of the
desired concept. Since our knowledge base (DBpedia) does not contain much
numerical information about concepts, we provided the participants the suggested
concept-property pairs to make sure that the collected gold standard would
consist of concepts that exist in the knowledge base. Since the recruited graduate
students were given a set of suggested concept-property pairs, there were no
overlaps between the sentences they collected. The concept-property pairs were
chosen so that they cover various domains including electronics, motor vehicles,
movies &amp; music, geographical locations, famous people and food. As a nal step,
all the collected gold standard content were processed by the TagMe semantic
annotator and the extracted concepts were stored in the gold standard.</p>
          <p>The developed gold standard dataset consists of 165 separate entries7. In
the whole dataset, there are 1,225 unique concepts extracted by TagMe. In each
instance, there are 9.85 mentioned concepts on average. Each of the entries was
selected such that TagMe can nd at least one spot for that entry.</p>
          <p>With regards to DBpedia, in our experiments, we used DBpedia 3.8. locally
installed on a MongoDB server and speci cally exploited the \properties"
collection. The \properties" collection has over 130 million subject-predicate-object
triples. One of our observations when working with DBpedia was that although
DBpedia is a great source of information, it does not provide substantial reliable
numeric data. In other words, many of the properties that need to have numeric
values are missing or have incorrect or too generic datatypes associated with
them. Given DBpedia does not enforce a schema, we believe one of the areas
that can be improved on this knowledge base is with regards to the quantitative
values.</p>
          <p>Based on the gold standard, our objective was to identify the correct
concept and property for each of the quantitative values in the dataset entries.
The experiments were run on a machine with 3.20 GHz CPU and 8 GB RAM.
Table 3 (in Appendix) shows some sample entries and the corresponding
concept and properties that were identi ed. In this table the mentioned entities
are the spotted concepts extracted by TagMe. The predicted property and
concept are those identi ed by our method for the highlighted numeric value
in that dataset entry. As an example, in the rst entry, fuelCapacity (http:
//dbpedia.org/property/fuelCapacity) is identi ed as the best property and
Honda Gyro (http://dbpedia.org/resource/Honda_Gyro) as the best concept
for the numeric value 5:0L in the entry.</p>
          <p>The experiments on the gold standard shows an accuracy of 73% for
predicting the correct property and 72% for identifying the correct concept. It should
be noted that given concept identi cation is dependent on the performance of
the property detection method in our work, when the property was correctly
identi ed, in 87% of the cases the concept was identi ed accurately as well.</p>
          <p>One of the areas that we plan to investigate to further improve the
performance of our work is to contextualize the consideration of properties with
DBpedia categories. In other words, we intend to rst identify the set of
categories that a given input text belongs to, and then only consider property values
of concepts within those categories for further predicting the correct property.
For Example, if the input text is mainly about automobiles and a candidate
property is \length", we would only consider values of \length" within concepts
related to automobiles rather than \length" of irrelevant concepts such as rivers
or cellphones.
7 The dataset is publicly available at http://ls3.rnet.ryerson.ca/people/
mehrnaz/dataset.xlsx.</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>Concluding Remarks</title>
          <p>In this paper we have proposed a technique for semantically annotating
quantitative values in textual content. To the best of our knowledge, our work is among
the rst to consider the semantic annotation of numerical values and connecting
them to appopriate properties on an external knowledge base such as DBpedia.
While we reach an overal accuracy of 73% on the gold standard, there is one
main limitations for our work that we will be addressing in the future work:
The core assumption of our work is that a numeric value is proceeded by a unit
measure (datatype), e.g. 5.0 L. However, in many real world cases such a unit
measure is non-existent after a numeric value. We are interested in predicting
the unit measure of a numeric value based on its context.
,
p
o
s r
t
l P
W
-4 itc s w ir</p>
          <p>eh e e
,
)
, e
e c
l r
c e
f T a V
o F m tn (co r
se yb ,n ce a g T
c s a la up rey )
n t p p c K G
ta p aJ is c iv sc , ,
sn ce , d -o le i on an
i
n
lo M
i
,
m
s
i
, t
a A il r
re , e</p>
          <p>m h
t s M D ly a , t
iaadn ,-100 ,aFm ltanP ,rum ,114 ludF ,tghF l()m ,radF</p>
          <p>o il
anC reA raG ,m rbu ay C F en ahp
eom oC rssexp iegnn ilegnS ,royD ceoonm iliseev ilraam ,rpu feop ,um rauh rce ighw ohW,lim reoy ehS
l.eS3 tteodp aodnE li,ayEm l,irccey aodnG i(rceev litteay iaoknK itgohn lseayp l,eyDm irrcegn rscccea i,sceepA itroanH eoyhB ,iteunF ,teavhE iitrsanh
b
a S H F T H S R S N M X A A S O T M D C
T ily l-e in ry le ty 09 -re lly ra le e - f
g y ev fu l a gu ap ad con so</p>
          <p>i
am in li i ea ed \p su s m m m</p>
          <p>ly a in d
r s b lsa ice 10
d th in an g y F t y lu in
eh lod se i g m in s em ro an sp ry ld ou f icn ed
trynE roaydG -treehw lissceeh fteonu irsceevw iiittrcanp lijraaaK teonhnw . irsaup ltexyh l,eapm itcghu leapm fteeav oouhCWieaovbm rseoadm ,tseeehm sooavhm .s
ttseaa eonhH l,lfsam tccavnup ,aaandpn rrsseexp .itcaayp freeaop i,oubShw ,akdng itte"oann lseayp freoadm ,lreeadp l,tseeoah treoohm i.sng10 .rseypu eoyhB itenu oyhw reeoyn liyhwm itreedh
r r t t T mob ev fa fa
a h
D T o o J o c B s k s M m m t f</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Jovanovic</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bagheri</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cuzzola</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gasevic</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jeremic</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Bashash</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          (
          <year>2014</year>
          ).
          <source>Automated Semantic Tagging of Textual Content</source>
          , IT Professional,
          <volume>16</volume>
          (
          <issue>6</issue>
          ),
          <fpage>38</fpage>
          -
          <lpage>46</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Cornolti</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ferragina</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Ciaramita</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2013</year>
          , May).
          <article-title>A framework for benchmarking entity-annotation systems</article-title>
          .
          <source>In Proceedings of the 22nd international conference on World Wide Web</source>
          (pp.
          <fpage>249</fpage>
          -
          <lpage>260</lpage>
          ).
          <source>International World Wide Web Conferences Steering Committee.</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Mendes</surname>
            ,
            <given-names>P. N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jakob</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garc</surname>
            a-Silva,
            <given-names>A.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          (
          <year>2011</year>
          ,
          <article-title>September)</article-title>
          .
          <article-title>DBpedia spotlight: shedding light on the web of documents</article-title>
          .
          <source>In Proceedings of the 7th International Conference on Semantic Systems</source>
          (pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          ). ACM.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Milne</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Witten</surname>
            ,
            <given-names>I. H.</given-names>
          </string-name>
          (
          <year>2013</year>
          ).
          <article-title>An open-source toolkit for mining Wikipedia</article-title>
          .
          <source>Arti cial Intelligence</source>
          ,
          <volume>194</volume>
          ,
          <fpage>222</fpage>
          -
          <lpage>239</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Ho</surname>
            <given-names>art</given-names>
          </string-name>
          , J.,
          <string-name>
            <surname>Yosef</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bordino</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          , Furstenau, H.,
          <string-name>
            <surname>Pinkal</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Spaniol</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , ... &amp;
          <string-name>
            <surname>Weikum</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          (
          <year>2011</year>
          ,
          <article-title>July)</article-title>
          .
          <article-title>Robust disambiguation of named entities in text</article-title>
          .
          <source>In Proceedings of the Conference on Empirical Methods in Natural Language Processing</source>
          (pp.
          <fpage>782</fpage>
          -
          <lpage>792</lpage>
          ).
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Ratinov</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roth</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Downey</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Anderson</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2011</year>
          , June).
          <article-title>Local and global algorithms for disambiguation to wikipedia</article-title>
          .
          <source>In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume</source>
          <volume>1</volume>
          (pp.
          <fpage>1375</fpage>
          -
          <lpage>1384</lpage>
          ).
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Yi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2008</year>
          ).
          <article-title>Information organization and retrieval using a topic mapsbased ontology: results of a taskbased evaluation</article-title>
          .
          <source>Journal of the American Society for Information Science and Technology</source>
          ,
          <volume>59</volume>
          (
          <issue>12</issue>
          ),
          <fpage>1898</fpage>
          -
          <lpage>1911</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Weld</surname>
            ,
            <given-names>D. S.</given-names>
          </string-name>
          (
          <year>2010</year>
          ,
          <article-title>July)</article-title>
          .
          <article-title>Open information extraction using Wikipedia</article-title>
          .
          <source>In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics</source>
          (pp.
          <fpage>118</fpage>
          -
          <lpage>127</lpage>
          ).
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Petasis</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karkaletsis</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paliouras</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krithara</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Zavitsanos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          (
          <year>2011</year>
          , January).
          <article-title>Ontology population and enrichment: State of the art</article-title>
          .
          <article-title>In Knowledge-driven multimedia information extraction and ontology evolution</article-title>
          (pp.
          <fpage>134</fpage>
          -
          <lpage>166</lpage>
          ). SpringerVerlag.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Ferragina</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Scaiella</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          (
          <year>2010</year>
          ,
          <article-title>October)</article-title>
          .
          <article-title>Tagme: on-the- y annotation of short text fragments (by wikipedia entities)</article-title>
          .
          <source>In Proceedings of the 19th ACM international conference on Information and knowledge management</source>
          (pp.
          <fpage>1625</fpage>
          -
          <lpage>1628</lpage>
          ). ACM.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Presutti</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Draicchio</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gangemi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>2012</year>
          ).
          <article-title>Knowledge extraction based on discourse representation theory and linguistic frames</article-title>
          .
          <source>In Knowledge Engineering and Knowledge Management</source>
          (pp.
          <fpage>114</fpage>
          -
          <lpage>129</lpage>
          ). Springer Berlin Heidelberg.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>