=Paper= {{Paper |id=Vol-1333/fomi2014_5 |storemode=property |title=Improving Ontology Service-Driven Entity Disambiguation |pdfUrl=https://ceur-ws.org/Vol-1333/fomi2014_5.pdf |volume=Vol-1333 |dblpUrl=https://dblp.org/rec/conf/fois/SeyedFM14 }} ==Improving Ontology Service-Driven Entity Disambiguation== https://ceur-ws.org/Vol-1333/fomi2014_5.pdf
      Improving Ontology Service-Driven
            Entity Disambiguation
         A. Patrice SEYED a Zachary FRY b and Deborah L. MCGUINNESS b
                              a 3M HIS, Silver Spring, MD
          b Rensselaer Polytechnic Institute, Department of Computer Science,

                        Tetherless World Constellation, Troy, NY

            Abstract.
               One of the long-standing challenges in natural language processing is uniquely
            identifying entities in text, which when performed accurately and with formal on-
            tologies, supports efforts such as semantic search and question-answering. With
            the recent proliferation of comprehensive, formalized sources of knowledge (e.g.,
            DBpedia, Freebase, OBO Foundry ontologies) and advancements in supportive Se-
            mantic Web technologies and services, leveraging such resources to address the en-
            tity disambiguation problem in the industry setting as “off the shelf” within natu-
            ral language processing pipelines becomes a more viable proposition. In this pa-
            per, we evaluate this viability by building and evaluating an entity disambiguation
            pipeline founded on publicly available ontology services, namely those provided
            by the NCBO BioPortal. We chose BioPortal due to its current use as an ontology
            repository and provider of ontological services for the biomedical informatics com-
            munity. To consider its usage outside the biomedical domain, and given our imme-
            diate project goal for facilitating semantic search over Earth science datasets for the
            DataONE project, we focus on the disambiguation of geographic entities. For this
            work, we leverage NCBO’s Term service in conjunction with NCBO’s entity dis-
            ambiguation service, the Annotator, to demonstrate an enhancement of the Annota-
            tor service, through application of a vector space model representation of ontolog-
            ical entities and relationships to drive scoring improvements. This work ultimately
            provides a methodology and pipeline for improving publicly available ontology
            service-based entity disambiguation, demonstrated through an enhanced version of
            the NCBO Annotator service for geographic named entity disambiguation.

            Keywords. entity disambiguation, ontology, geospatial




Introduction

One of the long-standing challenges in natural language processing is uniquely identify-
ing entities in text, which when performed accurately and with formal ontologies, sup-
ports efforts such as semantic search and question-answering. Semantic search would
greatly facilitate our project, the Data Observation Network for Earth (DataONE), as it
is aimed at limiting the excessive time and effort spent to discover, acquire, interpret,
and use related data for biological, ecological, environmental and Earth science data.1
1   http://dataone.org
The DataONE metadata catalog is composed of content uploaded by participating re-
search institutions, that includes a keywords field populated by data managers, where
these keywords and keyword-“worthy” terms in the scientific abstracts are not yet linked
to domain knowledge in a formal way to aid discovery, besides what exist as disparate
controlled vocabularies. Clearly here, the use of publicly available formal ontologies to
disambiguate terms of relevance for search provides advanced capabilities for precise
search and a “free extension” to their existing vocabularies that does not require manual
development.
      With the recent proliferation of comprehensive, formalized sources of knowledge
(e.g., DBpedia,2 Freebase,3 OBO Foundry ontologies4 ) and advancements in supportive
Semantic Web technologies and services, leveraging these resources to address the entity
disambiguation problem in the industry setting as “off the shelf” within natural language
processing pipelines becomes a more viable proposition. In this paper, we evaluate this
viability by building and evaluating a novel entity disambiguation pipeline founded on
publicly available ontology services, namely those provided by the NCBO BioPortal. We
chose BioPortal due to its central role as a ontology repository and provider of ontolog-
ical services for a given community, that being biomedical informatics. To consider its
usage outside the biomedical domain, and given our immediate project goal for facilitat-
ing semantic search over Earth science datasets for the DataONE project, we focus on
the disambiguation of geographic entities, using the Gazetteer Ontology (GAZ).
      In this work, we use NCBO’s Term service in unison with NCBO’s own entity dis-
ambiguation service, the NCBO Annotator, to demonstrate an enhancement over the An-
notator service, using a vector space representation of ontological entities and relation-
ships to drive scoring improvements. Fittingly then, we evaluate our results against the
NCBO Annotator, and apply the TopN scoring method, since both systems are suited
to provide inputs to a semi-automated semantic-mapping workbench environment. Ul-
timately, this work evaluates whether our techniques improve on TopN mapping accu-
racy of the Annotator service, including if introducing all domain-level relationships into
the vector space provides additional discriminatory power over just those relationships
considered “broader”, while at the same time provides insights into the process of us-
ing and augmenting publicly available ontology-service driven web services for entity
disambiguation.
      Our overall approach extracts named entity labels from natural language text, and
where applicable, maps each to a resource of an ontology that best represents and disam-
biguates its meaning. The set of entities that can be disambiguated within our pipeline
is flexible, to disambiguate what in the formal ontology community is referred to as par-
ticulars or concepts (or types), and what in the Web Ontology Language (OWL) con-
siders individuals or classes,5 that the Semantic Web community at large via the Re-
source Description Framework (RDF) refers to simply as resources.6 Thus we describe
our pipeline at the abstraction of “resource”, and where appropriate for its current ap-
plication we describe how it functions for geographic named entities disambiguation. In
the following sections we describe related work, background on the topic-based vector
2   http://dbpedia.org/
3   http://www.freebase.com/
4   http://www.obofoundry.org/
5   http://www.w3.org/TR/owl-ref/
6   http://www.w3.org/TR/1999/REC-rdf-syntax-19990222/
space model algorithm we apply, and our methodology and pipeline that utilizes these
algorithms and services. In Section 4 we evaluate results using the TopN scoring met-
ric, comparing against the existing annotation service using an initial, hand-curated gold
standard. In Section 5 we present qualitative findings that resulted from application of
our the pipeline, and in sections 6 and 7 we discuss future work and conclusions.


1. Background on eTVSM

In this section we formally present the definition and implementation of an eTVSM
model [10][5]. An eTVSM formalizes how resources represent named entity labels by
first including a TVSM of the ontology, and then representing entity labels within the
TVSM based on links between named entity labels and resources. The first step in build-
ing an eTVSM is to encode a graph representation of resources connected through a
given set of relationships into resource vectors that compose a TVSM. This graph repre-
sentation is based on candidate resources and their related resources. (We consider can-
didate resources those resources which are discovered by initial lexical matching and
potentially represent to what a named entity label refers.) For each resource, we consider
P(r, k) to be the power set of all resources at distance k from resource r. We construct a
resource vector~r as follows:
           *                                      +
               β                  β
                            k                   k       1
       ~t = ∑ ∑ α , . . . , ∑ ∑ α ,~r =                   ~t                           (1)
              k=0 r ∈P(r,k)      k=0 rn ∈P(r,k)
                                                        ~
                                                       ktk
                   1


Each resource in the graph representation of an ontology is assigned an index from 1
to n, which means that each resource vector has size n and there are n resource vectors.
The vector for resource n is calculated by assigning an exponentially declining weight
to connected resources and summing the total weights for each resource. For example,
resource i is assigned a weight of 1, all resources one node away from i adds a weight of
0.5, resources two nodes away from i add a weight of 0.25, etc. The sum of the weights
are normalized. In this way we assign a resource vector to each candidate resource. The
constant α is an exponential decay which is used to determine how important resources
are distant from the candidate. [5] experiments with different decay values and identify .9
as being optimal for the DBpedia Ontology; in our work we apply an exponential decay
constant of .5 and leave this analysis for future work. The constant β is used to limit the
distance between resources that we wish to consider; in our case β is 5.
     Next, an interpretation vector is constructed from an interpretation, i, which is a
unique mapping of a named entity label to a resource, which we multiply by an ambiguity
weight:

     ~i =~r g(i)                                                                       (2)

In order to reduce the effects of ambiguous mappings (i.e., which map labels to multiple
resources), we weigh each interpretation vector by an ambiguity weight g(i), defined as:

                            1
      g(i) =                                                                           (3)
               | j : j ∈ I(k), k ∈ K(i)|
The ambiguity weight g(i) forces labels which are mapped to many resources to have
less weight toward disambiguating other named entity labels. Here, K(i) is the set of all
labels for interpretation i, and I(K) is the set of all interpretations derived from label k.
We represent the document as the collection of all resource mappings by summing the
weighted interpretation vectors:

      ~td = ∑ wd,i~i                                                                     (4)
           i∈I

Each interpretation vector is weighted by multiplying the frequency at which the named
entity label for interpretation i is found in the extracted named entities from document
d. The cosine similarity metric is used to compare individual interpretation vectors for
the document against the document vector, ~td . Incorrect interpretations will not signif-
icantly affect the document vector and result in a low cosine similarity score, whereas
interpretations that are more similar to other interpretations will result in a higher cosine
similarity.


2. Pipeline

We apply a three-phase pipeline for executing our entity disambiguation approach, in-
cluding NER, resource mapping, and eTVSM scoring. We describe each phase as a
black-box system, with inputs and outputs which link the phases together. Our overall
pipeline takes as input an unstructured text document in the form of a science publication
abstract from DataONE outputs a ranked list of candidate resources.
      The NER phase of our pipeline extracts named entities embedded in unstructured
text using the Natural Language Tool Kit (NLTK) [11]. This phase serves the sole pur-
pose of generating a set of labels to be processed for resource mapping in the next phase.
The NLTK software library contains NER algorithms that extract named entity labels
from unstructured text. It applies a series of tokenizers to parse sentences into terms,
and using part-of-speech tagging generates parse trees as input to supervised learning
algorithms for named entity detection. The unstructured text which is passed through the
NER phase of our pipeline outputs a list of named entity labels for each given document.
      The resource-mapping phase of our pipeline processes a list of named entity labels
and returns a set of named entity label to resource mappings and an ontology portion
(i.e., a set of statements) describing each matched resource. In the context of geospatial
named entity recognition, there are multiple resources which are referred to by the same
preferred label annotation. For example, in the United States, the city of Springfield
can refer to greater than 40 different cities located in multiple states. Also some states,
such as Wisconsin, have multiple cities named Springfield. Therefore, the purpose of
the resource-mapping phase is to map each named entity label to those resources which
define a unique instance of that particular location with that name. For every named
entity label extracted from the NER phase, we query the NCBO Annotator Service for a
list of candidates, that is, resources which could potentially define that particular named
entity within an ontology.
      Next, we obtain statements about the resources, including annotations as well as
object property and subclass statements, using the NCBO Term Service, by supplying a
resource URI and ontology identifier as parameters to the service.7 The object property
and subclass-based statements are filtered according to a preselected set of properties,
where the objects of these statement are recursively submitted to the Term Service. In this
way, we perform a breadth-first search along a set of properties until the final resource
in the path is reached. This phase concludes when the statements are collected for each
candidate resource.
     The eTVSM scoring phase scores each entity label-to-resource mapping in a given
document by first generating a TVSM of the obtained ontology graph, and then encoding
the resource mappings into an eTVSM. We use the cosine similarity metric to determine
how strongly one candidate resource is related to all candidate resources for a given doc-
ument [5]. Given an eTVSM that encodes each mapping of a named entity label to can-
didate resources into a vector space, closely related candidate resources will have higher
scores than those than are more indirectly related or not related at all to other candidate
resources. Ultimately, the eTVSM scoring phase provides a quantitative measure for how
strongly a resource disambiguates a named entity label.


3. Evaluation

In this section we introduce a small hand-annotated dataset which we use as a gold
standard for comparing our result against the existing NCBO Annotator Service. We
score output from the NCBO Annotator Service and from our pipeline using the TopN
scoring metric. We demonstrate how our approach contributes an enhancement to the
NCBO Annotator Service for geographic named entity disambiguation using the eTVSM
algorithm. The process for which we created our gold standard is described below.
     First, we selected scientific paper abstracts by processing each abstract through our
pipeline and selecting those which have at least one named entity label that has greater
than 10 candidate concept mappings. This decision was due to a lack of preexisting gold
standard data and our desire to disambiguate highly ambiguous named entity labels. For
each named entity label we hand-annotated it with a concept in the Gazetteer Ontology
which correctly identifies it. The result is 24 unique named entities labels, which were
extracted using the NER phase of our pipeline. For these 24 named entity labels, there
are 18 unique candidate concepts contained in the GAZ ontology, which we discovered
by querying the annotator service with our named entity labels. The remaining six named
entity labels were either too generic for us to assign a unique concept, or the correct con-
cepts are not contained in the most current version of GAZ; in order to get a meaningful
comparison of our pipeline against the NCBO Annotator Service we excluded these six
labels for our evaluation. The 18 named entity labels were hand-mapped to concepts in
GAZ, composing our gold standard dataset which we use in our evaluation.
     In the setting of a semantic-mapping workbench, when a prospective curator inspects
the list of candidate concepts for a set of named entities extracted from a document, the
most accurate or correct concept will ideally appear at or near the top of a ranked list
of candidate concepts. In fact, the Annotator website presents candidate concepts in this
ranked manner. To be fair, there are just two levels of ranking produced by the Annotator
7   We describe the atoms of the ontology portion extraction in terms of statements, to maintain our abstrac-
    tion at the level of resources. In the case of named entity disambiguation these statements are a mix of
    assertions and axioms, and in the case of conceptual entities these statements are primarily axioms.
service, determined by whether a match was made on the preferred or alternate term,
while our approach uses a much more granular similarity score metric. Still, we evaluate
our augmented version of the annotator alongside the publicly available Annotator Ser-
vice in order to demonstrate improvement upon the existing service. Therefore we evalu-
ate the two approaches using the position of the correct concept returned in a ranked list
of concept-mapping scores. To do this, we use the TopN scoring metric [12,5]:
                   N    kj
                  ∑ αc
                  j=1
       TopN =      N
                                                                                         (5)
                   ∑ αci
                  i=1

     where N is the number of named entity labels which are mapped to a concept in the
gold standard. k j is the position of concept j in the returned concepts from the Annotator
Service. αc is an exponential decay coefficient used for penalizing concepts that appear
later in the list. A score of 1 is realized for a document if for all named entity labels,
the top scoring candidate concept corresponds to the gold standard’s concept for that
document. Note that since we only consider concepts that are in our gold standard, and
by definition, are returned by the Annotator Service, the TopN score can never have a
value of 0. However, if all the correct concepts for each entity are returned at the bottom
of the list, the TopN score will be significantly small. We chose a value of 0.8 for our
constant αc , so, for example, the position of the correct concept will contribute 0.1 to the
score if it appears in the tenth position [5].
     For the evaluation of our pipeline, we construct our vector space using two methods.
The first construction uses located in, while the second uses all (13) relationships con-
tained in the GAZ ontology. This comparison provides some evidence that including rela-
tionships beyond those considered generalizations improves scoring. Table 189 shows the
comparison between the NCBO Annotator Service and our pipeline as constructed using
the two methods. These results demonstrate that the application of the eTVSM-scoring
phase outperforms the Annotator Service for disambiguating geographic named entities.
The results demonstrate that a quantitative approach for disambiguation which measures
and ranks the strength of each concept mapping outperforms an approach which relies
on simple lexical matching.


4. Qualitative Findings

In this section we present some qualitative insights of applying our pipeline to a sample
scientific publication abstract. Figure 110 shows a graph output of our pipeline as applied
to a sample scientific abstract. In this run we included all geographic relationships from
GAZ during the concept-mapping phase.
     We confirmed that our ontology-based resource mapping approach successfully
down ranks irrelevant resources and thus outperforms purely lexical-based resource map-
8    A version of this paper with figures included is available at
     http://tw.rpi.edu/web/doc/ImprovingOntologyServiceDrivenEntityDisambiguation/.
9    https://www.flickr.com/photos/127739444@N02/15066654769/
10   https://www.flickr.com/photos/127739444@N02/15230553716/
pings. For instance, where only lexical matching algorithms are used (e..g., the cur-
rent Annotator Service), the entity label ‘Oregon’ receives equal ranking for “State
of Oregon (GAZ:00002515)” as resources “Oregon (GAZ:22225751)” and “Oregon
(GAZ:00084619)” (cities in Michigan and Illinois). Our approach leverages fine-grained
geographic relationships and considers relationships with other mapped named entities
mentioned in the same abstract (e.g., Deer Creek, Josephine County), so that the similar-
ity score of the mismatches for ‘Oregon’ is significantly lower than the correct one.
      We also identified improvements to the scoring of correct resources after applying
additional relationships in the resource-mapping phase of our pipeline, beyond those
considered generalizations (e.g., located in). For example, for the abstract of the fourth
study of our gold standard, the named entity label ‘Cumberland River’ was mapped to
three distinct candidates. When only the located in relationship is applied, the two incor-
rect resources received a score while the correct resource (GAZ:00150754) did not (see
left side of Table 211 ); however, when applying all geographic relationships, the correct
resource accurately received the highest score. This is due to relationships to other re-
sources, via the inclusion of additional kinds of geographic relationships, that have been
mapped to other named entity labels extracted from the abstract (shown in Figure 212 ).
      Finally, we learned that our pipeline improves the quality of ontologies available
through Bioportal, by facilitating curators in the process of identifying and reporting ex-
isting gaps. For example, in an abstract that mentions the Deer Creek Field Station and
Educational Center of the state of Oregon, the named entity label ‘Deer Creek’ returned
29 unique resources labeled as ‘Deer Creek’, none of which were the correct one. We in-
formed the GAZ team, who quickly created the resource and appropriate statements, in-
creasing coverage of the ontology. Figure 1 illustrates the results of the pipeline after the
newly added resource “Deer Creek (GAZ:00633440)” was included, which became the
highest ranking candidate resource for ‘Deer Creek’ due to relationships to “Josephine
County” and “State of Oregon”.


5. Related Work

In this section we discuss recent work that applies ontologies to the named entity recog-
nition (NER) problem, including that which the current work uses and builds upon:
the NCBO Annotator and enhanced topic-based vector space modeling (eTVSM). Re-
searchers at the BBC experimented with eTVSMs [10] to automatically apply editor tags
to archived radio programs for use in a manual curation environment [5]. Concepts con-
tained in the DBpedia ontology were represented in a topic-based vector space model
(TVSM), a model constructed by creating vectors for each concept that include those
concepts related by SKOS broader13 . The eTVSM was built by linking text transcribed
from radio programs to concepts in the DBpedia Ontology,14 scoring each link using the
relationships between concepts that were encoded in the TVSM. Links that were closely
related scored higher, while incorrect links which were not as closely related scored
lower. Our work reuses the same underlying theory for using a vector space model for
11   https://www.flickr.com/photos/127739444@N02/15066727059/
12   https://www.flickr.com/photos/127739444@N02/15253595325/
13   http://www.w3.org/2004/02/skos/
14   http://wiki.dbpedia.org/Ontology
disambiguation, and additionally explores the benefits of using relationships more ex-
plicit than broader, to take advantage of knowledge beyond that found in a generalization
hierarchy, formalized by expert curators.
     The NCBO BioPortal project supports efforts to linking unstructured text to ontolo-
gies through publicly accessible services for leveraging community based ontologies [2].
The NCBO Annotator Service matches inputted text to ontological terms contained in
community-developed ontologies by applying a lexical string matching algorithm to a
lexicon based on preferred and synonym labels [3,4].15 By default, the Annotator Ser-
vice is configured to consider all ontologies published through BioPortal, however there
is a parameter for restricting it to a set of target ontologies. [3] highlights the additional
need for enhancing the service by developing components that use the knowledge in on-
tologies to recognize relationships between concepts, which is a focus of this paper. The
service returns a list of candidate concepts from the selected ontologies and provides
a score for each candidate concept based on whether the concept was matched on pre-
ferred label or synonym. Our approach builds on this service and ultimately creates an
enhanced version of it that quantitatively measures how suitable each candidate concept
represents a named entity. Further, our approach and pipeline starts by recognizing to-
kens in the text, while the Annotator spots named entities using terms from the target
ontologies and supporting lexicons. Therefore with our approach a curator is more easily
able to find and report gaps in the existing ontologies in a semantic-mapping workbench
setting, since extracted token are immediately available for inspection. We describe how
we practically applied this mechanism in Section 3.
     Aside from the BBC and NCBO efforts, there exists extensive previous research
in the area of entity disambiguation leveraging ontology or more generally, linked data
sources. Alexopoulos et al. [6] propose a disambiguation framework that utilizes DB-
Pedia to detect intended meaning of named entities (e.g., soccer clubs, organizations) in
unstructured text, using an algorithm similar to [10]. Kleb et al. [8] focus on disambigua-
tion using spreading activation on an RDFS-based ontologies. Mendes et al. [9] provides
disambiguation and mapping to DBpedia URIs, within DBpedia Spotlight. Hoffart [7]
applies a novel collective disambiguation strategy using a new form of coherence graph
using DBpedia and Yago. There are also many off-the-shelf concept extraction tools
available: Open Calais,16 Zemanta,17 Alchemy API18 ); all of these approaches identify
entities and generate URIs for them through disambiguation.
     Our work differs from these in that we focus on the practicality of using NCBO
Bioportal and its APIs as an ”off the shelf” resource for applying eTVSM for semantic
disambiguation within an NLP pipeline. BioPortal is of particular interest because the
ontologies registered with it include many that are developed by expert curators. The
benefit of our approach is that the more explicit relationships and to what resources they
relate are used for disambiguation and subsequently for fine-grained semantic search
capabilities. What results from our work is an enhanced version of the NCBO Annotator
for geographic entity disambiguation. Due to the Annotator’s wide usage, it provides
immediate utility to the community upon release.

15   http://bioportal.bioontology.org/annotator
16   http://www.opencalais.com/
17   http://www.zemanta.com/
18   http://www.alchemyapi.com/
6. Future Work

In future work, we will expand our gold standard to help determine if they further vali-
date our results. Since it is a time-intensive process, we will seek external resources for
performing the work, such as Mechanical Turk.19 We will also leverage such resources
for tagging corpora for geographic named entities, which can be used for statistically
training the NLP tokenizer that we employ in the pipeline.
     Our methodology and pipeline lays a foundation for widening its use to conceptual
entities and other types of named entities. Therefore, in future work we will evaluate
how well, in practice, that our results are generalizable for disambiguating concepts and
named entities for other domains, such as biomedical. On the side of named entities,
one requirement is to request the NCBO to add additional ontologies that are specific to
individuals, similar to the crowdsourced content available via Dbpedia.
     For concept-based disambiguation, we lose the immediate benefit of the NLTK
named entity recognizer, but which is mitigated when corpora tagging is carried out for
the concept domain of interest, via Mechanical Turk or use of some other resources (e.g.
PubMed). While there exists a wide range of biomedical ontologies available that cover
similar sets of concepts, we will incorporate and test publicly available mappings be-
tween NCBO-registered ontologies, though performing the mapping task itself falls out
of our scope. For the ontology extraction task of the resource-mapping phase, the mech-
anism for obtaining axioms at the class level instead of assertions at the instance level
remains the same via the NCBO Term service.
     In cases where concept mappings are not available, selecting the ontology to use
that provides the best coverage and overall representation becomes more critical, as the
eTVSM approach requires the selection of one ontology. This selection should be an au-
tomated process, as within the context of an annotation software tool for semi-automated
mapping, it reduces burden on the annotator, enabling them to focus on finding the most
accurate concept match in a ranked list of candidates. Therefore, in future work we will
leverage a domain classifier for selecting the most suitable ontology for disambiguation.
     To further support annotation software tools that leverage our pipeline, we will make
modifications to capture the statements in RDF and/or OWL for ease of rendering in
graph form; currently we are applying the XML-based results from the NCBO services
in a non-RDF graph representation for processing into the vector models. We anticipate
that, generally, the graph representations (as shown in Figure 1), when presented, will
provide a curator context and visual justification of the ranking scores. Finally, at the
time of this writing the NCBO ontology service for the version of BioPortal being used
is deprecated, therefore we are working to port our code to leverage the latest version
prior to making it publicly available.


7. Conclusions

To help address the challenge of using publicly available ontology services for entity dis-
ambiguation, in this paper we 1) provide an enhanced version of the NCBO Annotator
service for geographic named entity through novel application of vector space model-
based disambiguation in concert with existing NCBO Term and Annotator services; 2)
19   https://www.mturk.com/
demonstrate that using the available fine-grained relationships in an expert-curated ontol-
ogy improves disambiguation; and 3) provide insights into the process of using publicly
available ontology-service driven web services and expert-curated domain ontologies for
entity disambiguation and organically improving upon those services.
     In support of future semantic mapping workbench applications, this approach pro-
vides a ranked list of results using quantitative scoring methods to disambiguate named
entities. In Section 4 we evaluated the performance of our pipeline against the Annotator
Service using the TopN scoring metric and demonstrated how in the context of a manual
curation workbench, our pipeline provides benefit by reducing the time a curator would
spend looking for the concept that correctly matched an extracted named entity label. To
further demonstrate its value as an enhanced version of the NCBO BioPortal Annotator
Service, in Section 5 we presented some insights resulting from the pipeline being ap-
plied to Earth and environmental science abstracts, leveraging domain-level relationships
available in GAZ to power the disambiguation process.
     Our approach also helps aid curators to create gold standard datasets as training data
for performing entity disambiguation using statistical machine learning methods. For cu-
rators who manage metadata like those within the DataONE project, the output from our
pipeline could be added as metadata, improving metadata quality by showing how named
entities in the text are related, which can we used to enhance search capabilities. Our
approach provides benefits over using methods that do not rely on ontologies for the dis-
ambiguation task, or when ontologies with minimal semantics (e.g., broader relationship
in SKOS) are used, as subsequent search interface capabilities will have better precision.


References

 [1] Heath, T. and Bizer C. Linked Data: Evolving the Web into a Global Data Space. Synthesis Lectures on
     the Semantic Web. Morgan and Claypool Publishers, 2011.
 [2] Musen MA, Noy NF, Shah NH, Whetzel PL, Chute CG, Story MA, Smith B; NCBO team. The National
     Center for Biomedical Ontology. J Am Med Inform Assc. 2012 Mar;19(2):190-5. Epub 2011 Nov 10.
 [3] Jonquet C, Shah NH, Musen MA. The open biomedical annotator. Summit on Translat Bioinforma. 2009
     Mar 1;2009:56-60. PubMed PMID: 21347171; PubMed Central PMCID: PMC3041576.
 [4] Shah, N., Bhatia, N., Jonquet, C., Rubin, D., Chiang, A., & Musen, M (2009). Comparison of concept
     recognizers for building the Open Biomedical Annotator. BMC bioinformatics, 10(Suppl 9), S14.
 [5] Raimond, Y., & Lowis, C. Automated interlinking of speech radio archives.
 [6] P. Alexopoulos, C. Ruiz, J.M. Gmez-Prez (2012), Scenario-Driven Selection and Exploitation of Se-
     mantic Data for Optimal Named Entity Disambiguation, Proceedings of the 1st Semantic Web and In-
     formation Extraction Workshop (SWAIE 2012), Galway, Ireland, October 8-12, 2012.
 [7] Hoffart, J., Yosef, M.A., Bordino, I., Frstenau, H, Pinkal, M., Spaniol, M., Taneva, B., Thater, S.,
     Weikum, G.: Robust disambiguation of named entities in text. In Proceedings of the Conference on
     Empirical Methods in Natural Language Processing, ACL, Stroudsburg, PA, USA, 782-792.
 [8] Kleb, J., Abecker, A.: Entity Reference Resolution via Spreading Activation on RDF-Graphs. In Pro-
     ceedings of the 7th ESWC, pages 152-166, Springer Berlin, Heidelberg, 2006.
 [9] Mendes, P.N., Jakob, M., Garcia-Silva, A., Bizer, C.: DBpedia spotlight: shedding light on the web of
     documents. In Proceedings of the 7th International Conference on Semantic Systems, ACM, New York,
     USA, 1-8, 2011.
[10] Polyvyanyy, A. (2007). Evaluation of a novel information retrieval model: eTVSM. Master’s thesis,
     Hasso Plattner Institut.
[11] Bird, Steven, Edward Loper and Ewan Klein (2009). NLP with Python. O’Reilly Media Inc.
[12] Adam Berenzweig, Beth Logan, Daniel P. W. Ellis, and Brian Whitman. A large-scale evaluation of
     acoustic and subjective music-similarity measures. Computer Music Journal, 28(2):6376, Summer 2004.