NLP-based support for ontology lifecycle
                                                  development

                                               Diana Maynard and Adam Funk and Wim Peters

                                                         Department of Computer Science
                                                              University of Sheffield
                                                          Regent Court, 211 Portobello
                                                              S1 4DP, Sheffield, UK


                                      Abstract. This paper describes the implementation of an approach to
                                      modelling the dynamics of the propagation of textually derived semantic
                                      information, in particular with respect to networked ontologies. On the
                                      one hand, new ontologies may be generated automatically from textual
                                      data, or existing ontologies may be modified or extended according to
                                      new evidence. This can cause problems for large or networked ontologies,
                                      where only a small section of the ontology may be modified, and where
                                      multiple users may be working with the ontologies simultaneously. Col-
                                      laborative editing of both documents and ontologies is becoming more
                                      widespread, but can bring major problems with change management:
                                      both within the ontology itself and between text and ontology. In this
                                      paper, we describe a plugin for the NeOn toolkit which uses both auto-
                                      matic and manual methods for generating and modifying ontologies on
                                      the fly from unstructured text, and enables two-way ontology lifecycle
                                      development.


                                1   Introduction

                                Ontology evolution is increasingly acquiring research momentum in the Semantic
                                Web field. This is due to the fact that ontologies, forming the backbone of Se-
                                mantic Web systems, need to be kept up-to-date so that ontology-based systems
                                remain usable. Furthermore, collaborative creation and editing of documents
                                such as Wikipedia and corporate or group wikis is becoming more and more
                                widespread; however, it is difficult to extract the relevant keywords and facts
                                from such documents when they are constantly changing and when a single user
                                is not responsible for their maintenance. Similarly when ontologies change, this
                                may affect systems dependent on them. In a collaborative ontology development
                                environment, changes tend to be more frequent.
                                    In this work, we describe an approach to modelling the dynamics of ontol-
                                ogy and metadata change in text, motivated by the need for constant ontology
                                evolution and for the propagation of changes in an ontology to its instances and
                                properties. For example, if a user deletes concepts from the ontology, it is impor-
                                tant to have a mechanism for dealing with any associated semantic metadata,


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
2      Diana Maynard and Adam Funk and Wim Peters

in order not to lose vital information. If a user adds new concepts to the ontol-
ogy, then it may be necessary to return to the text to check whether additional
instances can be found which should be used to populate these new concepts in
the ontology. We call this the top-down approach to ontology change.
    On the other hand, not only are ontologies dynamic and subject to struc-
tural change, but so are the texts and instances from which the ontologies may
be derived. If we get additional relevant textual material and/or find new in-
stances in that text, it may be necessary to modify the ontology to take into
consideration this new information (for example, adding new concepts or new
relations between existing concepts in the ontology). We call this the bottom-up
approach to ontology change.
    We have developed a generic GATE application1 called SPRAT (Semantic
Pattern Recognition and Annotation Tool) which uses ontology-based informa-
tion extraction techniques [1, 2] to generate ontological information from unstruc-
tured text and either create a new ontology from scratch or augment an existing
ontology with new entities. This application represents a typical situation where
NLP (natural language processing) techniques can assist in the development of
Semantic Web technology. There are many NLP tools used in both research and
industry, of which GATE is one of the most widely used and well known. There
are also a variety of tools and systems providing ontology lifecycle support, of
which the NeOn toolkit is one newly emerging example. However, one of the cur-
rent research challenges is to provide better interoperability between these two
rather different kinds of systems. For example, GATE provides ontology support
but it is limited in comparison with with ontology editors and frameworks such
as Protégé [3] and the NeOn toolkit2 . SPRAT is available as part of the GATE
webservices plugin for the NeOn Toolkit3 .
    Both the top-down and bottom-up approaches combine support for ontology
lifecyle development and NLP engineers. The ontology refinement and change
log mechanisms described in this paper enable ontological changes to be made in
GATE by language engineers and domain experts who are interested in acquir-
ing new information to populate an ontology, but who do not necessarily know
or care how this information may later be repurposed. These people will use
GATE to analyse the documents and/or create the data, but are not interested
in ontology management per se. On the other hand, ontologists who are not
concerned with where the data comes from, but only what use they will make
of it, do not have to worry about how to get hold of the relevant information in
the first place or details about language engineering tools.


2   Related work
In this work, we use lexico-syntactic pattern-based methods for ontology popu-
lation. The idea of acquiring semantic information from texts dates back to the
1
  General Architecture for Text Engineering, http://gate.ac.uk/
2
  http://www.neon-toolkit.org
3
  http://gate.ac.uk/projects/neon/webservices-plugin.html
                                                           GATE in NeOn          3

early 1960s with Harris’ distributional hypothesis [4] and Hirschman and Sager’s
work in the 1970s [5], which focused on determining sets of sublanguage-specific
word classes using syntactic patterns from domain-specific corpora. A detailed
description and comparison of lexical and syntactic pattern matching can be
found in [6], In particular, research in this area has been used in specific do-
mains such as medicine, where a relatively small number of syntactic structures
is often found, for example in patient reports. Here the structures are also quite
simple, with short and relatively unambiguous sentences typically found: this
makes syntactic pattern matching much easier.
    Using lexico-syntactic patterns (LSPs) has previously proved to be successful
for a variety of tasks[7]. Various attempts have been made to refine the patterns
in a semi-automatic way, for example using the web as evidence [8]. Other meth-
ods focus mainly on a specific kind of pattern, such as part-of relations [9], or
use clustering approaches [10]. The disadvantage of the latter is that they require
large corpora to work well and generally fail to produce good clusters from fewer
than 100 million words.
    The closest approach to ours is probably Text2Onto [11], which performs
hyponym extraction on the basis of patterns. It combines machine learning ap-
proaches with basic linguistic processing such as tokenisation or lemmatisation
and shallow parsing. SPRAT differs in that it has a greater number of LSPs, and
it currently uses only a rule-based approach rather than machine learning, with
no statistical clustering or parsing. This leads to much increased precision over
Text2Onto, though fewer relations are produced. It also enables a more flexible
approach and fine-tuning of the system.
   We also took inspiration from some currently unpublished research carried
out at DFKI in the Musing project4 , which aims to derive T-Box Relations from
unstructured texts in German. In this work, attention is focused primarily on
deriving relations between parts of German compound nouns, but we can make
use of similar restrictions.
    Within the range of activities required for ontology learning, SPRAT covers
a number of intermediate stages in the process of ontology acquisition, namely
term recognition and relation extraction. In the initial acquisition stage, it will
recognise terms from the corpus only if they participate in any of the patterns.
This guarantees termhood only up to a certain extent. For relation extraction,
SPRAT does not make use of a parser. There are many applications that make
use of syntactic dependencies, e.g., [12, 13]. Our approach differs from this in
that our patterns are defined at low levels of syntactic constituency, such as noun
phrases, and by means of finite state transducers. Identifying and engineering
on the basis of the linguistic building blocks that are relevant for each ontology
editing task eliminates the need for a parser. This bottom-up approach is much
faster and less error-prone than a parser, and is more in line with the ontology
bootstrapping approach advocated in [14].

4
    http://www.musing.eu/
4       Diana Maynard and Adam Funk and Wim Peters

3     Ontology acquisition using SPRAT
In this section, we describe the bottom-up approach to ontology change, us-
ing the SPRAT application. We combine aspects from traditional named entity
recognition, ontology-based information extraction and relation extraction, in
order to identify patterns for the extraction of a variety of entity types and
relations between them, and to re-engineer them into ontological concepts, in-
stances and properties. The application is developed in GATE. Basic linguistic
patterns are first identified, and then enhanced with deeper semantic knowledge.
The NEBOnE plugin for GATE is used in order to generate the new ontological
entities on the fly: this is described in more detail in Section 3.3.
    SPRAT makes use of a number of pre-existing GATE resources for pre-
processing the text with linguistic information, such as tokenisation, part-of-
speech tagging, simple noun phrase and verb phrase chunking, and named entity
recognition. It also includes lexical resources containing semantic classes from
WordNet [15] and VerbNet [16], which enable the incorporation of deeper se-
mantic information. This allows us (i) to look for verbal patterns connecting
terms in a sentence, using the ANNIC plugin in GATE [17], and (ii) to restrict
the kinds of relation extracted. For example, we can restrict the kinds of entities
that have body parts associated with them to animals and humans.

3.1   Identifying Linguistic Patterns
Identifying ontological concepts and/or relations in text requires a slightly differ-
ent strategy from that used in traditional named entity recognition and ontology-
based information extraction methods [1]. While we can still make use of known
terms (either via a gazetteer or by accessing the class, instance and property
labels in an existing ontology), this is often not sufficient because the concept
might not be in the ontology, or only as a synonym or variant. We therefore
make more use of linguistic patterns and also contextual clues, rather than re-
lying on gazetteer lists. We have identified three sets of patterns which can help
us identify concepts, instances and properties to create a new ontology or ex-
tend an existing one: the well-known Hearst patterns [18], the LSPs developed
in the NeOn project corresponding to Ontology Design Patterns [19], and some
new contextual patterns defined by us [20]. The ontology building process is
dynamic, so we can make use of the initial concepts and instances generated
in latter stages of the process in order to further refine the ontology. We have
extended the patterns described in [20] to incorporate further semantic restric-
tions and to include more properties. An example of a linguistic pattern is the
following:
Add a new class. Here we postulate that an unknown entity amidst a list of
known entities is likely to be also an entity of the same type. For example, if we
have a list of classes of fish, and there is an unknown noun phrase in amongst
the list, we can presume that this is also a class of fish. To decide where to
add this new class in the ontology, we can look for the Most Specific Common
Abstraction (MSCA) of all the other items in the list (i.e. the lowest common
                                                           GATE in NeOn          5

superclass of all the classes in the list) and add the new entity as a subclass of
this class.
Example: Hornsharks, leopard sharks and catsharks can survive in aquarium
conditions for up to a year or more.
where hornshark and leopard shark are classes in the ontology and catshark is
unknown, so we can recognise catshark as a subclass with the same parent as
that of hornshark and leopard shark, in this case shark.
    An example of selectional restrictions placed on a relation is that a prop-
erty must be restricted to the following semantic type from WordNet: plant,
shape, food, substance, object, body, animal, possession, phenomenon, artifact.
Experiments with this restriction improved precision from around 35% to 75%,
although unsurprisingly, recall dropped a little.


3.2   Implementation of patterns

All three sets of patterns are implemented in GATE as JAPE rules [21]. On
the left hand side of the rule is the pattern to be annotated. This consists of a
number of pre-existing annotations which have been created as a result of pre-
processing components (such as POS tagging, gazetteer lookup and so on) and
earlier JAPE rules. The right hand side of the rule invokes NEBOnE and creates
the new items in the ontology, as well as adding annotations to the document
itself.
    The following snippet shows the left hand side of a JAPE rule to match the
sharks example described above:

Rule: FishList1
({FishClass}(AND)      ({NP}|(LIST)) ):mention
--> ...

    This pattern looks for something that has been identified in the fish ontology
as a class, followed by either a single noun phrase or a list of noun phrases sepa-
rated by conjunctions or commas. On the right hand side of the rule, NEBOnE
is invoked and each Noun Phrase not already listed in the ontology is added as
a new class, at the same level as the existing class. This part of the rule first
gets the relevant information from the annotations (using the labels assigned on
the LHS of the rule), then adds the new information to the ontology and finally
adds annotations to the entities in the document. NEBOnE is responsible also
for ensuring that the resulting changes to the ontology are wellformed: this is
described in more detail in Section 3.3. Figure 1 shows a screenshot from GATE
of an ontology created from a selection of wikipedia texts about animals.


3.3   NEBOnE

The SPRAT application uses the specially developed NEBOnE plugin for GATE
in order to generate the changes to the ontology. NEBOnE (Named Entity Based
ONtology Editing) is an implementation for processing natural language text and
6      Diana Maynard and Adam Funk and Wim Peters


                       Fig. 1. Generated ontology in GATE


manipulating an ontology, and is derived from the CLOnE plugin [22] for GATE.
The major difference between CLONE and NEBOnE is that while CLONE re-
lies on a restricted input text (generated by the user in a controlled language),
NEBOnE can be used with unrestricted free text, so it is a lot more flexible.
When the NEBOnE plugin is installed, actions concerning the ontology are im-
plemented on the right hand side of JAPE rules, such as adding or deleting new
classes, instances, subclasses, properties and so on.
    If an item is selected for addition to the ontology as a new class, NEBOnE
first checks to see whether it already exists in the ontology: if it already exists
in the place where it is scheduled to be added, NEBOnE will do nothing. If
the item exists as a class elsewhere in the ontology, NEBOnE will add the new
class (because it supports multiple inheritance). If the requested parent class
and subclass both exist and are class names, NEBOnE will make the second a
subclass of the first and print a message. If either is already an instance, or the
parent class does not exist yet, NEBOnE will print a warning message.
    Similarly, if an instance to be added already exists (as an instance), NEBOnE
simply generates a notification message. If the item already exists as a class, and
an instance of the same name is to be added, or vice versa, then NEBOnE will not
generate the new instance and will produce a warning message. Thus NEBOnE
                                                           GATE in NeOn          7

ensures consistency in the ontology, avoiding the need to run a checker after the
ontology has been modified. A user can of course choose to ignore any potential
inconsistencies, by checking the generated messages and then manually adding
any offending items or making other changes to the ontology.
   Unlike CLOnE, NEBOnE’s functions will create classes and superclasses as
required in order to accommodate instances and subclasses, respectively; it does
not require every class to be explicitly created before it is used.


3.4   Evaluation

We evaluated the accuracy of the lexical patterns using a corpus of 25 randomly
selected wikipedia articles about animals, such as the entries for rabbit, sheep
etc. We ran SPRAT and examined the results in some detail. In total, SPRAT
generated 1058 classes, of which 83.6% were correct; 659 subclasses, of which
76.6% were correct, 23 instances, of which 52.2% were correct, and 55 properties,
of which 74.5% were correct. We should point out that in these type of texts
(articles about animals) the number of instances is quite small. The wrongly
extracted instances were largely the result of erroneous named entity recognition.
Otherwise, we find the results very encouraging.


4     Ontology change management in GATE

The bottom-up approach to ontology change embodied by applications such as
SPRAT is only half the story. The ontology development lifecycle also requires a
top-down approach, embodied here by a change log management system which
allows the managing of changes made by users in a collaborative environment
[23]. This is necessary because different people, who may be experts in their
domain, sometimes end up making changes to the same ontology, often at dis-
tributed locations. Where it is not possible to have access to a shared repository,
other people wishing to make changes to the same ontology have to wait for
others to finish their tasks. Changes made to an ontology may involve additions,
deletions and modifications [24].
    The basic idea behind producing a change log is to allow people to share
their changes with others and thus collaborate on manipulating ontology data.
This allows them to work independently but simultaneously on the same ontol-
ogy. However, some ontologies can be extremely large and unwieldy, whereas the
changes made to such ontologies may be only very minor. Instead of exchang-
ing different versions of ontologies, it makes it easier to exchange change logs
which can then be applied over the original ontology to get the modified ver-
sion. The procedure of applying the change log is automatic, so no intervention
is required by the user, who simply loads the old ontology and the changelog,
which automatically updates the old ontology to the new one.
    GATE has several plugins useful for carrying out different types of infor-
mation extraction tasks. It also has its own ontology API, which can be used
8      Diana Maynard and Adam Funk and Wim Peters

together with these resources to take the maximum advantage of the combi-
nation. We implemented GATE Ontology Services (GOS) to allow people to
connect to a central repository and manage ontologies centrally. It has been im-
plemented so that it can be used with other resources available in GATE, such
as the SPRAT application.
    GOS supports storing ontologies in a shared repository on a remote server.
GOS is based on OWLIM [25], which is a high performance semantic reposi-
tory developed in Java. It has its own published API with which a user can
make changes to the ontologies stored on the server. Software clients such as
the NeonOntologyServiceClient allow users to connect to this server, upload new
ontologies or use existing ones and manipulate data using the service methods
published by the GOS. Similar to the NeOn Toolkit, GOS also produces a change
log that describes changes made by the various users.
    When making changes to the ontology, users contribute to the change log
maintained on the server. The GOS maintains only one instance of change log per
repository and accumulates changes made by different users in the same change
log. Users wishing to download the entire change log can do so by selecting one
of the options provided in the NeonOntologyServiceClient.
    In its current version, GOS inherits Sesame’s repository management system
which allows giving different rights to different users on different repositories.
If proper rights are assigned, multiple users can connect to the same reposi-
tory from different locations and make changes. It is up to the system using
OWLIM/Sesame as backend to utilise this functionality and restrict multiple
users from editing the same ontology at the same time. In GOS, currently only
one user is allowed to modify the ontology at a time. Other users are given
read-only access if they connect to a repository which is already being edited
by another user. However, since the different repositories are stored on the same
central server, different users connecting to the same repository one after the
other see the latest modified version. If simultaneous access is needed, one can
carefully distribute or make copies of the same ontology across different reposito-
ries and ask users to connect to a different copy. Users wishing to make changes
can then contribute by adding new instances or deleting existing ones in their
copy. GOS does not allow making changes to the implicit resources (imported
ontologies). This helps in restricting users from making changes to the basic tax-
onomy. More information on GOS, how to access it, its methods and the change
log can be found in [26].
    A typical scenario involves an ontology originating from an organisation such
as FAO, a part of which is then modified in the NeOn Toolkit using the loosely
coupled SPRAT plugin. Since SPRAT relies on the GATE architecture, the
change log is produced by GATE as part of the ontology modification process.
In order for the new version of the ontology to be then recoupled with the existing
original ontology, and for further tools to be used on it, the change log must be
readable also by the NeOn Toolkit so that the changes can be applied. We have
tested the change log system in both directions, by verifying that changes made
                                                             GATE in NeOn          9

to an ontology in GATE can be applied successfully to the original ontology in
the NeOn toolkit, and vice versa, with identical ontologies created as a result.


5   Conclusions
We have described the implementation of the approach to modelling some of the
dynamics of semantic metadata, providing mutual support for ontology lifecycle
development and an NLP toolkit. We have developed a tool for deriving new
semantic information from unstructured text and represent this information on-
tologically so that it may then be used for further processing. The importance of
this work is the interoperability achieved between the ontology editing and natu-
ral language processing architectures. Users of both tools may decide to modify
the ontologies in question or to update the source data or facts derived from
them. By creating mechanisms for the interchange and propagation of informa-
tion in this way, including valuable change log data and support for multiple
users working individually or collaboratively in a potentially distributed envi-
ronment with networked ontologies, we have released an important bottleneck.

Acknowledgements. This research was partially supported by the EU Sixth
Framework Program project NeOn (IST-2005-027595).


References
 1. Maynard, D., Li, Y., Peters, W.: NLP Techniques for Term Extraction and On-
    tology Population. In Buitelaar, P., Cimiano, P., eds.: Bridging the Gap between
    Text and Knowledge - Selected Contributions to Ontology Learning and Popula-
    tion from Text. IOS Press (2008)
 2. Bontcheva, K., Davies, J., Duke, A., Glover, T., Kings, N., Thurlow, I.: Semantic
    Information Access. In Davies, J., Studer, R., Warren, P., eds.: Semantic Web
    Technologies. John Wiley and Sons (2006)
 3. Noy, N., Sintek, M., Decker, S., Crubézy, M., Fergerson, R., Musen, M.: Creating
    Semantic Web Contents with Protégé-2000. IEEE Intelligent Systems 16(2) (2001)
    60–71
 4. Harris, Z.: Mathematical Structures of Language. Wiley (Interscience), New York
    (1968)
 5. Hirschman, L., Grishman, R., Sager, N.: Grammatically based automatic word
    class formation. Information Processing and Retrieval 11 (1975) 39–57
 6. Maynard, D.G.: Term Recognition Using Combined Knowledge Sources. PhD
    thesis, Manchester Metropolitan University, UK (2000)
 7. Etzioni, O., Cafarella, M., Downey, D., Kok, S., Popescu, A., Shaked,
    T., Soderland, S., Weld, D.S., Yates, A.:                 Web-scale Information
    Extraction in KnowItAll.              In: Proceedings of WWW-2004. (2004)
    http://www.cs.washington.edu/research/knowitall/papers/www-paper.pdf.
 8. Pantel, P., Pennacchioni, M.: Espresso: Leveraging generic patterns for automat-
    ically harvesting semantic relations. In: Proceedings of Conference on Computa-
    tional Linguistics / Association for Computational Linguistics (COLING/ACL-06),
    Sydney, Australia (2006) 113–120
10      Diana Maynard and Adam Funk and Wim Peters

 9. Berland, M., Charniak, E.: Finding parts in very large corpora. In: Proceedings
    of ACL-99, College Park, MD (1999) 57–64
10. Pantel, P., Ravichandran, D.: Automatically labeling semantic classes. In: Pro-
    ceedings of HLT/NAACL-04), Boston, MA (2004) 321–328
11. Cimiano, P., Voelker, J.: Text2Onto - A Framework for Ontology Learning and
    Data-driven Change Discovery. In: Proceedings of the 10th International Con-
    ference on Applications of Natural Language to Information Systems (NLDB),
    Alicante, Spain (2005)
12. Cimiano, P., Hartung, M., Ratsch, E.: Learning the appropriate generalization
    level for relations extracted from the Genia corpus. In: Proc. of the 5th Language
    Resources and Evaluation Conference (LREC). (2006)
13. Gamallo, P., Gonzalez, M., Agustini, A., Lopes, G., de Lima, V.: Mapping syntactic
    dependencies onto semantic relations. In: Proc. of the ECAI Workshop on Machine
    Learning and Natural Language Processing for Ontology Engineering. (2006)
14. Maedche, A.: Ontology Learning for the Semantic Web. Kluwer Academic Pub-
    lishers, Amsterdam (2002)
15. Fellbaum, C., ed.: WordNet - An Electronic Lexical Database. MIT Press (1998)
16. Schuler, K.K.: VerbNet: A broad-coverage, comprehensive verb lexicon. PhD thesis,
    University of Pennsylvania (2005)
17. Aswani, N., Tablan, V., Bontcheva, K., Cunningham, H.: Indexing and Querying
    Linguistic Metadata and Document Content. In: Proceedings of Fifth International
    Conference on Recent Advances in Natural Language Processing (RANLP2005),
    Borovets, Bulgaria (2005)
18. Hearst, M.A.: Automatic acquisition of hyponyms from large text corpora. In: Con-
    ference on Computational Linguistics (COLING’92), Nantes, France, Association
    for Computational Linguistics (1992)
19. de Cea, G.A., Gómez-Pérez, A., Ponsoda, E.M., Suárez-Figueroa, M.C.: Natu-
    ral language-based approach for helping in the reuse of ontology design patterns.
    In: Proceedings of the 16th International Conference on Knowledge Engineering
    and Knowledge Management Knowledge Patterns (EKAW 2008), Acitrezza, Italy
    (September 2008)
20. Maynard, D., Funk, A., Peters, W.: SPRAT: a tool for automatic semantic pattern-
    based ontology population. In: International Conference for Digital Libraries and
    the Semantic Web, Trento, Italy (September 2009)
21. Cunningham, H., Maynard, D., Tablan, V.: JAPE: a Java Annotation Patterns
    Engine (Second Edition). Research Memorandum CS–00–10, Department of Com-
    puter Science, University of Sheffield (November 2000)
22. Funk, A., Tablan, V., Bontcheva, K., Cunningham, H., Davis, B., Handschuh, S.:
    CLOnE: Controlled Language for Ontology Editing. In: Proceedings of the 6th
    International Semantic Web Conference (ISWC 2007), Busan, Korea (November
    2007)
23. Palma, R., Haase, P., Ji, Q.: Change management to support collaborative work-
    flows. Technical Report D1.3.2, NeOn Project Deliverable (2009)
24. Maynard, D., Peters, W., D’Aquin, M., Sabou, M., Aswani, N.: Dynamics of
    metadata. Technical Report D1.5.1, NeOn Project Deliverable (2007)
25. Kiryakov, A.: OWLIM: balancing between scalable repository and light-weight
    reasoner. In: Proc. of WWW2006, Edinburgh, Scotland (2006)
26. Maynard, D., Peters, W., Angeletou, S., D’Aquin, M.: Implementation of metadata
    evolution. Technical Report D1.5.2, NeOn Project Deliverable (2008)