Knowledge Capturing Tools for Domain Experts
Exploiting Named Entity Recognition and n-ary Relation Discovery for Knowledge
                           Capturing in E-Science
                  Lars Bröcker                          Marc Rössler                      Andreas Wagner
                 Fraunhofer IAIS                  Computational Linguistics          Computational Linguistics
               Schloss Birlinghoven              University of Duisburg-Essen       University of Duisburg-Essen
              53754 Sankt Augustin,               47048 Duisburg, Germany            47048 Duisburg, Germany
                    Germany                         marc.roessler@uni-                andreas.wagner@uni-
         lars.broecker@iais.fhg.de                        due.de                             due.de

ABSTRACT                                                          gence]: Learning—knowledge acquisition, concept learning;
The success of the Semantic Web depends on the availabil-         I.2.7 [Artificial Intelligence]: Natural Language Process-
ity of content marked up using its description languages.         ing—text analysis; I.5.3 [Pattern Recognition]: Cluster-
Although the idea has been around for nearly a decade,            ing
the amount of Semantic Web content available is still fairly
small. This is despite the existence of many digital archives     General Terms
containing lots of high quality collections which would, ap-      Algorithms
propriately marked up, greatly enhance the reach of the Se-
mantic Web. The archives themselves would benefit as well,        Keywords
by improved opportunities for semantic search, navigation
                                                                  Named Entity Recognition, Relation Discovery, Semantic
and interconnection with other archives.
                                                                  Networks, Wiki Systems
The main challenge lies in the fact that ontology creation at
the moment is a very detailed and complicated process. It         1. INTRODUCTION
mostly requires the service of an ontology engineer, who de-      The Semantic Web can only flourish if enough content provi-
signs the ontology in accordance with domain experts. The         ders adopt it for the presentation of their content. This lack
software tools available, be it from the text engineering or      of adoption is the Achilles heel of the vision of the data web
the ontology creation disciplines, reflect this: they are built   where humans and software agents can work side by side.
for engineers, not for domain experts. In order to really tap     The main reason for this lies right at the base of the Semantic
the potential of the digital collections, tools are needed that   Web: the creation of ontologies. The process needed to get
support the domain experts in marking up the content they         to a working representation of a domain is too difficult for
understand better than anyone else.                               domain experts to do it on their own - a debilitating factor
                                                                  on the way to widespread adoption: the WWW did flourish
This paper presents an integrated approach to knowledge           simply due to the ease of marking up knowledge in HTML.
capturing and subsequent ontology creation, called WIKIN-         This does not hold true for OWL or even RDF.
GER, that aims at empowering domain experts to prepare
their content for inclusion into the Semantic Web. This is        There are tools that deliver support in the process of creat-
done by largely automating the process through the use of         ing an ontology, both from the domain of text engineering
named entity recognition and relation discovery.                  as well as from ontology engineering. But these tools are
                                                                  made for a selected audience: ontology engineers. This in
                                                                  itself is nothing bad, but it reduces the amount of growth of
Categories and Subject Descriptors                                the Semantic Web to the availability (and affordability) of
H.3.1 [Information Storage and Retrieval]: Content                said engineers. Tools are needed that allow domain experts
Analysis and Indexing—linguistic processing; I.2.4 [Artifi-       themselves to design and create ontologies tailored for their
cial Intelligence]: Knowledge Representation Formalisms           needs and domain corpora, if the Semantic Web is to come
and Methods—semantic networks; I.2.6 [Artificial Intelli-         about on a grand scale.

                                                                  But what is needed to create an ontology from a text corpus?
                                                                  First of all, an ontology can be seen as a graph structure,
                                                                  a semantic network. The nodes of this graph are the enti-
                                                                  ties, i.e. the actors, topics and objects of the ontology, while
                                                                  the edges of the graph are the relations that exist between
                                                                  the entities. The task of automatically creating an ontology
                                                                  can be broken down into the following steps: first named en-
                                                                  tity recognition (NER) and second the detection of relations
                                                                  existing between those entities.
The detection and classification of proper names into prede-     The architecture of WIKINGER is motivated by the as-
fined categories is called Named Entity Recognition (NER).       sumption that many nodes of a domain specific semantic
The recognition of the categories PERSON, LOCATION               network occur in domain relevant texts and that these oc-
and ORGANIZATION within the newspaper domain is es-              currences are proper names or expressions which can be ex-
pecially well-studied as a part of the MUC-campaigns (Mes-       tracted with NER-techniques.
sage Understanding Conferences) and can be conducted au-
tomatically with a performance beyond 0.9 F-measure for          The pilot domain of WIKINGER is contemporary history
English texts [4]. The detection of relations between the en-    with a focus on the history of Catholicism in Germany. For
tities of a corpus is a younger discipline, usually concerned    that domain, the traditional NER categories PERSON, LO-
with binary relations. Experiments on English newspapers         CATION, ORGANIZATION, and TIME/DATE expressions
show performance around 0.75 F-measure [8]. These ad-            obviously carry crucial nodes for a domain specific seman-
vances facilitate a largely automated processing of text cor-    tic network. However, the domain experts desired additional
pora into domain ontologies. This paper introduces an in-        categories, such as HISTORICAL-EVENT, BIOGRAPHIC-
tegrated web service-based framework called WIKINGER             EVENT or ROLE. A ROLE is a function or a position a per-
that does just that.                                             son holds (e.g. ”bishop”, ”professor of theology”) and is often
                                                                 part of a BIOGRAPHIC-EVENT, which may contain addi-
This paper is structured as follows: Section 2 gives an o-       tional annotations such as LOCATION and TIME/DATE,
verview of the WIKINGER framework, sections 3 and 4 de-          as the following example shows:
scribe our work on named entity extraction, while section 5
describes the relation discovery part of the process. After
that, section 6 highlights relevant related work, and we close   <BIO-EVENT>
with remarks on future works and the conclusion in sections         <DATE>1936</DATE>
7 and 8.                                                            <ROLE>archbishop</ROLE> of
                                                                    <LOC>Cologne</LOC>
                                                                 </BIO-EVENT>
2.   WIKINGER - THE BIG PICTURE
WIKINGER[3], short for Wiki Next Generation Enhanced             The HISTORICAL-EVENT describes events significant to
Repositories, aims at developing collaborative knowledge plat-   the domain experts, such as the ”Wall Street Crash of 1929”,
forms for scientific communities. The collaboration is fa-       also called ”Black Thursday”. This category may contain
cilitated by selecting a Wiki as a presentation layer, and       embedded categories, too. The two event categories of the
the knowledge contained can be organized via semantic re-        pilot domain are beyond the traditional NER task: Depend-
lations. The resulting semantic Wiki can be extended, re-        ing on the perspective, they either involve relation extraction
organized and commented on by all (registered) members of        or embedded categories. The corpus to annotate currently
the particular scientific community. To setup and maintain       consists of approximately 150 monographs within a book
the semantic network, NER-techniques are applied to the          series. The books were scanned and the text was OCR-
available domain-relevant documents (see section 3). The         extracted. The annotations of the resulting corpus will be
resulting annotations are the potential nodes of the seman-      used as potential nodes of the semantic network to be cre-
tic network that is constructed in a semi-automatic manner.      ated.
The relations are proposed based on clusters of co-occurring
entities (see section 5).                                        Since the book series has a consistent layout structure, it
                                                                 was possible to preserve some layout information, such as
Figure 1 shows a view of the components that are part of         the distinction between footnotes and other text. This dis-
the WIKINGER framework. It is built following a service-         tinction is helpful in order to detect a text unit specific to
oriented architecture, its modules are loosely coupled, which    the texts of our domain called a ”biogram”. A biogram usu-
allows need-driven reconfiguration of the system. The sys-       ally is a footnote that is provided the first time a person
tem itself uses a linked set of data repositories to perform     is mentioned in the text and comprises a short biography.
its duties. The resource layer at the bottom of fig. 1 shows     These biographies usually are short and concise and tend
a drastically simplified view of the outside world: it con-      to follow a predetermined structure. For instance, most of
tains arbitrary data sources that can be imported into the       the biograms start with the name of the person, and some
first of the repositories, i.e. the document repository. This    biograms present the single pieces of information separated
repository provides the other services of the system with a      by a particular delimiter such as semicolon or comma.
versioned corpus of documents to work on. The process-
ing services (e.g. for NER, relation discovery and creation      Thus, in most cases the person named at the beginning of
of the ontology) use this repository as a source only. They      a biogram is the one that the other annotations in that bi-
feed their results into the metadata repository. It is linked    ogram relate to. While some of the information items also
to the document repository to uphold references to the orig-     belong to persons that are related to the person described in
inal and it also provides versioned storage of the data. This    the biogram (e.g. ”his father was a <ROLE>prime minis-
ensures that the original corpus remains unchanged. The          ter</ROLE>”) this assumption nevertheless holds true for
final repository contains the semantic model of the corpus.      the largest part of the corpus. This is very important for the
It makes use of both the document repository as well as the      relation discovery step, since all relations discovered in a spe-
metadata repository. At the moment, the application layer        cific biogram are linked implicitly to said person, although
takes the form of a wiki system, but other applications can      its participation in most of the relations is not readily appar-
easily be envisioned.                                            ent from their local contexts. Accordingly, they need to be
                                                    WIKINGER System Components


                                                                                                                   Application Layer
                                     Guest/Member     Author/Editor         Annotator              Administrator


                                        Browser          Editor             Annotation         Administration


                                Entity Model                                              Processing Service
                                                          Entity Service
                                 Repository


                                                                                                                   Service Layer
                                                                                        Analyzer
                                 Meta Data                                                         NER
                                                        Meta Data Service                                   ...
                                 Repository


                                 Document               Document Service                   Account Service
                                 Repository


                                                                                                                   Resource Layer
                                                                                                User
                                                            Document
                                                                                              Database
                                                             Sources


                              Figure 1: The WIKINGER Framework: Component View


associated with the person discussed in the biogram, which                  A well-known example for such a task is the recognition of
in turn has implications for the creation of the semantic net-              biomedical entities such as genes, proteins or cell tissue [6,
work from the anntoation and relation data discovered in the                9]. It is almost impossible for a non-expert in the biomedical
course of the process.                                                      domain to judge about the correctness of an annotation or
                                                                            even to figure out a definition of the classes to recognize.
Processing these biograms results in a semantic network in                  Additionally, capitalization is not a distinctive feature of
OWL which contains any information that could be har-                       the entities to detect. Furthermore, biomedical entities are
vested automatically from all the biograms within the 150                   no proper names in the linguistic sense since a mention of a
monographs. This knowledge base constitutes a biographi-                    particular protein refers to all instances of that protein and
cal database for the scientific domain, which, according to                 not to a particular instance.
the historians working within the WIKINGER project, is
a long time desideratum for the domain of contemporary                      The annotation task within WIKINGER has similar char-
history of Catholics in Germany.                                            acteristics: the documents to be processed are specialized
                                                                            texts, thus the definition of the annotation categories has to
However, the tasks described are not limited to the pilot                   be provided by the domain experts. Also, most of the texts
application of WIKINGER. Indeed, it has many features in                    are in German, so the capitalization is not a reliable clue to
common with a series of annotation tasks found in other do-                 detect proper names. Furthermore, discussions with the do-
mains as well. Our research within the WIKINGER project                     main experts have shown that some of the annotation tasks
focuses on the application-oriented generalization of these                 amount to information extraction in a more general sense,
challenges.                                                                 in particular involving relation extraction, even though on
                                                                            a local level. For example, the BIO-EVENT provided in
                                                                            section 2 establishes a relation between the person the re-
3.   NER                                                                    spective biogram deals with, a role occupied by that person,
It is highly desirable to generalize successful NER approa-                 a certain time, and a location. Although these annotation
ches described in section 1 to a broader variety of semantic                tasks significantly expand the annotation of proper names,
markup at phrase level (i.e. apart from ”standard” categories               we still consider them as a sophisticated form of NER. In
such as PERSON, ORGANIZATION, or LOCATION) in                               other words, we basically employ approaches which have
order to support other NLP applications. However, this re-                  been successfully applied to NER.
quires annotation components that can be extended to new
categories and adapted to new domains and new languages.                    In principle, two major kinds of NER approaches have been
These tasks may have different characteristics than the clas-               proposed in the literature: rule-based and machine learning
sical MUC task: First, they may lack the clue of the dis-                   (ML) approaches. Rule-based approaches employ a hand-
tinctive capitalization for some semantic classes and some                  crafted set of rules which is fine-tuned to the particular ap-
languages, such as German. Second, the categories of inter-                 plication domain. The adaptation of such a rather complex
est may neither be obvious nor easily understandable due to                 rule set to new domains and/or languages brings about ex-
a highly specialized domain and language.
tensive modification and maintenance efforts and requires          4.   WALU
therefore comprehensive knowledge about both the new do-           A prerequisite for enabling domain experts to create training
main and the proper design of the linguistic rule set. This        data and control the process of training and (semi-)automa-
means that domain experts need extensive support by com-           tic semantic markup is the availability of a powerful and
putational linguists in order to port such a system to their       convenient tool. On the one hand, such a tool has to pro-
domain. In contrast, adapting machine learning approaches          vide the necessary functionalities, i.e. manual annotation
to a new application domain requires the creation of domain-       of documents, configuration and initiation of the training
specific training data, i.e. manual annotation of domain-          process, application of automatic annotation components,
specific documents. Since this essentially requires domain         as well as inspection and correction of the resulting anno-
(rather than linguistic) expertise, domain professionals need      tations. On the other hand, intuitive interfaces and con-
much less support by computational linguists (if any at all).      venient facilities supporting these functionalities while en-
Our experience within the WIKINGER project has shown               capsulating their complexity are crucial to ensure usability
that such support is necessary primarily for the initial task      for professionals of any domain. In addition, this tool has
of defining a suitable set of semantic categories. During this     to be integrated into the overall WIKINGER infrastructure
definition stage, the communication between domain experts         sketched in section 2. Currently there is no tool available
and linguists in essence consists in exchanging annotated          that meets all these requirements (see section 6), at least
examples. We believe that this example-based communica-            not to our knowledge. Therefore, we are developing such a
tion significantly facilitates portability, since concrete ex-     tool, which we call WALU (WIKINGER Annotations- und
amples are much easier to create and understand than the           Lern-Umgebung = WIKINGER annotation and learning en-
explicit formulation of more or less complex and abstract          vironment, see [16]).
(sub-)regularities. The same holds true for the annotation
of the training data itself, which can be regarded as example-     WALU supports manual annotation with a GUI that is easy
based communication between domain experts and machine             to use. It offers a comfortable navigation through the an-
learning algorithms.                                               notations, and simple but effective annotation support such
                                                                   as the automatic adjustment of markup boundaries or a dy-
Consequently, in order to minimize the amount of “external         namic markup dictionary. This dictionary is created during
help” specialists needed to set up the WIKINGER system             the annotation process and is used to propose markup la-
for their domain, we decided to employ ML approaches for           bels for text passages corresponding to dictionary entries.
NER. In our current experiments, we are using Maximum              Using a context-sensitive menu, the annotator confirms or
Entropy modeling and support vector machines. (As im-              rejects these proposals and/or removes the entry from the
plementations, we employ openNLP1 and SVMstruct2 , re-             dictionary. In our experience the immediate feedback of the
spectively.) However, we aim at providing a variety of ML          dynamic markup dictionary also helps the domain experts
algorithms which can either be employed independently or in        to clarify the task of string-based identification of domain-
combination to maximize performance. Regarding portabil-           relevant concepts. Additionally, WALU also provides an au-
ity, it is crucial that the learning approaches employ domain-     tomatic annotator for strings referring to the category DATE
independent features and resources that can be easily adap-        which is based on regular expressions. This is a simple pro-
ted to a new domain or a new NER task. Furthermore, these          totype of a series of automatic mechanisms that will be used
methods have to be applied in a way that allows the acqui-         to annotate all the available documents. Except a few anno-
sition of embedded annotations. “Standard” ML classifiers          tators based on regular expressions to classify entities with
assign one class (in our case, a semantic category) to each        unique patterns (such as email addresses and URLs), most of
instance to classify (in our case, a token)3 . In embedded         these annotators are based on machine learning algorithms
annotations, (parts of) entities may receive multiple classes      that will be accessible via WALU.
simultaneously (e.g. in the example in section 2, “1936” is
at the same time a DATE and part of a BIO-EVENT). To               Training the ML facilities mentioned in section 3 as well as
achieve such kind of concurrent classification, we run multi-      their annotation of new text can be initiated via the WALU
ple classifiers, each one assigning different classes, and unify   GUI. The annotation results can be displayed and manually
the results. For ML approaches which are restricted to bi-         corrected. Automatic annotations are displayed in a distinct
nary classification (e.g. SVM), one classifier is required for     way (only the lower half of the annotated tokens are marked)
each category. For ML approaches without this restriction          so that they can be discovered immediately by the user.
(e.g. MaxEnt), classifiers assigning multiple classes can be
built and combined in a more flexible way. Our experiments         WALU is designed both as a part of the WIKINGER infras-
with MaxEnt models have shown that combining classifiers           tructure and as a stand-alone tool. Web-service-based com-
each of which assigns all categories except one, i.e. each         munication facilities allow WALU to load documents from
of which “ignores” one particular class, yields higher perfor-     the WIKINGER document repository and load/store corre-
mance than employing binary classifiers. In these experi-          sponding annotations from/to the metadata repository. As
ments, we got F-measures (at token level) of up to 84.6%           a stand-alone tool, WALU currently is able to import text
for persons, 87,1% for organizations, 94,8% for geographic-        documents (other import formats will be captured later) and
political entities, and 92,8% for roles.                           to export annotated documents in a straightforward XML
                                                                   standoff format. The transfer between the various different
1                                                                  data formats is achieved via a special internal format we
  http://maxent.sourceforge.net/                                   call ‘WaRP (WALU Rich Paragraph) stream’, which is also
2
  http://svmlight.joachims.org/svm struct.html                     processed by the automatic annotation components.
3
  Multiword NEs are recognized as a sequence of tokens re-
ceiving the same class.
                     
                        
                                    
                                     
                                                        
                                                            
                                                                          relation. Since the amount of relation clusters is not known
                                                                          beforehand, agglomerative clustering is applied. In this al-
                                                                gorithm, every vector starts as its own cluster. Clusters
                                                                          are then merged, given they fulfill a certain clustering crite-
                                                                          rion that is defined on a distance measure. We use standard
                                      
                                                                     Cosine similarity as distance and allow both single and com-
                                                                          plete linkage as criteria. Given two clusters A and B and a
                                                            distance threshold t, this translates to:


           Figure 2: Workflow of the algorithm
                                                                             Single Linkage : ∃α ∈ A, β ∈ B : min(dist(α, β)) < t
5.     SEMIAUTOMATIC RELATION DISCO-
                                                                           Complete Linkage : ∃α ∈ A, β ∈ B : max(dist(α, β)) < t
       VERY
The algorithms and tools described in the preceding sections
provide named entities for a variety of project-dependent                 Which method will be used depends on the corpus in ques-
concept classes. They will become the nodes of the semantic               tion. Terse texts show better results with complete linkage,
network that is to be built. The remaining part is the provi-             normal text performs better with single linkage.
sion of edges connecting these nodes, which will be explained
in this section. The common approach to this problem is to                The result of this step is a set of relation clusters for each
let domain experts come up with a small number of rela-                   association rule. User interaction is needed at this point, in
tions and then to model them in an ontology editor. This                  order to review the results and to provide meaningful labels
requires knowledge of both ontology creation and ontology                 for the relations. They are not generated automatically at
editors, which tends to be a too high hurdle for domain ex-               the moment, but schemes employing parts-of-speech analysis
perts. Instead, we propose to do it based on the content of               (e.g. using the verbs) are feasible.
the corpus in question. With the named entities given by
the preceding steps, relation discovery applying statistical              The last step of the algorithm is the transformation of the
methods becomes feasible.                                                 entities and their relations into an ontology language. The
                                                                          transformation process is a straight-forward affair for enti-
                                                                          ties, classes and binary relations, since those can be handled
5.1      Algorithm                                                        by corresponding constructs in RDF. The transformation of
Figure 2 shows the workflow of our approach. The first step,              n-ary relations is slightly more complex, since it involves
NER, has been covered already. The next step consists of the              blank nodes that act as a hub for the attachment of binary
application of an association rule mining algorithm on the                relations to the various members of the relation. The result-
annotated corpus that has been segmented on the sentence-                 ing RDF represents the ontology for the domain corpus.
level. Only those sentences containing at least two entities
are kept. Each sentence is represented by the set of entity               In the use-case of our project, we have to deal with a dy-
classes appearing in it. These item sets serve as input for the           namic corpus, since the articles from the wiki are fed back
apriori algorithm[1], that generates a set of association rules           into the system to be analyzed. This continually updates the
of the form a → b. Each rule carries two parameters, support              semantic network and keeps it on par with the wiki. But an
(the amount of observations supporting it), and confidence                additional step is required: relation classification. The rela-
(in our case #(a→b)
                #a
                     ). Thresholds for these parameters can               tion clusters that have been committed in the initialization
be used to influence the result of the algorithm.                         phase of the system are used for this task. New instances
                                                                          of sentences are marked up with named entities and are
The association rules can be ranked according to the two                  then transformed into word vectors which can be classified
parameters. High support promises higher coverage, high                   against the relation clusters, and subsequently transformed
confidence hints at a tighter correlation between the entity              into RDF. Since the provenance of each triple in the ontol-
classes involved. Rules with more than one succedent tend                 ogy is known, exchanges can be restricted to those triples
to be more specialized, as evidenced by a higher confidence,              that are affected.
and thus offer a higher potential information gain and they
tend to be forgotten by the domain experts, when asked to                 Preliminary evaluation results of the algorithm show F-mea-
come up with possible relations.                                          sures (F1 = 2∗Recall∗P   recision
                                                                                                            ) between 70% and 75% for
                                                                                         Recall+P recision
                                                                          clusters representing binary as well as n-ary relations. The
The next step is a clustering phase. It takes an association              algorithm usually creates more relation clusters than a hu-
rule as input. The sentences of the rule are preprocessed, i.e.           man would, since humans tend to generalize the relations
the named entities are replaced with their respective classes.            rather than to have a multitude of minuscule distinctions in
This is done to receive generalized patterns of the relations in          their relation set. We have performed an evaluation of the
the sentences. Only the part between the outermost named                  performance of the algorithm against a part of the corpus
entities is taken and transformed into word vectors. These                relevant for the pilot application in the WIKINGER project.
weights of the vectors are created using tf*idf.                          More details can be found in [2].

The goal of the clustering phase is to receive relation clus-
ters, i.e. clusters in which every vector symbolizes the same             5.2    User interface
In order to provide the domain experts with an interface          linguistics. In this respect, WALU complements the range
that facilitates directing the relation discovery process, the    of existing tools.
Wikinger Relation Discovery GUI, short WiReD, has been
developed. It allows to view the results of the different steps
of the algorithms and to experiment with different settings
                                                                  6.2    Ontology learning environments
                                                                  As has been pointed out above, ontology learning environ-
for them. This encompasses the association rules generated
                                                                  ments usually are built as supporting tools for ontology
by the apriori algorithm as well as the composition of the
                                                                  engineers. Their task differs from the one tackled by the
relation clusters generated by the clustering phase.
                                                                  approaches in this paper insofar as the ontology engineer
                                                                  has the process-knowledge necessary for building ontologies.
Association rules can be selected manually for clustering,
                                                                  He usually has access to different domain experts, and thus
clusters can be post-processed (merged with others, deleted,
                                                                  needs only marginal software support. Named entity recog-
renamed) and finally selected for inclusion into the seman-
                                                                  nition is employed sometimes to facilitate populating the
tic network. The parameters for each algorithmic step are
                                                                  ontology, whereas relation discovery is not used extensively,
preset with reasonable defaults, but can be changed directly
                                                                  at least not to our knowledge.
from within WiReD, thus allowing experiments on the data
set. This may sound intimidating at first reading, but in
                                                                  Text-To-Onto[11] contains a module that calculates associ-
practice there are never more than two parameters per step
                                                                  ation rules to provide the engineer with an overview over
in the processing chain, four parameters in total.
                                                                  possible interrelations between concept classes, but this ap-
                                                                  proach is not followed further in the context of the applica-
When the experts have come to a final result, i.e. they have
                                                                  tion. Its successor, Text-2-Onto[5], employs a limited ver-
agreed upon a set of relations they want to see included
                                                                  sion of relation extraction, insofar as it searches for hyponym
in the ontology, the relation information is fed back into
                                                                  relation patterns (e.g. ”x is a kind of y”) in order to find ad-
the WIKINGER framework. Here it is used for different
                                                                  ditional instances of concept classes in a corpus. Relation
purposes. First of all it can be used to transform the infor-
                                                                  discovery is not employed there.
mation associated with it - the entities and their relations
- into the ontology format of choice. If the corpus is static,
this concludes the work needed for the ontology. In the case      6.3    Relation Discovery
of dynamic corpora, e.g. wiki systems, the relation infor-        Hasegawa et al [8] propose a system with a similar approach
mation approved by the experts is used to automatically           than the one presented here. They first perform NER on a
classify new patterns that enter the system. These basically      text corpus, and then collect entity pairs from within sen-
follow the same steps of the algorithm, only now in a fully       tences. These pairs are grouped by composition, the corre-
automated mode. The experts can change the relation set           sponding sentences are transformed into word vectors and
anytime they want using the WiReD GUI which results in a          a clustering step is performed on each of the groups. This
total recalculation of the ontology to reflect their desire for   results in a couple of relation clusters for each group. With
change.                                                           some postprocessing (weeding out clusters below a certain
                                                                  size), they report F-measures of between 75% and 80% for
                                                                  selected clusters on a year of newspaper articles from The
6.    RELATED WORK                                                New York Times. In addition, they generate cluster labels
This section highlights related work in the areas touched by      by taking the words with the highest occurrence in each
the work described in the sections above. We concentrate          cluster. We believe that adding an association rule creation
on annotation tools rather than individual NER algorithms,        phase at the beginning helps in the selection of interesting
since the tools mentioned all encompass different approaches      combinations of relation candidates, even more so because
to NER. Following that, ontology learning environments are        we are not restricted to the detection of binary relations.
discussed, with a special regard to their use of relation dis-
covery. Finally, algorithms partial to the discipline of rela-    There are other approaches besides this one, that exploit
tion discovery are discussed.                                     syntactic structures and perform parts-of-speech analysis:
                                                                  Jiang et al. [10] analyze sentence grammar trees, model
                                                                  candidate relations in RDF in order to capture their direc-
6.1    Annotation tools                                           tion and extract from the RDF a set of generalized relations.
As explained in section 4, the rationale behind WALU is
                                                                  Navigli et al. [14] present an approach to ontology learning
its usability by professionals of any domain, in particular
                                                                  that exploits synsets from WordNet in order to disambiguate
without computational or linguistic expertise. In this re-
                                                                  meaning and find relations that might hold between different
spect, WALU differs from other existing tools for semantic
                                                                  entities from the sentences that explain the different synsets.
annotation, e.g. GATE [7], WordFreak [12], MMAX [13], or
                                                                  But these approaches are dependent on deeper knowledge of
PALinkA [15]. These tools are primarily intended for users
                                                                  the language of the text corpus. Approaches like Hasegawa’s
with a background in (computational) linguistics. Conse-
                                                                  or ours only rely on statistics and the existence of annotated
quently, they are either tailored to different, more complex
                                                                  entities, thus they are language agnostic.
tasks than WALU (e.g. PALinkA for discourse annota-
tion), or are designed as highly multifunctional tools (e.g.
GATE, WordFreak, or MMAX). This multifunctionality al-            7.    FUTURE WORK
lows their flexible application with regard to specific and       Regarding NER, we will implement an interface to the Weka
complex needs. However, the price of this flexibility is that     library [17], which comprises a number of machine learning
these tools require extensive configuration efforts which sig-    algorithms. We will investigate combinations of different
nificantly affects usability for non-experts in computational     ML approaches either sequentially (i.e. the output of one
classifier is used as input to another one) or concurrently (i.e.    [1] R. Agrawal and R. Srikant. Fast algorithms for mining
several kinds of classifiers are run in parallel and a more-or-          association rules. In Proceedings of the 20th VLDB
less sophisticated voting mechanism — which might involve                conference, pages 487–499, 1994.
a further ML approach — decides on the final classification).        [2] L. Bröcker. Semiautomatic Creation of Semantic
                                                                         Networks. In Online-proceedings of PhD-symposium at
Furthermore, we plan to provide an interface to the UIMA                 ESWC 2007, June 2007. no URL as of yet.
framework4 . This way, further facilities for learning and pre-      [3] L. Bröcker, M. Rössler, A. Wagner, et al. WIKINGER
processing (e.g. morphological or syntactic analysis, which              - Wiki Next Generation Enhanced Repositories. In
can provide useful information for semantic annotation as                Online Proceedings of the German E-Science
well as relation discovery) will become available to our frame-          Conference, 2007.
work. Since units from the UIMA framework can be pro-                [4] N. A. Chinchor, editor. Proceedings of the Seventh
vided as web services they can be added to complement the                Message Understanding Conference, Fairfax, VA, 1998.
WIKINGER framework as needed.                                        [5] P. Cimiano and J. Völker. Text-2-Onto. In Proceedings
                                                                         of NLDB 2005, pages 227–238, 2005.
Regarding relation discovery, we intend to apply our ap-
                                                                     [6] N. Collier, P. Ruch, and A. Nazarenko, editors.
proach to other data sets, especially from the newspaper
                                                                         Proceedings of the International Joint Workshop on
domain, in order to evaluate its performance on data sets
                                                                         Natural Language Processing in Biomedicine and its
that cover a wide range of topics, and to enhance the al-
                                                                         Applications (JNLPBA-2004), Geneva, Switzerland,
gorithm with a stage that extracts suitable labels for the
                                                                         2004.
relations and their members automatically.
                                                                     [7] H. Cunningham. GATE, a General Architectur for
                                                                         Text Engineering. Computers and the Humanities,
The WIKINGER framework will be developed further, we
                                                                         36:223–254, 2002.
intend to use it as a base platform for a variety of future
projects.                                                            [8] T. Hasegawa, S. Sekine, and R. Grishman. Discovering
                                                                         Relations among Named Entities from Large Corpora.
                                                                         In Proceedings of the Annual Meeting of Association of
8.     CONCLUSIONS                                                       Computational Linguistics, pages 415–422, 2004.
This paper described a new approach to semi-automatic                [9] L. Hirschman, A. Yeh, C. Blaschke, and A. Valencia.
knowledge capturing from large text corpora. The goal is to              Overview of BioCreAtIvE: critical assessment of
empower domain experts to create domain ontologies them-                 information extraction for biology. BMC
selves, without being dependent on the availability of on-               Bioinformatics, 6 (Supplement 1), 2005.
tology engineers. This is to be achieved by automating the
                                                                    [10] T. Jiang, A. Tan, and K. Wang. Mining Generalized
process to a high degree, by employing named entity recog-
                                                                         Associations of Semantic Relations from Textual Web
nition (NER) and relation discovery. Domain experts are in-
                                                                         Content. IEEE Transactions on Knowledge and Data
volved at those stages which require a substantial knowledge
                                                                         Engineering, 1(2):164–179, 2007.
of the domain in question. Two software tools aiding in the
process have been introduced that aid the domain experts            [11] A. Maedche. The Text-To-Onto Environment, chapter
in the task, WALU and WiReD. The former is a workbench                   7 in Alexander Maedche: Ontology Learning for the
for example-based NER, while the latter is a tool aiding in              Semantic Web. Kluwer Academic Publishers, 2002.
the relation discovery process.                                     [12] T. Morton and J. LaCivita. WordFreak: an open tool
                                                                         for linguistic annotation. In Proceedings of the 2003
Evaluation results for the different algorithmic solutions have          Conference of the North American Chapter of the
been presented that show high values for F-measure for the               Association for Computational Linguistics on Human
automatic knowledge capturing methods.                                   Language Technology, Edmonton, Canada, 2003.
                                                                    [13] C. Müller and M. Strube. MMAX: A tool for the
All of this is part of a web service based architecture, the             annotation of multi-modal corpora. In Proceedings of
WIKINGER framework. It is used to create semantically en-                the 2nd IJCAI Workshop on Knowledge and Reasoning
hanced collaborative knowledge platforms for scientific com-             in Practical Dialogue Systems, Seattle, WA, 2001.
munities. The pilot application is a semantic wiki for the          [14] R. Navigli, P. Velardi, and A. Gangemi. Ontology
domain of contemporary history research regarding German                 learning and its application to automated terminology
catholicism.                                                             translation. IEEE Intelligent Systems, 18(1):22–31,
                                                                         2003.
9.     ACKNOWLEDGMENTS                                              [15] C. Orasan. PALinkA: A highly customisable tool for
The work presented in this paper is being funded by the                  discourse annotation. In Proceedings of the Fourth
German Federal Ministry of Education and Research under                  SIGdial Workshop on Discourse and Dialogue,
research grant 01C5965. See http://wikinger-escience.de for              Sapporo, Japan, 2003.
further details regarding the project. The authors would            [16] A. Wagner and M. Rössler. WALU — Eine
like to thank Prof. Cremers from the University of Bonn                  Annotations- und Lern-Umgebung für semantisches
and Prof. Hoeppner from the University of Duisburg-Essen                 Tagging. In G. Rehm, A. Witt, and L. Lemnitzer,
for their helpful suggestions.                                           editors, Data Structures for Linguistic Resources and
                                                                         Applications, pages 263–271. Gunter Narr Verlag,
                                                                         Tübingen, 2007.
10.      REFERENCES
4
    http://incubator.apache.org/uima/
[17] I. H. Witten and F. Eibe. Data Mining: Practical machine
     learning tools and techniques. Morgan Kaufmann, San Fran-
     cisco, 2nd edition, 2005.