=Paper= {{Paper |id=Vol-289/paper-3 |storemode=property |title=Knowledge Capturing Tools for Domain Experts |pdfUrl=https://ceur-ws.org/Vol-289/p03.pdf |volume=Vol-289 |dblpUrl=https://dblp.org/rec/conf/kcap/BrockerRW07 }} ==Knowledge Capturing Tools for Domain Experts== https://ceur-ws.org/Vol-289/p03.pdf

Knowledge Capturing Tools for Domain Experts
Exploiting Named Entity Recognition and n-ary Relation Discovery for Knowledge
Capturing in E-Science
Lars Bröcker Marc Rössler Andreas Wagner
Fraunhofer IAIS Computational Linguistics Computational Linguistics
Schloss Birlinghoven University of Duisburg-Essen University of Duisburg-Essen
53754 Sankt Augustin, 47048 Duisburg, Germany 47048 Duisburg, Germany
Germany marc.roessler@uni- andreas.wagner@uni-
lars.broecker@iais.fhg.de due.de due.de

ABSTRACT gence]: Learning—knowledge acquisition, concept learning;
The success of the Semantic Web depends on the availabil- I.2.7 [Artificial Intelligence]: Natural Language Process-
ity of content marked up using its description languages. ing—text analysis; I.5.3 [Pattern Recognition]: Cluster-
Although the idea has been around for nearly a decade, ing
the amount of Semantic Web content available is still fairly
small. This is despite the existence of many digital archives General Terms
containing lots of high quality collections which would, ap- Algorithms
propriately marked up, greatly enhance the reach of the Se-
mantic Web. The archives themselves would benefit as well, Keywords
by improved opportunities for semantic search, navigation
Named Entity Recognition, Relation Discovery, Semantic
and interconnection with other archives.
Networks, Wiki Systems
The main challenge lies in the fact that ontology creation at
the moment is a very detailed and complicated process. It 1. INTRODUCTION
mostly requires the service of an ontology engineer, who de- The Semantic Web can only flourish if enough content provi-
signs the ontology in accordance with domain experts. The ders adopt it for the presentation of their content. This lack
software tools available, be it from the text engineering or of adoption is the Achilles heel of the vision of the data web
the ontology creation disciplines, reflect this: they are built where humans and software agents can work side by side.
for engineers, not for domain experts. In order to really tap The main reason for this lies right at the base of the Semantic
the potential of the digital collections, tools are needed that Web: the creation of ontologies. The process needed to get
support the domain experts in marking up the content they to a working representation of a domain is too difficult for
understand better than anyone else. domain experts to do it on their own - a debilitating factor
on the way to widespread adoption: the WWW did flourish
This paper presents an integrated approach to knowledge simply due to the ease of marking up knowledge in HTML.
capturing and subsequent ontology creation, called WIKIN- This does not hold true for OWL or even RDF.
GER, that aims at empowering domain experts to prepare
their content for inclusion into the Semantic Web. This is There are tools that deliver support in the process of creat-
done by largely automating the process through the use of ing an ontology, both from the domain of text engineering
named entity recognition and relation discovery. as well as from ontology engineering. But these tools are
made for a selected audience: ontology engineers. This in
itself is nothing bad, but it reduces the amount of growth of
Categories and Subject Descriptors the Semantic Web to the availability (and affordability) of
H.3.1 [Information Storage and Retrieval]: Content said engineers. Tools are needed that allow domain experts
Analysis and Indexing—linguistic processing; I.2.4 [Artifi- themselves to design and create ontologies tailored for their
cial Intelligence]: Knowledge Representation Formalisms needs and domain corpora, if the Semantic Web is to come
and Methods—semantic networks; I.2.6 [Artificial Intelli- about on a grand scale.

But what is needed to create an ontology from a text corpus?
First of all, an ontology can be seen as a graph structure,
a semantic network. The nodes of this graph are the enti-
ties, i.e. the actors, topics and objects of the ontology, while
the edges of the graph are the relations that exist between
the entities. The task of automatically creating an ontology
can be broken down into the following steps: first named en-
tity recognition (NER) and second the detection of relations
existing between those entities.
The detection and classification of proper names into prede- The architecture of WIKINGER is motivated by the as-
fined categories is called Named Entity Recognition (NER). sumption that many nodes of a domain specific semantic
The recognition of the categories PERSON, LOCATION network occur in domain relevant texts and that these oc-
and ORGANIZATION within the newspaper domain is es- currences are proper names or expressions which can be ex-
pecially well-studied as a part of the MUC-campaigns (Mes- tracted with NER-techniques.
sage Understanding Conferences) and can be conducted au-
tomatically with a performance beyond 0.9 F-measure for The pilot domain of WIKINGER is contemporary history
English texts [4]. The detection of relations between the en- with a focus on the history of Catholicism in Germany. For
tities of a corpus is a younger discipline, usually concerned that domain, the traditional NER categories PERSON, LO-
with binary relations. Experiments on English newspapers CATION, ORGANIZATION, and TIME/DATE expressions
show performance around 0.75 F-measure [8]. These ad- obviously carry crucial nodes for a domain specific seman-
vances facilitate a largely automated processing of text cor- tic network. However, the domain experts desired additional
pora into domain ontologies. This paper introduces an in- categories, such as HISTORICAL-EVENT, BIOGRAPHIC-
tegrated web service-based framework called WIKINGER EVENT or ROLE. A ROLE is a function or a position a per-
that does just that. son holds (e.g. ”bishop”, ”professor of theology”) and is often
part of a BIOGRAPHIC-EVENT, which may contain addi-
This paper is structured as follows: Section 2 gives an o- tional annotations such as LOCATION and TIME/DATE,
verview of the WIKINGER framework, sections 3 and 4 de- as the following example shows:
scribe our work on named entity extraction, while section 5
describes the relation discovery part of the process. After
that, section 6 highlights relevant related work, and we close
with remarks on future works and the conclusion in sections 1936
7 and 8. archbishop of
Cologne

2. WIKINGER - THE BIG PICTURE
WIKINGER[3], short for Wiki Next Generation Enhanced The HISTORICAL-EVENT describes events significant to
Repositories, aims at developing collaborative knowledge plat- the domain experts, such as the ”Wall Street Crash of 1929”,
forms for scientific communities. The collaboration is fa- also called ”Black Thursday”. This category may contain
cilitated by selecting a Wiki as a presentation layer, and embedded categories, too. The two event categories of the
the knowledge contained can be organized via semantic re- pilot domain are beyond the traditional NER task: Depend-
lations. The resulting semantic Wiki can be extended, re- ing on the perspective, they either involve relation extraction
organized and commented on by all (registered) members of or embedded categories. The corpus to annotate currently
the particular scientific community. To setup and maintain consists of approximately 150 monographs within a book
the semantic network, NER-techniques are applied to the series. The books were scanned and the text was OCR-
available domain-relevant documents (see section 3). The extracted. The annotations of the resulting corpus will be
resulting annotations are the potential nodes of the seman- used as potential nodes of the semantic network to be cre-
tic network that is constructed in a semi-automatic manner. ated.
The relations are proposed based on clusters of co-occurring
entities (see section 5). Since the book series has a consistent layout structure, it
was possible to preserve some layout information, such as
Figure 1 shows a view of the components that are part of the distinction between footnotes and other text. This dis-
the WIKINGER framework. It is built following a service- tinction is helpful in order to detect a text unit specific to
oriented architecture, its modules are loosely coupled, which the texts of our domain called a ”biogram”. A biogram usu-
allows need-driven reconfiguration of the system. The sys- ally is a footnote that is provided the first time a person
tem itself uses a linked set of data repositories to perform is mentioned in the text and comprises a short biography.
its duties. The resource layer at the bottom of fig. 1 shows These biographies usually are short and concise and tend
a drastically simplified view of the outside world: it con- to follow a predetermined structure. For instance, most of
tains arbitrary data sources that can be imported into the the biograms start with the name of the person, and some
first of the repositories, i.e. the document repository. This biograms present the single pieces of information separated
repository provides the other services of the system with a by a particular delimiter such as semicolon or comma.
versioned corpus of documents to work on. The process-
ing services (e.g. for NER, relation discovery and creation Thus, in most cases the person named at the beginning of
of the ontology) use this repository as a source only. They a biogram is the one that the other annotations in that bi-
feed their results into the metadata repository. It is linked ogram relate to. While some of the information items also
to the document repository to uphold references to the orig- belong to persons that are related to the person described in
inal and it also provides versioned storage of the data. This the biogram (e.g. ”his father was a prime minis-
ensures that the original corpus remains unchanged. The ter”) this assumption nevertheless holds true for
final repository contains the semantic model of the corpus. the largest part of the corpus. This is very important for the
It makes use of both the document repository as well as the relation discovery step, since all relations discovered in a spe-
metadata repository. At the moment, the application layer cific biogram are linked implicitly to said person, although
takes the form of a wiki system, but other applications can its participation in most of the relations is not readily appar-
easily be envisioned. ent from their local contexts. Accordingly, they need to be
WIKINGER System Components

Application Layer
Guest/Member Author/Editor Annotator Administrator

Browser Editor Annotation Administration

Entity Model Processing Service
Entity Service
Repository

Service Layer
Analyzer
Meta Data NER
Meta Data Service ...
Repository

Document Document Service Account Service
Repository

Resource Layer
User
Document
Database
Sources

Figure 1: The WIKINGER Framework: Component View

associated with the person discussed in the biogram, which A well-known example for such a task is the recognition of
in turn has implications for the creation of the semantic net- biomedical entities such as genes, proteins or cell tissue [6,
work from the anntoation and relation data discovered in the 9]. It is almost impossible for a non-expert in the biomedical
course of the process. domain to judge about the correctness of an annotation or
even to figure out a definition of the classes to recognize.
Processing these biograms results in a semantic network in Additionally, capitalization is not a distinctive feature of
OWL which contains any information that could be har- the entities to detect. Furthermore, biomedical entities are
vested automatically from all the biograms within the 150 no proper names in the linguistic sense since a mention of a
monographs. This knowledge base constitutes a biographi- particular protein refers to all instances of that protein and
cal database for the scientific domain, which, according to not to a particular instance.
the historians working within the WIKINGER project, is
a long time desideratum for the domain of contemporary The annotation task within WIKINGER has similar char-
history of Catholics in Germany. acteristics: the documents to be processed are specialized
texts, thus the definition of the annotation categories has to
However, the tasks described are not limited to the pilot be provided by the domain experts. Also, most of the texts
application of WIKINGER. Indeed, it has many features in are in German, so the capitalization is not a reliable clue to
common with a series of annotation tasks found in other do- detect proper names. Furthermore, discussions with the do-
mains as well. Our research within the WIKINGER project main experts have shown that some of the annotation tasks
focuses on the application-oriented generalization of these amount to information extraction in a more general sense,
challenges. in particular involving relation extraction, even though on
a local level. For example, the BIO-EVENT provided in
section 2 establishes a relation between the person the re-
3. NER spective biogram deals with, a role occupied by that person,
It is highly desirable to generalize successful NER approa- a certain time, and a location. Although these annotation
ches described in section 1 to a broader variety of semantic tasks significantly expand the annotation of proper names,
markup at phrase level (i.e. apart from ”standard” categories we still consider them as a sophisticated form of NER. In
such as PERSON, ORGANIZATION, or LOCATION) in other words, we basically employ approaches which have
order to support other NLP applications. However, this re- been successfully applied to NER.
quires annotation components that can be extended to new
categories and adapted to new domains and new languages. In principle, two major kinds of NER approaches have been
These tasks may have different characteristics than the clas- proposed in the literature: rule-based and machine learning
sical MUC task: First, they may lack the clue of the dis- (ML) approaches. Rule-based approaches employ a hand-
tinctive capitalization for some semantic classes and some crafted set of rules which is fine-tuned to the particular ap-
languages, such as German. Second, the categories of inter- plication domain. The adaptation of such a rather complex
est may neither be obvious nor easily understandable due to rule set to new domains and/or languages brings about ex-
a highly specialized domain and language.
tensive modification and maintenance efforts and requires 4. WALU
therefore comprehensive knowledge about both the new do- A prerequisite for enabling domain experts to create training
main and the proper design of the linguistic rule set. This data and control the process of training and (semi-)automa-
means that domain experts need extensive support by com- tic semantic markup is the availability of a powerful and
putational linguists in order to port such a system to their convenient tool. On the one hand, such a tool has to pro-
domain. In contrast, adapting machine learning approaches vide the necessary functionalities, i.e. manual annotation
to a new application domain requires the creation of domain- of documents, configuration and initiation of the training
specific training data, i.e. manual annotation of domain- process, application of automatic annotation components,
specific documents. Since this essentially requires domain as well as inspection and correction of the resulting anno-
(rather than linguistic) expertise, domain professionals need tations. On the other hand, intuitive interfaces and con-
much less support by computational linguists (if any at all). venient facilities supporting these functionalities while en-
Our experience within the WIKINGER project has shown capsulating their complexity are crucial to ensure usability
that such support is necessary primarily for the initial task for professionals of any domain. In addition, this tool has
of defining a suitable set of semantic categories. During this to be integrated into the overall WIKINGER infrastructure
definition stage, the communication between domain experts sketched in section 2. Currently there is no tool available
and linguists in essence consists in exchanging annotated that meets all these requirements (see section 6), at least
examples. We believe that this example-based communica- not to our knowledge. Therefore, we are developing such a
tion significantly facilitates portability, since concrete ex- tool, which we call WALU (WIKINGER Annotations- und
amples are much easier to create and understand than the Lern-Umgebung = WIKINGER annotation and learning en-
explicit formulation of more or less complex and abstract vironment, see [16]).
(sub-)regularities. The same holds true for the annotation
of the training data itself, which can be regarded as example- WALU supports manual annotation with a GUI that is easy
based communication between domain experts and machine to use. It offers a comfortable navigation through the an-
learning algorithms. notations, and simple but effective annotation support such
as the automatic adjustment of markup boundaries or a dy-
Consequently, in order to minimize the amount of “external namic markup dictionary. This dictionary is created during
help” specialists needed to set up the WIKINGER system the annotation process and is used to propose markup la-
for their domain, we decided to employ ML approaches for bels for text passages corresponding to dictionary entries.
NER. In our current experiments, we are using Maximum Using a context-sensitive menu, the annotator confirms or
Entropy modeling and support vector machines. (As im- rejects these proposals and/or removes the entry from the
plementations, we employ openNLP1 and SVMstruct2 , re- dictionary. In our experience the immediate feedback of the
spectively.) However, we aim at providing a variety of ML dynamic markup dictionary also helps the domain experts
algorithms which can either be employed independently or in to clarify the task of string-based identification of domain-
combination to maximize performance. Regarding portabil- relevant concepts. Additionally, WALU also provides an au-
ity, it is crucial that the learning approaches employ domain- tomatic annotator for strings referring to the category DATE
independent features and resources that can be easily adap- which is based on regular expressions. This is a simple pro-
ted to a new domain or a new NER task. Furthermore, these totype of a series of automatic mechanisms that will be used
methods have to be applied in a way that allows the acqui- to annotate all the available documents. Except a few anno-
sition of embedded annotations. “Standard” ML classifiers tators based on regular expressions to classify entities with
assign one class (in our case, a semantic category) to each unique patterns (such as email addresses and URLs), most of
instance to classify (in our case, a token)3 . In embedded these annotators are based on machine learning algorithms
annotations, (parts of) entities may receive multiple classes that will be accessible via WALU.
simultaneously (e.g. in the example in section 2, “1936” is
at the same time a DATE and part of a BIO-EVENT). To Training the ML facilities mentioned in section 3 as well as
achieve such kind of concurrent classification, we run multi- their annotation of new text can be initiated via the WALU
ple classifiers, each one assigning different classes, and unify GUI. The annotation results can be displayed and manually
the results. For ML approaches which are restricted to bi- corrected. Automatic annotations are displayed in a distinct
nary classification (e.g. SVM), one classifier is required for way (only the lower half of the annotated tokens are marked)
each category. For ML approaches without this restriction so that they can be discovered immediately by the user.
(e.g. MaxEnt), classifiers assigning multiple classes can be
built and combined in a more flexible way. Our experiments WALU is designed both as a part of the WIKINGER infras-
with MaxEnt models have shown that combining classifiers tructure and as a stand-alone tool. Web-service-based com-
each of which assigns all categories except one, i.e. each munication facilities allow WALU to load documents from
of which “ignores” one particular class, yields higher perfor- the WIKINGER document repository and load/store corre-
mance than employing binary classifiers. In these experi- sponding annotations from/to the metadata repository. As
ments, we got F-measures (at token level) of up to 84.6% a stand-alone tool, WALU currently is able to import text
for persons, 87,1% for organizations, 94,8% for geographic- documents (other import formats will be captured later) and
political entities, and 92,8% for roles. to export annotated documents in a straightforward XML
standoff format. The transfer between the various different
1 data formats is achieved via a special internal format we
http://maxent.sourceforge.net/ call ‘WaRP (WALU Rich Paragraph) stream’, which is also
2
http://svmlight.joachims.org/svm struct.html processed by the automatic annotation components.
3
Multiword NEs are recognized as a sequence of tokens re-
ceiving the same class.

relation. Since the amount of relation clusters is not known
beforehand, agglomerative clustering is applied. In this al-
gorithm, every vector starts as its own cluster. Clusters
are then merged, given they fulfill a certain clustering crite-
rion that is defined on a distance measure. We use standard

Cosine similarity as distance and allow both single and com-
plete linkage as criteria. Given two clusters A and B and a
distance threshold t, this translates to:

Figure 2: Workflow of the algorithm
Single Linkage : ∃α ∈ A, β ∈ B : min(dist(α, β)) < t
5. SEMIAUTOMATIC RELATION DISCO-
Complete Linkage : ∃α ∈ A, β ∈ B : max(dist(α, β)) < t
VERY
The algorithms and tools described in the preceding sections
provide named entities for a variety of project-dependent Which method will be used depends on the corpus in ques-
concept classes. They will become the nodes of the semantic tion. Terse texts show better results with complete linkage,
network that is to be built. The remaining part is the provi- normal text performs better with single linkage.
sion of edges connecting these nodes, which will be explained
in this section. The common approach to this problem is to The result of this step is a set of relation clusters for each
let domain experts come up with a small number of rela- association rule. User interaction is needed at this point, in
tions and then to model them in an ontology editor. This order to review the results and to provide meaningful labels
requires knowledge of both ontology creation and ontology for the relations. They are not generated automatically at
editors, which tends to be a too high hurdle for domain ex- the moment, but schemes employing parts-of-speech analysis
perts. Instead, we propose to do it based on the content of (e.g. using the verbs) are feasible.
the corpus in question. With the named entities given by
the preceding steps, relation discovery applying statistical The last step of the algorithm is the transformation of the
methods becomes feasible. entities and their relations into an ontology language. The
transformation process is a straight-forward affair for enti-
ties, classes and binary relations, since those can be handled
5.1 Algorithm by corresponding constructs in RDF. The transformation of
Figure 2 shows the workflow of our approach. The first step, n-ary relations is slightly more complex, since it involves
NER, has been covered already. The next step consists of the blank nodes that act as a hub for the attachment of binary
application of an association rule mining algorithm on the relations to the various members of the relation. The result-
annotated corpus that has been segmented on the sentence- ing RDF represents the ontology for the domain corpus.
level. Only those sentences containing at least two entities
are kept. Each sentence is represented by the set of entity In the use-case of our project, we have to deal with a dy-
classes appearing in it. These item sets serve as input for the namic corpus, since the articles from the wiki are fed back
apriori algorithm[1], that generates a set of association rules into the system to be analyzed. This continually updates the
of the form a → b. Each rule carries two parameters, support semantic network and keeps it on par with the wiki. But an
(the amount of observations supporting it), and confidence additional step is required: relation classification. The rela-
(in our case #(a→b)
#a
). Thresholds for these parameters can tion clusters that have been committed in the initialization
be used to influence the result of the algorithm. phase of the system are used for this task. New instances
of sentences are marked up with named entities and are
The association rules can be ranked according to the two then transformed into word vectors which can be classified
parameters. High support promises higher coverage, high against the relation clusters, and subsequently transformed
confidence hints at a tighter correlation between the entity into RDF. Since the provenance of each triple in the ontol-
classes involved. Rules with more than one succedent tend ogy is known, exchanges can be restricted to those triples
to be more specialized, as evidenced by a higher confidence, that are affected.
and thus offer a higher potential information gain and they
tend to be forgotten by the domain experts, when asked to Preliminary evaluation results of the algorithm show F-mea-
come up with possible relations. sures (F1 = 2∗Recall∗P recision
) between 70% and 75% for
Recall+P recision
clusters representing binary as well as n-ary relations. The
The next step is a clustering phase. It takes an association algorithm usually creates more relation clusters than a hu-
rule as input. The sentences of the rule are preprocessed, i.e. man would, since humans tend to generalize the relations
the named entities are replaced with their respective classes. rather than to have a multitude of minuscule distinctions in
This is done to receive generalized patterns of the relations in their relation set. We have performed an evaluation of the
the sentences. Only the part between the outermost named performance of the algorithm against a part of the corpus
entities is taken and transformed into word vectors. These relevant for the pilot application in the WIKINGER project.
weights of the vectors are created using tf*idf. More details can be found in [2].

The goal of the clustering phase is to receive relation clus-
ters, i.e. clusters in which every vector symbolizes the same 5.2 User interface
In order to provide the domain experts with an interface linguistics. In this respect, WALU complements the range
that facilitates directing the relation discovery process, the of existing tools.
Wikinger Relation Discovery GUI, short WiReD, has been
developed. It allows to view the results of the different steps
of the algorithms and to experiment with different settings
6.2 Ontology learning environments
As has been pointed out above, ontology learning environ-
for them. This encompasses the association rules generated
ments usually are built as supporting tools for ontology
by the apriori algorithm as well as the composition of the
engineers. Their task differs from the one tackled by the
relation clusters generated by the clustering phase.
approaches in this paper insofar as the ontology engineer
has the process-knowledge necessary for building ontologies.
Association rules can be selected manually for clustering,
He usually has access to different domain experts, and thus
clusters can be post-processed (merged with others, deleted,
needs only marginal software support. Named entity recog-
renamed) and finally selected for inclusion into the seman-
nition is employed sometimes to facilitate populating the
tic network. The parameters for each algorithmic step are
ontology, whereas relation discovery is not used extensively,
preset with reasonable defaults, but can be changed directly
at least not to our knowledge.
from within WiReD, thus allowing experiments on the data
set. This may sound intimidating at first reading, but in
Text-To-Onto[11] contains a module that calculates associ-
practice there are never more than two parameters per step
ation rules to provide the engineer with an overview over
in the processing chain, four parameters in total.
possible interrelations between concept classes, but this ap-
proach is not followed further in the context of the applica-
When the experts have come to a final result, i.e. they have
tion. Its successor, Text-2-Onto[5], employs a limited ver-
agreed upon a set of relations they want to see included
sion of relation extraction, insofar as it searches for hyponym
in the ontology, the relation information is fed back into
relation patterns (e.g. ”x is a kind of y”) in order to find ad-
the WIKINGER framework. Here it is used for different
ditional instances of concept classes in a corpus. Relation
purposes. First of all it can be used to transform the infor-
discovery is not employed there.
mation associated with it - the entities and their relations
- into the ontology format of choice. If the corpus is static,
this concludes the work needed for the ontology. In the case 6.3 Relation Discovery
of dynamic corpora, e.g. wiki systems, the relation infor- Hasegawa et al [8] propose a system with a similar approach
mation approved by the experts is used to automatically than the one presented here. They first perform NER on a
classify new patterns that enter the system. These basically text corpus, and then collect entity pairs from within sen-
follow the same steps of the algorithm, only now in a fully tences. These pairs are grouped by composition, the corre-
automated mode. The experts can change the relation set sponding sentences are transformed into word vectors and
anytime they want using the WiReD GUI which results in a a clustering step is performed on each of the groups. This
total recalculation of the ontology to reflect their desire for results in a couple of relation clusters for each group. With
change. some postprocessing (weeding out clusters below a certain
size), they report F-measures of between 75% and 80% for
selected clusters on a year of newspaper articles from The
6. RELATED WORK New York Times. In addition, they generate cluster labels
This section highlights related work in the areas touched by by taking the words with the highest occurrence in each
the work described in the sections above. We concentrate cluster. We believe that adding an association rule creation
on annotation tools rather than individual NER algorithms, phase at the beginning helps in the selection of interesting
since the tools mentioned all encompass different approaches combinations of relation candidates, even more so because
to NER. Following that, ontology learning environments are we are not restricted to the detection of binary relations.
discussed, with a special regard to their use of relation dis-
covery. Finally, algorithms partial to the discipline of rela- There are other approaches besides this one, that exploit
tion discovery are discussed. syntactic structures and perform parts-of-speech analysis:
Jiang et al. [10] analyze sentence grammar trees, model
candidate relations in RDF in order to capture their direc-
6.1 Annotation tools tion and extract from the RDF a set of generalized relations.
As explained in section 4, the rationale behind WALU is
Navigli et al. [14] present an approach to ontology learning
its usability by professionals of any domain, in particular
that exploits synsets from WordNet in order to disambiguate
without computational or linguistic expertise. In this re-
meaning and find relations that might hold between different
spect, WALU differs from other existing tools for semantic
entities from the sentences that explain the different synsets.
annotation, e.g. GATE [7], WordFreak [12], MMAX [13], or
But these approaches are dependent on deeper knowledge of
PALinkA [15]. These tools are primarily intended for users
the language of the text corpus. Approaches like Hasegawa’s
with a background in (computational) linguistics. Conse-
or ours only rely on statistics and the existence of annotated
quently, they are either tailored to different, more complex
entities, thus they are language agnostic.
tasks than WALU (e.g. PALinkA for discourse annota-
tion), or are designed as highly multifunctional tools (e.g.
GATE, WordFreak, or MMAX). This multifunctionality al- 7. FUTURE WORK
lows their flexible application with regard to specific and Regarding NER, we will implement an interface to the Weka
complex needs. However, the price of this flexibility is that library [17], which comprises a number of machine learning
these tools require extensive configuration efforts which sig- algorithms. We will investigate combinations of different
nificantly affects usability for non-experts in computational ML approaches either sequentially (i.e. the output of one
classifier is used as input to another one) or concurrently (i.e. [1] R. Agrawal and R. Srikant. Fast algorithms for mining
several kinds of classifiers are run in parallel and a more-or- association rules. In Proceedings of the 20th VLDB
less sophisticated voting mechanism — which might involve conference, pages 487–499, 1994.
a further ML approach — decides on the final classification). [2] L. Bröcker. Semiautomatic Creation of Semantic
Networks. In Online-proceedings of PhD-symposium at
Furthermore, we plan to provide an interface to the UIMA ESWC 2007, June 2007. no URL as of yet.
framework4 . This way, further facilities for learning and pre- [3] L. Bröcker, M. Rössler, A. Wagner, et al. WIKINGER
processing (e.g. morphological or syntactic analysis, which - Wiki Next Generation Enhanced Repositories. In
can provide useful information for semantic annotation as Online Proceedings of the German E-Science
well as relation discovery) will become available to our frame- Conference, 2007.
work. Since units from the UIMA framework can be pro- [4] N. A. Chinchor, editor. Proceedings of the Seventh
vided as web services they can be added to complement the Message Understanding Conference, Fairfax, VA, 1998.
WIKINGER framework as needed. [5] P. Cimiano and J. Völker. Text-2-Onto. In Proceedings
of NLDB 2005, pages 227–238, 2005.
Regarding relation discovery, we intend to apply our ap-
[6] N. Collier, P. Ruch, and A. Nazarenko, editors.
proach to other data sets, especially from the newspaper
Proceedings of the International Joint Workshop on
domain, in order to evaluate its performance on data sets
Natural Language Processing in Biomedicine and its
that cover a wide range of topics, and to enhance the al-
Applications (JNLPBA-2004), Geneva, Switzerland,
gorithm with a stage that extracts suitable labels for the
2004.
relations and their members automatically.
[7] H. Cunningham. GATE, a General Architectur for
Text Engineering. Computers and the Humanities,
The WIKINGER framework will be developed further, we
36:223–254, 2002.
intend to use it as a base platform for a variety of future
projects. [8] T. Hasegawa, S. Sekine, and R. Grishman. Discovering
Relations among Named Entities from Large Corpora.
In Proceedings of the Annual Meeting of Association of
8. CONCLUSIONS Computational Linguistics, pages 415–422, 2004.
This paper described a new approach to semi-automatic [9] L. Hirschman, A. Yeh, C. Blaschke, and A. Valencia.
knowledge capturing from large text corpora. The goal is to Overview of BioCreAtIvE: critical assessment of
empower domain experts to create domain ontologies them- information extraction for biology. BMC
selves, without being dependent on the availability of on- Bioinformatics, 6 (Supplement 1), 2005.
tology engineers. This is to be achieved by automating the
[10] T. Jiang, A. Tan, and K. Wang. Mining Generalized
process to a high degree, by employing named entity recog-
Associations of Semantic Relations from Textual Web
nition (NER) and relation discovery. Domain experts are in-
Content. IEEE Transactions on Knowledge and Data
volved at those stages which require a substantial knowledge
Engineering, 1(2):164–179, 2007.
of the domain in question. Two software tools aiding in the
process have been introduced that aid the domain experts [11] A. Maedche. The Text-To-Onto Environment, chapter
in the task, WALU and WiReD. The former is a workbench 7 in Alexander Maedche: Ontology Learning for the
for example-based NER, while the latter is a tool aiding in Semantic Web. Kluwer Academic Publishers, 2002.
the relation discovery process. [12] T. Morton and J. LaCivita. WordFreak: an open tool
for linguistic annotation. In Proceedings of the 2003
Evaluation results for the different algorithmic solutions have Conference of the North American Chapter of the
been presented that show high values for F-measure for the Association for Computational Linguistics on Human
automatic knowledge capturing methods. Language Technology, Edmonton, Canada, 2003.
[13] C. Müller and M. Strube. MMAX: A tool for the
All of this is part of a web service based architecture, the annotation of multi-modal corpora. In Proceedings of
WIKINGER framework. It is used to create semantically en- the 2nd IJCAI Workshop on Knowledge and Reasoning
hanced collaborative knowledge platforms for scientific com- in Practical Dialogue Systems, Seattle, WA, 2001.
munities. The pilot application is a semantic wiki for the [14] R. Navigli, P. Velardi, and A. Gangemi. Ontology
domain of contemporary history research regarding German learning and its application to automated terminology
catholicism. translation. IEEE Intelligent Systems, 18(1):22–31,
2003.
9. ACKNOWLEDGMENTS [15] C. Orasan. PALinkA: A highly customisable tool for
The work presented in this paper is being funded by the discourse annotation. In Proceedings of the Fourth
German Federal Ministry of Education and Research under SIGdial Workshop on Discourse and Dialogue,
research grant 01C5965. See http://wikinger-escience.de for Sapporo, Japan, 2003.
further details regarding the project. The authors would [16] A. Wagner and M. Rössler. WALU — Eine
like to thank Prof. Cremers from the University of Bonn Annotations- und Lern-Umgebung für semantisches
and Prof. Hoeppner from the University of Duisburg-Essen Tagging. In G. Rehm, A. Witt, and L. Lemnitzer,
for their helpful suggestions. editors, Data Structures for Linguistic Resources and
Applications, pages 263–271. Gunter Narr Verlag,
Tübingen, 2007.
10. REFERENCES
4
http://incubator.apache.org/uima/
[17] I. H. Witten and F. Eibe. Data Mining: Practical machine
learning tools and techniques. Morgan Kaufmann, San Fran-
cisco, 2nd edition, 2005.