Knowledge Capturing Tools for Domain Experts Exploiting Named Entity Recognition and n-ary Relation Discovery for Knowledge Capturing in E-Science Lars Bröcker Marc Rössler Andreas Wagner Fraunhofer IAIS Computational Linguistics Computational Linguistics Schloss Birlinghoven University of Duisburg-Essen University of Duisburg-Essen 53754 Sankt Augustin, 47048 Duisburg, Germany 47048 Duisburg, Germany Germany marc.roessler@uni- andreas.wagner@uni- lars.broecker@iais.fhg.de due.de due.de ABSTRACT gence]: Learning—knowledge acquisition, concept learning; The success of the Semantic Web depends on the availabil- I.2.7 [Artificial Intelligence]: Natural Language Process- ity of content marked up using its description languages. ing—text analysis; I.5.3 [Pattern Recognition]: Cluster- Although the idea has been around for nearly a decade, ing the amount of Semantic Web content available is still fairly small. This is despite the existence of many digital archives General Terms containing lots of high quality collections which would, ap- Algorithms propriately marked up, greatly enhance the reach of the Se- mantic Web. The archives themselves would benefit as well, Keywords by improved opportunities for semantic search, navigation Named Entity Recognition, Relation Discovery, Semantic and interconnection with other archives. Networks, Wiki Systems The main challenge lies in the fact that ontology creation at the moment is a very detailed and complicated process. It 1. INTRODUCTION mostly requires the service of an ontology engineer, who de- The Semantic Web can only flourish if enough content provi- signs the ontology in accordance with domain experts. The ders adopt it for the presentation of their content. This lack software tools available, be it from the text engineering or of adoption is the Achilles heel of the vision of the data web the ontology creation disciplines, reflect this: they are built where humans and software agents can work side by side. for engineers, not for domain experts. In order to really tap The main reason for this lies right at the base of the Semantic the potential of the digital collections, tools are needed that Web: the creation of ontologies. The process needed to get support the domain experts in marking up the content they to a working representation of a domain is too difficult for understand better than anyone else. domain experts to do it on their own - a debilitating factor on the way to widespread adoption: the WWW did flourish This paper presents an integrated approach to knowledge simply due to the ease of marking up knowledge in HTML. capturing and subsequent ontology creation, called WIKIN- This does not hold true for OWL or even RDF. GER, that aims at empowering domain experts to prepare their content for inclusion into the Semantic Web. This is There are tools that deliver support in the process of creat- done by largely automating the process through the use of ing an ontology, both from the domain of text engineering named entity recognition and relation discovery. as well as from ontology engineering. But these tools are made for a selected audience: ontology engineers. This in itself is nothing bad, but it reduces the amount of growth of Categories and Subject Descriptors the Semantic Web to the availability (and affordability) of H.3.1 [Information Storage and Retrieval]: Content said engineers. Tools are needed that allow domain experts Analysis and Indexing—linguistic processing; I.2.4 [Artifi- themselves to design and create ontologies tailored for their cial Intelligence]: Knowledge Representation Formalisms needs and domain corpora, if the Semantic Web is to come and Methods—semantic networks; I.2.6 [Artificial Intelli- about on a grand scale. But what is needed to create an ontology from a text corpus? First of all, an ontology can be seen as a graph structure, a semantic network. The nodes of this graph are the enti- ties, i.e. the actors, topics and objects of the ontology, while the edges of the graph are the relations that exist between the entities. The task of automatically creating an ontology can be broken down into the following steps: first named en- tity recognition (NER) and second the detection of relations existing between those entities. The detection and classification of proper names into prede- The architecture of WIKINGER is motivated by the as- fined categories is called Named Entity Recognition (NER). sumption that many nodes of a domain specific semantic The recognition of the categories PERSON, LOCATION network occur in domain relevant texts and that these oc- and ORGANIZATION within the newspaper domain is es- currences are proper names or expressions which can be ex- pecially well-studied as a part of the MUC-campaigns (Mes- tracted with NER-techniques. sage Understanding Conferences) and can be conducted au- tomatically with a performance beyond 0.9 F-measure for The pilot domain of WIKINGER is contemporary history English texts [4]. The detection of relations between the en- with a focus on the history of Catholicism in Germany. For tities of a corpus is a younger discipline, usually concerned that domain, the traditional NER categories PERSON, LO- with binary relations. Experiments on English newspapers CATION, ORGANIZATION, and TIME/DATE expressions show performance around 0.75 F-measure [8]. These ad- obviously carry crucial nodes for a domain specific seman- vances facilitate a largely automated processing of text cor- tic network. However, the domain experts desired additional pora into domain ontologies. This paper introduces an in- categories, such as HISTORICAL-EVENT, BIOGRAPHIC- tegrated web service-based framework called WIKINGER EVENT or ROLE. A ROLE is a function or a position a per- that does just that. son holds (e.g. ”bishop”, ”professor of theology”) and is often part of a BIOGRAPHIC-EVENT, which may contain addi- This paper is structured as follows: Section 2 gives an o- tional annotations such as LOCATION and TIME/DATE, verview of the WIKINGER framework, sections 3 and 4 de- as the following example shows: scribe our work on named entity extraction, while section 5 describes the relation discovery part of the process. After that, section 6 highlights relevant related work, and we close with remarks on future works and the conclusion in sections 1936 7 and 8. archbishop of Cologne 2. WIKINGER - THE BIG PICTURE WIKINGER[3], short for Wiki Next Generation Enhanced The HISTORICAL-EVENT describes events significant to Repositories, aims at developing collaborative knowledge plat- the domain experts, such as the ”Wall Street Crash of 1929”, forms for scientific communities. The collaboration is fa- also called ”Black Thursday”. This category may contain cilitated by selecting a Wiki as a presentation layer, and embedded categories, too. The two event categories of the the knowledge contained can be organized via semantic re- pilot domain are beyond the traditional NER task: Depend- lations. The resulting semantic Wiki can be extended, re- ing on the perspective, they either involve relation extraction organized and commented on by all (registered) members of or embedded categories. The corpus to annotate currently the particular scientific community. To setup and maintain consists of approximately 150 monographs within a book the semantic network, NER-techniques are applied to the series. The books were scanned and the text was OCR- available domain-relevant documents (see section 3). The extracted. The annotations of the resulting corpus will be resulting annotations are the potential nodes of the seman- used as potential nodes of the semantic network to be cre- tic network that is constructed in a semi-automatic manner. ated. The relations are proposed based on clusters of co-occurring entities (see section 5). Since the book series has a consistent layout structure, it was possible to preserve some layout information, such as Figure 1 shows a view of the components that are part of the distinction between footnotes and other text. This dis- the WIKINGER framework. It is built following a service- tinction is helpful in order to detect a text unit specific to oriented architecture, its modules are loosely coupled, which the texts of our domain called a ”biogram”. A biogram usu- allows need-driven reconfiguration of the system. The sys- ally is a footnote that is provided the first time a person tem itself uses a linked set of data repositories to perform is mentioned in the text and comprises a short biography. its duties. The resource layer at the bottom of fig. 1 shows These biographies usually are short and concise and tend a drastically simplified view of the outside world: it con- to follow a predetermined structure. For instance, most of tains arbitrary data sources that can be imported into the the biograms start with the name of the person, and some first of the repositories, i.e. the document repository. This biograms present the single pieces of information separated repository provides the other services of the system with a by a particular delimiter such as semicolon or comma. versioned corpus of documents to work on. The process- ing services (e.g. for NER, relation discovery and creation Thus, in most cases the person named at the beginning of of the ontology) use this repository as a source only. They a biogram is the one that the other annotations in that bi- feed their results into the metadata repository. It is linked ogram relate to. While some of the information items also to the document repository to uphold references to the orig- belong to persons that are related to the person described in inal and it also provides versioned storage of the data. This the biogram (e.g. ”his father was a prime minis- ensures that the original corpus remains unchanged. The ter”) this assumption nevertheless holds true for final repository contains the semantic model of the corpus. the largest part of the corpus. This is very important for the It makes use of both the document repository as well as the relation discovery step, since all relations discovered in a spe- metadata repository. At the moment, the application layer cific biogram are linked implicitly to said person, although takes the form of a wiki system, but other applications can its participation in most of the relations is not readily appar- easily be envisioned. ent from their local contexts. Accordingly, they need to be WIKINGER System Components Application Layer Guest/Member Author/Editor Annotator Administrator Browser Editor Annotation Administration Entity Model Processing Service Entity Service Repository Service Layer Analyzer Meta Data NER Meta Data Service ... Repository Document Document Service Account Service Repository Resource Layer User Document Database Sources Figure 1: The WIKINGER Framework: Component View associated with the person discussed in the biogram, which A well-known example for such a task is the recognition of in turn has implications for the creation of the semantic net- biomedical entities such as genes, proteins or cell tissue [6, work from the anntoation and relation data discovered in the 9]. It is almost impossible for a non-expert in the biomedical course of the process. domain to judge about the correctness of an annotation or even to figure out a definition of the classes to recognize. Processing these biograms results in a semantic network in Additionally, capitalization is not a distinctive feature of OWL which contains any information that could be har- the entities to detect. Furthermore, biomedical entities are vested automatically from all the biograms within the 150 no proper names in the linguistic sense since a mention of a monographs. This knowledge base constitutes a biographi- particular protein refers to all instances of that protein and cal database for the scientific domain, which, according to not to a particular instance. the historians working within the WIKINGER project, is a long time desideratum for the domain of contemporary The annotation task within WIKINGER has similar char- history of Catholics in Germany. acteristics: the documents to be processed are specialized texts, thus the definition of the annotation categories has to However, the tasks described are not limited to the pilot be provided by the domain experts. Also, most of the texts application of WIKINGER. Indeed, it has many features in are in German, so the capitalization is not a reliable clue to common with a series of annotation tasks found in other do- detect proper names. Furthermore, discussions with the do- mains as well. Our research within the WIKINGER project main experts have shown that some of the annotation tasks focuses on the application-oriented generalization of these amount to information extraction in a more general sense, challenges. in particular involving relation extraction, even though on a local level. For example, the BIO-EVENT provided in section 2 establishes a relation between the person the re- 3. NER spective biogram deals with, a role occupied by that person, It is highly desirable to generalize successful NER approa- a certain time, and a location. Although these annotation ches described in section 1 to a broader variety of semantic tasks significantly expand the annotation of proper names, markup at phrase level (i.e. apart from ”standard” categories we still consider them as a sophisticated form of NER. In such as PERSON, ORGANIZATION, or LOCATION) in other words, we basically employ approaches which have order to support other NLP applications. However, this re- been successfully applied to NER. quires annotation components that can be extended to new categories and adapted to new domains and new languages. In principle, two major kinds of NER approaches have been These tasks may have different characteristics than the clas- proposed in the literature: rule-based and machine learning sical MUC task: First, they may lack the clue of the dis- (ML) approaches. Rule-based approaches employ a hand- tinctive capitalization for some semantic classes and some crafted set of rules which is fine-tuned to the particular ap- languages, such as German. Second, the categories of inter- plication domain. The adaptation of such a rather complex est may neither be obvious nor easily understandable due to rule set to new domains and/or languages brings about ex- a highly specialized domain and language. tensive modification and maintenance efforts and requires 4. WALU therefore comprehensive knowledge about both the new do- A prerequisite for enabling domain experts to create training main and the proper design of the linguistic rule set. This data and control the process of training and (semi-)automa- means that domain experts need extensive support by com- tic semantic markup is the availability of a powerful and putational linguists in order to port such a system to their convenient tool. On the one hand, such a tool has to pro- domain. In contrast, adapting machine learning approaches vide the necessary functionalities, i.e. manual annotation to a new application domain requires the creation of domain- of documents, configuration and initiation of the training specific training data, i.e. manual annotation of domain- process, application of automatic annotation components, specific documents. Since this essentially requires domain as well as inspection and correction of the resulting anno- (rather than linguistic) expertise, domain professionals need tations. On the other hand, intuitive interfaces and con- much less support by computational linguists (if any at all). venient facilities supporting these functionalities while en- Our experience within the WIKINGER project has shown capsulating their complexity are crucial to ensure usability that such support is necessary primarily for the initial task for professionals of any domain. In addition, this tool has of defining a suitable set of semantic categories. During this to be integrated into the overall WIKINGER infrastructure definition stage, the communication between domain experts sketched in section 2. Currently there is no tool available and linguists in essence consists in exchanging annotated that meets all these requirements (see section 6), at least examples. We believe that this example-based communica- not to our knowledge. Therefore, we are developing such a tion significantly facilitates portability, since concrete ex- tool, which we call WALU (WIKINGER Annotations- und amples are much easier to create and understand than the Lern-Umgebung = WIKINGER annotation and learning en- explicit formulation of more or less complex and abstract vironment, see [16]). (sub-)regularities. The same holds true for the annotation of the training data itself, which can be regarded as example- WALU supports manual annotation with a GUI that is easy based communication between domain experts and machine to use. It offers a comfortable navigation through the an- learning algorithms. notations, and simple but effective annotation support such as the automatic adjustment of markup boundaries or a dy- Consequently, in order to minimize the amount of “external namic markup dictionary. This dictionary is created during help” specialists needed to set up the WIKINGER system the annotation process and is used to propose markup la- for their domain, we decided to employ ML approaches for bels for text passages corresponding to dictionary entries. NER. In our current experiments, we are using Maximum Using a context-sensitive menu, the annotator confirms or Entropy modeling and support vector machines. (As im- rejects these proposals and/or removes the entry from the plementations, we employ openNLP1 and SVMstruct2 , re- dictionary. In our experience the immediate feedback of the spectively.) However, we aim at providing a variety of ML dynamic markup dictionary also helps the domain experts algorithms which can either be employed independently or in to clarify the task of string-based identification of domain- combination to maximize performance. Regarding portabil- relevant concepts. Additionally, WALU also provides an au- ity, it is crucial that the learning approaches employ domain- tomatic annotator for strings referring to the category DATE independent features and resources that can be easily adap- which is based on regular expressions. This is a simple pro- ted to a new domain or a new NER task. Furthermore, these totype of a series of automatic mechanisms that will be used methods have to be applied in a way that allows the acqui- to annotate all the available documents. Except a few anno- sition of embedded annotations. “Standard” ML classifiers tators based on regular expressions to classify entities with assign one class (in our case, a semantic category) to each unique patterns (such as email addresses and URLs), most of instance to classify (in our case, a token)3 . In embedded these annotators are based on machine learning algorithms annotations, (parts of) entities may receive multiple classes that will be accessible via WALU. simultaneously (e.g. in the example in section 2, “1936” is at the same time a DATE and part of a BIO-EVENT). To Training the ML facilities mentioned in section 3 as well as achieve such kind of concurrent classification, we run multi- their annotation of new text can be initiated via the WALU ple classifiers, each one assigning different classes, and unify GUI. The annotation results can be displayed and manually the results. For ML approaches which are restricted to bi- corrected. Automatic annotations are displayed in a distinct nary classification (e.g. SVM), one classifier is required for way (only the lower half of the annotated tokens are marked) each category. For ML approaches without this restriction so that they can be discovered immediately by the user. (e.g. MaxEnt), classifiers assigning multiple classes can be built and combined in a more flexible way. Our experiments WALU is designed both as a part of the WIKINGER infras- with MaxEnt models have shown that combining classifiers tructure and as a stand-alone tool. Web-service-based com- each of which assigns all categories except one, i.e. each munication facilities allow WALU to load documents from of which “ignores” one particular class, yields higher perfor- the WIKINGER document repository and load/store corre- mance than employing binary classifiers. In these experi- sponding annotations from/to the metadata repository. As ments, we got F-measures (at token level) of up to 84.6% a stand-alone tool, WALU currently is able to import text for persons, 87,1% for organizations, 94,8% for geographic- documents (other import formats will be captured later) and political entities, and 92,8% for roles. to export annotated documents in a straightforward XML standoff format. The transfer between the various different 1 data formats is achieved via a special internal format we http://maxent.sourceforge.net/ call ‘WaRP (WALU Rich Paragraph) stream’, which is also 2 http://svmlight.joachims.org/svm struct.html processed by the automatic annotation components. 3 Multiword NEs are recognized as a sequence of tokens re- ceiving the same class.         relation. Since the amount of relation clusters is not known beforehand, agglomerative clustering is applied. In this al-   gorithm, every vector starts as its own cluster. Clusters are then merged, given they fulfill a certain clustering crite- rion that is defined on a distance measure. We use standard      Cosine similarity as distance and allow both single and com- plete linkage as criteria. Given two clusters A and B and a   distance threshold t, this translates to: Figure 2: Workflow of the algorithm Single Linkage : ∃α ∈ A, β ∈ B : min(dist(α, β)) < t 5. SEMIAUTOMATIC RELATION DISCO- Complete Linkage : ∃α ∈ A, β ∈ B : max(dist(α, β)) < t VERY The algorithms and tools described in the preceding sections provide named entities for a variety of project-dependent Which method will be used depends on the corpus in ques- concept classes. They will become the nodes of the semantic tion. Terse texts show better results with complete linkage, network that is to be built. The remaining part is the provi- normal text performs better with single linkage. sion of edges connecting these nodes, which will be explained in this section. The common approach to this problem is to The result of this step is a set of relation clusters for each let domain experts come up with a small number of rela- association rule. User interaction is needed at this point, in tions and then to model them in an ontology editor. This order to review the results and to provide meaningful labels requires knowledge of both ontology creation and ontology for the relations. They are not generated automatically at editors, which tends to be a too high hurdle for domain ex- the moment, but schemes employing parts-of-speech analysis perts. Instead, we propose to do it based on the content of (e.g. using the verbs) are feasible. the corpus in question. With the named entities given by the preceding steps, relation discovery applying statistical The last step of the algorithm is the transformation of the methods becomes feasible. entities and their relations into an ontology language. The transformation process is a straight-forward affair for enti- ties, classes and binary relations, since those can be handled 5.1 Algorithm by corresponding constructs in RDF. The transformation of Figure 2 shows the workflow of our approach. The first step, n-ary relations is slightly more complex, since it involves NER, has been covered already. The next step consists of the blank nodes that act as a hub for the attachment of binary application of an association rule mining algorithm on the relations to the various members of the relation. The result- annotated corpus that has been segmented on the sentence- ing RDF represents the ontology for the domain corpus. level. Only those sentences containing at least two entities are kept. Each sentence is represented by the set of entity In the use-case of our project, we have to deal with a dy- classes appearing in it. These item sets serve as input for the namic corpus, since the articles from the wiki are fed back apriori algorithm[1], that generates a set of association rules into the system to be analyzed. This continually updates the of the form a → b. Each rule carries two parameters, support semantic network and keeps it on par with the wiki. But an (the amount of observations supporting it), and confidence additional step is required: relation classification. The rela- (in our case #(a→b) #a ). Thresholds for these parameters can tion clusters that have been committed in the initialization be used to influence the result of the algorithm. phase of the system are used for this task. New instances of sentences are marked up with named entities and are The association rules can be ranked according to the two then transformed into word vectors which can be classified parameters. High support promises higher coverage, high against the relation clusters, and subsequently transformed confidence hints at a tighter correlation between the entity into RDF. Since the provenance of each triple in the ontol- classes involved. Rules with more than one succedent tend ogy is known, exchanges can be restricted to those triples to be more specialized, as evidenced by a higher confidence, that are affected. and thus offer a higher potential information gain and they tend to be forgotten by the domain experts, when asked to Preliminary evaluation results of the algorithm show F-mea- come up with possible relations. sures (F1 = 2∗Recall∗P recision ) between 70% and 75% for Recall+P recision clusters representing binary as well as n-ary relations. The The next step is a clustering phase. It takes an association algorithm usually creates more relation clusters than a hu- rule as input. The sentences of the rule are preprocessed, i.e. man would, since humans tend to generalize the relations the named entities are replaced with their respective classes. rather than to have a multitude of minuscule distinctions in This is done to receive generalized patterns of the relations in their relation set. We have performed an evaluation of the the sentences. Only the part between the outermost named performance of the algorithm against a part of the corpus entities is taken and transformed into word vectors. These relevant for the pilot application in the WIKINGER project. weights of the vectors are created using tf*idf. More details can be found in [2]. The goal of the clustering phase is to receive relation clus- ters, i.e. clusters in which every vector symbolizes the same 5.2 User interface In order to provide the domain experts with an interface linguistics. In this respect, WALU complements the range that facilitates directing the relation discovery process, the of existing tools. Wikinger Relation Discovery GUI, short WiReD, has been developed. It allows to view the results of the different steps of the algorithms and to experiment with different settings 6.2 Ontology learning environments As has been pointed out above, ontology learning environ- for them. This encompasses the association rules generated ments usually are built as supporting tools for ontology by the apriori algorithm as well as the composition of the engineers. Their task differs from the one tackled by the relation clusters generated by the clustering phase. approaches in this paper insofar as the ontology engineer has the process-knowledge necessary for building ontologies. Association rules can be selected manually for clustering, He usually has access to different domain experts, and thus clusters can be post-processed (merged with others, deleted, needs only marginal software support. Named entity recog- renamed) and finally selected for inclusion into the seman- nition is employed sometimes to facilitate populating the tic network. The parameters for each algorithmic step are ontology, whereas relation discovery is not used extensively, preset with reasonable defaults, but can be changed directly at least not to our knowledge. from within WiReD, thus allowing experiments on the data set. This may sound intimidating at first reading, but in Text-To-Onto[11] contains a module that calculates associ- practice there are never more than two parameters per step ation rules to provide the engineer with an overview over in the processing chain, four parameters in total. possible interrelations between concept classes, but this ap- proach is not followed further in the context of the applica- When the experts have come to a final result, i.e. they have tion. Its successor, Text-2-Onto[5], employs a limited ver- agreed upon a set of relations they want to see included sion of relation extraction, insofar as it searches for hyponym in the ontology, the relation information is fed back into relation patterns (e.g. ”x is a kind of y”) in order to find ad- the WIKINGER framework. Here it is used for different ditional instances of concept classes in a corpus. Relation purposes. First of all it can be used to transform the infor- discovery is not employed there. mation associated with it - the entities and their relations - into the ontology format of choice. If the corpus is static, this concludes the work needed for the ontology. In the case 6.3 Relation Discovery of dynamic corpora, e.g. wiki systems, the relation infor- Hasegawa et al [8] propose a system with a similar approach mation approved by the experts is used to automatically than the one presented here. They first perform NER on a classify new patterns that enter the system. These basically text corpus, and then collect entity pairs from within sen- follow the same steps of the algorithm, only now in a fully tences. These pairs are grouped by composition, the corre- automated mode. The experts can change the relation set sponding sentences are transformed into word vectors and anytime they want using the WiReD GUI which results in a a clustering step is performed on each of the groups. This total recalculation of the ontology to reflect their desire for results in a couple of relation clusters for each group. With change. some postprocessing (weeding out clusters below a certain size), they report F-measures of between 75% and 80% for selected clusters on a year of newspaper articles from The 6. RELATED WORK New York Times. In addition, they generate cluster labels This section highlights related work in the areas touched by by taking the words with the highest occurrence in each the work described in the sections above. We concentrate cluster. We believe that adding an association rule creation on annotation tools rather than individual NER algorithms, phase at the beginning helps in the selection of interesting since the tools mentioned all encompass different approaches combinations of relation candidates, even more so because to NER. Following that, ontology learning environments are we are not restricted to the detection of binary relations. discussed, with a special regard to their use of relation dis- covery. Finally, algorithms partial to the discipline of rela- There are other approaches besides this one, that exploit tion discovery are discussed. syntactic structures and perform parts-of-speech analysis: Jiang et al. [10] analyze sentence grammar trees, model candidate relations in RDF in order to capture their direc- 6.1 Annotation tools tion and extract from the RDF a set of generalized relations. As explained in section 4, the rationale behind WALU is Navigli et al. [14] present an approach to ontology learning its usability by professionals of any domain, in particular that exploits synsets from WordNet in order to disambiguate without computational or linguistic expertise. In this re- meaning and find relations that might hold between different spect, WALU differs from other existing tools for semantic entities from the sentences that explain the different synsets. annotation, e.g. GATE [7], WordFreak [12], MMAX [13], or But these approaches are dependent on deeper knowledge of PALinkA [15]. These tools are primarily intended for users the language of the text corpus. Approaches like Hasegawa’s with a background in (computational) linguistics. Conse- or ours only rely on statistics and the existence of annotated quently, they are either tailored to different, more complex entities, thus they are language agnostic. tasks than WALU (e.g. PALinkA for discourse annota- tion), or are designed as highly multifunctional tools (e.g. GATE, WordFreak, or MMAX). This multifunctionality al- 7. FUTURE WORK lows their flexible application with regard to specific and Regarding NER, we will implement an interface to the Weka complex needs. However, the price of this flexibility is that library [17], which comprises a number of machine learning these tools require extensive configuration efforts which sig- algorithms. We will investigate combinations of different nificantly affects usability for non-experts in computational ML approaches either sequentially (i.e. the output of one classifier is used as input to another one) or concurrently (i.e. [1] R. Agrawal and R. Srikant. Fast algorithms for mining several kinds of classifiers are run in parallel and a more-or- association rules. In Proceedings of the 20th VLDB less sophisticated voting mechanism — which might involve conference, pages 487–499, 1994. a further ML approach — decides on the final classification). [2] L. Bröcker. Semiautomatic Creation of Semantic Networks. In Online-proceedings of PhD-symposium at Furthermore, we plan to provide an interface to the UIMA ESWC 2007, June 2007. no URL as of yet. framework4 . This way, further facilities for learning and pre- [3] L. Bröcker, M. Rössler, A. Wagner, et al. WIKINGER processing (e.g. morphological or syntactic analysis, which - Wiki Next Generation Enhanced Repositories. In can provide useful information for semantic annotation as Online Proceedings of the German E-Science well as relation discovery) will become available to our frame- Conference, 2007. work. Since units from the UIMA framework can be pro- [4] N. A. Chinchor, editor. Proceedings of the Seventh vided as web services they can be added to complement the Message Understanding Conference, Fairfax, VA, 1998. WIKINGER framework as needed. [5] P. Cimiano and J. Völker. Text-2-Onto. In Proceedings of NLDB 2005, pages 227–238, 2005. Regarding relation discovery, we intend to apply our ap- [6] N. Collier, P. Ruch, and A. Nazarenko, editors. proach to other data sets, especially from the newspaper Proceedings of the International Joint Workshop on domain, in order to evaluate its performance on data sets Natural Language Processing in Biomedicine and its that cover a wide range of topics, and to enhance the al- Applications (JNLPBA-2004), Geneva, Switzerland, gorithm with a stage that extracts suitable labels for the 2004. relations and their members automatically. [7] H. Cunningham. GATE, a General Architectur for Text Engineering. Computers and the Humanities, The WIKINGER framework will be developed further, we 36:223–254, 2002. intend to use it as a base platform for a variety of future projects. [8] T. Hasegawa, S. Sekine, and R. Grishman. Discovering Relations among Named Entities from Large Corpora. In Proceedings of the Annual Meeting of Association of 8. CONCLUSIONS Computational Linguistics, pages 415–422, 2004. This paper described a new approach to semi-automatic [9] L. Hirschman, A. Yeh, C. Blaschke, and A. Valencia. knowledge capturing from large text corpora. The goal is to Overview of BioCreAtIvE: critical assessment of empower domain experts to create domain ontologies them- information extraction for biology. BMC selves, without being dependent on the availability of on- Bioinformatics, 6 (Supplement 1), 2005. tology engineers. This is to be achieved by automating the [10] T. Jiang, A. Tan, and K. Wang. Mining Generalized process to a high degree, by employing named entity recog- Associations of Semantic Relations from Textual Web nition (NER) and relation discovery. Domain experts are in- Content. IEEE Transactions on Knowledge and Data volved at those stages which require a substantial knowledge Engineering, 1(2):164–179, 2007. of the domain in question. Two software tools aiding in the process have been introduced that aid the domain experts [11] A. Maedche. The Text-To-Onto Environment, chapter in the task, WALU and WiReD. The former is a workbench 7 in Alexander Maedche: Ontology Learning for the for example-based NER, while the latter is a tool aiding in Semantic Web. Kluwer Academic Publishers, 2002. the relation discovery process. [12] T. Morton and J. LaCivita. WordFreak: an open tool for linguistic annotation. In Proceedings of the 2003 Evaluation results for the different algorithmic solutions have Conference of the North American Chapter of the been presented that show high values for F-measure for the Association for Computational Linguistics on Human automatic knowledge capturing methods. Language Technology, Edmonton, Canada, 2003. [13] C. Müller and M. Strube. MMAX: A tool for the All of this is part of a web service based architecture, the annotation of multi-modal corpora. In Proceedings of WIKINGER framework. It is used to create semantically en- the 2nd IJCAI Workshop on Knowledge and Reasoning hanced collaborative knowledge platforms for scientific com- in Practical Dialogue Systems, Seattle, WA, 2001. munities. The pilot application is a semantic wiki for the [14] R. Navigli, P. Velardi, and A. Gangemi. Ontology domain of contemporary history research regarding German learning and its application to automated terminology catholicism. translation. IEEE Intelligent Systems, 18(1):22–31, 2003. 9. ACKNOWLEDGMENTS [15] C. Orasan. PALinkA: A highly customisable tool for The work presented in this paper is being funded by the discourse annotation. In Proceedings of the Fourth German Federal Ministry of Education and Research under SIGdial Workshop on Discourse and Dialogue, research grant 01C5965. See http://wikinger-escience.de for Sapporo, Japan, 2003. further details regarding the project. The authors would [16] A. Wagner and M. Rössler. WALU — Eine like to thank Prof. Cremers from the University of Bonn Annotations- und Lern-Umgebung für semantisches and Prof. Hoeppner from the University of Duisburg-Essen Tagging. In G. Rehm, A. Witt, and L. Lemnitzer, for their helpful suggestions. editors, Data Structures for Linguistic Resources and Applications, pages 263–271. Gunter Narr Verlag, Tübingen, 2007. 10. REFERENCES 4 http://incubator.apache.org/uima/ [17] I. H. Witten and F. Eibe. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, San Fran- cisco, 2nd edition, 2005.