A semantic wiki for novelty search on documents Michael Färber∗ Achim Rettinger Karlsruhe Institute of Technology (KIT) Karlsruhe Institute of Technology (KIT) Institute AIFB Institute AIFB 76131 Karlsruhe 76131 Karlsruhe michael.faerber@kit.edu rettinger@kit.edu ABSTRACT companies face the task of extracting relevant and novel Technology-oriented companies are typically interested in information for this purpose. Currently, this has to be monitoring developments concerning their technologies. done usually purely manually and without any structured However, most companies, especially SMEs, don’t have an background data making it a very time-consuming task. efficient process how this is achieved. If at all, efforts are Therefore, we provide a semi-automatic process for trend mostly limited to uncoordinated keyword queries on web detection and monitoring services. We present a semantic resources. Here, we present a semi-automatic approach that wiki-based application which is based on ontology-based allows for structured and continuous detection of relevant, information extraction (OBIE) where ontologies are used novel and domain specific documents appearing on the Web. within the information extraction (IE) process. Since Our system is based on a semantic wiki where the domain usually appropriate ontologies regarding technologies and expert is able (i) to store all relevant information in an their properties are missing or are too small, we focus our adequate knowledge base with the ability for monitoring and work on the crucial task of how to efficiently find new textual trend mining and (ii) to import detected novel items such information which is relevant to the domain expert, but has as future technologies and their properties to the knowledge not been stored in the knowledge base (KB) and, therefore, base in a continuos fashion. The latter is achieved by has been made usable in some sense. generating a structured query based on the user context and by representing found documents as semantic graphs. In this 2. RELATED WORK way, novel items can be found easier and in a semi-automatic Within the TREC “novelty track” in 2002–2004 [2], systems fashion. for detecting novelty were designed. However, the task took place on sentence level, was limited to event and Categories and Subject Descriptors opinion detection, and was aligned for non-domain specific H.4 [Information Systems Applications]: Miscellaneous texts such as news. Newsjunkie [1] is also geared to detecting novelty by comparing a new document against an General Terms existing document collection. Contrary to such systems, Algorithms, Economics we face domain-specific documents like technical reports and patents, and therefore do not have to deal with the problem of analysing huge amounts of articles in a very Keywords short time period, known as “burst of novelty”. Instead semantic wiki, novelty detection, document ranking, of purely statistical measures, our approach is based on ontology-supported information extraction. semantic technologies. 1. MOTIVATION 3. DOCUMENT RANKING AND Technology forecast and trend detection are indispendable tasks for technology companies in order to be informed ONTOLOGY POPULATION about market developments and inventions in their fields. Figure 1 gives an overview of the interplay between an With the advent of more and more documents on the Web, ontology and documents with potentially novel information: Given our own KB with instances and schema, our goal ∗ This work is supported by the German Federal Ministry is to search for documents and to rank them, so that the of Education and Research (BMBF) under grant 02PJ1002 documents most novel to the KB and relevant to both the (SyncTech). query and the KB have the highest ranking. In a second step the user is able to import phrases marked in the document into his/her KB as property values. Concerning the first part, Semantic MediaWiki1 as an instance of a semantic wiki is assigned the central role: The user is able to create new wiki pages (within the semi-automatic process or just manually) and to add DIR 2013, April 26, 2013, Delft, The Netherlands. 1 http://semantic-mediawiki.org/ ontology population Lithium-ion battery GUI-2 Priciple Assessment Sources/Contact fact extraction GUI-1 A Lithium-ion battery (also: Li-ion battery) is a hypernym of batteries on the basis of lithium. document WWW search Technology description ranking auto annotation Operand Energy Operation Storage Special features Independency of time and place. Very high energy density. Thermal stable. No memory effect. Figure 1: According to a user’s context a structured Market fields Industry, Household, Automotive, Other query is generated with the help of an underlying Handling easy ontology. Afterwards, ranking is performed using annotated document corpus. In the last step, annotations are verified by the user and used for (a) populating the ontology. In succeeding search Principle Assessment Sources/Contact rounds search is based on the enriched ontology. Technology description: A Lithium-ion battery (also: Li-ion battery) is a hypernym of batteries on the basis of lithium. appropriate properties with the help of a class-specific form (see figure 2). Internally, all data is stored in a structured Operand: way. The wiki allows the user to create a search query N/A Matter Energy Information Operation: N/A Change Transportation Storage out of the context by taking instances and property values Storage (from the KB) as well as search keywords written by the Special features: Independency of time and place. Very high energ user. After an optional expanding of the query graph with Market fields: Industry, Household, Automotive, Other neighbouring entities, we can generate the final query graph. Handling: easy Since all documents are annotated with the help of named entity recognition tools2 , we can compare the generated (b) query graph with all document entity graphs (generated Figure 2: Screenshots of a Semantic MediaWiki: from extracted named entities). Ranking of the documents is (a) displaying technology property values within a facilitated by weights which were assigned to every relation wiki page (b) edit functionality using form. in the KB schema graph. We can use implicit user feedback in the following way: If a user imports some novel item as domain (industrial technology field) and for generating a a new property or instance, the weights in the KB schema context-aware semantic search query. With the help of graph are adapted. By this means, we can defer to the a new proposed ranking schema, the more relevant and personal views what relationships between certain classes potentially novel information a document contains, the and properties (or other classes) are of great significance higher it is ranked and, hence, more likely to be worth and should be reinforced for next search sessions. reading and used for ontology population. Due to the use of structured information and approriate background data Our focused use cases are determined by our use case the way of doing trend mining can be changed towards a partners3 which are medium-sized technology companies. semi-automatic process with better search and monitoring Hence, the lightweight ontologies we used consist of capabilities. classes like technology, institution, and product. As document corpus, web documents retrieved by search engine requests are considered. In addition, trend detection in 5. REFERENCES conjunction with patents can be enabled by using the patent [1] Evgeniy Gabrilovich, Susan Dumais, and Eric Horvitz. database Espacenet4 , where access to over 70 million patent Newsjunkie: providing personalized newsfeeds via documents and their meta data is provided. analysis of information novelty. In Proceedings of the 13th international conference on World Wide Web, WWW ’04, pages 482–490, New York, NY, USA, 2004. 4. CONCLUSION ACM. Existing processes and tools for trend mining and technology [2] Ian Soboroff and Donna Harman. Novelty detection: watch are often only rudimentary implemented, especially the TREC experience. In Proceedings of the conference in SMEs. We have presented a semantic wiki for storing on Human Language Technology and Empirical and displaying structured information about a specific Methods in Natural Language Processing, HLT ’05, 2 One of these tools is the wikify service of the Wikipedia pages 105–112, Stroudsburg, PA, USA, 2005. Miner (http://wikipedia-miner.cms.waikato.ac.nz/) which Association for Computational Linguistics. we adapt by using the content of our domain specific semantic-based wiki. In order to detect also new entities, property values, and relationships, we use GATE (http://gate.ac.uk), a well-established rule-based framework. 3 Industry partners within the German research project syncTech (http://synctech-innovation.de). 4 http://worldwide.espacenet.com