A semantic wiki for novelty search on documents

                             Michael Färber∗                                     Achim Rettinger
                Karlsruhe Institute of Technology (KIT)                Karlsruhe Institute of Technology (KIT)
                            Institute AIFB                                         Institute AIFB
                           76131 Karlsruhe                                        76131 Karlsruhe
                      michael.faerber@kit.edu                                   rettinger@kit.edu


ABSTRACT                                                         companies face the task of extracting relevant and novel
Technology-oriented companies are typically interested in        information for this purpose. Currently, this has to be
monitoring developments concerning their technologies.           done usually purely manually and without any structured
However, most companies, especially SMEs, don’t have an          background data making it a very time-consuming task.
efficient process how this is achieved. If at all, efforts are   Therefore, we provide a semi-automatic process for trend
mostly limited to uncoordinated keyword queries on web           detection and monitoring services. We present a semantic
resources. Here, we present a semi-automatic approach that       wiki-based application which is based on ontology-based
allows for structured and continuous detection of relevant,      information extraction (OBIE) where ontologies are used
novel and domain specific documents appearing on the Web.        within the information extraction (IE) process. Since
Our system is based on a semantic wiki where the domain          usually appropriate ontologies regarding technologies and
expert is able (i) to store all relevant information in an       their properties are missing or are too small, we focus our
adequate knowledge base with the ability for monitoring and      work on the crucial task of how to efficiently find new textual
trend mining and (ii) to import detected novel items such        information which is relevant to the domain expert, but has
as future technologies and their properties to the knowledge     not been stored in the knowledge base (KB) and, therefore,
base in a continuos fashion. The latter is achieved by           has been made usable in some sense.
generating a structured query based on the user context and
by representing found documents as semantic graphs. In this      2.     RELATED WORK
way, novel items can be found easier and in a semi-automatic     Within the TREC “novelty track” in 2002–2004 [2], systems
fashion.                                                         for detecting novelty were designed. However, the task
                                                                 took place on sentence level, was limited to event and
Categories and Subject Descriptors                               opinion detection, and was aligned for non-domain specific
H.4 [Information Systems Applications]: Miscellaneous            texts such as news. Newsjunkie [1] is also geared to
                                                                 detecting novelty by comparing a new document against an
General Terms                                                    existing document collection. Contrary to such systems,
Algorithms, Economics                                            we face domain-specific documents like technical reports
                                                                 and patents, and therefore do not have to deal with the
                                                                 problem of analysing huge amounts of articles in a very
Keywords                                                         short time period, known as “burst of novelty”. Instead
semantic wiki, novelty detection, document ranking,              of purely statistical measures, our approach is based on
ontology-supported information extraction.                       semantic technologies.

1.    MOTIVATION                                                 3.     DOCUMENT RANKING AND
Technology forecast and trend detection are indispendable
tasks for technology companies in order to be informed
                                                                        ONTOLOGY POPULATION
about market developments and inventions in their fields.        Figure 1 gives an overview of the interplay between an
With the advent of more and more documents on the Web,           ontology and documents with potentially novel information:
                                                                 Given our own KB with instances and schema, our goal
∗
  This work is supported by the German Federal Ministry          is to search for documents and to rank them, so that the
of Education and Research (BMBF) under grant 02PJ1002            documents most novel to the KB and relevant to both the
(SyncTech).                                                      query and the KB have the highest ranking. In a second step
                                                                 the user is able to import phrases marked in the document
                                                                 into his/her KB as property values.

                                                                 Concerning the first part, Semantic MediaWiki1 as an
                                                                 instance of a semantic wiki is assigned the central role:
                                                                 The user is able to create new wiki pages (within the
                                                                 semi-automatic process or just manually) and to add
DIR 2013, April 26, 2013, Delft, The Netherlands.                1
                                                                     http://semantic-mediawiki.org/
                    ontology population
                                                                  Lithium-ion battery
            GUI-2                                                  Priciple Assessment Sources/Contact

              fact extraction             GUI-1
                                                                                                     A Lithium-ion battery (also: Li-ion battery) is a
                                                                                                     hypernym of batteries on the basis of lithium.
                    document                 WWW search               Technology description
                    ranking  auto
                             annotation
                                                                      Operand                        Energy
                                                                      Operation                      Storage
                                                                      Special features               Independency of time and place. Very high energy
                                                                                                     density. Thermal stable. No memory effect.
Figure 1: According to a user’s context a structured
                                                                      Market ﬁelds                   Industry, Household, Automotive, Other
query is generated with the help of an underlying                     Handling                       easy
ontology. Afterwards, ranking is performed using
annotated document corpus.        In the last step,
annotations are verified by the user and used for                                                               (a)
populating the ontology.     In succeeding search                  Principle Assessment Sources/Contact
rounds search is based on the enriched ontology.
                                                                      Technology description:       A Lithium-ion battery (also: Li-ion battery) is a hypernym
                                                                                                    of batteries on the basis of lithium.
appropriate properties with the help of a class-specific form
(see figure 2). Internally, all data is stored in a structured
                                                                      Operand:
way. The wiki allows the user to create a search query                                                    N/A          Matter         Energy          Information
                                                                      Operation:                          N/A          Change         Transportation       Storage
out of the context by taking instances and property values                                                Storage
(from the KB) as well as search keywords written by the               Special features:
                                                                                                    Independency of time and place. Very high energ

user. After an optional expanding of the query graph with             Market ﬁelds:                 Industry, Household, Automotive, Other

neighbouring entities, we can generate the final query graph.         Handling:                     easy

Since all documents are annotated with the help of named
entity recognition tools2 , we can compare the generated                                  (b)
query graph with all document entity graphs (generated
                                                                 Figure 2: Screenshots of a Semantic MediaWiki:
from extracted named entities). Ranking of the documents is
                                                                 (a) displaying technology property values within a
facilitated by weights which were assigned to every relation
                                                                 wiki page (b) edit functionality using form.
in the KB schema graph. We can use implicit user feedback
in the following way: If a user imports some novel item as
                                                                 domain (industrial technology field) and for generating a
a new property or instance, the weights in the KB schema
                                                                 context-aware semantic search query. With the help of
graph are adapted. By this means, we can defer to the
                                                                 a new proposed ranking schema, the more relevant and
personal views what relationships between certain classes
                                                                 potentially novel information a document contains, the
and properties (or other classes) are of great significance
                                                                 higher it is ranked and, hence, more likely to be worth
and should be reinforced for next search sessions.
                                                                 reading and used for ontology population. Due to the use
                                                                 of structured information and approriate background data
Our focused use cases are determined by our use case
                                                                 the way of doing trend mining can be changed towards a
partners3 which are medium-sized technology companies.
                                                                 semi-automatic process with better search and monitoring
Hence, the lightweight ontologies we used consist of
                                                                 capabilities.
classes like technology, institution, and product.       As
document corpus, web documents retrieved by search engine
requests are considered. In addition, trend detection in         5.     REFERENCES
conjunction with patents can be enabled by using the patent      [1] Evgeniy Gabrilovich, Susan Dumais, and Eric Horvitz.
database Espacenet4 , where access to over 70 million patent         Newsjunkie: providing personalized newsfeeds via
documents and their meta data is provided.                           analysis of information novelty. In Proceedings of the
                                                                     13th international conference on World Wide Web,
                                                                     WWW ’04, pages 482–490, New York, NY, USA, 2004.
4.   CONCLUSION                                                      ACM.
Existing processes and tools for trend mining and technology     [2] Ian Soboroff and Donna Harman. Novelty detection:
watch are often only rudimentary implemented, especially             the TREC experience. In Proceedings of the conference
in SMEs. We have presented a semantic wiki for storing               on Human Language Technology and Empirical
and displaying structured information about a specific               Methods in Natural Language Processing, HLT ’05,
2
  One of these tools is the wikify service of the Wikipedia          pages 105–112, Stroudsburg, PA, USA, 2005.
Miner (http://wikipedia-miner.cms.waikato.ac.nz/) which              Association for Computational Linguistics.
we adapt by using the content of our domain specific
semantic-based wiki.       In order to detect also new
entities, property values, and relationships, we use
GATE (http://gate.ac.uk), a well-established rule-based
framework.
3
  Industry partners within the German research project
syncTech (http://synctech-innovation.de).
4
  http://worldwide.espacenet.com