<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A semantic wiki for novelty search on documents</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Michael Färber</string-name>
          <email>michael.faerber@kit.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Achim Rettinger</string-name>
          <email>rettinger@kit.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Karlsruhe Institute of Technology (KIT), Institute AIFB</institution>
          ,
          <addr-line>76131 Karlsruhe</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Technology-oriented companies are typically interested in monitoring developments concerning their technologies. However, most companies, especially SMEs, don't have an e cient process how this is achieved. If at all, e orts are mostly limited to uncoordinated keyword queries on web resources. Here, we present a semi-automatic approach that allows for structured and continuous detection of relevant, novel and domain speci c documents appearing on the Web. Our system is based on a semantic wiki where the domain expert is able (i) to store all relevant information in an adequate knowledge base with the ability for monitoring and trend mining and (ii) to import detected novel items such as future technologies and their properties to the knowledge base in a continuos fashion. The latter is achieved by generating a structured query based on the user context and by representing found documents as semantic graphs. In this way, novel items can be found easier and in a semi-automatic fashion.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;semantic wiki</kwd>
        <kwd>novelty detection</kwd>
        <kwd>document ranking</kwd>
        <kwd>ontology-supported information extraction</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. MOTIVATION</title>
      <p>Technology forecast and trend detection are indispendable
tasks for technology companies in order to be informed
about market developments and inventions in their elds.
With the advent of more and more documents on the Web,
This work is supported by the German Federal Ministry
of Education and Research (BMBF) under grant 02PJ1002
(SyncTech).
companies face the task of extracting relevant and novel
information for this purpose. Currently, this has to be
done usually purely manually and without any structured
background data making it a very time-consuming task.
Therefore, we provide a semi-automatic process for trend
detection and monitoring services. We present a semantic
wiki-based application which is based on ontology-based
information extraction (OBIE) where ontologies are used
within the information extraction (IE) process. Since
usually appropriate ontologies regarding technologies and
their properties are missing or are too small, we focus our
work on the crucial task of how to e ciently nd new textual
information which is relevant to the domain expert, but has
not been stored in the knowledge base (KB) and, therefore,
has been made usable in some sense.</p>
    </sec>
    <sec id="sec-2">
      <title>2. RELATED WORK</title>
      <p>
        Within the TREC \novelty track" in 2002{2004 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], systems
for detecting novelty were designed. However, the task
took place on sentence level, was limited to event and
opinion detection, and was aligned for non-domain speci c
texts such as news. Newsjunkie [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is also geared to
detecting novelty by comparing a new document against an
existing document collection. Contrary to such systems,
we face domain-speci c documents like technical reports
and patents, and therefore do not have to deal with the
problem of analysing huge amounts of articles in a very
short time period, known as \burst of novelty". Instead
of purely statistical measures, our approach is based on
semantic technologies.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. DOCUMENT RANKING AND</title>
    </sec>
    <sec id="sec-4">
      <title>ONTOLOGY POPULATION</title>
      <p>Figure 1 gives an overview of the interplay between an
ontology and documents with potentially novel information:
Given our own KB with instances and schema, our goal
is to search for documents and to rank them, so that the
documents most novel to the KB and relevant to both the
query and the KB have the highest ranking. In a second step
the user is able to import phrases marked in the document
into his/her KB as property values.</p>
      <p>Concerning the rst part, Semantic MediaWiki1 as an
instance of a semantic wiki is assigned the central role:
The user is able to create new wiki pages (within the
semi-automatic process or just manually) and to add
Lithium-ion battery</p>
      <p>Priciple Assessment Sources/Contact
WWW search
Technology description
ontology population
GUI-2
fact extraction</p>
      <p>GUI-1
document
ranking
auto
annotation
appropriate properties with the help of a class-speci c form
(see gure 2). Internally, all data is stored in a structured
way. The wiki allows the user to create a search query
out of the context by taking instances and property values
(from the KB) as well as search keywords written by the
user. After an optional expanding of the query graph with
neighbouring entities, we can generate the nal query graph.
Since all documents are annotated with the help of named
entity recognition tools2, we can compare the generated
query graph with all document entity graphs (generated
from extracted named entities). Ranking of the documents is
facilitated by weights which were assigned to every relation
in the KB schema graph. We can use implicit user feedback
in the following way: If a user imports some novel item as
a new property or instance, the weights in the KB schema
graph are adapted. By this means, we can defer to the
personal views what relationships between certain classes
and properties (or other classes) are of great signi cance
and should be reinforced for next search sessions.
Our focused use cases are determined by our use case
partners3 which are medium-sized technology companies.
Hence, the lightweight ontologies we used consist of
classes like technology, institution, and product. As
document corpus, web documents retrieved by search engine
requests are considered. In addition, trend detection in
conjunction with patents can be enabled by using the patent
database Espacenet4, where access to over 70 million patent
documents and their meta data is provided.</p>
    </sec>
    <sec id="sec-5">
      <title>4. CONCLUSION</title>
      <p>Existing processes and tools for trend mining and technology
watch are often only rudimentary implemented, especially
in SMEs. We have presented a semantic wiki for storing
and displaying structured information about a speci c
2One of these tools is the wikify service of the Wikipedia
Miner (http://wikipedia-miner.cms.waikato.ac.nz/) which
we adapt by using the content of our domain speci c
semantic-based wiki. In order to detect also new
entities, property values, and relationships, we use
GATE (http://gate.ac.uk), a well-established rule-based
framework.
3Industry partners within the German research project
syncTech (http://synctech-innovation.de).
4http://worldwide.espacenet.com
PrincipleAssessment Sources/Contact</p>
      <p>Technology description:</p>
      <p>A Lithium-ion battery (also: Li-ion battery) is a
hypernym of batteries on the basis of lithium.</p>
      <p>Energy
Storage
Independency of time and place. Very high energy
density. Thermal stable. No memory effect.</p>
      <p>Industry, Household, Automotive, Other
easy
A Lithium-ion battery (also: Li-ion battery) is a hypernym
of batteries on the basis of lithium.</p>
      <p>Energy Information</p>
      <p>Transportation Storage
(b)
Figure 2: Screenshots of a Semantic MediaWiki:
(a) displaying technology property values within a
wiki page (b) edit functionality using form.
domain (industrial technology eld) and for generating a
context-aware semantic search query. With the help of
a new proposed ranking schema, the more relevant and
potentially novel information a document contains, the
higher it is ranked and, hence, more likely to be worth
reading and used for ontology population. Due to the use
of structured information and approriate background data
the way of doing trend mining can be changed towards a
semi-automatic process with better search and monitoring
capabilities.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Evgeniy</given-names>
            <surname>Gabrilovich</surname>
          </string-name>
          , Susan Dumais, and
          <string-name>
            <given-names>Eric</given-names>
            <surname>Horvitz</surname>
          </string-name>
          .
          <article-title>Newsjunkie: providing personalized newsfeeds via analysis of information novelty</article-title>
          .
          <source>In Proceedings of the 13th international conference on World Wide Web, WWW '04</source>
          , pages
          <fpage>482</fpage>
          {
          <fpage>490</fpage>
          , New York, NY, USA,
          <year>2004</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Ian</given-names>
            <surname>Soboro</surname>
          </string-name>
          and
          <string-name>
            <given-names>Donna</given-names>
            <surname>Harman</surname>
          </string-name>
          .
          <article-title>Novelty detection: the TREC experience</article-title>
          .
          <source>In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, HLT '05</source>
          , pages
          <fpage>105</fpage>
          {
          <fpage>112</fpage>
          ,
          <string-name>
            <surname>Stroudsburg</surname>
          </string-name>
          , PA, USA,
          <year>2005</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>