<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Semantic Concept Discovery Over Event Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Oktie Hassanzadeh</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shari Trewin</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Al o Gliozzo</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>IBM Research</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>Preparing a comprehensive, accurate, and unbiased report on a given topic or question is a challenging task. The rst step is often a daunting discovery task that requires searching through an overwhelming number of information sources without introducing bias from the analyst's current knowledge or limitations of the information sources. A common requirement for many analysis reports is a deep understanding of various kinds of historical and ongoing events that are reported in the media. To enable better analysis based on events, there exist several event databases containing structured representations of events extracted from news articles. Examples include GDELT [4], ICEWS [1], and EventRegistry [3]. These event databases have been successfully used to perform various kinds of analysis tasks, e.g., forecasting societal events [6]. However, there has been little work on the discovery aspect of the analysis, that results in a gap between the information requirements and the available data, and potentially a biased view of the available information. In this presentation, we describe a framework for concept discovery over event databases using semantic technologies. Unlike existing concept discovery solutions that perform discovery over text documents and in isolation from the remaining data analysis tasks [5, 8], our goal is providing a uni ed solution that allows deep understanding of the same data that will be used to perform other analysis tasks (e.g., hypothesis generation [7] or building models for forecasting [2]). Figure 1 shows the architecture of our system. The system takes in as input a set of event databases and RDF knowledge bases and provides as output a set of APIs that provide a uni ed retrieval mechanism over input data and knowledge bases, and an interface to a number of concept discovery algorithms. Figures 2 shows di erent portions of our system's UI that is built using our concept discovery framework APIs. The analyst can enter a natural language question or a set of concepts, and retrieve collections of relevant concepts identi ed and ranked using di erent concept discovery algorithms. A key aspect of our framework is the use of semantic technologies. In particular: { A uni ed view over multiple event databases and a background RDF knowledge base is achieved through semantic link discovery and annotation. { Natural language or keyword query understanding is performed through mapping of input terms to the concepts in the background knowledge base. { Concept discovery and ranking is performed through neural network based semantic term embeddings.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>We will present the results of our detailed evaluation of our proposed concept
discovery techniques. We prepared a ground truth from reports on speci c topics
written by human experts, including reports from the Human Rights Watch
orKnowledge
Sources
DBpedia
Wikidata</p>
      <p>…
Event Databases</p>
    </sec>
    <sec id="sec-2">
      <title>GDELT</title>
      <p>Events GKG</p>
    </sec>
    <sec id="sec-3">
      <title>ICEWS</title>
      <p>EventRegistry</p>
    </sec>
    <sec id="sec-4">
      <title>Ingestion:</title>
    </sec>
    <sec id="sec-5">
      <title>Crawl, Parse,</title>
    </sec>
    <sec id="sec-6">
      <title>Clean/Filter, Store</title>
    </sec>
    <sec id="sec-7">
      <title>Curation:</title>
    </sec>
    <sec id="sec-8">
      <title>Pre-process, Match,</title>
    </sec>
    <sec id="sec-9">
      <title>Index</title>
    </sec>
    <sec id="sec-10">
      <title>Event Knowledge Graph &amp;</title>
    </sec>
    <sec id="sec-11">
      <title>Concept Discovery APIs</title>
      <p>SolrCloud
ganization, and Wikipedia pages on people and events. The ground truth queries
included hand-built test queries on various topics, and an automatically
generated set of queries based on the title of the reports. Given only these query terms,
we measure the ability of di erent algorithms to nd the concepts mentioned in
the original reports. Our study nds that combining our neural network based
semantic term embeddings over structured data with an index-based method
can signi cantly outperform either method alone.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Boschee</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lautenschlager</surname>
            , J.,
            <given-names>O</given-names>
          </string-name>
          <string-name>
            <surname>'Brien</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shellman</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Starz</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ward</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <string-name>
            <given-names>ICEWS</given-names>
            <surname>Coded Event Data</surname>
          </string-name>
          (
          <year>2017</year>
          ), http://dx.doi.org/10.7910/DVN/28075
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Korkmaz</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cadena</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuhlman</surname>
            ,
            <given-names>C.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marathe</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vullikanti</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ramakrishnan</surname>
          </string-name>
          , N.:
          <article-title>Combining heterogeneous data sources for civil unrest forecasting</article-title>
          .
          <source>In: Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining</source>
          <year>2015</year>
          . pp.
          <volume>258</volume>
          {
          <fpage>265</fpage>
          . ASONAM '
          <volume>15</volume>
          (
          <year>2015</year>
          ), http://doi.acm.
          <source>org/10</source>
          .1145/2808797.2808847
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Leban</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fortuna</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brank</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grobelnik</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Event Registry: Learning About World Events from News</article-title>
          .
          <source>In: Proceedings of the 23rd International Conference on World Wide Web</source>
          . pp.
          <volume>107</volume>
          {
          <fpage>110</fpage>
          . WWW '14 Companion (
          <year>2014</year>
          ), http://doi.acm.
          <source>org/10</source>
          .1145/2567948.2577024
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Leetaru</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schrodt</surname>
            ,
            <given-names>P.A.</given-names>
          </string-name>
          : GDELT:
          <article-title>Global data on events, location</article-title>
          , and tone,
          <year>1979</year>
          {
          <year>2012</year>
          . In: ISA Annual
          <string-name>
            <surname>Convention</surname>
          </string-name>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pantel</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Concept Discovery from Text</article-title>
          .
          <source>In: Proceedings of the 19th International Conference on Computational Linguistics - Volume 1</source>
          . pp.
          <volume>1</volume>
          {
          <issue>7</issue>
          . COLING '
          <volume>02</volume>
          (
          <year>2002</year>
          ), http://dx.doi.org/10.3115/1072228.1072372
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Muthiah</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Butler</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khandpur</surname>
            ,
            <given-names>R.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Saraf</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Self</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rozovskaya</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cadena</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>C.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vullikanti</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marathe</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Summers</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Katz</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Doyle</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Arredondo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gupta</surname>
            ,
            <given-names>D.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mares</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ramakrishnan</surname>
          </string-name>
          , N.:
          <article-title>Embers at 4 years: Experiences operating an open source indicators forecasting system</article-title>
          .
          <source>In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source>
          . pp.
          <volume>205</volume>
          {
          <fpage>214</fpage>
          . KDD '
          <volume>16</volume>
          (
          <year>2016</year>
          ), http://doi.acm.
          <source>org/10</source>
          .1145/2939672.2939709
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Sohrabi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Udrea</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Riabov</surname>
            ,
            <given-names>A.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hassanzadeh</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Interactive Planning-Based Hypothesis Generation with LTS++</article-title>
          .
          <source>In: Proceedings of the Twenty-Fifth International Joint Conference on Arti cial Intelligence</source>
          ,
          <source>IJCAI</source>
          <year>2016</year>
          , New York, NY, USA,
          <fpage>9</fpage>
          -
          <issue>15</issue>
          <year>July 2016</year>
          . pp.
          <volume>4268</volume>
          {
          <issue>4269</issue>
          (
          <year>2016</year>
          ), http://www.ijcai.org/Abstract/16/654
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Tan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>h</year>
          .:
          <article-title>Text Mining: The state of the art and the challenges</article-title>
          .
          <source>In: In Proceedings of the PAKDD 1999 Workshop on Knowledge Disocovery from Advanced Databases</source>
          . pp.
          <volume>65</volume>
          {
          <issue>70</issue>
          (
          <year>1999</year>
          ), http://citeseerx.ist.psu.edu/viewdoc/summary?doi
          <source>=10.1.1.132.6973</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>