<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Applying Multidimensional Navigation and Explanation in Semantic Dataset Summarization</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>James R. Michaelis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Deborah L. McGuinness</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cynthia Chang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Joanne S. Luciano</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>James Hendler</string-name>
          <email>hendlerg@cs.rpi.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Tetherless World Constellation, Rensselaer Polytechnic Institute</institution>
          ,
          <addr-line>110 8th Street, Troy, NY 12180</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>A key objective of multidimensional dataset analysis is to reveal patterns of interest to users, but can be di cult to conduct due to the challenges of both presenting and navigating large datasets. This work explores how initial summarizations of multidimensional datasets can be generated (designed to reduce the number of data points which would need to be displayed), using summarization policies based on provided dataset values. Additionally, functionality for explaining the derivation of summarizations is being designed in line with prior work on aiding analyst interactions with data processing systems. To help drive development of this work, as well as provide illustrative use cases, we are presently designing a dataset summarization generator as part of greater work being done on an infrastructure for managing evidence of technical emergence in varying research disciplines via automated review of published materials.</p>
      </abstract>
      <kwd-group>
        <kwd>OLAP</kwd>
        <kwd>Explanation</kwd>
        <kwd>Provenance</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        A key objective of multidimensional dataset analysis is to reveal patterns of
interest to analysts. In many cases, these analyses will involve navigation over
a dataset to expose content likely to have interesting patterns. However,
multidimensional analysis has been observed to be challenging to analysts for the
following reasons [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]:
1. They may be overwhelmed by a data space evidence set if it is too large.
2. They may not have time or expertise to perform extensive navigation.
      </p>
      <p>
        This work explores how initial summarizations of multidimensional datasets
can be generated for consuming parties (designed to reduce the number of data
points which would need to be displayed) driven by summarization policies
based on provided dataset values. Focus has been given to RDF-based dataset
encodings, due largely to RDFs exibility in linking to outside data sources
(e.g., ontologies for expressing possible data values). Finally, functionality for
explaining the derivation of summarizations is being developed - in line with
prior work for aiding analyst interactions with data processing systems [
        <xref ref-type="bibr" rid="ref2 ref4">2</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>Evidence Summarization in the ARBITER System</title>
      <p>
        To help drive development of this work, as well as provide illustrative use cases,
we are presently developing a dataset summarization generator for the Abductive
Reasoning Based on Indicators and Topics of EmeRgence (ARBITER) system
being jointly developed by Rensselaer, BAE Systems, NYU, Brandeis and 1790
Analytics as part of IARPA's Foresight and Understanding from Scienti c
Exposition (FUSE) program. ARBITER's design objective is to scan for signs of
technical emergence in published literature - where technical emergence is
dened in the FUSE program as [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]: the process by which research domains appear,
mature, and if conditions are favorable, make a signi cant impact.
      </p>
      <p>In ARBITER, sets of one or more evidence entries are evaluated to make
hypotheses about emergence-related questions for a given topic and time period.
For example: Has a practical application for DNA Microarrays been established
in the time period of 2006-2010, based on the document collection PubMed-42?</p>
      <p>In this setting, evidence entries are de ned as emergence indicators,
calculated based on analysis over document collections. Indicators are classi ed
according to an OWL ontology of indicator types, where each indicator is de ned
to have at least one RDF type, as well as a set of numerical scoring metrics to
de ne relationship of evidence to hypothesis. For brevity, an example is provided
with ve indicators, each with a single RDF type and two numerical properties
(value and relevance to the question answer, where a higher value is better).</p>
      <p>
        Currently, these evidence entries are presented as a 2-dimensional
spreadsheet. To reduce the number of rows directly presented, policy-based
summarization techniques are being explored - deriving from established navigation
techniques in OLAP [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]: grouping rows into collection-based entries, as well as
ltering table entries - each based on speci ed criteria. For this submission, the
following two summarization policies are provided for illustrative purposes:
1. Grouping: Group entries together that are SKOS1 subconcepts of the
"FunderCount" class.
2. Filtering: Remove entries with relevance scores below 0.55.
      </p>
      <p>Ultimately, the following system conditions are assumed: (i) A maximum
number of summary rows will be speci ed, which will appear in the presented
summary; (ii) A pre-de ned collection of policies will be accessible by ARBITER,
along with a pre-de ned ordering for their execution; (iii) Policies will be
sequentially applied to the evidence set until the summary row count is reached, or all
policies have been applied. Initially, an evidence dataset D0 will represent
content directly generated by evidence gathering routines in ARBITER. Each policy
execution will yield a transformed dataset view D1:::n, up until condition (iii) is
satis ed.</p>
      <p>While initial summarization can be a powerful aid for analyst users, care has
to be taken in their usage, since one summarization strategy may not be
appropriate for all users and information-seeking tasks. To help analysts keep track
of applied strategies, summaries will be accompanied by explanations of their
derivation - accessible for individual entries. In Figure 2, an example summary
view - along with a supporting explanation - is provided.</p>
      <p>System Development: ARBITERs summary generator is being designed
to take three inputs: (i) A set of ne-grained evidence; (ii) A set of
SPARQLencoded preference policies, along with an accompanying execution order; and
(iii) Corresponding ontologies for encoding the preference and evidence data. For
encoding evidence, we are now exploring use of the RDF Datacube2 vocabulary
- given its support for representing multidimensional data.</p>
      <p>
        Upcoming Directions: In upcoming work, focus will be given to the
following three issues: (i) selection of summarization policies which align with an
analysts perceived preferences, (ii) based on the summarization explanations
provided, enabling analysts to tweak applied strategies to generate new
summarizations, and (iii) enabling analysts to identify source documents used to create
evidence entries (similar to e orts discussed in [
        <xref ref-type="bibr" rid="ref2 ref4">2</xref>
        ]). For situations where
significant numbers of evidence entries are presented (e.g., over 100), all three issues
are expected to need addressing.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Acknowledgements</title>
      <p>We would like to thank our collaborators at BAE Systems, Sean Stromsten,
Dan Hunter and Olga Babko-Malaya for their assistance in this work.
Support has been provided by the Intelligence Advanced Research Projects Activity
(IARPA) via Department of Interior National Business Center contract number
D11PC20154. The U.S. Government is authorized to reproduce and distribute
reprints for Governmental purposes notwithstanding any copyright annotation
thereon.</p>
      <p>Disclaimer: The views and conclusions contained herein are those of the
authors and should not be interpreted as necessarily representing the o cial
policies or endorsements, either expressed or implied, of IARPA, DoI/NBC, or
the U.S. Government.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Giacometti</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Marcel</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Negre</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <article-title>A framework for recommending OLAP queries</article-title>
          .
          <source>11th International Workshop on Data Warehousing and OLAP (DOLAP08)</source>
          ,
          <fpage>73</fpage>
          -
          <lpage>80</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Murdock</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McGuinness</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , Pinheiro da Silva,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Welty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            , and
            <surname>Ferrucci</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          <article-title>Explaining conclusions from diverse knowledge sources</article-title>
          .
          <source>Proceedings of ISWC</source>
          <year>2006</year>
          ,
          <volume>861</volume>
          -
          <fpage>872</fpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>3. Foresight and Understanding from Scienti c Exposition (FUSE) Program - Broad Agency Announcement (BAA) [IARPA-</article-title>
          <string-name>
            <surname>BAA-</surname>
          </string-name>
          10-06]. Retrieved from: http://www.iarpa.gov/solicitations fuse.html. Date Last Accessed:
          <volume>07</volume>
          /28/
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          2 RDF Datacube Vocabulary: http://www.w3.org/TR/vocab
          <article-title>-data-cube/</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>