<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Where did you hear that? Information and the Sources They Come From</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>James P. McCusker</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Timothy Lebo</string-name>
          <email>lebot@rpi.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Li Ding</string-name>
          <email>dingl@cs.rpi.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cynthia Chang</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paulo Pinheiro da Silva</string-name>
          <email>paulo@utep.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Deborah L. McGuinness</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CyberShARE Center, University of Texas at El Paso 500 W University Ave</institution>
          ,
          <addr-line>El Paso TX 79968</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Tetherless World Constellation Rensselaer Polytechnic Institute 110</institution>
          <addr-line>8th St., Troy, NY 12180</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>One current challenge in linked science is to adequately describe where a piece of information in the linked science cloud came from. Provenance models, such as Proof Markup Language (PML), have developed methods for expressing simple relationships between information and the sources of information. We argue that the representation of where information comes from is central to trusting linked data in scienti c applications. We introduce the notion of a model of information source and the usage of the source to obtain information by describing the Proof Markup Languages notion of source usage and show how this relationship can be modeled in a library science schema, Functional Requirements for Bibliographic Resources (FRBR). We discuss how these kinds of representations are critical to provenance models.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Before publishing a result, scientists need to check their facts. We stand on the
shoulders of giants, but as we push forward in science, we need to make sure that
we aren't standing on a giant house of cards. Knowing how, when and where your
data comes from is critical for good science, and it's even more critical for linked
science, where it isn't immediately clear where a database record or knowledge
assertion came from. Sources of information become critical to evaluate
information quality. It is di cult if not impossible to assess the trust of information,
or to encode it as knowledge, without having a link between information and
their sources. For example, one may want to know if the information came from
a source such as the New York Times, and further, it may be useful to know the
date, edition, page, and exact text fragment where the information was asserted.</p>
      <p>There are many challenges in the task of assigning a source to a piece of
information. First, it may not be easy to characterize the piece of information
in larger information containers (databases, printed documents, web documents,
documents that require parameters from a system to be retrieved, etc). Second,
the source of a piece of information is often a source of other pieces of information
and should be referenced by an identi er and characterized elsewhere. Third,
the assertion of the piece of information is a point-time event that occurs during
the life-time of the information source. Thus, not any assertion event is a valid
assertion: it needs to occur during the lifespan of its source(s) and in places
where the sources are located. Those are all critical conditions that need to be
properly captured in provenance languages if one wants to make proper use of
source information in combination with linked data.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Current Implementation: Proof Markup Language</title>
      <p>
        The Proof Markup Language (PML) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ][
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] evolved to include language constructs
to handle use cases such as those encountered in many text analytic settings [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],
where components, such as entity extractors, review natural language text and
infer structured assertions from the text. In order to maintain provenance, it
needed to capture the source that was accessed, in this case by the text analytic
component, and the information that was obtained or inferred. Further, in many
cases it was important to be able to encode the particular fragment of the source
that was used. The notion of using a source to obtain information is captured
in PMLs SourceUsage class, which serves to record the event of information
assertion that can also be a general case for the event of information retrieval
and information extraction. Sources can be as ne-grained as particular regions
of text or data les, text fragments, or as broad as entire online databases. The
raw data that was received from the Source is attached to Information using the
property hasRawString. Using this, it is possible to determine if two pieces of
Information came from the same Source, and if the Information has been derived
from the same data fragment.
      </p>
      <p>
        An example of this representation can be seen in Figure 11. The top level
concept of a NodeSet supports the encoding of support for a particular piece
of information that can be viewed as a conclusion of some inference step. That
inference step can be as simple as a told assertion or could be an inference using
some antecdents and resulting in a conclusion. One type of inference includes
the usage of a document source to obtain a piece of information. Figure 1 shows
a particular usage of a document (the PML primer) and includes an encoding of
the time it was used and the fragment of the text used. The inference steps use of
a SourceUsage identi es the following: the date, time, location and source of the
assertion. The Source is a pmlp:Source concept that can be used to represent
things like publications, documents, websites, datasets, person, organizations,
etc. Additional examples are available in Murdock et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and Welty, et al.,
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>The PML classes for Source, SourceUsage, and Information are empirically
derived, that is, they are responses to a set of use cases that required tracking
where information came from and how it was used. This model was e ective
1 The RDF can be downloaded from
http://inference-web.org/proofs/csctest/iwppml-2.rdf
at capturing the text analytic requirements from the Unstructured Information
Management Architecture components, however a more general representation
may be bene cial to support additional use cases such as copying les,
transforming data from one format to another, and so on. Files on disk are considered
to be pmlp:Sources, which limit the ability to describe mechanical duplication,
as derivational provenance in PML is limited to pmlp:Information. Similarly,
transformation of data from one le format to another results in, in one
perspective, two pmlp:Information instances that have the same information (they
have the same information), but in one perspective have completely di erent le
content, since the information is being represented using a di erent le encoding.
Generalization of these sorts of relationships can allow faithful representation of
these operations and allow for extension and decomposition of concepts like "the
source of a piece of information".</p>
      <p>
        Library Science has spent signi cant time dealing with some kinds of
provenance in the realm of bibliographic resources. Functional Requirements for
Bibliographic References (FRBR) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] is designed to address issues of abstraction in
bibliographic resources. For instance, when we mention The Art of Computer
Programming, we could be referring to the work as a whole, a particular edition,
a particular rendering of that edition (electronic versus paper, for instance), or a
particular copy. FRBR separates these di erent levels into, respectively, Work,
Expression, Manifestation, and Item. Electronic information resources can be
similarly distinguished by using Item to refer to a particular copy, Manifestation
to refer to a speci c bit image, Expression to refer to a xed set of
information, and a Work to refer to all versions of that information. Here we refer to
sets of Work, Expression, Manifestation, Item that are interlinked as a FRBR
stack, in that it is a complete stack of instances representing a particular piece
of information at all abstractive levels.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Mapping Source and Information into FRBR</title>
      <p>Describing how information is derived from a source is a little more complicated
in FRBR, but is more generalizable and allows for greater shades of distinction.
We propose that, since pmlp:Source is considered to be an opaque, speci c
resource, it should be a subclass of frbr:Item. pmlp:Information is actually a role
of a frbr:Work, frbr:Expression, or frbr:Manifestation. For instance, an
expression may not be information but may play the role of being information in the
context of an assertion. In Figure 2 two FRBR stacks show how a quote "This
document provides a brief introduction to the Proof Markup Language (PML)"
from the PML Primer2 is derived from a downloaded copy of the primer. In
the representation, we use the abstractive perspective to allow for description
of physical movement of data and transformation of information using the same
derivational ontology. Conversely, it allows description of what happened
encoded directly in the relationships. For instance, the fact that the copyEvent
produced an identical copy is stated by the fact that the server copy and client
copy are exemplars of the same Manifestation, while the subset event produces a
quote simply because that stack is declared to be partOf the PML Primer stack.
The derivational ontology is not named, but PML is an adequate candidate for
this task. Its information source construction can be replaced by FRBR.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>We believe that any serious model of provenance that supports linked science
must provide a mechanism for describing information sources and their usage.
This can be and has been achieved using the modeling primitives provided in
PML. By using the mapping we describe using FRBR, we can also model
additional nuanced explanations of data access, transformation, and analysis.
Generalized models of abstractive provenance also provide opportunities to express
nuanced explanations of data access, transformation, and analysis. We show how
the link between information and source can be modeled using a combination
of FRBR and a derivational provenance model. This combination is powerful,
and allows for unambiguous descriptions of data and information access and
transformation. Finally, we argue that the abstractive dimension should be a
key component of any provenance model that attempts to deal with artifacts of
information or data.
2 http://inference-web.org/2007/primer/</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgements</title>
      <p>The Tetherless World Constellation is partially funded by DARPA, U.S.
Department of Energy, Fujitsu, LGS, Lockheed Martin, Microsoft Research, NASA,
National Ecological Observatory Network (NEON), the National Science
Foundation, Qualcomm, and the Woods Hole Oceonographic Institution (WHOI).
This research was partially funded by the National Science Foundation under
CREST Grant No. HRD-0734825.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>McGuinness</surname>
            ,
            <given-names>D.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ding</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          , Pinheiro Da Silva,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          :
          <article-title>PML 2: A modular explanation interlingua</article-title>
          .
          <source>In: Proceedings of AAAI. Volume</source>
          <volume>7</volume>
          . (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. Pinheiro da Silva,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>McGuinness</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.L.</given-names>
            ,
            <surname>Fikes</surname>
          </string-name>
          , R.:
          <string-name>
            <given-names>A Proof</given-names>
            <surname>Markup Language for Semantic Web</surname>
          </string-name>
          <article-title>Services</article-title>
          .
          <source>Information Systems</source>
          <volume>31</volume>
          (
          <issue>4-5</issue>
          ) (
          <year>2006</year>
          )
          <fpage>381395</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Welty</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Murdock</surname>
            ,
            <given-names>J.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Silva</surname>
            ,
            <given-names>P.P.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McGuinness</surname>
            ,
            <given-names>D.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ferrucci</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fikes</surname>
          </string-name>
          , R.:
          <article-title>Tracking information extraction from intelligence documents</article-title>
          .
          <source>In: Proceedings of the 2005 International Conference on Intelligence Analysis (IA</source>
          <year>2005</year>
          ).
          <article-title>(</article-title>
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Murdock</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McGuinness</surname>
            ,
            <given-names>D.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pinheiro</surname>
            da Silva,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Welty</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ferrucci</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Explaining conclusions from diverse knowledge sources</article-title>
          .
          <source>The Semantic Web-ISWC</source>
          <year>2006</year>
          (
          <year>2006</year>
          )
          <volume>861</volume>
          {
          <fpage>872</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>O</given-names>
            <surname>'Neill</surname>
          </string-name>
          ,
          <string-name>
            <surname>E.</surname>
          </string-name>
          :
          <article-title>FRBR: Functional Requirements for Bibliographic Records</article-title>
          .
          <source>Library resources &amp; technical services 46(4)</source>
          (
          <year>2002</year>
          )
          <volume>150</volume>
          {
          <fpage>159</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Wilkinson</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vandervalk</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McCarthy</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>SADI Semantic Web Services-cause you can't always GET what you want!</article-title>
          <source>In: Services Computing Conference</source>
          ,
          <year>2009</year>
          .
          <article-title>APSCC 2009</article-title>
          .
          <article-title>IEEE Asia-Paci c</article-title>
          ,
          <source>IEEE</source>
          (
          <year>2009</year>
          )
          <volume>13</volume>
          {
          <fpage>18</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>