<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Capturing Provenance Information in the File System</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lars Gleim</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Databases and Information Systems, RWTH Aachen University</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Fraunhofer FIT</institution>
          ,
          <addr-line>Sankt Augustin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Institute of Textile Technology, RWTH Aachen University</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Leon Mu ̈ ller</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>As business processes are increasingly complex, agile, and specialized, provenance information can improve interpretability and contextualization of process data. While individual process steps frequently employ digital computer files, their relationships within the overall process are rarely captured. To address this issue, we extend the factFUSE system for managing versioned Web resources in the file system to capture provenance relationships. By introducing an extensible commit system, we enable recording the relations between digital files and resources in process steps (activities), which are then captured as RDF metadata using the W3C PROV standard. Our evaluation shows that users without prior experience in provenance management successfully employ the system to capture semantic process provenance, attesting to excellent usability and promising utility. factFUSE is available for practical use under open source GNU AGPLv3 license.</p>
      </abstract>
      <kwd-group>
        <kwd>Semantic Data Management</kwd>
        <kwd>Version Control System</kwd>
        <kwd>FAIR Data</kwd>
        <kwd>Desktop Computing</kwd>
        <kwd>FactDAG</kwd>
        <kwd>FactStack</kwd>
        <kwd>FUSE</kwd>
        <kwd>Linked Data Platform</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Contributions. Addressing these goals, we propose a platform-independent concept
for provenance capturing in the file system and provide a corresponding open-source
implementation for both Linux and macOS. We further provide an evaluation of the
system w.r.t. the design goals defined above. Sec. 2 introduces the main concept and
presents the implementation of an extensible commit system for factFUSE. Sec. 3
discusses its quantitative and qualitative evaluation. We conclude our work in Sec. 4.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Concept &amp; Realization</title>
      <p>
        To capture provenance information on computer files and to enable context embedding
in traditional data management environments, a bridge between the classic hierarchical
file system and semantic data management is fundamental. We extend upon the recently
proposed factFUSE [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] system for the joint management of computer files and semantic
data and metadata in the file system. The solution is based on FactStack [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], a unified
semantic data management system integrating RDF, arbitrary data types, and computer
files in a fundamentally provenance-linked knowledge graph through a combination
of open Web standards according to the FAIR principles [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. factFUSE maps Linked
Data Platform (LDP) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] resources into the local file system and vice versa through a
FUSE file system driver. This allows users to interact with semantic data in the same
way as with regular computer files and enables the drag-and-drop integration of files into
semantic graphs. Each resource is versioned using the HTTP Memento protocol and
augmented with a dedicated metadata record linked via the HTTP rel="describedby"
Link header as specified by the LDP standard [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        Concept. Extending upon this foundation provided by the factFUSE system, we
embed provenance management into the traditional workflow of managing files in the
file system by introducing (i) a Commit system inspired by the distributed version control
system GIT [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], which may expose user interfaces for metadata collection during the
process of persisting resource modifications, and (ii) extensible Metadata Interfaces that
may be opened directly from the file system explorer (either through context menus or
by adding buttons to the operating system’s file system explorer itself) to manage and
display metadata of the selected resources, as illustrated in Fig. 1). Subsequently, the
      </p>
      <sec id="sec-2-1">
        <title>Launch</title>
      </sec>
      <sec id="sec-2-2">
        <title>Application</title>
      </sec>
      <sec id="sec-2-3">
        <title>File System</title>
      </sec>
      <sec id="sec-2-4">
        <title>Provenance Explorer</title>
      </sec>
      <sec id="sec-2-5">
        <title>Link/Relation Display ...</title>
      </sec>
      <sec id="sec-2-6">
        <title>Metadata</title>
      </sec>
      <sec id="sec-2-7">
        <title>Interfaces</title>
      </sec>
      <sec id="sec-2-8">
        <title>Commit</title>
      </sec>
      <sec id="sec-2-9">
        <title>System</title>
      </sec>
      <sec id="sec-2-10">
        <title>Commit Interface</title>
      </sec>
      <sec id="sec-2-11">
        <title>Provenance Creation</title>
      </sec>
      <sec id="sec-2-12">
        <title>Metadata Generation LDP</title>
        <p>Fig. 1. The factFUSE concept, extended to capture computer file provenance through a commit
system and corresponding metadata interfaces in the file system.
Mountpoint</p>
        <p>EDITED
RDF1.ttl</p>
        <p>C2</p>
        <p>IMG1.jpg
DELETED
RDF2.ttl</p>
        <p>Commit</p>
        <p>EditA
used: IMG1
"Comment"
contains RDF1 revisionOf RDF1</p>
        <p>generatedBy
C1
contains</p>
        <p>C2
associatedWith</p>
        <p>used lmueller</p>
        <p>EditA
Comment
contains IMG1
collected provenance information is then persisted to the resource’s metadata record to
track the changes and revisions that resources go through and capture process knowledge.</p>
        <p>
          Realization. factFUSE is a NodeJS application, providing a custom user-space
file system based on the FactStack [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] system which provides primitives for handling
resource versioning, network communication, and metadata management. Additionally,
it provides helper functions for provenance management and preservation according
to the W3C PROV standard [
          <xref ref-type="bibr" rid="ref1 ref5">5,1</xref>
          ], which expresses data provenance through entities,
activities, and agents. factFUSE tracks changes made to resources in the file system in an
internal cache before asynchronously synchronizing them with the upstream LDP server.
To capture process provenance in the file system, we introduce an extensible commit
system.
        </p>
        <p>A commit (modeled as a PROV
activity) carries information on the time and
content of changed resources (PROV
entities), a title, its author (PROV agent),
a message, and possibly additional
resources that were used (but not modified)
in the process. Fig. 2 shows an example
in which two resources in the local file
system are modified. The CommitGUI
illustrated in Fig. 3 is then used to capture
provenance and additional metadata of the
generating process and to commit the data
to the upstream LDP server.</p>
        <p>In order to enable the semantic
exploration and management of existing
resources, a set of user interfaces (see Fig. 4) Fig. 3. Configuring changes &amp; metadata to
inhas been designed and implemented, that clude when creating a process step commit.
are accessible through the right-click context menu, extending the file system’s
functionality. The RevisionView displays a history of all existing revisions of a resource as
well as each revisions generating activity. Additionally, UI elements to download or
restore revisions are provided. Inspecting a revision’s generating activity allows further
inspection along the edges of the PROV graph, by opening the ActivityView. An
activity’s ActivityView holds information on its responsible Agent, the time of execution,
and a list of resources that have been used during the activity. A full overview of the
implemented interfaces in in-use scenarios can be found on the projects repository page4.
This approach of providing context-dependent UI extensions to manage and visualize
metadata, in alignment with individual ontologies and best practices of specific
application domains, allows for the progressive and configurable adoption of semantic data
management to the traditional file system.</p>
        <sec id="sec-2-12-1">
          <title>RevisionView</title>
        </sec>
        <sec id="sec-2-12-2">
          <title>ActivityView fileC - RevisionView</title>
          <p>
            Along with the proposed provenance management extensions, factFUSE provides a
tool to manage version-controlled Web resources in the file system which automatically
collects provenance information with interfaces to explore and use them. Goal G1 is
met by design since factFUSE allows for the drag-and-drop integration of arbitrary
computer files and its representation of Web resources in the file system that can be used
and edited in existing desktop applications. Goals G2 and G3 are met by employing
the FactStack data management system as detailed in [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ]. Goal G4 is addressed by
extending factFUSE with a commit system as detailed above. Provenance information is
automatically captured and can either be synchronized with the LDP in near real-time
or be manually customized and committed by the user. In order to validate the
userfriendliness specified by goal G5, a user study was conducted evaluating the management
and exploration of semantic provenance metadata as well as versioned resources. The
participants – all without prior experience in provenance management – were asked to
4https://git.rwth-aachen.de/i5/factdag/factfuse
complete a set of six tasks that each represented a core functionality and use case of the
system. A detailed description of the study, its participants, raw task, result data, and
further discussion can be found in the repository. The System Usability Scale (SUS) [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ]
results – a score of 83.25 – attested excellent usability to the factFUSE system, fulfilling
goal G5. Summarizing the key results, all participants successfully used the system to
solve the provenance and version management-related tasks and shared positive feedback
about the system and its utility. The commit system was further identified as an extension
point for future metadata collection, possibly depending on the type and context of the
modified resources, enabling deeper integration of semantic data management primitives
into the file system.
4
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Conclusion</title>
      <p>In this paper, we presented a solution for the collection and management of provenance
information in the file system, by extending the factFUSE system for managing versioned
Web resources. Through the implementation of an extensible commit system, we provide
a solution for semi-automatic provenance collection combined with manual user prompts
for additional metadata that is expressed as RDF, using the W3C PROV standard. This
enables the semantic embedment of digital files into processes and subsequently enables
a detailed overview of the relations within a process and between different resources. Our
user study yielded excellent usability results and showed a quick adoption of principles
by new users. As such, factFUSE provides a first step towards the flexible integration
of traditional file-based data management with semantic metadata using open Web
standards and technologies and is available open-source for community discernment.
Acknowledgments Funded by the Deutsche Forschungsgemeinschaft (DFG, German
Research Foundation) under Germany’s Excellence Strategy – EXC-2023 Internet of
Production – 390621612.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Belhajjame</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>B'Far</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , et al.:
          <article-title>Prov-dm: The prov data model</article-title>
          .
          <source>W3C Recommendation</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Brooke</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>SUS: A quick and dirty usability scale</article-title>
          . In: Usability Evaluation In Industry, pp.
          <fpage>207</fpage>
          -
          <lpage>212</lpage>
          . CRC Press (
          <year>1996</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Gleim</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pennekamp</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , et al.:
          <article-title>FactDAG: Formalizing Data Interoperability in an Internet of Production</article-title>
          .
          <source>IEEE Internet Things J</source>
          .
          <volume>7</volume>
          (
          <issue>4</issue>
          ) (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Gleim</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pennekamp</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , et al.:
          <article-title>FactStack: Interoperable Data Management and Preservation for the Web and Industry 4.0</article-title>
          . In: BTW (
          <year>2021</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Gleim</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tirpitz</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          , et al.:
          <article-title>Expressing FactDAG Provenance with PROV-O</article-title>
          . In: MEPDaW @ ISWC (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Loeliger</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McCullough</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Version Control with Git: Powerful tools and techniques for collaborative software development.</article-title>
          <string-name>
            <surname>O'Reilly Media</surname>
          </string-name>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. Mu¨ller, L.,
          <string-name>
            <surname>Gleim</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Managing Versioned Web Resources in the File System</article-title>
          . In: ICWE (
          <year>2021</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Speicher</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Arwe</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , et al.:
          <source>Linked Data Platform 1.0. W3C Rec</source>
          . (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Wilkinson</surname>
            ,
            <given-names>M.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dumontier</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , et al.:
          <article-title>The FAIR Guiding Principles for scientific data management and stewardship</article-title>
          .
          <source>Sci. Data</source>
          <volume>3</volume>
          ,
          <issue>160018</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>