<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>File Systems to Knowledge Graphs (Demo)</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yannis Tzitzikas</string-name>
          <email>tzitzik@ics.forth.gr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science Department, University of Crete</institution>
          ,
          <country country="GR">Greece</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute of Computer Science</institution>
          ,
          <addr-line>FORTH-ICS</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Motivation</institution>
          ,
          <addr-line>Challenges, Methodology and Approach</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <abstract>
        <p>The tree-structured and semantics-neutral approach of file systems is the dominant method for organizing information, decades now. In this paper we elaborate on the following two questions: (a) can a file system structure be benefited by a Knowledge Graph (KG), (b) can the construction of a KG be facilitated by the file system? To this end we propose an automatic method for producing KGs from folder structures, which can be configured through small, and easy to write, configuration files that can be placed in the desired folders to guide the KG construction. We present F S 2 K G , an implementation of the approach. The approach can facilitate the rapid creation of KGs, as well as various file system related tasks.</p>
      </abstract>
      <kwd-group>
        <kwd>Knowledge Graph creation</kwd>
        <kwd>File systems</kwd>
        <kwd>Semantic Access over File Systems</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        https://www.ics.forth.gr/~tzitzik (Y. Tzitzikas)
of using) file systems, be leveraged for speeding up, or just facilitating, the creation of KGs?
Both directions could have significant impact. The first would enable leveraging the Semantic
Web technologies in every day tasks. The second would assist the creation of KGs, something
desirable, since there is a need for practical and mature tools to foster knowledge engineering
(there are some critiques about the practicality and availability of tools for the Semantic Web,
e.g. see [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], and an elaborated discussion of these critiques at [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]).
      </p>
      <p>Challenges: Enriching a file system with a KG is a challenging task, since a file system contains
very heterogeneous material since it is used for various purposes and tasks. For instance, one
part of the file system may contain training material (books, papers, slides, assignments, student
exercises), another part various personal material (family documents, photos and videos, travel
information), datasets, software code and systems and others. Moreover, applications also use
the file system and create and modify parts of it.</p>
      <p>Methodology: We started by inspecting existing file system structures, and reflecting on
what we would like to achieve, what file system weakness we would like to tackle. We came up
with various ideas, that we implemented and tested, and only those that seemed efective were
included in the proposed tool FS2KG.</p>
      <p>Approach: In brief, we propose supporting two fundamental interrelated aspects: folder
structure and semantic network with connections between these two. The core schema is
illustrated in Figure 1 (upper part), where with “*” we denote multiplicity (as in UML Class
Diagrams). The approach is equipped with methods that create entities based on the files and
folders of the file system, as well as by extracting them from csv files. The big picture is sketched
in Figure 1 (lower part).</p>
      <p>
        Related Work: In comparison to the line of research under the term “semantic desktop” that
was developed 15 years ago (e.g. [
        <xref ref-type="bibr" rid="ref4 ref5 ref6 ref7 ref8">4, 5, 6, 7, 8</xref>
        ]), we could say that the current work has a more
modest, but realistic, vision: not to integrate data, applications, and tasks, but to focus on the
data part (folders and files). As pointed out in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], existing Semantic Desktops are either too
complicated, or not scale well, and a real “killer app” is still missing. The approach proposed
in this paper is more tightly related with the classical file system usage. It adopts a modular
configuration approach, there is no dependency to a central repository, or central configuration,
or any other service.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. The Functionality of FS2KG</title>
      <p>
        FS2KG supports a default operation that requires no configuration. It starts by traversing the file
system from the desired folder(s). Each folder is represented by a class, subfolder relationships
by r d f s : s u b C l a s s O f , while each file is represented as a named individual classified under the
class of its folder. However the user can place a ”.kg” file in some folders to configure the
creation of the KG in the corresponding part of the file system. In particular, a ”.kg” file
contains configuration parameters, in the form of key-value pairs. It supports commands: (I) for
scope restriction, i.e. with t r a v e r s e = o f f the traversal stops, and we can ignore files based on
their extension, e.g. i g n o r e E x t = t m p ; a u x . (II) for the automatic creation and classification of
entities corresponding to subfolders, e.g. with s u b F o l d e r s C l a s s = e x a m p l e : S t u d e n t for each
subfolder of the hosting folder an entity is created (belonging to the Semantic Network view), with
a link e x a m p l e : m o r e A t pointing to the class of that folder. An example is shown in Figure 2 where
with s u b F o l d e r s C l a s s = e x a m p l e : S t u d e n t in the ”.kg” file of the folders R e c c o m e n d a t i o n L e t t e r s and
M S c S t u d e n t s we managed to get one entity for each student which is connected with these folders.
(III) for leveraging ’readme’ files , i.e. with r e a d m e = o n if we encounter readme files they are
connected with the entities of the corresponding folders. (IV) for adding arbitrary metadata,
i.e. extra, explicitly specified, triples can be associated to a file, say f 1 . p d f , by placing them in
a file f 1 . k g in the same folder. (V) for extracting and transforming data from the desired
csv files . Specifically FS2KG adopts the following convention: If we want to perform extraction
from a file, say ” f n . y ” , we can create a file ” f n . k g ” where we place rules to extract data from the
corresponding file. We support an easy to use language (much simpler than existing approaches,
like [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]), with which we can construct RDF triples (data, taxonomies, ontologies), from csv
ifles. For example, suppose a file with name C o n n e c t i o n s . t x t that contains lines of the form:
L e o n a r d o ; R o m e ; F o o t b a l l .
      </p>
      <p>We can place in the same folder a file C o n n e c t i o n s . k g with:
C 1 = e x a m p l e : S t u d e n t
C 2 = e x a m p l e : L o c a t i o n
C 3 = e x a m p l e : S p o r t
R = C 1 , e x a m p l e : l i v e s A t , C 2 ; C 1 , e x a m p l e : l i k e s , C 3
Property C 1 refers to the first column, and its value means that the values that occur in the first
column of the data file should become instances of the class e x a m p l e : S t u d e n t . Consequently,
with the first three lines (properties C1-C3), we manage to classify all values that appear in the
csv file to the classes S t u d e n t , L o c a t i o n and S p o r t . The last row contains two rules, separated
by semicolon, for creating relationships. The first is “ C 1 , e x a m p l e : l i v e s A t , C 2 ” that states that the
values in C1 should be connected via e x a m p l e : l i v e s A t with the values of C2. Analogously, the
second rule relates the values of the first column with the values of the third column. If we also
have p r o v e n a n c e = o n then these extracted entities will be connected through r d f s : i s D e f i n e d I n
with the corresponding file. FS2KG also ofers a light weight query client, shown in Fig. 2(right).
Eficiency. The application of FS2KG over a file system of 140 GB that contains 60 K folders and
382 K files takes only 90 seconds and produces a ttl file of size 140 MB.</p>
      <p>Use Cases. We can identify two main scenarios: ( 1) Over existing file systems to enable
querying, identification and grouping of entities scattered in diferent subfolders . To this end, one
immediate next step is the implementation of an explorer that combines the functionality of the
classical file explorer with the query client (as shown in Figure 2).
( 2) Over folder structures and files created for facilitating KG construction. For example, the user
can use the file system to define a taxonomy (e.g. of papers organized in categories), instead of
having to use a taxonomy/ontology editor. Moreover, arbitrary KGs can be constructed from
csv files through FS2KG and the supported extraction language.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Conclusion</title>
      <p>
        Finding an efective method to conciliate freedom of file system usage, and Knowledge Graph
integrity and usability, is a challenging task. We will demonstrate FS2KG a tool for the automatic
creation of KGs from file systems that supports a modular (and easy to use) configuration
approach relying on small configuration files in the folders, and KG reconstruction at any
moment. The tool is open source and available at https://github.com/YannisTzitzikas/FS2KG,
subject to a plethora of extensions. We have decided to include in FS2KG a sort of core functionality.
On top of this, several straightforward extensions are applicable (since they have already been
studied in isolation) including: (a) representation of the filesystem’s file metadata in RDF (as in
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]), (b) extraction of the embedded in the files metadata and representation in RDF (as in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]),
(c) instance matching over the KG to establish connections between entities whose name is
slightly diferent in diferent folders, (d) regex-based specification of the desired files/folders (as
in web crawlers), (e) information extraction capabilities from files according to their type (text,
images, etc) based on the application context and requirements at hand (including scripts in the
’.kg’ files), (f) materialization of the extracted triples from big csv files, to avoid re-extracting
them in the next KG reconstruction, if the files have not been changed in the meantime, and (g)
keyword search based on both the contents of the files and produced KG (as in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]).
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>Barnard</surname>
          </string-name>
          <string-name>
            <surname>III</surname>
          </string-name>
          , L. Fein,
          <article-title>Organization and retrieval of records generated in a large-scale engineering project</article-title>
          ,
          <source>in: Papers and discussions presented at the December 3-5</source>
          ,
          <year>1958</year>
          , eastern joint computer conference: Modern computers: objectives, designs, applications,
          <year>1958</year>
          , pp.
          <fpage>59</fpage>
          -
          <lpage>63</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Verborgh</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Vander Sande, The semantic web identity crisis: in search of the trivialities that never were</article-title>
          ,
          <source>Semantic Web</source>
          <volume>11</volume>
          (
          <year>2020</year>
          )
          <fpage>19</fpage>
          -
          <lpage>27</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hogan</surname>
          </string-name>
          ,
          <article-title>The semantic web: Two decades on</article-title>
          ,
          <source>Semantic Web</source>
          <volume>11</volume>
          (
          <year>2020</year>
          )
          <fpage>169</fpage>
          -
          <lpage>185</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.</given-names>
            <surname>Jenkins</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jackson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Burden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wallis</surname>
          </string-name>
          ,
          <article-title>Automatic RDF metadata generation for resource discovery</article-title>
          ,
          <source>Computer Networks</source>
          <volume>31</volume>
          (
          <year>1999</year>
          )
          <fpage>1305</fpage>
          -
          <lpage>1320</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L.</given-names>
            <surname>Sauermann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bernardi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dengel</surname>
          </string-name>
          ,
          <article-title>Overview and outlook on the semantic desktop</article-title>
          .,
          <source>in: Semantic Desktop Workshop</source>
          , volume
          <volume>175</volume>
          ,
          <string-name>
            <surname>Citeseer</surname>
          </string-name>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>B.</given-names>
            <surname>Schandl</surname>
          </string-name>
          ,
          <article-title>SemDAV: a file exchange protocol for the semantic desktop</article-title>
          .,
          <source>in: SemDesk'06: Proceedings of the 5th International Conference on Semantic Desktop and Social Semantic Collaboration</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>L.</given-names>
            <surname>Sauermann</surname>
          </string-name>
          ,
          <string-name>
            <surname>L. Van Elst</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dengel</surname>
          </string-name>
          , PIMO
          <article-title>- a framework for representing personal information models</article-title>
          ,
          <source>Proceedings of I-Semantics</source>
          <volume>7</volume>
          (
          <year>2007</year>
          )
          <fpage>270</fpage>
          -
          <lpage>277</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>L.</given-names>
            <surname>Drăgan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Decker</surname>
          </string-name>
          ,
          <article-title>Knowledge management on the desktop</article-title>
          ,
          <source>in: International Conference on Knowledge Engineering and Knowledge Management</source>
          , Springer,
          <year>2012</year>
          , pp.
          <fpage>373</fpage>
          -
          <lpage>382</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>C.</given-names>
            <surname>Jilek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schröder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schwarz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Maus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dengel</surname>
          </string-name>
          ,
          <article-title>Context spaces as the cornerstone of a near-transparent and self-reorganizing semantic desktop</article-title>
          ,
          <source>in: European Semantic Web Conference</source>
          , Springer,
          <year>2018</year>
          , pp.
          <fpage>89</fpage>
          -
          <lpage>94</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Marketakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Minadakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kondylakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Konsolaki</surname>
          </string-name>
          , G. Samaritakis,
          <string-name>
            <given-names>M.</given-names>
            <surname>Theodoridou</surname>
          </string-name>
          , G. Flouris,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Doerr, X3ML mapping framework for information integration in cultural heritage and beyond</article-title>
          ,
          <source>International Journal on Digital Libraries</source>
          <volume>18</volume>
          (
          <year>2017</year>
          )
          <fpage>301</fpage>
          -
          <lpage>319</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Marketakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tzanakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tzitzikas</surname>
          </string-name>
          ,
          <article-title>Prescan: towards automating the preservation of digital objects</article-title>
          ,
          <source>in: Proceedings of the International Conference on Management of Emergent Digital EcoSystems</source>
          ,
          <year>2009</year>
          , pp.
          <fpage>404</fpage>
          -
          <lpage>411</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>C.</given-names>
            <surname>Nikas</surname>
          </string-name>
          , G. Kadilierakis,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fafalios</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tzitzikas</surname>
          </string-name>
          ,
          <article-title>Keyword search over RDF: Is a single perspective enough?</article-title>
          ,
          <source>Big Data and Cognitive Computing</source>
          <volume>4</volume>
          (
          <year>2020</year>
          )
          <fpage>22</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>