-

File Systems to Knowledge Graphs (Demo)

Yannis Tzitzikas

tzitzik@ics.forth.gr 0 1 2 0 Computer Science Department, University of Crete , Greece 1 Institute of Computer Science , FORTH-ICS , Greece 2 Motivation , Challenges, Methodology and Approach

2022

The tree-structured and semantics-neutral approach of file systems is the dominant method for organizing information, decades now. In this paper we elaborate on the following two questions: (a) can a file system structure be benefited by a Knowledge Graph (KG), (b) can the construction of a KG be facilitated by the file system? To this end we propose an automatic method for producing KGs from folder structures, which can be configured through small, and easy to write, configuration files that can be placed in the desired folders to guide the KG construction. We present F S 2 K G , an implementation of the approach. The approach can facilitate the rapid creation of KGs, as well as various file system related tasks.

Knowledge Graph creation File systems Semantic Access over File Systems

https://www.ics.forth.gr/~tzitzik (Y. Tzitzikas) of using) file systems, be leveraged for speeding up, or just facilitating, the creation of KGs? Both directions could have significant impact. The first would enable leveraging the Semantic Web technologies in every day tasks. The second would assist the creation of KGs, something desirable, since there is a need for practical and mature tools to foster knowledge engineering (there are some critiques about the practicality and availability of tools for the Semantic Web, e.g. see [ 2 ], and an elaborated discussion of these critiques at [ 3 ]).

Challenges: Enriching a file system with a KG is a challenging task, since a file system contains very heterogeneous material since it is used for various purposes and tasks. For instance, one part of the file system may contain training material (books, papers, slides, assignments, student exercises), another part various personal material (family documents, photos and videos, travel information), datasets, software code and systems and others. Moreover, applications also use the file system and create and modify parts of it.

Methodology: We started by inspecting existing file system structures, and reflecting on what we would like to achieve, what file system weakness we would like to tackle. We came up with various ideas, that we implemented and tested, and only those that seemed efective were included in the proposed tool FS2KG.

Approach: In brief, we propose supporting two fundamental interrelated aspects: folder structure and semantic network with connections between these two. The core schema is illustrated in Figure 1 (upper part), where with “*” we denote multiplicity (as in UML Class Diagrams). The approach is equipped with methods that create entities based on the files and folders of the file system, as well as by extracting them from csv files. The big picture is sketched in Figure 1 (lower part).

Related Work: In comparison to the line of research under the term “semantic desktop” that was developed 15 years ago (e.g. [ 4, 5, 6, 7, 8 ]), we could say that the current work has a more modest, but realistic, vision: not to integrate data, applications, and tasks, but to focus on the data part (folders and files). As pointed out in [ 9 ], existing Semantic Desktops are either too complicated, or not scale well, and a real “killer app” is still missing. The approach proposed in this paper is more tightly related with the classical file system usage. It adopts a modular configuration approach, there is no dependency to a central repository, or central configuration, or any other service.

2. The Functionality of FS2KG

FS2KG supports a default operation that requires no configuration. It starts by traversing the file system from the desired folder(s). Each folder is represented by a class, subfolder relationships by r d f s : s u b C l a s s O f , while each file is represented as a named individual classified under the class of its folder. However the user can place a ”.kg” file in some folders to configure the creation of the KG in the corresponding part of the file system. In particular, a ”.kg” file contains configuration parameters, in the form of key-value pairs. It supports commands: (I) for scope restriction, i.e. with t r a v e r s e = o f f the traversal stops, and we can ignore files based on their extension, e.g. i g n o r e E x t = t m p ; a u x . (II) for the automatic creation and classification of entities corresponding to subfolders, e.g. with s u b F o l d e r s C l a s s = e x a m p l e : S t u d e n t for each subfolder of the hosting folder an entity is created (belonging to the Semantic Network view), with a link e x a m p l e : m o r e A t pointing to the class of that folder. An example is shown in Figure 2 where with s u b F o l d e r s C l a s s = e x a m p l e : S t u d e n t in the ”.kg” file of the folders R e c c o m e n d a t i o n L e t t e r s and M S c S t u d e n t s we managed to get one entity for each student which is connected with these folders. (III) for leveraging ’readme’ files , i.e. with r e a d m e = o n if we encounter readme files they are connected with the entities of the corresponding folders. (IV) for adding arbitrary metadata, i.e. extra, explicitly specified, triples can be associated to a file, say f 1 . p d f , by placing them in a file f 1 . k g in the same folder. (V) for extracting and transforming data from the desired csv files . Specifically FS2KG adopts the following convention: If we want to perform extraction from a file, say ” f n . y ” , we can create a file ” f n . k g ” where we place rules to extract data from the corresponding file. We support an easy to use language (much simpler than existing approaches, like [ 10 ]), with which we can construct RDF triples (data, taxonomies, ontologies), from csv ifles. For example, suppose a file with name C o n n e c t i o n s . t x t that contains lines of the form: L e o n a r d o ; R o m e ; F o o t b a l l .

We can place in the same folder a file C o n n e c t i o n s . k g with: C 1 = e x a m p l e : S t u d e n t C 2 = e x a m p l e : L o c a t i o n C 3 = e x a m p l e : S p o r t R = C 1 , e x a m p l e : l i v e s A t , C 2 ; C 1 , e x a m p l e : l i k e s , C 3 Property C 1 refers to the first column, and its value means that the values that occur in the first column of the data file should become instances of the class e x a m p l e : S t u d e n t . Consequently, with the first three lines (properties C1-C3), we manage to classify all values that appear in the csv file to the classes S t u d e n t , L o c a t i o n and S p o r t . The last row contains two rules, separated by semicolon, for creating relationships. The first is “ C 1 , e x a m p l e : l i v e s A t , C 2 ” that states that the values in C1 should be connected via e x a m p l e : l i v e s A t with the values of C2. Analogously, the second rule relates the values of the first column with the values of the third column. If we also have p r o v e n a n c e = o n then these extracted entities will be connected through r d f s : i s D e f i n e d I n with the corresponding file. FS2KG also ofers a light weight query client, shown in Fig. 2(right). Eficiency. The application of FS2KG over a file system of 140 GB that contains 60 K folders and 382 K files takes only 90 seconds and produces a ttl file of size 140 MB.

Use Cases. We can identify two main scenarios: ( 1) Over existing file systems to enable querying, identification and grouping of entities scattered in diferent subfolders . To this end, one immediate next step is the implementation of an explorer that combines the functionality of the classical file explorer with the query client (as shown in Figure 2). ( 2) Over folder structures and files created for facilitating KG construction. For example, the user can use the file system to define a taxonomy (e.g. of papers organized in categories), instead of having to use a taxonomy/ontology editor. Moreover, arbitrary KGs can be constructed from csv files through FS2KG and the supported extraction language.

3. Conclusion

Finding an efective method to conciliate freedom of file system usage, and Knowledge Graph integrity and usability, is a challenging task. We will demonstrate FS2KG a tool for the automatic creation of KGs from file systems that supports a modular (and easy to use) configuration approach relying on small configuration files in the folders, and KG reconstruction at any moment. The tool is open source and available at https://github.com/YannisTzitzikas/FS2KG, subject to a plethora of extensions. We have decided to include in FS2KG a sort of core functionality. On top of this, several straightforward extensions are applicable (since they have already been studied in isolation) including: (a) representation of the filesystem’s file metadata in RDF (as in [ 4 ]), (b) extraction of the embedded in the files metadata and representation in RDF (as in [ 11 ]), (c) instance matching over the KG to establish connections between entities whose name is slightly diferent in diferent folders, (d) regex-based specification of the desired files/folders (as in web crawlers), (e) information extraction capabilities from files according to their type (text, images, etc) based on the application context and requirements at hand (including scripts in the ’.kg’ files), (f) materialization of the extracted triples from big csv files, to avoid re-extracting them in the next KG reconstruction, if the files have not been changed in the meantime, and (g) keyword search based on both the contents of the files and produced KG (as in [ 12 ]).

[1]

Barnard III , L. Fein, Organization and retrieval of records generated in a large-scale engineering project , in: Papers and discussions presented at the December 3-5 , 1958 , eastern joint computer conference: Modern computers: objectives, designs, applications, 1958 , pp. 59 - 63 .

[2]

Verborgh , M. Vander Sande, The semantic web identity crisis: in search of the trivialities that never were , Semantic Web 11 ( 2020 ) 19 - 27 .

[3]

Hogan , The semantic web: Two decades on , Semantic Web 11 ( 2020 ) 169 - 185 .

[4]

Jenkins ,

Jackson ,

Burden ,

Wallis , Automatic RDF metadata generation for resource discovery , Computer Networks 31 ( 1999 ) 1305 - 1320 .

[5]

Sauermann ,

Bernardi ,

Dengel , Overview and outlook on the semantic desktop ., in: Semantic Desktop Workshop , volume 175 , Citeseer , 2005 .

[6]

Schandl , SemDAV: a file exchange protocol for the semantic desktop ., in: SemDesk'06: Proceedings of the 5th International Conference on Semantic Desktop and Social Semantic Collaboration , 2006 .

[7]

Sauermann , L. Van Elst ,

Dengel , PIMO - a framework for representing personal information models , Proceedings of I-Semantics 7 ( 2007 ) 270 - 277 .

[8]

Drăgan ,

Decker , Knowledge management on the desktop , in: International Conference on Knowledge Engineering and Knowledge Management , Springer, 2012 , pp. 373 - 382 .

[9]

Jilek ,

Schröder ,

Schwarz ,

Maus ,

Dengel , Context spaces as the cornerstone of a near-transparent and self-reorganizing semantic desktop , in: European Semantic Web Conference , Springer, 2018 , pp. 89 - 94 .

[10]

Marketakis ,

Minadakis ,

Kondylakis ,

Konsolaki , G. Samaritakis,

Theodoridou , G. Flouris, M. Doerr, X3ML mapping framework for information integration in cultural heritage and beyond , International Journal on Digital Libraries 18 ( 2017 ) 301 - 319 .

[11]

Marketakis ,

Tzanakis ,

Tzitzikas , Prescan: towards automating the preservation of digital objects , in: Proceedings of the International Conference on Management of Emergent Digital EcoSystems , 2009 , pp. 404 - 411 .

[12]

Nikas , G. Kadilierakis,

Fafalios ,

Tzitzikas , Keyword search over RDF: Is a single perspective enough? , Big Data and Cognitive Computing 4 ( 2020 ) 22 .