Lifting File Systems into the Linked Data Cloud with TripFS

                     Bernhard Schandl                                            Niko Popitsch
               bernhard.schandl@univie.ac.at                              niko.popitsch@univie.ac.at
                         University of Vienna, Department of Distributed and Multimedia Systems
                                          Liebiggasse 4/3-4, 1010 Vienna, Austria


                                                                   systems do not impose major restrictions on creating, nam-
                                                                   ing, and arranging directories and files, they support a user’s
ABSTRACT                                                           individual preferences for data organization. File systems do
                                                                   not only store files that were created or modified locally: a
A major fraction of digital information is stored in file sys-     large share of files originates from other sources, like mul-
tems. File systems organize files usually in labelled directory    timedia devices, other desktops, or the Web. In corporate
trees and provide a minimum support for user-driven file an-       environments it is common to store data of collective in-
notation, linkage and categorization. Although file systems        terest on shared file servers that enable a simple form of
play a major role in knowledge organization, both in enter-        collaboration.
prise contexts as well as in the personal information sphere,         Overall, file systems can be considered as one of the pri-
they have rarely been considered in Web-based information          mary information sources both for organizations and indi-
integration. To a large extent, this can be contributed to the     viduals, and it is quite likely that they will remain to be
limited metadata support of file systems and to the lack of        important in the future. Therefore they are of high interest
stable identifiers for file and directories, which makes it hard   for information integration. However, file systems have only
to expose these objects in a global Web. We present TripFS,        rarely been considered in the field of Web-based data inte-
a lightweight approach for exposing parts of local filesystems     gration. This stems mostly from their limited possibilities of
as Linked Data. Serving file system objects via dereference-       data organization1 , limited metadata support, and the lack
able HTTP URIs paves the way to integrate them with the            of stable identifiers for files and directories.
Web of Data, and enables new possibilities of exploiting file         One promising strategy for Web-based information inte-
system data, for example, by linking them with other data          gration is the Linked Data paradigm. This term denotes a
sources or by annotating them using Semantic Web tech-             set of technologies and best practices that facilitate informa-
nologies.                                                          tion integration and linkage on a global scale. To expose in-
                                                                   formation as Linked Data means to follow simple principles
Categories and Subject Descriptors                                 [6]: first, identify each resource of interest with a globally
D.4.3 [Operating Systems]: File Systems Management;                unique, dereferenceable HTTP URI; second, provide useful
H.3.5 [Information Storage and Retrieval]: Online In-              information for clients when they access the URI (usually
formation Services                                                 expressed in RDF and HTML); and third, include links to
                                                                   other resources so that clients can retrieve more potentially
                                                                   interesting information.
General Terms                                                         In this paper, we present TripFS, a lightweight approach
Algorithms, Design                                                 that applies these principles for file systems in order to ex-
                                                                   pose their contents as Linked Data, and therefore enables
Keywords                                                           their direct inclusion in Web-based integration scenarios.
                                                                   It assigns stable, globally valid, dereferenceable URIs to
Linked Data, file systems, file metadata, information repre-       files and directories, monitors changes in the system, serves
sentation, information integration, event detection                metadata extracted from files as RDF data, and interlinks
                                                                   files with external data sources. It provides a plug-in ar-
1.   INTRODUCTION                                                  chitecture so that it can easily be extended to support ad-
   File systems store and organize data and documents of           ditional file types and linking components, it adapts to the
all sorts and of arbitrary complexity, ranging from small          specifics of the underlying file system, and it provides a so-
information snippets that can be put into single files, to         phisticated file change tracking component that increases
large repositories of heterogeneous content that are orga-         the stability of file identifiers.
nized within deep hierarchical structures. They act as the            Because it is easy to set-up, TripFS also facilitates ad-hoc
storage backbone of many information processing systems            sharing of file-based resources using standardized (semantic)
and can be considered as one major fundament of personal
and corporate information management. Since common file            1
                                                                     By now it seems commonly accepted that a single hier-
Copyright is held by the author/owner(s).                          archical scheme is insufficient for the organization of large
LDOW2010, April 27, 2010, Raleigh, North Carolina, USA.            amounts of data as we encounter them on today’s desktop
.                                                                  environments.
Web technologies. Moreover, it overcomes shortcomings of           quality of search and retrieval, as well as linkage with other
hierarchical organization mechanisms, because its metadata-        relevant data sources. In turn, these Linked Data and Web-
centric approach allows to query for descriptive information       based annotations could be propagated back into the work-
instead of file location, and to establish multiple, orthogonal    ing context of the file system user, e.g., by being considered
views on file system data.                                         by desktop search engines.
  After outlining application scenarios and describing how
users can benefit from exposing file systems as Linked Data
(Section 2), we discuss which steps have to be taken in order
                                                                   3.     REPRESENTING FILE SYSTEMS AS
to realize this idea (Section 3). We present details about the            LINKED DATA
TripFS architecture and implementation (Section 4). After            Since the characteristics of file systems and Linked Data
a discussion of related work (Section 5) we conclude the           differ significantly, a number of steps have to be performed
paper in Section 6.                                                in order to lift file system data into a Web of Data:

                                                                        1. Appropriate representations for files and directories
2.   BENEFITS OF LINKED FILE SYSTEMS                                       have to be found, which comply to the Linked Data
  The benefits of exposing data as Linked Data resources                   principles.
are manifold [9]. In this section we outline three scenarios
that illustrate how the quality of file system usage can be             2. Vocabularies that convey the characteristics of data
increased by exposing files as Linked Data.                                found in file systems have be to be specified and aligned
                                                                           to already existing relevant vocabularies.
A) Integrating File Systems into Enterprise Data. A
substantial fraction of enterprise data is available in the form        3. Descriptive metadata about files have to be extracted
of file systems. While these data can be accessed in a dis-                from the file system and transformed into the RDF
tributed context using protocols like CIFS or WebDAV, it                   data model.
is difficult to integrate them in a global enterprise context
                                                                        4. Meaningful links to other, external data sources have
due to the lack of stable identifiers for files and platform-
                                                                           to be detected and established.
independent metadata-based file access mechanisms. Linked
Data has been shown to be a viable approach for lightweight             5. Consistency between the file system and its correspond-
enterprise information integration [16]; therefore, making file            ing Linked Data representation has to be ensured.
systems part of a global or enterprise-internal Web of Data
enables them to be seamlessly integrated with, and seman-               6. Data have to be served according to Linked Data prin-
tically connected to other data sources.                                   ciples, i.e., in a form that is usable for both, humans
                                                                           and machines.
B) Web-based Ad-hoc Data Sharing. Despite the vast
amount of possibilities for digital communication we have at         In the following we outline how each of these steps can be
our disposal, ad-hoc sharing of meaningful information (e.g.,      realized.
the exchange of digital documents between participants’ lap-
tops during face-to-face meetings) is still cumbersome. We         3.1      File URIs in the Web of Data?
can regularly observe that collaborators use e-mail or in-            Within the context of a file system, files and directories
stant messaging to quickly exchange files. This approach,          can be uniquely identified using their absolute paths, each
however, does not allow more complex data to be shared,            of which consists of a sequence of directory names and a file
or to exchange files together with metadata that describe          name. The file: URI scheme [5] is a means to directly
their correct context. Linked Data builds on top of common         reuse these paths to form URIs, which can in turn be used
Web technologies, thus any Linked Data source can be di-           to access local file resources in a computer system.
rectly accessed using a common Web browser. A tool that               However, file URIs are neither globally unique, since they
allows users to temporarily share selected parts of their local    describe a local path to a resource on a particular host, nor
file systems as Linked Data (which implies not only sharing        stable, since the referenced files and directories may be re-
plain files, but also extracted metadata, annotations, and         moved, moved, or renamed. Therefore they are not suitable
links) facilitates efficient information exchange amongst col-     for being used in a global Web of Data.
laborators.                                                           To solve this identifier problem, we chose to use opaque,
                                                                   randomly generated UUIDs, and assign them to files and
C) Semantic Web-based File Annotations. Semantic an-               directories. The usage of random UUIDs in a global dis-
notation and interlinking of files is badly supported today:       tributed context is assumed to be safe since the probabil-
although modern file systems support the storage, manage-          ity of a collision is sufficiently low. Further, since UUIDs
ment, and retrieval of file annotations (e.g., extended at-        are fully opaque, they do not convey information about
tributes or file forks), these data are not accessible in a        the physical location of files and directories, and are there-
standardized and platform-independent way. This makes              fore stable even when the underlying file system objects are
the organization of files into logically connected units diffi-    changed. However, this requires to maintain a mapping be-
cult, and reduces the efficiency of file retrieval especially in   tween stable, UUID-based URIs on the one hand, and un-
distributed environments. If file systems were published as        stable, path-based identifiers on the other hand, to ensure
part of a Web of Data, they could be annotated and inter-          that modifications in the file system are properly reflected
linked using tools like the LEMO annotation framework [13]         in the Linked Data representation. In Section 3.6 we outline
or the Silk framework [22], which would lead to an increased       our strategy to accomplish this.
1   <urn:uuid:887d728e-bc12-4f28-a497-7d66439086e9>
2      a tripfs:File ;
3      rdfs:label "eswc2009-schandl.pdf" ;
4      tripfs:local-name "eswc2009-schandl.pdf"^^xsd:string ;
5      tripfs:path "/Users/bs/Data/work/papers/2009/eswc/eswc2009-schandl.pdf"^^xsd:string ;
6      tripfs:size "425561"^^xsd:long ;
7      tripfs:modified "2009-03-11T02:38:45"^^xsd:dateTime ;
8      tripfs:parent <urn:uuid:35069c61-451e-4688-98f5-080924b261f4> .

                                     Figure 1: A Linked Data representation of a PDF file


1   <urn:uuid:887d728e-bc12-4f28-a497-7d66439086e9>
2      a tripfs:File, foaf:Document, nfo:FileDataObject ;
3      tripfs:parent <urn:uuid:35069c61-451e-4688-98f5-080924b261f4> ;
4      nfo:belongsToContainer <urn:uuid:35069c61-451e-4688-98f5-080924b261f4> .
5

6   <urn:uuid:35069c61-451e-4688-98f5-080924b261f4>
7      a tripfs:Directory, dctype:Collection, nfo:FileDataObject, nfo:Folder ;
8      tripfs:child <urn:uuid:887d728e-bc12-4f28-a497-7d66439086e9> ;
9      nie:hasPart <urn:uuid:887d728e-bc12-4f28-a497-7d66439086e9> .

                    Figure 2: Interoperability through the usage of multiple overlapping vocabularies


    3.2     Files and Directories as Web Resources                    OAI-ORE [17] or Dublin Core3 . The Dublin Core Type
       The parent-child relationships between files and directo-      Vocabulary4 , as another example, defines terms for different
    ries can be represented as RDF triples with appropriate           resource types as well as collections. Additionally, there
    predicates. Several triples are added to each file or direc-      exists a large number of vocabularies that can be used to
    tory resource that convey data that are directly retrieved        identify media types and their specifics; e.g., the MPEG-7
    from the file system: the local name (i.e., the actual file or    ontology5 , the Music Ontology6 , or the set of NEPOMUK
    directory name without the entire path information), the          ontologies.
    file size, and the dates of creation and last modification. An       To reach a maximum level of interoperability, a data source
    example of a file’s RDF representation is depicted in Fig-        should aim to adhere to commonly accepted vocabularies
    ure 1. Resources that represent files or directories are inter-   as much as possible. The RDF semantics allows to arbi-
    nally identified by UUID-based URNs; for serving them as          trarily mix different, unrelated vocabularies; therefore we
    Linked Data they are dynamically rewritten to HTTP URIs           propose—in addition to using a custom vocabulary—to mo-
    (cf. Section 3.7).                                                del file system data using the NFO vocabulary, and to add
                                                                      type information from popular vocabularies like Dublin Core
    3.3     Vocabularies                                              and FOAF as they fit. By serving data using multiple, even
                                                                      already aligned vocabularies, we disburden data consumers
      In order to describe files, directories, their metadata and
                                                                      from the need to perform additional inference. An example
    their relations as RDF, we have developed a simple OWL vo-
                                                                      of such a mixed representation is presented in Figure 2.
    cabulary published at http://purl.org/tripfs/2010/02#.
    We have derived our vocabulary from existing semantic vo-         3.4   Extracting Semantic File Metadata
    cabularies as much as possible. However, as it is currently
                                                                         Current file systems provide only a limited set of low-
    uncommon to expose file resources as Linked Data, we ob-
                                                                      level metadata attributes associated with files such as name,
    served a lack of community-accepted vocabularies for this
                                                                      owner, size, creation and modification date, or permission
    purpose. To the best of our knowledge, only the NEPO-
                                                                      attributes. Modern file systems provide additional means
    MUK File Ontology2 (NFO) has been specifically defined
                                                                      to store higher-level metadata, like extended attributes or
    to model the contents of file systems. It provides terms to
                                                                      multiple data streams; however these are only useful if they
    describe files, directories, and their properties. Our vocab-
                                                                      are actually populated by applications, which is rarely the
    ulary is aligned with NFO and provides more specialized
                                                                      case.
    terms, according to our system’s requirements.
      A number of other vocabularies, however, have a general
    notion of the concept of documents, and usually align this        3
                                                                        http://dublincore.org/groups/collections/
    concept to the foaf:Document class. On the other hand,            collection-application-profile/
    several vocabularies have a notion of collections, which can      4
                                                                        http://dublincore.org/documents/
    be compared to directories in a file system; for instance,        dcmi-type-vocabulary/
                                                                      5
                                                                        http://metadata.net/mpeg7
    2                                                                 6
        http://www.semanticdesktop.org/ontologies/nfo/                  http://musicontology.com
1    <urn:uuid:887d728e-bc12-4f28-a497-7d66439086e9>
2       nie:mimeType "application/pdf" ;
3       nie:title "The Sile Model --- A Semantic File System Infrastructure for the Desktop" ;
4       nfo:pageCount 15 .
5

6    <urn:uuid:a998272d-45f0-4814-8f15-be5db5fe811a>
7       nie:mimeType "audio/mpeg" ;
8       nid3:title "Bohemian Rhapsody" ;
9       nid3:leadArtist [ nco:fullname "Queen" ] ;
10      nid3:length 355106 .

                                    Figure 3: Metadata extracted from a PDF and an MP3 file


1    <urn:uuid:887d728e-bc12-4f28-a497-7d66439086e9>
2       owl:sameAs <http://dblp.l3s.de/d2r/resource/publications/conf/esws/SchandlH09> .
3

4    <urn:uuid:a998272d-45f0-4814-8f15-be5db5fe811a>
5       rdfs:seeAlso <http://musicbrainz.org/track/c7faf83f-9cb3-4de4-a39f-1c1f98b8d81a> ,
6          <http://musicbrainz.org/track/95ebc842-9926-4658-8012-12c358247946> ;
7       owl:sameAs <http://musicbrainz.org/track/bbd5a2e7-9814-4988-8f5a-dc38c208eeea> ,
8          <http://musicbrainz.org/track/064c440c-4eba-47a6-83c4-c91a979eeb4b> .

                                         Figure 4: External links to DBLP and MusicBrainz


        As it is one of the Linked Data principles to “provide             then served as part of the file’s and directory’s description
     useful information” about a resource when a client derefer-           via the Linked Data interface. Extractors may extract not
     ences its URI, it is desirable to extract additional, descrip-        only file metadata (i.e., data about the documents repre-
     tive metadata from files and directories and expose them              sented by files), but also entities that are related to files
     also as Linked Data. Reconsider, for example, Scenario A              (e.g., the artist who has performed the music stored in a
     described in Section 2, where the value of file-system level          MP3 file) and can in turn be linked to external data sources.
     metadata (like file size, file type, or file permissions) is lim-        As an example, Figure 3 shows the RDF representation of
     ited; higher-level descriptive metadata that can be used for          metadata that have been extracted from two files; the first
     selective retrieval of files respectively their descriptions, e.g.,   resource represents a PDF document containing a scientific
     via SPARQL, is required. However, the combination of these            publication, the second represents an MP3 audio file8 . The
     metadata enables sophisticated discovery, retrieval and ac-           blank node used to identify the artist in this example (line
     cess methods based on (i) the parent/child relations of file          9) needs to be dynamically rewritten to a stable, derefer-
     system objects, (ii) low-level file system metadata, and (iii)        enceable URI by the Web server (see Section 3.7).
     high-level content-based metadata.
        The problem of extracting metadata from file systems has           3.5    Linking Files to External Sources
     been studied for a long time. The biggest challenge in this              Once files and directories are represented as RDF resources
     field is the data diversity found in file systems, which is im-       it is possible to link them to other related resources on the
     posed by the multitude of different file types. To illustrate         Web. Doing so allows clients to retrieve more, potentially
     this, currently more than 51,000 file types are registered at         interesting information about the resource. For instance,
     the popular FILExt service7 . Different file types exhibit dif-       files may be classified according to a classification scheme
     ferent internal structures, and consequently different meta-          that uses dereferenceable URIs as identifiers; in this case,
     data can be extracted. It is therefore impractical to provide         clients are enabled to query for files using these terms.
     metadata extractors for this large amount of different file              The task of linking files and directories to external re-
     types within a single software component. It is instead more          sources can be accomplished by tools that provide this func-
     feasible to define a generic metadata extraction framework            tionality for generic Web resources, which usually apply var-
     that allows specific extraction components for different file         ious heuristics to detect semantically related resources (e.g.,
     types to be plugged-in. By this, the system can be tailored           shared identifiers or object similarity [22]). These heuristics
     to the respective application context.                                depend on the information that is available for a particular
        In our approach, extractors read files and extract an RDF          entity; therefore in the context of a file system they depend
     graph that contains triples representing the extracted meta-          on the data provided by metadata extraction components,
     data. Multiple extractors can be cascaded into an extractor           as described in the previous section.
     pipeline and are sequentially applied to each object. The
     resulting RDF graphs are stored in the triple store and are
                                                                           8
                                                                            In this example we have used              terms from the
                                                                           OSCAF/NEPOMUK           ontologies             (http://www.
     7
         http://filext.com                                                 semanticdesktop.org/ontologies).
 Event            Reaction                                         and links to external data sources, the resulting RDF graph
                                                                   can be served according to Linked Data principles. For
 Creation         Mint URI, add resource to RDF graph,
                  perform extraction and linking                   this purpose, internal UUID-based URNs are dynamically
 Deletion         Delete respective resource and associated        rewritten to HTTP-based URIs with a configurable host
                  metadata from RDF graph                          part; e.g., http://example.com:8080/resource/<uuid>. It
 Move/Rename      Update local path properties in the RDF graph    is considered good practice [7] to serve at least two variants
 Update           Re-extract features, re-link, update RDF graph   of the data, an RDF representation for machines and an
                                                                   HTML representation for human consumption, and to let
                                                                   clients choose which representation they prefer using HTTP
Table 1: Reactions on file system events detected by               content negotiation. In addition to serving resources accord-
the watcher component                                              ing to Linked Data principles, it is recommended to provide
                                                                   a SPARQL endpoint [10] to allow clients to search for re-
                                                                   sources based on their RDF descriptions. Furthermore, the
                                                                   actual file data itself can be downloaded to the client. In
  As a consequence, we follow the same strategy as for meta-       the special case where the Linked Data resources are re-
data extractors and do not provide an all-in-one solution to       trieved locally (i.e., server and client are executed on the
the problem of linking files to external resources, but in-        same machine), the Web server can add links to the HTML
stead provide a framework that allows specialized linking          interface that allow the user to directly open directories or
components to be plugged in. These linking components              launch files from the browser, thus providing a seamless in-
can access not only the raw file data, but also extracted          teraction experience. Figure 5 shows a screenshot of such
metadata, and use this information as basis for interlink-         an HTML-based interface, which provides these options to
ing. Like extractors, linking components return RDF triples        the user.
which are added to the metadata model and served via the
Linked Data interface.
  As an example, Figure 4 shows to which external sources a        4.   IMPLEMENTATION
scientific publication and a music file can be linked, based on       TripFS has been designed as a modular service framework,
string similarity between the publication title and the com-       which defines plug-in interfaces that can be used to extend
bination of track title and artist name, respectively. In this     and adapt the system to the actual needs of the use case,
example the PDF document from Figure 3 has been linked             the file types to be served, and the special characteristics
to the Linked Data variant of the popular DBLP publication         of the underlying operating system. Such interfaces exist
database, and the MP3 file has been linked to resources of         for RDF storage components, file metadata extractors, file
the MusicBrainz service.                                           linkers, and file system crawlers (responsible for crawling
                                                                   a configured subtree of the file system) and watchers (re-
3.6    Maintaining Consistency                                     sponsible for maintaining the consistency of the mapping
   As described in Section 3.1, it is required to mint a UUID-     between external UUID-based URIs and internal file-based
based URI for each file and directory, which can be consid-        URIs). The system’s architecture is depicted in Figure 6.
ered globally unique from a practical point of view. How-             The TripFS core is a standalone server application, which
ever, without further precautions such URIs might be quite         has been implemented in pure Java, based on the Jena Se-
unstable as the mapping between an UUID-based external             mantic Web framework9 . On startup, it crawls a config-
URI and a file-based internal URI is invalidated whenever          ured sub-tree of the local file system, applies extractor and
a referenced file is moved, removed, or renamed. Further,          linker components to crawled files, and stores the resulting
updating such files may result in inconsistencies between a        RDF triples in a triple store (either in memory or persis-
file and the metadata that has been previously extracted           tent). It initializes the watcher component to monitor the
and stored. Note that this could lead also to invalid links        exposed file system sub-tree, which in turn notifies TripFS
between resources if these were automatically created based        upon changes to files or directories. Subsequently, the RDF
on file metadata, as described in Section 3.5.                     model is updated accordingly, and extractors and linkers are
   In order to preserve a stable mapping between these URIs        re-applied to the modified objects.
and the local files and directories they represent, we have to        Metadata Extraction and Linking. We have imple-
employ a watcher component that is responsible for detect-         mented simple extractors that extract low-level file meta-
ing file system events that may result in different file URIs or   data, such as name, file size or a hash sum that could for
modified file contents of referenced files. Whenever such an       example be used to identify and link equal files across dif-
event is detected, appropriate actions have to be taken, and       ferent TripFS instances.
the RDF model has to be updated. Note that in this sense,             Further, we have implemented extractor components based
the mapping between stable UUID-based URIs and instable            on the Aperture metadata extraction framework10 , which
file and directory paths acts as a kind of translation ser-        provides a multitude of extractors for many different file
vice between external, globally valid UUID-based URIs and          types, including Office documents and multimedia data. As
corresponding local file URIs, comparable to PURL or DOI           a proof of concept, we have also implemented several linker
services [2]. Table 1 summarizes the reactions that have to        components: one that links documents, based on their ti-
be taken after file system events have been detected.              tles, to resources in the DBLP data set; one that links au-
                                                                   dio files to MusicBrainz by analyzing track title and artist
3.7    Serving File Systems as Web Resources                        9
                                                                      An evaluation version of TripFS can be obtained from
  Once the RDF-based representation of files and directories        http://www.cs.univie.ac.at/tripfs.
                                                                   10
has been generated and enriched with extracted metadata               http://aperture.sourceforge.net
            Path-based
            navigation


       Direct file access

       Metadata access

            Link-based
            navigation


             Extracted
             metadata


                            Figure 5: Accessing local files via a Linked Data representation


name, and one that links files to potentially interesting DB-          Feature             Datatype    Similarity      Weight
pedia resources via the DBpedia lookup service. Both, the
                                                                       Last access         Date        Plausibility
set of extractors and linkers are to be understood as proof-           Last modification   Date        Plausibility
of-concept; by far they do not leverage the full potential of          IsDirectory         Bool        Plausibility
the presented approach. However, as described before, more             Checksum            Integer     Plausibility
extractors and linkers can be integrated easily according to           Name                String      Levensthein      3.0
the needs of an actual use case.                                       Extension           String      Major MIME       1.0
                                                                                                       type equality
   Maintaining Consistency. We have used DSNotify [19]
                                                                       Path                String      Levensthein      0.5
as an implementation for the watcher component. DSNo-                  Size                Long        Equality         0.1
tify is a change detection add-on for datasources, supporting          Permissions         Bitstring   Equality         0.1
them in maintaining link integrity in their data. At its core,
DSNotify extracts feature vectors from considered data en-
tities that are used in heuristic comparisons to determine        Table 2: Extracted features, their data type and
whether items that are no longer found at their original lo-      the strategy used to calculate a similarity between
cations were in fact removed or moved to another location.        them. Features that are used only in plausibility
DSNotify can easily be extended by implementing custom            checks have a value Plausibility here.
crawlers, feature extractors, and comparison heuristics.
   We have implemented a generic file-feature extractor for
DSNotify that extracts low-level features from local files (cf.
Table 2)11 . Further, we have developed a simple heuris-
tic that calculates the plausibility that a file (described by    similarities are weighted12 (e.g., the name similarity is con-
the feature vector X) was moved to another location (the          sidered more important than equal file sizes), summed up,
file there being described by the feature vector Y ). This        and normalized. These similarities are then used by DSNo-
heuristic consists of two parts: first, plausibility checks are   tify to detect move, remove and create events. Furthermore,
performed. For example, if the last modification date of file     DSNotify reports update events based on changes in the ex-
Y is before the one of file X, it cannot be a successor of X.     tracted feature vectors (cf. [19]).
Another example is that a file cannot become a directory             DSNotify periodically monitors the file subtree that is ex-
or vice versa (checked by the isDirectory feature). Second,       posed by TripFS, extracts feature vectors based on the file
a similarity metric between the remaining features is calcu-      attributes described before, and stores these vectors in an
lated by using the strategies listed in Table 2. The resulting    index. DSNotify uses a native C++ component for effi-
                                                                  ciently monitoring the local filesystem that makes use of the
                                                                  Windows API FindNextChangeNotification() method. We
11
 The set of extracted features used by DSNotify is over-
                                                                  12
lapping but not equal to the set of metadata attributes            The selection of features as well as their weight was our
extracted and exposed by the TripFS. In the current im-           own subjective choice based on several test-runs with the
plementation, these latter metadata are stored in the RDF         system. We consider an extensive evaluation of DSNotify as
graph while DSNotify stores features in its own indices.          a tool for detecting file system events as future work.
                                                                            expose these contents as Linked Data, but does not by itself
       HTTP                                                                 extract higher-level metadata from files. For this, it relies on
                                                                            additional components, of which a wide variety exists. The
                                           FILE/CIFS/SMB/NFS/...
                         HTTP                                               Aperture metadata extraction framework was already men-
                                                                            tioned before; it is based on the Gnowsis adapter framework
                                                                   /
     SPARQL   Linked Data Interface   Watcher                               [20] and is capable of extracting RDF descriptions from a
                                                                            wide range of files and other data sources. For most file types
                                       Crawler
                                                                            there exist extractors that return RDF descriptions of the
                                                                            file content, ranging from BibTeX files over calendar data
      RDF
                                                                            to JPEG images; a list of these extractors is maintained at
                                      Extractors
                                                                            the W3C ESW Wiki13 . Such conversion or extraction com-
                                                                            ponents exist also for Web sources, e.g., PiggyBank [15] or
                     TripFS
                                       Linkers
                                                         Local Filesystem
                                                                            Virtuoso Sponger technology14 , which create RDF descrip-
                                                                            tions from a multitude of Web sources on the fly.
                                                                               TripFS is in line with a number of other generic frame-
              Figure 6: TripFS architecture                                 works that allow one to expose Linked Data based on a dif-
                                                                            ferent underlying data representation. Frameworks in this
                                                                            area include D2R [8] and Triplify [3] for relational data
have also implemented a generic, yet less efficient Java-based              bases, SparqPlug [11] for DOM-based sources, OAI2LOD
monitor component that should work on all common plat-                      [14] for OAI-PMH repositories, and XLWrap [18] for spread-
forms. This allows us to re-crawl the respective subdirectory               sheet data. With TripFS, file system contents can likewise
tree only if there were actual changes reported by the op-                  be made “first-class citizens” of the Web of Data and can
erating system. The detected events are then forwarded to                   be seamlessly integrated with all these other data sources.
TripFS; the file’s path is updated in the RDF model, and
extractors and linkers are re-applied.
  Linked Data Interface. TripFS includes an embed-                          6.   CONCLUSIONS AND FUTURE WORK
ded Jetty Web server, which serves data from the triple                        In this paper we have presented and discussed TripFS, a
store, as described in Section 3.7. It dynamically rewrites                 service that exposes local file systems according to Linked
the internally used UUIDs and blank nodes to dereference-                   Data principles. This approach potentially brings benefit to
able HTTP URIS, and provides XHTML+RDFa and pure                            a range of application scenarios (cf. Section 2). In an en-
RDF representations of file and directory resources, as well                terprise information integration scenario (Scenario A), files
as a SPARQL endpoint. It further allows clients to directly                 are assigned stable, globally unique URIs and can therefore
download file contents and, in the case of local requests, to               be referenced from external systems. Metadata that are ex-
directly launch these files.                                                tracted from files can be indexed by Semantic Web search
  Neither component of TripFS makes any changes to the                      engines, and links to other (enterprise-internal or external)
exposed file system; i.e., no special files or directories (like            data sources can increase the quality of information organi-
needed e.g., for SVN) are created. Currently, TripFS also                   zation and data retrieval.
does not provide means to modify file systems via the Linked                   A lightweight component like TripFS can also be used in
Data interface.                                                             ad-hoc file sharing situations (Scenario B): participants in a
                                                                            face-to-face meeting can easily set up and start the sharing
5.    RELATED WORK                                                          server, which exposes a certain sub-tree of their file system
                                                                            as Linked Data. This enables collaborators in the same net-
   Although modern file systems support the creation, stor-                 work to access and retrieve these files, based not only on low-
age, management, and retrieval of file-related metadata (e.g.,              level characteristics like file name, but also using extracted
using extended attributes or file forks), they remain mostly                semantic metadata and links. Using additional components,
isolated from Web-based information integration and ex-                     more intuitive approaches like faceted navigation can be per-
change contexts. Even file systems that provide sophisti-                   formed on top of extracted data, and more experienced users
cated support for file annotations or links (e.g., LiFS [1]                 are enabled to issue complex SPARQL queries over the file
or AttrFS [23]) do not consider a global Web context but                    system.
restrict their features often to objects within the local sys-                 A Linked Data representation of file systems also facili-
tem. On the other hand, Web-based file systems usually                      tates the application of Web-based annotation services (Sce-
focus on performance (e.g., [12]) or security (e.g., [4]), but              nario C), which overcomes the limitations of the hierarchical
not on semantically rich file descriptions or metadata in-                  directory metaphor for file organization. Such annotations
teroperability. In this respect, TripFS can be seen as com-                 can refer to single files or even parts thereof, and can range
plementary to metadata-rich or highly scalable file systems                 from simple text-based comments to complex descriptions
in order to bridge the gap between file systems and Web                     that may refer to external entities and concepts. TripFS
environments. In combination with other works that repre-                   makes file systems a part of a global, uniform Web of Data
sent Web resources as virtual file systems (e.g., [21]), local              and therefore allows one to apply Web-based annotation
file systems and remote Web resources can be seamlessly                     techniques immediately to file system objects.
integrated, providing unified programming interfaces and a                     In future work, we plan an extensive evaluation of TripFS,
consistent user experience.
   As described before, file system contents are highly diverse             13
                                                                             http://esw.w3.org/topic/ConverterToRdf
and heterogeneous, and contain information that is valuable                 14
                                                                             http://docs.openlinksw.com/virtuoso/
in many scenarios. TripFS presents a generic framework to                   virtuososponger.html
in particular regarding the performance and scalability of      [11] Peter Coetzee, Tom Heath, and Enrico Motta.
our approach. For this purpose, we aim to apply TripFS in            SparqPlug: Generationg Linked Data from Legacy
a concrete enterprise information integration setting, and we        HTML, SPARQL and the DOM. In Proceedings of the
plan to develop a simple user interface that allows end users        First International Workshop on Linked Data on the
to more easily share their files using Linked Data technolo-         Web (LDOW), 2008.
gies. Further, we plan to improve and evaluate the accuracy     [12] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak
of the DSNotify component for detecting file system events.          Leung. The Google File System. In 19th ACM
   Additionally, we plan to introduce a more fine-grained            Symposium on Operating Systems Principles, 2003.
model for selecting what file system objects are exposed via    [13] Bernhard Haslhofer, Wolfgang Jochum, Ross King,
TripFS (currently one can select only a single subtree of the        Christian Sadilek, and Karin Schellner. The LEMO
file system) and implement a secure HTTPS version that               Annotation Framework: Weaving Multimedia
takes privacy considerations into account.                           Annotations with the Web. International Journal on
                                                                     Digital Libraries, 10(1), 2009.
Acknowledgements                                                [14] Bernhard Haslhofer and Bernhard Schandl. The
Parts of this work have been funded by FIT-IT grants 812513          OAI2LOD Server: Exposing OAI-PMH Metadata as
and 815133 from Austrian Federal Ministry of Transport,              Linked Data. In International Workshop on Linked
Innovation, and Technology.                                          Data on the Web (LDOW2008), 2008.
                                                                [15] David Huynh, Stefano Mazzocchi, and David R.
7.   REFERENCES                                                      Karger. Piggy Bank: Experience the Semantic Web
 [1] Sasha Ames, Nikhil Bobb, Kevin M. Greenan,                      Inside Your Web Browser. In International Semantic
     Owen S. Hofmann, Mark W. Storer, Carlos Maltzahn,               Web Conference, volume 3729 of Lecture Notes in
     Ethan L. Miller, and Scott A. Brandt. LiFS: An                  Computer Science, pages 413–430. Springer, 2005.
     Attribute-Rich File System for Storage Class               [16] Georgi Kobilarov, Tom Scott, Yves Raimond, Silver
     Memories. In Proceedings of the 23rd IEEE / 14th                Oliver, Chris Sizemore, Michael Smethurst, Christian
     NASA Goddard Conference on Mass Storage Systems                 Bizer, and Robert Lee. Media Meets Semantic Web —
     and Technologies, 2006.                                         How the BBC Uses DBpedia and Linked Data to
 [2] William Y. Arms. Uniform Resource Names: Handles,               Make Connections. In Proceedings of the 6th European
     PURLs, and Digital Object Identifiers. Commun.                  Semantic Web Conference, pages 723–737, Berlin,
     ACM, 44(5):68, 2001.                                            Heidelberg, 2009. Springer-Verlag.
 [3] Sören Auer, Sebastian Dietzold, Jens Lehmann,             [17] Carl Lagoze and Herbert Van de Sompel. ORE
     Sebastian Hellmann, and David Aumueller. Triplify:              Specification — Abstract Data Model. Open Archives
     Light-weight Linked Data Publication from Relational            Initiative, 2008. Available at
     Databases. In WWW ’09: Proceedings of the 18th                  http://www.openarchives.org/ore/1.0/datamodel.
     international conference on World wide web, pages          [18] Andreas Langegger and Wolfram Wöß. XLWrap -
     621–630, New York, NY, USA, 2009. ACM.                          Querying and Integrating Arbitrary Spreadsheets with
 [4] Arati Baliga, Joe Kilian, and Liviu Iftode. A                   SPARQL. In International Semantic Web Conference.
     Web-based Covert File System. In Proceedings of the             Springer, 2009.
     11th Workshop on Hot Topics in Operating Systems,          [19] Niko Popitsch and Bernhard Haslhofer. DSNotify:
     2007.                                                           Handling Broken Links in the Web of Data. In 19th
 [5] T. Berners-Lee, L. Masinter, and M. McCahill.                   International WWW Conference (WWW2010),
     Uniform Resource Locators (URL) (RFC 1738).                     Raleigh, NC, USA, 2 2010. ACM. to be published.
     Network Working Group, 1994.                               [20] Leo Sauermann and Sven Schwarz. Gnowsis Adapter
 [6] Tim Berners-Lee. Linked Data. World Wide Web                    Framework: Treating Structured Data Sources as
     Consortium, 2006. Available at                                  Virtual RDF Graphs. In Proceedings of the 4th
     http://www.w3.org/DesignIssues/LinkedData.html,                 International Semantic Web Conference (ISWC
     retrieved 08-Aug-2008.                                          2005), pages 1016–1028. Springer-Verlag GmbH, 2005.
 [7] Chris Bizer, Richard Cyganiak, and Tom Heath. How          [21] Bernhard Schandl. Representing Linked Data as
     to Publish Linked Data on the Web, 2007. Available at           Virtual File Systems. In Proceedings of the 2nd
     http://www4.wiwiss.fu-berlin.de/bizer/pub/                      International Workshop on Linked Data on the Web
     LinkedDataTutorial/, retrieved 02-Dec-2008.                     (LDOW), Madrid, Spain, 2009.
 [8] Chris Bizer and Andy Seaborne. D2RQ - Treating             [22] Julius Volz, Christian Bizer, Martin Gaedke, and
     Non-RDF Databases as Virtual RDF Graphs. In                     Georgi Kobilarov. Discovering and Maintaining Links
     Poster at the 3rd International Semantic Web                    on the Web of Data. In Proceedings of the 8th
     Conference (ISWC2004), 2004.                                    International Semantic Web Conference (ISWC
 [9] Christian Bizer, Tom Heath, and Tim Berners-Lee.                2009), 2009.
     Linked Data — The Story So Far. International              [23] C.E. Wills, D. Giampaolo, and M.S. Mackovitch.
     Journal on Semantic Web and Information Systems,                Experience with an Interactive Attribute-based User
     5(3), 2009.                                                     Information Environment. In Computers and
[10] Kendall Grant Clark, Lee Feigenbaum, and Elias                  Communications, 1995. Conference Proceedings of the
     Torres. SPARQL Protocol for RDF (W3C                            1995 IEEE Fourteenth Annual International Phoenix
     Recommendation 15 January 2008). World Wide Web                 Conference on, pages 359–365, Mar 1995.
     Consortium, 2008.