Proceedings of the 3rd International Workshop on Semantic Digital Archives (SDA 2013)


     A Storage Ontology for Hierarchical Storage
               Management Systems

      Sandro Schmidt, Torsten Wauer, Ronny Fritzsche, and Klaus Meißner

                    University of Dresden, 01062 Dresden, Germany
                 {sandro.schmidt, torsten.wauer, ronny.fritzsche,
                           klaus.meissner}@tu-dresden.de


        Abstract. The increasing capacity of storage media could store a huge
        amount of data while the price per Gigabyte is decreasing. On the other
        hand users and companies produce much more information that still
        overruns the available storage capacities. As a consequence, the man-
        agement and storage of this data is very complex and expensive. Users
        and apps want to access their files directly and without any time delay,
        resulting in using fast but very expensive storage media. A deeper look
        on the usage of data, especially in companies, shows that only a very
        small amount of it is used every day. Summarizing these facts, a con-
        cept is needed to find out which data is important to provide them on
        fastest storage media, and less important one on cheapest storage me-
        dia. Concepts derived from the Semantic Desktop can be a solution. We
        introduce a concept to describe files by an ontology. This allows for the
        description of their relations to each other, as well as their attributes.
        The resulting ontology offers an extensive volume of data that helps to
        find the adequate storage conditions for every single file. Another great
        advantage that showed up, is the independence from file system accesses
        to gather information about the stored files.

        Keywords: Semantic Storage, Ontology, Hierarchical Storage Manage-
        ment


1     Introduction

As Gantz described in his IDC White Paper [4], there is a huge amount of
information that overruns the available amount of storage capacity. That does
not mean that all of this information represented as files is important during the
whole lifetime. Studies showed that only 1 % of all files stored in a file system with
200,000 files are modified daily [15]. Furthermore, some data are more important,
oriented on their content than others, or some should only be stored on special
storage media. Some files are duplicates or redundant. As a consequence, only
    Acknowledgments - Parts of this paper have been researched within the scope of the
    SENSE project which is funded by the Federal Ministry of Education and Research,
    German Aerospace Center. It is part of the KMU-Innovativ: IKT campaign and
    goes by the funding number FKZ 01IS11025D.


                                               81
     Proceedings of the 3rd International Workshop on Semantic Digital Archives (SDA 2013)


                                                                             Performance tier

                                                                             Capacity tier

                                                                             Archive tier

Typical storage media and devices           Capacity Cost/GB Access time


                 Fig. 1. Comparison of the three typical tiers of a HSM


a fraction of the stored files in a system needs to be on fast and, therefore,
expensive storage media like the newest solid state drives to provide them in a
performant way to users or programs.
     Hierarchical Storage Management (HSM) Systems solve this problem by stor-
ing files on the most efficient storage media respective to the importance of these
files. Efficient means the balance between the access frequency, the time it needs
to load files from slower storage media, and how expensive the storage is. Thus,
files that have a high importance because, e. g. , they were accessed a lot, should
be accessed very fast by users or programs. Therefore, it is necessary to store
these files on media with low access time and high data rate. Conversely files that
are used rarely and, therefore, seems to be needed less often in future, should be
stored on cheap media.
     As shown in Fig. 1, a classical HSM System is divided into three tiers: Per-
formance Tier (PT), Capacity Tier (CT) and Archive Tier (AT) [12, 6]. The
PT has the fastest but most expensive storage media (e. g. , solid state drives),
while the AT utilizes the cheapest but slowest storage media like Tape Libraries
(Fig. 1). The CT acts as a compromise between cost and performance and uses
storage media like SATA RAID systems, since they are faster than tape storage
media but a lot cheaper than solid state drives. Due to economical reasons, the
PT has the lowest storage capacity and the AT the greatest one.
     Before the initialization phase no user, program or other data is stored on
the HSM System and all new files will be placed down on the PT. If there is
no more capacity available on the PT, files were migrated to the next lower tier
and the same happens when there is no more capacity available on the CT. In
practice, this migration is done in cyclic jobs, e. g. , once a day. The difficulty
now, is to find the right files to migrate, since the migration is expensive due to
read and write actions on the storage media and one does not want to migrate
files that one need often in the future. A trivial method is to find all files that
had no access for a specific time. Furthermore, every file would be analysed
without context and relationship to other files. This reveals the main problems
of classical HSM Systems. There is a lack of semantic and the very restricted
amount of values to decide if a file has to be migrated between tiers. The lack of
semantic means, that the system can not analyze files and consider files that are
related by the same topic as a group of files and should be migrated together. A


                                              82
      Proceedings of the 3rd International Workshop on Semantic Digital Archives (SDA 2013)


second problem is the very restricted amount of file attributes such as file size
and last access time.
    The next section introduces our Storage Ontology (SO), that is the main part
to overcome the lack of missing relations between files (Section 2). In Section 3,
we present our experience and results we have made with a practical realization
of our approach. Closing with related work (Section 4), we summarize our results
and give an outlook in Section 5.


2     Storage Ontology

The purpose of the Storage Ontology (SO) is to offer an optimal information
repository to find files that should be migrated from one tier to another. In this
paper, we want to restrict to HSM Systems as described in Section 1. We also
limit the files to be stored on the HSM to documents created by programs and
users. This excludes files belonging to the operating system. The domain of the
SO deals with files on the one side and HSM Systems on the other side and,
of course, the interaction between both. This involves every information that
is necessary to determine the best place of storage for each file. The following
subsections provides a deeper look into the domain of the SO, starting with the
domain file and going on with HSM.


2.1    Analyzing the Domain

Imagine a typical HSM System described in Section 1. Each of the files stored
on it, are going to be analysed, and the extracted information is stored in a
repository. Our purpose is not to collect each single possible information, since
this ends up in a repository too big, where querying is not much faster than on
file system level. Using too little semantic information leads to lack of knowledge
and one can not determine the correct files for migration. To find a trade-off for
this conflict, we took some typical scenarios from the real world to analyze which
information is important. Starting with analysing files, we have a closer look on
what is special about files that come from the same raw material, have the same
project, or files that are grouped by the same topic.
     Same raw material: At first, we observed that a lot of files are generated
from the same raw material. E. g., professional cameras are taking photos in a
very rich raw format, that needs a lot of storage capacity. After importing them,
a graphic designer edits the photos and exports them maybe as JPEG files,
which need less storage capacity. Later, these photos are presented on a website
and small teasers from these photos are created. In this case, we don’t need the
raw photos from the camera any more, since we have more performant photos
and the teaser ones.
     This small scenario shows the following. First, there is a group of files that
comes from a specific source (camera) at a specific time. Second, there are three
different characteristics resulting from the same raw material, having different


                                               83
     Proceedings of the 3rd International Workshop on Semantic Digital Archives (SDA 2013)


amount of information. Third, there is a purpose of use for each of the charac-
teristics. Among others, one could decide about the storage medium from this
information, e. g. , the preview photos should always be stored in the PT.
     Same project: As aforementioned, files often appear in groups. Groups are
very important for the migration, because they offer the possibility to migrate
more than just one file with only one query. As an example, imagine a video
project. It consists of lots of files such as material from a video camera or audio
files that should be underlayed into the video. There are also different chapters
that occurs in different files. Also, the finished film, the trailer or specific still
frames e. g. , for the web, could be contained in this project. As a consequence,
if lots of files from one project are accessed permanently, one could infer that
none of the files should be migrated to lower tiers. Only if the project is finished
and very few accesses are made on the project files, these could be migrated in
one step.
     Same topic: Imagine there are lots of Christmas photos stored on the HSM
System. Assume minimal accesses are made on these photos in the middle of the
year, these photos are stored on the archive tier. But when Christmas time is
coming, photos related to this topic are accessed more than in the middle of the
year. Related to the migration of the files belonging the same topic, one could
easily archive photos from this group in the archive tier or put them back on the
higher tiers when a special topic becomes more important. Such topics could be
inferred from more accesses on files related to this topic.
     In every scenario, information related to time and place of a file is also needed,
if the important files for the migration should be found. Therefore, it’s necessary
to know which storage media could offer the storage conditions that are needed
for a special file. As introduced in Section 1, a typical HSM System is build up
into three Storage Tiers, the Performance Tier, Capacity Tier and Archive Tier.
Some of these tiers have directly mounted Storage Media, like solid state drives,
while others have mounted whole storage devices, like tape libraries, that consists
of several storage media (tapes). As defined, devices do not directly store files,
they are more of an overlaying construct that is necessary to read, write and
manage their containing media. Only physical storage media can directly store
files and we do not want to lose the possibility of knowing where a file is stored
concretely. Furthermore, an HSM System could consist of several Storage Vaults.
Each of them manages one tiered hierarchy. Fig. 2 shows the structure of our
understanding of an HSM System with the help of Extended Backus–Naur Form
(EBNF). Since we map these structure into an ontology, EBNF is a good, simple
and formal tool. Additionally, every Storage Medium should be characterized by
its typical values like read/write access, average throughput or access time.
     Furthermore, it is important to know about typical properties of every storage
device and medium such as storage capacity, access time or data rate, while
the latter could be distinguished into read and write values. These values are
important to decide on which storage tier the devices or media should be placed,
even in future when faster media is available and the current fast ones are not
sufficient for the performance tier.


                                              84
      Proceedings of the 3rd International Workshop on Semantic Digital Archives (SDA 2013)


HSM              = { Storage Vault }
Storage Vault    = Performance Tier, Capacity Tier, Archive Tier
Performance Tier = Storage Tier (* also for Capacity and Archive Tier *)
Storage Tier     = { Storage Device }
Storage Device   = { Storage Media }
Storage Media    = { File }


      Fig. 2. Domain of HSM described in Extended Backus–Naur Form (EBNF)


2.2    Modeling of the SO
Ontologies are formal and their purpose is that every program and human using
them will understand it exactly in the same way. To gain this goal, a modeling
language like OWL [14], which we are using, is needed.
     Since the XML-Syntax is not easy to read, we will use the Manchester Syntax
[8], that uses natural language, to explain the structure of our SO. Thereby we
can describe the set of all audio files (all things that are files and have an audio
file format) as Defined Class by writing File and (hasFileFormat exactly 1
AudioFileFormat). Furthermore, the term contains a (normal) class File. The
differences between a Defined Class and a Class is the necessity of inference.
While we must explicitly set classes to an individual, Defined Classes are inferred
by reasoner and we do not have to assign them. There is also an object property
in the upper term: hasFileFormat. Object properties describe the connection
between two individuals. Additionally, we will also use datatype properties that
describes the connection to data values, like string, int or self defined types. An
example is hasSize long. In this case, the reasoner looks which of all individ-
uals, who are member of the class File, has exactly one property hasFileType
that has the range of another individual as an instance of AudioFileFormat,
that is also a Defined Class. At last, we have two keywords from the Manch-
ester Syntax: and and exactly. The keyword and is an intersection (u) and
exactly x (=) means that every instance of such a class must have x times
a given property. Another important concept of our ontology are individuals.
These are concrete instances that belongs to concrete classes. While File is a
class describing an individual belonging to its class, File123 could be a concrete
instance of the class File and could optionally have a file name or a size.

2.3    Structure of the SO
Following, we present the schema of our ontology by explaining the main ideas
with the help of the Manchester Syntax. At first, we describe the classes and
properties belonging to files, and later we want to end up with HSM Systems
and the connection between both.

File Concluding from Section 2.1 we modelled the class File as the central class
(see Fig. 3). There are some datatype properties, among others, that describes


                                               85
     Proceedings of the 3rd International Workshop on Semantic Digital Archives (SDA 2013)


attributes such as path, name, id, size or other important once such as last
access or the date of creation. There is also a property isPreview that indicates
if a file is a preview or not. Furthermore, a collection of files could be added
to a Group. The connection between File and FileFormat is more complex. To
explain the FileFormat, we have to go further to the two classes MediaType and
QualityType. These classes are modeled as closed enumerations. MediaType has
only five predefined instances based on the MIME Media Types by IANA [9]:
Audio, Application, Image, Video and Text. All the same, we defined the
class QualityType, that has two instances: Performance and Raw, considering
that a lot of files exists in several characteristics (see Section 2.1). While the
MediaType could be extracted through several tools, the QualityType must be
set manually, except of file formats for preview files, because, if a system creates a
preview from another file, one could observe this and set the property isPreview
on true. As an example, PNG as instance of FileFormat could be created. The two
properties MediaType and QualityType are set to Image and Performance. A
reasoner could infer from this information, that the individual is also a member
of ImageFileFormat and RawFileFormat. Described in Manchester Syntax it
looks like this: FileType and (hasMediaType value Image) for the defined
class ImageFileFormat. As one could imagine, every subclass from FileType
has an analogous class description and, thus, there is a subclass for every member
from the two enumerations.


                                                   Image

                                                                      Perfromance

                   ≡ ImageFileFormat           {} MediaType
                                                                   {} QualityType
      ≡ QualityFileFormat                   hasMediaType
                                                     hasQualityType        {} StorageTier            PerformanceTier

                                        FileFormat
     Representation
                                                                            onStorageTier

                                                                            StorageDevice             ≡ PerformanceSD
  Group               hasFile       hasFileFormat
                hasMember                                                  onStorageDevice
                                     File           onStorageMedium          StorageMedium            ≡ PerformanceSM
  xsd:string           hasPath
                     isPreview

  xsd:boolean
                                                                          isActiveSince isOnline


                ≡ PerformanceFile           ≡ ImageFile               xs:dateTime            xs:boolean


   Class               ≡ DefinedClass            {} Enumeration           Individual               Datatype Property


      rdf:subClassOf                rdf:type                                                                      Key


Fig. 3. Partial graph from the SO, that describes raw and image files on the perfor-
mance tier


                                                              86
    Proceedings of the 3rd International Workshop on Semantic Digital Archives (SDA 2013)


    Furthermore, an individual, e. g. , holiday.png as member of the class File
would be created. One could use an analyzing tool to get the MIME Media
Type from this file and set the property hasMediaType to the individual, in
this case with the range PNG. From this connection, a reasoner could infer that
this file also belongs to the classes ImageFile and PerformanceFile, since the
class description is File and (hasFileFormat exactly 1 ImageFileFormat)
for the Defined Class ImageFile. As one could envision, there are also analogous
class descriptions for every subclass from File that is build from every subclass
of FileFormat. An exception is the subclass PreviewFile as described earlier.
This class description is File and (isPreview true).
    We also defined the class Representation in the SO. This is an abstraction
of Files, that have at least one similar characteristic. As an example, there
could be three different characteristics from one photo as raw photo from a
camera, a performance file that was created from the first one and a preview.
If an individual of Representation would be created and add connections to
the three Files, one could find these three ones with only one query. As an
example, imagine an access on the preview file of a photo, and later the user
wants to retrieve the raw file of this photo which is archived. In this case, it is
easy to find this photo.

HSM Going on with the classes and properties belonging to the domain of the
HSM System, we have made a meaningful basis in Section 2.1 by describing the
HSM with the help of EBNF and it is easy to map this into our SO. There-
fore, we add a class for every type from the EBNF description, but we omit
the class Storage Vault in Fig. 1, since it is not necessary if only one is used
in the HSM System. According to the classes MediaType and QualityType, we
modelled the class StorageTier as an enumeration with the three members
PerformanceTier, CapacityTier and also ArchiveTier. With the property
onStorageTier, an instance of the class StorageDevice could be connected
to one of the tier, and also with the property onStorageDevice, an instance
of StorageMedium could be placed on a given storage device. Depending on to
which kind of storage tier a storage device and on to which storage device a stor-
age medium is connected, a further type could be added to these individuals.
Taken the example from Fig. 1, a storage device that is connected to the perfor-
mance tier would be additionally inferred as a PerformanceDevice, because the
class description of the defined class PerformanceDevice is StorageDevice and
(onStorageTier) PerformanceTier. We modelled the class description from
the defined classes CapacityDevice and ArchiveDevice and also the different
storage mediums, accordingly. Among others, a Storage Device has properties
for read/write, access time or throughput.


3   Evaluation
As we focus on the meaning and context of documents we decided to use semantic
data technologies instead of a relational database to store extracted information


                                             87
      Proceedings of the 3rd International Workshop on Semantic Digital Archives (SDA 2013)


[13]. This section introduces our test environment, gives an example to compare
our approach with the classical one and shows the advantages.

3.1    Test setup and environment
In 2005, we started the K-IMM2 project [10], which aim is to include the content
and context of documents in personal document management. Although the ar-
chitecture was designed to enable a simple replacement of the used ontology, the
prototype supported it in a cumbersome way. Another problem was the support
of only one ontology at runtime. Thus, we redesigned the architecture and reim-
plemented the core module to support multiple ontologies, exchange them at
runtime and extended the approach to support small and medium-sized enter-
prises. We call it KIMM+. In the SENSE3 project4 , KIMM+ is used as a seman-
tic middleware. Web applications are used to upload documents such as videos
and pictures to the SENSE application. These files are analysed by KIMM+ and
extracted information is stored in the ontologies, especially the SO. The files are
stored on a file system under control of an HSM System. In this infrastructure
both, HSM and web applications are able to query the SO without gathering
information or access to the file system and the files themselves. Therefore, the
HSM System was extended to execute and analyse SPARQL queries. The fol-
lowing section will give two short examples of how to use the SO and highlights
the advantages.

3.2    Use cases
In Section 1, we focussed on two aspects: (1) the lack of semantic (e. g., the
unknown relations between documents represented by their content) and (2)
that classical HSM Systems just use a restricted set of attributes to store files
on a specific tier. For the following examples, we assume to have three folders:
music, documents and pictures. Each folder contains files related to the topics:
flowers, birds and research paper.

(1) In Section 2, we introduced the concept Group. We see the SO as a minimal
subset of relations necessary to store a file in the best place. Therefore, we need
a concept to group files. By keeping it abstract, a group can represent the same
topic (e. g. , flowers) or persons in documents. A file can be part of several groups.
If a file named national fauna.pdf should be migrated from the PT to the CT,
the following query can be used to get the associated groups.
SELECT ?group WHERE {
    ?file :hasName "national_fauna.pdf". ?group :hasMember ?file . }
The result contains all groups the document is related to. The other documents
of these groups can be found by executing for each group:
2
  Knowledge through intelligent media management
3
  Intelligent Storage and Exploration of large Document Sets
4
  http://www.sense-projekt.de


                                               88
     Proceedings of the 3rd International Workshop on Semantic Digital Archives (SDA 2013)


SELECT ?file WHERE { ?group :hasMember ?file .}.

To prevent the HSM System from migrating files that should not be migrated,
other attributes have to be considered to get the final result. This is done by
the logic of the extended HSM System in SENSE. The example shows that in
this way the relations and semantic of documents will be involved in the decision
making for migration. We chose the group concept to keep the semantic within
the SO as simple as possible, because the HSM does not need to know what a
topic is but which files are related by topic. Each concept of grouped entities
can be broken down to a SO group.

(2) The second aspect we focussed on is the restricted set of attributes. They
are limited to classical attributes to filter files for a policy definition. The folder
pictures introduced above contains, for example, PNG, TIFF, NEF and JPEG
files. If RAW formats can be archived shortly after they where accessed, this is
difficult to model with standard policies. First, the attributes to match a RAW
format have to be defined. Lets assume TIFF and NEF are RAW formats and
need to have a file size greater than 1024 KByte. A policy would like this:
(File name matches pattern *.TIFF or *.NEF) and
    (File is larger than 1204KByte) .

If the definition of our RAW format changes or a new one needs to be regarded,
the policy has to be modified. Using the new approach, the HSM System gets a
list of files by executing the following query:
SELECT ?file WHERE
    { ?file a RawFile. ?file :hasFileFormat ?f. ?f a :ImageFileFormat. }.

In this case, changed requirements can be applied directly to the class definition
for RawFile. Another benefit is the possibility that files can be excluded in the
Defined Class if they match specific criteria. This improves the flexibility of the
HSM System.
    First tests within the SENSE framework showed another advantage of the
Storage Ontology. Listing the size of files, no matter if a group of files or all files
are selected, is much faster if the ontology is used. Linux as well as Windows and
OS X showed problems, when the HSM filesystem was loaded, containing more
than 500.000 files, although they were uniformly distributed in the filesystem.
This section showed two small examples how the introduced SO can be used to
improve a existing HSM Systems. Currently, we are testing and refactoring the
SO within the prototype of the SENSE application under real conditions to verify
the improvement. The next sections gives an overview on existing approaches on
using semantic technologies to store, search and access files.


4    Related Work
As mentioned in Section 1, we focus on two main problems: the lack of relations
regarding to file content and the usage of restricted values such as file size.


                                              89
      Proceedings of the 3rd International Workshop on Semantic Digital Archives (SDA 2013)


     Services and platforms like Google Docs, Facebook, Flickr, Youtube and
Twitter are used to store personal documents and to share them with others.
Even for enterprises this becomes more and more important. Regarding this
situation, Semantic Web Technologies can yield to a more flexible way to con-
nect and use these services. Focussing on the requirements of small and medium
sized enterprises, we concentrated on the storage of files within a centralized
infrastructure. Reasons for such a infrastructure can be required by law. In this
context, the Memex Vision published by Vannevar Bush [2] is of great interest.
He described a system to handle a large amount of personalized heterogeneous
data and defined four requirements, cited among, others by Gemmell et al.[5]:
(1) collections and search must replace hierarchy for organization, (2) many vi-
sualizations should be supported, (3) annotations are critical to non-text media
and must be made easy, and (4) authoring should be via transclusion5 .
     We understand (1) and (3) as necessary precondition to fulfill (2), and (4)
given by the concept of Semantic Web Technologies. Describing the relations of
files using an ontology supports the first requirement. While (semi)-automatic
creation from meta data and information retrieval fulfills the third requirement.
Having these information in form of an ontology, different visualization can be
realized (3). And as shown in Section 2, using Semantic Web Technologies, re-
lations between documents change when rules or schema change (4). A first
published approach to store relations between documents and their contents in
a ubiquitous way was WinFS [7]. Through an API, the documents and settings
folder of Windows Vista was planed to handle files in transactions supported by
relational database technologies. The aim was to handle all files in the same way
and to get a homogeneous knowledge base on documents on a PC. Although
WinFS never was released, parts of the concepts were included in following ver-
sions of NTFS on Windows 7 and Windows 8.
An interesting approach to fulfill the first requirement of Vannevar Bush [2] was
published by Bloehdorn et al.[1] in 2006. Using WebDAV as an abstraction layer
between user and file system enables Bloehdorn et al. to break with the classical
hierarchical approach of managing files in folders. To the user, the file system
hierarchy is presented, matching results of search queries. The files are sorted
and organized in virtual folders depending on their tags. Each Folder represents
a tag and a concatenation of multiple tags represents a location. In this way files
are tagged by their location with multiple tags. On the one hand, this approach
enables an orthogonal search, but on the other hand the subset of tags is limited
to the folder names, as they are the tags. That means, that no real search for
files ordered by persons is possible.
     In [3] Crenze et al. the challenges of information management in enterprise
environments are presented. They identified the: (1) amount of data, (2) the
quality and performance of extraction tools, and (3) security and authentication
of data and data access. Semantic Web Technologies are seen as possibility to
replace a classical full-text search by a semantic search on ontologies. According
to Crenze et al. a combination of full-text search with semantic-aware filtering
5
    Including documents into each other by using a reference.


                                               90
     Proceedings of the 3rd International Workshop on Semantic Digital Archives (SDA 2013)


and proposal functions is the best combination to retrieve good search results.
Because of weak performance, the authors intend to use Apache Lucene6 to index
properties instead of Apache Jena7 .
    Regarding the description of files through properties and meta data, PREMIS
[11] gives a good system to describe them in a standardized way. Especially,
since there also exist RDFS and OWL schemas. Beside a subset of properties
to describe data for archiving, PREMIS also covers security and authentica-
tion. Although it is designed to achieve a flexible technology-independent way
to archive physical objects it is also useful to identify storage related properties
for an HSM.
    In 2011, we came up with the idea of using a ontology to optimize the place-
ment of files in an HSM System. For the first prototype, we did not use all
capabilities of ontologies and used a combination of semantic descriptions and
relational rules to generate storage solutions [12]. The great disadvantage of this
approach was the modeling of rules that depended directly on information stored
in the ontology. Another disadvantage was the minimalistic schema, which was
not able to represent storage-related properties. For example, it did not support
the description of different file types like raw files that can be stored on cheap but
slow memory8 . In consequence, we focussed on developing the Storage Ontology.
The next section gives a conclusion on our results and overview on upcoming
work.


5    Conclusion
In Section 2 we introduced a schema to describe files stored on an arbitrary file
system controlled by an HSM System. We focused on describing files and their
relations among themselves and within an HSM System. The quality of the on-
tology depends on the used extraction tools to gather information. Using OWL
allows a more flexible classification of files such as image files or files which should
be moved to the archive. This leads to a more flexible configuration for HSM Sys-
tem policies as the underlying system uses SPARQL queries to get related files.
In consequence, the policy engine of existing systems can be used with minimal
modifications. The located files are related based on the two mentioned aspect.
They are chosen either because of matching attributes or because of relations in
content or type. In this way, we introduced a source for decision-making in HSM
systems and enabled applications in the front-end to get file specific information
as well as knowledge about the placement in the storage hierarchy. We integrated
our schema in ontologies developed in the SENSE project. Currently, we evaluate
and adapt the Storage Ontology by defined scenarios. The recorded data, like file
access-times on the Performance Tier and migration tasks, are compared with
the results of the classical approach using the policy engine without semantic.
Furthermore, we improve the SENSE framework.
6
  http://lucene.apache.org/core/
7
  http://jena.apache.org/
8
  Depending on the application domain.


                                              91
       Proceedings of the 3rd International Workshop on Semantic Digital Archives (SDA 2013)


References
 [1]    Stephan Bloehdorn, Olaf Görlitz, Simon Schenk, et al. “TagFS - Tag Se-
        mantics for Hierarchical File Systems”. In: Proceedings of the 6th Interna-
        tional Conference on Knowledge Management (I-KNOW 06), Graz, Aus-
        tria, September 6-8, 2006. Sept. 2006.
 [2]    Vannevar Bush. As We May Think. The Atlantic Monthly. 1945.
 [3]    Uwe Crenze, Stefan Köhler, Kristian Hermsdorf, et al. “Semantic Descrip-
        tions in an Enterprise Search Solution”. In: Reasoning Web. Edited by
        Grigoris Antoniou, Uwe Aßmann, Cristina Baroglio, et al. Volume 4636.
        Lecture Notes in Computer Science. Springer Berlin Heidelberg, 2007,
        pages 334–337.
 [4]    J Gantz. “The Diverse and Exploding Digital Universe”. In: IDC white
        paper. White Paper 2 (2008), pages 1–16.
 [5]    Jim Gemmell, Gordon Bell, Roger Lueder, et al. “MyLifeBits: fulfilling the
        Memex vision”. In: Proceedings of the tenth ACM international conference
        on Multimedia. MULTIMEDIA ’02. Juan-les-Pins, France: ACM, 2002,
        pages 235–238.
 [6]    PoINT Software & Systems GmbH. Automated Storage Tiering: Optimiz-
        ing the storage infrastructure concerning Cost, Efficiency and Compliance.
        http://www.point.de/fileadmin/Documents/White Paper Automated Sto
        rage Tiering by PoINT Storage Manager.pdf. 2012.
 [7]    Richard Grimes. Code Name WinFS. http://msdn.microsoft.com/de-de/
        magazine/cc164028%28en-us%29.aspx. Aug. 2004.
 [8]    Matthew Horridge, Nick Drummond, John Goodwin, et al. “The manch-
        ester owl syntax”. In: In Proc. of the 2006 OWL Experiences and Direc-
        tions Workshop (OWL-ED2006. 2006.
 [9]    IANA. MIME Media Types. http://www.iana.org/assignments/media-types.
        2013.
[10]    Annett Mitschick. “Ontologiebasierte Indexierung und Kontextualisierung
        multimedialer Dokumente für das persönliche Wissensmanagement”. PhD
        thesis. Technischen Universität Dresden, 2009.
[11]    PREMIS Editorial Committee. PREMIS Data Dictionary for Preserva-
        tion Metadata. Edited by PREMIS Editorial Committee. www.loc.gov/
        standards/premis/v2/premis-2-0.pdf. 2008.
[12]    Axel Schröder, Ronny Fritzsche, Sandro Schmidt, et al. “A Semantic Ex-
        tension of a Hierarchical Storage Management System for Small and Medium-
        sized Enterprises”. In: SDA. 2011, pages 23–36.
[13]    M. Uschold. Ontologies and Database Schema: What’s the Difference. http://
        semtech2011.semanticweb.com/programDetails.cfm?ptype=K&optionID=2
        84&pgid=4. 2011.
[14]    W3C OWL Working Group. OWL 2 Web Ontology Language Document
        Overview (Second Edition). http://www.w3.org/TR/owl2-overview/. 2012.
[15]    Muzhou Xiong, Hai Jin, and Song Wu. “FDSSS: An Efficient Metadata
        Management Scheme in Large Scale Data Environment”. In: Grid and
        Cooperative Computing 90412010 (2006).


                                                92