=Paper=
{{Paper
|id=None
|storemode=property
|title=A Semantic Extension of a Hierarchical Storage Management System for Small and Medium-sized Enterprises
|pdfUrl=https://ceur-ws.org/Vol-801/paper2.pdf
|volume=Vol-801
|dblpUrl=https://dblp.org/rec/conf/ercimdl/SchroderFSMM11
}}
==A Semantic Extension of a Hierarchical Storage Management System for Small and Medium-sized Enterprises==
<pdf width="1500px">https://ceur-ws.org/Vol-801/paper2.pdf</pdf>
<pre>
    Proceedings of the 1st International Workshop on Semantic Digital Archives (SDA 2011)


 A Semantic Extension of a Hierarchical Storage
      Management System for Small and
          Medium-sized Enterprises

Axel Schröder, Ronny Fritzsche, Sandro Schmidt, Annett Mitschick and Klaus
                                   Meißner

              University of Dresden, 01187 Dresden, Germany, 2011
    {axel.schroeder, Ronny.Fritzsche, Sandro.Schmidt, annett.mitschick,
                        klaus.meissner}@tu-dresden.de,
             WWW home page: http://www.mmt.inf.tu-dresden.de


       Abstract. The number of company data deposited in hierarchical stor-
       age management systems heavily increases. Thus, new approaches are
       necessary to keep track of a data pool. This paper introduces a seman-
       tic storage extension (SSE) for existing hierarchical storage management
       systems that allows them to exploit semantic relations between ﬁles and
       use them for a more eﬃcient and more intelligent data management. Our
       approach enhances traditional hierarchical storage management systems
       regarding migration, deletion, and retrieval operations by making use of
       semantic relations between ﬁles and contextual knowledge. Thereby a
       predictive ﬁle management is possible, which contributes to an increas-
       ing system performance and a better user experience. To this end, the
       SSE uses extracted features of documents to deﬁne relations between
       them and also oﬀers the possibility to specify additional knowledge by a
       domain expert.

       Keywords: semantic, hierarchical, storage, management


1    Introduction

At present, the amount of digital data stored by companies doubles year by year
[15]. To save costs, companies increasingly use (hierarchical) storage manage-
ment systems (SMSs). Those systems distribute data between diﬀerent storage
technologies and deposit information in a cost-optimized way. A SMS typically
divides the storage environment into three tiers (Figure 1). The performance tier
utilizes very fast and also expensive storage technologies like solid-state drives or
SAS1 /FC2 hard disks. The capacity tier usually consists of SATA RAID systems
which oﬀer lower costs per gigabyte, but also suﬀer from an higher access time.
The archive tier uses long time archiving technologies (WORM3 ) like optical
1
  Serial Attached SCSI
2
  Fibre Channel
3
  Write Once Read Many


                                             23
   Proceedings of the 1st International Workshop on Semantic Digital Archives (SDA 2011)


jukeboxes or tape libraries. This tier oﬀers the lowest costs per gigabyte and
the highest capacity, but it also has a very high access time. So the SMS has to
ﬁnd a tradeoﬀ between costs, capacity and access time. It has to distribute its
data in an optimized way. We concentrate on three typical operations of a SMS:
migration, retrieval and deletion. Migration means the movement of a ﬁle to a
slower storage tier (e.g., from the performance to the capacity tier) and retrieval
means the opposite, the movement to a faster storage tier. However, current
SMSs treat ﬁles individually and do not recognize and use semantic relations
(e.g same author, topic, accessdate, ..) between them. Furthermore, about 80 %


                                                        &RVWV*E
                                                                   3HUIRUPDQFH7LHU

                                                   $FFHVV7LPH     &DSDFLW\7LHU
                                        &DSDFLW\
           /7 8 /7 8 /7 8


                                                                   $UFKLYH7LHU
             2 OWUL 2 OWULX2 OWUL
                   XP P XP


Fig. 1. A typical three tier architecture of a hierarchical storage management system.


of all data in a company is stored unstructured and contentwise unorganized [4],
what soon becomes problematic for a useful cost optimization. Also only about
2 % of all stored data is used in daily business [17]. This means only these 2 %
of all available data needs to be provided within the SMS’s performance tier.
The use of inherent and additional semantic information and semantic relations
between ﬁles is one key to a more eﬃcient and more intelligent way of storage
management. There are other approaches to achieve a better performance, like
monitoring user behaviour for a anticipatory data management, which are not
focussed in this paper. We are concentrating on the aspect of using semantics
in SMS. However, current SMS do not or only rudimentarily use those semantic
relations. Treating this oﬀers huge potential to improve management algorithms
in order to achieve better performance and user experience.
    This paper introduces a semantic-aware software component as extension for
a SMS that allows the use of semantic information to optimize the way data is
stored. We call it Semantic Storage Extension (SSE). The SSE can be integrated
into existing SMS (see Section 3) and enables them to recognize semantic rela-
tions between documents, which may be used for distributing digital information.
Initially, Section 2 will illustrate current developments and research projects in
this ﬁeld of study. Hence, requirements for the SSE will be derived, which are
fundamental for its design, described in Section 3. This section also illustrates
the functionality of the SSE and points out its interactions with other compo-
nents like the SMS. Section 4 summarizes the results of this paper and illustrates
both current and further work in this research area.


                                              24
      Proceedings of the 1st International Workshop on Semantic Digital Archives (SDA 2011)


2      Related Work

Modern ﬁle systems like NTFS4 already use a wealth of metadata in documents.
However, these are primarily used for describing ﬁles. When trying to ﬁnd similar
information between diﬀerent documents, the limitations of those ﬁle systems
become apparent very fast. There is a wide range of papers which deal with
semantics in documents (e.g. to improve ﬁle searching) [6, 8, 9, 12, 14]. It is also
possible to enrich a rulebook of a document management system in order to
choose appropriate ﬁle handling strategies [2]. Other approaches (e.g. [5]) try to
separate the metadata from ﬁles to achieve a better system performance.
    In contrast to all this work, we focus on improving the document management
in a SMS. Thus, the SSE tries to help on managing ﬁles cost-orientedly according
to their actuality and relevance. To achieve this goal, new semantic relations
should be used. There are only few papers addressing semantic associations
between documents and attempting to use those for managing ﬁles. The next
section outlines some selected work of this speciﬁc research area.


2.1     Semantic Information in File Systems

To overcome the limitations of current ﬁle systems and SMSs, TagFS [1] anno-
tates ﬁles with keywords. This is done automatically as well as user-controlled.
For example, the keywords could contain diﬀerent metadata, names of folders in
a document path or other manually added terms. Subsequently these tags can be
used to ﬁlter from a set of ﬁles. Thereby restrictions of the documents path are
avoided. By using tags that reﬂect folder names of its original path, it is possible
to ﬁlter directly for ﬁles within subfolders, without knowing their exact location
(e.g., ../pictures/vacation or ../vacation/pictures then means the same).
    Another approach for mapping semantic information in ﬁle systems was cre-
ated in 2003 by introducing semantic vectors [7]. Thereby the metadata of ﬁles
are converted into vectors, which subsequently span a common feature space.
This leads to two results: On the one hand, duplicates easily become visible and
on the other hand, strong dependencies between several documents can be found
through vectors that are very close together.
    There is another approach outlined in [16]. It illustrates how to capture
external events in an ontology and to link them with the data that is related
to this event. By doing so, the data pool is enriched with additional knowledge
and ﬁles get indirectly linked with each other via events. For example, if ﬁles are
created or modiﬁed during a phone call, they become related to this event inside
the ontology. Thus, these documents have an implicit relation to all other ﬁles
related to this event. The authors clarify that the correct linking of ﬁles with
events needs a longer training process. But they also underline that this approach
ﬁnally works very precisely. This idea allows a SMS to cluster data based on
their real connections. Another advantage is that the users also beneﬁt from
this concept. They get the possibility to retrieve data by recalling speciﬁc events
4
    New Technology File System


                                               25
      Proceedings of the 1st International Workshop on Semantic Digital Archives (SDA 2011)


(meetings, phone calls, ...). So this approach closely resembles human thinking.
The navigation through a data pool is no longer based on the place (Where is
information stored?), but on events (Why and Whereby?). A disadvantage is,
that the relevance of ﬁles is not captured. So this approach has great potential
for user-based ﬁles searched, but still needs additional improvements to support
internal SMS operations.


2.2     Semantic Information in Additional Systems

There is a need for more research in the analyzed areas. The generation and
processing of additional information creates a greater workload. For example,
the approach of [16] introduces a mechanism to gather events. Corresponding to
the size of a company, this concept can require extensive upgrades in the existing
IT infrastructure. However, a long-term inﬂuence on current ﬁle systems is only
possible, if such an approach is enforced as a well-established practice in a com-
pany. Also, it has to be carefully considered, if the achieved advantages justify the
higher resource requirements. At this point, most publications only provide theo-
retical estimations or smaller ﬁeld tests. In [17], studies about extensive scenarios
are realized and performance and ﬂexibility of those approaches are evaluated.
They especially indicate, that modern DBMS5 (in this case MySQL6 ) are not
optimized for a very large number of metadata. Also many of the investigated
concepts are only partially applicable for ubiquitous use. For example, some ap-
proaches require additional computing power to permanently extract metadata
and analyze them, which then can be used to derive semantic relations. Usu-
ally it is very diﬃcult to realize event gathering on external devices (e.g., fax)
and connect them to documents, because there are no standardized interfaces to
catch these events. So the integration of the illustrated approaches in current ﬁle
systems and the combination of diﬀerent research concepts is tricky. Thus, a so-
lution is needed, which has enough potential compared to conventional methods
and also presents an additional value abreast them.


3      The Semantic Storage Extension

This section illustrates the design for the SSE in detail. In Section 3.1, necessary
requirements for the SSE are described. They lead to a software architecture out-
lined in Section 3.2. The following sections show the functionality of a semantic
service component (Section 3.3) which is needed for feature extraction, illus-
trate necessary modiﬁcations inside the SMS (Section 3.4) and ﬁnally describe
the structure of the SSE (Section 3.5).

5
    Database Management Systems
6
    http://www.mysql.com/


                                               26
      Proceedings of the 1st International Workshop on Semantic Digital Archives (SDA 2011)


3.1     Requirements

As shown in the introduction, the data pool of SME7 grows exponentially. Thus,
it is important that a semantic extension for a storage management system (SSE)
works performantly even after years. In contrast to traditional SMS which just
relocate ﬁles in regarding their own context, the SSE should oﬀer a foundation
to enable the SMS to preemptively relocate related ﬁles as well. Furthermore,
metadata should be managed centralized and independently from their docu-
ments. Similar to the approach of [5], who suggests separating metadata, two
advantages follow. Metadata can be accessed faster and the number of read ac-
cesses on the actual storage media decreases, which means a longer lifetime [11].
In respect of the limitations of DBMS, metadata should be stored in an ontology
[17]. Current DBMS often only support data mining and clustering. An ontol-
ogy allows a more expressive description of semantic relations and metadata. It
oﬀers additional possibilities for reasoning mechanisms to infer semantic knowl-
edge that is not explicitly modelled. Furthermore, the approach of gathering and
processing events oﬀers a huge potential [16]. The SSE should link documents
with external knowledge to not only manage data about their structure and con-
tent, but about their origin and meaning. By doing so, the SMS would be able
to not only provide relevant data for fast access, but presenting other data that
is semantically close as well.
     In order to function as an extension, the SSE has to work autonomously. It
should enhance the underlying SMS with minimal modiﬁcations, but never be
able to interfere with the SMSs basic functionality. This means the SSE should
be available for the SMS through its own interface as an independent component.
     Another requirement is a centralized metadata storage [5]. It can be used to
look up information about the data pool in one place and use it to quickly get
a link to related ﬁles. The metadata should also be managed centrally inside
the SSE. Here it is especially necessary to pay attention to the consistency of
information regarding the data pool in the SMS.
     The heterogeneity of the data pool requires various and complex extraction
algorithms. Metadata should not only be extracted from current ﬁle formats, but
new ﬁle formats should be supported in the future. The complexity and exten-
sibility of those extraction processes requires a complex treatment and was not
focused yet. A semantic service component (SSC) has to provide these extraction
features. The SSC has to be able to extract all necessary metadata, information
from the ﬁle system, semantic information of single documents and save them
into an ontology. In addition, a mutable set of policies inside the SSE is needed.
Through these policies it should be possible to capture additional knowledge
and connect it with the ontology. This additional knowledge is not implicitly
available and can not be derived through extraction algorithms from the data
pool itself. To avoid sophisticated, technical modiﬁcations as described by [16], a
domain expert should become the informal interface between the company’s pro-
cesses and the SSE. So the SSE operates semi-automatically, whereby existing
7
    Small and Medium-sized Enterprises


                                               27
      Proceedings of the 1st International Workshop on Semantic Digital Archives (SDA 2011)


information are extracted automatically and additional knowledge is generated
manually. An advantage of this procedure is the possibility to consider diﬀerent
processes and requirements of a company.
    Another requirement for the SSE is, that it just provides advisory functions
for the SMS. The SMS may consult the SSE, but must never lose control over its
tasks and responsibilities. The SMS consults the SSE by requesting semantically
related documents to a given source document or a set of source documents.
Furthermore, it should be possible to inform the SSE about which tasks should
be performed with the source ﬁle within a request (e.g., deletion, migration, ...),
in order to support a decision. To achieve these requirements the SMS needs the
ability to query the SSE and correctly interpret its answer.


3.2      Software Architecture

The requirements in Section 3.1 lead to a software architecture that is shown
in Figure 2. Our approach concentrates on hierarchical SMSs, but is also usable
for otherSMS that use equal operations on ﬁles (e.g., migrating them between
diﬀerent storage tiers). The ﬁgure shows that the SMS communicates with the
SSE over a dedicated interface. Thereby, requests for documents are sent to the
SSE, which searches for existing semantic relations to other documents. This
interface is also used to inform the SSE about every modiﬁcation in the data
pool, to ensure consistency to its ontology. Additionally, the SSE communicates
with the SSC, which extracts all relevant information and manages them in an
ontology. In the context of this analysis, a controlled data access to the SMS’s
ﬁles takes place, where the SMS stays in charge and provides access only to
ﬁles, that are needed for the current update. Furthermore, the system policies
for getting semantic associations are administrated in a decentralized way. The
following sections describe the realization of this architecture in detail.


                                     
                                                    
                                                      
                                                    
                                  
                                                                                  
                                   
                                               
                                   
       
                    
                   
                                          
                                       


                                Fig. 2. The software architecture


                                                  28
      Proceedings of the 1st International Workshop on Semantic Digital Archives (SDA 2011)


3.3     Functionality of the Semantic Service Component

The SSC needs access to the data pool to analyze documents in the SMS. To
decide which data should be analyzed and stored, the storage system has to
specify these documents, whereby two procedures are possible. The ﬁrst one is a
complete analysis. This is normally triggered when the SSE is activated for the
ﬁrst time or if a full consistency check should take place. The second procedure
is a partial analysis, which is called at runtime, whereby only modiﬁed, removed
or new documents of the SMS are examined.
During the analysis, the SSC has to extract all available information regarding
ﬁles and save them in an ontology. This information can be categorized as follows:

 1. File system information (e.g., resident attributes like ﬁlename or -type)
 2. Metadata (e.g., ID38 , TEI9 , EXIF10 )
 3. Implicit semantic knowledge (e.g., CBIR11 , face recognition or audio analy-
    sis)

Another important requirement is the extensibility of the SSC to integrate new
extraction algorithms for future ﬁle formats or metadata standards.
    The available information about documents is very heterogeneous regarding
their attributes and parameters. Diﬀerent identiﬁers sometimes got the same
meaning (e.g., author /creator, creation/date of creation). This leads to a lack
of interoperability. Therefore, navigation or search in the knowledge base is very
expensive. To achieve a good performance with an increasing data pool, we broke
down all semantic information to four fundamental dimensions: Person, Place,
Topic and Moment. Figure 3 illustrates the structural layout (schema) of infor-
mation for a speciﬁc document inside the ontology. All extracted information is
reduced to these four semantic concepts which are instantiated at runtime and
stored in the ontology. So they represent a very compact document knowledge
base. The ontology schema allows to ask about the who, where, what and when
in context of a document.


                                                           6SHDNHU$XWKRU
           V\VWHPILOH                                                                     
             VL]H                                           /RFDWLRQ&RRUGLQDWHV
                                                                                               
                                                    7LWOH7H[W$QDO\VLV
                                                                                                
              
                                                       0RG&UHDWLRQ'DWH
            UHPRYHG                                                                       


       Fig. 3. Used ontology schema to describe a document’s context information

8
   Identify an MP3 (metadata for audio ﬁles), http://www.id3.org
9
   Text Encoding Initiative (description of text documents), http://www.tei-c.org
10
   Exchangeable Image File Format (metadata for images), http://www.exif.org
11
   Content Based Image Retrieval


                                               29
      Proceedings of the 1st International Workshop on Semantic Digital Archives (SDA 2011)


Figure 3 shows the semantic concepts and below them extracted feature (Speaker,
Location, ...) which led to the speciﬁc concept. For example, an extracted speaker
leads to a designated personand a document title leads to a speciﬁc topic. Fur-
thermore, each document has basic properties which are derived from the ﬁle
system (system, size, ...). To diﬀerentiate between removing and irrecoverable re-
moving, the property removed marks a document as removed. So it can recovered
until is was irrecoverably deleted.
We use the K-IMM12 system [10] as a demo SSC. It especially oﬀers most of the
functionality we speciﬁed. In particular the extraction mechanism for ﬁle system
information, metadata and content based information retrieval. Additionally,
K-IMM has a modular structure, which easily allows the extension with new
extraction mechanisms. Furthermore, K-IMM stores generated knowledge in an
ontology and uses the ontology schema described in Figure 3.

3.4     Modiﬁcations in the Storage Management System
To interact with the SSE, the SMS needs some modiﬁcations. Basically it has
to be able to request the SSE for semantically related documents. Also it is
necessary that the SMS interprets the answers. To ensure consistency of the
ontology it is important that modiﬁcations in the data pool are immediately
delivered to the SSE. Last but not least, a common interface is essential to
realize those tasks. The following subsections illustrate the modiﬁcations which
have to be done.

Requesting the SSE According to the requirements (Section 3.1) the SMS
can use two diﬀerent types of requests. We call them simple request and directed
request. Simple requests only pass the identiﬁer (ID) of a ﬁle in the SMS. Each
ﬁle that has stored metadata records got an ID (see Figure 3) that is used to link
a ﬁle to its metadata inside the ontology. Simple requests are used to get a list
of semantically related documents to a source document without any additional
knowledge. The second type are directed requests which have a second param-
eter. This informs the SSE about the planned action for the source ﬁle (inside
the SMS). Currently our concept supports the three core operations: migration,
retrieval and deletion. The aim of a directed request is not only to receive a list
of semantically related documents, but also to receive a recommendation for a
given action.
    Another important aspect are concurrency issues. To fulﬁll the requirements,
the SMS should never wait for a response from the SSE. It has to be guaranteed
that the SMS can process its tasks without depending on the SSE. The SMS
just gets the advice to request recommendations from the SSE before starting
a planned task and then to integrate the response to its workload to optionally
re-schedule future actions.
    Between all ﬁles in the storage system, there are at least weak semantic
associations. For example, all ﬁles are stored in the same ﬁle system, that are
12
     Knowledge through Intelligent Media Management


                                               30
   Proceedings of the 1st International Workshop on Semantic Digital Archives (SDA 2011)


managed by the same SMS. Since the SSE also considers weak semantic bindings,
a response list could be very large. In the worst case, the response set contains all
ﬁles of the storage environment ordered by their relevance. At this point another
parameter for requests is introduced. This one is optional and is used to limit
the size of the response list to a request. This threshold parameter represents the
maximum disk space consumption in megabyte for related ﬁles. For example, if
only 2 GB of data could be retrieved from a speciﬁc storage tier, the value of
the parameter has to be 2048. So the response list only contains as many related
documents as the SMS can use.


Response Interpretation Responses of the SSE basically contain a sorted list
of ﬁle identiﬁers (FIDs). The order and their interpretation is inﬂuenced by the
type of request. Usually, ﬁrst listed FIDs have a closer semantic relation to a
source document than FIDs with a lower rank.
    In case of a directed request the SMS initially examines the recommendation
of the SSE. This is constructed as a boolean value. True stands for an approval
and false for a rejection. The detailed interpretation of the response can be
categorized as follows:

Directed request for a migration: As long as the SSE approves the planned mi-
gration (true), it can take place. In this case no sign has been found that the
source ﬁle belongs to the active data pool. So it would be recommendable for
the SMS to migrate other documents from the response list, which are also most
likely redundant. This helps to predictively move unused data into a slower stor-
age tier to save costs. If the SSE responds with a rejection (false), the response
list has to be interpreted in the opposite way. The source document (and its se-
mantically related documents) seems to be relevant and active in the company.
Thus, a migration to a slower storage tier is not advisable.

Directed request for a retrieval: A ﬁle retrieval works in an opposite way. For
example, a rejection (false) of a speciﬁc document means that this one does not
have (many) active or relevant relations to other ﬁles. Therefore it is unnecessary
to move this ﬁle to a faster storage tier. Also semantically relevant documents
most likely do not need to be retrieved.

Directed request for a deletion: This case requires an additional treatment by the
SMS. Also the signiﬁcance of the response list has to be handled with care. If the
SSE calculates a high relevance or actuality for the requested ﬁle, its response
contains a rejection (false). Like the other requests, this recommendation counts
for all ﬁles in its response list as well.


Modiﬁcations in the Data Pool As described in Section 3.3, we diﬀerentiate
a partial and a complete analysis. At the ﬁrst activation of the SSE and also
at the regular consistency check of the ontology, the SMS initiates a complete


                                            31
      Proceedings of the 1st International Workshop on Semantic Digital Archives (SDA 2011)


analysis, whereas the modiﬁcation of single informations just invokes a partial
analysis.
    For a well-structured delivery of necessary data to the SSE, the SMS needs an
additional module. This module provides all aﬀected documents as separate data
streams. Because of the complexity of the data pool, all data streams should be
processed iteratively during the analysis. Every data stream contains a reference
to following data stream (chained list). The SSE can process them step-by-step
until all modiﬁed documents, which are aﬀected by this update, are processed.
To clearly identify a data object, an FID has to be embedded into the data
stream.

3.5     Design of the Semantic Storage Extension
In this section, we introduce the structure of the underlying policies and show
how to formalize explicit semantic knowledge. Next, the process of reasoning
will be explained, particularly, how the SSE determines related ﬁles by semantic
bindings. Concluding, this subsection will show how response lists are generated
and structured.

Policies With policies, the domain expert should be able to specify any knowl-
edge about the processes and the structure of a company. A DBMS is used to
store this knowledge. Below, the structure of the database tables is illustrated
by the following example: “If documents were modiﬁed at Computer7 by Mrs.
Schulz or Mr. Meyer, they belong to the accounting.” To express this knowledge,
the domain expert can use two types of policies: basic policies and relational poli-
cies (Figure 4).


  
                                         
                                               
                                                                      

              
                             
                                        
                                         


                                Fig. 4. A simple policy example


Basic policies gather all processes which have temporal boundaries. These pro-
cesses are called projects. Policy 4 implies that the project company started on
2009-10-29. Furthermore, Policy 5 deﬁnes the project accounting which started
on 2010-01-01. Both projects have no deﬁned ending. Also the accounting is part
of company.
    The second type of policies are relational policies which are stored in a sep-
arate table (Figure 4). They associate existing projects with persons, places,


                                                  32
   Proceedings of the 1st International Workshop on Semantic Digital Archives (SDA 2011)


moments and topics. Policy 8 declares that all documents which contain the
person Schulz and were created or modiﬁed on Computer7 belong to accounting
(policy id 5 ). Policy 9 describes the same for Meyer and Computer7.
    To minimize the amount of policies, a third table (constraints) is introduced
(Figure 4). In this table, dependencies between relational policies can be ex-
pressed by logical operators. In our example, constraint 14 describes a relation
between policy 8 and 9 and combines them by an OR-operator. This means that
only one of those policies needs to be fulﬁlled for a document to be matched to
the accounting.


Procedure for Getting new Semantic Information Figure 5 shows how
the SSE tries to ﬁnd semantically related documents and which communication
is necessary between the involved components. The SMS requests the SSE and
inform it about the ID, and optional about the planned operation, of a source
ﬁle. Initially this ID is used to query the SSC for all information on this ﬁle.
The request itself is created by the SSE and formulated in SPARQL [13]. How
the returned knowledge is structure, is described in Figure 3. Next, the SSE
checks if existing relational policies match and if projects may be associated.
For example, if a source document is related to a place Computer7 and a person
named Meyer, then already one of the two policies matches. As both policies are
OR-linked, the ﬁle would be assigned to the project accounting.


                                 
          
                                                        


                         
                                   
                                                         


                   


                      
                       
      
                                                 
                                                                  
                                                                              

            Fig. 5. Workﬂow for getting semantically related documents


If all projects which can be assigned to a given document were found, the SSE
executes another SPARQL query against the SSC. This query determines the
IDs of all ﬁles which can be associated with the found projects. For the project
accounting this means that all IDs of ﬁles are requested which are related to at
least Meyer and Computer7 or Schulz and Computer7.


                                            33
   Proceedings of the 1st International Workshop on Semantic Digital Archives (SDA 2011)


    Thereafter the results are converted into data objects which are separated
into diﬀerent sets. A data object not only contains the ID of an aﬀected ﬁle, it
also contains diﬀerent properties (ﬁle size, ...), attributes (system, ...) and the
last access time. Each set represents documents with a simple semantic relation
to the source ﬁle.
    The next step is a calculation of the degree of relationship of the determined
ﬁles. All data objects of the diﬀerent sets are merged into one response set. Data
objects that contain the property “removed” (Figure 3) are skipped. If a data
object is not already available in the response set, it is added with a counter
initialized with 1. This counter represents the number of determined semantic
relations (degree of relationship). If a data object is already available, it is not
repeatedly added. Instead the counter is incremented by 1. The result of this
calculation is a set of data objects that contains diﬀerent degrees of relationships
for documents regarding their binding to a source ﬁle.
    Before the response list can be generated, a recommendation has to be pre-
pared, if requested. If the request of the SMS contains a deletion as planned
action, the attributes of the source document are checked again. If there is a
“system” attribute, the recommendation will be false. Otherwise, all associated
projects are considered. Those projects have a speciﬁed period of time. If one of
them is active (the end point is in future), the source document is also classi-
ﬁed as active. In this case the recommendation for deletion or migration is set
to false. Otherwise, if all related projects are inactive (the end point is in the
past) the recommendation is set to true, which means a migration or deletion of
related ﬁles is possible. If the planned action is a retrieval, the recommendation
is inverted (Section 3.4).


Generating Response Lists If the request of the SMS contains a threshold
parameter (Section 3.4), the SSE makes sure that the sum of all ﬁle sizes in
the response list does not exceed this value. A response list is constructed as a
dual sorted list of data objects with an optional recommendation ﬂag (Figure 6).
Each data object represents a semantically related document to a source ﬁle. The
dual sorting oﬀers the SMS an additional beneﬁt on its processing. Furthermore,
the order depends on the planned action. Figure 6 illustrates an example for a
directed response list in case of a planned migration. The data objects themselves
are labeled from FID1 to FID9. The primary order is accomplished in accordance
with the number of semantic bindings for each data object (curved brackets) and
in the case of a migration the secondary order is accomplished with the ﬁle size,
starting with the biggest ﬁle. So in case of a migration, if documents own the
same semantic binding, the bigger ones are migrated ﬁrst. The advantage is that
just a few operations are necessary to create enough space and that a lot of
small ﬁles can remain on the fast storage tier. This is particularly useful if no
threshold value was submitted.
Response lists for simple requests are sorted the same way but do not contain a
recommendation bit. In case of a retrieval, the secondary order is based on the
ﬁle size, starting with the smallest one. This is done to retrieve as many actual


                                            34
    Proceedings of the 1st International Workshop on Semantic Digital Archives (SDA 2011)


    7UXH   ),'      ),'      ),'       ),'       ),'       ),'      ),'       ),'
           .%      .%      .%       .%       .%       .%      .%       .%

                         UHODWLRQV          UHODWLRQV                    UHODWLRQV

                 Fig. 6. A sample response list for a planned migration


documents as possible. However, on a deletion, the SSE does not consider ﬁle
size. Here, the secondary order is based on the last-access-time-property of data
objects. Thus, older ﬁles are listed ﬁrst.


4    Conclusion and Further Work

A great amount of data will force software designers to implement more eﬃcient
and scalable algorithms to optimize storage solutions. One way to reach this goal
is given by semantic web technologies.
    As we have showed in Section 2, there is a great lack of publications about
using semantic technologies in SMSs. Current approaches do not regard relations
between documents. Only simple hierarchies are used for classifying ﬁles and
folders.
    In this paper, we introduced the so-called Semantic Storage Extension (SSE)
(Section 3). The SSE is a software component which can be integrated in an ex-
isting SMS with just minimal eﬀort. To enable a semantic information extraction,
this architecture is completed by a SSC which handles extracted information by
using a specialized ontology schema to describe documents in a semantic way.
As K-IMM is used as a SSC, it is easy to add new extraction plug-ins to ana-
lyze future document formats and enable information retrieval in the way it is
necessary for the SSE. Through inference algorithms that are provided by the
SSC new relations between indexed documents can be found which cannot be
derived by methods like data mining and clustering. Using this knowledge, the
SSE can advise the SMS on planned actions for a collection of ﬁles. To fulﬁll
the requirements of diﬀerent domains, we developed an additional policy-based
approach to enable a domain expert to describe the application domain in detail.
With this architecture a SMS can decide on actions like migration, retrieval and
deletion by using semantic knowledge.
    For future work we plan to improve the implementation of this architecture in
the project HSM [3]. Thereby, we want to perform tests to proof our theoretical
evaluation and to show that the beneﬁt and the performance for a SMS increases.
Also, a detailed analysis need to be made on security issues (like authentication
and authorization), sorting of given answers and the way ﬁles are weighted in
response lists. At least the handling of concurrence issues between the SMS
operations and operations of the SSC and SSE needs to be improved.


                                             35
   Proceedings of the 1st International Workshop on Semantic Digital Archives (SDA 2011)


References
 1. Bloehdorn, S., Görlitz, O., Schenk, S.: TagFS - Tag Semantics for Hierarchical File
    Systems. Proceedings of the 6th International Conference on Knowledge Manage-
    ment I-KNOW’06 (2006)
 2. Chang, W., Masinter, L.: System and method of determining and
    recommending a document control policy for a document (2008),
    http://www.freepatentsonline.com/y2008/0059448.html
 3. Fritzsche,      R.:       HSM-Projekt         (2011),     http://www.mmt.inf.tu-
    dresden.de/Forschung/Projekte/HSM/
 4. Gnasa, M.: Fraunhofer IAIS: Text Mining und Information Retrieval (2009),
    http://www.iais.fraunhofer.de/4862.html
 5. Hackl, G., Pausch, W., Schönherr, S., Specht, G., Thiel, G.: Synchronous Meta-
    data Management of Large Storage Systems. Proceedings of the Fourteenth Inter-
    national Database Engineering & Applications Symposium pp. 2–7 (2010)
 6. Jurafsky, D., Martin, J.H.: Speech and Language Processing: An Introduction to
    Natural Language Processing, Computational Linguistics and Speech Recognition.
    Prentice Hall (2008)
 7. Mahalingam, M., Tang, C., Xu, Z.: Towards a semantic, deep archival ﬁle system.
    The Ninth IEEE Workshop on Future Trends of Distributed Computing Systems
    pp. 115–121 (2003)
 8. Mierswa, I.: Beatles vs . Bach : Merkmalsextraktion im Phasenraum von Audio-
    daten. Tagungsband der GI-Workshop-Woche Lernen - Lehren - Wissen - Adap-
    tivitat (2003)
 9. Mitschick, A.: Ontologiebasierte Indexierung und Kontextualisierung multimedi-
    aler Dokumente für das persönliche Wissensmanagement. Dissertation, University
    of Dresden (2010)
10. Mitschick, A.: Ontology-based Indexing and Contextualization of Multimedia Doc-
    uments for Personal Information Management Applications. International Journal
    on Advances in Software 3(1), 31–40 (2010)
11. Neuroth, H., Oß wald, A., Scheﬀel, R., Strathmann, S., Huth, K.: nestor Handbuch
    - eine kleine Enzyklopädie der digitalen Langzeitarchivierung Version 2.3 (2010),
    http://nestor.sub.uni-goettingen.de/handbuch/nestor-handbuch 23.pdf
12. Orio, N.: Music Retrieval: A Tutorial and Review. Foundations and Trends      R in
    Information Retrieval 1(1), 1–96 (2006)
13. Prud’hommeaux, E., Seaborne, A.: SPARQL Query Language for RDF (2008),
    http://www.w3.org/TR/rdf-sparql-query/
14. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput-
    ing Surveys 34(1), 1–47 (2002)
15. Troopens, U., Erkens, R., Müller-Friedt, W., Wolafka, R., Haustein, N.: Storage
    Networks Explained. John Wiley & Sons Ltd., Mainz, 2 edn. (2009)
16. Weippl, E.R., Klemen, M., Linnert, M., Fenz, S., Goluch, G., Tjoa, A.M.: Semantic
    Storage : A Report on Performance and Flexibility. Lecture Notes in Computer
    Science 3588, 586–595 (2005)
17. Xiong, M., Jin, H., Wu, S.: FDSSS: An Eﬃcient Metadata Management
    Scheme in Large Scale Data Environment. Fifth International Conference
    on Grid and Cooperative Computing Workshops pp. 71–77 (Oct 2006),
    http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=4031532


                                            36

</pre>