Computing Recommendations for Long Term Data
    Accessibility basing on Open Knowledge and Linked Data

                 Sergiu Gordea                        Andrew Lindley                            Roman Graf
            AIT - Austrian Institute of           AIT - Austrian Institute of             AIT - Austrian Institute of
               Technology GmbH                       Technology GmbH                         Technology GmbH
             Donau-City-Strasse 1                  Donau-City-Strasse 1                    Donau-City-Strasse 1
                 Vienna, Austria                       Vienna, Austria                         Vienna, Austria
          sergiu.gordea@ait.ac.at               andrew.lindley@ait.ac.at                  roman.graf@ait.ac.at

ABSTRACT                                                              1.   INTRODUCTION
                                                                         Knowledge based recommender systems (KBRs) as natu-
Digital access to our cultural heritage assets was facilitated
                                                                      ral followers of expert systems are nowadays used for sup-
through the rapid development of the digitization process
                                                                      porting the decision making process in multiple application
and online publishing initiatives as Europeana or the Google
                                                                      areas as: e-commerce, financial services, tourism, etc. One
books project. As Galleries, Libraries, Archiving institu-
                                                                      of the most important challenges of KBRs is the construc-
tions and Museums (GLAM) created digital representations
                                                                      tion of their underlying knowledge base. This is typically
of their masterpieces new concerns arise regarding the long-
                                                                      composed by sets of factual knowledge, i.e. information de-
term accessibility of digitized and digitally born content.
                                                                      scribing the application’s domain and business rules. Both
Repository managers of institutions need to take well docu-
                                                                      together enable the drawing of conclusions and support the
mented decisions with regard to which digital object repre-
                                                                      decisions making process when analyzing the utility of a spe-
sentations to use for archiving or long term access to their
                                                                      cific item in a given context as for example, analyzing the ef-
valuable collections. The digital preservation recommender
                                                                      fectiveness of digitizing and publishing Mircea Eliade’s book
system presented within this paper aims at reducing the
                                                                      ”History of Religious Ideas” within Google books.
complexity in the process of decision making by providing
                                                                         Even though the world wide web has turned out to be
support for classification and the preservation risk analy-
                                                                      the largest knowledge base, information published lacks an
sis of digital objects. Technical information which is avail-
                                                                      unified well-formed representation and mainly is intended
able as linked data in open knowledge sources facilitates
                                                                      for human readers. The Linked Open Data (LOD)1 and
the construction of the DiPRec’s recommender knowledge
                                                                      Open Knowledge2 initiatives address these weaknesses by
base. This paper presents the DiPRec recommender sys-
                                                                      describing a method on how to provide structured data in a
tem, a community approach on how to achieve the genera-
                                                                      well-defined and queriable format. By linking together and
tion of well founded and trusted recommendations through
                                                                      inferring properties of di↵erent independent and publically
open linked data and inferred knowledge in the domain of
                                                                      available information sources like FreeBase3 , DbPedia4 and
long-term information preservation for GLAM institutions.
                                                                      Pronom 5 within the specific context of a digital preservation
                                                                      scenario we shortcut the well known challenge of KBRs, the
Categories and Subject Descriptors                                    knowledge acquisition bottleneck.
H.3.7 [Information Systems Applications]: Digital Li-                    In this paper we present our work carried out in the con-
braries; M.8 [Knowledge Management]: Knowledge Reuse                  text of the Assets6 project with the aim of preparing the
                                                                      ground for digital preservation within Europeana7 . The Eu-
General Terms                                                         ropeana portal serves as a central point for the large public
                                                                      to easily explore and research European cultural and sci-
Digital preservation, Recommender systems                             entific heritage online. It aggregates and collects data on
                                                                      digital resources from galleries, libraries, archives and muse-
Keywords                                                              ums accross Europe and by now manages about 19 million
                                                                      object descriptions collected from more than 15 hundred in-
Knowledge based recommender, open recommendations, linked
                                                                      stitutions. Within this very heterogeneous context it is eas-
open data, preservation planning
                                                                      ily understandable that digital objects are encoded in very
                                                                      heterogeneous file formats and versions throughout various
                                                                      di↵erent hardware and software content repository systems.
                                                                      Depending on the underlying use case it is likely that mul-
                                                                      1
                                                                        http://linkeddata.org/
                                                                      2
                                                                       http://www.okfn.org/
                                                                      3
                                                                       http://www.freebase.com
                                                                      4
                                                                       http://dbpedia.org/
                                                                     5
                                                                        http://www.nationalarchives.gov.uk/PRONOM/
                                                                     6
                                                                        http://www.assets4europeana.eu/
                                                                     7
                                                                       http://www.europeana.eu/portal/


                                                                 51
tiple representations of the same ’physical’ object exist at      explained in detail through a concrete example on the TIFF
a time. For example in most cases it is useful to provide         file format. The evaluation of our approach is presented in
access copies on demand which are easily distribuatable via       Section 4 by analyzing the digital collections of the Assets
the web while the master record needs to adhere to di↵er-         project. This is followed in the last Section of the paper (nr.
ent requirements as for example the institution’s long-term       5) by the summarization of the concluding remarks for our
scenario and preservation policy.                                 work.
   A key topic in preservation planning is the file formats
used for encoding the digital information. The Pronom
Unique Identifiers (PUIDs) registry provides persistent, unique
                                                                  2.   RELATED WORK
and unambiguous identifiers for file formats and therefore           Knowledge Based Recommender systems gained broad pop-
takes a fundamental role in the process of managing elec-         ularity in e-commerce and e-tourism [7, 11, 24, 19] appli-
tronic records. Currently it lists information on about 820       cations supporting customers in their decision making pro-
di↵erent PUIDs. While some of the formats are properly            cesses. The two most popular use cases are guidance through
documented, open-source and well supported, others may            large and complex product o↵ers (e.g. trip organization, fea-
be outdated, redeemed by software vendors and no longer           ture selection of technical equipment) as well as accompany-
functional in modern operating systems. As always the the         ing the process of high cost decision making (e.g. financial
binary file’s dependencies on the underlying platform, its        investments). When designing the DiPRec recommender we
configuration (codecs, plugins, etc.) as well as the render-      took into consideration the Advisor Suite [12] and Planets
ing software are responsible on generating a concrete user        Testbed infrastructure [16]. The main component of the
performance, it is vital to have a solid understanding on all     Advisor Suite is a multipurpose workbench which o↵ers sup-
of them. This process is costly and requires a high degree        port and advanced graphical user interfaces for constructing
of engineering expertise. Many of the GLAM institutions           knowledge based recommenders. Advisor Suite features in-
already outsource IT related activities and don’t have the        clude the import of product catalogues, visual editing of a
resources to keep track of the required level of complexity in    recommendation workflow and the generation of a runtime
house.                                                            environment. The Planets8 project focused on constructing
   The Digital Preservation Recommender (DiPRec) system           practical services and tools for establishing empirical evi-
addresses the topics of ’preservation watch’ and ’preserva-       dence in the process of informed decision making in the area
tion policy recommendation’. It proposes a solution in the        of digital long-term preservation. A major achievement was
domain of digital long-term preservation for making doc-          the definition of basic nouns and verbs for core preserva-
umented recommendations based on risk scores, while the           tion operations. This allows to easily combine and swap
underlying knowledge base is built through a linked data ap-      tools within a preservation workflow and lead to a num-
proach. Information from FreeBase, DbPedia and Pronom             ber of over fifty preservation services. Available services
in the areas of file formats, file conversions tools, hardware    were deployed and tested within the Planets Testbed [22], a
and software vendors is taken into account. The main con-         uniform environment for experimentation under well-defined
tribution of this paper consists in the integration of open       and controlled surroundings. It provides automated quality
(general or domain specific) data when constructing knowl-        assurance support for tools like DROID9 , JHOVE10 and the
edge based recommendations. The ”knowledge acquisition            eXtensible Characterisation Languages11 [5].
bottleneck” and the high costs of setting up and maintain-           A key topic in preservation planning is the process of eval-
ing KBRs are still an impediment for extensively adoption         uating objectives under the limitation of well-known con-
by the industry. Recommendations provided by DiPRec are           straints. A state of the art report on technical require-
meant to support GLAM institutions across Europe in the           ments and standards as well as available tools to support
process of analyzing their digital assets. The technical foun-    the analysis and planning of preservation actions is given in
dation and the explanation of the DiPRec recommendations          [2]. Strodl et al. present the Planets preservation planning
are computed on top of shared and collaboratively built data      methodology Plato12 by an empirical evaluation of image
sources, trust in the area of LOD and digital preservation        scenarios [21] and demonstrate specific cases of recommen-
is a key issue which has been left out for this paper due to      dations for image content in four major National Libraries in
simplicity.                                                       Europe[4]. After eliciting information regarding the preser-
   The novelty of our work consists in combining expert tools     vation scenario (user requirements) the Plato tool is able to
(as File, Droid or Fido) and automated object identifica-         recommend specific preservation actions [3] for a given sce-
tion processes, with structured information (e.g. techni-         nario. The tool was specifically designed to work on sam-
cal information on file formats) from open data reposito-         ples of the underlying data set and therefore is able to make
ries. This information is use for infering new knowledge,         use of XCL or similar tools for automated quality assurance
calculate preservation risks and finally for computing rec-       and semi-automated evaluation of objectives. In contrast to
ommendations on preservation actions in the domain of dig-        these scenario evaluations, DiPRec aims at collecting infor-
ital long-term preservation. We present the rationale used        mation on a broader range from open linked data registries
for the construction of the DiPRec recommender by pre-            and dynamic knowledge sources. It can evaluate more gen-
senting concrete examples of a given content analysis which       eral, even ’non-technical’ objectives (e.g. what is the risk
was provided for the Assets project. The rest of the pa-          that no software vendor will support old formats like Word
per is organized as follows; in Section 2 we present related      8
                                                                     http://www.planets-project.eu/
work carried out on recommender systems and in the field          9
                                                                     http://droid.sourceforge.net/
of digital preservation. Section 3 highlights the architecture    10
                                                                     http://hul.harvard.edu/jhove/
of DiPRec by comparing it against the construction of clas-       11
                                                                     http://planetarium.hki.uni-koeln.de/public/XCL/
sical KBRs. The functionality provided by our system is           12
                                                                     http://www.ifs.tuwien.ac.at/dp/plato/intro.html


                                                            52
                                                                                                 Open Domain
                                                                                                  Knowledge


                                                                       Domain Expert          Pronom     UDFR
                                                                    Knowledge Acquisition


                    Domain Expert                                                 Domain Knowledge
                                               User requirements                                               User Requirements
                     Knowledge                                                      Aggregation


                   Knowledge Base           Recommendation Engine                   Knowledge Base             Recommendation Engine


                                                                                  Domain Information           Recommentations
                    Domain Information        Recommentations                        Aggregation
                        Acquisition
                   (description and data)


                                                                              Format
                                                                           Identification    DbPedia FreeBase

                                                                                               Linked Open Data

                                    a) Classic KBR recommender                  b) DiPRec recommender


                  Figure 1: A comparison of regular KBRs and DiPRec recommender processes


3 documents? ). This is a significant improvement over the                  tion than the domain specific KBRs. Within the DiPRec
Plato tool where all this information needs to provided by                  recommender the Domain Information Aggregation module
domain experts.                                                             is responsible for collecting file format related information
   The Scape13 project is one of the major current initiatives              (e.g. formats, vendors, applications, etc.) from the open
[18] which is partially funded by the European Union’s FP7                  knowledge bases Pronom, DBPedia and Freebase. Further-
on institutional preservation requirements. The project ad-                 more the Domain Knowledge Aggregation module combines
dresses besides the issues of scalable preservation and quality-            the outcome of a risk analysis process with the knowledge
assured preservation workflows also the topic of policy-based               manually provided by domain experts. Figure 1 compares
preservation planning and watch.                                            the process used by regular KBRs and the one presented by
   The paradigms of semantic Web and linked open data [6]                   DiPRec which enhances the process of building the under-
transform the web from a pool of information into a valu-                   lying knowledge base. In the following sections we present
able knowledge source of data according to the definitions                  extended details on how the knowledge base of DiPRrec is
of a knowledge management theory [17]. The exploitation                     built by using as example the Tagged Image File Format
of linked data as knowledge source for recommender sys-                     (TIFF).
tem started as research topic in the last few years and was                    The TIFF format is still very popular among the publish-
first applied to improve case-based and collaborative filter-               ing industry, as it is a very adaptable file format although
ing recommenders [10, 9, 20]. In [20] the authors present the               it did not have a major update since 1992. It was originally
Talis Aspire system which is able to assists educational sta↵               created by Aldus and since 2009 it is now under control of
in picking educational web resources. The employment of                     Adobe Systems. There are a number of extensions avail-
linked data in collaborative filtering and case-based reason-               able (e.g. TIFF/IT, TIFF-FX) which have been based on
ing was explored by Heitmann and Hayes in [9] and [10].                     the TIFF 6.0 specification, but not all of them are broadly
                                                                            used. A standard and broadly accepted approach in the
3.      SYSTEM OVERVIEW                                                     archiving world is the migration of TIFF encoded content
                                                                            to the JPEG2000 format. In [4, 2] one can find the context
   Typically the creation of classic knowledge based recom-                 in which several content providers took the decision to per-
mender systems consists of three main tasks. Dealing with                   form this kind of content migration. However within these
the collection of detailed descriptions of products o↵ers is                scenarios, the context evaluation and the recommendation
followed by the process of constructing a recommendation                    were computed by domain experts and by expert systems.
knowledge base (see section 3.2). At runtime user require-                     The DiPRec system, on the one hand applies to the ap-
ments elicitation takes place and recommendations are com-                  proach of well-documented and trackable decision making,
puted based on the underlying recommendation knowledge                      and at the same time it uses a semi-automatic approach
base and the items that match the given user requirements.                  on domain knowledge acquisition. This reduces the human
DiPRec follows the same process but improves the way the                    e↵ort invested by domain experts when providing reserva-
knowledge base is built in order to reduce the e↵orts spent                 tion recommendations, reduces the financial e↵orts invested
on domain knowledge acquisition. This is especially relevant                in the context evaluations, and in the same time is able to
for being exploited in GLAM preservation scenarios, where                   o↵er good quality recommendations.
the underlying knowledge base contains broader informa-
13
     http://www.scape-project.eu/


                                                                    53
                                                   FILE FORMAT DESCRIPTION
     Format Name                    Tagged Image File Format (P), Tagged Image File Format (D), Tagged Image File Format(F)
     Pronom Id                      fmt/10 (P)
     Mime Type                      /media type/image/ti↵-fx, /media type/image/ti↵ (F), image/ti↵(P)
     File Extensions                .ti↵, .tif (D)
     Current Version                6 (P)
     Current Version Release Date   03 Jun 1992 (P)
     Software License               Proprietary software (D)
     Software                       QuickView Plus, Acrobat, AutoCAD, CorelDraw, Freemaker, GoLive, Illustrator, Photoshop,
                                    Powerpoint (P), SimpleText, Seashore, Imagine (D)
     Software Homepage              http://adobe.com/photoshop(D)
     Operating System               PC, Mac OS X, Microsoft Windows (D)
     Genre                          Image (Raster) (P), Image file format (I), SimpleText - Text editor, Adobe Photoshop - Raster
                                    graphics editor (D)
     Open Format                    none (P)
     Standards                      ISO 12639:2004 (W)
     Vendors                        Aldus, Adobe Systems, Apple Computer, now Apple Inc., Microsoft (D), Adobe Systems
                                    Incorporated (P), Aldus Corporation (P)
                                                      VENDOR DESCRIPTION
     Organization Name              Adobe Systems
     Country                        United States (P)
     Foundation date                Dec 1982 (F)
     Number of Employees            6068 (Jan 2007), 8660 (2009)(F), 9,117 (2010)(W)
     Revenue                        3,579,890,000 US$ (Nov 28, 2008) (F)
     Homepage                       http://adobe.com/photoshop(F)

Table 1: File format and vendor description. (Information sources P = Pronom, D = DBPedia, F = Freebase,
W = Wikipedia)


3.1       Domain Information Aggregation                                     considered to fulfill this requirement by combining the
   Di↵erently to the e-commerce domain where KBRs import                     properties like ”NUMBER OF EMPLOYEES”, ”VEN-
detailed item descriptions from product catalogs there is no                 DOR REVENUE”). See Table 2.
such catalog for computer file formats. The Unified Digi-                  When aggregating domain information we are interrogat-
tal Format Registry (UDFR)14 project was started in 2009                ing external knowledge sources like DbPedia and Freebase
by a group of Universities and GLAM institutions with the               which manage huge amounts of linked open data triples.
aim of building a single, shared technical registry for file            This allows us to extract fragmental descriptions on file for-
formats based on a semantic web and linked data approach.               mats, software applications and vendors supporting given
The project is based on the Pronom database which pro-                  file formats (see Table 1). DbPedia allows to post sophis-
vides basic information about a large number of file formats            ticated queries using SPARQL query and OWL ontology
and will be extended by data on migration pathways and                  languages [13] for retrieving data available in Wikipedia.
available software/tools. The registry should be available              Freebase [15] is a practical, scalable semantic database for
from the beginning of 2012. As Pronom data is not rich                  structured knowledge and is mainly composed and main-
enough to build a recommendation and reasoning mecha-                   tained by community members. Public read/write access to
nism for preservation scenarios of file formats on top, we              Freebase is allowed through an graph-based query API using
collect additional information sources and aggregate them               the Metaweb Query Language (MQL) [6]. PRONOM data
into a single homogeneous property representation in the                is released as linked open data and is accessible through a
recommender’s knowledge base. DiPRec uses two types of                  public SPARQL endpoint.
operations for aggregating domain information:
       • data unification: the data representation retrieved from                       AGGREGATED PROPERTIES
         di↵erent knowledge bases is unified and combined un-                                  File format related
         der the DiPRecs property model definition. For exam-            Is supported by major software vendors?          yes
         ple, the number of software tools supporting a given file       Is an open file format?                          no
         format is calculated over di↵erent data sources. The            Is widely supported by current web browsers?     yes
                                                                         Which versions officially supported by vendor?   6.0
         individual object’s namespace, the transformation pro-          Which versions are frequently used?              6.0
         cess of values, the query on how to extract a given             Image file compression supported?                yes
         record, etc. are preserved and are part of the prop-                            Preservation related metadata
         erty’s model representation.                                    Is creator information available?                yes/no
                                                                         Is publisher information available?              yes/no
       • property composition: more abstract properties which            Is digital rights information available?         yes/no
         require a hierarchical composition are computed by ag-          Is file migration allowed?                       yes/no
         gregating basic properties by weighted numbers. The             Object creation date?                            datetime
                                                                         Is an object preview available?                  URL
         model on property definition is meant to be kept very
         simple. For example ”supported by major vendors”
                                                                               Table 2: Sample compound properties.
         will check if at least one of the software companies is
14
     http://www.udfr.org/


                                                                 54
            Media File      Domain Knowledge
                               Aggregation
                                                   PRESERVATION
                                                        RISK
                                                                            property set and consequently contributes to the risk com-
                                                    Score / Report          putation over a given dimension is modeled through the in-
                                                                            troduction of specific weighting factors (see Equation 1).
                                                                              The value of the overall risk score for a given collection
                              Property Sets
                                                                            of objects is computed as a weighted sum over all digital
                                                                            preservation dimensions:
                            Domain Information                                         X               X
                               Aggregation                                     Ri =         wps,i ⇤          wp,ps ⇤ d(p, P F V (p)) (1)
                                                                                     ps2P Si        p2P ROPps
              DIGITAL
          PRESERVATION
               TOOLS                               Open Knowledge
           (Droid/Pronom)     FILE METADATA           Sources                 Where Ri represents the preservation risk computed over
                                                 (DBPedia, Freebase,
                                                     PRONOM )               the dimension i. ps represents the index of the current prop-
                                                                            erty set within all sets associated to the dimension i. The
 Figure 2: Domain knowledge aggregation process.                            w(ps,i) is the weight of the contribution of the property set ps
3.2      Domain Knowledge Aggregation                                       to the dimension i. Similarly, p stands for the index of cur-
                                                                            rent properties within the list of properties available in the
   Pronom as presented before is a viable resource for any-
                                                                            given property set P ROPps . wp,ps denotes the importance
one requiring impartial and definitive information about the
                                                                            of a property p for the property set ps. The distance be-
file formats, software products and other related data. Ex-
                                                                            tween the current property and the defined - ’preservation
tremely valuable to the DiPRec recommender is the infor-
                                                                            conform’ - value for this property is represented through
mation related to the file conversion tools based on a given
                                                                            d(p, P F V (p)).
PUID. Therefore we employ the Droid 15 characterization
service for automatically extracting technical metadata and
identifying file formats from physical media files. This meta-              3.3    User requirements elicitation
data is then used in conjunction within the domain knowl-                      DiPRec is designed to work as a multi-purpose digital
edge aggregation process presented in the Fig. 2                            preservation support tool which can be used in various sce-
   The risk analysis module is in charge of evaluating in-                  narios by di↵erent types of customers. For examples the tool
formation previously aggregated in the DiPRec knowledge                     may support content providers in analyzing the ’preservation
base for a given record at hand over following (exemplary)                  friendliness’ of their infrastructure, their archiving solutions
dimensions of digital preservation:                                         or the visibility of their artifacts published in the Europeana
                                                                            portal. Recommendations are always to be seen in the con-
      • Web accessibility: Dissemination copies are published               text in which the digital objects are used. Within the scope
        and accessible on e.g. the content provider’s web por-              of the Assets project there is the common interest to o↵er
        tal. There should be previews of objects (e.g. thumb-               public access to digital assets through the Europeana portal
        nails for images, video summaries, short intro for au-              (i.e. web discovery), to provide advance search functional-
        dio files) and ’rich’ object descriptions to increase their         ity (i.e. description richness and preservation of provenance
        visibility and retrieval. The chosen file representations           information) as well as the topic of the data archiving di-
        should render in the latest browsers without plugin                 mension.
        support and cope with modern features (e.g. pseudo                     As a result of the requirements elicitation process user pro-
        streaming, progressive image display, HTML5, X3D,                   files are created. A set of multiple choice questions is used to
        etc). Content is made available through di↵erent ex-                distinguish the relevant dimensions of available preservation
        ploitation channels.                                                objectives. According to di↵erent levels of complexity, role
                                                                            and required domain knowledge the system o↵ers a subset
      • Archiving and costs: The decision of following a spe-               of questions which are well understood and the best avail-
        cific institutional preservation policy for a given tech-           able choice for a user to express his needs. Fig. 3 presents
        nology is heavily influenced by given hardware and                  sample workflow which could be used to determine a given
        budget constraints. Future exploitations on the costs               user profile. For example a private user (ut = private per-
        for content exploitation need to be predicted and taken             son) with a solid level of IT knowledge (itk = expert) will
        into account.                                                       be asked about preferred encodings and compression types
                                                                            of the digital content, while others would define attributes
Other scenarios may include:
                                                                            about storage limitations and upload samples of a given col-
      • Provenance metadata                                                 lection.

      • Data exchange and collaborative data enrichment                     3.4    Recommendation computation
                                                                               Di↵erently to classic KBRs where the application’s scope
      • Publishing and digital rights management                            is very well delimited in terms of selecting the best match-
  The definition of preservation dimensions is not orthogo-                 ing items in a list of known possibilities, the DiPRec sys-
nal and therefore certain properties might be involved more                 tem relies on expressing an institutional preservation con-
than once when computing di↵erent risk score. Due to                        text in form of user requirements that are combined with the
management and maintenance reasons properties are also                      knowledge acquired about the long term accessibility threat-
grouped by sets and a property may belong to one or more                    ening. We employ tools to evaluate the content of a given
property sets. The extent to which a property belongs to a                  collection from a technical point of view and to generate fine
                                                                            grained preservation risk scores. When records are identified
15
     http://sourceforge.net/projects/droid/                                 to have vulnerabilities on certain preservation dimensions a


                                                                       55
                                                                     the migration of the files available in T IF F/5 format to
                                                                     JP EG/2000 by using the IM AGE M AGICK software with
                                                                     standard settings.

                                                                     4.   EVALUATION
                                                                        The evaluation of the first prototype of DiPRec was con-
                                                                     ducted within the scope of the Assets project. Ten partners
                                                                     of the project consortium provided metadata and binary
                                                                     content (10 collections with a total size of 516GB contained
                                                                     in 368067 media files) for supporting the development and
                                                                     testing of services developed within the scope of the project.
                                                                     The first step in the evaluation process was the identification
                                                                     of file formats, definition of property sets and the aggrega-
                                                                     tion of the domain knowledge available in open knowledge
                                                                     bases on these file formats.
                                                                        The Table 3 lists the distribution of file formats by content
                                                                     type. Even the experimental data was taken from a small
                                                                     number of content providers, we discovered a variety of 18
                                                                     formats in 38 di↵erent versions used for encoding the digital
                                                                     content.


                                                                      Content Type     File Format   # Versions    # Files
                                                                      TEXT             TXT           1             4
                                                                      TEXT             DOC           1             16
                                                                      TEXT             XML           1             20101
     Figure 3: User requirements elicitation workflow                 TEXT             HTML          1             1205
                                                                      IMAGE            JPG           8             323332
                                                                      IMAGE            PSD           1             3
rule based engine as JBoss Drools16 is used to propose ap-            IMAGE            PNG           4             1228
propriate preservation actions. The set of available business         IMAGE            BMP           2             141
rules are defined by domain experts in form of simple IF-             IMAGE            GIF           2             1066
                                                                      IMAGE            TIFF          4             4
THEN-ELSE rules. These rules are neither complete nor                 IMAGE            PDF           16            25008
meant to be non-overlapping. Unified tool access for pro-             AUDIO            MP3           1             3634
cessing executable preservation plans is provided through             VIDEO            FLV           1             9468
the Assets preservation normalisation framework which is              VIDEO            MPEG4         1             935
able to invoke the tools with exactly defined settings and            VIDEO            MPEG1         1             3074
parameter configurations.                                             VIDEO            MPEG3         1             3074
                                                                      3D               PLY           1             50
                                                                      3D               DAE           1             307
IF ( rac > 0.5 AND ia == true AND iwa == true AND
open f ormat == FALSE)
THEN migrate(preservation f ormat)                                   Table 3: Distribution of file formats in Assets col-
                                                                     lections.
IF (content type == IMAGE)
THEN preservation format = (JPEG/2000:1, TIFF/6:0.8)

IF (f ile f ormat == TIFF/5 AND
                                                                        The Digital Record Object Identification tool (DROID)
preservation f ormat == JPEG/2000)
                                                                     version 5, signature file 45 was executed through the As-
           T HEN migration tool = IM AGE M AGICK          (2)        sets preservation normalisation tool suite and was able to
                                                                     successfully identify file formats in 95 percent of the cases
                                                                     through its binary signature method except of the 3D model
   The preservation recommendations are computed using
                                                                     objects which have not yet been collected by Pronom. Ap-
the constraint solving problems (CSP) theory [8, 11]. Con-
                                                                     propriate information on all of the file formats was contained
straints are defined within the preservation actions knowl-
                                                                     in DbPedia and Freebase and the domain knowledge acqui-
edge base, the CSP context is defined by user profiles and
                                                                     sition process was completed by successfully computing the
the preservation risks are identified for the given data col-
                                                                     preservation risk analysis scores.
lection. The recommendations are represented in form of
                                                                        The second part of the evaluation consisted in comput-
preservation actions. For example, the set of business rules
                                                                     ing recommendations for the given content. Therefore, we
defined above combined with a user profile indicating inter-
                                                                     created a user profile for content providers that are inter-
est in the dimension of archiving and web accessibility will
                                                                     ested in making their content accesible through Europeana.
lead to the following recommendation when analyzing a col-
                                                                     Within this context, the content providers manifest interest
lection of images in TIFF format:
                                                                     for the web accessibility digital preservation dimension.
migrate(T IF F/5, JP EG/2000, IM AGE M AGICK)
                                                                        The highest diversity of file formats was found in the
In free text translation, the recommendation will suggest
                                                                     image collections. The recommendation to migrate these
16
     http://www.jboss.org/drools                                     files to the JPEG 2000 format didn not get a high priority


                                                                56
and will be performed within the next period of scheduled             6.   ACKNOWLEDGMENTS
storage migration. The Image Magick tool was the recom-                  This work was partially supported by the EU project ”AS-
mended choise for performing this transformation action.              SETS - Advanced Search Services and Enhanced Technolog-
The whole audio content available in Assets was provided in           ical Solutions for the European Digital Library” (CIP-ICT
the mp3 format and no recommendation was made for trans-              PSP-2009-3, Grant Agreement n. 250527).
forming audio collections. The most restrictive constraints
for web accessibility are defined for the video content. The          7.   REFERENCES
pseudostreaming protocol is an advanced technological solu-            [1] Aitken, B., Helwig, P., Jackson, A., Lindley, A.,
tion used for distributing information over the web. It allows             Nicchiarelli, E., Ross, S.: The planets testbed: Science
the user to interact with the media-player and to quickly                  for digital preservation. Code4Lib 1(3) (2008),
navigate within the content without the needed to down-                    http://journal.code4lib.org/articles/83
load the entire media file. This protocol is supported by              [2] Becker, C., Kulovits, H., Guttenbrunner, M., Strodl,
two file formats: flash video (FLV) and MPEG4 with H2.64
                                                                           S., Rauber, A., Hofman, H.: Systematic planning for
video encoding. It has native support in HTML5 and is used
                                                                           digital preservation: evaluating potential strategies
in HTML4 with an adequate browser plugin. A part of the
                                                                           and building preservation plans. International Journal
Assets content is already available in FLV format and an-                  on Digital Libraries 10(4), 133–157 (2009),
other part is available in MPEG1 or MPEG2. The DiPRec                      http://dblp.uni-trier.de/db/journals/jodl/
resulting recommendation is to migrate the content to FLV
                                                                           jodl10.html#BeckerKGSRH09
by using the ↵mpeg 17 tool.
                                                                       [3] Becker, C., Kulovits, H., Rauber, A., Hofman, H.:
                                                                           Plato: a service-oriented decision support system for
5.      CONCLUSION                                                         preservation planning. In: Proceedings of the 8th
   Within this paper we introduced the DiPRec recommender                  ACM/IEEE-CS joint conference on Digital libraries.
system, an expert support tool in the domain of digital long-              pp. 367–370. ACM, New York (2008), http:
term preservation for GLAMs. An important contribution                     //publik.tuwien.ac.at/files/PubDat_170832.pdf,
of this papers is the exploitation of an open linked data ap-              vortrag: 8th ACM/IEEE-CS joint conference on
proach for constructing the recommender’s knowledge base                   Digital libraries (JCDL 2008), Pittsburgh,
built upon open registries as DbPedia and Pronom. Since                    Pennsylvania; 2008-06-16 – 2008-06-20
the knowledge acquisition, aggregation and unification pro-            [4] Becker, C., Rauber, A.: Four cases, three solutions:
cess is fully automated it is easy to upgrade the recom-                   Preservation plans for images. Tech. rep., Vienna
mender’s knowledge base.                                                   University of Technology, Vienna, Austria (April 2011)
   We looked at preservation planning which is the process of          [5] Becker, C., Rauber, A., Heydegger, V., Schnasse, J.,
specifying clearly defined and relevant trees of objectives in a           Thaller, M.: A generic xml language for characterising
defined preservation dimension and evaluating them within                  objects to support digital preservation. In: SAC ’08:
a given (institutional) context to generate well-documented                Proceedings of the 2008 ACM symposium on Applied
decisions. DiPRec is able to advance the process with in-                  computing. pp. 402–406. ACM, New York, NY, USA
ferred community knowledge and reduces the degree of man-                  (2008)
ual evaluation processes or require technical expertise in this        [6] Bizer, C., Heath, T., Berners-Lee, T.: Linked data -
process.                                                                   the story so far. Int. J. Semantic Web Inf. Syst. 5(3),
   Am important concern related to the KBRs is the trust                   1–22 (2009)
in the provided recommendations. This is especially rele-              [7] Burke, R.D.: Hybrid web recommender systems. In:
vant for the digital preservation domain where we deal with                The Adaptive Web. pp. 377–408 (2007)
a large amount of multimedia material and the execution                [8] Felfernig, A., Gordea, S.: Ai technologies supporting
of the preservation actions is associated with considerable                e↵ective development processes for knowledge-based
costs. Within this paper we did not examine the complete-                  recommender applications. In: SEKE. pp. 372–379
ness, correctness and quality degree of the underlying data.               (2005)
We however argue that data from open knowledge bases like              [9] Heitmann, B., Hayes, C.: C.: Using linked data to
DbPedia or Freebase could protect from biases introduced                   build open, collaborative recommender systems. In:
by the economical interests of professional companies by its               In: AAAI Spring Symposium: Linked Data Meets
underlying community approach.                                             Artificial IntelligenceŠ. (2010
   The tool has been designed by reusing our past experience
                                                                      [10] Heitmann, B., Hayes, C.: Enabling case-based
in building knowledge based and case based recommender
                                                                           reasoning on the web of data. In: The WebCBR
systems [23, 8] and combining it with the expertise of cre-
                                                                           Workshop on Reasoning from Experiences on the Web
ation long-term preservation infrastructure and applications
                                                                           (2010)
[14, 1]. Based on this work the Assets normalisation tool
                                                                      [11] Jannach, D., Zanker, M., Fuchs, M.: Constraint-based
suite is able to automate the process of object identification
                                                                           recommendation in tourism: A multiperspective case
and characterisation and therefore directly integrates within
                                                                           study. J. of IT & Tourism 11(2), 139–155 (2009)
the property evaluation, risk analysis and recommendation
process for a given record. We presented a first evaluation           [12] Jannach, D., Zanker, M., Jessenitschnig, M., Seidler,
of digital content provided by national libraries and archives             O.: Developing a conversational travel advisor with
through the Assets project where the underlying concepts of                advisor suite. In: ENTER’07. pp. 43–52 (2007)
the DiPRec approach were proven to work adequately.                   [13] Jens, L., Jörg, S., Sören, A.: Discovering unknown
                                                                           connections -the dbpedia relationship finder. In:
17
     http://www.ffmpeg.org/                                                Proceedings of the 1st Conference on Social Semantic


                                                                 57
     Web (CSSW). vol. P-113, pp. 99–109. Gesellschaft für
     Informatik, Leipzig, Germany (2007)
[14] King, R., Schmidt, R., Jackson, A., Wilson, C., Steeg,
     F.: The planets interoperability framework: An
     infrastructure for digital preservation actions. In:
     ECDL09 Proceedings of the 13th European conference
     on Research and advanced technology for digital
     libraries. vol. 5714/2009, pp. 425–428. Springer-Verlag
     (2009),
     http://dx.doi.org/10.1007/978-3-642-04346-8_50
[15] Kurt, B., Colin, E., Praveen, P., Tim, S., Jamie, T.:
     Freebase: a collaboratively created graph database for
     structuring human knowledge. In: SIGMOD ’08
     Proceedings of the 2008 ACM SIGMOD international
     conference on Management of data. pp. 1247–1249.
     ACM, New York, NY, USA (2008)
[16] Lindley, A., Jackson, A.N., Aitken, B.: A collaborative
     research environment for digital preservation - the
     planets testbed. Enabling Technologies, IEEE
     International Workshops on 0, 197–202 (2010)
[17] Nonaka, I., Takeuchi, H.: The Knowledge-Creating
     Company: How Japanese Companies Create the
     Dynamics of Innovation. Oxford University Press
     (May 1995)
[18] Orit Edelstein, Michael Factor, R.K.T.R.E.S.P.T.:
     Evolving domains, problems and solutions for long
     term digital preservation. iPRES 2011 - 8th
     International Converence on Preservation of Digital
     Objects (2011)
[19] Ricci, F., Werthner, H.: Case base querying for travel
     planning recommendation. Journal of IT & Tourism
     4(3-4), 215–226 (2001), http://dblp.uni-trier.de/
     db/journals/jitt/jitt4.html#RicciW01
[20] Shabir, N., Clarke, C.: Using linked data as a basis for
     a learning resource recommendation system. In: 1st
     International Workshop on Semantic Web
     Applications for Learning and Teaching Support in
     Higher Education (SemHE’09) (September 2009),
     http://eprints.ecs.soton.ac.uk/18053/
[21] Strodl, S., Becker, C., Neumayer, R., Rauber, A.: How
     to choose a digital preservation strategy: evaluating a
     preservation planning procedure. In: JCDL ’07:
     Proceedings of the 2007 conference on digital libraries.
     pp. 29–38. ACM, New York, NY, USA (2007),
     http://doi.acm.org/10.1145/1255175.1255181
[22] Sven Schlarb, Edith Michaelar, M.K.A.L.B.A.S.R.A.J.:
     A case study on performing a complex file-format
     migration experiment using the planets testbed. IS&T
     Archiving Conference 7, 58–63 (2010)
[23] Zanker, M., Gordea, S., Jessenitschnig, M., Schnabl,
     M.: A hybrid similarity concept for browsing
     semi-structured product items. In: EC-Web. pp. 21–30
     (2006)
[24] Zanker, M., Jessenitschnig, M., Jannach, D., Gordea,
     S.: Comparing recommendation strategies in a
     commercial context. IEEE Intelligent Systems 22(3),
     69–73 (2007)


                                                                58