Computing Recommendations for Long Term Data Accessibility basing on Open Knowledge and Linked Data Sergiu Gordea Andrew Lindley Roman Graf AIT - Austrian Institute of AIT - Austrian Institute of AIT - Austrian Institute of Technology GmbH Technology GmbH Technology GmbH Donau-City-Strasse 1 Donau-City-Strasse 1 Donau-City-Strasse 1 Vienna, Austria Vienna, Austria Vienna, Austria sergiu.gordea@ait.ac.at andrew.lindley@ait.ac.at roman.graf@ait.ac.at ABSTRACT 1. INTRODUCTION Knowledge based recommender systems (KBRs) as natu- Digital access to our cultural heritage assets was facilitated ral followers of expert systems are nowadays used for sup- through the rapid development of the digitization process porting the decision making process in multiple application and online publishing initiatives as Europeana or the Google areas as: e-commerce, financial services, tourism, etc. One books project. As Galleries, Libraries, Archiving institu- of the most important challenges of KBRs is the construc- tions and Museums (GLAM) created digital representations tion of their underlying knowledge base. This is typically of their masterpieces new concerns arise regarding the long- composed by sets of factual knowledge, i.e. information de- term accessibility of digitized and digitally born content. scribing the application’s domain and business rules. Both Repository managers of institutions need to take well docu- together enable the drawing of conclusions and support the mented decisions with regard to which digital object repre- decisions making process when analyzing the utility of a spe- sentations to use for archiving or long term access to their cific item in a given context as for example, analyzing the ef- valuable collections. The digital preservation recommender fectiveness of digitizing and publishing Mircea Eliade’s book system presented within this paper aims at reducing the ”History of Religious Ideas” within Google books. complexity in the process of decision making by providing Even though the world wide web has turned out to be support for classification and the preservation risk analy- the largest knowledge base, information published lacks an sis of digital objects. Technical information which is avail- unified well-formed representation and mainly is intended able as linked data in open knowledge sources facilitates for human readers. The Linked Open Data (LOD)1 and the construction of the DiPRec’s recommender knowledge Open Knowledge2 initiatives address these weaknesses by base. This paper presents the DiPRec recommender sys- describing a method on how to provide structured data in a tem, a community approach on how to achieve the genera- well-defined and queriable format. By linking together and tion of well founded and trusted recommendations through inferring properties of di↵erent independent and publically open linked data and inferred knowledge in the domain of available information sources like FreeBase3 , DbPedia4 and long-term information preservation for GLAM institutions. Pronom 5 within the specific context of a digital preservation scenario we shortcut the well known challenge of KBRs, the Categories and Subject Descriptors knowledge acquisition bottleneck. H.3.7 [Information Systems Applications]: Digital Li- In this paper we present our work carried out in the con- braries; M.8 [Knowledge Management]: Knowledge Reuse text of the Assets6 project with the aim of preparing the ground for digital preservation within Europeana7 . The Eu- General Terms ropeana portal serves as a central point for the large public to easily explore and research European cultural and sci- Digital preservation, Recommender systems entific heritage online. It aggregates and collects data on digital resources from galleries, libraries, archives and muse- Keywords ums accross Europe and by now manages about 19 million object descriptions collected from more than 15 hundred in- Knowledge based recommender, open recommendations, linked stitutions. Within this very heterogeneous context it is eas- open data, preservation planning ily understandable that digital objects are encoded in very heterogeneous file formats and versions throughout various di↵erent hardware and software content repository systems. Depending on the underlying use case it is likely that mul- 1 http://linkeddata.org/ 2  http://www.okfn.org/ 3  http://www.freebase.com 4  http://dbpedia.org/  5 http://www.nationalarchives.gov.uk/PRONOM/  6 http://www.assets4europeana.eu/  7  http://www.europeana.eu/portal/ 51 tiple representations of the same ’physical’ object exist at explained in detail through a concrete example on the TIFF a time. For example in most cases it is useful to provide file format. The evaluation of our approach is presented in access copies on demand which are easily distribuatable via Section 4 by analyzing the digital collections of the Assets the web while the master record needs to adhere to di↵er- project. This is followed in the last Section of the paper (nr. ent requirements as for example the institution’s long-term 5) by the summarization of the concluding remarks for our scenario and preservation policy. work. A key topic in preservation planning is the file formats used for encoding the digital information. The Pronom Unique Identifiers (PUIDs) registry provides persistent, unique 2. RELATED WORK and unambiguous identifiers for file formats and therefore Knowledge Based Recommender systems gained broad pop- takes a fundamental role in the process of managing elec- ularity in e-commerce and e-tourism [7, 11, 24, 19] appli- tronic records. Currently it lists information on about 820 cations supporting customers in their decision making pro- di↵erent PUIDs. While some of the formats are properly cesses. The two most popular use cases are guidance through documented, open-source and well supported, others may large and complex product o↵ers (e.g. trip organization, fea- be outdated, redeemed by software vendors and no longer ture selection of technical equipment) as well as accompany- functional in modern operating systems. As always the the ing the process of high cost decision making (e.g. financial binary file’s dependencies on the underlying platform, its investments). When designing the DiPRec recommender we configuration (codecs, plugins, etc.) as well as the render- took into consideration the Advisor Suite [12] and Planets ing software are responsible on generating a concrete user Testbed infrastructure [16]. The main component of the performance, it is vital to have a solid understanding on all Advisor Suite is a multipurpose workbench which o↵ers sup- of them. This process is costly and requires a high degree port and advanced graphical user interfaces for constructing of engineering expertise. Many of the GLAM institutions knowledge based recommenders. Advisor Suite features in- already outsource IT related activities and don’t have the clude the import of product catalogues, visual editing of a resources to keep track of the required level of complexity in recommendation workflow and the generation of a runtime house. environment. The Planets8 project focused on constructing The Digital Preservation Recommender (DiPRec) system practical services and tools for establishing empirical evi- addresses the topics of ’preservation watch’ and ’preserva- dence in the process of informed decision making in the area tion policy recommendation’. It proposes a solution in the of digital long-term preservation. A major achievement was domain of digital long-term preservation for making doc- the definition of basic nouns and verbs for core preserva- umented recommendations based on risk scores, while the tion operations. This allows to easily combine and swap underlying knowledge base is built through a linked data ap- tools within a preservation workflow and lead to a num- proach. Information from FreeBase, DbPedia and Pronom ber of over fifty preservation services. Available services in the areas of file formats, file conversions tools, hardware were deployed and tested within the Planets Testbed [22], a and software vendors is taken into account. The main con- uniform environment for experimentation under well-defined tribution of this paper consists in the integration of open and controlled surroundings. It provides automated quality (general or domain specific) data when constructing knowl- assurance support for tools like DROID9 , JHOVE10 and the edge based recommendations. The ”knowledge acquisition eXtensible Characterisation Languages11 [5]. bottleneck” and the high costs of setting up and maintain- A key topic in preservation planning is the process of eval- ing KBRs are still an impediment for extensively adoption uating objectives under the limitation of well-known con- by the industry. Recommendations provided by DiPRec are straints. A state of the art report on technical require- meant to support GLAM institutions across Europe in the ments and standards as well as available tools to support process of analyzing their digital assets. The technical foun- the analysis and planning of preservation actions is given in dation and the explanation of the DiPRec recommendations [2]. Strodl et al. present the Planets preservation planning are computed on top of shared and collaboratively built data methodology Plato12 by an empirical evaluation of image sources, trust in the area of LOD and digital preservation scenarios [21] and demonstrate specific cases of recommen- is a key issue which has been left out for this paper due to dations for image content in four major National Libraries in simplicity. Europe[4]. After eliciting information regarding the preser- The novelty of our work consists in combining expert tools vation scenario (user requirements) the Plato tool is able to (as File, Droid or Fido) and automated object identifica- recommend specific preservation actions [3] for a given sce- tion processes, with structured information (e.g. techni- nario. The tool was specifically designed to work on sam- cal information on file formats) from open data reposito- ples of the underlying data set and therefore is able to make ries. This information is use for infering new knowledge, use of XCL or similar tools for automated quality assurance calculate preservation risks and finally for computing rec- and semi-automated evaluation of objectives. In contrast to ommendations on preservation actions in the domain of dig- these scenario evaluations, DiPRec aims at collecting infor- ital long-term preservation. We present the rationale used mation on a broader range from open linked data registries for the construction of the DiPRec recommender by pre- and dynamic knowledge sources. It can evaluate more gen- senting concrete examples of a given content analysis which eral, even ’non-technical’ objectives (e.g. what is the risk was provided for the Assets project. The rest of the pa- that no software vendor will support old formats like Word per is organized as follows; in Section 2 we present related 8 http://www.planets-project.eu/ work carried out on recommender systems and in the field 9 http://droid.sourceforge.net/ of digital preservation. Section 3 highlights the architecture 10 http://hul.harvard.edu/jhove/ of DiPRec by comparing it against the construction of clas- 11 http://planetarium.hki.uni-koeln.de/public/XCL/ sical KBRs. The functionality provided by our system is 12 http://www.ifs.tuwien.ac.at/dp/plato/intro.html 52 Open Domain Knowledge Domain Expert Pronom UDFR Knowledge Acquisition Domain Expert Domain Knowledge User requirements User Requirements Knowledge Aggregation Knowledge Base Recommendation Engine Knowledge Base Recommendation Engine Domain Information Recommentations Domain Information Recommentations Aggregation Acquisition (description and data) Format Identification DbPedia FreeBase Linked Open Data a) Classic KBR recommender b) DiPRec recommender Figure 1: A comparison of regular KBRs and DiPRec recommender processes 3 documents? ). This is a significant improvement over the tion than the domain specific KBRs. Within the DiPRec Plato tool where all this information needs to provided by recommender the Domain Information Aggregation module domain experts. is responsible for collecting file format related information The Scape13 project is one of the major current initiatives (e.g. formats, vendors, applications, etc.) from the open [18] which is partially funded by the European Union’s FP7 knowledge bases Pronom, DBPedia and Freebase. Further- on institutional preservation requirements. The project ad- more the Domain Knowledge Aggregation module combines dresses besides the issues of scalable preservation and quality- the outcome of a risk analysis process with the knowledge assured preservation workflows also the topic of policy-based manually provided by domain experts. Figure 1 compares preservation planning and watch. the process used by regular KBRs and the one presented by The paradigms of semantic Web and linked open data [6] DiPRec which enhances the process of building the under- transform the web from a pool of information into a valu- lying knowledge base. In the following sections we present able knowledge source of data according to the definitions extended details on how the knowledge base of DiPRrec is of a knowledge management theory [17]. The exploitation built by using as example the Tagged Image File Format of linked data as knowledge source for recommender sys- (TIFF). tem started as research topic in the last few years and was The TIFF format is still very popular among the publish- first applied to improve case-based and collaborative filter- ing industry, as it is a very adaptable file format although ing recommenders [10, 9, 20]. In [20] the authors present the it did not have a major update since 1992. It was originally Talis Aspire system which is able to assists educational sta↵ created by Aldus and since 2009 it is now under control of in picking educational web resources. The employment of Adobe Systems. There are a number of extensions avail- linked data in collaborative filtering and case-based reason- able (e.g. TIFF/IT, TIFF-FX) which have been based on ing was explored by Heitmann and Hayes in [9] and [10]. the TIFF 6.0 specification, but not all of them are broadly used. A standard and broadly accepted approach in the 3. SYSTEM OVERVIEW archiving world is the migration of TIFF encoded content to the JPEG2000 format. In [4, 2] one can find the context Typically the creation of classic knowledge based recom- in which several content providers took the decision to per- mender systems consists of three main tasks. Dealing with form this kind of content migration. However within these the collection of detailed descriptions of products o↵ers is scenarios, the context evaluation and the recommendation followed by the process of constructing a recommendation were computed by domain experts and by expert systems. knowledge base (see section 3.2). At runtime user require- The DiPRec system, on the one hand applies to the ap- ments elicitation takes place and recommendations are com- proach of well-documented and trackable decision making, puted based on the underlying recommendation knowledge and at the same time it uses a semi-automatic approach base and the items that match the given user requirements. on domain knowledge acquisition. This reduces the human DiPRec follows the same process but improves the way the e↵ort invested by domain experts when providing reserva- knowledge base is built in order to reduce the e↵orts spent tion recommendations, reduces the financial e↵orts invested on domain knowledge acquisition. This is especially relevant in the context evaluations, and in the same time is able to for being exploited in GLAM preservation scenarios, where o↵er good quality recommendations. the underlying knowledge base contains broader informa- 13 http://www.scape-project.eu/ 53 FILE FORMAT DESCRIPTION Format Name Tagged Image File Format (P), Tagged Image File Format (D), Tagged Image File Format(F) Pronom Id fmt/10 (P) Mime Type /media type/image/ti↵-fx, /media type/image/ti↵ (F), image/ti↵(P) File Extensions .ti↵, .tif (D) Current Version 6 (P) Current Version Release Date 03 Jun 1992 (P) Software License Proprietary software (D) Software QuickView Plus, Acrobat, AutoCAD, CorelDraw, Freemaker, GoLive, Illustrator, Photoshop, Powerpoint (P), SimpleText, Seashore, Imagine (D) Software Homepage http://adobe.com/photoshop(D) Operating System PC, Mac OS X, Microsoft Windows (D) Genre Image (Raster) (P), Image file format (I), SimpleText - Text editor, Adobe Photoshop - Raster graphics editor (D) Open Format none (P) Standards ISO 12639:2004 (W) Vendors Aldus, Adobe Systems, Apple Computer, now Apple Inc., Microsoft (D), Adobe Systems Incorporated (P), Aldus Corporation (P) VENDOR DESCRIPTION Organization Name Adobe Systems Country United States (P) Foundation date Dec 1982 (F) Number of Employees 6068 (Jan 2007), 8660 (2009)(F), 9,117 (2010)(W) Revenue 3,579,890,000 US$ (Nov 28, 2008) (F) Homepage http://adobe.com/photoshop(F) Table 1: File format and vendor description. (Information sources P = Pronom, D = DBPedia, F = Freebase, W = Wikipedia) 3.1 Domain Information Aggregation considered to fulfill this requirement by combining the Di↵erently to the e-commerce domain where KBRs import properties like ”NUMBER OF EMPLOYEES”, ”VEN- detailed item descriptions from product catalogs there is no DOR REVENUE”). See Table 2. such catalog for computer file formats. The Unified Digi- When aggregating domain information we are interrogat- tal Format Registry (UDFR)14 project was started in 2009 ing external knowledge sources like DbPedia and Freebase by a group of Universities and GLAM institutions with the which manage huge amounts of linked open data triples. aim of building a single, shared technical registry for file This allows us to extract fragmental descriptions on file for- formats based on a semantic web and linked data approach. mats, software applications and vendors supporting given The project is based on the Pronom database which pro- file formats (see Table 1). DbPedia allows to post sophis- vides basic information about a large number of file formats ticated queries using SPARQL query and OWL ontology and will be extended by data on migration pathways and languages [13] for retrieving data available in Wikipedia. available software/tools. The registry should be available Freebase [15] is a practical, scalable semantic database for from the beginning of 2012. As Pronom data is not rich structured knowledge and is mainly composed and main- enough to build a recommendation and reasoning mecha- tained by community members. Public read/write access to nism for preservation scenarios of file formats on top, we Freebase is allowed through an graph-based query API using collect additional information sources and aggregate them the Metaweb Query Language (MQL) [6]. PRONOM data into a single homogeneous property representation in the is released as linked open data and is accessible through a recommender’s knowledge base. DiPRec uses two types of public SPARQL endpoint. operations for aggregating domain information: • data unification: the data representation retrieved from AGGREGATED PROPERTIES di↵erent knowledge bases is unified and combined un- File format related der the DiPRecs property model definition. For exam- Is supported by major software vendors? yes ple, the number of software tools supporting a given file Is an open file format? no format is calculated over di↵erent data sources. The Is widely supported by current web browsers? yes Which versions officially supported by vendor? 6.0 individual object’s namespace, the transformation pro- Which versions are frequently used? 6.0 cess of values, the query on how to extract a given Image file compression supported? yes record, etc. are preserved and are part of the prop- Preservation related metadata erty’s model representation. Is creator information available? yes/no Is publisher information available? yes/no • property composition: more abstract properties which Is digital rights information available? yes/no require a hierarchical composition are computed by ag- Is file migration allowed? yes/no gregating basic properties by weighted numbers. The Object creation date? datetime Is an object preview available? URL model on property definition is meant to be kept very simple. For example ”supported by major vendors” Table 2: Sample compound properties. will check if at least one of the software companies is 14 http://www.udfr.org/ 54 Media File Domain Knowledge Aggregation PRESERVATION RISK property set and consequently contributes to the risk com- Score / Report putation over a given dimension is modeled through the in- troduction of specific weighting factors (see Equation 1). The value of the overall risk score for a given collection Property Sets of objects is computed as a weighted sum over all digital preservation dimensions: Domain Information X X Aggregation Ri = wps,i ⇤ wp,ps ⇤ d(p, P F V (p)) (1) ps2P Si p2P ROPps DIGITAL PRESERVATION TOOLS Open Knowledge (Droid/Pronom) FILE METADATA Sources Where Ri represents the preservation risk computed over (DBPedia, Freebase, PRONOM ) the dimension i. ps represents the index of the current prop- erty set within all sets associated to the dimension i. The Figure 2: Domain knowledge aggregation process. w(ps,i) is the weight of the contribution of the property set ps 3.2 Domain Knowledge Aggregation to the dimension i. Similarly, p stands for the index of cur- rent properties within the list of properties available in the Pronom as presented before is a viable resource for any- given property set P ROPps . wp,ps denotes the importance one requiring impartial and definitive information about the of a property p for the property set ps. The distance be- file formats, software products and other related data. Ex- tween the current property and the defined - ’preservation tremely valuable to the DiPRec recommender is the infor- conform’ - value for this property is represented through mation related to the file conversion tools based on a given d(p, P F V (p)). PUID. Therefore we employ the Droid 15 characterization service for automatically extracting technical metadata and identifying file formats from physical media files. This meta- 3.3 User requirements elicitation data is then used in conjunction within the domain knowl- DiPRec is designed to work as a multi-purpose digital edge aggregation process presented in the Fig. 2 preservation support tool which can be used in various sce- The risk analysis module is in charge of evaluating in- narios by di↵erent types of customers. For examples the tool formation previously aggregated in the DiPRec knowledge may support content providers in analyzing the ’preservation base for a given record at hand over following (exemplary) friendliness’ of their infrastructure, their archiving solutions dimensions of digital preservation: or the visibility of their artifacts published in the Europeana portal. Recommendations are always to be seen in the con- • Web accessibility: Dissemination copies are published text in which the digital objects are used. Within the scope and accessible on e.g. the content provider’s web por- of the Assets project there is the common interest to o↵er tal. There should be previews of objects (e.g. thumb- public access to digital assets through the Europeana portal nails for images, video summaries, short intro for au- (i.e. web discovery), to provide advance search functional- dio files) and ’rich’ object descriptions to increase their ity (i.e. description richness and preservation of provenance visibility and retrieval. The chosen file representations information) as well as the topic of the data archiving di- should render in the latest browsers without plugin mension. support and cope with modern features (e.g. pseudo As a result of the requirements elicitation process user pro- streaming, progressive image display, HTML5, X3D, files are created. A set of multiple choice questions is used to etc). Content is made available through di↵erent ex- distinguish the relevant dimensions of available preservation ploitation channels. objectives. According to di↵erent levels of complexity, role and required domain knowledge the system o↵ers a subset • Archiving and costs: The decision of following a spe- of questions which are well understood and the best avail- cific institutional preservation policy for a given tech- able choice for a user to express his needs. Fig. 3 presents nology is heavily influenced by given hardware and sample workflow which could be used to determine a given budget constraints. Future exploitations on the costs user profile. For example a private user (ut = private per- for content exploitation need to be predicted and taken son) with a solid level of IT knowledge (itk = expert) will into account. be asked about preferred encodings and compression types of the digital content, while others would define attributes Other scenarios may include: about storage limitations and upload samples of a given col- • Provenance metadata lection. • Data exchange and collaborative data enrichment 3.4 Recommendation computation Di↵erently to classic KBRs where the application’s scope • Publishing and digital rights management is very well delimited in terms of selecting the best match- The definition of preservation dimensions is not orthogo- ing items in a list of known possibilities, the DiPRec sys- nal and therefore certain properties might be involved more tem relies on expressing an institutional preservation con- than once when computing di↵erent risk score. Due to text in form of user requirements that are combined with the management and maintenance reasons properties are also knowledge acquired about the long term accessibility threat- grouped by sets and a property may belong to one or more ening. We employ tools to evaluate the content of a given property sets. The extent to which a property belongs to a collection from a technical point of view and to generate fine grained preservation risk scores. When records are identified 15 http://sourceforge.net/projects/droid/ to have vulnerabilities on certain preservation dimensions a 55 the migration of the files available in T IF F/5 format to JP EG/2000 by using the IM AGE M AGICK software with standard settings. 4. EVALUATION The evaluation of the first prototype of DiPRec was con- ducted within the scope of the Assets project. Ten partners of the project consortium provided metadata and binary content (10 collections with a total size of 516GB contained in 368067 media files) for supporting the development and testing of services developed within the scope of the project. The first step in the evaluation process was the identification of file formats, definition of property sets and the aggrega- tion of the domain knowledge available in open knowledge bases on these file formats. The Table 3 lists the distribution of file formats by content type. Even the experimental data was taken from a small number of content providers, we discovered a variety of 18 formats in 38 di↵erent versions used for encoding the digital content. Content Type File Format # Versions # Files TEXT TXT 1 4 TEXT DOC 1 16 TEXT XML 1 20101 Figure 3: User requirements elicitation workflow TEXT HTML 1 1205 IMAGE JPG 8 323332 IMAGE PSD 1 3 rule based engine as JBoss Drools16 is used to propose ap- IMAGE PNG 4 1228 propriate preservation actions. The set of available business IMAGE BMP 2 141 rules are defined by domain experts in form of simple IF- IMAGE GIF 2 1066 IMAGE TIFF 4 4 THEN-ELSE rules. These rules are neither complete nor IMAGE PDF 16 25008 meant to be non-overlapping. Unified tool access for pro- AUDIO MP3 1 3634 cessing executable preservation plans is provided through VIDEO FLV 1 9468 the Assets preservation normalisation framework which is VIDEO MPEG4 1 935 able to invoke the tools with exactly defined settings and VIDEO MPEG1 1 3074 parameter configurations. VIDEO MPEG3 1 3074 3D PLY 1 50 3D DAE 1 307 IF ( rac > 0.5 AND ia == true AND iwa == true AND open f ormat == FALSE) THEN migrate(preservation f ormat) Table 3: Distribution of file formats in Assets col- lections. IF (content type == IMAGE) THEN preservation format = (JPEG/2000:1, TIFF/6:0.8) IF (f ile f ormat == TIFF/5 AND The Digital Record Object Identification tool (DROID) preservation f ormat == JPEG/2000) version 5, signature file 45 was executed through the As- T HEN migration tool = IM AGE M AGICK (2) sets preservation normalisation tool suite and was able to successfully identify file formats in 95 percent of the cases through its binary signature method except of the 3D model The preservation recommendations are computed using objects which have not yet been collected by Pronom. Ap- the constraint solving problems (CSP) theory [8, 11]. Con- propriate information on all of the file formats was contained straints are defined within the preservation actions knowl- in DbPedia and Freebase and the domain knowledge acqui- edge base, the CSP context is defined by user profiles and sition process was completed by successfully computing the the preservation risks are identified for the given data col- preservation risk analysis scores. lection. The recommendations are represented in form of The second part of the evaluation consisted in comput- preservation actions. For example, the set of business rules ing recommendations for the given content. Therefore, we defined above combined with a user profile indicating inter- created a user profile for content providers that are inter- est in the dimension of archiving and web accessibility will ested in making their content accesible through Europeana. lead to the following recommendation when analyzing a col- Within this context, the content providers manifest interest lection of images in TIFF format: for the web accessibility digital preservation dimension. migrate(T IF F/5, JP EG/2000, IM AGE M AGICK) The highest diversity of file formats was found in the In free text translation, the recommendation will suggest image collections. The recommendation to migrate these 16 http://www.jboss.org/drools files to the JPEG 2000 format didn not get a high priority 56 and will be performed within the next period of scheduled 6. ACKNOWLEDGMENTS storage migration. The Image Magick tool was the recom- This work was partially supported by the EU project ”AS- mended choise for performing this transformation action. SETS - Advanced Search Services and Enhanced Technolog- The whole audio content available in Assets was provided in ical Solutions for the European Digital Library” (CIP-ICT the mp3 format and no recommendation was made for trans- PSP-2009-3, Grant Agreement n. 250527). forming audio collections. The most restrictive constraints for web accessibility are defined for the video content. The 7. REFERENCES pseudostreaming protocol is an advanced technological solu- [1] Aitken, B., Helwig, P., Jackson, A., Lindley, A., tion used for distributing information over the web. It allows Nicchiarelli, E., Ross, S.: The planets testbed: Science the user to interact with the media-player and to quickly for digital preservation. Code4Lib 1(3) (2008), navigate within the content without the needed to down- http://journal.code4lib.org/articles/83 load the entire media file. This protocol is supported by [2] Becker, C., Kulovits, H., Guttenbrunner, M., Strodl, two file formats: flash video (FLV) and MPEG4 with H2.64 S., Rauber, A., Hofman, H.: Systematic planning for video encoding. It has native support in HTML5 and is used digital preservation: evaluating potential strategies in HTML4 with an adequate browser plugin. A part of the and building preservation plans. International Journal Assets content is already available in FLV format and an- on Digital Libraries 10(4), 133–157 (2009), other part is available in MPEG1 or MPEG2. The DiPRec http://dblp.uni-trier.de/db/journals/jodl/ resulting recommendation is to migrate the content to FLV jodl10.html#BeckerKGSRH09 by using the ↵mpeg 17 tool. [3] Becker, C., Kulovits, H., Rauber, A., Hofman, H.: Plato: a service-oriented decision support system for 5. CONCLUSION preservation planning. In: Proceedings of the 8th Within this paper we introduced the DiPRec recommender ACM/IEEE-CS joint conference on Digital libraries. system, an expert support tool in the domain of digital long- pp. 367–370. ACM, New York (2008), http: term preservation for GLAMs. An important contribution //publik.tuwien.ac.at/files/PubDat_170832.pdf, of this papers is the exploitation of an open linked data ap- vortrag: 8th ACM/IEEE-CS joint conference on proach for constructing the recommender’s knowledge base Digital libraries (JCDL 2008), Pittsburgh, built upon open registries as DbPedia and Pronom. Since Pennsylvania; 2008-06-16 – 2008-06-20 the knowledge acquisition, aggregation and unification pro- [4] Becker, C., Rauber, A.: Four cases, three solutions: cess is fully automated it is easy to upgrade the recom- Preservation plans for images. Tech. rep., Vienna mender’s knowledge base. University of Technology, Vienna, Austria (April 2011) We looked at preservation planning which is the process of [5] Becker, C., Rauber, A., Heydegger, V., Schnasse, J., specifying clearly defined and relevant trees of objectives in a Thaller, M.: A generic xml language for characterising defined preservation dimension and evaluating them within objects to support digital preservation. In: SAC ’08: a given (institutional) context to generate well-documented Proceedings of the 2008 ACM symposium on Applied decisions. DiPRec is able to advance the process with in- computing. pp. 402–406. ACM, New York, NY, USA ferred community knowledge and reduces the degree of man- (2008) ual evaluation processes or require technical expertise in this [6] Bizer, C., Heath, T., Berners-Lee, T.: Linked data - process. the story so far. Int. J. Semantic Web Inf. Syst. 5(3), Am important concern related to the KBRs is the trust 1–22 (2009) in the provided recommendations. This is especially rele- [7] Burke, R.D.: Hybrid web recommender systems. In: vant for the digital preservation domain where we deal with The Adaptive Web. pp. 377–408 (2007) a large amount of multimedia material and the execution [8] Felfernig, A., Gordea, S.: Ai technologies supporting of the preservation actions is associated with considerable e↵ective development processes for knowledge-based costs. Within this paper we did not examine the complete- recommender applications. In: SEKE. pp. 372–379 ness, correctness and quality degree of the underlying data. (2005) We however argue that data from open knowledge bases like [9] Heitmann, B., Hayes, C.: C.: Using linked data to DbPedia or Freebase could protect from biases introduced build open, collaborative recommender systems. In: by the economical interests of professional companies by its In: AAAI Spring Symposium: Linked Data Meets underlying community approach. Artificial IntelligenceŠ. (2010 The tool has been designed by reusing our past experience [10] Heitmann, B., Hayes, C.: Enabling case-based in building knowledge based and case based recommender reasoning on the web of data. In: The WebCBR systems [23, 8] and combining it with the expertise of cre- Workshop on Reasoning from Experiences on the Web ation long-term preservation infrastructure and applications (2010) [14, 1]. Based on this work the Assets normalisation tool [11] Jannach, D., Zanker, M., Fuchs, M.: Constraint-based suite is able to automate the process of object identification recommendation in tourism: A multiperspective case and characterisation and therefore directly integrates within study. J. of IT & Tourism 11(2), 139–155 (2009) the property evaluation, risk analysis and recommendation process for a given record. We presented a first evaluation [12] Jannach, D., Zanker, M., Jessenitschnig, M., Seidler, of digital content provided by national libraries and archives O.: Developing a conversational travel advisor with through the Assets project where the underlying concepts of advisor suite. In: ENTER’07. pp. 43–52 (2007) the DiPRec approach were proven to work adequately. [13] Jens, L., Jörg, S., Sören, A.: Discovering unknown connections -the dbpedia relationship finder. In: 17 http://www.ffmpeg.org/ Proceedings of the 1st Conference on Social Semantic 57 Web (CSSW). vol. P-113, pp. 99–109. Gesellschaft für Informatik, Leipzig, Germany (2007) [14] King, R., Schmidt, R., Jackson, A., Wilson, C., Steeg, F.: The planets interoperability framework: An infrastructure for digital preservation actions. In: ECDL09 Proceedings of the 13th European conference on Research and advanced technology for digital libraries. vol. 5714/2009, pp. 425–428. Springer-Verlag (2009), http://dx.doi.org/10.1007/978-3-642-04346-8_50 [15] Kurt, B., Colin, E., Praveen, P., Tim, S., Jamie, T.: Freebase: a collaboratively created graph database for structuring human knowledge. In: SIGMOD ’08 Proceedings of the 2008 ACM SIGMOD international conference on Management of data. pp. 1247–1249. ACM, New York, NY, USA (2008) [16] Lindley, A., Jackson, A.N., Aitken, B.: A collaborative research environment for digital preservation - the planets testbed. Enabling Technologies, IEEE International Workshops on 0, 197–202 (2010) [17] Nonaka, I., Takeuchi, H.: The Knowledge-Creating Company: How Japanese Companies Create the Dynamics of Innovation. Oxford University Press (May 1995) [18] Orit Edelstein, Michael Factor, R.K.T.R.E.S.P.T.: Evolving domains, problems and solutions for long term digital preservation. iPRES 2011 - 8th International Converence on Preservation of Digital Objects (2011) [19] Ricci, F., Werthner, H.: Case base querying for travel planning recommendation. Journal of IT & Tourism 4(3-4), 215–226 (2001), http://dblp.uni-trier.de/ db/journals/jitt/jitt4.html#RicciW01 [20] Shabir, N., Clarke, C.: Using linked data as a basis for a learning resource recommendation system. In: 1st International Workshop on Semantic Web Applications for Learning and Teaching Support in Higher Education (SemHE’09) (September 2009), http://eprints.ecs.soton.ac.uk/18053/ [21] Strodl, S., Becker, C., Neumayer, R., Rauber, A.: How to choose a digital preservation strategy: evaluating a preservation planning procedure. In: JCDL ’07: Proceedings of the 2007 conference on digital libraries. pp. 29–38. ACM, New York, NY, USA (2007), http://doi.acm.org/10.1145/1255175.1255181 [22] Sven Schlarb, Edith Michaelar, M.K.A.L.B.A.S.R.A.J.: A case study on performing a complex file-format migration experiment using the planets testbed. IS&T Archiving Conference 7, 58–63 (2010) [23] Zanker, M., Gordea, S., Jessenitschnig, M., Schnabl, M.: A hybrid similarity concept for browsing semi-structured product items. In: EC-Web. pp. 21–30 (2006) [24] Zanker, M., Jessenitschnig, M., Jannach, D., Gordea, S.: Comparing recommendation strategies in a commercial context. IEEE Intelligent Systems 22(3), 69–73 (2007) 58