=Paper=
{{Paper
|id=Vol-2038/paper7
|storemode=property
|title=Repolytics: Identifying Measurable Insights for Digital Repositories
|pdfUrl=https://ceur-ws.org/Vol-2038/paper7.pdf
|volume=Vol-2038
|authors=Vangelis Nomikos
|dblpUrl=https://dblp.org/rec/conf/ercimdl/Nomikos17
}}
==Repolytics: Identifying Measurable Insights for Digital Repositories==
Repolytics: Identifying Measurable Insights for Digital Repositories Vangelis Nomikos Intelibility Innovative Data Engineering LLC, Atlanta, GA, USA nomikos@intelibility.com Abstract. This paper presents specific aspects of the Repolytics platform: a da- ta analytics platform for digital repositories. An overview of the platform is pre- sented alongside an example of how one can employ it's services to analyze a digital repository's data and identify quality related issues. Keywords: metadata quality, data quality, data analytics, quality metrics, digi- tal repositories. 1 Introduction Data is captured, analyzed and used to drive all aspects of our lives in a data driven world. Recently, we have seen a rapid growth of digital repositories and open data catalogues being made available to the public. In the case of digital repositories that target the libraries-archives, scientific-research data domains and open data portals the market is dominated by open source solutions such as DSpace, Omeka, DKAN, CKAN due to their simplicity and low-no cost. Most solutions enable ingest / cata- loguing of information either through automated means (SWORD, REST APIs) or through simple and untuitive forms. Quality assurance in most cases comprises of a number of mandatory metadata the user has to enter. The reality however is far more complex and this has a profound effect in the quality of the the ingested data. Repolytics [1] is a platform that aims at filling this gap through intelligent data analytics. The data loaded into the platform are analyzed and specific quality metrics are presented alongside a more thorough analysis per metadata element. The metrics include metadata completeness, accuracy and consistency. Furthermore, similar met- rics are calculated for the data as well. 2 Related Work Metadata quality is an important issue for the Digital Library domain and has attract- ed several researcher groups to deal with it. One of the main demands is the estab- lishment of a conceptual framework consisted of a set of well-defined quality assess- ment criteria such as completeness, validity, consistency, timeliness, appropriateness and accuracy constituents [2, 3]. In such a framework any assessment effort would be based on reliable indications about metadata quality. The first attempt to define a framework established a narrow set of criteria such as accuracy, completeness and 2 serviceability [4]. Some researches expanded the criteria set from which metadata quality is approached [6, 8], while some other narrowed the perspective focusing only on the completeness criterion [5]. An important evolution of this scientific field was the introduction of the context- dependent metadata quality approach [3, 7]. According to this approach metadata quality issues follow four major concepts: mappings, changes to the information enti- ty, changes to the underlying entity and context changes. For these concepts a taxon- omy of 22 information quality criteria was developed. The criteria were clustered to three categories: intrinsic, relation and reputational and are measured via 41 metrics [7]. Recent research suggests that a metadata quality framework doesn’t have to “in- vent new dimensions in order to accommodate the needs of diverse communities of practice” [3] but to give the flexibility to each evaluator to assess the results within a specific context. This paper follows the context-dependent approach and assumes that metadata quality strongly depends on the viewpoint of the evaluator and should be aligned with the application domain for which the metadata were produced and used. Therefore weighting functions for these factors (evaluator viewpoint, application do- main, metadata usage) should be defined and used to weight the values of the metada- ta quality metrics. 3 The Repolytics approach The core principle of the Repolytics platform is the provisioning of an expandable set of middleware services that operates both on metadata and data level of a digital re- pository. The most fundamental services of the platform involve: (meta)-data integration services (meta)-data profiling services (meta)-data quality services The Repolytics platform enables the user to load data from different data sources ei- ther directly (e.g. through a file archive) or by using one of the supported data provid- ers such as DKAN API and OAI-PMH protocol. Each data source provides one or more metadata (e.g. OAI-DC, MODS) and data (e.g. CSV, excel) representations. Once the data has been loaded, each digital object is analyzed and the main workflow seen in Figure 1 below is executed. Figure 1. Overal workflow per digital object. For each digital object, if metadata is provided, the format is identified, every metada- ta element is indexed and a set of core analyzers are invoked for each element. For specific elements depending on their type, a set of custom analyzers is invoked. If the 3 data is provided (usually throuh some kind of URL) the actual data is fetched and verified followed by a profile of each bitstream. Again, a set of custom analyzers is then invoked depending on the file type. Some of the metrics that are calculated include: metadata completeness distinct values for each metadata element accuracy for specific element types (e.g. dates, actors) where all values are classified according to their class itemset frequencies when applicable (e.g. in the case of subject terms and keywords) One of the primary challenges include the efficeint visualization of the results to the end user. For that reason a series of bar charts, radar charts, tables, gauge meters etc are employed per case. For example, as shown in Figure 2 a radar chart is employed to fingerprint an entire repository according to it's completeness. Figure 2. Radar chart showing the overall completeness for a data soure. Similary, an accuracy detector identifies and classifies all metadata element values according to their class and provides an insight on the accuracy level of each class (low, medium, high). Class Type Example Accuracy YYYY-MM-DD 2015-12-20 Medium YYYY 1998 Low YYYY-MM-DD 1996-02-23 14:30:11 High HH:MM:SS UNKNOWN Sp.1996/23 - 4 4 Conclusions & Future Work In this paper, Repolytics, a platform for repository data analytics is presented. The platform enables the use to easily load data from supported data sources and analyzes this data focusing primarily on completeness and accuracy whereas in specific cases, more specific metrics (such as itemset frequencies) are employed to help gain insight on highly subjective metrics such as consistency. The platform can also access and profile data as well. References 1. Repolytics Platform, http://www.repolytics.com 2. Herzog, T., Scheuren, F., Winkler, W.: Data Quality and Record Linkage Techniques. Springer-Verlag, NY (2007) 3. Tani, A., Candela, L., Castelli, D.: Dealing with metadata quality: The legacy of digital li- brary efforts. Inf. Process. Manag. 49 (2013) 1194–1205 4. Moen, W.E., Stewart, E.L., McClure, C.R.: The Role of Content Analysis in Evaluating Metadata for the U.S. Government Information Locator Service (GILS): Results from an Exploratory Study. http://digital.library.unt.edu/ark:/67531/metadc36312/citation/ (1997) 5. Margaritopoulos, M., Margaritopoulos, T., Mavridis, I., Manitsaris, A.: Quantifying and measuring metadata completeness. J. Am. Soc. Inf. Sci. Tech. 63 (2012) 724–737 6. Ochoa,X., Duval,E.: Quality Metrics for Learning Object Metadata. In: WorldConference on Educational Multimedia, Hypermedia and Telecommunications. (2006) 1004–1011 7. Stvilia, B., Gasser, L., Twidale, M.B., Smith, L.C.:A framework for information quality assessment. J. Am. Soc. Inf. Sci. Tech. 58 (2007) 1720–1733 8. Chen,Y.-N.,Wen,C.-Y.,Chen,H.-P.,Lin,Y.-H.,Sum,H.-C.:Metrics for metadata quality as- surance and their implications for digital libraries. In: Xing, C., Crestani, F., and Rauber, A. (eds.) 13th International Conference on Asia-Pacific Digital Libraries, Beijing, China. Springer-Verlag, Berlin (2011)138–147