Quality Assessment in Digital Libraries – Challenges and Chances Sascha Tönnies, Wolf-Tilo Balke L3S Research Center Appelstraße 9a 30167 Hannover {toennies, balke}@L3S.de ABSTRACT of chemical substances. Remarkable is that only 40 million Today, more and more information provider such as digital substances have been indexed just nine month before. In contrast, libraries offer corpora related to a specialized domain. Beyond the CAS registry contained 10 million entries in 1990 and around simple keyword based searches the resulting information systems 22 million entries in 2000. often rely on entity centered searches. For being able to offer this Typical Web search engines are already able to handle such an kind of search, a high quality document processing is essential. In amount of data in a fully automated way by indexing Web content addition, information systems more and more have to rely on using text-retrieval methods and structural properties of the semantic techniques during the workflows of metadata generation, underlying collection like link analysis. In contrast, there are search and navigational access. But, due to the statistical and/or information providers, e.g. libraries, still relying on manually collaborative nature of such techniques, the underlying quality of created indexes due to high quality requirements. These providers automatically generated metadata is questionable. Thus, the have to face two serious problems even for focused collections. quality assessment of information system’s metadata annotations First, it is increasingly costly and time consuming to build up a used for subsequent querying of collections has to be guaranteed. proper index; second, even given an ideal collection, the indexing In this paper we discuss the importance of metadata quality has to foresee all possible uses for each specific item. Also the assessment for information systems and the chances gained out of information overload for the individual customer and the controlled and guaranteed quality. increasing specialization of (research) interests require indexes to be more specific in the choice of appropriate indexing terms. Categories and Subject Descriptors H.3.3 [INFORMATION STORAGE AND RETRIEVAL]: Chemical documents have to be extended by two types of Information Search and Retrieval metadata. The first type, the bibliographic metadata like authors, affiliation, publisher and year, is obviously readily available in a H.3.7 [INFORMATION STORAGE AND RETRIEVAL]: library environment. The second and for our purposes more Digital Libraries – Systems issues important type is chemical metadata, specified by chemical entities, reactions, concepts and techniques, contained in the General Terms original document. This chemical metadata is not readily available Algorithms, Experimentation. and must be extracted, collected and structured. Therefore, the development and application of technologies for automated Keywords metadata generation gain in importance. The advantage of using Digital Libraries, Information Quality. such techniques is twofold: First, document processing becomes 1. INTRODUCTION less expensive and second, a higher degree of personalization is Recently the exponential growth of available information can be possible. In particular, the usage of semantic techniques has been observed not only in the Web but also in highly specialized proposed to bring a higher rate of automation into the indexing domains. Thus, all information providers have to face the problem process. Commonly used semantic techniques in the domain of of information overload. Relying on the work done in the virtual digital libraries are the usage of (bibliographic) ontologies, library for chemistry project (ViFaChem II [1]) we selected tagging and classification systems. Summarized, semantic chemistry as an application area. A good example for the techniques rely on statistical and collaborative methods to assess information growth in this domain is shown in a press release of textual documents. However, due to the nature of statistical and the Chemical Abstract Service (CAS). A total of 50 million collaborative methods, using such techniques may result in a loss chemical substances have been indexed on September 2009 in the of retrieval quality in comparison to handcrafted indexes. For curated CAS registry, the worldwide most comprehensive registry information providers this potential loss in quality is a serious problem; if users cannot trust in the results, the added value of curated information systems over simple Web searches becomes Copyright is held by the author/owner(s). questionable. Hence, before a semantic technique can be used, GvD Workshop’10, 25.-28.05.2010, Bad Helmstedt, Germany. information providers have to gauge the impact of the technology’s use in the retrieval process. In this paper, we will discuss the problem of quality assessment for semantic techniques in information systems with focus on digital libraries and show first results on quality assessment. 2. CHALLENGES IN QUALITY document collection in structure databases. The resulting databases can be accessed through graphical interfaces. By ASSESSMENT drawing a chemical structure a domain expert can thus formulate a The development of a digital library is a multi-stage process query, which in turn will be parsed by the chemical query parser containing the following steps: Preprocessing of underlying and matched against entities’ fingerprints stored inside the documents, metadata enrichment, indexing of a collection and structure database. The amount of manual work required for metadata, and personalized document retrieval. Given that each of building up and maintaining such indexes results in high costs. these steps may have a loss in quality, the total quality of the Today, CAS offers high quality data at a price of about 30,000 overall information is questionable. USD/year for a single user subscription. Obviously for the Preprocessing. Chemists often search for documents containing growing open access movement this type of indexing documents particular chemical entities or reactions. Therefore, an important is not a viable option. Also the widely used Web search interfaces, part of document preprocessing is entity recognition, which is a e.g. Google or Yahoo! cannot retrieve and search for structural difficult task when working on proprietary document formats, e.g. data stored in a database. PDF documents. Due to the unstructured representation of PDF Document retrieval. Besides the topical focus, the major success documents it is very difficult to gain high quality during entity factor for effective information access is the respective user recognition. PDF documents store all characters using the interface and the (generally metadata based) query facility. In absolute position within the document and thus all paragraphs are terms of suitable interfaces information visualization is becoming split into single line paragraphs. Since entity names usually are increasingly prevalent for understanding and explaining quite long, the probability that names are split into several parts is information. Currently, faceted navigation is a popular technique rather high. For example, entity extractors have a hard time for supporting exploration and discovery of digital libraries and figuring out whether different parts of a word belong to the same document collections. Facets refer to different kinds of categories entity or are entities in their own right (for example the chemical used to characterize information items in each corpus. However, name 4-(aminomethyl)cyclohexamine separated into 4- the large-subject-space problem is still unresolved and makes aminomethyl and cyclohexamine). Another difficulty is the innovative, yet understandable extensions of the faceted model processing of chemical formulas containing superscript and essential [7]. subscript letters which become lost during text conversion using any available software on the market. Even simple document Quality assessment. As today’s digital libraries more and more elements like tables and figure captions cannot be processed in rely on automatic enriched (semantic) metadata it seems obvious high quality. The resulting quality is even worse if the document that the quality assessment of a digital library is not trivial. The is still not digitized and OCR software has to be used, because the traditional quality measures of digital libraries, i.e., the digitized result will always contain additional OCR errors. These attractiveness of the collections, the technology’s ease of use and errors are insignificant when building up a full-text index since the user satisfaction [8] may not be sufficient anymore, as all standard IR techniques are not really affected by unsystematic measure the user’s experiences and imply high quality (manually (OCR) errors. But considering entity centered domains it might generated) metadata. Even though the user satisfaction should be already be an interesting factor in the process of tokenization and the overall intention, users may look favorably upon the novelty entity recognition, also affecting the overall retrieval quality [2]. of the interface rather than assess the retrieval effectiveness. Furthermore, during the assessment, the user may not know which Metadata enrichment. The important part of introducing a information he misses, because of low data quality. significant quality control for information systems lies in the metadata part used for querying the system. The quality of A possible approach would be a combination of the users’ (handcrafted) metadata for traditional libraries may be measured satisfaction, and each individual quality aspect during the in terms of completeness, correctness and relevance [3], [4]. For acquisition of documents, the metadata generation and the instance, the classification of a resource according to the Library retrieval process. Adding up each single quality value may not be of Congress subject headings (LCSH) or the Dewey Decimal sufficient for an overall quality assessment. It seems obvious, that System (DCC) can be measured using these criteria. In contrast, some errors affect each other in such a way, that the overall loss considering a semantic technique such as collaborative tagging of quality will be higher than the addition of the single values. where users categorize resources with a free vocabulary, such Though, the interaction of the individual aspects has to be measures are difficult to apply. Thus, semantically enriched investigated. metadata has to be evaluated regarding the quality of metadata 3. USE CASE – BUILDING A CHEMICAL itself. In the domain of collaborative tagging systems, some work investigating the quality of tags has already been done, see e.g. DIGITAL LIBRARY [5]. But in general, research in the field of quality assessment for The ViFaChem II project focuses on using knowledge about semantic techniques is still rare. A good example is [6] where chemical workflows as a basis for creating a digital library portal. measures for the quality of automatically generated taxonomies The overall vision is a personalized knowledge space for the for resource classification are investigated. Further approaches individual practitioner in the field of chemistry. Building on like the comparison with other knowledge spaces, e.g., DBpedia (automatically derived) ontologies structuring the domain, openly and currated databases are possible and have to be evaluated. But accessible topical databases, and specialized indexes of substances a major problem in the quality assessment of metadata generation derived from a set of user-selected documents, a personalized is the absence of ‘gold standards’ for benchmarking. knowledge space can be created that promises to help users combating the information flood. The characteristic of a chemical Indexing. Besides simple text indexes for metadata and full texts, document is that the relevant chemical structure information is not chemical information service providers offer specialized indexes just encoded in text but also in images. Within this project a built up by identifying and indexing all chemical structures from a digital library for chemical documents was build up and integrated into in the Chem.de portal. In this context we already investigated The last step in the ViFaChem document processing is the some quality challenges. chemical OCR. Chemical OCR analyzes images contained within the document and tries to convert these images back to structural ViFaChem II System information. This information can be stored in a structural database and be used for, e.g. substructure searches. The problem of such semantic technique is the very poor quality as measured in Document processing Document Retrieval [9]. Thus, this technique should definitely not be used in any automatic information retrieval process today. Lucene Named Entity Metadata enrichment. The personalized retrieval process of the Chemical Entities Chem.de portal relies on classical bibliographic metadata and Recognition semantic enriched metadata, i.e. chemical metadata. Whereas the GrowBag classical bibliographic metadata is derived from the catalog Keyword Hierarchy Keywords system of the TIB Hannover und thus has high quality due to the Extraction review process in the library, the quality of the automatically Document generated semantic metadata has to be assessed. IR Syndication Repository Reasoner First experiments rely on the author keywords which were subsequently used for automatically creating folksonomies. The Ontologies Ontology, RDF resulting tag clouds were calculated by the Semantic GrowBag technique [10] investigating higher order co-occurrences. To Structure DB assess the quality of these graphs, we conducted a user study with Chemical OCR Structural Data domain experts. All experts were asked to think aloud after being exposed to the individual graph and provide feedback on how they assessed the quality and which metadata items were considered to be useful for the average user of the respective collection. Moreover, after reviewing the metadata, the experts were asked Open Chemical about their expectations in terms of correctness and completeness Databases of the automatically generated metadata. Quality Assurance The study resulted in three major observations: 1. Domain experts always started from a (reasonably Figure 1. ViFaChem II architecture similar) cognitive classification of possible entities. Preprocessing. The document collection at the German National They expected to find relevant terms with respect to all Library of Science and Technology (TIB Hannover) contains expected classes. several hundreds of thousands journal articles, conference proceedings, research reports and online resources. For our 2. Considering the given metadata all experts expected to chemical digital library we indexed, among others, the collection find a similar degree of generality / specificity of the of chemical documents from the journal Archive for Organic keywords. The respective degree was derived relative to Chemistry (ARKIVOC)1 which is one of the most renowned open the general understanding of the respective domain. access sources for organic chemistry. This journal publishes all 3. Assessing the type of relationship between each papers as PDF documents. Since we have to use many tools for keyword and the query term all experts tried to embed deriving metadata for the use in our system, first all the different the terms in a common context. With increasing document types have to be converted into one general interface broadness of the context, the satisfaction with the format. We rely on SciXML (a XML derivate) because the named keywords decreased. entity recognition framework (Oscar3) expects that format as Based on these observations, we proposed three measures namely input. degree of category coverage, semantic word bandwidth and One of the first challenges we had to target was the quality loss relevance of covered terms. Although our preliminary results during the conversion from the PDF format to the SciXML address the sensibility of our measures, a detailed investigation format. We observed that problem during our first experiments; using several document corpora is still needed to reflect different we had a very low entity recognition rate in comparison to our topics and sizes. In addition, automatically building folksonomies manual annotated corpora, caused by a very poor conversion is just one possible semantic technique for an assisted information quality of the PDF documents. Further investigations showed that retrieval and many more are possible, e.g., author networks, this was a structural problem of the PDF document format, and personalized document ranking and automated classification of that all available conversion tools ran into the same problems documents. For all of these semantic techniques information during the conversion step. Thus, we were not able to use any of providers should try to find possibilities for quality assessment, to the available tools out of the box and developed a Java-based fulfill their mission of high quality standards. framework enabling the ViFaChem II document processor to Besides the chemical entities, also reaction names are extracted. convert a document into an object model, verify the model and These names are linked to the Chemical Entities of Biological serialize it as a SciXML file resulting in considerable quality improvement also for keyword extraction. 1 www.arkat-usa.org Interest (ChEBI)2 and the Name Reaction Ontology (RXNO) 3 by Each facet’s entry can be selected to be included or excluded. If simple string matching algorithms. one entry is included, the document hitlist will only contain Indexing. Our chemical digital library has different indexes used documents linked to the facet entry. If one entry is excluded the for different kinds of document retrieval. All extracted chemical documents contained in the hitlist are explicitly not containing the entities are converted into chemical structures and are stored facet entry. within a structure database. This enables the user, to search for The document retrieval process also includes ontology based documents by drawing a chemical entity as query. In addition, we document retrieval (see Figure 3): All documents containing build up a Lucene based text index containing trivial names of the named reactions are linked to the respective ontology term of the identified entities and the full-texts. To provide a high recall we RXNO ontology. Thus, a user can retrieve documents by had to solve several problems. Chemical substances can have browsing the RXNO ontology. many different and often ambiguous textual representations, like several trivial names, InChI codes or SMILES. In chemical documents besides structure images usually only trivial names are used for brevity and improved readability. We developed a workflow allowing the automatic enrichment of chemical metadata from publicly accessible databases for each occurring chemical entity. In this way, it is possible to provide a simple keyword based search interface with in Chem.de. Our experiments show that the resulting retrieval quality of our enriched index is almost as good as chemical exact structure searches and significantly better compared to a full text search [11]. Document retrieval. A user can retrieve documents by either doing a keyword based search over the text index or a (drawn) structure search over the structure database. A chemical entity search will result in a hitlist of chemical entities. The user can than select the chemical entities of interest to retrieve the documents containing the selected entities. The resulting document hitlist can be further filtered by chemical and bibliographic facets as shown in Figure 2. Figure 3. Ontology based document browsing 4. CHANCES OF QUALITY ASSESMENT Besides the enormous complexity of examine the overall quality of information systems, the quality assessment will also result in chances for information providers such as digital libraries. Today, the major difference between digital libraries and a simple Web search engine as information provider is the given quality. Web search engines do currently not focus on quality but on fast and effective information retrieval. For instance, Google is indexing millions of books for the book search project4 without considering the requirements discussed in [12]. Digital libraries instead do still rely on information quality. This competitive advantage can only be retained if the quality standard can be guaranteed in the future. Of course, even if semantic techniques may help in the future to Figure 2. Parts of the advanced search interface of the gain automatically generated semantic metadata digital libraries Chem.de portal have to spend a lot of money for the quality assessment. But through the assessment of the quality of service it may be also possible to establish new business models. For instance, digital 2 http://www.ebi.ac.uk/chebi/ 3 4 http://www.rsc.org/ontologies/RXNO/index.asp http://books.google.com libraries can provide higher quality of service to premium [2] A. Abdulkader and M.R. Casey, "Low Cost Correction of customers who will pay money for the services in contrast to the OCR Errors Using Learning in a Multi-Engine standard customer. That way, the semantic techniques used for the Environment," 10th International Conference on Document retrieval process could be adapted to the customer’s need. For Analysis and Recognition, IEEE, 2009, pp. 576-580. instance, the digital library can adopt a more general ontology [3] D.M. Nichols, C. Chan, D. Bainbridge, D. McKay, and used within the retrieval process to the specific domain of the M.B. Twidale, "A lightweight metadata quality tool," customer and thus gain higher quality. International Conference on Digital Libraries, 2008. High quality metadata also result in good options in terms of [4] T. Margaritopoulos, M. Margaritopoulos, I. Mavridis, and suitable interfaces of information. A promising example of a A. Manitsaris, "A conceptual framework for metadata beneficial usage of high quality semantic metadata is the quality assessment," International Conference on Dublin GoPubMed portal5 providing ontology-based literature search Core and Metadata Applications, 2008. over around 19 million biomedical research journals in the Medline collection. This portal relies on the manually curated [5] K. Bischoff, C.S. Firan, W. Nejdl, and R. Paiu, "Can all MeSH6 and gene ontology7 and thus can offer enormous tags be used for search?," Conference on Information and capabilities in semantic document retrieval. Knowledge Management, 2008. [6] S. Tönnies and W. Balke, "Using Semantic Technologies in Generally speaking, the biggest change for digital libraries will be Digital Libraries – A Roadmap to Quality Evaluation," 13th the transparency of the whole information retrieval process. European Conference, ECDL 2009, Corfu, Greece, Thereby, the user can understand, how the search result is September 27 - October 2, 2009, Berlin, Heidelberg: generated and to what extent the underlying data quality affects Springer Berlin / Heidelberg, 2009, pp. 168-179. the retrieval process. For metaphor purpose a color-coded visualization based on a traffic light may express information [7] M. Hearst, "UIs for Faceted Navigation: Recent Advances quality. In this way, the user knows which quality he can expect and Remaining Open Problems," 2008. from the information and can decide if the given quality is [8] N. Fuhr, G. Tsakonas, T. Aalberg, M. Agosti, P. Hansen, S. acceptable for his task at hand. Kapidakis, C. Klas, L. Kovács, M. Landoni, A. Micsik, C. Papatheodorou, C. Peters, and I. Sølvberg, "Evaluation of 5. FUTURE WORK digital libraries," International Journal on Digital Libraries, Currently, quality is only examined for a few semantic techniques. vol. 8, 2007, pp. 21-38. Therefore, we will investigate different semantic techniques using manual inspection together with appropriate quality measures. [9] A. Valko and P. Johnson, "CLiDE Pro: A chemical OCR Applying these quality measures in a real digital library will be tool," Proceedings of the 8th International Conference on the next step. This will result in an investigation on how the Chemical Structures (ICCS), 2008. individual quality measures will affect the outcome of other [10] J. Diederich and W. Balke, "The Semantic GrowBag semantic techniques and whether it is possible to tweak them with Algorithm: Automatically Deriving Categorization the quality input. Systems," 11th European Conference on Research and For the retrieval part of the digital library, the influence of the Advanced Technology for Digital Libraries (ECDL), 2007. quality assessment on the user has to be investigated. This implies [11] S. Tönnies, B. Köhncke, O. Koepler, and W. Balke, the personalized creation of retrieval workflows based on the "Exposing the Hidden Web for Chemical Digital Libraries," users’ quality requirements and the visualization of different 10th ACM/IEEE Joint Conference on Digital Libraries quality aspects in the digital library interface. (JCDL), Surfers Paradise, Gold Coast, Australia: 2010. [12] S. Tönnies and W. Balke, "User-centered Content 6. ACKNOWLEDGMENTS Provisioning over Large Collections of eBooks," This work was partially supported by the German Research Proceedings of the 2009 2nd ACM Workshop on Research Foundation (DFG) within the ViFaChem II project. Advances in Large Digital Book Repositories, BooksOnline 7. REFERENCES 2009, Corfu, Greece, October 2, 2009, 2009. [1] S. Tönnies, B. Köhncke, O. Koepler, and W. Balke, "Building Chemical Information Systems - the ViFaChem II Project," Datenbanksysteme in Business, Technologie und Web (BTW 2009), 13. Fachtagung des GI-Fachbereichs "Datenbanken und Informationssysteme" (DBIS), GI, 2009. 5 http://www.gopubmed.org/web/gopubmed/ 6 http://www.nlm.nih.gov/mesh/ 7 http://www.geneontology.org/