Quality Assessment in Digital Libraries –
                             Challenges and Chances
                                               Sascha Tönnies, Wolf-Tilo Balke
                                                            L3S Research Center
                                                               Appelstraße 9a
                                                              30167 Hannover

                                                   {toennies, balke}@L3S.de

ABSTRACT                                                               of chemical substances. Remarkable is that only 40 million
Today, more and more information provider such as digital              substances have been indexed just nine month before. In contrast,
libraries offer corpora related to a specialized domain. Beyond        the CAS registry contained 10 million entries in 1990 and around
simple keyword based searches the resulting information systems        22 million entries in 2000.
often rely on entity centered searches. For being able to offer this   Typical Web search engines are already able to handle such an
kind of search, a high quality document processing is essential. In    amount of data in a fully automated way by indexing Web content
addition, information systems more and more have to rely on            using text-retrieval methods and structural properties of the
semantic techniques during the workflows of metadata generation,       underlying collection like link analysis. In contrast, there are
search and navigational access. But, due to the statistical and/or     information providers, e.g. libraries, still relying on manually
collaborative nature of such techniques, the underlying quality of     created indexes due to high quality requirements. These providers
automatically generated metadata is questionable. Thus, the            have to face two serious problems even for focused collections.
quality assessment of information system’s metadata annotations        First, it is increasingly costly and time consuming to build up a
used for subsequent querying of collections has to be guaranteed.      proper index; second, even given an ideal collection, the indexing
In this paper we discuss the importance of metadata quality            has to foresee all possible uses for each specific item. Also the
assessment for information systems and the chances gained out of       information overload for the individual customer and the
controlled and guaranteed quality.                                     increasing specialization of (research) interests require indexes to
                                                                       be more specific in the choice of appropriate indexing terms.
Categories and Subject Descriptors
H.3.3 [INFORMATION STORAGE AND RETRIEVAL]:                             Chemical documents have to be extended by two types of
Information Search and Retrieval                                       metadata. The first type, the bibliographic metadata like authors,
                                                                       affiliation, publisher and year, is obviously readily available in a
H.3.7 [INFORMATION STORAGE AND RETRIEVAL]:                             library environment. The second and for our purposes more
Digital Libraries – Systems issues                                     important type is chemical metadata, specified by chemical
                                                                       entities, reactions, concepts and techniques, contained in the
General Terms                                                          original document. This chemical metadata is not readily available
Algorithms, Experimentation.                                           and must be extracted, collected and structured. Therefore, the
                                                                       development and application of technologies for automated
Keywords                                                               metadata generation gain in importance. The advantage of using
Digital Libraries, Information Quality.
                                                                       such techniques is twofold: First, document processing becomes
1. INTRODUCTION                                                        less expensive and second, a higher degree of personalization is
Recently the exponential growth of available information can be        possible. In particular, the usage of semantic techniques has been
observed not only in the Web but also in highly specialized            proposed to bring a higher rate of automation into the indexing
domains. Thus, all information providers have to face the problem      process. Commonly used semantic techniques in the domain of
of information overload. Relying on the work done in the virtual       digital libraries are the usage of (bibliographic) ontologies,
library for chemistry project (ViFaChem II [1]) we selected            tagging and classification systems. Summarized, semantic
chemistry as an application area. A good example for the               techniques rely on statistical and collaborative methods to assess
information growth in this domain is shown in a press release of       textual documents. However, due to the nature of statistical and
the Chemical Abstract Service (CAS). A total of 50 million             collaborative methods, using such techniques may result in a loss
chemical substances have been indexed on September 2009 in the         of retrieval quality in comparison to handcrafted indexes. For
curated CAS registry, the worldwide most comprehensive registry        information providers this potential loss in quality is a serious
                                                                       problem; if users cannot trust in the results, the added value of
                                                                       curated information systems over simple Web searches becomes
 Copyright is held by the author/owner(s).                             questionable. Hence, before a semantic technique can be used,
 GvD Workshop’10, 25.-28.05.2010, Bad Helmstedt, Germany.              information providers have to gauge the impact of the
                                                                       technology’s use in the retrieval process.
                                                                       In this paper, we will discuss the problem of quality assessment
                                                                       for semantic techniques in information systems with focus on
                                                                       digital libraries and show first results on quality assessment.
2. CHALLENGES IN QUALITY                                                 document collection in structure databases. The resulting
                                                                         databases can be accessed through graphical interfaces. By
ASSESSMENT                                                               drawing a chemical structure a domain expert can thus formulate a
The development of a digital library is a multi-stage process            query, which in turn will be parsed by the chemical query parser
containing the following steps: Preprocessing of underlying              and matched against entities’ fingerprints stored inside the
documents, metadata enrichment, indexing of a collection and             structure database. The amount of manual work required for
metadata, and personalized document retrieval. Given that each of        building up and maintaining such indexes results in high costs.
these steps may have a loss in quality, the total quality of the         Today, CAS offers high quality data at a price of about 30,000
overall information is questionable.                                     USD/year for a single user subscription. Obviously for the
Preprocessing. Chemists often search for documents containing            growing open access movement this type of indexing documents
particular chemical entities or reactions. Therefore, an important       is not a viable option. Also the widely used Web search interfaces,
part of document preprocessing is entity recognition, which is a         e.g. Google or Yahoo! cannot retrieve and search for structural
difficult task when working on proprietary document formats, e.g.        data stored in a database.
PDF documents. Due to the unstructured representation of PDF             Document retrieval. Besides the topical focus, the major success
documents it is very difficult to gain high quality during entity        factor for effective information access is the respective user
recognition. PDF documents store all characters using the                interface and the (generally metadata based) query facility. In
absolute position within the document and thus all paragraphs are        terms of suitable interfaces information visualization is becoming
split into single line paragraphs. Since entity names usually are        increasingly prevalent for understanding and explaining
quite long, the probability that names are split into several parts is   information. Currently, faceted navigation is a popular technique
rather high. For example, entity extractors have a hard time             for supporting exploration and discovery of digital libraries and
figuring out whether different parts of a word belong to the same        document collections. Facets refer to different kinds of categories
entity or are entities in their own right (for example the chemical      used to characterize information items in each corpus. However,
name 4-(aminomethyl)cyclohexamine separated into 4-                      the large-subject-space problem is still unresolved and makes
aminomethyl and cyclohexamine). Another difficulty is the                innovative, yet understandable extensions of the faceted model
processing of chemical formulas containing superscript and               essential [7].
subscript letters which become lost during text conversion using
any available software on the market. Even simple document               Quality assessment. As today’s digital libraries more and more
elements like tables and figure captions cannot be processed in          rely on automatic enriched (semantic) metadata it seems obvious
high quality. The resulting quality is even worse if the document        that the quality assessment of a digital library is not trivial. The
is still not digitized and OCR software has to be used, because the      traditional quality measures of digital libraries, i.e., the
digitized result will always contain additional OCR errors. These        attractiveness of the collections, the technology’s ease of use and
errors are insignificant when building up a full-text index since        the user satisfaction [8] may not be sufficient anymore, as all
standard IR techniques are not really affected by unsystematic           measure the user’s experiences and imply high quality (manually
(OCR) errors. But considering entity centered domains it might           generated) metadata. Even though the user satisfaction should be
already be an interesting factor in the process of tokenization and      the overall intention, users may look favorably upon the novelty
entity recognition, also affecting the overall retrieval quality [2].    of the interface rather than assess the retrieval effectiveness.
                                                                         Furthermore, during the assessment, the user may not know which
Metadata enrichment. The important part of introducing a                 information he misses, because of low data quality.
significant quality control for information systems lies in the
metadata part used for querying the system. The quality of               A possible approach would be a combination of the users’
(handcrafted) metadata for traditional libraries may be measured         satisfaction, and each individual quality aspect during the
in terms of completeness, correctness and relevance [3], [4]. For        acquisition of documents, the metadata generation and the
instance, the classification of a resource according to the Library      retrieval process. Adding up each single quality value may not be
of Congress subject headings (LCSH) or the Dewey Decimal                 sufficient for an overall quality assessment. It seems obvious, that
System (DCC) can be measured using these criteria. In contrast,          some errors affect each other in such a way, that the overall loss
considering a semantic technique such as collaborative tagging           of quality will be higher than the addition of the single values.
where users categorize resources with a free vocabulary, such            Though, the interaction of the individual aspects has to be
measures are difficult to apply. Thus, semantically enriched             investigated.
metadata has to be evaluated regarding the quality of metadata           3. USE CASE – BUILDING A CHEMICAL
itself. In the domain of collaborative tagging systems, some work
investigating the quality of tags has already been done, see e.g.        DIGITAL LIBRARY
[5]. But in general, research in the field of quality assessment for     The ViFaChem II project focuses on using knowledge about
semantic techniques is still rare. A good example is [6] where           chemical workflows as a basis for creating a digital library portal.
measures for the quality of automatically generated taxonomies           The overall vision is a personalized knowledge space for the
for resource classification are investigated. Further approaches         individual practitioner in the field of chemistry. Building on
like the comparison with other knowledge spaces, e.g., DBpedia           (automatically derived) ontologies structuring the domain, openly
and currated databases are possible and have to be evaluated. But        accessible topical databases, and specialized indexes of substances
a major problem in the quality assessment of metadata generation         derived from a set of user-selected documents, a personalized
is the absence of ‘gold standards’ for benchmarking.                     knowledge space can be created that promises to help users
                                                                         combating the information flood. The characteristic of a chemical
Indexing. Besides simple text indexes for metadata and full texts,       document is that the relevant chemical structure information is not
chemical information service providers offer specialized indexes         just encoded in text but also in images. Within this project a
built up by identifying and indexing all chemical structures from a      digital library for chemical documents was build up and integrated
into in the Chem.de portal. In this context we already investigated    The last step in the ViFaChem document processing is the
some quality challenges.                                               chemical OCR. Chemical OCR analyzes images contained within
                                                                       the document and tries to convert these images back to structural
                     ViFaChem II System                                information. This information can be stored in a structural
                                                                       database and be used for, e.g. substructure searches. The problem
                                                                       of such semantic technique is the very poor quality as measured in
      Document processing                 Document Retrieval           [9]. Thus, this technique should definitely not be used in any
                                                                       automatic information retrieval process today.
                                            Lucene
         Named Entity                                                  Metadata enrichment. The personalized retrieval process of the
                                               Chemical Entities       Chem.de portal relies on classical bibliographic metadata and
          Recognition
                                                                       semantic enriched metadata, i.e. chemical metadata. Whereas the
                                            GrowBag                    classical bibliographic metadata is derived from the catalog
            Keyword
                                           Hierarchy Keywords          system of the TIB Hannover und thus has high quality due to the
           Extraction
                                                                       review process in the library, the quality of the automatically
                             Document                                  generated semantic metadata has to be assessed.
             IR                                  Syndication
                             Repository
                                            Reasoner                   First experiments rely on the author keywords which were
                                                                       subsequently used for automatically creating folksonomies. The
           Ontologies                           Ontology, RDF          resulting tag clouds were calculated by the Semantic GrowBag
                                                                       technique [10] investigating higher order co-occurrences. To
                                            Structure DB               assess the quality of these graphs, we conducted a user study with
         Chemical OCR                           Structural Data        domain experts. All experts were asked to think aloud after being
                                                                       exposed to the individual graph and provide feedback on how they
                                                                       assessed the quality and which metadata items were considered to
                                                                       be useful for the average user of the respective collection.
                                                                       Moreover, after reviewing the metadata, the experts were asked
                               Open Chemical                           about their expectations in terms of correctness and completeness
                                 Databases                             of the automatically generated metadata.
         Quality Assurance
                                                                       The study resulted in three major observations:
                                                                            1.   Domain experts always started from a (reasonably
                  Figure 1. ViFaChem II architecture
                                                                                 similar) cognitive classification of possible entities.
Preprocessing. The document collection at the German National                    They expected to find relevant terms with respect to all
Library of Science and Technology (TIB Hannover) contains                        expected classes.
several hundreds of thousands journal articles, conference
proceedings, research reports and online resources. For our                 2.   Considering the given metadata all experts expected to
chemical digital library we indexed, among others, the collection                find a similar degree of generality / specificity of the
of chemical documents from the journal Archive for Organic                       keywords. The respective degree was derived relative to
Chemistry (ARKIVOC)1 which is one of the most renowned open                      the general understanding of the respective domain.
access sources for organic chemistry. This journal publishes all            3.   Assessing the type of relationship between each
papers as PDF documents. Since we have to use many tools for                     keyword and the query term all experts tried to embed
deriving metadata for the use in our system, first all the different             the terms in a common context. With increasing
document types have to be converted into one general interface                   broadness of the context, the satisfaction with the
format. We rely on SciXML (a XML derivate) because the named                     keywords decreased.
entity recognition framework (Oscar3) expects that format as           Based on these observations, we proposed three measures namely
input.                                                                 degree of category coverage, semantic word bandwidth and
One of the first challenges we had to target was the quality loss      relevance of covered terms. Although our preliminary results
during the conversion from the PDF format to the SciXML                address the sensibility of our measures, a detailed investigation
format. We observed that problem during our first experiments;         using several document corpora is still needed to reflect different
we had a very low entity recognition rate in comparison to our         topics and sizes. In addition, automatically building folksonomies
manual annotated corpora, caused by a very poor conversion             is just one possible semantic technique for an assisted information
quality of the PDF documents. Further investigations showed that       retrieval and many more are possible, e.g., author networks,
this was a structural problem of the PDF document format, and          personalized document ranking and automated classification of
that all available conversion tools ran into the same problems         documents. For all of these semantic techniques information
during the conversion step. Thus, we were not able to use any of       providers should try to find possibilities for quality assessment, to
the available tools out of the box and developed a Java-based          fulfill their mission of high quality standards.
framework enabling the ViFaChem II document processor to               Besides the chemical entities, also reaction names are extracted.
convert a document into an object model, verify the model and          These names are linked to the Chemical Entities of Biological
serialize it as a SciXML file resulting in considerable quality
improvement also for keyword extraction.

1
    www.arkat-usa.org
Interest (ChEBI)2 and the Name Reaction Ontology (RXNO) 3 by          Each facet’s entry can be selected to be included or excluded. If
simple string matching algorithms.                                    one entry is included, the document hitlist will only contain
Indexing. Our chemical digital library has different indexes used     documents linked to the facet entry. If one entry is excluded the
for different kinds of document retrieval. All extracted chemical     documents contained in the hitlist are explicitly not containing the
entities are converted into chemical structures and are stored        facet entry.
within a structure database. This enables the user, to search for     The document retrieval process also includes ontology based
documents by drawing a chemical entity as query. In addition, we      document retrieval (see Figure 3): All documents containing
build up a Lucene based text index containing trivial names of the    named reactions are linked to the respective ontology term of the
identified entities and the full-texts. To provide a high recall we   RXNO ontology. Thus, a user can retrieve documents by
had to solve several problems. Chemical substances can have           browsing the RXNO ontology.
many different and often ambiguous textual representations, like
several trivial names, InChI codes or SMILES. In chemical
documents besides structure images usually only trivial names are
used for brevity and improved readability. We developed a
workflow allowing the automatic enrichment of chemical
metadata from publicly accessible databases for each occurring
chemical entity. In this way, it is possible to provide a simple
keyword based search interface with in Chem.de. Our experiments
show that the resulting retrieval quality of our enriched index is
almost as good as chemical exact structure searches and
significantly better compared to a full text search [11].
Document retrieval. A user can retrieve documents by either
doing a keyword based search over the text index or a (drawn)
structure search over the structure database. A chemical entity
search will result in a hitlist of chemical entities. The user can
than select the chemical entities of interest to retrieve the
documents containing the selected entities. The resulting
document hitlist can be further filtered by chemical and
bibliographic facets as shown in Figure 2.


                                                                                 Figure 3. Ontology based document browsing
                                                                      4. CHANCES OF QUALITY ASSESMENT
                                                                      Besides the enormous complexity of examine the overall quality
                                                                      of information systems, the quality assessment will also result in
                                                                      chances for information providers such as digital libraries. Today,
                                                                      the major difference between digital libraries and a simple Web
                                                                      search engine as information provider is the given quality. Web
                                                                      search engines do currently not focus on quality but on fast and
                                                                      effective information retrieval. For instance, Google is indexing
                                                                      millions of books for the book search project4 without considering
                                                                      the requirements discussed in [12]. Digital libraries instead do still
                                                                      rely on information quality. This competitive advantage can only
                                                                      be retained if the quality standard can be guaranteed in the future.
                                                                      Of course, even if semantic techniques may help in the future to
      Figure 2. Parts of the advanced search interface of the         gain automatically generated semantic metadata digital libraries
                          Chem.de portal                              have to spend a lot of money for the quality assessment. But
                                                                      through the assessment of the quality of service it may be also
                                                                      possible to establish new business models. For instance, digital
2
    http://www.ebi.ac.uk/chebi/
3                                                                     4
    http://www.rsc.org/ontologies/RXNO/index.asp                          http://books.google.com
libraries can provide higher quality of service to premium             [2]   A. Abdulkader and M.R. Casey, "Low Cost Correction of
customers who will pay money for the services in contrast to the             OCR Errors Using Learning in a Multi-Engine
standard customer. That way, the semantic techniques used for the            Environment," 10th International Conference on Document
retrieval process could be adapted to the customer’s need. For               Analysis and Recognition, IEEE, 2009, pp. 576-580.
instance, the digital library can adopt a more general ontology        [3]   D.M. Nichols, C. Chan, D. Bainbridge, D. McKay, and
used within the retrieval process to the specific domain of the              M.B. Twidale, "A lightweight metadata quality tool,"
customer and thus gain higher quality.                                       International Conference on Digital Libraries, 2008.
High quality metadata also result in good options in terms of          [4]    T. Margaritopoulos, M. Margaritopoulos, I. Mavridis, and
suitable interfaces of information. A promising example of a                 A. Manitsaris, "A conceptual framework for metadata
beneficial usage of high quality semantic metadata is the                    quality assessment," International Conference on Dublin
GoPubMed portal5 providing ontology-based literature search                  Core and Metadata Applications, 2008.
over around 19 million biomedical research journals in the
Medline collection. This portal relies on the manually curated         [5]    K. Bischoff, C.S. Firan, W. Nejdl, and R. Paiu, "Can all
MeSH6 and gene ontology7 and thus can offer enormous                         tags be used for search?," Conference on Information and
capabilities in semantic document retrieval.                                 Knowledge Management, 2008.
                                                                       [6]   S. Tönnies and W. Balke, "Using Semantic Technologies in
Generally speaking, the biggest change for digital libraries will be         Digital Libraries – A Roadmap to Quality Evaluation," 13th
the transparency of the whole information retrieval process.                 European Conference, ECDL 2009, Corfu, Greece,
Thereby, the user can understand, how the search result is                   September 27 - October 2, 2009, Berlin, Heidelberg:
generated and to what extent the underlying data quality affects             Springer Berlin / Heidelberg, 2009, pp. 168-179.
the retrieval process. For metaphor purpose a color-coded
visualization based on a traffic light may express information         [7]    M. Hearst, "UIs for Faceted Navigation: Recent Advances
quality. In this way, the user knows which quality he can expect             and Remaining Open Problems," 2008.
from the information and can decide if the given quality is            [8]    N. Fuhr, G. Tsakonas, T. Aalberg, M. Agosti, P. Hansen, S.
acceptable for his task at hand.                                             Kapidakis, C. Klas, L. Kovács, M. Landoni, A. Micsik, C.
                                                                             Papatheodorou, C. Peters, and I. Sølvberg, "Evaluation of
5. FUTURE WORK                                                               digital libraries," International Journal on Digital Libraries,
Currently, quality is only examined for a few semantic techniques.           vol. 8, 2007, pp. 21-38.
Therefore, we will investigate different semantic techniques using
manual inspection together with appropriate quality measures.          [9]    A. Valko and P. Johnson, "CLiDE Pro: A chemical OCR
Applying these quality measures in a real digital library will be            tool," Proceedings of the 8th International Conference on
the next step. This will result in an investigation on how the               Chemical Structures (ICCS), 2008.
individual quality measures will affect the outcome of other           [10] J. Diederich and W. Balke, "The Semantic GrowBag
semantic techniques and whether it is possible to tweak them with           Algorithm: Automatically Deriving Categorization
the quality input.                                                          Systems," 11th European Conference on Research and
For the retrieval part of the digital library, the influence of the         Advanced Technology for Digital Libraries (ECDL), 2007.
quality assessment on the user has to be investigated. This implies    [11] S. Tönnies, B. Köhncke, O. Koepler, and W. Balke,
the personalized creation of retrieval workflows based on the               "Exposing the Hidden Web for Chemical Digital Libraries,"
users’ quality requirements and the visualization of different              10th ACM/IEEE Joint Conference on Digital Libraries
quality aspects in the digital library interface.                           (JCDL), Surfers Paradise, Gold Coast, Australia: 2010.
                                                                       [12] S. Tönnies and W. Balke, "User-centered Content
6. ACKNOWLEDGMENTS                                                          Provisioning over Large Collections of eBooks,"
This work was partially supported by the German Research
                                                                            Proceedings of the 2009 2nd ACM Workshop on Research
Foundation (DFG) within the ViFaChem II project.
                                                                            Advances in Large Digital Book Repositories, BooksOnline
7. REFERENCES                                                               2009, Corfu, Greece, October 2, 2009, 2009.
[1]      S. Tönnies, B. Köhncke, O. Koepler, and W. Balke,
        "Building Chemical Information Systems - the ViFaChem II
        Project," Datenbanksysteme in Business, Technologie und
        Web (BTW 2009), 13. Fachtagung des GI-Fachbereichs
        "Datenbanken und Informationssysteme" (DBIS), GI, 2009.


5
    http://www.gopubmed.org/web/gopubmed/
6
    http://www.nlm.nih.gov/mesh/
7
    http://www.geneontology.org/