Knowledge Organisation for Digital Libraries Lena-Luise Stahn1 , Ingetraut Dahlberg, and Ernesto William De Luca1 Georg Eckert Institute Leibniz Institute for International Textbook Research stahn@leibniz-gei.de, Ingetraut.Dahlberg@t-online.de, deluca@leibniz-gei.de Abstract. Today research data and output from nearly every disci- plinary and interdisciplinary domain exist also (and sometimes only) in digital form. Consequently, to ensure the data’s efficient retrieval as well as long term usage and relevance, Knowledge Organisation and its tools/systems, i.e. lexical resources like thesauri and ontologies, play a major role in the Digital Libraries world. In this paper we discuss our approach of pursuing the conversion and mapping of two of the most promising lexical resources, the Informa- tion Coding Classification and the MultiWordNet, after having converted them into the EuroWordNet RDF/OWL format. In a second step we show how to integrate this additional knowledge into a domain Knowl- edge Organisation System (KOS). In the end we will have a presentation of the method and mapping as well as a use case for its evaluation for Information Retrieval purposes. Keywords: Knowledge Organisation, Lexical Linked Open Data, Digi- tal Libraries 1 Introduction 1.1 Knowledge Organisation Systems (KOS), Linked Open Data (LOD) and Digital Libraries: Current Situation and Problems The relevance of Knowledge Organisation has gained in importance with the emergence of the digital era and the World Wide Web. Just as in the physical world these large arising digital libraries require adequate and thorough struc- turing systems in order to organise the increasing amount of available data and to avoid loss of information. Making them accessible by ”putting them into the Semantic Web” requires thorough description and homogenisation, i.e. the provi- sion of a data model capable of dealing with large amounts of data from various domains. ”A KOS serves as a bridge between the user’s information need and the material in the collection. With it, the user should be able to identify an object of interest without prior knowledge of its existence.”1 1 https://www.clir.org/pubs/reports/pub91/1knowledge.html 2 The Georg Eckert Institute2 (GEI) deals with various Textbook Research resources from multiple disciplines. Research data, produced within in a specific context under particular research questions, results a broad spectrum of digital resources, e.g. a digital collection of historic textbooks3 , repositories for curric- ula4 and textbook research related publications5 , as well as digital information services6 . Because there still does not exist a proper Knowledge Management at the GEI as would be provided by using a suitable KOS these research results can not yet be linked to other research contexts and therefore are in danger of not being used afterwards. Their production within a highly specialized research community with complex but separate contexts and systems prevents them from being found easily, consequently followed by death of data”, double work and waste of resources. Another researcher with related information interests has no knowledge about the already existing GEI data and is not able to satisfy their need of information easily. Although many data models and classification systems have been developed to help here, their inadequacy is evident, as these models, too, have their intrin- sic structures and characteristics and are most often bound to their respective domain. Digital Libraries suffer from too specialised models, which are not able to communicate with each other. The advantages of the Linked Data World (Berners-Lee et al. 2001) are therefore not being exploited (De Luca/Dahlberg 2014, 9). Our idea is to transport this knowledge into the Digital Libraries world. With this a link may be enabled, e.g. between two disciplines seemingly as slightly related as the textbook research and the restoration domain, which eventually enriches both knowledge domains. Pursuing our preliminary work (De Luca/Dahlberg 2014) in section 2.2, we will show our approach of improving this situation by using broader KOS like Top Ontologies and Lexical Resources (e.g. WordNet Domains and Information Coding Classification) and how to integrate these into the Linked Data Cloud. In this paper, we give an overview of Ingetraut Dahlberg’s Information Coding Classification and its potential for Information Retrieval and Knowledge Organisation in the Digital Libraries world (see sec- tion 2.2.1). Section 2.2.2 covers Ernesto William De Luca’s work of converting EuroWordNet into the Semantic Web-compliant format RDF/OWL. Both their procedures will be adapted in this work, as shown in section 3.1 and 3.2, followed by a proof of concept (3.3). Because of additional expertise in the field, our cur- rent approach intends to pursue this work by applying the restoration domain as a use case: if the domain knowledge’s integration into the (Multilingual) Lexical Linked Data Cloud succeeds, it will be enriched with knowledge from a broader KOS as is provided by the ICC with its Knowledge Domains. Consequently a link will be created other research fields for which its relevance has not yet been 2 http://www.gei.de/en/home.html 3 http://gei-digital.gei.de/viewer/ 4 http://bibliothek.gei.de/en/library/curricula-workstation-textbookcat.html 5 http://repository.gei.de/ 6 http://edumeres.net/ 3 detected. Ideally new research questions may arise from this unexpected combi- nation. Especially the field of textbook research with its highly multidisciplinary character may benefit (cf. Dahlberg 2017, 2014, 120). 1.2 Preliminary Work Dahlberg’s Information Coding Classification Dahlberg’s Information Cod- ing Classification (ICC) (Dahlberg 2017, 2014, 1974) system follows this inten- tion with its structure based on a model of ordering the world’s knowledge disciplines according to ”evolution theoretic aspects” (”evolutionstheoretischen Gesichtspunkten”). The knowledge of the world is seen as belonging to different levels of existence (”Seinsschichten”), each structured through ”aspects”. Struc- turing each (sub-)level this aspect-oriented view features infinite extensibility. Its bilingual character can be seen as an additional advantage: The multilin- gualism of the Web is a source of ambiguities where the ICC may function as a translation tool. The initial development of the ICC as a ”faceted classification” by Dahlberg in the early seventies was made possible through a funding by the former Dt. Gesellschaft fr Dokumentation e.V. (DGD)7 . Dahlberg’s intention was to estab- lish a classification system not depending on the division into knowledge disci- plines in order to facilitate long term usage and extensibility (De Luca/Dahlberg 2014, 3 et seq.). Her approach is based on J.K.Feibleman’s (1954) and N. Hart- mann’s (1964) ”Schichtentheorie des Seins” (”Theory of Levels of Being”), di- viding existence/being into layers of ”areas of being” (”Seinsbereiche”). Each layer is structured by nine ”categorically defined aspects” functioning as ”Sys- temstellenplan” oder ”Systemifikator” and dividing the ”areas of being” into nine ”Subject Groups” (”Sachgruppen”) (SG) and further into nine ”Knowl- edge Domains” (”Wissensgebiete”) (WG) (for further description cf. Dahlberg 2017, 2014, 1974, 230 et seq., 259 et seq.). De Luca’s RDF/OWL for EuroWordNet De Luca (De Luca/Dahlberg 2014, 4) pursued the findings in the context of improving Information Retrieval results by means of NLP techniques. He decided to take the ICC into considera- tion in his approach of extending the classification resources WordNet Domains (Magnini/Cavagli 2000) (WN Domains),based on the commonly used Princeton WordNet (Miller et al. 1990, Fellbaum 1998) (PWN), and especially the mul- tilingual extension EuroWordNet (Vossen 1997) (EWN), used in the Semantic Web context. De Luca’s work comprises the conversion of EuroWordNet into a Semantic Web-compliant format (De Luca et al. 2007), for which RDF/OWL has proved to be the most useful, based on the RDF/OWL format for Prince- ton WordNet developed in (van Assem et al. 2004). The RDF/OWL WordNet model comprises of three main classes SynSet, WordSense and Word with re- spective subclasses, and three relation types (for SynSet and WordSense each, 7 In 2014 it was renamed as Deutsche Gesellschaft fr Information und Wissen e. V. (DGI). http://dgi-info.de/ 4 Fig. 1. Table 1: Information Coding Classification. Survey of Subject Groups (English) and attributes). An example is shown in figure 1. The conversion supports stan- dardisation and hence the long term usage, as well as the EuroWordNet’s mul- tilingualism. In (De Luca/Dahlberg 2014) we have adopted their EuroWordNet conversion for the ICC and discussed our approach how to extend the ICC with the EuroWordNet in RDF/OWL. This approach will be followed in the present paper (see section 3.2), using the MultiWordNet (Magnini et al. 1994), a multi- lingual version of the Princeton WordNet, which, as opposed to the EWN, also adheres to the PWN’s structure and hence to the WN Domains. This way we will be able to implement the mapping between the WN Domains and the ICC Knowledge Domains, which at present has been generated on a theoretical level. Fig. 2. Figure 1: RDF/OWL-EuroWordNet SynSet Example 5 2 Approach extension and Use Case 2.1 ICC Actualisation The work’s continuation presented in this paper comprises several steps: the first step will be the actualisation and completion of the ICC itself, namely the integration of the fourth to sixth level. This will be achieved by completing Dahlberg’s preliminary project ”Logstruktur” (Dahlberg 1977, 1979) and pur- suing the work of her dissertation (Dahlberg 1974). The following tasks need to be handled: 1. Revision: Sighting of the work done in 1974 by Dahlberg and complementing possibly missing definitions in all divisions, English translation. 2. Actualisation/Updating: inserting newly established knowledge domains into ICC (since the ”Vademecum Deutscher Lehr- und Forschungssttten” (1960) used by Dahlberg for this step in 1974 does not exist anymore, a new source signing German research institutions needs to be found; domain experts’ questioning). 3. Review by domain experts 4. compilation of defined terms in alphabetical order (Lexicon basis) 5. compilation of defined terms in systematic order (classification basis) 6. English translation 2.2 ICC and MultiWordNet conversion into RDF/OWL EWN and mapping This part applies the conversion method from (De Luca et al. 2007) and (De Luca/Dahlberg 2014, 8), first format adaptation: 1. schema extension/adaptation: – ICC and MultiWordNet analysis: specific requirements and format adap- tation – RDF/OWL EWN format adaptation for ICC and MultiWordNet 2. The second step will be to apply the conversion method developed in (De Luca et al. 2007) and again proposed in (De Luca/Dahlberg 2014) in order to have the ICC and MultiWordNet data in an adequate format which allows the theoretical and practical mapping onto the WN Domains: – theoretical mapping: map the ICC knowledge domains onto the WN Domains, based on the initial approach done in (De Luca/Dahlberg 2014, 6), taking into account the ICC updates from step one – practical mapping: After the two ontologies have been mapped on a the- oretical as well as on a schema level, the actual data integration between the ICC and the WordNet Domains via the MultiWordNet will be possi- ble. Again this procedure will be done according to the approach in (De Luca/Dahlberg 2014, 8 et seq.): enlarging the MultiWordNet coverage with the ICC generic terms of the Knowledge Domains, declaring every ICC generic term as a class (owl:Class) and every underlying term as a 6 subclass (rdfs:subClassOf), eventually supplementing every class (of the ICC top ontology) with a language description (e.g. xml:lang=”en”), in order to add it to the correct set of language files of the MultiWordNet. 2.3 Proof of concept: Restoration Domain as Use Case and Information Retrieval Tasks As a proof of concept, we decided to add a third step to link the restoration do- main ontology intended for document indexing purposes: the chosen restoration terminologies have been established especially for the SemRes project and used only on an experimental level. The integration is based on the approach in (De Luca et al. 2007) and will be done on a manual level: after determining the correct SynSet in the WN-ICC- Extension, the domain ontology’s top concept will be added as a hyponym of the SynSet8 . For this converting the domain ontologies into the same RDF/OWL format is necessary9 . Eventually we will be able to perform Information Retrieval tasks with the converted KOS, which will test the newly integrated ICC top ontology for In- formation Retrieval purposes. We imagine use cases, in which Information Re- trieval is done in the restoration domain, searching for the semantic concept ”democracy”. Through the newly established links between restoration domain ontology and the ICC Knowledge Domains via MultiWordNet, information on the ”democracy” concept may also be found in the textbook research domain, where the concept would be used in the context of education or art history. This way, abstraction of concepts, integrated into the available Lexical Linked Data Cloud, will be facilitated, and the discovery of relations between different domains will be allowed. 3 Conclusion and future work In this paper we have shown our approach how to establish a ”Lexikon der Wissensgebiete” by pursuing the work of Dahlberg (1974, 1977, 1979). We then discussed how to bring this knowledge database into the lexical linked data cloud by converting it into RDF/OWL to use it as a Top Ontology within the Lexical Resource WordNet. This step is based on the EuroWordNet conversion into RDF/OWL, presented by De Luca et al. (2007). By integrating the restoration 8 ”After having found the correct SynSet, we merged manually the complete converted domain ontology under the appropriate hyperonym (SynSet). In this case we could enlarge the EuroWordNet coverage with domain-specific terms.” De Luca et al. 2007, 9. 9 ”The first step before including the domain ontologies in the new EuroWordNet OWL hierarchy was to convert these in the same OWL format [] merging methods for including these domain-ontologies to the EuroWordNet Owl representation [] domain ontology is then added to the generic one, directly under its new hyperonym.” De Luca et al. 2007, 9. 7 domain ontology as a use case we can evaluate the usefulness of the enriched lexical resource for Information Retrieval or Query expansion tasks. Further possible work comprises the connection of additional domain ontolo- gies, e.g. archaeology, to evaluate the usefulness of the produced and enriched lexical linked data knowledge repository in Digital Humanities research scenar- ios. We imagine this to offer a general improvement for Information Retrieval within the Digital Libraries world. References 1. van Assem M., Gangemi A., and Schreiber G.: Wordnet in RDFS and OWL. Tech- nical report, W3C (2004) 2. Berners-Lee, T., Hendler, J., and Lassila, O.: The Semantic Web. A new form of Web content that is meaningful to computers will unleash a revolution of new possibilities. Scientific American, 284(5), 1-5 (2001) 3. Dahlberg, I.: Grundlagen universaler Wissensordnung. (1974) 4. Dahlberg, I.: Logstruktur Ein Projekt zur Generierung einer Systematik der Wis- senschaften. Deutsche Universittszeitung 8, 250 (1977) 5. Dahlberg, I.: Das DFG-Projekt Logstruktur”. Formaler Abschlussbericht. (1979) 6. Dahlberg, I.: Information Coding Classification: Geschichtliches, Prinzipien, In- haltliches. Information Wissenschaft und Praxis 61(8), 449-454 (2010) 7. Dahlberg, I.: A faceted classification of general concepts. Classification & Ontology. Proc.Intern. UDC-Seminar 2011, 177-193 (2011) 8. Dahlberg, I.: A systematic new lexicon of all knowledge fields, based on the Infor- mation Coding Classification. Knowledge Organization 39(2), 142-150 (2012) 9. Dahlberg, I.: Wissensorganisation: Entwicklung, Aufgabe, Anwendung, Zukunft. Textbooks for Knowledge Organization, Vol 3 (2014) 10. Dahlberg, I.: Brief Communication: Why a new universal classification system is needed. Knowledge Organization, 44(1), 65-71 (2017). 11. De Luca E. W., Eul M., and Nrnberger A.: Converting EuroWordNet in OWL and Extending It with Domain Ontologies. In: Proceedings of the Workshop on Lexical-Semantic and Ontological Resources, ## (2007) 12. De Luca E. W.: Semantic Support in Multilingual Text Retrieval. (2008) 13. De Luca, E. W., Plumbaum, T., Kunegis, J., and Albayrak, S.: Multilingual Ontology-based User Profile Enrichment. In: Proceedings of the First International Workshop on the Multilingual Semantic Web (MSW 2010), in conjunction with WWW 2010 - 19th International World Wide Web Conference 571, 41-42 (2010) 14. De Luca, E. W.: Extending the Linked Data Cloud with Multilingual Lexical Linked Data. Knowledge Organization, 40(5) (2013) 15. De Luca, E. W., and Dahlberg, I.: Die Multilingual Lexical Linked Data Cloud: Eine mgliche Zugangsoptimierung? Information-Wissenschaft & Praxis, 65(4-5), 279-287 (2014) 16. Feibleman, J.K.: The Integrative Levels in Nature. British J. Philosophy of Science, May (1954) 17. Fellbaum, C.: WordNet, an electronic lexical database. (1998) 18. Hartmann, N.: Der Aufbau der realen Welt. Grundriss einer allgemeinen Kate- gorienlehre. (1964) 8 19. Magnini, B., Strapparava, C., Ciravegna, F., and Pianta, E.: Multilingual lexical knowledge bases: Applied WordNet prospects. In: Proceedings of the International Workshop on ”The Future of the Dictionary” (1994) 20. Magnini, B., and Cavagli, G.: Integrating Subject Field Codes into WordNet. In: LREC, 1413-1418 (2000) 21. Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., and Miller, K. J.: Introduction to WordNet: An online lexical database. International journal of lexicography, 3(4), 235-244 (1990) 22. VDLF ; ein Handbuch des wissenschaftlichen Lebens. (1960) 23. Vossen, P.: EuroWordNet: a multilingual database for information retrieval. In: Proceedings of the DELOS workshop on Cross-language Information Retrieval, 5-7 (1997)