Towards Semantic Recommendation of Biodiversity Datasets based on Linked Open Data Felicitas Löffler Bahar Sateli René Witte Birgitta König-Ries Dept. of Mathematics Semantic Software Lab Semantic Software Lab Friedrich Schiller University and Computer Science Dept. of Computer Science Dept. of Computer Science Jena, Germany and Friedrich Schiller University and Software Engineering and Software Engineering German Centre for Integrative Jena, Germany Concordia University Concordia University Biodiversity Research (iDiv) Montréal, Canada Montréal, Canada Halle-Jena-Leipzig, Germany ABSTRACT 1. INTRODUCTION Conventional content-based filtering methods recommend Content-based recommender systems observe a user’s brows- documents based on extracted keywords. They calculate the ing behaviour and record the interests [1]. By means of natu- similarity between keywords and user interests and return a ral language processing and machine learning techniques, the list of matching documents. In the long run, this approach user’s preferences are extracted and stored in a user profile. often leads to overspecialization and fewer new entries with The same methods are utilized to obtain suitable content respect to a user’s preferences. Here, we propose a seman- keywords to establish a content profile. Based on previously tic recommender system using Linked Open Data for the seen documents, the system attempts to recommend similar user profile and adding semantic annotations to the index. content. Therefore, a mathematical representation of the user Linked Open Data allows recommendations beyond the con- and content profile is needed. A widely used scheme are TF- tent domain and supports the detection of new information. IDF (term frequency-inverse document frequency) weights One research area with a strong need for the discovery of [19]. Computed from the frequency of keywords appearing new information is biodiversity. Due to their heterogeneity, in a document, these term vectors capture the influence of the exploration of biodiversity data requires interdisciplinary keywords in a document or preferences in a user profile. The collaboration. Personalization, in particular in recommender angle between these vectors describes the distance or the systems, can help to link the individual disciplines in bio- closeness of the profiles and is calculated with similarity mea- diversity research and to discover relevant documents and sures, like the cosine similarity. The recommendation lists of datasets from various sources. We developed a first prototype these traditional, keyword-based recommender systems often for our semantic recommender system in this field, where a contain very similar results to those already seen, leading multitude of existing vocabularies facilitate our approach. to overspecialization [11] and the “Filter-Bubble”-effect [17]: The user obtains only content according to the stored prefer- ences, other related documents not perfectly matching the Categories and Subject Descriptors stored interests are not displayed. Thus, increasing diversity H.3.3 [Information Storage And Retrieval]: Informa- in recommendations has become an own research area [21, 25, tion Search and Retrieval; H.3.5 [Information Storage 24, 18, 3, 6, 23], mainly used to improve the recommendation And Retrieval]: Online Information Services results in news or movie portals. One field where content recommender systems could en- hance daily work is research. Scientists need to be aware General Terms of relevant research in their own but also neighboring fields. Design, Human Factors Increasingly, in addition to literature, the underlying data itself and even data that has not been used in publications are being made publicly available. An important example Keywords for such a discipline is biodiversity research, which explores content filtering, diversity, Linked Open Data, recommender the variety of species and their genetic and characteristic systems, semantic indexing, semantic recommendation diversity [12]. The morphological and genetic information of an organism, together with the ecological and geographical context, forms a highly diverse structure. Collected and stored in different data formats, the datasets often contain or link to spatial, temporal and environmental data [22]. Many important research questions cannot be answered by working with individual datasets or data collected by one group, but require meta-analysis across a wide range of data. Since the analysis of biodiversity data is quite time-consuming, there is Copyright c by the paper’s authors. Copying permitted only a strong need for personalization and new filtering techniques for private and academic purposes. in this research area. Ordinary search functions in relevant In: G. Specht, H. Gamper, F. Klan (eds.): Proceedings of the 26th GI- data portals or databases, e.g., the Global Biodiversity In- Workshop on Foundations of Databases (Grundlagen von Datenbanken), 21.10.2014 - 24.10.2014, Bozen, Italy, published at http://ceur-ws.org. formation Facility (GBIF)1 and the Catalog of Life,2 only that several types of relations can be taken into account. return data that match the user’s query exactly and fail at For instance, for a user interested in “geology”, the profile finding more diverse and semantically related content. Also, contains the concept “geology” that also permits the recom- user interests are not taken into account in the result list. mendation of inferred concepts, e.g., “fossil”. The idea of We believe our semantic-based content recommender system recommending related concepts was first introduced by Mid- could facilitate the difficult and time-consuming research delton et al. [15]. They developed Quickstep, a recommender process in this domain. system for research papers with ontological terms in the user Here, we propose a new semantic-based content recom- profile and for paper categories. The ontology only considers mender system that represents the user profile as Linked is-a relationships and omits other relation types (e.g., part- Open Data (LOD) [9] and incorporates semantic annotations of). Another simple hierarchical approach from Shoval et into the recommendation process. Additionally, the search al. [13] calculates the distance among concepts in a profile engine is connected to a terminology server and utilizes the hierarchy. They distinguish between perfect, close and weak provided vocabularies for a recommendation. The result list match. When the concept appears in both a user’s and docu- contains more diverse predictions and includes hierarchical ment’s profile, it is called a perfect match. In a close match, concepts or individuals. the concept emerges only in one of the profiles and a child or The structure of this paper is as follows: Next, we de- parent concept appears in the other. The largest distance is scribe related work. Section 3 presents the architecture of called a weak match, where only one of the profiles contains a our semantic recommender system and some implementation grandchild or grandparent concept. Finally, a weighted sum details. In Section 4, an application scenario is discussed. Fi- over all matching categories leads to the recommendation nally, conclusions and future work are presented in Section 5. list. This ontological filtering method was integrated into the news recommender system epaper. Another semantically en- hanced recommender system is Athena [10]. The underlying 2. RELATED WORK ontology is used to explore the semantic neighborhood in the The major goal of diversity research in recommender sys- news domain. The authors compared several ontology-based tems is to counteract overspecialization [11] and to recom- similarity measures with the traditional TF-IDF approach. mend related products, articles or documents. More books However, this system lacks of a connection to a search engine of an author or different movies of a genre are the classical that allows to query large datasets. applications, mainly used in recommender systems based on All presented systems use manually established vocabular- collaborative filtering methods. In order to enhance the vari- ies with a limited number of classes. None of them utilize ety in book recommendations, Ziegler et al. [25] enrich user a generic user profile to store the preferences in a seman- profiles with taxonomical super-topics. The recommendation tic format (RDF/XML or OWL). The FOAF (Friend Of A list generated by this extended profile is merged with a rank Friend) project3 provides a vocabulary for describing and in reverse order, called dissimilarity rank. Depending on a connecting people, e.g., demographic information (name, ad- certain diversification factor, this merging process supports dress, age) or interests. As one of the first, in 2006 Celma [2] more or less diverse recommendations. Larger diversification leveraged FOAF in his music recommender system to store factors lead to more diverse products beyond user interests. users’ preferences. Our approach goes beyond the FOAF Zhang and Hurley [24] favor another mathematical solution interests, by incorporating another generic user model vo- and describe the balance between diversity and similarity as cabulary, the Intelleo User Modelling Ontology (IUMO).4 a constrained optimization problem. They compute a dis- Besides user interests, IUMO offers elements to store learning similarity matrix according to applied criterias, e.g., movie goals, competences and recommendation preferences. This genres, and assign a matching function to find a subset of allows to adapt the results to a user’s previous knowledge or products that are diverse as well as similar. One hybrid to recommend only documents for a specific task. approach by van Setten [21] combines the results of several conventional algorithms, e.g., collaborative and case-based, to improve movie recommendations. Mainly focused on news 3. DESIGN AND IMPLEMENTATION or social media, approaches using content-based filtering In this section, we describe the architecture and some methods try to present different viewpoints on an event to implementation details of our semantic-based recommender decrease the media bias in news portals [18, 3] or to facilitate system (Figure 1). The user model component, described in the filtering of comments [6, 23]. Section 3.1, contains all user information. The source files, Apart from Ziegler et al., none of the presented approaches described in Section 3.2, are analyzed with GATE [5], as de- have considered semantic technologies. However, utilizing scribed in Section 3.3. Additionally, GATE is connected with ontologies and storing user or document profiles in triple a terminology server (Section 3.2) to annotate documents stores represents a large potential for diversity research in with concepts from the provided biodiversity vocabularies. recommender systems. Frasincar et al. [7] define semanti- In Section 3.4, we explain how the annotated documents are cally enhanced recommenders as systems with an underly- indexed with GATE Mı́mir [4]. The final recommendation list ing knowledge base. This can either be linguistic-based [8], is generated in the recommender component (Section 3.5). where only linguistic relations (e.g., synonymy, hypernomy, meronymy, antonymy) are considered, or ontology-based. In 3.1 User profile the latter case, the content and the user profile are repre- The user interests are stored in an RDF/XML format uti- sented with concepts of an ontology. This has the advantage lizing the FOAF vocabulary for general user information. In 1 3 GBIF, http://www.gbif.org FOAF, http://xmlns.com/foaf/spec/ 2 4 Catalog of Life, http://www.catalogueoflife.org/col/ IUMO, http://intelleo.eu/ontologies/user-model/ search/all/ spec/ Figure 1: The architecture of our semantic content recommender system order to improve the recommendations regarding a user’s existing vocabularies. Furthermore, biodiversity is an inter- previous knowledge and to distinguish between learning goals, disciplinary field, where the results from several sources have interests and recommendation preferences, we incorporate to be linked to gain new knowledge. A recommender system the Intelleo User Modelling Ontology for an extended profile for this domain needs to support scientists by improving this description. Recommendation preferences will contain set- linking process and helping them finding relevant content in tings in respect of visualization, e.g., highlighting of interests, an acceptable time. and recommender control options, e.g., keyword-search or Researchers in the biodiversity domain are advised to store more diverse results. Another adjustment will adapt the their datasets together with metadata, describing informa- result set according to a user’s previous knowledge. In order tion about their collected data. A very common metadata to enhance the comprehensibility for a beginner, the system format is ABCD.7 This XML-based standard provides ele- could provide synonyms; and for an expert the recommender ments for general information (e.g., author, title, address), could include more specific documents. as well as additional biodiversity related metadata, like infor- The interests are stored in form of links to LOD resources. mation about taxonomy, scientific name, units or gathering. For instance, in our example profile in Listing 1, a user is Very often, each taxon needs specific ABCD fields, e.g., fossil interested in “biotic mesoscopic physical object”, which is a datasets include data about the geological era. Therefore, concept from the ENVO5 ontology. Note that the interest several additional ABCD-related metadata standards have entry in the RDF file does not contain the textual description, emerged (e.g., ABCDEFG8 , ABCDDNA9 ). One document but the link to the concept in the ontology, i.e., http://purl. may contain the metadata of one or more species observations obolibrary.org/obo/ENVO_01000009. Currently, we only in a textual description. This provides for annotation and support explicit user modelling. Thus, the user information indexing for a semantic search. For our prototype, we use the has to be added manually to the RDF/XML file. Later, we ABCDEFG metadata files provided by the GFBio10 project; intend to develop a user profiling component, which gathers specifically, metadata files from the Museum für Naturkunde a user’s interests automatically. The profile is accessible via (MfN).11 An example for an ABCDEFG metadata file is an Apache Fuseki6 server. presented in Listing 2, containing the core ABCD structure as well as additional information about the geological era. Listing 1: User profile with interests stored as The terminology server supplied by the GFBio project of- Linked Open Data URIs fers access to several biodiversity vocabularies, e.g., ENVO, BEFDATA, TDWGREGION. It also provides a SPARQL Felicitas 3.3 Semantic annotation Loeffler The source documents are analyzed and annotated accord- Felicitas Loeffler ing to the vocabularies provided by the terminology server. Female that offers several standard language engineering components Friedrich Schiller University Jena [5]. We developed a custom GATE pipeline (Figure 2) that felicitas.loeffler@uni−jena.de analyzes the documents: First, the documents are split into included in the GATE distribution. Afterwards, an ‘Anno- tation Set Transfer’ processing resource adds the original 7 3.2 Source files and terminology server ABCD, http://www.tdwg.org/standards/115/ 8 ABCDEFG, http://www.geocase.eu/efg The content provided by our recommender comes from the 9 ABCDDNA, http://www.tdwg.org/standards/640/ biodiversity domain. This research area offers a wide range of 10 GFBio, http://www.gfbio.org 5 11 ENVO, http://purl.obolibrary.org/obo/envo.owl MfN, http://www.naturkundemuseum-berlin.de/ 6 12 Apache Fuseki, http://jena.apache.org/documentation/ GFBio terminology server, http://terminologies.gfbio. serving_data/ org/sparql/ Figure 2: The GFBio pipeline in GATE presenting the GFBio annotations markups of the ABCDEFG files to the annotation set, e.g., the user in steering the recommendation process actively. abcd:HigherTaxon. The following ontology-aware ‘Large KB The recommender component is still under development and Gazetteer’ is connected to the terminology server. For each has not been added to the implementation yet. document, all occurring ontology classes are added as specific “gfbioAnnot” annotations that have both instance (link to Listing 2: Excerpt from a biodiversity metadata file the concrete source document) and class URI. At the end, a in ABCDEFG format [20] ‘GATE Mı́mir Processing Resource’ submits the annotated documents to the semantic search engine. 3.4 Semantic indexing For semantic indexing, we are using GATE Mı́mir:13 “Mı́mir MfN − Fossil invertebrates is a multi-paradigm information management index and Gastropods, bivalves, brachiopods, sponges repository which can be used to index and search over text, annotations, semantic schemas (ontologies), and semantic metadata (instance data)” [4]. Besides ordinary keyword- Gastropods, Bivalves, Brachiopods, Sponges based search, Mı́mir incorporates the previously generated semantic annotations from GATE to the index. Addition- ally, it can be connected to the terminology server, allowing MfN queries over the ontologies. All index relevant annotations MfN − Fossil invertebrates Ia and the connection to the terminology server are specified in MB.Ga.3895 an index template. 3.5 Content recommender Euomphaloidea Family The Java-based content recommender sends a SPARQL query to the Fuseki Server and obtains the interests and preferred recommendation techniques from the user profile Euomphalus sp. SPARQL query to the Mı́mir server. Presently, this query asks only for child nodes (Figure 3). The result set contains ABCDEFG metadata files related to a user’s interests. We intend to experiment with further semantic relations in the future, e.g., object properties. Assuming that a specific fossil used to live in rocks, it might be interesting to know if other System species, living in this geological era, occured in rocks. An- Triassic other filtering method would be to use parent or grandparent provide control options and feedback mechanisms to support 13 GATE Mı́mir, https://gate.ac.uk/mimir/ Figure 3: A search for “biotic mesoscopic physical object” returning documents about fossils (child concept) 4. APPLICATION The semantic content recommender system allows the recommendation of more specific and diverse ABCDEFG metadata files with respect to the stored user interests. List- ing 3 shows the query to obtain the interests from a user profile, introduced in Listing 1. The result contains a list of (LOD) URIs to concepts in an ontology. Figure 4: An excerpt from the ENVO ontology Listing 3: SPARQL query to retrieve user interests 5. CONCLUSIONS SELECT ?label ?interest ?syn WHERE We introduced our new semantically enhanced content { recommender system for the biodiversity domain. Its main ?s foaf:firstName "Felicitas" . benefit lays in the connection to a search engine supporting ?s um:TopicPreference ?interest . ?interest rdfs:label ?label . integrated textual, linguistic and ontological queries. We are ?interest oboInOwl:hasRelatedSynonym ?syn using existing vocabularies from the terminology server of the } GFBio project. The recommendation list contains not only classical keyword-based results, but documents including In this example, the user would like to obtain biodiversity semantically related concepts. datasets about a “biotic mesoscopic physical object”, which In future work, we intend to integrate semantic-based rec- is the textual description of http://purl.obolibrary.org/ ommender algorithms to obtain further diverse results and to obo/ENVO_01000009. This technical term might be incom- support the interdisciplinary linking process in biodiversity prehensible for a beginner, e.g., a student, who would prefer research. We will set up an experiment to evaluate the algo- a description like “organic material feature”. Thus, for a rithms in large datasets with the established classification later adjustment of the result according to a user’s previous metrics Precision and Recall [14]. Additionally, we would knowledge, the system additionally returns synonyms. like to extend the recommender component with control op- The returned interest (LOD) URI is utilized for a second tions for the user [1]. Integrated into a portal, the result query to the search engine (Figure 3). The connection to the list should be adapted according to a user’s recommendation terminology server allows Mı́mir to search within the ENVO settings or adjusted to previous knowledge. These control ontology (Figure 4) and to include related child concepts functions allow the user to actively steer the recommenda- as well as their children and individuals. Since there is no tion process. We are planning to utilize the new layered metadata file containing the exact term “biotic mesoscopic evaluation approach for interactive adaptive systems from physical object”, a simple keyword-based search would fail. Paramythis, Weibelzahl and Masthoff [16]. Since adaptive However, Mı́mir can retrieve more specific information than systems present different results to each user, ordinary eval- stored in the user profile and is returning biodiversity meta- uation metrics are not appropriate. Thus, accuracy, validity, data files about “fossil”. That ontology class is a child node of usability, scrutability and transparency will be assessed in “biotic mesoscopic physical object” and represents a semantic several layers, e.g., the collection of input data and their relation. Due to a high similarity regarding the content of interpretation or the decision upon the adaptation strategy. the metadata files, the result set in Figure 3 contains only This should lead to an improved consideration of adaptivity documents which closely resemble each other. in the evaluation process. 6. ACKNOWLEDGMENTS P. B. Kantor, editors, Recommender Systems Handbook, This work was supported by DAAD (German Academic pages 73–105. Springer, 2011. Exchange Service)14 through the PPP Canada program and [12] M. Loreau. Excellence in ecology. International Ecology by DFG (German Research Foundation)15 within the GFBio Institute, Oldendorf, Germany, 2010. project. [13] V. Maidel, P. Shoval, B. Shapira, and M. Taieb-Maimon. Ontological content-based filtering 7. REFERENCES for personalised newspapers: A method and its evaluation. Online Information Review, 34 Issue [1] F. Bakalov, M.-J. Meurs, B. König-Ries, B. Sateli, 5:729–756, 2010. R. Witte, G. Butler, and A. Tsang. An approach to [14] C. D. Manning, P. Raghavan, and H. Schütze. controlling user models and personalization effects in Introduction to Information Retrieval. Cambridge recommender systems. In Proceedings of the 2013 University Press, 2008. international conference on Intelligent User Interfaces, [15] S. E. Middleton, N. R. Shadbolt, and D. C. D. Roure. IUI ’13, pages 49–56, New York, NY, USA, 2013. ACM. Ontological user profiling in recommender systems. [2] Ò. Celma. FOAFing the music: Bridging the semantic ACM Trans. Inf. Syst., 22(1):54–88, Jan. 2004. gap in music recommendation. In Proceedings of 5th [16] A. Paramythis, S. Weibelzahl, and J. Masthoff. Layered International Semantic Web Conference, pages 927–934, evaluation of interactive adaptive systems: Framework Athens, GA, USA, 2006. and formative methods. User Modeling and [3] S. Chhabra and P. Resnick. Cubethat: News article User-Adapted Interaction, 20(5):383–453, Dec. 2010. recommender. In Proceedings of the sixth ACM [17] E. Pariser. The Filter Bubble - What the internet is conference on Recommender systems, RecSys ’12, pages hiding from you. Viking, 2011. 295–296, New York, NY, USA, 2012. ACM. [18] S. Park, S. Kang, S. Chung, and J. Song. Newscube: [4] H. Cunningham, V. Tablan, I. Roberts, M. Greenwood, delivering multiple aspects of news to mitigate media and N. Aswani. Information extraction and semantic bias. In Proceedings of the SIGCHI Conference on annotation for multi-paradigm information Human Factors in Computing Systems, CHI ’09, pages management. In M. Lupu, K. Mayer, J. Tait, and A. J. 443–452, New York, NY, USA, 2009. ACM. Trippe, editors, Current Challenges in Patent [19] G. Salton and C. Buckley. Term-weighting approaches Information Retrieval, volume 29 of The Information in automatic text retrieval. Information Processing and Retrieval Series, pages 307–327. Springer Berlin Management, 24:513–523, 1988. Heidelberg, 2011. [20] Museum für Naturkunde Berlin. Fossil invertebrates, [5] H. Cunningham et al. Text Processing with GATE UnitID:MB.Ga.3895. (Version 6). University of Sheffield, Dept. of Computer http://coll.mfn-berlin.de/u/MB_Ga_3895.html. Science, 2011. [21] M. van Setten. Supporting people in finding [6] S. Faridani, E. Bitton, K. Ryokai, and K. Goldberg. information: hybrid recommender systems and Opinion space: A scalable tool for browsing online goal-based structuring. PhD thesis, Telematica Instituut, comments. In Proceedings of the SIGCHI Conference University of Twente, The Netherlands, 2005. on Human Factors in Computing Systems, CHI ’10, pages 1175–1184, New York, NY, USA, 2010. ACM. [22] R. Walls, J. Deck, R. Guralnick, S. Baskauf, R. Beaman, and et al. Semantics in Support of [7] F. Frasincar, W. IJntema, F. Goossen, and Biodiversity Knowledge Discovery: An Introduction to F. Hogenboom. A semantic approach for news the Biological Collections Ontology and Related recommendation. Business Intelligence Applications Ontologies. PLoS ONE 9(3): e89606, 2014. and the Web: Models, Systems and Technologies, IGI Global, pages 102–121, 2011. [23] D. Wong, S. Faridani, E. Bitton, B. Hartmann, and K. Goldberg. The diversity donut: enabling participant [8] F. Getahun, J. Tekli, R. Chbeir, M. Viviani, and control over the diversity of recommended responses. In K. Yétongnon. Relating RSS News/Items. In CHI ’11 Extended Abstracts on Human Factors in M. Gaedke, M. Grossniklaus, and O. Dı́az, editors, Computing Systems, CHI EA ’11, pages 1471–1476, ICWE, volume 5648 of Lecture Notes in Computer New York, NY, USA, 2011. ACM. Science, pages 442–452. Springer, 2009. [24] M. Zhang and N. Hurley. Avoiding monotony: [9] T. Health and C. Bizer. Linked Data: Evolving the Web Improving the diversity of recommendation lists. In into a Global Data Space. Synthesis Lectures on the Proceedings of the 2008 ACM Conference on Semantic Web: Theory and Technology. Morgan & Recommender Systems, RecSys ’08, pages 123–130, New Claypool, 2011. York, NY, USA, 2008. ACM. [10] W. IJntema, F. Goossen, F. Frasincar, and [25] C.-N. Ziegler, G. Lausen, and L. Schmidt-Thieme. F. Hogenboom. Ontology-based news recommendation. Taxonomy-driven computation of product In Proceedings of the 2010 EDBT/ICDT Workshops, recommendations. In Proceedings of the Thirteenth EDBT ’10, pages 16:1–16:6, New York, NY, USA, 2010. ACM International Conference on Information and ACM. Knowledge Management, CIKM ’04, pages 406–415, [11] P. Lops, M. de Gemmis, and G. Semeraro. New York, NY, USA, 2004. ACM. Content-based recommender systems: State of the art and trends. In F. Ricci, L. Rokach, B. Shapira, and 14 DAAD, https://www.daad.de/de/ 15 DFG, http://www.dfg.de