Towards Semantic Recommendation of Biodiversity
Datasets based on Linked Open Data
Felicitas Löffler Bahar Sateli René Witte Birgitta König-Ries
Dept. of Mathematics Semantic Software Lab Semantic Software Lab Friedrich Schiller University
and Computer Science Dept. of Computer Science Dept. of Computer Science Jena, Germany and
Friedrich Schiller University and Software Engineering and Software Engineering German Centre for Integrative
Jena, Germany Concordia University Concordia University Biodiversity Research (iDiv)
Montréal, Canada Montréal, Canada Halle-Jena-Leipzig, Germany
ABSTRACT 1. INTRODUCTION
Conventional content-based filtering methods recommend Content-based recommender systems observe a user’s brows-
documents based on extracted keywords. They calculate the ing behaviour and record the interests [1]. By means of natu-
similarity between keywords and user interests and return a ral language processing and machine learning techniques, the
list of matching documents. In the long run, this approach user’s preferences are extracted and stored in a user profile.
often leads to overspecialization and fewer new entries with The same methods are utilized to obtain suitable content
respect to a user’s preferences. Here, we propose a seman- keywords to establish a content profile. Based on previously
tic recommender system using Linked Open Data for the seen documents, the system attempts to recommend similar
user profile and adding semantic annotations to the index. content. Therefore, a mathematical representation of the user
Linked Open Data allows recommendations beyond the con- and content profile is needed. A widely used scheme are TF-
tent domain and supports the detection of new information. IDF (term frequency-inverse document frequency) weights
One research area with a strong need for the discovery of [19]. Computed from the frequency of keywords appearing
new information is biodiversity. Due to their heterogeneity, in a document, these term vectors capture the influence of
the exploration of biodiversity data requires interdisciplinary keywords in a document or preferences in a user profile. The
collaboration. Personalization, in particular in recommender angle between these vectors describes the distance or the
systems, can help to link the individual disciplines in bio- closeness of the profiles and is calculated with similarity mea-
diversity research and to discover relevant documents and sures, like the cosine similarity. The recommendation lists of
datasets from various sources. We developed a first prototype these traditional, keyword-based recommender systems often
for our semantic recommender system in this field, where a contain very similar results to those already seen, leading
multitude of existing vocabularies facilitate our approach. to overspecialization [11] and the “Filter-Bubble”-effect [17]:
The user obtains only content according to the stored prefer-
ences, other related documents not perfectly matching the
Categories and Subject Descriptors stored interests are not displayed. Thus, increasing diversity
H.3.3 [Information Storage And Retrieval]: Informa- in recommendations has become an own research area [21, 25,
tion Search and Retrieval; H.3.5 [Information Storage 24, 18, 3, 6, 23], mainly used to improve the recommendation
And Retrieval]: Online Information Services results in news or movie portals.
One field where content recommender systems could en-
hance daily work is research. Scientists need to be aware
General Terms of relevant research in their own but also neighboring fields.
Design, Human Factors Increasingly, in addition to literature, the underlying data
itself and even data that has not been used in publications
are being made publicly available. An important example
Keywords for such a discipline is biodiversity research, which explores
content filtering, diversity, Linked Open Data, recommender the variety of species and their genetic and characteristic
systems, semantic indexing, semantic recommendation diversity [12]. The morphological and genetic information of
an organism, together with the ecological and geographical
context, forms a highly diverse structure. Collected and
stored in different data formats, the datasets often contain or
link to spatial, temporal and environmental data [22]. Many
important research questions cannot be answered by working
with individual datasets or data collected by one group, but
require meta-analysis across a wide range of data. Since the
analysis of biodiversity data is quite time-consuming, there is
Copyright c by the paper’s authors. Copying permitted only a strong need for personalization and new filtering techniques
for private and academic purposes. in this research area. Ordinary search functions in relevant
In: G. Specht, H. Gamper, F. Klan (eds.): Proceedings of the 26th GI- data portals or databases, e.g., the Global Biodiversity In-
Workshop on Foundations of Databases (Grundlagen von Datenbanken),
21.10.2014 - 24.10.2014, Bozen, Italy, published at http://ceur-ws.org.
formation Facility (GBIF)1 and the Catalog of Life,2 only that several types of relations can be taken into account.
return data that match the user’s query exactly and fail at For instance, for a user interested in “geology”, the profile
finding more diverse and semantically related content. Also, contains the concept “geology” that also permits the recom-
user interests are not taken into account in the result list. mendation of inferred concepts, e.g., “fossil”. The idea of
We believe our semantic-based content recommender system recommending related concepts was first introduced by Mid-
could facilitate the difficult and time-consuming research delton et al. [15]. They developed Quickstep, a recommender
process in this domain. system for research papers with ontological terms in the user
Here, we propose a new semantic-based content recom- profile and for paper categories. The ontology only considers
mender system that represents the user profile as Linked is-a relationships and omits other relation types (e.g., part-
Open Data (LOD) [9] and incorporates semantic annotations of). Another simple hierarchical approach from Shoval et
into the recommendation process. Additionally, the search al. [13] calculates the distance among concepts in a profile
engine is connected to a terminology server and utilizes the hierarchy. They distinguish between perfect, close and weak
provided vocabularies for a recommendation. The result list match. When the concept appears in both a user’s and docu-
contains more diverse predictions and includes hierarchical ment’s profile, it is called a perfect match. In a close match,
concepts or individuals. the concept emerges only in one of the profiles and a child or
The structure of this paper is as follows: Next, we de- parent concept appears in the other. The largest distance is
scribe related work. Section 3 presents the architecture of called a weak match, where only one of the profiles contains a
our semantic recommender system and some implementation grandchild or grandparent concept. Finally, a weighted sum
details. In Section 4, an application scenario is discussed. Fi- over all matching categories leads to the recommendation
nally, conclusions and future work are presented in Section 5. list. This ontological filtering method was integrated into the
news recommender system epaper. Another semantically en-
hanced recommender system is Athena [10]. The underlying
2. RELATED WORK ontology is used to explore the semantic neighborhood in the
The major goal of diversity research in recommender sys- news domain. The authors compared several ontology-based
tems is to counteract overspecialization [11] and to recom- similarity measures with the traditional TF-IDF approach.
mend related products, articles or documents. More books However, this system lacks of a connection to a search engine
of an author or different movies of a genre are the classical that allows to query large datasets.
applications, mainly used in recommender systems based on All presented systems use manually established vocabular-
collaborative filtering methods. In order to enhance the vari- ies with a limited number of classes. None of them utilize
ety in book recommendations, Ziegler et al. [25] enrich user a generic user profile to store the preferences in a seman-
profiles with taxonomical super-topics. The recommendation tic format (RDF/XML or OWL). The FOAF (Friend Of A
list generated by this extended profile is merged with a rank Friend) project3 provides a vocabulary for describing and
in reverse order, called dissimilarity rank. Depending on a connecting people, e.g., demographic information (name, ad-
certain diversification factor, this merging process supports dress, age) or interests. As one of the first, in 2006 Celma [2]
more or less diverse recommendations. Larger diversification leveraged FOAF in his music recommender system to store
factors lead to more diverse products beyond user interests. users’ preferences. Our approach goes beyond the FOAF
Zhang and Hurley [24] favor another mathematical solution interests, by incorporating another generic user model vo-
and describe the balance between diversity and similarity as cabulary, the Intelleo User Modelling Ontology (IUMO).4
a constrained optimization problem. They compute a dis- Besides user interests, IUMO offers elements to store learning
similarity matrix according to applied criterias, e.g., movie goals, competences and recommendation preferences. This
genres, and assign a matching function to find a subset of allows to adapt the results to a user’s previous knowledge or
products that are diverse as well as similar. One hybrid to recommend only documents for a specific task.
approach by van Setten [21] combines the results of several
conventional algorithms, e.g., collaborative and case-based,
to improve movie recommendations. Mainly focused on news 3. DESIGN AND IMPLEMENTATION
or social media, approaches using content-based filtering In this section, we describe the architecture and some
methods try to present different viewpoints on an event to implementation details of our semantic-based recommender
decrease the media bias in news portals [18, 3] or to facilitate system (Figure 1). The user model component, described in
the filtering of comments [6, 23]. Section 3.1, contains all user information. The source files,
Apart from Ziegler et al., none of the presented approaches described in Section 3.2, are analyzed with GATE [5], as de-
have considered semantic technologies. However, utilizing scribed in Section 3.3. Additionally, GATE is connected with
ontologies and storing user or document profiles in triple a terminology server (Section 3.2) to annotate documents
stores represents a large potential for diversity research in with concepts from the provided biodiversity vocabularies.
recommender systems. Frasincar et al. [7] define semanti- In Section 3.4, we explain how the annotated documents are
cally enhanced recommenders as systems with an underly- indexed with GATE Mı́mir [4]. The final recommendation list
ing knowledge base. This can either be linguistic-based [8], is generated in the recommender component (Section 3.5).
where only linguistic relations (e.g., synonymy, hypernomy,
meronymy, antonymy) are considered, or ontology-based. In 3.1 User profile
the latter case, the content and the user profile are repre- The user interests are stored in an RDF/XML format uti-
sented with concepts of an ontology. This has the advantage lizing the FOAF vocabulary for general user information. In
1 3
GBIF, http://www.gbif.org FOAF, http://xmlns.com/foaf/spec/
2 4
Catalog of Life, http://www.catalogueoflife.org/col/ IUMO, http://intelleo.eu/ontologies/user-model/
search/all/ spec/
Figure 1: The architecture of our semantic content recommender system
order to improve the recommendations regarding a user’s existing vocabularies. Furthermore, biodiversity is an inter-
previous knowledge and to distinguish between learning goals, disciplinary field, where the results from several sources have
interests and recommendation preferences, we incorporate to be linked to gain new knowledge. A recommender system
the Intelleo User Modelling Ontology for an extended profile for this domain needs to support scientists by improving this
description. Recommendation preferences will contain set- linking process and helping them finding relevant content in
tings in respect of visualization, e.g., highlighting of interests, an acceptable time.
and recommender control options, e.g., keyword-search or Researchers in the biodiversity domain are advised to store
more diverse results. Another adjustment will adapt the their datasets together with metadata, describing informa-
result set according to a user’s previous knowledge. In order tion about their collected data. A very common metadata
to enhance the comprehensibility for a beginner, the system format is ABCD.7 This XML-based standard provides ele-
could provide synonyms; and for an expert the recommender ments for general information (e.g., author, title, address),
could include more specific documents. as well as additional biodiversity related metadata, like infor-
The interests are stored in form of links to LOD resources. mation about taxonomy, scientific name, units or gathering.
For instance, in our example profile in Listing 1, a user is Very often, each taxon needs specific ABCD fields, e.g., fossil
interested in “biotic mesoscopic physical object”, which is a datasets include data about the geological era. Therefore,
concept from the ENVO5 ontology. Note that the interest several additional ABCD-related metadata standards have
entry in the RDF file does not contain the textual description, emerged (e.g., ABCDEFG8 , ABCDDNA9 ). One document
but the link to the concept in the ontology, i.e., http://purl. may contain the metadata of one or more species observations
obolibrary.org/obo/ENVO_01000009. Currently, we only in a textual description. This provides for annotation and
support explicit user modelling. Thus, the user information indexing for a semantic search. For our prototype, we use the
has to be added manually to the RDF/XML file. Later, we ABCDEFG metadata files provided by the GFBio10 project;
intend to develop a user profiling component, which gathers specifically, metadata files from the Museum für Naturkunde
a user’s interests automatically. The profile is accessible via (MfN).11 An example for an ABCDEFG metadata file is
an Apache Fuseki6 server. presented in Listing 2, containing the core ABCD structure
as well as additional information about the geological era.
Listing 1: User profile with interests stored as The terminology server supplied by the GFBio project of-
Linked Open Data URIs fers access to several biodiversity vocabularies, e.g., ENVO,
BEFDATA, TDWGREGION. It also provides a SPARQL
Felicitas
3.3 Semantic annotation
Loeffler The source documents are analyzed and annotated accord-
Felicitas Loeffler ing to the vocabularies provided by the terminology server.
Female
that offers several standard language engineering components
Friedrich Schiller University Jena [5]. We developed a custom GATE pipeline (Figure 2) that
felicitas.loeffler@uni−jena.de analyzes the documents: First, the documents are split into
included in the GATE distribution. Afterwards, an ‘Anno-
tation Set Transfer’ processing resource adds the original
7
3.2 Source files and terminology server ABCD, http://www.tdwg.org/standards/115/
8
ABCDEFG, http://www.geocase.eu/efg
The content provided by our recommender comes from the 9
ABCDDNA, http://www.tdwg.org/standards/640/
biodiversity domain. This research area offers a wide range of 10
GFBio, http://www.gfbio.org
5 11
ENVO, http://purl.obolibrary.org/obo/envo.owl MfN, http://www.naturkundemuseum-berlin.de/
6 12
Apache Fuseki, http://jena.apache.org/documentation/ GFBio terminology server, http://terminologies.gfbio.
serving_data/ org/sparql/
Figure 2: The GFBio pipeline in GATE presenting the GFBio annotations
markups of the ABCDEFG files to the annotation set, e.g., the user in steering the recommendation process actively.
abcd:HigherTaxon. The following ontology-aware ‘Large KB The recommender component is still under development and
Gazetteer’ is connected to the terminology server. For each has not been added to the implementation yet.
document, all occurring ontology classes are added as specific
“gfbioAnnot” annotations that have both instance (link to
Listing 2: Excerpt from a biodiversity metadata file
the concrete source document) and class URI. At the end, a
in ABCDEFG format [20]
‘GATE Mı́mir Processing Resource’ submits the annotated
documents to the semantic search engine.
3.4 Semantic indexing
For semantic indexing, we are using GATE Mı́mir:13 “Mı́mir
MfN − Fossil invertebrates
is a multi-paradigm information management index and Gastropods, bivalves, brachiopods, sponges
repository which can be used to index and search over text,
annotations, semantic schemas (ontologies), and semantic
metadata (instance data)” [4]. Besides ordinary keyword- Gastropods, Bivalves, Brachiopods, Sponges
abcd:TaxonomicTerm>
based search, Mı́mir incorporates the previously generated
semantic annotations from GATE to the index. Addition-
ally, it can be connected to the terminology server, allowing
MfN
queries over the ontologies. All index relevant annotations MfN − Fossil invertebrates Ia
and the connection to the terminology server are specified in MB.Ga.3895
an index template.
3.5 Content recommender Euomphaloidea
Family
The Java-based content recommender sends a SPARQL
query to the Fuseki Server and obtains the interests and
preferred recommendation techniques from the user profile Euomphalus sp.
as a list of (LOD) URIs. This list is utilized for a second abcd:FullScientificNameString>
SPARQL query to the Mı́mir server. Presently, this query
asks only for child nodes (Figure 3). The result set contains
ABCDEFG metadata files related to a user’s interests. We
intend to experiment with further semantic relations in the
future, e.g., object properties. Assuming that a specific fossil
used to live in rocks, it might be interesting to know if other System
efg:ChronoStratigraphicDivision>
species, living in this geological era, occured in rocks. An- Triassic
other filtering method would be to use parent or grandparent
nodes from the vocabularies to broaden the search. We will efg:ChronostratigraphicAttributions>
provide control options and feedback mechanisms to support
13
GATE Mı́mir, https://gate.ac.uk/mimir/
Figure 3: A search for “biotic mesoscopic physical object” returning documents about fossils (child concept)
4. APPLICATION
The semantic content recommender system allows the
recommendation of more specific and diverse ABCDEFG
metadata files with respect to the stored user interests. List-
ing 3 shows the query to obtain the interests from a user
profile, introduced in Listing 1. The result contains a list of
(LOD) URIs to concepts in an ontology.
Figure 4: An excerpt from the ENVO ontology
Listing 3: SPARQL query to retrieve user interests
5. CONCLUSIONS
SELECT ?label ?interest ?syn
WHERE We introduced our new semantically enhanced content
{ recommender system for the biodiversity domain. Its main
?s foaf:firstName "Felicitas" . benefit lays in the connection to a search engine supporting
?s um:TopicPreference ?interest .
?interest rdfs:label ?label . integrated textual, linguistic and ontological queries. We are
?interest oboInOwl:hasRelatedSynonym ?syn using existing vocabularies from the terminology server of the
} GFBio project. The recommendation list contains not only
classical keyword-based results, but documents including
In this example, the user would like to obtain biodiversity semantically related concepts.
datasets about a “biotic mesoscopic physical object”, which In future work, we intend to integrate semantic-based rec-
is the textual description of http://purl.obolibrary.org/ ommender algorithms to obtain further diverse results and to
obo/ENVO_01000009. This technical term might be incom- support the interdisciplinary linking process in biodiversity
prehensible for a beginner, e.g., a student, who would prefer research. We will set up an experiment to evaluate the algo-
a description like “organic material feature”. Thus, for a rithms in large datasets with the established classification
later adjustment of the result according to a user’s previous metrics Precision and Recall [14]. Additionally, we would
knowledge, the system additionally returns synonyms. like to extend the recommender component with control op-
The returned interest (LOD) URI is utilized for a second tions for the user [1]. Integrated into a portal, the result
query to the search engine (Figure 3). The connection to the list should be adapted according to a user’s recommendation
terminology server allows Mı́mir to search within the ENVO settings or adjusted to previous knowledge. These control
ontology (Figure 4) and to include related child concepts functions allow the user to actively steer the recommenda-
as well as their children and individuals. Since there is no tion process. We are planning to utilize the new layered
metadata file containing the exact term “biotic mesoscopic evaluation approach for interactive adaptive systems from
physical object”, a simple keyword-based search would fail. Paramythis, Weibelzahl and Masthoff [16]. Since adaptive
However, Mı́mir can retrieve more specific information than systems present different results to each user, ordinary eval-
stored in the user profile and is returning biodiversity meta- uation metrics are not appropriate. Thus, accuracy, validity,
data files about “fossil”. That ontology class is a child node of usability, scrutability and transparency will be assessed in
“biotic mesoscopic physical object” and represents a semantic several layers, e.g., the collection of input data and their
relation. Due to a high similarity regarding the content of interpretation or the decision upon the adaptation strategy.
the metadata files, the result set in Figure 3 contains only This should lead to an improved consideration of adaptivity
documents which closely resemble each other. in the evaluation process.
6. ACKNOWLEDGMENTS P. B. Kantor, editors, Recommender Systems Handbook,
This work was supported by DAAD (German Academic pages 73–105. Springer, 2011.
Exchange Service)14 through the PPP Canada program and [12] M. Loreau. Excellence in ecology. International Ecology
by DFG (German Research Foundation)15 within the GFBio Institute, Oldendorf, Germany, 2010.
project. [13] V. Maidel, P. Shoval, B. Shapira, and
M. Taieb-Maimon. Ontological content-based filtering
7. REFERENCES for personalised newspapers: A method and its
evaluation. Online Information Review, 34 Issue
[1] F. Bakalov, M.-J. Meurs, B. König-Ries, B. Sateli, 5:729–756, 2010.
R. Witte, G. Butler, and A. Tsang. An approach to [14] C. D. Manning, P. Raghavan, and H. Schütze.
controlling user models and personalization effects in Introduction to Information Retrieval. Cambridge
recommender systems. In Proceedings of the 2013 University Press, 2008.
international conference on Intelligent User Interfaces,
[15] S. E. Middleton, N. R. Shadbolt, and D. C. D. Roure.
IUI ’13, pages 49–56, New York, NY, USA, 2013. ACM.
Ontological user profiling in recommender systems.
[2] Ò. Celma. FOAFing the music: Bridging the semantic ACM Trans. Inf. Syst., 22(1):54–88, Jan. 2004.
gap in music recommendation. In Proceedings of 5th
[16] A. Paramythis, S. Weibelzahl, and J. Masthoff. Layered
International Semantic Web Conference, pages 927–934,
evaluation of interactive adaptive systems: Framework
Athens, GA, USA, 2006.
and formative methods. User Modeling and
[3] S. Chhabra and P. Resnick. Cubethat: News article User-Adapted Interaction, 20(5):383–453, Dec. 2010.
recommender. In Proceedings of the sixth ACM
[17] E. Pariser. The Filter Bubble - What the internet is
conference on Recommender systems, RecSys ’12, pages
hiding from you. Viking, 2011.
295–296, New York, NY, USA, 2012. ACM.
[18] S. Park, S. Kang, S. Chung, and J. Song. Newscube:
[4] H. Cunningham, V. Tablan, I. Roberts, M. Greenwood,
delivering multiple aspects of news to mitigate media
and N. Aswani. Information extraction and semantic
bias. In Proceedings of the SIGCHI Conference on
annotation for multi-paradigm information
Human Factors in Computing Systems, CHI ’09, pages
management. In M. Lupu, K. Mayer, J. Tait, and A. J.
443–452, New York, NY, USA, 2009. ACM.
Trippe, editors, Current Challenges in Patent
[19] G. Salton and C. Buckley. Term-weighting approaches
Information Retrieval, volume 29 of The Information
in automatic text retrieval. Information Processing and
Retrieval Series, pages 307–327. Springer Berlin
Management, 24:513–523, 1988.
Heidelberg, 2011.
[20] Museum für Naturkunde Berlin. Fossil invertebrates,
[5] H. Cunningham et al. Text Processing with GATE
UnitID:MB.Ga.3895.
(Version 6). University of Sheffield, Dept. of Computer
http://coll.mfn-berlin.de/u/MB_Ga_3895.html.
Science, 2011.
[21] M. van Setten. Supporting people in finding
[6] S. Faridani, E. Bitton, K. Ryokai, and K. Goldberg.
information: hybrid recommender systems and
Opinion space: A scalable tool for browsing online
goal-based structuring. PhD thesis, Telematica Instituut,
comments. In Proceedings of the SIGCHI Conference
University of Twente, The Netherlands, 2005.
on Human Factors in Computing Systems, CHI ’10,
pages 1175–1184, New York, NY, USA, 2010. ACM. [22] R. Walls, J. Deck, R. Guralnick, S. Baskauf,
R. Beaman, and et al. Semantics in Support of
[7] F. Frasincar, W. IJntema, F. Goossen, and
Biodiversity Knowledge Discovery: An Introduction to
F. Hogenboom. A semantic approach for news
the Biological Collections Ontology and Related
recommendation. Business Intelligence Applications
Ontologies. PLoS ONE 9(3): e89606, 2014.
and the Web: Models, Systems and Technologies, IGI
Global, pages 102–121, 2011. [23] D. Wong, S. Faridani, E. Bitton, B. Hartmann, and
K. Goldberg. The diversity donut: enabling participant
[8] F. Getahun, J. Tekli, R. Chbeir, M. Viviani, and
control over the diversity of recommended responses. In
K. Yétongnon. Relating RSS News/Items. In
CHI ’11 Extended Abstracts on Human Factors in
M. Gaedke, M. Grossniklaus, and O. Dı́az, editors,
Computing Systems, CHI EA ’11, pages 1471–1476,
ICWE, volume 5648 of Lecture Notes in Computer
New York, NY, USA, 2011. ACM.
Science, pages 442–452. Springer, 2009.
[24] M. Zhang and N. Hurley. Avoiding monotony:
[9] T. Health and C. Bizer. Linked Data: Evolving the Web
Improving the diversity of recommendation lists. In
into a Global Data Space. Synthesis Lectures on the
Proceedings of the 2008 ACM Conference on
Semantic Web: Theory and Technology. Morgan &
Recommender Systems, RecSys ’08, pages 123–130, New
Claypool, 2011.
York, NY, USA, 2008. ACM.
[10] W. IJntema, F. Goossen, F. Frasincar, and
[25] C.-N. Ziegler, G. Lausen, and L. Schmidt-Thieme.
F. Hogenboom. Ontology-based news recommendation.
Taxonomy-driven computation of product
In Proceedings of the 2010 EDBT/ICDT Workshops,
recommendations. In Proceedings of the Thirteenth
EDBT ’10, pages 16:1–16:6, New York, NY, USA, 2010.
ACM International Conference on Information and
ACM.
Knowledge Management, CIKM ’04, pages 406–415,
[11] P. Lops, M. de Gemmis, and G. Semeraro.
New York, NY, USA, 2004. ACM.
Content-based recommender systems: State of the art
and trends. In F. Ricci, L. Rokach, B. Shapira, and
14
DAAD, https://www.daad.de/de/
15
DFG, http://www.dfg.de