Proceedings of the ISWC 2014
Posters & Demonstrations Track
Editors:
Matthew Horridge, Marco Rospocher and Jacco van Ossenbruggen
i
ISWC 2014 P&D Preface
Preface
The ISWC 2014 Poster and Demonstration track complements the Research Paper track of
the conference and o↵ers an opportunity for presenting late-breaking research results, ongoing
research projects, and speculative or innovative work in progress. The informal setting of the
track encourages presenters and participants to engage in discussions about the presented work.
Such discussions can be a valuable input into the future work of the presenters, while o↵ering
participants an e↵ective way to broaden their knowledge of the emerging research trends and
to network with other researchers.
These proceedings contain the four-page abstracts of all accepted posters and demos pre-
sented at ISWC 2014. Posters range from technical contributions, reports on Semantic Web
software systems, descriptions of completed work, and also work in progress. Demonstrations
showcase innovative Semantic Web related implementations and technologies. This year we had
156 submissions, of which the program committee accept 71 posters and 49 demos. We would
like to take this opportunity to thank all of the authors for their contributions to the ISWC
2014 programme!
We would also like to thank the members of the program committee and the additional
reviewers for their time and e↵orts. A special thanks for respecting our deadlines, we know
these fell in the middle of the summer holidays for many of you! All abstracts included here
have been revised and improved based on your valuable feedback, and we feel the final result
represents a wide variety of topics that will o↵er a vibrant and exciting session at the conference.
Finally, we would like to thank our local organisers Luciano Serafini and Chiara Ghidini for
their invaluable help in sorting out the logistics of this track.
September 2014 Matthew Horridge
Stanford, Trento, Amsterdam Marco Rospocher
Jacco van Ossenbruggen
iii
ISWC 2014 P&D Program Committee
Program Committee
Alessandro Adamou Despoina Magka
Carlo Allocca Sara Magliacane
Samantha Bail James Malone
Pierpaolo Basile Nicolas Matentzoglu
Eva Blomqvist Georgios Meditskos
Victor de Boer Alessandra Mileo
Stefano Bortoli Kody Moodley
Loris Bozzato Andrea Moro
Volha Bryl Yuan Ni
Marut Buranarach Andrea Giovanni Nuzzolese
Jim Burton Alessandro Oltramari
Elena Cabrio Jacco van Ossenbruggen
Annalina Caputo Alessio Palmero Aprosio
Vinay Chaudhri Matteo Palmonari
Gong Cheng Je↵ Z. Pan
Sam Coppens Guilin Qi
Oscar Corcho José Luis Redondo-Garcı́a
Francesco Corcoglioniti Marco Rospocher
Claudia D’Amato Tuukka Ruotsalo
Chiara Del Vescovo Manuel Salvadores
Aidan Delaney Bernhard Schandl
Daniele Dell’Aglio Christoph Schütz
Chiara Di Francescomarino Patrice Seyed
Mauro Dragoni Giorgos Stoilos
Marieke van Erp Nenad Stojanovic
Antske Fokkens Mari Carmen Suárez-Figueroa
Marco Gabriele Enrico Fossati Hideaki Takeda
Anna Lisa Gentile Myriam Traub
Aurona Gerber Tania Tudorache
Jose Manuel Gomez-Perez Anni-Yasmin Turhan
Tudor Groza Victoria Uren
Gerd Gröner Davy Van Deursen
Peter Haase Willem Robert van Hage
Armin Haller Maria Esther Vidal
Karl Hammar Serena Villata
Michiel Hildebrand Boris Villazón-Terrazas
Pascal Hitzler Stefanos Vrochidis
Aidan Hogan Martine de Vos
Matthew Horridge Simon Walk
Hanmin Jung Hai H. Wang
Simon Jupp Haofen Wang
Haklae Kim Kewen Wang
Pavel Klinov Shenghui Wang
Patrick Lambrix Jun Zhao
Paea Le Pendu Yi Zhou
Florian Lemmerich Amal Zouaq
Yuan-Fang Li
Joanne Luciano
iv
ISWC 2014 P&D Additional Reviewers
Additional Reviewers
Muhammad Intizar Ali
David Carral
Marco Cremaschi
Brian Davis
Jangwon Gim
Hegde Vinod
Nazmul Hussain
Myunggwon Hwang
Amit Joshi
Kim Taehong
Kim Young-Min
Fadi Maali
Theofilos Mailis
Nicolas Matentzoglu
Jim Mccusker
David Molik
Raghava Mutharaju
Alina Patelli
Thomas Ploeger
Riccardo Porrini
Anon Reviewera
Victor Saquicela
Ana Sasa Bastinos
Veronika Thost
Jung-Ho Um
Zhangquan Zhou
v
Contents
1 Life Stories as Event-based Linked Data: Case Semantic National Biog-
raphy
Eero Hyvönen, Miika Alonen, Esko Ikkala and Eetu Mäkelä 1
2 News Visualization based on Semantic Knowledge
Sebastian Arnold, Damian Burke, Tobias Dörsch, Bernd Löber and An-
dreas Lommatzsch 5
3 Sherlock: a Semi-Automatic Quiz Generation System using Linked Data
Dong Liu and Chenghua Lin 9
4 Low-Cost Queryable Linked Data through Triple Pattern Fragments
Ruben Verborgh, Olaf Hartig, Ben De Meester, Gerald Haesendonck,
Laurens De Vocht, Miel Vander Sande, Richard Cyganiak, Pieter Col-
paert, Erik Mannens and Rik Van de Walle 13
5 call: A Nucleus for a Web of Open Functions
Maurizio Atzori 17
6 Cross-lingual detection of world events from news articles
Gregor Leban, Blaž Fortuna, Janez Brank and Marko Grobelnik 21
7 Multilingual Word Sense Disambiguation and Entity Linking for Every-
body
Andrea Moro, Francesco Cecconi and Roberto Navigli 25
8 Help me describe my data: A demonstration of the Open PHACTS
VoID Editor
Carole Goble, Alasdair J. G. Gray and Eleftherios Tatakis 29
9 OUSocial2 - A Platform for Gathering Students’ Feedback from Social
Media
Keerthi Thomas, Miriam Fernandez, Stuart Brown and Harith Alani 33
10 Using an Ontology Learning System for Trend Analysis and Detection
Gerhard Wohlgenannt, Stefan Belk, Matyas Karacsonyi and Matthias
Schett 37
11 A Prototype Web Service for Benchmarking Power Consumption of Mo-
bile Semantic Applications
Evan Patton and Deborah McGuinness 41
12 SPARKLIS: a SPARQL Endpoint Explorer for Expressive Question An-
swering
Sébastien Ferré 45
13 Reconciling Information in DBpedia through a Question Answering Sys-
tem
Elena Cabrio, Alessio Palmero Aprosio and Serena Villata 49
14 Open Mashup Platform - A Smart Data Exploration Environment
Tuan-Dat Trinh, Ba-Lam Do, Peter Wetz, Amin Anjomshoaa, Elmar
Kiesling and Amin Tjoa 53
vi
15 CIMBA - Client-Integrated MicroBlogging Architecture
Andrei Sambra, Sandro Hawke, Timothy Berners-Lee, Lalana Kagal and
Ashraf Aboulnaga 57
16 The Organiser - A Semantic Desktop Agent based on NEPOMUK
Sebastian Faubel and Moritz Eberl 61
17 HDTourist: Exploring Urban Data on Android
Elena Hervalejo, Miguel A. Martinez-Prieto, Javier D. Fernández and
Oscar Corcho 65
18 Integrating NLP and SW with the KnowledgeStore
Marco Rospocher, Francesco Corcoglioniti, Roldano Cattoni, Bernardo
Magnini and Luciano Serafini 69
19 Graphical Representation of OWL 2 Ontologies through Graphol
Marco Console, Domenico Lembo, Valerio Santarelli and Domenico
Fabio Savo 73
20 LIVE: a Tool for Checking Licenses Compatibility between Vocabularies
and Data
Guido Governatori, Ho-Pun Lam, Antonino Rotolo, Serena Villata,
Ghislain Auguste Atemezing and Fabien Gandon 77
21 The Map Generator Tool
Valeria Fionda, Giuseppe Pirrò and Claudio Gutierrez 81
22 Named Entity Recognition using FOX
René Speck and Axel-Cyrille Ngonga Ngomo 85
23 A Linked Data Platform adapter for the Bugzilla issue tracker
Nandana Mihindukulasooriya, Miguel Esteban-Gutierrez and Raúl Garcı́a-
Castro 89
24 LED: curated and crowdsourced Linked Data on Music Listening Expe-
riences
Alessandro Adamou, Mathieu D’Aquin, Helen Barlow and Simon Brown 93
25 WhatTheySaid: Enriching UK Parliament Debates with Semantic Web
Yunjia Li, Chaohai Ding and Mike Wald 97
26 Multilingual Disambiguation of Named Entities Using Linked Data
Ricardo Usbeck, Axel-Cyrille Ngonga Ngomo, Wencan Luo and Lars
Wesemann 101
27 The Wikipedia Bitaxonomy Explorer
Tiziano Flati and Roberto Navigli 105
28 Enhancing Web intelligence with the content of online video fragments
Lyndon Nixon, Matthias Bauer and Arno Scharl 109
29 EMBench: Generating Entity-Related Benchmark Data
Ekaterini Ioannou and Yannis Velegrakis 113
vii
30 Demonstration of multi-perspective exploratory search with the Discov-
ery Hub web application
Nicolas Marie and Fabien Gandon 117
31 Modeling and Monitoring Processes exploiting Semantic Reasoning
Mauro Dragoni, Piergiorgio Bertoli, Chiara Di Francescomarino, Chiara
Ghidini, Michele Nori, Marco Pistore, Roberto Tiella and Francesco
Corcoglioniti 121
32 WikipEvent: Temporal Event Data for the Semantic Web
Ujwal Gadiraju, Kaweh Djafari Naini, Andrea Ceroni, Mihai Georgescu,
Dang Duc Pham and Marco Fisichella 125
33 Towards a DBpedia of Tourism: the case of Tourpedia
Stefano Cresci, Andrea D’Errico, Davide Gazzè, Angelica Lo Duca, An-
drea Marchetti and Maurizio Tesconi 129
34 Using Semantics for Interactive Visual Analysis of Linked Open Data
Gerwald Tschinkel, Eduardo Veas, Belgin Mutlu and Vedran Sabol 133
35 Exploiting Linked Data Cubes with OpenCube Toolkit
Evangelos Kalampokis, Andriy Nikolov, Peter Haase, Richard Cyga-
niak, Arkadiusz Stasiewicz, Areti Karamanou, Maria Zotou, Dimitris
Zeginis, Efthimios Tambouris and Konstantinos Tarabanis 137
36 Detecting Hot Spots in Web Videos
José Luis Redondo-Garcı́a, Mariella Sabatino, Pasquale Lisena and
Raphaël Troncy 141
37 EUROSENTIMENT: Linked Data Sentiment Analysis
J. Fernando Sánchez-Rada, Gabriela Vulcu, Carlos A. Iglesias and Paul
Buitelaar 145
38 Property-based typing with LITEQ
Stefan Scheglmann, Martin Leinberger, Ralf Lämmel, Ste↵en Staab and
Matthias Thimm 149
39 From Tale to Speech: Ontology-based Emotion and Dialogue Annotation
of Fairy Tales with a TTS Output
Christian Eisenreich, Jana Ott, Tonio Süßdorf, Christian Willms and
Thierry Declerck 153
40 BIOTEX: A system for Biomedical Terminology Extraction, Ranking,
and Validation
Juan Antonio Lossio Ventura, Clement Jonquet, Mathieu Roche and
Maguelonne Teisseire 157
41 Visualizing and Animating Large-scale Spatiotemporal Data with EL-
BAR Explorer
Suvodeep Mazumdar and Tomi Kauppinen 161
42 A Demonstration of Linked Data Source Discovery and Integration
Jason Slepicka, Chengye Yin, Pedro Szekely and Craig Knoblock 165
viii
43 Developing Mobile Linked Data Applications
Oshani Seneviratne, Evan Patton, Daniela Miao, Fuming Shih, Weihua
Li, Lalana Kagal and Carlos Castillo 169
44 A Visual Summary for Linked Open Data sources
Fabio Benedetti, Laura Po and Sonia Bergamaschi 173
45 EasyESA: A Low-e↵ort Infrastructure for Explicit Semantic Analysis
Danilo Carvalho, Cagatay Calli, Andre Freitas and Edward Curry 177
46 LODHub - A Platform for Sharing and Analyzing large-scale Linked
Open Data
Stefan Hagedorn and Kai-Uwe Sattler 181
47 LOD4AR: Exploring Linked Open Data with a Mobile Augmented Re-
ality Web Application
Silviu Vert, Bogdan Dragulescu and Radu Vasiu 185
48 PLANET: Query Plan Visualizer for Shipping Policies against Single
SPARQL Endpoints
Maribel Acosta, Maria Esther Vidal, Fabian Flöck, Simon Castillo and
Andreas Harth 189
49 High Performance Linked Data Processing for Virtual Reality Environ-
ments
Felix Leif Keppmann, Tobias Käfer, Ste↵en Stadtmüller, René Schubotz
and Andreas Harth 193
50 Analyzing Relative Incompleteness of Movie Descriptions in the Web of
Data: A Case Study
Wancheng Yuan, Elena Demidova, Stefan Dietze and Xuan Zhou 197
51 A Semantic Metadata Generator for Web Pages Based on Keyphrase
Extraction
Dario De Nart, Carlo Tasso and Dante Degl’Innocenti 201
52 A Multilingual SPARQL-Based Retrieval Interface for Cultural Heritage
Objects
Dana Dannells, Ramona Enache and Mariana Damova 205
53 Extending Tagging Ontologies with Domain Specific Knowledge
Frederic Font, Sergio Oramas, György Fazekas and Xavier Serra 209
54 DisambiguatingWeb Tables using Partial Data
Ziqi Zhang 213
55 On Linking Heterogeneous Dataset Collections
Mayank Kejriwal and Daniel Miranker 217
56 Scientific data as RDF with Arrays: Tight integration of SciSPARQL
queries into MATLAB
Andrej Andrejev, Xueming He and Tore Risch 221
57 Measuring similarity in ontologies: a new family of measures
Tahani Alsubait, Bijan Parsia and Uli Sattler 225
ix
58 Towards Combining Machine Learning with Attribute Exploration for
Ontology Refinement
Jedrzej Potoniec, Sebastian Rudolph and Agnieszka Lawrynowicz 229
59 ASSG: Adaptive structural summary for RDF graph data
Haiwei Zhang, Yuanyuan Duan, Xiaojie Yuan and Ying Zhang 233
60 Evaluation of String Normalisation Modules for String-based Biomedical
Vocabularies Alignment with AnAGram
Anique van Berne and Veronique Malaise 237
61 Keyword-Based Semantic Search Engine Koios++
Björn Forcher, Andreas Giloj and Erich Weichselgartner 241
62 Supporting SPARQL Update Queries in RDF-XML Integration
Nikos Bikakis, Chrisa Tsinaraki, Ioannis Stavrakantonakis and Stavros
Christodoulakis 245
63 CURIOS: Web-based Presentation and Management of Linked Datasets
Hai Nguyen, Stuart Taylor, Gemma Webster, Nophadol Jekjantuk, Chris
Mellish, Je↵ Z. Pan and Tristan Ap Rheinallt 249
64 The uComp Protege Plugin for Crowdsourcing Ontology Validation
Florian Hanika, Gerhard Wohlgenannt and Marta Sabou 253
65 Frame-Semantic Web: a Case Study for Korean
Jungyeul Park, Sejin Nam, Youngsik Kim, Younggyun Hahm, Dosam
Hwang and Key-Sun Choi 257
66 SparkRDF: Elastic Discreted RDF Graph Processing Engine With Dis-
tributed Memory
Xi Chen, Huajun Chen, Ningyu Zhang and songyang Zhang 261
67 LEAPS: A Semantic Web and Linked data framework for the Algal
Biomass Domain
Monika Solanki 265
68 Bridging the Semantic Gap between RDF and SPARQL using Com-
pleteness Statements
Fariz Darari, Simon Razniewski and Werner Nutt 269
69 COLINA: A Method for Ranking SPARQL Query Results through Con-
tent and Link Analysis
Azam Feyznia, Mohsen Kahani and Fattane Zarrinkalam 273
70 Licentia: a Tool for Supporting Users in Data Licensing on the Web of
Data
Cristian Cardellino, Serena Villata, Fabien Gandon, Guido Governa-
tori, Ho-Pun Lam and Antonino Rotolo 277
71 Automatic Stopword Generation using Contextual Semantics for Senti-
ment Analysis of Twitter
Hassan Saif, Miriam Fernandez and Harith Alani 281
72 The Manchester OWL Repository: System Description
Nicolas Matentzoglu, Daniel Tang, Bijan Parsia and Uli Sattler 285
x
73 A Fully Parallel Framework for Analyzing RDF Data
Long Cheng, Spyros Kotoulas, Tomas Ward and Georgios Theodoropoulos289
74 Objects as results from graph queries using an ORM and generated
semantic-relational binding
Marc-Antoine Parent 293
75 Hedera: Scalable Indexing and Exploring Entities in Wikipedia Revision
History
Tuan Tran and Tu Ngoc Nguyen 297
76 Evaluating Ontology Alignment Systems in Query Answering Tasks
Alessandro Solimando, Ernesto Jimenez-Ruiz and Christoph Pinkel 301
77 Using Fuzzy Logic For Multi-Domain Sentiment Analysis
Mauro Dragoni, Andrea Tettamanzi and Célia Da Costa Pereira 305
78 AMSL — Creating a Linked Data Infrastructure for Managing Elec-
tronic Resources in Libraries
Natanael Arndt, Sebastian Nuck, Andreas Nareike, Norman Radtke,
Leander Seige and Thomas Riechert 309
79 Extending an ontology alignment system with BioPortal: a preliminary
analysis
Xi Chen, Weiguo Xia, Ernesto Jimenez-Ruiz and Valerie Cross 313
80 How much navigable is the Web of Linked Data?
Valeria Fionda and Enrico Malizia 317
81 A Framework for Incremental Maintenance of RDF Views of Relational
Data
Vânia Vidal, Marco Antonio Casanova, Jose Monteiro, Narciso Ar-
ruda, Diego Sá and Valeria Pequeno 321
82 Document Relation System Based on Ontologies for the Security Do-
main
Janine Hellriegel, Hans Ziegler and Ulrich Meissen 325
83 Representing Swedish Lexical Resources in RDF with lemon
Lars Borin, Dana Dannells, Markus Forsberg and John P. Mccrae 329
84 QASM: a Q&A Social Media System Based on Social Semantic
Zide Meng, Fabien Gandon and Catherine Faron-Zucker 333
85 A Semantic-Based Platform for Efficient Online Communication
Zaenal Akbar, José Marı́a Garcı́a, Ioan Toma and Dieter Fensel 337
86 SHAX: The Semantic Historical Archive eXplorer
Michael Feldman, Shen Gao, Marc Novel, Katerina Papaioannou and
Abraham Bernstein 341
87 SemanTex: Semantic Text Exploration Using Document Links Implied
by Conceptual Networks Extracted from the Texts
Suad Aldarra, Emir Muñoz, Pierre-Yves Vandenbussche and Vit Novacek345
xi
88 Towards a Top-K SPARQL Query Benchmark
Shima Zahmatkesh, Emanuele Della Valle, Daniele Dell’aglio and Alessan-
dro Bozzon 349
89 Exploring type-specific topic profiles of datasets: a demo for educational
linked data
Davide Taibi, Stefan Dietze, Besnik Fetahu and Giovanni Fulantelli 353
90 TEX-OWL: a Latex-Style Syntax for authoring OWL 2 ontologies
Matteo Matassoni, Marco Rospocher, Mauro Dragoni and Paolo Bouquet357
91 Supporting Integrated Tourism Services with Semantic Technologies and
Machine Learning
Francesca Alessandra Lisi and Floriana Esposito 361
92 Towards a Semantically Enriched Online Newspaper
Ricardo Kawase, Eelco Herder and Patrick Siehndel 365
93 Identifying Topic-Related Hyperlinks on Twitter
Patrick Siehndel, Ricardo Kawase, Eelco Herder and Thomas Risse 369
94 Capturing and Linking Human Sensor Observations with YouSense
Tomi Kauppinen, Evgenia Litvinova and Jan Kallenbach 373
95 An update strategy for the WaterFowl RDF data store
Olivier Curé and Guillaume Blin 377
96 Linking Historical Data on the Web
Valeria Fionda and Giovanni Grasso 381
97 User driven Information Extraction with LODIE
Anna Lisa Gentile and Suvodeep Mazdumar 385
98 QALM: a Benchmark for Question Answering over Linked Merchant
Websites Data
Amine Hallili, Elena Cabrio and Catherine Faron Zucker 389
99 GeoTriples: a Tool for Publishing Geospatial Data as RDF Graphs Us-
ing R2RML Mappings
Kostis Kyzirakos, Ioannis Vlachopoulos, Dimitrianos Savva, Stefan Mane-
gold and Manolis Koubarakis 393
100 New Directions in Linked Data Fusion
Jan Michelfeit and Jindich Mynarz 397
101 Bio2RDF Release 3: A larger, more connected network of Linked Data
for the Life Sciences
Michel Dumontier, Alison Callahan, Jose Cruz-Toledo, Peter Ansell,
Vincent Emonet, François Belleau and Arnaud Droit 401
102 Infoboxer: Using Statistical and Semantic Knowledge to Help Create
Wikipedia Infoboxes
Roberto Yus, Varish Mulwad, Tim Finin and Eduardo Mena 405
xii
103 The Topics they are a-Changing — Characterising Topics with Time-
Stamped Semantic Graphs
Amparo E. Cano, Yulan He and Harith Alani 409
104 Linked Data and facets to explore text corpora in the Humanities: a
case study
Christian Morbidoni 413
105 Dexter 2.0 - an Open Source Tool for Semantically Enriching Data
Diego Ceccarelli, Claudio Lucchese, Salvatore Orlando, Ra↵aele Perego
and Salvatore Trani 417
106 A Hybrid Approach to Learn Description Logic Ontology from Texts
Yue Ma and Alifah Syamsiyah 421
107 Identifying First Responder Communities Using Social Network Analy-
sis
John Erickson, Katherine Chastain, Zachary Fry, Jim Mccusker, Rui
Yan, Evan Patton and Deborah McGuinness 425
108 Exploiting Semantic Annotations for Entity-based Information Retrieval
Lei Zhang, Michael Färber, Thanh Tran and Achim Rettinger 429
109 Crawl Me Maybe: Iterative Linked Dataset Preservation
Besnik Fetahu, Ujwal Gadiraju and Stefan Dietze 433
110 A Semantics-Oriented Storage Model for Big Heterogeneous RDF Data
Hyeongsik Kim, Padmashree Ravindra and Kemafor Anyanwu 437
111 Approximating Inference-enabled Federated SPARQL Queries on Mul-
tiple Endpoints
Yuji Yamagata and Naoki Fukuta 441
112 VKGBuilder – A Tool of Building and Exploring Vertical Knowledge
Graphs
Tong Ruan, Haofen Wang and Fanghuai Hu 445
113 Using the semantic web for author disambiguation - are we there yet?
Cornelia Hedeler, Bijan Parsia and Brigitte Mathiak 449
114 SHEPHERD: A Shipping-Based Query Processor to Enhance SPARQL
Endpoint Performance
Maribel Acosta, Maria Esther Vidal, Fabian Flöck, Simon Castillo, Car-
los Buil Aranda and Andreas Harth 453
115 AgreementMakerLight 2.0: Towards Efficient Large-Scale Ontology Match-
ing
Daniel Faria, Catia Pesquita, Emanuel Santos, Isabel F. Cruz and Fran-
cisco Couto 457
116 Extracting Architectural Patterns from Web data
Ujwal Gadiraju, Ricardo Kawase and Stefan Dietze 461
117 Xodx — A node for the Distributed Semantic Social Network
Natanael Arndt and Sebastian Tramp 465
xiii
118 An Ontology Explorer for Biomimetics Database
Kouji Kozaki and Riichiro Mizoguchi 469
119 Semi-Automated Semantic Annotation of the Biomedical Literature
Fabio Rinaldi 473
120 Live SPARQL Auto-Completion
Stephane Campinas 477
0
Life Stories as Event-based Linked Data:
Case Semantic National Biography
Eero Hyvönen, Miika Alonen, Esko Ikkala, and Eetu Mäkelä
Semantic Computing Research Group (SeCo), Aalto University
http://www.seco.tkk.fi/, firstname.lastname@aalto.fi
Abstract. This paper argues, by presenting a case study and a demonstration on
the web, that biographies make a promising application case of Linked Data: the
reading experience can be enhanced by enriching the biographies with additional
life time events, by proving the user with a spatio-temporal context for reading,
and by linking the text to additional contents in related datasets.
1 Introduction
This paper addresses the research question: How can the reading experience of biogra-
phies be enhanced using web technologies? Our research hypotheses is to apply the
Linked Data (LD) approach to this, with the idea of providing the reader with a richer
reading context than the biography document alone. The focus of research is on: 1)
Data linking. Biographies can be linked with additional contextual data, such as links
to the literal works of the person. 2) Data enriching. Data from different sources can
be used for enriching the life story with additional events and data, e.g., with metadata
about a historical event that the person participated in. 3) Visualization. LD can be vi-
sualized in useful ways. The life story can, e.g., be shown on maps and timelines. We
tested the hypoheses in a case study1 where the Finnish National Biography2 (NB), a
collection of 6,381 short biographies, is published as LD in a SPARQL endpoint with a
demonstrational application based on its standard API.
2 Representing Biographies as Linked Data
To enrich and link biographical data with related datasets the data must be made se-
mantically interoperable, either by data alignments (using, e.g., Dublin Core and the
dumb down priciple) or by data transformations into a harmonized form [3]. In our
case study we selected the data harminization approach and the event-centric CIDOC
CRM3 ISO standard as the ontological basis, since biographies are based on life events.
NB biograhies are modeled as collections of CIDOC CRM events, where each event is
characterized by the 1) actors involved, 2) place, 3) time, and 4) the event type.
1
Our work was funded by Tekes, Finnish Cultural Foundation, and the Linked Data Finland
consortium of 20 organizations.
2
http://www.kansallisbiografia.fi/english/?p=2
3
http://www.cidoc-crm.org/
1
A simple custom event extractor was created for transforming biographies into this
model represented in RDF. The extractor first lemmatizes a biography and then analyzes
its major parts: a textual story followed by systematically titled sections listing major
achievements of the person, such as “works”, “awards”, and “memberships” as snip-
pets. A snippet represents an event and typically contains mentions of years and places.
For example, the biography of architect Alvar Aalto tells “WORKS: ...; Church of Muu-
rame 1926-1929;...” indicating an artistic creation event. The named entity recognition
tool of the Machinese4 NLP library is used for finding place names in the snippets,
and Geonames is used for geocoding. Timespans of snippet events are found easily as
numeric years or their intervals, and an actor of the events is the subject person of the
biography. The result of processing a biography is a list of spatio-temporal CIDOC
CRM events with short titles (snippet texts) related to the corresponding person. At the
moment, the extractor uses only the snippets for event creation—more generic event
extraction from the free biography narrative remains a topic of further research.
For a domain ontology, we reused the Finnish History Ontology HISTO by trans-
forming it into CIDOC CRM. The new HISTO version contains 1,173 major historical
events (E5 Event in CIDOC CRM) covering over 1000 years of Finnish history, and in-
cludes 80,085 activities (E7 Activity) of different kinds, such as armistice, election etc.
Linked to these are 7,302 persons (E21 Person) and a few hundred organizations and
groups, 3,290 places (E53 Place), and 11,141 time spans (E52 Time-span). The data
originates from the Agricola timeline5 created by Finnish historians.
Fig. 1. Spatio-temporal visualization of Alvar Aalto’s life with external links.
4
http://www.connexor.com/nlplib/
5
http://agricola.utu.fi/
2
The extracted events were then enriched with events from external datasets as fol-
lows: 1) Persons in SNB and HISTO were mapped onto each other based on their
names. This worked well without further semantic disambiguation since few different
persons had similar names. NB and HISTO shared 921 persons p, and the biography
of each p could therefore be enriched with all HISTO events that p was involved in.
2) There were 361 artistic creation events (e.g., publishing a book) of NB persons that
could be extracted from Europeana Linked Open Data6 using the person as the creator.
Related biographies could therefore be enriched with events pointing to Europeana con-
tents. 3) The NB persons were involved in 263 instances of publications of the Project
Gutenberg data7 . Corresponding events could therefore be added into the biographies,
and links to the original digitized publications be provided. 4) The NB persons were also
linked to Wikipedia for additional information; again simple string matching produced
good results. These examples demostrate how linked events can be extracted from other
datasets and be used for enriching other biographical events. In the experiment, 116,278
spatio-temporal events were finally extracted for the NB biography records.
3 Biographies Enriched in a Spatio-temporal Context
Based on the enriched and linked biography data, a demonstrator was created prov-
ing the end user with a spatio-temporal context for reading NB biographical data as
well as links to addtional content from related sources. Fig. 1 depicts the user inter-
face online8 with architect Alvar Aalto’s biography selected; the other 6,400 celebrities
can be selected from the alphabetical list above. On the left column, temporal events
extracted from the biography and related datasets are presented (in Finnish), such as
“1898 Birth”, and “1908-1916 Jyväskylä Classical Lyceum”. The event “1930–1939:
Alvar Aalto created his famous functionalist works (Histo)” shows an external link to
HISTO for additional information. The events are also seen as bubbles on a timeline
at the bottom. The map in the middle shows the end-user the places related to the bi-
ography events. By hovering the mouse over an event or its bubble the related event is
high-lighted and the map zoomed and centered around the place related to the event. In
this way the user can quickly get an overview about the spatio-temporal context of Al-
var Aalto’s life, and get links to additional sources of information. The actual biography
text can be read by clicking a link lower in the interface (not visible in the figure). The
user interface also performs dynamic SPARQL querying for additional external links.
In our demonstration, the BookSampo dataset and SPARQL endpoint [6] is used for
enriching literature-related biographies with additional publication and literature award
events.
The user interface for spatio-temporal lifeline visualization was implemented using
AngularJS9 and D310 on top of the Linked Data Finland (LDF) data service11 .
6
http://pro.europeana.eu/linked-open-data
7
http://datahub.io/dataset/gutenberg
8
http://www.ldf.fi/dataset/history/map.html
9
http://angularjs.org
10
http://d3js.org
11
Cf. http://www.ldf.fi/dataset/history/ for dataset documentation and SPARQL endpoint
3
4 Discussion, Related Work, and Future Research
Our case study suggests that biography publication is a promising application case for
LD. The event-based modeling approach was deemed useful and handy, after learning
basics of the fairly complex CIDOC CRM model. The snippet events could be extracted
and aligned with related places, times, and actors fairly accurately using simple string-
based techniques. However, the results of event extraction and entity linking have not
been evaluated formally, and it is obvious that problems grow with larger datasets and
when analysing free text—these issues are a topic of future research.
Biographical data has been studied by genealogists (e.g., (Event) GEDCOM12 ), CH
organizations (e.g., the Getty ULAN13 ), and semantic web researchers (e.g., BIO on-
tology14 ). Semantic web event models include, e.g., Event Ontology [8], LODE ontol-
ogy15 , SEM [1], and Event-Model-F16 [9]. A history ontology with map visualizations
is presented in [7], and an ontology of historical events in [4]. Visualization using his-
torical timelines is discussed, e.g., in [5], and event extraction reviewed in [2].
References
1. van Hage, W.R., Malaisé, V., Segers, R., Hollink, L., Schreiber, G.: Design and use of the
simple event model (SEM). Web Semantics: Science, Services and Agents on the World Wide
Web 9(2), 128–136 (2011)
2. Hogenboom, F., Frasincar, F., Kaymak, U., de Jong, F.: An overview of event extraction from
text. In: DeRiVE 2011, Detection, Representation, and Exploitation of Events in the Semantic
Web (2011), http://ceur-ws.org/Vol-779/
3. Hyvönen, E.: Publishing and using cultural heritage linked data on the semantic web. Morgan
& Claypool, Palo Alto, CA (2012)
4. Hyvönen, E., Alm, O., Kuittinen, H.: Using an ontology of historical events in seman-
tic portals for cultural heritage. In: Proceedings of the Cultural Heritage on the Semantic
Web Workshop at the 6th International Semantic Web Conference (ISWC 2007) (2007),
http://www.cs.vu.nl/ laroyo/CH-SW.html
5. Jensen, M.: Vizualising complex semantic timelines. NewsBlip Research Papers, Report
NBTR2003-001 (2003), http://www.newsblip.com/tr/
6. Mäkelä, E., Ruotsalo, T., Hyvönen: How to deal with massively heterogeneous cultural her-
itage data—lessons learned in CultureSampo. Semantic Web – Interoperability, Usability, Ap-
plicability 3(1) (2012)
7. Nagypal, G., Deswarte, R., Oosthoek, J.: Applying the semantic web: The VICODI experi-
ence in creating visual contextualization for history. Lit Linguist Computing 20(3), 327–349
(2005), http://dx.doi.org/10.1093/llc/fqi037
8. Raimond, Y., Abdallah, S.: The event ontology (2007),
http://motools.sourceforge.net/event/event.html
9. Scherp, A., Saathoff, C., Franz, T.: Event-Model-F (2010),
http://www.uni-koblenz-landau.de/koblenz/fb4/AGStaab/Research/ontologies/events
12
http://en.wikipedia.org/wiki/GEDCOM
13
http://www.getty.edu/research/tools/vocabularies/ulan/
14
http://vocab.org/bio/0.1/.html
15
http://linkedevents.org/ontology/
16
http://www.uni-koblenz-landau.de/koblenz/fb4/AGStaab/Research/ontologies/events
4
News Visualization Based on Semantic Knowledge
Sebastian Arnold, Damian Burke, Tobias Dörsch,
Bernd Loeber, and Andreas Lommatzsch
Technische Universität Berlin
Ernst-Reuter-Platz 7, D-10587 Berlin, Germany
{sarnold,damian.burke,tobias.m.doersch,bernd.loeber,
andreas.lommatzsch}@mailbox.tu-berlin.de
Abstract. Due to the overwhelming amount of news articles from a growing
number of sources, it has become nearly impossible for humans to select and read
all articles that are relevant to get deep insights and form conclusions. This leads
to a need for an easy way to aggregate and analyze news articles efficiently and
visualize the garnered knowledge as a base for further cognitive processing.
The presented application provides a tool to satisfy said need. In our approach we
use semantic techniques to extract named entities, relations and locations from
news sources in different languages. This knowledge is used as the base for data
aggregation and visualization operators. The data operators include filtering of
entities, types and date range, detection of correlated news topics for a set of
selected entities and geospatial analysis based on locations. Our visualization
provides a time-based graphical representation of news occurrences according
to the given filters as well as an interactive map which displays news within a
perimeter for the different locations mentioned in the news articles. In every step
of the user process, we offer a tag cloud that highlights popular results and provide
links to the original sources including highlighted snippets. Using the graphical
interface, the user is able to analyze and explore vast amounts of fresh news
articles, find possible relations and perform trend analysis in an intuitive way.
1 Introduction
Comprehensive news analysis is a common task for a broad range of recipients. To
overlook the overwhelming amount of articles that are published in the Web every hour,
new technologies are needed that help to classify, search and explore topics in real
time. Current approaches focus on automated classification of documents into expert-
defined categories, such as politics, business or sports. The results need to be tagged
manually with meta-information about locations, people and current news topics. The
simple model of categories and tags, however, is not detailed enough to suit temporal
or regional relationships and it cannot bridge the semantic gap that the small subset of
tagged information opens. The challenge for machine-driven news analysis consists of
two parts. First, an extractor needs to be able to identify the key concepts and entities
mentioned in the documents and to find the most important relationships between them.
Second, an intuitive way for browsing the results with support for explorative discovery
of relevant topics and emerging trends needs to be developed.
5
We present a semantic approach that abstracts from multi-lingual representation of
facts and enriches extracted information with background knowledge. Our implemen-
tation utilizes natural language processing tools for the extraction of named entities,
relations and semantic context. Open APIs are used to augment further knowledge
(e.g. geo-coordinates) to the results. Our application visualizes the gained knowledge
and provides time-based, location-based and relationship-based exploration operators.
The relationship between original news documents and aggregated search results is
maintained throughout the whole user process.
In Section 2, we give an overview on existing projects of similar focus. Our knowledge-
based approach and the implementation is introduced in Section 3. The user interaction
and visualization operators are discussed in Section 4. We conclude in Section 5.
2 Related Work
We start with an overview on existing projects in the field of semantic news visualization.
The following projects are related to our approach on a conceptual or visual level.
M APLANDIA1 visualizes news for a specific date beginning in 2005 on a map. The
system uses the BBC news feed as its only source to deliver markers and outlines for the
countries that were mentioned in the news on a specified date. Additionally, it offers a
list of the news for the day. However, by using only one marker the application is unable
to visualize news on a more detailed and fine-grained level. M APLANDIA also does not
offer any possibility to limit the displayed visualizations to a certain region of interest.
The application offers news in only one language and source for a specific day.
The idea behind the SPIGA-S YSTEM [3] is to provide businesses with a multilingual
press review of news from national and international sources. Using the Apache UIMA
framework, the system crawls a few thousand sources regardless of the language used.
After a fitting business profile has been created, the system clusters information and
visualizes current trends.
3 Implementation of Knowledge Extraction
In this section, we explain our semantic approach to news aggregation. In contrast
to classical word- or tag-based indexing, we focus on semantic features that we ex-
tract from daily news documents. To handle the linguistic complexity of this problem,
we utilize information extraction techniques for natural language [2]. The knowledge
extraction pipeline is shown in Fig. 1. It consists of a periodic RSS feed crawler as
source for news documents2 and the language components for sentence splitting, part-
of-speech (POS) tagging, named entity recognition (NER) and coreference resolution.
We utilize a Stanford CoreNLP named entity recognition pipeline [1] for the languages
English and German. The pipeline periodically builds histograms over the frequency
of named entity occurrences in all documents. Using a 3-class entity typification (Per-
son, Organization, Location) we apply special treatment to each of the entity types.
1
http://maplandia.com/news
2
In our demonstrator, it is configured to use feeds from http://www.theguardian.com
6
Fig. 1: The figure shows the architecture and a screenshot of our system. The system architecture
is divided in user operators F ILTER, C ORRELATE, P LOT, T RACE and L OCATE (shown left) and
the knowledge extraction pipeline (shown right). The screenshot of the Web application visualizes
news entities related to Vladimir Putin utilizing the operators C ORRELATE, F ILTER and P LOT.
Person and Organization names are normalized to obtain representative identifiers
from different spellings. Location names are resolved to geo-location coordinates us-
ing Google Maps API.3 The results are processed and aggregated into relations of the
form MentionedIn(E NTITY(type), D OCUMENT(date), frequency). To get a comprehen-
sive view on the relevant information, we create histograms over these relations and store
the distributions in a relational database. The data is processed using relational statements
to fit the type of user query that is requested from the frontend. To increase processing
performance of the information in Web-scale, we partly precompute these statements.
The results are then visualized to allow easy recognition of trends and relationships.
4 Demonstrator and Benefits to the Audience
In this section, we present our user interface which has a specific focus on a simple
workflow.4 Our aim is to relieve the user from having much work with sorting and
filtering tables and instead allow exploratory search [4] inside the data set. We present
the results in a clean web interface that establishes an interaction flow from top to bottom.
Our system offers several operators to explore the temporal, geographical and relational
distribution in the news corpus.
A user session starts with the F ILTER operator that is visualized as a tag cloud
showing the most mentioned entities in a given time range in a size proportional to their
frequency (e.g. location Russia). The selection can be influenced by further filter settings,
such as type and date restrictions. Clicking on an entity within the tag cloud will trigger
the C ORRELATE operator, which offers related entities to the selected one in the tag
cloud (e.g. location Ukraine, person Vladimir Putin). This is done by intersecting the
most relevant documents for a given entity and time range and picking the top mentioned
named entities in these articles. Selecting different items will further narrow down the
3
https://developers.google.com/maps/
4
An online demo is available at http://irml-lehre.aot.tu-berlin.de
7
results. Both the selected and the correlated entities are then displayed in a time-based
P LOT with the time range on the x-axis and the frequency of occurrences on the y-axis.
To instantly reveal the relationships and importance of co-occurrent entities, one can
modify the display style (e.g. stacked, expanded or streamed). To get more detailed
information about specific data points, the user can hover the cursor above them to
trigger a T RACE operation. Then, more details about the selected tuple (E NTITY, date)
are revealed: counts and snippets of the news articles that mention the selected entity
and links to trace back the original documents.
The L OCATE operator focuses on geographic locating of selected entities and their
relations. The operator works by computing a set of bounding coordinates which are
used to query the database for possible locations. Using the same F ILTER interface, a
location of interest can be selected from the tag cloud or by entering its name in the text
field. By utilizing a slider to set a search perimeter, the user is able to further focus on the
regions around the selected location. After selection, a world map will display locations
mentioned in the matching articles. By clicking the markers on the map, a balloon listing
shows up to ten headlines and links to the respective articles. This allows the user to gain
an overview of connections and associations of different countries and locations.
5 Conclusion and Future Work
The application allows the user to quickly visualize and analyze vast amounts of news
articles. By the use of interactive elements such as graphs and a world map the user is
able to check hypotheses and draw conclusions in an explorative and playful manner.
This greatly reduces the cognitive load for the user as he or she is able to find the relevant
facts fast and browse the underlying news articles to get further information from the
original source. In our conducted user studies we observed that a streamlined interface
with the options at the top and the results below was most appealing to users, and fewer
options led to a more intuitive experience. The presented application is realized as a
prototype and will be expanded by further development. A broader range of information
can be achieved by including more news sources, implementing extended language
support (e.g. multi-lingual knowledge extraction) and expanding the features of the
F ILTER operator (e.g. including sentiment selection). A deeper enrichment of knowledge
can be achieved by linking the detected entities to additional knowledge sources (e.g.
DBpedia or freebase) and using context information to extract more language features
(e.g. sentiments, quotations, relations).
References
1. D. Cer, M.-C. de Marneffe, D. Jurafsky, and C. D. Manning. Parsing to stanford dependencies:
Trade-offs between speed and accuracy. In 7th Intl. Conf. LREC 2010, 2010.
2. R. Grishman. Information extraction: Techniques and challenges. In Information Extraction A
Multidisciplinary Approach to an Emerging Information Technology. Springer, 1997.
3. L. Hennig, D. Ploch, D. Prawdzik, B. Armbruster, H. Düwiger, E. W. De Luca, and S. Albayrak.
Spiga - a multilingual news aggregator. In Proceedings of GSCL 2011, 2011.
4. G. Marchionini. Exploratory search: from finding to understanding. Communications of the
ACM, 49(4):41–46, 2006.
8
Sherlock: a Semi-Automatic Quiz Generation
System using Linked Data
Dong Liu1 and Chenghua Lin2
1
BBC Future Media & Technology - Knowledge & Learning,Salford M50 2QH, UK,
Dong.Liu@bbc.co.uk
2
Department of Computing Science, University of Aberdeen, AB24 3UE, UK
chenghua.lin@abdn.ac.uk
Abstract. This paper presents Sherlock, a semi-automatic quiz gener-
ation system for educational purposes. By exploiting semantic and ma-
chine learning technologies, Sherlock not only o↵ers a generic framework
for domain independent quiz generation, but also provides a mechanism
for automatically controlling the difficulty level of the generated quizzes.
We evaluate the e↵ectiveness of the system based on three real-world
datasets.
Keywords: Quiz Generation, Linked Data, RDF, Educational Games
1 Introduction
Interactive games are e↵ective ways of helping knowledge being transferred be-
tween humans and machines. For instance, e↵orts have been made to unleash
the potential of using Linked Data to generate educational quizzes. However,
it is observed that the existing approaches [1, 2] share some common limita-
tions that they are either based on domain specific templates or the creation of
quiz templates heavily relies on ontologist and Linked Data experts. There is no
mechanism provided to end-users to engage with customised quiz authoring.
Moreover, a system that can generate quizzes with di↵erent difficulty lev-
els will better serve users’ needs. However, such an important feature is rarely
o↵ered by the existing systems, where most of the practices simply select the dis-
tractors (i.e., the wrong candidate answers) at random from an answer pool (e.g.,
obtained by querying the Linked Data repositories). Some work has attempted
to determine the difficulty of a quiz but still it is simply based on assessing the
popularity of a RDF resource, without considering the fact that the difficulty
level of a quiz is directly a↵ected by semantic relatedness between the correct
answer and the distractors [3].
In this paper, we present a novel semi-automatic quiz generation system
(Sherlock) empowered by semantic and machine learning technologies. Sherlock
is distinguished from existing systems in a few aspects: (1) it o↵ers a generic
framework for generating quizzes of multiple domains with minimum human
e↵ort; (2) a mechanism is introduced for controlling the difficulty level of the
generated quizzes; and (3) an intuitive interface is provided for engaging users
9
2
Offline Online
Data Collection and Integration Incorrect
Distractor Quiz Renderer
Database
Similarity Computation
LOD Adaptive
Similarity Clustering
Question
and Answer Quiz Creator
Template-based
Question and Answer Generator Database
Fig. 1. Overall architecture of Sherlock.
in creating customised quizzes. The live Sherlock system can be accessed from
http://sherlock.pilots.bbcconnectedstudio.co.uk/1 .
2 System Architecture
Fig. 1 depicts an overview of the Sherlock framework, in which the components
are logically divided into two groups: online and o✏ine. These components can
interact with each other via shared databases which contain the information of
the questions, correct answers and distractors (i.e., incorrect answers).
Data Collection and Integration: We collected RDF data published by
DBpedia and the BBC. These data play two main roles, i.e., serving as the
knowledge base for quiz generation and used for calculating the similarity scores
between objects/entities (i.e., answers and distractors).
Similarity Computation: The similarity computation module is the core
for controlling the difficulty level of quiz generation. It first accesses the RDF
store, and then calculates the similarity scores between each object/entity pairs.
In the second step, the module performs K-means clustering to partition the
distractors into di↵erent difficulty levels according to their Linked Data Semantic
Distance (LDSD) [4] scores with the correct answer of a quiz. In the preliminary
experiment, we empirically set K=3 corresponding to three difficulty levels, i.e.
“easy”, “medium” and “difficult”.
Template-based Question and Answer Generator: This module auto-
mates the process of generating questions and the correct answers. Fig. 2(a)
demonstrates the instantiation of an example template: “Which of the following
animals is {?animal name}?”, where the variable is replaced with rdfs:label of
the animal. The generated questions and answers will be saved in the database.
Quiz Renderer: The rendering module first retrieves the question and the
correct answer from the database, and then selects suitable distractors from
the entities returned by the similarity computation module. Fig. 2(a) shows the
module’s intuitive gaming interface, as well as a distinctive feature for tuning up
or down the quiz difficulty level dynamically, making Sherlock able to better serve
the needs of di↵erent user groups (e.g., users of di↵erent age and with di↵erent
1
For the best experiences, please use Safari or Opera to access the demo.
10
3
(a) (b)
Fig. 2. (a) User interface for playing a quiz; (b) User interface for creating a quiz.
0.8
0.96%
0.95%
0.94% 0.6
0.93%
0.92%
Accuracy 0.4
Similarity)
0.91%
0.9% 0.2
0.89%
0.88%
0.0
0.87%
0.86% 0.90 0.91 0.92 0.93 0.94
Easy% Medium% Difficult% Similarity
(a) (b)
Fig. 3. (a) Averaged similarity of the testing quizzes (Wildlife domain); (b) Correlation
measure of the Wildlife domain (r = 0.97, p < 0.1).
educational background). Furthermore, to enhance a user’s learning experience,
the “learn more” link on the bottom left of the interface points to a Web page
containing detailed information about the correct answer (e.g., Cheetah).
Quiz Creator: Fig. 2(b) depicts the quiz creator module, which complements
the automatic quiz generation by allowing users to create customised quizzes
with more diverse topics and to share with others. Quiz authoring involves
three simple steps: 1) write a question; 2) set the correct answer (distractors
are suggested by the Sherlock system automatically); and 3) preview and sub-
mit. For instance, one can take a picture of several ingredients and let people
guess what dish one is going to cook. The quiz creator interface can be accessed
from http://sherlock.pilots.bbcconnectedstudio.co.uk/#/quiz/create.
3 Empirical Evaluation
This demo aims to show how Sherlock can e↵ectively generate quizzes of di↵erent
domains and how well a standard similarity measure can be used to suggest
quiz difficulty level that matches human’s perception. The hypothesis is that if
some objects/entities have higher degree of semantic relatedness, their di↵erences
would be subtle and hence more difficult to be disambiguated, and vice versa.
11
4
We investigated the correlation between the difficulty level captured by the
similarity measure and that perceived by human. To test our hypothesis, a group
of 10 human evaluators were presented with 45 testing quizzes generated by
Sherlock based on the BBC Wildlife domain data, i.e., 15 quizzes per difficulty
level. Next the averaged pairwise similarity between the correct answer and
distractors of each testing quiz were computed, as shown in Fig. 3(a). Fig. 3(b)
demonstrates that the quiz test accuracy of human evaluation indeed shows a
negative correlation (r = 0.97, p < 0.1) with the average similarity of the quiz
answer choices (i.e., each datapoint is the averaged value over 15 quizzes per
difficulty level). This suggests that LDSD is an appropriate similarity measure
for indicating quiz difficulty level, which inlines with our hypothesis.
In another set of experiments, we evaluated Sherlock as a generic framework
for quiz generation, in which the system was tested on structural RDF datasets
from three di↵erent domains, namely, BBC Wildlife, BBC Food and BBC Your-
Paintings2 , with 321, 991 and 2,315 quizzes automatically generated by the
system for each domain respectively. Benefiting from the domain-independent
similarity measure (LDSD), Sherlock can be easily adapted to generate quizzes
of new domains with minimum human e↵orts, i.e., no need to manually define
rules or rewrite SPARQL queries.
4 Conclusion
In this paper, we presented a novel generic framework (Sherlock) for generating
educational quizzes using linked data. Compared to existing systems, Sherlock
o↵ers a few distinctive features, i.e., it not only provides a generic framework
for generating quizzes of multiple domains with minimum human e↵ort, but
also introduces a mechanism for controlling the difficulty level of the generated
quizzes based on a semantic similarity measure.
Acknowledgements
The research described here is supported by the BBC Connected Studio pro-
gramme and the award made by the RCUK Digital Economy theme to the
dot.rural Digital Economy Hub; award reference EP/G066051/1. The authors
would like to thank Ryan Hussey, Tom Cass, James Ruston, Herm Baskerville
and Nava Tintarev for their valuable contribution.
References
[1] Damljanovic, D., Miller, D., O’Sullivan, D.: Learning from quizzes using intelligent
learning companions. In: WWW (Companion Volume). (2013) 435–438
[2] Álvaro, G., Álvaro, J.: A linked data movie quiz: the answers are out there, and
so are the questions [blog post]. http://bit.ly/linkedmovies (2010)
[3] Waitelonis, J., Ludwig, N., Knuth, M., Sack, H.: WhoKnows? - evaluating linked
data heuristics with a quiz that cleans up dbpedia. International Journal of Inter-
active Technology and Smart Education (ITSE) 8 (2011) 236–248
[4] Passant, A.: Measuring semantic distance on linking data and using it for resources
recommendations. In: AAAI Symposium: Linked Data Meets AI. (2010)
2
http://www.bbc.co.uk/nature/wildlife, http://www.bbc.co.uk/food and http:
//www.bbc.co.uk/arts/yourpaintings
12
Low-Cost Queryable Linked Data
through Triple Pattern Fragments
Ruben Verborgh1 , Olaf Hartig2 , Ben De Meester1 , Gerald Haesendonck1 ,
Laurens De Vocht1 , Miel Vander Sande1 , Richard Cyganiak3 , Pieter Colpaert1 ,
Erik Mannens1 , and Rik Van de Walle1
1
Ghent University – iMinds, Belgium
{firstname.lastname}@ugent.be
2
University of Waterloo, Canada
ohartig@uwaterloo.ca
3
Digital Enterprise Research Institute, nui Galway, Ireland
richard@cyganiak.de
Abstract. For publishers of Linked Open Data, providing queryable
access to their dataset is costly. Those that offer a public sparql end-
point often have to sacrifice high availability; others merely provide
non-queryable means of access such as data dumps. We have developed
a client-side query execution approach for which servers only need to
provide a lightweight triple-pattern-based interface, enabling queryable
access at low cost. This paper describes the implementation of a client
that can evaluate sparql queries over such triple pattern fragments of
a Linked Data dataset. Graph patterns of sparql queries can be solved
efficiently by using metadata in server responses. The demonstration
consists of sparql client for triple pattern fragments that can run as
a standalone application, browser application, or library.
Keywords: Linked Data, Linked Data Fragments, querying, availability,
scalability, sparql
1 Introduction
An ever increasing amount of Linked Data is published on the Web, a large part
of which is freely and publicly available. The true value of these datasets becomes
apparent when users can execute arbitrary queries over them, to retrieve pre-
cisely those facts they are interested in. The sparql query language [3] allows to
specify highly precise selections, but it is very costly for servers to offer a public
sparql endpoint over a large dataset [6]. As a result, current public sparql
endpoints, often hosted by institutions that cannot afford an expensive server
setup, suffer from low availability rates [1]. An alternative for these institutions
is to provide their data in a non-queryable form, for instance, by allowing con-
sumers to download a data dump which they can use to set up their own private
sparql endpoint. However, this prohibits live querying of the data, and is in
turn rather expensive on the client side.
13
2 Ruben Verborgh et al.
In this demo, we will show a low-cost server interface that offers access to
a dataset through all of its triple patterns, together with a client that performs
efficient execution of complex queries through this interface. This enables pub-
lishers to provide Linked Data in a queryable way at low cost. The demo comple-
ments our paper at the iswc2014 Research Track [6], which profoundly explains
the principles behind the technology and experimentally verifies its scalability.
The present paper details the implementation and introduces the supporting
prototype implementation of our sparql client of triple pattern fragments.
2 Related Work
We contrast our approach with the three categories of current http interfaces
to rdf, each of which comes with its own trade-offs regarding performance,
bandwidth, and client/server processor usage and availability.
Public sparql endpoints The current de-facto way for providing queryable access
to triples on the Web is the sparql protocol, which is supported by many triple
stores such as Virtuoso, AllegroGraph, Sesame, and Jena tdb. Even though
current sparql interfaces offer high performance, individual queries can con-
sume a significant amount of server processor time and memory. Because each
client requests unique, highly specific queries, regular http caching is ineffective,
since this can only optimize repeated identical requests. These factors contribute
to the low availability of public sparql endpoints, which has been documented
extensively [1]. This makes providing reliable public sparql endpoints an excep-
tionally difficult challenge, incomparable to hosting regular public http servers.
Linked Data servers Perhaps the most well-known alternative interface to triples
is described by the Linked Data principles. The principles require servers to pub-
lish documents with triples about specific entities, which the client can access
through their entity-specific uri, a process which is called dereferencing. Each
of these Linked Data documents should contain data that mention uris of other
entities, which can be dereferenced in turn. Several Linked Data querying tech-
niques [4] use dereferencing to solve queries over the Web of Data. This process
happens client-side, so the availability of servers is not impacted. However, exe-
cution times are high, and many queries cannot be solved (efficiently) [6].
Other http interfaces for triples Additionally, several other http interfaces for
triples have been designed. Strictly speaking, the most trivial http interface
is a data dump, which is a single-file representation of a dataset. The Linked
Data Platform [5] is a read/write http interface for Linked Data, scheduled to
become a w3c recommendation. It details several concepts that extend beyond
the Linked Data principles, such as containers and write access. However, the
api has been designed primarily for consistent read/write access to Linked Data
resources, not to enable reliable and/or efficient query execution. The interface
we will discuss next offers low-cost publishing and client-side querying.
14
Low-Cost Queryable Linked Data through Triple Pattern Fragments 3
3 Linked Data Fragments and Triple Pattern Fragments
Linked Data Fragments [6] enable a uniform view on all possible http interfaces
for triples, and allow to define new interfaces with different trade-offs.
Definition 1. A Linked Data Fragment (ldf) of a dataset is a resource con-
sisting of those triples of this dataset that match a specific selector, together with
their metadata and hypermedia controls to retrieve other Linked Data Fragments.
We define a specific type of ldfs that require minimal effort to generate by
a server, while still enabling efficient querying on the client side:
Definition 2. A triple pattern fragment is a Linked Data Fragment with
a triple pattern as selector, count metadata, and the controls to retrieve any other
triple pattern fragment of the dataset. Each page of a triple pattern fragment
contains a subset of the matching triples, together with all metadata and controls.
Triple pattern fragments can be generated easily, as triple-pattern selection
is an indexed operation in the majority of triple stores. Furthermore, specialized
formats such as the compressed rdf hdt (Header – Dictionary – Triples [2])
natively support fast triple-pattern extraction. This ensures low-cost servers.
Clients can then efficiently evaluate sparql queries over the remote dataset
because each page contains an estimate of the total number of matching triples.
This allows efficient asymmetric joins by first binding those triple patterns with
the lowest number of matches. For basic graph patterns (bgps), which are the
main building blocks of sparql queries, the algorithm works as follows:
1. For each triple pattern tpi in the bgp B = {tp1 , . . . , tpn }, fetch the first
page i1 of the triple pattern fragment fi for tpi , which contains an es-
timate cnti of the total number of matches for tpi . Choose ✏ such that
cnt✏ = min({cnt1 , . . . , cntn }). f✏ is then the optimal fragment to start with.
2. Fetch all remaining pages of the triple pattern fragment f✏ . For each triple t in
the ldf, generate the solution mapping µt such that µt (tp✏ ) = t. Compose
the subpattern Bt = {tp | tp = µt (tpj ) ^ tpj 2 B} \ {t}. If Bt 6= ;, find
mappings ⌦Bt by recursively calling the algorithm for Bt . Else, ⌦Bt = {µ; }
with µ; the empty mapping.
3. Return all solution mappings µ 2 {µt [ µ0 | µ0 2 ⌦Bt }.
4 Demo of Client-side Querying
The above recursive algorithm has been implemented by a dynamic pipeline
of iterators [6]. At the deepest level, a client uses TriplePatternIterators to
retrieve pages of triple pattern fragments from the server, turning the triples on
those pages into bindings. A basic graph pattern of a sparql query is evaluated
by a GraphPatternIterator, which first discovers the triple pattern in this graph
with the lowest number of matches by fetching the first page of the corresponding
triple pattern fragment. Then, TriplePatternIterators are recursively chained
together in the optimal order, which is chosen dynamically based on the number
of matches for each binding. More specific iterators enable other sparql features.
15
4 Ruben Verborgh et al.
Fig. 1. The demo shows how Linked Data Fragments clients, such as Web browsers,
evaluate sparql queries over datasets offered as inexpensive triple pattern fragments.
In the above example, a user searches for artists born in Italian cities.
This iterator-based approach has been implemented as a JavaScript appli-
cation (Fig. 1), to allow its usage on different platforms (standalone, library,
browser application). The source code of the client, and also of triple pattern frag-
ment servers, is freely available at https://github.com/LinkedDataFragments/.
The versatility and efficiency of client-side querying is demonstrated through the
Web application http://client.linkeddatafragments.org, which allows users
to execute arbitrary sparql queries over triple pattern fragments. That way,
participants experience first-hand how low-cost Linked Data publishing solutions
can still enable efficient, realtime query execution over datasets on the Web.
References
1. Buil-Aranda, C., Hogan, A., Umbrich, J., Vandenbussche, P.Y.: sparql Web-
querying infrastructure: Ready for action? In: Proceedings of the 12th International
Semantic Web Conference (Nov 2013)
2. Fernández, J.D., Martínez-Prieto, M.A., Gutiérrez, C., Polleres, A., Arias, M.: Bi-
nary rdf representation for publication and exchange (hdt). Journal of Web Se-
mantics 19, 22–41 (Mar 2013)
3. Harris, S., Seaborne, A.: sparql 1.1 query language. Recommendation, w3c (Mar
2013), http://www.w3.org/TR/sparql11-query/
4. Hartig, O.: An overview on execution strategies for Linked Data queries. Datenbank-
Spektrum 13(2), 89–99 (2013)
5. Speicher, S., Arwe, J., Malhotra, A.: Linked Data Platform 1.0. Working draft, w3c
(Mar 2014), http://www.w3.org/TR/2014/WD-ldp-20140311/
6. Verborgh, R., Hartig, O., De Meester, B., Haesendonck, G., De Vocht, L., Van-
der Sande, M., Cyganiak, R., Colpaert, P., Mannens, E., Van de Walle, R.: Querying
datasets on the Web with high availability. In: Proceedings of the 13th International
Semantic Web Conference (Oct 2014)
16
call: A Nucleus for a Web of Open Functions
Maurizio Atzori
Math/CS Department
University of Cagliari
Via Ospedale 72
09124 Cagliari (CA), Italy
atzori@unica.it
Abstract. In our recent work we envisioned a Web where functions,
like Linked Data, can be openly published and available to all the users
of any remote sparql endpoint. The resulting Web of Functions can
be realized by introducing a call sparql extension that can invoke
any remote function (custom third-party extensions) by only knowing
its corresponding URI, while the implementation and the computational
resources are made available by the function publisher. In this paper we
demo our framework with a set of functions showing (1) advanced use of
its higher-order expressivity power featuring, e.g., function composition
of third-party functions, and (2) a possible bridge between hundreds
of standard Web APIs and the Web of Functions. In our view these
functions found an initial nucleus to which anyone can contribute within
the decentralized Web of Functions, made available through call.
1 Introduction
While extending the language with user-defined custom functions (sometimes
called extension functions) represented by URIs is a native feature of the sparql
language, the mechanism only works on the single endpoint featuring that spe-
cific function. In our recent work [1], we investigated interoperability, compu-
tational power and expressivity of functions that can be used within a sparql
query, envisioning a Web where also functions can be openly published, making
them available to all the users of any other endpoint. In [1] we define a wfn:call
function with three possible architectures to deploy it in a backward compatible
manner. It is the basis needed in order to realize a Web of Open Functions,
meaning that users can call a function by only knowing its corresponding URI,
as it is the case for entities and properties in the Web of Open Data1 , while
the implementation and the computational resources are made available by the
function publisher, as it happens with usual Web APIs.
Practically, supposing that Alice wants to use a Bob’s sparql extension (only
defined in Bob’s endpoint) from her own endpoint, she will write the following:
PREFIX wfn:
1
We titled this paper after DBpedia milestone work in [2].
17
PREFIX bob:
SELECT *
WHERE {
# within Alice data, find useful values for ?arg1, ?arg2
...
# now use Bob’s function
FILTER(wfn:call(bob:complexFunction, ?arg1, ?arg2) )
}
Therefore, the function wfn:call takes care of finding the right endpoint (see [1,
3, 4]), i.e., Bob’s, and then remotely call Bob’s complexFunction. We believe this
may be the first step toward a novel view of the Web as a place holding code
and functions, not only data as the Linked Data is greatly doing. The Semantic
Web already shifted URIs from pages to conceptual entities, primarily struc-
tured data. We believe that among these concepts there should be computable
functions.
In this paper we demo our open source implementation for the wfn:call
function, realized as an Apache Jena’s custom function extension, and avail-
able with other resources (including a link to our endpoint that publishes it)
at http://atzori.webofcode.org/projects/wfn/. In particular, we devise an
initial nucleus of practical functions that may empower sparql queries with
computations that exploit higher-order expressivity and hundreds of existing
Web APIs.
2 Fully Higher-Order Functions in SPARQL
Higher-order functions (HOF) are functions that take functions as either input
and/or output. Languages that allow functions to be used as any other kind
of data, are said to feature first-class functions. Here we show that these ad-
vanced expressivity, typical of functional languages, can be used within sparql
by means of wfn:call. In the following we exemplify it by describing the use of
three important HOF: reduce, compose and memoize.
Reduce. In functional languages “reduce” (also called “fold”, “inject” or “aggre-
gate”) is a function that takes a binary function (e.g., the + operator) and a list
(e.g., a list of integers), producing a result by recursively applying the function
to the remaining list (providing, e.g., the sum of all the elements). Thus, it rep-
resents a general-purpose aggregation mechanism potentially useful in sparql
queries. In the following we show how it can be used to apply the binary max
function provided by Jena to a list of 4 numbers:
PREFIX call:
PREFIX afn: .
SELECT ?max {
BIND( call:(wfn:reduce, afn:max, 5, 7, -1, 3) AS ?max)
}
resulting in ?max = 7.
18
Compose. Another important HOF is the composition function. Given two func-
tions g and f , it returns a third function that behaves as the application of f
followed by the application of g, i.e., g(f (.)). The following query excerpt:
BIND(call:(wfn:compose, fn:upper-case, afn:namespace)
AS ?uppercase_ns).
BIND(call:(?uppercase_ns, ) AS ?result)
returns the uppercased namespace, that is, HTTP://SOMETHING.ORG/. In partic-
ular, variable ?uppercase_ns is binded to a dynamically generated sparql func-
tion that, whenever invoked, applies afn:namespace followed by fn:upper-case.
Memoize. Many sparql queries may iterate over intermediate results, requir-
ing the execution of the same function multiple times, possibly with the same
paramenters. In order to speed up the execution of potentially time-consuming
functions, we implemented a memoization function that keeps the results of func-
tion calls and then returns the cached result when the same inputs occur again.
This part of the query:
BIND(call:(wfn:memoize, ?slow_function) AS ?fast_function).
BIND(call:(?fast_function, 1) AS ?result1). #1st time at normal speed
BIND(call:(?fast_function, 1) AS ?result2). #2nd time is faster
dynamically generates a ?fast_function that is the memoization of function
in ?slow_function. Please notice that this kind of useful features are possible
only in higher-order environments, such as the one resulting by the use of our
call function [1].
3 Bridging Web APIs and the Web of Functions
In order to develop a useful Web of Functions, the small set of auxiliary func-
tions presented in the previous section are clearly not enough. While some other
powerful user-defined functions are already online (e.g., runSPARQL [5] computes
recursive sparql functions), we need a larger nucleus of functions allowing any
sort of computation from within sparql queries. In this section we propose the
exploitation of a well-known Web API hub, namely Mashape 2 , by using a simple
bridge function that allows to call any mashape-featured API. This function,
that we called wfn:api-bridge, pushes hundreds of Web APIs within the Web
of Functions, ranging from weather forecast to face detection, from language
translation to flight information lookup. For instance, we can iterate over DB-
pedia cities in Tuscany finding those with a close airport, cheap flight and good
weather during the week after a given arrival day. In the following we find large
Tuscany cities sorted by current weather temperature:
SELECT * {
?city dbpedia-owl:region dbpedia:Tuscany;
dbpedia-owl:populationTotal ?population;
2
Freely available at http://www.mashape.com/
19
geo:lat ?lat; geo:long ?long.
FILTER(?population > 80000).
BIND(CONCAT("lat=",?lat,"&lon=",?long) AS ?parameters)
BIND( call:(wfn:api-bridge, "community-open-weather-map", ?parameters,
"main.temp") as ?temperature).
} ORDER BY ?temperature
The wfn:api-bridge function calls the Mashape Web API corresponding to the
first argument, with parameters specified in the second argument, then returning
the JSON field selected in the third argument. Di↵erent APIs necessary to answer
the query can be combined together with compose, and the resulting function
may be memoized for better performance if needed. Advanced uses may exploit
Linked Data information to search through existing Web API repositories [6].
4 Conclusions and Demo Showcase
We presented a set of sparql extensions containing higher-order manipulation
functions, allowing for instance function composition, together with a bridge
function that allows the use of hundreds of existing Web APIs from any sparql
endpoint featuring the wfn:call function, that we developed and opensourced
for Apache Jena. This set, forming an initial nucleus for the Web of Functions,
enables a wide spectrum of much powerful sparql queries w.r.t. the ones we are
currently used to, with a number of practical examples that will be showcased
during the demo and made available at our website.
Acknowledgments. This work was supported in part by the RAS Project
CRP-17615 DENIS: Dataspaces Enhancing Next Internet in Sardinia and by
MIUR PRIN 2010-11 project Security Horizons.
References
1. Atzori, M.: Toward the Web of Functions: Interoperable Higher-Order Functions
in SPARQL. In: 13th International Semantic Web Conference (Research Track).
(2014)
2. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.G.: DBpedia:
A Nucleus for a Web of Open Data. In: The Semantic Web, 6th International
Semantic Web Conference, 2nd Asian Semantic Web Conference (ISWC/ASWC).
(2007) 722–735
3. Paulheim, H., Hertling, S.: Discoverability of SPARQL Endpoints in Linked Open
Data. In: International Semantic Web Conference (Posters & Demos). (2013)
4. Alexander, K., Cyganiak, R., Hausenblas, M., Zhao, J.: Describing Linked Datasets
with the VoID Vocabulary (December 2010)
5. Atzori, M.: Computing Recursive SPARQL Queries. In: 8th IEEE International
Conference on Semantic Computing. (2014)
6. Bianchini, D., Antonellis, V.D., Melchiori, M.: A Linked Data Perspective for Ef-
fective Exploration of Web APIs Repositories. In: ICWE 2013. (2013) 506–509
20
Cross-lingual detection of world events from
news articles
Gregor Leban, Blaž Fortuna, Janez Brank, and Marko Grobelnik
Jožef Stefan Institute
Ljubljana, Slovenia
{firstname.surname}@ijs.si
Abstract. In this demo we describe a system called Event Registry
(http://eventregistry.org) that can identify world events from news arti-
cles. Events can be detected in di↵erent languages. For each event, the
system can extract core event information and store it in a structured
form that allows advanced search options. Numerous visualizations are
provided for visualizing search results.
Keywords: information extraction; cross-linguality; events; named entity de-
tection; clustering; news
1 Introduction
Each day there are thousands of small, medium and large events that are hap-
pening in the world. These events range from insignificant meetings, conferences
and sport events to important natural disasters and decisions made by world
leaders. The way we learn about these events is from di↵erent media channels.
The unstructured nature of content provided on these channels is suitable for
humans, but hard to understand by computers.
Identifying events described in the news and extracting main information
about these events is the main goal of the system presented in this paper. Event
Registry [2] is able to process news articles published in di↵erent languages world-
wide. By analyzing the articles it can identify the mentioned events and extract
main event information. Extracted event information is stored in a structured
way that provides unique features such as searching for events by date, location,
entity, topic and other event properties. Beside finding events of interest, Event
Registry also provides user interface with numerous visualizations showing date,
location, concept and topic aggregates of events that match the search criteria.
The rest of the paper is organized as follows. We start by describing the high
level architecture of the system. Next, we describe in more details the process of
identification of events from individual news articles. We continue by providing
information about how event information is extracted from a group of articles
mentioning the event. In the end we also describe the main features of the
frontend interface of the Event Registry.
21
2 Gregor Leban, Blaž Fortuna, Janez Brank, and Marko Grobelnik
2 System architecture
Event Registry consists of a pipeline of components, that each provide unique
and relevant features for the system. In order to identify events we first need
data from which we can extract the information. In Event Registry we use news
articles as the data source. The news articles are collected using the News-
Feed [5] service which collects news articles from more than 75.000 worldwide
news sources. Collected articles are in more than 40 languages, where articles in
English, Spanish, German and Chinese languages amount to about 70% of all
articles. These languages are also the only ones we use in the Event Registry.
Each collected article in the mentioned languages is then analyzed in order
to extract relevant information. One of the key tasks is identification and disam-
biguation of named entities and topics mentioned in the article. We perform this
task using a semantic enrichment service developed in the XLike project. We also
detect date mentions in the text, since they frequently specify the date when the
event occurred. By analyzing the articles we noticed that di↵erent news sources
frequently publish almost identical news articles. These duplicated articles don’t
bring any new information, which is why we identify them and ignore them in
the rest of the pipeline.
An important feature of the Event Registry is cross-linguality. To support
it, we identify for each article a set of most similar articles in other languages.
To identify these articles we use canonical correlation analysis[4] which maps
articles from di↵erent languages into a common semantic space. The common
space can then be used to identify most similar articles in all other languages.
In order to train the mapping to the common space we used the aligned corpus
of Wikipedia articles.
After extracting relevant features from each individual article we start with
the main task of identifying events from groups of articles. Since this is the main
contribution of the paper we will describe the details of the process in the next
three sections.
3 Identification of events
An assumption that we make in identifying events is that any relevant event
should be reported at least by a few di↵erent news publishers. In order to identify
events we therefore apply an online clustering algorithm based on [1] on articles
as they are added into the system. Each article is first transformed into a TF-IDF
weighted feature vector. The features in the vector are the words in the document
as well as the identified named entities and topics. If the cosine similarity of the
closest centroid is above a threshold, the article is added to the closest cluster
and the cluster properties are updated. Otherwise, a new cluster is formed that
contains only the new article.
Each identified cluster of articles is considered to describe an event if it
contains at least a minimum number of articles (the minimum value used in
our system is 5 articles). Once a cluster reaches this limit, we treat it as an
22
Cross-lingual detection of world events from news articles 3
event and the information about it’s articles are sent to the next components in
the pipeline. Those components are responsible for extracting event information
from the articles and will be described in the next sections.
Since clusters are being constantly updated with new articles we want to
reevaluate each cluster after a few updates in order to determine if it should
be split into two clusters or merged with another cluster. In order to decide if
the cluster should be split we apply a bisecting k-means algorithm (with k = 2)
on the cluster. We then use a variant of the Bayesian Information Criterion to
decide whether to accept the new split or not. Periodically we also identify pairs
of clusters with high similarity and decide if they should be merged or not. The
decision is made using the Lughofer’s ellipsoid criterion [3]. We assume that the
clusters that have not been modified for a few days mention past events and we
therefore remove them from the list of maintained clusters.
3.1 Cross-lingual merging of clusters
Each identified cluster of articles contains only articles in a single language.
Since articles in di↵erent languages can describe the same events we want to
identify clusters describing the same event in di↵erent languages and represent
them as a single event. In order to determine which cluster pairs to merge we
represent the task as a binary classification problem. Given a cluster pair c1
and c2 in languages l1 and l2 we want to extract a set of features that would
discriminate between cluster pairs that describe the same event and those that
don’t. A very important learning feature that we can extract for each cluster pair
is computed by inspecting individual articles in each cluster. Using canonical
correlation analysis we are able to obtain for each article in c1 a list of 10 most
similar articles in language l2 . Using this information we can check how many
of these articles are in c2 . We can repeat the same computation for articles
in c2 . By normalizing the results by the size of the clusters we can obtain a
score that should by intuition correlate with similarity of the two clusters across
the two languages. Some of the other features that we extract include the time
di↵erence between the average article date of each cluster, cosine similarity of
the annotated concepts and topics, and category similarity.
In order to build a classification model we collected 85 learning examples.
Some cluster pairs were selected randomly and some were selected by users
based on their similarity. The features for the selected learning examples were
extracted automatically, while the correct class was assigned manually. A linear
SVM was then trained on the data which achieved 87% accuracy using 10-fold
cross validation.
4 Event information extraction
Once we obtain one or more new clusters that are believed to describe a single
event, we assign them a new id in the Event Registry. From the associated arti-
cles we then try to extract the relevant information about the event. We try to
23
4 Gregor Leban, Blaž Fortuna, Janez Brank, and Marko Grobelnik
determine event date by seeing if there is a common date reference frequently
mentioned in the articles. If no date is mentioned frequently enough we use the
average article’s published date as the event date. The location of the event
is determined by locating frequently mentioned named entities that represent
locations. To determine what the event is about we aggregate the named enti-
ties and topics identified in the articles. Each event is also categorized using a
DMoz taxonomy. All extracted information is stored in the Event Registry in a
structured form that provides rich search and visualization capabilities.
5 Event search, visualization options and data
accessibility
Event Registry is available at http://eventregistry.org and currently contains
15.000.000 articles from which we identified about 1 million events. Available
search options include search by relevant named entities, keywords, publishers,
event location, date and category. The resulting events that match the criteria
can then be seen as a list or using one of numerous visualizations. Main vi-
sualizations of search results include location and time aggregates, list of top
named entities and topics, graph of related entities, concept trends, concept ma-
trix, date mentions, clusters of events and event categories. For each individual
event we can provide the list of articles describing it as well as visualizations
of concepts, article timeline, date mentions, article sources and other similar
events. Examples of these visualizations are (due to space limit) available on
http://eventregistry.org/screens. All Event Registry data is also stored using the
Storyline ontology and is available through a SPARQL endpoint available at
http://eventregistry.org/sparql.
6 Acknowledgments
This work was supported by the Slovenian Research Agency and X-Like (ICT-
288342-STREP).
References
1. C. C. Aggarwal and P. Yu. A framework for clustering massive text and categorical
data streams. In Proceedings of the sixth SIAM international conference on data
mining, volume 124, pages 479–483, 2006.
2. G. Leban, B. Fortuna, J. Brank, and M. Grobelnik. Event registry – learning about
world events from news. In WWW 2014 Proceedings, pages 107–110. ACM, 2014.
3. E. Lughofer. A dynamic split-and-merge approach for evolving cluster models.
Evolving Systems, 3(3):135–151, 2012.
4. J. Rupnik, A. Muhic, and P. Skraba. Cross-lingual document retrieval through hub
languages. In NIPS, 2012.
5. M. Trampus and B. Novak. Internals of an aggregated web news feed. In Proceedings
of 15th Multiconference on Information Society 2012 (IS-2012), 2012.
24
Multilingual Word Sense Disambiguation and
Entity Linking for Everybody
Andrea Moro, Francesco Cecconi, and Roberto Navigli
Sapienza University of Rome, Viale Regina Elena 295, 00198, Italy
{moro,cecconi,navigli}@di.uniroma1.it
Abstract. In this paper we present a Web interface and a RESTful API
for our state-of-the-art multilingual word sense disambiguation and en-
tity linking system. The Web interface has been developed, on the one
hand, to be user-friendly for non-specialized users, who can thus easily
obtain a first grasp on complex linguistic problems such as the ambi-
guity of words and entity mentions and, on the other hand, to provide
a showcase for researchers from other fields interested in the multilin-
gual disambiguation task. Moreover, our RESTful API enables an easy
integration, within a Java framework, of state-of-the-art language tech-
nologies. Both the Web interface and the RESTful API are available at
http://babelfy.org
Keywords: Multilinguality, Word Sense Disambiguation, Entity Link-
ing, Web interface, RESTful API
1 Introduction
The tasks of Word Sense Disambiguation (WSD) and Entity Linking (EL) are
well-known in the computational linguistics community. WSD [9, 10] is a histori-
cal task aimed at assigning meanings to single-word and multi-word occurrences
within text, while the aim of EL [3, 12] is to discover mentions of entities within
a text and to link them to the most suitable entry in the considered knowl-
edge base. These two tasks are key to many problems in Artificial Intelligence
and especially to Machine Reading (MR) [6], i.e., the problem of automatic,
unsupervised understanding of text. Moreover, the recent upsurge of interest in
the use of semi-structured resources to create novel repositories of knowledge
[5] has opened up new opportunities for wide-coverage, general-purpose Natural
Language Understanding techniques. The next logical step, from the point of
view of Machine Reading, is to link natural language text to the aforementioned
resources.
In this paper, we present a Web interface and a Java RESTful API for our
state-of-the-art approach to WSD and EL in arbitrary languages: Babelfy [8].
Babelfy is the first approach which explicitly aims at performing both multi-
lingual WSD and EL at the same time. The approach is knowledge-based and
exploits semantic relations between word meanings and named entities from Ba-
belNet [11], a multilingual semantic network which provides lexicalizations and
glosses for more than 9 million concepts and named entities in 50 languages.
25
2 BabelNet
In our work we use the BabelNet 2.51 semantic network [11] since it is the largest
available multilingual knowledge base and is obtained from the automatic seam-
less integration of Wikipedia2 , WikiData3 , OmegaWiki4 , WordNet [4], Open
Multilingual WordNet [1] and Wiktionary5 . It is available in di↵erent formats,
such as via its Java API, a SPARQL endpoint and a linked data interface [2]. It
contains more than 9 million concepts and named entities, 50 million lexicaliza-
tions and around 250 million semantic relations (see http://babelnet.org/stats
for more detailed statistics). Moreover, by using this resource we can leverage the
multilingual lexicalizations of the concepts and entities it contains to perform
disambiguation in any of the 50 languages covered in BabelNet.
3 The Babelfy System
Our state-of-the-art approach, Babelfy [8], is based on a loose identification of
candidate meanings (substring matching instead of exact matching) coupled with
a densest subgraph heuristic which selects high-coherence semantic interpreta-
tions. Here we briefly describe its three main steps:
1. Each vertex, i.e., either concept or named entity, is automatically associated
with a semantic signature, that is, a set of related vertices by means of
random walks with restart on the BabelNet network.
2. Then, given an input text, all the linkable fragments, i.e., pieces of text being
equal to or substring of at least one lexicalization contained in BabelNet, are
selected and, for each of them, the possible meanings are listed according to
the semantic network.
3. A graph-based semantic interpretation of the whole text is produced by link-
ing the candidate meanings of the selected fragments using the previously-
computed semantic signatures. Then a densest subgraph heuristic is used
to extract the most coherent interpretation and finally the fragments are
disambiguated by using a centrality measure within this graph.
A detailed description and evaluations of the approach are given in [7, 8].
4 Web Interface and RESTful API
We developed a Web interface and a RESTful API by following the KISS prin-
ciple, i.e., “keep it simple, stupid”. As can be seen from the screenshot in Figure
1
http://babelnet.org
2
http://www.wikipedia.org
3
http://wikidata.org
4
http://omegawiki.org
5
http://wiktionary.org
26
Fig. 1. A screenshot of the Babelfy Web interface.
1, the Web interface asks for the input text, its language and whether the par-
tial matching heuristic should be used instead of the exact string matching one.
After clicking on “Babelfy!” the user is presented with the annotated text where
we denote with green circles the concepts and with yellow circles the named en-
tities. As for the Java RESTful API, users can exploit our approach by writing
less than 10 lines of code. Here we show a complete example:
// get an instance of the Babelfy RESTful API manager
Babelfy bfy = Babelfy.getInstance(AccessType.ONLINE);
// the string to be disambiguated
String inputText = "hello world, I’m a computer scientist";
// the actual disambiguation call
Annotation annotations = bfy.babelfy("", inputText,
Matching.EXACT, Language.EN);
// printing the result
for(BabelSynsetAnchor annotation : annotations.getAnnotations())
System.out.println(annotation.getAnchorText()+"\t"+
annotation.getBabelSynset().getId()+"\t"+
annotation.getBabelSynset());
4.1 Documentation for the RESTful API
Annotation babelfy(String key, String inputText,
Matching candidateSelectionMode, Language language)
The first parameter is the access key. A random or empty key will grant 100
requests per day (but a less restrictive key can be requested). The second pa-
rameter is a string representing the input text (sentences or whole documents
can be input up to a maximum of 3500 characters). The third parameter is an
enum with two possible values: EXACT or PARTIAL, to enable, respectively,
the exact or partial matching heuristic for the selection of fragment candidates
found in the input text. The fourth parameter is the language of the input text
(among 50 languages denoted with their ISO 639-1 uppercase code).
27
Annotation is the object that contains the output of our system. A user
can access the POS-tagged input text with getText() which returns a list of
WordLemmaTag objects with the respective getters. With getAnnotations() a
user will get a list of BabelSynsetAnchor objects, i.e., the actual annotations.
A user can use getAnchorText() to get the disambiguated fragment of text and
with getBabelSynset() get the selected Babel synset. Moreover, if a user wants
to anchor the disambiguated entry to the input text, the start and end indices
of the tagged text can be gotten with getStart() and getEnd().
5 Conclusion
In this paper, we presented and described the typical use of the Web interface and
Java RESTful API of our state-of-the-art system for multilingual Word Sense
Disambiguation and Entity Linking, i.e., Babelfy, available at http://babelfy.org
Acknowledgments
The authors gratefully acknowledge the support of the
ERC Starting Grant MultiJEDI No. 259234.
References
1. Bond, F., Foster, R.: Linking and extending an open multilingual wordnet. In:
Proc. of ACL. pp. 1352–1362 (2013)
2. Ehrmann, M., Cecconi, F., Vannella, D., Mccrae, J.P., Cimiano, P., Navigli, R.:
Representing Multilingual Data as Linked Data: the Case of BabelNet 2.0. In:
Proc. of LREC. pp. 401–408 (2014)
3. Erbs, N., Zesch, T., Gurevych, I.: Link Discovery: A Comprehensive Analysis. In:
Proc. of ICSC. pp. 83–86 (2011)
4. Fellbaum, C.: WordNet: An Electronic Lexical Database. MIT Press (1998)
5. Hovy, E.H., Navigli, R., Ponzetto, S.P.: Collaboratively built semi-structured con-
tent and Artificial Intelligence: The story so far. Artificial Intelligence 194, 2–27
(2013)
6. Mitchell, T.M.: Reading the Web: A Breakthrough Goal for AI. AI Magazine (2005)
7. Moro, A., Navigli, R., Tucci, F.M., Passonneau, R.J.: Annotating the MASC Cor-
pus with BabelNet. Proc. of LREC pp. 4214–4219 (2014)
8. Moro, A., Raganato, A., Navigli, R.: Entity Linking meets Word Sense Disam-
biguation: A Unified Approach. TACL 2, 231–244 (2014)
9. Navigli, R.: Word sense disambiguation: A survey. ACM Comput. Surv. 41(2),
1–69 (2009)
10. Navigli, R.: A Quick Tour of Word Sense Disambiguation, Induction and Related
Approaches. In: Proc. of SOFSEM. pp. 115–129 (2012)
11. Navigli, R., Ponzetto, S.P.: BabelNet: The automatic construction, evaluation and
application of a wide-coverage multilingual semantic network. Artificial Intelligence
193, 217–250 (2012)
12. Rao, D., McNamee, P., Dredze, M.: Entity linking: Finding extracted entities in a
knowledge base. In: Multi-source, Multilingual Information Extraction and Sum-
marization, pp. 93–115 (2013)
28
Help me describe my data: A demonstration of
the Open PHACTS VoID Editor
Carole Goble1 , Alasdair J G Gray2 , and Eleftherios Tatakis1
1
School of Computer Science, University of Manchester, Manchester, UK
2
Department of Computer Science, Heriot-Watt University, Edinburgh, UK
Abstract. The Open PHACTS VoID Editor helps non-Semantic Web
experts to create machine interpretable descriptions for their datasets.
The web app guides the user, an expert in the domain of the data,
through a series of questions to capture details of their dataset and then
generates a VoID dataset description. The generated dataset description
conforms to the Open PHACTS dataset description guidelines that en-
sure suitable provenance information is available about the dataset to
enable its discovery and reuse.
The VoID Editor is available at http://voideditor.cs.man.ac.uk.
The source code can be found at
https://github.com/openphacts/Void-Editor2.
Keywords: Dataset descriptions, VoID, Provenance, Metadata
1 Motivating Problem
Users of systems such as the Open PHACTS Discovery Platform3 [1,2] need to
know which datasets have been integrated. In the scientific domain they partic-
ularly need to know which version of a dataset is loaded in order to correctly
interpret the results returned by the platform. To satisify this need, the prove-
nance of the datasets loaded into the Open PHACTS Discovery Platform are
needed. This provenance information is then available for any data returned
by the platform’s API. Within the Open PHACTS project we have identified
a minimal set of metadata that should be provided to aid understanding and
reuse of the data [3]. Additionally, we recommend that the metadata is provided
using the VoID vocabulary [4] so that the data is self-describing and machine
processable.
Open PHACTS does not publish its own datasets; it integrates existing pub-
licly available domain data. Typically the publishers of these scientific data sets
are experts in their scientific domain, viz. chemistry or biology, but not in the
semantic web. They need to be supported in the creation of VoID descriptions
of their datasets which may have been published in a database and converted
into RDF. A tool which hides the underlying details of the semantic web but
enables the creation of descriptions understandable to a domain expert is thus
needed.
3
https://dev.openphacts.org/ accessed July 2014
29
2 Goble, Gray, Tatakis
Fig. 1: Screenshot of the VoID Editor
2 VoID Editor
The aim of the VoID Editor (see screenshot in Figure 1) is to allow a data pub-
lisher to create validated dataset descriptions within 30 minutes. In particular,
the data publisher does not need to read and understand the Open PHACTS
dataset descriptions guidelines [3] which provide a checklist of the RDF prop-
erties that must and should be provided. There is also no need for the data
publisher to understand RDF or the VoID vocabulary.
The VoID Editor is a web application that guides the data provider through
a series of questions to acquire the required metadata properties. The user is first
asked for details about themselves and other individuals involved in the author-
ing of the data. Core publishing metadata such as the publishing organisation
and the license are then gathered. The user is then asked for versioning infor-
mation and the expected update frequency of the data. The Sources tab helps
the user to provide details of source datasets from which their data is derived.
They can either select from the datasets already known to the Open PHACTS
Discovery Platform or enter the details manually. The list of known datasets is
populated by a call to the Open PHACTS API. The Distribution Formats tab
allows the user to describe the distributions in which the data is provided, e.g.
RDF, database dump, or CSV. The final screen allows the user to export the
RDF of their dataset description as well as providing a summary of any vali-
dation errors, e.g. not supplying a license which is a required field, such errors
30
The Open PHACTS VoID Editor 3
Fig. 2: Screenshots of the Linkset Editor
will already have been indicated by a red bar at the top of the screen containing
an error message. Note that the ‘Export RDF’ button is only activated when a
valid dataset description can be created, i.e. all required fields have been filled
in.
At any stage, the generated RDF dataset description may be inspected by
clicking the ‘Under the Hood’ button. This button can also be used to save a par-
tially generated description that can later be imported into the editor through
the ‘Import VoID’ button. The ‘Under the Hood’ feature is also useful for se-
mantic web experts to see what is being generated at any stage.
3 Linkset Editor
In companion with the VoID Editor, a Linkset Editor (see screenshot in Figure 2)
has been developed. The Linkset Editor allows for the creation of descriptions
of the links between two datasets. The same interface design and framework is
used.
The Linkset Editor reuses the first three tabs of the VoID Editor to capture
details of the authors, core publishing information, and details about versioning.
The Source/Target tab allows the user to select the pair of datasets that are
connected by the linkset. Again, the list of possible datasets is generated by
a call to the Open PHACTS API. The Link Info tab asks the user to declare
31
4 Goble, Gray, Tatakis
the link predicate used in the linkset and provide some justification to capture
the nature of the equality relationship encoded in the links. (For details about
linkset justifications, please see Section 5 of [3].)
4 Implementation
The VoID and Linkset Editors have been implemented using AngularJS as a
Javascript framework for the web client with a server implementation using Jena
libraries. A user-centric approach was followed for the design and development
of the VoID Editor. A small number of data providers were consulted about the
type of tool they required with regular interviews and feedback on prototype
versions. A larger number of potential users were involved in an evaluation of
the VoID Editor. Full details can be found in [5].
In the future we plan to investigate how the VoID Editor can genearate tem-
plate descriptions that can be populated as part of the data publishing pipeline.
We also plan to look at how the editor could be adapted to other dataset descrip-
tion guidelines, e.g. DCAT4 or the W3C HCLS community profile5 . However,
this is not a straightforward process since considerable care and attention is paid
to the phrasing and grouping of questions to ensure a pleasant user experience.
Acknowledgements
The research has received support from the Innovative Medicines Initiative Joint
Undertaking under grant agreement number 115191, resources of which are com-
posed of financial contribution from the European Union’s Seventh Framework
Programme (FP7/2007- 2013) and EFPIA companies in kind contribution.
References
1. Gray, A.J.G., Groth, P., Loizou, A., Askjaer, S., Brenninkmeijer, C.Y.A., Burger,
K., Chichester, C., Evelo, C.T., Goble, C.A., Harland, L., Pettifer, S., Thompson,
M., Waagmeester, A., Williams, A.J.: Applying linked data approaches to phar-
macology: Architectural decisions and implementation. Semantic Web 5(2) (2014)
101–113 doi:10.3233/SW-2012-0088.
2. Groth, P., Loizou, A., Gray, A.J.G., Goble, C., Harland, L., Pettifer, S.: API-centric
Linked Data Integration: The Open PHACTS Discovery Platform Case Study. Jour-
nal of Web Semantics (2014) In press. doi:10.1016/j.websem.2014.03.003.
3. Gray, A.J.G.: Dataset descriptions for the Open Pharmacological Space. Working
draft, Open PHACTS (September 2013)
4. Alexander, K., Cyganiak, R., Hausenblas, M., Zhao, J.: Describing Linked Datasets
with the VoID Vocabulary. Note, W3C (March 2011)
5. Tatakis, E.: VoID Editor v2. Undergraduate dissertation, School of Computer
Science, University of Manchester, Manchester, UK (April 2014)
4
http://www.w3.org/TR/vocab-dcat/ accessed July 2014
5
http://www.w3.org/2001/sw/hcls/notes/hclsdataset/ access July 2014
32
OUSocial2 - A Platform for Gathering Students’
Feedback from Social Media
Keerthi Thomas, Miriam Fernandez, Stuart Brown, and Harith Alani
Open University, UK
first.last@open.ac.uk, h.alani@open.ac.uk
Abstract. Universities strive to collect feedback from students to improve their
courses and tutorship. Such feedback is often collected at the end of a course via
survey forms. However, such methods in collecting feedback are too controlled,
slow, and passive. With the rise of social media, many students are finding online
venues to group and share their experiences and seek peers’ support. OUSocial2
is a platform that monitors behaviour, sentiment, and topics, in open social media
groups set up by, and for, Open University students. It captures anonymous feed-
back from students towards their courses, and tracks the evolution of engagement
behaviour and sentiment within those groups.
Keywords: social media, behaviour analysis, sentiment analysis
1 Introduction
The Open University (OU) has around 250 thousand students, rendering it the largest
university in the United Kingdom and one of the leading distance teaching institutions.
Although the university provide students with several websites and applications where
they can discuss their courses with their tutors and peers, many seem to be more en-
gaged in such discussions on open social media platforms, such as Facebook groups.
Social media has become a rich source for student feedback, which could be col-
lected and investigated, to capture any issues and problems in real time, as well as to
monitor the engagement of students with their courses and peers. Students retention is
especially challenging in distance learning, and close monitoring of students’ activities
and involvement can greatly help to predict churn of students, and thus giving their
tutors an opportunity to intervene and support disengaging or struggling students [4].
OUSocial2 is a prototypical platform for collecting and analysing content from rel-
evant and public Facebook groups, set up by OU students. These open groups have been
set up to bring together other students who enrolled in particular OU courses or mod-
ules. OUSocial2 extends its predecessor which is described in [1], with a completely
new interface, and lexicon-based sentiment tracking service.
More specifically, the objectives of the OUSocial2 project are:
1. Build a data collection service for gathering, storing, and integrating data from
public Facebook groups related to OU
2. Develop and train a model for identifying the behaviour of individual users based
on their activities and interactions in the Facebook online groups
3. Extract the topics that emerge in Facebook group discussions
4. Track the sentiment expressed about the specific topics by the group members
33
This paper describes the OUSocial2 platform’s architecture, analysis components,
and data enrichment with the OUs linked data portal.
Demo: A fully working OUSocial2 platform will be demoed at the conference, running
over 44 groups from Facebook, with a total of 172,695 posts from 19,759 users. Au-
dience will be able to see how the various analyses components described below can
be used to assess and monitor engagement of students in course groups, their evolving
sentiment, and topics. For privacy reasons, the live demo is not publicly available yet.
A video recording of the demo is available at:
https://dl.dropboxusercontent.com/u/17906712/ousocial2-demo.mp4
https://dl.dropboxusercontent.com/u/17906712/ousocial2-demo.avi
2 OUSocial2
In this section we describe the three main OUSocial2 analyses components and how
their output is visualised in the demo. Facebook API is used to collect all posts and
interactions from public groups about OU courses. 44 of such groups are identified by
matching their titles to official OU course codes (e.g., T224). Collected data includes
group ID, posts’ content, owner, time of posting, whether the post is a reply to another
post, users, etc. Data collection is reactivated every 24 hours to update the database.
Fig. 1. Distribution of behaviour roles over time for several selected groups
2.1 Behaviour Analyser
This component applies our behaviour analyses service (see [3]) which uses machine
learning and SPIN (spinrdf.org/) rules to identify the roles of users. Understanding
the behaviour composition of a group, and the evolution of behaviour roles of individ-
uals (micro) and groups (macro) is useful for assessing user engagement and future
prospects [3, 2].
This component identifies eight types of roles; Lurker, Follower, Daily User Con-
tributor, Broadcaster, Leader, Celebrity and Super User. Figure 1 is the OUSocial2
display of the role compositions of the top 10 active groups. The slide bar at the bottom
is to view the roles at different points in time. The number and percentage of each type
of role in a group is displayed on the right hand side. Engagement of particular group
members can also be studied (see Figure 2).
34
Fig. 2. User behaviour over time
Fig. 3. Evolution of group sentiment over time. Red line is average sentiment across all groups.
2.2 Sentiment Analyser
The sentiment analysis component calculates the sentiment for each post. We use Sen-
tiStrength;1 a lexicon-based sentiment analyser, to estimate the strength of positive and
negative sentiments in our posts. We calculate sentiment at the community and member
levels. OUSocial2 users can visualise and compare the evolution of sentiment in se-
lected groups (Figure 3). Users can also see the sentiment distribution of a given group
over time 4, and upon clicking on a specific time point, the list of top positive, negative,
and neutral posts are listed (Fig 5).
Fig. 4. Overall positive and negative Fig. 5. Topics appearing in Fig. 6. Topics in posi-
sentiment levels in a group posts with positive sentiment tive posts
2.3 Topic Analyser
Several named entity recognition systems have emerged recently, such as Textwise,
Zemanta, DBpedia Spotlight, OpenCalais, Alchemy API, and TextRazor. OUSocial2
uses TextRazor since it seems to provide the best accuracy in our context. TextRazor
1
http://sentistrength.wlv.ac.uk/
35
(textrazor.com/) identifies entities from our posts, and returns the relevant URIs from
DBpedia and Freebase, with confidence scores. Users of OUSocial2 can view the topics
that appear in posts, in tag clouds of positive or negative entities (Fig. 6, to help them
spot any issues or concerns raised by the members of these groups.
2.4 Data Enrichment
OU official information about all courses already exist as linked data from data.open.ac.uk.
Course information can be SPARQLed to retrieve course titles, descriptions, topic cat-
egories, relevant courses, etc.
3 Feedback and Future Work
OUSocial2 was demonstrated to the university’s executive board and strategy office, and
was generally very well received as a tool that could enhance our collection of feedback,
and speeding up our reaction to any concerns or challenges raised by students. Privacy
was raised as an important issue, and further steps are planned to abstract any informa-
tion that could lead to the identification of students. It was suggested that sentiment and
engagement results could be compared to actual students’ performance on the courses
in question, as well as to their end-of-year feedback forms. Other requests include the
implementation of alerts on abnormal activities in chosen groups (e.g., drop in engage-
ment, rise in negative sentiment), and a comparison between groups on the same course
but on different years.
Sentiment analysis was done using SentiStrength; a general-purpose lexicon-based
tool. However, results showed that many posts on a course about World Wars were being
incorrectly flagged as negative, whereas they were simply mentioning various course
topic words (e.g., war, deaths, holocaust), rather than expressing a negative opinion
about the topic or course itself. We plan to investigate using course descriptions to
further tune our sentiment analyses.
4 Conclusions
In this document we described the main components of OUSocial2; a web based tool
for assessing and monitoring students’ engagement and sentiment in public social media
groups about their courses. The tool enables course leaders and the university to become
aware of potential concerns and factors that could lead to students to quit their courses,
which is a known problem with online learning and MOOCs.
References
1. M. Fernandez, H. Alani, and S. Brown. Ou social: reaching students in social media. In Proc.
12th International Semantic Web Conference (ISWC 2013) - Demo, Sydney, Australia, 2013.
2. M. Rowe and H. Alani. What makes communities tick? community health analysis using role
compositions. In 4th IEEE Int. Conf. Social Computing (SocialCom), Amsterdam, 2012.
3. M. Rowe, M. Fernandez, S. Angeletou, and H. Alani. Community analysis through semantic
rules and role composition derivation. Journal of Web Semantics (JWS), 18(1):31–47, 2013.
4. A. Wolff, Z. Zdrahal, A. Nikolov, and M. Pantucek. Improving retention: predicting at-risk
students by analysing clicking behaviour in a virtual learning environment. In Third Confer-
ence on Learning Analytics and Knowledge (LAK 2013), Leuven, Belgium, 2013.
36
Using an Ontology Learning System for Trend
Analysis and Detection
Gerhard Wohlgenannt and Stefan Belk and Matyas Karacsonyi and Matthias
Schett
Vienna Univ. of Economics and Business, Welthandelsplatz 1, 1200 Wien, Austria
{gerhard.wohlgenannt,stefan.belk,matyas.karacsonyi}@wu.ac.at
http://www.wu.ac.at
Abstract. The aim of ontology learning is to generate domain models
(semi-) automatically. We apply an ontology learning system to create
domain ontologies from scratch in a monthly interval and use the re-
sulting data to detect and analyze trends in the domain. In contrast to
traditional trend analysis on the level of single terms, the application
of semantic technologies allows for a more abstract and integrated view
of the domain. A Web frontend displays the resulting ontologies, and a
number of analyses are performed on the data collected. This frontend
can be used to detect trends and evolution in a domain, and dissect them
on an aggregated, as well as a fine-grained-level.
Keywords: trend detection, ontology evolution, semantic technologies,
ontology learning
1 Introduction
Ontologies are a cornerstone technology of the Semantic Web. As the manual
construction of ontologies is expensive, there have been a number of e↵orts to
(semi-)automatic ontology learning (OL). The demo application builds upon an
existing OL system, but extends the system to apply it as a Web intelligence,
resp. a trend detection, tool.
As the system generates lightweight domain ontologies from scratch in reg-
ular intervals (ie. monthly), the starting point is always the same. This allows
meaningful comparisons between ontologies, allowing to trace ontology evolution
and general trends in the domain. The system captures an abundance of data
about the ontologies in a relational database, from high-level to low-level (see
below), which helps to analyze and visualize trends. The OL system generates
ontologies from 32 heterogeneous evidence sources, which contain domain data
from the respective period of time, so we can not only analyze the resulting
ontologies but trace which sources support which ontological elements.
In summary, we use Semantic Web technologies as a Web intelligence tool by
extending the system with visual and analytic components for trend detection.
Trend detection is an major issue in a world that is changing rapidly. Timely
detection of trends (and reaction to them) is important in many areas, eg. for
success in business [2].
37
2 Authors Suppressed Due to Excessive Length
2 The Underlying Ontology Learning System
This section gives a brief introduction to the ontology learning (OL) system,
as well as the sources of evidence used. We try to be as brief as possible, and
include only information crucial to understand the trend detection application
(for more details see the related work section and the referenced literature).
All trend detection analyses described in the following are based on a specific
system for OL and ontology evolution. The system learns lightweight ontologies,
more precisely taxonomies plus unlabeled non-taxonomic relations, from het-
erogeneous input sources. At the moment we use “climate change” as our test
domain, and generate ontologies in monthly intervals. As the framework learns
from scratch, it starts with a small seed ontology (two static concepts). For this
seed ontology, we collect evidence from the evidence sources, and integrate the
data (typically a few thousand terms including their relation to the seed con-
cepts) into a spreading activation network. The spreading activation algorithm
selects the 25 (current setting) most important new domain concept candidates.
The only step which needs human assessment is a relevance check for the con-
cept candidates done with crowdsourcing. A positioning step integrates the can-
didates into the existing seed ontology. This concludes the first “stage” of OL.
We then use the extended ontology as new seed ontology, and start over. The
system halts after three rounds of extension.
As already mentioned, the learning process relies on 32 heterogeneous ev-
idence sources. Most of these sources are very dynamic and therefore well-fit
for trend detection. The text-based sources include domain-specific corpora ex-
tracted from news media articles (segregated by country of origin), Web sites
of NGOs and Fortune 1000 companies, domain-filtered postings from Facebook,
Youtube, etc. We use keyword extraction and Hearst-style patterns to collect
evidence, i.e. terms and relations. Furthermore, the system queries Social Web
APIs (Twitter, Flickr) to get related and terms. We also use a few rather static
sources, such as WordNet and DBpedia to help with taxonomy building.
3 Trend Detection and Analysis on Di↵erent Levels
Our demo system contains three main areas, namely (i) the ontologies, ie. the
monthly snapshots of the domain model, (ii) high-level evolution, which include
aggregated analyses on the characteristics of the evidence sources and ontologies,
and (iii) low-level evolution, which trace the dynamics of concepts and single
evidence on a fine-grained level. The demo portal can be found at http://
hugo.ai.wu.ac.at:5050, a screencast presentation of the portal is available at
http://ai.wu.ac.at/~wohlg/iswc-demo.mp4.
3.1 Ontologies
The Ontologies menu lists all ontologies computed per computation setting. The
computation setting is simply a distinct system configuration. By clicking on an
38
Using an Ontology Learning System for Trend Analysis and Detection 3
ontology, the system displays detailed information. This includes representations
in OWL/Turtle syntax and as graph of the resulting ontology, as well as of
intermediary results. A user also finds performance data and the list of concepts
by extension level. For a more detailed analysis, one can take a look at all
evidence collected and used in the learning process. Multiple viewpoints (by
concept, by evidence source, . . . ) allow investigating the underlying data.
In a nutshell, the Ontologies menu facilitates the analysis of trends in the
domain both on the level of ontologies and the underlying evidence data.
Fig. 1. Example snippet from an ontology (as graph) generated.
3.2 Low-Level Evolution
The Concept History shows which concepts have been added and removed from
the ontology over time – for a specific system setting. For example, due to media
coverage on hurricanes in October 2013 (see also Google trends), the concept hur-
ricane was added to the ontology in November 2013 (in most settings). Entering
“hurricane” as concept candidate in the ECM analysis presents the fine-grained
development of evidence of the concept. Figure 2 shows which sources (US news
media, UK news media, etc.) support the concept to what extend.
3.3 High-Level Evolution
The High-Level Evolution menu includes tools and visualizations to trace the
evolution of evidence sources and the quality of the OL algorithms. For example,
the source impact vector (SIV) graph shows the impact of the evidence sources on
the system, which is computed according to the observed quality of suggestions
from these sources. Source evolution displays the evolution of quality of concept
candidates suggested by the source.
39
4 Authors Suppressed Due to Excessive Length
Fig. 2. (Keyword-generated) evidence for concept hurricane in various Web corpora
(News Media, NGOs Websites, etc.)
4 Related Work
More information about the OL system used as foundation for the trend detec-
tion experiments and visualizations can be found in Weichselbraun et al. [3] and
Wohlgenannt et al.[4]. A number of approaches have been proposed for trend
detection from text data. For example, Bolelli et al. [1] first divide documents
into time segments, then detected topics with a latent Dirichlet allocation model,
and finally trace the evolution of the topics. In the realm of social media, Twit-
terMonitor [2] identifies trends on Twitter in real time.
5 Conclusions
The demo application uses Semantic Web (ontology learning) technologies to
facilitate trend analysis and detection in a given domain. Users can trace change
on di↵erent levels, (i) on the level of ontologies themselves, (ii) the aggregated
level of quality of the system and impact of evidence sources, and (iii) the fine-
grained level on concepts and single evidence. The fine-grained level is especially
helpful to determine the reasons for trends in the sources of evidence. Future
work will include the implementation of additional analyses and visualizations
and the application of the tool in other domains, for example finance and politics.
References
1. Bolelli, L., Ertekin, S., Giles, C.L.: Topic and trend detection in text collections
using latent dirichlet allocation. In: Proc. 31th European Conf. on IR Research. pp.
776–780. ECIR ’09, Springer-Verlag, Berlin, Heidelberg (2009)
2. Mathioudakis, M., Koudas, N.: Twittermonitor: Trend detection over the twitter
stream. In: Proc. of the 2010 ACM SIGMOD Int. Conference on Management of
Data. pp. 1155–1158. SIGMOD ’10, ACM, New York, NY, USA (2010)
3. Weichselbraun, A., Wohlgenannt, G., Scharl, A.: Refining non-taxonomic relation
labels with external structured data to support ontology learning. Data & Knowl-
edge Engineering 69(8), 763–778 (2010)
4. Wohlgenannt, G., Weichselbraun, A., Scharl, A., Sabou, M.: Dynamic integration
of multiple evidence sources for ontology learning. Journal of Information and Data
Management (JIDM) 3(3), 243–254 (2012)
40
A Prototype Service for Benchmarking Power
Consumption of Mobile Semantic Applications
Evan W. Patton and Deborah L. McGuinness
Rensselaer Polytechnic Institute
110 8th Street, Troy NY 12180 USA
{pattoe, dlm}@cs.rpi.edu
http://tw.rpi.edu/
Abstract. We present a prototype web service that enables researchers
to evaluate the performance per watt of semantic web tools. The web
service provides access to a hardware platform for collecting power con-
sumption data for a mobile device. Experiments are specified using RDF
to define the conditions of the experiment, the operations that compose
those conditions, and how they are combined into individual execution
plans. Further, experimental descriptions and their provenance are pub-
lished as linked data, allowing others to easily repeat experiments. We
will demonstrate how we have used the system to date, how others can
use it, and discuss its potential to revolutionize design and development
of semantically enabled mobile applications.
Keywords: reasoning, mobile, power, performance, resource-constrained
1 Introduction
One challenge that semantic technologies face when deployed on mobile plat-
forms like smartphones is the amount of energy available for the device to com-
pute and communicate with other agents. For example, the Google Nexus One,
one of the first Android smartphones, had a single core processor operating at
1 GHz and 512 MB of RAM. Samsung’s latest o↵ering, the Galaxy S5, has a
quad core, 2.5 GHz processor and 2 GB of RAM, more than a 8-fold increase in
processing power and 4-fold increase in memory in 5 years. However, the battery
capacity of the two phones are 1400 mAh and 2800 mAh, resp., which indicates
that battery technology is progressing more slowly than processing technology.
Further, application complexity has also increased. Tools are needed to help de-
velopers understand how semantic tools consume power so as to identify when
they can use local reasoning on mobile platforms or when o↵-device computation
is more practical.
We introduced a broadly reusable methodology [3] motivated by these con-
cerns to evaluate the performance of reasoners relative to the amount of energy
consumed during operation. Ultimately, these metrics will provide developers
deeper insight into power consumption and enable next-generation applications
of semantic technologies for power constrained devices. We present a prototype
41
Table 1. A data sample for query 14 (Listing 1.1) from LUBM executed on the Sam-
sung Galaxy S4. Times are in milliseconds, memory is in kilobytes, and power is in
milliwatts.
Reasoner Init. Ont. Load Data Load Query Plan Answer Memory Power
Jena 0.122 372.6 7076 2.594 233.2 35023 944
Pellet 0.152 355.7 8872 1.984 12350 59418 1024
HermiT 0.427 407.8 17442 0.092 21205 58720 995
ontology-driven web service for researchers to use our reference hardware setup
to perform analysis of semantic web tools’ power consumption.
2 Web Service for Power-Performance Evaluation
Our power benchmarking methodology [3] bypasses the removable battery in a
Samsung Galaxy S4 to collect power data during reasoning and query answering
tasks using three reasoning engines. Because our methodology requires a hard-
ware setup, we are developing and will demonstrate a web service to execute
experiments using our existing infrastructure. The web service is based on the
Semantic Automated Discovery and Integration (SADI) Framework1 and accepts
jobs described using RDF and the ontology we will discuss in Section 3. On com-
pletion, it provides a ZIP file containing runtime information, raw and processed
power measurements, power and energy consumption statistics, and provenance
capturing information about the process. Table 1 shows a sample data point for
each of three di↵erent reasoners on the Lehigh University Benchmark [2], query
14 (shown in Listing 1.1).
Listing 1.1. Lehigh University Benchmark query 14
PREFIX rdf:
PREFIX ub:
SELECT ?X WHERE {?X rdf:type ub:UndergraduateStudent}
3 Toward an Ontology for Experiment Descriptions
We will demonstrate our experimental ontology for declaratively describing the
operational constraints of an experiment, which is then executed on the target
device. The experiment description is published as linked data, along with meta-
data about the experiment output, andprovenance modeled using the PROV-O
ontology [1].These metadata are published in a triple store to enable meta-
analysis, recombination, and extension of power experiments.
1
http://sadiframework.org/content/
42
Experiment. Experiment provides the root of an experiment description. List-
ing 1.2 shows an example experiment. Core to an experiment description are
conditions, which are grouped together based on some common dimension. If
an experiment defines more than one condition group, the engine performing
the experiment can generate specific conditions through the use of a condition-
FillStrategy. We are currently investigating two di↵erent strategies, CrossJoin
and Paired, that evaluate the cross product and paired conditions across groups,
respectively. To provide control over what data are returned, the author of the
experiment can declare which variables are of interest.
Listing 1.2. An example description of an experiment
[] a exp:Experiment ;
exp:name ‘LUBM on Android’ ;
exp:version ‘1.0’ ;
dc:creator ;
exp:trials 30 ;
exp:conditions :ReasonerConditionGroup,
OntologyConditionGroup ;
exp:conditionFillStrategy exp:CrossJoin ;
exp:dependentVariable exp:ExecutionInterval,
exp:AveragePowerConsumption, exp:MaxPowerConsumption .
Conditions and Condition Groups. Conditions are the highest unit of test in
our experiment ontology. They are composed of collections of operations that
specify a sequence of actions to take on the device. Listing 1.3 shows an example
of an ontology condition group that specifies two di↵erent ontology operations.
Currently, we only support nominal values, but future versions of the ontology
will also support ordinal, scalar, and ratio level inputs.
Listing 1.3. An example of a condition group with conditions
:OntologyConditionGroup a exp:ConditionGroup ;
exp:name ‘Ontology Condition’ ;
exp:varies exp:OntologyDatasetQueryOperation ;
exp:nominalValues ( :SchemaOrgOperations :LUBMOperations )
Operations. Operations encapsulate actions to be performed on the experimental
device. In Listing 1.4, we show an example of how an operation would define tests
for the LUBM query set. The measurePowerDuringDownload property can be
used to evaluate the performance of communication channels while retrieving the
content required for performing the experiment. In addition to loading ontolo-
gies, datasets, and executing queries, our ontology supports modeling reasoners,
parallel and sequential operations, and randomization of operations.2
2
Due to space constraints, we cannot elaborate on the details of modeling each oper-
ation type. For more information and examples, please see http://power.tw.rpi.edu
43
Listing 1.4. An example of an combined operation on an ontology, dataset, and queries
:LUMBOperations a exp:OntologyDatasetQueryOperation ;
exp:name ‘LUBM’ ;
exp:measurePowerDuringDownload false ;
exp:ontology lubm:univ-bench.owl ;
exp:dataset lubm:lubm-100k.ttl ;
exp:query lubm:query1.rq , lubm:query2.rq , ... .
4 Discussion and Conclusions
As semantic technologies become more prevalent, we need to ensure that tools
are available to assist in their deployment on a variety of devices including mobile
platforms, which are often power constrained. While our initial work provides
direction for this e↵ort, we recognize that more widespread adoption requires
lower barriers to entry. We described a web service under active development
to provide access to a reference implementation of the hardware described in [3]
where we found that, while compute time accounts for most energy consumption,
significant memory consumption may a↵ect power consumption during reason-
ing. With this investigation, we are working to enable semantic web researchers
and implementers to obtain insight into the power requirements for semantic
technology stacks. We will demonstrate a variety of example experiments and
discuss broader usage with attendees. In future work we intend to use expand
this web service to provide a means of easily repeating experiments as well as
further support for modeling the execution of experiments using We also in-
tend to provide example code that present an analysis of more reasoners, e.g. by
utilizing the work in [4].
Acknowledgements
Mr. Patton was funded by an NSF Graduate Research Fellowship. RPIs Tether-
less World Constellation is supported in part by Fujitsu, Lockheed Martin, LGS,
Microsoft Research, Qualcomm, in addition to sponsored research from DARPA,
IARPA, NASA, NIST, NSF, and USGS.
References
1. Belhajjame, K., Cheney, J., Corsar, D., Garijo, D., Soiland-Reyes, S., Zednik, S.,
Zhao, J.: PROV-O: The PROV ontology. Tech. rep., W3C (2013), http://www.w3.
org/TR/prov-o/
2. Guo, Y., Pan, Z., Heflin, J.: LUBM: A benchmarking for OWL knowledge base
systems. Web Semantics 3(2), 158–182 (2005)
3. Patton, E.W., McGuinness, D.L.: A power consumption benchmark for reasoners on
mobile devices. In: Proceedings of the 13th International Semantic Web Conference:
Replication, Benchmark, Data & Software Track (2014)
4. Yus, R., Bobed, C., Esteban, G., Bobillo, F., Mena, E.: Android goes semantic: DL
reasoners on smartphones. In: OWL Reasoner Evaluation Workshop 2013 (2013)
44
Sparklis: a SPARQL Endpoint Explorer
for Expressive Question Answering
Sébastien Ferré
IRISA, Université de Rennes 1
Campus de Beaulieu, 35042 Rennes cedex, France
Email: ferre@irisa.fr
Abstract. Sparklis is a Semantic Web tool that helps users explore
SPARQL endpoints by guiding them in the interactive building of
questions and answers, from simple ones to complex ones. It com-
bines the fine-grained guidance of faceted search, most of the expressiv-
ity of SPARQL, and the readability of (controlled) natural languages.
No endpoint-specific configuration is necessary, and no knowledge of
SPARQL and the data schema is required from users. This demonstra-
tion paper is a companion to the research paper [2].
1 Motivation
A wealth of semantic data is accessible through SPARQL endpoints. DBpedia
alone contains several billions of triples covering all sorts of topics (e.g., people,
places, buildings, species, films, books). Although di↵erent endpoints may use
di↵erent vocabularies and ontologies, they all share a common interface to access
and retrieve semantic data: the SPARQL query language. In addition to being a
widely-adopted W3C standard, the advantages of SPARQL are its expressivity,
especially since version 1.1, and its scalability for large RDF stores thanks to
highly optimized SPARQL engines (e.g., Virtuoso, Jena TDB). Its main draw-
back is that writing SPARQL queries is a tedious and error-prone task, and is
largely unaccessible to most potential users of semantic data.
Our motivation in developing Sparklis1 , shared by many other developers
of Semantic Web tools and applications, is to unleash access to semantic data by
making it easier to define and send SPARQL queries to endpoints. The novelty
of Sparklis is to combine in an integrated fashion di↵erent search paradigms:
Faceted Search (FS), Query Builders (QB), and Natural Language Interfaces
(NLI). That integration is the key to reconcile properties for which there is
generally a trade-o↵ in existing systems: user guidance, expressivity, readability
of queries, scalability, and portability to di↵erent endpoints [2].
2 Principles
Sparklis re-uses and generalizes the interaction model of Faceted Search
(FS) [8], where users are guided step-by-step in the selection of items. At each
1
Online at http://www.irisa.fr/LIS/ferre/sparklis/osparklis.html
45
step, the system gives a set of suggestions to refine the current selection, and
users only have to pick a suggestion according to their preferences. The sugges-
tions are specific to the selection, and therefore support exploratory search [7]
by providing overview and feedback during the search process.
To overcome expressivity limitations of FS and existing extensions for the Se-
mantic Web (e.g., gFacet [4], VisiNav [3], SemFacet [1]), we have generalized it
to Query-based Faceted Search (QFS), where the selection of items is replaced
by a structured query. The latter is built step-by-step through the successive
choices of the user. This makes Sparklis a kind of Query Builder (QB), like
SemanticCrystal [5]. QBs have the advantage to allow for a high expressivity
while assisting users about syntax, e.g. avoiding syntax errors, listing eligible
constructs. However, the FS-based guidance of Sparklis is more fine-grained
than in QBs. Sparklis avoids vocabulary errors by retrieving the URIs and
literals right from the SPARQL endpoint. It needs not be configured for a par-
ticular dataset, and dynamically discovers the data schema. In fact, Sparklis
only allows the building of queries that do return results, preventing users to fall
on empty results. That is because system suggestions are computed for the in-
dividual results, not for their common class. In fact, Sparklis is as much about
building answers as about building questions.
To overcome the lack of readability of SPARQL queries for most users, Spark-
lis queries and suggestions are verbalized in natural language so that SPARQL
queries never need to be shown to users. This makes Sparklis a kind of Natural
Language Interface (NLI), like PowerAqua [6]. The important di↵erence is that
questions are built through successive user choices in Sparklis instead of be-
ing freely input in NLIs. Sparklis interaction makes question formulation more
constrained, slower, and less spontaneous, but it provides guidance and safeness
with intermediate answers and suggestions at each step. Moreover, it avoids
the hard problem of NL understanding: i.e., ambiguities, out-of-scope questions.
A few NLI systems, like Ginseng [5], are based on a controlled NL and auto-
completion to suggest the next words in a question. However, their suggestions
are not fine-grained like with FS, and less flexible because they only apply to the
end of the question. In Sparklis, questions form complete sentences at any step
of the search; and suggestions are not words but meaningful phrases (e.g., that
has a director), and can be inserted at any position in the current question.
3 User Interface and Interaction
Figure 1 is a Sparklis screenshot taken during an exploration of book
writers in DBpedia. From top to bottom, the user interface contains (1)
navigation buttons and the endpoint URL, (2) the current question and
the current focus as a subphrase (highlighted in green), (3) three lists
of suggestions for insertion at the focus, and (4) the table of answers.
The shown question and answer have been built in 10 steps (8 inser-
tions and 2 focus moves): a Writer/that has a birthDate/after 1800/focus
on a Writer/that is the author of something/a Book/a number of/the
46
Fig. 1. Sparklis screenshot: a list of writers with their birth date (after 1800), nation-
ality, and (decreasing) number of written books. Current focus is on writer’s nationality.
highest-to-lowest/focus on a Writer/that has a nationality. Note that
di↵erent insertion orderings are possible for a same question. Navigation but-
tons allow to move backward/forward in the construction history. A permalink
to the current navigation state (endpoint+question) can be generated at any
time. To switch to another SPARQL endpoint, it is enough to input its URL in
the entry field. The query focus is moved simply by clicking on di↵erent parts of
the question, or on di↵erent table column headers. Every suggestion in the three
lists, as well as every table cell, can be inserted or applied to the current focus
by clicking it. The first suggestion list contains entities (individuals and liter-
als). The second list contains concepts (classes and properties). The third list
contains logical connectives, sorting modifiers, and aggregation operators. Each
suggestion list is equipped with an immediate-feedback filtering mechanism to
quickly locate suggestions in long lists. With the first list, filters can be inserted
into the query with di↵erent filter operators listed in a drop-down menu (e.g.,
matches, higher or equal than, before). Questions and suggestions use in-
dentation to disambiguate di↵erent possible groupings and improve readability,
and syntax coloring to distinguish between the di↵erent kinds of words.
4 Performances and Limitations
Portability. Sparklis conforms to the SPARQL standard, and requires no pre-
processing or configuration to explore an endpoint. It entirely relies on the end-
47
point to discover data and its schema. The main limitation is that URIs are
displayed through their local names, which is not always readable.
Expressivity. Sparklis covers many features of SPARQL: basic graph pat-
terns (including cycles), basic filters, UNION, OPTIONAL, NOT EXISTS, SELECT,
ORDER BY, multiple aggregations with GROUP BY. Almost all queries of the
QALD2 challenge can be answered. Uncovered features are expressions, named
graphs, nested queries, queries returning RDF graphs, and updates.
Scalability. Sparklis is responsive on the largest well-known endpoint: DB-
pedia. Among the 100 QALD-3 questions, half can be answered in less than 30
seconds (wall-clock time including user interaction and system computations).
5 Demonstration
The demonstration has shown to participants how QALD questions over DB-
pedia can be answered in a step-by-step process. Those questions cover various
retrieval tasks: basic facts (Give me the homepage of Forbes), entity lists (Which
rivers flow into a German lake?), counts (How many languages are spoken in
Colombia?), optimums (Which of Tim Burton’s films had the highest budget?).
More complex analytical question answering has also been demonstrated (Give
me the total runtime, from highest to lowest, of films per director and per coun-
try). Participants were also given the opportunity to explore any SPARQL end-
point of their choice.
References
1. Arenas, M., Grau, B., Kharlamov, E., Š. Marciuška, Zheleznyakov, D., Jimenez-
Ruiz, E.: SemFacet: Semantic faceted search over YAGO. In: World Wide Web
Conf. Companion. pp. 123–126. WWW Steering Committee (2014)
2. Ferré, S.: Expressive and scalable query-based faceted search over SPARQL end-
points. In: Mika, P., Tudorache, T. (eds.) Int. Semantic Web Conf. Springer (2014)
3. Harth, A.: VisiNav: A system for visual search and navigation on web data. J. Web
Semantics 8(4), 348–354 (2010)
4. Heim, P., Ertl, T., Ziegler, J.: Facet graphs: Complex semantic querying made easy.
In: et al., L.A. (ed.) Extended Semantic Web Conference. pp. 288–302. LNCS 6088,
Springer (2010)
5. Kaufmann, E., Bernstein, A.: Evaluating the usability of natural language query
languages and interfaces to semantic web knowledge bases. J. Web Semantics 8(4),
377–393 (2010)
6. Lopez, V., Fernández, M., Motta, E., Stieler, N.: PowerAqua: Supporting users in
querying and exploring the semantic web. Semantic Web 3(3), 249–265 (2012)
7. Marchionini, G.: Exploratory search: from finding to understanding. Communica-
tions of the ACM 49(4), 41–46 (2006)
8. Sacco, G.M., Tzitzikas, Y. (eds.): Dynamic taxonomies and faceted search. The
information retrieval series, Springer (2009)
2
http://greententacle.techfak.uni-bielefeld.de/~cunger/qald/
48
Reconciling Information in DBpedia through a
Question Answering System
Elena Cabrio1,2 , Alessio Palmero Aprosio3 , and Serena Villata1
1
INRIA Sophia Antipolis, France - firstname.lastname@inria.fr
2
EURECOM, France
3
Fondazione Bruno Kessler, Trento, Italy - aprosio@fbk.eu
Abstract. Results obtained querying language-specific DBpedia chap-
ters SPARQL endpoints for the same query can be related by several het-
erogenous relations, or contain an inconsistent set of information about
the same topic. To overcome this issue in question answering systems
over language-specific DBpedia chapters, we propose the RADAR frame-
work for information reconciliation. Starting from a categorization of
the possible relations among the resulting instances, such framework: (i)
classifies such relations, (ii) reconciles the obtained information using
argumentation theory, (iii) ranks the alternative results depending on
the confidence of the source in case of inconsistencies, and (iv) explains
the reasons underlying the proposed ranking.
1 Introduction
In the Web of Data, it is possible to retrieve heterogeneous information items
concerning a single real-world object coming from di↵erent data sources, e.g.,
the results of a single SPARQL query on di↵erent endpoints. These results may
conflict with each other, or they may be linked by some other relation like a spec-
ification. The automated detection of the kind of relationship holding between
di↵erent instances about a single object with the goal of reconciling them is an
open problem for consuming in the Web of Data. In particular, this problem
arises while querying the language-specific chapters of DBpedia, that may con-
tain di↵erent information with respect to the English version. This issue becomes
therefore particularly relevant in Question Answering (QA) systems exploiting
DBpedia language-specific chapters as referential data set, since the user expects
a unique (and possibly correct) answer to her factual natural language question.
In this demo, we propose the RADAR (ReconciliAtion of Dbpedia through
ARgumentation) framework that: i) adopts a classification method to return
the relation holding between two information items; ii) applies abstract argu-
mentation theory [4] for reasoning about conflicting information and assessing
the acceptability degree of the information items, depending on the kind of rela-
tion linking them; and iii) returns the graph of the results set, together with the
acceptability degree of each information item, to motivate the resulting informa-
tion ranking. We have integrated RADAR into the QA system QAKiS [1], that
queries language-specific DBpedia chapters using a natural language interface.
49
2 RADAR: a Framework for Information Reconciliation
The RADAR framework (Fig. 1) takes as input a collection of results from
the same SPARQL query raised against the language-specific DBpedia chapters
SPARQL endpoints, and retrieves: (i) the sources proposing each particular el-
ement of the results set, and (ii) the elements of the results set themselves. The
first module of RADAR (Source confidence assignment score, Fig. 1) takes each
information source, and following two di↵erent heuristics, i.e., Wikipedia page
length (the chapter of the longest language-specific Wikipedia page describing
the queried entity is rewarded w.r.t. the others) and entity geo-localization (the
chapter of the language spoken in the places linked to the page of the entity is
rewarded with respect to the others), assigns a confidence degree to the source.
Such metrics are summed, and normalized (0score1), where 0 is the less
reliable chapter for a certain entity and 1 is the most reliable one. Such con-
fidence degree will a↵ect the reconciliation if inconsistencies arise: information
proposed by the more reliable source will obtain a higher acceptability degree.
The second module (Rela-
tion classification module,
Fig. 1) starts from the re-
sults set, and it matches
every element with all the
other returned elements,
detecting the kind of re-
lation holding between
this pair of elements, fol-
lowing the categorization
of [3]. Such categories cor-
respond to the linguis-
tic phenomena (mainly
discourse and lexical se- Fig. 1: RADAR framework architecture.
mantics relations) hold-
ing among heterogeneous
values obtained querying two DBpedia language-specific chapters, given a certain
subject and a certain ontological property. RADAR clusters the relations of iden-
tity, disambiguated entity and coreference into a unique category, called surface
variants of the entity, and automatically detects such relation among two enti-
ties applying one of the following strategies: cross-lingual links (using WikiData),
text identity (i.e., string matching), Wikipedia redirection and disambiguation
pages. Moreover, RADAR integrates into a unique category geo-specification
and renaming, and classifies a relation of this category when in GeoNames one
entity results as contained in the other one. We also consider the alternative
names gazette included in GeoNames, and geographical information extracted
from English Wikipedia infoboxes, such as Infobox former country. Finally,
RADAR clusters meronymy, hyponymy, metonymy and identity:stage name into
a unique category, called inclusion, and detects it exploiting a set of features
extracted from heterogeneous resources: MusicBrainz, NCF Thesaurus, DBpe-
50
dia, WikiData and Wikipedia hierarchical information. Concerning inconsistent
data in DBpedia language-specific chapters, RADAR labels a relation between
entities/objects as negative, if every attempt to find one of the positive relations
described above fails. The output consists in a graph composed by the elements
of the results set connected with each other by the identified relations. Both the
sources associated with a confidence score and the results set under the form of
a graph are then provided to the third module of RADAR, the Argumentation
module (Fig. 1). Its aim is to reconcile the results set: it considers all positive
relations as a support relation and all negative relations as an attack relation,
building a bipolar argumentation graph where each element of the results set
is seen as an argument. Finally, adopting a bipolar fuzzy labeling algorithm [2]
relying on the source’s confidence to decide the acceptability of the information,
the module returns the acceptability degree of each argument, i.e., element of the
results set. RADAR provides as output: i) the acceptable elements (a threshold
is adopted), and ii) the graph of the results set, i.e., the explanation about the
choice of the acceptable elements returned.
Integrating RADAR into QAKiS. QAKiS addresses the task of QA over
structured knowledge-bases (e.g., DBpedia) [1], where the relevant information
is expressed also in unstructured forms (e.g., Wikipedia pages). It implements
a relation-based match for question interpretation, to convert the user ques-
tion into a query language (e.g., SPARQL), making use of relational patterns
(automatically extracted from Wikipedia and collected in the WikiFramework
repository) that capture di↵erent ways to express a certain relation in a given
language. In QAKiS, the SPARQL query created after the question interpre-
tation phase is sent to a set of language-specific DBpedia chapters SPARQL
endpoints for answer retrieval. The set of retrieved answers from each endpoint
is then sent to RADAR for answers reconciliation1 . The user can select the DB-
pedia chapter she wants to query besides English (that must be selected as it
is needed for Named Entity (NE) recognition), i.e., French or German. After
writing a question or selecting it among the proposed examples, the user has
to click on the tab RADAR where a graph with the answers provided by the
di↵erent endpoints and the relations among them is shown. Each node has an
associated confidence score, resulting from the fuzzy labeling algorithm. More-
over, each node is related to the others by a relation of support or attack, and a
further specification of such relations according to the identified categories [3] is
provided to the user as justification of the performed reconciliation and ranking.
To evaluate RADAR integration into QAKiS, we extract from QALD-2 data
set2 the questions currently addressed by QAKiS (i.e., questions containing a
NE related to the answer through one single ontological property), correspond-
ing to 58 questions (26 in the training, 32 in the test set). The discarded ques-
tions require either some forms of reasoning on data, aggregation from data sets
other than DBpedia, involve n-relations, or are boolean questions. We submit
1
A demo of RADAR integrated into QAKiS can be tested at http://qakis.org.
2
http://bit.ly/QALD2014
51
such questions to QAKiS on the English, German and French DBpedia chap-
ters. Since QALD-2 questions were created to query the English chapter only,
it turned out that only in 25/58 cases at least two endpoints provide an answer
(in all the other cases the answer is provided by the English chapter only, not
useful for our purposes). For instance, given the question Who developed Skype?
the English DBpedia provides Skype Limited as the answer, while the French
one returns Microsoft. We evaluate the ability of RADAR to correctly classify
the relations among the answers provided to the same query by the di↵erent
language-specific endpoints, w.r.t. a manually annotated goldstandard (built ac-
cording to [3]’s guidelines), carrying out two sets of experiments: i) we start from
the answers provided by the di↵erent DBpedia endpoints to the 25 QALD ques-
tions, and we run RADAR on it; ii) we add QAKiS in the loop, meaning that the
data we use as input for the argumentation module are directly provided by the
system. We obtain the following results: RADAR achieves a precision/recall/f-
measure of 1 in the classification of surface form and inclusion relations (overall
positive: p/r/f=1); QAKiS+RADAR obtains p=1, r=0.60 and f=0.75 on sur-
face form, p/r/f of 1 on inclusion (overall positive: p/r/f= 1/0.63/0.77). Since
QALD-2 data was created to query the English chapter only, this small data set
does not capture the variability of possibly inconsistent answers among DBpedia
language-specific chapters. Only two categories of relations are present in this
data, i.e., surface forms, and inclusion, and for this reason RADAR has out-
standing performances when applied on the correct mapping between NL ques-
tions and SPARQL queries. When QAKiS is added into the loop, its mistakes
in translating the NL question into the correct SPARQL query are propagated.
3 Future Perspectives
This demo improves the results of [2] as the categorization is more specific thus
producing a more insightful explanation graph, and more performing techniques
are applied to extract the relations. As future work, we will address a user
evaluation to check whether QAKiS answer explanation suits data consumers’
needs, and we will explore the possibility to leave the data consumer herself to
assign the confidence degree to the sources depending on searched information.
References
1. Cabrio, E., Cojan, J., Gandon, F.: Mind the cultural gap: bridging language spe-
cific dbpedia chapters for question answering. In: Cimiano, P., Buitelaar, P. (eds.)
Towards the Multilingual Semantic Web. Springer Verlag (2014)
2. Cabrio, E., Cojan, J., Villata, S., Gandon, F.: Argumentation-based inconsistencies
detection for question-answering over dbpedia. In: NLP-DBPEDIA@ISWC (2013)
3. Cabrio, E., Villata, S., Gandon, F.: Classifying inconsistencies in dbpedia language
specific chapters. In: LREC-2014 (2014)
4. Dung, P.: On the acceptability of arguments and its fundamental role in non-
monotonic reasoning, logic programming and n-person games. Artif. Intell. 77(2),
321–358 (1995)
52
Open Mashup Platform – A Smart Data
Exploration Environment
Tuan-Dat Trinh, Ba-Lam Do, Peter Wetz, Amin Anjomshoaa,
Elmar Kiesling, and A Min Tjoa
Vienna University of Technology, Vienna, Austria
{tuan.trinh,peter.wetz,ba.do,amin.anjomshoaa,
elmar.kiesling,a.tjoa}@tuwien.ac.at
Abstract. The number of applications designed around Linked Open
Data (LOD) has expanded rapidly in recent years. However, these appli-
cations typically do not make use of the vast amounts of LOD datasets,
but only provide access to predefined, domain-specific subsets. Excep-
tions that do allow for more flexible exploration of LOD are not targeted
at end users, which excludes users who have limited experience with
Semantic Web technologies from realizing the potential of the so-called
LOD cloud. This paper introduces a Mashup Platform that models, man-
ages, reuses, and interconnects LOD web applications, thereby encourag-
ing initiative and creativity of potential users. Figuratively, our approach
allows developers to implement building blocks whereas the platform pro-
vides the cement so that end users can build houses by themselves.
1 Introduction
More than ten years after the concept of LOD has been introduced, can end users
really benefit from it? In spite of considerable e↵ort made by researchers, the
answer, unfortunately, is “just a little”. This issue is becoming more pressing as
the number of LOD dataset increases; as of 20141 , there are already more than
60 billion triples from 928 datasets.
The limited adoption of LOD by end users may be explained by the following
observations: (i) There are currently no platforms that allow end users to manage
and reuse LOD applications. An “LOD App Store” – by analogy with digital
distribution platforms for mobile apps such as Google Play Store or App Store
– would be an interesting concept to foster di↵usion among end users. (ii) Most
LOD providers publish their datasets without paying regard to how their data
may be used e↵ectively or how it may be combined with data provided by others.
After all, specific use cases are typically the most e↵ective way to illustrate
that the data is useful for end users. To save their own times, LOD providers
need tools that support them in implementing such applications as efficiently as
possible. (iii) Most importantly, end users currently play a passive role, waiting
for developers to deliver LOD applications rather than leveraging LOD according
1
http://stats.lod2.eu/
53
to their individual needs. To make good use of LOD, users currently need to
equip themselves with knowledge about Semantic Web technologies as well as
the SPARQL query language. Since this cannot generally be expected, the key
question we address here is: “Is there any way to utilize the users’ capabilities
and creativity and allow them to explore LOD themselves without the need to
acquire specialized knowledge?”
To address these issues, we propose an Open Mashup Platform. The key idea
of this platform is to compose LOD applications from linkable Web widgets pro-
vided by data publishers and developers. These widgets are divided into three
categories, i.e., data, process, and visualization widgets which perform the data
retrieval, data processing/data integration, and data presentation tasks, respec-
tively. The internal mechanics of these complicated tasks are not visible to end
users because they are encapsulated inside the widgets. Widgets have inputs and
outputs and can be easily linked to each other by end users. Being web appli-
cations, they can run on various platforms and be shared and reused. When a
LOD provider publishes a new dataset, they can develop new widgets and add
them to the platform so that users can dynamically, actively and creatively com-
bine these widgets to compose LOD applications across multiple LOD datasets.
This paper presents the most basic functionalities of the Platform – a smart data
exploration environment for end users available at http://linkedwidgets.org.
2 Prototype System
2.1 Graph-based model and the Annotator tool
To communicate and transmit data between widgets, each of them implements
its own well-defined model as well as interfaces to provide the features required
by the Platform. A widget is similar to a service in that it has multiple inputs
and a single output. To model them, however, instead of capturing the func-
tional semantics and focusing on input and output parameters like SAWSDL
[6], OWL-S [2], WSMO [1] etc., we use a graph-based model similar to [5]. This
approach has a number of advantages and is a prerequisite for the semantic
search and auto-composition algorithms described in this paper. For example,
from the model for a Film Merger widget that requires an Actor and a Director
as its two inputs and returns a list of Films starring the actor and being directed
by the director (cf. Fig. 1a), the relation between input and output instances is
immediately apparent.
To allow developers to create and annotate widgets correctly and efficiently,
we provide a Widget Annotator tool. Developers simply drag, drop and then
configure three components, i.e., WidgetModel, Object, and Relation to visually
define their widget models. After that, the OWL description file for the model as
well as the corresponding HTML widget file are generated automatically. The lat-
ter includes the injected Java Script code snippet served for the widget communi-
cation protocol and a sample JSON-LD input/output of the widget according to
the defined model. Based on that, developers can implement the widget’s process-
ing function which receives input from widgets and returns output to others. De-
54
velopers can also rewrite/improve widgets using server-side scripting languages.
Finally, as soon as they have deployed their widgets, developers can submit their
work to the platform where it is listed and can be reused with other available
widgets; in particular, the widgets annotations are published into the LOD of
widgets which can be accessed from the graph of the
http://ogd.ifs.tuwien.ac.at/sparql SPARQL endpoint.
2.2 Semantic Widget Search
In line with the growth of the LOD cloud, the number of available widgets can be
expected to grow rapidly. In this case, to ensure that users can find widgets on the
platform, a semantic search will be provided in addition to conventional search
methods by keywords, category, tags, etc. Because the widgets’ RDF metadata
is openly available via the SPARQL endpoint, other third parties, if necessary,
can also develop their own widget-search tool. Our search tool is similar to the
annotator tool, but it is much simpler and directed at end users. By defining the
constraints for input/output, they, for example, can find widgets which return
Films with particular properties for each Film, or even find widgets which consist
of relationship between Films and Actors.
2.3 Mashup Panel
The Mashup Panel is the most crucial part of the platform; it allows users to
compose, publish and share their applications, thereby enabling them to dynam-
ically and actively explore LOD datasets without special skills or knowledge.
Widgets are grouped into Widget Collections to o↵er a group of scenarios af-
ter being combined to each other. Users can create their own collections or choose
to work with existing collections from other users. The list of widgets that belong
to the selected collection is placed at the left-hand side of the Mashup Panel.
Users simply drag and drop widgets into the mashup area at the right-hand side.
For each chosen widget, available operations are resize, run, view/cache output
data, get detailed information about the widget based on its URI. In the next
step, users can wire the input of a widget to the output of another one and thus
build up a data-processing flow. The connected widgets, under the coordination
of the platform, will communicate and transmit data to each other, from the
very first data widgets to the visualization widgets. Finally, if the whole mashup
is saved, parameters set in HTML form inputs from each widget will be auto-
matically detected and stored so that users can publish the final result displayed
inside the visualization widget onto their websites. Furthermore, the combined
application are semantically annotated and can be shared between users via their
URLs or URIs.
We implemented two algorithms to help users acquaint themselves with their
new widgets: auto-matching and auto-composition. The auto-matching algorithm
enables users to find – given input/output terminal A – all terminal B from all
of other widgets such that connection between A and B is valid. The auto-
composition algorithm is a more advanced approach in that it can automatically
55
Model of Input-1 Model of Output
dbpedia:Actor foaf:Film
dbpedia:starring
foaf:name or foaf:name
irect
dbpedia:Director d ia : d
xsd:string dbpe xsd:string
dbpedia: http://dbpedia.org/ontology/
foaf:name foaf: http://xmlns.com/foaf/0.1/
xsd:string xsd: http://www.w3.org/2001/XMLSchema#
Model of Input-2
(a) Model of the Film Merger Widget
(b) A complete sample LOD Application
Fig. 1: Sample widget model and widget combination
compose a complete application from a widget, or a complete branch that con-
sumes/provides data for a specific output/input terminal. “Complete” in this
context means that all terminals must be wired. This as well as the semantic
search feature distinguish our platform from similar contributions, e.g., [4] or [3].
A sample application that collects all movies played by Arnold Schwarzenegger
and directed by James Cameron is shown in Fig. 1b. Many other use cases can
be found on the platform at http://linkedwidgets.org.
References
1. de Bruijn, J., et al.: Web Service Modeling Ontology (WSMO) (2005), http://www.
w3.org/Submission/WSMO/
2. David, M., et al.: OWL-S : Semantic Markup for Web Services (2004), http://www.
w3.org/Submission/OWL-S/
3. Imran, M., et al.: ResEval mash: a mashup tool for advanced research evaluation. In:
21st international conference companion on World Wide Web. pp. 361–364 (2012)
4. Le-Phuoc, D., et al.: Rapid prototyping of semantic mash-ups through semantic web
pipes. In: 8th international conference on World Wide Web. p. 581. ACM Press, New
York, New York, USA (2009)
5. Taheriyan, M., Knoblock, C.: Rapidly integrating services into the linked data cloud.
In: 11th International Semantic Web Conference. pp. 559–574 (2012)
6. Wc, C.B., Ibm, J.F.: SAWSDL : Semantic Annotations for WSDL and XML Schema.
IEEE Internet Computing 11(6), 60–67 (2007)
56
CIMBA - Client-Integrated MicroBlogging
Architecture
Andrei Vlad Sambra1 , Sandro Hawke1 , Tim Berners-Lee1 , Lalana Kagal1 , and
Ashraf Aboulnaga2
1
Decentralized Information Group,
MIT CSAIL
2
Qatar Computing Research Institute
asambra@mit.edu,sandro@w3.org,timbl@w3.org,lkagal@csail.mit.edu,
aaboulnaga@qf.org.qa
Abstract. Personal data ownership and interoperability for decentral-
ized social Web applications are currently two debated topics, especially
when taking into consideration the aspects of privacy and access control.
To increase data ownership, users should have the freedom to choose
where their data resides and who is allowed access to it by decoupling
data storage from the application that consumes it. Through CIMBA,
we propose a decentralized architecture based on Web standards, which
puts users back in control of their own data.
Keywords: decentralization, Linked Data, social Web, privacy, Web
apps
1 Introduction
Recently, we have witnessed a dramatic increase in the number of social Web
applications. These applications come in di↵erent forms and o↵er di↵erent ser-
vices such as social networks, content management systems (CMS), bug trackers,
blogging tools, or collaboration services in general.
A common practice, specific to most Web services, is to centralize user re-
sources thus becoming so-called data silos. Often when adhering to online ser-
vices people usually end up creating dedicated local accounts, which ties and
limits users to particular services and/or resources. A solution to data silos can
be achieved through decentralization, where users are free to host their data
wherever they want, and then use several Web apps to consume and manage the
data. In the following section we will discuss how our decentralized architecture
plays an important role in achieving true data ownership for users.
2 Architecture
Today, more and more software is built around an application-specific back-
end database. This makes switching applications problematic, since data are
57
2 Sambra A. V., Hawke S., Berners-Lee T., Kagal L., Aboulnage A.
structured according to each specific application and it only has meaning within
the context of those applications. Moreover, this practice also forces a tight
coupling between backends and applications consuming the data (cf. Fig.1 a).
The proposed architecture uses the Resource Description Framework (RDF) [1]
to achieve greater interoperability between servers and applications as well as to
ensure the data structure remains the same, regardless of the server on which
the data are stored. In our case, CIMBA is a simple microblogging client that is
completely decoupled from the backend, which in turn is a generic storage server
(cf. Fig.1 b).
Fig. 1. a) Current architecture; b) Proposed decentralized architecture
By fully decoupling the server from the Web app, developers will be able
to produce large scale Web apps without having to also manage the backend
servers, making it very simple to switch from one backend to another, as well as
from one Web app to another one without losing any data (cf. Fig.2 a).
Another advantage is that users are no longer locked into a silo because
of their social connections (cf. Fig.2 b). Web apps reuse the user’s social graph,
which is also located on the data manager. The data manager is a generic Linked
Data personal data server, which implements the Linked Data Platform spec-
ification [2] (currently on REC track at W3C3 ), as well as the Web Access
Control [3] ontology (to enforce privacy policies).
Our architecture uses WebID [4] as the main mechanism to identify peo-
ple at the Web scale, together with WebID-TLS [4] (to authenticate requests
to restricted resources), a fully decentralized authentication scheme based on
WebID.
3
http://www.w3.org
58
CIMBA - Client-Integrated MicroBlogging Architecture 3
Fig. 2. a) users can easily switch software; b) social connections stay with the user
3 CIMBA
The name CIMBA stands for Client-Integrated MicroBlogging Application. It
provides users with the power of having their own blog combined with the ease of
using Twitter. CIMBA was written in Javascript, using the AngularJS4 frame-
work. The source code is publicly availabled on GitHub5 , and a running online
demo can be accessed by visiting http://cimba.co.
Compared to Twitter, CIMBA users are not stuck with a single feed or time-
line, but instead they can create multiple Channels and use them as categories
for their posts (i.e. main, work, family, etc.). Access control can be set per chan-
nel as well as per post, though policies for posts will override those set per
channel (i.e. having a private post in a public channel).
Figure 3 displays an overview of the architecture behind CIMBA. Users Alice
and Bob each have their own personal data managers, which hold the posts data,
configuration files as well as their personal WebID profiles.
Accessing CIMBA simply means loading all the necessary HTML, Javascript
and CSS files from the application server into the user’s browser. From that
moment on, the application which now runs in the browser will communicate
directly with the user’s personal data manager. The location of the personal
data manager is found by ”faking” a WebID-TLS authentication process, with
the purpose of finding the user’s WebID and implicitly, the WebID profile. There
is no actual need to authenticate the user to CIMBA, since all requests for data
are authenticated by the user’s personal data store.
Once the WebID profile is found, the app follows a series of links to discover
useful information about the user, such as a generic Linked Data server from
where the app can store and retrieve resources. CIMBA stores all the appli-
cation data on the user’s personal data manager, in a workspace dedicated to
microblogging. Microblogging data are stored using the SIOC ontology.
4
https://angularjs.org/
5
https://github.com/linkeddata/cimba
59
4 Sambra A. V., Hawke S., Berners-Lee T., Kagal L., Aboulnage A.
Fig. 3. Overview of the architecture for CIMBA
To read what other people write, users can subscribe to their channels. The
list of subscriptions is also expressed using the SIOC vocabulary and it is stored
on the user’s server.
4 Conclusions and future work
Our proposed decentralized architecture o↵ers significant benefits compared to
current Web apps, both in terms of data ownership, privacy, as well as inter-
operability. Being fully decoupled from the backend, Web apps can be easily
forked and improved by anyone with access to the source code, thus spurring
innovation and creativity. At its current stage, CIMBA su↵ers from scalability
issues, though we are very close to overcoming these issues.
References
1. Graham K., Carroll J., McBride B.: Resource description framework (RDF): Con-
cepts and abstract syntax. In: W3C recommendation. (2004)
2. Speicher S., Arwe J., Malhotra A.: Linked Data Platform 1.0. http://www.w3.org/
TR/ldp/ (2014)
3. Hollenback J., Presbrey J., Berners-Lee T.: Using RDF metadata to enable access
control on the social semantic web. In: Proceedings of the Workshop on Collabo-
rative Construction, Management and Linking of Structured Knowledge (CK2009),
vol. 514. (2009)
4. Sambra A., Henry S., Berners-Lee T.: WebID Specifications. http://www.w3.org/
2005/Incubator/webid/spec/ (2014)
5. Breslin J.G., Harth A., Bojars U., Decker, S.: Towards semantically-interlinked on-
line communities. In: The Semantic Web: Research and Applications, pages 500-514.
(2005)
60
The Organiser - A Semantic Desktop Agent
based on NEPOMUK
Sebastian Faubel and Moritz Eberl
Semiodesk GbR, D-86159 Augsburg, Germany
{sebastian, moritz}@semiodesk.com
Abstract. In this paper we introduce our NEPOMUK-based Semantic
Desktop for the Windows platform. It uniquely features an integrative
user interface concept which allows a user to focus on personal infor-
mation management while relying on the Property Projection agent for
semi-automated file management.
Keywords: Semantic Desktop, Personal Information Management
1 Introduction
In recent years, mobile cloud computing [5] has created a paradigm shift in the
use of electronic devices. It is now common for people to consume and produce
content using multiple devices, online platforms and communication channels.
However, productive and collaborative work is becoming increasingly fragmented
across di↵erent mobile platforms, social networks and collaboration platforms [1].
Di↵erent devices and applications often come with their separate methods
of organizing and storing information. Thus, considerable e↵ort has to be made
to represent a single piece of information in multiple systems. In our case, the
ISWC 2014 conference is being represented in five di↵erent entities: a calendar
event, a shared folder in the hierarchical file system, a notes list, a bookmarks
folder and a task list.
There is a need for active support by computers in the creation and filtering
process of personal and group information. It has to provide a consolidated view
on data and blend the boundaries between workstations, mobile devices and web
services. Semantic Web technologies, specifically the Semantic Desktop [6], o↵er
a suitable platform for this purpose.
2 Our Solution
To solve this problem we have created the Organiser 1 , a Semantic Desktop agent
based on the NEPOMUK ontologies [4]. It integrates personal information such
as contacts, events and notes from cloud services with the local file system and
thus, provides a consolidated view.
1
Demo video: http://www.semiodesk.com/media/2014/0714-organiser-intro
61
Fig. 1. Upcoming events are easily accessible from the Organiser’s dashboard.
The application’s dashboard is shown in figure 1. It serves as an entry point
into exploring resources and provides quick access to resource collections (i.e.
documents, pictures, events, etc.) and the local file system. Most prominently,
it features an activity control (agenda and journal) which allows a user to plan
into the future. Because future activities also serve as containers for files and
related information, the dashboard provides quick access to all resources which
are relevant to the user at a certain time.
Moreover, all resources can act as containers for relevant files and informa-
tion. The displayed relations in the resource view can be hyperlinks to other
resources, which enables browsing for interesting files and information. In order
to assist a user in adding reasonable relations to a newly created resource, the
Organiser actively analyzes the resource’s properties and provides suggestions;
such as to relate a collection of pictures to an event if they were taken at the
time of the event. A user may accept or decline those suggestions.
Another feature of the resource view is, that it does not only consolidate
already existing relations, but it also o↵ers the ability to create new content.
When adding new files, such as office documents, a file system path is being
generated from the metadata of the file and the context of the resource it was
created in.
62
2.1 File System Abstraction using Property Projection
The hierarchical file system is the de facto standard for organizing and sharing
files in a productive computing environment. In order to support a soft transition
away from the static file system as a primary means of organizing and browsing
files, we have developed the Property Projection method: A Semantic Web agent
that is capable of learning how resource metadata is being projected into the
path component of a URI, either by analyzing existing file systems or through
interactive user input (figure 2).
Fig. 2. Training the PropertyProjection agent interactively.
Once a projection schema has been learned, the agent can suggest storage
locations for files being created in the context of a resource. This allows a user to
shift from working with the static file system folders to working with resources
such as contacts, events, notes or tasks that have temporal relevance to his or
her activities.
The agent can formalize the schematic generation of URIs using the Property
Projection Ontology. Based on the Path Projection Ontology [3], the revised
ontology provides a vocabulary for the following features:
– Generating readable URIs from metadata (conforming to RFC 2396 [2])
– Generating names and titles for resources from metadata
URI schemes can be shared with other compatible agents on a global scale. This
can improve overall team productivity, since a common file organization schema
removes the burden to choose a storage location, and helps co-workers spend
less time on looking for misfiled information.
2.2 Implementation
The Organiser is implemented using Trinity, our Semantic Web application de-
velopment platform for .NET/Mono. It features a Semantic Object Mapping
63
mechanism that allows to define an object oriented abstraction layer on top of a
RDF triple store. This layer promotes the use of common development methods,
proven application design patterns, and significantly increases the compatibility
to existing APIs.
Although the Organiser’s user interface is currently implemented using WPF,
the consequent use of the MVVM design pattern allows for reimplementation
using other technologies such as HTML and JavaScript. The interface is laid out
in such a way, that it can be used on touch screens and may be scaled down to
the resolution of current smartphones.
Metadata from the file system and cloud services is gathered in the back-
ground by Ubiquity, a metadata extraction and synchronisation service. All
changes to the extracted resources made in the Organiser are mirrored back
to the metadata of the respective resource.
3 Conclusions / Future Work
The Organiser concept was refined over multiple iterations and the software is
nearing completion. Only final usability tests and some connectivity and format
extensions are missing.
A focus in the future is to implement the Organiser for mobile devices like
Tablets and Smartphones. Because they accompany the user most of the time,
we want them to act as a conduit to bring the seemingly virtual planning and
organisation of the desktop into the everyday world of the user. Achieving a
convergence of the user’s data on mobile devices and the Desktop PC can lead
to device independent productivity.
References
1. Bergman, O., Beyth-Marom, R., Nachmias, R.: The project fragmentation problem
in personal information management. In: Proceedings of the SIGCHI Conference
on Human Factors in Computing Systems. pp. 271–274. CHI ’06, ACM, New York,
NY, USA (2006)
2. Berners-Lee, T.: RFC 2396: Uniform Resource Identifiers (URI). Tech. rep., MIT
(1998)
3. Faubel, S., Kuschel, C.: Towards semantic file system interfaces. In: Bizer, C., Joshi,
A. (eds.) International Semantic Web Conference (Posters & Demos). CEUR Work-
shop Proceedings, vol. 401. CEUR-WS.org (2008)
4. Groza, T., Handschuh, S., Moeller, K., Grimnes, G., Sauermann, L., Minack, E.,
Mesnage, C., Jazayeri, M., Reif, G., Gudjnsdttir, R.: The nepomuk project - on the
way to the social semantic desktop. In: Pellegrini, T., Scha↵ert, S. (eds.) Proceedings
of I-Semantics’ 07. pp. 201–211. JUCS (Sep 2007)
5. Liu, F., Shu, P., Jin, H., Ding, L., Yu, J., Niu, D., Li, B.: Gearing resource-poor
mobile devices with powerful clouds: architectures, challenges, and applications.
IEEE Wireless Commun. 20(3), 1–0 (2013)
6. Sauermann, L., Bernardi, A., Dengel, A.: Overview and outlook on the semantic
desktop. In: Proceedings of the 1st Workshop on The Semantic Desktop at the
ISWC 2005 Conference (2005)
64
HDTourist: Exploring Urban Data on Android
Elena Hervalejo1 , Miguel A. Martı́nez-Prieto1 , Javier D. Fernández1,2 , Oscar Corcho2
1
DataWeb Research, Department of Computer Science, Univ. de Valladolid (Spain)
elena.hervalejo@gmail.com, {migumar2,jfergar}@infor.uva.es
2
Ontology Engineering Group (OEG), Univ. Politécnica de Madrid (Spain)
{jdfernandez,ocorcho}@fi.upm.es
1 Introduction
The Web of Data currently comprises ⇡ 62 billion triples from more than 2,000
di↵erent datasets covering many fields of knowledge3 . This volume of structured
Linked Data can be seen as a particular case of Big Data, referred to as Big
Semantic Data [4]. Obviously, powerful computational configurations are tradi-
tionally required to deal with the scalability problems arising to Big Semantic
Data. It is not surprising that this “data revolution” has competed in parallel
with the growth of mobile computing. Smartphones and tablets are massively
used at the expense of traditional computers but, to date, mobile devices have
more limited computation resources.
Therefore, one question that we may ask ourselves would be: can (potentially
large) semantic datasets be consumed natively on mobile devices? Currently, only
a few mobile apps (e.g., [1, 9, 2, 8]) make use of semantic data that they store
in the mobile devices, while many others access existing SPARQL endpoints or
Linked Data directly. Two main reasons can be considered for this fact. On the
one hand, in spite of some initial approaches [6, 3], there are no well-established
triplestores for mobile devices. This is an important limitation because any po-
tential app must assume both RDF storage and SPARQL resolution. On the
other hand, the particular features of these devices (little storage space, less
computational power or more limited bandwidths) limit the adoption of seman-
tic data for di↵erent uses and purposes.
This paper introduces our HDTourist mobile application prototype. It con-
sumes urban data from DBpedia4 to help tourists visiting a foreign city. Although
it is a simple app, its functionality allows illustrating how semantic data can be
stored and queried with limited resources. Our prototype is implemented for An-
droid, but its foundations, explained in Section 2, can be deployed in any other
platform. The app is described in Section 3, and Section 4 concludes about our
current achievements and devises the future work.
2 Managing RDF in Mobile Devices
Our approach for managing RDF is inspired by the role played by SQLite5 in
Android devices. SQLite is a self-contained SQL engine which is deployed as
3
Stats reported by LODStats: http://stats.lod2.eu/.
4
http://dbpedia.org/.
5
http://www.sqlite.org/.
65
an internal component of the application program. This way, the app itself can
read and write data directly from the database files without requiring a separate
process running as a DBMS (Database Management System).
Similarly, our only requirement is to hold properly serialized RDF files and a
standardized interface to operate on them. Both responsibilities are provided by
the RDF/HDT [5] format, which serializes RDF using up to 15 times less space
than other syntaxes [4], while allowing basic SPARQL queries to be efficiently
resolved on the serialized file [7].Thus, including RDF/HDT as a library6 of the
app, allows it to manage and query semantic data in compressed space.
3 HDTourist
HDTourist is a proof-of-concept app7 built on top of RDF/HDT. It is designed
as a lightweight app to provide tourists with information when they are in a
foreign place. In these cases, people are more reluctant to connect to Internet
because of the potentially expensive costs of roaming. Thus, our mobile device
will be useful to keep compressed semantic information and query it o✏ine.
Use case. Let us suppose that we plan our trip to Riva del Garda to attend
ISWC’2014, and our flight arrives to Verona. Fortunately, we have a day to visit
the city and decide to use HDTourist. Before leaving home, or in a Wi-Fi hotspot
(i.e. in the hotel), we use our Internet connection to download the RDF/HDT
file with relevant information about Verona. Currently, these data are obtained
by exploring di↵erent categories related to the DBpedia entity modeling the
city: http://dbpedia.org/page/Verona. In addition to semantic data, we can download
multimedia: images, maps of the region, etc. to improve the user experience. We
download them and HDTourist is ready to be used in our visit.
Verona’s HDT file has 18, 208 triples, with a size of ⇡850 KB, more than 4
times smaller than the original NTriples file (⇡3.6 MB). Beyond the space sav-
ings, this HDT file is self-queryable in contrast to the flat NTriples serialization.
3.1 Retrieving Urban Data from DBpedia
DBpedia contains a lot of descriptive data about cities, which we filter as follows:
given the URI u of a city (e.g. http://dbpedia.org/page/Verona), we run a CONSTRUCT
query on DBpedia which retrieves: i) all triples describing the city, i.e., all triples
comprising u as subject, and ii) all landmarks related to the city, i.e., all re-
sources (and their descriptions) linking to u. We restrict to resources of some
kind of landmarks that we have manually identified, e.g. resources of type Place
(http://dbpedia.org/ontology/Place), Historical Buildings (http://dbpedia.org/ontology/His-
toricPlace), etc. Other types specifically related to the city are considered, for
instance the squares in Verona (http://dbpedia.org/class/yago/PiazzasInVerona).
The RDF subgraph returned by this CONSTRUCT query is then converted to
RDF/HDT and ready to be downloaded and queried by our mobile app.
6
We use the Java RDF/HDT library: https://github.com/rdfhdt/hdt-java.
7
Available at: http://dataweb.infor.uva.es/project/hdtourist/?lang=en.
66
3.2 Browsing Urban Data
HDTourist uses categories to organize and display data. The main menu com-
prises four categories: description, demography and geography, attractions, and
other interesting data. Figure 1 (a) shows the description of Verona, which in-
cludes basic information about the city. The information showed in each category
is defined as SPARQL templates in XML configuration files (one per category),
such as the following one:
A t t r a c t i o n s
S q u a r e s
SELECT ? l a b e l
WHERE
{ ? p l a c e r d f : t y p e .
? place r d f s : l a b e l ? label .}
UNION
{ ? p l a c e r d f : t y p e .
? place r d f s : l a b e l ? label .}
}
s p a r q l>
g r o u p>
B u i l d i n g s
....
This XML excerpt corresponds to Figure 1 (b) showing the category “At-
tractions”, which includes Squares, Buildings, etc. Each group retrieves the label
of attractions with a SPARQL query which typically consists of a UNION of
Basic Graph Patterns searching for certain types of resources, as shown in the
excerpt. When parsing the XML, the template ${CITY} is converted to the ap-
propriated name, e.g. Verona. Each SPARQL query is then resolved making use
of the query API of RDF/HDT, retrieving the label shown in the screen layout.
As shown in Figure 1 (c), each landmark can be expanded, obtaining fur-
ther information. In this screenshot, we choose the “Piazza delle Erbe” (within
“Squares”), and the app retrieves the triples describing it. The concrete informa-
tion to be shown in the landmark description is also configured by means of an
XML file containing one SPARQL template per category, again resolved against
the local RDF/HDT. As shown in the screenshot, pictures can be downloaded
and stored o✏ine. Finally, HDTourist is able to show geolocated landmarks in
interactive maps, as shown in Figure 1 (d) for “Piazza delle Erbe”. The app uses
Google maps by default, but o✏ine maps8 can be downloaded beforehand.
4 Conclusions and Future Work
The o✏ine capacities and structured information consumption possibilities of
mobile devices are still several order of magnitudes below traditional devices.
With our demo we show that RDF/HDT can be used as a self-contained engine
to retrieve RDF information in mobile devices. To date, we have explored a given
set of cities and certain query templates to build the screen layout. We are now
exploring a spreading activation mechanism to automatically retrieve interesting
8
In this prototype we use Nutiteq SDK Maps, available at http://www.nutiteq.com/.
67
Fig. 1. Some screenshots of HDTourist.
features of a city which are then converted to HDT on the server side. This also
takes into account other datasets besides DBpedia.
Acknowledgments
This work has been funded by the European Commission under the grant Plan-
etData (FP7-257641) and by the Spanish Ministry of Economy and Competi-
tiveness (TIN2013-46238-C4-2-R).
References
1. C. Becker and C. Bizer. DBpedia Mobile: A Location-Enabled Linked Data Browser.
In Proc. of LDOW, CEUR-WS 369, paper 14, 2008.
2. A.E. Cano, A.S. Dadzie, and M. Hartmann. Who’s Who–A Linked Data Visualisa-
tion Tool for Mobile Environments. In Proc. of ISWC, pages 451–455, 2011.
3. J. David, J. Euzenat, and M. Rosoiu. Linked Data from your Pocket. In Proc. of
DownScale, CEUR-WS 844, paper 2, pages 6–13, 2012.
4. J.D. Fernández, M. Arias, M.A. Martı́nez-Prieto, and C. Gutiérrez. Management
of Big Semantic Data. In Big Data Computing, chapter 4. Taylor & Francis, 2013.
5. J.D. Fernández, M.A. Martı́nez-Prieto, C. Gutiérrez, and A. Polleres. Binary RDF
Representation for Publication and Exchange (HDT). W3C Member Submission,
2011. http://www.w3.org/Submission/2011/03/.
6. D. Le-Phuoc, J.X. Parreira, V. Reynolds, and M. Hauswirth. RDF On the Go: An
RDF Storage and Query Processor for Mobile Devices. In Proc. ISWC, CEUR-WS
658, paper 19, 2010.
7. M.A. Martı́nez-Prieto, M. Arias, and J.D. Fernández. Exchange and Consumption
of Huge RDF Data. In Proc. of ESWC, pages 437–452, 2012.
8. V.C. Ostuni, G. Gentile, T. Di Noia, R. Mirizzi, D. Romito, and E. Di Sciascio.
Mobile Movie Recommendations with Linked Data. In Proc. of CD-ARES, pages
400–415, 2013.
9. G. Parra, J. Klerkx, and E. Duval. More!: Mobile Interaction with Linked Data. In
Proc. of DCI, CEUR-WS 817, paper 4, 2010.
68
Integrating NLP and SW with the KnowledgeStore
Marco Rospocher, Francesco Corcoglioniti, Roldano Cattoni,
Bernardo Magnini, and Luciano Serafini
Fondazione Bruno Kessler—IRST, Via Sommarive 18, Trento, I-38123, Italy
{rospocher,corcoglio,cattoni,magnini,serafini}@fbk.eu
Abstract. We showcase the KnowledgeStore (KS), a scalable, fault-tolerant,
and Semantic Web grounded storage system for interlinking unstructured and
structured contents. The KS contributes to bridge the unstructured (e.g., tex-
tual document, web pages) and structured (e.g., RDF, LOD) worlds, enabling
to jointly store, manage, retrieve, and query, both typologies of contents.
1 Introduction: Motivations and Vision
Despite the widespread diffusion of structured data sources and the public acclaim of
the Linked Open Data (LOD) initiative, a preponderant amount of information remains
nowadays available only in unstructured form, both on the Web and within organiza-
tions. While different in form, structured and unstructured contents are often related in
content, as they speak about the very same entities of the world (e.g., persons, organi-
zations, locations, events), their properties, and relations among them. Despite the last
decades achievements in Natural Language Processing (NLP), now supporting large
scale extraction of knowledge about entities of the world from unstructured text, frame-
works enabling the seamless integration and linking of knowledge coming both from
structured and unstructured contents are still lacking.1
In this demo we showcase the KnowledgeStore (KS), a scalable, fault-tolerant, and
Semantic Web grounded storage system to jointly store, manage, retrieve, and query,
both structured and unstructured data. Fig. 1a shows schematically how the KS man-
ages unstructured and structured contents in its three representation layers. On the one
hand (and similarly to a file system) the resource layer stores unstructured content in
the form of resources (e.g., news articles), each having a textual representation and
some descriptive metadata. On the other hand, the entity layer is the home of structured
content, that, based on Knowledge Representation and Semantic Web best practices,
consists of axioms (a set of hsubject, predicate, objecti triples), which describe the enti-
ties of the world (e.g., persons, locations, events), and for which additional metadata are
kept to track their provenance and to denote the formal contexts where they hold (e.g.,
point of view, attribution). Between the aforementioned two layers there is the mention
layer, which indexes mentions, i.e., snippets of resources (e.g., some characters in a text
document) that denote something of interest, such as an entity or an axiom of the entity
layer. Mentions can be automatically extracted by NLP tools, that can enrich them with
additional attributes about how they denote their referent (e.g., with which name, quali-
fiers, “sentiment”). Far from being simple pointers, mentions present both unstructured
1
See [1] for an overview of works related to the contribution presented in this demo.
69
Resource Layer Mention Layer Entity Layer
expressed by
Resource Mention refers to Entity Axiom
has mention described by holds in
source
Entity Mention Context
target
Relation Mention
...
Indonesia Hit By Earthquake
dbpedia:United_Nations
A United Nations assessment team (b)
was dispatched to the province after
dbpedia:United_Nations rdf:type yago:PoliticalSystems
two quakes, measuring 7.6 and 7.4,
struck west of Manokwari Jan. 4. At dbpedia:United_Nations rdfs:label "United Nations"@en
least five people were killed, 250
others injured and more than 800 dbpedia:United_Nations foaf:homepage
homes destroyed by those temblors,
according to the UN.
(a) (c)
Fig. 1: (a) The three KS layers; (b) Interactions with external modules; (c) Components.
and structured facets (respectively snippet and attributes) not available in the resource
and entity layers alone, and are thus a valuable source of information on their own.
Thanks to the explicit representation and alignment of information at different lev-
els, from unstructured to structured knowledge, the KS supports a number of usage
scenarios. It enables the development of enhanced applications, such as effective de-
cision support systems that exploit the possibility to semantically query the content of
the KS with requests combining structured and unstructured content, such as “retrieve
all the documents mentioning that person Barack Obama participated to a sport event”.
Then, it favours the design and empirical investigation of information processing tasks
otherwise difficult to experiment with, such as cross-document coreference resolution
(i.e., identifying that two mentions refer to the same entity of the world) exploiting the
availability of interlinked structured knowledge. Finally, the joint storage of (i) extracted
knowledge, (ii) the resources it derives from, and (iii) extracted metadata provides an
ideal scenario for developing, training, and evaluating ontology population techniques.
2 An overview of the KnowledgeStore
In this section we briefly outline the main characteristics of the KS. For a more exhaus-
tive presentation of the KS design, we point the reader to [1]. More documentation, as
well as binaries and source code,2 are all available on the KS web site [2].
Data Model The data model defines what information can be stored in the KS. It
is organized in three layers (resource, mention and entity), with properties that relate
objects across them. To favour the exposure of the KS content according to LOD prin-
ciples, the data model is defined as an OWL 2 ontology (available on [2]). It contains
the TBox definitions and restrictions for each model element and can be extended on a
per-deployment basis, e.g., with domain-specific resource and linguistic metadata.
API The KS presents a number of interfaces through which external clients may ac-
cess and manipulate stored data. Several aspects have been considered in defining them
2
Released under the terms of the Apache License, Version 2.0.
70
(e.g, operation granularity, data validation). These interfaces are offered through two
HTTP ReST endpoints. The CRUD endpoint provides the basic operations to access
and manipulate (CRUD: create, retrieve, update, and delete) any object stored in any of
the layers of the KS. Operations of the CRUD endpoint are all defined in terms of sets
of objects, in order to enable bulk operations as well as operations on single objects.
The SPARQL endpoint allows to query axioms in the entity layer using SPARQL. This
endpoint provides a flexible and Semantic Web-compliant way to query for entity data,
and leverages the grounding of the KS data model in Knowledge Representation and
Semantic Web best practices. A Java client is also offered to ease the development of
(Java) client applications.
Architecture At its core, the KS is a storage server whose services are utilized by ex-
ternal clients to store and retrieve the contents they process. From a functional point of
view, we identify three main typologies of clients (see Fig. 1b): (i) populators, whose
purpose is to feed the KS with basic contents needed by other applications (e.g., docu-
ments, background knowledge from LOD sources); (ii) linguistic processors, that read
input data from the KS and write back the results of their computation; and, (iii) appli-
cations, that mainly read data from the KS (e.g., decision support systems). Internally,
the KS consists of a number of software components (see Fig. 1c) distributed on a
cluster of machines: (i) the Hadoop HDFS filesystem provides a reliable and scalable
storage for the physical files holding the representations of resources (e.g., texts and
linguistic annotations of news articles); (ii) the HBase column-oriented store builds
on Hadoop to provide database services for storing and retrieving semi-structured in-
formation about resources and mentions; (iii) the Virtuoso triple-store stores axioms
to provide services supporting reasoning and online SPARQL query answering; and,
(iv) the Frontend Server has been specifically developed to implement the operations
of the CRUD and SPARQL endpoints on top of the components listed above, handling
global issues such as access control, data validation and operation transactionality.
User Interface (UI) The KS UI (see Fig. 2) enables human users to access and inspect
the content of the KS via two core operations: (i) the SPARQL query operation, with
which arbitrary SPARQL queries can be run against the KS SPARQL endpoint, obtain-
ing the results directly in the browser or as a downloadable file (in various file formats,
including the recently standardized JSON-LD); and, (ii) the lookup operation, which
given the URI of an object (i.e., resource, mention, entity), retrieves all the KS content
about that object. These two operations are seamlessly integrated in the UI, to offer a
smooth browsing experience to the users.
3 Showcasing the KnowledgeStore and concluding remarks
During the Posters and Demos session, we will demonstrate live how to access the KS
content via the UI (similarly to the detailed demo preview available at [3]), highlighting
the possibilities offered by the KS to navigate back and forth from unstructured to
structured content. For instance, we will show how to run arbitrary SPARQL queries,
retrieving the mentions of entities and triples in the query result set, and the documents
where they occur. Similarly, starting from a document URI, we will show how to access
the mentions identified in the document, up to the entities and triples they refer to.
71
Fig. 2: KS UI. Lookup of a mention. Note the three boxes (Mention resource, Mention
Data, Mention Referent) corresponding to the three representation layers of the KS.
In the last few months, several running instances of the KS were set-up (on a cluster
of 5 average specs servers) and populated using the NewsReader Processing Pipeline [4]
with contents coming from various domains: to name a few, one on the global automo-
tive industry [5] (64K resources, 9M mentions, 316M entity triples), and one related to
the FIFA World Cup (212K resources, 75M mentions, 240M entity triples). The latter,
which will be used for the demo, was exploited during a Hackathon event [6], where 38
web developers accessed the KS to build their applications (over 30K SPARQL queries
were submitted – on average 1 query/s, with peaks of 25 queries/s).
Acknowledgements The research leading to this paper was supported by the European
Union’s 7th Framework Programme via the NewsReader Project (ICT-316404).
References
1. Corcoglioniti, F., Rospocher, M., Cattoni, R., Magnini, B., Serafini, L.: Interlinking unstruc-
tured and structured knowledge in an integrated framework. In: 7th IEEE International Con-
ference on Semantic Computing (ICSC), Irvine, CA, USA. (2013)
2. http://knowledgestore.fbk.eu
3. http://youtu.be/if1PRwSll5c
4. https://github.com/newsreader/
5. http://datahub.io/dataset/global-automotive-industry-news
6. http://www.newsreader-project.eu/come-hack-with-newsreader/
72
Graphical Representation of
OWL 2 Ontologies through Graphol
Marco Console, Domenico Lembo, Valerio Santarelli, and Domenico Fabio Savo
Dipartimento di Ingegneria Informatica, Automatica e Gestionale “Antonio Ruberti”
S APIENZA Università di Roma
{console,lembo,santarelli,savo}@dis.uniroma1.it
Abstract. We present Graphol, a novel language for the diagrammatic represen-
tation of ontologies. Graphol is designed to offer a completely visual representa-
tion to the users, thus helping the understanding of people not skilled in logic. At
the same time, it provides designers with simple mechanisms for ontology edit-
ing, which free them from having to write down complex textual syntax. Through
Graphol we can specify SROIQ(D) ontologies, thus our language essentially
captures the OWL 2 standard. In this respect, we developed a basic software tool
to translate Graphol ontologies realized with the yEd graph editor into OWL 2
functional syntax specifications.
1 Introduction
Ontologies have become popular in recent years in several contexts, such as
biomedicine, life sciences, e-commerce, enterprise applications [9]. Obviously, it is
very likely that people operating in such contexts are not experts in logic and generally
do not possess the necessary skills to interpret formulas through which ontologies are
typically expressed. This turns out to be a serious problem also in the development of
an ontology. Indeed, ontologists usually work together with domain experts, the former
providing their knowledge about ontology modelling and languages, the latter provid-
ing their expertise on the domain of interest. During this phase, communication between
these actors is fundamental to produce a correct specification.
The use of a graphical representation for ontologies is widely recognized as a means
to mitigate this communication problem. At the same time, the possibility of specifying
ontologies in a graphical way might bring software analysts and experts in conceptual
modelling to approach ontology modelling, since they would be provided with mecha-
nisms that are close in spirit to those they usually adopt for software design.
Various proposals in this direction exist in the literature, but to date graphical lan-
guages for ontology have not become very popular, especially for the editing task.
Among various reasons, we single out the following points: (i) many languages for
graphical representation of ontologies do not capture the current standard OWL 2, and
their extension to it is not straightforward (see, e.g., [3,8,7,6,1]); (ii) other proposals
require the use of formulas mixed with the graphical representation (see, e.g., [4,1]);
(iii) popular ontology management tools, such us Protégé1 or TopBraid Composer2 ,
offer visualization functionalities, but do not support a completely graphical editing.
1
http://protege.stanford.edu
2
http://www.topquadrant.com/tools
73
To meet the main disadvantages mentioned above, in this paper we present our
proposal for graphical specification and visualization of ontologies, and introduce the
novel Graphol language, whose main characteristics can be summarized as follows:
– Graphol is completely graphical (no formulae need to be used in our diagrams) and
adopts a limited number of symbols. In Graphol, an ontology is a graph, whose
nodes represent either predicates from the ontology alphabet or constructors used
to build complex expressions from named predicates. Then, two kinds of edges
are adopted: input edges, used to specify arguments of constructors, and inclusion
edges, used to denote inclusion axioms between (complex) expressions.
– Graphol has a precise syntax and semantics, which is given through a natural en-
coding in Description Logics.
– Such enconding shows that Graphol subsumes SROIQ(D), the logical underpin-
ning of OWL 2.
– Graphol is rooted in a standard language for conceptual modeling: the basic com-
ponents of Graphol are taken from the Entity-Relationship (ER) model. Notably,
simple ontologies that correspond to classical ER diagrams (e.g., some OWL 2 QL
ontologies) have in Graphol a representation that is isomorphic to the ER one.
– Graphol comes with some basic tools that support both the graphical editing and
the automatic translation of the diagrams into a corresponding OWL 2 specifica-
tion, to foster the interoperation with standard OWL reasoners and development
environments.
We have adopted Graphol in various industrial projects, where we have produced
large ontologies with hundreds of predicates and axioms. In such projects we could
verify the effectiveness of the language for communicating with domain experts. At the
same time, we exploited Graphol in the editing phase: all ontologies realized in these
projects have been indeed completely specified in our graphical language, whereas an
OWL functional syntax encoding thereof has been obtained automatically through the
use of our translator tool. One of these experiences is described in [2], where the impact
of the use of Graphol on the quality of the realized ontology is widely discussed.
We also conducted some user evaluation tests, where both designers skilled in con-
ceptual modelling (but with no or limited experience in ontology modelling) and users
without specific logic background were involved. From these tests, we obtained promis-
ing results about the effectiveness of our language for both visualizing and editing on-
tologies. A complete description of our evaluation study is given in [5].
For a complete description of both the syntax and the semantics of Graphol we refer
the reader to [5] and to the Graphol web site3 , where it is also possible to download
currently available software tools for our language. In the rest of the paper we instead
discuss how the Graphol demonstration will be carried out.
2 The Graphol demonstration
In this demo we will show the process we devised to obtain an OWL 2 ontology starting
from the specification of a Graphol diagram. Such a process relies on both existing
3
http://www.dis.uniroma1.it/˜graphol
74
Fig. 1: A simple Graphol ontology
open source tools and original software components. More in detail, to draw a Graphol
ontology we make use of the yEd editor for graphs4 , which we equip with a palette
containing all and only the symbols needed for Graphol. yEd allows us to save the
ontology in GraphML5 , a popular XML-based file format for encoding graphs.
An example of a Graphol ontology obtained through yEd is given in Figure 1. In
the figure, the reader can see that in Graphol classes (i.e., Person, Car maniac, Car),
object properties (i.e., is owner of car), and data properties (i.e., age) are modeled
by labeled rectangles, diamonds, and circles, respectively, similarly to ER diagrams.
The white (resp. black) square labeled with exists is a graphical constructor that takes
as input a property, through a dashed arrow whose end node is a small diamond, and
returns the domain (resp. the range) of the property. Such squares can have also different
labels, to denote different constructs. In the example, the label (4,-) on the white
square taking as input the is owner of car property specifies a cardinality restriction
on the domain of such property, i.e., it denotes all individuals participating at least 4
times to is owner of car. The solid arrow always indicates a subsumption relation.
This means that Car maniac is a subclass of Person, and also of the complex class
obtained through the cardinality restriction, which implies that a car maniac owns at
least four cars.
Furthermore, the ontology
in the example says that the
domain of is owner of car is
Person, its range is Car, and
also that each Person has an
age, and that the domain of
age is Person. Also, the ad-
ditional dash orthogonal to the
edge connecting age to its do-
main specifies that this property
is functional.
The above example uses
only a limited sets of construc-
tors available in Graphol. Par-
ticipants to the demo will be
provided with the yEd editor Fig. 2: The Graphol2OWL tool
and the Graphol palette to draw
their own ontologies, experiencing the entire expressive power of the language.
4
http://www.yworks.com/en/products_yed_about.html
5
http://graphml.graphdrawing.org/
75
To both check the correctness of the specification and translate it into OWL 2, we
developed a dedicated tool. The tool provides a syntactic validation of a given diagram:
while parsing the GraphML file, if a portion of the graph is found that does not respect
the Graphol syntax, the tool reports an error to the user in a pop-up window and visual-
izes this portion by means of an external yEd viewer. A screenshot of this tool showing
an error identified in a Graphol diagram is given in Figure 2. In this example, the error
consists in linking a class to a property with a solid arrow, which actually corresponds
to a wrong subsumption between a concept and role.
The translator to obtain OWL 2 encodings from Graphol will be used during the
demo. We will also show the compatibility of the produced OWL 2 functional syntax
file with popular tools for ontology editing and management, like Protégé 6 .
3 Future Work
Our main future work on Graphol is the development of editing tools (stand-alone sys-
tems or plugins of existing ontology development environments) tailored to the specifi-
cation of ontologies in our graphical language and integrated with state-of the art rea-
soners. At the same time, we are working to improve ontology visualization in Graphol,
by investigating mechanisms to automatically extract ontology views at different levels
of detail on the basis of specific user requests.
References
1. do Amaral, F.N.: Model outlines: A visual language for DL concept descriptions. Semantic
Web J. 4(4), 429–455 (2013)
2. Antonioli, N., Castanò, F., Coletta, S., Grossi, S., Lembo, D., Lenzerini, M., Poggi, A., Virardi,
E., Castracane, P.: Ontology-based data management for the italian public debt. In: Proc. of
FOIS (2014), to Appear
3. Brockmans, S., Volz, R., Eberhart, A., Löffler, P.: Visual modeling of OWL DL ontologies
using UML. In: Proc. of ISWC. pp. 198–213. Springer (2004)
4. Cerans, K., Ovcinnikova, J., Liepins, R., Sprogis, A.: Advanced OWL 2.0 ontology visualiza-
tion in OWLGrEd. In: In Proc. of DB&IS. pp. 41–54 (2012)
5. Console, M., Lembo, D., Santarelli, V., Savo, D.F.: The Graphol language for
ontology specification, available at http://www.dis.uniroma1.it/˜graphol/
documentation/GrapholLVPrel.pdf
6. Dau, F., Eklund, P.W.: A diagrammatic reasoning system for the description logic ALC. J.
Vis. Lang. Comput. 19(5), 539–573 (2008)
7. Krivov, S., Williams, R., Villa, F.: GrOWL: A tool for visualization and editing of OWL
ontologies. J. of Web Semantics 5(2), 54–57 (2007)
8. Object Management Group: Ontology definition metamodel. Tech. Rep. formal/2009-05-01,
OMG (2009), available at http://www.omg.org/spec/ODM/1.0
9. Staab, S., Studer, R. (eds.): Handbook on Ontologies. International Handbooks on Information
Systems, Springer, 2nd edn. (2009)
6
A preview of the demo is available at http://www.dis.uniroma1.it/˜graphol/
research.html.
76
LIVE: a Tool for Checking Licenses Compatibility
between Vocabularies and Data
Guido Governatori1? , Ho-Pun Lam1 , Antonino Rotolo2 , Serena Villata3 ,
Ghislain Atemezing4 , and Fabien Gandon3
1 NICTA Queensland Research Laboratory
firstname.lastname@nicta.com.au
2 University of Bologna
antonino.rotolo@unibo.it
3 INRIA Sophia Antipolis
firstname.lastname@inria.fr
4 Eurecom
auguste.atemezing@eurecom.fr
Abstract In the Web of Data, licenses specifying the terms of use and reuse are
associated not only to datasets but also to vocabularies. However, even less support
is provided for taking the licenses of vocabularies into account than for datasets,
which says it all. In this paper, we present a framework called LIVE able to support
data publishers in verifying licenses compatibility, taking into account both the
licenses associated to the vocabularies and those assigned to the data built using
such vocabularies.
1 Introduction
The license of a dataset in the Web of Data can be specified within the data, or outside
of it, for example in a separate document linking the data. In line with the Web of
Data philosophy [3], licenses for such datasets should be specified in RDF, for instance
through the Dublin Core vocabulary1 . Despite such guidelines, still a lot of effort is
needed to enhance the association of licenses to data on the Web, and to process licensed
material in an automated way. The scenario becomes even more complex when another
essential component in the Web of Data is taken into account: the vocabularies. Our
goal is to support the data provider in assigning a license to her data, and verifying
its compatibility with the licenses associated to the adopted vocabularies. We answer
this question by proposing an online framework called LIVE2 (LIcenses VErification)
that exploits the formal approach to licenses composition proposed in [2] to verify the
compatibility of a set of heterogeneous licenses. LIVE, after retrieving the licenses
associated to the vocabularies used in the dataset under analysis, supports data providers
in verifying whether the license assigned to the dataset is compatible with those of the
vocabularies, and returns a warning when this is not the case.
? NICTA is funded by the Australian Government as represented by the Department of Broadband,
Communications and the Digital Economy and the Australian Research Council through the
ICT Centre of Excellence program.
1 http://purl.org/dc/terms/license
2 The online tool is available at http://www.eurecom.fr/ atemezin/licenseChecker/
~
77
2 The LIVE framework
The LIVE framework is a Javascript application, combining HTML and Bootstrap.
Hence, installation has no prerequisite. Since the tool is written in Javascript, the best
way to monitor the execution time is with the performance.now() function. We use the
10 LOD datasets with the highest number of links towards other LOD datasets available
at http://lod-cloud.net/state/#links. For each of the URLs in Datahub, we
retrieve the VoID3 file in Turtle format, and we use the voidChecker function4 of the
LIVE tool to retrieve the associated license, if any. The input of the LIVE framework
(Figure 1) consists in the dataset (URI or VOiD) whose license has to be verified. The
framework is composed by two modules. The first module takes care of retrieving the
vocabularies used in the dataset, and for each vocabulary, retrieves the associate license5
(if any) querying the LOV repository. The second module takes as input the set of
licenses (i.e., the licenses of the vocabularies used in the dataset as well as the license
assigned to the dataset) to verify whether they are compatible with each others. The
result returned by the module is a yes/no answer. In case of negative answer, the data
provider is invited to change the license associated to the dataset and check back again
with the LIVE framework whether further inconsistencies arise.
LIVE framework dataset D
Licenses
retrieve vocabularies
retrieval used in the dataset
Check consistency of module
licensing information
for dataset D retrieve licenses
for selected vocabularies
vocabularies and data LOV
licenses
Licenses
compatibility
module
Warning: licenses are
not compatible
Figure 1. LIVE framework architecture.
Retrieving licensing information from vocabularies and datasets. Two use-cases are
taken into account: a SPARQL endpoint, or a VoID file in Turtle syntax. In the first
use case, the tool retrieves the named graphs present in the repository, and then the
user is asked to select the URI of the graph that needs to be checked. Having that infor-
mation, a SPARQL query is triggered, looking for entities declared as owl:Ontology,
3 http://www.w3.org/TR/void/
4 http://www.eurecom.fr/ atemezin/licenseChecker/voidChecker.html
~
5 Note that the LIVE framework relies on the dataset of machine-readable licenses (RDF, Turtle
syntax) presented in [1].
78
voaf:Vocabulary or object of the void:vocabulary property. The final step is to
look up the LOV catalogue to check whether they declare any license. There are two
options for checking the license: (i) a “strict checking” where the FILTER clause con-
tains exactly the namespace of the submitted vocabulary, or (ii) a “domain checking”,
where only the domain of the vocabulary is used in the FILTER clause. This latter option
is recommended in case only one vocabulary has to be checked for the license. In the
second use case, the module parses a VoID file using a N3 parser for Javascript6 , and
then collects the declared vocabularies in the file, querying again LOV7 to check their
licensing information. When the URIs of the licenses associated to the vocabularies and
the dataset are retrieved, the module retrieves the machine-readable description of the
licenses in the dataset of licenses [1].
Licenses compatibility verification. The logic proposed in [2] and the licenses compati-
bility verification process has been implemented using SPINdle [4] – a defeasible logic
reasoner capable of inferencing defeasible theories with hundredth of thousand rules.
Users
User
interface
Licenses retrieval
RDF–Defeasible Theories Reasoning
Composed Theory
Theory Translator Composer Reasoning layer
Engine
Contextual Info
Compatibility Composed Theory
Results
Checker Conclusions
Figure 2. Licenses compatibility module.
As depicted in Figure 2, after receiving queries from users, the selected licenses
(represented using RDF) will be translated into the DFL formalism supported by SPINdle
using the RDF-Defeasible Theory Translator. That is, each RDF-triple will be translated
into a defeasible rule based on the subsumption relation between the subject and object
of a RDF-triples. In our case, we can use the subject and object of the RDF-triples as
the antecedent and head of a defeasible rule, respectively. Besides, the translator also
supports direct import from the Web and processing of RDF data into SPINdle theories.
The translated defeasible theories will then be composed into a single defeasible
theory based on the logic proposed in [2], using the Theories Composer. Afterwards, the
6 https://github.com/RubenVerborgh/N3.js
7 Since LOV endpoint does not support the JSON format in the results, we have uploaded the
data in eventmedia.eurecom.fr/sparql.
79
composed theory, together with other contextual information (as defined by user), will
be loaded into the SPINdle reasoner to perform a compatibility check before returning
the results to the users.
We have evaluated the time performances of the LIVE framework in two steps.
First, we evaluate the time performances of the licenses compatibility module: it needs
about 6ms to compute the compatibility of two licenses. Second, we evaluate time
performances (Chrome v. 34) of the whole LIVE framework for the 10 LOD datasets
with the highest number of links towards other LOD datasets, considering both the
licenses retrieval module and the licenses compatibility one. The results show that
LIVE provides the compatibility evaluation in less than 5 seconds for 7 of the selected
datasets. Time performances of LIVE are mostly affected by the first module while the
compatibility module does not produce a significant overhead. For instance, consider
Linked Dataspaces8 , a dataset where we retrieve the licensing information in both
the dataset and the adopted vocabularies. In this case, LIVE retrieves in 13.20s 48
vocabularies, the license for the dataset is CC-BY, and the PDDL license is attached one
of the vocabularies9 . The time for verifying the compatibility is 8ms, leading to a total
of 13.208s.
3 Future perspectives
We have introduced the LIVE framework for licenses compatibility. The goal of the
framework is to verify the compatibility of the licenses associated to the vocabularies
exploited to create a RDF dataset and the license associated to the dataset itself. Several
points have to be taken into account as future work. More precisely, in the present paper
we consider vocabularies as data but this is not the only possible interpretation. For
instance, we may see vocabularies as a kind of compiler, such that, after the creation
of the dataset then the external vocabularies are no more used. In this case, what is a
suitable way of defining a compatibility verification? We will investigate this issue as
well as we will evaluate the usability of the online LIVE tool to subsequently improve
the user interface.
References
1. Cabrio, E., Aprosio, A.P., Villata, S.: These are your rights: A natural language processing
approach to automated rdf licenses generation. In: ESWC2014, LNCS (2014)
2. Governatori, G., Rotolo, A., Villata, S., Gandon, F.: One license to compose them all - a deontic
logic approach to data licensing on the web of data. In: International Semantic Web Conference
(1). Lecture Notes in Computer Science, vol. 8218, pp. 151–166. Springer (2013)
3. Heath, T., Bizer, C.: Linked Data: Evolving the Web into a Global Data Space. Morgan &
Claypool (2011)
4. Lam, H.P., Governatori, G.: The making of SPINdle. In: Proceedings of RuleML, LNCS 5858.
pp. 315–322. Springer (2009)
8 http://270a.info/
9 http://purl.org/linked-data/cube
80
The Map Generator Tool?
Valeria Fionda1 , Giuseppe Pirrò2 , Claudio Gutierrez3 ,
1
Department of Mathematics, University of Calabria, Italy
2
WeST, University of Koblenz-Landau, Germany
3
DCC, Universidad de Chile, Chile
Abstract. We present the MaGe system, which helps users and devel-
opers to build maps of the Web graph. Maps abstract and represent in a
concise and machine-readable way regions of information on the Web.
1 Introduction
The Web is a large and interconnected information space (usually modeled as
a graph) commonly accessed and explored via navigation enabled by browsers.
To cope with the size of this huge (cyber)space, Web users need to track, record
and specify conceptual regions on the Web (e.g., a set of Web pages; friends
and their interests; a network of citations), for their own use, for exchanging, for
further processing. Users often navigate large fragments of the Web, to discover
and isolate very few resources of interest and struggle to keep connectivity in-
formation among them. The idea of a map of a Web region is essentially that of
representing in a concise way information in the region in terms of connectivity
among a set of distinguished resources (nodes).
Conceptual
Abstraction
Specification
Web of Linked Data Region of information Map
Fig. 1. Building maps of the Web.
With the advent of the Web of Data [4], maps to describe and navigate infor-
mation on the Web in a machine-processable way become more feasible. The key
new technical support is: (i) the availability of a standard infrastructure, based
on the Resource Description Framework (RDF), for the publishing/interlinking
of structured data on the Web; (ii) a community of developers; (iii) languages
to specify regions of information on the Web [1]. Fig. 1 sketches the high level
process of building maps of the Web via the Map Generator (MaGe) system.
?
V. Fionda was supported by the European Commission, the European Social Fund
and the Calabria region.
81
2 The Map framework
The idea of a map on the Web is to represent in a concise and comprehensive
way connectivity information between pairs of distinguished nodes. Given a con-
ceptual region G of information on the Web, there can be several maps of G
with different level of detail (i.e., nodes and edges to be included).
Formally, let = (V , E ) be a Web region, where V and E are the set
of nodes and edges respectively. Then:
• u ! v denotes an edge (u, v) 2 E.
• u ⇣ v denotes a path from u to v in .
• Let N ✓ V . Then, u⇣N v if and only if there is a path from u to v in not
passing through intermediate nodes in N .
Let VM ✓ V be the set of distinguished nodes of the Web region = (V , E ),
i.e., those that we would like to represent.
Definition 1 (Map) A map M = (VM , EM ) of = (V , E ) is a graph such
that VM ✓ V and each edge (x, y) 2 EM implies x ⇣ y in .
A basic (and highly used) example of map of the Web are bookmarks. In this
case, VM is the set of nodes highlighted or marked, and EM = ;, that is, there
is no connectivity recorded among them. An important idea is that of a good
map, i.e., a map which represents connectivity among the distinguished nodes
and avoids redundant edges [3].
Definition 2 (Good map) A map M = (VM , EM ) of = (V , E ) is good if
and only if:
1. 8x, y 2 VM x ⇣VM y in implies x ! y in M
2. 8x, y 2 VM x ! y in M implies x ⇣VM y in .
Good maps have the nice properties (i) uniqueness and (ii) low complexity of
computation. Indeed, given a region = (V , E ) and a set of distinguished
nodes VM ✓ V there exists a unique good map M = (VM , EM ) that is com-
putable in O(|VM | ⇥ (|V \ VM | + |E |)) by an adaptation of the BFS algorithm.
3 MaGe: Building Maps of the Web
Maps are built on top of regions of information on the Web. To automate the
process of generating regions, MaGe uses the NautiLOD [1,2] language. Given
an expression ex, NautiLOD enables to extract a Web region = (V , E )
such that V and E are the set of nodes and edges visited while evaluating ex.
Once the region has been obtained, MaGe computes good maps as sketched in
Section 2 considering the set of distinguished nodes VM = {s} [ T , where s is
the node in the region that corresponds to the seed URI where the evaluation of
ex starts and T are the nodes satisfying ex.
MaGe has been implemented in Java and is available for download4 . It in-
cludes two main modules: the selection and the abstraction modules. The first
4
The MaGe website: http://mapsforweb.wordpress.com
82
(a)
Region
Specification
Specification
of the Region
(b)
Visualization
Control
Visualization
Control
Map Creation
Fig. 2. The GUI of the MaGe tool.
one is responsible for the implementation of the NautiLOD language. In par-
ticular, given a seed URI and an expression, this module retrieves a Web region
and a set of distinguished nodes. The second module, given the Web region and
the set of distinguished nodes leverages the map framework to build maps. The
decoupling between selection and abstraction enables to use the two functional-
ities also separately. MaGe is endowed with a GUI, which is shown in Fig. 2. It
includes four main tabs. The first one (Fig. 2 (b)) is used to specify the region via
a NautiLOD expression. The second and fourth display the region retrieved in
RDF and the expression endpoints, respectively. The third tab (Fig. 2 (a)) deals
with the creation of maps and their visualization. Both regions and maps can be
saved in RDF allowing their storage, sharing, reuse and exchange. We now pro-
vide an example that we plan to show (along with others) in the demo. A video
explaining how to use the tool is available at http://youtu.be/BsvAiX3n968.
Maps of Influence. An influence network is a graph where nodes are per-
sons and edges represent influence relations. We leverage information taken from
dbpedia.org and the property dbpprop:influenced.
Example 3 Build a map of a region containing people that have influenced, or
have been influenced by Stanley Kubrick (SK) up to distance 6. The distinguished
nodes must be scientists.
The region can be specified via the following NautiLOD expression. Here, the
URI of SK in DBpedia (dbpedia:Stanley_Kubrick) is used as seed node:
dbpprop:influenced<1-6>[ASK {?p rdf:type dbpedia:Scientist.}]
In the expression, the notation <1-6> is a shorthand for the concatenation of
(up to) six steps of the predicate dbpprop:influenced, while the ASK query in
the test [ ] is used to filter the distinguished nodes (i.e., scientists).
83
Good Map
S. Kubrick
(b)
(f)
S. Kubrick
(a) Region
Stanley Kubrick
ending node
starting node
Sándor Ferenczi
S. Kubrick
C. Sagan
(c) Ernst Wilhelm von Brucke (d) (e)
Fig. 3. Region (f) and good map (a) for SK with some zooms (b)-(e).
Fig. 3 (f) reports the region associated to the influence network of SK. The
region contains 2981 nodes and 7893 edges. Indeed, it is very difficult to identify
the distinguished nodes and more importantly connectivity among them and
with the seed node. Fig. 3 (a)-(e) show the good map of this region (109 nodes;
2629 edges). The abstraction provided by the good map enables to identify the
influence path, for instance, between SK and C. Segan (Fig. 3 (e)).
4 Conclusions
The availability of machine-processable information at a Web scale opens new
perspectives toward the development of systems for the harnessing of knowledge
on the Web. We contend that maps, key devices in helping human navigation
in information spaces, are also meaningful on the Web space. They are useful
navigation cues and powerful ways of conveying complex information via concise
representations. Effectively, they play the role of navigational charts, that is,
tools that provide users with abstractions of regions of information on the Web.
We have implemented the MaGE system to generate maps. During the demo we
will show maps in different domains including bibliographic networks.
References
1. V. Fionda, C. Gutierrez, and G. Pirrò. Extracting Relevant Subgraphs from Graph
Navigation. In ISWC (Posters & Demos), 2012.
2. V. Fionda, C. Gutierrez, and G. Pirrò. Semantic Navigation on the Web of Data:
Specification of Routes, Web Fragments and Actions. In WWW, 2012.
3. V. Fionda, C. Gutierrez, and G. Pirrò. Knowledge Maps of Web Graphs. In KR,
2014.
4. T. Heath and C. Bizer. Linked Data: Evolving the Web into a Global Data Space.
Morgan & Claypool, 2011.
84
Named Entity Recognition using FOX
René Speck and Axel-Cyrille Ngonga Ngomo
AKSW, Department of Computer Science, University of Leipzig, Germany
{speck,ngonga}@informatik.uni-leipzig.de
Abstract. Unstructured data still makes up an important portion of the Web.
One key task towards transforming this unstructured data into structured data is
named entity recognition. We demo FOX, the Federated knOwledge eXtraction
framework, a highly accurate open-source framework that implements RESTful
web services for named entity recognition. Our framework achieves a higher F-
measure than state-of-the-art named entity recognition frameworks by combining
the results of several approaches through ensemble learning. Moreover, it disam-
biguates and links named entities against DBpedia by relying on the AGDISTIS
framework. As a result, FOX provides users with accurately disambiguated and
linked named entities in several RDF serialization formats. We demonstrate the
different interfaces implemented by FOX within use cases pertaining to extracting
entities from news texts.
1 Introduction
The Semantic Web vision requires the data on the Web to be represented in a machine-
readable format. Given that a significant percentage of the data available on the Web
is unstructured, tools for transforming text into RDF are of central importance. In this
demo paper, we present FOX, the federated knowledge extraction framework.1 It inte-
grates state-of-the-art named entity recognition (NER) frameworks by using ensemble
learning (EL). By these means, FOX can achieve up to 95.23% F-measure where the
best of the current state-of-the-art system (Stanford NER) achieves 91.68% F-measure.
In this paper, we aim to demonstrate several of the features of FOX, including the large
number of input and output formats it supports, different bindings with which FOX
can be integrated into Java and Python code and the easy extension model underly-
ing the framework. Our framework is already being used in several systems, including
SCMS [5], ConTEXT [3] and IR frameworks [8]. The approach underlying FOX is
presented in [7], which will be presented at the same conference. All features presented
herein will be part of the demonstration.
2 Demonstration
The goal of the demonstration will be to show the whole of the FOX workflow from
the gathering and preprocessing of input data to the generation of RDF data. In addi-
tion, we will show how to configure and train FOX after it has been enhanced with a
1
FOX online demo:http://fox-demo.aksw.org
FOX project page:http://fox.aksw.org.
Source code, evaluation data and evaluation results:http://github.com/AKSW/FOX.
85
novel NER tool or EL algorithm. Further, we will present FOX’s feedback RESTful
service to improve the training and test datasets. In the demonstration, we also go over
the Python2 and Java bindings3 for an easy use of FOX’s RESTful service within an
application. At the end we will explain how to use the FOX Java interfaces to integrate
future algorithms.
2.1 Workflow
The workflow underlying FOX consists of four main steps: (1) preprocessing of the
unstructured input data, (2) recognizing the Named Entities (NE), (3) linking the NE to
resources using AGDISTIS [9] and (4) converting the results to an RDF serialization
format.
Preprocessing FOX allows users to use a URL, text with HTML tags or plain text as
input data (see the top left part of Figure 1). The input can be carried out in a form
(see the center of Figure 1) or via FOX’s web service. In case of a URL, FOX sends
a request to the given URL to receive the input data. Then, for all input formats, FOX
removes HTML tags and detects sentences and tokens.
Fig. 1. Request form of the FOX online demo.
We will use text examples, URLs and text with HTML tags to show how FOX
gathers or cleans them for the sake of entity recognition.
2
https://pypi.python.org/pypi/foxpy
3
https://github.com/renespeck/fox-java
86
Entity Recognition Our approach relies on four state-of-the-art NER tools so far: (1)
the Stanford Named Entity Recognizer (Stanford) [2], (2) the Illinois Named Entity
Tagger (Illinois) [6], (3) the Ottawa Baseline Information Extraction (Balie) [4] and
(4) the Apache OpenNLP Name Finder (OpenNLP) [1]. FOX allows using a particu-
lar NER approach which is integrated in it (see bottom right of Figure 1). To this end,
FOX light has to be set to the absolute path to the class of the tool to use. If FOX light
is off, then FOX utilizes these four NER tools in parallel and stores the received NEs
for further processing. It maps the entity types of each of the NER tools to the classes
Location, Organization and Person. Finally, the results of all tools are merged
by using FOX’s EL layer as discussed in [7]. We will show the named entities recog-
nized by FOX and contrast these with those recognized by the other tools. Moreover,
we will show the runtime log that FOX generates to point to FOX’s scalability.
Entity Linking FOX makes use of AGDISTIS [9], an open-source named entity dis-
ambiguation framework able to link entities against every linked data knowledge base,
to disambiguate entities and to link them against DBpedia. In contrast to lookup-based
approaches, our framework can also detect resources that are not in DBpedia. In this
case, these are assigned their own URIs. Moreover, FOX provides a Java interface and
a configuration file for easy integration of other entity linking tools. We will show the
messages that FOX generates and sends to AGDISTIS as well as the answers it receives
and serializes.
Serialization Formats FOX is designed to support a large number of use cases. To
this end, our framework can serialize its results into the following formats: JSON-LD4 ,
N-Triples5 , RDF/JSON6 , RDF/XML7 , Turtle8 , TriG9 , N-Quads10 . FOX allows the user
to choose between these formats (see bottom left part of Figure 1). We will show how
the out of FOX looks like in the different formats and point to how they can be parsed.
3 Evaluation and Results
We performed a thorough evaluation of FOX by using five different datasets and com-
paring it with state-of-the-art NER frameworks (see Table 1). Our evaluation shows that
FOX clearly outperforms the state of the art. The details of the complete evaluation are
presented in [7]. The evaluation code and datasets are also available at FOX’s Github
page, i.e., http://github.com/AKSW/FOX.
4
http://www.w3.org/TR/json-ld
5
http://www.w3.org/TR/n-triples/
6
http://www.w3.org/TR/rdf-json
7
http://www.w3.org/TR/REC-rdf-syntax
8
http://www.w3.org/TR/turtle
9
http://www.w3.org/TR/trig
10
http://www.w3.org/TR/n-quads
87
Table 1. Comparison of the F-measure of FOX with the included NER tools. Best results are
marked in bold font.
token-based entity-based
News News⇤ Web Reuters All News News⇤ Web Reuters All
FOX 92.73 95.23 68.81 87.55 90.99 90.70 93.09 63.36 81.98 90.28
Stanford 90.34 91.68 65.81 82.85 89.21 87.66 89.72 62.83 79.68 88.05
Illinois 80.20 84.95 64.44 85.35 79.54 76.71 83.34 54.25 83.74 76.25
OpenNLP 73.71 79.57 49.18 73.96 72.65 67.89 75.78 43.99 72.89 67.66
Balie 71.54 79.80 40.15 64.78 69.40 69.66 80.48 35.07 68.71 67.82
4 Conclusion
We will present FOX, a NER framework which relies on EL and demonstrate how it
can be used. In future work, we will extend the number of tools integrated in FOX.
Moreover, we will extend the tasks supported by the framework. In particular, we aim
to integrate tagging, keyword extraction as well as relation extraction in the near future.
References
1. J Baldridge. The opennlp project, 2005.
2. Jenny Rose Finkel, Trond Grenager, and Christopher Manning. Incorporating non-local in-
formation into information extraction systems by gibbs sampling. In ACL, pages 363–370,
2005.
3. Ali Khalili, Sören Auer, and Axel-Cyrille Ngonga Ngomo. context – lightweight text analytics
using linked data. In 11th Extended Semantic Web Conference (ESWC2014), 2014.
4. David Nadeau. Balie—baseline information extraction: Multilingual information extraction
from text with machine learning and natural language techniques. Technical report, Technical
report, University of Ottawa, 2005.
5. Axel-Cyrille Ngonga Ngomo, Norman Heino, Klaus Lyko, René Speck, and Martin
Kaltenböck. SCMS - Semantifying Content Management Systems. In Proceedings of the
International Semantic Web Conference, 2011.
6. Lev Ratinov and Dan Roth. Design challenges and misconceptions in named entity recog-
nition. In Proceedings of the Thirteenth Conference on Computational Natural Language
Learning, CoNLL ’09, pages 147–155, Stroudsburg, PA, USA, 2009. Association for Com-
putational Linguistics.
7. René Speck and Axel-Cyrille Ngonga Ngomo. Ensemble learning for named entity recog-
nition. In In Proceedings of the International Semantic Web Conference, Lecture Notes in
Computer Science, 2014.
8. Ricardo Usbeck. Combining linked data and statistical information retrieval. In 11th Extended
Semantic Web Conference, PhD Symposium. Springer, 2014.
9. Ricardo Usbeck, Axel-Cyrille Ngonga Ngomo, Sören Auer, Daniel Gerber, and Andreas Both.
Agdistis - agnostic disambiguation of named entities using linked open data. In Submitted to
12th International Semantic Web Conference, 21-25 October 2013, Sydney, Australia, 2013.
88
A Linked Data Platform adapter
for the Bugzilla issue tracker
Nandana Mihindukulasooriya,
Miguel Esteban-Gutiérrez, and Raúl Garcı́a-Castro
Center for Open Middleware
Ontology Engineering Group, Escuela Técnica Superior de Ingenieros Informáticos
Universidad Politécnica de Madrid, Spain
{nmihindu,mesteban,rgarcia}@fi.upm.es
Abstract. The W3C Linked Data Platform (LDP) specification defines
a standard HTTP-based protocol for read/write Linked Data and pro-
vides the basis for application integration using Linked Data. This paper
presents an LDP adapter for the Bugzilla issue tracker and demonstrates
how to use the LDP protocol to expose a traditional application as a
read/write Linked Data application. This approach provides a flexible
LDP adoption strategy with minimal changes to existing applications.
1 Introduction
The W3C Linked Data Platform (LDP) is an initiative to produce a standard
protocol and a set of best practices for the development of read/write Linked
Data applications [1]. The LDP protocol provides the basis for a novel paradigm
of application integration using Linked Data1 in which each application exposes
its data as a set of Linked Data resources and the application state is driven
following the REST design principles [2].
Some advantages of this approach over traditional SOAP-based web services
include: (a) global identifiers for data that can be accessed using the Web in-
frastructure and typed links between data from di↵erent applications [3]; (b) the
graph-based RDF data model that allows consuming and merging data from dif-
ferent sources without having to do complex structural transformations; and (c)
explicit semantics of data expressed in RDF Schema or OWL ontologies which
can be aligned and mapped to data models of other applications using techniques
such as ontology matching.
This approach is more suitable when the integration is data-intensive and the
traceability links between di↵erent applications are important. The Application
Lifecycle Management (ALM) domain, in which heterogeneous tools are used
in di↵erent phases of the software development lifecycle, provides a good use
case for this approach. The ALM iStack project2 has developed a prototype for
1
http://www.w3.org/DesignIssues/LinkedData.html
2
https://sites.google.com/a/centeropenmiddleware.com/alm-istack/
89
2 A Linked Data Platform (LDP) Adapter for Bugzilla
integrating ALM tools by using the LDP protocol and this paper presents an
LDP adapter developed to LDP-enable the Bugzilla3 issue tracker.
2 An LDP adapter for Bugzilla
The three main alternatives for LDP-enabling an application are: (a) native
support built into the application; (b) an application plugin; and (c) an LDP
adapter. Providing native support requires modification to the application and
not all applications allow extensions through plugins. As we have seen in the
early stages of web services [4], adapters provide a more flexible mechanism to
gradually adopting a technology while using the existing tools with minimum
changes, and we have leveraged this approach.
An application is defined in terms of its data model and business logic. An
LDP-enabled application exposes the data as Linked Data and allows to drive
its business logic following the REST design principles. Thus to LDP-enable
an application, its data model should be expressed in RDF by mapping it to a
new ontology or by reusing existing vocabularies. In the Bugzilla adapter, the
Bugzilla native data model is mapped to the ALM iStack ontology4 . The adapter
exposes the Bugzilla data as LDP resources by transforming the data between
the ALM iStack ontology and the Bugzilla native model so that LDP clients can
consume RDF data from Bugzilla as if it was a native LDP application.
The Bugzilla LDP adapter, which is a JavaEE web application, consists of
three main layers: (a) LDP layer, (b) transformation layer, and (c) application
gateway layer, as illustrated in Figure 1.
The LDP layer handles the LDP communications and exposes the Bugzilla
data as LDP resources. This layer is built using the LDP4j framework5 which
provides a middleware for the development of read/write Linked Data appli-
cations [5]. The concepts such as bugs, products, product versions, and users
are mapped to LDP containers which list these entities and allow creating new
entities. Each entity such as a bug, a product, or a user is mapped to an LDP
resource with its own URI that can be used by clients to access them.
The transformation layer handles data validation and transformation.
This includes extracting information from RDF data, validating them based
on application restrictions, and mapping them to the Bugzilla model. The ALM
iStack ontology is generic so that it can be used with other issue trackers (e.g.,
JIRA6 , Redmine7 ); thus there is an impedance mismatch between the ontology
and the Bugzilla native model which is managed by the adapter.
The application gateway layer handles the communication with the Bug-
zilla instance using its XML-RPC remote interface. Because the Bugzilla bug
tracker is also accessed using its web UI, the adapter synchronizes with the
3
http://www.bugzilla.org/
4
http://delicias.dia.fi.upm.es/ontologies/alm-istack.owl
5
http://www.ldp4j.org/
6
https://www.atlassian.com/software/jira
7
http://www.redmine.org/
90
A Linked Data Platform Adapter for Bugzilla 3
Bugzilla instance based on user-defined policies. In addition there are several
cross-cutting services such as configuration management, consistency, security,
and synchronization which are utilized by multiple layers.
Fig. 1. High-level architecture of the Bugzilla adapter
3 Demonstration
This demonstration shows how LDP clients can use the adapter to access the
Bugzilla bug tracker and to perform tasks such as discovering bugs reported
against a product, modifying the status or the other properties of the bug, or
creating new bugs (e.g., Fig. 2 shows a creation request and response).
Fig. 2. Creation of a new bug using the Bugzilla LDP adapter
For example, a continuous integration server in an integrated ALM setting
encounters a build failure. Thus, the “integration server agent” (1) wants to
report a defect (2) titled “Bugzilla adapter build is broken” (3) with descrip-
tion “Bugzilla adapter build fails due to a test failure” (4) for the “version 1.0
91
4 A Linked Data Platform (LDP) Adapter for Bugzilla
of the Bugzilla Adapter ” product (5) that is related to the “issue 730698 ” in
“https://bugzilla.mozilla.org/ ” (6). The LDP client converts this message to an
LDP request according to the ALM iStack ontology as shown in Figure 2.
Once this request is received by the adapter, it extracts the necessary in-
formation, transforms it into the Bugzilla model using a mapping between the
ontology and Bugzilla models, and creates a bug in the Bugzilla instance using
its remote XML-RPC interface. Once created, the Bugzilla instance returns the
identifier for the bug inside Bugzilla. Then, the adapter generates an URI for
the bug and manages the mapping between the identifier given by the Bugzilla
and the URI. Any information that does not fit into the Bugzilla model such as
links to external applications is maintained in the adapter. Finally, the adapter
returns the URI using the Location header (7) and lets the client know it is an
LDP resource using the “type” link relation (8) according to the LDP protocol.
The LDP client or other external applications can access and link to the bug
using the URI returned by the adapter. In addition, clients can modify the bug
using the PUT operation with modified RDF data which then will be propagated
to the Bugzilla instance following a similar process.
4 Conclusion
In this paper, we presented the Bugzilla LDP adapter and provided an overview
of how to build adapters for LDP-enabling existing applications in order to use
them as read/write Linked Data applications. With minimal changes to the
existing application, the Bugzilla LDP adapter enables semantic integration of
the Bugzilla tool with other LDP-enabled applications and makes possible to
have typed links between application data.
Acknowledgments: The authors are partially supported by the ALM iStack
project of the Center for Open Middleware.
References
1. Speicher, S., Arwe, J., Malhotra, A.: Linked Data Platform 1.0 (June 2014) W3C
Candidate Recommendation, http://www.w3.org/TR/ldp/.
2. Mihindukulasooriya, N., Garcı́a-Castro, R., Esteban-Gutiérrez, M.: Linked Data
Platform as a novel approach for Enterprise Application Integration. In: Proceed-
ings of the 4th International Workshop on Consuming Linked Data (COLD2013),
Sydney, Australia (Oct 2013)
3. Heath, T., Bizer, C.: Linked Data: Evolving the Web into a Global Data Space.
Synthesis lectures on the semantic web: theory and technology 1(1) (2011) 1–136
4. Benatallah, B., Casati, F., Grigori, D., Nezhad, H.R.M., Toumani, F.: Developing
adapters for web services integration. In: Advanced Information Systems Engineer-
ing, Springer (2005) 415–429
5. Esteban-Gutiérrez, M., Mihindukulasooriya, N., Garcı́a-Castro, R.: LDP4j: A frame-
work for the development of interoperable read-write Linked Data applications. In:
Proceedings of the 1st ISWC Developers Workshop, Riva del Garda, Italy (Oct
2014)
92
LED: curated and crowdsourced Linked Data on
Music Listening Experiences
Alessandro Adamou1 , Mathieu d’Aquin1 , Helen Barlow1 and Simon Brown2
1
The Open University, United Kingdom
{alessandro.adamou, mathieu.daquin, helen.barlow}@open.ac.uk
2
Royal College of Music, United Kingdom
simon.brown@rcm.ac.uk
Abstract. We present the Listening Experience Database (LED), a
structured knowledge base of accounts of listening to music in docu-
mented sources. LED aggregates scholarly and crowdsourced contribu-
tions and is heavily focused on data reuse. To that end, both the storage
system and the governance model are natively implemented as Linked
Data. Reuse of data from datasets such as the BNB and DBpedia is inte-
grated with the data lifecycle since the entry phase, and several content
management functionalities are implemented using semantic technolo-
gies. Imported data are enhanced through curation and specialisation
with degrees of granularity not provided by the original datasets.
Keywords: Linked Data, Crowdsourcing, Digital Humanities, Data Workflow
1 Introduction
Most research on listening to music focuses on investigating associated cognitive
processes or analysing its reception by critics or commercial indicators such as
sales. There is only sporadic research on the cultural and aesthetic position of
music among individuals and societies over the course of history. One obstacle
to this kind of research is the sparsity of primary source evidence of listening to
music. Should such evidence be compiled, we argue that the adoption of explicit
structured semantics would help highlight the interactions of listeners with a
range of musical repertoires, as well as the settings where music is performed.
With the Listening Experience Database (LED)1 , we aim at covering
this ground. LED is the product of a Digital Humanities project focused on
gathering documented evidence of listening to music across history and musical
genres. It accepts contributions from research groups in humanities as well as
the crowdsourcing community, however, the data management workflow is super-
vised to guarantee that minimum scholarly conventions are met. Being conceived
with data reuse in mind, LED is natively implemented as Linked Data. All the
operations in the data governance model manipulate triples within, and across,
named RDF graphs that encode provenance schemes for users of the system.
1
LED, online at http://www.open.ac.uk/Arts/LED
93
Several content management functionalities available in LED, such as content
authoring, review, reconciliation and faceted search, incorporate Linked Data
reuse. Reused datasets include DBpedia2 and the British National Bibliography
(BNB)3 , with music-specific datasets currently under investigation. Reused data
are also enhanced, as the LED datamodel is fine-grained and allows for describing
portions of documents and excerpts, which are not modelled in the datasets at
hand. LED therefore also aims at being a node by its own right in the Linked
Data Cloud, providing unique content and contributing to existing data too.
At the time of writing, the LED dataset stores about 1,000 listening experience
records contributed by 25 users, half of whom being volunteers from the crowd.
2 Related work
A similar e↵ort in aggregating structured data in primary evidence was already
carried out for reading experiences [1], though the process was not data-driven
and the resulting Linked Data were only marginally aligned. We also acknowl-
edge a project being carried out, which gathers direct personal experiences of
young users of the system, albeit with a minimal data structure4 . We also drew
inspiration from earlier accounts of using DBpedia for music, such as the dbrec
recommender [3]. Crowdsourcing is also gaining the attention of the Semantic
Web community, with very recent attempts at tackling data quality aspects [4].
3 The Listening Experience Database
We define a listening experience (LE) as a documented (i.e. with a quotable
and citable source) engagement of an individual in an event where some piece
of music is played. In terms of conceptual modelling, a LE is a subjective event,
and one document describing it is the quoted evidence reported in the database.
The lifecycle of data in LED involves the roles of contributor, consumer and
gatekeeper, and states called draft, submitted, public and blacklisted. Every arti-
fact stored in the system exists in one or more of these states (except blacklisted
ones, which exclude all other states), and a state determines if a user with a
certain role can “see” an artifact or not. What these artifacts are, depends on
the specific phases in the workflow, which are transitions between these states.
Authoring. Contributors populate the knowledge base by entering data on
a LE and its associated entities. The entry forms are dynamic and provide sug-
gestions and autocompletion data from LED and external datasets in real time
(cf. Figure 1). Artifacts declared during this phase remain in a draft state, only
to enter a submitted state once the contributor submits the LE to gatekeepers.
Review. Privileged users with the gatekeeper role review a submitted artifact
and either promote it to the public state, or reject it for blacklisting, or demote it
2
DBpedia, http://dbpedia.org
3
British National Bibliography, http://bnb.data.bl.uk
4
Experiencing Music, http://experiencingmusic.com
94
to draft again, which they can do by either taking over the artifact and amending
its data themselves, or sending it back to the original contributor.
Reconciliation. Gatekeepers can align and merge duplicate artifacts that
are found to match. They can compare candidate duplicates with other artifacts
in LED and third-party data. This operation does not modify their state.
Faceted search. Consumers can navigate LE’s by filtering keyword search
results by bespoke criteria which are not necessarily stored in LED, but also
reused from third-party datasets. Only public artifacts contibute to searches.
(a) Listening experience submission. (b) Autocompletion from the BNB dataset.
Fig. 1: Example of data entry for “Pictures from Italy” by Charles Dickens.
With a native Linked Data implementation, we can immediately integrate
reuse with every stage of the data lifecycle starting with data entry, and eliminate
a posteriori revision and extraction phases from the workflow, thereby reducing
the time-to-publish of our data and having them linked right from the beginning.
Also, the named graph model of quad-stores can encode provenance information
with the granularity of atomic statements [2], thus lending itself to fine-grained
and complex trust management models.
To encode the above workflow entirely in RDF, we used the named graph
paradigm in order to represent states and artifacts. Deciding on the scale of
the latter was an issue: while we intended to give gatekeepers control on single
RDF triples (or quads, from the named graph perspective), and to contributors
a way to support the truth or falsehood of a triple, this can be complex and
time-consuming. Therefore, artifacts are encapsulated into LE’s, musical works,
95
literary sources, agents (e.g. people, groups or organisations) and places: these
are, for instance, the classes of artifacts that gatekeepers may want to review or
reconcile. However, LE’s remain the core artifacts of the sytem: only by creating
or editing them can their associated artifacts be generated.
The LED knowledge base is partitioned into data spaces, each belonging
to a user or role. Every contributor owns two RDF graphs, one for draft ar-
tifacts and one for submitted ones. Thus, we can keep track of which con-
tributors support a fact by reusing it (e.g. ). There is a single graph for public artifacts, and one
for blacklisted ones. Contributors have access to the graphs they own plus the
public graph; gatekeepers can access every user’s submitted graph and the pub-
lic and blacklist graphs. State transitions are realised by parametric SPARQL
queries that selectively move RDF triples across these graphs. Along with these
data spaces there are rules that determine the visibility of triples to each user,
depending on the content of their private graphs. In general, these rules assume
contributors have greater confidence in the facts in their possession, and when
missing, they should trust those provided by the community or other datasets.
4 Demonstration
The audience will be given a live demonstration of the LED system, but from
the point of view of users with the privileged roles of contributor and gatekeeper.
We will show the benefits of reusing data from indexed datasets during the entry
phase, as well as the implementation of our governance model in Linked Data
and its e↵ects on the representation of a resource as seen by the general public
or a specific user. Data reuse and enhancement will be demonstrated through a
LE entry form to be auto-populated in real time and open to input by audience
members. To demonstrate the governance model, we will run two distinct entries
with shared data through the whole draft-submission-gatekeeping lifecycle. We
will then show how di↵erently the shared data and their RDF representations
appear to each user, based on the trust and provenance policies in place.
References
1. Matthew Bradley. The Reading Experience Database. Journal of Victorian Culture,
15(1):151–153, 2010.
2. Jeremy J. Carroll, Christian Bizer, Patrick J. Hayes, and Patrick Stickler. Named
graphs, provenance and trust. In Allan Ellis and Tatsuya Hagino, editors, WWW,
pages 613–622. ACM, 2005.
3. Alexandre Passant. dbrec - music recommendations using DBpedia. In Peter F.
Patel-Schneider, Yue Pan, Pascal Hitzler, Peter Mika, Lei Zhang, Je↵ Z. Pan, Ian
Horrocks, and Birte Glimm, editors, International Semantic Web Conference (2),
volume 6497 of Lecture Notes in Computer Science, pages 209–224. Springer, 2010.
4. Elena Simperl, Maribel Acosta, and Barry Norton. A semantically enabled architec-
ture for crowdsourced linked data management. In Ricardo A. Baeza-Yates, Stefano
Ceri, Piero Fraternali, and Fausto Giunchiglia, editors, CrowdSearch, volume 842 of
CEUR Workshop Proceedings, pages 9–14. CEUR-WS.org, 2012.
96
WhatTheySaid: Enriching UK Parliament
Debates with Semantic Web
Yunjia Li, Chaohai Ding, and Mike Wald
School of Electronics and Computer Science,
University of Southampton, UK
{yl2,cd8e10,mw}@ecs.soton.ac.uk
Abstract. To improve the transparency of politics, the UK Parliament
Debate archives have been published online for a long time. However
there is still a lack of efficient way to deeply analysis the debate data.
WhatTheySaid is an initiative to solve this problem by applying natural
language processing and semantic Web technologies to enrich UK Parlia-
ment Debate archives and publish them as linked data. It also provides
various data visualisations for users to compare debates over years.
Keywords: linked data, parliamentary debate, semantic web
1 Introduction
The publicity of UK Parliament Debate, such as BBC Parliament1 have exert
tremendous influence on the transparency of politics in the UK. Political fig-
ures need to be responsible for what they have said in the debates as they are
monitored by the public. However, it is still difficult currently to automatically
analyse the debate archives to find the answer to questions such as: how the de-
bates across months or even years are related to each other. For this purpose, we
have developed WhatTheySaid2 (WTS), which uses semantic Web and natural
language processing (NLP) technologies to automatically enrich the UK Parlia-
ment debates and categorize them for searching, visualisation and comparison.
In UK, there are already applications, such as TheyWorkForYou3 , to pro-
vide extended functions for users to search debates and view the performances
of each Member of Parliament (MP), such as the voting history and recent ap-
pearances. The semantic Web approach is also applied in Polimedia [1] as a way
to model Dutch Parliament debates and enrich them with named entities and
external links to di↵erent media. In this demo, we refer to the data sources and
the methodologies provided by those previous work and build more advanced
features to fullfil the following requirements: (R1) Calculate the similarities be-
tween debates so that users can easily navigate through similar debates; (R2)
Categorise debates into di↵erent topics and extract the key statements, so that
1
http://www.bbc.co.uk/democracylive/bbc_parliament/
2
http://whattheysaid.org.uk
3
http://theyworkforyou.com
97
2 Yunjia Li, Chaohai Ding, and Mike Wald
users can easily spot the statements that are contradict to each other; (R3)
Based on R2, link the debates to a fragment of debate video archive, so that
users can watch the video fragment as the proof of the statement; (R4) Analyse
the speeches of a particular MP and see how the sentiment is changing over time.
To demo the implementation of the requirements above, we have taken the
UK House of Common debate data in 2013 from TheyWorkForYou as the sample
dataset, and the following sections will go through the system.
2 Semantic Model of UK Parliament Debate
The WTS ontology4 models UK Parliament debate structure and involved agents.
This ontology reuses some vocabularies such as FOAF5 and Ontology for Media
Resource6 . When designing this ontology, we have firstly referred to the data
structure of TheyWorkForYou, where one debate is identified by a Heading and
a Heading contains one or more Speeches. We have also added several attributes
to Speech, such as sentimental score, primary topic, summarise text and related
media fragment in order to save the data required to implement R2, R3 and R4
in Section 1.
Fig. 1. WhatTheySaid Ontology
3 System Design and Walk-through
Figure 2 shows the architecture of WTS application. Our major data sources are
the debate information from TheyWorkForYou, including debate date, speakers,
headings, the text of speeches in each debate, etc., and the debate video with
automatic transcripts provided by BBC Parliament archive. Then we use Alche-
myAPI7 to proceed sentimental analysis on each speech in the debates so that
4
http://www.whattheysaid.org.uk/ontology/v1/whatheysaid.owl
5
http://www.foaf-project.org
6
http://www.w3.org/TR/mediaont-10/
7
http://www.alchemyapi.com/
98
WhatTheySaid: Enriching UK Parliament Debates with Semantic Web 3
each speech made by a speaker will be allocated with a score between 1.0 (posi-
tive) and -1.0 (negative). For speeches with more than 1000 characters, we also
carry out topic detection and text summarisation using AlchemyAPI.
To link the debates to each other, we apply TF-IDF [3] algorithm to calculate
the similarity scores between each two debates. We firstly merge the plain text
of all the speeches in a debate into one big debate document d. Then, given a
debate document collection D and d 2 D, a word w, we calculate the weighting
of each document Wd :
Wd = fw,d ⇤ log(|D|/fw,D ) (1)
where fw,d equals the number of times w appears in d, |D| is the size of corpus,
and fw,D is the number of documents in which w appears in D [3]. In information
retrieval, the Vector Space Model (VSM) represents each document in a collec-
tion as a point in a space and the semantically similarity of words is depended
on the space distance of related points [4]. When the Wd is calculated for each
document, we use cosine similarity8 for the vector space to come up with the
similarity score between any two debate documents. On the user interface, every
time a debate document is viewed, we will list the top ten debates that similar
to this debate, so that users can easily navigate through similar debates.
Fig. 2. WhatTheySaid Architecture Diagram
For named entity recognition, we use DBpedia Spotlight9 to extract named
entities and interlink those concepts to the speeches, where they are mentioned.
All the enrichment information are saved in a triple store implemented by
rdfstore-js10 , which also exposes a SPARQL Endpoint data querying and vi-
sualisation. For the whole 2013 year’s debate, we have collected 68968 speeches
and more than 400K named entities (with duplication) have been recognised.
Using the model defined in Figure 1, we have generated more than 1.2 million
triples.
8
http://en.wikipedia.org/wiki/Cosine_similarity
9
https://github.com/dbpedia-spotlight
10
https://github.com/antoniogarrote/rdfstore-js
99
4 Yunjia Li, Chaohai Ding, and Mike Wald
We visualise the enriched debate data in various ways. Firstly, we use both
heat map and line chart to visualise the sentiment scores of speeches for each
MP on yearly (see Figure 3(a)) and monthly basis respectively. We also provide
a timeline visualisation (Figure 3(b)) for the statements in di↵erent topics made
by a certain MP. To implement R3, we have referred to the previous work [2]
and designed a replay page with the transcript and named entities aligned with
the fragments of debate video11 . The full demo is available online12 and the
RDF dataset is published for download13 . We are planning to expand the appli-
cation with more debates from early years, so that debates across years can be
interlinked and enriched for analysis.
Fig. 3. WhatTheySaid Data Visualisation
4 Acknowledgement
This mini-project is funded by the EPSRC Semantic Media Network. We also
would like to thank Yves Raimond from BBC and Sebastian Riedel from UCL
for the support of this mini-project.
References
1. Juric, D., Hollink, L., Houben, G.J.: Bringing parliamentary debates to the semantic
web. Detection, Representation, and Exploitation of Events in the Semantic Web
(2012)
2. Li, Y., Rizzo, G., Troncy, R., Wald, M., Wills, G.: Creating enriched youtube media
fragments with nerd using timed-text (2012)
3. Ramos, J.: Using tf-idf to determine word relevance in document queries. In: Pro-
ceedings of the First Instructional Conference on Machine Learning (2003)
4. Turney, P.D., Pantel, P., et al.: From frequency to meaning: Vector space models of
semantics. Journal of artificial intelligence research 37(1), 141–188 (2010)
11
Due to copyright issues, we cannot make the debate video publicly available.
12
http://whattheysaid.org.uk
13
http://whattheysaid.org.uk/download/wtstriple.ttl
100
Multilingual Disambiguation of Named Entities
Using Linked Data
Ricardo Usbeck} , Axel-Cyrille Ngonga Ngomo} , Wencan Luo~ , and Lars
Wesemann
} University of Leipzig, Germany ,
R & D, Unister GmbH, Leipzig, Germany,
~ University of Pittsburgh, United States of America
email: {usbeck|ngonga}@informatik.uni-leipzig.de
Abstract. One key step towards extracting structured data from un-
structured data sources is the disambiguation of entities. With AGDIS-
TIS, we provide a time-efficient, state-of-the-art, knowledge-base-agnostic
and multilingual framework for the disambiguation of RDF resources.
The aim of this demo is to present the English, German and Chinese
version of our framework based on DBpedia. We show the results of
the framework on texts pertaining to manifold domains including news,
sports, automobiles and e-commerce. We also summarize the results of
the evaluation of AGDISTIS on several languages.
1 Introduction
A significant portion of the information on the Web is still only available in
textual format. Addressing this information gap between the Document Web
and the Data Web requires amongst others the extraction of entities and rela-
tions between these entities from text. One key step during this processing is
the disambiguation of entities (also known as entity linking). The AGDISTIS
framework [7] (which will also be presented at this conference) addresses two
of the major drawbacks of current entity linking frameworks [1,2,3]: time com-
plexity and accuracy. With AGDISTIS, we have developed a framework that
achieves polynomial time complexity and outperforms the state of the art w.r.t.
accuracy. The framework is knowledge-base-agnostic (i.e., it can be deployed on
any knowledge base) and is also language-independent. In this demo, we will
present AGDISTIS deployed on three di↵erent languages (English, German and
Chinese) and three di↵erent knowledge bases (DBpedia, the German DBpedia
and the Chinese DBpedia). To the best of our knowledge, we therewith provide
the first Chinese instantiation of entity linking to DBpedia. We will also demon-
strate the AGDISTIS web services endpoints for German, English and Chinese
disambiguation and show how data can be sent to the endpoints. Moreover, the
output format of AGDISTIS will be explained. An online version of the demo is
available at http://agdistis.aksw.org/demo.
101
2 Demonstration
Within our demonstration, we aim to show how AGDISTIS can be used by
non-expert as well as expert users. For non-experts, we provide a graphical user
interface (GUI). Experts can choose to use the REST interfaces provided by
the tool or use a Java snippet to call the REST interface. The whole of this
functionality, which will be described in more details in the following sections,
will also be demonstrated at the conference.
2.1 AGDISTIS for non-expert users
A screenshot of the AGDISTIS GUI is shown in Figure 1. This GUI supports
the following workflow.
Fig. 1: Screenshot of the demo with an English example which is already anno-
tated.
Entity Recognition After typing or pasting text into the input field, users can
choose between either annotating the entities manually or having the entities
detected automatically. In the first case, the labels of the entities are to be
marked by using square brackets (see central panel of Figure 1). In the case of
an automatic annotation, we send the text to the FOX framework, which has
been shown to outperform the state of the art in [6]. We will demonstrate this
feature by using both manually pre-annotated text and text without annotations
102
in our examples (see upper bar of Figure 1). Moreover, we will allow the crowd
to enter arbitrary texts that pertain to their domain of interest.
Automatic Language Detection Once the user has set which entities are to
be disambiguated, the marked-up text is sent to the language detection module
based on [5]. We chose this library because it is both precise (> 99% precision)
and time-efficient. If the input is detected to belong to one of the languages
we support (i.e., German, Chinese, English), then we forward the input to a
dedicated AGDISTIS instance for this given language. In all other cases, an error
message is shown to the user, pointing towards the language at hand not being
supported. The main advantage of this approach is that the user does not need
to select the language in which the text is explicated manually, thus leading to
an improved user experience. We will demonstrate this feature by entering text
in di↵erent languages (German, English, French, Chinese, etc.) and presenting
the output of the framework for each of these test cases.
Entity Linking This is the most important step of the whole workflow. The
annotated text is forwarded to the corresponding language-specific deployment
of AGDISTIS, of which each relies on a language-specific version of DBpedia 3.9.
The approach underlying AGDISTIS [7] is language-independent and combines
breadth-first search and the well-known HITS algorithm. In addition, string
similarity measures and label expansion heuristics are used to account for typos
and morphological variations in naming. Moreover, Wikipedia-specific surface
forms for resources can be used.
Output Within the demo the annotated text is shown below the input field
where disambiguated entities are colored to highlight them. While hovering a
highlighted entity the disambiguated URI is shown. We will demonstrate the
output of the entity linking by using the examples shown in the upper part of
Figure 1. The output of the system will be shown both in a HTML version
and made available as a download in JSON. Moreover, we will allow interested
participants to enter their own examples and view the output of the tool.
2.2 AGDISTIS for expert users
To support di↵erent languages we set up a REST URI for each of the language
versions. Each of these endpoints understands two mandatory parameters: (1)
text which is an UTF-8 and URL encoded string with entities annotated with
XML-tag and (2) type=’agdistis’ to disambiguate with the AGDIS-
TIS algorithm. In the future, several wrappers will be implemented to use di↵er-
ent entity linking algorithms for comparison. Following, a CURL1 snippet shows
how to address the web service, see also http://agdistis.aksw.org:
curl --data-urlencode "text=’Barack Obama arrives
in Washington, D.C..’" -d type=’agdistis’
{AGDISTIS URL}/AGDISTIS
1
http://curl.haxx.se/
103
3 Evaluation
English and German Evaluation. AGDISTIS has been evaluated on 8 dif-
ferent datasets from diverse domains such as news, sports or buisiness reports.
For English datasets AGDISTIS is able to outperform the currently best dis-
ambiguation framework, TagMe2, on three out of four datasets by up to 29.5%
F-measure. Considering the only German dataset available for named entity dis-
ambiguation, i.e., news.de [4], we are able to outperform the only competitor
DBpedia Spotlight by 3% F-measure.
Chinese Evaluation. We evaluated the Chinese version of AGDISTIS within
a question answering setting. To this end, we used the multilingual benchmark
provided in QALD-42 . Since the Chinese language is not supported, we extended
the QALD-4 benchmark by translating the English questions to Chinese and
inserted the named entity links manually. The accuracies achieved by AGDISTIS
for the train and test datasets are 65% and 70% respectively.
4 Conclusion
We presented the demo of AGDISTIS for three di↵erent languages on three
di↵erent DBpedia-based knowledge bases. In future work, we aim to create a
single-server multilingual version of the framework that will intrinsically support
several languages at the same time. To this end, we will use a graph merging
algorithm to combine the di↵erent versions of DBpedia to a single graph. The
disambiguation steps will then be carried out on this unique graph.
Acknowledgments This work has been supported by the
ESF and the Free State of Saxony and the FP7 project Geo-
Know (GA No. 318159).
References
1. Paolo Ferragina and Ugo Scaiella. Fast and accurate annotation of short texts with
wikipedia pages. IEEE software, 29(1), 2012.
2. Pablo N. Mendes, Max Jakob, Andres Garcia-Silva, and Christian Bizer. Dbpe-
dia spotlight: Shedding light on the web of documents. In Proceedings of the 7th
International Conference on Semantic Systems (I-Semantics), 2011.
3. Andrea Moro, Alessandro Raganato, and Roberto Navigli. Entity linking meets
word sense disambiguation: a unified approach. TACL, 2:231–244, 2014.
4. Michael Röder, Ricardo Usbeck, Sebastian Hellmann, Daniel Gerber, and Andreas
Both. N3 - a collection of datasets for named entity recognition and disambiguation
in the nlp interchange format. In LREC, 2014.
5. Nakatani Shuyo. Language detection library for java, 2010.
6. René Speck and Ngonga Ngomo. Ensemble learning for named entity recognition.
In International Semantic Web Conference. 2014.
7. Ricardo Usbeck, Axel-Cyrille Ngonga Ngomo, Sören Auer, Daniel Gerber, and An-
dreas Both. Agdistis - agnostic disambiguation of named entities using linked open
data. In International Semantic Web Conference. 2014.
2
http://greententacle.techfak.uni-bielefeld.de/~cunger/qald
104
The Wikipedia Bitaxonomy Explorer
Tiziano Flati and Roberto Navigli
Dipartimento di Informatica
Sapienza Università di Roma
Abstract. We present WiBi Explorer, a new Web application devel-
oped in our laboratory for visualizing and exploring the bitaxonomy of
Wikipedia, that is, a taxonomy over Wikipedia articles aligned to a tax-
onomy over Wikipedia categories. The application also enables users to
explore and convert the taxonomic information into RDF format. The
system is publicly accessible at wibitaxonomy.org and all the data is
freely downloadable and released under a CC BY-NC-SA 3.0 license.
1 Introduction
Knowledge modeling is a long-standing problem which has been addressed in a
variety of ways (see [8] for a survey). If we leave aside knowledge-lean taxonomy
learning approaches [9], a typical and widespread model consists of knowledge
resources and multilingual dictionaries which provide concepts and relationships
between concepts. The scenario is characterized by two types of resources: those,
such as BabelNet [6], which provide general untyped relationships, and those,
such as DBpedia [1], in which edges model arbitrarily labelled predicates over
concepts (e.g., dbpedia-owl:birthPlace).
In neither of these resource types, however, is any explicit attention paid to
hypernymy as a distinct relation type. Instead, hypernymy has been proven to
be a relevant relation type capable of ameliorating systems in several hard tasks
in Natural Language Processing [2, 7]. Indeed, even restricting to Wikipedia, no
high-quality, large-scale taxonomy is yet available, which exhibits high coverage
for both Wikipedia pages and categories.
WiBi [4] is a project set up with the specific aim of providing hypernymy
relations over Wikipedia and our tests confirm it as the best current resource
for taxonomizing both Wikipedia pages and categories in a joint fashion with
state-of-the-art results. Here we present a Web application for visualizing and
exploring our bitaxonomy of Wikipedia. The interface also o↵ers a customization
of the “view” and allows the export of data into RDF, in line with today’s
Semantic Web trend.
2 The Wikipedia Bitaxonomy
WiBi [4] is an approach which aims at building a bitaxonomy of Wikipedia, that
is, automatically extracting two taxonomies, one for Wikipedia pages and one
for Wikipedia categories, aligned to one another.
105
2
The bitaxonomy is built thanks to a three-phase approach that i) first builds
a taxonomy for the Wikipedia pages, then ii) leverages this partial information
to iteratively infer new hypernymy relations over Wikipedia categories while at
the same time increasing the page taxonomy, and finally iii) refines the obtained
category taxonomy by means of three ad-hoc heuristics that cope with structural
problems a↵ecting some categories. As a result, a bitaxonomy is obtained where
each element - either page or category - is associated with one or more hypernyms
and where elements of one taxonomy are aligned (i.e, linked) to elements of the
other taxonomy. In order to transfer hypernymy knowledge from either one of
the two Wikipedia sides to the other side, the whole process remarkably, and as
a key feature, exploits categorization edges (here called cross-edges) manually
provided by Wikipedians, which connect any page on one side to its categories
on the other side and vice versa. Extensive comparison has been carried out
on two datasets of 1,000 pages and categories each, against all the available
knowledge resources, including MENTA, DBpedia, YAGO, WikiTaxonomy and
WikiNet (for an extensive survey, see [5]). Results show that WiBi surpasses all
competitors not only in terms of quality, with the highest precision and recall,
but also in terms of coverage and specificity.
3 The demo interface
Here we present a Web-based visual explorer for displaying the two aligned
taxonomies of WiBi, centered on any given Wikipedia item of interest chosen
by the user. The interface easily integrates search facilities with customization
tools which personalize the experience from a user’s point of view.
The home page. An excerpt of the interface’s home page is shown in Fig.
1(a). As can be seen, this page has been kept very clean with as few elements
as possible. On the top of the page a navigation bar contains links to i) the
about page, which contains release information about the website content, ii) a
download area, where it is possible to obtain the data underlying the interface
and iii) the search page, which represents the core contribution of this work.
The search page mainly contains a text area in which the user is requested
to input her query of interest, additionally opting for searching through either
the page inventory, the category inventory or both, thanks to dedicated radio
buttons. After the query is sent, the search engine tries to match the input text
against the whole database of Wikipedia pages (or categories) and, if a match
is found, the engine displays the final result to the user. Otherwise, the query
is interpreted as a lemma and the user is returned with the (possible) list of all
Wikipedia pages/categories whose lemma matches against the query.
The result page. Starting from the Wikipedia element provided by the user,
the objective of the result page is to show a relevant excerpt of the bitaxonomy,
that is, the nearest (or more relevant) nodes connected to it, drawn from both
of the two taxonomies. To do this, WiBi Explorer performs a series of steps:
1. Start a DFS of maximum length 1 from the given element p of a taxonomy. As a
result, a subgraph ST1 = (SV1 , SE1 ) is obtained;
106
3
(a) WiBi Explorer’s home page. (b) Result for the ISWC Wikipedia page.
Fig. 1. The Wikipedia Bitaxonomy Explorer overview.
2. Collect all the nodes ⇡(p) belonging to the other taxonomy (i.e, those whose cross-
edges are incident to p). Start a DFS of maximum length 2 from each element in
⇡(p). As a result, a subgraph ST2 = (SV2 , SE2 ) is obtained;
3. Display ST1 and ST2 , as well as all the possible cross-edges linking nodes of the
two subgraphs. Prune out low-connected nodes from the displayed bitaxonomy.
As a result, the interface displays a meaningful excerpt of the two taxonomies,
centered on the issued query. The result for the Wikipedia page International
Semantic Web Conference is shown in Fig. 1(b).
Customization of the view Since a user might be interested in a more general
view of the bitaxonomy, two additional sliders are provided to the user in order
to manually adjust the two maximum depths 1 and 2 (see Fig. 1(b) on top).
Moreover, the interface provides the user with the capability to click on nodes
and interactively explore di↵erent parts of the taxonomy. The application thus
acts as a dynamic explorer that enables users to navigate through the structure
of the bitaxonomy and discover new relations as the visit proceeds.
4 Converting data to RDF
Interestingly, data can also be exported in RDF format, in line with recent
work on (linguistic) linked open data and the Semantic Web [3]. To this end, the
explorer is backed by the Apache Jena framework (https://jena.apache.org/)
and thus also integrates a single-click functionality that seamlessly converts the
displayed data into RDF format. The user can opt for Turtle, RDF/XML or
N-Triple format (see blue box in Fig. 1(b), bottom left). An excerpt of a view of
the bitaxonomy converted into RDF for the query ISWC is shown in Fig. 2. As
can be seen, several namespaces have been used: WiBi specific entities encode
Wikipedia items, while standard SKOS’s subsumption relations (skos:narrower
and skos:broader ) encode is-a relations.
5 Conclusions
We have proposed the Wikipedia Bitaxonomy Explorer, a new, flexible and ex-
tensible Web interface that allows the navigation of the recently created Wikipedia
107
4
@prefix wibi: .
@prefix wibi-model: .
@prefix skos: .
wibi:International_Semantic_Web_Conference a skos:Concept;
wibi-model:hasWikipediaCategory ;
skos:broader wibi:Academic_conference .
a skos:Concept;
wibi-model:hasWikipediaPage wibi:Academic_conference ;
skos:narrower .
Fig. 2. RDF excerpt of the taxonomy view for the ISWC Wikipedia page.
Bitaxonomy [4]. In addition to default settings, several parameters concerning
the general appearance of the results can also be customized according to the
user’s preferences. The demo is available at wibitaxonomy.org, it is seamlessly
integrated into the BabelNet interface (http://babelnet.org/) and the data
is freely downloadable under a CC BY-NC-SA 3.0 license.
Acknowledgments
The authors gratefully acknowledge the support of the
ERC Starting Grant MultiJEDI No. 259234.
The authors also acknowledge support from the LIDER project (No. 610782), a
Coordination and Support Action funded by the EC under FP7.
References
1. Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., Hellmann,
S.: DBpedia - a crystallization point for the Web of Data. Web Semantics 7(3), 154–
165 (2009)
2. Cui, H., Kan, M.Y., Chua, T.S.: Soft Pattern Matching Models for Definitional
Question Answering. ACM Transactions on Information Systems 25(2) (2007)
3. Ehrmann, M., Cecconi, F., Vannella, D., Mccrae, J.P., Cimiano, P., Navigli, R.:
Representing Multilingual Data as Linked Data: the Case of BabelNet 2.0. In: Proc.
of LREC 2014. pp. 401–408. Reykjavik, Iceland
4. Flati, T., Vannella, D., Pasini, T., Navigli, R.: Two Is Bigger (and Better) Than One:
the Wikipedia Bitaxonomy Project. In: Proc. of ACL 2014. pp. 945–955. Baltimore,
Maryland
5. Hovy, E.H., Navigli, R., Ponzetto, S.P.: Collaboratively built semi-structured con-
tent and Artificial Intelligence: The story so far. Artificial Intelligence 194, 2–27
(2013)
6. Navigli, R., Ponzetto, S.P.: BabelNet: The automatic construction, evaluation and
application of a wide-coverage multilingual semantic network. Artificial Intelligence
193, 217–250 (2012)
7. Snow, R., Jurafsky, D., Ng, A.: Semantic taxonomy induction from heterogeneous
evidence. In: Proc. of the COLING-ACL 2006. pp. 801–808
8. Van Harmelen, F., Lifschitz, V., Porter, B.: Handbook of knowledge representation,
vol. 1. Elsevier (2008)
9. Velardi, P., Faralli, S., Navigli, R.: OntoLearn Reloaded: A graph-based algorithm
for taxonomy induction. Computational Linguistics 39(3), 665–707 (2013)
108
Enhancing Web intelligence with the content of
online video fragments
Lyndon Nixon1 , Matthias Bauer1 , and Arno Scharl12
1
MODUL University, Vienna, Austria
lyndon.nixon@modul.ac.at
2
webLyzard technology gmbh, Vienna, Austria
scharl@webLyzard.com
Abstract. This demo will show work to enhance a Web intelligence
platform which crawls and analyses online news and social media content
about climate change topics to uncover sentiment and opinions around
those topics over time to also incorporate the content within non-textual
media, in our case YouTube videos. YouTube contains a lot of organisa-
tional and individual opinion about climate change which currently can
not be taken into account by the platforms sentiment and opinion mining
technology. We describe the approach taken to extract and include the
content of YouTube videos and why we believe this can lead to improved
Web intelligence capabilities.
1 Introduction
Web intelligence refers to use of technologies to extract knowledge from data on
the Web, in particular, the learning of how opinions or sentiment towards specific
topics or concepts change over time by analyzing the content of time-specific Web
data such as activity streams on social media platforms, news stories from trust-
worthy newsfeeds and press releases from relevant organisations. The webLyzard
Web intelligence platform 3 , which has been in development since many years
of university R&D, collects and analyzes big data repositories gathered from a
range of electronic sources and uses state-of-the-art Web intelligence tools to re-
veal flows of relevant information between stakeholders through trend analyses,
benchmarks and customized reports. One key domain to which the platform has
been applied is climate change [1], and the insights provided by webLyzard are
being used by the NOAA in the US to inform their online communication.
A public demonstrator “Media Watch on Climate Change” is available at
http://www.ecoresearch.net/climate. This demonstrator analyses news ar-
ticles from British and US news sources, social media and the (RSS) PR feeds of
Fortune 1000 companies. For example, for any searched term, the frequency of
mentions of the term over a selected time period can be seen, the level of positive
or negative sentiment expressed around that term, and extent of disagreement
across sources. The individual sources can be explored and the content displayed,
3
http://webLyzard.com
109
e.g. text of a news article or company press release, or in social media a tweet or
a YouTube video. While mentions of terms within the textual sources is being
used to provide deep analytics on frequency, sentiment and disagreement over
time for that term, any use of the same term within the YouTube videos which
are crawled continually by the platform is disregarded, as details of the video
content is not available to the internal analytics tools of the platform.
In the MediaMixer project http://mediamixer.eu, whose goal was to pro-
mote innovative media technology supporting fragments and semantics to indus-
try use cases [2], a collaboration with webLyzard led to a prototype platform
where the content of fragments of crawled YouTube videos could be exposed to
the platforms analytics capabilities and hence video fragments could be made
available to platform search and data visualisation components. This demo paper
describes the approach taken and the resulting implementation (under the name
v̈ideoLyzard¨)4 as well as how we believe this work can help lead to improved
Web intelligence capabilities for stakeholders such as the NOAA.
2 Technical process and workflow
A new server-side processing pipeline has been created which takes a batch of
YouTube video URLs from the webLyzard crawl component and processes them
by getting transcripts of each YouTube video, performing Named Entity Recog-
nition (NER) over the transcripts, and generating on that basis an annotation
for each YouTube video which identifies the temporal fragments of the video and
the entities which occur in each fragment. These annotations are exported back
into the webLyzard platform and on that basis access to video fragments match-
ing search terms is made possible. videoLyzard makes use of di↵erent semantic
and multimedia technologies to:
– split videos into distinct temporal fragments (generally corresponding with
the sentence level in the accompanying speech), using the Media Fragment
URI specification to refer to the fragments by URL 5
– extract distinct entities from the textual transcript of the video, using the ag-
gregation of Named Entity Recognition, or NER, Web services called NERD
6
, and attaching entity annotations to a temporal fragment of the video
– normalizing entity identifications to DBPedia URIs, thus using Linked Data
to provide a Web-wide unique identification for each concept, disambiguating
terms which are ambiguous in natural langauge and connecting annotations
to additional metadata about each entity
– create machine-processable annotations of the video in RDF, using the LinkedTV
Ontology 7 which follows the Open Annotation Model 8 with specific exten-
sions for multimedia annotation
4
http://webLyzard.com/video
5
http://www.w3.org/TR/media-frags/
6
http://nerd.eurecom.fr
7
http://linkedtv.eu/ontology
8
http://www.openannotation.org/
110
– enabling the computer-aided ‘semantic search’ over video at the fragment
level by storing the generated RDF in a triple store (Sesame) where it can
be queried using SPARQL in combination with queries to complementary
Linked Data repositories.
3 Implementation and results
The public demonstrator 9 initially incorporated 297 annotated YouTube videos
(the videos crawled by MediaWatch in September and October 2013). Now more
YouTube videos are returned for a search (listed in the Documents tab) based
on locating the search term within the video transcript, and the search can be
expanded to show each video fragment containing the search term (listed in
the Quotes tab). For the purpose of easily browsing through transcribed videos,
a new Video Tab (cf. top right corner, Figure 1) plays back the entire video
(from the Documents tab) or just the fragment which matched the search (from
the Quotes tab). An administrator interface is also available to allow admins to
launch new video processing batches via videoLyzard as well as monitor running
processes through to the successful export of the generated video annotations.
Considering the small video dataset which has already been analyzed, which
is much lower than the total number of YouTube videos in webLyzards crawl
index, it is noteworthy how much new relevant content can be uncovered now
that the platform search is able to include video fragments in its search index.
For example, the search term “hydroelectricity” in the live site returns a total
of 3 YouTube videos for the period November 2013 to July 2014. On the other
hand, the prototype with videoLyzard annotations finds 6 YouTube video frag-
ment matches for the 2 month period alone, all matched against semantically
similar terms (hydropower, hydro geothermal, hydro-electric) which have been
normalized to the DBPedia term for hydroelectricity in the NER step.
4 Future work and conclusion
Our next step is that the number of annotated YouTube videos will be scaled
up, with the goal to reach to near real time accessibility to annotated YouTube
content. Based on knowledge of common terms in the specific domain (climate
change), we also plan to research means to clean up YouTube’s automatic tran-
scriptions prior to performing a more domain-specific NER over them. Since a
video fragment at sentence level typically does not contain enough context for
viewer understanding, we also want to explore less granular fragmentation of
the videos for playback, as well as use of semantic and sentiment information
attached to fragments to drive exploration of related (or opposing) fragments
around a topic. In conclusion, given the growing use of audiovisual content to
9
Available at http://link.weblyzard.com/video-showcase with an increasing num-
ber of annotated videos.
111
Fig. 1: Video fragment search and playback in webLyzard
communicate about topics online as opposed to just text, Web intelligence plat-
forms miss out on significant amounts of information when they do not consider
video material such as that being shared by organisations and individuals on
YouTube. The videoLyzard prototype shows that even a small amount of video
analysis can uncover additional intelligence for stakeholders, with semantic tech-
nologies playing a key role in associating content to distinct entities.
Acknowledgments
This work is supported by the EU FP7 funded Support Action MediaMixer
(www.mediamixer.eu)
References
1. ”Media Watch on Climate Change - Visual Analytics for Aggregating and
Managing Environmental Knowledge from Online Sources”. A Scharl, A
Hubmann-Haidvogel, A Weichselbraun, H-P Lang and M Sabou. In 46th
Hawaii International Conference on Systems Sciences (HICSS-46), Maui,
USA, January 2013
2. ”Second Demonstrators”, L Nixon et al., MediaMixer Deliverable D2.2.3,
April 2014
112
EMBench: Generating Entity-Related Benchmark Data
Ekaterini Ioannou1 ? and Yannis Velegrakis2
1
Technical University of Crete, Greece, ioannou@softnet.tuc.gr
2
University of Trento, Italy, velgias@disi.unitn.eu
Abstract. The entity matching task aims at identifying whether instances are re-
ferring to the same real world entity. It is considered as a fundamental task in data
integration and cleaning techniques. More recently, the entity matching task has
also become a vital part in techniques focusing on entity search and entity evolu-
tion. Unfortunately, the existing data sets and benchmarking systems are not able
to cover the related evaluation requirements. In this demonstration, we present
EMBench; a system for benchmarking entity matching, search or evolution sys-
tems in a generic, complete, and principled way. We will discuss the technical
challenges for generating benchmark data for these tasks, the novelties of our
system with respect to existing similar efforts, and explain how EMBench can be
used for generating benchmarking data.
1 Introduction
The entity matching task aims at identifying instances representing the same real world
entity, such as an author or a conference [6]. Existing matching approaches are typi-
cally based on some similarity function that measures syntactic and semantic proximity
of two instances. Depending on the results of this comparison, it is decided whether the
two instances are matching or not. More advance matching approaches exploit relation-
ships between instances [1,2], the use of blocking for reducing the required processing
time [7,8], and using information encoded in the available schemata [3,4].
Despite the many different techniques for entity matching there is no evaluation
methodology that covers all the aspects of matching tasks or at least giving the user the
ability to test the aspects of interest. Most matching techniques have followed their own
ad-hoc evaluation approach, tailored to their own specific goals. Comparison among
entity matching systems and selection of the best system for a specific task at hand is
becoming a challenge. Developers can not easily test the new features of the products
they develop against competitors, practitioners can not make informative choices for
the most suitable tool to use, and researchers can neither compare the techniques they
are developing against those already existing, neither identify existing limitations that
can serve as potential research directions.
In this demonstration, we will present and discuss the EMBench system for bench-
marking entity matching systems in a generic, complete, and principled way [5]. The
?
This research has been co-financed by the European Union (European Social Fund ESF) and
Greek national funds through the Operational Program “Education and Lifelong Learning”
of the National Strategic Reference Framework (NSRF) - Research Funding Program: Thalis.
Investing in knowledge society through the European Social Fund.
113
system provides a series of scenarios that cover the majority of the matching situa-
tions that are met in practice and which the existing matching systems are expected to
support. EMBench is fully configurable and allows the dynamic (i.e., on-the-fly) gener-
ation of the different test cases in terms of different sizes and complexities both at the
schema and at the instance level. The fact that the entity matching scenarios are created
in a principled way, allows the identification of the actual type of heterogeneities that
the under evaluation matching system does not support. This is a fundamental differ-
ence from other existing benchmark or competition-based approaches that come with a
static set of cases that do not always apply in all the real world scenarios.
The following URL provides an online access to the system as well as the sources
code and binary file, and the details can be found in the full version of the paper [5]:
http://db.disi.unitn.eu/pages/EMBench/
2 Entity Matching Scenarios
To generate test cases in a systematic way, we introduce the notion of a scenario. A
scenario is a tuple hen , I, er i where en is an entity, I is an entity collection, and er an
entity from I referred to as the ground truth. The scenario is said to be successfully
executed by an entity matching technique if the technique returns the entity er as a
response when provided as input the pair hen , Ii, i.e., returns er as the best match of en
in the entity collection I.
EMBench creates a scenario by first selecting an entity er from the collection I and
a series of modifiers f1 , f2 , . . . , fn . It then applies the modifiers over the selected entity,
f1 f2 fn
i.e., er !e1 ,!. . . !en , and generates as a scenario the triple hen , I, er i.
Each modifier reflects a specific heterogeneity that matching tasks are frequently
requested to detect. An example of such a category of modifiers is Syntactic Variations
and it includes modifiers such as misspellings, word permutations, aliases, abbrevia-
tions, and homonymity. Structural Variations is another category of modifiers. These
modifiers exploit variations on the attribute level. For example, we might have entities
that use a set of attributes to describe some information while others entities use just
one attribute (e.g., human names might be split into first name and last name, or may
not). Another category is Entity Evolution simulating scenarios in which the entities
have modifications due to time. These modifications can be, for example, changes in
the attribute values, elimination of attributes, or addition of new attributes.
An important feature of the system is that the data engineer that created the scenarios
can choose not only the case but also the size of data instance to generate. In this way
the matching algorithm is tested not only in terms of effectiveness but also in terms of
efficiency (scalability).
3 The EMBench System
Figure 1(a) illustrates the architecture of the system. As shown, EMBench maintains a
Repository that contains the data used during the collection generation. The synthetic
data generated by EMBench are not completely random strings but are based on real
world values following realistic scenarios. This is achieved by Shredders, i.e., software
components that receive a source and shreds it into a series of Column Tables. The
system incorporates general purpose shredders (e.g., relational databases, XML files) as
114
Fig. 1. (a) An illustration of the EMBench’s architecture. (b) A screenshot of the EMBench GUI
for creating an entity collection.
well as shredders specifically designed for popular systems (e.g., Wikipedia, DBPedia,
Amazon, IMDb, DBLP, OKKAM).
The system also supports cleaning the repetitive, overlapping, or complementary
information in the resulted column tables. Among the processes incorporated for this
functionality, we have rules that specify how the values of the column tables are to be
combined together or modified and guide the creation of a new set of column tables,
referred to as the Derived Column Tables. Note that a derived column table may be
created through an identify function rule, meaning that it is considered a derived table
without any modification.
There is no need to shred the original sources or to create the derived column ta-
bles every time the benchmark needs to run. Once they are created, they remain in the
repository until deleted or overwritten. Actually, the current version of EMBench con-
tains a Default Data Collection that is considered sufficient for the realistic evaluation
of matching tasks. For instance, it contains 49299 feminine names, 74079 masculine
names, 4003 diseases, 84847 companies, and 11817 universities.
The Entity Generator creates an entity collection I of N entities by constructing an
entity for every tuple of the populated table R. Each such entity will have M attributes,
one for every of the M attributes of the table R. EMBench provides two options for
selecting the N values from the derived column table: (i) a random selection with or
without repetitions, and (ii) select values following the Zipfian distribution.
As mentioned in Section 2, EMBench includes a set of Entity Modifiers that modify
in various ways the data of an entity collection and construct a new entity collection
with a high degree of heterogeneity. The used modifiers, their order and the modifica-
tion degree is something that is specified by a set of configuration parameters. These
parameters have some default values in the system but can also be modified by the user.
Overall, EMBench offers three main functionalities. The first is to create a source
repository by importing data using shredders. The second is to generate entity collec-
tions using the data from the source repository. The third functionality is to evaluate
matching algorithms. To ease the use of these functionalities, EMBench is in general
fully parametrized through a configuration file. In addition, EMBench is accompanied
115
with a user interface that allows the specification of the parameters that build the con-
figuration file on-the-fly and run EMBench (shown in Figure 1(b)).
4 Demonstration Highlights
In the proposed demonstration we will discuss with the audience the functionalities and
abilities of EMBench. We will particularly focus on the following four parts.
A. Using EMBench. During the first part we will discuss the two available ways
for using EMBench. The first is the usage through a configuration file, which allows
providing a description of the functionallities that can be executed, for example which
shredders to run, or which matching tasks to evaluate. The second usage is through the
EMBench GUI (shown in Figure 1(b)). The GUI provides an alternative mechanism for
selecting EMBench’s configuration and executing the functionalities of EMBench.
B. Repository and Default Data Collection. The second part of the demonstration
focuses on the repository. We will present the data included in the default data collec-
tion, and illustrate how to use existing EMBench shredders for importing additional
data. We will also explain how to create, configure, and execute new shredders.
C. Creating Entity Collections. In the subsequent part of the demonstration we
will present the creation of collections. This includes describing the schema for the
entities to be generated (e.g., maximum number of entity attributes, value distribution,
column tables). It also includes the specification and configuration of the modifiers.
D. Evaluating Algorithms using EMBench. The last part of the demonstration
focuses on illustrating how EMBench can be used for evaluating algorithms. We will
discuss the metrics that are currently incorporated in EMBench and how additional ones
can be easily implemented. Furthermore, we will present and illustrate the supported
matching-related tasks (i.e., one-to-one matching and blocking).
The demonstration is intended for researchers and practitioners alike. The confer-
ence participant will have the opportunity to understand the principles behind the bench-
mark. This will help the participants in evaluating and testing new matching systems in
order to select the one that bests fits a task at hand, but will also give valuable insight
on how to design and improve matching systems.
References
[1] I. Bhattacharya and L. Getoor. Deduplication and group detection using links. In LinkKDD,
2004.
[2] X. Dong, A. Halevy, and J. Madhavan. Reference reconciliation in complex information
spaces. In SIGMOD, 2005.
[3] J. Euzenat and P. Shvaiko. Ontology matching. Springer-Verlag, 2007.
[4] F. Giunchiglia, P. Shvaiko, and M. Yatskevich. S-Match: an algorithm and an implementation
of semantic matching. In Semantic Interoperability and Integration, 2005.
[5] E. Ioannou, N. Rassadko, and Y. Velegrakis. On generating benchmark data for entity match-
ing. J. Data Semantics, 2(1), 2013.
[6] E. Ioannou and S. Staworko. Management of inconsistencies in data integration. In Data
Exchange, Information, and Streams, 2013.
[7] G. Papadakis, E. Ioannou, C. Niederée, T. Palpanas, and W. Nejdl. Beyond 100 million
entities: large-scale blocking-based resolution for heterogeneous data. In WSDM, 2012.
[8] S. Whang, D. Menestrina, G. Koutrika, M. Theobald, and H. Garcia-Molina. Entity resolu-
tion with iterative blocking. In SIGMOD, 2009.
116
Demonstration of multi-perspectives exploratory
search with the Discovery Hub web application
Nicolas Marie1,2 and Fabien Gandon1
1
WIMMICS, INRIA Sophia-Antipolis, Sophia Antipolis, France
2
Alcatel-Lucent Bell Labs, Nozay, France
{firstname.lastname}@inria.fr
Abstract. This paper describes a demonstration of the Discovery Hub
exploratory search system. The demonstration focuses on the exploration
of topics through several perspectives.
1 Introduction
Exploratory search refers to cognitive-consuming search tasks like learning
or investigation. There is a need to develop systems optimized for supporting
exploratory search as the today widely-used search engines fail to efficiently
support exploratory search [1]. Linked data o↵ers exciting perspectives in this
context and several systems were already published. We want to reach a new
level in linked data based exploration by allowing the users to unveil knowl-
edge nuances corresponding to specific facets of interest of the topic explored.
In this demonstration paper we first give a brief overview of linked data based
exploratory search systems. Then we present the Discovery Hub web applica-
tion and focus more particularly on its multi-perspectives exploration capacity.
Finally we present the demonstration scenario we propose.
2 Linked data based exploratory search
The amount of contributions at the crossroad of semantic search and exploratory
search is increasing today. The linked data knowledge, and especially DBpedia,
allows to design new information retrieval approaches and interaction models
that efficiently support exploratory search tasks. Yovisto3 (2009) is an academic
videos platform that retrieves topics’ suggestions that are semantically related
to the users’ query. The objective is to ease the exploratio of the videos collec-
tion. Lookup Explore Discover4 (2010) helps the users to compose queries
about topics of interest by suggesting related query-terms. Once the query is
composed the system retrieves the results from others several services such as
search engines and social networks. Aemoo5 (2012) o↵ers a graph-view on topics
of interest. The graph shows their neighborhood filtered by a semantic pattern.
3
http://www.yovisto.com/
4
http://sisinflab.poliba.it/led/
5
http://wit.istc.cnr.it/aemoo
117
The users can reverse the filtering to show more surprising knowledge. They can
also ask for explanations (cross-references in Wikipedia) about the relations be-
tween the shown resources. The Seevl6 (2013) demonstrator is a music discovery
platform implementing a linked data based recommendation algorithm. The DB-
pedia semantics are also used in Seevl to support browsing (e.g. by music genres,
band members) and to provide explanations about the recommendations (show-
ing the shared properties between the artists). Linked Jazz7 (2013) aims to
capture the relations within the American jazz community in RDF. The authors
rely on a large amount of jazz people interviews transcripts. These transcripts
are automatically processed and then finely analyzed through a crowd-sourced
approach.
The approaches recently published in the literature produced good results
when evaluated. Nevertheless a common limit of the existing linked data based
exploratory search systems is the fact they constrain the exploration through
single results selection and ranking schemes. The users cannot influence the
retrieved results to reveal specific aspects of knowledge that interest them in
particular.
3 Multi-perspectives exploratory search
The framework and models implemented by the Discovery Hub application was
presented in [2]. Contrary to other systems it does not pre-compute and store
the results for later retrieving. Instead it computes the results on demand thanks
to a semantics-sensitive graph traversal algorithm. The algorithm is applied on
a small amount of data stored in a local and transient triple store. The data is
incrementally imported at query time using SPARQL queries sent to the targeted
SPARQL endpoint (DBpedia in the case of Discovery Hub). The objective of this
step is to identify a set of relevant results related to the initial topics of interest
that will be explored by the user [2]. The web application demonstrating the
framework, called Discovery Hub, is available online8 and was showcased in
several screencasts9 .
The fact the results are computed at query-time allows to let the users con-
trol several computation parameters through the interface and to o↵er multi-
perspective exploratory search. Indeed, the objects described in linked data
datasets can be rich, complex and approached in many manners. For exam-
ple, a user can be interested in a painter (e.g. Claude Monet or Mary Cassat) in
many ways: works, epoch, movement, entourage, social or political contexts and
more. The user may also be interested by basic information or unexpected ones
depending on his actual knowledge about the topic. He may also want to explore
the topic through a specific culture or area e.g. impressionism in American or
French culture.
6
http://play.seevl.fm
7
http://linkedjazz.org/
8
http://discoveryhub.co, current CPU-intensive experiments might slow down the
search temporarily
9
https://www.youtube.com/user/wearediscoveryhub/videos
118
The framework allows three operations for building such exploration per-
spectives, detailed in [3]. (1) The users can specify criteria of interest and
disinterest that are used by the framework during the sample importation and
its computation. The DBpedia categories are used for this purpose, see Figure
1. The objective is to guide the algorithm in order to retrieve results that are
more specifically related to the aspects that interest the user, see example of
queries and results in Table 1. (2) It is possible to inject randomness in the
algorithm values in order to modify the ranking scheme and expose more un-
expected results10 . (3) With the proposed framework it is easy to change the
data source used to process the query11 . In the context of DBpedia it enables
the use of the DBpedia internatinal chapters like the French, German, Italian
ones12 to leverage cultural bias.
Fig. 1. Discovery Hub interface, criteria of interest specification query
4 Demonstration scenario
The demonstration will be constituted of a sequence of interactions like the ones
presented in the previously published screencasts. First the user launches a sim-
ple query (query 1 in Table 1), he examines the results. At this point we show the
audience the Discovery Hub functionalities supporting the exploration and un-
derstanding. We will focus on the faceted browsing aspect, the explanations and
the redirections toward third-party platforms. During this step we will engage
the conversation about how we compute the results and what is the software
architecture.
10
this advanced query mode is supported by the framework but currently not available
through the interface and will be integrated soon
11
idem 10
12
wiki.dbpedia.org/Internationalization/Chapters?v=190k
119
Table 1. Results of three queries about Claude Monet using the criteria specification
Query Claude Monet (1) Claude Monet (2) Claude Monet (3)
Criteria None Impressionist painters + Impressionist painters -
Artists from Paris - Artists from Paris +
People from Le Havre - People from Le Havre +
Alumni of the École des Beaux-Arts - Alumni of the École des Beaux-Arts +
French painters - French painters +
Results
1 Pierre-Auguste Renoir Theodore Robinson Pierre-Auguste Renoir
2 Alfred Sisley Édouard Manet Gustave Courbet
3 Édouard Manet Alfred Sisley Edgar Degas
4 Mary Cassatt Wladyslaw Podkowiński Jacques-Louis David
5 Camille Pissarro Leslie Hunter Jean-Baptiste-Camille Corot
6 Edgar Degas Theodore Earl Butler Jean-François Millet
7 Charles Angrand Lilla Cabot Perry Paul Cézanne
8 Gustave Courbet Frank Weston Benson Marc Chagall
During the results examination we will voluntarily focus on the French im-
pressionist painters that were close to Monet. At this point the user might be
interested in the relations of Monet with the non-French impressionists (query
2 in Table 1). We will explain the querying system for criteria of interest spec-
ification and then emphasize the di↵erences between the results obtained with
query 1 and 2.
To continue in the same logic we will submit the query 3 as well as a query
with a high level of randomness and one using the French chapter of DBpedia
in several tabs. We let the audience compare the results. We seek an interactive
demonstration by encouraging the audience to try the application while com-
menting the system more than a strict and pre-defined sequence of interactions
(which serves only to start the interactions).
5 Conclusion and perspectives
Discovery Hub is a linked data based exploratory search system built on the
top of DBpedia. With this demonstration we want to show the value of linked
data for exploratory search. Mature datasets like DBpedia allow the creation of
new information retrieval approaches as well as new interaction models. More
specifically we want to demonstrate the multi-perspectives exploratory search
capacities of Discovery Hub. Thanks to the demonstration track we hope to
have discussions with other researchers about the perspectives we envision for
Discovery Hub. It notably includes an approach where the user can specify or
change the specified criteria of interest interactively in order to re-rank the results
without relaunching the whole query-process.
References
1. G. Marchionini. Exploratory search: from finding to understanding. Communica-
tions of the ACM, 49(4):41–46, 2006.
2. N. Marie, F. Gandon, M. Ribière, and F. Rodio. Discovery hub: on-the-fly linked
data exploratory search. In Proceedings of the 9th International Conference on
Semantic Systems. ACM, 2013.
3. N. Marie, M. Gandon, Alain Giboin, and E. Palagi. Exploratory search on topics
through di↵erent perspectives with dbpedia. In Proceedings of the 10th International
Conference on Semantic Systems. ACM, 2014.
120
Modeling and Monitoring Processes
exploiting Semantic Reasoning
Piergiorgio Bertoli1 , Francesco Corcoglioniti2 , Chiara Di Francescomarino2 , Mauro
Dragoni2 , Chiara Ghidini2 , Michele Nori1 , Marco Pistore1 , and Roberto Tiella2
1
SayService, Trento, Italy bertoli|nori|pistore@sayservice.it
2
FBK—IRST, Trento, Italy corcoglio|df mchiara|dragoni|ghidini|tiella@f bk.eu ??
Abstract. Data about process executions has witnessed a notable increase in the
last decades, due to the growing adoption of Information Technology systems
able to trace and store this information. Meanwhile, Semantic Web methodolo-
gies and technologies have become more and more robust and able to face the
issues posed by a variety of new domains, taking advantage of reasoning services
in the “big data” era. In this demo paper we present ProMo, a tool for the col-
laborative modeling and monitoring of Business Process executions. Specifically,
by exploiting semantic modeling and reasoning, it enables the reconciliation of
business and data layers as well as of static and procedural aspects, thus allowing
business analysts to infer knowledge and use it to analyze process executions.
1 Introduction
The last decades have witnessed a rapid and widespread adoption of Information Tech-
nology (IT) to support business activities in all phases. As a side effect, IT systems
have made available huge quantities of data about process executions, thus enabling
(i) to monitor the actual execution and the progress of (instances of) Business Processes
(BPs); (ii) to provide statistical analysis; (iii) to detect deviations of process executions
from process models (e.g., [1]); and (iv) to identify problems in process executions.
Meanwhile, Semantic Web technologies have known an important growth and have
made available powerful reasoning services able to reason on complex domains, as well
as technologies able to deal with huge quantities of data. This opens the way to the use
of Semantic Web technologies for process modeling and monitoring and for the analysis
of processes characterizing complex scenarios as those of large organizations.
In these complex scenarios, knowledge can be classified in two orthogonal ways.
First, we distinguish between a dynamic dimension, which concerns the procedures
and the activities carried out by the organization for realizing specific objectives, and
a static dimension, which concerns the organization structure (e.g., the role hierarchy),
the data structure (e.g., the document organization), and the relationships among these
and other domain entities. Then, knowledge can be ascribed to two layers: the IT layer,
which concerns the actual data items processed by IT systems; and the business layer,
??
This work is supported by “ProMo - A Collaborative Agile Approach to Model and Monitor
Service-Based Business Processes”, funded by the Operational Programme “Fondo Europeo
di Sviluppo Regionale (FESR) 2007-2013 of the Province of Trento, Italy.
121
Fig. 1: ProMo overview
which concerns the models of the dynamic and static aspects of the organization do-
main. Given this frame, two main challenges need to be faced: (i) bridging the unavoid-
able gap between the business and the data layer; and (ii) reconciling the static and
dynamic dimensions so to make them available for monitoring and analysis purposes.
In this demo we present and showcase ProMo, a tool that exploits Semantic Web
technologies to address the above challenges through an integrated representation of
knowledge, enabling the collaborative modeling, monitoring and analysis of business
processes. By reconciling all these different dimensions and layers, ProMo overcomes
existing approaches. In the remainder we describe how ProMo reconciles the business
and IT layers and the static and dynamic dimensions, introducing the ProMo main com-
ponents that will be demonstrated live during the Posters and Demo session.
2 Reconciling Business and IT layers
Aligning the business and IT layers is a difficult task. For example, process monitoring
at the IT layer cannot observe data exchanged on paper documents or user activities not
mediated by IT systems, and thus brings only partial information on which activities
were executed and what data or artifacts they produced. Even when IT data exists, it is
not easy to associate it to a specific process instance. Indeed, IT services can be shared
by process classes and instances, and traced information can be hard to disambiguate.
ProMo solution to this problem is based on the introduction of an intermediate layer
(Figure 1), which enables the communication between the business and the IT lay-
ers through an intermediate model. Such a model formalizes the relationships between
business models and information extracted at the IT layer and relies on the integrated
representation of all the information collected about a process execution (the IT-trace).
To accomplish its goal, ProMo integrates a modeling component and a monitoring
component. At the business level, the modeling component provides MoKi-ProMo, a
customized version of the MediaWiki-based3 tool MoKi [2] for the collaborative mod-
eling of processes and ontologies. At the intermediate layer ProMo provides (i) mapping
3
http://www.mediawiki.org
122
Fig. 2: MoKi-ProMo visualization of a reconstructed (partial) trace
and monitoring editors that allow IT Experts (taking advantage of the Domain Experts
modeling) to specify, respectively, aggregation/monitoring rules and the relationships
between business models and the information extracted at IT level; and (ii) an editor
for defining interesting Key Performance Indicators (KPIs) to be monitored. Specifi-
cally, the input required at the intermediate layer is provided by experts by using the
DomainObject language [3] for defining mapping properties, an ad-hoc rule language
for monitoring rules, and SPARQL queries for business KPIs.
At run-time, whenever an IT-level event occurs, it is captured and handled by the
monitoring component. In detail, the event is managed by the monitoring engine, which,
based on the specification and rules defined at design-time, correlates and aggregates
events, produces new control events, monitors and maps the events to the correspond-
ing one(s) at the business layer and eventually produces the IT-trace. The information
in the IT-trace, which in many cases is only partial with respect to a complete execu-
tion flow of a designed process model, is hence passed to a reasoning engine. Such
an engine, by taking advantage of the business knowledge, reconstructs missing infor-
mation by applying model-driven satisfiability rules [4] and the reconstructed trace is
then visualized by the BP monitoring and analysis component. Figure 2 shows how a
reconstructed (partial) execution trace is visualized in MoKi-ProMo, pointing out the
path possibly taken by the process execution and distinguishing between monitored and
reconstructed (with some certainty degree) activities. The reconstructed IT-trace is then
recorded in a semantic-based knowledge store, which is then queried by the BP moni-
toring and analysis component in order to provide monitoring services at business level.
An implementation built on top of current Semantic Web technologies aims at coping
with large quantities of data and high data rates typical of real application scenarios.
3 Reconciling Static and Dynamic Dimensions
Although different in their nature, static and dynamic knowledge about an organization
domain are strictly related and should be jointly considered in order to obtain a com-
prehensive view of the organization processes. Importantly, reconciliation of these two
123
dimensions should be done both at the business layer, allowing an explicit representa-
tion of the links between static and dynamic model elements (e.g., the fact that a process
activity operates on a certain document), and at the data layer, allowing the collection,
integration and comprehensive querying of static and procedural data.
At the business layer, ProMo solution is represented by the modeling component
of MoKi-ProMo, which allows different experts (e.g., Business Designers, Knowledge
Engineers and Domain Experts) to collaboratively model the different static and dy-
namic aspects describing the domain (see Figure 1). Specifically, MoKi-ProMo allows
Domain Experts and Knowledge Engineers to collaboratively model the static aspects
of the domain in form of OWL 2 ontologies. Concerning the dynamic aspects, MoKi-
ProMo customizes the Oryx editor4 for the BPMN modeling of business processes by
introducing symbol variations (e.g., special data objects for explicitly capturing data
structures). Moreover, MoKi-ProMo also provides an interface allowing Business Ana-
lysts and Domain Experts to edit KPIs of interest, thus enabling them to access IT data
from a (static and dynamic) business perspective.
At the IT layer, ProMo solution consists in exploiting a Domain ontology [5], con-
sisting of an upper-level cross-domain core and a domain-dependent extension, and a
BPMN [6] ontology to build an integrated semantic model combining static and proce-
dural knowledge acquired at modeling time, together with knowledge about IT-data.
By leveraging scalable Semantic Web technologies for data storage, reasoning and
querying, the semantic model enables Business Analysts to query asserted and inferred
knowledge and bring execution data analysis at business level. In particular, analytical
SPARQL queries combining static and dynamic dimensions with data derived from the
IT-layer can be formulated and evaluated, such as the number of times a path is followed
or an actor instance executes a business activity, or the average time spent by an actor of
a given category to complete the process. Experiments carried out in the context of an
Italian use case have shown the applicability of the approach in realistic scenarios [5].
References
1. van der Aalst, W.M.P.: Process Mining: Discovery, Conformance and Enhancement of Busi-
ness Processes. 1st edn. Springer Publishing Company, Inc. (2011)
2. Ghidini, C., Rospocher, M., Serafini, L.: Conceptual modeling in wikis: a reference archi-
tecture and a tool. In: 4th Int. Conf. on Information, Process, and Knowledge Management
(eKNOW). (2012) 128–135
3. Bertoli, P., Kazhamiakin, R., Nori, M., Pistore, M.: SMART: Modeling and monitoring sup-
port for business process coordination in dynamic environments. In: 15th Int. Conf. on Busi-
ness Inf. Systems (BIS) - Workshops. Volume 127 of LNBIP., Springer (2012) 243–254
4. Bertoli, P., Di Francescomarino, C., Dragoni, M., Ghidini, C.: Reasoning-based techniques
for dealing with incomplete business process execution traces. In: 13th Conf. of Italian Asso-
ciation for Artificial Intelligence (AI*IA). Volume 8249 of LNCS., Springer (2013) 469–480
5. Di Francescomarino, C., Corcoglioniti, F., Dragoni, M., Bertoli, P., Tiella, R., Ghidini, C.,
Nori, M., Pistore, M.: Semantic-based process analysis. In: 13th Int. Semantic Web Confer-
ence (ISWC) - In-use track. (2014) (to appear).
6. Rospocher, M., Ghidini, C., Serafini, L.: An ontology for the Business Process Modelling
Notation. In: 8th Int. Conf. on Formal Ontology in Inf. Systems (FOIS). (2014) (to appear).
4
http://bpt.hpi.uni-potsdam.de/Oryx/
124
WikipEvent: Temporal Event Data for the
Semantic Web
Ujwal Gadiraju, Kaweh Djafari Naini, Andrea Ceroni, Mihai Georgescu, Dang
Duc Pham, Stefan Dietze, and Marco Fisichella
L3S Research Center, Leibniz Universität Hannover, Germany
{gadiraju, naini, ceroni, georgescu, pham, dietze, fisichella}@L3S.de
Abstract. In this demo we present WikipEvent, an exploratory system
that captures and visualises continuously evolving complex event struc-
tures, along with the involved entities. The framework facilitates entity-
centric and event-centric search, presented via a user-friendly interface
and supported by temporal snippets from corresponding Wikipedia page
versions. The events detected and extracted using di↵erent mechanisms
are exposed as freely available Linked Data for further reuse.
Keywords: Events; Temporal Evolution; Wikipedia; RDF; Interface
1 Introduction
Exploratory search systems help users to search, navigate, and discover new facts
and relationships. We detect and extract events from di↵erent sources and merge
these events into a unique event repository. Each event is described primarily in
terms of (i) a list of entities (Wikipedia pages) participating in the event, (ii) a
textual description of the event, (iii) start and end dates of the event, and (iv)
the extraction method used to obtain the event. We further classify entities as
people, organizations, artifacts, and locations by exploiting the class hierarchy
defined in YAGO2 [2], since di↵erent entity categories play di↵erent roles while
participating in an event.
In this demo, we showcase two popular usecase scenarios. First, we show
that WikipEvent can be used in order to explore the events in which particular
entities have been involved. Secondly, we show the suitability of WikipEvent to
explore the evolution of entities based on the events these are involved in. Both
these scenarios can be additionally surveyed based on the temporal dimension.
In addition to the events that are presented using a timeline 1 , we adopt a ver-
sioning approach introduced previously[1] in order to display significant versions
of wikipages corresponding to the entities involved in the event.
WikipEvent facilitates the understanding of events and their related enti-
ties on a temporal basis. Instead of exploration of isolated knowledge bases,
WikipEvent takes advantage of the complimentary nature of sources contribut-
ing to the underlying event repository.
1
Interface: http://wikipeventdemo.l3s.uni-hannover.de/WikiEventEntity/
125
2 Ujwal Gadiraju et al.
2 WikipEvent Data
We extract events from three sources; Wikipedia Current Events portal[3], Yago[2],
and an event detection method called Co-References[4]. Firstly, the WikiTimes
project2 provides an API for 50,000 events between 2001-2013, acquired from
the Wikipedia Current Events portal. The second event source is the YAGO2
ontology including entities which describe events, e.g. 2011 Australian Open as
well as facts connecting entities, e.g. < BobDylan > wasBornIn < Duluth >.
The Co-References method introduced in [4] extracts Wikipedia pages (entities)
related to an event by using the Wikipedia edit history. The edits corresponding
to a Wikipedia page (entity) are analysed for indications of the occurrence of an
event involving that entity.
The resulting repository contains more than 2.6 million events, extracted from
the di↵erent sources as shown in Table 1. The contribution from the sources
is skewed due to two reasons. Firstly, they cover very di↵erent time periods:
YAGO2 contains events and temporal facts spanning over thousands of years,
Current Events captures events since 2001 only, and, for performance reasons,
the Co-References method has been restricted to edits in 2011. In addition,
Co-References exclusively analyses entities of type politician, while YAGO2 and
Current Events contain almost all the Wikipedia pages. To facilitate compar-
isons across entities occurring in all three sources, in the rest of this section we
will consider only those events that occurred in 2011 and involved politicians.
Source Total Politicians Politicians 2011
All 2,629,740 50,168 1,401
YAGO2 2,578,547 42,399 360
Current Events 50,951 7,527 799
Co-Reference 242 242 242
Table 1: Number of events within the event repository, split by di↵erent sources.
Co-References is able to detect events with di↵erent duration and granularity
(from a wrestling match to the Egypt Revolution). YAGO2 mostly contains
high-level and well-known events represented through temporal facts regarding
entities, often lacking textual descriptions. Current Events portal contains daily
events which have a self explanatory textual description and are reliable, thanks
to the high level of control within Wikipedia. The complimentary nature of
the di↵erent sources in terms of complexity (number of participants), duration,
and granularity of events is evident. Due to these reasons, also the schemas
used to represent events across di↵erent sources are distinct, yet overlapping.
While certain properties (for instance for time points) are overlapping yet follow
di↵erent conventions, we lifted the events from di↵erent sources into a unified
dataset following Linked Data principles and deploying a joint RDF schema.
Exposing Events Data as RDF.
We have exposed the WikipEvent events data through a public Linked Data
2
http://data.l3s.de/dataset/wikitimes
126
WikipEvent: Temporal Event Data for the Semantic Web 3
interface using D2R Server3 , enabling URI dereferencing via content negotiation
and providing a public SPARQL endpoint. This data can be accessed and queried
via http://wikipevent.l3s.uni-hannover.de/ and using our SPARQL end-
point4 .
Fig. 1: Example of event related data as an RDF repository.
We represent events through established event RDF vocabularies, to facilitate
reuse, interpretation and linking of our data through third parties. In particular,
we use properties from the LODE ontology5 to map di↵erent properties per-
taining to events in our dataset, for instance, property lode:atPlace is used as
predicate for stating venues of events. Figure 1 presents an example of an event
and its entailing properties in our event repository.
3 Use Case Scenarios
WikipEvent can help students, scholars, historians, or journalists by facilitating
temporal search focussed on either the entities at hand or the events. Thus, the
WikipEvent framework can be used to satisfy two primary scenarios - entity-
based and event-based information needs. Results are presented through a user-
friendly interface, that supports faceted search, query-path tracing, query com-
pletion, temporal settings, and is abridged with events’ sources as well as filters
for related entities. The underlying versioning system [1], helps us to identify
significant wikipage revisions of the entities involved in the event. These sig-
nificant revisions of wikipages are also presented to the user in addition to the
timeline of events. As introduced in our previous work, a significant revision is
one where the edits between preceding and succeeding revision are above a cer-
tain threshold. Here, the notion of significance is modeled based on the Cosine
distance between successive revisions [1].
Entity-based Search. Users may want to learn about entities of interest
with respect to their temporal participation in events. For example, journalists
might aim at studying political affiliations of individuals, and the campaigns
3
http://d2rq.org/
4
http://wikipevent.l3s.uni-hannover.de/snorql/
5
http://linkedevents.org/ontology/
127
4 Ujwal Gadiraju et al.
Fig. 2: Entity-centric search:Barack Obama. Fig. 3: Event-centric search:Iraq War.
they participated in. Existing systems however, make it cumbersome to easily
access this information. The WikipEvent interface overcomes this challenge by
presenting a timeline of events that an entity is involved in. Additional filters
for relevant entities help users to navigate through the retrieved results. Figure
2 presents an example of an entity-based search on WikipEvent.
Event-based Search. Historic events are a subject of interest to a wide
array of people, ranging from students to archivists. WikipEvent facilitates a
free-text search for events. Events relevant to a given query are presented in the
form of a continuous timeline (Figure 3), while highlighting the entities involved.
WikipEvent enables users to sift through event related information on a temporal
basis, in order to learn more about the events and the participating entities.
To gain a complete understanding of the WikipEvent framework, we point
the reader to a demo video 6 .
References
1. Andrea Ceroni, Mihai Georgescu, Ujwal Gadiraju, Kaweh Djafari Naini, and Marco
Fisichella. Information evolution in wikipedia. In Proceedings of the 10th Interna-
tional Symposium on Open Collaboration, OpenSym ’14. ACM, 2014.
2. Johannes Ho↵art, Fabian M Suchanek, Klaus Berberich, and Gerhard Weikum.
Yago2: a spatially and temporally enhanced knowledge base from wikipedia. Arti-
ficial Intelligence, 194:28–61, 2013.
3. Giang Binh Tran and Mohammad Alrifai. Indexing and analyzing wikipedia’s cur-
rent events portal, the daily news summaries by the crowd. In Proceedings of the
companion publication of the 23rd international conference on World wide web com-
panion, pages 511–516, 2014.
4. Tuan A. Tran, Andrea Ceroni, Mihai Georgescu, Kaweh Djafari Naini, and Marco
Fisichella. Wikipevent: leveraging wikipedia edit history for event detection. In
Web Information Systems Engineering - WISE 2014, Thessaloniki, Greece, 12-14
October 2014, Proceedings, 2014.
6
A demo video is available on the home screen of the web interface.
128
Towards a DBpedia of Tourism: the case of
Tourpedia
Stefano Cresci, Andrea D’Errico, Davide Gazzé, Angelica Lo Duca,
Andrea Marchetti, Maurizio Tesconi
Institute of Informatics and Telematics, National Research Council,
via Moruzzi 1, 56124 Italy
email: [name].[surname]@iit.cnr.it
Abstract. In this paper we illustrate Tourpedia, which would be the
DBpedia of tourism. Tourpedia contains more than half a million places,
divided in four categories: accommodations, restaurants, points of in-
terests and attractions. They are related to eight locations: Amsterdam,
Barcelona, Berlin, Dubai, London, Paris, Rome and Tuscany, but new lo-
cations are continuously added. Information about places were extracted
from four social media: Facebook, Foursquare, GooglePlaces and Book-
ing and were integrated in order to build a unique catalogue. Tourpedia
provides also a Web API and a SPARQL endpoint to access data.
1 Introduction
The concept of Semantic Web was introduced by Tim Berners Lee in 2001[2].
His main idea consisted in migrating from the Web of documents to the Web
of data. The purpose of the Web of data is to connect concepts and contents
to each other, instead of simply connecting documents. Thus the Web of data
has led to the conversion of existing documents to linked data [6], and to the
creation of new datasets1 . Among them, one of the most exploited datasets is
DBpedia2 , which is the linked data version of Wikipedia3 .
DBpedia is available in di↵erent languages. Its English version contains about
4.0 million things, classified in di↵erent categories, including people, places, cre-
ative works, organizations, species and deseases. However, DBpedia, as well
as Wikipedia, contains only a small number of things related to the tourism
domain, such as accommodations and restaurants. In addition, to the best of
our knowledge, only few linked datasets have been implemented in the field
of tourism. Among them, the case of El Viajero4 , which provides information
about more than 20.000 travel guides, pictures, videos and posts, and that of
Accommodations in Tuscany5 , which contains the list of accommodations in
1
For a list of shared datasets, please look at: http://datahub.io.
2
http://dbpedia.org
3
http://wikipedia.org
4
http://datahub.io/dataset/elviajero
5
http://datahub.io/dataset/grrt
129
Tuscany, Italy. For more details about datasets about tourism, please refer to:
http://datahub.io/dataset?q=tourism.
In this paper we illustrate Tourpedia, which would be the DBpedia of Tourism.
Tourpedia is reachable through its portal6 and is available also in the datahub.io
platform7 .
Tourpedia was developed within the OpeNER Project8 (Open Polarity En-
hanced Name Entity Recognition), whose main objective is to implement a
pipeline to process natural language.
The usage of Tourpedia could be very various. For example, it could be used
to perform named entity disambiguation in tourism domain, or to extract the
most appreciated points of interest in a town.
2 Tourpedia
Figure 1 illustrates the Tourpedia architecture. The Data Extraction module
consists of four ad-hoc scrapers, which extract data from four social media:
Facebook9 , Foursquare10 , Google Places11 and Booking12 . We chose these social
media firstly because they are very popular and secondly because they pro-
vide an easy way to extract data. The scrapers of Facebook, GooglePlaces and
Foursquare exploit the RESTful APIs the social media provide, while the Book-
ing scraper extracts information from each accommodation page.
The Named Entity repository contains two main datasets, which belong to
the specific domain of tourism: Places and Reviews about places. The dataset of
Places contains more than 500.000 places in Europe divided in four categories: ac-
commodations, restaurants, points of interest and attractions13 . At the moment
the following locations are covered: Amsterdam, Barcelona, Berlin, Dubai, Lon-
don, Paris, Rome and Tuscany. Places were elaborated and integrated through
the Data Integration module in order to build a unique catalogue. Data Integra-
tion was performed by using a merging algorithm based on distance and string
similarity.
The dataset of Reviews contains about 600.000 reviews about places. Reviews
were analysed through the OpeNER pipeline in order to extract their sentiment.
2.1 Web application
Tourpedia provides also a Web application14 [5], which shows the sentiment
about places on an interactive map, which is Google Maps-like.
6
http://tour-pedia.org
7
http://datahub.io/dataset/tourpedia
8
http://www.opener-project.eu
9
http://www.facebook.com
10
http://foursquare.com
11
https://plus.google.com/u/0/local
12
http://www.booking.com
13
http://tour-pedia.org/about/statistics.html
14
http://tour-pedia.org/gui/demo/
130
Social Media
Linked
Dataset
SPARQL
Data Extraction endpoint
D2R server
OpeNER Data
pipeline Integration
Web API
Reviews Places
NE Repository
Web Application
Fig. 1. The architecture of Tourpedia.
The sentiment of a place is calculated as a function of all the sentiments
of the reviews about that place. In order to retrieve the sentiment of a review,
the OpeNER pipeline was used. In particular, each place is associated to zero
or more reviews extracted from social media (i.e. Facebook, Foursquare and
Google Places). Each review is processed through the OpeNER pipeline and is
associated to a rate, which expresses its specific sentiment.
2.2 Linked Data
Tourpedia is exposed as a linked data node and provides a SPARQL endpoint15 .
The service is implemented through the use of a D2R server16 . For each place, the
following ontologies are used to represent it: VCARD [9] and DBpedia OWL17 ,
for generic properties; Acco [8], Hontology [4] and GoodRelations [7] for domain-
specific properties. In a previous work [1], we illustrated the employed ontologies
and structures of accommodations as linked data. In order to fulfill the principles
of linked data [3], each location is linked to the same location in DBpedia.
2.3 Web API
Tourpedia provides a RESTful API18 to access places and statistics. The output
of each request can be JSON, CSV and XML. For example, a search request
about Places is an HTTP URL of the following form:
http://tour-pedia.org/api/getPlaces?parameters
where parameters must be at least one one of the following: location (the
location of the places), category (the type of the places such as accomodation),
attraction, restaurant, poi), and name (the keyword to be searched).
15
http://tour-pedia.org/sparql
16
http://d2rq.org/
17
http://wiki.dbpedia.org/Ontology
18
http://tour-pedia.org/api
131
3 Conclusions and Future Work
In this paper we have illustrated Tourpedia, which would be the DBpedia of
Tourism. It could be interesting a deeper connection between Tourpedia and
DBpedia. At the moment, in fact, only locations are connected to DBpedia.
As future work, we are going to align also attractions and points of interest
contained in Tourpedia to DBpedia.
Tourpedia could be exploited both by tourism stackholders to get the senti-
ment about touristic places and by common users.
At the moment, the procedure to update datasets is manual. As future work,
we are going to define a semi-automatic procedure to update them and to add
new locations.
Acknowledgements
This work has been carried out within OpeNER project, co-funded by the Euro-
pean Commission under the FP7 (7th Framework Programs Grant Agreement
n. 296451).
References
1. Bacciu, C., Lo Duca, A., Marchetti, A., Tesconi, M.: Accommodations in Tuscany
as Linked Data. In: Proceedings of The 9th edition of the Language Resources and
Evaluation Conference (LREC 2014). pp. 3542–3545 (May, 26-31 2014)
2. Berners-Lee, T., Hendler, J., Lassila, O.: The semantic web. Scientific Ameri-
can 284(5), 34–43 (May 2001), http://www.sciam.com/article.cfm?articleID=
00048144-10D2-1C70-84A9809EC588EF21
3. Bizer, C., Heath, T., Berners-Lee, T.: Linked data - the story so far. Int. J. Semantic
Web Inf. Syst. 5(3), 1–22 (2009)
4. Chaves, M.S., de Freitas, L.A., Vieira, R.: Hontology: A multilingual ontology for
the accommodation sector in the tourism industry. In: Filipe, J., Dietz, J.L.G. (eds.)
KEOD. pp. 149–154. SciTePress (2012)
5. Cresci, S., D’Errico, A., Gazzé, D., Lo Duca, A., Marchetti, A., Tesconi, M.: Tour-
pedia: a Web Application for Sentiment Visualization in Tourism Domain. In: Pro-
ceedings of The OpeNER Workshop in The 9th edition of the Language Resources
and Evaluation Conference (LREC 2014). pp. 18–21 (May, 26 2014)
6. Heath, T., Bizer, C.: Linked Data: Evolving the Web into a Global Data Space.
Morgan & Claypool, 1st edn. (2011), http://linkeddatabook.com/
7. Hepp, M.: Goodrelations language reference. Tech. rep., Hepp Research GmbH,
Innsbruck (2011)
8. Hepp, M.: Accommodation ontology language reference. Tech. rep., Hepp Research
GmbH, Innsbruck (2013)
9. Iannella, R., McKinney, J.: VCARD ontology. Available at:
http://www.w3.org/TR/vcard-rdf/. Tech. rep. (2013)
132
Using Semantics for Interactive Visual Analysis
of Linked Open Data
Gerwald Tschinkel1 , Eduardo Veas1 , Belgin Mutlu1 and Vedran Sabol1,2
1
Know-Center gtschinkel|eveas|bmutlu|vsabol@know-center.at
2
Graz University of Technology
Abstract. Providing easy to use methods for visual analysis of Linked
Data is often hindered by the complexity of semantic technologies. On
the other hand, semantic information inherent to Linked Data provides
opportunities to support the user in interactively analysing the data. This
paper provides a demonstration of an interactive, Web-based visualisa-
tion tool, the “Vis Wizard”, which makes use of semantics to simplify the
process of setting up visualisations, transforming the data and, most im-
portantly, interactively analysing multiple datasets using brushing and
linking methods.
1 Introduction
An objective of the CODE3 project is to make Linked Data accessible to novice
users by providing easy to use methods for visual data analysis. This is hard to
achieve with current Linked Data tools, which require user’s knowledge of se-
mantic technologies (such as SPARQL). This paper demonstrates how semantic
information can be used to support the interactive analytical process, without
the need for users to understand the complexities of the underlying technology.
Within CODE we use the RDF Data Cube Vocabulary4 for describing statis-
tical datasets. Our “Vis Wizard”5 tool provides an intuitive, easy to use interface
supporting visualisation and interactive analysis of RDF Cubes. In the Vis Wiz-
ard we utilise semantic information from Linked Data to support the user in:
1. Selecting and configuring the visualisations
2. Aggregating datasets
3. Brushing and linking over multiple datasets
This paper illustrates the use of semantic technologies in a visual analytics
tool that enables novice users to perform complex operations and analyses on
Linked Data. The demonstration focuses mainly on step 3, with a screencast of
the demonstration also being available6 .
3
http://code-research.eu
4
http://www.w3.org/TR/vocab-data-cube
5
http://code.know-center.tugraz.at/vis
6
http://youtu.be/aBfuGhgVaxA
133
II
Related work: A wide range of tools o↵ers functionalities for visualising and
interacting with data, but only a few rely on semantic information to support the
analytical process. Tableau [6] provides a mighty visualisation toolset, however
it does not make use of semantic information for assisting the user. The CubeViz
Framework [5] facilitates visual analytics on RDF Data Cubes, but does not use
semantics for the user interface. CubeViz supports no brushing, no possibility
to compare datasets directly and no automatic selection of visualisations. Cam-
marano et al. [1] introduces a method to automatically analyse data attributes
and map them to visual properties of the visualisation. Even so, this does not
include an automatic selection of visualisation types.
2 The Linked Data Vis Wizard
The underlying thought is to make the user capable of visually analysing data
without knowing about the concept of Linked Data or RDF Data Cubes. How-
ever, the Vis Wizard utilises the available semantic information to support users
in interacting with the data and performing analytical tasks.
Fig. 1. Two RDF Data Cubes are shown in the Vis Wizard. Brushing the 3G coverage
value in the parallel coordinates highlights corresponding countries in the geo-chart.
Scenario: Figure 1 compares two datasets taken from the EU Open Data End-
point7 in the Vis Wizard. The first one, shown in parallel coordinates, represents
the 3G coverage in Europe, as percentage value, per country for each year. The
second dataset, shown in the geo-chart, contains active SIM cards per 100 people
(encoded by colour-grading) for countries in Europe. In the following we use the
Vis Wizard to gain insights into the data and ascertain the datasets correlate.
2.1 Interactive Visual Analysis
Selecting and configuring the visualisation: The first step is to find an
appropriate visual representation for the given dataset. Within the 10 supported
7
http://open-data.europa.eu
134
III
charts only those are made available which can actively and meaningfully be
used with the provided data. For example, the geo-chart is only available if the
data contains a geographic dimension. After the chart was selected, the user
can adjust the mapping of data onto the visual properties (e.g. axes, colours,
item sizes etc.) of the chart, whereby only suitable mappings are o↵ered. Chart
selection and the data mapping is computed by an algorithm [3] comparing the
semantic information in the RDF Data Cube with the visual capabilities of the
chart, which are described using the Visual Analytics Vocabulary8 .
Aggregation: We provide a dialogue for aggregating the data and creating a
new Data Cube. In the scenario shown in Fig 1 the second dataset was averaged
over the years and visualised over the countries. Using semantics we di↵erentiate
between dimensions and measures and enable validation of the user choices.
For suggesting charts and supporting aggregation we are utilising RDF datatype,
occurrence and persistence.
Brushing and Linking: The idea behind brushing and linking is to combine
di↵erent visualisations to overcome the shortcomings of single techniques [2].
Interactive changes made in one visualisation are automatically reflected in the
other ones. Our scenario contains two separate datasets: the first dataset has
the dimensions “country” and “year”, the second dataset has only “country”.
For conventional tools it is hard to provide interaction over di↵erent datasets,
because relationships between them are usually not explicitly available. In cases
when columns are labelled using equal strings guessing the relationships may be
possible, but when labels di↵er, e.g. a dimension in dataset A is called “Country”
while in the dataset B it is called “State”, the relation cannot be established. In
such cases the burden of understanding the structure of the datasets and linking
them together falls on the user. Within RDF Data Cubes, each dimension has an
URI which is (by definition) unique and can be used to establish the connection
between datasets, making linking and brushing over di↵erent datasets possible.
Applied to our scenario the following interactive analysis is performed (see
Fig. 1): The user applies a brush on the first dataset by selecting a specific
value range in the “3G coverage” dimension using the parallel coordinates chart.
Countries outside of the selected range are greyed out in the geo-chart, which
shows the second dataset (SIM card penetration). Obviously, a high 3G coverage
correlates with high SIM card penetration (red), with one exception - France.
It should be noted that the functionality of linking data over di↵erent datasets,
or even di↵erent endpoints, depends on the quality of the semantic information:
the URIs of the cube-dimensions in di↵erent datasets need to be consistent. If
datasets use di↵erent, domain-specific URI namespaces, linking the data will not
be possible.
3 Evaluation
We conducted a formative evaluation to explore if our goals regarding the us-
ability of the Vis Wizard could be achieved and to ascertain that users were able
8
http://code.know-center.tugraz.at/static/ontology/visual-analytics.owl
135
IV
to analyse complex datasets. Eight test users participated which executed six
tasks, where one task was exclusively about linking and brushing. Test users had
a good knowledge of computers, but were not familiar with semantic data. We
conducted a quantitative subjective workload test, using the simplified NASA
R-TLX, and a qualitative thinking aloud test. More details on the evaluation,
including methodology, test users and results are available under [4].
The functionality supporting the choice and configuration of the visualisa-
tion was much appreciated, but users pointed out that immediately suggesting
the most suitable visualisation would have been even more helpful. The task
regarding brushing in the scatterplot had a very high subjective performance of
accomplishing (the median was 91.25 on a scale from 0 to 100, 100 being the
highest value achievable). The conclusion of our evaluation is that, while several
usability issues still need to be fixed, the overall advantage is clearly observable.
4 Conclusion and Future Work
Within this research we have observed a high potential in using semantic infor-
mation for improving interaction in visual analytics. It has been shown that the
user supporting techniques were helpful in gaining insights from the data, with-
out spending much time in selecting and configuring visualisations or analysing
how to link the datasets manually.
As for our purpose the correctness of the semantic annotations of the data is
essential, the stability of our approach could be improved by implementing the
use of URI aliases. We will also explore the possibilities to rank the visualisations
in order to, given a particular dataset, automatically show the most suitable one.
Acknowledgements. This work is funded by the EC FP7 projects CODE (grant 296150) and EEX-
CESS (grant 600601). The Know-Center GmbH is funded by Austrian Federal Government within
the Austrian COMET Program, managed by the Austrian Research Promotion Agency (FFG).
References
1. Cammarano, M., Dong, X.L., Chan, B., Klingner, J., Talbot, J., Halevey, A., Han-
rahan, P.: Visualization of heterogeneous data. In: IEEE Information Visualization
2. Keim, D.A.: Information visualization and visual data mining. In: IEEE Transac-
tions on Visualization and computer graphics (2002)
3. Mutlu, B., Höfler, P., Tschinkel, G., Veas, E.E., Sabol, V., Stegmaier, F., Granitzer,
M.: Suggesting visualisations for published data. In: Proceedings of IVAPP 2014.
pp. 267–275 (2014)
4. Sabol, V., Tschinkel, G., Veas, E., Hoefler, P., Mutlu, B., Granitzer, M.: Discovery
and visual analysis of linked data for humans. In: Accepted for publication at the
13th International Semantic Web Conference (2014)
5. Salas, P.E., Martin, M., Mota, F.M.D., Breitman, K., Auer, S., Casanova, M.A.:
Publishing statistical data on the web. In: Proceedings of 6th International IEEE
Conference on Semantic Computing. IEEE 2012, IEEE (2012)
6. Stolte, C., Hanrahan, P.: Polaris: A system for query, analysis and visualization
of multi-dimensional relational databases. IEEE Transactions on Visualization and
Computer Graphics 8, 52–65 (2002)
136
Exploiting Linked Data Cubes with OpenCube Toolkit
Evangelos Kalampokis1,2, Andriy Nikolov3, Peter Haase3, Richard Cyganiak4,
Arkadiusz Stasiewicz4, Areti Karamanou1,2, Maria Zotou1,2, Dimitris Zeginis1,2,
Efthimios Tambouris 1,2, Konstantinos Tarabanis1,2
1
Centre for Research & Technology - Hellas, 6th km Xarilaou-Thermi, 57001, Greece
2
University of Macedonia, Egnatia 156, 54006 Thessaloniki, Greece
{ekal, akarm, mzotou, zegin, tambouris, kat}@uom.gr
3
fluid Operations AG, Altrottstraße 31, 69190 Walldorf, Germany
{andriy.nikolov, peter.haase}@fluidops.com
4
Insight Centre for Data Analytics, Galway, Ireland
{richard.cyganiak, arkadiusz.stasiewicz}@insight-centre.org
Abstract. The adoption of the Linked Data principles and technologies has
promised to enhance the analysis of statistics at a Web scale. Statistical data,
however, is typically organized in data cubes where a numeric fact (aka
measure) is categorized by dimensions. Both data cubes and linked data
introduce complexity that raises the barrier for reusing the data. The majority of
linked data tools are not able to support or do not facilitate the reuse of linked
data cubes. In this demo we present the OpenCube Toolkit that enables the easy
publishing and exploitation of linked data cubes using visualizations and data
analytics.
Keywords: Linked data, statistics, data cubes, visualization, analytics.
1 Introduction
A major part of Open Data concerns statistics such as population figures, economic
and social indicators. Analysis of statistical open data can provide value to both
citizens and businesses in various areas such as business intelligence, epidemiological
studies and evidence-based policy-making. Linked Data has emerged as a promising
paradigm to enable the exploitation of the Web as a platform for data integration. As a
result Linked Data has been proposed as the most appropriate way for publishing
open data on the Web. Statistical data needs to be formulated as RDF data cubes [1]
characterized by dimensions, slices and observations in order to unveil its full
potential and value [2]. Processing of linked statistical data has only become a
popular research topic in the recent years. Several practical solutions have been
developed in this domain: for example, the LOD2 Statistical Workbench1 brings
together components developed in the LOD2 project by means of the OntoWiki2 tool.
1
http://wiki.lod2.eu/display/LOD2DOC/LOD2+Statistical+Workbench
2
http://aksw.org/Projects/OntoWiki.html
137
In this demo paper we describe the OpenCube Toolkit that enable users to work
with linked data cubes in an easy manner. In comparison with existing tools, our
toolkit provides the following contributions:
• application development SDK allowing customized domain-specific
applications to be built to support various use cases;
• new functionalities enabling users to better exploit linked data cubes;
• components supporting the whole linked data cube lifecycle.
2 OpenCube Toolkit
The OpenCube Toolkit3 integrates a number of components which enable the user to
work with semantic statistical data at different stages of the lifecycle: from importing
legacy data and exposing it as linked open data to applying advanced visualization
techniques and complex statistical methods to it.
The Information Workbench (IWB) platform [3] serves as a backbone for the
toolkit components. The components are integrated into a single architecture via
standard interfaces provided by the IWB SDK: widgets (for UI controls) and data
providers (for data importing and processing components). The overall UI design is
based on the use of wiki-based templates providing dedicated views for RDF
resources: an appropriate view template is applied to an RDF resource based on its
type. All components of the architecture share the access to a common RDF
repository (local or remote) and can retrieve data by means of SPARQL queries.
The OpenCube Toolkit demo uses datasets from the Linked Data version of
Eurostat4 and can be currently accessed using the following link:
http://data.fluidops.net.
2.1 Using the OpenCube Toolkit for data import, transformation, and publishing
Much of the relevant valuable statistical data are only available in various legacy
formats, such CSV and Excel. To present these data in the form of linked RDF data
cubes, they have to be imported, transformed into the RDF Data Cube format and
made accessible for querying.
The OpenCube TARQL5 component enables cubes construction from legacy data
via TARQL (Transformation SPARQL): a SPARQL-based data mapping language
that enables conversion of data from RDF, CSV, TSV and JSON (and potentially
XML and relational databases) to RDF. TARQL is tool for converting CSV files to
RDF using SPARQL 1.1 syntax. It is built on top of Apache ARQ6. The OpenCube
TARQL component includes the new release of TARQL. It brings several
improvements, such as: streaming capabilities, multiple query patterns in one
3
http://opencube-toolkit.eu
4
http://eurostat.linked-statistics.org
5
https://github.com/cygri/tarql
6
http://jena.apache.org/documentation/query/
138
mapping file, convenient functions for typical mapping activities, validation rules
included in mapping file, increased flexibility (dealing with CSV variants like TSV).
The R2RML7 language is a W3C standard for mappings from relational databases
to RDF datasets. D2RQ8 is a platform for accessing relational databases as virtual,
read-only RDF graphs. D2RQ Extensions for Data Cube cover the functionality of
importing of raw data as data cubes by mapping raw data to RDF. The process of
mapping the data cube with a relational data source includes: (a) mapping the tables
to classes of entities, (b) mapping selected columns into cube dimensions and cube
measures, (c) mapping selected rows into observation values, and (d) generate triples
with data structure definition. The user, by providing information about the dataset,
such as the data dimensions and related measures, will receive an R2RML mapping
file, which as a result will be used to generate and store the output.
2.2 Using the OpenCube Toolkit to utilize statistical data
To make use of available statistical data cubes, the user requires, as a minimum, to be
able to explore and, visualize the data. The next step involves being able to apply to
these data relevant statistical analysis methods.
The OpenCube Browser enables the exploration of an RDF data cube by
presenting two-dimensional slices of the cube as a table. Currently browser enables
users to change the two dimensions that define the table of the browser and also
change the values of the fixed dimensions and thus select a different slice to be
presented. Moreover, the browser supports roll-up and drill-down OLAP operations
through dimensions reduction and insertion respectively. Finally, the user can create
and store a two-dimensional slice of the cube based on the data presented in the
browser. Initially, the browser selects two dimensions to present in the table and sets
up a fixed value for all other dimensions. Based on these it creates and sends a
SPARQL query to the store to retrieve the appropriate data. For the drill-down and
roll-up operations the browser assumes that a set of data cubes has been created out of
the initial cube by summarizing observations across one or more dimensions. We
assume that these cubes define an Aggregation Set.
The OpenCube Map View enables the visualization of RDF data cubes on a map
based on their geospatial dimension. Initially, Map View presents to the user the
supported types of visualization (including markers, bubbles, choropleth and heat
maps) along with all the dimensions and their values in drop-down lists.
The user selects the type of visualization and a map appears that actually visualizes a
one-dimension slice of the cube where the geospatial dimension is free and the other
dimensions are randomly “fixed”. In addition, the user can click on an area or marker
or bubble and see the details of the specific observation. The maps are created using
OpenStreetMap9 and Leaflet10 open-source library.
7
http://www.w3.org/TR/r2rml/
8
http://d2rq.org/
9
http://wiki.openstreetmap.org/wiki/Develop
10
http://leafletjs.com/
139
To allow the user explore the data in a data cube, it is important that the used
visualization controls are (i) interactive and (ii) adapted to the cube data
representation. In this way the user can easily switch between different slices of the
cube and compare between them. To this end, we implemented our Chart-based
Visualization functionality. The charts can be inserted into a wiki page of an RDF
resource and configured to show data cube slices. When viewing the page, the user
can change the selection of dimension values to change the visualised cube slices. The
SPARQL query to retrieve the appropriate data is constructed based on the slice
definition, and the data is downloaded from the SPARQL endpoint dynamically.
When working with statistical data, a crucial requirement is the possibility to apply
specialized analysis methods. One of the most popular environments for statistical
data analysis is R11. To use the capabilities of R inside the OpenCube Toolkit, we
integrated it with our architecture through the Statistical Analysis of RDF Data
Cubes component. R is run as a web service (using Rserve12 package) and accessed
via HTTP. Input data are retrieved using SPARQL queries and passed to R together
with an R script. Then, R capabilities can be exploited in two modes: (i) as a widget
(the script generates a chart, which is then shown on the wiki page) and (ii) as a data
source (the script produces a data frame, which is then converted to RDF using
defined R2RML mappings and stored in the data repository).
3 Conclusions
This demo paper presents the first release of the OpenCube Toolkit developed to
enable easy publishing and reusing of linked data cubes. The toolkit smoothly
integrates separate components dealing with different subtasks of the linked statistical
data processing workflow to provide the user with a rich set of functionalities for
working with statistical semantic data.
Acknowledgments. The work presented in this paper was partially carried out in
OpenCube13 project, which is funded by the EC within FP7 (No. 611667).
References
1. Cyganiak, R., Reynolds, D.: The RDF Data Cube vocabulary,
http://www.w3.org/TR/vocab-data-cube/ (2013)
2. Kalampokis, E., Tambouris, E., Tarabanis, K.: Linked Open Government Data Analytics.
In: Wimmer, M.A., Janssen, M., Scholl, H.J. (eds.) EGOV 2013. LNCS, vol. 8074, pp. 99-
110. IFIP International Federation for Information Processing (2013)
3. Haase, P., Schmidt, M., Schwarte, A. Information Workbench as a Self-Service platform.
COLD 2011, ISWC 2011, Shanghai, China (2011).
11
http://www.r-project.org/
12
http://www.rforge.net/Rserve/
13
http://www.opencube-project.eu
140
Detecting Hot Spots in Web Videos
José Luis Redondo Garcı́a1 , Mariella Sabatino1 ,
Pasquale Lisena1 , Raphaël Troncy1
EURECOM, Sophia Antipolis, France,
{redondo, mariella.sabatino, pasquale.lisena, raphael.troncy}@eurecom.fr
Abstract. This paper presents a system that detects and enables the
exploration of relevant fragments (called Hot Spots) inside educational
online videos. Our approach combines visual analysis techniques and
background knowledge from the web of data in order to quickly get an
overview about the video content and therefore promote media consump-
tion at the fragment level. First, we perform a chapter segmentation by
combining visual features and semantic units (paragraphs) available in
transcripts. Second, we semantically annotate those segments via Named
Entity Extraction and topic detection. We then identify consecutive
segments talking about similar topics and entities that we merge into
bigger and semantic independent media units. Finally, we rank those
segments and filter out the lowest scored candidates, in order to pro-
pose a summary that illustrates the Hot Spots in a dedicated media
player. An online demo is available at http://linkedtv.eurecom.fr/
mediafragmentplayer.
Keywords: Semantic Video Annotation, Media Fragments, Summa-
rization
1 Introduction
Nowadays, people consume all kind of audiovisual content on a daily basis. From
breaking news to satiric videos, personal recordings or cooking tutorials, we are
constantly feed by video content to watch. A common practice by viewers con-
sists in fast browsing through the video, using sometimes the key frames provided
by the video sharing platform, with the risk of missing the essence of the video.
This phenomena is even more obvious when it comes to educational web con-
tent. A study made over media entertainment streaming services reveals that
the majority of partial content views (52.55%) are ended by the user within the
first 10 minutes, and about 37% of these sessions do not last past the first five
minutes [5]. In practice, it is difficult and time consuming to manually gather
video insights that give the viewers a fair understanding about what the video is
talking about. Our research tackles this problem by proposing a set of automat-
ically annotated media fragments called Hot Spots, which intend to highlight
the main concepts and topics discussed in a video. We also propose a dedicated
exploring interface that eases the consumption and sharing of those hot spots.
141
2
The challenge of video segmentation has been addressed by numerous pre-
vious research. Some of them rely exclusively on low-level visual features such
as color histograms or visual concept detection clustering operations [4]. Other
approaches rely on text, leveraging the video transcripts and sometimes manual
annotations and comments attached to the video [6] while the combination of
both text and visual features is explored in [1]. Our approach combines also
both visual and textual features with the added value of leveraging structured
knowledge available in the web of data.
2 Generating and Exploring Hot Spots in Web Videos
This demo implements a multimodal algorithm for detecting and annotating the
key fragments of a video in order to propose a quick overview about what are
the main topics being discussed. We conduct an experiment over a corpora of
1681 TED talks 1 , a global set of conferences owned by the private non-profit
Sapling Foundation under the slogan: ”Ideas Worth Spreading”
2.1 Media Fragments Generation
First, we perform shot segmentation for each video using the algorithm described
in [3]. Shots are the smallest unit in a video, capturing visual changes between
frames but not necessary reflecting changes of topic being discussed in the video.
Therefore, we introduce the notion of chapters corresponding to wider chunks
illustrating particular topics. In order to obtain such fragments, we use specific
marks embedded in the available video transcripts for all TED talks that indi-
cate the start of new paragraphs. In a last step, those fragments are combined
with visual shots. Hence, we adjust the boundaries of each chapter using both
paragraph and shot boundaries.
2.2 Media Fragments Annotation
We rely on the subtitles available for the 1681 TED talks for annotating the
media fragments which have been generated. More precisely, we detect topics
and named entities. For the former, we have used the dedicated TextRazor topic
detection method2 , while for the latter, we used the NERD framework [2]. Both
entities and topics come with a relevance score which we use to give a weight to
this particular semantic unit within the context of the video story. Topics and
named entities are attached to a chapter.
2.3 Hot Spots Generation
Once all chapters are delimited and annotated, we iteratively cluster them, in
particular, when temporally close segments are similar enough in terms of top-
ics named entities. More precisely, we compute a similarity function between
1
http://www.ted.com/
2
https://www.textrazor.com/documentation
142
3
consecutive pairs of segments S1 and S2 until no new merges are possible. This
comparison leverages on the annotations attached to each n segment by analyz-
o
P
ing the number of coincidences between topics T = max3 topici Reli and
nP o
entities E = max5W 0 s entityi Reli , where Reli is the TexRazor’s relevance:
✓ T ◆ ✓ T ◆
|T1 T2 | |E1 E2 |
d (S1 , S2 ) = wtopic · + wentity · (1)
max {|T1 | , |T2 |} max {|E1 | , |E2 |}
After this clustering process, the video is decomposed into less but longer
chapters. However, there are still too many candidates to be proposed as Hot
Spots. Therefore, we filter out those fragments which contain potentially less
interesting topics. We define a function for measuring the interestingness of a
video segment, which directly depends on the relevance and frequency of the
annotations and which is inversely proportional to its length. In our current
approach, the Hot Spots are those fragments whose relative relevance falls under
the first quarter of the final score distribution.
In a last step, for each Hot Spot, we also generate a summarization to be
shown in a dedicated media player where we highlight the main topics T and
entities E which have been discovered.
2.4 Exploring Hot Spots within TED Talks
The Hot Spots and their summaries are visualized in a user friendly Media
Fragment URI compliant media player. The procedure to get the Hot Spots for
a particular Ted talk is the following: the user enters a valid TED Talk URL
to get a landing page (Figure 1a). When the results are available, the hot spots
are highlighted on the timeline together with the label of the most relevant
chapter annotation (Figure 1b). This label can be extended to a broader set of
entities and topics (Figure 1c). Finally, the user can always share those hot spots
segments using media fragment URIs (Figure 1d).
3 Discussion
We have presented a demo for automatically discovering Hot Spots in online
and educational videos. We leverage on visual analysis and background knowl-
edge available in the web of data for detecting what fragments illustrate the
best the main topics discussed in the video. Those Hot Spots allow the viewer
to quickly decide if a video is worth watching and will provide incentive for
consuming videos at the fragment level. In addition, Hot Spots can be explored
in a dedicated media fragment player which also display the attached semantic
annotations.
We plan to carry out an exhaustive evaluation of our approach involving
real users feedback, in order to optimize the results of our Hot Spot detection
algorithm and to improve the usability and efficiency of the developed interface.
143
4
Fig. 1: Visualizing the Hot Spots of a TED Talk (available at http:
//linkedtv.eurecom.fr/mediafragmentplayer/video/bbd70fff-e828-4db5-80d0-
1a4c9aea430e)
We also plan to further exploit the segmentation results and their corresponding
annotations for establishing links between fragments belonging to di↵erent videos
in order to generate true hyperlinks within a closed collection such as TED talks
and make results available following Linked Data principles.
References
1. S.-F. Chang, R. Manmatha, and T.-S. Chua. Combining text and audio-visual
features in video indexing. In In IEEE International Conference on Acoustics,
Speech, and Signal Processing (ICASSP’05), 2005.
2. G. Rizzo and R. Troncy. NERD: A Framework for Unifying Named Entity Recog-
nition and Disambiguation Extraction Tools. In 13th Conference of the European
Chapter for Computational Linguistics (EACL’12), Avignon, France, 2012.
3. P. Sidiropoulos, V. Mezaris, I. Kompatsiaris, H. Meinedo, M. Bugalho, and I. Tran-
coso. Temporal video segmentation to scenes using high-level audiovisual features.
IEEE Transactions on Circuits and Systems for Video Technology, 21(8):1163–1177,
2011.
4. C. G. Snoek and M. Worring. Multimodal video indexing: A review of the state-of-
the-art. Multimedia tools and applications, 25(1):5–35, 2005.
5. H. Yu, D. Zheng, B. Y. Zhao, and W. Zheng. Understanding user behavior in
large-scale video-on-demand systems. In In 1st ACM SIGOPS/EuroSys European
Conference on Computer Systems, pages 333–344, 2006.
6. Z.-J. Zha, M. Wang, J. Shen, and T.-S. Chua. Text mining in multimedia. In Mining
Text Data, pages 361–384. Springer, 2012.
144
EUROSENTIMENT: Linked Data Sentiment
Analysis
J. Fernando Sánchez-Rada1 , Gabriela Vulcu2 , Carlos A. Iglesias1 , and Paul
Buitelaar2
1
Dept. Ing. Sist. Telemáticos, Universidad Politécnica de Madrid,
{jfernando,cif}@gsi.dit.upm.es,
http://www.gsi.dit.upm.es
2
Insight, Centre for Data Analytics at National University of Ireland, Galway
{gabriela.vulcu,paul.buitelaar}@insight-centre.org,
http://insight-centre.org/
Abstract. Sentiment and Emotion Analysis strongly depend on quality
language resources, especially sentiment dictionaries. These resources are
usually scattered, heterogeneous and limited to specific domains of appli-
cation by simple algorithms. The EUROSENTIMENT project addresses
these issues by 1) developing a common language resource representation
model for sentiment analysis, and APIs for sentiment analysis services
based on established Linked Data formats (lemon, Marl, NIF and ONYX)
2) by creating a Language Resource Pool (a.k.a. LRP) that makes avail-
able to the community existing scattered language resources and services
for sentiment analysis in an interoperable way. In this paper we describe
the available language resources and services in the LRP and some sam-
ple applications that can be developed on top of the EUROSENTIMENT
LRP.
Keywords: Language Resources, Sentiment Analysis, Emotion Analy-
sis, Linked Data, Ontologies
1 Introduction
This paper reports our ongoing work in the European R&D project EUROSEN-
TIMENT, where we have created a multilingual Language Resource Pool (LRP)
for Sentiment Analysis based on a Linked Data approach for modelling linguistic
resources.
Sentiment Analysis requires language resources such as dictionaries that pro-
vide a sentiment or emotion value to each word. Just as words have different
meanings in different domains, the associated sentiment or emotion also varies.
Hence, every domain has its own dictionary. The information about what each
domain represents or how the entries for each domain are related is usually un-
documented or implied by the name of each dictionary. Moreover, it is common
that dictionaries from different providers use different representation formats.
Thus, it is very difficult to use different dictionaries at the same time.
145
2
In order to overcome these limitations, we have defined a Linked Data Model
for Sentiment and Emotion Analysis, which is based on the combination of sev-
eral vocabularies: the NLP Interchange Format (NIF) [1], to represent informa-
tion about texts, referencing text in the web with unique URIs; the Lexicon
Model for Ontologies (lemon) [2], to provide lexical information, and differen-
tiate between different domains and senses of a word; Marl [5], to link lexical
entries or senses with a sentiment; and Onyx [3], that adds emotive information.
The use of a semantic format not only eliminates the interoperability issue,
but it also makes information from other Linked Data sources available for the
sentiment analysis process. The EUROSENTIMENT LRP generates language
resources from legacy corpora, linking them with other Linked Data sources, and
shares this enriched version with other users.
In addition to language resources, the pool also offers access to sentiment
analysis services with a unified interface and data format. This interface builds on
the NIF Public API, adding several extra parameters that are used in Sentiment
Analysis. Results are formatted using JSON-LD and the same vocabularies as
for language resources. The NIF-compatible API allows for the aggregation of
results from different sources.
The project documentation3 contains further information about the EU-
ROSENTIMENT format, APIs and tools.
2 Language Resources
The EUROSENTIMENT LRP contains a set of language resources (lexicons and
corpora). The available EUROSENTIMENT language resources can be found
here.4 The user can see the domain and the language of each language resource.
At the moment the LRP contains resources for electronics and hotel domains in
six languages (Catalan, English, Spanish, French, Italian and Portuguese) and
we are currently working on adding more language resources from other domains
like telco, movies, food and music. Table 1 shows the number of reviews in each
available corpus and the number of lexical entries in each available lexicon.
A detailed description of the methodology for creating the domain-specific
sentiment lexicons and corpora to be added in the EUROSENTIMENT LRP
was presented at LREC 2014 [4].
The EUROSENTIMENT demonstrator5 shows how users can benefit from
the LRP, including an interactive SPARQL query editor to access the resources
and a faceted browser.
3 Sentiment Services
In addition to a model for language resources, EUROSENTIMENT also provides
an API for sentiment and emotion analysis services. Several already existing ser-
3
http://eurosentiment.readthedocs.org
4
http://portal.eurosentiment.eu/home_resources
5
http://eurosentiment.eu/demo
146
3
Lexicons
Corpora
Language Domains #Entities
Language Domains #Entities
German General 107417
English Hotel,Electronics 22373
English Hotel,Electronics 8660
Spanish Hotel,Electronics 18191
Spanish Hotel,Electronics 1041
Catalan Hotel,Electronics 4707
Catalan Hotel,Electronics 1358
Portuguese Hotel,Electronics 6244
Portuguese Hotel,Electronics 1387
French Electronics 22841
French Hotel,Electronics 651
Table 1. Summary of the resources in the LRP
vices in different languages have been adapted to expose this API. Any user can
benefit from these services, which are conveniently listed in the EUROSENTI-
MENT portal. At the moment, the following services are provided in several
languages: language detection, domain detection, sentiment and emotion detec-
tion, and text analysis.
Fig. 1. The LRP provides a list of available services
4 Applications Using the LRP
To demonstrate the capabilities of the EUROSENTIMENT LRP, we open-
sourced the code of several applications that make use of the services and re-
sources of the EUROSENTIMENT LRP. The applications are written in dif-
ferent programming languages and are thoroughly documented. Using these ap-
plications as a template, it is straightforward to immediately start consuming
the services and resources. The code can be found on the EUROSENTIMENT
Github repositories.6
6
http://github.com/eurosentiment
147
4
Fig. 2. Simple service that uses the resources in EUROSENTIMENT to analyse opin-
ions in different languages and domains
Acknowledgements
This work has been funded by the European project EUROSENTIMENT under
grant no. 296277
References
1. Hellmann, S., Lehmann, J., Auer, S., Nitzschke, M.: Nif combinator: Combining nlp
tool output. In: Knowledge Engineering and Knowledge Management, pp. 446–449.
Springer (2012)
2. McCrae, J., Spohr, D., Cimiano, P.: Linking lexical resources and ontologies on the
semantic web with lemon. In: Antoniou, G., Grobelnik, M., Simperl, E., Parsia, B.,
Plexousakis, D., De Leenheer, P., Pan, J. (eds.) The Semantic Web: Research and
Applications, Lecture Notes in Computer Science, vol. 6643, pp. 245–259. Springer
Berlin Heidelberg (2011)
3. Sánchez-Rada, J.F., Iglesias, C.A.: Onyx: Describing emotions on the web of data.
In: ESSEM@ AI* IA. pp. 71–82. Citeseer (2013)
4. Vulcu, G., Buitelaar, P., Negi, S., Pereira, B., Arcan, M., Coughlan, B., Sánchez-
Rada, J.F., Iglesias, C.A.: Generating Linked-Data based Domain-Specific Senti-
ment Lexicons from Legacy Language and Semantic Resources. In: th International
Workshop on EMOTION, SOCIAL SIGNALS, SENTIMENT & LINKED OPEN
DATA, co-located with LREC 2014,. LREC2014, Reykjavik, Iceland (May 2014)
5. Westerski, A., Iglesias, C.A., Tapia, F.: Linked Opinions: Describing Sentiments on
the Structured Web of Data. In: Proceedings of the 4th International Workshop
Social Data on the Web (2011)
148
Property-based typing with LITEQ
Programming access to weakly-typed RDF data
Stefan Scheglmann1 , Martin Leinberger1 , Ralf Lämmel2 , Steffen Staab1 , Matthias
Thimm1 , Evelyne Viegas3
1
Institute for Web Science and Technologies, University of Koblenz-Landau, Germany
2
The Software Languages Team, University of Koblenz-Landau, Germany
3
Microsoft Research Redmond, US
Abstract. Coding against the semantic web can be quite difficult as the basic
concepts of RDF data and programming languages differ greatly. Existing map-
pings from RDF to programming languages are mostly schema-centric. However,
this can be problematic as many data sources lack schematic information. To al-
leviate this problem, we present a data centric approach that focuses on the prop-
erties of the instance data found in RDF and that lets a developer create types
in his programming language by specifying properties that need to be present.
This resembles a type definition rooted in description logics. We show how such
a type definition can look like and demonstrate how a program using such type
definitions can can be written.
1 Introduction
Access to RDF data from within programs is difficult to realize since, (i) RDF follows
a flexible and extensible data model, (ii) schema is often missing or incomplete, and
(iii) data RDF type information is missing. In order to establish a robust access from
a program to RDF data, a developer faces several challenges. Most of the data sources
are defined externally and the developer has only a brief idea of what to find in this data
source, therefore he has to explore the data first. Once explored, he has to deal with the
impedance mismatch between how RDF types are structuring RDF data, compared to
how code types are used in programming languages, [4, 2, 6, 1]. In response to these
challenges, we have proposed LITEQ [5], a system that allows for the mapping of RDF
schema information into programming language types.
However, using LITEQ in practice has shown that purely relying on RDF schema
information for the mapping raises new issues. To alleviate these problems, we have
implemented an property-based approach and include it into LITEQ as an alternative
way of usage. Using this, a developer is able to create code types by listing properties
that should be present in instances of this code type.
In this demo, we present an implementation of this approach1 in F#. It supports
developers in defining new code types by specifying their properties. This is aided by
auto-completion mechanism, which computes its suggestions directly on the instance
data without any need of a schema. The approach is intended as an extension to the
LITEQ library also presented at the InUse-Track of ISWC 2014.
1
http://west.uni-koblenz.de/Research/systems/liteq
149
2 The LITEQ approach
Typically, the integration of RDF data in a programming environment is a multi-step
process. (1) The structure and content of the data source has to be explored, (2) the
code types and their hierarchy has to be designed and implemented, then (3) the queries
for the concrete data have to be defined, and finally (4) the data can be retrieved and
mapped to the predefined code types.
The NPQL Approach: LITEQ provides an IDE integrated workflow to cope with all
of these tasks. It implements NPQL, a novel path query language, to explore the data
source, to define types based on schematic information and to query the data. Returned
results of these queries are automatically typed in the programming language. All of
this is aided by the autocompletion of the IDE. Figure 1 shows a typical LITEQ expres-
sion using an NPQL query in order to retrieve all mo:MusicArtist entities which
actually have the foaf:made and mo:biography property defined. The result is
returned as a set of objects of the created code type for mo:MusicArtist.
Fig. 1: Querying for all music artists that made at least one record and have a biography.
Property-based Type access with LITEQ: The example shown above has several
problems. Additionally, schema-centric technique only provides code types for which
a RDF type is defined in the schema. It is not possible to introduce more specific code
types, e.g. if it is known that all entities of interest will have at least one foaf:made
and mo:biography relation, one may like to reflect that by the returned code type.
Lastly and most importantly, for all schema-centric access methods, like LITEQ, a more
or less complete schema must be present, which is not always the case, especially in
Linked Data. To cope with these problems of schema-centric approaches, we introduce
the idea of a different data access/typing approach, property-based RDF access in a
program [7].
Property-based Type declaration: The basic idea is very simple: (1) A code type
is defined by a set of properties. (2) Extensionally, a code type represents a anony-
mous RDF type (or view on the data) which refers to the set of all entities which
actually have all the properties defined in the code type. (3) Intensionally, the code
type signature is given by the set of properties (for these it provides direct access).
All other properties of an instance of this type can only be accessed indirectly in a
program using the predicate name. Specifically, this means that a developer might
define a type by just declaring a set of properties as the types signature (Sig), e.g.
Sig = {foaf:made, mo:biography}. Our approach allows for two different ways
to map such a property-based type to a corresponding code type. A flat-mapping which
just maps to a rdf:Resource representation and only makes the properties explic-
itly accessible which are defined in the code type signature. Such an unnamed code type
refers to the set of all entities sharing the properties in the code types signature (Sig),
cf. 1.
unnamedTypeSig ⌘ 9foaf:made u 9mo:biography (1)
150
The second option is named-mapping. Which allows to map the previously unnamed
type to a given RDF type. This ensures separation if entities of distinguished types
share properties, e.g. music artists and producers given the properties foaf:made
and mo:biography. And to search for other properties stating that their domain is
the specified type in order to provide a richer API. The named type is defined as the
intersection of the unnamed type for the provided signature (Sig) and all entities of the
given RDF type (mo:MusicArtist), cf. 2.
namedType(Sig,mo:MusicArtist) ⌘ unnamedTypeSig u mo:MusicArtist (2)
Type-declaration in Practice In the following, we will give a brief overview of the
new method of access RDF data in a programming environment:
(1) To use the library, the DLL must be referenced from the project. This allows opening
the library namespace and creating a store object by passing a URL to a SPARQL
endpoint, cf. Figure 2.
Fig. 2: Creating the type representing the store.
(2) All properties contained in the SPARQL endpoint can be accessed from the store
object (cf. Fig. 3).
Fig. 3: Defining types via property selection.
(3) Once a property has been chosen, the set of properties presented in the next step is
restricted to those properties which have been observed in combination of the previously
chosen properties, cf. Figure 4. Here foaf:made has already been chosen and the only
co-occurring property mo:biography is presented for the next selection.
Fig. 4: Refining a type by adding additional property constraints.
(4) Finally, the developer has to decide on how he likes the new type to be mapped. As
mentioned in the previous section, he can choose for an unnamed representation or a
named one, cf. Figure 5.
(5) If the developer decides for the named representation, he has to choose the RDF
type, cf. Figure 6. In this case, only mo:MusicArtist instances contain the specified
properties and therefore he can only chose this one RDF type.
The presented method of relying on instance information instead of the schema has
a severe drawback when it comes to specifying return code types of properties in the
programming language. As one cannot use schematic information, the only chance is
151
Fig. 5: Named and unnamed variants
Fig. 6: Defining the named code type based on RDF type mo:MusicArtist.
to probe the instance set. However, for big instance sets, this process takes to long to be
useful. The current prototype avoids this problem by only probing whether a property
returns another RDF resource or a literal and types returned values accordingly.
3 Conclusion and Further Work
In this extended abstract, we presented a new way to work with RDF in a programming
language - by defining types based on their properties. As an extension of LITEQ, this
demo will focus on the new feature of property based type definition, as described in
[8, 7] and discussed in [3].
Acknowledgements This work has been supported by Microsoft.
References
1. V. Eisenberg and Y. Kanza. Ruby on semantic web. In Serge Abiteboul, Klemens Böhm,
Christoph Koch, and Kian-Lee Tan, editors, ICDE2011, pages 1324–1327. IEEE Computer
Society, 2011.
2. L. Hart and P. Emery. OWL Full and UML 2.0 Compared.
http://uk.builder.com/whitepapers/0and39026692and60093347p-39001028qand00.htm,
2004.
3. S. Homoceanu, P. Wille, and W. T. Balke. Proswip: Property-based data access for semantic
web interactive programming. In 12th International Semantic Web Conference, ISWC 2013,
Sydney, Australia, 2013.
4. A. Kalyanpur, D. J. Pastor, S. Battle, and J. A. Padget. Automatic Mapping of OWL Ontolo-
gies into Java. In SEKE2004, 2004.
5. M. Leinberger, S. Scheglmann, R. Lämmel, S. Staab, M. Thimm, and E. Viegas. Semantic
web application development with LITEQ. In International Semantic Web Conference, 2014.
6. T. Rahmani, D. Oberle, and M. Dahms. An adjustable transformation from owl to ecore. In
MoDELS2010, volume 6395 of LNCS, pages 243–257. Springer, 2010.
7. S. Scheglmann and G. Gröner. Property-based Typing for RDF Data. In PSW 2012, First
Workshop on Programming the Semantic Web, Boston, Massachusetts, November 11th, 2012,
2012.
8. S. Scheglmann, G. Gröner, S. Staab, and R. Lämmel. Incompleteness-aware programming
with rdf data. In Evelyne Viegas, Karin Breitman, and Judith Bishop, editors, DDFP, pages
11–14. ACM, 2013.
152
From Tale to Speech: Ontology-based Emotion and
Dialogue Annotation of Fairy Tales with a TTS Output
Christian Eisenreich 1, Jana Ott1, Tonio Süßdorf1, Christian Willms1,
Thierry Declerck2,1
1
Saarland University, Computational Linguistics Department, D-66041 Saarbrücken, Ger-
many
(eisenr|janao|tonios|cwillms)@coli.uni-saarland.de
2
German Research Center for Artificial Intelligence (DFKI), Language Technology Lab,
Stuhlsatzenhausweg 3, D-66123 Saarbrücken, Germany
thierry.declerck@dfki.de
Abstract. In this demo and poster paper, we describe the concept and imple-
mentation of an ontology-based storyteller for fairy tales. Its main functions are
(i) annotating the tales by extracting timeline information, characters and dia-
logues with corresponding emotions expressed in the utterances, (ii) populating
an existing ontology for fairy tales with the previously extracted information
and (iii) using this ontology to generate a spoken version of the tales.
Common natural language processing technologies and resources, such as
part-of-speech tagging, chunking and semantic networks have been successfully
used for the implementation of the three tasks mentioned just above, including
the integration of an open source text-to-speech system. The code of the system
is publicly available.
Keywords: ontology, natural language processing, text-to-speech, semantic-
network, fairy tale, storytelling
1 Introduction
The idea of developing an ontology-based storyteller for fairy tales was based on the
consideration of two previous works in the field of narrative text processing. The first
work is described in (Scheidel & Declerck, 2010), which is about an augmented
Proppian1 fairy tale markup language, called Apftml, which we extended according to
the needs of our current work.
Our second starting point is described in (Declerck et al., 2012), which presents an
ontology-based system that is able to detect and recognize the characters (partici-
pants) playing a role in a folktale. Our system combines and extends the results of
1
From „Vladimir Yakovlevich Propp”, who was “a Soviet folklorist and scholar who
analyzed the basic plot components of Russian folk tales to identify their simplest
irreducible narrative elements.” (http://en.wikipedia.org/wiki/Vladimir_Propp)
adfa, p. 1, 2011.
© Springer-Verlag Berlin Heidelberg 2011
153
those studies, adding the detection of dialogues and emotions in the tales and an on-
tology-driven Text-To-Speech (TTS) component that “reads” the tales, with individu-
al voices for every character, including also a voice for the narrator, and taking into
account the types of emotions detected during the textual processing of the tales.
To summarize: Our system first parses the input tale (in English or German) and
extracts as much relevant information as possible on the characters – including their
emotions -- and the events they are involved in. This provides us with an annotated
version of the tale that is used for populating the ontology. The system finally uses the
ontology and a robust and parameterizable TTS system to generate the speech output.
All the data of the system have been made available in a bitbucket repository
(https://bitbucket.org/ceisen/apftml2repo), including documentation and related in-
formation2.
2 Architecture of the System
Firstly, we use the Python NLTK3 and Pattern API4 to annotate the tale. Then we use
the Java OWL-API5 to populate the ontology. And finally the Mary Text-To-Speech
system 6 is used to generate the speech output. Mary is an open-source, multilingual
Text-to-Speech Synthesis platform, which is robust, easy to configure and allows us
to extend our storyteller to more languages. The general architecture of the system is
displayed below in Fig. 1.
Fig. 1. The general architecture of the ontology-driven ‘Tale to Speech” system
2
An example of the audio data generated for the tale “The Frog Prince” is available at
https://bytebucket.org/ceisen/apftml2repo/raw/763c5eb533f09997e757ec61652310c742238
384/example%20output/audio_output.mp3.
3
Natural Language Toolkit: http://www.nltk.org/. See also (Bird et al., 2009)
4
See (De Smedt & Daelemans, 2012).
5
See (Horridge & Bechhofer, 2011).
6
http://mary.dfki.de/. See also (Schröder Marc &Trouvain, 2003) or (Charfuelan & Steiner,
2013).
154
3 The Ontology Population
The ontology we use is an extension of the one presented in (Declerck et al., 2012),
which describes basically family structures among human beings, but also a small list
of extra-natural beings. In the extended version of the ontology we include also tem-
poral information (basically for representing the mostly linear structure of the narra-
tive) as well as dialogue structures, including the participants involved in the dia-
logues (sender(s) and receivers(s)), whereas we give special attention also to the nar-
rator of the tale, since this “character” is also giving relevant information about the
status of the characters in the tales, including their emotional state. Dialogues are
synchronized with the linear narrative structure. Detected emotions are also included
in the populated ontology, and are attached for the time being to utterances, and will
be attached in the future to the characters directly. The Mary TTS system is accessing
all this information in order to parameterize the voices that are attached to each de-
tected characters.
4 A Gold Standard
In order to support evaluation of the automated annotation of fairy tales with our inte-
grated set of tools 5 fairy tales have been manually annotated7. The tales are “The
Frog Prince”, “The Town Musicians of Bremen”, “Die Bremer Stadt Musikanten”
(the German original version), “The Magic Swan Geese” and “Rumpelstiltskin”.
The annotation examples show the different steps involved in the system: the text
analysis, the temporal segmentation, the recognition of the characters and the dia-
logues they are involved in, the emotions that are attached to the utterances and deliv-
ered during speech the story in near real time.
5 Summary and Outlook
We have designed and implemented in the field of fairy tales an ontology-based emo-
tion- and dialogue annotation system with speech output. The system provides robust
results for the tested fairy tales. While the annotation and ontology population pro-
cesses are working for both English and German texts, the TTS output is for the time
being optimized for the English language.
Future work can deal with adding a graphical user interface, extending the parsing
process for annotating tales in other languages and populating the ontology with more
information, like the Proppian functions.
7
The manually annotated tales, together with the annotation schema, are available at
https://bitbucket.org/ceisen/apftml2repo/src/763c5eb533f09997e757ec61652310c74223838
4/soproworkspace/SoPro13Java/gold/?at=master
155
6 References
1. Horridge Matthew and Bechhofer Sean (2011). The owl api: A java api for owl ontologies
IOS Press, IOS Press volume 2 number 1, 11--12
2. Schröder Marc and Trouvain Jürgen (2003). The German text-to-speech synthesis system
MARY: A tool for research, development and teaching. Springer: International Journal of
Speech Technology, volume 6 number 4, 365—377.
3. Marcela Charfuelan and Ingmar Steiner (2013). Expressive speech synthesis in MARY
TTS using audiobook data and EmotionML. ISCA: Proceedings of Interspeech 2013
4. Steven Bird, Ewan Klein, and Edward Loper (2009). Natural Language Processing with
Python--- Analyzing Text with the Natural Language Toolkit.. O'Reilly Media,
(http://www.nltk.org/book/)
5. Ekman Paul (1999). Emotions In T. Dalgleish and T. Power (Eds.) The Handbook of Cog-
nition and Emotion Pp. 45-60. Sussex, UK: John Wiley \& Sons, Ltd.
6. De Smedt, Tom and Daelemans, Walter (2012). Pattern for python. The Journal of Ma-
chine Learning Research, volume{13} nr.1 2063--2067
7. Scheidel Antonia and Declerck Thierry (2010). Apftml-augmented proppian fairy tale
markup language. First International AMICUS Workshop on Automated Motif Discovery
in Cultural Heritage and Scientific Communication Texts.. Szeged University, volume 10
8. Declerck Thierry, Koleva Nikolina and Krieger Hans-Ulrich (2012). Ontology-based in-
cremental annotation of characters in folktales Association for Computational Linguistics.
Proceedings of the 6th Workshop on Language Technology for Cultural Heritage, Social
Sciences, and Humanities, 30--34
9. Propp V.Y. Morphology of the Folktale. Leningrad, 1928; English: The Hague: Mouton,
1958; Austin: University of Texas Press, 1968.
10. Inderjeet Mani: Computational Modeling of Narrative. Synthesis Lectures on Human Lan-
guage Technologies, Morgan & Claypool Publishers 2012.
156
B IOT EX: A system for Biomedical Terminology
Extraction, Ranking, and Validation
Juan Antonio Lossio-Ventura1 , Clement Jonquet1 ,
Mathieu Roche1,2 , and Maguelonne Teisseire1,2
1
University of Montpellier 2, LIRMM, CNRS - Montpellier, France
2
Irstea, CIRAD, TETIS - Montpellier, France
juan.lossio@lirmm.fr,jonquet@lirmm.fr,
mathieu.roche@cirad.fr,maguelonne.teisseire@teledetection.fr
Abstract. Term extraction is an essential task in domain knowledge acquisition.
Although hundreds of terminologies and ontologies exist in the biomedical do-
main, the language evolves faster than our ability to formalize and catalog it.
We may be interested in the terms and words explicitly used in our corpus in or-
der to index or mine this corpus or just to enrich currently available terminologies
and ontologies. Automatic term recognition and keyword extraction measures are
widely used in biomedical text mining applications. We present B IOT EX, a Web
application that implements state-of-the-art measures for automatic extraction of
biomedical terms from free text in English and French.
1 Introduction
Within a corpus, there is different information to represent, with different communities
to express that information. Therefore, the terminology and vocabulary is often very
corpus specific and not explicitly defined. For instance in medical world, terms em-
ployed by lay users on a forum will necessarily differ from the vocabulary used by
doctors in electronic health records. We thus intend to offer users an opportunity to au-
tomatically extract biomedical terms and use them for any natural language, indexing,
knowledge extraction, or annotation purpose. Extracted terms can also be used to enrich
biomedical ontologies or terminologies by offering new terms or synonyms to attach to
existing defined classes. Automatic Term Extraction (ATE) methods are designed to
automatically extract relevant terms from a given corpus.1 . Relevant terms are useful to
gain further insight into the conceptual structure of a domain. In the biomedical domain,
there is a substantial difference between existing resources (ontologies) in English and
French. In English there are about 7 000 000 terms associated with about 6 000 000
concepts such as those in UMLS or BioPortal [7]. Whereas, in French there are only
about 330 000 terms associated with about 160 000 concepts [6]. French ontologies
therefore have to be populated and tool like B IOT EXwill help for this task. Our project
involves two main stages: (i) Biomedical term extraction, and (ii) Ontology population,
in order to populate ontologies with the extracted terms.
1
We refer to ATE when terms extracted are not previously defined in existing standard ontolo-
gies or terminologies. We refer to ’semantic annotation’ when term extracted can be attached
or match to an existing class (URI) such as in [8]. Both approaches are related to Named Entity
Recognition (NER), which automatically extracts name of entities (disease, person, city).
157
2 Lossio-Ventura et al.
In this paper, we present B IOT EX, an application that performs the first step. Given
a text corpus, it extracts and ranks biomedical terms according to the selected state-
of-the-art extraction measure. In addition, B IOT EX automatically validates terms that
already exist in UMLS/MeSH-fr terminologies. We have presented different measures
and performed comparative assessments in other publications [4, 5]. In this paper, we
focus on the presentation of B IOT EX and the use cases it supports.
2 Related work and available extraction measures
Term extraction techniques can be divided into four broad categories: (i) Linguistic
approaches attempt to recover terms via linguistic patterns [3]. (ii) Statistical meth-
ods focus on external evidence through contextual information. Similar methods, called
Automatic Keyword Extraction (AKE), are geared towards extracting the most relevant
words or phrases in a document. These measures, such as Okapi BM25 and TF-IDF,
can be used to automatically extract biomedical terms, as we proposed in [4]. These
two measures are included in B IOT EX. (iii) Machine Learning is often designed for
specific entity classes and thus integrate term extraction and term classification. (iv) Hy-
brid methods. Most approaches combine several methods (typically linguistic and sta-
tistically based) for the term extraction task. This is the case of C-value [2], a very
popular measure specialized in multi-word and nested term extraction.
In [4], we proposed the new hybrid measures F-TFIDF-C and F-OCapi, which
combine C-value with TF-IDF and Okapi respectively to extract terms and obtain better
results than C-value. In [5], we propose LIDF-value measure based on linguistic and
statistical information. We offer all of these measures within B IOT EX. Our measures
were evaluated in terms of precision [4, 5] and obtained the best results over the top
k extracted terms (P @k) on several corpora (LabTestOnline, GENIA, PubMed). For
instance, on a GENIA corpus, LIDF-value achieved 82% for P @100, thus improving
the C-value precision by 13%, and 66% for P @2000, with an improvement of 11%.
B IOT EX allows users to assess the performances of measures with different corpora.
A detailed study of related work revealed that most existing systems implement-
ing statistical methods are made to extract keywords and, to a lesser extent, to extract
terminology from a text corpus. Indeed, most systems take a single text document as
input, not a set of documents (as corpus), for which the IDF can be computed. Most
systems are available only in English. Table 1 shows a quick comparison with TerMine
(C-value), the most commonly used application, and FlexiTerm, the most recent one.
Table 1. Brief comparison of biomedical terminology extraction applications.
BioT ex T erM ine F lexiT erm
Languages en/fr en en
Type of Application Desktop/Web Web Desktop
License Open Open Open
Processing Capacity No Limits / < 6 MB < 2 MB No Limits
Possibility to save results XML - CSV
POS tool TreeTagger Genia/TreeTagger Stanford POS
# of Implemented Measures 8 1 1
http://www.nactem.ac.uk/software/termine/
http://users.cs.cf.ac.uk/I.Spasic/flexiterm/
158
B IOT EX: A system for Biomedical Terminology Extraction, Ranking, and Validation 3
3 Implementation of B IOT EX
B IOT EX is an application for biomedical terminology extraction which offers several
baselines and new measures to rank candidate terms for a given text corpus. B IO -
T EX can be used either as: (i) a Web application taking a text file as input, or (ii) as
a Java library. When used as a Web application, it produces a file with a maximum of
1200 ranked candidate terms. Used as a Java library, it produces four files with ranked
candidate terms found in the corpus, respectively, unigram, bigram, 3-gram and all the
4+ gram terms. B IOT EX supports two main use cases:
(1) Term extraction and ranking measures: As illustrated by the Web application
interface, Figure 1 (1), B IOT EX users can customize the workflow by changing the
following parameters:
– Choose the corpus language (i.e., English or French), and the Part-of-Speech
(PoS) tagger to apply. Note that we tested three POS-tagger tools but currently
only TreeTagger is available within B IOT EX.
– Select a number of patterns to filter out the candidate terms (200 by default).
Those reference patterns (e.g., noun-noun, noun-prep-noun, etc.) were built
with terms taken from UMLS for English and MeSH-fr for French. They are
ranked by frequency.
– Select the type of terms to extract: all terms (i.e., single- and multi-word terms)
or multi-word terms only.
– Select the ranking measures to apply.
(2) Validation of candidate terms: After the extraction process, B IOT EX automat-
ically validates the extracted terms by using UMLS (Eng) & MeSH-fr (Fr). As
illustrated in Figure 1 (2), these validated terms are displayed in green, specifying
the used knowledge source and the others in red. Therefore, B IOT EX allows some-
one to easily distinguish the classes annotating the original corpus (in green) from
the terms that maybe also considered relevant for their data, but need to be curated
(in red). The last ones may be considered candidates for ontology enrichment.
4 Conclusions and Future Work
In this article, we present the B IOT EX application for biomedical terminology extrac-
tion. It is available for online testing and evaluation but can also be used in any pro-
gram as a Java library (POS tagger not included). In contrast to other existing sys-
tems, this system allows us to analyze a French corpus, manually validate extracted
terms and export the list of extracted terms. We hope that B IOT EX will be a valuable
tool for the biomedical community. It is currently used in a couple of test-beds within
the SIFR project (http://www.lirmm.fr/sifr). The application is available at
http://tubo.lirmm.fr/biotex/ along with a video demonstration http:
//www.youtube.com/watch?v=EBbkZj7HcL8. For our future validations, we
will enrich our validation dictionaries with BioPortal [7] terms for English and CISMeF
[9] terms for French. In the future, we will offer disambiguation features using the Web
to find the context in order to populate biomedical ontologies with the new extracted
terms (red terms), while looking into the possibility of extracting relations [1] between
new terms and already known terms.
159
4 Lossio-Ventura et al.
Fig. 1. (1) Interface: term extraction. (2) Interface: term validation. Users can export the results
for offline processing.
Acknowledgments. This work was supported in part by the French National Research Agency
under JCJC program, grant ANR-12-JS02-01001, as well as by University of Montpellier 2,
CNRS, IBC of Montpellier project and the FINCyT program, Peru
References
1. Abacha, A. B., Zweigenbaum, P.: Automatic extraction of semantic relations between medical
entities: a rule based approach. Journal of Biomedical Semantics, vol. 2 (2011)
2. Frantzi K., Ananiadou S., Mima, H.: Automatic recognition of multiword terms: the C-
value/NC-value Method. International Journal on Digital Libraries, vol. 3, pp. 115-130, (2000)
3. Gaizauskas, R., Demetriou, G., Humphreys, K.: Term recognition, classification in biological
science journal articles. Proceeding of the Computional Terminology for Medical, Biological
Applications Workshop of the 2 nd International Conference on NLP, pp. 37-44 (2000)
4. Lossio-Ventura, J.A., Jonquet, C., Roche, M., Teisseire M.: Towards a Mixed Approach to
Extract Biomedical Terms from Text Corpus. International Journal of Knowledge Discovery
in Bioinformatics, IGI Global. vol. 4, pp. 1-15, Hershey, PA, USA (2014)
5. Lossio-Ventura, J.A., Jonquet, C., Roche, M., Teisseire M.: Yet another ranking function to
automatic multi-word term extraction. Proceedings of the 9th International Conference on
Natural Language Processing (PolTAL’14), Springer LNAI. Warsaw, Poland (2014)
6. Neveol, A., Grosjean, J., Darmoni, S., Zweigenbaum, P.: Language Resources for French in
the Biomedical Domain. 9th International Conference on Language Resources and Evaluation
(LREC’14). Reykjavik, Iceland (2014)
7. Noy, N. F., Shah, N. H., Whetzel, P. L., Dai, B., Dorf, M., Griffith, N., Jonquet, C., Rubin, D.
L., Storey, M., Chute, C.G., Musen, M. A.: BioPortal: ontologies and integrated data resources
at the click of a mouse. Nucleic acids research, vol. 37(suppl 2), pp 170–173. (2009)
8. Jonquet, C., Shah, N.H., Youn, C.H., Callendar, Chris, Storey, M-A, Musen, M.A.: NCBO
Annotator: Semantic Annotation of Biomedical Data. 8th International Semantic Web Con-
ference, Poster and Demonstration Session Washington DC, USA (2009)
9. Darmoni, S.J., Pereira, S., Sakji, S., Merabti, T., Prieur, E. Joubert, M., Thirion, B.: Multiple
Terminologies in a Health Portal: Automatic Indexing and Information Retrieval. 12th Con-
ference on Artificial Intelligence in Medicine, LNCS 5651, pp.255-259, Verona, Italy (2009)
160
Visualizing and Animating Large-scale
Spatiotemporal Data with ELBAR Explorer
Suvodeep Mazumdar1 and Tomi Kauppinen2,3
1
Department of Computer Science,
University of Sheffield, 1 Portobello, S1 4DP, United Kingdom
s.mazumdar@sheffield.ac.uk
2
Cognitive Systems Group, University of Bremen, Germany
3
Department of Media Technology, Aalto University School of Science, Finland
tomi.kauppinen@uni-bremen.de
Abstract. Visual exploration of data enables users and analysts observe
interesting patterns that can trigger new research for further investiga-
tion. With the increasing availability of Linked Data, facilitating support
for making sense of the data via visual exploration tools for hypothesis
generation is critical. Time and space play important roles in this be-
cause of their ability to illustrate dynamicity, from a spatial context. Yet,
Linked Data visualization approaches typically have not made efficient
use of time and space together, apart from typical rather static mul-
tivisualization approaches and mashups. In this paper we demonstrate
ELBAR explorer that visualizes a vast amount of scientific observational
data about the Brazilian Amazon Rainforest. Our core contribution is
a novel mechanism for animating between the di↵erent observed values,
thus illustrating the observed changes themselves.
Keywords: Visual Analytics, Information Visualization, Linked Data
1 Introduction
Making sense of spatiotemporal data is a crucial step in providing insight for
critical actionable decisions. Linked Data is no exception in this. With the in-
creased availability of potentially interesting information the task is to support
decision makers by illustrating significant patterns of data. This way data be-
comes narrative and can share a story [3].
In this paper we demonstrate the combined use of spatial and temporal as-
pects of Linked Data and illustrate how very heterogeneous phenomena can
be illustrated over time and space. Our contribution is an explorer that takes
spatiotemporal Linked Data as an input via SPARQL queries and enables explo-
ration of variables via animations. To evaluate and illustrate the use of the tool
we make use of openly published data about the Brazilian Amazon Rainforest.
This data, including its economical, social and ecological dimensions, serves to
show the potential of visualizing Linked Data by animations over time.
Section 2 explains both the ELBAR explorer and the spatiotemporal data
we used for evaluation. Section 3 discusses the use of ELBAR explorer with a
161
2 Spatial and Temporal Exploration of Linked Data
concrete scenario. Section finishes the paper with concluding remarks and future
work ideas.
2 Animating Large-scale Temporal Data with ELBAR
In this demonstration, ELBAR4 makes use of the openly available Linked Brazil-
ian Amazon Rainforest Data5 [2] which captures and shares environmental ob-
servations (like deforestation) together with information about social phenomena
(like amount of population) and economical phenomena (like market prices of
products). The data has been aggregated to 25 km x 25 km grid cells [1] and
extended with open governmental data and linked to DBPedia. ELBAR uses
the paradigm of visual animations to illustrate changes over time on maps. The
core idea is that employing such means to navigate multiple dimensions of data
supports analysts in generating hypotheses for further investigation.
Fig. 1. Explorer for Linked Brazilian Amazon Rainforest (ELBAR). The interface con-
tains four sections: A – Filters, B – Map, C – Info window, D – Graph
Fig. 1 presents a screenshot of the ELBAR explorer. The filters (Section A)
provide mechanisms for selecting variables (e.g. deforestation rates) for further
inspection. SPARQL queries are built from the interactions done by the users
and sent to SPARQL endpoints. Results from the endpoint are processed and
converted to visual elements on maps (Section B) and graphs (Section D)6 . The
relevant observations (as retrieved from a triple store) are then visually encoded
and overlaid on a map, based on their spatial positions. The information window
(Section C) is then updated with further information regarding the filter being
selected. Clicking on visual elements of the graph (Section D) highlights the
individual sections on the map.
4
A demo of ELBAR is presented at http://linkedscience.org/demos/elbar
5
http://linkedscience.org/data/linked-brazilian-amazon-rainforest/
6
Users are also presented with Preset Animations, that can be previously defined or
extracted from the data
162
Visualizing and Animating Large-scale Spatiotemporal Data with ELBAR 3
3 Scenario for the Use of ELBAR
3.1 Visualizing the Phenomena
Assuming a user would like to understand a certain phenomena over time (like
deforestation) she/he will select a property (like deforestation rate) to visualise it
on the map. The explorer then presents the corresponding values of the property,
color-coded and overlaid on the map as well as a distribution of the values of
the property as a graph. The graph (see the the bottom right of Figure 1) is
interactive and supports mouse events such as zoom (by clicking and dragging
to select a section in the graph) or left clicks.
Fig. 2. User interactions on graphs translate to highlight visual elements on the map,
aiding in providing context
Zooming the graph provides a finer grained view of the distribution to fa-
cilitating selection of individual data points. Clicking on the graph selects the
respective section on the map and highlights it. Such interactions support un-
derstanding of how di↵erent phenomena are distributed. They also potentially
illustrate their spatial proximity and patterns they form. The example illustrated
in Figure 2 shows the distribution of a property (ACUM 2008 - accumulated per-
centage of deforestation in The Brazilian Amazon Rainforest until 2008) on the
graph (left) and the map (right). The graph is then zoomed to the highest few
points and then explored using mouse clicks. Clicking on individual datapoints
indicate their spatial references by marking up the respective section on the
map. Clicking on all the points (with a high value of ACUM 2008) on Section A
(middle) highlights the respective section A on the map (right). However, with
almost similar high values, the section B and C highlight sections in other areas.
3.2 Animating the Phenomena over Time
Comparing two properties for example (selected in Section A), accumulated de-
forestation in 1997 (ACUM 1997) and accumulated deforestation in 2008 (ACUM
2008) in such cases is a helpful feature that provides animated transitions to as-
sist users observe change. The result is then parsed and the values are visually
encoded into the relevant sections on the map, with a transition7 defined that
7
https://github.com/mbostock/d3/wiki/Transitions
163
4 Spatial and Temporal Exploration of Linked Data
Fig. 3. Generating hypothesis for the high deforetation in the North East Amazon. 1:
accumulated deforestation in 1997, 2: accumulated deforestation in 2008, 3: distance
from the nearest road, 4: distance from nearest municipality.
iterates between the two visual encodings, thereby creating an animated e↵ect.
Since no other visual feature is altered in the user’s field of view apart from color
encoding, the user can easily observe the evolution of deforestation. The analyst
then attempts to understand why the North East Amazon contains the highest
amount of deforestation. Di↵erent properties such as distance from nearest road
(bottom left, Fig. 3) and distance from the nearest municipality (bottom right,
Fig. 3) indicate the clear segregation of the North Eastern region, which could
support an analyst to hypothesise that the remoteness of these areas help in
illegal deforestation activities.
4 Conclusions
In this paper, we presented ELBAR explorer that employs generic visualization
and animation techniques to support analysts and decision makers explore spa-
tial and temporal data. We demonstrate the system and the architecture, along
with a description of the data. This will be accompanied with a guided exam-
ple of how such techniques can be used, and we also invite our demonstration
attendees, both online and onsite, to build custom transitions.
Future work will include development and evaluation of ELBAR with di↵er-
ent kinds of spatiotemporal data. Moreover, we investigate other novel mech-
anisms for exploring the spatiotemporal and topical aspects of Linked Data.
We foresee the potential for the use of the community is the expansion of the
background data, such as census data from individual municipalities and other
authorities could support for gaining insight of social, economical and ecological
processes.
References
1. de Espindola, G.M.: Spatiotemporal trends of land use change in the Brazilian
Amazon. Ph.D. thesis, National Institute for Space Research (INPE), São José dos
Campos, Brazil (2012)
2. Kauppinen, T., de Espindola, G., Jones, J., Sanchez, A., Gräler, B., Bartoschek, T.:
Linked Brazilian Amazon Rainforest Data. Semantic Web Journal 5(2) (2014)
3. Segel, E., Heer, J.: Narrative visualization: Telling stories with data. Visualization
and Computer Graphics, IEEE Transactions on 16(6), 1139–1148 (2010)
164
A Demonstration of Linked Data Source
Discovery and Integration?
Jason Slepicka, Chengye Yin, Pedro Szekely, and Craig A. Knoblock
University of Southern California
Information Sciences Institute and Department of Computer Science, USA
{knoblock,pszekely,slepicka}@isi.edu
{chengyey}@usc.edu
Abstract. The Linked Data cloud is an enormous repository of data,
but it is difficult for users to find relevant data and integrate it into their
datasets. Users can navigate datasets in the Linked Data cloud with
ontologies, but they lack detailed characterization of datasets’ contents.
We present an approach that leverages r2rml mappings to characterize
datasets. Our demonstration shows how users can easily create r2rml
mappings for their datasets and then use these mappings to find data
from the Linked Data cloud and integrate it into their datasets.
1 Introduction
The Linked Data cloud contains an enormous amount of data about many topics.
Consider museums, which often have detailed data about their artworks but may
only have sparse data about the artists who created them. Museums typically
have tombstone data about artists (name, birth/death years, and places) but
may lack biographies, influences, etc. Museums could use additional information
about their artists in the Linked Data cloud and integrate it with their own to
produce a richer, more complete dataset.
Our approach to this, built into our Karma data integration system [8],
uses r2rml mappings [7] to describe users’ datasets and datasets in the Linked
Data cloud. Today, datasets include, at best, a VoID description [1] with basic
metadata, such as access method and vocabularies used. r2rml-style mappings
could complement VoID with their schema-like nature by capturing the semantic
structure of a dataset and characterize its subjects and properties accordingly
with statistics or set approximations like Bloom filters. With this information,
users can reason better about how a dataset might integrate with their own data.
r2rml was defined to specify mappings from relational DBs to RDF, but
recent work [2] has proposed extensions to handle data types like CSV, JSON,
XML and Web APIs. Consequently, it is reasonable to expect that more datasets
in the Linked Data cloud could be published with r2rml-style descriptions.
In this demonstration we show how museum users can use Karma to quickly
define an r2rml mapping of a dataset (our previous work), use r2rml mappings
?
A video demonstration is available at http://youtu.be/sr-XDBKeNCY
165
2 Slepicka et al.
from other datasets to find more information about artists in their dataset, and
then augment their dataset with that information.
2 Datasets
For our demonstration we will integrate a CSV file containing 197 artists with
Linked Data published by the Smithsonian American Art Museum (SAAM). In
previous work [8], we mapped the SAAM dataset, including over 40,000 artworks
and 8,000 artists to the CIDOC CRM ontology [3] using r2rml and made it
accessible by a SPARQL endpoint, along with a repository for the r2rml map-
pings. The SAAM LOD here is a proxy for the Linked Data cloud to illustrate
the vision of a Linked Data cloud populated with r2rml models.
3 Demonstration
We will show how a user can interactively model an artist dataset, discover the
Smithsonian’s data for those artists, and then integrate the Smithsonian’s data.
Step 1: Modeling a New Source. The user begins by using Karma’s
existing capability to model the artists in the CSV file as crm:E21 Person in an
r2rml mapping shown in Figure 1. Karma can use this mapping to generate
RDF, and can also compare it to retrieve other mappings, discovering new related
sources that can be integrated with the artist dataset.
Step 2: Discovering Data Sources. The user then clicks on E21 Person1
in the r2rml mapping and selects Augment Data to discover new data to in-
tegrate into artist records. Karma retrieves r2rml mappings in its repository
that describe crm:E21 Person, and uses these mappings to generate a candidate
set of linked data sources to integrate, identifies meaningful object and data
properties, and presents them to the user as illustrated in Figure 2. To help the
users select properties to integrate, Karma uses Bloom filters to estimate the
number of artists that have each of the properties listed in Figure 2.
Fig. 1. A Karma user creates an r2rml mapping for a CSV file of a museum’s artists’
biographical records and clicks ’Augment Data’ to discover new data sources
166
A Demonstration of Linked Data Discovery and Integration 3
Fig. 2. A Karma user selects CIDOC CRM object and data properties discovered from
other sources to augment crm:E21 Person
Step 3: Integrating Data Sources. The user selects the artist’s biography
(for completeness) and birth (for validation). Karma automatically constructs
SPARQL queries to retrieve the data, integrates it into the worksheet, and aug-
ments the r2rml mapping accordingly (Figure 3). To support the integrated
SPARQL queries, we generated owl:sameAs links between the artists in the CSV
file and the Smithsonian dataset using LIMES [5] (we plan to integrate LIMES
with Karma to enable users to perform all integration steps within Karma).
Fig. 3. A Karma user has integrated biographical data from the Smithsonian as new
columns in their dataset. The columns contain artists’ biographies and birth dates.
167
4 Slepicka et al.
4 Related Work and Conclusions
We see similarities in our approach with those used in relational database inte-
gration and semantic service composition. ORCHESTRA[4] starts, like r2rml,
by aligning database tables to a schema graph. For integration, heuristics are
used to translate keyword searches over the graph into join paths using its Q
query system. However, these joins are not guaranteed to be semantically mean-
ingful, unlike the integration paths Karma finds using r2rml.
Platforms such as iServe[6] capture Linked Services and make them discover-
able and queryable by annotating them with their Minimal Service Model. How-
ever, the past work on service discovery and composition only uses a semantic
model of the inputs and outputs of the services. In contrast, Karma service de-
scriptions [9] also capture the relationship between the attributes, which allows
us to automatically discover semantically meaningful joins.
By building on Karma’s ability to quickly model many source types, we
demonstrate how a user can discover other linked data sources, select the desired
attributes from those sources, and then integrate the data from those sources
into their own dataset. Through this source discovery and integration, a user
can transparently compose and join other sources and services in a semantically
meaningful, interactive way that was not previously possible.
References
1. Alexander, K., Cyganiak, R., Hausenblas, M., and Zhao, J. Describing
linked datasets with the VoID vocabulary. W3C note, W3C, Mar. 2011.
2. Dimou, A., Sande, M. V., Colpaert, P., Mannens, E., and de Walle, R. V.
Extending R2RML to a source-independent mapping language for RDF. In Interna-
tional Semantic Web Conference (Posters and Demos) (2013), vol. 1035 of CEUR
Workshop Proceedings, CEUR-WS.org, pp. 237–240.
3. Doerr, M. The CIDOC conceptual reference module: An ontological approach to
semantic interoperability of metadata. AI Mag. 24, 3 (Sept. 2003), 75–92.
4. Ives, Z. G., Green, T. J., Karvounarakis, G., Taylor, N. E., Tannen, V.,
Talukdar, P. P., Jacob, M., and Pereira, F. The ORCHESTRA collaborative
data sharing system. ACM SIGMOD Record 37, 3 (2008), 26–32.
5. Ngomo, A.-C. N., and Auer, S. LIMES: a time-efficient approach for large-scale
link discovery on the web of data. In Proceedings of the Twenty-Second international
joint conference on Artificial Intelligence (2011), AAAI Press, pp. 2312–2317.
6. Pedrinaci, C., Liu, D., Maleshkova, M., Lambert, D., Kopecky, J., and
Domingue, J. iServe: a linked services publishing platform. In CEUR workshop
proceedings (2010), vol. 596.
7. Sundara, S., Cyganiak, R., and Das, S. R2RML: RDB to RDF mapping lan-
guage. W3C recommendation, W3C, Sept. 2012.
8. Szekely, P., Knoblock, C. A., Yang, F., Zhu, X., Fink, E., Allen, R., and
Goodlander, G. Connecting the Smithsonian American Art Museum to the
Linked Data Cloud. In Proceedings of the 10th ESWC (2013).
9. Taheriyan, M., Knoblock, C. A., Szekely, P., and Ambite, J. L. Semi-
automatically modeling web APIs to create linked APIs. In Proceedings of the
ESWC 2012 Workshop on Linked APIs (2012).
168
Developing Mobile Linked Data Applications
Oshani Seneviratne1 , Evan W. Patton2,1 , Daniela Miao1 , Fuming Shih1 , Weihua
Li1 , Lalana Kagal1 , and Carlos Castillo3
1
Massachusetts Institute of Technology
{oshanis,ewpatton,dmiao,fuming,wli17,lkagal}@csail.mit.edu
2
Rensselaer Polytechnic Institute pattoe@rpi.edu
3
Qatar Computing Research Institute chato@acm.org
Abstract. With the rapid advancement of mobile technologies, users are
generating and consuming a significant amount of data on their handheld
devices. However, the lack of Linked Data tools for these devices has left
much of the data unstructured and difficult to reuse and integrate with
other datasets. We will demonstrate an application development framework
that enables the easy development of mobile apps that generate and con-
sume Linked Data. We also provide a set of easy-to-deploy Web Services
to supplement functionality for mobile apps focused on crowdsourcing. We
motivate our work by describing a real-world application of this framework,
which is a disaster relief application that streams crowd-sourced reports in
real time.
1 Introduction
Many developers are shifting their attention to the mobile world as smartphones
are becoming the information hub for people’s daily lives [1]. The pervasiveness of
smartphones has led to the ubiquitous consumption and generation of data on them.
Smartphones can derive contextual information from their environment, enabling
applications that provide great value both to individual users and to society. For ex-
ample, context-aware applications can recommend products, services, or connections
to others based on people’s surroundings. People can also use social applications to
report their status and the situation around them in cases of emergencies, such as
floods or earthquakes.
Most mobile applications create or consume data stored in standalone databases
without the potential of being “interlinked” with data from other applications or
organizations. The Web community has advocated the use of Linked Data technolo-
gies to address this data interoperability issue in Web-based applications. Although
we have seen some success from content publishers in using or publishing Linked
Data, few examples exist for mobile platforms [2].
Our goal is to bring support for Linked Data to mobile platforms by allowing
developers to build mobile applications that consume and publish Linked Data au-
tomatically. We will demonstrate a framework that allows developers to accomplish
this on Android devices through modular components, as well as cloud scripts aimed
at enabling quick deployment of companion web services to interact with streams
of Linked Data. We believe that this framework will reduce the burdens of mobile
developers to work with Linked Data and open up many opportunities for building
applications that interact with and contribute to the Linked Data world.
169
Fig. 1: Consuming Linked Data. Left: Logic used for obtaining the user’s loca-
tion, constructing a SPARQL query, querying a remote endpoint, and displaying
the retrieved results on a map. Right: Output as seen on the phone. ‘QueryButton’
in the blocks editor is the name of the button with the label ‘Get Landmarks near
Current Location’ as seen on the phone.
2 Framework
Our framework1 extends MIT App Inventor,2 an open source, web-based environ-
ment for developing Android applications. Though primarily designed for pedagog-
ical use, App Inventor has found great success in many domains and boasts about
87,000 users weekly, 1.9 million total users, and over 4.9 million applications ranging
from tracking rainfall in Haiti3 to teaching U.S. Marines how to destroy weapons.4
App Inventor is structured around components: discrete blocks that provide func-
tionality and user interface elements that developers connect together to create an
application. App Inventor provides a designer view for laying out an application’s
user interface and a blocks view where users define their program logic by connecting
method blocks, associated with components, together (as shown in Fig. 1).
We will demonstrate several components that we’ve developed powered by Apache
Jena5 , including forms to generate Linked Data, maps for visualizing geographic
data, web services for cloud messaging, web services for exploring Linked Data,
and sensor components for context awareness. We will also demonstrate companion
web services that provide cloud functionality for mobile phones such as providing
messaging services to integrate streaming RDF content.
3 Use Case: WeReport
WeReport allows people to report (e.g. through a photo) the situation on the ground
during an emergency, hence enhancing situational awareness with respect to the
emergency [3]. The application demonstrates the capacity to generate and consume
Linked Data, as well as integrate with di↵erent public datasets.
Consider a scenario where a hurricane has hit a city, and Joe, a volunteer citizen
reporter, notices a series of potential hazards in his neighborhood, e.g. fallen trees
blocking an intersection. With WeReport, Joe can take a picture of the hazard and
upload it, along with a tag and description, to warn others in the area. An example
of this report can be seen in Figure 2a.
1
http://punya.appinventor.mit.edu
2
http://appinventor.mit.edu/
3
http://www.webcitation.org/6PUXWVG0U
4
http://www.webcitation.org/6PUXZb7FM
5
https://jena.apache.org
170
(a) Submitting Reports (b) Browsing Reports
Fig. 2: WeReport application for Mobile Devices: an application that allows users to
submit disaster reports to the cloud. Users can subscribe to reports close to a certain
location on a topic such as ‘Fallen Trees’, and results are streamed in real-time to
subscribers via push notifications.
When creating a report, WeReport consumes Linked Data via pre-defined SPARQL
queries that search for popular landmarks near Joe’s location, so that those places
can be tagged if hazardous. At the same time, the mobile application enables him to
transparently produce structured data through our Linked Data-enhanced compo-
nents. The generated data can be published to SPARQL 1.1 compliant endpoints,
which enables data re-use by others.
WeReport also supports real-time streaming of user-submitted disaster reports.
In this scenario, Bob and Anna are both relief workers heading towards the a↵ected
area. Using WeReport, they have subscribed to the hurricane disaster feed and they
continuously receive push notifications as reports on the disaster area are submitted
by users like Joe. While Bob wishes to be continuously notified on every newly
submitted report, Anna chooses to receive an alert only when there are at least
3 reports of a certain type in a given location, a custom limit she had previously
specified. Whenever Bob and Anna want to view the newest reports, WeReport
provides a “Browse Reports” interface which automatically displays the received
reports, as show in Figure 2b.
4 Related Work
There has been some work in Linked Data use for smartphones, however, they do not
provide a comprehensive solution to mobile app development as our framework does.
[4] propose a lightweight general-purpose RDF framework for Android to share ap-
plication data inside the phone. Our framework focuses on obtaining external RDF
sources and consuming them and not necessarily application data integration using
RDF. Spinel [5] is a plug-in architecture and a set of web configuration tools for
Android that enable new datasets to be added to mobile phones without program-
ming. We provide both a plug-in architecture and support for adding new datasets
using program blocks that can be configured easily. Another notable platform for
creating mobile Linked Data applications is the “RDF on the go”, an RDF storage
and query processor library for mobile devices [6]. Our framework not only allows
storage and querying of RDF data, but more importantly makes it much more user
171
friendly to manipulate the data, be it contextual data from the phone or data from
a remote endpoint.
Unlike applications such as Cinemappy [7] and DBPedia Mobile [8], which re-
quire developers to have extensive knowledge of both Linked Data and mobile pro-
gramming, our target audience is Linked Data developers who want to develop
mobile applications but do not have mobile programming experience.
5 Summary
Though mobile devices have become the primary computing and communication
platform for users, the number of Linked Data apps available for these devices is
insignificant because developing them is currently extremely time consuming.
In this demonstration, we show a mobile application platform that enables mo-
bile Linked Data applications to be quickly and easily developed. Along with leading
to applications that are able to leverage structured data, these mobile applications
will help get around the classic “chicken-or-egg” situation by being both genera-
tors and consumers of Linked Data. The current implementation contains a few
limitations including scalability, which our team will be working on. We will also
be conducting further user studies to evaluate the usability and usefulness of the
platform.
References
1. Tillmann, N., Moskal, M., de Halleux, J., Fahndrich, M., Bishop, J., Samuel, A., Xie, T.:
The future of teaching programming is on mobile devices. In: Proceedings of the 17th
ACM annual conference on Innovation and technology in computer science education,
ACM (2012) 156–161
2. Ermilov, T., Khalili, A., Auer, S.: Ubiquitous Semantic Applications: A Systematic
Literature Review. International Journal on Semantic Web and Information Systems
10(1) (2014)
3. Vieweg, S.: Situational Awareness in Mass Emergency: A Behavioral and Linguis-
tic Analysis of Microblogged Communications. PhD thesis, University of Colorado at
Boulder (2012)
4. David, J., Euzenat, J., Rosoiu, M., et al.: Linked data from your pocket. In: Proc. 1st
ESWC workshop on downscaling the semantic web. (2012) 6–13
5. Chang, K.S.P., Myers, B.A., Cahill, G.M., Simanta, S., Morris, E., Lewis, G.: A plug-in
architecture for connecting to new data sources on mobile devices. In: Visual Languages
and Human-Centric Computing (VL/HCC), 2013 IEEE Symposium on, IEEE (2013)
51–58
6. Le Phuoc, D., Parreira, J.X., Reynolds, V., Hauswirth, M.: RDF On the Go: RDF
Storage and Query Processor for Mobile Devices. In: ISWC Posters&Demos. (2010) 12
7. Ostuni, V., Gentile, G., Noia, T., Mirizzi, R., Romito, D., Sciascio, E.: Mobile movie
recommendations with linked data. In Cuzzocrea, A., Kittl, C., Simos, D., Weippl, E.,
Xu, L., eds.: Availability, Reliability, and Security in Information Systems and HCI.
Volume 8127 of Lecture Notes in Computer Science. Springer Berlin Heidelberg (2013)
400–415
8. Passant, A.: Measuring Semantic Distance on Linking Data and Using it for Resources
Recommendations. In: AAAI Spring Symposium: Linked Data Meets Artificial Intelli-
gence. (2010)
172
A Visual Summary for Linked Open Data sources
Fabio Benedetti, Sonia Bergamaschi, Laura Po
Università di Modena e Reggio Emilia - Dipartimento di Ingegneria "Enzo Ferrari" - Italy
firstname.lastname@unimore.it
Abstract. In this paper we propose LODeX, a tool that produces a represen-
tative summary of a Linked open Data (LOD) source starting from scratch, thus
supporting users in exploring and understanding the contents of a dataset. The
tool takes in input the URL of a SPARQL endpoint and launches a set of pre-
defined SPARQL queries, from the results of the queries it generates a visual
summary of the source. The summary reports statistical and structural informa-
tion of the LOD dataset and it can be browsed to focus on particular classes or
to explore their properties and their use. LODeX was tested on the 137 public
SPARQL endpoints contained in Data Hub (formerly CKAN)1 , one of the main
Open Data catalogues. The statistical and structural information extraction was
successfully performed on 107 sources, among these the most significant ones
are included in the online version of the tool2 .
1 Introduction
The RDF Data Model plays a key role in the birth and continuous expansion of the Web
of data since it allows to represent structured and semi-structured data. However, while
the LOD cloud is still growing, we assist to a lack of tools able to produce a meaningful,
high level representation of these datasets.
Quite a lot of portals catalog datasets that are available as LOD on the Web and
permit users to perform keyword search over their list of sources. Nevertheless, when
a user starts exploring in details an unknown LOD dataset, several issues arise: (1) the
difficulty in finding documentation and, in particular, a high level description of classes
and properties of the dataset; (2) the complexity of understanding the schema of the
source, since there are no fixed modeling rules; (3) the effort to explore a source with a
high number of instances; (4) the impossibility, for non skilled users, to write specific
SPARQL queries in order to explore the content of the dataset.
To overcome the above problems, we devise LODeX, a tool able to automatically
provide a high level summarization of a LOD dataset, including its inferred schema.
It is composed by several algorithms that discern between intensional and extensional
knowledge. Moreover, it handles the problem of long running queries, that are subject
to timeout failures, by generating a pool of low complexity queries able to return the
same information.
This work has been accomplished in the framework of a PhD program organized by the Global
Grant Spinner 2013, and funded by the European Social Fund and the Emilia Romagna Region.
1
http://datahub.io
2
http://dbgroup.unimo.it/lodex
173
As presented in [3], the majority of the tools for data visualization is not able to
provide a synthetic view of the data (instances) contained in a single source. Payola3 [4]
and LOD Visualization4 [2] are two recent tools that exploits analysis functionalities for
guiding the process of visualization. However, these tools always need some querying
parameters to start the analysis of a LOD dataset. Conversely, LODeX neither requires
any a priori knowledge of the dataset, nor asks users to set any parameters; it focuses
on extracting the schema from a LOD endpoint and producing a summarized view of
the concepts contained in the dataset.
The paper is structured as follows. Section 2 describes the architecture of LODeX,
while a use case and demonstration scenario is described in Section 3. Conclusions and
some ideas for future work are described in Section 4.
2 LODeX - Overview
LODeX aims to be totally automatic in the production of the schema summary.
Figure 1 depicts the architecture of LODeX. The tool is composed by three main
processes: Index Extraction, Post-processing and Visualization. The goal of the first
two steps is to automatically extract from a SPARQL endpoint the information needed
to produce its schema summary, while the third step aims to produce a navigable view
of schema summary for the users. For an easy reuse, all the contents extracted and pro-
cessed by LODeX are stored in a NoSQL document database, since it allows a flexible
representation of the indexes.
Fig. 1. LODeX Architecture
The Index Extraction (IE) takes as input the URL of a SPARQL endpoint and
generates the queries needed to extract structural and statistical information about the
source. Major details about the IE process can be found in [1]. The IE component has
been designed in order to maximize the compatibility with LOD sources and minimize
the costs in terms of time and computational complexity. The intensional and exten-
sional knowledge are extracted and collected in a set of statistical indexes, stored in the
NoSQL Database.
The Post-processing (PP) combines the information contained in the statistical in-
dexes to produce the schema summary of a specific dataset. The summary is induced
3
http://live.payola.cz/
4
http://lodvisualization.appspot.com/
174
from the distribution of the instances in the dataset. The PP also collects synthetic in-
formation regarding the endpoint. Also the schema summary is stored in the NoSQL
database.
The Visualization of the schema summary is performed through a web application
written in Python that uses NoSQL database as backend. We used Data Driven Docu-
ments5 to create a visual representation of the dataset with which the user can interact
to navigate the schema and discover the information that he/she is looking for.
The tool has been tested on the entire set of sources described in SPARQL Endpoint
Status(SPARQLES)6 , a specialized application that recursively monitors the availabil-
ity of public SPARQL Endpoints contained in DataHub. At the time of our evaluation
(May 2014), SPARQLES indicated that the 52% of SPARQL endpoints (244/469) were
available and only the 13% of the endpoints presented a documentation, i.e. VoID and/or
Service descriptions. LODeX was able to complete the extraction phase, thus building
the visual summaries, for 107 LOD sources (78% of the 137 dataset that were compliant
with the necessary SPARQL operators) that are now collected and shown in the online
demo.
Fig. 2. Visual Summary of the Linked Clean Energy Data source and a particular of the “Sector”
property.
3 Use Case and Demonstration Scenario
We refer to an hypothetical use-case involving a company in the clean energy sector.
The company has its own products and services and attempts to discover new informa-
tion on renewable energy and energy efficiency in the country where it is located. While
5
http://d3js.org/
6
http://sparqles.okfn.org/
175
searching the key datasets in the energy field, the company will likely find the Linked
Clean Energy Data dataset7 . This dataset, composed of 60140 triples, is described as a
“Comprehensive set of linked clean energy data including: policy and regulatory coun-
try profiles, key stakeholders, project outcome documents and a thesaurus on renewable,
energy efficiency and climate change for public re-use”.
By using our application to explore this dataset (see Figure 2)8 , the user can, at a
glance, have the intuition of all the instantiated classes (the nodes in the graph) and the
connections among them (the arcs), besides the number of instances defined for each
class (reflected in the dimension of the node). Focusing on the color of the nodes in
the graph, a user can understand which classes are defined by the provider of the source
and which others are taken from external vocabularies (in this case we can see that some
of the class definitions are acquired from Foaf, Geonames.org and Skos). By position-
ing the mouse on a node, more information about the class are shown (as depicted in
Figure 2 on the left). Since classes are linked to each others by some properties, it is
possible to explore the property details. Thus, by clicking on a property another visual
representation of the intensional knowledge is shown (see the right part of Figure 2).
4 Conclusions and Future Work
This paper has shown how LODeX is able to provide a visual and navigable summary
of a LOD dataset including its inferred schema starting from the URL of a SPARQL
Endpoint. The result gained by LODeX could also be useful to enrich LOD sources’
documentation, since the schema summary can be easily translated with respect to a
vocabulary and inserted into the LOD source. LODex is currently limited to display the
contents of a source proposing a graph. However, new developments are being imple-
mented in order to facilitate the query definition by exploiting the visual summary.
References
1. F. Benedetti, S. Bergamaschi, and L. Po. Online index extraction from linked open data
sources. To appear in Linked Data for Information Extraction (LD4IE) Workshop held at
International Semantic Web Conference, 2014.
2. J. M. Brunetti, S. Auer, and R. Garca. The linked data visualization model. In International
Semantic Web Conference (Posters & Demos), 2012.
3. A.-S. Dadzie and M. Rowe. Approaches to visualising linked data: A survey. Semantic Web,
2(2):89–124, 2011.
4. J. Klímek, J. Helmich, and M. Nečaskỳ. Payola: Collaborative linked data analysis and vi-
sualization framework. In The Semantic Web: ESWC 2013 Satellite Events, pages 147–151.
Springer, 2013.
7
http://data.reegle.info/
8
The visual summary of this source is available at http://dbgroup.unimo.it/lodex/157
176
EasyESA: A Low-e↵ort Infrastructure for
Explicit Semantic Analysis
Danilo Carvalho1,2 , Çağatay Çallı3 , André Freitas1 , Edward Curry1
1
Insight Centre for Data Analytics, National University of Ireland, Galway
2
PESC/COPPE, Federal University of Rio de Janeiro (UFRJ)
3
Department of Computer Engineering, METU, Ankara
1 Introduction
Distributional semantic models (DSMs) are semantic models which are based
on the statistical analysis of co-occurrences of words in large corpora. DSMs
can be used in a wide spectrum of semantic applications including semantic
search, question answering, paraphrase detection, word sense disambiguation,
among others. The ability to automatically harvest meaning from unstructured
heterogeneous data, its simplicity of use and the ability to build comprehensive
semantic models are major strengths of distributional approaches.
The construction of distributional models, however, is dependent on process-
ing large-scale corpora. The English version of Wikipedia 2014, for example,
contains 44 GB of article data. The hardware and software infrastructure re-
quirements necessary to process large-scale corpora bring high entry barriers for
researchers and developers to start experimenting with distributional semantics.
In order to reduce these barriers we developed EasyESA, a high-performance and
easy-to-deploy distributional semantics framework and service which provides an
Explicit Semantic Analysis (ESA) [4] infrastructure.
2 Explicit Semantic Analysis (ESA)
DSMs are represented as a vector space model, where each dimension represents a
context C for the linguistic context in which the target word occurs in a reference
corpus. A context can be defined using documents, co-occurrence window sizes
(number of neighbouring words or data elements) or syntactic features. The dis-
tributional interpretation of a target word is defined by a weighted vector of the
contexts in which the word occurs, defining a geometric interpretation under a
distributional vector space. The weights associated with the vectors are defined
using an associated weighting scheme W, which can recalibrate the relevance
of more generic or discriminative contexts. A semantic relatedness measure S
between two words can be calculated by using di↵erent similarity/distance mea-
sures such as the cosine similarity or Euclidean distance.
In the Explicit Semantic Analysis DSM [4], Wikipedia is used as a reference
corpus and the contexts are defined by each Wikipedia article. The weighting
scheme is defined by TF/IDF (term frequency/inverse document frequency) and
177
the similarity measure by the cosine similarity. The interpretation vector of a
term on ESA is a weighted vector of Wikipedia articles, which is called in the
ESA model a concept vector.
A keyword query over the ESA semantic space returns the list of ranked
articles titles, which define a concept vector associated with the query terms
(where each vector component receives a relevance score). The approach supports
the interpretation of small text fragments, where the final context vector is the
centroid of the words’ concept vectors. The ESA semantic relatedness measure
between two terms is calculated by computing the cosine similarity between the
concept vectors representing the interpretation of the two terms.
3 EasyESA
EasyESA consists of an open source platform that can be used as a remote
service or can be deployed locally. The API consists of three services:
Semantic relatedness measure: Calculates the semantic relatedness measure
between two terms. The semantic relatedness measure is a real number in the
[0,1] interval, representing the degree of semantic proximity between two terms.
Semantic relatedness measures are comparative measures and are useful when
sets of terms are compared in relation to their semantic proximity. Semantic
relatedness can be used for semantic matching in the context of the development
of semantic systems such as question answering, text entailment, event matching
and semantic search.
– Example: Request for the semantic relatedness measure between the words wife
and spouse.
– Service URL: http://vmdeb20.deri.ie:8890/esaservice?task=esa&term1=wife&
term2=spouse
Concept vector: Given a term, it returns the associated concept vector: a
weighted vector of contexts (Wikipedia articles). The term can contain multiple
words. The concept vectors can be used to build semantic indexes, which can be
applied for semantic applications which depends on high performance semantic
matching. An example of a semantic index built using ESA concept vectors is
available in [1].
– Example: Request for the concept vector of the word wife with maximum dimen-
sionality of 50.
– Service URL: http://vmdeb20.deri.ie:8890/esaservice?task=vector&source=wife&
limit=50
Query explanation: Given two terms, returns the overlap between the concept
vectors.
– Example: Request for the concept vector overlap between the words wife and spouse
for concept vectors with 100 dimensions.
– Service URL: http://vmdeb20.deri.ie:8890/esaservice?task=explain&term1=wife
&term2=spouse&limit=100
178
Mean request values are 0.055 ms for the semantic relatedness measure and
0.080 ms for the concept vector (500 dimensions) on an Intel Core i7 Quad Core
3770 3.40 GHz 32GB DDR3 RAM computer.
EasyESA was developed using Wikiprep-ESA1 as a basis. The software is
available as an open source tool at http://easy-esa.org. The improvements
targeted the following contributions: (i) major performance improvements (fun-
damental for the application of distributional semantics in real applications
which depends on 100s of requests per second); (ii) robust concurrent queries;
(iii) RESTful service API; (iv) deployment of an online service infrastructure;
(v) packaging and pre-processed files for easy deployment of a local ESA in-
frastructure. A detailed description of the improvements can be found at http:
//easy-esa.org/improvements.
4 Demonstrations
Two demonstration applications were built using EasyESA targeting to show
the low e↵ort involved in the use of distributional semantics in the context
of di↵erent semantic tasks. In the demonstration, Wikipedia 2013 was used as
a reference corpus. Videos of the running applications are available at: http:
//treo.deri.ie/iswc2014demo
Semantic Search: The first demonstration consists is the use of EasyESA for
simulating a semantic search application. In this scenario users can enter a set of
terms which can represent the searchable items (for example film genres). Each
term associated with a genre has a distributional conceptual representation, i.e.
is represented by a concept vector. Users can then enter a search term which
has no lexical similarity to the indexed terms. The demonstration computes the
semantic relatedness for each vector, ranking the results by their degree of se-
mantic relatedness. In the example, the search query ‘winston churchill’ returns
the most likely film genres which are associated with the query. Genres ‘war’,
‘documentary’, and ‘historical’ were the top most related terms. Figure 1 shows
a screenshot of the interface of the example. The demonstration application can
be accessed in: http://vmdeb20.deri.ie/esa-demo/semsearch.html.
Word Sense Disambiguation (WSD): In the second demonstration, EasyESA
is used to perform a word sense disambiguation task (WSD). The user enters
a sentence and then selects a word to get the correct sense from WordNet.
The WSD application gets the sentence context (the words surrounding the
target word) finds its associated context vector and computes the semantic
relatedness measure in relation to the context vector of the associated Word-
Net glosses for each word sense available. The di↵erent senses are then ranked
by their semantic relatedness values. In the examples there is no lexical over-
lap between the sentence context and the di↵erent WordNet glosses, with the
distributional knowledge from Wikipedia filling the semantic gap between the
context and the glosses. The demonstration application can be accessed in:
http://vmdeb20.deri.ie/esa-demo/sensedisambig.html.
1
https://github.com/faraday/wikiprep-esa
179
Fig. 1: Screenshot of the semantic search application.
5 Applications using EasyESA
While the demonstration focuses on providing easy to replicate applications
for users to start experimenting with distributional semantics, more complex
applications were built using EasyESA. Freitas et al [2] used EasyESA for
terminology-level search over Linked Data vocabularies, achieving better per-
formance in the semantic matching process when compared to WordNet-based
query expansion approach. EasyESA was used in the Treo QA system [1], a
schema-agnostic query approach over the Linked Data Web. The system uses
a distributional semantics approach to match query terms to dataset elements,
supporting schema-agnostic queries. Hasan & Curry [3] use EasyESA for seman-
tically matching complex events from semantically heterogeneous data sources,
in a real time scenario.
Acknowledgment: This publication was supported in part by Science Foundation
Ireland (SFI) (Grant Number SFI/12/RC/2289) and by the Irish Research Council.
References
1. Freitas, A., Curry, E., Natural Language Queries over Heterogeneous Linked Data
Graphs: A Distributional-Compositional Semantics Approach. In Proc. of the 19th
Intl. Conf. on Intelligent User Interfaces. (2014).
2. Freitas, A., Curry, E., O’Riain, S., A Distributional Approach for Terminological
Semantic Search on the Linked Data Web. In Proc. of the 27th ACM Symposium
On Applied Computing (SAC), Semantic Web and Applications (SWA). (2012).
3. Hasan, S., Curry, E., Approximate Semantic Matching of Events for The Internet
of Things,. In ACM Transactions on Internet Technology (TOIT). (2014).
4. Gabrilovich, E., Markovitch S., Computing semantic relatedness using Wikipedia-
based explicit semantic analysis. In Proc. of the 20th Intl. Joint Conf. on Artificial
Intelligence, 1606–1611. (2007).
180
LODHub - A Platform for Sharing and Analyzing
large-scale Linked Open Data
Stefan Hagedorn and Kai-Uwe Sattler
Database & Information Systems Group,
Technische Universität Ilmenau, Germany
{first.last}@tu-ilmenau.de
Abstract. In this demo paper we describe the current prototype of our
new platform LodHub that allows users to publish and share linked
datasets. The platform further allows to run SPARQL queries and exe-
cute Pig scripts on these datasets to support users in their data process-
ing and analysis tasks.
Keywords: Linked Open Data, data processing, data analysis, web
platform
1 Introduction
Over the last years, the (Linked) Open Data movement has received a growing
popularity. More and more public and government agencies as well as organiza-
tions publish data of common interest allowing to make use of it and building
interesting applications. This is fostered by various hubs such as datahub.io,
data.gov etc., which represent central registries for data sources.
However, creating added value from the published data requires typically to
preprocess and clean the data, combine it with other data, and to create new
datasets or build models. Results of such transformation and analysis tasks are
twofold: first, the new datasets or models might be useful for other users in the
form of curated datasets and second, the tasks could be reused for other datasets,
too. Particularly, recent developments on big data technologies provide numer-
ous useful building blocks for such an environment, e.g. scalable platforms like
Hadoop or Spark or higher-level programming models and data flow languages
like Pig or Jaql.
In [2] we argue that there is a need for a platform addressing these require-
ments by combining functionalities of Open Data management frameworks with
an infrastructure for exploration, processing, and analytics over large collec-
tions of (linked) data sets while respecting access and privacy restrictions of the
datasets.
In this demo we plan to show our initial results of building such a plat-
form. Our LodHub platform provides services for uploading and sharing (RDF)
datasets. Furthermore, it contains facilities to explore, query, and analyze datasets
by integrating a SPARQL query processor as well as a visual data flow tool for
designing and executing Pig scripts on the hosted datasets.
181
2 LodHub
2 LodHub Services
The platform has several features that let users work with their datasets. How-
ever, it is currently only a prototype and some features are not completed yet.
2.1 Managing datasets
Upload Users can upload their datasets via the website. During the upload
process, the user has the possibility to enter the dataset name, tags, and a
short description. The tags can later be used to search in the users collection of
datasets. An import feature that lets users upload datasets that are not in the
linked data format (e.g., CSV files, Excel sheets, etc.) is not finished yet, but it
is a work in progress.
Update During time, the contents of a dataset may change. These changes can
be updated values for some particular statements, new additional statements,
or removed statements. If the set of changes is small, one might consider it a
new version of a dataset instead of a completely new one. To reflect this issue in
LodHub, the platform supports versioning. This means it is possible to explicitly
upload a new version of a dataset. By default, users will use the most recent
version of a dataset. However, it is also possible to switch to an older version
and, e.g., run queries on that version.
Collaboration The idea of LodHub is to allow users to work together on
potentially large datasets. When a user uploads a new dataset, she or he becomes
the owner of it and hence, has the permission to perform all operations on it.
To also allow other users to work with this dataset, one can share the dataset to
other users. With sharing we mean that the other users can see this dataset in
their collection so that they are able to work with it. When sharing a dataset, the
owner can choose what permission the other users should have on this dataset:
Read allows to query the dataset, Write is like Read access plus the permission
to upload new versions, Share is the permission to share the dataset to other
users.
2.2 Querying datasets
Working with datasets means to run queries on them to find the needed informa-
tion. However, there are different types of queries that users need to execute. On
the one hand, there are the rather short ad-hoc queries that can be run directly
on the datasets. On the other hand there are the data-intensive transformation
and analysis tasks. LodHub supports both types of workload by providing two
ways to formulate the queries.
182
LodHub 3
Ad-hoc queries Since LodHub was designed around linked data, the main
query language used on the platform is SPARQL. SPARQL queries can be used
to instantly formulate a query on one dataset or even the union of many datasets.
The user is presented a text input area where she or he can enter the query. The
query is then directly executed by the underlying RDF framework (in our case
Jena) and the result triples are presented on the website.
Analytical tasks Writing SPARQL queries can be cumbersome and requires
the users to learn the language. Furthermore, there may be complex tasks that
are too difficult to express in SPARQL, e.g., data transformation steps, or com-
plex operators that are not even available in SPARQL. In this case, it is easier to
formulate the query in a script language where data is processed in a streaming
fashion to achieve high throughput.
To achieve a low entrance barrier, we provide a graphical editor that lets users
create queries via drag and drop. Users can choose between several predefined
operators which they can then connect to model the data flow between the
operators. For each operator, its parameters like filter conditions, join attributes,
or projection columns can be set individually. Thus, users intuitively build an
operator tree, without having to care about language specific rules.
Each operator is translated into a Pig script statement. Pig, being a frame-
work on top of Apache Hadoop, allows a distributed execution of the query and
to load the data from a distributed file system (e.g., HDFS). Hence, this approach
does not use the generated indexes from the used RDF framework. However, the
data flow approach allows a high parallelization in a cluster environment.
Currently, the graphical editor produces Pig scripts only. However, the editor
was designed so that it will be possible to also generate other languages from the
graph. Thus, in a future version the editor will be able to generate traditional
SPARQL queries and maybe other languages, too.
Data exploration To help people to understand the content of datasets and to
find useful datasets that may contribute to the users question, LodHub allows
to visualize how datasets are interlinked with each other, i.e., how many objects
of a dataset occur as subjects in another dataset (and vice versa).
3 Architecture
The platform was written using the Play! framework1 . Play’s integration of Akka
allows to easily run the application in a cluster environment. However, in the
current development phase we concentrate on a single machine, but plan to
distribute the application in a cluster to achieve better load distribution.
To store the datasets, we use the Jena TDB framework2 . The SPARQL
queries on these datasets are directly passed to the Jena library which then
1
http://www.playframework.com/
2
http://jena.apache.org/documentation/tdb/
183
4 LodHub
evaluates the query. The Pig scripts are currently executed in local mode, i.e.,
they are not distributed to a cluster. However, this is just a configuration step,
so that the application can be run in a cluster environment without having to
change a lot of code.
The modular design of the application will enable us to easily replace the
RDF store with another one or even to install a new store along with the others.
Thus, we could use the for example our fast CameLOD [1] for analytical SPARQL
queries that have to read a massive amount of data, while transactional queries
are still handled by the Jena TDB store.
4 Demo
In our demo we will show the features that were described in the previous section.
Users will be able to upload datasets, update existing ones, and to execute queries
on the datasets. For the queries they can either type in traditional SPARQL
queries or use our graphical editor to define a data flow and then generate a Pig
script from the created graph.
A short demonstration of the current status of the platform can be found in
this video:
http://youtu.be/m4kKiBrw2m4
In this demonstration we used several datasets which contain information
about events, their location, date, and an URL for more information, as well as
datasets containing information about cities and the country they belong to.
After an introduction of the dashboard that is the user’s entry point to all
actions, we run a SPARQL query with a GROUP BY and HAVING clause to find
the subjects that have the same predicate two or more times.
Next, we show how to create more complex analytical data processing tasks
using the graphical editor to create the Pig scripts. In this editor we create a data
flow using a LOAD, GROUP BY, and a PROJECTION operator. For each operator, we
enter the necessary parameters. LOAD: the path of the file to load and the schema,
GROUP BY: the grouping column, and PROJECTION: the column names to project.
After uploading a new dataset we create a second Pig script that uses a special
MATERIALIZE operator. This operator allows to materialize the results of a Pig
script into a new dataset that is immediately available to the user and can be
used just like a normal dataset.
At the end of the demo video we show how to visualize the interlinks between
selected datasets.
References
1. Hagedorn, S., Sattler, K.U.: Efficient parallel processing of analytical queries on
linked data. In: OTM, pp. 452–469 (Sept 2013)
2. Hagedorn, S., Sattler, K.U.: Lodhub - a platform for sharing and integrated process-
ing of linked open data. In: In proceeding of: 5th International Workshop on Data
Engineering Meets the Semantic Web. pp. 260–262. IEEE (March 2014)
184
LOD4AR: Exploring Linked Open Data with a Mobile
Augmented Reality Web Application
Silviu Vert, Bogdan Dragulescu and Radu Vasiu
Multimedia Research Centre, Politehnica University Timisoara, Timisoara, Romania
{silviu.vert, bogdan.dragulescu, radu.vasiu}@cm.upt.ro
Abstract. There is a vast amount of linked open data published nowadays on the
web, ranging from user-generated data to government public data. This data
needs visualization and exploration tools which people can use to make sense of
the data and turn it into useful and used information. Several such tools have been
proposed in the past, but they are designed mostly for specialists. Augmented
reality has recently emerged as an interactive medium for exploring information
in the real world and is well-suited for non-specialists. We propose a demo of a
mobile augmented reality application that runs in the browser and consumes and
displays linked open data in a friendly manner on top of the surroundings of the
user.
1 Introduction
Due to the vast amount of data that was published in the latest years, linked data has
become a popular research field. The Linking Open Data cloud1 has grown from 12
datasets in 2007 to 295 datasets in 2011. One of the major research areas is related to
the consumption of linked data by browsers designed specifically for this task. These
browsing and visualization tools are required by tech users and lay users equally to
easily retrieve information from the Web of Data. A recent survey [1] categorized these
tools into text browsers and visualization-based browsers, with the latter being more
suited for lay users, but also being the ones that are fewer and need more tweaking.
However, browsers that make use of large quantity of linked (open) data and are well-
suited for specific tasks are still under-researched.
Mobile augmented reality applications have recently emerged as interactive and
usable tools for exploration of the surrounding world of a user [2]. We propose this
medium as a suitable form of exploring geo-based linked data. This approach is not
without its research challenges, mainly geodata integration, data quality assessment and
provenance and trust issues [3]. We present a demo of a mobile augmented reality
application that helps tourists to get a sense of the unfamiliar surroundings based on
popular linked open data content sources that are integrated for this purpose. The
current version of the demo is available online.2
1 http://lod-cloud.net/
2
http://dev.cm.upt.ro/ar/
adfa, p. 1, 2011.
© Springer-Verlag Berlin Heidelberg 2011
185
2 Overview of the implementation
The application implements the crawling/data warehousing pattern. Figure 1 highlights
the steps needed to build an augmented reality application that uses data from multiple
sources.
Fig. 1. Overview of the flow of information in the application
In the first step, a list of data sources is identified with the purpose of being used in
an augmented reality application. The data sources have to contain information
regarding places with geographic coordinates, with labels and descriptions, ideally in
multiple languages, and with mapping links and specific categories.
In the second step, the data from these multiple data sources is collected and mapped
to a consistent vocabulary. The identity resolution is resolved for different URIs
addressing the same resource, a data quality assessment is provided, the merged data is
stored into a RDF store and an API is provided for the AR application to obtain the
desired information in JSON format.
The mobile augmented reality application, the third step, is browser based so there
is no need for the user to download a standalone application from the store. Given the
detected geolocation of the user, the application displays a set of Points of Interest
(POIs) in the immediate vicinity of the user. The information about the POIs is
consumed from the triple store, via the above-named API.
The first two steps are described in section three and the AR application in section
four.
3 Linked Open Data Integration
In order to build a dataset usable in an augmented reality application, first of all we
identified possible data sources that satisfied the requirements described above. The
186
datasets chosen for this demo application are extracted from DBpedia.org,
LinkedGeoData.org and the Romanian Government Open Data portal3.
In order to collect, map and integrate the chosen datasets we used the powerful
Linked Data Integration Framework (LDIF) [4]. From DBpedia we collected
information relevant to the authors hometown, Timisoara, using the SPARQL Import
module from LDIF. In the case of LinkedGeoData the SPARQL Import is unusable
because the query cannot be limited to a geographic area. The solution was to generate
a dump file for the desired area and using the Triple/Quad Dump Import module to load
the data in LDIF. From the Romanian Government Open Data portal we used the
museums dataset available only in CSV format. The Open Refine tool has been used to
clean and convert the dataset into RDF format and the Triple/Quad Dump Import
module to load the data in LDIF.
Data translation was carried out to obtain a consistent list of categories and to build
geometry values, if missing. Identity resolution was employed using the Levenshtein
distance with a threshold of 1 on labels and comparing the distance between POIs with
a threshold of 100 meters. For quality assessment the metric was time closeness; this
implies that only the most recent data is used in the data fusion step.
The resulted dataset was outputted by LDIF in an OpenRDF Sesame RDFS store
and contained 25,000 statements for the city of Timisoara. To allow the AR application
to query the data, an API was built that can query triples from the RDF store and return
them in JSON format.
4 The Augmented Reality Web Application
The augmented solution chosen is a browser-based one. Modern web technologies
(HTML, CSS, Javascript) combined with the current capabilities of mobile devices that
are available in the browser (geolocation, camera, WebGL) have given rise to such
solutions. Our demo uses the recently launched awe.js library, 4 which builds an
augmented reality layer on top of the three.js library, a 3D library working in the
browser. Awe.js library is advertised to work both with location and marker based AR,
on the latest versions of Chrome and Firefox on Android, as well as on devices such as
Oculus Rift, Leap Motion and Google Glass. We successfully tested our web
application so far on Chrome v35, Firefox v30 and Opera v22 on a Nexus 4 smartphone
with Android 4.4.
The web application queries asynchronously the Sesame server (through a proxy, to
overcome the same-origin policy) to get the required POIs, based on the location of the
user, an area around it, where to search the points, and the desired category of interest.
It then processes the list of retrieved points to place 3D pinpoints into the space, based
on the category the POI belongs to and on the distance from the user to that POI. On
touching a certain POI, the user is presented with a short snippet of information, which
he can expand to read more about that place. Figure 2 shows some screenshots of the
application.
3 http://data.gov.ro
4
https://github.com/buildar/awe.js
187
Fig. 2. Screenshots of the mobile browser application. On the left, a screenshot of the menu and
multiple display of POIs. In the middle, the selected POI and a short snippet of information. On
the right, the extended information for the same POI.
5 Conclusions and Future Work
In this demo paper we proposed a mobile augmented reality application that runs in the
browser and consumes and displays linked open data. To accomplish this we used three
open datasets, a powerful linked data integration framework, LDIF, an OpenRDF
Sesame RDFS store and we built an API for extracting data from the store and
delivering it to the mobile augmented reality browser application built on top of awe.js.
As future work we intend to include: submenus (to be able to further refine the selection
of the POIs), UI improvement, additional datasets, improved data quality.
Acknowledgements. This work was partially supported by the strategic grant
POSDRU/159/1.5/S/137070 (2014) of the Ministry of National Education, Romania,
co-financed by the European Social Fund – Investing in People, within the Sectoral
Operational Programme Human Resources Development 2007-2013.
References
1. Dadzie, A.-S., Rowe, M.: Approaches to visualising linked data: A survey. Semantic Web.
2, 89–124 (2011).
2. Kounavis, C.D., Kasimati, A.E., Zamani, E.D.: Enhancing the Tourism Experience through
Mobile Augmented Reality: Challenges and Prospects. Int. J. Eng. Bus. Manag. 4, (2012).
3. Vert, S., Vasiu, R.: Relevant Aspects for the Integration of Linked Data in Mobile
Augmented Reality Applications for Tourism. The 20th International Conference on
Information and Software Technologies (ICIST 2014) , Druskininkai, Lithuania October 9
(2014) (accepted).
4. Schultz, A., Matteini, A., Isele, R., Mendes, P.N., Bizer, C., Becker, C.: LDIF-A Framework
for Large-Scale Linked Data Integration. 21st International World Wide Web Conference
(WWW 2012), Developers Track, Lyon, France (2012).
188
PLANET: Query Plan Visualizer for Shipping Policies
against Single SPARQL Endpoints
Maribel Acosta1 , Maria-Esther Vidal2 , Fabian Flöck1 ,
Simon Castillo2 , and Andreas Harth1
1
Institute AIFB, Karlsruhe Institute of Technology, Germany
{maribel.acosta,fabian.floeck,harth}@kit.edu
2
Universidad Simón Bolı́var, Venezuela
{mvidal, scastillo}@ldc.usb.ve
Abstract. Shipping policies allow for deciding whether a query should be exe-
cuted at the server, the client or distributed among these two. Given the limita-
tions of public SPARQL endpoints, selecting appropriate shipping plans is crucial
for successful query executions without harming the endpoint performance. We
present PLANET, a query plan visualizer for shipping strategies against a single
SPARQL endpoint. We demonstrate the performance of the shipping policies fol-
lowed by existing SPARQL query engines. Attendees will observe the effects of
executing different shipping plans against a given endpoint.
1 Introduction and Overview
In the context of the Web of Data, endpoints are acknowledged as promising SPARQL
server infrastructures to access a wide variety of Linked Data sets. Nevertheless, recent
studies reveal high variance in the behavior of public SPARQL endpoints, depending
on the queries posed against them [3]. One of the factors that impact on the endpoint
performance is the type of shipping policy followed to execute a query.
Shipping policies [4] define the way that the workload of executing a query is dis-
tributed among servers and clients. Query-shipping policies conduct the execution of
query operators at the server, while plans following data-shipping exploit the capacity
of the client and execute the query operators locally. In contrast, hybrid approaches pose
sub-queries and operators according to the complexity of the queries, and the server
workload and availability. Current SPARQL query engines implement different poli-
cies. For example, FedX [5] implements a query-shipping strategy, executing the whole
query at the endpoint, when the federation is comprised by one endpoint. ANAPSID [2]
usually follows a hybrid-shipping strategy, it locally gathers the results of star-shaped
sub-queries executed by the endpoint. To showcase an adaptive hybrid approach along-
side FedX and ANAPSID, we also demonstrate SHEPHERD [1], an endpoint-tailored
SPARQL client-server query processor that aims for reducing the endpoint workload
and benefits the generation of hybrid shipping plans.
Analyzing the shipping policies followed to execute a query provides the basis not
only to understand the behavior of an endpoint, but also allows for the development of
endpoint-aware query processing techniques that preserve endpoint resources. The goal
of the work presented here is to assist data consumers or data providers in understand-
ing the effects of posing different shipping plans against an existing public SPARQL
189
endpoint. We introduce PLANET, a query plan visualizer for shipping strategies that
provides an intuitive overview of the plan structure, the used shipping strategies as well
as key metrics to understand the behavior of different engines when executing a query.
PLANET is designed to shed light on the distribution of operator execution between
client and server, which is crucial for investigating the type of plans that may lead to
severe under-performance of endpoints. Attendees will observe the impact of differ-
ent shipping strategies when queries are posed against a single endpoint. The demo is
published at http://km.aifb.kit.edu/sites/planet/.
2 The PLANET Architecture
PLANET invokes SPARQL query engines that implement different shipping strategies.
In this work, we studied the query-shipping plans produced by FedX, and the hybrid
shipping plans of ANAPSID and SHEPHERD. The plan retrieved from each engine is
processed by PLANET’s query plan parser, which translates the plans into JSON struc-
tures to encode the visualization data that will be consumed by the rendering module.
Currently PLANET is able to parse plans generated by engines that use the Sesame1
framework, or the ANAPSID or SHEPHERD internal structures.
The rendering module uses the “Collapsible Tree” layout of the D3.js JavaScript
library2 to generate the visualizations of plans produced by the SPARQL query pro-
cessing engines. Figures 1(a) and 1(b) show snapshots of plans rendered by PLANET.
Plan operator nodes are filled with different colors to distinguish whether the operator
is executed by the engine (locally) or at the server (remotely), which allows to easily
identify the type of shipping strategy followed in each plan.
The plan descriptor reports a set of metrics characterizing the shipping plans. Exe-
cution performance is measured by the execution time of the query and the number of
results produced. In addition, the metric hybrid ratio measures the quantitative relation
between the SPARQL operators executed at the local engine and the ones executed at
the endpoint. The hybrid ratio of a plan p is calculated as follows:
clientOp(p) · serverOp(p)
hybridRatio(p) =
totalOp(p) · max{clientOp(p), serverOp(p)}
where clientOp(p), serverOp(p) stand for the number of operations executed for plan
p at the client and server, respectively, and totalOp(p) = clientOp(p) + serverOp(p).
Note that the hybrid ratio for plans following data- or query-shipping strategies is zero.
The output of PLANET is a set of plan visualizations and a summary report with
the metrics computed for each plan.
3 Demonstration of Use Cases
As an illustrating example, consider the following query included in our online demo:
GP Query 2 against the DBpedia endpoint, comprised of 9 triple patterns and one
FILTER operator. Figure 1 reports on the plans depicted by PLANET for the previ-
ous query. The plan reported in Figure 1(a) follows a hybrid shipping strategy where
the quantitive relation between the SPARQL operators executed locally and the ones
1
http://www.openrdf.org/
2
http://d3js.org/
190
(a) Hybrid Shipping Plan. Sub-query under red bar is posed against DBpedia endpoint; Op-
erators under blue bar are executed at the client side
(b) Query Shipping Plan. The whole query is posed to the DBpedia endpoint
Fig. 1. Plans against the DBpedia endpoint. Blue circles represent operators executed locally; red
circles correspond to operators that will be posed against the endpoint
posed against the DBpedia endpoint or hybrid ratio is 0.27; the execution time is 0.63
secs. and one tuple is retrieved. On the other hand, when the query shipping-based plan
presented in Figure 1(b) is executed, the execution time is 1.44 secs. and no answer is
received. Finally, if the query is executed directly against the endpoint, the answer is
effectively retrieved but the execution time is 6.91 secs. For the three different engines,
we can observe that the combined performance of engine and endpoint deteriorates as
the number of operators posed against the endpoint increases.
In congruence with the previous example result, the following research questions
arose: (i) is the observed behavior due to limitations of the endpoints? Or (ii) is this
behavior caused by the shipping plan followed during query execution? As part of this
demo, we will visualize characteristics of different plans and public endpoints that pro-
vide evidence enabling to answer our research questions. By this time, we have per-
formed a comprehensive study of the execution of 70 queries against seven different
public SPARQL endpoints. We selected the query targets from the list of endpoints
monitored by the SPARQLES tool [3] and classified them in quartiles. These included
two high-performant endpoints (Top25% Quartile), two medium-performant ((25%;
50%] Quartile), and three second-least performant endpoints ((50%;75%] Quartile). We
crafted ten SPARQL queries for each endpoint; five are composed of basic graph pat-
191
terns (BGP queries), and the others comprise any SPARQL graph pattern (GP queries).
Attendees of the demo have the possibility of analyze the results of executing these
queries currently loaded in the system, or visualize the plans of their own SPARQL
queries. Query plans are computed on-the-fly while the reported results were computed
off-line to facilitate demonstration. We will demonstrate the following use cases:
Effects of Shipping Policies in BGP Queries. We show that in setups as the one re-
ported in Figure 1 and where endpoints receive large number of concurrent request
per day, i.e., the endpoint is in the (50%;75%] quartile, hybrid-shipping policies can
reduce execution time and the results surpasses the 79% of the answers retrieved by
query-shipping plans. For high- and medium-performant endpoints, there is a trade-off
between execution time and size of retrieved answers. Nevertheless, in all the queries
the effectiveness of the endpoints is increased by up to one order of magnitude, i.e., the
number of answers produced per second following a hybrid-shipping plan can be up 20
times the number of answers produced by a query-shipping plan.
Effects of Shipping Policies in GP Queries. We observed that for highly work-loaded
endpoints, 90% of hybrid-shipping plans reduce execution time by up two orders of
magnitude. Hybrid plans generated by SHEPHERD achieved the highest performance
on DBpedia. For endpoints in other quartiles, a competitive performance between hybrid-
and query-shipping plans is observed, but hybrid plan performance never significantly
deteriorates. This suggests that hybrid-shipping policies are appropriate to achieve rea-
sonable performance while shifting load from the server to the client.
4 Conclusions
PLANET visualizes the impact of shipping policies and provide the basis for the un-
derstanding of the conditions that benefit the implementation of hybrid shipping plans,
e.g., attendees will be able to observe that for non-selective queries hybrid plans signif-
icantly outperforms the others. Thus, PLANET facilitates the analysis of the behavior
of public endpoints, as well as, the development of scalable real-world client-server
applications against single SPARQL endpoints.
Acknowledgements
The authors acknowledge the support of the European Community’s Seventh Frame-
work Programme FP7-ICT-2011-7 (XLike, Grant 288342).
References
1. M. Acosta, M.-E. Vidal, F. Flock, S. Castillo, C. Buil-Aranda, and A. Harth. Shepherd: A
shipping-based query processor to enhance sparql endpoint performance. In ISWC Poster
Track, 2014.
2. M. Acosta, M.-E. Vidal, T. Lampo, J. Castillo, and E. Ruckhaus. Anapsid: an adaptive query
processing engine for SPARQL endpoints. In ISWC, pages 18–34, 2011.
3. C. B. Aranda, A. Hogan, J. Umbrich, and P.-Y. Vandenbussche. SPARQL web-querying in-
frastructure: Ready for action? In ISWC, pages 277–293, 2013.
4. M. J. Franklin, B. T. Jónsson, and D. Kossmann. Performance tradeoffs for client-server query
processing. In SIGMOD Conference, pages 149–160, 1996.
5. A. Schwarte, P. Haase, K. Hose, R. Schenkel, and M. Schmidt. Fedx: Optimization techniques
for federated query processing on linked data. In ISWC, pages 601–616, 2011.
192
High Performance Linked Data Processing for
Virtual Reality Environments
Felix Leif Keppmann1 , Tobias Käfer1 , Steffen Stadtmüller1 , René Schubotz2 ,
and Andreas Harth1
1
Karlsruhe Institute of Technology (KIT)
{felix.leif.keppmann, tobias.kaefer, steffen.stadtmueller,
andreas.harth}@kit.edu
2
Airbus Group
rene.schubotz@eads.net
1 Introduction
The success of Linked Data (LD) [1] has enabled an environment in which ap-
plication data can easily be enriched by the abundance of available information
on the Web. Many recent approaches of the Linked Data community go beyond
the mere exposure of static data and propose the combination of Linked Data
and Representational State Transfer (REST) [3, 5, 7] to enable dynamic systems.
However, in highly dynamic environments, where near real-time data integration
and processing with high update frequencies are required, the perceived over-
head of Linked Data query processing and stateless communication pattern often
prevents the adoption of resource state exchange-oriented systems.
Nevertheless, in our demonstration, we show a Virtual Reality (VR) informa-
tion system that leverages the REST principles and the integration capabilities
of LD. We specifically chose a VR setting, because it requires very low latency [2]
in order to enable a natural interaction of the user with the system. Our system
consists of loosely coupled components [4] as implicated by REST, and provides
an interactive experience by seamlessly integrating existing LD sources from the
Web as well as high dynamic body tracking data in a VR environment.
We show how sensor data exposed as LD, can be processed with high update
frequencies and be rendered in a VR environment. Constantly evaluated queries
are employed to realise both gesture recognition and collision detection of objects
in the VR. Derived actions like data retrieval from the Web and the subsequent
integration of the retrieved data with the sensor data are performed on-the-fly.
With our system we contribute by demonstrating:
– the applicability of Linked Data in a VR environment
– the feasibility of a REST-based distributed system with high frequencies
– the capability to execute high frequency on-the-fly declarative integration of
Linked Data in a REST environment
In the following we present the experience provided by our demonstration
from both a user’s and a technological point of view (Section 2). Afterwards, we
elaborate on the underlying technologies (Section 3) and conclude shortly at the
end (Section 4).
193
Fig. 1. Depth Video with Skeleton Tracking and Demo System Visualization
2 Demonstration
Our demonstration system let the user experience an interactive, integrated and
responsive Virtual Reality. The system displays an avatar representing the user
and information from various sources integrated on-the-fly in response to user
commands. In our system, the user inputs commands to the system via Natural
Interaction (NI), e.g. by interacting with the system via movements and gestures.
The user in our set-up stands in front of a motion tracking sensor that tracks all
parts of the body. As the user moves, for example, a hand to form a specific pose,
the system executes a particular action. Figure 1 shows both, a visualization of
the depth video input data (including skeleton tracking) and the VR which the
user is remote controlling via NI.3
The user is represented in the VR by a human-like avatar (blue in Figure 1).
Each recognized joint of the user’s skeleton is mapped to the avatar, e.g. a knee
of the user is mapped to knee of the avatar and moved accordingly. Instead
of walking on a floor, the avatar is placed on a map (in Figure 1 the city of
Karlsruhe in Germany) which will move in a specific direction if the avatar steps
on the corresponding border of the map. Further, Points of Interest (PoIs) are
visualized on the map, e.g. important buildings or upcoming concerts in the
area (represented by red arrows in Figure 1). Via gestures of the user at a PoI,
more detailed information will be requested, integrated on-the-fly and displayed
in the VR. Thereby, a user is able to navigate through the map by walking and
requesting additional information with gestures.
Beside the user experience of our system, we demonstrate from a techno-
logical point of view, loosely coupled components representing several internal
and external data sources and sinks with a variety of exposed update frequen-
cies, which are integrated on-the-fly in a declarative manner. All data sources
and data sinks expose their data and functionality as LD via REST interfaces.
The data sources include sources of a relatively static nature, e.g. map data or
PoIs, and highly dynamic data sources, e.g. the sensor data of the body-tracking
video sensor. Nevertheless, all sources and sinks are integrated via one compo-
nent without the need for programmatic hard-coded integration of the interfaces.
3
More screenshots and videos of the demonstration system are available at:
http://purl.org/NET/HighPerfLDVR
194
In particular, a set of rules defined in a declarative rule language specifies the
data flow between sources and sinks. A corresponding engine handles the query
processing, evaluates the rule set in each cycle and executes low level tasks, e.g.
retrieving data from the REST interface of the video sensor and a PoIs service,
transformation and integration in the target schema and updating or creating
resources in the REST interface of the visualization. Currently, our demonstra-
tion system processes data with a frequency of 20 Hz under high load up to 28 Hz
under low load, which is close to the ideal 30 Hz update frequency at which the
body tracking sensor updates.
3 Technology
We use Linked Data-Fu (LD-Fu) [6] as a central component for integration of
data sources and sinks. LD-Fu is both a declarative rule language and an engine
to handle the execution of programs consisting of these rules. The LD-Fu engine
is able to interact with LD resources via REST interfaces. Rules defined in a
program are evaluated and the interactions defined in the rule body are executed
if the conditions are met. These interactions can include the retrieval of data from
LD resources, the integration of data and the manipulation of LD resources. We
developed the engine as multi-threaded forward-chaining data processor, thereby
supporting fast parallel data processing with reasoning capabilities.
In our demonstration system LD-Fu handles 1) the data flow between body
tracking of the depth sensor and visualization on the screen or wall, 2) collision
detection in the interaction of the user with virtual objects, 3) execution of
actions as result of the user interaction and 4) the integration of additional
data sources on the web, e.g. information about PoIs or map data. All these
interactions are defined as rule sets in LD-Fu programs.
With Natural Interaction via REST (NIREST) we expose body tracking data
as LD via a REST interface. It utilizes data extracted from depth video sensors,
e.g. Microsoft Kinect4 devices. The position information of recognized people
in the depth video, their skeleton data and device metadata are exposed as
LD resources on the REST interface and are updated at a frequency of 30 Hz.
We developed NIREST in Java as application container that encapsulates all
required components and is deployable on application servers. The application
is build on top of the OpenNI5 framework and the NiTE middleware. OpenNI is
as an Open Source framework providing low-level access to colour video, depth
video and audio data of sensor devices. NiTE acts as middleware on top of
OpenNI and provides body, skeleton and hand tracking.
We use NIREST in our demonstration system as 1) high dynamic data source
for 2) body positions and for 3) the position of skeleton joint points of all people
in front of the sensor. LD-Fu programs use this data via the REST interface for
collision detection and visualizations in the virtual reality.
4
http://www.microsoft.com/en-us/kinectforwindows/
5
https://github.com/openni
195
Our user interface is based on jMonkey6 , an Open Source 3D engine for the
development of games in Java. We combine jMonkey with a REST interface to
expose data about objects in the 3D scene graph using LD resources. A scene
graph is a data structure in VR engines which represents all objects in the VR
as well as their interrelations. The LD interface on top of jMonkey allows for
the retrieval and modification of data about the 3D scene graph nodes. In our
demonstration we employ the visualization based on jMonkey as user interface
which can be displayed on a monitor or on a wall using a beamer.
4 Conclusion
With our system we demonstrate data integration in a highly dynamic envi-
ronment using Semantic Web technologies. We applied successfully the LD and
REST paradigms that facilitate on-the-fly declarative data integration in our
VR scenario. Moreover, the user will be part of the demonstration system and
be able to control the integration and visualization of information via natural
interaction all programmed using a declarative rule language.
Acknowledgements
This work was supported by the German Ministry of Education and Research
(BMBF) within the projects ARVIDA (FKZ 01IM13001G) and Software-Campus
(FKZ 01IS12051) and by the EU within the projects i-VISION (GA #605550)
and PlanetData (GA #257641).
References
1. Bizer, C., Heath, T., Berners-Lee, T.: Linked Data – The Story So Far. International
Journal on Semantic Web and Information Systems 5(3), 1–22 (2009)
2. (ITU), I.T.U.: End-user multimedia QoS categories. ITU-T Recommendation
G.1010, International Telecommunication Union (ITU) (2001)
3. Krummenacher, R., Norton, B., Marte, A.: Towards Linked Open Services and Pro-
cesses. In: Proceedings of the Future Internet Symposium. Springer Berlin Heidel-
berg (2010)
4. Pautasso, C., Wilde, E.: Why is the Web Loosely Coupled? A Multi-Faceted Metric
for Service Design. In: Proceedings of the International World Wide Web Conference
(2009)
5. Speiser, S., Harth, A.: Integrating Linked Data and Services with Linked Data
Services. In: Proceedings of the Extended Semantic Web Conference (2011)
6. Stadtmüller, S., Speiser, S., Harth, A., Studer, R.: Data-Fu: A Language and an
Interpreter for Interaction with Read/Write Linked Data. In: Proceedings of the
International World Wide Web Conference (2013)
7. Verborgh, R., Steiner, T., van Deursen, D., van de Walle, R., Gabarró Vallès, J.:
Efficient Runtime Service Discovery and Consumption with Hyperlinked RESTdesc.
In: Proceedings of the International Conference on Next Generation Web Services
Practices (2011)
6
http://jmonkeyengine.org/
196
Analyzing Relative Incompleteness of Movie
Descriptions in the Web of Data: A Case Study
Wancheng Yuan1 , Elena Demidova2 , Stefan Dietze2 , Xuan Zhou1
1
DEKE Lab, MOE. Renmin University of China. Beijing, China
wancheng.yuan@ruc.edu.cn, zhou.xuan@outlook.com
2
L3S Research Center and Leibniz University of Hanover, Germany
{demidova, dietze}@L3S.de
1 Introduction and Approach
In the context of Linked Open Data (LOD) [3], datasets are published or updated
frequently, constantly changing the landscape of the Linked Data Cloud. In this
paper we present a case study, investigating relative incompleteness among sub-
graphs of three Linked Open Data (LOD) datasets (DBpedia (dbpedia.org),
Freebase (www.freebase.com), LinkedMDB (www.linkedmdb.com)) and pro-
pose measures for relative data incompleteness in LOD. The study provides
insights into the level of accuracy and actual conflicts between di↵erent LOD
datasets in a particular domain (movies). In addition, we investigate the im-
pact of the neighbourhood size (i.e. path length) under consideration, to better
understand the reliability of cross-dataset links.
Fig. 1 presents an example of relative incompleteness in the representation
of movie entity “Holy Smoke!” in DBpedia and Freebase. In this example, the
di↵erence between the actor sets indicates the “Movie.Actor” property might
be incomplete. As we do not know the exact complete set of actors and the
noise observed in linked datasets interferes completeness estimation, we call this
phenomenon relative incompleteness. If we follow the “Movie.Cinematographer”
link in the data graphs of the two datasets, we can observe further relative
incompleteness in its “birthPlace” property.
Fig. 1. Representation of the movie “Holy Smoke!” in DBpedia and Freebase.
In this paper we discuss incompleteness related measures that can be ob-
tained by pairwise dataset comparisons exploiting entity co-resolution across
197
2
these datasets and apply these measures in a case study. The basis for the pro-
posed measures is the assumption that dataset specific di↵erences in representa-
tion of equivalent entities, and in particular the values of multi-value properties,
can provide valuable insights into the relative incompleteness of these datasets.
To facilitate discovery of relative incompleteness, we assume correct schema map-
ping and make use of known equivalent entities. In the context of LOD, absolute
incompleteness is difficult to judge, as it is difficult to obtain the ground truth of
absolute completeness. Therefore, we choose to estimate relative incompleteness
of the properties by following paths of limited lengths in the data graphs.
By ith -Order property, we mean a property that can be reached from
the target entity through a path of length i in the data graph. For instance,
“Movie.Actor” is a 1st Order property in Fig. 1, while “Movie.Cinematograher.
Name” is a 2nd Order property. Then, we define ith -Order Value Incompleteness
as follows:
ith -Order Value Incompleteness (Dx , Dy , P ) between the pair of datasets
Dx , Dy with respect to a ith -Order multi-value property P is the proportion of
entities in Dx and Dy having di↵erent values in P .
As P is a multi-value property, the di↵erence on P usually indicates that at
least one of the datasets does not provide sufficient information on P . In Fig. 1,
we observe a 2nd Order Value Incompleteness in the “Movie.Cinematographer
.birthPlace” property. To determine equivalent values, we rely on direct value
comparisons and identity links.
Considering the LOD cloud as a large interlinked knowledge graph, relative
incompleteness of data across di↵erent datasets is a crucial and often under-
investigated issue. Relative incompleteness can result e.g. from extraction errors,
lack of source maintenance [4], imprecise identity links [2] as well as incompat-
ibilities in schemas and their interpretation (as we observed in this study). In
the literature, detection and resolution of data inconsistency has been studied
in the context data fusion [1]. However, the corresponding methods for the as-
sessment of LOD datasets are underdeveloped. The measures proposed in this
paper can help judging relative agreement of datasets on certain properties and
thus support source selection. E.g. these statistics can support identification of
sources with the highest relative agreement as well as the sources containing
complementary information, dependent on the particular scenario.
2 A Case Study
Datasets and Schemas: We used the latest version of three datasets from LOD
- LinkedMDB (LMDB), DBpedia and Freebase. The LMDB dataset contains
eight concepts about movies, such as Movie, Actor and Country, and more than
200,000 records. The DBpedia and Freebase datasets contain around 150,000
and 1,000,000 movie records respectively. To perform the study, we randomly
selected 200 Movie and 200 Actor entities shared between these datasets. To
establish the relationship of entities across the three datasets, we obtained the
existing interlinking information (i.e., the owl:sameAs predicate) of the Movie
198
3
entities across all the three datasets as well as the Actor entities in the DBpedia
and Freebase. We manually established schema mappings between the Movie
and Actor concepts and their properties among the datasets.
Evaluation Results: We computed the 1st and 2nd Order Value Incomplete-
ness for each property in each pair of datasets. Table 1 presents an aggregated
1st and 2nd Order Value Incompleteness results for the Movie and Actor entities.
In this result, if there is a single property that is incomplete on an entity, we
would count this entity as incomplete. As we can see in Table 1, the relative
incompleteness in the DBpedia/Freebase pair reaches 100% in the first order
and 89% in the second order, meaning that all the Movie entities in the datasets
are a↵ected by incompleteness issues. The overall 1st Order Incompleteness of
the Movie entities in the other dataset pairs is also pretty high, e.g., 70% for
LMDB/Freebase and 56% for LMDB/DBpedia.
Table 1. Aggregated Incompleteness for Movie and Actor Entities
Datasets Movie 1st O. Movie 2nd O. Actor 1st O.
Incompleteness Incompleteness Incompleteness
LMDB/DBpedia 0.56 n/a n/a
LMDB/Freebase 0.70 n/a n/a
DBpedia/Freebase 1.00 0.89 0.76
Table 2 presents the details of the evaluation for each property. As we can
see in the Table 2, the highest relative incompleteness among all datasets is
observed in the DBpedia/Freebase pair on the property Actor, whose incom-
pleteness is 73%. This is because DBpedia tends to include only key people in
a movie, whereas Freebase tends include more complete actor lists. For exam-
ple, for the movie “Stand by Me”, DBpedia lists only five actors: Wil Wheaton,
Kiefer Sutherland, River Phoenix, Corey Feldman, and Jerry O’Connell. The
“starring” property of Freebase includes many more actor names such as Gary
Riley, Bradley Gregg, Frances Lee McCain, etc. We also observed that the “star-
ring” property sometimes mixes actor and character names in a movie. For ex-
ample, for “Stand by Me”, it includes characters Teddy Duchamp and Waitress.
Regarding the LMDB/Freebase, the incompleteness on the properties Producer,
Release Date and Actor are 30%, 26% and 19% respectively. LMDB/DBpedia
shows a similar distribution, i.e. 29%, 11% and 15%, on the same properties.
Table 2. 1st Order Value Incompleteness of Selected Movie Properties
Dataset Release Country Language Actor Director Writer Editor Producer
Date
LMDB/DBpedia 0.11 0.02 0.16 0.15 0.02 n/a n/a 0.29
LMDB/Freebase 0.26 0.15 0.24 0.19 0.02 n/a n/a 0.30
DBpedia/Freebase 0.21 0.12 0.25 0.73 0.04 0.25 0.08 0.36
199
4
An exemplary evaluation performed on the Actor entities indicates a similar
tendency as for the Movie type, with 76% incompleteness in the first order in the
DBpedia/Freebase pair. While Actor entities always agree on the names and very
often on the birth dates (which makes us think that the existing interlinking of
Actor entities was established using these properties), they frequently disagree on
the property of birthPlace. This is because the values of property birthPlace from
DBpedia are much more detailed than those from Freebase. DBpedia typically
includes a country name in an address, whereas Freebase does not. For example,
the place of birth of the person “Len Wiseman” from DBpedia is “Fremont,
California, United States”, while that from Freebase is “Fremont, California”.
As a result, we observe an increased incompleteness in the property birthPlace.
Interestingly, the property deathPlace is much less incomplete, as most actors
listed in these databases are still alive (we regard null values as incomparable).
3 Conclusions and Outlook
In this paper we presented measures to automatically evaluate relative incom-
pleteness in linked datasets and applied these measures in a case study. From the
experiment performed using three linked datasets in the movie domain we can
conclude that incompleteness is a very common phenomenon in these datasets,
and its number increases significantly with increasing order, i.e. increase of in-
vestigated entity neighbourhood. The main causes of relative incompleteness
observed during our experiment are due to di↵erent interpretations of properties
in the datasets. Our method of classification and identification of quality issues
provides not only insights into the level of agreement between datasets but also
into the overall quality of datasets. In future work we intend to extend these
approaches to infer knowledge about the correctness and agreement of schemas.
Acknowledgments
This work was partially funded by the NSFC Project No. 61272138, ERC under
ALEXANDRIA (ERC 339233), the COST Action IC1302 (KEYSTONE) and 973 Pro-
gram Project of China (Grant Nr.: 2012CB316205).
References
1. X. L. Dong and F. Naumann. Data fusion: resolving data conflicts for integration.
Proc. VLDB Endow., 2(2):1654–1655, 2009.
2. H. Halpin, P. J. Hayes, J. P. McCusker, D. L. McGuinness, and H. S. Thompson.
When owl: sameas isn’t the same: An analysis of identity in linked data. In Proc.
of the 9th International Semantic Web Conference, ISWC 2010, Shanghai, 2010.
3. T. Heath and C. Bizer. Linked Data: Evolving the Web into a Global Data Space
(1st edition). Synthesis Lectures on the Semantic Web: Theory and Technology, 1:1,
1-136. Morgan & Claypool, 2011.
4. P. N. Mendes, H. Mühleisen, and C. Bizer. Sieve: linked data quality assessment
and fusion. In Proc. of the 2012 Joint EDBT/ICDT Workshops, Berlin, 2012.
200
A Semantic Metadata Generator for Web Pages
Based on Keyphrase Extraction
Dario De Nart, Carlo Tasso, Dante Degl’Innocenti
Artificial Intelligence Lab
Department of Mathematics and Computer Science
University of Udine, Italy
{dario.denart,carlo.tasso}@uniud.it, dante.deglinnocenti@spes.uniud.it
Abstract. The annotation of documents and web pages with semantic
metatdata is an activity that can greatly increase the accuracy of Infor-
mation Retrieval and Personalization systems, but the growing amount
of text data available is too large for an extensive manual process. On
the other hand, automatic keyphrase generation and wikification can
significantly support this activity. In this demonstration we present a
system that automatically extracts keyphrases, identifies candidate DB-
pedia entities, and returns as output a set of RDF triples compliant with
the Opengraph and the Schema.org vocabularies.
1 Introduction
In the last few years we have witnessed the rapid growth of the Semantic Web
and all its related technologies, in particular the ones that allow the embedding
of semantic data inside the HTML markup of Web pages, such as RDFa. Recent
studies highlight how a significant part of the most visited pages of the Web is
annotated with semantic data and this number is expected to grow in the near
future. However, up to now, the majority of such metadata is manually authored
and maintained by the owners of the pages, especially those associated with
textual content (such as articles and blog posts). Keyphrase Extraction (herein
KPE) and Wikification can greatly ease this task, by identifying automatically
relevant concepts in the text and Wikipedia/DBpedia entities to be linked. In
this demonstration we propose a system for semantic metadata generation based
on a knowledge-based KPE and Wikification phase and a subsequent rule-based
translation of extracted knowledge into RDF 1 . Generated metadata adhere to
the Opengraph and the Schema.org vocabularies which currently are, according
to a recent study [2], wide-spread on the Web.
2 Related Work
Several authors in the literature have already addressed the problem of extracting
keyphrases (herein KPs) from natural language documents and a wide range of
1
A live demo of the system can be found at http://goo.gl/beKJu5 and can be accessed
by logging as user “guest” with password “guest”
201
approaches have been proposed. The authors of [11] identify four types of KPE
strategies:
– Simple Statistical Approaches: mostly unsupervised techniques, considering
word frequency, TF-IDF or word co-occurency [8].
– Linguistic Approaches: techniques relying on linguistic knowledge to identify
KPs. Proposed methods include lexical analysis [1], syntactic analysis [4],
and discourse analysis [6].
– Machine Learning Approaches: techniques based on machine learning algo-
rithms such as Naive Bayes classifiers and SVM. Systems such as KEA [10],
LAKE [3], and GenEx [9] belong to this category.
– Other Approaches: other strategies exist which do not fit into one of the
above categories, mostly hybrid approaches combining two or more of the
above techniques. Among others, heuristic approaches based on knowledge-
based criteria [7] have been proposed.
Automatic semantic data generation from natural language text has already
been investigated as well and several knowledge extraction systems already exist
[5], such as OpenCalais 2 , AIDA3 , Apache Stanbol4 , and NERD5 .
3 System Overview
The proposed system includes three main modules: a Domain Independent KPE
module (herein DIKPE), a KP Inference module (KPIM), and a RDF Triple
Builder (RTB). Our KPE technique exploits a knowledge-based strategy. Af-
ter a candidate KP generation stage, candidate KPs are selected according to
various features including statistic (such as word frequency), linguistic (part of
speech analysis), meta-knowledge based (life span in the text, first and last oc-
currence, and presence of specific tags), and external-knowledge based (existence
of a match with a DBpedia entity) ones. Such features correspond to di↵erent
kinds of knowledge that are involved in the process of recognizing relevant enti-
ties in a text. Most of such features are language-independent and the modular
architecture of DIKPE allows an easy substitution of language-dependent com-
ponents, making our framework language-independent. Currently English and
Italian languages are supported.
The result of this KPE phase is a set of relevant KPs including DBpedia matches,
hence providing a partial wikification of the text. Such knowledge is used by the
KPIM for a further step of KP generation, in which a new set of potentially
relevant KPs not included in the text is inferred exploiting the link structure of
DBpedia. Properties such as type and subject are considered in order to discover
concepts possibly related to the text. Finally, the extracted and the inferred KPs
are used by the RTB to build a set of Opengraph and Schema.org triples. Due to
2
http://www.opencalais.com/
3
www.mpi-inf.mpg.de/yago-naga/aida/
4
https://stanbol.apache.org/
5
http://nerd.eurecom.fr/
202
the simplicity of the adopted vocabularies, this task is performed in a rule-based
way. The rdf fragment to be generated, in fact, is considered by the RTB as a
template to fill according to the data provided by the DIKPE and the KPIM.
4 Evaluation and Conclusions
In order to support and validate our approach several experiments have been per-
formed. Due to the early stage of development of the system and being the KP
generation the critical component of the systems, testing e↵orts were focused on
assessing the quality of generated KPs. The DIKPE module was benchmarked
against the KEA algorithm on a set of 215 English documents labelled with
keyphrases generated by the authors and by additional experts. For each docu-
ment, the KP sets returned by the two compared systems were matched against
the set of human generated KPs. Each time a machine-generated KP matched
a human-generated KP, it was considered a correct KP; the number of correct
KPs generated for each document was then averaged over the whole data set.
Various machine-generated KP set sizes were tested. As shown in Table 1, the
DIKPE system significantly outperformed the KEA baseline. A user evaluation
Table 1. Performance of DIKPE compared to KEA.
Extracted Average number of correct KPs
Keyphrases KEA DIKpE
7 2.05 3.86
15 2.95 5.29
20 3.08 5.92
of the perceived quality of generated KPs was also performed: a set of 50 articles
was annotated and a pool of experts of various ages and gender was asked to
assess the quality of generated metadata. Table 2 shows the results of the user
evaluation.
Table 2. User evaluation of generated keyphrases.
Evaluation Frequency
Good 56,28%
Too Generic 14,72%
Too Specific 2,27%
Incomplete 9,85%
Not Relevant 9,85%
Meaningless 7,03%
Evaluation is, however, still ongoing: an extensive benchmark with more com-
plex Knowledge Extraction systems is planned, as well as further enhancements
203
such as inclusion of more complex vocabularies and integration with the Apache
Stanbol framework.
References
1. Barker, K., Cornacchia, N.: Using noun phrase heads to extract document
keyphrases. In: Advances in Artificial Intelligence, pp. 40–52. Springer (2000)
2. Bizer, C., Eckert, K., Meusel, R., Mühleisen, H., Schuhmacher, M., Völker, J.: De-
ployment of rdfa, microdata, and microformats on the web–a quantitative analysis.
In: The Semantic Web–ISWC 2013, pp. 17–32. Springer (2013)
3. DAvanzo, E., Magnini, B., Vallin, A.: Keyphrase extraction for summarization
purposes: The lake system at duc-2004. In: Proceedings of the 2004 document
understanding conference (2004)
4. Fagan, J.: Automatic phrase indexing for document retrieval. In: Proceedings of the
10th Annual International ACM SIGIR Conference on Research and Development
in Information Retrieval. pp. 91–101. SIGIR ’87, ACM, New York, NY, USA (1987),
http://doi.acm.org/10.1145/42005.42016
5. Gangemi, A.: A comparison of knowledge extraction tools for the semantic web.
In: The Semantic Web: Semantics and Big Data, pp. 351–366. Springer (2013)
6. Krapivin, M., Marchese, M., Yadrantsau, A., Liang, Y.: Unsupervised key-phrases
extraction from scientific papers using domain and linguistic knowledge. In: Digital
Information Management, 2008. ICDIM 2008. Third International Conference on.
pp. 105–112 (Nov 2008)
7. Liu, Z., Li, P., Zheng, Y., Sun, M.: Clustering to find exemplar terms for
keyphrase extraction. In: Proceedings of the 2009 Conference on Empirical Meth-
ods in Natural Language Processing: Volume 1 - Volume 1. pp. 257–266. EMNLP
’09, Association for Computational Linguistics, Stroudsburg, PA, USA (2009),
http://dl.acm.org/citation.cfm?id=1699510.1699544
8. Matsuo, Y., Ishizuka, M.: Keyword extraction from a single document using word
co-occurrence statistical information. International Journal on Artificial Intelli-
gence Tools 13(01), 157–169 (2004)
9. Turney, P.D.: Learning algorithms for keyphrase extraction. Information Retrieval
2(4), 303–336 (2000)
10. Witten, I.H., Paynter, G.W., Frank, E., Gutwin, C., Nevill-Manning, C.G.: Kea:
Practical automatic keyphrase extraction. In: Proceedings of the fourth ACM con-
ference on Digital libraries. pp. 254–255. ACM (1999)
11. Zhang, C.: Automatic keyword extraction from documents using conditional ran-
dom fields. Journal of Computational Information Systems 4(3), 1169–1180 (2008),
http://eprints.rclis.org/handle/10760/12305
204
A Multilingual SPARQL-Based Retrieval
Interface for Cultural Heritage Objects
Mariana Damova1 and Dana Dannélls2 and Ramona Enache3
1
Mozaika, Bulgaria
mariana.damova@mozajka.co
2
Språkbanken, University of Gothenburg
dana.dannells@svenska.gu.se
3
Department of Computer Science and Engineering, University of Gothenburg
ramona.enache@cse.gu.se
1 Introduction
In this paper we present a multilingual SPARQL-based [1] retrieval interface for
querying cultural heritage data in natural language (NL). The presented sys-
tem o↵ers an elegant grammar-based approach which is based on Grammatical
Framework (GF) [2], a grammar formalism supporting multilingual applications.
Using GF, we are able to present a cross-language SPARQL grammar covering
15 languages and a cross-language retrieval interface that uses this grammar for
interacting with the Semantic Web4 . To our knowledge, this is the first imple-
mentation of SPARQL generation and parsing via GF that is published as a
knowledge representation infrastructure-based prototype.
Querying the Semantic Web in natural language, more specifically, using
English to formulate SPARQL queries with the help of controlled natural lan-
guage (CNL) syntax has been developed before [3,4]. Such approaches, based
on verbalization methods are adequate for English, but in a multilingual setting
where major challenges such as lexical and structural gaps become prominent
[5], grammar-based approaches are preferable. The work presented here comple-
ments the method proposed by Lopez et al. [6] in that it faces the challenges
in realizing NL in real world systems, not only in English, but also in multiple
languages.
2 An Interface for Multilingual Queries
Our system follows the approach of the Museum Reason-able View (MRV) of
Linked Open Data (LOD) [7]. It provides a unified access to the cultural heritage
sources including LOD from DBpedia,5 among other sources.
4
A demo is available from the following page: http://museum.ontotext.com/
5
http://dbpedia.org
205
2
Fig. 1. Demo of the natural language query “show all paintings that are by Leonardo
da Vinci” in Italian.
The query grammar for this data covers the nine central classes: title, painter,
type, colour, size, year, material, museum, place and the major properties de-
scribing the relationship between them: hasCreationDate, fromTimePeriodValue,
toTimePeriodValue, hasMaterial, hasTitle, hasDimension, hasCurrentLocation,
hasColour. The set of SPARQL queries we cover include the famous five WH
questions: who, where, when, how, what. Table 1 shows some NL queries and
their mappings to query variables in SPARQL.
NL Query SPARQL
Where is Mona Lisa located? :hasCurrentLocation ?location
What are the colours of Mona Lisa? :hasColour ?colour
Who painted Mona Lisa? :createdBy ?painter
When was Mona Lisa painted? :hasCreationDate ?crdat
How many paintings were painted by
Leonardo da Vinci? ?(count(distinct ?painting) as ?count)
Table 1. Queries and query variables
The NL to SPARQL mapping is implemented as a transformation table,
which could be extended to cover larger syntactic question variations.
The grammar has a modular structure with three main components: (1) lex-
icon modules covering ontology classes and properties; (2) data module covering
ontology instances; and (3) query module covering NL questions and SPARQL
query patterns. It supports NL queries in 15 languages, including: Bulgarian,
Finnish, Norwegian, Catalan, French, Romanian, Danish, Hebrew, Russian, Dutch,
Italian, Spanish, English, German and Swedish. The system relies on GF gram-
mars, treating SPARQL as yet another language. In the same manner as NL
generation, SPARQL patterns are encoded as grammar rules. Because of this
compact representation within the same grammar, we can achieve parallel trans-
lations between any pair of the 15 languages and SPARQL.
206
3
The grammar-based interface provides a mechanism to formulate a query in
any of the 15 languages, translate it to SPARQL and view the answers in any of
those languages. The answers can be displayed as natural language descriptions
or as triples. The latter can then be navigated as linked data. The browsing of
the triples can be carried on continuously; by clicking on one of the triples listed
in the answers, a new SPARQL query is launched and the results are generated
as natural language text via the same grammar-based interface or as triples.
Fig. 2. Example of the query “who painted Guernica?” in 15 languages and in
SPARQL.
3 Evaluation
Following previous question answering over linked data (QALD) evaluation chal-
lenges [5], we divided the evaluation into three parts, each focusing on a specific
aspect: (1) user satisfaction, i.e. how many queries were answered; (2) correct-
ness; and (3) coverage, how the system scales up.
For the first parts of the evaluation, we considered a number of random
queries in 7 languages and counted the number of corrections that 1-2 native
informants would make to the original queries. The results of the evaluation
showed that the amount of suggested corrections is relatively low for the majority
of the evaluated languages. The overall correctness of the generated queries seem
to be representative and acceptable, at least among the users who participated
in the evaluation.
Regarding coverage, the grammar allows for paraphrasing most of the ques-
tion patterns, which sums up, on average to 3 paraphrases per construction in
the English grammar. The number of alternatives varies across languages, but
207
4
the average across languages ranges between 2 and 3 paraphrases per construc-
tion. In addition, the 112 basic query patterns from the query grammar can be
combined with logical operators, in order to obtain more complex queries, which
sums up to 1159 query patterns that the grammar covers, including WH and
Yes/No questions. The additions needed to build the query grammar in order
for it to scale up are small, given that the other resources are in place. Also for
building the query grammar for a given language, no more than 150 lines of code
are needed. This process can be done semi-automatically.
4 Conclusions
We introduce a novel approach to multilingual interaction with the Semantic
Web content via GF grammars. The method has been successfully demonstrated
for the cultural heritage domain and could subsequently be implemented for
other domains or scaled up in terms of languages or content coverage. The main
contribution with respect to current state-of-the-art approaches is SPARQL sup-
port and question answering in 15 languages.
Acknowledgment
This work was supported by MOLTO European Union Seventh Framework Pro-
gramme (FP7/2007-2013) under grant agreement FP7-ICT-247914. The authors
would like to acknowledge the Centre for Language Technology in Gothenburg.
References
1. Garlik, S.H., Andy, S.: SPARQL 1.1 Query Language. (March 2013) W3C Recom-
mendation.
2. Ranta, A.: Grammatical Framework: Programming with Multilingual Grammars.
CSLI Studies in Computational Linguistics. CSLI, Stanford (2011)
3. Ferré, S.: SQUALL: A controlled natural language for querying and updating RDF
graphs. In: CNL. (2012) 11–25
4. Ngonga Ngomo, A.C., Bühmann, L., Unger, C., Lehmann, J., Gerber., D.: Sorry,
I don’t speak SPARQL — translating SPARQL queries into natural language. In:
Proceedings of WWW. (2013)
5. Walter, S., Unger, C., Cimiano, P., Bär, D.: Evaluation of a Layered Approach to
Question Answering over Linked Data. In: International Semantic Web Conference
(2). (2012) 362–374
6. Lopez, V., Fernández, M., Motta, E., Stieler, N.: Poweraqua: Supporting users in
querying and exploring the semantic web. Semantic Web 3(3) (2012) 249–265
7. Damova, M., Dannélls, D.: Reason-able View of Linked Data for cultural heritage.
In: Proceedings of the third International Conference on Software, Services and
Semantic Technologies (S3T). (2011)
208
Extending Tagging Ontologies with
Domain Specific Knowledge
Frederic Font1 , Sergio Oramas1 , György Fazekas2 , and Xavier Serra1
1
Music Technology Group, Universitat Pompeu Fabra, Barcelona, Spain
2
Centre for Digital Music, Queen Mary University of London, London, UK
{name.surname}@upf.edu, gyorgy.fazekas@eecs.qmul.ac.uk
Abstract. Currently proposed tagging ontologies are mostly focused on
the definition of a common schema for representing the agents involved in
a tagging process. In this paper we describe preliminary research around
the idea of extending tagging ontologies by incorporating some domain
specific class definitions and relations. We illustrate our idea with a par-
ticular use case where a tag recommendation system is driven by such
an ontology. Besides our use case, we believe that such extended tagging
ontologies can bring more meaningful structure into folksonomies and
improve browsing and organisation functionalities of online platforms
relying on tagging systems.
Keywords: Tagging ontology, Tag recommendation, Folksonomy, Freesound
1 Introduction
Tagging systems are extensively used in online sharing sites as a means of content
browsing and organisation. In general, tagging systems allow users to annotate
resources with free-form textual labels chosen by the users of the system. The re-
sulting set of associations between tags, users and resources that arise in tagging
systems is known as a folksonomy. Folksonomies su↵er from a number of well-
known issues including tag scarcity, ambiguities with synonymy and polysemy,
typographical errors, the use of user-specific naming conventions, or even the use
of di↵erent languages [1]. Despite these issues, folksonomies have succeeded in
providing basic organisation and browsing functionalities to online sharing sites.
However, their unstructured nature makes it difficult to allow more advanced
capabilities such as hierarchical browsing or faceted searching.
In order to bring some structure to folksonomies, some studies have focused
on the analysis of folksonomies to automatically derive structured or semi-
structured representations of the knowledge of the domain, typically in the form
of lightweight ontologies or hierarchical taxonomies [2–4]. However, these meth-
ods still tend to require significant amount of manual e↵ort to provide meaningful
representations. Some other studies have proposed modelling folksonomies and
the tagging process using ontologies [5]. These ontologies are focused on defining
a common schema for the agents involved in a tagging process. Current tagging
ontologies may enhance interoperability between folksonomies, but do not gen-
erally provide ways of structuring a folksonomy with domain-specific knowledge.
209
2 F. Font et al.
In this paper, we present some preliminary research on extending a tagging
ontology by including the possibility to represent the semantics of a specific
domain. The generic idea is presented and discussed in Sec. 2. In Sec. 3 we
describe a practical application in a real-world tagging system where the tagging
ontology is used to drive a tag recommendation system. Finally, in Sec. 4, we
discuss about possible future directions.
2 Extending a tagging ontology
Our starting point for the extension of the tagging ontology is the Modular
Unified Tagging Ontology (MUTO) [5]. In the core of the MUTO ontology, the
muto:Tagging class is defined which supports several relations to indicate, among
others, a resource that is tagged (muto:hasResource of type rdfs:Resource), the
tag assigned to the resource (muto:hasTag of type muto:Tag), and the user that
made the tag assignment (muto:hasCreator of type sioc:UserAccount).
We propose to extend the tagging ontology in two ways. First, we add a num-
ber of subclasses to the muto:Tag class which can be used instead of muto:Tag
(right side of Fig. 1). These subclasses represent di↵erent tag categories (i.e. with
a narrower scope than the generic muto:Tag class), similarly to the idea of TagSet
introduced in the SCOT ontology [6], but in a semantic sense. A tag category
represents a broad concept that groups a set of tags that share some seman-
tic characteristics related to the specific domain. The same principle is applied
to resources, and a number of rdfs:Resource subclasses are defined (left side
of Fig. 1). Resource subclasses (or resource categories) are used to organise re-
sources into groups with a narrower scope than the general rdfs:Resource class.
The particular definition of tag and resource categories would depend on the
particular application domain of the extended tagging ontology (an example is
given below). Also, in the diagram of Fig. 1, both tag and resource subclasses
are only shown as a flat hierarchy, but more complex class structures could be
explored. Moreover, existing domain ontologies and taxonomies may be reused
to extend the tagging ontology.
Second, we propose to extend the tagging ontology by adding object prop-
erties to model semantic relations among tag categories and resource categories
(dashed lines in Fig. 1). These object properties are useful to, for example, model
dependencies between categories of tags and resources. The specific meaning of
these semantic relations would also depend on the particular application domain
of the extended tagging ontology. In addition to semantic relations between tag
and resource categories, and given that the muto:Tag class inherits from the Sim-
ple Knowledge Organization System (SKOS) [7] class skos:Concept, semantic
relations between tag individuals can be also modelled [5].
3 Use case: tag recommendation in Freesound
We applied an extended tagging ontology as described above in a tag recommen-
dation task in the context of Freesound, an online collaborative sound database
with more than 200,000 uploaded sounds and 3,8 million registered users [8]. In
210
Extending Tagging Ontologies with Domain Specific Knowledge 3
UGIV5HVRXUFH PXWRKDV5HVRXUFH
PXWR7DJJLQJ PXWRKDV7DJ
PXWR7DJ
UGIVVXE&ODVV2I UGIVVXE&ODVV2I
5HV&DWHJRU\ 7DJ&DWHJRU\
5HV&DWHJRU\ 7DJ&DWHJRU\
5HV&DWHJRU\1 7DJ&DWHJRU\1
Fig. 1. Diagram of the extended parts of the tagging ontology.
previous work by the authors, a tag recommendation system was proposed which,
given a set of input tags, is able to suggest other potentially relevant tags [9].
The system is based on the construction of five tag-tag similarity matrices tai-
lored to five manually defined and rather generic audio categories (e.g. “Music”,
“E↵ects”, etc.). The recommendation system uses a classifier to automatically
predict one of these five categories depending on the input tags, and then uses
the corresponding tag-tag similarity matrix for the recommendation process.
To improve that recommendation system, we used the extended tagging on-
tology to model the folksonomy and include some domain specific knowledge. On
the one side, we extended the tagging ontology by adding 5 resource subclasses
corresponding to the 5 sound categories mentioned above (e.g. :EffectsSound).
Moreover, we defined 26 tag subclasses that are intended to group the tags in
categories according to the type of information that they describe about sounds
(i.e. grouped in audio properties). These include categories like “instrument”,
“microphone”,“chord”, “material”, or “action” (e.g. :InstrumentTag). On the
other side, we extended the ontology by defining several object properties that
relate resource and tag categories. These object properties indicate that a par-
ticular tag category is relevant for one or more resource categories. For example,
:InstrumentTag is relevant for :MusicSound audio category, and this is indicated
with a :hasInstrument object property that relates instrument tag category with
music resource category. Furthermore, we populated the extended ontology by
manually classifying the 500 most used tags in Freesound into one of the 26
defined tag categories and added these tags as individuals (instances) of the
corresponding tag category. This last step was necessary to bootstrap the tag
recommendation system (see below).
Using this ontology we can extend the tag recommendation system in a way
that, given the audio category detected by the classifier and the object properties
that relate resource and tag categories, we can guide the annotation process by
suggesting tag categories that are relevant for a particular sound. For example,
for a sound belonging to the resource category :MusicSound, we can suggest tag
categories like :InstrumentTag or :TempoTag, which are particularly relevant for
musical sounds. Once tag categories are suggested, users can click on them and
get a list of tag recommendations for every category. This list is obtained by
computing the intersection of the tags provided by the aforementioned recom-
mendation system (based on the tag-tag similarity matrix), with those that have
been manually introduced in the ontology as tag instances of the selected tag
category. See Fig. 2 for an screenshot of a prototype interface for that system.
211
4 F. Font et al.
Fig. 2. Screenshot of the interface of a prototype tag recommendation system driven
by the extended tagging ontology.
4 Conclusions
In this paper we have shown some preliminary research on extending current
tagging ontologies with structured knowledge specific to the domain of applica-
tion of a tagging system. By incorporating domain specific knowledge in tagging
ontologies, we expect to be able to bring some semantically-meaningful structure
into folksonomies. We have illustrated the idea with a use case in the context
of an audio clip sharing site where a tag recommendation system is driven by
an extended tagging ontology. Formal evaluation of the ontology-driven tag rec-
ommendation system is planned for future work. Besides the described use case,
we think that using extended tagging ontologies can improve other aspects of
online platforms relying on tagging systems such as browsing and organisation
functionalities. The main limitation for such improvements is the population of
the ontology. In our use case, we use a manually populated ontology to bootstrap
the recommender, but the tagging system could further populate the ontology
by learning new “tag individuals-tag category” relations when users annotate
new sounds. Furthermore, other knowledge extraction techniques could be used
to automatically populate the ontology with information coming from other
user-generated data (e.g. in our case could be sound comments or textual de-
scriptions), and even from external data sources from linked open data.
References
1. H. Halpin, V. Robu, and H. Shepard, “The dynamics and semantics of collaborative tagging,”
in Proceedings of the 1st Semantic Authoring and Annotation Workshop, pp. 1–21, 2006.
2. P. Mika, “Ontologies are us: A unified model of social networks and semantics,” Web Semantics:
Science, Services and Agents on the World Wide Web, vol. 5, pp. 5–15, Mar. 2007.
3. P. Heymann and H. Garcia-Molina, “Collaborative Creation of Communal Hierarchical Tax-
onomies in Social Tagging Systems,” tech. rep., 2006.
4. F. Limpens, F. L. Gandon, and M. Bu↵a, “Linking folksonomies and ontologies for supporting
knowledge sharing: a state of the art,” 2009.
5. S. Lohmann, P. Dı́az, and I. Aedo, “MUTO: the modular unified tagging ontology,” Proceedings
of the 7th International Conference on Semantic Systems - I-Semantics ’11, pp. 95–104, 2011.
6. H. L. Kim, S. Scerri, J. G. Breslin, S. Decker, and H. G. Kim, “The State of the Art in Tag
Ontologies : A Semantic Model for Tagging and Folksonomies,” pp. 128–137, 2008.
7. SKOS: Simple Knowledge Organization System. http://www.w3.org/TR/skos-reference.
8. F. Font, G. Roma, and X. Serra, “Freesound Technical Demo,” in Proceedings of the 21st ACM
Conference on Multimedia (ACM MM 13), pp. 411–412, 2013.
9. F. Font, J. Serrà, and X. Serra, “Class-based tag recommendation and user-based eval-
uation in online audio clip sharing,” Journal on Knowledge Based Systems, 2014,
10.1016/j.knosys.2014.06.003.
212
Disambiguating Web Tables using Partial Data
Ziqi Zhang
Department of Computer Science, University of Sheffield, UK
z.zhang@dcs.shef.ac.uk
Abstract. This work addresses disambiguating Web tables - annotating content
cells with named entities and table columns with semantic type information. Con-
trary to state-of-the-art that builds features based on the entire table content, this
work uses a method that starts by annotating table columns using automatically
selected partial data (i.e., a sample), then using the type information to guide
content cell disambiguation. Different sample selection methods are introduced
and tested to show that they contribute to higher accuracy in cell disambiguation,
comparable accuracy in column type annotation with reduced computation.
1 Introduction
Enabling machines to effectively and efficiently access the increasing amount of tab-
ular data on the Web remains a major challenge to the Semantic Web, as the classic
indexing, search and NLP techniques fail to address the underlying semantics carried
by tabular structures [1, 2]. This has sparked increasing interest in research on seman-
tic Table Interpretation, which deals with semantically annotating tabular data such as
shown in Figure 1. This work focuses specifically on annotating table columns that
contain named entity mentions with semantic type information (column classification),
and linking content cells in these columns with named entities from knowledge bases
(cell disambiguation). Existing work follows a typical workflow involving 1) retrieving
candidates (e.g., named entities, concepts) from the knowledge base, 2) constructing
features of candidates, and 3) applying inference to choose the best candidates. One
key limitation is that they adopt an exhaustive strategy to build the candidate space for
inference. In particular, annotating table columns depends on candidate entities from
all cells in the column [1, 2]. However, for human cognition this is unnecessary. For ex-
ample, one does not need to read the entire table shown in Figure 1 - which may contain
over a hundred rows - to label the three columns. Being able to make such inference
using partial (as opposed to the entire table) or sample data can improve the efficiency
of the task as the first two phases can cost up to 99% of computation time [1].
Sample driven Table Interpretation opens up several challenges. The first is defining
a sample with respect to each task. The second is determining the optimal size of the
sample with respect to varying sizes of tables. The third is choosing the optimal sample
entries, since a skewed sample may damage accuracy. Our previous work in [5] has
proposed TableMiner to address the first two challenges. This work adapts TableMiner
to explore the third challenge. A number of sample selection techniques are introduced
and experiments show that they can further improve cell disambiguation accuracy and
in the column type annotation task, contribute to reduction in computation with compa-
rable learning accuracy.
213
2
2 Related Work
An increasing number of work has
been carried out in semantic Table
Interpretation, such as Venetis et
al. [3] that uses a maximum likeli-
hood model, Limaye et al. [1] that
uses a joint inference model, and
Mulwad et al. [2] that uses joint
inference with semantic message
passing. These methods differ in
terms of the inference models, fea-
tures and background knowledge Fig. 1. Lakes in Central Greece
bases used. All these methods are, as discussed earlier, ‘exhaustive’ as they require
features built based on all content cells in order to annotate table columns. Zwicklbauer
et al. [6] is the first method that annotates a table column using a sample of the column.
However, the sample is arbitrarily chosen.
3 Methodology
TableMiner is previously described in [5]. It disambiguates named entity columns in a
table in two phases. The first phase creates preliminary annotations by using a sample
of a column to classify the column in an iterative, incremental algorithm shown in Algo-
rithm 1. In each iteration, a content cell Ti,j drawn from a column Tj is disambiguated
(output Ei,j ). Then the concepts associated with the winning entity are gathered to cre-
ate a set of candidate concepts for the column, Cj . Candidate concepts are scored and
their score can change at each iteration due to newly disambiguated content cells adding
re-enforcing evidence. At the end of each iteration, Cj from the current iteration is com-
pared with the previous. If scores of candidate concepts are little changed (convergence,
see [5] for a method for detection), then column classification is considered to be stable
and the highest scoring candidates are (Cj+ ) chosen to annotate the column. The second
phase begins by disambiguating the remaining cells (part I), this time using the type in-
formation for the column to limit candidate entity space to those belonging to the type
only. This may revise Cj for the column, either adding new elements, or resetting scores
of existing ones and possibly causing the winning concept for the column to change.
In this case, the next part of the second phase (part II) repeats the disambiguation and
classification operations on the entire column, while using the new Cj+ as constraints to
restrict candidate entity space. This procedure repeats until Cj+ and the winning entity
in each cell stabilizes (i.e., no change).
Modified TableMiner For the purpose of this study, TableMiner is modified (T Mmod )
to contain only the first phase and part I of the second phase. In other words, we do not
revise the column classification results obtained from sample data. Therefore T Mmod
may only use a fraction of a column’s data to classify the column, which reduces com-
putation overhead compared to classic ‘exhaustive’ methods.
214
3
Sample selection The choice of the sam-
Algorithm 1 Sample based classification
ple can affect learning in T Mmod in two
1: Input: Tj ; Cj ;
ways. While the size of the sample is
2: for all cell Ti,j in Tj do
dealt with by the convergence measure 3: prevC Cj
j
described in [5], here we address the is- 4: Ei,j disambiguate(Ti,j )
sue of selecting the suitable sample en- 5: Cj updateclass(Cj , Ei,j )
tries to ensure learning accuracy. Since 6: if convergence(Cj , prevCj ) then
column classification depends on the dis- 7: break
ambiguated cells in the sample, we hy- 8: end if
pothesize that high accuracy of cell dis- 9: end for
ambiguation contributes to high accuracy
in column classification. And we further hypothesize that higher accuracy of content
cell disambiguation can be achieved by 1) richer feature representation, and 2) less
ambiguous names (i.e., if a name is used by only one or very few named entities).
Therefore, we propose three methods to compute a score of each content cell in a col-
umn, then rank them by the score before running Algorithm 1 (i.e., input Tj will contain
content cells the order of which is re-arranged based on the scores).
One-sense-per-discourse (ospd) First and foremost, we make the hypothesis of
‘one-sense-per-discourse’ in table context, that if an NE-column is not the subject col-
umn of the table (e.g., the first column in Figure 1 is a subject column), then cells
containing the same text content are extremely likely to express the same meaning1 .
Thus to apply ospd we firstly re-arrange cells in a column by putting those containing
duplicate text content adjacent to each other. Next, when disambiguating a content cell,
the feature representation of the cell concatenates the row context of the cell, and that
of any adjacent cells with the same text content (e.g., in Table 1 we assume ‘Aetolia-
Acarnania’ on the three rows to have the same meaning, and build a single feature rep-
resentation by concatenating all the three rows). Effectively this creates a richer feature
representation for cells whose content re-occur across a table.
Feature size (fs) With fs, we firstly apply ospd, then rank cells in a column by the
size of their feature representation as determined by the number of tokens in a bag-of-
words representation. This would allow T Mmod to start with cells that potentially have
the largest - hence ‘richest’ - feature representation in Algorithm 1.
Name length (nl) With nl, we count the number of words in the cell text content to
be disambiguated and rank cells by this number - name length (e.g., in Table 1 ‘Aetolia-
Acarnania’ has two words and will be disambiguated before ‘Boeotia’). nl merely relies
on the name length of a cell content and does not apply ospd. The idea is that longer
names are less likely to be ambiguous.
4 Evaluation and Conclusion
We evaluate the proposed methods of sample selection using two datasets: LimayeAll
and Limaye2002 . LimayeAll contains over 6000 tables and is used for evaluating con-
tent cell disambiguation. Limaye200 contains a subset 200 tables from LimayeAll with
1
Due to space limitation, details are omitted but can be found in [3, 4]
2
[4], currently under transparent review.
215
4
Cell disambiguation (LimayeAll) Column classification (Limaye200)
ospd fs nl ospd fs nl
T Mmod T Mmod T Mmod T Mmod T Mmod T Mmod T Mmod T Mmod
0.809 0.812 0.812 0.813 0.723 0.719 0.721 0.723
Table 1. Cell disambiguation and column classification accuracy in F1.
columns manually annotated with Freebase concepts, and used for evaluating column
classification. As a baseline, T Mmod without any sample selection techniques is used.
It simply chooses cells from a column in their original order in Algorithm 1. This is
ospd
compared against T Mmod , which applies ospd to non-subject NE-columns, preserves
the original order but disambiguates groups of cells containing the same text content;
fs
T Mmod that applies ospd to non-subject NE-columns then prioritizes cells that po-
tentially have richer feature representation; and T Mmod
nl
that prioritizes cells containing
longer text content. Results on both datasets are shown in Table 1. It suggests that, com-
pared against T Mmod , the sample selection techniques can enhance the accuracy of cell
disambiguation marginally. In the column classification task however, they do not add
benefits in terms of accuracy. By analyzing the computation overhead in terms of the
automatically determined sample size in each table, it shows that the sample selection
techniques have reducing the amount of data to be processed in column classification.
As an example, T Mmod converges on average after processing 58% of cells in a table
column, i.e., it manages to classify a table column using a sample size of 58% of the
ospd fs
total number of cells in that column. T Mmod reduces this to 53%, for T Mmod 52%
and for T Mmod 58% (unchanged). This may contribute to noticeable reduction in CPU
nl
time since the construction of feature space (including querying knowledge bases) for
each data unit consumes over 90% of computation time [1]. To summarize, it has been
shown that, by using sample selection techniques, it is possible to semantically anno-
tate Web tables in a more efficient way, achieving comparable or even higher learning
accuracy depending on tasks.
Acknowledgement: Part of this work is carried out in the LODIE project (Linked Open
Data Information Extraction), funded by EPSRC (EP/J019488/1).
References
1. Limaye, G., Sarawagi, S., Chakrabarti, S.: Annotating and searching web tables using entities,
types and relationships. Proceedings of the VLDB Endowment 3(1-2), 1338–1347 (2010)
2. Mulwad, V., Finin, T., Joshi, A.: Semantic message passing for generating linked data from
tables. In: International Semantic Web Conference (1). pp. 363–378. Springer (2013)
3. Venetis, P., Halevy, A., Madhavan, J., Paşca, M., Shen, W., Wu, F., Miao, G., Wu, C.: Recov-
ering semantics of tables on the web. Proc. of VLDB Endowment 4(9), 528–538 (Jun 2011)
4. Zhang, Z.: Start small, build complete: Effective and efficient semantic table interpretation
using tableminer. In: The Semantic Web Journal (under reviewer, #668-1878) (2014)
5. Zhang, Z.: Towards efficient and effective semantic table interpretation. In: To appear in:
ISWC2014 (2014)
6. Zwicklbauer, S., Einsiedler, C., Granitzer, M., Seifert, C.: Towards disambiguating web tables.
In: International Semantic Web Conference (Posters & Demos). pp. 205–208 (2013)
216
On Linking Heterogeneous Dataset Collections
Mayank Kejriwal and Daniel P. Miranker
University of Texas at Austin
{kejriwal,miranker}@cs.utexas.edu
Abstract. Link discovery is the problem of linking entities between
two or more datasets, based on some (possibly unknown) specification.
A blocking scheme is a one-to-many mapping from entities to blocks.
Blocking methods avoid O(n2 ) comparisons by clustering entities into
blocks, and limiting the evaluation of link specifications to entity pairs
within blocks. Current link-discovery blocking methods explicitly assume
that two RDF datasets are provided as input, and need to be linked. In
this paper, we assume instead that two heterogeneous dataset collections,
comprising arbitrary numbers of RDF and tabular datasets, are provided
as input. We show that data model heterogeneity can be addressed by
representing RDF datasets as property tables. We also propose an un-
supervised technique called dataset mapping that maps datasets from
one collection to the other, and is shown to be compatible with existing
clustering methods. Dataset mapping is empirically evaluated on three
real-world test collections ranging over government and constitutional
domains, and shown to improve two established baselines.
Keywords: Heterogeneous Blocking, Instance Matching, Link Discov-
ery
With the advent of Linked Data, discovering links between entities emerged
as an active research area [2]. Given a link specification, a naive approach would
discover links by conducting O(n2 ) comparisons on the set of n entities. In the
Entity Resolution (ER) community, a preprocessing technique called blocking
mitigates full pairwise comparisons by clustering entities into blocks. Only en-
tities within blocks are paired and compared. ER is critical in data integration
systems [1]. In the Semantic Web, the problem has received attention as scalably
discovering owl:sameAs links between RDF datasets [5].
In the Big Data era, scalability and heterogeneity are essential components
of systems and hence, practical requirements for real-world link discovery. Scal-
ability is addressed by blocking, but current work assumes that the dataset pairs
between which entities are to be linked are provided. In other words, datasets
A and B are input to the pipeline, and entities in A need to be linked to enti-
ties in B. Investigations in some important real-world domains show that pairs
of dataset collections also need to undergo linking. Each collection is a set of
datasets. An example is government data. Recent government e↵orts have led to
release of public data as batches of files, both across related domains and time, as
217
2 Mayank Kejriwal and Daniel P. Miranker
Fig. 1. The property table representation. For subjects that don’t have a property, the
reserved keyword null is entered. ; is a reserved delimiter that allows each field value
to have set semantics.
one of our real-world test sets demonstrates. Thus, there are (at least) two scal-
ability issues: at the collection level, and at the dataset level. That is, datasets
in one collection first need to be mapped to datasets in the second collection,
after which a blocking scheme is learned and applied on each mapped pair. The
problem of blocking two collections is exacerbated by data model heterogeneity,
where some datasets are RDF and the others are tabular.
We note that data model heterogeneity has larger implications, since it also
applies in the standard case where two datasets are provided, but one is RDF
and the other, tabular. In recent years, the growth of both Linked Open Data
and the Deep Web have been extensively documented. Datasets in the former
are in RDF, while datasets in the latter are typically relational. Because of
data model heterogeneity, both communities have adopted di↵erent techniques
for performing link discovery (typically called record linkage in the relational
community). There is a clear motivation, therefore, in addressing this particular
type of heterogeneity, since it would enable significant cross-fertilization between
both communities. We will show an example of this empirically.
The intuition behind our proposed solution to data model heterogeneity is
to represent the RDF dataset as an information-preserving table, not as a set of
triples or a directed graph. The literature shows that such a table has previously
been proposed as a physical data structure, for efficient implementation of triple
stores [6]. An example of this table, called a property table, is shown in Figure 1.
We note that this is the first application of property tables as logical data struc-
tures in the link-discovery context. The table is information-preserving because
the original set of triples can be reconstructed from the table.
Note that the property table builds a schema (in the form of a set of proper-
ties) for the RDF file, regardless of whether it has accompanying RDFS or OWL
metadata. Thus, it applies to arbitrary files on Linked Open Data. Secondly,
numerous techniques in relational data integration can handle datasets with dif-
ferent schemas (called structural heterogeneity). By representing RDF datasets
in the input collections as property tables, data model heterogeneity is reduced
to structural heterogeneity in the tabular domain.
Figure 2 shows the overall framework of link-discovery. The first step, pro-
posed in this paper for collections, is called the dataset mapping step. It takes
two collections A and B of heterogeneous datasets as input and produces a set
218
On Linking Heterogeneous Dataset Collections 3
Fig. 2. The overall link-discovery framework. Dataset mapping is our contribution.
of mappings between datasets. Let such a mapping be (a, b) where a 2 A, b 2 B.
For each such mapping, the subsequent blocking process is invoked. Blocking has
been extensively researched, with even the least expensive blocking methods hav-
ing complexity O(n), where n is the total number of entities in the input datasets.
Blocking generates a candidate set of entity pairs, , with | | << O(n2 ). Thus,
blocking provides complexity improvements over brute-force linkage. To under-
stand the savings of dataset mapping, assume that each collection contains q
datasets, and each dataset contains n entities. Without dataset mapping, any
blocking method would be at least O(qn). With mapping, there would be q in-
stances of complexity O(n) each. Since depends heavily on n, the savings carry
over to the final quadratic process (but which cannot be quantified without as-
sumptions about the blocking process). We empirically demonstrate these gains.
An added benefit is that there is now scope for parallelization.
The mapping process itself relies on document similarity measures developed
in the information retrieval community, by representing each dataset as bag of
tokens. Intuitively, mapped datasets should have relatively high document simi-
larity to each other. Empirically, we found a tailored version of cosine similarity
to work best. Many packages exist for efficiently computing it. Computing simi-
larities between all pairs of datasets, we get a |A|⇥|B| matrix. A straightforward
approach would use a threshold to output many-many mappings, or a graph bi-
partite matcher to output one-one mappings. The former requires a parameter
specification, while the latter is cubic (O(q 3 )). Therefore, we opted for a domi-
nating strategy, which can be computed in the same time it takes to build the
matrix. Namely, a mapping (a, b) is chosen if the score in the cell of (a, b) dom-
inates, that is, it is the highest in its constitutent row and column. This has
the advantage of being conservative against false positives. The method applies
even when |A| = 6 |B|. In our experiments, we used cosine document similarity
combined with the dominating strategy.
Experiments: Some results are demonstrated in Figure 3. We use three real-
world test cases. The first two test cases (a and b in the figure) comprise RDF
dataset collections describing court cases decided in Colombia and Venezuela re-
spectively, along with Constitution articles. The third test set consists of ten US
government budget dataset collections from 2009 to 2013 1 . Other such collec-
tions can also be observed on the same website, providing motivation for dataset
mapping. We have released publicly available datasets on a single page2 with
1
http://www.pewstates.org/research/reports/
2
https://sites.google.com/a/utexas.edu/mayank-kejriwal/datasets
219
4 Mayank Kejriwal and Daniel P. Miranker
Fig. 3. RR (Reduction Ratio) quantifies efficiency and is given by 1 | |/|⌦|, with
⌦ the set of all O(n2 ) entity pairs, while PC (Pairs Completeness) measures recall of
with respect to the ground-truth set, ⌦m (✓ ⌦). SS indicates if dataset mapping
(equivalently denoted Source Selection) was used.
ground-truths. We used two popular methods as baselines, a state-of-the-art
unsupervised clustering method called Canopy Clustering (CC in figure) [4] as
well as an extended feature-selection based blocking method (Hetero in figure)
[3]. The gains produced by dataset mapping are particularly large on CC. More
importantly, we found that the dataset mapping algorithm was able to deduce
the correct mappings without introducing false positives or negatives, and with
run-time negligible compared to the subsequent blocking procedures.
Future Work: We continue to investigate dataset mapping, including other
document similarity measures, task domains and mapping strategies. We are
also investigating supervised versions of the problem, particularly in cases where
token overlap is low. Finally, we are investigating the property table further.
References
1. P. Christen. Data matching: concepts and techniques for record linkage, entity res-
olution, and duplicate detection. Springer, 2012.
2. R. Isele and C. Bizer. Learning expressive linkage rules using genetic programming.
Proceedings of the VLDB Endowment, 5(11):1638–1649, 2012.
3. M. Kejriwal and D. P. Miranker. An unsupervised algorithm for learning blocking
schemes. In Data Mining, 2013. ICDM’13. Thirteenth International Conference on.
IEEE, 2013.
4. A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional
data sets with application to reference matching. In Proceedings of the sixth ACM
SIGKDD international conference on Knowledge discovery and data mining, pages
169–178. ACM, 2000.
5. F. Schar↵e, Y. Liu, and C. Zhou. Rdf-ai: an architecture for rdf datasets matching,
fusion and interlink. In Proc. IJCAI 2009 workshop on Identity, reference, and
knowledge representation (IR-KR), Pasadena (CA US), 2009.
6. K. Wilkinson, C. Sayers, H. A. Kuno, D. Reynolds, et al. Efficient rdf storage and
retrieval in jena2. In SWDB, volume 3, pages 131–150, 2003.
220
Scientific data as RDF with Arrays:
Tight integration of SciSPARQL queries into MATLAB
Andrej Andrejev1, Xueming He1, Tore Risch1
1
Uppsala DataBase Laboratory (UDBL), Department of Information Technology,
Uppsala University, Box 337, SE-751 05 Uppsala, Sweden.
{Andrej.Andrejev, Tore.Risch}@it.uu.se, emilyhexueming@hotmail.com
Abstract. We present an integrated solution for storing and querying scientific
data and metadata, using MATLAB environment as client front-end and our
prototype DBMS on the server. We use RDF for experiment metadata, and
numeric arrays for the rest. Our extension of SPARQL supports array operations
and extensibility with foreign functions.
1 Introduction
In many branches of science and engineering, researchers accumulate large amounts
of experimental data [3,4] and use widely recognized (de-facto standard) libraries of
algorithms to analyze and refine that data. Tools such as MATLAB or similar serve
as integrated environments that provide basic file management, extensibility with
algorithmic libraries, visualization and debugging tools, and are generally oriented
towards single-user scenario.
What is typically missing is the infrastructure for storing the descriptions of
experiments, including parameters, terminology mappings, provenance records and
other kinds of metadata. At best, this information is stored in a set of variables in the
same files that contain large numeric arrays of experimental data, and thus is prone
to duplication and hard to update. We have addressed this problem in our previous
work [2] utilizing the Semantic Web approach for storing both data and metadata,
and using Scientific SPARQL query language [1], that extends SPARQL queries
with numeric array operations and external user-defined functions. The goal of
SciSPARQL is to provide uniform query access to both metadata about the
experiments and the massive experimental data itself, as illustrated by table 1.
Section 3 gives more detailed account of SciSPARQL features.
SciSPARQL is supported by our software prototype - SSDM (Scientific
SPARQL Database Manager [1,2]), a database management system (DBMS) for
storing and querying data originating from scientific experiments. SSDM provides
scalable storage representation of RDF and numeric multidimensional arrays.
Table 1. Comparison of data processing domains of MATLAB, SPARQL and SciSPARQL
MATLAB SPARQL SciSPARQL
Metadata Ĝ Ĝ
Scientific data Ĝ Ĝ
including Arrays
In this work, we demonstrate a client-server architecture featuring (i) SSDM
server: the centralized storage for both experiment metadata (as RDF) and arrays
221
stored in binary files linked from the RDF dataset and (ii) MSL: a MATLAB
extension that allows to establish connections to SSDM server, run SPARQL queries
and updates directly from MATLAB interpreter, and access the query result sets.
We show that the data is shipped from the server only on demand. Also, the
conversion of numeric array data between native MATLAB format and internal
SSDM representation only takes place if non-MATLAB function going to access the
array, or, more typically, a certain range within the array.
For this demo1 we have deployed SSDM server on a Linux machine to store
RDF datasets in-memory and array data in binary .mat files [5], which is currently a
de-facto standard. (This provides the same speed for reading and processing array
data as it would be while using MATLAB alone) The demo script is run on the client
machine inside MATLAB interpreter.
2 MATLAB-SciSPARQL Link
The extension to MATLAB includes two main classes: Connection and Scan, and
additional classes used to represent RDF types on MATLAB client side, e.g. URIs
and typed literals. An additional class MatProxy is used to represent (on the client
side) an array stored in a .mat file on the server.
Connection encapsulates a connection to SSDM server, including methods for
x executing SciSPARQL queries and obtaining a result as a Scan,
x executing non-query SciSPARQL statements, e.g. updates and function
definitions, apart from inserting RDF triples into the dataset on the server
x defining URI prefixes to be used both on client and server side,
x shipping MATLAB arrays from client to the server,
x managing data persistence on the server.
Scan encapsulates a result set of the query. The data is not physically retrieved,
stored or shipped anywhere before it is explicitly accessed as a row in the scan. Scan
includes methods for iterating through the result sets of SciSPARQL queries: the
arrays and scalar numbers become represented by MATLAB arrays and numbers,
other RDF values get represented by the wrapper objects defined in MSL.
As we show in the demo, the user can easily create MATLAB routines to
convert (partially or entirely) the data from the Scan into the desired representation,
e.g. for visualization.
3 Scientific SPARQL
We have extended SPARQL language to query and update RDF datasets extended
with arrays. SciSPARQL [1] includes
x extensions for declaratively specifying element access, slicing, projection and
transposition operations over numeric arrays of arbitrary dimensionality,
x a library of array aggregation functions, that are performed on the server in
order to reduce the amount of data shipped to the client,
x extensibility with user-defined foreign functions, allowing to make use of
existing computational libraries.
1
The demo script is available at http://www.it.uu.se/research/group/udbl/SciSPARQL/demo3/
222
SciSPARQL is designed to handle both metadata (stored or viewed as RDF) and
large numeric data to be accessed in the uniform way: by the same query, from the
same dataset.
One important feature of SciSPARQL is ability to define SciSPARQL functional
views, essentially, the named parameterized queries (or, similarly, updates). These
can be used in other queries, or called directly from MATLAB client with
parameters provided as MATLAB values. The conversion of values from MATLAB
to RDF is performed automatically on the client.
4 SSDM Server and Array Proxy Objects
Scientific SPARQL Database Manager is designed for storing RDF data and
numeric multidimensional arrays, working either as in-memory DBMS, or with a
help of SQL-based [2], or any other interfaced back-end storage. In this demo SSDM
server is configured to store RDF triples in-memory, and array data as managed
directory of native .mat files. Reading and writing .MAT files on the server side is
done via freely distributed MATLAB MCR libraries.
To save a snapshot of RDF dataset linking to the arrays stored in .mat files,
save() SciSPARQL directive can be sent via the connection. The server can be re-
started with a named image, and continue to function as in-memory DBMS.
The main purpose of SSDM server is to process SciSPARQL queries and
updates. As part of an update, a store() function can be called from the client. A
MATLAB value (e.g. numeric multidimensional array) will be shipped to the server
as a binary .mat file, and saved under server-managed name in the server file system.
The Array Proxy object pointing to the value in that .mat file will be returned to the
client, and used as a replacement for the actual array e.g. as a parameter to
SciSPARQL queries and updates. Once stored in RDF dataset, Array Proxy serves as
a link from metadata RDF graph to the numeric data stored externally in a .mat file.
If the file is already on the server, and its location is known (maybe, due to some
convention among the users), an alternative link() function can be used to obtain
an equivalent Array Proxy object.
When SciSPARQL query involves slicing, element access, projection or array
aggregate operations on an array represented by Array Proxy, the SSDM server reads
the specified part of the array stored in file into SSDM internal array representation
(thus performing slicing, projection or element access), does any further processing
(e.g. applying array aggregate functions, like "sum of all columns"), and ships the
resulting, typically, much smaller array to MATLAB client, where it is converted
back to MATLAB representation. It is also possible to do slicing and projection
operations within the native .mat array representation, when no further processing by
SSDM is planned.
One of the possible workflows involving arrays is shown on Fig. 1-2. First, a
MATLAB array A is created on the client. A call to store() function ships it to
the server and returns an Array Proxy object. This object is used in RDF triples sent
to SSDM while populating RDF graph describing the experiment.
At the query phase (Fig. 2), a subset of A (that is now stored on the server in a
.mat file) is selected, fed to array_sum() aggregate function, and the result (a
single number) is shipped back to the client for post-processing and visualization.
223
File system SSDM Server File system SSDM Server
RDF RDF
Store Store
Server side Server side
.mat Array
file Proxy
Client side: MATLAB Client side: MATLAB
SciSPARQL
SciSPARQL
F(A)
f(x) store() Update Scan A
A Query
makeURI()
Fig. 1. Storing client-generated data and metadata Fig. 2. Querying data and metadata on SSDM
on SSDM server. server from MATLAB client.
5. Conclusion
The use of standard query languages for bringing the remotely stored data into
the computational environments is becoming increasingly popular as the data gets
bigger and more distributed. MATLAB already has facility to execute SQL, and R
statistical environment recently gained a simple SPARQL package [6]. We take the
next step, by providing extensions to the standard query techniques, to make the
database connections even more useful and efficient.
The approach with linking to the data instead of copying and storing it locally is
beneficial, as the creation of the RDF graph to represent metadata takes negligibly
small time compared to copying the massive data described by this RDF graph.
There is a number of efficient binary storage formats around, and our approach can
be easily extended to any of them, as long as it is possible to address stored data in
terms of string or symbolic identifiers, and read specified parts of the arrays.
The main benefit, however, is integrating Semantic Web metadata management
approach (RDF and SPARQL) into an environment that misses it so obviously. The
MATLAB users can now take advantage of remote and centralized repositories for
both massive numeric data and metadata, send queries that combine them both,
retrieve exactly as much data as required for the task, and do any further processing
the way they already do.
References
1. A.Andrejev and T.Risch. Scientific SPARQL: Semantic web queries over scientific data.
In International Workshop on Data Engineering Meets the Semantic Web, ICDE'12
2. A.Andrejev, S.Toor, A.Hellander, S.Holmgren, and T.Risch: Scientific Analysis by
Queries in Extended SPARQL over a Scalable e-Science Data Store, In e-Science'13
3. M. Stonebraker, J. Becla, D. J. DeWitt, K.-T. Lim, D. Maier, O. Ratzesberger, and
S. B. Zdonik. Requirements for science data bases and scidb. In CIDR '09.
4. E. Soroush, M. Balazinska, and D. L. Wang. Arraystore: a storage manager for complex
parallel array processing. In SIGMOD '11
5. http://www.mathworks.se/help/pdf_doc/matlab/matfile_format.pdf
6. http://cran.r-project.org/web/packages/SPARQL/index.html
224
Measuring Similarity in Ontologies: A new
family of measures
Tahani Alsubait, Bijan Parsia, and Uli Sattler
School of Computer Science, The University of Manchester, United Kingdom
{alsubait,bparsia,sattler}@cs.man.ac.uk
1 Introduction
Similarity measurement is important for numerous applications. Be it classical
information retrieval, clustering, ontology matching or various other applica-
tions. It is also known that similarity measurement is difficult. This can be
easily seen by looking at the several attempts that have been made to develop
similarity measures, see for example [2, 4]. The problem is also well-founded in
psychology and a number of psychological models of similarity have been already
developed, see for example [3]. Rather than adopting a psychological model for
similarity as a foundation, we noticed that some existing similarity measures
for ontologies are ad-hoc and unprincipled. In addition, there is still a need for
similarity measures which are applicable to expressive Description Logics (DLs)
(i.e., beyond EL) and which are terminological (i.e., do not require an ABox).
To address these requirements, we have developed a new family of similarity
measures which are founded on the feature-based psychological model [3]. The
individual measures vary in their accuracy/computational cost based on which
features they consider.
To date, there has been no thorough empirical investigation of similarity
measures. This has motivated us to carry out two separate empirical studies.
First, we compare the new measures along with some existing measures against
a gold-standard. Second, we examine the practicality of using the new measures
over an independently motivated corpus of ontologies (BioPortal library) which
contains over 300 ontologies. We also examine whether cheap measures can be
an approximation of some more computationally expensive measures. In addi-
tion, we explore what could possibly could wrong when using a cheap similarity
measure.
2 A new family of similarity measures
The new measures are based on Jaccard’s similarity coefficient which has been
proved to be a proper metric (i.e., satisfies the properties: equivalence closure,
symmetry and triangle inequality). Jaccard’s coefficient, which maps similarity
to a value in the range [0,1], is defined as follows (for sets of “features” A0 ,B 0 of
A,B, i.e., subsumers of A and B):
0 0
J(A, B) = |(A \B )|
|(A0 [B 0 )|
225
We aim at similarity measures for general OWL ontologies and thus a naive
implementation of this approach would be trivialised because a concept has in-
finitely many subsumers. To overcome this, we present refinements for the simi-
larity function in which we do not count all subsumers but consider subsumers
from a set of (possibly complex) concepts of a concept language L. Let C and
D be concepts, let O be an ontology and let L be a concept language. We set:
S(C, O, L) = {D 2 L(O) e | O |= C v D}
Com(C, D, O, L) = S(C, O, L) \ S(D, O, L)
Union(C, D, O, L) = S(C, O, L) [ S(D, O, L)
|Com(C, D, O, L)|
Sim(C, D, O, L) =
|U nion(C, D, O, L)|
To design a new measure, it remains to specify the set L. For example:
e
AtomicSim(C, D) = Sim(C, D, O, LAtomic (O)), e =O
and LAtomic (O) e \ NC .
e
SubSim(C, D) = Sim(C, D, O, LSub (O)), e = Sub(O).
and LSub (O)
e
GrSim(C, D) = Sim(C, D, O, LG (O)), e = {E | E 2 Sub(O)
and LG (O)
or E = 9r.F, for some r 2 Oe \ NR and F 2 Sub(O)}.
where O e is the signature of O, NC is the set of concept names and Sub(O) is the
set of concept expressions in O. The rationale of SubSim(·) is that it provides
similarity measurements that are sensitive to the modeller’s focus. To capture
more possible subsumers, one can use GrSim(·) for which the grammar can be
extended easily.
3 Approximations of similarity measures
Some measures might be practically inefficient due to the large number of can-
didate subsumers. For this reason, it would be nice if we can examine whether
a “cheap” measure can be a good approximation for a more expensive one.
Definition 1 Given two similarity functions Sim(·), Sim0 (·), we say that:
e Sim(A1 , B1 )
– Sim0 (·) preserves the order of Sim(·) if 8A1 , B1 , A2 , B2 2 O:
Sim(A2 , B2 ) =) Sim0 (A1 , B1 ) Sim0 (A2 , B2 ).
– Sim0 (·) approximates Sim(·) from above if 8A, B 2 O: e Sim(A, B)
0
Sim (A, B).
– Sim0 (·) approximates Sim(·) from below if 8A, B 2 O: e Sim(A, B)
0
Sim (A, B).
Consider AtomicSim(·) and SubSim(·). The first thing to notice is that the
set of candidate subsumers for the first measure is actually a subset of the set
of candidate subsumers for the second measure (O e \ NC ✓ Sub(O)). However,
we need to notice also that the number of entailed subsumers in the two cases
need not to be proportionally related. Hence, the above examples of similarity
measures are, theoretically, non-approximations of each other.
226
4 Empirical evaluation
We carry out a comparison between the three measures GrSim(·), SubSim(·)
and AtomicSim(·) against human similarity judgments. We also include two
existing similarity measures in this comparison (Rada [2] and Wu & Palmer [4]).
We also study in detail the behaviour of our new family of measures in practice.
GrSim(·) is considered as the expensive and most precise measure in this study.
To study the relation between the di↵erent measures in practice, we examine
the following properties: order-preservation, approximation from above/below
and correlation (using Pearson’s coefficient).
4.1 Experimental set-up
Part 1: Comparison against a gold-standard The similarity of 19 SNOMED-
CT concept pairs was calculated using the three methods along with Rada [2]
and Wu & Palmer [4] measures. We compare these similarities to human judge-
ments taken from the Pedersen et al.[1] test set.
Part 2: Cheap vs. expensive measures A snapshot of BioPortal from
November 2012 was used as a corpus. It contains a total of 293 ontologies.
We excluded 86 ontologies which have only atomic subsumptions as for such
ontologies the behaviour of the considered measures will be identical, i.e., we
already know that AtomicSim(·) is good and cheap. Due to the large number
of classes and difficulty of spotting interesting patterns by eye, we calculated
the pairwise similarity for a sample of concepts from the corpus. The size of the
sample is 1,843 concepts with 99% confidence level. To ensure that the sample
encompasses concepts with di↵erent characteristics, we picked 14 concepts from
each ontology. The selection was not purely random. Instead, we picked 2 random
concepts and for each random concept we picked some neighbour concepts.
4.2 Results
How good is the expensive measure? Not surprisingly, GrSim and SubSim
had the highest correlation values with experts’ similarity (Pearson’s correlation
coefficient r = 0.87, p < 0.001). Secondly comes AtomicSim with r = 0.86.
Finally comes Wu & Palmer then Rada with r = 0.81 and r = 0.64 respectively.
Figure 1 shows the similarity curves for the 6 measures used in this comparison.
The new measures along with Wu & Palmer measure preserve the order of human
similarity more often than Rada measure. They mostly underestimated similarity
whereas the Rada measure was mostly overestimating human similarity.
Cost of the expensive measure The average time per ontology taken to
calculate grammar-based similarities was 2.3 minutes (standard deviation =
10.6 minutes, median m = 0.9 seconds) and the maximum time was 93 minutes
for the Neglected Tropical Disease Ontology which is a SRIQ ontology with
1237 logical axioms, 252 concepts and 99 object properties. For this ontology,
the cost of AtomicSim(·) was only 15.545 sec and 15.549 sec for SubSim(·). 9 out
of 196 ontologies took over 1 hour to be processed. One thing to note about these
ontologies is the high number of logical axioms and object properties. Clearly,
GrSim(·) is far more costly than the other two measures. This is why we want
to know how good/bad a cheaper measure can be.
227
Fig. 1: 6 Curves of similarity for 19 SNOMED clinical terms
How good is a cheap measure? Although we have excluded all ontologies
with only atomic subsumptions from the study, in 12% of the ontologies the three
measures were perfectly correlated (r = 1, p < 0.001). These perfect correlations
indicate that, in some cases, the benefit of using an expensive measure is totally
neglectable.
AtomicSim(·) and SubSim(·) did not preserve the order of GrSim(·) in 80%
and 73% of the ontologies respectively. Also, they were not approximations from
above nor from below in 72% and 64% of the ontologies respectively.
Take a look at the African Traditional Medicine ontology in Figure 2. SubSim(·)
is 100% order-preserving while AtomicSim(·) is only 99% order-preserving.
Fig. 2: African Traditional Medicine Fig. 3: Platynereis Stage
Note also the Platynereis Stage Ontology in Figure 3 in which both AtomicSim(·)
and SubSim(·) are 75% order-preserving. However, AtomicSim(·) was 100% ap-
proximating from above while SubSim(·) was 85% approximating from below.
References
1. T. Pedersen, S. Pakhomov, S. Patwardhan, and C. Chute. Measures of semantic
similarity and relatedness in the biomedical domain. Journal of Biomedical Infor-
matics, 30(3):288–299, 2007.
2. R. Rada, H. Mili, E. Bicknell, and M. Blettner. Development and application of a
metric on semantic nets. In IEEE Transaction on Systems, Man, and Cybernetics,
volume 19, page 1730, 1989.
3. A. Tversky. Features of similarity. Psycological Review by the American Psycological
Association, Inc., 84(4), July 1977.
4. Z. Wu and MS. Palmer. Verb semantics and lexical selection. In Proceedings of
the 32nd. Annual Meeting of the Association for Computational Linguistics (ACL
1994), page 133138, 1994.
228
Towards Combining Machine Learning with
Attribute Exploration for Ontology Refinement
Jedrzej Potoniec1 , Sebastian Rudolph2 , and Agnieszka Lawrynowicz1
1
Institute of Computing Science, Poznan University of Technology, Poland
{jpotoniec,alawrynowicz}@cs.put.poznan.pl
2
Technische Universität Dresden, Germany sebastian.rudolph@tu-dresden.de
Abstract. We propose a new method for knowledge acquisition and on-
tology refinement for the Semantic Web utilizing Linked Data available
through remote SPARQL endpoints. This method is based on combina-
tion of the attribute exploration algorithm from formal concept analysis
and the active learning approach from machine learning.
1 Introduction
Knowledge acquisition is a process of capturing knowledge, typically from a hu-
man expert, and thus it concerns all systems and environments where that kind
of knowledge is required. It is also said to be major bottleneck in development
of intelligent systems due to its difficulty and time requirements. Of course the
Semantic Web, as an area concerned with structured and precise representation
of information, has to deal with exactly the same issue.
Since the early days of the Semantic Web, building ontologies has been a dif-
ficult and laborious task. Frequently people trying to express complex knowledge
do not know how to perform this task properly. Mistakes come from difficulty
in understanding the complex logic formalism supporting OWL.
Frequently an ontology engineer would start collecting vocabulary and re-
quirements for an ontology, structuralize the vocabulary and later specify more
complex dependencies [6]. We propose a solution to support knowledge acquisi-
tion for ontology construction. Especially we address the last part of the process,
where some basic knowledge is already gathered and more complex dependencies
are to be specified. We aim to answer the question how to extend an ontology
with meaningful, valid and non-trivial axioms taking into consideration available
data and user workload ?
2 Related work
For knowledge acquisition for ontology development many approaches have been
proposed so far. The most basic ones are ontology editors supporting ontology
development, such as Protégé 3 . In addition to that, there are methodologies
helpful in ontologies development, such as the one proposed in NeOn [6].
3
http://protege.stanford.edu/
229
In [1,5] applications of attribute exploration algorithm from formal concept
analysis to ontology development have been proposed. [1] describes how to dis-
cover subsumptions between conjunction of classes and [5] extends it to proper-
ties’ domains and ranges.
In [3] the idea of learning ontologies purely from Linked Data by means of dis-
covering association rules is presented. [2] presents a methodology for manually
building and populating domain ontologies from Linked Data.
3 Approach
The proposed approach is to support the user during attribute exploration by
means of machine learning (ML). The ML algorithm’s task is to answer simple,
non-interesting questions posed by the attribute exploration algorithm and leave
for the user only these questions which are non-trivial to answer.
Input of our proposed algorithm is an ontology O, a partial context derived
from it, and two thresholds ✓a and ✓r . They are, respectively, thresholds for ac-
cepting and rejecting an implication and have to be manually chosen w.r.t. the
used ML algorithm. Result of the algorithm is a set of probably valid implica-
tions, which can be transformed into subsumptions for extending the ontology.
A detailed description of the algorithm is presented below. For sake of clarity
its description treats the attribute exploration algorithm as a black box, which
provides the next implication to consider.
1. Generate implication L ! R by means of the attribute exploration algo-
rithm.
2. For every r 2 R, do the following sequence of steps:
(a) If L ! {r} is already refuted by some of the known individuals, go to
the nextdr.
(b) If O |= L v r, remember implication L ! {r} as a valid one and go
to the next r.
(c) Compute probabilities of acceptance pa and rejection pr of the implica-
tion L ! {r} with the ML algorithm. Note that pa + pr = 1.
(d) If pa ✓a , remember the implication L ! {r} as a valid one and go to
the next r.
(e) If pr ✓r , go to the step 2i.
(f) Ask user if implication L ! {r} is valid.
(g) Add considered implication with user’s answer to a set of learning ex-
amples for the ML algorithm.
(h) If the implication is valid, remember it as a valid one and go to the
next r.
(i) Otherwise, extend the partial context with a counterexample either pro-
vided by user or auto-generated.
The purpose of iteration through the set of conclusions R in the algorithm
is twofold. We believe that this way user can more easily decide if the presented
230
implication is valid or not, because she does not have to consider complex relation
between two conjunctions of attributes.
The other thing is that this way automated generation of counterexamples
provides more concrete results. For an arbitrary implication L ! R a counterex-
ample can be generated and said to have all attributes from L and to not have at
least one attribute from R. This is not in line with the method of partial context
induction, as it is unclear which exactly attribute from R the counterexample
does not have. Because of that partial context can not reflect knowledge base ac-
curately anymore, and the attribute exploration algorithm can start to generate
invalid implications. If the implication has a single attribute in its right-hand
side, it is clear which attribute the counterexample does not have.
3.1 Application of machine learning
The task which ML algorithm is to solve can be seen as a kind of active learning
with a binary classification. Every implication is classified as valid or invalid and
if the algorithm is unsure, the user is asked.
One should note that not every classifier generates reasonable probabilities.
For example, rule-based or tree-based systems usually are not suitable for that
purpose. Problem of generating probabilities can be also seen as a regression
problem.
Moreover costs of both types of mistakes are di↵erent and distribution of
learning examples can be heavily imbalanced, i.e. implications with one decision
may appear much often than with other decision. To reflect these fact a classifier
suitable for cost-sensitive learning is required.
To apply machine learning techniques, a way to transform implications to
feature vectors is required. We apply three approaches to this problem. First
of all, a single purely syntactic measure is used: the number of attributes in
the left-hand side divided by the number of all attributes. Secondly, there are
features made of values of measures typical for association rules mining. Their
computation is based on features of individuals in the ontology. Following the
naming convention from [4], we use coverage, prevalence, support, recall and lift.
Finally, we use a mapping from the set of the attributes to Linked Data
in order to obtain the number of objects in an RDF repository supporting an
implication or its parts. Every attribute is mapped to a SPARQL graph pattern
with a single featured variable denoting the object identifier. Following the same
name convention from [4], coverage, prevalence, support, recall and confidence
are used. All of these features can be computed using only SPARQL COUNT
DISTINCT expressions and basic graph patterns and thus they maintain relatively
low complexity and are suitable to use with remote SPARQL endpoints.
Such a feature vector is later labeled with the classifier mentioned above and
given answer (valid/invalid/unsure) is used to either refine the ontology or ask
the user. If the user is asked, her answer is then used as a correct label for the
feature vector and the classifier is relearned.
231
4 Conclusions and future work
As we are proposing a method which is to make development and refinement of
domain-specific ontologies easier, our main goal for evaluation is to validate its
practical usability. We plan to apply our method to a selection of domain-specific
ontologies concerning some knowledge of general type such as literature, music
and movies. We plan to use a crowdsourcing service to validate our hypotheses.
We hope that with ontologies with a theme being general enough and additional
information available in the Internet, the crowd will be able to validate our
decisions about implications and Linked Data mappings.
We believe that our approach is promising and will be able to help ontology
engineers in the process of ontology refinement. We are combining three tech-
nologies very suitable for this kind of a task. First of all, the attribute exploration
algorithm that has been developed especially for discovering additional relations
between attributes. Moreover, Linked Data is supposed to describe parts of the
world. Obviously, this description can not be assumed to be neither accurate
nor complete, yet it should be sufficient to support the user in a process of on-
tology refinement. Finally, the whole purpose of machine learning algorithms is
to adapt themselves, and thus they are suitable to replace the user in uniform,
repeatable tasks.
Acknowledgement. Jedrzej Potoniec and Agnieszka Lawrynowicz acknowl-
edge support from the PARENT-BRIDGE program of Foundation for Polish Sci-
ence, cofinanced from European Union, Regional Development Fund (Grant No
POMOST/2013-7/8 LeoLOD – Learning and Evolving Ontologies from Linked
Open Data).
References
1. Baader, F., Ganter, B., et al.: Completing description logic knowledge bases using
formal concept analysis. In: Proc. of IJCAI 2007. pp. 230–235. AAAI Press (2007)
2. Dastgheib, S., Mesbah, A., Kochut, K.: mOntage: Building Domain Ontologies from
Linked Open Data. In: IEEE Seventh International Conference on Semantic Com-
puting (ICSC). pp. 70–77. IEEE (2013)
3. Fleischhacker, D., Völker, J.: Inductive learning of disjointness axioms. In: Meers-
man, R., Dillon, T., et al. (eds.) On the Move to Meaningful Internet Systems: OTM
2011, LNCS, vol. 7045, pp. 680–697. Springer Berlin Heidelberg (2011)
4. Le Bras, Y., Lenca, P., Lallich, S.: Optimonotone measures for optimal rule discov-
ery. Computational Intelligence 28(4), 475–504 (2012)
5. Rudolph, S.: Acquiring generalized domain-range restrictions. In: Medina, R.,
Obiedkov, S. (eds.) Formal Concept Analysis, LNCS, vol. 4933, pp. 32–45. Springer
Berlin Heidelberg (2008)
6. Suárez-Figueroa, M.C., Gómez-Pérez, A., Fernández-López, M.: The NeOn Method-
ology for Ontology Engineering. In: Suárez-Figueroa, M.C., Gómez-Pérez, A., et al.
(eds.) Ontology Engineering in a Networked World, pp. 9–34. Springer Berlin Hei-
delberg (2012)
232
ASSG: Adaptive structural summary for RDF
graph data
Haiwei Zhang, Yuanyuan Duan, Xiaojie Yuan, and Ying Zhang?
Department of Computer Science and Information Security, Nankai University.
94,Weijin Road, Tianjin, China
{zhanghaiwei,duanyuanyuan,yuanxiaojie,zhangying}
@dbis.nankai.edu.cn
http://dbis.nankai.edu.cn
Abstract. RDF is considered to be an important data model for Seman-
tic Web as a labeled directed graph. Querying in massive RDF graph data
is known to be hard. In order to reduce the data size, we present ASSG,
an Adaptive Structural Summary for RDF Graph data by bisimulations
between nodes. ASSG compresses only the part of the graph related to
queries. Thus ASSG contains less nodes and edges than existing work.
More importantly, ASSG has the adaptive ability to adjust its structure
according to the updating query graphs. Experimental results show that
ASSG can reduce graph data with the ratio 85% in average, higher than
that of existing work.
Keywords: Adaptive structural summary, RDF graph, Equivalence class
1 Introduction
The resource description framework (RDF) data model has been designed as a
flexible representation of schema-relaxable or even schema-free information for
the Semantic Web [1]. RDF can be modeled by a labeled directed graph and
querying in RDF data is usually thought to be a process of subgraph matching.
The subgraph matching problem is defined as follows: for a data graph G and a
query graph Q, retrieve all subgraphs of G that are isomorphic to Q. Existing
two solutions, subgraph isomorphism and graph simulation, are expensive where
subgraph isomorphism is NP-complete and graph simulation takes quadratic
time. Further, indices are used to accelerate subgraph queries on large graph
data, but indices incur extra cost on construction and maintainence (see [2] for
a survey). Motivated by this, a new approach, using graph compression, has
been proposed recently [3]. In [3], Fan et al. proposed query preserving graph
compression Gr , which compresses massive graph into a small one by partitioning
nodes into equivalence classes. For subgraph matching, Gr can reduce graph data
with the ratio 57% in average. However, for a designated query graph, lots of
components (nodes and edges) in Gr are redundant. Hence it is possible to
construct a compressed graph for designed subgraph matching.
?
Corresponding author.
233
2 H. Zhang et al
In this paper, we present ASSG (Adaptive Structural Summary of Graphs),
a graph compression method that further reduces the size of the graph data.
ASSG has less components than Gr and more importantly, it has adaptive ability
to adjust its structure according to di↵erent subgraph matchings. In the following
sections, we mainly introduce our novel technique.
2 Adaptive Structural Summary
In this section, we present our approach of adaptive structural summary for
labeled directed graph data (such as RDF). ASSG is actually an compressed
graph constructed by equivalence classes of nodes and it has adaptive ability to
adjust its structure according to di↵erent query graphs.
Graph data is divided into di↵erent equivalence classes by bisimulation rela-
tions as [3] proposed. For computing bisimulation relation, we refer to the notion
rank proposed in [4] for describing structural feature from leaf nodes (if exist).
A.Dovier, et al.[4] proposed function of computing ranks of nodes for both di-
rected acyclic graph (DAG) and directed cyclic graph (DCG). Rank is something
like structural feature of nodes from leaf nodes in graph data.
An equivalence class ECG of nodes in graph data G = (V, E, L) is denoted
by a triple (Ve , Re , Le ), where (1) Ve is a set of nodes included in the equivalence
class, (2) Re is the rank of the nodes, and (3) Le denotes the labels of the nodes.
A1 A2 A3 A1A2A3 B A A1 A2 A3
B1 B2 B3 B
B1B2B3 C D B1B2B3
C D
C1 D1 C2 D2 C3 D3 C1C2C3 D1D2D3 Q1 Q2 C1C2C3 D1D2D3
(a) Graph data G (b) ASSG (c) Query graphs (d) ASSGÿ
Fig. 1. Graph data and equivalence classes
Fig. 1 shows examples of a graph (Fig. 1(a)) and the equivalence classes
of nodes (Fig. 1(b)). Labels and ranks of nodes in the same equivalence class
are the same, such as rank(C1 ) = rank(C2 ) = rank(C3 ) = 0, rank(A1 ) =
rank(A2 ) = rank(A3 ) = 2, and so on. Two times of DFS processing will be
performed to construct equivalence classes of node. DCG will be changed into
DAG by algorithm of Tarjan in the first DFS (not shown in Fig. 1). Subsequently,
in the second DFS, rank of each node will be measured and then the node will
be collapsed into corresponding equivalence class by its label and rank. Hence,
V of G is partitioned to di↵erent equivalence classes with a cost of O(|V | + |E|).
For a labeled directed graph G = (V, E, L), We define ASSG as GASS =
(VASS , EASS , LASS , RASS ), where: (1) VASS denotes the set of nodes that col-
lapsed by the nodes in the same equivalence class, (2) EASS is the set of edges,
(3) LASS is the labels of nodes in VASS , (4) RASS records the rank of each
v 2 VASS .
234
ASSG: Adaptive structural summary for RDF graph data 3
Obviously, ASSG is the minimum pattern that can describe labeled directed
graph data because nodes with the same label and rank will be collapsed. Unfor-
tunately, the process of measuring ranks will lose some descendants or ancestors
of nodes. And this case will not conform to the definition of bisimulation, and
thus bring out wrong answers for subgraph matching. For example, in Fig. 1(b),
the nodes A1 and A2 in the same equivalence class have di↵erent children. To
solve the problem, ASSG will adaptively adjust its structure for updating query
graphs.
For each subgraph matching, the procedure of adaptively updating ASS-
G includes two stages: matching and partitioning. Given a query graph Q =
(VQ , EQ , LQ ) and ASSG GASS = (VASS , EASS , LASS , RASS ), assuming that
RQ ={rank(vQ )|vQ 2VQ }. For the matching stage, 8v2VQ and u2VQ , 9v 0 , u0 2VASS ,
if LQ (v) = LASS (v 0 ), LQ (u) = LASS (u0 ), and RQ (v) RQ (u) = RASS (v 0 )
RASS (u0 ), then v, u matches v 0 , u0 respectively. For the partitioning stage, n-
odes in ASSG matching current query graph will be partitioned into di↵erent
parts according to its neighbors by the algorithm presented in [5] with the com-
plexity of time O(|E|log|VQ |). In Fig. 1(c), ASSG will not change while matching
Q1 , but ASSG will change to the structure shown in Fig. 1(d) while matching
Q2 . It is obvious that the size of ASSG will increase after further partition, but
each partition will adjust minimum amount of nodes. While subgraph matching
focuses on frequent nodes, ASSG will remain stable.
3 Experimental Evaluation
In this section, we performed experiments on both realistic and synthetic data
sets to verify the performance of ASSG.
Table 1. Compress Ratio of Gr and ASSG
Data Set |G| < |V |, |E|, |L| > Gr ASSG(15%)
California 60K<24K, 32K, 95> 49.22% 33.25%
Internet 530K<96K, 421K, 50> 42.41% 17.08%
Citation 1.7M<815K, 806K, 67> 31.71% 5.83%
Synthetic 2.6M<1.4M, 2.1M, 60> 26.9% 3.73%
Firstly, we use compression ratio as a measurement for evaluating the e↵ec-
tiveness of ASSG for subgraph matchings compared with Gr . We define com-
pression ratio of ASSG as: CASS = |VASS |/|V |. Similarly, the compression ratio
of Gr is CGr = |Vr |/|V |. The ration is lower, the better. The e↵ectiveness of
ASSG compared with Gr is reported in Table 1 where |G| denotes to the size
of graph data. For a query graph Gq = (Vq , Eq , Lq ), the compression ratio of
ASSG is decided by the number of labels |Lq | in the query graph. Assuming
that |Lq | = 15% ⇥ |L|, then we can study from table 1: By ASSG, graph data
can be highly compressed according to query graphs. ASSG reduces graph data
by 85% in average. The compression ratio of ASSG is lower than that of Gr .
Secondly, we evaluate the efficiency of updating ASSG. Assuming that num-
ber of labels in query graph is 15% of |L|. We generate two query graphs for
235
4 H. Zhang et al
updating ASSG. The number of repeated labels in these two graphs are 0, 1,
2, 5 respectively as table 2 shows. We can study that the more repeated labels
in di↵erent query graphs, the less time occupation for ASSG to update. As a
result, for frequent subgraph matchings, ASSG can be updated and maintained
with low cost of time.
Table 2. Time Occupations of Updating ASSG (s)
Data Set 0 repeated label 1 repeated label 2 repeated labels 5 repeated labels
California 8.95 2.96 2.79 2.73
Internet 28.64 25.42 21.29 9.9
Citation 55.49 53.7 47.1 6.35
Synthetic 113.47 101.32 91.24 33.73
4 Conclusion and Future work
We have proposed ASSG, adaptive structural summary for RDF graph data.
ASSG is based on equivalence classes of nodes, and ASSG compresses graph
data according to the query graphs. We presented main idea for constructing
and updating ASSG and designed experiments on realistic and synthetic data
sets to evaluate the e↵ectiveness and efficiency of our technique. Experimental
results show that the compression ratio of ASSG is lower than that of existing
work Gr and ASSG is efficiently updated for frequent queries. Further more, we
will use ASSG for optimizing SPARQL queries on RDF data for semantic web.
Acknowledgments. This work is supported by National Natural Science Foun-
dation of China under Grant No. 61170184, 61402243, the National 863 Project
of China under Grant No. 2013AA013204, National Key Technology R&D Pro-
gram under Grant No.2013BAH01B05, and the Tianjin Municipal Science and
Technology Commission under Grant No.13ZCZDGX02200, 13ZCZDGX01098
and 13JCQNJC00100.
References
1. T.Neumann., G.Weikum.: The rdf-3x engine for scalable management of rdf data.
VLDB J., 19(1), 91–113, 2010.
2. Z.Sun., H.Wang., H.Wang., B.Shao., J.Li.: Efficient Subgraph matching on billion
node graphs. The VLDB Journal, 5(9), 788–799 (2012)
3. W.Fan., J.Li., X.Wang., Y.Wu.:Query preserving graph compression. In: ACM SIG-
MOD International Conference on Management of Data, pp. 157–168. ACM, New
York (2012)
4. A.Dovier., C.Piazza., A.Policriti.: A fast bisimulation algorithm. In: Conference on
Computer Aided Verification, pp. 79–90. Springer-Verlag Berlin Heidelberg (2001)
5. R.Paige., R.E.Tarjan., R.Bonic.: A linear time solution to the single function coars-
est partition problem. Theoretical Computer Science, 40(1), 67–84 (1985)
236
Evaluation of String Normalisation Modules for String-based
Biomedical Vocabularies Alignment with AnAGram
Anique van Berne, Veronique Malaisé
A.vanBerne@Elsevier.com V.Malaise@Elsevier.com
Elsevier BV Elsevier BV
Abstract: Biomedical vocabularies have specific characteristics that make their
lexical alignment challenging. We have built a string-based vocabulary alignment
tool, AnAGram, dedicated to efficiently compare terms in the biomedical domain, and
evaluate this tool’s results against an algorithm based on Jaro-Winkler’s edit-distance.
AnAGram is modular, enabling us to evaluate the precision and recall of different
normalization procedures. Globally, our normalization and replacement strategy im-
proves the F-measure score from the edit-distance experiment by more than 100%.
Most of this increase can be explained by targeted transformations of the strings with
the use of a dictionary of adjective/noun correspondences yielding useful results.
However, we found that the classic Porter stemming algorithm needs to be adapted to
the biomedical domain to give good quality results in this area.
1. Introduction
Elsevier has a number of online tools in the biomedical domain. Improving their
interoperability involves aligning the vocabularies these tools are built on. The vo-
cabulary alignment tool needs to be generic enough to work with any of our vocabu-
laries, but each alignment requires specific conditions to be optimal, due to vocabular-
ies’ specific lexical idiosyncrasies.
We have designed a modular, step-wise alignment tool: AnAGram. Its normaliza-
tion procedures are based on previous research[1], basic Information Retrieval nor-
malization processes, and our own observations. We chose a string-based alignment
method as these perform well on the anatomical datasets of the OAEI campaign[1],
and string-based alignment is an important step in most methods identified in [3][4].
We compare the precision and recall of AnAGram against an implementation of
Jaro-Winkler’s edit-distance method (JW)[7] and evaluate the precision of each step
of the alignment process. We gain over 100% F-measure compared to the edit-
distance method. We evaluate the contribution and quality of the string normalization
modules independently and show that the Porter stemmer[2] does not give optimal
results in the biomedical domain.
In Section 2 we present our use-case: aligning Dorland’s to Elsevier’s Merged
Medical Taxonomy (EMMeT)1. Section 3 describes related work in vocabulary
1
http://river-valley.tv/elsevier-merged-medical-taxonomy-emmet-from-smart-content-to-smart-
collection/
237
alignment in the biomedical domain. Section 4 and 5 present AnAGram and evaluate
against Jaro-Winkler’s edit-distance. Section 6 presents future work and conclusions.
2. Use case: Dorland’s definition alignment to EMMeT
Elsevier’s Merged Medical Taxonomy (EMMeT) is used in “Smart Content” applica-
tions2; it contains more than 1 million biomedical concepts and their hierarchical,
linguistic and semantic relationships. We aim at expanding EMMeT with definitions
from the authoritative biomedical dictionary Dorland’s3 by aligning them.
3. Related work
Cheatham and Hitzler[1] list the types of linguistic processes used by at least one
alignment tool in the Ontology Alignment Evaluation Initiative (OAEI)[5]. AnAGram
implements all syntactic linguistic transformations listed; instead of a generic syno-
nym expansion system, we used a correspondence dictionary of adjective/noun pairs.
This dictionary is a manually curated list based on information automatically extract-
ed from Dorland’s. It contains pairs that would not be not solved by stemming such as
saturnine/lead. Ambiguous entries, such as gluteal/natal, were removed.
Chua and Kim’s[6] approach for string-based vocabulary alignment is the closest
to AnAGram: they use WordNet4, a lexical knowledge base, to gather adjective/noun
pairs to improve the coverage of their matches, after using string normalization steps;
our set of pairs is larger than the one derived from WordNet.
4. AnAGram: biomedical vocabularies alignment tool
AnAGram was built for use on a local system5, and is tuned to performance by us-
ing hash-table lookup to find matches. Currently, no partial matching is possible. The
matching steps are built in a modular way: one can select the set of desired steps. The
source taxonomy is processed using these steps and the target taxonomy is processed
sequentially: the alignment stops at the first match. Modules are ordered to increasing
distance between original and transformed string, simulating a confidence value.
Exact matching: corresponds to JW edit-distance 1.
Normalization: special characters are removed or transformed (Sjögren’s syndrome
to Sjogren’s syndrome; punctuation marks to space), string is lower cased.
Stop word removal: tokenization by splitting on spaces, removal of stop words, us-
ing a list that was fine-tuned over several rounds of indexing with EMMeT.
2
http://info.clinicalkey.com/docs/Smart_Content.pdf
3
http://www.dorlands.com/
4
http://wordnet.princeton.edu/
5
Dell™ Precision™ T7500, 2x Intel® Xeon® CPU E5620 2.4 GHz processors, 64 GB RAM.
Software: Windows 7 Professional 64 bit, Service Pack 1; Perl v5.16.3
238
Re-ordering: tokens are sorted alphabetically, enabling matches for inverted terms.
Substitution: sequences of tokens are replaced with the corresponding value from our
dictionary, applying a longest string matching principle.
Stemming: using the Porter stemming algorithm[2] (Perl module Lingua::Stem::
Snowball). The substitution step is then repeated, using stemmed dictionary entries.
Independent lists: stop-words list and substitution dictionary are independent files.
5. Experimentation and results
We align EMMeT version 3.2 (13/12/13) (1,027,717 preferred labels) to Dorland’s
32nd edition (115,248 entries). We evaluate AnAGram as a whole against JW, with a
0.92 threshold (established experimentally). The JW implementation can work only
with preferred labels.
To evaluate the recall of AnAGram vs the JW implementation, we use a manual
gold set of 115 mappings created by domain experts (Table 1). AnAGram gives better
recall and better precision than the JW method.
Correct mapping Incorrect mapping Recall (%) Precision(%) F-measure
Jaro-Winkler 46 8 43% 85% 0.57
AnAGram 80 3 71% 96% 0.82
Table 1 - Results of AnAGram vs. Jaro-Winkler on Dorland’s Gold Set pairs
We evaluate a random sample of 25 non-exact alignments from each module to get
a better insight on AnAGram’s normalization process. The results are either: Correct,
Related (useful but not exactly correct), or Incorrect (Table 2 and Figure 1). AnA-
Gram gives more correct results but JW is useful for finding related matches.
100% Preferred labels C R I
90%
80% Jaro-Winkler 16 40 44
70%
60% AnAGram non-exact 77 14 9
50%
40% Normalised 25 0 0
30% No stop words 16 3 6
20%
10%
0% Word order 25 0 0
Substituted 16 9 0
Stemmed 11 11 3
Subst. & stem 13 7 5
Table 2 – Results for AnAGram’s modules.
(C: correct; R: related; I: incorrect)
Figure 1 - Quality of matches returned by
Correct Related Incorrect AnAGram’s modules.
We evaluate the performance of each normalization step by evaluating 25 ran-
dom results for each of AnAGram’s modules separately6 (Table 2, Figure 1). Normal-
6
Some modules are based on the result of a previous transformation, so the later the module
comes in the chain, the more complicated matches it faces.
239
ization does very well (100% correct results). Removal of stop words causes some
errors and related matches: single-letter stop words can be meaningful, like A for
hepatitis A. Word order rearranging ranks second: it does not often change the mean-
ing of the term. Substitution performs reasonably well; most of the non-correct results
are related matches. Stemming gives the poorest results with false positives due to
nouns/verbs stemmed to the same root, such as cilitated/ciliate. The substituted and
stemmed matches have a result similar to the stemmed results. Still, even the worst
results from any AnAGram module are better than the overall results of the non-exact
matches from the JW algorithm. One reason for this is that the JW does not stop the
alignment at the best match, but delivers everything that satisfies the threshold of
0.92.
Not all modules account for an equal portion of the non-exact results. The nor-
malization module delivers around 70% of matches, stemming accounts for 15 to 20%
and the other modules account for 2% to 4% of the matches each.
6. Future work and conclusion
Results are good compared to OAEI large biomedical vocabularies alignment’s re-
sults for string-based tools[1]. We will work on the Stemming algorithm, the im-
provement of our stop words list and substitution dictionary, and on adding an opti-
mized version of the JW algorithm as a final optional module for AnAGram to im-
prove results further. In this way we will benefit from additional related matches in
cases where no previous match was found.
References
[1] Michelle Cheatham, Pascal Hitzler. String Similarity Metrics for Ontology Alignment. In-
ternational Semantic Web Conference (ISWC2013) (2) 2013: 294-309
[2] Cornelis .J. van Rijsbergen, Stephen E. Robertson, MartinF. Porter. New models in proba-
bilistic information retrieval. London: British Library. (British Library Research and Develop-
ment Report, no. 5587), 1980
[3] Jérôme Euzenat (Coordinator) et al. State of the art on Ontology alignment. Knowledge
Web D 2.2.3, 2004.
[4] Jerôme Euzenat, Pavel Shvaiko. Ontology Matching. Springer-Verlag, Berlin Heidelberg
2013
[5] Jérôme Euzenat, Christian Meilicke, Heiner Stuckenschmidt, Pavel Shvaiko, Cássia Tro-
jahn. Ontology Alignment Evaluation Initiative: Six Years of Experience. Journal on Data Se-
mantics XV, Lecture Notes in Computer Science (6720) 2011: 158-192
[6] Watson W. K. Chua and Jung-Jae Kim. BOAT: Automatic alignment of biomedical ontolo-
gies using term informativeness and candidate selection. Journal of Biomedical Informatics
(45) 2012: 337-349
[7] William E.Winkler. String Comparator Metrics and Enhanced Decision Rules in the
Fellegi-Sunter Model of Record Linkage. Proceedings of the Section on Survey Research
Methods (American Statistical Association) 1990: 354–359
240
1 1 1
241
Keywords Conjunctive Queries
Keyword-Element- Augmentation of Top-K Graph Element-Query-
Mapping Graph Index Exploration Mapping
Query Computation
Keyword Index Summary Graph
Keyword Indexing Summarization
Data Preprocessing
RDF(S) Data
242
h T = {t1 , ..., th } ti 2 T
ti 2 T j Ri = {ri1 , ..., rij }
h R1 , ..., Rh
R1 Rh h Zi
riq 2 R i
ri q rc
Gs
h rc
G1 , ..., Gq
w pn pn po ea
vb mn w(pn ) = w(po ) + w(ea ) + w(vb ) + w(mn )
G1 , ..., Gq
pn w(mn )
w(mn ) = 0
w(mn ) < 0
243
zpid:Author zpid:Publication zpid:Librarian zpid:Metadata
rdf:type rdf:type rdf:type rdf:type
zpid:wrote zpid:wrote
zpid:ErichW zpid:Article-01 zpid:JuergenW zpid:Abstract-02
Summarization
zpid:Author zpid:Article
zpid:Person
zpid:wrote
zpid:workedOn zpid:Librarian zpid:Metadata
244
Supporting SPARQL Update Queries
in RDF-XML Integration
Nikos Bikakis1 * Chrisa Tsinaraki2 Ioannis Stavrakantonakis3
Stavros Christodoulakis4
1
NTU Athens & R.C. ATHENA, Greece
2
EU Joint Research Center, Italy
3
STI, University of Innsbruck, Austria
4 Technical University of Crete, Greece
Abstract. The Web of Data encourages organizations and companies to publish
their data according to the Linked Data practices and offer SPARQL endpoints.
On the other hand, the dominant standard for information exchange is XML. The
SPARQL2XQuery Framework focuses on the automatic translation of SPARQL
queries in XQuery expressions in order to access XML data across the Web. In
this paper, we outline our ongoing work on supporting update queries in the
RDF–XML integration scenario.
Keywords: SPARQL2XQuery, SPARQL to XQuery, XML Schema to OWL,
SPARQL update, XQuery Update, SPARQL 1.1.
1 Introduction
The SPARQL2XQuery Framework, that we have previously developed [6], aims to
bridge the heterogeneity issues that arise in the consumption of XML-based sources
within Semantic Web. In our working scenario, mappings between RDF/S–OWL and
XML sources are automatically derived or manually specified. Using these mappings,
the SPARQL queries are translated on the fly into XQuery expressions, which access
the XML data. Therefore, the current version of SPARQL2XQuery provides read-only
access to XML data. In this paper, we outline our ongoing work on extending the
SPARQL2XQuery Framework towards supporting SPARQL update queries.
Both SPARQL and XQuery have recently standardized their update operation seman-
tics in the SPARQL 1.1 and XQuery Update Facility, respectively. We have studied the
correspondences between the update operations of these query languages, and we de-
scribe the extension of our mapping model and the SPARQL-to-XQuery translation
algorithm towards supporting SPARQL update queries.
Similarly to the motivation of our work, in the RDB–RDF interoperability scenario,
D2R/Update [1] (a D2R extension) and OntoAccess [2] enable SPARQL update queries
over relational databases. Regarding the XML–RDB–RDF interoperability scenario
[5], the work presented in [3] extends the XSPARQL language [4] in order to support
update queries.
*
This work is partially supported by the EU/Greece funded KRIPIS: MEDA Project
245
2 Translating SPARQL Update Queries to XQuery
This section describes the translation of SPARQL update operations into XQuery ex-
pressions using the XQuery Update Facility. We present how similar methods and al-
gorithms previously developed in the SPARQL2XQuery Framework can be adopted
for the update operation translation. For instance, graph pattern and triple pattern trans-
lation are also used in the update operation translation. Note that, due to space limita-
tions, some issues are presented in a simplified way in the rest of this section and several
details are omitted.
Table 1 presents the SPARQL update operations and summarizes their translation in
XQuery. In particular, there are three main categories of SPARQL update operations a)
Delete Data; b) Insert Data; and c) Delete/Insert. For each update operation, a simplified
SPARQL syntax template is presented, as well as the corresponding XQuery expres-
sions. In SPARQL context, we assume the following sets, let tr be an RDF triple set, tp
a triple pattern set, trp a set of triples and/or triple patterns, and gp a graph pattern.
Additionally, in XQuery, we denote as xEW, xEI and xED the sets of XQuery expressions
(i.e., FLOWR expressions) that have resulted from the translation of the graph pattern
included in the Where, Insert and Delete SPARQL clauses, respectively. Let xE be a set
of XQuery expressions, xE($v1, $v2,… $vn) denote that xE are using (as input) the values
assigned to XQuery variables $v1, $v2,… $vn. Finally, xn denotes an XML fragment,
i.e., a set of XML nodes, and xp denotes an XPath expression.
Table 1. Translation of the SPARQL Update Operations in XQuery
SPARQL
SPARQL Update Translated XQuery Expressions
Syntax Template 1
Operation
Delete data{ delete nodes collection("http://dataset...")/xp1
DELETE DATA tr ...
} delete nodes collection("http://dataset...")/xpn
let $n1 := xn1
…
let $nn := xnn
let $data1 := ($nk, $nm,…) // k, m,… ∈ [1,n]
…
let $datap := ($nj, $nv,…) // j, y,… ∈ [1,n]
Insert data{ let $insert_location1 := collection("http://xmldataset...")/xp1
INSERT DATA tr …
} let $insert_locationp := collection("http://xmldataset...")/xpp
return(
insert nodes $data1 into $insert_location1 ,
…
insert nodes $datap into $insert_locationp
)
(b) let $where_gp := xEW
(a) Delete{
trp (a) let $insert_location1 := xp1
}Where{ let $where_gp := xEW for $it1 in $insert_location1
gp (c) Delete{ let $delete_gp:= xED ($where_gp) xEI ($where_gp, $it1)
} trp return delete nodes $delete_gp return insert nodes into $it1
DELETE / }Insert{
…
INSERT (b) Insert{ trp
trp }Where{ (c) let $where_gp := xEW
gp Translate Delete Where same as (a), let $insert_ location n := xpn
}Where{ then translate Insert Where same as (b) for $itn in $insert_locationn
gp }
xEI ($where_gp, $itn)
}
return insert nodes into $itn
1
For simplicity, the WITH, GRAPH and USING clauses are omitted.
246
In the following examples, we assume that an RDF source has been mapped to an XML
source. In particular, we assume the example presented in [6], where an RDF and an
XML source describing persons and students have been mapped. Here, due to space
limitation, we just outline the RDF and XML concepts, as well as the mappings that are
involved in the following examples. In RDF, we have a class Student having several
datatype properties, i.e., FName, E-mail, Department, GivenName, etc. In XML, we have
an XML complex type Student_type, having an attribute SSN and several simple ele-
ments, i.e., FirstName, Email, Dept, GivenName etc. Based on the XML structure, the stu-
dents’ elements appear in the \Persons\Student path. We assume that the Student class
has been mapped to the Student_type and the RDF datatype properties to the similar
XML elements.
Delete Data. The Delete Data SPARQL operation removes a set of triples from RDF
graphs. This SPARQL operation can be translated in XQuery using the Delete Nodes
XQuery operation. Specifically, using the predefined mappings, the set of triples tr de-
fined in the SPARQL Delete Data clause is transformed (using a similar approach such
as the BGP2XQuery algorithm [6]) in a set of XPath expressions XP. For each xpi ∊ XP
an XQuery Delete Nodes operation is defined.
In this example, two RDF triples are deleted from an RDF graph. In
addition to the mappings described above, we assume that the person
"http://rdf.gr/person1209" in RDF data has been mapped to the person
"/Persons/Student[.@SSN=1209]" in XML data.
SPARQL Delete Data query Translated XQuery query
Delete data{ delete nodes collection("http://xml.gr")/Persons/Stu-
ns: FName "John" . dent[.@SSN=1209]/FirstName[.= "John"]
ns:E-mail "john@smith.com". delete nodes collection("http://xml.gr")/Persons/Stu-
} dent[.@SSN=1209]/Email[.= "John@smith.com"]
Insert Data. The Insert Data SPARQL operation, adds a set of new triples in RDF
graphs. This SPARQL operation can be translated in XQuery using the Insert Nodes
XQuery operation. In the Insert Data translation, the set of triples tr defined in SPARQL
are transformed into XML node sets xni, using the predefined mappings. In particular,
a set of Let XQuery clauses is used to build the XML nodes and define the appropriate
node nesting and grouping. Then, the location of the XML node insertion can be easily
determined considering the triples and the mappings. Finally, the constructed nodes are
inserted in their location of insertion using the XQuery Insert nodes clause.
In this example, the RDF triples deleted in the previous example are re-
inserted in the RDF graph.
SPARQL Insert Data query Translated XQuery query
Insert data{ let $n1 := John
ns:FName "John" . let $n2 := john@smith.com
ns:E-mail "john@smith.com". let $data1 := ($n1, $n2)
} let $insert_location1 := collection("http://xml.gr")/Per-
sons/Student[.@SSN=1209]
return insert nodes $data1 into $insert_location1
Insert / Delete. The Delete/Insert SPARQL operations are used to remove and/or add a
set of triples from/to RDF graphs, using the bindings that resulted from the evaluation
247
of the graph pattern defined in the Where clause. According to the SPARQL 1.1 seman-
tics, the Where clause is the first one that is evaluated. Then, the Delete/Insert clause is
applied over the produced results. Especially, in case, that both Delete and Insert opera-
tions exist, the deletion is performed before the insertion, and the Where clause is eval-
uated once. The Delete and the Insert SPARQL operations can be translated to XQuery
using the Delete Nodes and Insert Nodes operations, respectively. In brief, initially the
graph pattern used in the Where clause is translated to XQuery expressions xEW (simi-
larly as in the GP2XQuery algorithm [6]). Then, the graph pattern used in the Delete/In-
sert clause is translated to XQuery expressions xED/xEI (as it is also in the BGP2XQuery
algorithm [6]) using also the bindings that resulted from the evaluation of xEW.
In this example, the Where clause selects all the students studying in a
computer science (CS) department. Then, the Delete clause deletes all the triples that
match with its triple patterns, using the ?student bindings determined from the Where
clause. In particular, from all the retrieved students (i.e., CS students), the students
which have as first name the name "John" should be deleted.
SPARQL Delete query Translated XQuery query
Delete{ let $where_gp := collection("http://xml.gr")/Persons/Student[./Dept="CS"]
?student ns:FName "John" . let $delete_gp := $where_gp[./FirstName="John"]
}Where{ return delete nodes $delete_gp
?student ns:Department "CS" .
}
In this example, the Where clause selects all the students studying in a CS
department, as well as their first names. Then, the Insert clause creates new triples
according to its triple patterns, using the ?student and ?name bindings determined
from the Where clause. In particular, a new triple having as predicate “ns:GivenName”
and as object the first name of the ?student, is inserted for each ?student.
SPARQL Insert query Translated XQuery query
Insert{ let $where_gp := collection(“http://xml.gr”)/Persons/Student[./Dept="CS"]
?student ns:GivenName ?name . let $insert_location1 := $where_gp
}Where{ for $it1 in $insert_location1
?student ns:FName ?name . let $insert_gp1 := {fn:string($it1/FirstName)}
?student ns:Department "CS" . return insert nodes $insert_gp1 into $it1
}
References
1. Eisenberg V., Kanza Y.: "D2RQ/update: updating relational data via virtual RDF". In WWW
2012
2. Hert M., Reif G., Gall H. C.: "Updating relational data via SPARQL/update". In EDBT/ICDT
Workshops 2010.
3. Ali M.I., Lopes N., Friel O., Mileo A.: "Update Semantics for Interoperability among XML,
RDF and RDB". In APWeb 2013
4. Bischof S., Decker S., Krennwallner T., Lopes N., Polleres A.: "Mapping between RDF and
XML with XSPARQL". J. Data Semantics 1(3), (2012)
5. Bikakis N., Tsinaraki C., Gioldasis N., Stavrakantonakis I., Christodoulakis S.: "The XML
and Semantic Web Worlds: Technologies, Interoperability and Integration. A survey of the
State of the Art". In Semantic Hyper/Multi-media Adaptation: Schemes and Applications,
Springer 2013
6. Bikakis N., Tsinaraki C., Stavrakantonakis I., Gioldasis N., Christodoulakis S.: "The
SPARQL2XQuery Interoperability Framework". World Wide Web Journal (WWWJ), 2014
248
CURIOS: Web-based Presentation and
Management of Linked Datasets
Hai H. Nguyen1 , Stuart Taylor1 , Gemma Webster1 , Nophadol Jekjantuk1 ,
Chris Mellish1 , Je↵ Z. Pan1 , and Tristan ap Rheinallt2
1
dot.rural Digital Economy Hub, University of Aberdeen, Aberdeen AB24 5UA, UK
2
Hebridean Connections, Ravenspoint, Kershader, Isle of Lewis HS2 9QA, UK
1 Introduction
A number of systems extend the traditional web and Web 2.0 technologies by
providing some form of integration with semantic web data [1,2,3]. These ap-
proaches build on tested content management systems (CMSs) for facilitating
users in the semantic web. However, instead of directly managing existing linked
data, these systems provide a mapping between their own data model to linked
datasets using an RDF or OWL vocabulary. This sort of integration can be seen
as a read or write only approach, where linked data is either imported into or
exported from the system. The next step in this evolution of CMSs is a full
integration with linked data: allowing ontology instances, already published as
linked data, to be directly managed using widely used web content manage-
ment platforms. The motivation is to keep data (i.e., linked data repositories)
loosely-coupled to the tool used to maintain them (i.e., the CMS).
In this poster we extend [3], a query builder for SPARQL, with an update
mechanism to allow users to directly manage their linked data from within the
CMS. To make the system sustainable and extensible in future, we choose to
use Drupal as the default CMS and develop a module to handle query/update
against a triple store. Our system, which we call a Linked Data Content Man-
agement System (Linked Data CMS) [4], performs similar operations to those of
a traditional CMS but whereas a traditional CMS uses a data model of content
types stored in some relational database back end, a Linked Data CMS per-
forms CRUD (create, read, update and delete) operations on linked data held in
a triple store. Moreover, we show how the system can assist users in producing
and consuming linked data in the cultural heritage domain and introduce 2 case
studies used for system evaluation.
2 Using CURIOS
We introduce CURIOS, an implementation of a Linked Data CMS.3 A dataset
managed by CURIOS needs to have a structure described by an OWL ontology
that imports a small CURIOS “upper ontology”. It must relate some of its classes
and properties to constructs in that ontology. This has the benefit that they can
3
Available open-source at https://github.com/curiosproject/curios.
249
2
be recognised and treated specially by the generated website. For instance, an
image can be presented in a special way (see Fig. 1) if it is an instance of the
hc:ImageFile class and its URL is provided by the hc:URL property.
Once the ontology is defined, it is necessary to provide a separate description
of which parts of the data (and the level of detail) are to be managed by the web-
site. This description takes the form of an application-dependent configuration
file which is loaded as part of the Drupal module. This file describes the classes,
fields, and relationships to be shown in the website and how these relate to the
constructs of the ontology [4]. Although the configuration file could be generated
automatically, it is a declarative description and can easily be edited by hand.
The configuration file centralises the maintenance of the structure of the CMS
with respect to the ontology, e.g., if a new type of page is required, the user
can update the configuration and then run the Linked Data CMS mapping to
create the required Drupal entities. Additionally our approach can handle some
changes to the schema of the ontology. For example if a change in the ontology
occurs, such as a domain/range, additional classes or a change of URIs, then the
configuration can be reloaded to synchronise Drupal with the ontology schema.
When the CURIOS Drupal module is initialised, it automatically creates a
set of Drupal resources and Views based on the configuration file, along with
an additional set of pages allowing the linked data to be maintained via CRUD
operations. Drupal site administrators can then maintain the website generated
by the configuration in the same way as a regular Drupal site.
2.1 Browsing and Update Functionalities
Figure 1: Details of a croft
CURIOS allows users to browse and update their linked data in a triple
store directly without transforming RDF triples to Drupal content and vice
versa. A CURIOS record consists of a set of RDF triples where the subject
is a unique URI representing the record identifier. For browsing, depending on
250
3
di↵erent conditions, CURIOS presents data in di↵erent ways. For instance, a
list of records or details of a record (see Fig. 1) will be displayed depending
on whether the record URI is provided. To navigate between linked individuals,
object properties of an RDF individual are presented as hyperlinks to other
records instead of the normal text used for datatype properties.
Users are also able to create, update, or delete a record via a user-friendly
GUI. Firstly, CURIOS assists users entering data by providing di↵erent widgets
depending on the datatype the user wants to edit (Fig. 2a). For instance, with
geographical coordinates, a map is displayed to allow users to choose a loca-
tion rather than to type in the coordinates as text. Secondly, to prevent users
from entering incorrect values for some special properties such as an occupation
or a type of place, an auto-complete widget is provided. Thirdly, it is typical
that in the cultural heritage domain, temporal data such as dates are rather
vague and not recorded in a consistent format. To facilitate users during data
entry process, CURIOS provides a simple treatment to vague dates by introdu-
cing the hc:DateRange class which consists of two datetime datatype propertes:
hc:dateFrom and hc:dateTo. A user can enter an exact date or a vague date
such as a year, a season in a year, a decade, a century, etc, and CURIOS can
convert the vague date into an appropriate instance of hc:DateRange. Finally,
to manage object properties (i.e., links) between individuals, CURIOS allows
property add and remove operations as presented in Fig. 2b, which are then
mapped onto corresponding SPARQL update queries, e.g., INSERT and DELETE,
to insert and remove appropriate triples.
(a) Updating special datatypes such as
dates, geographical coordinates, etc. (b) Adding/removing object properties
Figure 2: Updating Linked Data in CURIOS
2.2 Use of the Triple Store
Although in principle we could use any SPARQL 1.1 compliant triple store/S-
PARQL server, in practice we are using Jena Fuseki [5]. The reasoner in Fuseki
251
4
creates a complete graph of all the consequences of the stated information when
a SPARQL query is presented, and this is kept in a memory cache. Unfortu-
nately, this reasoning has to be repeated after an update has been performed,
and especially with complex updates, this can take an unreasonable amount
of time that can a↵ect the website’s responsiveness. Also, although one ideally
wants to show a user all the inferred information in order that they have an ac-
curate model of the system’s knowledge, if they are allowed to specify arbitrary
updates on this then they may remove a piece of inferred information which is
then re-inferred whenever the reasoner is next invoked. For these two reasons,
we perform all updates via a Fuseki endpoint where no reasoning takes place. A
second endpoint, where reasoning is enabled, is used for normal browsing. With
this method, the information shown for browsing gradually becomes out of date
as nothing prompts the recomputation of the inference graph. This is overcome
by allowing the user to explicitly invoke the recomputation or by having a pro-
cess outside of the website causing the recomputation at regular time intervals.
3 Case Studies and Future Work
To test the generality of our system, we conduct 2 case studies, one involving
historical societies based in the Western Isles of Scotland (Hebridean Connec-
tions) and another one with the local historical group at Portsoy, a fishing village
located in the North East of Scotland. The dataset used in the Hebridean Con-
nections case study consists of over 45,000 records with about 850,000 RDF
triples (before inference), incorporated within a relatively simple OWL onto-
logy. The Portsoy case study uses a similar ontology with 1370 records and
23258 RDF triples (before inference). The Drupal website which we have built
with the software is already being used by Hebridean Connections at http:
//www.hebrideanconnections.com.
In future we plan to make the system easier to set up for naı̈ve users as well
as to evaluate our system with di↵erent SPARQL servers/RDF stores.
Acknowledgements The research described here is supported by the award
made by the RCUK Digital Economy programme to the dot.rural Digital Eco-
nomy Hub; award reference: EP/G066051/1.
References
1. Krötzsch, M., Vrandecic, D., Völkel, M.: Semantic MediaWiki. In: ISWC. (2006)
2. Corlosquet, S., Delbru, R., Clark, T., Polleres, A., Decker, S.: Produce and consume
Linked Data with Drupal! In: ISWC. (2009)
3. Clark, L.: SPARQL Views: A Visual SPARQL Query Builder for Drupal. In Polleres,
A., Chen, H., eds.: ISWC Posters&Demos. Volume 658 of CEUR Workshop Pro-
ceedings., CEUR-WS.org (2010)
4. Taylor, S., Jekjantuk, N., Mellish, C., Pan, J.Z.: Reasoning driven configuration of
linked data content management systems. In: JIST 2013. (2013)
5. Seaborne, A.: Fuseki: serving RDF data over HTTP. http://jena.apache.org/
documentation/serving_data/ (2011) Accessed: 2012-10-27.
252
The uComp Protégé Plugin for Crowdsourcing
Ontology Validation
Florian Hanika1 , Gerhard Wohlgenannt1 , and Marta Sabou2
1
WU Vienna
{florian.hanika,gerhard.wohlgenannt}@wu.ac.at
2
MODUL University Vienna
marta.sabou@modul.ac.at
Abstract. The validation of ontologies using domain experts is expen-
sive. Crowdsourcing has been shown a viable alternative for many knowl-
edge acquisition tasks. We present a Protégé plugin and a workflow for
outsourcing a number of ontology validation tasks to Games with a Pur-
pose and paid micro-task crowdsourcing.
Keywords: Protégé plugin, ontology engineering, crowdsourcing, hu-
man computation
1 Introduction
Protégé3 is a well-known free and open-source platform for ontology engineering.
Protégé can be extended with plugins using the Protégé Development Kit. We
present a plugin for crowdsourcing ontology engineering tasks, as well as the
underlying technologies and workflows. More specifically, the plugin supports
outsourcing of some typical ontology validation tasks (see Section 2.2) to Games
with a Purpose (GWAP) and paid-for crowdsourcing.
The research question our work focuses on is how to integrate ontology en-
gineering processes with human computation (HC), to study which tasks can
be outsourced, how this a↵ects the quality of the ontological elements, and to
provide tool support for HC. This paper concentrates on the integration pro-
cess and tool support. As manual ontology construction by domain experts is
expensive and cumbersome, HC helps to decrease cost and increase scalability
by distributing jobs to multiple workers.
2 The uComp Protégé Plugin
The uComp Protégé Plugin allows the validation of certain parts of an ontology,
which makes it useful in any setting where the quality of an ontology is ques-
tionable, for example if an ontology was generated automatically with ontology
learning methods, or if a third-party ontology needs to be evaluated before use.
This section covers the uComp API, and the uComp Protégé plugin (function-
ality and installation).
3
protege.stanford.edu
253
2 Florian Hanika, Gerhard Wohlgenannt, and Marta Sabou
2.1 The uComp API
The Protégé plugin sends all validation tasks to the uComp HC API. Depending
on the settings, the API further delegates the tasks to a GWAP or to Crowd-
Flower4 . CrowdFlower is a platform for paid micro-task crowdsourcing. The
uComp API5 currently supports classification tasks (other task types are under
development). The API user can create new HC jobs, cancel jobs, and collect
results from the service. All communication is done via HTTP and JSON.
2.2 The plugin
The plugin supports the validation of various parts of an ontology: relevance of
classes, subClassOf relations, domain and range axioms, instanceOf relations,
etc. The general usage pattern is as follows: the user selects the respective part
of the ontology, provides some information for the crowdworkers, and submits
the job. As soon as available, the results are presented to the user.
Fig. 1. Class relevance check for class bond including results.
Class relevance check For the sake of brevity, we only describe the Class
Relevance Check and SubClass Relation Validation in some detail. The other
task types follow a very similar pattern. Class Relation Check helps to decide if a
given class (or a set of classes) – based on the class label – is relevant for the given
domain. Figure 1 shows an example class relevance check for the class bond. After
selecting a class, the user can enter a ontology domain (here: Finance) to validate
against, and give additional advice to the crowdworkers. Furthermore, (s)he can
choose between the GWAP and CrowdFlower for validation. If CrowdFlower is
4
www.crowdflower.com
5
tinyurl.com/mkarmk9
254
The uComp Protégé Plugin for Crowdsourcing Ontology Validation 3
selected, the expected cost of the job can be calculated. The validate subtree
option allows to validate not only the current class, but also all its subclasses
(recursively). To validate the whole ontology in one go, the user selects the root
class (Thing) and marks the validate subtree option. When available, the results
of the HC task are presented in a textbox. In Figure 1 only one judgment was
collected – the crowdworker stated that class bond is relevant for the domain.
Validation of SubClass Relations With this component, a user can ask the
crowd if there exists a subClass relation between a given class and its super-
classes.
Fig. 2. Validation of class dollar and its superclass currency.
Similar to the class relevance check, users can set the ontology domain, and
choose CrowdFlower or GWAP (“uComp-Quiz”). In Figure 2 the subClass rela-
tion between dollar and currency is evaluated. Before sending to CrowdFlower,
expected costs can be calculated as number of units (elements to evaluate) mul-
tiplied by number of judgments per unit and payment per judgment.
2.3 Installation and Configuration
As the uComp plugin is part of the official Protégé repository, it can easily
be installed from within Protégé with File ! Check for plugins ! Down-
loads. To configure and use the plugin, the user needs to create a file name
ucomp api settings.txt in folder .Protege. The file contains the uComp API
key6 , the number of judgments per unit which we be collected, and the payment
per judgment (if using CrowdFlower), for example: abcdefghijklmnopqrst,5,2
6
For API requests see tinyurl.com/mkarmk9
255
4 Florian Hanika, Gerhard Wohlgenannt, and Marta Sabou
Detailed information about the functionality, usage and installation of the plugin
is provided with the plugin documentation.
3 Related Work
Human computation outsources computing steps to humans, typically for prob-
lems computers can not solve (yet). Together with altruism, fun (as in GWAPs)
and monetary incentives are central ways to motivate humans to participate.
Early work in the field of GWAPs was done by von Ahn [1]. Games have suc-
cessfully been used for example in ontology alignment [6] or to verify class defi-
nitions [3]. Micro-task crowdsourcing is very popular recently in knowledge ac-
quisition and natural language processing, and has also been integrated into the
popular NLP framework GATE [2]. A number of studies show that crowdworkers
provide results of similar quality as domain experts [4, 5].
4 Conclusions
In this paper we introduce a Protégé plugin for validating ontological elements,
and its integration into a human computation workflow. The plugin delegates
validation tasks to a GWAP or to CrowdFlower and displays the results to the
user. Future work includes an extensive evaluation of various aspects: HC work-
flows in ontology engineering, quality of crowdsourcing results, and the usability
of the plugin itself.
Acknowledgments. The work presented was developed within project uComp,
which receives the funding support of EPSRC EP/K017896/1, FWF 1097-N23,
and ANR-12-CHRI-0003-03, in the framework of the CHIST-ERA ERA-NET.
References
1. von Ahn, L.: Games With a Purpose. Computer 39(6), 92 –94 (2006)
2. Bontcheva, K., Roberts, I., Derczynski, L., Rout, D.: The GATE Crowdsourcing
Plugin: Crowdsourcing Annotated Corpora Made Easy. In: Proc. of the 14th Con-
ference of the European Chapter of the Association for Computational Linguistics
(EACL). ACL (2014)
3. Markotschi, T., Voelker, J.: Guess What?! Human Intelligence for Mining Linked
Data. In: Proceedings of the Workshop on Knowledge Injection into and Extraction
from Linked Data (KIELD) at the International Conference on Knowledge Engi-
neering and Knowledge Management (EKAW) (2010)
4. Noy, N.F., Mortensen, J., Musen, M.A., Alexander, P.R.: Mechanical Turk As an
Ontology Engineer?: Using Microtasks As a Component of an Ontology-engineering
Workflow. In: Proc. 5th ACM WebSci Conf. pp. 262–271. WebSci ’13, ACM (2013)
5. Sabou, M., Bontcheva, K., Scharl, A., Föls, M.: Games with a Purpose or Mecha-
nised Labour?: A Comparative Study. In: Proc. of the 13th Int. Conf. on Knowledge
Management and Knowledge Technologies. pp. 1–8. i-Know ’13, ACM (2013)
6. Siorpaes, K., Hepp, M.: Games with a Purpose for the Semantic Web. IEEE Intel-
ligent Systems 23(3), 50–60 (2008)
256
Frame-Semantic Web: a Case Study for Korean?
Jungyeul Park†‡ , Sejin Nam‡ , Youngsik Kim‡
Younggyun Hahm‡ , Dosam Hwang‡§ , and Key-Sun Choi‡
†
UMR 6074 IRISA, Université de Rennes 1, France
‡
Semantic Web Research Center, KAIST, Republic of Korea
§
Department of Computer Science, Yeungnam University, Republic of Korea
http://semanticweb.kaist.ac.kr
Abstract. FrameNet itself can become a resource for the Semantic Web.
It can be represented in RDF. However, mapping FrameNet to other
resources such as Wikipedia for building a knowledge base becomes more
common practice. By such mapping, FrameNet can be considered to
provide capability to describe the semantic relations between RDF data.
Since the FrameNet resource has been proven very useful, multiple global
projects for other languages have arisen over the years, parallel to the
original English FrameNet. Accordingly, significant steps were made to
further develop FrameNet for Korean. This paper presents how frame
semantics becomes a frame-semantic web. We also provide the Wikipedia
coverage by Korean FrameNet lexicons in the context of constructing a
knowledge base from sentences in Wikipedia to show the usefulness of
our work on frame semantics in the Semantic Web environment.
Keywords: Semantic Web, Frame Semantics, FrameNet, Korean FrameNet.
1 Introduction
FrameNet [1]1 is a both human- and machine-readable large-scale on-line lexical
database, not only consists of thousands and thousands of words and sentences,
but, moreover, an extensive and complex range of semantic information as well.
Based on a theory of meaning called frame semantics, FrameNet strongly sup-
ports an idea that the meanings of words and sentences can be best understood
on the basis of a semantic frame, a coherent conceptual structure of a word
describing a type of event, relation, or entity and the participants in it. It is be-
lieved that semantic frames of related concepts are inseparable from each other,
so that, one cannot have complete understanding of a word, without knowledge
of all the semantic frames related to that word. FrameNet itself serves as a great
example of such a principle, wherein 1,180 semantic frames closely link together
by a system of semantic relations and provide a solid basis for reasoning about
the meaning of the entire text.
?
This work was supported by the IT R&D program of MSIP/KEIT. [10044494,
WiseKB: Big data based self-evolving knowledge base and reasoning platform]
1
https://framenet.icsi.berkeley.edu
257
FrameNet itself can become a resource for the Semantic Web as represented
in RDF/OWL [2, 3]. Mapping FrameNet to other resources such as Wikipedia
for building a knowledge base can also be considered to provide capability to
describe the semantic relations between RDF data. Since the FrameNet resource
has been proven useful in the development of a number of other NLP appli-
cations, even in the Semantic Web environment such as in [4], multiple global
projects have arisen over the years, parallel to the original English FrameNet,
for a wide variety of languages around the world. In addition to Brazilian Por-
tuguese2 , French3 , German (the SALSA Project)4 , Japanese5 , Spanish6 , and
Swedish7 , significant steps were made to further develop FrameNet for Korean,
and the following sections of this paper present the process and mechanisms.
By using FrameNet, it can become a frame-semantic web where frame semantics
is enabled for the Semantic Web. We also provide the Wikipedia coverage by
Korean FrameNet lexicons in the context of constructing a knowledge base from
sentences in Wikipedia. It can show how the frame-semantic web would be useful
in the Semantic Web environment.
2 Building a Database of Frame Semantic Information
for Korean
We describe the manual construction of a FrameNet-style annotated corpus for
Korean translated from the FrameNet corpus and its FE transfer based on
English-Korean alignment using cross-linguistic projection proposed in [5, ?].
We also explain this process by using the translated Korean FrameNet corpus
and its counterpart English corpus as our bilingual parallel corpus. We propose
a method for mapping a Korean LU to an existing FrameNet-defined frame to
acquire a Korean frame semantic lexicon. Finally, we illustrate a self-training
technique that can build a database of large-scale frame semantic information
for Korean.
Manual Construction: The development of FrameNet for Korean has been
the central goal of our project, and we have chosen to perform this task by
starting o↵ with “manually translating” the already-existing FrameNet from
English to Korean language. Such decision was made on the grounds that, even
though obtaining a large set of data through means of manual translation can be
a difficult, costly and time-consuming process, its expected advantages indeed
far outweigh the charge in the long run. The fact that only humans can really
develop a true understanding and appreciation of the complexities of languages,
subject knowledge and expertise, creativity and cultural sensitivity also makes
manual translation the best option to adopt for our project. Expert translators
2
http://www.ufjf.br/framenetbr
3
https://sites.google.com/site/anrasfalda
4
http://www.coli.uni-saarland.de/projects/salsa
5
http://jfn.st.hc.keio.ac.jp
6
http://sfn.uab.es/SFN
7
http://spraakbanken.gu.se/eng/swefn
258
performed manual translation for all FrameNet full text annotated corpus with
a word alignment recommendation system. A guideline manual for translating
the FrameNet-style annotated corpus to Korean sentences was prepared for the
clean transferring of English FrameNet annotated sentences to Korean.
Automatic Construction: We also extend previous approaches described
in [5] using a bilingual English-Korean parallel corpus. Assuming that the same
kinds of frame elements (FEs) exist for each frame for the English and Korean
sentences, we achieve the cross-linguistic projection of English FE annotation to
Korean via alignment of tokenized English and Korean sentences. English FE re-
alization can be projected to its corresponding Korean sentences by transforming
consecutive series of Korean tokens in the Korean translation of any given sen-
tence. Since the alignment of English tokens to Korean tokens defines the trans-
formation, the success of token alignment is crucial for the cross-linguistic pro-
jection process. For frame population to Korean lexical units (LUs), we present
our method for the automatic creation of the Korean frame semantic lexicon for
verbs in this section. We start by finding an appropriate translation for each verb
to create a mapping between a Korean LU and an existing FrameNet-defined
frame. In contrast to mapping from one sense to one frame, mapping to more
than one frame requires using a further disambiguation process to select the most
probable frame for a given verb. We use maximum likelihood estimation (MLE)
for possible frames from the existing annotated corpora to select the correct
frame. For the current work, we only used FrameNets lexicographic annotation
to estimate MLE. We use the Sejong predicate dictionary8 for frame semantic
lexicon acquisition. We place 16,807 Korean verbs in FrameNet-defined frames,
which constitute 12,764 distinctive orthographic units in Korean. We assume
that FEs with respect to the assigned frame for Korean LUs are directly equiv-
alent to the FEs in the corresponding English frames. Thus, we do not consider
redefining FEs specifically for Korean.
Bootstrapping Frame-Semantic Information: Self-training for frame se-
mantic role projection consists of annotating FrameNet-style semantic informa-
tion, inducing word alignments between two languages, and projecting semantic
information of the source language onto the target language. We used the bilin-
gual parallel corpus for self-training, and a probabilistic frame-semantic parser
[6] to annotate semantic information of the source language (English). Then, we
induced an HMM word alignment model between English and Korean with a
statistical machine translation toolkit. Finally, we projected semantic roles in-
formation from the English onto the Korean sentences. For the experiment, we
employed a large bilingual English-Korean parallel corpus, which contains al-
most 100,000 bilingual parallel sentences to bootstrap the semantic information.
During self-training, errors in the original model would be amplified in the new
model; thus, we calibrate the results of the frame-semantic parser by using the
confidence score of the frame-semantic parser as a threshold. As a result, 120,621
pairs of frames with their FEs are obtained and among them 30,149 are unique;
715 frames are used for 10,898 di↵erent lexica.
8
http://www.sejong.or.kr
259
3 Linking FrameNet to Wikipedia
DBpedia9 is a knowledge base constructed from Wikipedia based on DBpedia
ontology (DBO). DBO can be viewed as a vocabulary to represent knowledge
in Wikipedia. However, DBO is a Wikipedia-Infobox-driven ontology. That is,
although DBO is suitable to represent essential information of Wikipedia, it does
not guarantee enough to represent knowledge in Wikipedia written in a natural
language. In overcoming such a problem, FrameNet has been considered useful
in linguistic level as a language resource representing semantics. We calculate
the Wikipedia coverage rate by DBO and FrameNets LUs to match the relation
instantiation from DBpedia and FrameNet to Wikipedia. Before we calculate
the Wikipedia coverage rate, we need to know which sentences within Wikipedia
actually contain knowledge. We define that a typical sentence with extractable
knowledge can be linked to DBpedia entities as a triple. From almost three
millions sentences in Korean Wikipedia, we find over four millions predicates
for cases where only a subject appears, only an object appears, or both of a
subject and an object appear (2.11 predicates per sentence). We obtain 6.92%
and 95.19% for DBO and FrameNets LUs, respectively. The shortage of DBO
can be explained that DBO is too small to cover actual predicates in Wikipedia
only by pre-defined predicates in DBO. However, FrameNet gives almost full
coverage for sentences with extractable knowledge, which is very promising for
extracting and representing knowledge in Wikipedia using FrameNet.
4 Discussion and Conclusion
Throughout this paper, by building a database of frame semantic information,
we explained that FrameNet can become a resource for the Semantic Web and it
can gather lexical linked data and knowledge patterns with almost full coverage
for Wikipedia.
References
1. Ruppenhofer, J., Ellsworth, M., Petruck, M.R.L., Johnson, C.R., Sche↵czyk, J.:
FrameNet II: Extended Theory and Practice. (2010)
2. Narayanan, S., Fillmore, C.J., Baker, C.F., Petruck, M.R.L.: FrameNet Meets the
Semantic Web: A DAML+OIL Frame Representation. In: Proc. of AAAI-02.
3. Narayanan, S., Baker, C.F., Fillmore, C.J., Petruck, M.R.L.: FrameNet Meets the
Semantic Web: Lexical Semantics for the Web. In: ISWC 2003.
4. Fossati, M., Tonelli, S., Giuliano, C.: Frame Semantics Annotation Made Easy with
DBpedia. In: Proc. of CrowdSem2013. 69–78
5. Padó, S., Lapata, M.: Cross-lingual Bootstrapping of Semantic Lexicons: The Case
of FrameNet. In: Proc. of AAAI-05.
6. Das, D., Schneider, N., Chen, D., Smith, N.A.: Probabilistic Frame-Semantic Pars-
ing. In: Proc. of NAACL 2010.
9
http://dbpedia.org/About
260
SparkRDF: Elastic Discreted RDF Graph
Processing Engine With Distributed Memory
Xi Chen, Huajun Chen, Ningyu Zhang, and Songyang Zhang
College of Computer Science, Zhejiang University,
Hangzhou 310027, China
{xichen,huajunsir,zxlzr,syzhang1991}@zju.edu.cn
Abstract. With the explosive growth of semantic data on the Web over
the past years, many large-scale RDF knowledge bases with billions of
facts are generating. This poses significant challenges for the storage and
retrieval of big RDF graphs. In this paper, we introduce the SparkRDF,
an elastic discreted semantic graph processing engine with distributed
memory. To reduce the high I/O and communication costs for distribut-
ed platforms, SparkRDF implements SPARQL query based on Spark, a
novel in-memory distributed computing framework. All the intermediate
results are cached in the distributed memory to accelerate the process
of iterative join. To reduce the search space and memory overhead, S-
parkRDF splits the RDF graph into the multi-layer subgraphs based on
the relations and classes. For SPARQL query optimization, SparkRDF
generates an optimal execution plan for join queries, leading to e↵ective
reduction on the size of intermediate results, the number of joins and
the cost of communication. Our extensive evaluation demonstrates the
efficiency of our system.
Keywords: Big RDF Graph, SPARQL, SPARK, Distributed memory.
1 Introduction
With the development of Semantic technologies and Web 3.0, the amount of
Semantic Web data represented by the Resource Description Framework (RDF)
is increasing rapidly. Traditional RDF systems are mainly facing two challenges.
i)scalability: the ability to process the big RDF data. Most existing RDF systems
are based on single node[4][1], which are easily vulnerable to the growth of the
data size because they usually need to load large indexes into the limited memory.
ii) real-time: the capacity to implement SPARQL query over big RDF graph in
near real time. For highly iterative SPARQL query, existing MapReduce-based
RDF systems su↵er from high I/O cost because of iteratively reading and writing
large intermediate results in disk[3].
In this paper, we introduce SparkRDF, an elastic discreted RDF graph pro-
cessing system with distributed memory. It is based on Spark, a in-memory
cluster computing system which is quite suitable for large-scale real-time itera-
tive computing jobs[5]. SparkRDF splits the big RDF graph into MESGs(Multi-
layer Elastic SubGraph) based on relations and classes by creating 5 kinds of
261
2 Xi Chen et al.
indexes(C,R,CR,RC,CRC) with di↵erent grains to cater for diverse triple pat-
terns(TP). These index files on demand are modeled as RDSG(Resilient Discret-
ed SubGraph), a collection of in-memory semantic subgraph objects partitioned
across machines, which can implement SPARQL query by a series of basic opera-
tors. All intermediate results(IR), which are also regarded as the RDSG, remain
in the distributed memory to support further fast joins. Based on the query
model, several corresponding optimization tactics are then presented.
The remaining of this paper is organized as follows. Section 2 introduces
the index data model and iterative query model of SparkRDF. In Section 3,
we present the results of our experiments. Finally, we conclude and discuss the
future work in Section 4.
2 SparkRDF
2.1 Index Data Model: MESG
We create the index model called MESG based on relations and classes, which ex-
tends traditional vertical partitioning solution by connecting class indexes with
predicate indexes, whose goal is to construct a smaller index file for every TP in
the SPARQL query. At the same time, as it is uncertain that the class informa-
tion about the entities can are given in the SPARQL query, the SparkRDF needs
a multi-layer elastic index scheme to meet the query need for di↵erent kinds of
TP. Specifically, we first construct the class indexes(C) and relation indexes(R).
Then a set of finer-grained index files(CR,RC,CRC) are created by joining the
two kinds of index files. All the index files are stored in the HDFS.
2.2 RDSG-based Iterative Query Model
For SparkRDF, all the index files and IRs can be modeled as an unified con-
cept called RDSG(Resilient Discreted SubGraph). It is a distributed memory
abstraction that lets us perform in-memory query computations on large clus-
ters by providing the following basic operators: RDSG Gen, RDSG Filter, RDS-
G Prepartition, RDSG Join. Figure 1 illustrates the RDSG-based query process.
Every job corresponds to one query variable.
2.3 Optimization techniques
Based on the data model and query model, several optimization strategies are
made to improve query efficiency. First, TR-SPARQL refers to a Type-Restrictive
SPARQL by passing variable’s implicit class message to corresponding TPs that
contains the variable. It cuts down the number of task (remove the TPs whose
predicate is rdf:type )and the cost of parsing every TP(form a more restrictive
index file). Then we use a selectivity-based greedy algorithm to design a optimal
execution order of TPs, greatly reducing the size of IR. At last, the location-
free prepartitioning is implemented to avoid the shu✏ing cost in the distributed
join. It ignores the partitioning information of index files, while repartitioning
the data with the same join key to the same node.
262
SparkRDF: Elastic Discreted RDF Engine With Distributed Memory 3
3 Evaluation
We implement the experiment on a cluster with three machines. Each node has
16 GB DDR3 RAM, 8-core Intel Xeon(R) E5606 CPUs at 2.13GHz. We com-
pare SparkRDF with the state-of-the-art centralized RDF-3X and distributed
HadoopRDF. We run the RDF-3X on one of the nodes. HadoopRDF and S-
parkRDF were executed in the cluster. We use the widely-used LUBM dataset
with the scale of 10000, 20000 and 30000 universities, consisting of 1.3 , 2.7 and
4.1 billion triples. For the LUBM queries, we chose 7 representative queries which
are roughly classified into 2 categories: highly selective queries (Q4,Q5,Q6) and
unselective queries(Q1,Q2,Q3,Q7). A short description on the chosen queries is
provided in the Appendix.
Table 1 summarizes our comparison with HadoopRDF and RDF-3X(best
times are boldfaced). The first observation is that SparkRDF performs much
better than HadoopRDF for all queries. This can be mainly attributed to the
following three characteristics of SparkRDF: finer granularity of index scheme,
optimal query order and e↵ective memory-based joining. Another observation
is that SparkRDF outperformed RDF-3X in Q1,Q2,Q3,Q7, while RDF-3X did
better in Q4,Q5,Q6. The result conforms to our initial conjecture: RDF-3X can
achieve high performance for queries with high selectivity and bound objects or
subjects, while SparkRDF did well for queries with unbound objects or subjects,
low selectivity or large intermediate results joins. Another result is that RDF-3X
fails to answer Q1 and Q3 when the data set size is 4.1 billion triples. On the
contrary, SparkRDF scales linearly and smoothly when the scale of the datasets
increases from 1.3 to 4.1 billion triples. It proves that SparkRDF has a good
scalability.
Selectivity of Variables
Job1 Job2 Job3
TPx1 TPy1 TPz1
Index RDSG RDSG RDSG
Index RDSG Index RDSG
....
....
Selectivity of TPs
RDSG_Gen Prepartition RDSG_Filter RDSG_OP RDSG_OP
TPx2
TPy2 TPz2 ...
Index RDSG RDSG RDSG
Index RDSG Index RDSG
....
....
RDSG_Gen Prepartition RDSG_Filter RDSG_OP RDSG_OP
....
....
....
....
....
....
....
....
....
....
....
TPxk TPyn TPzm
IR not shuffled
IR1 IR2 IRk IRk IRk+1 IRk+2 IRk+n
... ... ... ...
RDSG_Join RDSG_Join RDSG_Join RDSG_Join RDSG_Join
Fig. 1. The Iterative Query Model of SparkRDF
263
4 Xi Chen et al.
Table 1. Performance Comparison in seconds for SparkRDF(SRDF), HadoopRD-
F(HRDF) and RDF3X.
LUBM-10000 LUBM-20000 LUBM-30000
cluster systems centralized cluster systems centralized cluster systems centralized
system system system
SRDF HRDF RDF3X SRDF HRDF RDF3X SRDF HRDF RDF3X
Q1 478.5 8475.4 2131.4 1123.2 >3h 4380.3 1435.4 >3.5h failed
Q2 11.9 3425.2 13.8 25.8 >2h 28.9 40.3 >2.5h 43.5
Q3 1.4 6869.7 24.6 1.4 >2.5h 90.7 1.4 >3h failed
Q4 14.4 11940.3 0.7 23.8 >4h 0.8 32.5 >8h 0.8
Q5 6.8 2587.5 0.7 10.9 >1h 0.7 13.0 >3h 0.7
Q6 10.3 7210.5 0.6 16.4 >2.5h 0.7 20.0 >3h 0.7
Q7 54.6 1911.2 101.5 112.5 >0.7h 198.5 201.3 >1h 853.0
4 Conclusion and Future Work
In the paper, we introduce the SparkRDF, a real-time scalable big RDF graph
processing engine. Also We give some the experimental results to show e↵ective-
ness of the SparkRDF. In the future, we would like to extend the work in few
directions. First, we will handle more complex SPARQL patterns(such as OP-
TIONAL). Finally, we will make a more complete and comprehensive experiment
to validate the efficiency of SparkRDF.
References
1. Atre, M., Chaoji, V., Zaki, M.J., Hendler, J.A.: Matrix Bit loaded: a scalable
lightweight join query processor for rdf data. In: Proceedings of the 19th inter-
national conference on World wide web. pp. 41–50. ACM (2010)
2. Guo, Y., Pan, Z.: LUBM: A benchmark for owl knowledge base systems. Web Se-
mantics: Science, Services and Agents on the World Wide Web 3(2), 158–182 (2005)
3. Husain, M., McGlothlin, J., Masud, M.M., Khan, L., Thuraisingham, B.: Heuristics-
based query processing for large rdf graphs using cloud computing. Knowledge and
Data Engineering, IEEE Transactions on 23(9), 1312–1327 (2011)
4. Neumann, T., Weikum, G.: The rdf-3x engine for scalable management of rdf data.
The VLDB Journal 19(1), 91–113 (2010)
5. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin,
M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: A fault-tolerant ab-
straction for in-memory cluster computing. In: Proceedings of the 9th USENIX
conference on Networked Systems Design and Implementation. pp. 2–2 (2012)
APPENDIX
We provide the SPARQL queries used in the experimental section:
Q1-Q6 are the same as [1]. Q7 corresponds to the Q14 of [2].
264
LEAPS: A Semantic Web and Linked data
framework for the Algal Biomass Domain
Monika Solanki1 and Johannes Skarka2
1
Aston University, UK
m.solanki@aston.ac.uk
2
Karlsruhe Institute of Technology, ITAS, Germany
johannes.skarka@kit.edu
Abstract. In this paper we present, LEAPS , a Semantic Web and
Linked data framework for searching and visualising datasets from the
domain of Algal biomass. LEAPS provides tailored interfaces to explore
algal biomass datasets via REST services and a SPARQL endpoint for
stakeholders in the domain of algal biomass. The rich suite of datasets
include data about potential algal biomass cultivation sites, sources of
CO2 , the pipelines connecting the cultivation sites to the CO2 sources
and a subset of the biological taxonomy of algae derived from the world’s
largest online information source on algae.
1 Motivation
Recently the idea that algae biomass based biofuels could serve as an alternative
to fossil fuels has been embraced by councils across the globe. Major companies,
government bodies and dedicated non-profit organisations such as ABO (Algal
Biomass Organisation) 3 and EABA(European Algal Biomass Association)4 have
been pushing the case for research into clean energy sources including algae
biomass based biofuels.
It is quickly evident that because of extensive research being carried out,
the domain itself is a very rich source of information. Most of the knowledge is
however largely buried in various formats of images, spreadsheets, proprietary
data sources and grey literature that are not readily machine accessible/inter-
pretable. A critical limitation that has been identified is the lack of a knowledge
level infrastructure that is equipped with the capabilities to provide semantic
grounding to the datasets for algal biomass so that they can be interlinked,
shared and reused within the biomass community.
Integrating algal biomass datasets to enable knowledge representation and
reasoning requires a technology infrastructure based on formalised and shared
vocabularies. In this paper, we present LEAPS 5 , a Semantic Web/Linked data
framework for the representation and visualisation of knowledge in the domain
3
http://www.algalbiomass.org/
4
http://www.eaba-association.eu/
5
http://www.semanticwebservices.org/enalgae
265
of algal biomass. One of the main goals of LEAPS is to enable the stakeholders
of the algal biomass domain to interactively explore, via linked data, potential
algal sites and sources of their consumables across NUTS (Nomenclature of Units
for Territorial Statistics)6 regions in North-Western Europe.
Some of the objectives of LEAPS are,
– motivate the use of Semantic Web technologies and LOD for the algal biomass
domain.
– laying out a set of ontological requirements for knowledge representation
that support the publication of algal biomass data.
– elaborating on how algal biomass datasets are transformed to their corre-
sponding RDF model representation.
– interlinking the generated RDF datasets along spatial dimensions with other
datasets on the Web of data.
– visualising the linked datasets via an end user LOD REST Web service.
– visualising the scientific classification of the algae species as large network
graphs.
2 LEAPS Datasets
The transformation of the raw datasets to linked data takes place in two steps.
The first part of the data processing and the potential calculation are performed
in a GIS-based model which was developed for this purpose using ArcGIS 7
9.3.1. The second step of lifting the data from XML to RDF is carried out using
a bespoke parser that exploits XPath 8 to selectively query the XML datasets
and generate linked data using the ontologies.
The transformation process yielded four datasets which were stored in dis-
tributed triple store repositories: Biomass production sites, CO2 sources, pipelines
and region potential. We stored the datasets in separate repositories to simulate
the realistic scenario of these datasets being made available by distinct and ded-
icated dataset providers in the future. While a linked data representation of the
NUTS regions data 9 , was already available there was no SPARQL endpoint or
service to query the dataset for region names. We retrieved the dataset dump and
curated it in our local triple store as a separate repository. The NUTS dataset
was required to link the biomass production sites and the CO2 sources to re-
gions where they would be located and to the dataset about the region potential
of biomass yields. The transformed datasets interlinked resources defining sites,
CO2 sources, pipelines, regions and NUTS data using link predicates defined in
the ontology network.
Datasets about algae cultivation can become more meaningful and useful to
the biomass community, if they are integrated with datasets about algal strains.
6
http://bit.ly/I7y5st
7
http://www.esri.com/software/arcgis/index.html
8
http://www.w3.org/TR/xpath/
9
http://nuts.geovocab.org/
266
This can help the plant operators in taking judicious decisions about which
strain to cultivate at a specific geospatial location. Algaebase10 provides the
largest online database of algae information. While Algaebase does not make
RDF versions of the datasets directly available through its website, they can
be programmatically retrieved via their LSIDs (Life Science Identifiers) from
the LSID Web resolver 11 made available by Biodiversity Information Standards
(TDWG)12 working group.
We retrieved RDF metadata for 113061 species of algae13 and curated in our
triple store. We then used the Semantic import plugin with Gephi to visualise
the biological taxonomy of the algae species.
3 System Description
LEAPS provides an integrated view over multiple heterogeneous datasets of po-
tential algal sites and sources of their consumables across NUTS regions in North-
Western Europe. Figure 1 illustrates the conceptual architecture of LEAPS . The
Fig. 1. Architecture of LEAPS
main components of the application are
10
http://www.algaebase.org/about/
11
http://lsid.tdwg.org/
12
http://www.tdwg.org/
13
The retrieval algorithm ran on an Ubuntu server for three days
267
– Parsing modules: As shown in Figure 1, the parsing modules are responsi-
ble for lifting the data from their original formats to RDF. The lifting process
takes place in two stages to ensure uniformity in transformation.
– Linking engine: The linking engine along with the bespoke XML parser
is responsible for producing the linked data representation of the datasets.
The linking engine uses ontologies, dataset specific rules and heuristics to
generate interlinking between the five datasets. From the LOD cloud, we
currently provide outgoing links to DBpedia14 and Geonames15 .
– Triple store: The linked datasets are stored in a triple store. We use
OWLIM SE 5.0 16 .
– Web services: Several REST Web services have been implemented to pro-
vide access to the linked datasets.
– Ontologies: A suite of OWL ontologies for the algal biomass domain have
been designed and made available.
– Interfaces: The Web interface provides an interactive way to explore various
facets of sites, sources, pipelines, regions, ontolgoies and SPARQL endpoints.
The map visualisation has been rendered using Google maps. Besides the
SPARQL endpoint and the interactive Web interface, a REST client has
been implemented for access to the datasets. Query results are available in
RDF/XML, JSON, Turtle and XML formats.
4 Application access
LEAPS 17 is available on the Web. The interface currently provides visualisation
and navigation of the algae cultivation datasets in a way most intuitive for the
phycologists. The application has been demonstrated to several stakeholders of
the community at various algae-related workshops and congresses. They have
found the navigation very useful and made suggestions for future dataset ag-
gregation. At the time of this writing, data retrieval is relatively slow for some
queries because of their federated nature, however optimisation work on the
retrieval mechanism is in progress to enable faster retrieval of information.
Acknowledgments
The research described in this paper was partly supported by the Energetic Algae
project (EnAlgae), a 4 year Strategic Initiative of the INTERREG IVB North West
Europe Programme. It was carried out while the first author was a researcher at BCU,
UK.
References
14
http://dbpedia.org/About
15
http://sws.geonames.org/
16
http://www.ontotext.com/owlim/editions
17
http://www.semanticwebservices.org/enalgae
268
Bridging the Semantic Gap between RDF and
SPARQL using Completeness Statements
Fariz Darari, Simon Razniewski, and Werner Nutt
Faculty of Computer Science, Free University of Bozen-Bolzano, Italy
fariz.darari@stud-inf.unibz.it,{razniewski,nutt}@inf.unibz.it
Abstract. RDF data is often treated as incomplete, following the Open-
World Assumption. On the other hand, SPARQL, the standard query
language over RDF, usually follows the Closed-World Assumption, as-
suming RDF data to be complete. This gives rise to a semantic gap
between RDF and SPARQL. In this paper, we address how to close the
semantic gap between RDF and SPARQL in terms of certain answers
and possible answers using completeness statements.
Keywords: SPARQL, RDF, data completeness, OWA, CWA
1 Introduction
Due to its open and incremental nature, data on the Semantic Web is gen-
erally incomplete. This can be formalized using the Open-World Assumption
(OWA)[1]. SPARQL, on the other hand, interprets RDF data under closed-
world semantics. While for positive queries, this does not create problems [2],
SPARQL queries with negation make only sense under closed-world semantics.
This ambiguous interpretation of RDF data poses the question how to determine
which semantics is appropriate in a given situation.
As a particular example, suppose we want to retrieve all Oscar winners that
have tattoos. For obvious reasons, no sources on the Semantic Web contain
complete information about this topic and hence the Open-World Assumption
applies. On the other hand, suppose we want to retrieve all Oscar winners. Here,
an RDF version of IMDb1 would contain complete data and hence one may
intuitively apply closed-world reasoning.
In [3], completeness statements were introduced, which are metadata that
allow one to specify that the CWA applies to parts of a data source. We argue
that these statements can be used to understand the meaning of query results
in terms of certain and possible answers [4], which is especially interesting for
queries with negation: Suppose an RDF data source has the completeness state-
ment “complete for all Oscar winners”. The result of a SPARQL query for Oscar
winners contains not only all certain but also all possible answers. The result
of a SPARQL query for Oscar winners with tattoos contains only all certain
1
http://www.imdb.com/oscars/
269
2 Fariz Darari, Simon Razniewski, and Werner Nutt
answers. Moreover, the result of a SPARQL query for people with tattoos that
did not win an Oscar contains all certain answers, but not all possible answers.
The result of a SPARQL query for Oscar winners not having tattoos contains
all possible answers but none of the answers is certain.
In this paper, we discuss how to assess the relationship between certain an-
swers, possible answers and the results retrieved by SPARQL queries in the
presence of completeness statements.
2 Formalization
SPARQL Queries. A basic graph pattern (BGP) is a set of triple patterns [5].
In this work, we do not consider blank nodes. We define a graph pattern induc-
tively as follows: (1) a BGP P is a graph pattern; (2) for a BGP P , a NOT-EXISTS
pattern ¬P is a graph pattern; (3) for graph patterns P1 and P2 , P1 AND P2 is a
graph pattern. Any graph pattern P can be equivalently written as a conjunction
of a BGP (called the positive part and denoted as P + ) and several NOT-EXISTS
patterns { ¬P1 , . . . , ¬Pn } (referred to as the negative part and denoted as P ).
A query with negation has the form Q = (W, P ) where P is a graph pattern and
W ✓ var (P + ) is the set of distinguished variables. The evaluation of a graph
pattern P over a graph G is defined in [5]:
JP + AND ¬P1 AND . . . AND ¬Pn KG = { µ | µ 2 JP + KG ^ 8i . Jµ(Pi )KG = ; }
The result of evaluating (W, P ) over a graph G is the restriction of JP KG to W .
This fragment of SPARQL queries is safe, that is, for a query with negation Q
and a graph G, the evaluation JQKG returns finite answers. We assume all queries
to be consistent, that is, there is a graph G where JQKG 6= ;.
Completeness Statements. Formally, a completeness statement C is of the
form Compl(P1 |P2 ) where P1 , called the pattern, and P2 , called the condition,
are BGPs. Intuitively, such a statement expresses that the source contains all
instantiations of P1 that satisfy condition P2 (e.g., all Golden Globe winners
that won an Oscar).
In line with the Open-World Assumption of RDF, for a graph G, we call any
graph G0 such that G0 ◆ G an interpretation of G [2]. We associate to a complete-
ness statement C the CONSTRUCT query QC = (CONSTRUCT { P1 } { P1 AND P2 }). A
pair (G, G0 ) of a graph and one of its interpretations satisfies a completeness
statement C, written (G, G0 ) |= C if JQC KG0 ✓ G holds. It satisfies a set C of
completeness statements, written (G, G0 ) |= C if it satisfies every element in C. A
set of completeness statements C entails a statement C, written C |= C, if for all
pairs (G, G0 ) of a graph G and an interpretation G0 of G such that (G, G0 ) |= C,
it is the case that (G, G0 ) |= C.
Completeness statements restrict the set of interpretations:
Definition 1 (Valid Interpretation with Completeness Statements). Let
G be a graph and C be a set of completeness statements. An interpretation G0 ◆ G
is valid w.r.t. C i↵ (G, G0 ) |= C. We write the set of all valid interpretations of
G w.r.t. C as int(G, C).
270
Bridging the Semantic Gap between RDF and SPARQL 3
Definition 2 (Certain and Possible Answers with Completeness State-
ments). Let Q be a query, G be a graph and C be a set of completeness state-
ments. Then the certain and possible answers of Q over G w.r.t. C are CAQ C (G) =
T Q S
G0 2int(G,C) JQKG 0 and PA C (G) = G0 2int(G,C) JQKG 0 , respectively.
If C is empty, then PAQ Q
C (G) and CAC (G) correspond to the classical certain and
possible answers [4]. With completeness statements, some query results may
correspond to the certain and possible answers while others may not.
Example 1 (Possible Answers). Consider again the query asking for people that
have tattoos. Since this is private information we have no knowledge how com-
plete our data source is, and hence the set of possible answers is (nearly) infinite:
Anyone not explicitly stated to have tattoos might still have tattoos.
On the other hand, consider the query for people who won an Oscar. It is
relatively easy for a data source to be complete on this topic, by comparing:
e.g., to the Internet Movie Database (IMDb). If a data source is complete for all
Oscar winners, then there are no further possible answers.
The reasoning for queries with negation is more complex. By [2], under the OWA,
for a monotonic query Q and a graph G the certain answers are JQKG , while
for queries with negation, the certain answers are empty. With completeness
statements, the certain answers of a query with negation can be non-empty.
Example 2 (Certain Answers). Consider first a query for people that won an
Oscar but no Golden Globe. If a data source is complete both for Oscar winners
and Golden Globe winners, then the query result contains all possible and all
certain answers.
On the other hand, a query for people with tattoos that did not win an Oscar
would only return certain answers, but not all possible answers, because there
are probably many more people with tattoos that did not win an Oscar.
The result of a query for people that won an Oscar and do not have tattoos
contains all possible answers but no certain answers, because we do not know
for certain which Oscar winners have tattoos and which do not.
We next define for each query some completeness statements that allow one to
capture the crucial information for getting certain or possible answers. Knowing
about the crucial statements helps in data acquisition identify which data is
needed in order to achieve desired answer semantics.
Definition 3 (Crucial Statements). For a query Q = (W, P ), the positive
+
crucial statement of Q, denoted as CQ , is the statement Compl(P + |true). The
set of negative crucial statements of Q, denoted as CQ , is the set { Compl(P1 |P + ),
. . . , Compl(Pn |P + ) }, where P1 , . . . , Pn are from the negative part P of Q.
The next theorems show that the crucial statements can be used to infer
relationships between certain answers, possible answers and SPARQL query re-
sults.
271
4 Fariz Darari, Simon Razniewski, and Werner Nutt
Theorem 1 (Bounded Possible Answers). Let C be a set of completeness
statements and Q be a positive query. Then
+
C |= CQ implies for all graphs G, PAQ Q
C (G) = CAC (G) (= JQKG ).
While the equality JQKG = CAQC (G) always holds, the new insight is that JQKG =
Q
PAC (G). This means the query results cannot miss any information w.r.t. reality.
Theorem 2 (Queries with Negation). Let C be a set of completeness state-
ments and Q be a query. Then
1. C |= CQ implies for all graphs G, CAQ C (G) = JQKG ;
+
2. C |= CQ ^ CQ implies for all graphs G, PAQ Q
C (G) = CAC (G) = JQKG .
The first item means that if C |= CQ , then every answer returned by the query
+
is a certain answer. The second item means that if additionally C |= CQ , then
there also cannot be any other possible answers than those returned by JQKG .
The completeness statement entailment problems can be solved using standard
query containment techniques [6].
3 Discussion
We have shown that in the presence of completeness statements, the semantics
of SPARQL may correspond to the certain answer or possible answer semantics.
Our work is based on the observation that parts of data on the Semantic Web are
actually complete. In future research, we would like to consider explicit negative
RDF knowledge and completeness statements over it as an alternative for getting
certain and possible answers. In this work, we assume all information contained
in a graph is correct. An interesting case is when this assumption does not hold
in general. We would like to investigate correctness statements as the dual of
completeness statements. A further study is also needed for an e↵ective way
of maintenance of completeness statements to cope with information changes.
One possible way is to add timestamps to completeness statements. An extended
version with proofs of this paper is available at http://arxiv.org/abs/1408.6395.
References
1. Raymond Reiter. On Closed World Data Bases. 1977.
2. Marcelo Arenas and Jorge Pérez. Querying Semantic Web Data with SPARQL. In
PODS, 2011.
3. Fariz Darari, Werner Nutt, Giuseppe Pirrò, and Simon Razniewski. Completeness
Statements about RDF Data Sources and Their Use for Query Answering. In ISWC
2013, pages 66–83. Springer Berlin Heidelberg, 2013.
4. Serge Abiteboul, Paris C. Kanellakis, and Gösta Grahne. On the Representation
and Querying of Sets of Possible Worlds. Theor. Comput. Sci., 78(1):158–187, 1991.
5. Steve Harris and Andy Seaborne. SPARQL 1.1 Query Language. Technical report,
W3C, 2013.
6. Simon Razniewski and Werner Nutt. Completeness of Queries over Incomplete
Databases. PVLDB, 4(11):749–760, 2011.
272
COLINA: A Method for Ranking SPARQL Query
Results through Content and Link Analysis
Azam Feyznia, Mohsen Kahani, Fattane Zarrinkalam
Computer Engineering Department, Ferdowsi University of Mashhad, Mashhad, Iran
azam.feyznia@stu-mail.um.ac.ir
kahani@um.ac.ir
fattane.zarrinkalam@stu-mail.um.ac.ir
Abstract. The growing amount of Linked Data increases the importance of se-
mantic search engines for retrieving information. Users often examine the first
few results among all returned results. Therefore, using an appropriate ranking
algorithm has a great effect on user satisfaction. To the best of our knowledge,
all previous methods for ranking SPARQL query results are based on popularity
calculation and currently there isn’t any method for calculating the relevance of
results with SPARQL query. However, the proposed ranking method of this pa-
per calculates both relevancy and popularity ranks for SPARQL query results
through content and link analysis respectively. It calculates the popularity rank
by generalizing PageRank method on a graph with two layers, data sources and
semantic documents. It also assigns weights automatically to different semantic
links. Further, the relevancy rank is based on the relevance of semantic docu-
ments with SPARQL query.
Keywords: Ranking, SPARQL, Semantic Search Engine, Link Analysis, Con-
tent Analysis, Semantic Web
1 Introduction
Structured data has enabled users to search semantic web, based on SPARQL queries.
The increasing amount of structured data on the web has led to many results to be re-
turned by a SPARQL query [1]. Further, since in most cases, all returned results equally
satisfy query conditions, checking all of them and finding the best answers takes too
much time. Therefore, the semantic web search engines whose have provided a
SPARQL endpoint for processing and running SPARQL queries on their indexed data,
require some mechanisms for ranking SPARQL query results besides the ranking meth-
ods applied to keyword queries, to help users find their desired answers in less time.
In search engines, ranking are usually done by content and link analysis and the final
rank for each result is calculated by combining scores obtained from each analysis al-
gorithm [2-3]. The content analysis ranking algorithms calculate the relevancy between
each result with the user query in online mode. In the link analysis ranking algorithms,
273
popularity calculation is done in offline mode, before the user query is received, by
constructing data graph and analyzing the existing links in it.
To the best of our knowledge, all previous methods for ranking SPARQL query re-
sults are based on popularity calculation and currently there is no method for calculating
the relevance of sub-graph results with SPARQL query. The ranking methods which
are based on link analysis, compute rank for entities of result graphs by utilizing entity-
centric data models. It is worth noting that, the results of a SPARQL query in addition
to the entities, may be made up of predicates and constant values. As a result, the pro-
posed algorithms by [4] and [5] which are only based on entity ranking, cannot rank all
results of SPARQL queries. One of the cornerstones in ranking SPARQL query results
are language model based ranking methods [6]. Providing an approach for analyzing
content of structured queries such as SPARQL queries, is a significant advance which
is obtained by these methods.
Therefore, by studying the limitations presented in existing researches and consid-
ering specific features of SPARQL queries and results, this paper proposed a ranking
method which calculates relevancy and popularity scores through content and link anal-
ysis respectively.
2 Proposed Method: COLINA
We are interested in measuring how valuable the result graph is, for a given query. Our
method ranks SPARQL query results by combining content and link analysis scores of
semantic documents which results are retrieved from. In the next subsections, we
briefly describe two key components of our method.
2.1 Offline Ranker
The offline ranker calculates data popularity by applying weighted PageRank algorithm
on data graph. We first explain our data model and then reveal our scheme for weighting
semantic links.
Data Model. In order to consider the provenance of data in our link analysis ranking,
we choose a two-layer graph including data source and semantic document layers. Data
source layer is made up of a collection of inter-connected data sources. A data source
is the source which has authority to assign URI identifier and is defined as a pay-level
domain similar to [3]. The semantic document layer is composed of independent graphs
of semantic documents. Each graph contains a set of internal nodes and edges.
Our explanation for using document-centric data model instead of entity-centric data
model is that in response to a SPARQL query, the sub-graphs that meet query condi-
tions are returned as results. Depending on the number of triple patterns in query, each
sub-graph constitutes several triples. Hence, we can estimate the rank score of triples
by the rank score of documents which are appeared in them. The document graph was
constructed by extracting explicit and implicit links between semantic documents ac-
cording to [7].
274
Weighting mechanism. We categorized links in two classes based on their labels, but
not their frequency: specific and general links. In semantic web, links are semantically
different and so they have different importance. Our method for measuring the im-
portance of link labels goes beyond just measuring the frequency of labels by also tak-
ing these categories into account. We first determine which category the link label be-
longs to, then we use different frequency based measurements. The intuition behind
this idea is that general and common link labels such as owl:sameAs, which convey
high importance, get high weight. On the other hand, specific link labels, that hold much
information based on information-theory, get high weight too. This way we can con-
sider the importance of common link labels and also maintain the importance of specific
link labels. In this paper, we exploit a hierarchical approach to separate the link labels
that are between data sources. From this point of view, the link label that is defined for
a particular class is considered general for all of its subclasses. Hence each data source
is a subclass of owl:thing, we can derive general labels through extracting link labels
which rdfs:domain of them is defined owl:thing by Virtuoso1.
2.2 Online Ranker
Unlike keyword-based queries which are collection of words that are specified by users,
each triple pattern in SPARQL queries has two arguments: the bound arguments which
are labeled by users and the unbound arguments which are variable. We can measure
the relevancy of document, based on bound and unbound query arguments as follows:
S q (doc ) = β rq (doc ) + (1 − β ) rr (doc ) (1)
where !" #$%&' and !( #$%&' denotes the relevancy score of a document with respect
to unlabeled arguments in query and produced answer, respectively. Parameter ) set
empirically to a calibrated global value.
For example, assuming that “Bob a Physicist” is an answer for “?x a Physicist”. If
this triple appears in a document which is exclusively about physicist or Bob, it is more
relevant than when it is included in a document which is about anything else. This ex-
ample highlights our justification for using both bound and unbound arguments in the
relevance calculation for documents.
Since the computing value for !" #$%&' and !( #$%&' depends on query formulation,
we need to deal with possible forms of triple patterns. For this, we define ACDT and
QCDT functions for estimating !" #$%&' and !( #$%&' respectively.
The ACDT is Answer Container Document’s Triples. In short, it computes fre-
quency of a result in semantic documents with respect to the position of unbound argu-
ments in intended triple pattern. The QCDT is Query Container Document’s Triples.
Similarly, it computes frequency of a query in semantic documents with respect to the
position of bound arguments in intended triple pattern. The basic idea for ACDT and
QCDT is derived from TF Scheme in information retrieval.
1 http://lod.openlinksw.com/sparql
275
3 Combine Content and Link Analysis Ranks
We combine relevancy score +( and popularity score +, in order to compute final score
-. for document. Since the foundation of our ranking algorithms is similar to algorithms
presented in [2], we use his method for combining scores of our algorithms.
4 Conclusion
In this paper we presented a method for ranking SPARQL query results based on con-
tent and link analysis, which can be used as ranking component in semantic web search
engines. In our method, the rank of triples that constitute the result graphs are approxi-
mated by the rank score of semantic documents which expressed them. We introduced
a two-layer data model and proposed a novel link weighting mechanism based on sep-
aration of link labels incorporating the notion frequency of labels in a convenient man-
ner. Our content analysis ranking algorithm provides an approach to compute the rele-
vancy of results with respect to the bound and unbound arguments in intended SPARQL
query. We believe that using content analysis ranking in combination with link analysis
ranking which is powered by our data model and weighting mechanism, can improve
accuracy of ranking algorithm for SPARQL query results.
References
1. J. Hees, M. Khamis, R. Biedert, S. Abdennadher, and A. Dengel. Collecting links between
entities ranked by human association strengths. In Proceedings of ESWC-13, pages 517-531,
2013.
2. R. Delbru. Searching Web Data: an Entity Retrieval Model. Ph.D. Thesis, National Univer-
sity of Ireland, Ireland, 2010.
3. A. Hogan, A. Harth, J. Umbrich, S. Kinsella, A. Polleres, S. Decker. Searching and Brows-
ing Linked Data with SWSE: the Semantic Web Search Engine. Journal of web semantics,
pages 365-401, 2011.
4. K. Mulay, P.S. Kumar. SPRING: Ranking the results of SPARQL queries on Linked Data.
17th International Conference on Management of Data COMAD, Bangalore, India, 2011.
5. A. Buikstra, H. Neth, L. Schooler, A. ten Teije, F. van. Harmelen. Ranking query results
from linked open data using a simple cognitive heuristic. In Workshop on discovering mean-
ing on the go in large heterogeneous data 2011 (LHD-11), Twenty-second International
Joint Conference on Articial Intelligence (IJCAI-11), Barcelona, Spain, 2011.
6. G. Kasneci, F. M. Suchanek, G. Ifrim, M. Ramanath, G. Weikum. NAGA: Searching and
Ranking Knowledge. In 24th International Conference on Data Engineering (ICDE 2008).
IEEE, 2008.
7. A. Feyznia, M. Kahani, R. Ramezani. A Link Analysis Based Ranking Algorithm for Se-
mantic Web Documents. In 6th Conference on Information and Knowledge (IKT 2014),
Shahrood, Iran, 2014.
276
Licentia: a Tool for Supporting Users in Data
Licensing on the Web of Data
Cristian Cardellino1 , Serena Villata1 , Fabien Gandon1 ,
Guido Governatori2? , Ho-Pun Lam2 , and Antonino Rotolo3
1
INRIA Sophia Antipolis, France - firstname.lastname@inria.fr
2
NICTA Queensland Research Laboratory
firstname.lastname@nicta.com.au
3
University of Bologna
antonino.rotolo@unibo.it
Abstract. Associating a license to data is a fundamental task when
publishing data on the Web. However, in many cases data producers
and publishers are not legal experts, and they usually have only a basic
knowledge about the possible constraints they want to ensure concerning
the use and reuse of their data. In this paper, we propose a framework
called Licentia that o↵ers to the data producers and publishers a suite of
services to deal with licensing information. In particular, Licentia sup-
ports, through a user-friendly interface, the users in selecting the license
that better suits their needs, starting from the set of constraints proposed
to regulate the terms of use and reuse of the data.
1 Introduction
In order to ensure the high quality of the data published on the Web of Data,
part of the self-description of the data should consist in the licensing terms
which specify the admitted use and re-use of the data by third parties. This
issue is relevant both for data publication as underlined in the “Linked Data
Cookbook”1 where it is required to specify an appropriate license for the data,
and for the open data publication as expressing the constraints on the reuse of
the data would encourage the publication of more open data. The main problem
is that data producers and publishers often do not have extensive knowledge
about the existing licenses, and the legal terminology used to express the terms
of data use and reuse. To address this open issue, we present Licentia, a suite
of services to support data producers and publishers in data licensing by means
of a user-friendly interface that masks to the user the complexity of the legal
reasoning process. In particular, Licentia o↵ers two services: i) the user selects
among a pre-defined list those terms of use and reuse (i.e., permissions, prohi-
bitions, and obligations) she would assign to the data and the system returns
?
NICTA is funded by the Australian Government as represented by the Department of
Broadband, Communications and the Digital Economy and the Australian Research
Council through the ICT Centre of Excellence program.
1
http://www.w3.org/2011/gld/wiki/Linked_Data_Cookbook
277
the set of licenses meeting (some of) the selected requirements together with
the machine readable licenses’ specifications, and ii) the user selects a license
and she can verify whether a certain action2 is allowed on the data released
under such license. Licentia relies on the dataset of machine-readable licenses
(RDF, Turtle syntax, ODRL vocabulary3 and Creative Commons vocabulary4 )
available at http://datahub.io/dataset/rdflicense. We rely on the deon-
tic logic presented by Governatori et al. [2] to address the problem of verifying
the compatibility of the licensing terms in order to find the license compatible
with the constraints selected by the user. The need for licensing compatibility
checking is high, as shown by other similar services (e.g., Licensius5 or Creative
Commons Choose service6 ). However, the advantage of Licentia with respect to
these services is twofold: first, in these services compatibility is pre-calculated
among a pre-defined and small set of licenses, while in Licentia compatibility is
computed at runtime and we consider more than 50 heterogeneous licenses; sec-
ond, Licentia provides a further service that is not considered by the others, i.e.,
it allows to select a license from our dataset and verify whether some selected
actions are compatible with such license.
2 Licentia: services for supporting data licensing
Licentia is implemented as a Web service 7 . It is written as a Play Framework
application in Scala and Java, using the Model-View-Controller architecture, and
powered with SPINdle [3] Java library as background reasoner. The architecture
of the Web service is shown in Fig. 1. The workflow is defined by three steps:
selection and specification of licensing conditions, reasoning and incompatibility
checking, and process and return of results.
Selection and specification of conditions. Using the web interface form, the data
producer and publisher specifies a set of licensing conditions she wants to as-
sociate to the data. These conditions are divided into three categories: permis-
sions (e.g., Distribution), obligations (e.g., Attribution) and prohibitions (e.g.,
Commercial Use). The chosen set of conditions is taken by a server side con-
troller, which also gets, through a SPARQL endpoint from a RDF triplestore
server containing our licenses repository, a list of the stored licenses and their
corresponding conditions. This data is delivered to a module that handles the
information and formalizes it into defeasible logic rules for process in SPINdle8 –
a modular and efficient reasoning engine for defeasible logic and modal defeasible
2
In the interface, we adopt the terminology and the rights of the ODRL vocabulary.
3
http://www.w3.org/ns/odrl/2/
4
http://creativecommons.org/ns
5
http://oeg-dev.dia.fi.upm.es/licensius/
6
https://creativecommons.org/choose/
7
A demo video of Licentia showing the finding licenses service is available at http:
//wimmics.inria.fr/projects/licentia/.
8
http://spin.nicta.org.au/spindle/index.html
278
Fig. 1: Licentia Web service architecture.
logic [3]. Licentia is based on the logic and the licenses compatibility verification
process proposed by Governatori et al. [2]. The module translates the licenses
and the set of conditions from the RDF specification to defeasible logic rules so
that SPINdle can reason over them. The module considers every single license in
the repository and compares them to the list of conditions selected by the user.
Reasoning and incompatibility checking. SPINdle returns a set of conclusions
for each license’s conditions compared to the set of conditions selected by the
user. From these conclusions, the module gets the set of incompatible conditions
chosen by the user with respect to each license: all defeasible non provable rules
are incompatible conditions. If this list is empty, then the license is compatible to
the set of conditions the user selected. After the module gets all the conclusions
for each license, it has two partial results: one containing compatible licenses, and
one containing incompatible ones. If the set of compatible licenses is non empty,
the module divides this set in two parts: one with those licenses containing the
complete set of user’s conditions, and the other with those licenses that do not
contain all the user’s conditions (but still are compatible), highlighting those
user’s conditions that are not explicitly mentioned in such license. If the set of
compatible licenses is empty, the module returns the set of incompatible licenses
along with a list highlighting for each license what are the user’s conditions that
are incompatible with the license.
279
Process and return of results. If a set of compatible licenses is returned, the
controller provides a view listing all the licenses that are compatible and contain
the user’s selected conditions on the top of the page. Secondly, it returns a list
of all other compatible licenses that do not share all of the user’s conditions, in
ascending order by the number of not contained conditions, highlighting each of
the user’s conditions that are not explicitly defined in the license. If the set of
incompatible licenses is returned, the controller filters all licenses not matching
any of the conditions the user selected, keeping those licenses containing at least
one of the conditions chosen by the user. If the filtered set is non empty, the
system shows a message stating there is no license in the repository compatible
with the selected conditions, but there exist some licenses that meet some of
the conditions. Such licenses are thus listed in ascending order by number of
incompatible conditions, highlighting every unmet condition. In any case, the
list provides a link to the legal specification of the license as well as a link to
a downloadable RDF version of the license. If no compatible license is found
then a disclaimer is shown. Note that the evaluation of the performances of the
SPINdle9 reasoning module to verify the compatibility of licensing terms has
been presented in [2].
3 Future Perspectives
In this demo, we present the Licentia tool that proposes a set of services for
supporting users in data licensing. We are currently finalizing the second service
(verification of compatible actions with respect to a specific license), and we
are increasing the number of considered machine readable licenses. We plan to
extend Licentia by integrating an improved version of the generator of RDF
licenses specifications from natural language texts introduced in [1]. Finally, a
user evaluation should not be underestimated in order to improve the usability
of the user interface.
References
1. Cabrio, E., Aprosio, A.P., Villata, S.: These are your rights - a natural language
processing approach to automated rdf licenses generation. In: ESWC. Lecture Notes
in Computer Science, vol. 8465, pp. 255–269. Springer (2014)
2. Governatori, G., Rotolo, A., Villata, S., Gandon, F.: One license to compose them
all - a deontic logic approach to data licensing on the web of data. In: ISWC. Lecture
Notes in Computer Science, vol. 8218, pp. 151–166. Springer (2013)
3. Lam, H.P., Governatori, G.: The making of SPINdle. In: Proceedings of RuleML,
LNCS 5858. pp. 315–322. Springer (2009)
4. Maher, M.J., Rock, A., Antoniou, G., Billington, D., Miller, T.: Efficient defeasible
reasoning systems. International Journal of Artificial Intelligence Tools 10, 483–501
(2001)
9
SPINdle has been experimentally tested against the benchmark of [4] showing that
it is able to handle very large theories, indeed the largest theory it has been tested
with has 1 million rules.
280
Automatic Stopword Generation using Contextual
Semantics for Sentiment Analysis of Twitter
Hassan Saif, Miriam Fernandez and Harith Alani
Knowledge Media Institute, The Open University, United Kingdom
{h.saif,m.fernandez,h.alani}@open.ac.uk
Abstract. In this paper we propose a semantic approach to automatically identify
and remove stopwords from Twitter data. Unlike most existing approaches, which
rely on outdated and context-insensitive stopword lists, our proposed approach
considers the contextual semantics and sentiment of words in order to measure
their discrimination power. Evaluation results on 6 Twitter datasets show that,
removing our semantically identified stopwords from tweets, increases the binary
sentiment classification performance over the classic pre-complied stopword list
by 0.42% and 0.94% in accuracy and F-measure respectively. Also, our approach
reduces the sentiment classifier’s feature space by 48.34% and the dataset sparsity
by 1.17%, on average, compared to the classic method.
Keywords: Sentiment Analysis, Contextual Semantics, Stopwords, Twitter
1 Introduction
The excessive presence of abbreviations and irregular words in tweets make them very
noisy, sparse and hard to extract sentiment from [7, 8]. Aiming to address this problem,
existing works on Twitter sentiment analysis remove stopwords from tweets as a pre-
processing procedure [5]. To this end, these works usually use pre-complied lists of
stopwords, such as the Van stoplist [3]. These stoplists, although widely used, have
previously been criticised for: (i) being outdated [2] and, (ii) for not accounting for
the specificities of the context under analysis [1]. Words with low informative values
in some context or corpus, may have discrimination power in a different context. For
example, the word “like”, generally consider as stopword, has an important sentiment
discrimination power in the sentence “I like you”.
In this paper, we propose an unsupervised approach for automatically generating
context-aware stoplists for the sentiment analysis task on Twitter. Our approach captures
the contextual semantics and sentiment of words in tweets in order to calculate their
informative value. Words with low informative value are then selected as stopwords. Con-
textual semantics (aka statistical semantics) are based on the proposition that meaning
can be extracted from words co-occurrences [9].
We evaluate our approach against the Van stoplist (so-called clasic method) using
six Twitter datasets. In particular, we study how removing stopwords generated by our
approach affects: (i) the level of data sparsity of the used datasets and (ii) the performance
of the Maximum Entropy (MaxEnt) classifier in terms of: (a) the size of the classifier’s
feature space and, (b) the classifier’s performance. Our preliminary results show that
our approach outperforms the classic stopword removal method in both accuracy and
F1-measure by 0.42% and 0.94% respectively. Moreover, removing our semantically-
identified stopwords reduces the feature space by 48.34% and the dataset sparsity by
1.17%, compared to the classic method, on average.
281
2 Stopwords Generation using Contextual Semantics
The main principle behind our approach is that the informativeness of words in sentiment
analysis relies on their semantics and sentiment within the contexts they occur. Stopwords
correspond to those words of weak contextual semantics and sentiment.Therefore, our
approach functions by first capturing the contextual semantics and sentiment of words
and then calculating their informative values accordingly.
2.1 Capturing Contextual Semantics and Sentiment
To capture the contextual semantics and sentiment of words, we use our previously
proposed semantic representation model SentiCircles [6].
In summary, the SentiCircle model extracts the contex- Y
+1
tual semantics of a word from its co-occurrences with other Very Positive Positive
words in a given tweet corpus. These co-occurrences are then C i y i
represented as a geometric circle which is subsequently used r θi i
+1
to compute the contextual sentiment of the word by apply- -1 x m
X
i
ing simple trigonometric identities on it. In particular, for
each unique term m in a tweet collection, we build a two-
dimensional geometric circle, where the term m is situated Very Negative Negative
-1
in the centre of the circle, and each point around it repre- Fig. 1: SentiCircle of a term m.
sents a context term ci (i.e., a term that occurs with m in the Stopwords region is shaded in
gray.
same context). The position of ci , as illustrated in Figure 1,
is defined jointly by its Cartesian coordinates xi , yi as:
xi = ri cos(✓i ⇤ ⇡) yi = ri sin(✓i ⇤ ⇡)
Where ✓i is the polar angle of the context term ci and its value equals to the prior
sentiment of ci in a sentiment lexicon before adaptation, ri is the radius of ci and its
value represents the degree of correlation (tdoc) between ci and m, and can be computed
as:
ri = tdoc(m, ci ) = f (ci , m) ⇥ log(N/Nci )
where f (ci , m) is the number of times ci occurs with m in tweets, N is the total number
of terms, and Nci is the total number of terms that occur with ci . Note that all terms’
radii in the SentiCircle are normalised. Also, all angles’ values are in radian.
The trigonometric properties of the SentiCircle allow us to encode the contextual
semantics of a term as sentiment orientation and sentiment strength. Y-axis defines the
sentiment of the term, i.e., a positive y value denotes a positive sentiment and vice versa.
The X-axis defines the sentiment strength of the term. The smaller the x value, the
stronger the sentiment.1 This, in turn, divides the circle into four sentiment quadrants.
Terms in the two upper quadrants have a positive sentiment (sin ✓ > 0), with upper left
quadrant representing stronger positive sentiment since it has larger angle values than
those in the top right quadrant. Similarly, terms in the two lower quadrants have negative
sentiment values (sin ✓ < 0). Moreover, a small region called the “Neutral Region” can
be defined. This region is located very close to X-axis in the “Positive” and the “Negative”
quadrants only, where terms lie in this region have very weak sentiment (i.e, |✓| t 0).
1
This is because cos ✓ < 0 for large angles.
282
The overall Contextual Semantics and Sentiment An effective way to compute the
overall sentiment of m is by calculating the geometric median of all the points in its
SentiCircle. Formally, for a given set of n points (p1 , p2 , P
..., pn ) in a SentiCirlce ⌦, the
n
2D geometric median g is defined as: g = arg ming2R2 i=1 k|pi g||2 . We call the
geometric median g the SentiMedian as its position in the SentiCircle determines the
total contextual-sentiment orientation and strength of m.
2.2 Detecting Stopwords with SentiCircles
Stopwords in sentiment analysis are those who have weak semantics and sentiment within
the context they occur. Hence, stopwords in our approach are those whose SentiMedians
are located in the SentiCircle within a very small region close to the origin, as shown
in Figure 1. This is because points in this region have: (i) very weak sentiment (i.e.,
|✓| t 0) and (ii) low importance or low degree of correlation (i.e., r t 0). We call this
region the stopword region. Therefore, to detect stopwords in our approach, we first
build a SentiCircle for each word in the tweet corpus, calculate its overall Contextual
semantics and sentiment by means of its SentiMedian, and check whether the word’s
SentiMedian lies within the stopword region or not.
We assume the same stopword region boundary for all SentiCircles emerging from
the same Twitter corpus, or context. To compute these boundaries we first build the
SentiCircle of the complete corpus by merging all SentiCircles of each individual term
and then we plot the density distribution of the terms within the constructed SentiCircle.
The boundaries of the stopword region are delimited by an increase/decrease in the
density of terms along the X- and Y-axis. Table 1 shows the X and Y boundaries of the
stopword region for all Twitter datasets that we use in this work.
Dataset OMD HCR STS-Gold SemEval WAB GASP
X-boundary 0.0001 0.0015 0.0015 0.002 0.0006 0.0005
Y-boundary (Y) 0.0001 0.00001 0.001 0.00001 0.0001 0.001
Table 1: Stopword region boundary for all datasets
3 Evaluation and Results
To evaluate our approach, we perform binary sentiment classification (positive / negative
classification of tweets) using a MaxEnt classifier and observe fluctuations (increases
and decreases) after removing stopwords on: the classification performance, measured in
terms of accuracy and F-measure, the size of the classifier’s feature space and the level
of data sparsity. To this end, we use 6 Twitter datasets: OMD, HCR, STS-Gold, SemEval,
WAP and GASP [4]. Our baseline for comparison is the classic method, which is based
on removing stopwords obtained from the pre-complied Van stoplist [3].
Figure 2 depicts the classification performance in accuracy and F1-measure as well
as the reduction in the classifier’s features space obtained by applying our SentiCircle
stopword removal methods on all datasets. As noted, our method outperforms the classic
stopword list by 0.42% and 0.94% in accuracy and F1-measure on average respectively.
Moreover, we observe that our method shrinks the feature space substantially by 48.34%,
while the classic method has a reduction rate of 5.5% only.
Figure 3 shows the average impact of the SentiCircle and the classic methods on the
sparsity degree of our datasets. We notice that our SentiCircle method always lowers the
sparsity degree of all datasets by 1.17% on average compared to the classic method.
283
Accuracy$ F1$ Reduc5on$Rate$ Classic% SenGCircle%
85$ 50.99$ 1.000%
Accuracy)&)F1Measure) 84$ 0.999%
40.99$
83$ 0.998%
Sparsity)Degree)
Reduc&on)Rate)
82$ 30.99$ 0.997%
81$ 0.996%
80$ 20.99$ 0.995%
79$ 0.994%
10.99$
78$ 0.993%
77$ 0.99$ 0.992%
Classic$ Sen5Circle$ OMD% HCR% STS5Gold% SemEval% WAB% GASP%
Fig. 2: Average accuracy, F-measure and reduc- Fig. 3: Impact of the classic and SentiCircles
tion rate of MaxEnt using different stoplists methods on the sparsity degree of all datasets.
4 Conclusions
In this paper we proposed a novel approach for generating context-aware stopword
lists for sentiment analysis on Twitter. Our approach exploits the contextual semantics
of words in order to capture their context and calculates their discrimination power
accordingly. We have evaluated our approach for binary sentiment classification using
6 Twitter datasets. Results show that our stopword removal approach outperforms the
classic method in terms of the sentiment classification performance and the reduction in
both the classifier’s feature space and the dataset sparsity.
Acknowledgment
This work was supported by the EU-FP7 project SENSE4US (grant no. 611242).
References
1. Ayral, H., Yavuz, S.: An automated domain specific stop word generation method for natural
language text classification. In: International Symposium on Innovations in Intelligent Systems
and Applications (INISTA) (2011)
2. Lo, R.T.W., He, B., Ounis, I.: Automatically building a stopword list for an information retrieval
system. In: Journal on Digital Information Management: Special Issue on the 5th Dutch-Belgian
Information Retrieval Workshop (DIR) (2005)
3. Rijsbergen, C.J.V.: Information Retrieval. Butterworth-Heinemann, Newton, MA, USA, 2nd
edn. (1979)
4. Saif, H., Fernandez, M., He, Y., Alani, H.: Evaluation datasets for twitter sentiment analysis a
survey and a new dataset, the sts-gold. In: Proceedings, 1st ESSEM Workshop. Turin, Italy
(2013)
5. Saif, H., Fernandez, M., He, Y., Alani, H.: On Stopwords, Filtering and Data Sparsity for
Sentiment Analysis of Twitter. In: Proc. 9th Language Resources and Evaluation Conference
(LREC). Reykjavik, Iceland (2014)
6. Saif, H., Fernandez, M., He, Y., Alani, H.: Senticircles for contextual and conceptual semantic
sentiment analysis of twitter. In: Proc. 11th Extended Semantic Web Conf. (ESWC). Crete,
Greece (2014)
7. Saif, H., He, Y., Alani, H.: Alleviating data sparsity for twitter sentiment analysis. In: Proc. 2nd
Workshop on Making Sense of Microposts (#MSM2012). Layon, France (2012)
8. Saif, H., He, Y., Alani, H.: Semantic sentiment analysis of twitter. In: Proceedings of the 11th
international conference on The Semantic Web. Boston, MA (2012)
9. Turney, P.D., Pantel, P., et al.: From frequency to meaning: Vector space models of semantics.
Journal of artificial intelligence research 37(1), 141–188 (2010)
284
The Manchester OWL Repository: System
Description
Nicolas Matentzoglu, Daniel Tang, Bijan Parsia, and Uli Sattler
The University of Manchester
Oxford Road, Manchester, M13 9PL, UK
{matentzn,bparsia,sattler}@cs.manchester.ac.uk
Abstract. Tool development for and empirical experimentation in OWL
ontology research require a wide variety of suitable ontologies as input for
testing and evaluation purposes and detailed characterisations of real on-
tologies. Findings of surveys and results of benchmarking activities may
be biased, even heavily, towards manually assembled sets of “somehow
suitable” ontologies. We are building the Manchester OWL Repository,
a resource for creating and sharing ontology datasets, to push the qual-
ity frontier of empirical ontology research and provide access to a great
variety of well curated ontologies.
Keywords: Repository, Ontologies, Empirical
1 Introduction
Empirical work with ontologies comes in a wide variety of forms, for example sur-
veys of the modular structure of ontologies[1], surveys of modelling patterns to
inform design decisions of engineering environments [4] and benchmarking activ-
ities for reasoning services such as Description Logic (DL) classification [2]. Since
it is generally difficult to obtain representative datasets, both due to technical
reasons (lack of suitable collections) and conceptual reasons (lack of agreement
on what they should be representative of), it is common practice to manually
select a somewhat arbitrary set of ontologies that usually supports the given
case. On top of that, few authors ever publish the datasets they used, often for
practical reasons (e.g. size, e↵ort), which makes reproducing experiment results
often impossible. The currently best option for ontology related research is the
BioPortal repository [5], which provides a web based interface for browsing on-
tologies in the biomedical domain and a REST web service to programmatically
obtain copies of all (public) versions of a wide range of biomedical ontologies.
There are, however, certain problems with this option. First, the repository is
limited to biomedical ontologies, which makes BioPortal unsuitable for surveys
that require access to ontologies of di↵erent domains. The second problem is the
technical barrier of accessing the web service: It requires a good amount of work
to download all interesting ontologies, for example due to a range of ontologies
published in a compressed form or the logistical hurdle of recreating new snap-
shots over and over again. The third problem is due to the fact that there is
285
BioPortal User Web Google Webscrape
Data Gathering Custom
Webservice submitted crawl Search
Snapshot
Curation Oxford Ontology
BioPortal
Sources Library
MOWLCorp
OWL/XML ORIGINAL OWL/XML ORIGINAL
OWL/XML ORIGINAL
Manchester OWL Repository
Pool
OWL/XML ORIGINAL
Restful Webservice
Interface
Web Frontend API
Fig. 1. The repository architecture.
no shared understanding of what it means to “use BioPortal”. Di↵erent authors
have di↵erent inclusion and exclusion criteria, for example they only take the
ones that are easily parseable after download, or the ones that were accessible
at a particular point in time. The Manchester OWL Respository aims to bridge
that gap by providing a framework for conveniently retrieving some standard
datasets and allowing users to create, and share, their own.
2 Overall architecture
The Manchester OWL repository can be divided into four layers (see Figure 1).
The first layer represents the data gathering. Through web crawls, web scrapes,
API calls, and user contributions ontologies are collected and stored in their re-
spective collections. The second layer represents the three main data sources of
the repository, each providing ontologies in their original and curated (OWL/XML)
form. The third layer, the pool, represents a virtual layer in which access to the
ontologies is unified, providing some means for de-duplification because of the
possibility of corpora intersection. Lastly, the interface layer provides access to
the repository through a REST service and a web-based front end.
3 Data Gathering
The main component of the data gathering layer is a web crawl based on
crawler4j, a java-based framework for custom web crawling and daily calls to
the Google Custom Search API that fills the MOWLCorp, which makes up
the bulk of the repository’s data. An ongoing BioPortal downloader creates a
snapshot of BioPortal once per month using the BioPortal web services, whilst
retaining copies of all available versions so far. The third (minor) component of
the repository is a web scrape of the Oxford Ontology Library (OOL), a hand
curated set of ontologies which features some particularly difficult, and thus in-
286
teresting to reasoner developers, ontologies. Ontologies are downloaded in their
raw form and thrown in the curation pipeline.
4 Data curation
Ontology candidates from all three sources undergo a mild form of repair (un-
declared entity injection, rewrite of non absolute IRIs) and are exported into
OWL/XML, with their imports closure merged into a single ontology, while
retaining information about the axiom source ontology through respective an-
notations. Metrics and files for both the original and the curated versions of
the ontologies are retained and form part of the repository. The data curation
looks slightly di↵erent for all three data sources, especially with respect to filter-
ing. Apart from the criterion of OWL API [3] parse-ability, BioPortal and the
OOL are left unfiltered because they are already deemed curated. This means
that some ontologies in the corpus may not contain any logical axioms at all.
In MOWLCorp, on the other hand, we filter out ontologies that 1) have an
empty TBox (root ontology) and 2) have byte-identical duplicates after seriali-
sation into OWL/XML. The reason for the first step is our focus on ontologies
(which excludes pure collections of RDF instance data) and the fact that the
imports closure is part of the repository, i.e., they are downloaded and evaluated
independently of the root ontology.
5 Accessing the repository
There are currently three di↵erent means to access the repository: 1) A web
frontend1 provides access to preconstructed datasets and their descriptions, 2)
an experimental data set creator allows users to create custom datasets based
on a wide range of metrics and 3) an experimental REST-based web service that
allows users to create a dataset using the REST API. Since 2) is based on 3), we
now describe the query language that allows users to create their own datasets
and access the web service.
The query language allows the user to construct statements that represent
filter criteria for ontologies based on some essential metrics such as axiom and
entity counts, or profile membership. It roughly conforms to the following gram-
mar:
q = comp {(”&&”|”||”) comp}
comp = metric (“>=” | “<=” | “=”) n
metric = “axiom count” | “class count” | ...
where “metric” should be a valid metadata element. The query language parser
was built with open-source parser generator Yacc and Lex.
The repository web services are built using the PHP framework Laravel.
Laravel is an advanced framework which implements the REST protocol, so that
users can get access to the services using a REST client, or simply using a web
1
http://mowlrepo.cs.manchester.ac.uk/
287
Table 1. The REST service parameters.
service url method param return
query /api/ POST query JSON array with fields:status, count,size,
message, progress
check status /api/checkStatus/ GET id JSON array with fields: status, progress
download /api/resource GET id file stream
browser and web-based tools such as Curl. For now, we have implemented three
services: query, checkStatus and download. The query service accepts a query
string that complies to the query language and returns an id string. Afterwards,
users can use the id string to check the status of their query, and to download
the final dataset using checkStatus and download services.
The usage of the services are listed in the Table 1; note that urls should be
appended to mowlrepo.cs.manchester.ac.uk which has been omitted.
6 Next steps
We have presented the Manchester OWL Repository and a range of prototype
interfaces to access pre-constructed datasets and create custom ones. We believe
that the repository will help pushing the quality frontier of empirical ontology-
related research by providing access to shareable, well curated datasets. We
are currently working on the REST services, the dataset creator and improved
dataset descriptions. In the near future, we are aiming to 1) integrate the repos-
itory with Zenodo, a service that allows hosting large datasets that are citable
via DOIs, 2) extend our metadata to capture even more ontology properties (in
particular consistency and coherence) and 3) improving the curation pipeline by
implementing extended yet save fixes for OWL DL profile violations.
References
1. C. Del Vescovo, P. Klinov, B. Parsia, U. Sattler, T. Schneider, and D. Tsarkov.
Empirical study of logic-based modules: Cheap is cheerful. In Lecture Notes in
Computer Science (including subseries Lecture Notes in Artificial Intelligence and
Lecture Notes in Bioinformatics), volume 8218 LNCS, pages 84–100, 2013.
2. R. S. Gonçalves, S. Bail, E. Jiménez-Ruiz, N. Matentzoglu, B. Parsia, B. Glimm,
and Y. Kazakov. OWL Reasoner Evaluation (ORE) Workshop 2013 Results: Short
Report. In ORE, pages 1–18, 2013.
3. M. Horridge and S. Bechhofer. The OWL API: A Java API for OWL ontologies.
Semantic Web, 2:11–21, 2011.
4. M. Horridge, T. Tudorache, J. Vendetti, C. Nyulas, M. A. Musen, and N. F. Noy.
Simplified OWL Ontology Editing for the Web: Is WebProt{é}g{é} Enough? In
International Semantic Web Conference (1), pages 200–215, 2013.
5. N. F. Noy, N. H. Shah, P. L. Whetzel, B. Dai, M. Dorf, N. Griffith, C. Jonquet,
D. L. Rubin, M. A. Storey, C. G. Chute, and M. A. Musen. BioPortal: Ontologies
and integrated data resources at the click of a mouse. Nucleic Acids Research, 37,
2009.
288
A Fully Parallel Framework for Analyzing RDF Data
Long Cheng1 , Spyros Kotoulas2 , Tomas E Ward3 , Georgios Theodoropoulos4
1
Technische Universität Dresden, Germany 2 IBM Research, Ireland
3
National University of Ireland Maynooth, Ireland 4 Durham University, UK
long.cheng@tu-dresden.de, spyros.kotoulas@ie.ibm.com
tomas.ward@nuim.ie, theogeorgios@gmail.com
Abstract. We introduce the design of a fully parallel framework for quickly ana-
lyzing large-scale RDF data over distributed architectures. We present three core
operations of this framework: dictionary encoding, parallel joins and indexing
processing. Preliminary experimental results on a commodity cluster show that
we can load large RDF data very fast while remaining within an interactive range
for query processing.
1 Introduction
Fast loading speed and query interactivity are important for exploration and analysis
of RDF data at Web scale. In such scenarios, large computational resources would be
tapped in a short time, which requires very fast data loading of the target dataset(s). In
turn, to shorten the data processing life-cycle for each query, exploration and analysis
should be done in an interactive manner. In the context of these conditions, we follow
the following design paradigm.
Model. We employ the commonly used relational model. Namely, RDF data is stored
in the form of triples and SPARQL queries are implemented by a sequence of lookups
and joins. We do not use the graph-based approaches, because they focus on subgraph
matching, which is not suitable for handling large-scale RDF data, as described in [1].
Moreover, for a row-oriented output format, graph exploration is not sufficient to gen-
erate the final join results, as presented in [2] and graph-partitioning approaches are too
costly, in terms of loading speed.
Parallelism. We parallelize all operations such as dictionary encoding, indexing and
query processing. Although asynchronous parallel operations (such as joins) have been
seen to improve load balancing in state-of-art systems [2], we still adopt the conven-
tional synchronous manner, since asynchronous operations always rely on specified
communication protocols (e.g. MPI). To remedy the consequent load-imbalance prob-
lem, we focus on techniques to improve the implementations of each operation. For
example, for a series of parallel joins, we keep each join operation load-balanced.
Performance. We are interested in the performance in both data loading and query-
ing. In fact, current distributed RDF systems generally operate on a trade-off between
loading complexity and query efficiency. For example, the similar-size partitioning
method [3, 4] offers superior loading performance at the cost of more complex/slower
querying, and the graph-based partitioning approach [2, 5] requires significant compu-
tational effort for data loading and/or partitioning. Given the trade-offs between the two
289
Candidate
RDF Data
Distributed
Integers
Indexes
Outputs
Refined
results
results
data
distribute
encode
retrieve
filtering
joins
load
Data loading Data querying
Data Flow with Inter- Optional Local
Local Data File Local Data Flow
machine Communication Data Flow
Fig. 1. General design of our parallel framework.
approaches, we combine elements of both to achieve fast loading while still keeping
query time in an interactive range.
Our parallel framework is shown in Figure 1. The entire data process is divided into
two parts: data loading and data querying. (1) The equal-size partitioned raw RDF data
at each computation node (core) is encoded in parallel in the form of integers and then
loaded in memory in local indexes (without redistributing data). (2) Based on the query
execution plan, the candidate results are retrieved from the built indexes, and parallel
joins are applied to formulate the final outputs. In the latter process, local filters1 at each
node can be used to reduce/remove the retrieved results that have no contribution for
the final outputs, and the redistributed data during parallel joins can be used to create
additional sharded indexes.
2 Core Operations
Triple Encoding. We utilise a distributed dictionary encoding method, as described
in [6,7], to transform RDF terms into 64-bit integers and to represent statements (aligned
in memory) using this encoding. Using a straightforward technique and an efficient
skew-handling strategy, our implementation [6] is shown to be notable faster than [8]
and additionally supports small incremental updates.
Parallel Joins. Based on existing indexes, we can lookup the candidate results for each
graph pattern and then use joins to compute SPARQL queries. For the most critical join
operation, parallel hash joins [3] are commonly used in current RDF systems. However,
they always bring in load-imbalance problems. The reason is that the terms in real-
world Linked Data are highly skewed [9]. In comparison to that, our implementation
adopts the query-based distributed joins we proposed in [10–13] so as to achieve more
efficient and robust performance on each join operation in the presence of different
query workloads.
Two-tier Indexing. We adopt an efficient two-tier index architecture we presented
in [14]. We build the primary index l1 for the encoded triples at each node using a
modified vertical partitioning approach [15]. Different from [15], to speedup the load
process, we do not do any sort operation, but just insert each tuple in a correspond-
ing vertical table. For join operations, we could have to redistribute a large number of
1
Though our system supports filtering operations, we do not give the details in this paper.
290
(intermediate) results around all computation nodes, which is normally very costly. To
remedy this, we employ a bottom-up dynamic programming-like parallel algorithm to
build a multi-level secondary index (l2 ... ln ), based on each query execution plan. With
that, we will simply copy the redistributed data of each join to the local secondary in-
dexes, and these parts of data will be re-used by other queries that contain patterns in
common, so as to reduce (or remove) the corresponding network communication during
the execution. In fact, according to the terminology regarding graph partitioning used
in [5], the k-level index lk on each node in our approach will dynamically construct a
k-hop subgraph. This means that our method essentially does dynamic graph-based par-
titioning based on the query load, starting from an initial equal-size partitioning. There-
fore, our approach can combine the loading speed of similar-size partitioning with the
execution speed of graph-based partitioning in an analytical environment.
3 Preliminary Results
Experiments were conducted on 16 IBM iDataPlex nodes with two 6-core Intel Xeon
X5679 processors, 128GB of RAM and a single 1TB SATA hard-drive, connected using
Gigabit Ethernet. We use Linux kernel version 2.6.32-220 and implement our method
using X10 version 2.3, compiled to C++ with gcc version 4.4.6.
Data Loading. We test the performance of triple encoding and primary index building
through loading 1.1 billion triples (LUBM8000 with indexes on P, PO and PS) in mem-
ory. The entire process takes 340 secs, for an average throughput of 540MB or 3.24M
triples per second (254 secs to encode triples and 86 secs to build the primary index).
12
query over primary index
9.533
build 2nd-level indexing
10 query over 2nd-level index
8.244
build 3rd-level index
query over 3rd-level index
8
Runtime (sec)
6
4.173
3.917
4
2
0.446
0.421
0.376
0.304
0.51
0.44
0
Q2 Q9
Fig. 2. Runtime over different indexes using 192 cores.
Data Querying2 . We implement queries over the indexes l1 , l2 and l3 to examine the
efficiency of our secondary indexes. We run the two most complex queries Q2 and Q9
of LUBM. As we do not support RDF inference, the query Q9 is modified as below so
as to guarantee that we can get results for each basic graph pattern.
Q9: select ?x ?y ?z where { ?x rdf:type ub:GraduateStudent. ?y rdf:type ub:FullProfessor. ?z
rdf:type ub:Course. ?x ub:advisor ?y. ?y ub:teacherOf ?z. ?x ub:takesCourse ?z.}
2
The results presented here is a mirror of our previous work [14].
291
To focus on analyzing the core performance only, we report times for the operations of
results retrieval and the joins (namely we are excluding the time to decode the output)
in the execution phase.
The results in Figure 2 show that the secondary indexes can obviously improve the
query performance. Moreover, the higher the level of index is, the lower the execution
time. Additionally, it can be seen that building a high-level index is very fast, taking
only hundreds of ms, which is extremely small compared to the query execution time.
We did not employ the query-based joins as mentioned in our query execution pre-
sented here, as we found the data skew in our tests was not obvious (due to the nature
structure of the LUBM benchmark). We plan to integrate the joins with the develop-
ment of our system, and then present more detailed results using much more complex
workloads (e.g., similar to the one used in [4]).
Acknowledgments. This work was supported by the DFG in grant KR 4381/1-1. We
thank Markus Krötzsch for comments that greatly improved the manuscript.
References
1. Sun, Z., Wang, H., Wang, H., Shao, B., Li, J.: Efficient subgraph matching on billion node
graphs. Proc. VLDB Endow. 5(9) (May 2012) 788–799
2. Gurajada, S., Seufert, S., Miliaraki, I., Theobald, M.: TriAD: A distributed shared-nothing
RDF engine based on asynchronous message passing. In: SIGMOD. (2014) 289–300
3. Weaver, J., Williams, G.T.: Scalable RDF query processing on clusters and supercomputers.
In: SSWS. (2009)
4. Kotoulas, S., Urbani, J., Boncz, P., Mika, P.: Robust runtime optimization and skew-resistant
execution of analytical SPARQL queries on PIG. In: ISWC. (2012) 247–262
5. Huang, J., Abadi, D.J., Ren, K.: Scalable SPARQL querying of large RDF graphs. Proc.
VLDB Endow. 4(11) (2011) 1123–1134
6. Cheng, L., Malik, A., Kotoulas, S., Ward, T.E., Theodoropoulos, G.: Efficient parallel dic-
tionary encoding for RDF data. In: WebDB. (2014)
7. Cheng, L., Malik, A., Kotoulas, S., Ward, T.E., Theodoropoulos, G.: Scalable RDF data
compression using X10. arXiv preprint arXiv:1403.2404 (2014)
8. Urbani, J., Maassen, J., Drost, N., Seinstra, F., Bal, H.: Scalable RDF data compression with
MapReduce. Concurrency and Computation: Practice and Experience 25(1) (2013) 24–39
9. Kotoulas, S., Oren, E., Van Harmelen, F.: Mind the data skew: Distributed inferencing by
speeddating in elastic regions. In: WWW. (2010) 531–540
10. Cheng, L., Kotoulas, S., Ward, T.E., Theodoropoulos, G.: QbDJ: A novel framework for
handling skew in parallel join processing on distributed memory. In: HPCC. (2013) 1519–
1527
11. Cheng, L., Kotoulas, S., Ward, T.E., Theodoropoulos, G.: Efficiently handling skew in outer
joins on distributed systems. In: CCGrid. (2014) 295–304
12. Cheng, L., Kotoulas, S., Ward, T.E., Theodoropoulos, G.: Robust and efficient large-large
table outer joins on distributed infrastructures. In: Euro-Par. (2014) 258–269
13. Cheng, L., Kotoulas, S., Ward, T.E., Theodoropoulos, G.: Robust and skew-resistant parallel
joins in shared-nothing systems. In: CIKM. (2014)
14. Cheng, L., Kotoulas, S., Ward, T.E., Theodoropoulos, G.: A two-tier index architecture for
fast processing large RDF data over distributed memory. In: HT. (2014) 300–302
15. Abadi, D.J., Marcus, A., Madden, S.R., Hollenbach, K.: Scalable semantic web data man-
agement using vertical partitioning. In: VLDB. (2007) 411–422
292
Objects as results from graph queries using an
ORM and generated semantic-relational binding
Marc-Antoine Parent
maparent@acm.org
Imagination for People
Abstract. This poster describes a method to expose data in a object-
relational model as linked data. It uses Virtuoso’s Linked Data views
on relational data to expose and query relational data. As our object-
relational model is evolving, we generate the linked data view definition
from augmented ORM declarations.
This project has received funding from the European Union’s Seventh Framework
Programme for research, technological development and demonstration under grant
agreement n 6611188.
1 Interoperability in the Catalyst Consortium
The Catalyst consortium1 has been funded by the European Commission to
develop existing tools for argument mapping and argument analytics into an
ecosystem of collective intelligence tools [2] using Linked Data, and test those
tools on a large scale.
The existing tools (Cohere2 by Open University’s KMI3 , Deliberatorium4 by
MIT’s Centre for Collective Intelligence5 and University of Zürich6 , EdgeSense7
by Wikitalia8 , and Assembl9 by Imagination for People10 ) have used a disparate
array of languages (PHP, Lisp, Python) and relational databases (PostgreSQL,
MySQL). More important, the data models differ in many ways: Idea-centric or
message-centric, importance of data history, of social analytics, idea classification
at idea creation or post-hoc, etc. Finally, we were all dealing with algorithmic
problems that we knew could benefit from graph queries in semantic databases.
1
http://catalyst-fp7.eu/
2
http://cohere.open.ac.uk/
3
http://kmi.open.ac.uk/
4
http://cci.mit.edu/klein/deliberatorium.html
5
http://cci.mit.edu/
6
http://www.ifi.uzh.ch/index.html
7
https://github.com/Wikitalia/edgesense
8
http://www.wikitalia.it/
9
http://assembl.org/
10
http://imaginationforpeople.org/fr/
293
The technical partners agreed to use Linked Data technologies for interoper-
ability, both for its inherent flexibility and because there were relevant standard
ontologies that could be leveraged. We co-designed a format [6] that could ac-
commodate the model diversity, leveraging a few common ontologies, such as
SIOC, OpenAnnotation, and FOAF; and with equivalences to other relevant on-
tologies such as AIF. New ontologies were developed when necessary, for example
for IBIS data. To lower the barrier to entry for partners with limited expertise
with Semantic Web technologies, we agreed on RESTful exchange of JSON-LD
data as the main interchange protocol. Partners with legacy relational models
could enter the ecosystem with simple JSON parsers and generators.
The Assembl platform could not follow that simple model: our legacy model
was object-oriented, using SQLAlchemy [1], a Python declarative ORM, with
PostgreSQL. We wanted to use a semantic database rather than a semantic
wrapper on a traditional relational database, so we could display the results
of complex graph queries efficiently. On the other hand, we had a fair amount
of business logic coded at the object layer, which we wanted to leverage; and
the object model was under continuous development, and we did not want to
maintain a semantic-relational wrapper independently.
2 Existing solutions for proxies to data storage
The first alternatives we rejected were: Pure relational (weakness of graph-
oriented queries), pure semantic (relative obscurity of object-semantic tooling
in Python), and partitioning our data between two databases, with the relation-
ships in a semantic database and the content in a RDBMS (overhead of joining
across database systems).
When dealing with relational data, Object-Relational Mappings (ORMs) allow
developers to write an object model annotated with mappings to the relational
model. This annotated object model can act as a OO wrapper, or proxy to the
relational model. Many ORMs also allow developers to generate the relational
model from the object model, or more rarely the object model (with relational
annotations) from the relational model through relational introspection. In either
case, we have a single authoritative model, which software engineers also call the
“don’t repeat yourself” (DRY) principle.
With semantic data, we also have object proxies over semantic data11 . As with
an ORM, OO code can be annotated with semantic mapping annotations, as with
RDFAlchemy12 or Elmo13 ) Also similarly, the OO code of those proxys can be
generated from the RDFS or OWL models, as with RdfReactor14 or Jastor15
respectively, and others [5]. In the case of dynamic languages like Python, it is
11
http://semanticweb.org/wiki/Tripresso
12
http://www.openvest.com/trac/wiki/RDFAlchemy
13
http://www.openrdf.org/elmo.jsp
14
http://rdfreactor.semweb4j.org/
15
http://jastor.sourceforge.net/
294
also possible to dynamically translate accessor queries to unchecked RDF access,
as with, for example, SuRF16 or OldMan17 .
Another bridging technique involves semantic mapping over relational data.
An adapter will use this mapping to act as a semantic proxy. The most well-
known mapping language is the R2RML standard [3], but Virtuoso offers its
own Linked Data Views syntax [4]. Those technologies allow great flexibility,
but require to maintain the semantic-relational mapping in synchrony with the
pre-existing semantic and relational models.
3 Generated semantic-relational binding for Assembl
We opted to use the Virtuoso database and enrich the relational annotation layer
of SQLAlchemy with a semantic layer, with enough information to generate not
only the relational model, but also the semantic mapping. This gives us both
OO and semantic proxies over relational data. Simple traversals are converted
by the ORM into relational queries, while more complex traversals are written as
embedded SPARQL. We can expose the data as a SPARQL endpoint, or export
it as JSON-LD for the benefit of the Catalyst ecosystem.
We generate Virtuoso linked data view definitions rather than R2RML map-
pings (which our annotations would also allow.) This allows us to also exploit
Virtuoso’s capability to embed a SPARQL subquery in a SQL query. Thus, our
code receives ORM objects directly from a SPARQL query, and a single code-
base serves as both an OO and semantic proxy to our data. We still have to keep
this annotation layer up-to-date with both our relational and semantic models,
as in the case of a hand-crafted semantic-relational mapping; but we avoid the
maintainability cost of updating a distinct OO layer.
The poster18 contains example of data definitions in the object model, and
their translation to a Virtuoso linked data binding.
Implementation Our semantic annotation layer19 is based on work by William
Waites20 to extend SQLAlchemy with specificities of the Virtuoso SQL dialect.
An extensible introspector visits the classes known to the ORM and obtains the
following information from class and columns annotations: the IRI pattern for
the class; a global condition for the class to be included in the mapping; and
for each database column, a specification needed to define a specific quad in the
Linked data view.
SQLAlchemy has advanced capability to translate Object-Oriented (OO) in-
heritance into complex relational patterns21 , so the introspector has to cater to
class definitions spanning many tables, or multiple classes sharing a single table,
16
https://pythonhosted.org/SuRF/
17
https://github.com/oldm/OldMan
18
http://maparent.ca/iswc2014poster.pdf
19
https://github.com/maparent/virtuoso-python
20
http://river.styx.org/ww/2010/10/pyodbc-spasql/index
21
http://docs.sqlalchemy.org/en/rel_0_9/orm/inheritance.html
295
and generate appropriate bindings. In Assembl, a subclass of the introspector
also allows more quad specifications tied to more than one column, multiple
graphs, global conditions that apply to class patterns, etc.
Much of the quad specification besides the predicate can be left blank, as
the introspector can be initialized with a default graph, the subject is the class’
subject IRI pattern, and the object is the column to which the quad specification
is attached, which is interpreted to be either a literal or the application of an
IRI pattern which can be inferred from foreign key information.
The quad specification may also specify a condition of applicability using the
ORM constructs. The condition’s structure is visited to define a coherent set of
table aliases for this condition, which will be used in the linked data binding. A
reference to a column defined in a superclass (which may appear in the object or
condition of the quad specification) will enrich the condition with the appropriate
table join; similarly, a reference to a subclass which does not define its own table
will re-use the appropriate ORM condition.
4 Open issues
Having exposed our relational data as linked data, we will next work on impor-
tation of semantic data, and translating it into our relational model. We have
to contend with the fact that open-world semantic data may not conform to
referential integrity constraints defined at the relational layer. Also, because it
is based on SQLAlchemy models, our solution follows its OO model with single
inheritance.
References
1. Bayer, M.: Sqlalchemy. In: Brown, A., Wilson, G. (eds.) The Architecture of Open
Source Applications: Elegance, Evolution, and a Few More Fearless Hacks., vol. 2.
Lulu.com (2012), http://aosabook.org/en/sqlalchemy.html 2
2. Buckingham Shum, S., De Liddo, A., Klein, M.: Dcla meet cida: Collective intel-
ligence deliberation analytics. The Second International Workshop on Discourse-
Centric Learning Analytics (Mar 2014), http://dcla14.files.wordpress.com/
2014/03/dcla14_buckinghamshumdeliddoklein1.pdf 1
3. Das, S., Sundara, S., Cyganiak, R.: R2rml: Rdb to rdf mapping language. W3c
recommendation, World Wide Web Consortium (September 2012), http://www.
w3.org/TR/2012/REC-r2rml-20120927/ 3
4. Haynes, T.: Mapping relational data to rdf with virtuoso’s rdf views. Tech.
rep., OpenLink Software (2010), http://virtuoso.openlinksw.com/whitepapers/
Mapping%20Relational%20Data%20to%20RDF.pdf 3
5. Kalyanpur, A., Pastor, D.J., Battle, S., Padget, J.A.: Automatic mapping of owl
ontologies into java. In: Maurer, F., Ruhe, G. (eds.) SEKE. pp. 98–103 (2004) 2
6. Parent, M.A., Grégoire, B.: Software architecture and cross-platform interoperabil-
ity specification. D 3.1, Catalyst-FP7 (Mar 2014), http://catalyst-fp7.eu/wp-
content/uploads/2014/03/D3.1-Software-Architecture-and-Cross-Platform-
Interoperability-Specification.pdf 2
296
Hedera: Scalable Indexing and Exploring
Entities in Wikipedia Revision History
Tuan Tran and Tu Ngoc Nguyen
L3S Research Center / Leibniz Universität Hannover, Germany
{ttran, tunguyen}@L3S.de
Abstract. Much of work in semantic web relying on Wikipedia as the
main source of knowledge often work on static snapshots of the dataset.
The full history of Wikipedia revisions, while contains much more useful
information, is still difficult to access due to its exceptional volume. To
enable further research on this collection, we developed a tool, named
Hedera, that efficiently extracts semantic information from Wikipedia re-
vision history datasets. Hedera exploits Map-Reduce paradigm to achieve
rapid extraction, it is able to handle one entire Wikipedia articles’ revi-
sion history within a day in a medium-scale cluster, and supports flexible
data structures for various kinds of semantic web study.
1 Introduction
For over decades, Wikipedia has become a backbone of sematic Web research,
with the proliferation of high-quality big knowledge bases (KBs) such as DB-
pedia [1], where information is derived from various Wikipedia public collec-
tions. Existing approaches often rely on one o✏ine snapshot of datasets, they
treat knowledge as static and ignore the temporal evolution of information in
Wikipedia. When for instance a fact changes (e.g. death of a celebrity) or en-
tities themselves evolve, they can only be reflected in the next version of the
knowledge bases (typically extracted fresh from a newer Wikipedia dump). This
undesirable quality of KBs make them unable to capture temporally dynamical
relationship latent among revisions of the encyclopedia (e.g., participate together
in complex events), which are difficult to detect in one single Wikipedia snap-
shot. Furthermore, applications relying on obsolete facts might fail to reason
under new contexts (e.g. question answering systems for recent real-world inci-
dents), because they were not captured in the KBs. In order to complement these
temporal aspects, the whole Wikipedia revision history should be well-exploited.
However, such longitudinal analytics over ernoumous size of Wikipedia require
huge computation. In this work, we develop Hedera, a large-scale framework that
supports processing, indexing and visualising Wikipedia revision history. Hedera
is an end-to-end system that works directly with the raw dataset, processes them
to streaming data, and incrementally indexes and visualizes the information of
entities registered in the KBs in a dynamic fashion. In contrast to existing work
that handle the dataset in centralized settings [2], Hedera employs the Map-
Reduce paradigm to achieve the scalable performance, which is able to transfer
raw data of 2.5 year revision history of 1 million entities into full-text index
297
2 Tuan Tran and Tu Ngoc Nguyen
within a few hours in an 8-node cluster. We open-sourced Hedera to facilitate
further research 1 .
2 Extracting and Indexing Entities
2.1 Preprocessing Dataset
Here we describe the Hedera architecture and workflow. As shown in Figure 1,
the core data input of Hedera is a Wikipedia Revision history dump 2 . Hedera
currently works with the raw XML dumps, it supports accessing and extracting
information directly from compressed files. Hedera makes use heavily the Hadoop
framework. The preprocessor is responsible for re-partitioning the raw files into
independent units (a.k.a InputSplit in Hadoop) depending on users’ need. There
are two levels of partitioning: Entity-wise and Document-wise. Entity-wise par-
titioning guarantees that revisions belonging to the same entity are sent to one
computing node, while document-wise sends content of revisions arbitrarily to
any node, and keeps track in each revision the reference to its preceding ones
for future usage in the Map-Reduce level. The preprocessor accepts user-defined
low-level filters (for instance, only partition articles, or revisions within 2011 and
2012), as well as list of entity identifiers from a knowledge base to limit to. If
filtered by the knowledge base, users must provide methods to verify one revi-
sion against the map of entities (for instance, using Wikipedia-derived URL of
entities). The results are Hadoop file splits, in the XML or JSON formats.
Hadoop job Extraction
Pig work flow Extension
…
Entity Batch Indexing
Entity
snippet
snippet
Temporal
Transformer Transformer Transformer Information
Extraction
Input split
Longitudinal
Wikipedia-derived Analytics
Ontologies (optional) Preprocessor User-defined
Filters
…
Wikipedia
Revisions
History Dump
Fig. 1. Hedera Framework Architecture
2.2 Extracting Information
Before extracted in the Map-Reduce phase (Extraction component in Figure 1),
file splits outputed from the preprocessor are streamed into a Transformer. The
main goal of the transformer is to consume the files and emits (key,value) pairs
suitable for inputting into one Map function. Hedera provides several classes of
transformer, each of which implements one operator specified in the extraction
1
Project documentation and code can be found at: https://github.com/
antoine-tran/Hedera
2
http://dumps.wikimedia.org
298
Hedera: Processing Tools for Wikipedia Revision 3
layer. Pushing down these operators into transformers reduces significantly the
volume of text sent around the network. The extraction layer enables users to
write extraction logic in high-level programming languages such as Java or Pig 3 ,
which can be used in other applications. The extraction layer also accepts user-
defined filters, allowing user to extract and index di↵erent portions of the same
partitions at di↵erent time. For instance, the user can choose to first filter and
partition Wikipedia articles published in 2012; and later she can sample, from
one partition, the revisions about people published in May 2012. This flexibility
facilitates rapid development of research-style prototypes in Wikipedia revision
dataset, which is one of our major contributions.
3 Indexing and Exploring Entity-based Evolutions in
Wikipedia
In this section, we illustrate the use of Hedera in one application - incremental
indexing and visualizing Wikipedia revision history. Indexing large-scale lon-
gitudinal data collections i.e., the Wikipedia history is not a straightforward
problem. Challenges in finding a scalable data structure and distributed storage
that can most exploit data along the time dimension are still not fully addressed.
In Hedera, we present a distributed approach in which the collection is processed
and thereafter the indexing is parallelized using the Map-Reduce paradigm. This
approach (that is based on the document-based data structure of ElasticSearch)
can be considered as a baseline for further optimizations. The index’s schema is
loosely structured, which allows flexible update and incremental indexing of new
revisions (that is of necessity for the evolving Wikipedia history collection). Our
preliminary evaluation showed that this approach outperformed the well-known
centralized indexing method provided by [2]. The time processing (indexing) gap
is exponentially magnified along with the increase of data volume. In addition,
we also evaluated the querying time (and experienced the similar result) of the
system. We describe how the temporal index facilitate large-scale analytics on
the semantic-ness of Wikipedia with some case studies. The detail of the exper-
iment is described below.
We extract 933,837 entities registered in DBpedia, each of which correspond
to one Wikipedia article. The time interval spans from 1 Jan 2011 to 13 July
2013, containing 26,067,419 revisions, amounting for 601 GBytes of text in un-
compressed format. The data is processed and re-partitioned using Hedera before
being passed out and indexed into ElasticSearch 4 (a distributed real-time index-
ing framework that supports data at large scale) using Map-Reduce. Figure 2
illustrates one toy example of analysing the temporal dynamics of entities in
Wikipedia. Here we aggregate the results for three distinct entity queries, i.e.,
obama, euro and olympic on the temporal anchor-text (a visible text on a hy-
perlink between two Wikipedia revision) index. The left-most table shows the
top terms appear in the returned results, whereas the two timeline graphs illus-
trate the dynamic evolvement of the entities over the studied time period (with
3
http://pig.apache.org
4
http://www.elasticsearch.org
299
4 Tuan Tran and Tu Ngoc Nguyen
Fig. 2. Exploring Entity Structure Dynamics Over Time
1-week and 1-day granuality, from left to right respectively). As easily observed,
the three entities peak at the time where a related event happens (Euro 2012 for
euro, US Presidential Election for obama and the Summer and Winter Olympics
for olympic). This further shows the value of temporal anchor text in mining
the Wikipedia entity dynamics. We analogously experimented on the Wikipedia
full-text index. Here we brought up a case study of the entity co-occurrance (or
temporal relationship) (i.e., between Usain Bolt and Mo Farah), where the two
co-peak in the time of Summer Olympics 2012, one big tournament where the
two atheletes together participated. These examples demonstrate the value of
our temporal Wikipedia indexes for temporal semantic research challenges.
4 Conclusions and Future Work
In this paper, we introduced Hedera, our ongoing work in supporting flexible
and efficient access to Wikipedia revision history dataset. Hedera can work di-
rectly with raw data in the low-level, it uses Map-Reduce to achieve the high-
performance computation. We open-source Hedera for future use in research
communities, and believe our system is the first in public of this kind. Future
work includes deeper integration with knowledge bases, with more API and ser-
vices to access the extraction layer more flexibly.
Acknowledgements
This work is partially funded by the FP7 project ForgetIT (under grant No. 600826)
and the ERC Advanced Grant ALEXANDRIA (under grant No. 339233).
References
1. S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives. DBpedia:
A nucleus for a web of open data. Springer, 2007.
2. O. Ferschke, T. Zesch, and I. Gurevych. Wikipedia revision toolkit: efficiently ac-
cessing wikipedia’s edit history. In HLT, pages 97–102, 2011.
300
Evaluating Ontology Alignment Systems in
Query Answering Tasks
Alessandro Solimando1 , Ernesto Jiménez-Ruiz2 , and Christoph Pinkel3
1
Dipartimento di Informatica, Bioingegneria, Robotica e Ingegneria dei Sistemi,
Università di Genova, Italy
2
Department of Computer Science, University of Oxford, UK
3
fluid Operations AG, Walldorf, Germany
Abstract. Ontology matching receives increasing attention and gained
importance in more recent applications such as ontology-based data ac-
cess (OBDA). However, query answering over aligned ontologies has not
been addressed by any evaluation initiative so far. A novel Ontology
Alignment Evaluation Initiative (OAEI) track, Ontology Alignment for
Query Answering (OA4QA), introduced in the 2014 evaluation cam-
paign, aims at bridging this gap in the practical evaluation of matching
systems w.r.t. this key usage.
1 Introduction
Ontologies play a key role in the development of the Semantic Web and are being
used in many application domains such as biomedicine and energy industry. An
application domain may have been modeled with di↵erent points of view and
purposes. This situation usually leads to the development of di↵erent ontologies
that intuitively overlap, but they use di↵erent naming and modeling conventions.
The problem of (semi-)automatically computing mappings between indepen-
dently developed ontologies is usually referred to as the ontology matching prob-
lem. A number of sophisticated ontology matching systems have been developed
in the last years [5]. Ontology matching systems, however, rely on lexical and
structural heuristics and the integration of the input ontologies and the map-
pings may lead to many undesired logical consequences. In [1] three principles
were proposed to minimize the number of potentially unintended consequences,
namely: (i) consistency principle, the mappings should not lead to unsatisfiable
classes in the integrated ontology; (ii) locality principle, the mappings should
link entities that have similar neighbourhoods; (iii) conservativity principle, the
mappings should not introduce alterations in the classification of the input on-
tologies. The occurrence of these violations is frequent, even in the reference
mapping sets of the Ontology Alignment Evaluation Initiative4 (OAEI ) [6].
Violations to these principles may hinder the usefulness of ontology map-
pings. The practical e↵ect of these violations, however, is clearly evident when
ontology alignments are involved in complex tasks such as query answering [4].
4
http://oaei.ontologymatching.org/
301
Query Evaluation Engine
Vocabulary QF-Ontology DB-Ontology
Query
Fig. 1. Ontology Alignment in an OBDA Scenario
The traditional tracks of OAEI evaluate ontology matching systems w.r.t. scala-
bility, multi-lingual support, instance matching, reuse of background knowledge,
etc. Systems’ e↵ectiveness is, however, only assessed by means of classical infor-
mation retrieval metrics (i.e., precision, recall and f-measure) w.r.t. a manually-
curated reference alignment, provided by the organisers. The new OA4QA track5
evaluates those same metrics, but w.r.t. the ability of the generated alignments
to enable the answer of a set of queries in an OBDA scenario, where several
ontologies exist. Figure 1 shows an OBDA scenario where the first ontology pro-
vides the vocabulary to formulate the queries (QF-Ontology) and the second is
linked to the data and it is not visible to the users (DB-Ontology). Such OBDA
scenario is presented in real-world use cases (e.g., Optique project6 [2, 6]). The
integration via ontology alignment is required since only the vocabulary of the
DB-Ontology is connected to the data. The OA4QA will also be key for inves-
tigating the e↵ects of logical violations a↵ecting the computed alignments, and
evaluating the e↵ectiveness of the repair strategies employed by the matchers.
2 Ontology Alignment for Query Answering
This section describes the considered dataset and its extensions (Section 2.1), the
query processing engine (Section 2.2), and the evaluation metrics (Section 2.3).
2.1 Dataset
The set of ontologies coincides with that of the conference track,7 in order to
facilitate the understanding of the queries and query results. The dataset is
however extended with synthetic ABoxes, extracted from the DBLP dataset.8
Given a query q expressed using the vocabulary of ontology O1 , another
ontology O2 enriched with syntethic data is chosen. Finally, the query is executed
over the aligned ontology O1 [ M [ O2 , where M is an alignment between O1
and O2 . Referring to Figure 1, O1 plays the role of QF-Ontology, while O2 that
of DB-Ontology.
5
http://www.cs.ox.ac.uk/isg/projects/Optique/oaei/oa4qa/
6
http://www.optique-project.eu/
7
http://oaei.ontologymatching.org/2014/conference/index.html
8
http://dblp.uni-trier.de/xml/
302
2.2 Query Evaluation Engine
The evaluation engine considered is an extension of the OWL 2 reasoner Her-
miT, known as OWL-BGP 9 [3]. OWL-BGP is able to process SPARQL queries
in the SPARQL-OWL fragment, under the OWL 2 Direct Semantics entailment
regime.10 The queries employed in the OA4QA track are standard conjunctive
queries, that are fully supported by the more expressive SPARQL-OWL frag-
ment. SPARQL-OWL, for instance, also support queries where variables occur
within complex class expressions or bind to class or property names.
2.3 Evaluation Metrics and Gold Standard
As already discussed in Section 1, the evaluation metrics used for the OA4QA
track are the classic information retrieval ones (i.e., precision, recall and f-
measure), but on the result set of the query evaluation. In order to compute
the gold standard for query results, the publicly available reference alignments
ra1 has been manually revised. The aforementioned metrics are then evaluated,
for each alignment computed by the di↵erent matching tools, against the ra1, and
manually repaired version of ra1 from conservativity and consistency violations.
Three categories of queries will be considered in OA4QA: (i) basic, (ii) queries
involving violations, (iii) advanced queries involving nontrivial mappings.
2.4 Impact of the Mappings in the Query Results
As an illustrative example, consider the aligned ontology OU computed us-
ing confof and ekaw as input ontologies (Oconf of and Oekaw , respectively),
and the ra1 reference alignment between them. OU entails ekaw:Student v
ekaw:Conf P articipant, while Oekaw does not, and therefore this represents a
conservativity principle violation. Clearly, the result set for the query q(x)
ekaw:Conf P articipant(x) will erroneously contain any student not actually
participating at the conference. The explanation for this entailment in OU is
given below, where Axioms 1 and 3 are mappings from the reference alignment.
conf of :Scholar ⌘ ekaw:Student (1)
conf of :Scholar v conf of :P articipant (2)
conf of :P articipant ⌘ ekaw:Conf P articipant (3)
The softening of Axiom 3 into conf of :P articipant w ekaw:Conf P articipant
represents a possible repair for the aforementioned violation.
3 Preliminary Evaluation
In Table 1 11 a preliminary evaluation using the alignments of the OAEI 2013
participants and the following queries is shown: (i) q1 (x) ekaw:Author(x),
9
https://code.google.com/p/owl-bgp/
10
http://www.w3.org/TR/2010/WD-sparql11-entailment-20100126/#id45013
11
#q(x) refers to the cardinality of the result set.
303
Reference Alignment Repaired Alignment
Category Query #M
#q(x) Prec. Rec. F-meas. #q(x) Prec. Rec. F-meas.
Basic q1 5 98 1 1 1 98 1 1 1
Violations q2 4 53 0.8 1 0.83 38 0.57 1 0.68
Advanced q3 7 - - - - 182 1 0.5 0.67
Table 1. Preliminary query answering results for the OAEI 2013 alignments
over the ontology pair hcmt, ekawi; (ii) q2 (x) ekaw:Conf P articipant(x),
over hconf of, ekawi, involving the violation described in Section 2.4; (iii) and
q3 (x) conf of :Reception(x) [ conf of :Banquet(x) [ conf of :T rip(x), over
hconf of, edasi. The evaluation12 shows the negative e↵ect on precision of logical
flaws a↵ecting the computed alignments (q2 ) and a lowering in recall due to
missing mapping (q3 ). For q3 the results w.r.t. the reference alignment (ra1 ) are
missing due to the unsatisfiability of the aligned ontology Oconf of [ Oedas [ ra1.
4 Conclusions and Future Work
We have presented the novel OAEI track addressing query answering over pairs
of ontologies aligned by a set of ontology-to-ontology mappings. From the prelim-
inary evaluation the main limits of the traditional evaluation, for what concerns
logical violations of the alignments, clearly emerged. As a future work we plan
to cover increasingly complex queries and ontologies, including the ones in the
Optique use case [6]. We also plan to consider more complex scenarios involving
a single QF-Ontology aligned with several DB-Ontologies.
Acknowledgements. This work was supported by the EU FP7 IP project Optique
(no. 318338), the MIUR project CINA (Compositionality, Interaction, Negotia-
tion, Autonomicity for the future ICT society) and the EPSRC project Score!.
References
1. Jiménez-Ruiz, E., Cuenca Grau, B., Horrocks, I., Berlanga, R.: Logic-based Assess-
ment of the Compatibility of UMLS Ontology Sources. J. Biomed. Semant. (2011)
2. Kharlamov, E., et al.: Optique 1.0: Semantic Access to Big Data: The Case of
Norwegian Petroleum Directorate’s FactPages. ISWC (Posters & Demos) (2013)
3. Kollia, I., Glimm, B., Horrocks, I.: SPARQL query answering over OWL ontologies.
In: The Semantic Web: Research and Applications, pp. 382–396. Springer (2011)
4. Meilicke, C.: Alignments Incoherency in Ontology Matching. Ph.D. thesis, Univer-
sity of Mannheim (2011)
5. Shvaiko, P., Euzenat, J.: Ontology Matching: State of the Art and Future Challenges.
IEEE Transactions on Knowl. and Data Eng. (TKDE) (2012)
6. Solimando, A., Jiménez-Ruiz, E., Guerrini, G.: Detecting and Correcting Conser-
vativity Principle Violations in Ontology-to-Ontology Mappings. In: International
Semantic Web Conference (2014)
12
Out of the 26 alignments of OAEI 2013, only the ones shown in column #M were
able to produce a result (either for logical problems or for an empty result set due
to missing mappings). Reported precision/recall values are averaged values.
304
Using Fuzzy Logic For Multi-Domain Sentiment
Analysis
Mauro Dragoni1 , Andrea G.B. Tettamanzi2 , and Célia da Costa Pereira2
1
FBK–IRST, Trento, Italy
2
Université Nice Sophia Antipolis, I3S, UMR 7271, Sophia Antipolis, France
dragoni@fbk.eu andrea.tettamanzi|celia.pereira@unice.fr
Abstract. Recent advances in the Sentiment Analysis field focus on the inves-
tigation about the polarities that concepts describing the same sentiment have
when they are used in different domains. In this paper, we investigated on the use
of fuzzy logic representation for modeling knowledge concerning the relation-
ships between sentiment concepts and different domains. The developed system
is built on top of a knowledge base defined by integrating WordNet and SenticNet,
and it implements an algorithm used for learning the use of sentiment concepts
from multi-domain datasets and for propagating such information to each con-
cept of the knowledge base. The system has been validated on the Blitzer dataset,
a multi-domain sentiment dataset built by using reviews of Amazon products, by
demonstrating the effectiveness of the proposed approach.
1 Introduction
Sentiment Analysis is a kind of text categorization task that aims to classify documents
according to their opinion (polarity) on a given subject [1]. This task has created a con-
siderable interest due to its wide applications. However, in the classic Sentiment Anal-
ysis the polarity of each term of the document is computed independently with respect
to domain which the document belongs to. Recently, the idea of adapting terms polarity
to different domains emerged [2]. The rational behind the idea of such investigation is
simple. Let’s consider the following example concerning the adjective “small”:
1. The sideboard is small and it is not able to contain a lot of stuff.
2. The small dimensions of this decoder allow to move it easily.
In the first text, we considered the Furnishings domain and, within it, the polarity
of the adjective “small” is, for sure, “negative” because it highlight an issue of the de-
scribed item. On the other side, in the second text, where we considered the Electronics
domain, the polarity of such adjective can be considered “positive”.
In literature, different approaches related to the Multi-Domain Sentiment Analy-
sis has been proposed. Briefly, two main categories may be identified: (i) the transfer
of learned classifiers across different domains [3] [4], and (ii) the use of propagation
of labels through graphs structures [5] [6]. Independently from the kind of approach,
works using concepts rather than terms for representing different sentiments have been
proposed.
305
Differently from the approaches already discussed in the literature, we address the
multi-domain sentiment analysis problem by applying the fuzzy logic theory for mod-
eling membership functions representing the relationships between concepts and do-
mains. Moreover, the proposed system exploits the use of semantic background knowl-
edge for propagating information represented by the learned fuzzy membership func-
tions to each element of the network.
2 System
The main aim of the implemented system is the learning of fuzzy membership func-
tions representing the belonging of a concept with respect to a domain in terms of both
sentiment polarity as well as aboutness. The two pillars on which the system has been
though are: (i) the use of fuzzy logic for modeling the polarity of a concept with respect
to a domain as well as its aboutness, and (ii) the creation of a two-levels graph where
the top level represents the semantic relationships between concepts, while the bottom
level contains the links between all concept membership functions and the domains.
Figure 1 shows the conceptualization of the two-levels graph. Relationships be-
tween the concepts of the Level 1 (the Semantic Level) are described by the back-
ground knowledge exploited by the system. The type of relationships are the same gen-
erally used in linguistic resource: for example, concepts C1 and C3 may be connected
through an Is-A relationship rather than the Antonym one. Instead, each connection of
the Level 2 (the Sentiment Level) describes the belonging of each concept with respect
to the different domains taken into account.
The system has been trained by using the Blitzer dataset3 in two steps: first, the
fuzzy membership functions have been initially estimated by analyzing only the explicit
information present within the dataset (Section 2.1); then, (ii) the explicit information
has been propagated through the Sentiment Level graph by exploiting the connections
defined in the Semantic Level.
2.1 Preliminary Learning Phase
The Preliminary Learning (PL) phase aims to estimate the starting polarity of each
concept with respect to a domain. The estimation of this value is done by analyzing
only the explicit information provided by the training set. This phase allows to define the
preliminary fuzzy membership functions between the concepts defined in the Semantic
Level of the graph and the domains that are defined in the Sentiment one. Such a value
is computed by the Equation 1
i
kC
polarity⇤i (C) = 2 [ 1, 1] 8i = 1, . . . , n, (1)
TCi
where C is the concept taken into account, index i refers to domain Di which the con-
cept belongs to, n is the number of domains available in the training set, kC i
is the
arithmetic sum of the polarities observed for concept C in the training set restricted to
3
http://www.cs.jhu.edu/ mdredze/datasets/sentiment/
306
Fig. 1: The two-layer graph initialized during the Preliminary Learning Phase (a) and
its evolution after the execution of the Information Propagation Phase (b).
domain Di , and TCi is the number of instances of the training set, restricted to domain
Di , in which concept C occurs. The shape of the fuzzy membership function gener-
ated during this phase is a triangle with the top vertex in the coordinates (x, 1), where
x = polarity⇤i (C) and with the two bottom vertices in the coordinates ( 1, 0) and
(1, 0) respectively. The rationale is that while we have one point (x) in which we have
full confidence, our uncertainty covers the entire space because we do not have any
information concerning the remaining polarity values.
2.2 Information Propagation Phase
The Information Propagation (IP) phase aims to exploit the explicit information learned
in the PL phase in order to both (i) refine the fuzzy membership function of the known
concepts, as well as, (ii) to model such functions for concepts that are not specified in
the training set, but that are semantically related to the specified ones. Figure 1 presents
how the two-levels graph evolves before and after the execution of the IP phase. After
the PL phase only four membership functions are modeled: C1 and C2 for the domain
D1 , and C1 and C5 for the domain D2 (Figure 1a). However, as we may observe, in the
Semantic Level there are concepts that are semantically related to the ones that were
explicitly defined in the training set, namely C3 and C4 ; while, there are also concepts
for which a fuzzy membership function has not been modeled for some domains (i.e.
C2 for the domain D2 and C5 for the domain D1 ).
Such fuzzy membership functions may be inferred by propagating the information
modeled in the PL phase. Similarly, existing fuzzy membership functions are refined by
the influence of the other ones. Let’s consider the polarity between the concept C3 and
the domain D2 . The fuzzy membership function representing this polarity is strongly
influenced by the ones representing the polarities of concepts C1 and C5 with respect
to the domain D2 .
The propagation of the learned information through the graph is done iteratively
where, in each iteration, the estimated polarity value of the concept x learned during
the PL phase is updated based on the learned values of the adjoining concepts. At each
307
iteration, the updated values is saved in order to exploit it for the re-shaping of the fuzzy
membership function associating the concept x to the domain i.
The resulting shapes of the inferred fuzzy membership functions will be trapezoids
where the extension of the upper base is proportional to the difference between the
value learned during the PL phase (Vpl ) and the value obtained at the end of the IP
phase (Vip ); while, the support is proportional to both the number of iterations needed
by the concept x to converge to the Vip and the variance with respect to the average of
the values computed after each iteration of the IP phase.
3 Concluding Remarks
The system have been validated on the full version of the Blitzer dataset4 and the results,
compared with the precision obtained by three baselines, are shown in Table 1.
SVM [7] Naive-Bayes [8] Max-Entropy [8] MDFSA MDFSA
Precision (Rec. 1.0) Precision (Rec. 1.0) Precision (Rec. 1.0) Precision Recall
0.8068 0.8227 0.8275 0.8617 0.9987
Table 1: Results obtained on the full version of the Blitzer dataset.
The results demonstrated that the modeled fuzzy membership functions may be
exploited effectively for computing the polarities of concepts used in different domains.
References
1. Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up? sentiment classification using machine
learning techniques. In: Proceedings of the Conference on Empirical Methods in Natural Lan-
guage Processing (EMNLP), Philadelphia, Association for Computational Linguistics (July
2002) 79–86
2. Blitzer, J., Dredze, M., Pereira, F.: Biographies, bollywood, boom-boxes and blenders: Do-
main adaptation for sentiment classification. In Carroll, J.A., van den Bosch, A., Zaenen, A.,
eds.: ACL, The Association for Computational Linguistics (2007)
3. Bollegala, D., Weir, D.J., Carroll, J.A.: Cross-domain sentiment classification using a senti-
ment sensitive thesaurus. IEEE Trans. Knowl. Data Eng. 25(8) (2013) 1719–1731
4. Xia, R., Zong, C., Hu, X., Cambria, E.: Feature ensemble plus sample selection: Domain
adaptation for sentiment classification. IEEE Int. Systems 28(3) (2013) 10–18
5. Ponomareva, N., Thelwall, M.: Semi-supervised vs. cross-domain graphs for sentiment anal-
ysis. In Angelova, G., Bontcheva, K., Mitkov, R., eds.: RANLP, RANLP 2011 Organising
Committee / ACL (2013) 571–578
6. Tsai, A.C.R., Wu, C.E., Tsai, R.T.H., jen Hsu, J.Y.: Building a concept-level sentiment dic-
tionary based on commonsense knowledge. IEEE Int. Systems 28(2) (2013) 22–30
7. Chang, C.C., Lin, C.J.: Libsvm: A library for support vector machines. ACM TIST 2(3)
(2011) 27
8. McCallum, A.K.: Mallet: A machine learning for language toolkit. http://mallet.cs.
umass.edu (2002)
4
Detailed results and tool demo are available at http://dkmtools.fbk.eu/moki/demo/mdfsa/mdfsa demo.html
308
JaG
*`2iBM; GBMF2/ .i AM7`bi`m+im`2 7Q` JM;BM;
1H2+i`QMB+ _2bQm`+2b BM GB#``B2b
LiM2H `M/i1 - a2#biBM Lm+F1 - M/`2b L`2BF21,2 - LQ`KM _/iF21,2 -
G2M/2` a2B;22 - M/ h?QKb _B2+?2`i3
1
Eaq- AMbiBimi2 7Q` TTHB2/ AM7Q`KiB+b UAM7AV 2XoX
>BMbi`Ľ2 RR- y9RyN G2BTxB;- :2`KMv
&HbiMK2'!BM7BXQ`;
2
G2BTxB; lMBp2`bBiv GB#``v
"22i?Qp2Mbi`X e- y9Ryd G2BTxB;- :2`KMv
&HbiMK2'!m#XmMB@H2BTxB;X/2
3
G2BTxB; lMBp2`bBiv Q7 TTHB2/ a+B2M+2 U>hqEV
:mbip@6`2vi;@ai`X 9k- y9k8R G2BTxB;- :2`KMv
i?QKbX`B2+?2`i!?irF@H2BTxB;X/2
R AMi`Q/m+iBQM
h?2 HB#``v /QKBM Bb +m``2MiHv mM/2`;QBM; +?M;2b `2HiBM; iQ MQi QMHv T?vbB+H
`2bQm`+2b U2X;X #QQFb- DQm`MHb- *.bf.o.bV #mi iQ i?2 `TB/Hv ;`QrBM; MmK#2`b
Q7 2H2+i`QMB+ `2bQm`+2b U2X;X 2@DQm`MHb- 2@#QQFb Q` /i#b2bV r?B+? `2 +QMiBM2/
BM i?2B` +QHH2+iBQMb b r2HHX aBM+2 i?Qb2 `2bQm`+2b `2 MQi HBKBi2/ iQ T?vbB+H
Bi2Kb MvKQ`2 M/ +M #2 +QTB2/ rBi?Qmi HQbb Q7 BM7Q`KiBQM- M2r HB+2MbBM;
M/ H2M/BM; KQ/2Hb ?p2 #22M BMi`Q/m+2/X *m``2Mi HB+2MbBM; KQ/2Hb `2 Tv@T2`@
pB2r- Ti`QM@/`Bp2M@+[mBbBiBQM- b?Q`i i2`K HQM M/ #B; /2HX AM //BiBQM- rBi?
+?M;BM; K`F2ib- mb2` 2tT2+iiBQMb M/ Tm#HB+iBQM 7Q`Kib- M2r KQ/2Hb rBHH
#2 BMi`Q/m+2/ BM i?2 7mim`2X
h?2 2tBbiBM; BM7`bi`m+im`2 Bb MQi T`2T`2/ 7Q` KM;BM; i?Qb2 M2r 2H2+i`QMB+
`2bQm`+2b- HB+2MbBM; M/ H2M/BM; KQ/2HbX aQ7ir`2 r?B+? Bb /2p2HQT2/ iQ K22i i?2
+m``2Mi `2[mB`2K2Mib U+7X b2+iBQM kV Bb HBF2Hv iQ #2 Qmi/i2/ BM +QmTH2 Q7 v2`b-
/m2 iQ i?2 +?M;BM; KQ/2HBM; `2[mB`2K2MibX hQ /2p2HQT 7mim`2@T`QQ7 KM;BM;
bvbi2K- ?B;?Hv ~2tB#H2 /i KQ/2H M/ bQ7ir`2 r?B+? Bb /Ti#H2 iQ +?M;2/
/i KQ/2H Bb M22/2/X h?2 i`;2i2/ mb2` ;`QmT `2 HB#``BMb rBi? HBKBi2/ Q` MQ
mM/2`biM/BM; Q7 a2KMiB+ q2#- GBMF2/ .i M/ _.6 i2+?MB[m2bX
q2 T`2b2Mi JaG (R)- M 1H2+i`QMB+ _2bQm`+2 JM;2K2Mi avbi2K r?B+?
Bb #b2/ QM i?2 ;2M2`B+ M/ +QHH#Q`iBp2 _.6 `2bQm`+2 KM;2K2Mi bvbi2K
PMiQqBFB (k- j)X lbBM; _.6 b /i KQ/2H KF2b i?2 TTHB+iBQM ~2tB#H2 M/
i?mb 7mim`2@T`QQ7- r?BH2 i?2 +?M;BM; KQ/2HBM; `2[mB`2K2Mib +M #2 K2i #v /@
DmbiBM; i?2 mb2/ _.6 pQ+#mH`v `2bTX TTHB+iBQM T`Q}H2X hQ /Ti i?2 ;2M2`B+
PMiQqBFB iQ i?2 M22/b Q7 i?2 /QKBM 2tT2`ib- r2 ?p2 //2/ bQK2 +QKTQ@
M2Mib iQ bmTTQ`i /QKBM bT2+B}+ mb2 +b2b- r?BH2 biBHH F22TBM; i?2K ;MQbiB+ iQ
309
i?2 mb2/ _.6 pQ+#mH`vX h?2 +m``2Mi biimb Q7 i?2 T`QD2+i BM+Hm/BM; HBMFb iQ
i?2 bQm`+2 +Q/2 M/ HBp2 /2KQMbi`iBQM Bb pBH#H2 i i?2 T`QD2+i r2#@T;2,
?iiT,ffKbHXi2+?MQHQ;vfX
k aii2 Q7 i?2 `i
aQ7ir`2 +m``2MiHv BM mb2 BM HB#``B2b Bb KBMHv +QM+2`M2/ rBi? KM;BM; T`BMi
`2bQm`+2b- bm+? b i?2 AMi2;`i2/ GB#``v avbi2K GA"1_P U?iiT,ffrrrXHB#2`QX
+QKXmfV- r?B+? Bb mb2/ i G2BTxB; lMBp2`bBiv GB#``vX "mi bm+? bQ7ir`2 Bb MQi
T`2T`2/ 7Q` KM;BM; 2H2+i`QMB+ `2bQm`+2b rBi? +QKTH2t HB+2MbBM; M/ ++2bb
KQ/2HbX _2+2MiHv- bQK2 +QKK2`+BH p2M/Q`b ?p2 bi`i2/ T`QpB/BM; L2ti :2M2`@
iBQM GB#``v avbi2Kb r?B+? +QMbiBimi2 M2r TT`Q+? iQ 1H2+i`QMB+ _2bQm`+2
JM;2K2MiX .m2 iQ i?2B` +QKK2`+BH Mim`2- Bi Bb /B{+mHi iQ 2pHmi2 ?Qr
~2tB#H2 i?2 mb2/ /i KQ/2Hb `2 M/ r?B+? `2[mB`2K2Mib `2 K2i (9- 8)X aBM+2
i?2 /i KQ/2Hb `2 +HQb2/- i?2v +M MQi #2 2ti2M/2/ #v i?2 +mbiQK2` b `2@
[mB`2K2Mib +?M;2- r?2`2b i?2 +QMi`+i M/ bm#b+`BTiBQM i2`Kb Q7i2M +?M;2
rBi? 2p2`v M2r #mbBM2bb v2`X 6m`i?2`- Bi Bb KQ`2 /B{+mHi iQ BMi2;`i2 2ti2`MH
FMQrH2/;2 #b2b BM +QMi`bi iQ GBMF2/ .i M/ i?2 GBMF2/ PT2M .i *HQm/ (e)X
MQi?2` TT`Q+? Bb iQ pQB/ bT2+B}+ HB#``v bQ7ir`2 #mi iQ mb2 ;2M2`B+
`2bQm`+2 Q` /Q+mK2Mi KM;2K2Mi bvbi2Kb- bm+? b qBFBbX PMiQqBFB rb /2@
p2HQT2/ b qBFB bvbi2K 7Q` +QHH#Q`iBp2Hv +`2iBM; b2KMiB+ /i BM _.6 M/
PqG FMQrH2/;2 #b2bX Pp2` i?2 iBK2- PMiQqBFB ?b 2pQHp2/ iQr`/b ?B;?Hv
2ti2MbB#H2 /2p2HQTK2Mi 7`K2rQ`F 7Q` FMQrH2/;2 BMi2MbBp2 TTHB+iBQMb (j)X
j h?2 JaG 1H2+i`QMB+ _2bQm`+2 JM;2K2Mi avbi2K
h?2 JaG 1H2+i`QMB+ _2bQm`+2 JM;2K2Mi avbi2K Bb #b2/ QM i?2 PMiQqBFB
TTHB+iBQM 6`K2rQ`FX Ai T`QpB/2b i?2 #bB+ 7mM+iBQMHBiv 7Q` KM;BM; `2@
bQm`+2b BX2X +`2iBM;- 2/BiBM;- [m2`vBM;- pBbmHBxBM;- HBMFBM; M/ 2tTQ`iBM; _.6f
PqG FMQrH2/;2 #b2bX qBi? i?2 GBMF2/ .i a2`p2` M/ GBMF2/ .i q`T@
T2`fAKTQ`i2` +QKTQM2Mib- Bi +M Tm#HBb? M/ +QMbmK2 GBMF2/ .i ++Q`/BM;
iQ i?2 `mH2b (e)X M TTHB+iBQM S`Q;`KKBM; AMi2`7+2 HHQrb i?2 /2p2HQTK2Mi
Q7 TQr2`7mH i?B`/ T`iv 2ti2MbBQMbX
+Q`2 T`i Q7 Qm` TTHB+iBQM Bb iQ mb2 M 2tT`2bbBp2 /i KQ/2H b T@
THB+iBQM T`Q}H2 r?B+? +QKT`Bb2b /Bz2`2Mi _.6 M/ PqG pQ+#mH`B2bX h?2
2tT`2bbBp2M2bb Q7 i?2 /i KQ/2H ?2HTb iQ KQp2 /2bB;M /2+BbBQMb 7`QK i?2 T`Q@
;`K +Q/2 iQ i?2 2bBHv /QTi#H2 pQ+#mH`v /2}MBiBQMb- r?B+? BM+`2b2b i?2
~2tB#BHBiv Q7 i?2 r?QH2 bvbi2KX 6Q` 2tT`2bbBM; i?2 /i KQ/2H r2 +QK#BM2 r2HH
FMQrM 2tBbiBM; pQ+#mH`B2b- bm+? b .*JA J2i/i h2`Kb U?iiT,ffTm`HX
Q`;f/+fi2`KbfV- .m#HBM *Q`2 J2i/i 1H2K2Mi a2i U?iiT,ffTm`HXQ`;f/+f
2H2K2MibfRXRfV- h?2 "B#HBQ;`T?B+ PMiQHQ;v U?iiT,ffTm`HXQ`;fQMiQHQ;vf
#B#QfV- +/2KB+ AMbiBimiBQM AMi2`MH ai`m+im`2 PMiQHQ;v U?iiT,ffTm`HXQ`;f
pQ+#fBBbQfb+?2KOV M/ h?2 6`B2M/ Q7 6`B2M/ _.6 pQ+#mH`v U?iiT,
fftKHMbX+QKf7Q7fyXRfV- iQ F22T i?2 /i +QKTiB#H2 iQ H`2/v 2tBbiBM; +QK@
TQM2MibX qBi? JaG- r2 7m`i?2` BMi`Q/m+2 i?2 oQ+#mH`v 7Q` GB#``v 1_J
310
U"A"_J- ?iiT,ffpQ+#Xm#XmMB@H2BTxB;X/2f#B#`KfVX Ai T`QpB/2b i2`Kb 7Q`
2tT`2bbBM; HB+2MbBM; M/ ++2bb KQ/2Hb M/ Bb HB;M2/ iQ i?2 B/2b Q7 i?2 1H2+@
i`QMB+ _2bQm`+2 JM;2K2Mi AMBiBiBp2 U1_JAV (9)X A7 M2r `2[mB`2K2Mib `Bb2 BM
i?2 7mim`2- i?2 /i KQ/2H ?b iQ #2 +?M;2/ M/ B7 M2+2bb`v M2r pQ+#mH`v
i2`Kb +M #2 //2/X
q2 ?p2 /2p2HQT2/ /i i2KTHi2b iQ T`QpB/2 mb2` BMi2`7+2 7Q` `2bQm`+2
+`2iBQM iQ i?2 /QKBM 2tT2`ibX aBM+2 Qm` bvbi2K /Q2b MQi `2[mB`2 i2+?MB+H
_.6 FMQrH2/;2 Q7 Bib mb2`b- i?2 /i i2KTHi2b T`QpB/2 7Q`K #b2/ 2/BiBM;
BMi2`7+2- r?B+? 7m`i?2` bmTTQ`ib i?2 +`2i2/ `2bQm`+2b iQ #2 +QKTHBMi iQ i?2
/2}M2/ TTHB+iBQM T`Q}H2X h?2 i2KTHi2 /2}MBiBQM Bib2H7 Bb 2tT`2bb2/ BM _.6 b
r2HH- iQ +?B2p2 i?2 `2[mB`2/ 2ti2MbB#BHBiv- rBi?Qmi M22/ iQ +?M;2 i?2 bQ7ir`2X
hQ bmTTQ`i i?2 rQ`F@~Qrb 7Q` KM;BM; K2i@/i +QKBM; rBi? 2H2+i`QMB+
`2bQm`+2b Ubm+? b +QMi+ib- +QMi`+ib- T+F;2b- ;`22K2Mib M/ HB+2Mb2bV- bT2@
+BH BKTQ`i M/ BMi2;`iBQM +QKTQM2Mib `2 /2p2HQT2/X Sm#HB+Hv pBH#H2
GBMF2/ .i M/ aS_ZG b2`pB+2b- r?B+? ?p2 2pQHp2/ BM i?2 HB#``v /QKBM
BM i?2 Tbi v2`b U2X;X iBiH2 BM7Q`KiBQM- AaaL@?BbiQ`v M/ mi?Q`Biv }H2bV- `2
M2+2bb`v 7Q` i?2 2H2+i`QMB+ `2bQm`+2 KM;2K2Mi M/ `2 BKTQ`i2/ mbBM; i?2
2tBbiBM; GBMF2/ .i BKTQ`i T`Q+2bbX
"2BM; #H2 iQ `2biQ`2 +?M;2b K/2 QM i`BTH2b Bb QM2 Q7 i?2 2tBbiBM; 72im`2b Q7
PMiQqBFBX AM i?2 JaG T`QD2+i 7m`i?2` `2[mB`2K2Mib 7Q` `2T`Q/m+B#BHBiv- +QMbBb@
i2M+v M/ BM+`2b2/ +H`Biv Q7 `2bQm`+2 +?M;2b r?2`2 7Q`KmHi2/X hQ K22i i?2b2
`2[mB`2K2Mib i?2 T`2b2Mi p2`bBQMBM; bvbi2K rb 2ti2M/2/ iQr`/b *?M;2@
a2i QMiQHQ;v9 X h?2 p2`bBQMBM; K2i/i Bb biQ`2/ BM aZG /i#b2 +QMiBMBM;
i?2 *?M;2a2i 2H2K2MibX >2M+2- i?2 mM/2`HvBM; /i KQ/2H Bb MQr +T#H2 Q7
2tT`2bbBM; i?2 bK2 KQmMi Q7 BM7Q`KiBQM b *?M;2a2iX //BiBQMb M/ `2@
KQpHb +QM+2`MBM; i?2 bK2 i`BTH2 `2 ;;`2;i2/ iQ +?M;2 bii2K2MiX 1p2M
i?Qm;? i?2 p2`bBQMBM; K2i/i Bb MQi biQ`2/ BM 7Q`K Q7 i`BTH2b- i?2 BM7Q`KiBQM
+QmH/ #2 [m2`B2/ M/ i`Mb72``2/ BMiQ *?M;2a2i FMQrH2/;2 #b2 7Q` 7m`i?2`
Tm`TQb2bX JQ`2Qp2` KmHiBTH2 `2i`B2pH +T#BHBiB2b 7Q` [m2`vBM; 2ti2M/2/ p2`bBQM@
BM; BM7Q`KiBQM ?p2 #22M //2/ BM 7Q`K Q7 M PMiQqBFB 2ti2MbBQMX
h?2 T`2b2Mi b2`+? 7mM+iBQM Q7 i?2 PMiQqBFB Bb #b2/ QM +QMp2MiBQMH
aS_ZG b2`+?- mbBM; #B7,+QMiBMb }Hi2` QM H#2HbX hQ BKT`Qp2 b2`+? bT22/
M/ T`QpB/2 //BiBQMH 72im`2b HBF2 7mxxv b2`+?- i?2 1HbiB+ a2`+? b2`+? 2M@
;BM2 rb BMi2;`i2/ b M PMiQqBFB 2ti2MbBQMX h?Bb 7mHH@i2ti b2`+? KF2b mb2
Q7 +Hbb #b2/ BM/2t bi`m+im`2X >2M+2- Bi T`QpB/2b 7bi2` ++2bb iQ T`2@BM/2t2/
`2bQm`+2bX h?2 mM/2`HvBM; BM/2t bi`m+im`2 Bb #mBHi mT #v BM/2tBM; +Hbb2b Q7 i?2
FMQrH2/;2 #b2 r?B+? +QMiBMb BM7Q`KiBQM i?i M22/ iQ #2 ++2bb2/ 7`2[m2MiHvX
h?i Bb- +Hbb2b +QMiBMBM; T`QT2`iB2b bm+? b /+,iBiH2 `2 KQ`2 K2MBM;7mH iQ
#2 b2`+?2/ rBi? 7mHH@i2ti b2`+? i?M +Hbb2b i?i QMHv BM+Hm/2 T`QT2`iB2b HBF2
2X;X #B#Q,BbbM r?B+? +QMiBM MmK2`B+H b2[m2M+2bX //BiBQMHHv iQ i?2 miQ@
+QKTH2iBQM 72im`2b b2`+? 7mM+iBQM- M 2ti2M/2/ b2`+? bmTTQ`ib M 2M?M+2/
7mxxv b2`+? r?B+? T`QpB/2b i?2 TQbbB#BHBiv Q7 `2bi`B+iBM; i?2 `2bmHi b2i iQ i?2
T`2pBQmbHv /2}M2/ +Hbb2b M/ Bb KQ`2 `Q#mbi ;BMbi ivTBM; KBbiF2bX
9
?iiT,ffpQ+#XQ`;f+?M;2b2ifb+?2KX?iKH
311
9 *QM+HmbBQM M/ 6mim`2 qQ`F
q2 /2KQMbi`i2 MQp2H TT`Q+? iQ #mBH/ mT M 2H2+i`QMB+ `2bQm`+2 KM;2@
K2Mi bvbi2K 7Q` HB#``B2b #v mbBM; ;2M2`B+ _.6 `2bQm`+2 KM;2K2Mi i2+?MQHQ;vX
h?2 mb2/ ;2M2`B+ +QKTQM2Mib `2 2ti2M/2/ M/ +QKTH2K2Mi2/ #v bQK2 /QKBM
/Ti#H2 +QKTQM2Mib iQ T`QpB/2 +mbiQKBx2/ BMi2`7+2 iQ /QKBM 2tT2`ibX *QM@
b2[m2MiHv mbBM; GBMF2/ .i M/ _.6 7m`i?2` 2M#H2b M/ bmTTQ`ib HB#``B2b iQ
#mBH/ mT GBMF2/ .i BM7`bi`m+im`2 7Q` 2t+?M;2 Q7 K2i /i +`Qbb HB#``B2b
b r2HH b +`Qbb BMbiBimiBQMb mbBM; i?2 b2`pB+2b T`QpB/2/ #v HB#``vX
8 +FMQrH2/;K2Mib
h?2 T`2b2Mi2/ bQ7ir`2 bvbi2K rb /2p2HQT2/ BM i?2 JaG T`QD2+i 7Q` /2p2H@
QTBM; M 1H2+i`QMB+ _2bQm`+2 JM;2K2Mi avbi2K #b2/ QM GBMF2/ .i i2+?@
MQHQ;v U?iiT,ffKbHXi2+?MQHQ;vfVX q2 rMi iQ i?MF Qm` +QHH2;m2b Gv/B
lMi2`/ƺ`72H- *`bi2M E`?H M/ "Dƺ`M Jmb+?HH 7`QK G2BTxB; lMBp2`bBiv GB#``v-
C2Mb JBii2H#+? 7`QK atQM aii2 M/ lMBp2`bBiv GB#``v .`2b/2M UaGl"V M/
Qm` 72HHQrb 7`QK ;BH2 EMQrH2/;2 1M;BM22`BM; M/ a2KMiB+ q2# UEaqV `2@
b2`+? ;`QmT 2bT2+BHHv >2M`B EMQ+?2M?m2` 7Q` i?2B` bmTTQ`i- ?2HT7mH +QKK2Mib
M/ BMbTB`BM; /Bb+mbbBQMbX h?Bb rQ`F rb bmTTQ`i2/ #v i?2 1m`QT2M lMBQM M/
6`22 aii2 Q7 atQMv #v ;`Mi 7`QK i?2 1m`QT2M _2;BQMH .2p2HQTK2Mi 6mM/
U1_.6V 7Q` i?2 T`QD2+i MmK#2` RyyR8RRj9 Ua" BM/2tVX
_272`2M+2b
RX L`2BF2- X- `M/i- LX- _/iF2- LX- Lm+F- aX- a2B;2- GX- _B2+?2`i- hX, KbH Ĝ KM;@
BM; 2H2+i`QMB+ `2bQm`+2b 7Q` HB#``B2b #b2/ QM b2KMiB+ r2#X AM, qQ`Fb?QT QM .i
JM;2K2Mi M/ 1H2+i`QMB+ _2bQm`+2 JM;2K2Mi BM GB#``B2b U.1_J kyR9V ,
AL6P_JhAE kyR9X Ua2Ti2K#2` kyR9V
kX >2BMQ- LX- .B2ixQH/- aX- J`iBM- JX- m2`- aX, .2p2HQTBM; b2KMiB+ r2# TTHB+iBQMb
rBi? i?2 QMiQrBFB 7`K2rQ`FX AM S2HH2;`BMB- hX- m2`- aX- hQ+?i2`KMM- EX- a+?7@
72`i- aX- 2/bX, L2irQ`F2/ EMQrH2/;2 @ L2irQ`F2/ J2/BX oQHmK2 kkR Q7 aim/B2b BM
*QKTmiiBQMH AMi2HHB;2M+2X aT`BM;2`- "2`HBM f >2B/2H#2`; UkyyNV eRĜdd
jX 6`Bb+?Kmi?- SX- J`iBM- JX- h`KT- aX- _B2+?2`i- hX- m2`- aX, PMiQqBFB @ M
mi?Q`BM;- Sm#HB+iBQM M/ oBbmHBxiBQM AMi2`7+2 7Q` i?2 .i q2#X a2KMiB+
q2# CQm`MH UkyR9V
9X C2r2HH- hX.X- M/2`bQM- AX- *?M/H2`- X- 6`#- aX1X- S`F2`- EX- _B;;BQ- X-
_Q#2`ibQM- LX.XJX, 1H2+i`QMB+ `2bQm`+2 KM;2K2Mi Ĝ `2TQ`i Q7 i?2 /H7 2`K BMB@
iBiBp2X h2+?MB+H `2TQ`i- .B;BiH GB#``v 62/2`iBQM- qb?BM;iQM- .X*X Ukyy9V
?iiT,ffQH/X/B;HB#XQ`;fTm#bf/H7RykfX
8X C2r2HH- hX- BTT2`bT+?- CX- M/2`bQM- AX- 1M;HM/- .X- EbT`QrbFB- _X- J+ZmBHHM-
"X- J+:2`v- hX- _B;;BQ- X, JFBM; ;QQ/ QM i?2 T`QKBb2 Q7 2`K, biM/`/b M/
#2bi T`+iB+2b /Bb+mbbBQM TT2`X h2+?MB+H `2TQ`i- LAaP 1_J .i aiM/`/b M/
"2bi S`+iBp2b _2pB2r ai22`BM; *QKKBii22- PM2 LQ`i? *?`H2b ai`22i- amBi2 RNy8-
"HiBKQ`2- J. kRkyR UCMm`v kyRkV ?iiT,ffrrrXMBbQXQ`;fTTbf;`QmTnTm#HB+f
/Q+mK2MiXT?T\/Q+mK2MinB/4dN9er;n##`2p42`K`2pB2rX
eX "2`M2`b@G22- hX, GBMF2/ .iX .2bB;M Bbbm2b- qj* UCmM2 kyyNV ?iiT,ffrrrXrjX
Q`;f.2bB;MAbbm2bfGBMF2/.iX?iKHX
312
Extending an ontology alignment system with
B IO P ORTAL: a preliminary analysis?
Xi Chen1 , Weiguo Xia1 , Ernesto Jiménez-Ruiz2 , Valerie Cross1
1
Miami University, Oxford, OH 45056, United States
2
University of Oxford, United Kingdom
1 Introduction
Ontology alignment (OA) systems developed over the past decade produced alignments
by using lexical, structural and logical similarity measures between concepts in two dif-
ferent ontologies. To improve the OA process, string-based matchers were extended to
look up synonyms for source and target concepts in background or external knowledge
sources such as general purpose lexicons, for example, WordNet.3 Other OA systems
such as SAMBO [8] and ASMOV [6] applied this approach but with specialized back-
ground knowledge, i.e. the UMLS Metathesaurus,4 for the anatomy track of the Ontol-
ogy Alignment Evaluation Initiative5 (OAEI). Then a composition-based approach was
proposed to use background knowledge sources such as Uberon6 and the Foundational
Model of Anatomy7 (FMA) as intermediate ontologies [5] for the anatomy track. Here
source concepts and target concepts are first mapped to the intermediate background
ontology. If source and target concepts map to an exact match in the intermediate on-
tology, a mapping can be made between them. Other OA systems also followed with a
composition-based approach using Uberon [1, 2].
One issue on the use of background knowledge sources is determining the best
knowledge source on which to use these various alignment techniques. Previous OA
systems using specialized knowledge sources have pre-selected specific biomedical on-
tologies such as Uberon for the anatomy track.
As a coordinated community effort, B IO P ORTAL [3, 4] provides access to more than
370 biomedical ontologies, synonyms, and mappings between ontology entities via a set
of REST services.8 By tapping into this resource, an OA system has access to the full
range of these ontologies, including Uberon and many of the ontologies integrated in
the UMLS Metathesaurus. Since BioPortal has not been exploited in the context of the
OAEI, this paper examines two practical uses of B IO P ORTAL as a generalized yet also
specialized background knowledge source for the biomedical domain. We provide a
preliminary investigation of the results of these two uses of B IO P ORTAL in the OAEI’s
anatomy track using the LogMap system [7].
?
This research was financed by the Optique project with grant agreement FP7-318338
3
http://wordnet.princeton.edu/
4
http://www.nlm.nih.gov/research/umls
5
http://oaei.ontologymatching.org
6
http://obophenotype.github.io/anatomy/
7
http://sig.biostr.washington.edu/projects/fm/AboutFM.html
8
http://data.bioontology.org/documentation
313
Algorithm 1 Algorithm to assess border-line mappings using B IO P ORTAL
Input: m = he1 , e2 i: mapping to assess; ⌧1 ,⌧2 : thresholds; Output: true/false
1: Extract set of similar entities E1 from B IO P ORTAL for entity e1
2: Extract set of similar entities E2 from B IO P ORTAL for entity e2
3: if E1 6= ; and E2 6= ; then
4: if JaccardIndex(E1 , E2 ) > ⌧1 then
5: return true
6: Extract mappings M1 from B IO P ORTAL for entities in E1
7: Extract mappings M2 from B IO P ORTAL for entities in E2
8: if M1 6= ; and M2 6= ; and JaccardIndex(M1 , M2 ) > ⌧2 then
9: return true
10: return false
2 B IO P ORTAL as an Oracle
Over the last few years, OA systems have made only minor improvements based on
alignment performance measures of precision, recall, and F-score. This experience pro-
vides evidence that a performance upper bound is being reached using OA systems
which are completely automatic. To increase their performance, some OA systems (e.g.
LogMap) have included a semi-automatic matching approach which incorporates user
interaction to assess borderline alignments (i.e. non “clear cut” cases with respect to
their confidence values). For example, LogMap identifies 250 borderline mappings in
the OAEI’s anatomy track when its interactive mode is active.
The research presented in this paper investigates replacing the human expert with
an automated expert or “oracle” that relies on specialized knowledge sources in the
biomedical domain. B IO P ORTAL provides access to different resources including a
wide variety of ontologies, classes within ontologies and mappings between the classes
of different ontologies. For example, B IO P ORTAL allows to search for ontology classes
whose labels have an exact match with a given term. The oracle can use this capabil-
ity to assist in determining whether a borderline mapping produced by an OA system
should be included in the final alignment output or not (i.e. increasing its confidence).
Algorithm 1 shows the implemented method to assess a given mapping m between
entities e1 and e2 using B IO P ORTAL as an oracle.
3 B IO P ORTAL as a Mediating Ontology Provider
Mediating ontologies are typically pre-selected specifically for the OA task. For exam-
ple the top systems in the OAEI’s anatomy track used Uberon as (pre-selected) medi-
ating ontology [5, 1, 2]. Limited research, however, has addressed the challenge of au-
tomatically selecting an appropriate mediating ontology as background knowledge [10,
9]. This research investigates using B IO P ORTAL as a (dynamic) provider of mediating
ontologies instead of relying on a few preselected ontologies.
Unlike [10] and [9], due to the large number of ontologies available in B IO P OR -
TAL, we have followed a fast-selection approach to identify a suitable set of mediating
314
Algorithm 2 Algorithm to identify mediating ontologies from B IO P ORTAL
Input: O1 , O2 : input ontologies; LM: a lexical matcher; N: stop condition
Output: Top-5 (candidate) mediating ontologies MO
1: Compute exact mappings M between O1 and O2 using the lexical matcher LM
2: Extract representative entity labels S from M
3: for each label 2 S
4: Get ontologies from B IO P ORTAL that contains an entity with label label (search call)
5: Add to MO the ontologies that provides synonyms for label (record positive hits I)
6: Record number of synonyms (II)
7: Record ontology information: # of classes (III), depth (IV) and DL expressiveness (V)
8: stop condition: if after N calls to B IO P ORTAL MO did not change then stop iteration
9: return Top-5 ontologies from MO according to the number of positive hits and synonyms
Table 1: Top 5 mediating (B IO P ORTAL) ontologies for the OAEI’s anatomy track
# Ontology % pos. hits (I) Avg. # syn. (II) # classes (III) Depth (IV) DL exp. (V)
1 SNOMED CT 60% 5.1 401,200 28 ALER
2 UBERON 63% 3.3 12,091 28 SRIQ
3 MeSH 34% 5.0 242,262 16 AL
4 EFO 16% 5.1 14,253 14 SROIF
5 CL (Cell Onto.) 22% 3.3 5,534 19 SH
ontologies from B IO P ORTAL (see Algorithm 29 ). The fast-selection approach identi-
fies entity labels that appear in the input ontologies and searches to find ontologies in
B IO P ORTAL that include those labels and contain synonyms for them. The algorithm
stops if the number of identified mediating ontologies does not change after a specified
number N of (search) calls to B IO P ORTAL or when there are no more labels to check.
Table 1 shows the identified top-5 mediating ontologies for the OAEI’s anatomy
track (with N=25 as stop condition). The ranking is based on the number of labels
(i.e. search calls to B IO P ORTAL) for which an ontology is able to provide synonyms
(positive hits, I) and the average number of provided synonyms per positive hit (II).
Additionally, information about the ontology is also given (III–V).
4 Preliminary evaluation
We have conducted a preliminary evaluation of the use of B IO P ORTAL as a back-
ground knowledge provider (e.g. oracle and mediating ontology provider) in the OAEI’s
anatomy track and with LogMap as OA system. For this purpose, we have extended
LogMap’s matching process to (i) use Algorithm 1 as an oracle within its interactive
mode (see Figure 3 in [7]); and (ii) use a mediating ontology MO as in Algorithm 3.
The results10 are summarized in Table 2. Last column shows the original scores
produced by LogMap (without B IO P ORTAL). As expected, the best results in terms of
9
In the close future, we plan to combine this algorithm with the ontotology recommender pro-
vided by B IO P ORTAL: https://bioportal.bioontology.org/recommender
10
SNOMED and MeSH have been discarded as mediating ontologies. SNOMED is not available
to download, and we were unable to download MeSH due to a time-out given by B IO P ORTAL.
315
Algorithm 3 Use of a mediating ontology with LogMap
Input: O1 , O2 : input ontologies; MO: mediating ontology; Output: M: output mappings;
1: M1 := LogMap(O1 , MO)
2: M2 := LogMap(MO, O2 )
3: MC := ComposeMappings(M1 , M2 )
4: M := LogMap(O1 , O2 , MC )
5: return M
Table 2: Results of LogMap with/without B IO P ORTAL as background knowledge
Mode LogMap - B IO P ORTAL
LogMap
Score Oracle MO Uberon MO CL MO EFO
Precision 0.915 0.899 0.907 0.914 0.913
Recall 0.846 0.927 0.867 0.846 0.846
F-score 0.879 0.913 0.886 0.879 0.878
F-score has been obtained using Uberon as mediating ontology. Using CL as media-
tor also improves the results with respect to those obtained by LogMap, although the
improvement does not have an impact as big as with Uberon. There is not significant
improvement using EFO as mediating ontology. Using B IO P ORTAL as an oracle leads
to a small increase in precision, but recall remains the same.
This preliminary evaluation has shown the potential of using B IO P ORTAL as back-
ground knowledge. In the close future we plan to conduct an extensive evaluation in-
volving more challenging datasets (e.g. OAEI’s largebio track) and other OA systems,
and combining several mediating ontologies.
References
1. Cruz, I.F., Stroe, C., Caimi, F., Fabiani, A., Pesquita, C., Couto, F.M., Palmonari, M.: Using
AgreementMaker to Align Ontologies for OAEI 2011. In: 6th OM Workshop (2011)
2. Faria, D., Pesquita, C., Santos, E., Palmonari, M., Cruz, I.F., Couto, F.M.: The Agreement-
MakerLight Ontology Matching System. In: OTM Conferences. pp. 527–541 (2013)
3. Fridman Noy, N., Shah, N.H., Whetzel, P.L., Dai, B., et al.: BioPortal: ontologies and inte-
grated data resources at the click of a mouse. Nucleic Acids Research 37, 170–173 (2009)
4. Ghazvinian, A., Noy, N.F., Jonquet, C., Shah, N.H., Musen, M.A.: What four million map-
pings can tell you about two hundred ontologies. In: Int’l Sem. Web Conf. (ISWC) (2009)
5. Gross, A., Hartung, M., Kirsten, T., Rahm, E.: Mapping composition for matching large life
science ontologies. In: 2nd International Conference on Biomedical Ontology (ICBO) (2011)
6. Jean-Mary, Y.R., Shironoshita, E.P., Kabuka, M.R.: Ontology Matching with Semantic Veri-
fication. Journal of Web Semantics 7(3), 235–251 (2009)
7. Jiménez-Ruiz, E., Cuenca Grau, B., Zhou, Y., Horrocks, I.: Large-scale interactive ontology
matching: Algorithms and implementation. In: European Conf. on Art. Int. (ECAI) (2012)
8. Lambrix, P., Tan, H.: A System for Aligning and Merging Biomedical Ontologies. Journal
of Web Semantics 4(3), 196–206 (2006)
9. Quix, C., Roy, P., Kensche, D.: Automatic selection of background knowledge for ontology
matching. In: Proc. of the Int’l Workshop on Sem. Web Inf. Management (2011)
10. Sabou, M., d’Aquin, M., Motta, E.: Exploring the Semantic Web as Background Knowledge
for Ontology Matching. J. Data Semantics 11, 156–190 (2008)
316
How much navigable is the Web of Linked
Data?
Valeria Fionda1 , Enrico Malizia2
1
Department of Mathematics, University of Calabria, Italy
2
DIMES, University of Calabria, Italy
Abstract. Millions of RDF links connect data providers on the Web of
Linked Data. Here, navigability is a key issue. This poster provides a
preliminary quantitative analysis of this fundamental feature.
1 Motivation
Linked Data are published on the Web following the Linked Data principles [2].
One of them states that RDF links must be used to allow clients to navigate the
Web of Linked Data from dataset to dataset. In particular, RDF links allow: (i)
data publishers to connect the data they provide to data already on the Web; and
(ii) clients to discover new knowledge by traversing links and retrieving data.
Hence, navigability is a key feature. The Web of Linked Data is growing rapidly
and both the set of data providers and the structure of RDF links continuously
evolve. Is this growth taking place preserving the basic navigability principle?
In this poster we try to answer this question by analyzing the pay-level do-
main (PLD) networks extracted from the last three Billion Triple Challenge
datasets. In addition, we also analyze the sameAs network obtained by consid-
ering only owl:sameAs links. Some recent works analyzed sameAs networks [1,
3] to provide some statistics on the deployment status and use of owl:sameAs
links [3] and to evaluate their quality [1]. However, to the best of our knowledge,
this is the first attempt to use the PLD and sameAs networks to perform a
quantitative analysis of the navigability of the Web of Linked Data.
2 Methodology
Navigability indices. We model the Web of Linked Data as a directed graph
G = hV, Ei where V = {v1 , . . . , vn } is the set of vertices and E ✓ V ⇥ V is the
set of edges. The vertices of G represent the pay-level domains that identify data
publishers in the Web of Linked Data. The edges of G represent directed links
between di↵erent pay-level domains (i.e., there are no loops) and are ordered
V. Fionda’s work was supported by the European Commission, the European Social
Fund and the Calabria region. E. Malizia’s work was supported by the ERC grant
246858 (DIADEM), while he was visiting the Department of Computer Science of the
University of Oxford.
317
pairs of vertices (vi , vj ), where vi is the source vertex and vj is the target one.
Intuitively, an edge (vi , vj ) models the fact that there is at least one URI having
PLD vi by dereferencing which an RDF link to a URI having PLD vj is obtained.
We denote by vi ;G vj the existence of a path from vi to vj in G (otherwise
we write vi 6;G vj ). For a graph G = hV, Ei, G⇤ = hV, E ⇤ i is the closure of G
where (vi , vj ) 2 E ⇤ if and only if vi 6= vj and vi ;G vj . We define the reachability
matrix RG 2 {0, 1}n⇥n such that RG [i, j] = 1 if and only if (vi , vj ) 2 E ⇤ (i.e.,
vi ;G vj ). Moreover, we define the distance matrix DG 2 Nn⇥n such that
DG [i, j] is the length of the shortest path between vi and vj (DG [i, j] = 1 if
vi 6;G vj ). When G is understood we simply write vi ; vj , R[i, j], and D[i, j].
To evaluate the navigability of the Web of Linked Data we use two indices.
The first one is the reachability index ⌘(G), corresponding to the edge density of
G⇤ . The reachability index is the probability that betweenPany two given vertices
of G there exists a path. In particular, ⌘(G) = n(n1 1) vi ,vj 2V,vi 6=vj R[i, j] =
|E ⇤ |
n(n 1) . This index takes into account only the reachability between vertices and
implies that ⌘(G1 ) = ⌘(G2 ) for any pair of graphs G1 =hV, E1 i and G2 =hV, E2 i
such that G⇤1 = G⇤2 , even if E1 ⇢ E2 (or E2 ⇢ E1 ).
To take into account di↵erences in graph topologies, we use the efficiency
index ⌘e(G) [4]. This index exploits the distance matrix D to weight the reacha-
bility by the (inverse of the) length of the shortest path between vertices. Given
P R[i,j] R[i,j]
a graph G, ⌘e(G) = n(n1 1) vi ,vj 2V,vi 6=vj D[i,j] , where D[i,j] = 0 when vi 6; vj .
It can be shown that for any graph G, ⌘e(G) ⌘(G), and given two graphs
G1 = hV, E1 i and G2 = hV, E2 i such that E1 ⇢ E2 then ⌘(G1 ) ⌘(G2 ), while
⌘e(G1 ) < ⌘e(G2 ). The index ⌘e(·) has been used in literature to measure how
efficiently small-world networks exchange information [4].
Intuitively, the closer ⌘(G) is to 1, the more G is similar to a strongly con-
nected graph; on the other hand, the closer ⌘e(G) is to 1, the more G is similar
to a complete graph. Note that, ⌘e combines information on reachability and in-
formation about the distances between pairs of vertices and it is not simply the
arithmetic mean of the inverse of the shortest paths lengths.
Datasets. To perform our analysis we used the Billion Triple Challenge (BTC)
datasets of 2010, 2011 and 20123 . Unfortunately, the BTC dataset for 2013 was
not crawled and a dataset for 2014 was not available to the date of submission.
We decided to use the pay-level domain (PLD) granularity to build our networks,
where the PLD of a URI is a sub-domain (generally one level below) of a generic
public top-level domain, for which users usually pay for. PLD domains in the Web
of Linked Data are often in one-to-one correspondence with Linked Open Data
datasets. We extracted the PLD network from each BTC dataset by considering
each RDF quad and adding an edge between the PLD of the context URI and the
PLD of the subject and object. In particular, we extracted two PLD networks:
the first one (denoted by All) considers all types of links and the second one
(denoted by SA) considers only owl:sameAs links.
3
http://km.aifb.kit.edu/projects/btc-X/, X={2010,2011,2012}
318
Fig. 1. The largest internal connected component of the PLD sameAs network ex-
tracted from the BTC2012 dataset.
Since BTC datasets are obtained by crawling the LOD cloud starting from
a set of seed URIs, we also extracted from each network the largest internal
connected component obtained by ignoring the PLDs “on the border” (i.e., those
without any outgoing edge) that are probably those where the crawling stopped.
We denote by All-I and SA-I are internal subnetworks extracted from All and
SA, respectively. Fig. 1 shows the SA-I network of the BTC 2012 dataset.
3 Evaluation
Table 1 reports our results. The table shows that ⌘ and ⌘e on the complete
network, for both All and SA, decrease in 2011 with respect to 2010 and are
still decresing for the All network even in 2012. Moreover, the values obtained
for both ⌘ and ⌘e are very small. For example, ⌘(All) = 4.899 · 10 4 for the BTC
2012 dataset means that given a random pair of PLDs from the All network the
probability that they are connected by a path is less than 0.5 . Translated to
the Web, this means that starting from a given a URI and following RDF links
only a very small portion of the Web of Linked Data can be reached. However,
an explanation for such a behavior could be that the BTC datasets are obtained
by crawling the Web and it is reasonable to think that PLDs at the “border” of
the network are those at which the crawling process stopped. If this is the case,
some links can actually be missing and our indices can be biased. In general,
a decrease over time of ⌘ and ⌘e on the full Web of Linked Data highlight a
decrease in its navigability. Nevertheless, since in our case the BTC datasets
are used as representative samples of the full Web of Linked Data, this decrease
can be related to the fact that in 2012 and 2011 the crawler retrieved triples
spanning more data providers than in 2010 and a large portion of them is on the
319
Index/Year ⌘ e
⌘
Network 2010 2011 2012 2010 2011 2012
All 0.034 0.002 4.899·10 4 0.009 5.98·10 4 1.719·10 4
SA 0.134 0.018 0.059 0.037 0.005 0.016
All-I 0.312 0.887 0.867 0.089 0.319 0.326
SA-I 0.400 0.658 0.497 0.131 0.223 0.187
Table 1. Summary of the analysis carried out.
border. Indeed, in general, both ⌘ and ⌘e decrease if the proportion of the nodes
on the border increase with respect to the total size of the network.
For this reason we decided to analyze the internal largest connected compo-
nents All-I and SA-I. As for All-I, on one hand it can be observed that the
di↵erence in the values of both indices between 2011 and 2012 is negligible. There
is, on the other hand, a big increase in both indices for the 2011 network with
respect to the 2010 one. It is evident, from these results, that the Web of Linked
Data gained a lot in navigability from 2010 to 2011, according to the All-I sam-
ple, while the navigability remained almost unchanged in 2012. A similar trend
can be identified also on the SA-I network, apart from a noticeable decrease in
the navigability of the 2012 network compared to the 2011 one. Our results show
that, for example, in 2012 given a random pair of PLDs from the All-I network
the probability that they are connected by a path is greater than 86% with an
efficiency of 0.326. It is worth to point out that, in a distributed environment
as the Web, efficiency is a fundamental property that, besides reachability, is
related to the number of dereferences that must be performed to move from a
source PLD to a target one. Roughly speaking, lower values of efficiency for the
same value of reachability translate in more traffic on the network.
4 Conclusions
Navigability is a key feature of the Web of Linked Data. We introduced some
indices to quantitatively measure the connectivity of Linked Data providers and
the efficiency of their connections. The results obtained show that, as hoped, the
navigability of the Web of Linked Data is increasing with its growth. However, in
order not to be biased toward a certain interpretation, it is important to stress
that the results obtained could have been influenced by the crawling strategy
used to build the BTC datasets used in our analysis. We plan to perform our
analysis in the following years to monitor and hopefully confirm this trend.
References
1. G. Bartolomeo and S. Salsano. A spectrometry of linked data. In LDOW, 2012.
2. T. Berners-Lee. Linked data design issues, 2006.
3. L. Ding, J. Shinavier, Z. Shangguan, and D. McGuinness. SameAs Networks and
Beyond: Analyzing Deployment Status and Implications of owl:sameAs in Linked
Data. In ISWC, 2010.
4. V. Latora and M. Marchiori. Efficient behavior of small-world networks. Phys. Rev.
Lett., 87(19):198701, 2001.
320
A Framework for Incremental Maintenance of
RDF Views of Relational Data
Vânia M. P. Vidal1 , Marco A. Casanova2 , José M. Monteiro1 , Narciso Arruda1 ,
Diego Sá1 , and Valéria M. Pequeno3
1
Federal University of Ceará, Fortaleza, CE, Brazil
{vvidal, jmmfilho, narciso, diego}@lia.ufc.br
2
Pontifical Catholic University of Rio de Janeiro, RJ, Brazil
casanova@inf.puc-rio.br
3
DMIR, INESC-ID Porto Salvo, Portugal
vmp@inesc-id.pt
Abstract. A general and flexible way to publish relational data in RDF
format is to create RDF views of the underlying relational data. In this
paper, we demonstrate a framework, based on rules, for the incremen-
tal maintenance of RDF views defined on top of relational data. We
also demonstrate a tool that automatically generates, based on the map-
ping between the relational schema and a target ontology, the RDF view
exported from the relational data source and all rules required for the
incremental maintenance of the RDF view.
Keywords: RDF View Maintenance, RDB-to-RDF, Linked Data
1 Introduction
The Linked Data initiative [1] promotes the publication of previously isolated
databases as interlinked RDF triple sets, thereby creating a global scale datas-
pace, known as the Web of Data. However, the full potential of linked data
depends on how easy it is to publish data stored in relational databases (RDBs)
in RDF format. This process is often called RDB-to-RDF.
A general way to publish relational data in RDF format is to create RDF
views of the relational data. The contents of views can be materialized to improve
query performance and data availability. However, to be useful, a materialized
view must be continuously maintained to reflect dynamic source updates.
In this demo, we show a framework, based on rules, for the incremental
maintenance of external RDF views defined on top of relational data. Figure 1
depicts the main components of the framework. Briefly, the administrator of a
relational data-base, using Rubya (Rules by assertion), should create RDF views
and define a set of rules using Rubya - Figure 1(a). These rules are responsible
for: (i) computing the view maintenance statements necessary to maintain a
materialized view V with respect to base updates; and (ii) sending the view
maintenance statements to the view controller of V - Figure 1(b). The rules can
be implemented using triggers. Hence, no middleware system is required. The
321
2 Vânia et al.
view controller for the RDF view has the following functionality: (i) receives the
view maintenance updates from the RDB server and (ii) applies the updates to
the view accordingly.
Fig. 1. Suggested Framework.
Our approach is very e↵ective for an externally maintained view because:
the view maintenance rules are defined at view definition time; no access to
the materialized view is required to compute the view maintenance statements
propagated by the rules; and the application of the view maintenance statements
by the view controller does not require any additional queries over the data
source to maintain the view. This is important when the view is maintained
externally [4], because accessing a remote data source may be too slow.
The use of rules is therefore an e↵ective solution for the incremental mainte-
nance of external views. However, creating rules that correctly maintain an RDF
view can be a complex process, which calls for tools that automate the rule gen-
eration process. In Section 2, we further detail the Rubya tool that, based on
the mapping between the relational schema and a target ontology, automatically
generates the RDF view exported from the relational data source and the set of
rules required for the incremental maintenance of the RDF view.
The demo video is available at http://tiny.cc/rubya. First, the video shows,
with the help of a real-word application, the process of defining the RDF view
and generating the maintenance rules with Rubya. Then, it shows some practical
examples of using the rules for incremental maintenance of a materialized RDF
view. For more information see http://www.arida.ufc.br/ivmf/.
2 Generating Rules with Rubya
Figure 1 highlights the main components of Rubya. The process of defining the
RDF view and generating the maintenance rules with Rubya consists of three
steps:
STEP 1 (Mapping specification): Using the correspondence assertions ed-
itor of Rubya, the user loads the source and target schema and then he can
322
A Framework for Incremental Maintenance of RDF Views of Relational Data 3
draw correspondence assertions (CAs) to specify the mapping between the tar-
get RDF schema and the source relational schema. The demo video shows how
the CA Editor helps the user graphically to define CAs.
A CA can be: (i) a class correspondence assertion (CCA), which matches
a class and a relation schema; (ii) an object property correspondence assertion
(OCA), which matches an object property with paths (list of foreign keys) of a
relation schema; or (iii) a datatype property correspondence assertion (DCA),
which matches a datatype property with attributes or paths of a relation schema.
CAs have a simple syntax and semantics and yet suffice to capture most of the
subtleties of mapping relational schemas into RDF schemas. Figure 2 shows some
examples of correspondence assertions between the relational schema ISWC REL
and the ontology CONF OWL. CCA1 matches the class foaf:Person with the
relation Persons. We refer the reader to [4, 5] for the details and motivation of
the mapping formalism.
Fig. 2. CONF OWL and ISWC REL schemas and some examples of CAs.
STEP 2 (RDF view creation): The GRVS module automatically generates
the RDF view schema, which is induced by the correspondence assertions defined
in Step 1. The vocabulary of the RDF view schema contains all the elements of
the target RDF schema that match an element of the source relational schema.
STEP 3 (Rule generation): The GVMR module automatically generates the
set of rules required to maintain the RDF view defined in Step 2. The process
of generating the rules for a view V consists of the following steps: (a) Obtain,
based on the CAs of V, the set of all relations in the relational schema that
are relevant to V. (b) For each relation R that is relevant to V, three rules are
generated to account for insertions, deletions and updates on R.
Two procedures, GVU INSERTonR and GVU DELETEonR, are automati-
cally generated, at view definition time, based on the CAs of V that are relevant
to R. Note that an update is treated as a deletion followed by an insertion, as
usual. GVU INSERTonR takes as input a tuple rnew inserted in R and returns
the updates necessary to maintain the view V. GVU DELETEonR takes as in-
put a tuple rold deleted from R and returns the updates necessary to maintain
323
4 Vânia et al.
the view V. In [4], we present the algorithms that compile GVU INSERTonR
and GVU DELETEonR based on the CAs of V that are relevant to R.
Once the rules are created, they are used to incrementally maintain the ma-
terialized RDF view. For example, Figure 3 shows the process to update a RDF
view when an insertion occurs on Papers. When an insertion occurs on Papers, a
corresponding trigger is fired. The trigger computes the view maintenance state-
ments U, and sends it to the view controller. The view controller computes the
view updates U*, and applies it to the view state.
Fig. 3. Using the rules generated by Rubya when insertions occurs on Papers.
3 Conclusions
In this paper, we present Rubya, a tool for incremental maintenance of external
RDF views defined on top of relational data. There is significant work on reusing
relational data in terms of RDF (see a survey in [3]). Karma [2], for example, is
a tool to semi-automatically create mapping from a source to a target ontology.
In our tool, the user defines mappings between a source and a target ontology
using a GUI. The novelty of our proposal is that we generate rules to maintain
the RDF views.
References
1. Berners-lee, T., design issues: Linked data, http://www.w3.org/DesignIssues/
LinkedData.html
2. Knoblock, C.A., et al.: Semi-automatically Mapping Structured Sources into the
Semantic Web. In: ESWC, pp. 375–390. Springer-Verlag, Berlin, Heidelberg (2012)
3. Spanos, D.E., Stavrou, P., Mitrou, N.: Bringing Relational Databases into the Se-
mantic Web: A Survey. Semantic Web Journal 3(2), 169–209 (2012)
4. Vidal, V.M.P., Casanova, M.A., Cardoso, D.S.: Incremental Maintenance of RDF
Views of Relational Data. In: OTM 2013 Conferences, pp. 572–587. Austria (2013)
5. Vidal, V.M.P., Casanova, M.A., Neto, L.E.T., Monteiro, J.M.: A Semi-Automatic
Approach for Generating Customized R2RML Mappings. In: SAC, pp. 316–322
(2014)
324
Document Relation System Based on Ontologies for the
Security Domain
Janine Hellriegel1, Hans Georg Ziegler1, and Ulrich Meissen1
1
Fraunhofer Institute for Open Communication Systems (FOKUS),
Kaiserin Augusta Allee 31, 10589 Berlin, Germany
{ Janine.Hellriegel, Hans.Georg.Ziegler,
Ulrich.Meissen }@fokus.fraunhofer.de
Abstract. Finding semantic similarity or semantic relatedness between unstruc-
tured text documents is an ongoing research field in the semantic web area. For
larger text corpuses often lexical matching – the matching of shared terms – is
applied. Related sematic terms and concepts are not considered in this solution.
Also documents that use heterogeneous perspectives on a domain could not be
set into a relation properly. In this paper, we present our ongoing work on a
flexible and expandable system that handles text documents with different
points of view, languages and level of detail. The system is presented in the se-
curity domain but could be adapted to other domains. The decision making pro-
cess is transparent and the result is a ranked list.
Keywords: Document Relation, Security, Ontology, Semantic Relatedness
1 Introduction
The amount of available information in the Internet is growing day by day. It is diffi-
cult to keep an overview of relevant data in a domain, especially if different kinds of
views on the same topic are considered. An expert is using different words and level
of detail in contrast to a normal user, but they describe exactly the same concept. Hav-
ing a database consisting of documents authored from people with different levels of
expertise, language skills and ambitions imposes a big challenge on a semantic search
algorithm. The usage of long texts as search input enables a wider range of search
terms, which is the foundation to detect a larger spectrum of documents. The relevant
results are documents related to the input query text document. A basic method to
compare two text documents is the vector space model [2], which relates the text
similarity to the amount of similar words. However, semantically related words are
not considered. Knowledge based similarity measures use lager document corpuses
and external networks like WordNet or Wikipedia to analyze co-occurrences and
relations. An overview of theses techniques is presented in [3] but most of the meth-
ods just work for a couple words as search query. Although all documents affiliate to
one domain (e.g. the security domain) lexical matching and knowledge-based meas-
ure don't retrieve a sufficient number of related documents. Another measure, the
325
Ontology based matching includes concepts and heterogeneous relations. Wang [7]
proposes a system to relate documents using the concepts found in WordNet. But the
measurement step still depends on words and heterogeneous concepts could not be
related. In the security and safety domain only specialized ontologies exist [5], [6],
that mainly focus on the security of information systems. An attempt to combine dif-
ferent ontologies was made by [1] but could not express the diversity of the domain
also addressing e.g. security of citizens, infrastructures or utilities. As the mentioned
references show, a system that searches for related text documents in a clear and
traceable way is not yet developed. At the moment no ontology exists that would
match the terminology of the whole security domain. Therefore a new, more general,
ontology as well as a general system are developed.
2 System of Semantic Related Documents
The fundament to measure semantic relatedness between two documents are terms. A
terminology is built, which is used to compare all documents quickly and determine
their relations. The whole system is divided into three steps. Figure 1 displays an
overview of the whole system.
Computing the Document
Preprocessing and
Relation with help of Weighting and Ranking
Keyword based extraction
Ontologies
1
?
semantic
keywords 2
ontology
?
? ?
Fig. 1. System overview with three steps to determine document relatedness
A possible scenario is the goal to find related work and potential partners for a project
idea. In the first step predefined keywords are extracted from the project descriptions
and organization profiles as well as the query text containing the project idea to dis-
cover terms that characterize the documents. Each document is now represented with
the detected keywords. In the second step the text documents are classified in the
ontology according to their keywords in order to discover further relations. If the
keyword maritime borders appears in the query document, all relations from this
keyword to others, like border surveillance, are used. The ontology helps to discover
related keywords and therefore related documents. With the help of a weighting algo-
rithm a ranked list of related documents is the result in the last step.
326
2.1 Preprocessing and Keyword based extraction
In order to extract the valuable terms from the documents, a manually created key-
word list is used and their term frequency for each document is determined. Compar-
ing the occurrence of the keywords gives a first measure for the relatedness. The more
terms the texts have in common, the more related they are. However, different views
and special relations are not yet taken into account. In order to extract the keywords
all documents are preprocessed with a tokenization on term bases. Further, stemming
algorithms are used to transform all terms in the documents as well as all keywords to
their base form. The keyword list was developed by early-warning system experts
together with civil protection and police specialists. It contains about 500 English
words relevant to the security domain but still could be modified or extended. Auto-
matic keyword extraction algorithms are not suitable since they produce too much
noise and could not hold up to the quality of the keywords list. All keywords can be
translated semi-automatically. Therefore the system supports different languages.
Synonyms, categories and other semantic relevant words are added by using BabelNet
[4]. From the term video surveillance the terms surveillance camera, cctv, video home
security system are derived. In total a keyword list with over 4000 terms has been
produced. In this way it is ensured, that only domain related terms are found.
2.2 Computing the Document Relation with help of Ontologies
In the case when related documents don't contain identical or similar terms, an ontol-
ogy or terminological net can be used in order to improve the calculation of the relat-
edness. The relation between a technical and a user view could only be determined
over a shared concept. Using the heterogeneous paths between the terms in the graph-
based knowledge representation, new relations between the documents are revealed.
Not only the distances in the terminological net are considered, also the type of rela-
tion like is-a or part-of between the terms determines the relatedness of the text doc-
uments. In this way, for each detected keyword in the query document, related key-
words could be found. Texts containing the related keywords are most likely to corre-
late with the query document. A new ontology in the security domain is manually
built at the moment, containing the original 500 keywords, relations from BabelNet
and a taxonomy created by security researchers. The taxonomy is loosely based on a
project categorization for the recent FP 7 Cordis security call [8].
2.3 Weighting and Ranking
A ranked list of texts related to the query document is the result of the system. Two
measurements are used to rank the results, first is the weighting of the original key-
words and second is the type of relation between the keywords. Not all retrieved
terms are equally important to distinguish the texts. The term security is important but
very general and can be found in a lot of documents. Due to the low entropy of the
term, it does not help to find unique relations. In contrast, the term body scanner is
more useful to find related documents. A term weighting is applied with the tf-idf
327
statistic [2] to identify significant terms. As document corpus the FP 7 Cordis security
call project descriptions are used. Secondly the relation between two specific key-
words (body scanner and metal detector) is ranked higher then a relation between a
specific keyword and a more general keyword (body scanner and airport security).
3 Conclusion and Future Work
With the presented system, a ranked list of related documents can be retrieved. Re-
gardless what kind of view or level of detail they contain. The system describes a
general sequence of functions and could be adapted to other domains if a correspond-
ing keyword list and ontology are available. In the music domain e.g. artist profiles
could be related to genre or instrument descriptions. The system is based on a simple
method but achieves good results because it works close to the domain. In addition, it
allows the evaluation of the results and to understand why documents are identified as
related. The system is still work in progress, the next steps are to complete the devel-
opment of the ontology and to evaluate the chosen keywords. Further evaluations
concerning the accuracy as well as user satisfaction have to be performed.
References
1. Liu, Shuangyan, Duncan Shaw, and Christopher Brewster: Ontologies for Crisis Manage-
ment: a Review of State of the Art in Ontology Design and Usability. In: Proceedings of
the Information Systems for Crisis Response and Management Conference (2013)
2. Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze: Introduction to In-
formation Retrieval. Vol. 1. Cambridge university press Cambridge (2008)
3. Mihalcea, Rada, Courtney Corley, and Carlo Strapparava: Corpus-based and Knowledge-
based Measures of Text Semantic Similarity. In: AAAI, 6:775–80 (2006)
4. Navigli, Roberto, and Simone Paolo Ponzetto: BabelNet: Building a Very Large Multilin-
gual Semantic Network. In: Proceedings of the 48th Annual Meeting of the Association for
Computational Linguistics, 216–25. Association for Computational Linguistics (2010)
5. Ramanauskaite, Simona, Dmitrij Olifer, Nikolaj Goranin, and Antanas Čenys: Security
Ontology for Adaptive Mapping of Security Standards. In: International Journal of Com-
puters Communications & Control 8, no. 6 (2013)
6. Souag, Amina, Camille Salinesi, and Isabelle Comyn-Wattiau: Ontologies for Security
Requirements: A Literature Survey and Classification. In: Advanced Information Systems
Engineering Workshops, 61–69. Springer (2012)
7. Wang, James Z., and William Taylor: Concept Forest: A New Ontology-assisted Text
Document Similarity Measurement Method. In: Web Intelligence, IEEE/WIC/ACM Inter-
national Conference On, 395–401. IEEE (2007)
8. FP7 Cordis Project, http://cordis.europa.eu/fp7/security/home_en.html
Acknowledgement
This work has received funding from the Federal Ministry of Education and Research
for the security research project “fit4sec” under grant agreement no. 13N12809.
328
Representing Swedish Lexical Resources in RDF with
lemon
Lars Borin1 and Dana Dannélls1 and Markus Forsberg1 and John P. McCrae2
1
Språkbanken, University of Gothenburg
{lars.borin, dana.dannells, markus.forsberg}@svenska.gu.se
2
Cognitive Interaction Technology Center of Excellence, University of Bielefeld
jmccrae@cit-ec.uni-bielefeld.de
Abstract. The paper presents an ongoing project which aims to publish Swedish
lexical-semantic resources using Semantic Web and Linked Data technologies.
In this article, we highlight the practical conversion methods and challenges of
converting three of the Swedish language resources in RDF with lemon.
Keywords: Language Technology, Lexical Resources, Lexical Semantics.
1 Introduction
For state-of-the-art lexical resources, availability as Linked Open Data (LOD) is a ba-
sic requirement for their widest possible dissemination and use in research, education,
development of products and services. In addition, in order to provide language sup-
port to individuals requiring augmentative and alternative communication (AAC) we
need linguistic resources suitably organized and represented, e.g., sign language mate-
rial, symbol and image libraries adapted to multiple cognitive levels, as well as textual
support in many languages. So far, in Sweden, these resources have been developed as
separate and uncoordinated efforts, either commercially or by non-profit organizations
targeting specific groups and needs. In the long run, this is an exclusive and expensive
way of proceeding, leading to limited usefulness. In the project, we aim to link Con-
cept Coding Framework (CCF) technology and some symbol sets, to a common LOD
format for languages resources (LRs) to be developed together with Språkbanken (The
Swedish Language Bank),3 which will be a great step forward.
There are ongoing international initiatives aiming to define suitable formats for
publishing linguistic content according to linked open data principles [1]. Integrating
linguistic content on the Web by using these principles is central for many language
technology applications. It requires harmonization on different levels in particular on
the metadata level. In this paper we present our first attempt to publish three Swedish
lexical-semantic resources in RDF with lemon [2].
3
329
2
2 lemon
lemon (Lexicon Model for Ontologies) is a model for associating linguistic information
with ontologies,4 in particular Semantic Web ontologies. The model builds on exist-
ing models for incorporating multilingual knowledge in ontologies, and for the rep-
resentation of lexical resources [3]. lemon is built around the principal of semantics
by reference [4]. It separates the lexical layer, that is the words and their morphology
and syntactic behaviour, and the semantic layer in the ontology, which describes the
domain-specific meaning of that entry. The model of lemon is based around lexical
entries, which connect to ontology entities, by means of an object called LexicalSense,
which refers to one meaning of a word, or correspondingly a pair consisting of the word
and the meaning. In this sense, the model of lemon is primarily semasiological, i.e, or-
ganized around words, as opposed to the onomasiological resources, such as SALDO,
which are primarily built around senses. However, the usage of the sense object and the
distributed nature of the RDF graph model, means that from a linked data viewpoint
this distinction is of less relevance, and lemon proves to be an effective model for the
lexical resources discussed here.
The lemon model has since 2011 been the focus of the W3C OntoLex community
group,5 and as such significant developments on both the model and its applications
are still active. In particular, lemon has already been used successfully for the repre-
sentation of several existing lexical resources, most notably WordNet [5], UBY [6] and
BabelNet [7]. Furthermore, the use of lemon has already proved to be a key component
in systems for tasks such as question answering [8], natural language generation [9] and
information extraction [10].
3 Converting the Swedish Lexical Resources into RDF with lemon
Språkbanken at the Department of Swedish, University of Gothenburg, Sweden main-
tains a very large collection of lexical resources for both modern and historical Swedish.
Currently there exist 23 different lexical resources with over 700,000 lexical entries.
Within the time frame of our project we so far considered three of the modern lexicons,
which are also freely available in Lexical Markup Framework (LMF) [11]. As we will
show in this chapter, the form of these lexical resources varies substantially. We mini-
mize this variation with lemon. Since lemon is builds on LMF, it allows easy conversion
supported by EXtensible Stylesheet Language XSL Transformation mechanism.6
SALDO, the new version of the Swedish Associative Thesaurus [12], is a semanti-
cally organized lexicon containing morphological and lexical-semantic information for
more than 130,000 Swedish lexical entries of which 13,000 are verbs.7 It is the largest
freely available electronic Swedish lexical resource for language technology, and is the
pivot of all the Swedish lexical resources maintained at Språkbanken. SALDO entries
4
5
6
7
330
3
are arranged in a hierarchical structure capturing semantic closeness between senses
indicated by a unique sense identifier, in lemon this unique identifier is represented
with the object lemon:LexicalSense. A challenge here was how to represent SALDO’s
lemgram which is a pairing of the word base form and its inflectional paradigm. Lem-
gram is represented with the object lemon:LexicalEntry. The base form is described
with a lemma value of the lexical entry and is represented with the object lemon:Form.
The inflectional paradigm is described with a form value combined with a digit, and is
also represented with the object lemon:Form. We defined our objects for capturing the
paradigm patterns and the morphosyntactic tags.
Swedish FrameNet (SweFN), created as a part of a large project called SweFN++ [13],
is a lexical-semantic resource that has been expanded from and constructed in line with
Berkeley FrameNet (BFN) [14].8 It is defined in terms of semantic frames. Each frame
is unique and is evoked by one or more target word(s) called lexical unit (LU) which car-
ries valence information about the possible syntactic and semantic realizations of frame
elements (FEs). Frames are represented with the object lemon:LexicalSense. The LUs
evoked by a frame are linked to SALDO entities with the property lemon:isSenseOf.
The property lemon:SemArg links to FEs objects. There are two types of FEs: core
and peripheral, these are represented with the object lemon:Argument and are linked
to either uby:core or uby:peripheral with the property uby:coreType. A challenge here
was how to represent the syntactic and semantic realizations of FEs that are illustrated
with annotated example sentences. In the LMF file they are annotated with extra, non-
standardized tags. Our solution was to define our object karpHash:example to represent
the annotated example sentences for each FE.
Lexin is a bilingual dictionary,9 originally developed for immigrants by the Swedish
national agency for education.10 It contains detailed linguistic information for 15 lan-
guages including sentence and expression examples, sentence constructions, explana-
tions through comments, synonyms, etc. A lexical entry in Lexin is represented with
lemon:LexicalSense. Entries are linked to SALDO entries with owl:sameAs property. In
addition, we defined the objects spraakbanken:translation and spraakbanken:synonym
to represent translation equivalents of sentences in our RDF model.
4 Summary
We described the effort of transforming three Swedish lexical resources into LOD us-
ing Semantic Web and Linked Data technologies.11 Deciding on how to transform the
lexical resource attributes to lemon features has been carried out manually for each re-
source. Once the transformation is decided, the integration is conducted automatically.
Publishing lexical resources in Swedish as RDF data is valuable for a variety of use
cases. One of the many benefits of having this semantically interlinked content is to
8
9
10
11
The published resources can be accessed here:
331
4
enhance accessibility and availability on the web, in particular for language technology
applications.
Acknowledgements
The research presented here has been conducted with funding by VINNOVA (Swedish
Governmental Agency for Innovation Systems; grant agreement 2013-04996), and by
the University of Gothenburg through its support of the Centre for Language Technol-
ogy.12
References
1. Chiarcos, C., Nordhoff, S., Hellmann, S., eds.: Linked Data in Linguistics. Representing and
Connecting Language Data and Language Metadata. Springer (2012)
2. McCrae, J., Spohr, D., Cimiano, P.: Linking lexical resources and ontologies on the semantic
web with lemon. In: The Semantic Web: Research and Applications. (2011) 245–259
3. Cimiano, P., Buitelaar, P., McCrae, J., Sintek, M.: Lexinfo: A declarative model for the
lexicon-ontology interface. Web Semantics: Science, Services and Agents on the World
Wide Web 9(1) (2011)
4. Buitelaar, P. In: Ontology-based Semantic Lexicons: Mapping between Terms and Object
Descriptions. Cambridge University Press (2010) 212–223
5. McCrae, J.P., Fellbaum, C., Cimiano, P.: Publishing and linking WordNet using RDF and
lemon. In: Proceedings of the 3rd Workshop on Linked Data in Linguistics. (2014)
6. Eckle-Kohler, J., McCrae, J., Chiarcos, C.: lemonUby-a large, interlinked, syntactically-rich
resource for ontologies. Semantic Web Journal, submitted. (2014)
7. Ehrmann, M., Vannela, D., McCrae, J.P., Cecconi, F., Cimiano, P., Navigli, R.: Representing
Multilingual Data as Linked Data: the Case of BabelNet 2.0. In: Proceedings of the Ninth
International Conference on Language Resources and Evaluation (LREC). (2014)
8. Unger, C., Cimiano, P.: Pythia: Compositional meaning construction for ontology-based
question answering on the semantic web. In Munoz, R., ed.: Natural Language Processing
and Information Systems: 16th International Conference on Applications of Natural Lan-
guage to Information Systems. Volume 6716., Springer (2011) 153–160
9. Cimiano, P., Lüker, J., Nagel, D., Unger, C.: Exploiting ontology lexica for generating natural
language texts from RDF data. In: Proceedings of the 14th European Workshop on Natural
Language Generation. (2013) 10–19
10. Davis, B., Badra, F., Buitelaar, P., Wunner, T., Handschuh, S.: Squeezing lemon with GATE.
In: Proceedings of the First Workshop on the Mulitlingual Semantic Web. (2011) 74
11. Francopoulo, G., George, M., Calzolari, N., Monachini, M., Bel, N., Pet, M., Soria, C., et al.:
Lexical markup framework (LMF). In: International Conference on Language Resources and
Evaluation LREC. (2006)
12. Borin, L., Forsberg, M., Lönngren, L.: SALDO: a touch of yin to WordNet’s yang. Language
Resources and Evaluation 47(4) (2013) 1191–1211
13. Borin, L., Dannélls, D., Forsberg, M., Toporowska Gronostaj, M., Kokkinakis, D.: The
past meets the present in Swedish FrameNet++. In: Proceedings of the 14th EURALEX
International Congress. (2010) 269–281
14. Fillmore, C.J., Johnson, C.R., Petruck, M.R.L.: Background to Framenet. International
Journal of Lexicography 16(3) (2003) 235–250
12
332
QASM: a Q&A Social Media System
Based on Social Semantic
Zide Meng1 , Fabien Gandon1 , and Catherine Faron-Zucker2
INRIA Sophia Antipolis Méditerranée, 06900 Sophia Antipolis, France1
Univ. Nice Sophia Antipolis, CNRS, I3S, UMR 7271, 06900 Sophia Antipolis, France2
Abstract. In this paper, we describe the QASM (Question & Answer
Social Media) system based on social network analysis to manage the
two main resources in CQA sites: users and contents. We first present
the QASM vocabulary used to formalize both the level of interest and the
expertise of users on topics. Then we present our method to extract this
knowledge from CQA sites. Finally we show how this knowledge is used
both to find relevant experts for a question and to search for similar
questions. We tested QASM on a dataset extracted from the popular
CQA site StackOverflow.
Keywords: Community Question Answering, Social Media Mining, Se-
mantic Web
1 Introduction
Community Question Answering (CQA) services provide a platform where users
can ask expert for help. Since questions and answers can be viewed and searched
afterwards, people with similar questions can also directly find solutions by
browsing this content. Therefore, e↵ectively managing these content is a key
issue. Previous research works on this topic mainly focus on expert detection [2],
similar question retrieval [1]. In this paper, we describe QASM (Question & An-
swer Social Media), a system based on social network analysis (SNA) to manage
the two main resources in CQA sites: users and contents. We first present the
QASM vocabulary used to formalize both the level of interest and the expertise
of users on topics. Then we present our method to extract this knowledge from
CQA sites. Our knowledge model and knowledge extraction method is an exten-
sion of our work presented in [3] on social media mining for detecting topics from
question tags in CQA sites. Finally we show how this knowledge is used both to
find relevant experts for routing questions (users interested and experts in the
question topics) and to find answers to questions by browsing CQA content and
by identifying relevant answers to similar questions previously posted. We tested
QASM on a dataset extracted from the popular CQA site StackOverflow.
333
2 Zide Meng et al.
2 QASM System Description
2.1 Overview
Figure 1 presents an overview of QASM. We first use the SIOC ontology1 to
construct an RDF dataset from social media data extracted from a CQA site.
Then we use social media mining techniques to extract topics, interests and
expertise levels from this dataset. We formalize them with the QASM schema
and enrich our RDF dataset with this knowledge. As a result, we provide an
integrated and enriched Q&A triple store which contains both user interests,
levels of expertise and topics learned from question tags. Finally, we linked our
dataset with DBpedia (through named entity identification).
Based on the QASM RDF dataset, we can provide the users of the Q&A site
with two services to find relevant experts for a question and to search for similar
questions. We detail them in the following subsections.
Fig. 1. Overview of QASM
2.2 QASM Vocabulary
The QASM vocabulary2 enables to model the level of user interests and exper-
tise and topics of questions and answers from Q&A sites. Figure 2 provides an
overview of it. It reuses both the SIOC ontology and the Weighting ontology3 .
– qasm:Topic represents a set of tags related to a specified topic. In our mod-
els, tags belong to instances of qasm:Topic, we also consider di↵erent tags
have di↵erent weights for each topic.
1
http://sioc-project.org/ontology
2
It is available online at http://ns.inria.fr/qasm/qasm.html
3
http://smiy.sourceforge.net/wo/spec/weightingontology.html
334
QASM: a Q&A System Based on Social Semantic 3
Fig. 2. Overview of the QASM vocabulary
– qasm:WeightedObject is used to describe the weight that a specified subject
has with regard to a specified object. This class has four subclasses which
represent question topics, users’ interests, users’ expertise and tag topics re-
spectively. In fact, this class is used to model the distributions we extracted
from the original data. For example, topic-tag distribution, user-interest dis-
tribution.
– qasm:interestIn is used to describe the user-interest distribution. This
property is di↵erent from foaf:interest for its range. In FOAF people are
interested in documents, while in QASM a user is interested in a topic to a
certain degree (a weight).
– qasm:expertiseIn is used to describe the user-expertise distribution. A user
has di↵erent weights for di↵erent topics.
2.3 Knowledge Extraction by Social Media Mining
Topics, interests and levels of expertise are implicit information in the available
raw CQA data. We use social media mining techniques to extract this knowledge.
– Topics & User Interests In [3], we proposed a light-weight model to extract
topics from question tags. The output of this model is a topic-tag distribution
where each tag belonging to a topic is given a weight (probability) indicating
to what extent the tag is related to the topic. A user answering a question
acquires the tags attached to this question and can therefore be represented
by a list of tags. Then we use the topic-tag distribution to compute a user-
topic distribution indicating to what extent each user is related to a topic.
– User Expertise The users interested in a question may provide answers to it
or comments to other answers. Each question or answer may get votes from
other users and an answer may be chosen as the best answer. By exploiting
the tags attached to a question and the topic-tag distribution, the users
providing questions or answers with a high number of votes or the best
answers can be considered as experts in the topics to which their questions
belongs. Equation 1 defines how we use the vote information to compute
users’ levels of expertise. Eu,k denotes the expertise of user u on topic k, m
denotes the number of answers provided by user u, Pt,k denotes the weight
335
4 Zide Meng et al.
of tag t for topic k, Qi and Ai,j denote the votes on question i and its j th
answer, where Aj is the j th answer provided by user u to question Qi .
m
X
Eu,k = Pt,k ⇤ log(Qi ) ⇤ log(Ai,j ) (1)
i=1
2.4 Experimental Evaluation
We first built an RDF dataset from Stackoverflow raw data which comprises
15327727 triples4 . Then we randomly chose several questions and for each ques-
tion we recorded 10 or 20 users provided by our system. Then for each question,
we computed the proportion of the recorded users who actually answered it.
Compared to [4], our results are much better.
Table 1. Preliminary results on question routing
100 500 1000 average [4]
precision@10 0.021 0.0188 0.0187 0.0195 0.0167
precision@20 0.016 0.0134 0.0134 0.0143 0.0118
3 Conclusion and Future Work
We presented QASM, a Q&A system combining social media mining and se-
mantic web models and technologies to manage Q&A users and content in CQA
sites. There are many potential future directions for this work. We are currently
considering constructing a benchmark for Q&A system based on our Stackover-
flow dataset. In a near future we will also enrich the linking of QASM with the
LOD which may help to improve question routing and similar question search.
References
1. Anderson, A., Huttenlocher, D., Kleinberg, J., Leskovec, J.: Discovering value from
community activity on focused question answering sites: a case study of stack over-
flow. In Proc. of the 18th ACM SIGKDD Int. Conf. on Knowledge Discovery and
Data Mining (2012)
2. Yang, L., Qiu, M., Gottipati, S., Zhu, F., Jiang, J., Sun, H., Chen, Z.: CQArank:
jointly model topics and expertise in community question answering. In Proc. of the
22nd ACM Int. Conf. on Information & Knowledge Management (2013)
3. Zide, M., Fabien, G., Catherine, F., Ge, S.: Empirical Study on Overlapping Commu-
nity Detection in Question and Answer Sites. In Proc. of the Int. Conf. on Advances
in Social Networks Analysis and Mining (2014)
4. Chang, S., Pal, A.: Routing questions for collaborative answering in community
question answering. In Proc. of Int. Conf. on Advances in Social Networks Analysis
and Mining (ASONAM) (2013)
4
It is available online at https://wimmics.inria.fr/data
336
A Semantic-Based Platform for Efficient Online
Communication
Zaenal Akbar, José Marı́a Garcı́a, Ioan Toma, Dieter Fensel
Semantic Technology Institute, University of Innsbruck, Austria
firstname.lastname@sti2.at
Abstract. To achieve an e↵ective and efficient way of disseminating
information to ever growing communication channels, we propose an ap-
proach that separates the information and communication channels and
interlinks them with an intermediary component. The separation enables
various dimensions to reuse the information and communication chan-
nels in transactional communication. In this paper we introduce our on-
line communication platform, which is comprised of several components.
The important roles of semantic web technologies to the platform are
explained in detail, including a use case to show the contributions of se-
mantic web in supporting the e↵ectiveness and efficiency of information
dissemination.
Keywords: semantic web, online communication, platform, information dis-
semination
1 Introduction
In today’s internet era, the number and kinds of information dissemination chan-
nels are growing exponentially and changing constantly. Websites, e-mails, blogs
and social media have become the mainstream means of communication. Nev-
ertheless, information dissemination is not only about finding suitable channels,
but also fitting the content to the available channels. These are the main chal-
lenges for e↵ective and efficient information dissemination, and for online com-
munication in general.
Our solution to overcoming these challenges is to decouple information from
channels, defining separate models for each of them, and then interlinking them
with an intermediary component [1]. Semantic technologies play important roles
in our solution: analysis and understanding of the natural language statements,
information modeling and sharing with common vocabularies, matchmaking in-
formation and channels using a rules-based approach [2].
In this paper, we focus on the information modeling (including annotations)
part such that the matchmaking of information to appropriate channels can be
performed efficiently. First, we present the overall architecture, then we discuss
how semantics contribute to the solution and finally we show a use case, followed
by the conclusion and future works.
337
II
2 The Online Communication Platform
Shown in Fig. 1, the online communication platform consists of several compo-
nents which are grouped based on their conceptual functions:
Fig. 1. The Online Communication platform architecture
Information Management is responsible for gathering the content from
data sources (annotated and un-annotated) and representing them into the com-
mon vocabularies. First, the contents are extracted by a Content Extractor (im-
plemented using Any23 1 ), then stored onto a Triplestore such as OWLIM 2 . Fur-
ther, an RDF to OO Mapper (implemented using RDFBeans 3 ) maps the stored
triples onto object-oriented models to be used by the other components. For an-
notated sources where the sources have been annotated with the selected vocab-
ularies, the content can be extracted automatically. For un-annotated sources, a
manual mapping is required to inter-relate the database items (i.e. table fields)
to relevant terms in the desired vocabularies.
Weaver is responsible for matching the information to appropriate chan-
nels through a rule based system. A Rule Editor enables experts to create and
maintain rules through an integrated user interface and access-controlled rules
repository. The rules are then matched to the facts in the working memory of
the rule-based system by a Rule Engine. In our implementation we use Drools 4 .
Channel Management is responsible for distributing the information to
the selected channels according to the defined rules. Dacodi 5 o↵ers various func-
tionalities for distributing the content to the selected communication channels,
as well as for collecting and analyzing feedback from those channels [3].
1 4
http://any23.apache.org http://drools.jboss.org
2 5
http://www.ontotext.com/owlim http://dacodi.sti2.at
3
http://rdfbeans.sourceforge.net
338
III
3 Applying Semantic Technologies to Online
Communication
Semantic technologies contribute mainly to content modeling, namely in how
to obtain content from distributed and heterogeneous sources (i.e. through an-
notation) and represent them in a common representation to make an efficient
match between information and desired channels possible. The matching is not
between content sources but between the common representation to channels.
To achieve a reusable and interoperable information model, we selected vo-
cabularies (whole or partial) from the Linked Open Vocabularies 6 :
1. Dublin Core 7 , all metadata terms to support resource description
2. Friend of a Friend 8 , a vocabulary to describe people, the links between them,
the things they create and do
3. Good Relations 9 , a vocabulary to describe e-commerce products and services
4. Schema.org 10 , a collection of tags to markup a page in ways recognized by
major search engines
These vocabularies are widely used, especially Schema.org which has been adopted
by webmasters to increase their webpages’ visibility in search engines.
We show these contributions in detail within the Tourismusverband (TVb)
Innsbruck 11 use case. As one of the big tourism boards in Austria, its goal is to
achieve the highest visibility possible in search engines as well as to be present in
various social channels [4]. It has a lot of content types (i.e. Place, Event, Trip)
to be disseminated to numerous channels (i.e. Facebook, YouTube).
a) The TVb Innsbruck content sources (i.e. Blog, services from touristic providers)
were annotated with the selected terms of Schema.org.
Farmer’s Market
14.07.2014 - 01.01.2015
Location:
Markthalle (Herzog-Siegmund-Ufer 1-3, AT-6020, Innsbruck)
In this example, information about Event is annotated with the term Event
from Schema.org by using microdata format 12 .
6 10
http://lov.okfn.org http://schema.org
7 11
http://dublincore.org http://www.innsbruck.info
8 12
http://www.foaf-project.org http://www.w3.org/TR/microdata/
9
http://purl.org/goodrelations/
339
IV
b) The publication rules were defined to guide the publication of extracted con-
tents to selected channels.
rule "Event Publication Rule"
when item : Event()
then insert(new ItemToBePublishedIn(item, facebookWall))
insert(new ItemToBePublishedIn(item, youtube))
end
In this rule, each time a new Event was found in the extracted contents, it
was then prepared to be published to the facebookWall, youtube (instances
of TVb’s Facebook and YouTube accounts respectively).
4 Evaluation, Conclusions and Future Work
In order to evaluate our work, we compared the number of visitors to the TVb’s
website before and after annotating the content. Compared to the same period
in 2013, the number of visitors increased by 8.63% between Jan-Feb 2014, which
may be caused by the annotation. Also, the platform is currently being tested
by 6 people at TVb as a substitution to their social media dissemination tool.
The platform was comprised of several components and used semantic web
technologies to integrate various information sources, extracting and represent-
ing the content into common vocabularies to enable efficient matchmaking to
appropriate channels using a rules-based approach. There are four vocabularies
currently supported and in the future, we would like to add more vocabularies
(i.e. Schema.org Action, SIOC 13 ) to enhance the channel management, in order
to improve the feedback collection, for example.
Acknowledgements We would like to thank all members of the OC working
group 14 for their valuable feedback. This work was partly funded by the EU FP7
under grants no. 600663 (Prelida), 257641 (PlanetData) and 284860 (MSEE).
References
1. Fensel, A., Toma, I., Garcı́a, J.M., Stavrakantonakis, I., Fensel, D.: Enabling cus-
tomers engagement and collaboration for small and medium-sized enterprises in
ubiquitous multi-channel ecosystems. Computers in Industry 65(5) (2014) 891–904
2. Akbar, Z., Garcı́a, J.M., Toma, I., Fensel, D.: On using semantically-aware rules for
efficient online communication. In Bikakis, A., Fodor, P., Roman, D., eds.: Rules on
the web. From theory to applications. Volume 8620 of LNCS. Springer (2014) 37–51
3. Toma, I., Fensel, D., Oberhauser, A., Fuchs, C., Stanciu, C., Larizgoitia, I.: Sesa: A
scalable multi-channel communication and booking solution for e-commerce in the
tourism domain. In: The 10th International Conference on e-Business Engineering
(ICEBE). (Sept 2013) 288–293
4. Akbar, Z., Fensel, A., Fensel, D., Fuchs, C., Garcia, J., Juen, B., Lasierra, N.,
Stanciu, C.V., Toma, I., Tymaniuk, S.: Tvb innsbruck semantic pilot analysis.
White paper, Semantic Technology Institute, University of Innsbruck (May 2014)
http://oc.sti2.at/TR/TVBInnsbruck.
13 14
http://sioc-project.org http://oc.sti2.at
340
SHAX: A Semantic Historical Archive eXplorer
Michael Feldman1 , Shen Gao1 , Marc Novel2 , Katerina Papaioannou1 , and
Abraham Bernstein1
1
Department of Informatics, University of Zurich, Zurich, Switzerland
2
Swiss Federal Research Institute WSL, Birmensdorf, Switzerland
Abstract. Newspaper archives are some of the richest historical doc-
ument collections. Their study is, however, very tedious: one needs to
physically visit the archives, search through reams of old, very fragile pa-
per, and manually assemble cross-references. We present Shax, a visual
newspaper-archive exploration tool that takes large, historical archives as
an input and allows interested parties to browse the information included
in a chronological or geographic manner so as to re-discover history.
We used Shax on a selection of the Neue Zürcher Zeitung (NZZ)—the
longest continuously published German newspaper in Switzerland with
archives going back to 1780. Specifically, we took the highly noisy OCRed
text segments, extracted pertinent entities, geolocation, as well as tem-
poral information, linked them with the Linked Open Data cloud, and
built a browser-based exploration platform.
This platform enables users to interactively browse the 111906 newspaper
pages published from 1910 to 1920 and containing historic events such
as World War I (WWI) and the Russian Revolution. Note that Shax
is neither limited to this newspaper nor to this time-period or language
but exemplifies the power in combining semantic technologies with an
exceptional dataset.
1 Introduction
During the past decade, many newspapers (most notably the New York Times 3
but see also [1] for an overview) have digitalized their archive in order to make it
searchable and publicly available. Usually, the scanned newspapers are converted
into text via the use of Optical Character Recognition (OCR). The received
output contains a great degree of noise and makes knowledge discovery from
historical newspaper archives a challenging task. Approaches like data cleaning
with specialized Information Retrieval (IR) tools are commonly used for this
task but require substantial human involvement and domain-specific knowledge
[4].
Alternatively, we develop a Semantic-Web based, data-driven approach, which
e↵ectively retrieves information from a large volume of newspaper issues. Our
methodology was applied to a part of the digitalized archive of the Neue Zürcher
3
http://open.blogs.nytimes.com/2013/07/11/introducing-the-new-timesmachine/
341
Zeitung (NZZ) for the years ranging from 1910 to 1920 and is applicable to dif-
ferent news corpora in various languages. The interactive visualization of our
results enables the user to browse and discover historical events with the related
geographic information.
2 Dataset
The NZZ kindly provided us with a part of the archive covering the issues pub-
lished from 1910 to 1920. This period covers historic events such as WWI and the
Russian Revolution. The scanning, digitizing, and OCRing of the NZZ archive
was conducted by the Fraunhofer Institute4 . The dataset we use consists of 354
GB scanned PDFs and 111906 OCRed pages in XML format (one XML file per
newspaper page).
One of the biggest problems when processing the data is noise. The OCR
struggled with the Gothic font, which is used during the longest period in the
archive, including the one under discussion. Additionally, during wartime, when
printing resources were scarce, ink and paper quality decreased: some pages are
simply not readable and others were printed on thin paper, causing the text of
the backside to shimmer through the front side in the scans. The recognized text
also contains unavoidable errors, such as di↵erent word-spelling due to language
change. However, the names of high-frequency entities remain the same during
this time period. These errors cannot a↵ect our results considerably.
As a result only a part of the text was correctly recognized. Using a spell-
checker, we found that only 64% of the words were correctly recognized. As
we assume a random distribution of the errors, our results contain insignificant
biases.
3 The Application
Data Data
NER Analysis
OCRed Geo Enrichment Geo-entities
Visualization
results entities with context
Fig. 1: The Semantic-Web based approach for analysing newspaper corpora
3.1 The Semantic Web Approach
Existing ways of dealing with historical corpora rely on Information Retrieval
methods requiring a substantial amount of human e↵ort. Based on the assump-
tion that the locations of important historical events are explicitly mentioned in
the newspaper, we develop a purely data-driven approach that leverages Seman-
tic Web technologies to analyze the noisy dataset (Fig. 1).
Specifically, we first perform the Named Entity Recognition (NER) on the
dataset with DBpedia Spotlight [3], trying out di↵erent confidence values to
4
https://www.iais.fraunhofer.de/nzz.html
342
improve the accuracy. We focus on the correlation between temporal and geo-
graphical information, hence, we only extract the geographic entities (e.g., the
city of Sarajevo). Each entity is linked with its corresponding meta-data as well
as information retrieved from DBpedia and GeoNames. For example, we query
the longitude and latitude form DBpedia and use GeoNames to find its county
code by reverse geo-indexing. The result of this process includes tuples in the
following format: (entity name, longitude, latitude, country code, DBpedia link,
date of mention, issue ID). Finally, we perform data analysis on the results on a
monthly basis by aggregating on the country code or the entity name. In both
cases, we compute the sum of counts in every group.
3.2 The Interactive Visualization
In this section, we briefly introduce the functions of our exploration platform
which is available at https://files.ifi.uzh.ch/ddis/nzz-demo/WebContent/.
Function 1: Country Mentions over time A choropleth-map of Europe
was generated for each year based on the country counts. As shown in Fig. 2(a)
and 2(b), the color intensity of a country is in proportion to its counts (i.e. the
darker the color, the more the counts). By navigating through the years, the
way the colors change provides an overview of the popularity of each country.
For example, the Balkan countries are mentioned more often at the beginning
of WWI. In order to avoid biases, such as countries being mentioned extensively
due to higher geographical proximity to Switzerland or due to larger population,
the annual counts of each country were also normalized by relative distance ([5])
to Zürich and population estimated in 1910.
(a) 1910 (b) 1915
Fig. 2: Color Changing of EU in 1910 and 1915
Function 2: Linking countries, issues and historical events By construct-
ing an inverse index that links the countries to the issues where they are men-
tioned, users can further explore the reasons behind the change in the colors.
By clicking on a country, they can see the historical events it was involved in
Fig. 3 as well as the relevant newspaper’s PDFs. The historical events presented
are systematically extracted from DBpedia [2] by querying the category “Event”
with the corresponding country. A newspaper’s PDF is considered relevant if the
country or a place within its borders was mentioned.
Function 3: Entity Mentions over time A more detailed analysis of entity
mentions is visualized using the word cloud (Fig. 4(a)) and trend line (Fig.
343
Fig. 3: Linking countries, issues and historical events
4(b)) of all geographic entities per year. It is possible to directly observe the
changes in the popularity of each entity over time, as well as correlations among
them. Additionally, we plot each entity as a bubble on the map based on its
coordinates and number of mentions (Fig. 4(c)). Thus, users could discover the
actual location of a historical event.
(a) Word Cloud (b) Trend Line (c) Bubble Map
Fig. 4: Entity Mentions over time
4 Conclusion
Shax is a browser-based exploration platform, which highlights the major role of
Semantic Web tools in extracting entities mentioned in highly noisy newspaper
archives. Moreover, it shows that interactive visualization is necessary not only
in presenting the information within the newspaper in a user-friendly way, but
also in discovering implicit knowledge from the corpus. In the future, we plan to
apply our method on the whole 250 years of NZZ archives and try to extend it
to explore other kinds of entities such as notable people.
Acknowledgements
We would like to thank the Neue Zürcher Zeitung, Thilo Haas, Thomas Schar-
renbach, and Daniel Spicar.
References
[1] Doerr, M., Markakis, G., Theodoridou, M., Tsikritzis, M.: Diathesis: Ocr based
semantic annotation of newspapers (2007)
[2] Hienert, D., Luciano, F.: Extraction of historical events from wikipedia. In: CoRR
(2012)
[3] Mendes, P.N., Jakob, M., Garcı́a-Silva, A., Bizer, C.: Dbpedia spotlight: shedding
light on the web of documents. In: Proc. of the 7th i-Semantics (2011)
[4] Places, Á., Fariña, A., Luaces, M., Pedreira, Ó., Seco, D.: A workflow management
system to feed digital libraries: proposal and case study. Multimedia Tools and
Applications (2014)
[5] Worboys, M.F.: Metrics and topologies for geographic space. In: Proc. 7th Intl.
Symp. Spatial Data Handling (1996)
344
SemanTex: Semantic Text Exploration Using
Document Links Implied by Conceptual
Networks Extracted from the Texts?
Suad Aldarra1 , Emir Muñoz1 , Pierre-Yves Vandenbussche1 , and Vı́t Nováček2
1
Fujitsu (Ireland) Limited
Airside Business Park, Swords, Co. Dublin, Ireland
E-mail: Firstname.Lastname@ie.fujitsu.com
2
Insight @ NUI Galway (formerly known as DERI)
IDA Business Park, Lower Dangan, Galway, Ireland
E-mail: vit.novacek@deri.org
1 Introduction
Despite of advances in digital document processing, exploration of implicit rela-
tionships within large amounts of textual resources can still be daunting. This
is partly due to the ‘black-box’ nature of most current methods for computing
links (i.e., similarities) between documents (c.f., [1] and [2]). The methods are
mostly based on numeric computational models like vector spaces or probabilis-
tic classifiers. Such models may perform well according to standard IR evaluation
methodologies, but can be sub-optimal in applications aimed at end users due
to the difficulties in interpreting the results and their provenance [3, 1].
Our Semantic Text Exploration prototype (abbreviated as SemanTex) aims
at finding implicit links within a corpus of textual resources (such as articles or
web pages) and exposing them to users in an intuitive front-end. We discover the
links by: (1) finding concepts that are important in the corpus; (2) computing
relationships between the concepts; (3) using the relationships for finding links
between the texts. The links are annotated with the concepts from which the
particular connection was computed. Apart of being presented to human users
for manual exploration in the SemanTex interfaces, we are working on repre-
senting the semantically annotated links between textual documents in RDF
and exposing the resulting datasets for particular domains (such as PubMed or
New York Times articles) as a part of the Linked Open Data cloud.
In the following we provide more details on the method and give an example
of its practical application to browsing of biomedical articles. A video example
of a specific SemanTex prototype to be demonstrated at the conference can be
looked up at http://goo.gl/zL8lJ2.
?
This work has been supported by the ‘KI2NA’ project funded by Fujitsu Laboratories
Limited in collaboration with Insight @ NUI Galway.
345
2 Method
Extracting Conceptual Networks. For extracting links between concepts in
the texts we use methods we introduced in [4]. The essentials of the method are
as follows: (1) Extracting noun phrases that may refer to domain-specific con-
cepts (using either a shallow parser for general texts or biomedical named-entity
recognition tool for life sciences). (2) Computing co-occurrence relationships be-
tween the extracted noun phrases by means of point-wise mutual information
(PMI). (3) Filtering out the relationships with the PMI scores below a threshold.
(4) Computing (cosine) similarity relationships based on the co-occurrence ones.
Computing Paths between Documents. From a conceptual network, one
can generate sets of paths leading out from every concept. To prevent a combi-
natorial explosion, we limit the paths by two factors: (1) the maximum path
length; (2) the minimum product of the edge weights of the path. From the set
of such paths associated with particular nodes, paths between the original docu-
ments (i.e., text-to-text links semantically annotated by the concepts appearing
on them) can be generated using inverted indices of concept-text provenance.
For instance, imagine a text A contains concept x. Now assume that x is related
to a concept y in text B via a path (x, u, v, y). Then we can say the texts A and
B are related by a (x, u, v, y) path.
Selecting the Most Relevant Paths. The critical part of the method is find-
ing out which paths are most promising out of potentially huge numbers of
them. For that we use multi-objective optimisation of several specific complex-
ity, coherence and entropy measures introduced in [4]. We follow certain intuitive
assumptions when selecting the path measures to optimise: (1) Paths leading
through more complex environs are more informative for a user. (2) Paths sur-
rounded by many highly balanced (i.e., entropic) topics are more informative.
(3) Coherent paths with gradual topical changes on the way are better (less
chaotic, more focused progression from one topic to another en route to the
linked text). (4) It is more interesting, considering an Information Retrieval
point of view, when one ends up in a topically distant (incoherent) area (once
the progress through the topics is gradual, i.e., less random). The result of this
step is a set of optimal (non-dominated) text-to-text paths that can be further
ranked according to their combined score.
3 Usage
To demonstrate the SemanTex technology, we have applied it to the corpus of
Parkinson’s Disease article abstracts from PubMed which we experimented with
in [4]. As can be seen in Figure 1, the front-end has been incorporated into
a PubMed look-and-feel. Domain-specific concepts have been highlighted in the
abstract display (the darker the shade, the more the concept is important for the
given abstract). After clicking on any of the highlights, a separate ‘Path Diagram’
window is displayed where one can navigate paths leading from the selected
concept. The nodes on the paths can be expanded with further connections
346
while the corresponding related articles are always displayed in the bottom of
the window. Clicking on a related article leads to the article view. One can also
explore articles related by a path to the currently browsed one. A diagram of
paths that connect the articles via the concepts in them can be displayed as well.
Figure 1 illustrates SemanTex on an example of article about the correlation
of ca↵eine consumption, risk of Parkinson’s disease and the related di↵erences be-
tween men and women. When exploring the concept ‘high ca↵eine consumption’,
one can continue to the ‘hormones’ and ‘women’ nodes. Expanding the ‘women’
link shows many concepts related to women’s health, such as ‘oophorectomy’ (re-
moval of ovaries). There is a single article related to that concept, dealing with
increased risk of parkinsonism in women who underwent oophorectomy before
menopause. This shows how one can quickly explore a problem from many dif-
ferent viewpoints with SemanTex, linking an article dealing with the influence of
particular hormonal levels on the development of Parkinson’s Disease in women
with another article looking into higher risk of parkinsonism due to lower levels
of estrogen caused by the pre-menopausal removal of ovaries.
When further exploring some articles related to the last one, one can see all
the paths that connect them. For instance, a study about Dutch elderly people is
linked to the oophorectomy article by means of four paths, all involving concepts
clearly related to common geriatric ailments. This illustrates the possibility of
smooth topical progression in exploring the articles.
4 Conclusions
In this work, we have presented SemanTex, an application that discovers im-
plicit, semantically annotated links within a corpus of textual resources. We
have implemented a sample prototype of the technology deployed on PubMed
articles using the standard PubMed look-and-feel. This was to show how we can
easily add value to many traditional applications involving exploration of large
numbers of textual resources. In a similar way, we will implement SemanTex ver-
sions for New York Times and Wikipedia articles within an evaluation trial of
the technology. Last but not least, we plan to generate RDF representations of
the semantically annotated links between texts computed by particular instances
of SemanTex and expose them as a part of the LOD cloud.
References
1. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval.
Cambridge University Press (2008)
2. Lin, J., Wilbur, W.J.: PubMed related articles: a probabilistic topic-based model
for content similarity. BMC Bioinformatics 8(1) (2007)
3. Grefenstette, E.: Analysing document similarity measures. Master’s thesis, Univer-
sity of Oxford (2009)
4. Nováček, V., Burns, G.A.: SKIMMR: Facilitating knowledge discovery in life sci-
ences by machine-aided skim reading. PeerJ (2014) In press, see https://peerj.
com/preprints/352/ for a preprint.
347
348
Fig. 1. SemanTex usage example - Parkinson’s Disease
Towards a Top-K SPARQL Query Benchmark
Shima Zahmatkesh, Emanuele Della Valle, Daniele Dell’Aglio, and Alessandro
Bozzon
DEIB - Politecnico of Milano, Delft University of Technology
shima.zahmatkesh@polimi.it, emanuele.dellavalle@polimi.it,
daniele.dellaglio@polimi.it, a.bozzon@tudelft.nl
Abstract. The research on optimization of top-k SPARQL query would
largely benefit from the establishment of a benchmark that allows com-
paring different approaches. For such a benchmark to be meaningful, at
least two requirements should hold: 1) the benchmark should resemble
reality as much as possible, and 2) it should stress the features of the top-
k SPARQL queries both from a syntactic and performance perspective.
In this paper we propose Top-k DBPSB: an extension of the DBpedia
SPARQL benchmark (DBPSB), a benchmark known to resemble real-
ity, with the capabilities required to compare SPARQL engines on top-k
queries.
Keywords: Top-k Query, SPARQL, Benchmark
1 Problem Statement
Top-k queries – queries returning the top k results ordered according to a user-
defined scoring function – are gaining attention in the Database [1] and Semantic
Web communities [2–6]. Order is an important property that can be exploited
to speed up query processing, but state-of-the-art SPARQL engines such as
Virtuoso [7], Sesame [8], and Jena [9], do no exploit order for query optimisa-
tion purposes. Top-k SPARQL queries are managed with a materialize-then-sort
processing schema [1] that computes all the matching solutions (e.g., thousands)
even if only a limited number k (e.g., ten) are requested.
Recent works [2–5] have shown that an efficient split-and-interleave process-
ing schema [1] could be adopted to improve the performance of top-k SPARQL
queries. To the best of our knowledge, a consistent comparison of those works
does not exist. As often occurs, the main cause for this fragmentation resides in
the partial lack of a SPARQL benchmark covering top-k SPARQL queries. To
foster the work on top-k query processing within the Semantic Web community,
we believe that it is the right time to define a top-k SPARQL benchmark.
Following well known principles of benchmarking [10], we formulate the re-
search question of this work as: is it possible to set up a benchmark for top-k
SPARQL queries, which resembles reality as much as possible and stresses the
features of top-k queries both from a syntactic (i.e., queries should contain rank-
related clauses) and performance (i.e., the query mix should insist on character-
istics of top-k queries which stress SPARQL engine) perspective?
349
2 Metodology
In this poster, we describe our ongoing work on Top-k DBpedia SPARQL Bench-
mark (Top-k DBPSB). It extends the methodology, proposed in DBPSB [11], to
build SPARQL benchmarks that resemble reality. It uses the same dataset, per-
formance metrics, and test driver of DBPSB, but it extends the query variabil-
ity feature of DBPSB by proposing an algorithm to automatically create top-k
queries from the 25 auxiliary queries of the DBPSB and its datasets.
In order to present Top-k DBPSB, we need to introduce some terminology:
– Definition 1: Rankable Data Property is a RDF property whose range is
in xsd:int, xsd:long, xsd:float, xsd:integer, xsd:decimal, xsd:double, xsd:date,
xsd:dateTime, xsd:time, or xsd:duration.
– Definition 2: Rankable Triple Pattern is a triple pattern that has a Rank-
able Data Property in the property position of the pattern.
– Definition 3: When a variable, in the object position of a Rankable Triple
Pattern, appears in a scoring criteria of the scoring function, we call it Scor-
ing Variable and we call Rankable Variable the one appearing in the subject
position.
Figure 1 shows an overview of the process followed by Top-k DBPSB to gener-
ate top-k SPARQL queries from the Auxiliary Queries and datasets of DBSBM.
The process consists of four steps: 1) finding rankable variables, 2) computing
maximum and minimum values for each rankable variable, 3) generating scoring
functions, and 4) generating top-k queries.
Fig. 1: Top-k DBPSB generates top-k SPARQL queries starting from the Auxiliary queries and datasets
of DBPSB (a SPARQL benchmark known to resemble reality).
In the first step, Top-k DBPSB looks for all rankable variables, in each aux-
iliary query of each BDPSB query template, which could be used as part of a
ranking criterion in a scoring function. For each DBPSB auxiliary query, Top-k
DBPSB first checks if the variables in the query fit the definition of scoring vari-
able. To find additional rankable variable Top-k DBPSB considers all variables
in the query and tries also to find rankable triple pattern related to them.
350
For the sake of simplicity, Top-k DBPSB generates scoring function as a
weighted sum of normalized ranking criteria, therefore, for each rankable vari-
able, Top-k DBPSB needs to compute its maximum and minimum value. So, for
each DBPSB auxiliary query and each rankable triple pattern identified in the
previous step, Top-k DBPSB generates a query to find those values.
After having collected those values, for each rankable triple pattern, Top-k
DBPSB creates a new top-k query template. For instance, Query 2 of Figure
1 shows such a query as generated by the Top-K DBPSB. In order to obtain
an executable query, for each scoring variable that appears in scoring function
Top-K DBPSB adds the related rankable triple pattern to the body of the query.
1
.
3 Experiments
Given that we extended DBPSB we give for positively answered the first part of
our research question (i.e., Top-k DBPSB resembles reality). In this section, we
provide preliminary evidence that the query variability feature of Top-k DBPSB
positively answers also the second part of our research question. We do so by
operationalising our research question in three hypotheses:
H.1 The more are the number of the Rankable Variables, the longer is the average
execution time.
H.2 The more are the number of the Scoring Variables in the scoring function,
the longer is the average execution time.
H.3 The value of the LIMIT clause has not any significant impact on the average
execution time.
In order to evaluate our hypothesis, we carry out our experiments by using
the SPARQL engines Virtuoso, and Jena. After preparing the experimental en-
vironment, we load the DBpedia dataset with the scale factors of 10% on the
SPARQL engines. In order to evaluate the performance of SPARQL engines,
we use the DBpedia SPARQL Benchmark test driver and modify it for Top-k
queries. We execute the generated Top-k SPARQL queries against these two
SPARQL engines. The information gains from the execution of randomly gener-
ated query against the SPARQL engines is used to evaluate our hypotheses.
As a result we got that most of the queries templates generated by Top-k
DBPSB are compatible with our hypotheses H.2 and H.3. On the contrary, Top-
k DBPSB does not generate adequate query templates to test the hypothesis H.1:
only 6 queries templates out of the 25 in DBPSB can be use in validating H.1
while the others have only one rankable variable. Further investigation is needed
to refine the hypothesis H.1 and better qualify the cases that stress SPARQL
engines when answering top-k queries.
1
The implementation code is available online at https :
//bitbucket.org/sh_zahmatkesh/top ≠ k ≠ dbpsb
351
4 Conclusion
In this work, we presented Top-k DBPSB, an extension of DBPSB [11] that adds
the possibility to automatically generate top-k queries. Top-k DBPSB satisfies
the requirement of resembling the reality extending DBPSB which automatically
derives datasets and queries from DBpedia and it query logs. Moreover, we
provide experimental evidence that Top-k DBPSB also satisfies the requirement
to stress the distinguish features of top-k queries.
In order to support the claim that we positively answered to the research
question presented in Section 1, we experimentally showed that the query vari-
ability provided by Top-k DBPSB stresses the SPARQL engines. To this end,
we formulated three hypothesis and we empirically demonstrated that when the
number of scoring variables increases the average execution time also does (hy-
pothesis H.2) and that the average execution time is independent from the value
used in the LIMIT clause (hypothesis H.3). Counterintuitively, hypothesis H.1 is
not confirmed by our experiments.
References
1. Ilyas, I.F., Beskales, G., Soliman, M.A.: A survey of top-k query processing tech-
niques in relational database systems. ACM Computing Surveys (CSUR) 40(4)
(2008) 11
2. Magliacane, S., Bozzon, A., Della Valle, E.: Efficient execution of top-k sparql
queries. In: The Semantic Web–ISWC 2012. Springer (2012) 344–360
3. Wagner, A., Duc, T.T., Ladwig, G., Harth, A., Studer, R.: Top-k linked data query
processing. In: The Semantic Web: Research and Applications. Springer (2012)
56–71
4. Cheng, J., Ma, Z., Yan, L.: f-sparql: a flexible extension of sparql. In: Database
and Expert Systems Applications, Springer (2010) 487–494
5. Cedeno, J.P.: A Framework for Top-K Queries over Weighted RDF Graphs. PhD
thesis, Arizona State University (2010)
6. Siberski, W., Pan, J.Z., Thaden, U.: Querying the semantic web with preferences.
In: ISWC. (2006) 612–624
7. Erling, O., Mikhailov, I.: Rdf support in the virtuoso dbms. In Auer, S., Bizer, C.,
Müller, C., Zhdanova, A.V., eds.: CSSW. Volume 113 of LNI., GI (2007) 59–68
8. Broekstra, J., Kampman, A., van Harmelen, F.: Sesame: A generic architecture for
storing and querying rdf and rdf schema. In Horrocks, I., Hendler, J.A., eds.: In-
ternational Semantic Web Conference. Volume 2342 of Lecture Notes in Computer
Science., Springer (2002) 54–68
9. Owens, A., Seaborne, A., Gibbins, N., mc schraefel: Clustered tdb: A clustered
triple store for jena. Technical report, Electronics and Computer Science, Univer-
sity of Southampton (2008)
10. Gray, J., ed.: The Benchmark Handbook for Database and Transaction Systems
(2nd Edition). Morgan Kaufmann (1993)
11. Morsey, M., Lehmann, J., Auer, S., Ngomo, A.C.N.: Dbpedia sparql benchmark
- performance assessment with real queries on real data. In Aroyo, L., Welty, C.,
Alani, H., Taylor, J., Bernstein, A., Kagal, L., Noy, N.F., Blomqvist, E., eds.: Inter-
national Semantic Web Conference (1). Volume 7031 of Lecture Notes in Computer
Science., Springer (2011) 454–469
352
Exploring type-specific topic profiles of datasets:
a demo for educational linked data
Davide Taibi1, Stefan Dietze2, Besnik Fetahu2, Giovanni Fulantelli1
1
Istituto per le Tecnologie Didattiche, Consiglio Nazionale delle Ricerche, Palermo, Italy
{davide.taibi, giovanni.fulantelli}@itd.cnr.it
2
L3S Research Center, Hannover, Germany
{dietze, fetahu}@l3s.de
Abstract. This demo presents the dataset profile explorer which provides a re-
source type-specific view on categories associated with available datasets in the
Linked Data cloud, in particular the ones of educational relevance. Our work uti-
lises type mappings with dataset topic profiles to provide a type-specific view on
datasets and their categorisation with respect to topics, i.e. DBpedia categories.
Categories associated with each dataset are shown in an interactive graph, gener-
ated for the specific profiles only, allowing for more representative and meaning-
ful classification and exploration of datasets.
Keywords: Dataset profile, Linked Data for Education, Linked Data Explorer
1 Motivation
The diversity of datasets in the Linked Data (LD) cloud has increased in the last few
years, and identifying a dataset containing resources related to a specific topic is, at
present, a challenging activity. Moreover, the lack of up-to-date and precise descriptive
information has exacerbated this challenge. The mere keywords-based classification
derived from the description of the dataset owner is not sufficient, and for this reason,
it is necessary to find new methods that exploit the characteristics of the resources
within the datasets to provide useful hints about topics covered by datasets and their
subsequent classification.
In this direction, authors in [1] proposed an approach to create structured metadata
to describe a dataset by means of topics, where a weighted graph of topics constitutes
a dataset profile. Profiles are created by means of a processing pipeline1 that combines
techniques for datasets resource sampling, topic extraction and topic ranking. Topics
have been associated to dataset by using named entity recognition (NER) techniques
and a score, representing the relevance of a topic for a dataset, has been calculated using
algorithms to evaluate node relevance in network such as PageRank, K-Step Markov,
and HITS.
1 http://data-observatory.org/lod-profiles/profiling.htm
adfa, p. 1, 2011.
© Springer-Verlag Berlin Heidelberg 2011
353
The limitations of such an approach are related mainly to the following aspects. First,
the meaning of individual topics assigned to a dataset can be extremely dependent on
the type of resources they are attached to. Also, the entire topic profile of a dataset is
hard to interpret if categories from different types are considered at the same time. As
an example of the first issue, the same category (e.g. "Technology") might be associated
to resources of very different types such as "video" (e.g. in the Yovisto Datset2) or
"research institution"(e.g. in the CNR dataset3). Concerning the second issue, the single
topic profile attached for instance to bibliographic datasets (such as: the LAK dataset4
or Semantic Web Dog Food5) - in which people (“authors”), organisations ("affilia-
tions") and documents (“papers”) are represented – is characterized by the diversity of
its categories (e.g. DBpedia categories: Scientific_disciplines, Data_management In-
formation_science but also Universities_by_country, Universities_and_colleges). In-
deed, classification of datasets in the LD Cloud is highly specific to the resource types
one is looking at. While one might be interested in the classification of "persons" listed
in one dataset (for instance, to learn more about the origin countries of authors in
DBLP), another one might be interested in the classification of topics covered by the
documents (for instance disciplines of scientific publications) in the very same dataset.
The approach we propose in this demo to overcome the limitations described above
relies on filtering the topic profiles defined in [1] according to the types of the resources.
This results in a type-specific categorisation of datasets, which considers both the cat-
egories associated with one dataset and the resource types these are associated with.
However, the schemas adopted by the datasets of the LD cloud are heterogeneous,
thus making difficult to compare the topic profiles across datasets. While there are
many overlapping type definitions representing the same or similar real world entities,
such as "documents", "people", “organization”, type-specific profiling relies on type
mappings to improve the comparability and interpretation of types and consequently,
profiles. For this aim the explicit mappings and relations declared within specific sche-
mas (as an example foaf:Agent has as subclasses: foaf:Group, foaf:Person, foaf:Organ-
ization) as well as across schemas (for instance through owl:equivalentClass or
rdfs:subClassOf properties) are crucial.
While relying on explicit type mappings we have based our demo on a set of datasets
where explicit schema mappings are available from earlier work [2]. This includes ed-
ucation-related datasets identified by the LinkedUp Catalog6 in combination with the
dataset profiles generated by the Linked Data Observatory7. While the latter provides
topic profiles for all selected datasets, the LinkedUp Catalog contains explicit schema
mappings which were manually created for the most frequent types in the dataset. Spe-
cifically, the profile explorer proposed in this demo aims at providing a resource type-
specific view on categories associated with the datasets in the LinkedUp Catalog. In
2 http://www.yovisto.com/
3 http://data.cnr.it/
4 http://lak.linkededucation.org
5 http://data.semanticweb.org
6 http://data.linkededucation.org/linkedup/catalog/
7 http://data-observatory.org/lod-profiles
354
this initial stage a selection of 23 dataset of the catalog have been considered, as repre-
sentative of datasets including different resource types related to several topics. Type
mappings across all involved datasets link "documents" of all sorts to the common
foaf:Document class, "persons" and "organisations" to the common foaf:Agent class,
and course and module to the aiiso:KnowledgeGrouping8 class. Categories associated
with each dataset are shown in an interactive graph, generated for the specific types
only, allowing for more representative and meaningful classification and exploration of
datasets (Figure 1).
Fig. 1. A screenshot of the demo
2 The Dataset Profile Explorer
The dataset explorer is available at: http://data-observatory.org/led-explorer/. The
explorer is composed of three panels: the panel at the center of the screen shows the
network of datasets and categories, the panel on the left shows general and detailed
descriptions about datasets and categories, and at the top of these panels the selection
panel allows users to apply specific filters on the network. In the central panel, green
nodes represent datasets while blue nodes represent categories. An edge connects a da-
taset to a category if the category belongs to the dataset topic profile. In order to draw
the network, the sigmajs9 library has been used and the nodes of the network have been
displayed using the ForceAtlas2 layout. By clicking on a node (dataset or category),
general and detailed descriptions are shown on the left panel. In the case of a dataset,
8 http://purl.org/vocab/aiiso/schema#KnowledgeGrouping
9 http://sigmajs.org
355
the general description reports the description of the dataset retrieved from the Datahub
repository10. In the detailed description, the list of the top ten categories (and the related
score) associated to the dataset is reported. In the case of a category, the description
panel reports the list of datasets which have that category in their profile. The datasets
including the category in their top ten list are highlighted in bold.
The selection panel at the top allows users to filter the results by means of three
combo boxes, respectively related to: dataset, resource type, and resource sub-type. The
list of dataset is composed by the dataset of the LinkedUp catalog. Regarding the re-
source type, the explorer is focused on three classes: foaf:Document, foaf:Agent and
aiiso:KnowledegeGrouping. The foaf:Document is related to learning material such as:
research papers, books, and so on; the foaf:Agent resource type has been included to
take into account elements such as persons and organizations. The aiiso:Knowledege-
Grouping is a type representing resources related to courses and modules. This initial
set of resource type can be easily enlarged by means of configuration settings. The
resource sub-type has been included with the aim of refining the results already filtered
by resource type. Another filter that has been included into the explorer is related to the
score of the relationships between datasets and categories. A slider bar allows users to
filter results based on a specific range of the scores. As stated before, the scores have
been calculated by the linked dataset profiling pipeline. The filters on datasets, resource
types and resource sub-types can be combined and, as a result, only the portion of the
network consistent with the filter selections is highlighted
3 Conclusion
In order to foster an effective use of the resources in the LD cloud, it is important to
make explicit the topics covered by the datasets even in relation to the types of resources
in the datasets. To this aim, we have developed a dataset profile explorer focused on
the domain of educational related datasets. In this domain, topic coverage and the type
of the resources assume a key role in supporting the search for content suitable for a
specific learning course. The explorer allows users to navigate topic profiles associated
with datasets with respect to the type of the resource in the dataset.
The explorer can be configured to be used with different datasets provided that the
dataset topic profile is available, thus extending the application of the proposed ap-
proaches to several fields.
4 References
1. Fetahu, B., Dietze, S., Nunes, B. P., Taibi, D., Casanova, M. A., Generating structured Pro-
files of Linked Data Graphs, 12th International Semantic Web Conference (ISWC2013),
Sydney, Australia, (2013).
2. D’Aquin, M., Adamou, A., Dietze, S., Assessing the Educational Linked Data Landscape,
ACM Web Science 2013 (WebSci2013), Paris, France, May 2013.
10 http://datahub.io
356
TEX-OWL: a Latex-Style Syntax for authoring OWL 2
ontologies
Matteo Matassoni1 , Marco Rospocher2 , Mauro Dragoni2 , and Paolo Bouquet1
1
The University of Trento, Via Sommarive 9, Trento, I-38123, Italy
2
Fondazione Bruno Kessler—IRST, Via Sommarive 18, Trento, I-38123, Italy
Abstract. This paper describes a new syntax that can be used to write OWL 2
ontologies. The syntax, which is known as TEX-OWL, was developed to address
the need for an easy-to-read and easy-to-write plain text syntax. TEX-OWL is
inspired by LATEX syntax, and covers all construct of OWL 2. We designed TEX-
OWL to be less verbose than the other OWL syntaxes, and easy-to-use especially
for quickly developing small-size ontologies with just a text editor. The important
features of the syntax are discussed in this paper, and a reference implementation
of a Java-based parser and writer is described.
1 Introduction and Motivation
Since OWL became a World (W3C) Wide Web Consortium recommendation, there has
been a steady stream of Web (OWL) Ontology Language ontology editing tools that
have made their way to users’ desktops. Most notably, Protégé-OWL [1], Swoop [2],
TopBraid Composer,1 and MoKi [3].
All of these tools offer a variety of presentation or rendering formats for class,
property and individual descriptions and axioms. These formats range from the W3C
officially required RDF/XML exchange syntax [4], to the optionals: Turtle [5], OWL/
XML [6], the Functional-Style Syntax [7], the Manchester Syntax [8], with some non-
standard W3C syntaxes, like a Description-Logic style syntax and the Open (OBO)
Biomedical Ontologies format [9].
While the use of ontology editing tools is becoming more and more popular, there
are still situations where users have to quickly write small-size ontology for testing or
prototyping purposes, and directly writing the ontology with a text editor would be more
effective (i.e., the overhead of learning the ontology editing tool’s functionalities and
features is more than the benefit obtained by using it). These situations quite frequently
occur in academic context.
W3C chose RDF/XML as the primary exchange syntax for OWL 2; indeed, this
syntax must be supported by all OWL 2 tools. However, the fact that XML is extremely
verbose and hard to write by hand excludes this syntax for quickly authoring and editing
ontologies in a concise manner.
W3C provides alternatives to RDF/XML for OWL 2; these include Turtle, OWL/
XML, the Functional-Style Syntax and the Manchester Syntax. OWL/XML is an XML
1
http://topbraidcomposer.com/.
357
serialization for OWL 2 that mirrors its structural specification. It is more human-
readable and easier to parse than RDF/XML; however like RDF/XML, OWL/XML is
still XML. Another syntax that follows OWL 2’s structural specification is the Functional-
Style Syntax. It is a human-readable and plain text syntax that removes the burden of
XML, however like the previous two, the Functional-Style Syntax is also verbose. In
fact, it has an excessive number of keywords, and typically requires the use of a large
number of brackets, as its name might suggest. The Manchester Syntax is a user-friendly
syntax that can be used to write OWL 2 ontologies. It is the least verbose OWL 2 syntax
and when it is used to write ontology documents, it gathers together information about
names in a frame-like manner as opposed to the others OWL 2 syntaxes. This nature
at a first look may seem a great advantage for the Manchester Syntax, but on the other
hand it makes this syntax unable of handling General (GCI) Concept Inclusions (i.e.,
the Manchester Syntax does not cover the expressivity of the whole OWL 2 language).
The OWL Latex-Style Syntax was created to deal with the above issues and provide
users with a lightweight syntax that makes it easier to write ontologies. It has been
designed primarily for writing ontology documents in simple textual editor. The syntax
is discussed in detail through the rest of this paper.
2 OWL Latex-Style Syntax
The full specification of the TEX-OWL syntax is available at http://github.com/
matax87/TexOwl/blob/master/docs/grammar.pdf, together with several
examples of using the various syntax constructs. The primary design consideration in
developing TEX-OWL was to produce a syntax that was concise, and quick and easy to
read and write by hand. We took inspiration for developing the syntax from the LATEX
format, given it’s popularity especially in academic environments. A previous attempt
to develop a latex-like syntax was proposed in [11], but its syntax is restricted to a
limited subset of OWL 1. Lessons learned from this previous experience were also
taken into consideration. For example, keywords that represent datatypes and annota-
tions (i.e., datatypes and annotations were hard-coded in the syntax) were remove in the
new syntax to generalise them via IRIs.
It was also decided that although the syntax should be aligned as much as pos-
sible with the OWL specification, for example by using keywords derived from the
Functional-Style Syntax specification, the main objective would be to strive for con-
ciseness and a reduction in the amount of time it took users to write axioms. To this
end, some new keywords were created and others changed in name or name length.
Moreover, it was also decided that the syntax should match as much as possible the
LATEX format peculiarities of using keyword and command that start with a backslash
(‘\’) symbol, with required parameters inside curly braces and optional parameters in-
side square brackets.
Although the TEX-OWL Syntax borrows ideas from the OWL Functional-Style
Syntax, it is much less verbose. An OWL ontology written in TEX-OWL starts with
an optional preface and continues with the actual ontology document. The optional
preface is where prefixes can be declared via the new keyword \ns. This keyword
can be used also to declare a default prefix, which will be used for interpreting sim-
358
ple IRIs.2 The actual ontology document begins with \begin{ontology} and ends
with \end{ontology} syntax. After the begin ontology statement, users can also
provide an optional ontology IRI and a even more optional version IRI typing them in-
side square brackets: [ontologyIRI, versionIRI]. Inside the begin/end block,
user can import other ontology documents, using the keyword \import, declare ax-
ioms and put ontology annotations, as shown in the example below:
\ns
\ b e g i n { o n t o l o g y }[ ]
% Animals form a c l a s s
a n i m a l \c
% P l a n t s form a c l a s s d i s j o i n t from animals
animal \ c d i s j o i n t p l a n t
% Trees are a type of p l a n t
tree \cisa plant
% Branches are p a r t s of t r e e s
b r a n c h \ c i s a \ o f o r a l l { i s p a r t o f }{ t r e e }
% Leaves are p a r t s of branches
l e a f \ c i s a \ o f o r a l l { i s p a r t o f }{ b r a n c h }
% H e r b i v o r e s are e x a c t l y t h o s e animals t h a t e a t o n l y p l a n t s or p a r t s o f p l a n t s
h e r b i v o r e \ceq ( a n i m a l \cand \ o f o r a l l { e a t s }{( p l a n t \ c o r \ o f o r a l l { i s p a r t o f }{
p l a n t }) })
% Carnivores are e x a c t l y those animals t h a t eat animals
c a r n i v o r e \ceq ( a n i m a l \cand \ o e x i s t s { e a t s }{ a n i m a l })
% G i r a f f e s a r e h e r b i v o r e s , and t h e y e a t o n l y l e a v e s
g i r a f f e \ c i s a ( h e r b i v o r e \cand \ o f o r a l l { e a t s }{ l e a f })
% Lions are animals t h a t eat only h e r b i v o r e s
l i o n \ c i s a ( a n i m a l \cand \ o f o r a l l { e a t s }{ h e r b i v o r e })
% T a s t y p l a n t s a r e p l a n t s t h a t a r e e a t e n b o t h by h e r b i v o r e s and c a r n i v o r e s
t a s t y p l a n t \ c i s a \ c a n d o f { p l a n t , \ o e x i s t s { e a t e n b y }{ h e r b i v o r e } , \ o e x i s t s {
e a t e n b y }{ c a r n i v o r e }}
e a t e n b y \oinv e a t s
e a t s \odomain a n i m a l
\end{ o n t o l o g y }
3 Implementation
A Java based reference implementation of a TEX-OWL parser and writer were created.3
They use the OWLAPI framework [10] and were developed as modules that can be inte-
grated inside it. The parser was constructed using the Java (JavaCC) Compiler Compiler
[12]. It can parse complete ontologies written in TEX-OWL. The writer, which inside
the OWLAPI is known as renderer, can serialize OWLAPI’s ontology objects to files
written in TEX-OWL. Moreover, the implementation also includes converters, which
can transform a TEX-OWL ontology to any other OWL 2 syntaxes and vice versa.
4 Concluding Remarks
TEX-OWL is a new OWL 2 syntax that was designed in response to a demand from
users for a more concise syntax that can be easily used to quickly write small-size
ontologies by hand. Key features of the syntax are that it is inspired by the LATEX syntax:
2
Simple IRIs are equivalent to abbreviated IRIs where the default prefix is used and there is no
need need of typing the colon (‘:’) symbol.
3
The implementation is available from http://github.com/matax87/TexOwl/.
359
in particular the syntax uses the same format for parameters and keywords. The syntax
is suited for use in simple textual editor tools. A reference implementation of a Java
based parser and a writer have been produced, which may be integrated into any tool.
The implementation also includes converters, which can transform TEX-OWL to other
OWL 2 syntaxes and vice versa.
In order to evaluate TEX-OWL, two questionnaires were designed and sent to knowl-
edge engineers with experience in authoring ontologies with the various OWL syntaxes.
In the first questionnaire (accessible here: http://goo.gl/Cjpqtg), all OWL 2
syntaxes and TEX-OWL’s intuitiveness, conciseness, and understandability were com-
pared using ten different examples of use. The second questionnaire (accessible here:
http://goo.gl/lbFu4R) was focused on evaluating the usability of the new syn-
tax for authoring a small ontology. Ten knowledge engineers participated to the first
questionnaire, and five of them took part to the second one. Ratings were expressed
according to the typical five-level Likert scale.
In summary, the results show that TEX-OWL is indeed the most concise syntax
and is intuitive as the Manchester Syntax, which is the most intuitive among OWL 2
syntaxes. Moreover, users have found it easy to use TEX-OWL for authoring a small
example ontology and that, in general, this syntax is better to use for writing ontologies
by hands than other OWL 2 syntaxes.
References
1. Knublauch, H., Musen, M.A., Rector, A.L.: Editing description logics ontologies with the
Protégé OWL plugin (2004)
2. Kalyanpur, A., Parsia, B., Hendler, J.: A tool for working with web ontologies (2005)
3. Chiara Ghidini, Marco Rospocher, Luciano Serafini: Modeling in a Wiki with MoKi: Ref-
erence Architecture, Implementation, and Usages International Journal On Advances in Life
Sciences, IARIA, volume 4, 111-124 (2012)
4. Fabien, G., Schreiber, G.: Rdf 1.1 xml syntax specification (2014) http://www.w3.org/
TR/2014/REC-rdf-syntax-grammar-20140225/.
5. Beckett, D.: New syntaxes for rdf. Technical report, Institute For Learning And Research
Technology, Bristol (2004)
6. Motik, B., Parsia, B., Patel-Schneider, P.F.: Owl 2 web ontology language
xml serialization (second edition) (2012) http://www.w3.org/TR/2012/
REC-owl2-xml-serialization-20121211/.
7. Motik, B., Parsia, B.: Owl 2 web ontology language structural specification and
functional-style syntax (second edition) (2012) http://www.w3.org/TR/2012/
REC-owl2-syntax-20121211/.
8. Horridge, M., Drummond, N., Goodwin, J., Rector, A.L., Stevens, R., Wang, H.: The manch-
ester owl Syntax (2006)
9. Motik, B., Parsia, B.: Obo flat file format 1.4 syntax and semantics [draft] (2011) ftp://
ftp.geneontology.org/go/www/obo-syntax.html.
10. The owl api http://owlapi.sourceforge.net.
11. Latex2owl, http://dkm.fbk.eu/index.php/Latex2owl.
12. Sreeni, V., Sriram, S.: Java compiler compiler [tm] (javacc [tm]) - the java parser generator
http://javacc.java.net.
360
Supporting Integrated Tourism Services with
Semantic Technologies and Machine Learning
Francesca A. Lisi and Floriana Esposito
Dipartimento di Informatica, Università degli Studi di Bari “Aldo Moro”, Italy
{francesca.lisi,floriana.esposito}@uniba.it
Abstract. In this paper we report our ongoing work on the application
of semantic technologies and machine learning to Integrated Tourism in
the Apulia Region, Italy, within the Puglia@Service project.
1 Introduction
Integrated Tourism can be defined as the kind of tourism which is explicitly linked
to the localities in which it takes place and, in practical terms, has clear connec-
tions with local resources, activities, products, production and service industries,
and a participatory local community. Integrated Tourism thus needs ICTs that
should go beyond the mere technological support for tourism marketing, di↵er-
ently from most approaches in eTourism research (see [1] for a comprehensive
yet not very recent review). In this paper, we report our experience in support-
ing Integrated Tourism services with Semantic Technologies (STs) and Machine
Learning (ML). The work has been conducted within Puglia@Service,1 an Italian
PON Research & Competitivity project aimed at creating an innovative service
infrastructure for the Apulia Region, Italy.
The paper is structured as follows. Section 2.1 shortly describes a domain
ontology for Integrated Tourism, named OnTourism, which has been modeled
for being used in Puglia@Service. Section 2.2 briefly presents a Web Informa-
tion Extraction (WIE) tool, named WIE-OnTour, which has been developed
for populating OnTourism with data automatically retrieved from the Web. Sec-
tion 2.3 illustrates some of the Semantic Web Services (SWSes) which have been
defined on top of OnTourism for supporting Integrated Tourism in Apulia. Sec-
tion 3 outlines an application scenario for a ML tool, named Foil-DL, to better
adapt the automated composition of these services to user demands. Section 4
concludes the paper with final remarks and directions of future work.
2 Semantic Technologies for Integrated Tourism
2.1 A Domain Ontology
Domain ontologies for tourism are already available, e.g. the travel 2 ontology is
centered around the concept of Destination. However, it is not fully satisfactory
1
http://www.ponrec.it/open-data/progetti/scheda-progetto?ProgettoID=5807
2
http://www.protege.cim3.net/file/pub/ontologies/travel/travel.owl
361
from the viewpoint of Integrated Tourism because, e.g., it lacks concepts mod-
eling the reachability of places. In Puglia@Service, we have decided to build a
domain ontology, named OnTourism, 3 more suitable for the project objectives
and compliant with the OWL 2 standard. It consists of 379 axioms, 205 logical
axioms, 117 classes, 9 object properties, and 14 data properties, and has the
expressivity of the DL ALCOF(D).
The main classes of the terminology are Site, Place and Distance. The first
is the root of a taxonomy which covers several types of sites of interest (e.g.,
Hotel and Church). The second models the places where sites are located at. The
third, together with the object properties hasDistance and isDistanceFor and
the data properties hasLengthValue/hasTimeValue, allows to represent the dis-
tance relation between sites with values in either length or time units. Distances
are further specified according to the transportation means used (see, e.g., the
class Distance on Foot). Other relevant classes in the terminology are Amenity
(with subclasses such as Wheelchair Access) and Service (with subclasses such as
Bike Rental ) that model, respectively, amenities and services available at the ac-
commodations. Finally, the terminology includes the official 5-star classification
system for hotel ranking.
2.2 Ontology Population with Web Information Extraction
WIE-OnTour is a wrapper-based WIE tool implemented in Java and conceived
for the population of OnTourism with data concerning hotels and B&Bs available
in the web site of TripAdvisor4 . The tool is also able to compute distances of
the extracted accommodations from sites of interest (e.g., touristic attractions)
by means of the Google Maps5 API. Finally, the tool supports the user in the
specification of sites of interest.
Instantiations of OnTourism for the main destinations of urban tourism in
Apulia have been obtained with WIE-OnTour. Here, we consider an instan-
tiation for the city of Bari (the capital town of Apulia). It contains 34 hotels,
70 B&Bs, 106 places, and 208 foot distances for a total of 440 individuals. The
distances are provided in time and length on foot and have been computed with
respect to Basilica di San Nicola and Cattedrale di San Sabino (both instances
of Church and located in Bari). The restriction to foot distances is due to the
aforementioned preference of Integrated Tourism for eco-mobility.
2.3 Semantic Web Services
In Puglia@Service, we have defined several atomic services in OWL-S on top
of the aforementioned domain ontologies, travel and OnTourism. For example,
city churches service returns the churches (o.p. of type Church) located in a given
city (i.p. of type City) whereas near attraction accomodations service returns all
the accommodations (o.p. of type Accommodation) near a given attraction (i.p.
3
http://www.di.uniba.it/~lisi/ontologies/OnTourism.owl
4
http://www.tripadvisor.com/
5
http://maps.google.com/
362
of type Attraction). Note that closeness can be defined on the basis of distance
either in a crisp way (i.e., when the distance value is under a fixed threshold)
or in a fuzzy way (i.e., through grades of closeness). In both ways, however,
the computation should consider the transportation means used as well as the
measure units adopted according to the OnTourism ontology.
In Puglia@Service, we intend to obtain composite services by applying meth-
ods such as [3]. For example, the sequence composed of city churches service
and near attraction accomodations service could satisfy, e.g., the user request for
accommodations around Basilica di San Nicola. Indeed, since Bari is a major
destination of religious tourism in Apulia, it could e↵ectively support the de-
mand from pilgrims who prefer to find an accommodation in the neighborhood
of places of worship so that they can practise their own religions at any hour of
the day. Also, if the suggested accommodations are easy to reach (i.e., at foot
distance) from the site of interest, the service will bring benefit also to the city,
by reducing the car traffic. In a more complex scenario, disabled pilgrims might
need a wheelchair-accessible accommodation. The service composition mecha-
nism should then append also wheelchairaccess accommodations service, so that
the resulting composite service could be considered more compatible with the
special needs of this user profile.
3 Towards Learning from Users’ Feedback
In Puglia@Service, automated service composition will be enhanced by exploiting
users’ feedback. The idea is to apply ML tools in order to induce ontology ax-
ioms which can be used for discarding those compositions that do not reflect the
preferences/expectations/needs of a certain user profile. Here, we illustrate this
idea with an application scenario which builds upon the accommodation rating
provided by TripAdvisor’s users. More precisely, we consider the task of accom-
modation finding. This task strongly relies on a classification problem aimed at
distinguishing good accommodations from bad ones according to the amenities
available, the services o↵ered, the location and the distance from sites of interest,
etc. In order to address this classification problem, we need ML tools able to
deal with the inherent incompleteness of Web data and the inherent vagueness of
concepts such as the closeness. One such tool is Foil-DL [2], a ML system able
to induce a set of fuzzy General Concept Inclusion (GCI) EL(D) axioms from
positive and negative examples for a target class in any OWL ontology.
As an illustration of the potential usefulness of Foil-DL in the Puglia@Service
context, we report here a couple of experiments concerning the filtering of results
returned by the SWSes reported in the previous section for the case of Bari. We
set up a learning problem with the class Bad Accommodation as target of the
learning process. Ratings from TripAdvisor users have been exploited for provid-
ing Foil-DL with positive and negative examples. Out of the 104 accommoda-
tions, 57 with a higher percentage (say, over 0.7) of positive users’ feedback are
asserted as instances of Good Accommodation, whereas 15 with a lower percent-
age (say, under 0.5) are asserted as instances of Bad Accommodation. The latter,
of course, play the role of positive examples in our learning problem. Syntactic
restrictions are imposed on the form of the learnable GCI axioms.
363
In the first experiment, we have not considered the distances of the accom-
modations from the sites of interest. With this configuration, Foil-DL returns
just the following GCI with confidence 0.5:
Bed_and_Breakfast and hasAmenity some (Pets_Allowed) and hasAmenity some (Wheelchair_Access)
subclass of Bad_Accommodation
The GCI suggests that B&Bs are not recommended even though they provide
disabled facilities. It can be used to filter out from the result set of wheelchairac-
cess accommodations service those accommodations which are classified as bad.
In the second experiment, conversely, we have considered the distances of
the accommodations from the sites of interest. With this configuration, Foil-DL
returns the following GCI with confidence 1.0
hasAmenity some (Bar) and hasAmenity some (Wheelchair_Access) and
hasDistance some (isDistanceFor some (Bed_and_Breakfast) and isDistanceFor some (Church))
subclass of Bad_Accommodation
The GCI strenghtens the opinion that B&Bs are not recommendable accommo-
dations for disabled people whatever their distance from the churches is.
As a further experiment, we have restricted our analysis of accommodations
in Bari to only B&Bs. Starting from 12 positive examples and 39 negative ex-
amples for Bad Accommodation, Foil-DL returns the following two GCIs with
confidence 0.154 and 0.067 respectively:
hasAmenity some (Pets_Allowed) and hasAmenity some (Wheelchair_Access) subclass of Bad_Accommodation
hasAmenity some (Bar) and hasAmenity some (Wheelchair_Access) subclass of Bad_Accommodation
which confirm that B&Bs should not be recommended to disabled tourists.
4 Conclusions and future work
In this paper we have reported our ongoing work on the use of STs and ML
for Integrated Tourism in Apulia within the Puglia@Service project. Though
developed for the purposes of the project, the technical solutions here described
are nevertheless general enough to be reusable for similar applications in other
geographical contexts. Notably, they show the added value of having ontologies
and ontology reasoning (including also non-standard inferences like induction as
exemplified by Foil-DL) behind a Web Service infrastructure.
For the future we intend to carry on the work on the application of Foil-DL
to the automated service composition. Notably, we shall consider the problem of
learning from the feedback provided by specific user profiles.
References
1. Buhalis, D., Law, R.: Progress in information technology and tourism management:
20 years on and 10 years after the internet - the state of eTourism research. Tourism
Management 29(4), 609–623 (2008)
2. Lisi, F.A., Straccia, U.: A System for Learning GCI Axioms in Fuzzy Description
Logics. In: Eiter, T., Glimm, B., Kazakov, Y., Kroetzsch, M. (eds.) Proc. of the
26th Int. Workshop on Description Logics. CEUR Workshop Proceedings, vol. 1014.
CEUR-WS.org (2013)
3. Redavid, D., Iannone, L., Payne, T.R., Semeraro, G.: OWL-S atomic services com-
position with SWRL rules. In: An, A., Matwin, S., Ras, Z.W., Slezak, D. (eds.)
Foundations of Intelligent Systems. LNCS, vol. 4994, pp. 605–611. Springer (2008)
364
Towards a Semantically Enriched Online Newspaper
Ricardo Kawase, Eelco Herder, Patrick Siehndel
L3S Research Center, Leibniz University Hannover, Germany
{kawase, herder, siehndel}@L3S.de
Abstract. The Internet plays a major role as a source of news. Many publishers
o↵er online versions of their newspapers to paying customers. Online newspapers
bear more similarity with traditional print papers than with regular news sites. In
a close collaboration with Mediengruppe Madsack - publisher of newspapers in
several German federal states, we aim at providing a semantically enriched online
newspaper. News articles are annotated with relevant entities - places, persons
and organizations. These annotations form the basis for an entity-based ‘Theme
Radar’, a dashboard for monitoring articles related to the users’ explicitly indi-
cated and inferred interests.
1 Introduction
Traditional print media are nowadays replaced or complemented by online media. Most
publishers of international, national and local newspapers use the Web as an additional
communication channel. Many news sites that are connected to a print newspaper also
o↵er online subscriptions or provide content as pay-per-view. A commonly used solu-
tion is subscription-based access to an online newspaper, which is a digital copy of the
print newspaper, often with additional features for search, recommendation or archiv-
ing. However, in most cases, these additional features are based on content analysis,
manual interlinking by the editors and collaborative filtering. In this paper, we present
our work towards an semantically enriched online newspaper, which is a currently run-
ning collaboration between the L3S Research Center and Madsack GmbH & Co. KG.
1.1 Madsack
Madsack GmbH & Co. KG is a German media group with headquarters in Hannover,
Germany. Its core business comprises the production of several regional newspapers in
Lower Saxony, Schleswig-Holstein, Mecklenburg-Western Pomerania, Saxony, Hessen
and Saxony-Anhalt. Madsack is the sixth largest publishing house in Germany1 ; in the
year 2012, the average circulation of their 18 paid newspapers amounted to 939,590
copies2 .
1
http://www.media-perspektiven.de/uploads/tx_mppublications/05-2010_
Roeper.pdf
2
http://www.madsack.de/fileadmin/content/downloads/Geschaeftsbericht_
MGM_2012-2013_web.pdf
365
The digital business of Madsack media group includes the distribution of edito-
rial content (e-paper, mobile apps, usually using the brand of the corresponding daily
newspapers), marketing services (e.g. programming of websites and apps) as well as
collaborations with online marketplaces.
We focus on Madsack’s e-paper product. The e-paper is a Web environment that
allows subscribers to access the daily editions of the newspaper in digital format. The
environment is restricted to paying subscribers, who are required to log in with their
personal username and password. Once they are logged in, the website presents the
reader current daily newspaper editions. The online newspaper holds the same design
as the printed version. Every morning, except on Sundays (there are no editions printed
on Sundays), a news daily edition is available on the website.
2 Enrichment
In order to e↵ectively archive, categorize and publish news articles, most larger me-
dia companies have documentation departments that assign labels, categories or terms
to news articles. Due to the increasingly large amount of items and the need for the
term assignment to be quick [3], automatic semantic annotation is increasingly consid-
ered as an alternative for human annotation. Several established o↵-the-shelf tools for
knowledge extraction and semantic annotation are readily available, including DBpedia
Spotlight [4], AIDA [7], Open Calais, Wikimeta and Zemanta [2]. Wikipedia Miner [6]
directly makes use of the evolving Wikipedia structure; the toolkit includes search and
annotation services and provides links to relevant Wikipedia articles. In a comparison
of entity-annotation systems [1], Wikipedia Miner consistently scored high in terms of
recall and F1.
We semantically enriched the news articles by identifying entities and types. For this
purpose, we use the Wikipedia Miner[5] service as an annotation tool. First, detected
words are disambiguated using machine learning algorithms that take the context of the
word into account. This step is followed by the detection of links to Wikipedia articles
(which will later be aligned with DBpedia entities). By using a predefined threshold,
we ensured that only those words that are relevant for the whole document are linked
to articles. The goal of the whole process is to annotate a given news article in the
same way as a human would link a Wikipedia article. We set up a local deployment
of Wikipedia Miner and trained the models on top of a German Wikipedia dump from
February, 20143 .
After annotating the content of the news articles, with the identified entities in hand,
we query DBpedia in order to gather further information regarding the entities. Specifi-
cally, we explore their relationships through the predicate rdf:type to extract the type of
the entity given by DBpedia’s ontology (dbpedia-owl). Although several di↵erent types
are identified, we selected the three most relevant types for news articles: dbpedia-
owl:Place, dbpedia-owl:Person and dbpedia-owl:Organisation. These three types
were reported by Madsack’s editorial sta↵ to be the most relevant for their readers.
Additionally, as we describe in Section 3, it is important to avoid an overload of fea-
tures and information to the readers. Thus, we aim at having just a few and very useful
facets that can improve relevant news retrieval.
3
http://dumps.wikimedia.org/dewiki/20140216/
366
Fig. 1. Madsack’s e-paper prototype interface.
3 User Interface
From the end users’ (the readers’) perspective, the main innovation of the e-paper is
the so called ‘Themenradar’ (Theme Radar). The Theme Radar provides users with
shortcuts to news articles that pertain to their particular interests - as explicitly indicated
by subscribing to an entity (which represents a theme or topic), combined with the
entities that most often occur in the articles that the user read so far. Augmenting the
‘Theme Radar’ with the assistance of semantically enriched data is, in fact, one of our
main goals in this collaboration.
Figure 1 depicts the first prototype of the ‘Theme Radar’. It consists of a dashboard
of the readers’ interests. The ‘Theme Radar’ works as a semantically enhanced topic
dashboard that enables readers to get suggestions for themes and topics, to manage their
367
topics and to get personalized news recommendations based on entity co-occurrences,
linked data relations and the aforementioned semantic properties types.
Based on the users’ activity logs, the system automatically builds the ‘Theme Radar’.
Top entities of interest are presented to the users in their ‘Theme Radar’ with additional
suggested entities and recommended articles (based on entity co-occurrence). In the in-
terface, these entities are grouped by type and also by ‘Book’ (Books are sections within
the newspaper, as predefined by the editors - such as ‘Sports’ and ‘Politics’). Addition-
ally, the users can manually add entities to their profiles, which get higher weights in
the news recommendation process.
4 Conclusions
In this paper, we presented our work towards a semantically enriched online newspaper,
which - to the best of our knowledge - is the first of its kind in a fully commercial setup.
We are currently on a stage of interface designing which, in a commercial product,
requires the validation and approval from several stakeholders. As future work, we plan
to evaluate the quality of the annotations with user feedback and to perform an analysis
of online reading behavior, with a focus on the semantic aspects. Building upon these
steps, we plan to develop and entity-based news recommender that fulfills Madsack’s
online readers’ interests, and to evaluate them in practice.
5 Acknowledgement
We would like to thank the Madsack Online GmbH & Co. KG team for the collaboration
opportunity and the support during the implementation presented in this work.
References
1. M. Cornolti, P. Ferragina, and M. Ciaramita. A framework for benchmarking entity-annotation
systems. In Proceedings of the 22nd international conference on World Wide Web, pages 249–
260. International World Wide Web Conferences Steering Committee, 2013.
2. A. Gangemi. A comparison of knowledge extraction tools for the semantic web. In The
Semantic Web: Semantics and Big Data, pages 351–366. Springer, 2013.
3. A. L. Garrido, O. Gómez, S. Ilarri, and E. Mena. An experience developing a semantic anno-
tation system in a media group. In Natural Language Processing and Information Systems,
pages 333–338. Springer, 2012.
4. P. N. Mendes, M. Jakob, A. Garcı́a-Silva, and C. Bizer. Dbpedia spotlight: shedding light
on the web of documents. In Proceedings of the 7th International Conference on Semantic
Systems, pages 1–8. ACM, 2011.
5. D. Milne and I. H. Witten. Learning to link with wikipedia. In CIKM ’08: Proceeding of
the 17th ACM conference on Information and knowledge management, pages 509–518, New
York, NY, USA, 2008. ACM.
6. D. Milne and I. H. Witten. An open-source toolkit for mining wikipedia. Artificial Intelli-
gence, 194:222–239, 2013.
7. M. A. Yosef, J. Ho↵art, I. Bordino, M. Spaniol, and G. Weikum. Aida: An online tool for
accurate disambiguation of named entities in text and tables. Proceedings of the VLDB En-
dowment, 4(12):1450–1453, 2011.
368
Identifying Topic-Related Hyperlinks on Twitter
Patrick Siehndel, Ricardo Kawase, Eelco Herder and Thomas Risse
L3S Research Center, Leibniz University Hannover, Germany
{siehndel, kawase, herder, risse}@L3S.de
Abstract. The microblogging service Twitter has become one of the most popu-
lar sources of real time information. Every second, hundreds of URLs are posted
on Twitter. Due to the maximum tweet length of 140 characters, these URLs are
in most cases a shortened version of the original URLs. In contrast to the origi-
nal URLS, which usually provide some hints on the destination Web site and the
specific page, shortened links do not tell the users what to expect behind them.
These links might contain relevant information or news regarding a certain topic
of interest, but they might just as well be completely irrelevant, or even lead to a
malicious or harmful website. In this paper, we present our work towards iden-
tifying credible Twitter users for given topics. We achieve this by characterizing
the content of the posted URLs to further relate to the expertise of Twitter users.
1 Introduction
The microblogging service Twitter has become one of the most popular and most
dynamic social networks available on the Web, reaching almost 300 million active
users [1]. Due to its popularity and dynamics, Twitter has been topic of various ar-
eas of research. Twitter clearly trades content size for dynamics, which results in one
major challenge for researchers - tweets are too short to put them into context without
relating them to other information. Nevertheless, these short messages can be combined
to build a larger picture of a given user (user profiling) or a given topic. Additionally,
tweets may contain hyperlinks to external additional Web pages. In this case, these
linked Web pages can be used for enriching tweets with plenty of information.
An increasing number of users post URLs on a regular basis, and there are more
than 500 million Tweets every day1 . With such a high volume, it is unlikely that all
posted URLs link to relevant sources. Thus, measuring the quality of a link posted on
Twitter is an open question [3].
In many cases, a lot can be deduced just by the URL of a given Web page. For
example, if the URL domain is from a news provider, a video hosting website or a so-
cial network, the user already knows more or less what to expect after clicking on it.
However, regular URLs are, in many cases, too long to fit in a single tweet. Conse-
quently, Twitter automatically reduces the link length using shortening services. This
leads to the problem that the user’s educated guess of what is coming next is completely
gone. In this work, we focus on ameliorating these problems by identifying those tweets
containing URLs that might be relevant for the rest of the community.
The reasonable assumption behind our method is that users who usually talk about
a certain topic (‘experts’) will post interesting links about the same topic. The strong
1
https://blog.twitter.com/2013/new-tweets-per-second-record-and-how
369
point in our method is that it is independent of the users’ social graph. There is no need
to verify the user’s network or the retweet behavior. Thus, it can be calculated on the
fly. To achieve our final goal, we divide our work in two main steps: the generation of
user profiles [5] and the generation of URL profiles. In this paper, we focus on the latter
step.
2 Methodology
In our scenario, we build profiles for Twitter users based on the content of their tweets.
Besides the profiles for users we also generate profiles for the URLs posted by the users.
One of the biggest challenges in this task is to find appropriate algorithms and metrics
for building comparable profiles for users and websites. The method we developed to
solve this task is based on the vast amount of information provided by Wikipedia. We
use the articles and the related category information supplied by Wikipedia to define
the topic and the expertise level inherent in certain terms. Our method consists of three
main steps to create profiles for users and websites.
Extraction: In this step, we annotate the given content (all tweets of a user, or
the contents of a Web page) using the Wikipedia Miner Toolkit [4]. The tool provides
us with links to Wikipedia articles. The links discovered by Wikipedia Miner have a
similar style to the links that can be found inside a Wikipedia article. Not all words that
have a related article in Wikipedia are used as links, but only words that are relevant for
the whole topic are used as links, if a detected article is relevant for the whole text is
based on di↵erent features like the relatedness to other detected articles or generality of
the article.
Categorization: In the second stage, Categorization, we extract the categories of
each entity that has been mentioned in the users’ tweets or in the posted URL. For
each category, we follow the path through all parent categories, up to the root category.
In most cases, this procedure results in the assignment of several top-level categories
to an entity. Since the graph structure of Wikipedia contains also links to less relevant
categories, we only follow links to parent categories which distance to the root is shorter
or less than the one of the child category. For each category, a weight is calculated by
first defining a value for the detected entity. This value is based on the distance of
the entity to the root node. Following the parent categories, we divide the weight of
each node by the number of sibling categories. This step ensures, that categories do
not get higher values just because of a broader structure inside the graph. Based on this
calculation, we give higher scores to categories that are deeper inside the category graph
and more focused on one topic.
Aggregation: In the final stage, Aggregation, we perform a linear aggregation over
all of the scores for a document, in order to generate the final profile for the user (or
for the website). The generated profile displays the topics a user/website talks about, as
well as the expertise in - or focus on - a certain topic.
3 Validation
As mentioned in Section 1, in this paper we focus our attention on the generation of
URL profiles and the relation to the corresponding tweets and users. Thus, in order
to validate our methodology, we crawled Twitter with a number of predefined queries
(keywords) and collected all resulting tweets that additionally contain URLs. We have
370
Table 1. Statistics about the used dataset.
Items Annotations Annotations per Item
Topic Tweets 83,300 88,530 1.06
Linked Wedsites 40,940 457,164 11.1
All Tweets 11,303,580 30,059,981 3.127
Fig. 1. Coverage of Wikipedia Categories based on the URL Content for each selected topic.
previously validated our approach by characterizing and connecting heterogeneous re-
sources based on the aggregated topics [2]. Here, the goal is to qualitatively validate
if the topic assignment given by our method in fact represents the real topics that are
expected to be covered in a given query.
3.1 Dataset
The used dataset consists of around 83,300 tweets related to seven di↵erent topics.
The idea behind this approach is, to collect a series of tweets that contain links and
certain keywords relevant for one particular topic. Within these tweets, we found 40,940
di↵erent URLs. For each of these URLs, we tried to download and extract the textual
content, which resulted in 26,475 di↵erent websites. Additionally we downloaded the
last 200 posts for each user. The numbers of the dataset are shown in Table 1.
371
Table 2. Correlations between created profiles
URL Content URL Content Single Tweet
Single Tweet User Tweets User Tweets
Edward Snowden 0.995 0.968 0.961
Higgs Boson 0.812 0.628 0.496
Iphone 5 0.961 0.698 0.664
Israel Palastinian Talks 0.984 0.884 0.867
Nexus 5 0.968. 0.972 0.956
Obamacare 0.983 0.79 0.752
World Music Avards 0.921 0.718 0.614
All topics average 0.946 0.808 0.759
3.2 Topic Comparison
Figure 1 shows the generated profiles for two of the chosen example topics. The shown
profiles are averaged over all users and show the profiles based on the content of the
crawled web pages, based on the tweets containing the URLs and based on the complete
user profile (last 200 Tweets, based on API restrictions). We can see that for the very
specific topic ‘Israeli Palestinian Talks’ the generated profiles are very similar. For the
topic ‘iPhone 5’ the profiles are less similar, since this topic or keyword is less specific
it becomes much harder for a user to find the content he is looking for. A tweet like
‘The new iPhone is really cool’ together with a link may be related to many di↵erent
aspects of the product. Table 2 displays the correlation between the di↵erent profiles
for the chosen exemplifying topics. While users who write about topic like ‘Snowden’
or ‘Nexus phones’ seem to write about related topics in most of their tweets, this is not
true for more general topics.
4 Conclusion
In this paper, we presented a work towards the identification of credible topic-related
hyperlinks in social networks. Our basic assumption is that users who usually talk about
a certain topic (‘experts’) will post interesting (and safe) links about the same topic.
The final goal of our work requires to analyze the quality of the posted URLs. Here,
we presented our profiling method with preliminary results of the URL profiles. As
future work we plan to analyze the quality of profiles and URLs in order to provide a
confidence and quality score for URLs.
5 Acknowledgment
This work has been partially supported by the European Commission under ARCOMEM
(ICT 270239) and QualiMaster (ICT 619525)
References
1. Twitter now the fastest growing social platform in the world.
http://globalwebindex.net/thinking/twitter-now-the-fastest-growing-social-platform-in-
the-world/, Jan. 2013.
2. R. Kawase, P. Siehndel, B. P. Nunes, E. Herder, and W. Nejdl. Exploiting the wisdom of the
crowds for characterizing and connecting heterogeneous resources. In HT, 2014.
3. S. Lee and J. Kim. Warningbird: Detecting suspicious urls in twitter stream. In NDSS, 2012.
4. D. Milne and I. H. Witten. An open-source toolkit for mining wikipedia. Artificial Intelli-
gence, 194:222–239, 2013.
5. P. Siehndel and R. Kawase. Twikime! - user profiles that make sense. In International Seman-
tic Web Conference (Posters & Demos), 2012.
372
Capturing and Linking Human Sensor
Observations with YouSense
Tomi Kauppinen1,2 and Evgenia Litvinova1 and Jan Kallenbach1
1
Department of Media Technology
Aalto University School of Science, Finland
evgenia.litvinova@aalto.fi
jan.kallenbach@aalto.fi
2
Cognitive Systems Group,
University of Bremen, Germany
tomi.kauppinen@aalto.fi
Abstract. Semantic technologies are prominent for gathering human
sensor observations. Linked Data supports sharing and accessing of not
just data but also vocabularies describing the data. Human sensor obser-
vations are often a combination of natural language and categorizable en-
tries, thus calling for semantic treatment. Space and time serve as natural
integrators of data in addition to concepts. In this paper we demonstrate
YouSense tool which supports gathering of experiences about spaces (like
generic buildings or office spaces). Our contribution is also a vocabulary
for describing the experiences as RDF and tools for visualizing and mak-
ing sense of the gathered user experiences.
1 Introduction
Understanding how people experience surroundings (like office spaces or confer-
ence venues) supports to further develop spaces and to modify them to meet
user needs. This Human Sensor Web approach is rather di↵erent from monitor-
ing just technical parameters such as the indoor temperature, humidity, or CO2
concentration, typically in use to assess the performance of buildings.
The promise of Linked Data and semantic technologies in this setting is that
they can o↵er ways to share and access these human sensor observations in a
radically novel ways. The crucial tasks are to figure out how to describe the
experiences of people in a processable way, and also how to gather the observa-
tions. Finally, Information Visualization is a useful step in understanding if the
gathered data has a story to tell [1,2].
In this paper we demonstrate and present YouSense, a mobile and web ap-
plication for supporting sensing of the spaces by people. We also present the
EXPERIENCE vocabulary, a Linked Data compatible collection of terms to en-
able creation of RDF about the gathered experiences. The resulting history data
provides evidences about comfortable and problematic spaces for both users of
buildings and building managers. Our contribution includes also a set of visual-
ization tools to make sense of the human sensor observations, and their thematic,
spatial and temporal characteristics.
373
The paper is structured as follows. Section 2 presents the tool for gathering
the user experiences and discusses the vocabulary for encoding them. Section 3
demonstrates the use of the gathered data and outlines questions one can ask
with the system. Section 4 provides future work research agenda and concluding
remarks.
2 Gathering Observations for Semantic Descriptions
YouSense3 is a web application that can be used from both mobile and desktop
browsers. In YouSense, a user makes a sentence to describe how he/she feels
in the space (see Figure 1 for an example). The structure of the sentence is
defined with the help of the EXPERIENCE4 Vocabulary. EXPRERIENCE is a
lightweight vocabulary providing terms for creating descriptions of experiences
(as instances of Experience) about the environment, for example how cold or
warm one feels in a particular context created by the location, time and activities
an Experiencer is involved with. It thus supports for describing user experiences
and feelings.
Fig. 1. YouSense in action.
The structure of EXPERIENCE was designed to include things people gen-
erally use when they describe situations in spaces. We conducted a diary study,
where people were asked to report about their experience about spaces in a free
form. By analyzing the results of the user study we designed the set of terms to
be included in EXPERIENCE. For instance, while we supposed people would
3
Demonstration also available online at http://yousense.aalto.fi
4
http://linkedearth.org/experience/ns/
374
like to relate the experiences to spaces, and times, it was rather surprising that
they also wanted to guess about the reason for the experience.
Below is a simple example use of the EXPERIENCE Vocabulary, in line with
Figure 1 but a completed one. In this example there is a person (Laura) who has
experience (VeryCold ) and (FreshAir ) in her office (Room3156 ) in June 2014
while performing certain activity (Sitting). The observer has also communicated
about the (possible) reason (“Window is open”) for the experience (VeryCold ).
We also gather the action that the expriencer plans to do next (Work ).
@prefix experience: .
example:exampleExperience_34
a experience:Experience ;
experience:hasExperiencer feelingdata:Laura ;
experience:hasEventExperience feelingdata:VeryCold, feelingdata:FreshAir ;
experience:hasLocation feelingdata:Room3156 ;
experience:hasTime "2014-06-20T10:00+02:00" ;
experience:hasActivity dbpedia:Sitting ;
experience:hasReason "Window is open" ;
experience:hasFollowingAction dbpedia:Work .
3 Experimenting with YouSense in Concrete Use Cases
The sensemaking part of the YouSense creates diagrams for each of the mes-
sage parts (reasons, locations, times, people, spaces, activities) with reasoning
support (partonomy, concept hierarchy, temporal abstraction). For instance, the
adjectives people reported about the certain experiences (such as cold or warm)
about spaces support understanding of spaces.
Figure 2 depicts the approach by presenting a set of experiences about spaces
as a bubble visualization. The experiences are aggregated to positive (green)
and negative (red) ones. The floor plan visualization on the right shows a heat
map of these aggregated experiences by rooms. Zooming closer to a room allows
to see more detailed information about collected experiences and the spatial
configuration of the room. The idea is to retrieve patterns like “rooms facing the
sea generally have more positive experiences than ones facing the inner yard”
thus supporting to reveal the causes of the experiences.
We have experimented YouSense with selected spaces at the Aalto University
to evaluate its usefulness. These spaces include the Design Factory5 , Media Fac-
tory6 and the spaces of the Department of Media Technology and Department
of Automation and Systems Technology. According to the experiments and dis-
cussions with building managers the following types questions arose as the ones
needing to be answered by making sense of the gathered data.
– what is the air quality of this space?
– do people feel comfortable in this space?
– do people stay in this space to work and study?
5
http://www.aaltodesignfactory.fi
6
http://mediafactory.aalto.fi
375
Fig. 2. Example of visualizing user experiences about spaces
– or do they prefer to work and study somewhere else? where?
– do people need some additional services or activities in this space?
– what kind of things (furniture, games, ..) would people like to have in spaces, like
in the lobby area?
– do people feel comfortable in this space and why it is so?
4 Conclusions
We argued that gathering of user experiences and other human sensor obser-
vations is a good use case for Linked Data and semantic technologies. As we
demonstrated, EXPERIENCE vocabulary supports describing of user experi-
ences about spaces and to link them to reusable terms from DBPedia. The
YouSense app enables gathering the experiences via mobile/web compliant in-
terface and for storing them to a queryable triple store. We also illustrated the
use of YouSense for supporting understanding of spaces with visualizations.
As we showed, visualizations support for getting an overview of the data
gathered. They also raised a set of questions, partially already answered by the
gathered data. There were also interesting new questions which call for answering
in the research agenda for the next months. We are particularly interested in
studying what recurring, interesting patterns can we find from observation feeds.
References
1. D.A. Keim. Information visualization and visual data mining. Visualization and
Computer Graphics, IEEE Transactions on, 8(1):1–8, Jan/Mar 2002.
2. Edward Segel and Je↵rey Heer. Narrative visualization: Telling stories with data. Vi-
sualization and Computer Graphics, IEEE Transactions on, 16(6):1139–1148, 2010.
376
An Update Strategy for the WaterFowl RDF
Data Store
Olivier Curé1 , Guillaume Blin2
1
Université Paris-Est, LIGM - UMR CNRS 8049, France
ocure@univ-mlv.fr
2
Université de Bordeaux, LaBRI - UMR CNRS 5800, France
guilllaume.blin@labri.fr
Abstract. The WaterFowl RDF Store is characterized by its high com-
pression rate and a self-indexing approach. Both of these characteristics
are due to its underlying architecture. Intuitively, it is based on a stack
composed of two forms of Succinct Data Structures, namely bitmaps and
wavelet trees. The ability to efficiently retrieve information from these
structures is performed via a set of operations, i.e., rank, select and
access, which are used by our query processor. The nice properties, e.g.
compactness and efficient data retrieval, we have observed on our first
experimentations come at the price of poor performances when insertions
or deletions are required. For instance, a naive approach has a dramatic
impact on the capacity to handle ABox updates. In this paper, we address
this issue by proposing an update strategy which uses an hybrid wavelet
tree (using both pointer-based and pointerless sub-wavelet trees).
1 Introduction
Large amount of RDF data are being produced in diverse domains. Such a
data deluge is generally addressed by distributing the workload over a cluster of
commodity machines. We believe that this will soon not be enough and that in
order to respond to the exponential production of data, the next generation of
systems will distribute highly compressed data. One valuable property of such
systems would be to perform some data oriented operations without requiring a
decompression phase.
We have recently proposed the first building blocks of such a system, namely
WaterFowl, for RDF data [1]. The current version corresponds to an in-memory,
self-indexed, operating at the bit level approach which uses data structures with
a compression rate close to theoretical optimum. These so-called Succinct Data
Structures (SDS) support efficient decompression-free query operations on the
compressed data. The first components we have developed for this architec-
ture are a query processor and an inference engine which supports the RDFS
entailment regime with inference materialization limited to rdfs:range and
rdfs:domain. Both of these components take advantage of the SDS properties
and highly performant operations. Of course, SDS are not perfect and a main
limitation corresponds to their inability to efficiently handle update operations,
377
i.e., inserting or deleting a bit. In the worst case, one has to completely rebuild
the corresponding SDS (bitmap or wavelet tree) to address such updates. Even
if we consider that the sweet spot for RDF stores is OnLine Analytic Processing
(OLAP), rather than OnLine Transactional Processing (OLTP), such a draw-
back is not acceptable when managing data sets of several millions of triples.
The main contribution of this paper is to present an update strategy that
addresses instance updates. Due to space limitations, we do not consider updates
at the schema level (i.e., TBox). This approach is based on a set of heuristics and
the definition of an hybrid wavelet tree using both pointer-based and pointerless
sub-wavelet trees.
2 WaterFowl architecture
Figure 1 presents the three main components of the WaterFowl system: dic-
tionary, query processing and triples storage. The former is responsible for the
encoding and decoding of the entries of triple data sets. The main contribution
here consists of encoding the concepts and properties of an ontology such that
their respective hierarchy are using common binary prefixes. This enables us to
perform prefix versions of the rank, select and access operations and thus
prevents navigating the whole depth of the wavelet trees to return an answer.
The query processing component handles the classical operations of a database
query processor and communicates intensively with the dictionary unit. A pe-
culiarity of this query processor is to translate SPARQL queries into sequences
of rank, select and access SDS operations over sequences. They are designed
to (i) count the number of occurences of a letter appearing in a prefix of a given
length, (ii) find the position of the kth occurrence of a letter and (iii) retrieve
the letter at a given position in the sequence.
Finally, the triple storage component consists of two layers of bitmaps and
wavelet trees3 . It uses an abstraction of a set of triples represented as a forest
where each tree corresponds to a single subject, i.e., its root, the intermediate
nodes are the properties of the root subject and the leaves are the objects of those
subject-property pairs. The first layer encodes the relation between the subjects
and the predicates of a set of triples. It is composed of a bitmap, to encode
the subjects, and a wavelet tree, to encode the sequences of predicates of those
subjects. Unlike the first layer, the second one has two bitmaps and two wavelet
trees. Bo encodes the relation between the predicates and the objects; that is the
edges between the leaves and their parents in the tree representation. Whereas,
the bitmap Bc encodes the positions of ontology concepts in the sequence of
objects. Finally, the sequence of objects obtained from a pre-order traversal
in the forest is split into two disjoint subsequences; one for the concepts and
one for the rest. Each of these sequences is encoded in a wavelet tree (resp.
W Toc and W Toi ). This architecture reduces sparsity of identifiers and enables
the management of very large datasets and ontologies while allowing time and
3
Due to space limitations, we let the reader refer to [1] for corresponding definitions
378
space efficiency. More details on these components are proposed in [1]. That
paper presents some evaluations where di↵erent wavelet tree implementations
have been used: with and without pointers and one so-called matrix [3]. The
characteristics of the first two motivated our update approach.
Fig. 1. WaterFowl’s architecture
3 Update strategy
As previously mentioned, one of the benefits of using wavelet trees in our system
is the ability to compress drastically the data while being able to query it without
decompression. The main drawback of such SDS lies in its requirement to pre-
compute and store some small but neccessary extra informations. More formally,
the bitmaps used in the inner construction of the wavelet trees requires n + o(n)
bits of storage space (the original bit array and an o(n) auxiliary structure) to
support rank and select in constant time. Note that while the implementaton
of rank is simple and practical, this is not the case for select which should be
avoided whenever possible, e.g., in our SPARQL query translations. Recently,
some attempts were done in order to provide faster and smaller structure for
select [4]. These precomputed auxiliary informations are used during query
processing in order to get constant time complexity. This is why, by definition,
the bitmaps are static and cannot be updated. In our context, it implies an
immutable RDF store (which is quite restrictive).
In order to overcome this issue, we propose an update strategy using the
inner tree structure of the wavelet trees. First, recall that there are mainly two
379
implementations of the wavelet trees: with and without pointers. On one hand,
the implementation without pointers uses less memory space and provides better
time performances. On the other hand, any modification to the represented se-
quence implies a full reconstruction of the wavelet tree; while, in the case where
pointers are used, only a subset of the nodes (and the corresponding bitmaps)
have to be rebuilt which in our first experiments is much faster than a total re-
construction. A second important point of our approach is based on the fact that
in wavelet trees, each bit of an encoded entry specifies a path in the tree. That
is, the instance data only influence the size of nodes but not their placement
in the tree. Considering these two assumptions together with the huge amount
of data we want to handle, our strategy supports an hybrid approach based on
the natural definition of a tree. Indeed, a tree can be defined recursively as a
node with a sequence of children which are themselves trees. Our hybrid wavelet
tree is then defined as a node representing a wavelet tree without pointers of
height k and a sequence of 2k children which are themselves hybrid wavelet
trees. In practice, considering a querying scenario composed of read (e.g., select)
and write (e.g., add, delete, update) operations, for performance purpose the
hybrid wavelet tree can adapt its composition to the scenario by minimizing
and maximizing the numbers of pointers in the depth traversal of respectively a
read and write operation. Note that this approach of cracking the database into
manageable pieces is reminiscent to dynamic indexing solution presented in [2].
4 Conclusion
This poster exploits available wavelet tree implementations to address the is-
sue of updating an ABox. We have already implemented a prototype of the
WaterFowl system and of the updating system. They so far provide interesting
performance results but we have yet to test with real use cases. This will en-
able us to observe practical modifications and to study their efficiency. These
observations should provide directions for optimizations. Our future work, we
will consist in the development of two new components. A first one will address
updates at the schema level, i.e., insertions or removals of concepts and proper-
ties of the underlying ontology. The second one will consider the partitioning of
a data set over a machine cluster together with both ABox and TBox updates.
References
1. Olivier Curé, Guillaume Blin, Dominique Revuz, and David Célestin Faye. Water-
fowl: A compact, self-indexed and inference-enabled immutable rdf store. In ESWC,
pages 302–316, 2014.
2. Stratos Idreos, Martin L. Kersten, and Stefan Manegold. Database cracking. In
CIDR, pages 68–78, 2007.
3. Gonzalo Navarro. Wavelet trees for all. J. Discrete Algorithms, 25:2–20, 2014.
4. Gonzalo Navarro and Eliana Providel. Fast, small, simple rank/select on bitmaps.
In Ralf Klasing, editor, Experimental Algorithms, volume 7276 of Lecture Notes in
Computer Science, pages 295–306. Springer Berlin Heidelberg, 2012.
380
Linking Historical Data on the Web?
Valeria Fionda1 , Giovanni Grasso2
1
Department of Mathematics, University of Calabria, Italy
fionda@mat.unical.it
2
Department of Computer Science, Oxford University, UK
giovanni.grasso@cs.ox.ac.uk
Abstract. Linked Data today available on the Web mostly represent
snapshots at particular points in time. The temporal aspect of data is
mostly taken into account only by adding and removing triples to keep
datasets up-to-date, thus neglecting the importance to keep track of the
evolution of data over time. To overcome this limitation, we introduce
the LinkHisData framework to automatize the creation and publication
of linked historical data extracted from the Deep Web.
1 Introduction
Most of the linked datasets today available on the Web offer a static view over
the data they provide ignoring their evolution over time. Indeed, the temporal
aspect is only considered by adding and removing information to keep datasets
up-to-date [4]. However, while some information is intrinsically static because
universally valid (e.g., the author of Harry Potter books is J. K. Rowling), a
large portion of data are only valid from a particular point in time on (e.g., the
first Harry Potter novel is available in Braille edition from May 1998) or are
valid for a certain period (e.g., the price of the children paperback edition of the
first Harry Potter book was £3.85 between the 10th and the 15th May 2014).
The possibility to include the temporal validity of RDF data opens the
doors to the creation of new datasets in several domains by keeping track of
the evolution of information over time (e.g., books prices, currency exchange
rate) and publishing it as linked historical data. In turn, the availability of his-
torical datasets would stimulate and enable a large number of applications (e.g.,
data analytics) and research areas [1]. Unfortunately, a recent work [6] showed
that the amount of linked data with temporal information available on the Web
is very small and to the best of our knowledge there are no proposals aimed at
publishing historical dataset on the Web of Linked Data. In this paper we pro-
pose LinkHisData, a configurable framework to automatize the creation and
publication of linked historical data extracted from the Deep Web. In particular,
the focus is on extracting and making persistent transient information (e.g., the
price of a book on a certain day) before becoming no longer accessible.
?
V. Fionda was supported by the European Commission, the European Social Fund
and the Calabria region. G. Grasso was supported by the European Commission’s
Seventh Framework Programme (FP7/2007–2013) from ERC grant agreement DIA-
DEM, no. 246858.
381
Linked Historical
Dataset
LOD Interrogator Data Extractor Integrator
Fig. 1: An overview of the components of the LinkHisData framework.
2 The LinkHisData framework
LinkHisData (Linked Historical Data) is a configurable framework that builds
on well-established semantic technologies and tools (e.g., SPARQL, RDF), as
well as languages for Deep Web data extraction that have already been success-
fully employed by the LOD community (i.e., OXPath [3, 5]).
Fig. 1 shows the LinkHisData architecture. Its main components are: the
LOD Interrogator, the Data Extractor, the Integrator, and the Linked Historical
Dataset (LHD). The LOD Interrogator uses the SPARQL query and the endpoint
address provided in input by the user to retrieve from the Web of Linked Data
entities’ URIs and related information. These data feed the Data Extractor that
runs the OXPath wrapper, again provided in input by the user, to extract tran-
sient information from the Deep Web directly into RDF. OXPath is a modern
wrapping language able to execute actions (e.g., click, form filling) and schedule
periodic extraction tasks. The query and the wrapper may share variable names
so that the wrapper is instantiate with the actual values provided by the LOD
Interrogator. The RDF data generated by the Data Extractor populates LHD
and are published on the Web. A single execution of the extraction and publi-
cation process produces RDF data with temporal information that represent a
snapshot at the time of extraction. To produce historical data, the whole process
is repeated at the frequency set by the user and for each repetition the Integrator
is responsible for the integration of the fresh triples (with temporal validity set
to the time of extraction) with the data already stored in LHD (whose validity
dates back to a prior extraction) by executing the (set of) SPARQL query pro-
vided in input. Different inputs supplied by the user configure LinkHisData to
extract and publish historical datasets in different domains.
3 LinkHisData: Book Price Example
We instantiate the LinkHisData framework for the extraction and publication
of linked historical data about books prices extracted from Barnes&Noble (www.
bn.com). The next query retrieves from DBpedia books and relative attributes3 :
SELECT ?b ?t ?ab ?an WHERE {
?b dbp:name ?t. ?b rdf:type dbo:Book. ?b dbp:author ?ab. ?ab foaf:name ?an. }
3
The prefixes used in the paper are taken from www.prefix.cc
382
doc("www.bn.com")//*[@id=’keyword’]/{?t ?an/}//*[@id=’quick-search’]/{"Books"/}
2 //*[@id=’quick-search-1’]//button/{click /}//li#search-result
[jarowrinkler(./li#title, ?t)=1][jarowrinkler(./li#auth, ?an)>.7]/{click /}/
4 /html:<(schema:Book(isbn))> [.: ]
[.//*[starts-with(.,"ISBN")][1]/text()[2]: ]
6 [.: [.: ]]
[.:
8 [? .//::*[@itemprop="price"][1]: ]
[? .//::*[@itemprop="price"][1]: ]
10 [? .: ]
Fig. 2: OXPATH RDF wrapper
This query returns a set of tuples containing the URI of a book (?b), its ti-
tle (?t), the URI of its author (?ab) and the author’s name (?an). For instance, for
the book “The Firm” of J. Grisham the retrieved tuple is hdbpedia:The_Firm_(novel),
“The Firm”, dbpedia:John_Grisham, “John Grisham”i. The remainder of the ex-
ample uses the values of this tuple to instantiate the various components.
The tuples returned by the LOD Interrogator constitute the input for the
Data Extractor which runs the OXPath wrapper shown in Figure 2 (for space
reasons some parts are simplified or omitted). Here, the variables (e.g., ?b) refer
to the corresponding ones in the LOD Interrogator SPARQL query. We assume
some familiarity with XPath to illustrate the wrapper. It comprises three parts:
(i) navigation to the pages containing relevant data, (ii) identification of data
to extract, and (iii) RDF output production. In our example, we use types and
properties from schema.org (e.g., Book, Offer, price). For instance, for our ex-
ample tuple about “The Firm”, the wrapper produces the following RDF output:
lhd:b9780440245926 a schema:Book ; owl:sameAs dbpedia:The_Firm_(novel);
schema:isbn "9780440245926"; schema:author lhd:John_Grisham;
schema:offers lhd:off_123.
lhd:off_123 a schema:Offer,lhd:HistoricalEntity ; schema:price "9.21";
schema:priceCurrency "$"; schema:validFrom "2014-06-28".
lhd:John_Grisham a schema:Person ; owl:sameAs dbpedia:John_Grisham.
The wrapper encodes part (i) in lines 1–2. Firstly, the website is loaded,
then, the search form is filled with (“The Firm John Grisham”) i.e., book title
and author name, plus the search is restricted to the book category. Finally, the
submit button is clicked and all the books on the result page are selected by the
expression //li.search-result. However, many of the results may not refer to the
book of interest (e.g., also collections/sets of books containing it are retrieved)
and it is absolutely crucial to identify the correct entities as these will be linked
back to the original entities in DBpedia. We address this problem (part (ii))
by using the Jaro-Winkler distance [2] to match author name and book title.
This metric has been widely used in record linkage and performs particularly
well on person and entity’s names. Our wrapper (line 3) demands for a perfect
matching on the title and a significant similarity (>0.7) on the author’s name
to deal with different variations (e.g., “J. Grisham” or “John Ray Grisham, Jr”).
383
This strategy prevents our framework from overtaxing the site by extracting (a
possibly huge quantity of) data on non-interesting books, that would be only
later discarded by a costly post-processing linking phase. For example, for “The
Firm” we correctly identify only 2 books out of the original 22 results.
Part (iii) is realized by visiting the detail page of each selected book via
the action {click/}. An RDF extraction marker is used to create a schema:Book
instance (line 4) and by explicitly linking it to the corresponding DBpedia URI
via owl:sameAs. The unique URI for the created book instance relies on the func-
tional dependency to its isbn (schema:Book(isbn)). This ensures that this URI will
be always the same within the subsequent extractions, allowing to refer to the
right entity in the integration phase. The wrapper also creates one linked data
entity for the book author (line 6) and one for the offer (line 7), respectively.
The former is of type schema:Person; its URI is created on the basis of the author
name (?an) and is linked via owl:sameAs to the author on DBpedia. The latter
is of type schema:Offer and lhd:HistoricalEntity, type used to mark entities with
temporal validity. Its URI is randomly generated to ensure different URIs for
consecutive extractions. Some properties of the offer are also extracted (e.g., the
schema:price) and the schema:validityFrom is set to now(), the current date.
The Integrator takes in input, for each book (?b), the set of triples T as pro-
duced by the Data Extractor and it is responsible for integrating them with those
already present in LHD. In particular, it deals with entities having a temporal
validity, e.g., book offers and their prices in our example. Each book in LHD
may have associated several offers that represent the evolution of the price over
time. However, only one of them provides the current selling price (i.e., the one
not having a schema:validThrough triple). Therefore, for each book (?b), the In-
tegrator instantiates the query template (provided by the user) to retrieve such
current price (?p) and its corresponding offer (?o) from the historical dataset. If
?b is not already present in the historical dataset, the triples in T are added to
it. Otherwise, the Integrator compares the current price (?p) with the price of
the offer in T (freshly extracted). If they differ, the price validity is updated by
adding to the dataset both the triple (?o, schema:validThrough, now()) and the
offer in T. Together they provide a new piece of historical information.
References
1. O. Alonso, J. Strötgen, R. A. Baeza-Yates, and M. Gertz. Temporal information
retrieval: Challenges and opportunities. In TWAW, volume 813, pages 1–8, 2011.
2. W.W. Cohen, P.D. Ravikumar, and S.E. Fienberg. A comparison of string distance
metrics for name-matching tasks. In IIWeb, pages 73–78, 2003.
3. T. Furche, G. Gottlob, G. Grasso, C. Schallhart, and A.J. Sellers. OXPath: A
language for scalable data extraction, automation, and crawling on the deep web.
VLDB J., 22(1):47–72, 2013.
4. T. Käfer, A. Abdelrahman, J. Umbrich, P. O’Byrne, and A. Hogan. Observing
linked data dynamics. In ESWC, pages 213–227, 2013.
5. J. Lehmann, T. Furche, G. Grasso, A.C. Ngonga Ngomo, C. Schallhart, and C. et al.
Unger. deqa: Deep web extraction for question answering. In ISWC, 2012.
6. A. Rula, M. Palmonari, A. Harth, S. Stadtmüller, and A. Maurino. On the diversity
and availability of temporal information in linked open data. In ISWC, 2012.
384
User driven Information Extraction with LODIE
Anna Lisa Gentile and Suvodeep Mazumdar
Department of Computer Science, University of Sheffield, UK
{a.gentile, s.mazumdar}@sheffield.ac.uk
Abstract. Information Extraction (IE) is the technique for transform-
ing unstructured or semi-structured data into structured representation
that can be understood by machines. In this paper we use a user-driven
Information Extraction technique to wrap entity-centric Web pages. The
user can select concepts and properties of interest from available Linked
Data. Given a number of websites containing pages about the concepts of
interest, the method will exploit (i) recurrent structures in the Web pages
and (ii) available knowledge in Linked data to extract the information
of interest from the Web pages.
1 Introduction
Information Extraction transforms unstructured or semi-structured text into
structured data that can be understood by machines. It is a crucial technique
towards realizing the vision of the Semantic Web. Wrapper Induction (WI) is
the task of automatically learning wrappers (or extraction patterns) for a set
of homogeneous Web pages, i.e. pages from the same website, generated using
consistent templates1 . WI methods [1,2] learn a set of rules enabling the system-
atic extraction of specific data records from the homogeneous Web pages. In this
paper we adopt a user driven paradigm for IE and we perform on demand extrac-
tion on entity-centric webpages. We adopt our WI method [2,3] developed within
the LODIE (Linked Open Data for Information Extraction) framework [4]. The
main advantage of our method is that does not require manually annotated
pages. The training examples for the WI method are automatically generated
exploiting Linked Data.
2 State of the Art
Using WI to extract information from structured Web pages has been studied
extensively. Early studies focused on the DOM-tree representation of Web pages
and learn a template that wrap data records in HTML tags, such as [1,5,6]. Su-
pervised methods require manual annotation on example pages to learn wrappers
for similar pages [1,7,8]. The number of required annotations can be drastically
reduced by annotating pages from a specific website and then adapting the learnt
1
For example, a yellow page website will use the same template to display information
(e.g., name, address, cuisine) of di↵erent restaurants.
385
2 Gentile and Mazumdar
rules to previously unseen websites of the same domain [9,10]. Completely un-
supervised methods (e.g. RoadRunner [11] and EXALG [12]) do not require any
training data, nor an initial extraction template (indicating which concepts and
attributes to extract), and they only assume the homogeneity of the considered
pages. The drawback of unsupervised methods is that the semantic of produced
results is left as a post-process to the user. Hybrid methods [2] intend to find a
tradeo↵ with these two limitations by proposing a supervised strategy, where the
training data is automatically generated exploiting Linked Data. In this work
we perform IE using the method proposed in [2,3] and follow the general IE
paradigm from [4].
3 User-driven Information Extraction
In LODIE we adopt a user driven paradigm for IE. As first step, the user must
define her/his information need. This is done via a visual exploration of linked
data (Figure 1).
Fig. 1: Exploring linked data to define user need, by selecting concepts and attributes to extract.
Here the user selected the concept Book and the attributes title and author. As author is a datatype
attribute, of type P erson, the attribute name is chosen.
The user can explore underlying linked data using the A↵ective Graphs vi-
sualization tool [13] and select concepts and properties she/he is interested in
(a screenshot is shown in Figure 1). These concepts and properties get added
to the side panel. Once the selection is finished, she/he can start the IE pro-
cess. The IE starts with a dictionary generation phase. A dictionary di,k consists
of values for the attribute ai,k of instances of concept ci . Noisy entries in the
dictionaries are removed using a cleaning procedure detailed in [3]. As a run-
ning example we will assume the user wants to extract title and author for the
concept Book. We retrieve from the Web k websites containing entity-pages of
the concept types selected by the user, and save the pages Wci ,k . Following the
Book example, Barnes&Noble2 or AbeBooks3 websites can be used, and pages
collected in Wbook,barnesandnoble and Wbook,abebooks .
For each Wci ,k we generate a set of extraction patterns for every attribute.
In our example we will produce 4 sets of patterns, one per each website and
2
http://www.barnesandnoble.com/
3
http://www.abebooks.co.uk
386
User driven Information Extraction with LODIE 3
attribute. To produce the patterns we (i) use our dictionaries to generate brute-
force annotations on the pages in Wci ,k and then (ii) use statistical (occurrence
frequency) and structural (position of the annotations in the webpage) clues to
choose the final extraction patterns.
Briefly, a page is transformed to a simplified page representation Pci : a col-
lection of pairs 〈xpath4 , text value〉. Candidates are generated matching the dic-
tionaries di,k against possible text values in Pci (Figure 2).
/HTML[1]/BODY[1]/DIV[2]/DIV[2]/DIV[2]/DIV[1]/H2[1]/text()[1] breaking dawn
/HTML[1]/BODY[1]/DIV[2]/DIV[2]/DIV[2]/DIV[4]/DIV[1]/H2[1]/EM[1]/text()[1] breaking dawn
/HTML[1]/BODY[1]/DIV[2]/DIV[2]/DIV[2]/DIV[4]/TABLE[10]/TBODY[1]/TR[1]/TD[3]/B[1]/A[1]/text()[1] break-
ing dawn
/HTML[1]/BODY[1]/DIV[2]/DIV[2]/DIV[2]/DIV[4]/TABLE[1]/TBODY[1]/TR[1]/TD[3]/B[1]/A[1]/text()[1] break-
ing dawn
/HTML[1]/BODY[1]/DIV[2]/DIV[2]/DIV[2]/DIV[4]/TABLE[2]/TBODY[1]/TR[1]/TD[3]/B[1]/A[1]/text()[1] break-
ing dawn
/HTML[1]/BODY[1]/DIV[2]/DIV[2]/DIV[2]/DIV[4]/TABLE[3]/TBODY[1]/TR[1]/TD[3]/B[1]/A[1]/text()[1] break-
ing dawn
/HTML[1]/BODY[1]/DIV[2]/DIV[2]/DIV[2]/DIV[4]/TABLE[6]/TBODY[1]/TR[1]/TD[3]/B[1]/A[1]/text()[1] break-
ing dawn
/HTML[1]/BODY[1]/DIV[2]/DIV[2]/DIV[2]/DIV[4]/TABLE[8]/TBODY[1]/TR[1]/TD[3]/B[1]/A[1]/text()[1] break-
ing dawn
/HTML[1]/BODY[1]/DIV[2]/DIV[2]/DIV[3]/DIV[3]/UL[1]/LI[2]/A[1]/text()[1] the host
/HTML[1]/BODY[1]/DIV[2]/DIV[2]/DIV[3]/DIV[3]/UL[1]/LI[5]/A[1]/text()[1] new moon
Fig. 2: Example of candidates for book title for a Web page on the book “Breaking Dawn”, from the
website AbeBooks.
Final patterns are chosen amongst the candidates exploiting frequency in-
formation and other heuristics. Details of the method can be found in [2,3].
In the running example, higher scoring patterns for extracting book title from
AbeBooks website are shown in Figure 3.
/HTML[1]/BODY[1]/DIV[2]/DIV[2]/DIV[2]/DIV[1]/H2[1]/text()[1] 329.0
/HTML[1]/BODY[1]/DIV[2]/DIV[2]/DIV[2]/DIV[4]/DIV[1]/H2[1]/EM[1]/text()[1] 329.0
Fig. 3: Extraction patterns for book titles from AbeBooks website.
All extraction patterns are then used to extract target values from all Wci ,k .
Results are produced as linked data, using the concept and properties initially
selected by the user for representation, and made accessible to the user via an
exploration interface (Figure 4), implemented using Simile Widgets5 .
A video showing the proposed system used with the running Book exam-
ple can be found at http://staffwww.dcs.shef.ac.uk/people/A.L.Gentile/
demo/iswc2014.html.
4 Conclusions and future work
In this paper we describe the LODIE approach to perform IE on user defined
extraction tasks. The user is prompted a visual tool to explore available linked
data and choose concepts for which she/he wants to mine additional material
from the Web. We learn extraction patterns to wrap relevant websites and return
structured results to the user.
4
http://www.w3.org/TR/xpath/
5
http://www.simile-widgets.org/
387
4 Gentile and Mazumdar
Fig. 4: Exploration of results produced by the IE method
References
1. Kushmerick, N.: Wrapper Induction for information Extraction. In: IJCAI97.
(1997) 729–735
2. Gentile, A.L., Zhang, Z., Augenstein, I., Ciravegna, F.: Unsupervised wrapper
induction using linked data. In: Proc. of the seventh international conference on
Knowledge capture. K-CAP ’13, New York, NY, USA, ACM (2013) 41–48
3. Gentile, A.L., Zhang, Z., Ciravegna, F.: Self training wrapper induction with linked
data. In: Proceedings of the 17th International Conference on Text, Speech and
Dialogue (TSD 2014). (2014) 295–302
4. Ciravegna, F., Gentile, A.L., Zhang, Z.: Lodie: Linked open data for web-scale
information extraction. In: SWAIE. (2012) 11–22
5. Muslea, I., Minton, S., Knoblock, C.: Hierarchical wrapper induction for semistruc-
tured information sources. Autonomous Agents and Multi-Agent Systems (2001)
1–28
6. Soderland, S.: Learning information extraction rules for semi-structured and free
text. Mach. Learn. 34(1-3) (February 1999) 233–272
7. Muslea, I., Minton, S., Knoblock, C.: Active Learning with Strong and Weak Views:
A Case Study on Wrapper Induction. IJCAI’03 8th international joint conference
on Artificial intelligence (2003) 415–420
8. Dalvi, N., Kumar, R., Soliman, M.: Automatic wrappers for large scale web ex-
traction. Proc. of the VLDB Endowment 4(4) (2011) 219–230
9. Wong, T., Lam, W.: Learning to adapt web information extraction knowledge
and discovering new attributes via a Bayesian approach. Knowledge and Data
Engineering, IEEE 22(4) (2010) 523–536
10. Hao, Q., Cai, R., Pang, Y., Zhang, L.: From One Tree to a Forest : a Unified
Solution for Structured Web Data Extraction. In: SIGIR 2011. (2011) 775–784
11. Crescenzi, V., Mecca, G.: Automatic information extraction from large websites.
Journal of the ACM 51(5) (September 2004) 731–779
12. Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: Proc.
of the 2003 ACM SIGMOD international conference on Management of data, ACM
(2003) 337–348
13. Mazumdar, S., Petrelli, D., Elbedweihy, K., Lanfranchi, V., Ciravegna, F.: A↵ective
graphs: The visual appeal of linked data. Semantic Web–Interoperability, Usability,
Applicability. IOS Press (to appear, 2014) (2013)
388
QALM: a Benchmark for Question Answering
over Linked Merchant Websites Data
Amine Hallili1 , Elena Cabrio2,3 , and Catherine Faron Zucker1
1
Univ. Nice Sophia Antipolis, CNRS, I3S, UMR 7271, Sophia Antipolis, France
amine.hallili@inria.fr; faron@unice.fr
2
INRIA Sophia Antipolis Méditerranée, Sophia Antipolis, France
elena.cabrio@inria.fr
3
EURECOM, Sophia Antipolis, France
Abstract. This paper presents a benchmark for training and evaluat-
ing Question Answering Systems aiming at mediating between a user,
expressing his or her information needs in natural language, and seman-
tic data in the commercial domain of the mobile phones industry. We
first describe the RDF dataset we extracted through the APIs of mer-
chant websites, and the schemas on which it relies. We then present the
methodology we applied to create a set of natural language questions
expressing possible user needs in the above mentioned domain. Such
question set has then been further annotated both with the correspond-
ing SPARQL queries, and with the correct answers retrieved from the
dataset.
1 Introduction
The evolution of the e-commerce domain, especially the Business To Client
(B2C), has encouraged the implementation and the use of dedicated applica-
tions (e.g. Question Answering Systems) trying to provide end-users with a bet-
ter experience. At the same time, the user’s needs are getting more and more
complex and specific, especially when it comes to commercial products whose
questions concern more often their technical aspects (e.g. price, color, seller, etc.).
Several systems are proposing solutions to answer to these needs, but many chal-
lenges have not been overcome yet, leaving room for improvement. For instance,
federating several commercial knowledge bases in one knowledge base has not
been accomplished yet. Also, understanding and interpreting complex natural
language questions also known as n-relation questions seems to be one of the
ambitious topics that systems are currently trying to figure out.
In this paper we present a benchmark for training and evaluating Question
Answering (QA) Systems aiming at mediating between a user, expressing his or
her information need in natural language, and semantic data in the commercial
domain of the mobile phone industry. We first describe the RDF dataset that we
have extracted through the APIs of merchant sites, and the schemas on which it
relies. We then present the methodology we applied to create a set of natural lan-
guage questions expressing possible user needs in the above mentioned domain.
389
Such question set has then be further annotated both with the corresponding
SPARQL queries, and with the correct answers retrieved from the dataset.
2 A Merchant Sites Dataset for the Mobile Phones
Industry
This section describes the QALM (Question Answering over Linked Merchant
websites) ontology (Section 2.1), and the RDF dataset (Section 2.2) we built by
extracting a sample of data from a set of commercial websites.
2.1 QALM Ontology
The QALM RDF dataset relies on two ontologies: the Merchant Site Ontology
(MSO) and the Phone Ontology (PO). Together they build up the QALM On-
tology.4 MSO models general concepts of merchant websites, and it is aligned to
the commercial part of the Schema.org ontology. MSO is composed of 5 classes:
mso:Product, mso:Seller, mso:Organization, mso:Store, mso:ParcelDelive-
ry, and of 29 properties (e.g. mso:price, mso:url, mso:location, mso:seller)
declared as subclasses and subproperties of Schema.org classes and properties.
We added to them multilingual labels (both in English and in French), that
can be exploited by QA systems in particular for property identification in the
question interpretation step. We relied on WordNet synonyms [2] to extract as
much labels as possible. For example, the property mso:price has the following
English labels: “price”, “cost”, “value”, “tari↵”, “amount”, and the following
French labels: “prix”, “coût”, “coûter”, “valoir”, “tarif”, “s’élever”.
PO is a domain ontology modeling concepts specific to the phone indus-
try. It is composed of 7 classes (e.g. po:Phone, po:Accessory) which are de-
clared as subclasses of mso:Product, and of 35 properties (e.g. po:handsetType,
po:operatingSystem, po:phoneStyle).
2.2 QALM RDF Dataset
Our final goal is to build a unified RDF dataset integrating commercial product
descriptions from various e-commerce websites. In order to achieve this goal,
we analyze the web services of the e-commerce websites regardless of their type
(either SOAP or REST). To feed our dataset, we create a mapping between
the remote calls to the web services and the ontology properties, that we store
in a separate file for reuse. In particular, we built the QALM RDF dataset by
extracting data from eBay5 and BestBuy6 commercial websites through BestBuy
Web service and eBay API. The extracted raw data is transformed into RDF
triples by applying the above described mapping between the QALM ontology
4
Available at www.i3s.unice.fr/qalm/ontology
5
http://www.ebay.com/
6
http://www.bestbuy.com/
390
and the API/web service. For instance, the method getPrice() in the eBay
API is mapped to the property mso:price in the QALM ontology. Currently,
the QALM dataset comprises 500000 product descriptions and up to 15 millions
triples extracted from eBay and BestBuy.7
3 QALM Question Set
In order to train and to evaluate a QA system mediating between a user and
semantic data in the QALM dataset, a set of questions representing users re-
quests in the phone industry domain is required. Up to our knowledge, the only
available standard sets of questions to evaluate QA systems over linked data
are the ones released by the organizers of the QALD (Question Answering over
Linked Data) challenges.8 However such questions are over the English DBpedia
dataset9 , and therefore cover several topics. For this reason, we created a set
of natural language questions for the specific commercial domain of the phone
industry, following the guidelines described by the QALD organizers for the
creation of their question sets [1]. More specifically, these questions were cre-
ated by 12 external people (students and researchers in other groups) with no
background in question answering, in order to avoid a bias towards a particular
approach. To accomplish the task of question creation, each person was given i)
the list of the product types present in the QALM dataset (mainly composed of
IT products as phones and accessories); ii) the list of the properties of the QALM
ontology presented as product features in which they could be interested in; and
they were asked to produce i) both 1-relation and 2-relation questions, and ii)
at least 5 questions each. The questions were designed to present potential user
questions and to include a wide range of challenges such as lexical ambiguities
and complex syntactical structures. Such questions were then annotated with
the corresponding SPARQL queries, and the correct answers retrieved from the
dataset, in order to consider them as a reliable goldstandard for our benchmark.
The final question set comprises 70 questions; it is divided into a training
set10 and a test set of respectively 40 and 30 questions. Annotations are provided
in XML format, and according to QALD guidelines, the following attributes are
specified for each question along with its ID: aggregation (indicates whether any
operation beyond triple pattern matching is required to answer the question,
e.g., counting, filtering, ordering), answertype (gives the answer type: resource,
string, boolean, double, date). We also added the attribute relations, to indicate
whether the question is connected to its answer through one or more properties of
the ontology (values: 1, n). Finally, for each question the corresponding SPARQL
query is provided, as well as the answers this query returns. Examples 1 and 2
show some questions from the collected question set, connected to their answers
through 1 property or more than 1 property of the ontology, respectively. In
7
Available at www.i3s.unice.fr/QALM/qalm.rdf
8
http://greententacle.techfak.uni-bielefeld.de/~cunger/qald/
9
http://dbpedia.org
10
Available at www.i3s.unice.fr/QALM/training_questions.xml
391
particular, questions 14 and 50 from Example 2 require also to carry out some
reasoning on the results, in order to rank them and to produce the correct answer.
Example 1. 1-relation questions.
id=36. Give me the manufacturers who supply on-ear headphones.
id=52. What colors are available for the Samsung Galaxy 5 ?
id=61. Which products of Alcatel are available online?
Example 2. n-relations questions.
id=14. Which cell phone case (any manufacturer) has the most ratings?
id=50. What is the highest camera resolution of phones manufactured by Motorola?
id=58. I would like to know in which stores I can buy Apple phones.
4 Conclusions and Ongoing Work
This paper presented a benchmark to train and test QA systems, composed of i)
the QALM ontologies; ii) the QALM RDF dataset of product descriptions ex-
tracted from eBay and BestBuy; and iii) the QALM Question Set, containing 70
natural language questions in the commercial domain of phones and accessories.
As for future work, we will consider aligning the QALM ontology to the
GoodRelations ontology to fully cover the commercial domain, and to benefit
from the semantics captured in this ontology. We also consider improving the
QALM RDF dataset by i) extracting RDF data from additional commercial
websites that provide web services or APIs; and ii) directly extracting RDF
data in the Schema.org ontology from commercial websites whose pages are
automatically generated with Schema.org markup (e.g. Magento, OSCommerce,
Genesis2.0, Prestashop), to extend the number of addressed commercial websites.
In parallel, we are currently developing the SynchroBot QA system [3], an
ontology-based chatbot for the e-commerce domain. We will evaluate it by using
the proposed QALM benchmark.
Acknowledgements
We thank Amazon, eBay and BestBuy for contributing to this work by sharing
with us public data about their commercial products. The work of E. Cabrio was
funded by the French Government through the ANR-11-LABX-0031-01 program.
References
1. Cimiano, P., Lopez, V., Unger, C., Cabrio, E., Ngomo, A.C.N., Walter, S.: Multi-
lingual question answering over linked data (qald-3): Lab overview. In: CLEF. pp.
321–332 (2013)
2. Fellbaum, C.: WordNet: An Electronic Lexical Database. Bradford Books (1998)
3. Hallili, A.: Toward an ontology-based chatbot endowed with natural language pro-
cessing and generation. In: Proc. of ESSLLI 2014 - Student Session, Poster paper
(2014)
392
GeoTriples: a Tool for Publishing Geospatial
Data as RDF Graphs Using R2RML Mappings
Kostis Kyzirakos1 , Ioannis Vlachopoulos2 , Dimitrianos Savva2 ,
Stefan Manegold1 , and Manolis Koubarakis2
1
Centrum Wiskunde & Informatica, Amsterdam, The Netherlands
{firstname.lastname}@cwi.nl
2
National and Kapodistrian University of Athens, Greece
{johnvl,dimis,koubarak}@di.uoa.gr
Abstract. In this paper we present the tool GeoTriples that allows the
transformation of Earth Observation data and geospatial data into RDF
graphs, by using and extending the R2RML mapping language to be able
to deal with the specificities of geospatial data. GeoTriples is a semi-
automated tool that transforms geospatial information into RDF follow-
ing the state of the art vocabularies like GeoSPARQL and stSPARQL,
but at the same time it is not tightly coupled to a specific vocabulary.
Keywords: Linked Geospatial Data, data publishing, GeoSPARQL, stSPARQL
1 Introduction
In the last few years there has been significant e↵ort on publishing EO and
geospatial data sources as linked open data. However, the problem of publishing
geospatial data sources into RDF graphs using a generic and extensible frame-
work has received little attention as it has only recently emerged. Instead, script-
ing methods, that were adapted to the subject, were employed mostly for this
task, such as custom python scripts developed in project TELEIOS3 . However,
some work towards developing automated methods for translating geospatial
data into RDF has been presented in the latest LGD Workshop4 . In this paper
we present the tool GeoTriples that allows the transformation of geospatial data
stored in spatially-enabled relational databases and raw files. It is implemented
as an extension to the D2RQ platform5 [1] and goes beyond the state of the
art by extending the R2RML mapping language6 to deal with the specifities of
geospatial data. GeoTriples uses GeoSPARQL7 as the target vocabulary but the
user is free to use any vocabulary she finds appropriate.
3
http://www.earthobservatory.eu
4
http://www.w3.org/2014/03/lgd
5
http://d2rq.org
6
http://www.w3.org/TR/r2rml/
7
http://www.opengeospatial.org/standards/geosparql/
393
osm_w:1 rdf:type geo:Feature ;
NAME TYPE WIDTH osm_ont:hasName "Mangfall"^^xsd:string ;
Zeitlbach stream 1 geo:hasGeometry osm_g:1 .
Mangfall river 25 osm_g:1 rdf:type geo:Geometry ;
Triftbach canal 10 geo:dimension "2"^^xsd:integer .
(a) Example data from an ESRI shapefile (b) Expected RDF triples about Mangfall
_:osm
rr:logicalTable [ rr:tableName "‘osm‘"; ];
rr:subjectMap [
rr:class geo:Feature;
_:osmGeometry
rr:template "http://data.example.com/osm-waterways/
rr:logicalTable [ rr:tableName "‘osm‘"; ];
Feature/id/{‘gid‘}"; ];
rr:subjectMap [
rr:predicateObjectMap [
rr:class geo:Geometry;
rr:predicate osm:hasName; rr:template "http://data.example.com/osm-waterways/
rr:objectMap [ rr:datatype xsd:string; Geometry/id/{‘gid‘}"; ];
rr:column "‘NAME‘"; ]; ]; rr:predicateObjectMap [
rr:predicateObjectMap [ rr:predicate geo:dimension;
rr:predicate geo:hasGeometry ; rr:objectMap [
rr:objectMap [ rrx:transformation [
rr:parentTriplesMap _:osmGeometry; rrx:function geof:dimension;
rr:joinCondition [ rrx:argumentMap (
rr:child "gid"; [rr:column "‘geom‘"]
rr:parent "gid"; ]; ]; ]. ); ] ]; ].
(c) Mapping of thematic information (d) Mapping of geometric information
Fig. 1: Examples of extended R2RML mappings for OSM
2 The Tool GeoTriples
GeoTriples8 is an open source tool, that takes as input geospatial data that are
stored in a spatially enabled database, data that reside in raw files (e.g. ESRI
shapefiles) or the results that derive from processing of the aforementioned data
(e.g. a SciQL query over raster or array data). At a lower level, GeoTriples uses a
connector for each type of input data that transparently accesses and processes
the input data. It also consists of two main components: the mapping generator
and the R2RML processor. The mapping generator creates automatically an
R2RML mapping document from the input data source. The mapping is also
enriched with subject and predicate object maps so that the RDF graph that
will be produced follows the GeoSPARQL vocabulary. Geospatial information is
modeled using a variety of data models (e.g., relational, hierarchical) and is made
available using a variety of formats (e.g., ESRI shapefiles, KML documents). In
order to deal with these specificities of geospatial information, we extended the
R2RML language to allow the representation of a transformation function over
the input data via an object map. In [2] we provide more information about our
approach. Figure 1 presents an example of such a transformation. The R2RML
processor is responsible for producing the desired RDF output by taking into
account the mapping document generated, which is also optionally edited by the
user. When the R2RML processor of GeoTriples detects an object map with a
transformation function, it applies on the fly this function on the serialization
of the geometry described in the subject map. However, if the input data source
is a spatially enabled database, it generates the appropriate SQL queries that
push these transformations to the underlying DBMS.
8
https://sourceforge.net/projects/geotriples/
394
Fig. 2: The graphical user interface of GeoTriples
3 Using GeoTriples in a real-world scenario
In this section we present how we will demonstrate the tool GeoTriples in the
context of a precision farming application that is developed by the FP7 EU
project LEO9 . The application combines traditional geospatial data with linked
geospatial data for enhancing the quality of precision farming activities. Precision
farming aims to solve numerous problems for farmers such as the minimization
of the environmental pollution by fertilizers. For dealing with this issue, the
farmers have to comply with many legal and technical guidelines that require
the combination of information that resides in diverse information sources. In
this section we present how linked geospatial data can form the knowledge base
for providing solutions for this problem. We will publish the following datasets
as RDF graphs using GeoTriples in order to use them in the precision farming
application.
OpenStreetMap (OSM) is a collaborative project for publishing free maps
of the world. OSM maintains a community-driven global editable map that gath-
ers map data in a crowdsourcing fashion.
Talking Fields aims to increase the efficiency of agricultural production
via precision farming by means of geo-information services integrating space
and ground-based assets. It produces products for improved soil probing using
satellite-based zone maps, and provide services for monitoring crop development
through provision of biomass maps and yield estimates.
Natura 2000 is a European ecological network where national authorities
submit a standard data form that describes each site and its ecology in order to
be characterized as a Natura site.
Corine Land Cover (CLC) is an activity of the European Environment
Agency that collects data regarding the land cover of European countries.
In this demo we will use GeoTriples in order to produce the R2RML map-
pings that dictate the process of generating the desired RDF output from the
above data. Then, using the R2RML processor of GeoTriples, we translate the
input data into RDF graphs and store the latter into the geospatial RDF store
Strabon10 [3].
9
http://www.linkedeodata.eu
10
http://www.strabon.di.uoa.gr/
395
SELECT DISTINCT ?field_name ?river_name
WHERE {?river rdf:type osmo:River;
osmo:hasName ?river_name;
geo:hasGeometry ?river_geo .
?river_geo geo:asWKT ?river_geowkt .
?field rdf:type tf:Field;
tfo:hasFieldName ?field_name;
tf:hasRasterCell ?cell .
?cell geo:hasGeometry ?cell_geo ;
?cell_geo geo:asWKT ?field_geowkt .
FILTER(geof:distance(?river_geowkt,
?field_geowkt, uom:meter)<100)}
(a) GeoSPARQL Query (b) Query results
Fig. 3: Discover the parts of the agricultural fields that are close to rivers.
The user can use the graphical interface of GeoTriples that is displayed in
Figure 2 for publishing these datasets as RDF graphs. At first the user defines
the necessary credentials of the DBMS that stores the above datasets. Then, she
selects the tables and the columns that contain the information that she want to
publish as RDF graphs. Optionally, an existing ontology may be loaded, in order
to map the columns of the selected table to properties from the loaded ontology
and map the instances generated from the rows to a specific class. Afterwards,
GeoTriples generates automatically the R2RML mappings and presents them
to the user. Finally, the user may either customize the generated mappings or
proceed with the generation of the RDF graph.
A series of GeoSPARQL queries will be issued afterwards in Strabon for
providing the precision farming application with information like the location of
agricultural fields that are close to a river. This information allows the precision
farming application to take into account legal restrictions regarding distance
requirements when preparing the prescription maps that the farmers will utilize
afterwards. In Figure 3a we present a GeoSPARQL query that discovers this
information, and in Figure 3b we depict the query results.
4 Conclusions
In this paper we presented the tool GeoTriples that uses an extended form
of R2RML mapping language to transform geospatial data into RDF graphs,
and the GeoSPARQL ontology to properly express it. We demonstrate how
GeoTriples is being used for publishing geospatial information that resides in
di↵erent data sources for the realization of a precision farming application.
References
1. C. Bizer and A. Seaborne. D2RQ-treating non-RDF databases as virtual RDF
graphs. In Proceedings of the 3rd International Semantic Web Conference, 2004.
2. K. Kostis, V. Ioannis, S. Dimitrianos, M. Stefan, and K. Manolis. Data models and
languages for mapping EO data to RDF. Del. 2.1, FP7 project LEO, 2014.
3. K. Kyzirakos, M. Karpathiotakis, and M. Koubarakis. Strabon: A Semantic Geospa-
tial DBMS. In International Semantic Web Conference, 2012.
396
New Directions in Linked Data Fusion
Jan Michelfeit1 and Jindřich Mynarz2
1
Faculty of Mathematics and Physics, Charles University in Prague, Czech Rep.
michelfeit @ ksi.mff.cuni.cz,
2
University of Economics, Prague, Czech Republic
Abstract. When consuming Linked Data from multiple sources, or from
a data source after deduplication of entities, conflicting, missing or out-
dated values must be dealt with during data fusion in order to increase
the usefulness and quality of the data. In this poster, we argue that the
nature of Linked Data in RDF requires a more sophisticated approach
to data fusion than the current Linked Data fusion tools provide. We
demonstrate where they fall short on a real case of public procurement
data fusion when dealing with property dependencies and fusion of struc-
tured values, and we propose new data fusion extensions to address these
problems.
Keywords: Data Fusion, Linked Data, RDF, Data Integration
1 Introduction
The value of Linked Data lies in the ability to link pieces of data. A data in-
tegration process applied to the data can provide a unified view on data and
simplify the creation of Linked Data consuming applications. Nevertheless, con-
flicts emerge during the integration. Deduplication reveals di↵erent URIs rep-
resenting the same real-world entities (identity conflicts), and conflicting values
appear due to errors, missing, or outdated pieces of information (data conflicts).
Resolution of these conflicts is a task for data fusion. It combines multiple
records representing the same real-world object into a single, consistent and clean
representation [1]. In the context of Linked Data represented as RDF, real-world
objects are represented as resources. A set of RDF triples describing a resource, a
resource description, corresponds to a “record”. Conflicts are resolved, and low-
quality values purged to get a clean representation of a resource. This is typically
realized by fusion functions such as Latest, Vote, or Average. Tools imple-
menting Linked Data fusion include, e.g., Sieve [2], or LD-FusionTool3 which we
develop as part of the UnifiedViews ETL framework4 [3].
In this poster, we demonstrate how errors can be introduced in the fused
data when there are dependencies between RDF properties, or when fusing the
common pattern of structured values (e.g., address of an entity). We propose how
to deal with these cases by extending the data fusion process with the notion of
dependent properties and dependent resources.
3
https://github.com/mifeet/LD-FusionTool
4
successor of the ODCleanStore framework, where LD-FusionTool originated
397
Fig. 1. Sample representation of a business entity. Red boxes denote dependent re-
sources (structured values), green arrows denote groups of dependent properties.
2 Motivating Example
We demonstrate the need for new data fusion capabilities on a real scenario with
public procurement data extracted from an XML-based API and RDFized using
the UnifiedViews framework. The extracted data needs to be deduplicated and
fused in order to obtain high-quality data before further analytical processing.
Fig. 1 shows how a business entity (BE) is represented in RDF. It has a
legal name, address, and official identifier, which may be marked as syntactically
invalid. The extracted data contains many copies of the same BE because of
duplication in the source dataset. Simple merge of matched BEs would result in
data conflicts due to misspellings and errors in the dataset or mismatches in the
generated owl:sameAs links. Our goal is to fuse BEs so that each has a single
legal name, address, and identifier, choosing the best possible values.
Property dependencies. We encounter the first problem with the state-of-
the-art Linked Data fusion tools when fusing addresses. The tools resolve each
property independently which can result in the selection of, e.g., a town from
one address in the input and a postal code from another one. Such result could
be incorrect, however, because the postal code is related to the town. We need
to introduce dependency between properties to obtain a correct fused result.
Fusing structured values. Both address and identifier can be regarded
as structured values of a BE. We will refer to the main resource (e.g., BE) as
a parent resource and to the resource representing the structured value (e.g.,
address) as a dependent resource. Currently, structured values need to be fused
separately. One way of achieving this is generating owl:sameAs links among
structured values based on their properties, e.g., match addresses based on simi-
larity of street, and town. This approach has two drawbacks: it doesn’t guarantee
that a BE will have only a single address after fusion, and the error of auto-
matically generated owl:sameAs links accumulates. Another way is generating
owl:sameAs links between dependent resources that belong to the same parent
resource. This approach may lead to errors when two parent resources point to
the same dependent resource, e.g., two di↵erent BEs point to the same address.
All addresses for the two BEs would incorrectly be merged in such case.
We argue that a smarter approach considering structured values in resource
descriptions could (1) overcome the outlined problems with the separate fusion
of structured values, (2) reduce the overhead of additional linking, fusion, and
validation, (3) gracefully handle blank nodes, where linking may not be practical.
398
3 Extending Linked Data Fusion
In this section, we propose how to extend data fusion to improve on the issues
demonstrated in Section 2. Let there be a set U (RDF URI references), a set B
(blank nodes) and a set L (literals). A triple (s, p, o) 2 (U [B)⇥U ⇥(U [B [L) is
an RDF triple and we refer to its components as subject, predicate, and object,
respectively. Let g 2 U be a graph name. We regard a triple (s, p, o) that belongs
to a named graph g as a quad (s, p, o, g).
3.1 Property Dependencies
Independent fusion of properties is not always sufficient, as we demonstrated in
Section 2. What we want is to keep the values of dependent properties together
in the fused result if the values occurred together in the input data. Let us call a
set of input quads sharing the same subject s and graph name g before resolution
of identity conflicts an input group IGs,g . Furthermore, let d(p1 , p2 ) denote that
there is a dependency between properties p1 and p2 .
Definition 1. A fused result R from input quads I satisfies property dependen-
cies if and only if 8p1 , p2 2 U such that d(p1 , p2 ): all quads (s, p, o, g) 2 R such
that p = p1 _ p = p2 are derived 5 from the same input group in I.
We chose to define input groups based on subject and graph because it covers
two common scenarios: (1) fusing data from multiple sources (input quads can
have di↵erent graph names), and (2) fusion after deduplication of a single source
(quads will have di↵erent subjects before resolution of identity conflicts).
Here is how a basic data fusion algorithm can be extended to produce results
satisfying property dependencies. The input of the algorithm includes these de-
pendencies – we assume it is given as an equivalence relation d. We also assume
the input resource description contains all quads for all mutually dependent
properties. The extended algorithm consists of the following high level steps:
1. Find equivalence classes C of the equivalence relation d.
2. For every class of dependent properties C 2 C:
(a) Let IC be all input quads with one of the properties in C.
(b) For every nonempty input group Is,g in IC , let Os,g be the fused result
of the basic data fusion algorithm applied on Is,g .
(c) Select one set OC from all sets Os,g of fused quads according to some
fusion tool specific criterion and add OC to the result.
3. Fuse input quads with properties that do not have any dependency using
the basic data fusion algorithm.
It is straightforward to prove that the extended algorithm indeed produces
results satisfying property dependencies. The criterion used in step (2c) can
depend on the implementing fusion tool. In LD-FusionTool, which can assess
the quality of fused quads, we select Os,g such that the average quality of the
fused result is maximal.
5
By derived we mean “selected from” for the so called deciding fusion functions such
as Latest, or “computed from” for mediating fusion functions such as Average.
399
3.2 Dependent Resources
Current Linked Data tools fuse resource descriptions composed of triples having
the respective resource as its subject. Further triples describing structured values
are not included (e.g., street is not included for a BE). This leaves a space for
improvement as demonstrated in Section 2. We propose the inclusion of depen-
dent resources reachable from the parent resource through specially annotated
properties, in analogy to [4]. For resource r with resource description R, we fuse
a property p annotated with fusion function DependentResource as follows:
1. Let Dr,p = {o | (r, p, o, g) 2 R, o, g 2 U } be dependent resources. Recursively
fuse resources in Dr,p as if there were owl:sameAs links between all pairs of
resources in Dr,p . Denote the fused result Fr,p .
2. Let d 2 U be a new unique URI. Add a new quad (r, p, d, g), and quads
{(d, q, o, g) | (s, q, o, g) 2 Fr,p } to the result.
This approach produces a single fused dependent resource (e.g., a single ad-
dress of a BE), and takes advantage of the locality of owl:sameAs links to avoid
incorrect merge of dependent resources with multiple parents. A unique URI is
generated in step (2) so that other parts of the RDF graph where the dependent
resource may occur are not a↵ected by its “local” fusion for one parent resource.
4 Conclusion
Our practical experience with fusion of public procurement data shows that the
graph nature or RDF has its specifics that need to be addressed. State-of-the-art
Linked Data fusion tools do not cover two common patterns in RDF: fusion of
structured values, and dependencies between properties.
We answer this challenge with new data fusion features. We introduce the
concepts of dependent properties and dependent resources, and propose how to
appropriately extend data fusion. The extensions have been implemented in LD-
FusionTool and successfully used to fulfill the goals of our motivational scenario.
The new data fusion features show a new direction in Linked Data fusion –
taking advantage of the broader context in the RDF graph. This can be further
leveraged not only in conflict resolution, but also in quality assessment.
Acknowledgement. This work was supported by a grant from the EU’s 7th
Framework Programme number 611358 provided for the project COMSODE.
References
[1] Bleiholder, J., Naumann, F.: Data Fusion. In: ACM Computing Surveys 41.1 (2008)
[2] Mendes, P. N., Mhleisen, H., Bizer, C.: Sieve: Linked Data Quality Assessment and
Fusion. In: Proceedings of the 2012 Joint EDBT/ICDT Workshops, ACM (2012)
[3] Knap, T., et al.: UnifiedViews: An ETL Framework for Sustainable RDF Data
Processing. In: The Semantic Web: ESWC 2014, Posters and Demos Track (2014)
[4] Mynarz, J., Svátek, V.: Towards a Benchmark for LOD-enhanced Knowledge Dis-
covery from Structured Data. In: Proceedings of the Second International Workshop
on Knowledge Discovery and Data Mining Meets Linked Open Data (2013).
400
Bio2RDF Release 3: A Larger Connected Network of
Linked Data for the Life Sciences
Michel Dumontier1, Alison Callahan1, Jose Cruz-Toledo2, Peter Ansell3, Vincent
Emonet4, François Belleau4, Arnaud Droit4
1
Stanford Center for Biomedical Informatics Research, Stanford University, CA; 2IO
Informatics, Berkeley, CA; 3Microsoft QUT eResearch Centre, Queensland Universi-
ty of Technology, Australia; 4Department of Molecular Medicine, CHUQ Research
Center, Laval University, QC
{michel.dumontier, alison.callahan, josemiguelcruztoledo,
peter.ansell, vincent.emonet, francois.belleau,
arnaud.droit}@gmail.com
Abstract. Bio2RDF is an open source project to generate and provide Linked
Data for the Life Sciences. Here, we report on a third coordinated release of
~11 billion triples across 30 biomedical databases and datasets, representing a
10 fold increase in the number of triples since Bio2RDF Release 2 (Jan 2013).
New clinically relevant datasets have been added. New features in this release
include improved data quality, typing of every URI, extended dataset statistics,
tighter integration, and a refactored linked data platform. Bio2RDF data is
available via REST services, SPARQL endpoints, and downloadable files.
Keywords: linked open data, semantic web, RDF
1 Introduction
Bio2RDF is an open-source project to transform the vast collections of heteroge-
neously formatted biomedical data into Linked Data [1], [2]. GitHub-housed PHP
scripts convert data (e.g. flat files, tab-delimited files, XML, JSON) into RDF using
downloadable files or APIs. Bio2RDF scripts follow a basic convention to specify the
syntax of HTTP identifiers for i) source-identified data items, ii) script-generated data
items, and iii) vocabulary used to describe the dataset contents [1]. Bio2RDF scripts
uses the Life Science Registry (http://tinyurl.com/lsregistry), a comprehensive list of
over 2200 biomedical databases, datasets and terminologies, to obtain a canonical
dataset name (prefix), which is used in the formulation of a Bio2RDF URI -
http://bio2rdf.org/{prefix}:{identifier} and identifiers.org URI. Each data item is
annotated with provenance, including the URL of the files from which it was generat-
ed. Bio2RDF types and relations have been mapped to the Semanticscience Integrated
Ontology (SIO)[3], thereby enabling queries to be formulated using a single terminol-
adfa, p. 1, 2011.
© Springer-Verlag Berlin Heidelberg 2011
401
ogy [4]. Bio2RDF has been used for a wide variety of biomedical research including
understanding HIV-based interactions [5] and drug discovery [6].
Here, we report an update to the Bio2RDF network, termed Bio2RDF Release 3,
and compare the results to Bio2RDF Release 2.
2 Bio2RDF Release 3
Bio2RDF Release 3 (July 2014) is comprised of ~11B triples across 30 datasets, 10
of which are new since Release 2 (Table 1). The top 3 datasets are Pubmed (scholarly
citations; 5B triples), iProClass (database cross-references; 3B triples), and NCBI
Gene (sequences and annotations; 1B triples). Compared to Release 2, there are 10x
the number of triples, and each dataset has increased by an average of 300%.
Table 1. Bio2RDF Release 3 Datasets.
% out in
Dataset new triples increase types links links
Affymetrix 86,943,196 196% 2 30 0
Biomodels 2,380,009 404% 16 50 0
BioPortal 19,920,395 125% - - -
Clinicaltrials * 8,323,598 100% 55 1 1
CTD 326,720,894 230% 9 9 1
dbSNP * 8,801,487 100% 6 5 2
DrugBank 3,649,561 325% 69 23 2
GenAge * 73,048 100% 2 6 0
GenDR * 11,663 100% 4 4 0
GO Annotations 97,520,151 122% 1 6 1
HGNC 3,628,205 434% 3 11 3
Homologene 7,189,769 561% 1 4 0
InterPro 2,323,345 233% 8 18 3
iProClass 3,306,107,223 1564% 0 16 0
iRefIndex 48,781,511 157% 5 24 0
KEGG * 50,197,150 100% 17 43 7
LSR * 55,914 100% 1 2 0
MeSH 7,323,864 176% 7 0 4
MGD 8,206,813 334% 11 8 4
NCBI Gene 1,164,672,432 296% 16 11 11
NDC 6,033,632 34% 12 0 2
OMIM 7,980,581 432% 8 17 8
OrphaNet * 377,947 100% 3 12 2
PharmGKB 278,049,209 733% 18 41 1
PubMed 5,005,343,905 1350% 9 0 18
SABIO-RK 2,716,421 104% 15 11 1
402
SGD 12,399,627 223% 42 24 1
SIDER * 17,509,770 100% 8 3 0
NCBI Taxonomy 21,310,356 120% 5 2 12
WormBase * 22,682,002 100% 34 5 2
Total 10,495,601,538 370 343 79
Fig. 1. Connectivity in Bio2RDF Release 3 datasets. Nodes represent datasets, edges represent
connections between datasets.
Figure 1 shows a dataset network diagram using pre-computed SPARQL-based graph
summaries (excluding bioportal ontologies). The network exhibits a power-law
distribution, with a few highly connected nodes connected to a vast number of nodes
with abut a single edge.
3 REST services
We redeveloped the Bio2RDF linked data platform to provide 3 basic services
(describe, search, links) by querying the target SPARQL endpoint using Talend ESB,
a graphical Java code generator based on the Eclipse framework. The REST services
403
now return RDF triples or quads based on content negotiation or RESTful URIs of the
form http://bio2rdf.org/[prefix]/[service]/[format]/[searchterm]. The describe service
returns statements with the searchterm as an identifier in the subject position. The
links service returns triples with the searchterm as an identifier in the object position.
Finally, the search service returns triples containing matched literals. Datasets and
available services descriptions are stored and retrieved by the web application using a
new SPARQL endpoint (http://dataset.bio2rdf.org/sparql).
4 Availability
Bio2RDF is accessible from http://bio2rdf.org. Bio2RDF scripts, mappings, and
web application are available from GitHub (https://github.com/bio2rdf). A list of the
datasets, detailed statistics, and downloadable content (RDF files, VoID description,
statistics, virtuoso database) are available from
http://download.bio2rdf.org/current/release.html . Descriptions of Bio2RDF datasets
and file locations are also available from datahub.io .
5 References
[1] A. Callahan, J. Cruz-Toledo, P. Ansell, and M. Dumontier, “Bio2RDF
Release 2: Improved coverage, interoperability and provenance of life science
linked data,” in Lecture Notes in Computer Science (including subseries
Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics),
2013, vol. 7882 LNCS, pp. 200–212.
[2] F. Belleau, M. A. Nolin, N. Tourigny, P. Rigault, and J. Morissette,
“Bio2RDF: Towards a mashup to build bioinformatics knowledge systems,”
J. Biomed. Inform., vol. 41, no. 5, pp. 706–716, 2008.
[3] M. Dumontier et al, “The Semanticscience Integrated Ontology (SIO) for
biomedical research and knowledge discovery.,” J. Biomed. Semantics, vol. 5,
p. 14, 2014.
[4] A. Callahan, J. Cruz-Toledo, and M. Dumontier, “Ontology-Based Querying
with Bio2RDF’s Linked Open Data.,” J. Biomed. Semantics, vol. 4 Suppl 1, p.
S1, 2013.
[5] M. A. Nolin, M. Dumontier, F. Belleau, and J. Corbeil, “Building an HIV data
mashup using Bio2RDF,” Brief. Bioinform., vol. 13, pp. 98–106, 2012.
[6] B. Chen, X. Dong, D. Jiao, H. Wang, Q. Zhu, Y. Ding, and D. J. Wild,
“Chem2Bio2RDF: a semantic framework for linking and data mining
chemogenomic and systems chemical biology data.,” BMC Bioinformatics,
vol. 11, p. 255, 2010.
404
Infoboxer: Using Statistical and Semantic
Knowledge to Help Create Wikipedia Infoboxes
Roberto Yus1 , Varish Mulwad2 , Tim Finin2 , and Eduardo Mena1
1
University of Zaragoza, Zaragoza, Spain
{ryus,emena}@unizar.es,
2
University of Maryland, Baltimore County, Baltimore, USA
{varish1,finin}@cs.umbc.edu
Abstract. Infoboxer uses statistical and semantic knowledge from linked
data sources to ease the process of creating Wikipedia infoboxes. It cre-
ates dynamic and semantic templates by suggesting attributes common
for similar articles and controlling the expected values semantically.
Keywords: Infoboxes, Wikipedia, DBpedia, Semantic Web
1 Introduction
Wikipedia is a free and collaborative encyclopedia launched in 2001 which, as
of June 2014, has more than four million English articles. Wikipedia is centered
around collaboratively creating and editing articles for a variety of topics and
subjects. The information in these articles is often split into two parts: 1) un-
structured text with details on the article’s subject and 2) a semi–structured
infobox that summarizes the most important facts about the article’s subject.
Thus, infoboxes are usually preferred by systems using Wikipedia content (such
as Google’s Knowledge Graph or Microsoft Bing’s Satori) as they are easier to
process by machines.
Current creation of Wikipedia infoboxes is based on templates that are cre-
ated and maintained collaboratively. While templates provide a standardized
way of representing infobox information across Wikipedia articles, they pose
several challenges. Di↵erent communities use di↵erent infobox templates for the
same category articles; attribute names di↵er (e.g., date of birth vs. birthdate),
and attribute values are expressed using a wide variety of measurements and
units [2]. Infobox templates are grouped by article categories with typically one
template associated with one category (e.g., it is hard to find an infobox tem-
plate for article whose categories are both Artist and Politician). Given the large
number of Wikipedia categories, it is difficult to create templates for every pos-
sible category and combination. Finally, templates are free form in nature; when
users fill attribute values no integrity check is performed on whether value is of
appropriate type for the given attribute, often leading to erroneous infoboxes.
405
Infoboxer3 is a tool grounded in Semantic Web technologies that overcomes
challenges in creating and updating infoboxes, along the way making the pro-
cess easier for users. Using statistical information from Linked Open Data (LOD)
datasets, Infoboxer helps people populate infoboxes using the most popular at-
tributes used to describe instances for a given category or any combination of
categories, thus generating an infobox “template” automatically. For each at-
tribute or property Infoboxer also identifies the most popular types and provides
them as suggestions to be used to represent attribute values. The attribute value
types allows Infoboxer to enforce semantic constraints on the values entered by
the user. It also provides suggestions for attribute values whenever possible and
links them to existing entities in Wikipedia.
2 Using DBpedia to Help Creating Wikipedia Infoboxes
The Infoboxer demonstration presented in this paper, uses DBpedia [1], a semi-
structured representation of Wikipedia’s content, to implement and power all of
its features and functionalities. While our demonstration system uses DBpedia, it
could be replaced with any other LOD knowledge base, such as Yago or Freebase.
In the following sections we explain each functionality in detail.
Identifying popular attributes. The most popular attributes for a given category
are generated by computing attribute usage statistics based on instance data for
the category. Infoboxer first obtains a list of DBpedia instances for the given
category. For example, list of instances associated with the category dbpedia-
owl:SoccerPlayer include dbpedia:David Beckham and dbpedia:Tim Howard. A
list of attributes used by these instances is generated and then ordered based
on number of instances using each attribute. Duplicate counts are avoided by
noting distinct attribute for every instance only once (at this point we want to
know how many di↵erent instances of the category are using the property to
highlight its popularity). For example, the property dbpedia-owl:team appears
several times with the soccer player dbpedia:David Beckham (as he played for
several soccer teams), but it is only counted once.
Sorting the list of attributes based on frequency of usage provides Infoboxer
with the most popular attributes for each category. Figure 1 shows the most pop-
ular properties for soccer players, e.g., dbpedia-owl:team, foaf:name, and dbpedia-
owl:position, along with the percentage of instances using them. This first step
could be simplified by only using information about the domains and ranges of
each property (e.g., to obtain properties where the domain is a soccer player).
However, DBpedia does not impose restrictions over domain and range for most
of the properties. In fact, in a previous analysis, we detected that for DBpedia
3.9, 21% of properties have no domain defined, 15% have no range, and 2%
have no domain and range. On July 1, Wikidata, a project focused on human
3
http://sid.cps.unizar.es/Infoboxer
406
Fig. 1. Screenshot of Infoboxer creating the Wikipedia infobox of a soccer player.
edited structured Wikipedia, rolled out a similar feature which is restricted to
suggesting only popular properties4 .
Identifying popular range types. Infoboxer finds the most popular types used to
represent values for each attribute identified in the previous step. Attribute value
types is akin to rdfs:range classes associated with an attribute or a property in
an ontology. Infoboxer first obtains a list of attribute values for a given category
and attribute by identifying list of triples in DBpedia’s ABox whose subject are
instances of the given category and property, the given attribute. For example,
the category dbpedia-owl:SoccerPlayer and attribute dbpedia-owl:team generates
a list of values such as dbpedia:Arsenal F.C. and dbpedia:Korea University. A
list of value types is generated from the values and ordered based on number
of instances whose attribute values have the type. Based on the attribute, value
types are either semantic classes, such as dbpedia-owl:SoccerClub and dbpedia-
owl:University, or xml datatypes such as xsd:string, xsd:integer, or xsd:datetime.
Sorting the list of types provides Infoboxer with the most popular attribute value
(or range) types.
Suggesting attribute values and enforcing semantic constraints. The top three
value types for an attribute are provided as suggestions to users as they add
values for the most popular attributes in the infobox. Infoboxer also uses these
types to enforce semantic constraints on the values entered, thus ensuring infobox
correctness. In cases where value type is a semantic class, Infoboxer retrieves
instances of that class and populates them for auto-completion as user starts
filling up the value. In cases where value type is an XML datatype, Infoboxer
4
http://lists.wikimedia.org/pipermail/wikidata-l/2014-July/004148.html
407
shows the most popular values used as examples. Once the user enters a value,
Infoboxer checks whether value conforms to the expected type.
Fixing existing infoboxes. Infoboxer also uses its functionalities to improve ex-
isting Wikipedia infoboxes. Given an article title, Infoboxer fetches its categories
and existing attribute values. Then, it highlights popular properties with miss-
ing values and also highlights attribute values that have an incorrect semantic
type. For example, as of June 2014, dbpedia:David Beckham has the value db-
pedia:England national football team (whose rdf:type is dbpedia-owl:SoccerClub)
for the attribute dbpedia-owl:birthPlace and Infoboxer highlights it as a possible
error as only 2% of soccer players have a soccer club as birth place (49% of them
have a dbpedia-owl:Settlement and 22% a dbpedia-owl:City). Also, Infoboxer en-
courages users to update the attribute value if it is of a less popular type (e.g.,
suggesting a value of type dbpedia-owl:SoccerClub over dbpedia-owl:Organisation
for the property dbpedia-owl:team).
The combination of the four functionalities allows Infoboxer to dynamically
generate infobox templates, ensure infobox correctness, and help assist in fixing
existing ones. Since Infoboxer relies on KBs such as DBpedia, generated tem-
plates will automatically evolve with change in information in KBs over time.
3 Demonstration
The demo will allow users to create new infoboxes and edit existing ones. They
begin by entering the name of a new or existing Wikipedia article and select
appropriate categories for it (e.g., Soccer Player and Scientist). Users will be
provided with the most popular attributes to be completed, along with its pop-
ularity, based on the selected categories. For each attribute, users will also be
provided information about the top three value types; auto-complete will assist
users in selecting the appropriate value. A “Google it” button will help user fire
Google search queries to discover a possible value. Also, as users start filling
values in the forms, current version of the infobox will be displayed on the side.
In summary, users will be able to experience how fast and controlled it is to
create semantically correct Wikipedia infoboxes with Infoboxer.
Acknowledgments. This research was supported by the CICYT project TIN-
2010-21387-C02-02, DGA FSE, NSF awards 1228198, 1250627 and 0910838 and
a gift from Microsoft Research.
References
1. Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., Hellmann,
S.: DBpedia - a crystallization point for the web of data. Web Semantics: Science,
Services and Agents on the World Wide Web 7(3), 154–165 (2009)
2. Morsey, M., Lehmann, J., Auer, S., Stadler, C., Hellmann, S.: DBpedia and the
Live Extraction of Structured Data from Wikipedia. Program: electronic library
and information systems 46, 157–181 (2012)
408
The Topics they are a-Changing — Characterising
Topics with Time-Stamped Semantic Graphs
A. Elizabeth Cano,1 Yulan He,2 and Harith Alani1
1
Knowledge Media Institute, Open University, UK
ampaeli@gmail.com, h.alani@open.ac.uk
2
School of Engineering and Applied Science, Aston University, UK
y.he@cantab.net
Abstract. DBpedia has become one of the major sources of structured knowl-
edge extracted from Wikipedia. Such structures gradually re-shape the represen-
tation of Topics as new events relevant to such topics emerge. Such changes make
evident the continuous evolution of topic representations and introduce new chal-
lenges to supervised topic classification tasks, since labelled data can rapidly be-
come outdated. Here we analyse topic changes in DBpedia and propose the use
of semantic features as a more stable representation of a topic. Our experiments
show promising results in understanding how the relevance of features to a topic
changes over time.
Keywords: social media, topic detection, DBpedia, concept drift, feature rele-
vance decay
1 Introduction
Supervised topic classifiers which depend on labelled data can rapidly become outdated
since new information regarding these topics emerge. This challenge becomes apparent
when applying topic classifiers to streaming data like Twitter. The continuous change of
vocabulary – in many cases event-dependent– makes the task of retraining such classi-
fiers with fresh topic-label annotations a costly one. In event-dependent topics not only
new lexical features re-characterise the topic but also existing features can potentially
become irrelevant to the topic (e.g., Jan25 being relevant to violence in the Egyptian
revolution is now less relevant to current representations of the topic violence). In dy-
namic environments the expectation that the progressive feature drifts of topics to be in
the same feature space is not normally met.
The incorporation of new event-data to a topic representation leads to a linguis-
tic evolution of a topic, but also to a change on its semantic structure. To the best of
our knowledge, none of the existing approaches for topic classification using seman-
tic features [4][2][5][7], has focused on the epoch-based transfer learning task. In this
paper we aim to disseminate our work presented in [1] by summarising our proposed
transfer learning approach for the epoch-based topic classification ot tweets. In [1] we
investigate whether the use of semantic features as opposed to lexical features can pro-
vide a more stable representation of a topic. Here we extend our work by representing
cross-epoch settings gain in F-measure for both lexical and semantic feature with in-
fographics. This enables us to highlight the relevance of the studied semantic features
over the lexical ones.
409
1.1 Evolving Topics
DBpedia is periodically updated to incorporate any additions and modification in Wikipedia.
This enables us to track how specific resources evolve over time, by comparing these re-
sources over subsequent DBpedia editions. For example, changes to the semantic graph
for the concept Barack Obama can be derived from snapshots of this resource’s seman-
tic graph from different DBpedia dumps 3 . E.g., in Figure 1, although some of the triples
remain unchanged in consecutive dumps, new triples provide further information on the
resource.
DBPEDIA
3.8 dbo:wikiPageWikiLink dbp:Budget_Control_Act_of_2011
3.7 dbp:Al-Qaeda skos:subject dbo:wikiPageWikiLink
category:United_States_presidential_candidates,_2012
rdfs:subClassOf
3.6 dbo:birthPlace rdf:type
dbp:Hawaii dbp:Barack_Obama yago:PresidentOfTheUnitedStates dbo:Person
Fig. 1. Triples of the Barack Obama resource extracted from different DBpedia dumps (3.6 to
3.8). Each DBpedia dump presents a snapshot in time of factual information of a resource.
Changes regarding a resource are exposed both through new semantic features (i.e
triples) and new lexical features –appearing on changes in a resource’s abstract–. In
DBpedia a topic can be represented by the collection of resources belonging to both
the main topic (e.g. cat:War) and resources (e.g dbp:Combat assessment) belong-
ing to subcategories (e.g. cat:Military operations) of the main Topic. Therefore
a topic’s evolution can be easily tracked by tracking changes in existing and new re-
sources belonging to it.
2 Topic Classification with Time-Stamped Semantic Graphs
In [1], we propose a novel transfer learning [6][3] approach to address the classification
task of new data when the only available labelled data belongs to a previous epoch.
This approach relies on the incorporation of knowledge from DBpedia graphs. This
approach is summarised in Figure 2 and consists of the following stages: 1) Extraction
of lexical and semantic features from tweets; 2) Time-dependent content modelling;
3) Strategy for weighting topic-relevant features with DBpedia; and 4) Construction of
time-dependent topic classifiers based on lexical, semantic and joint features.
Our analysis involves the use of two main feature types: lexical and semantic fea-
tures. The semantic features consist on Class, Property, Category, and Resource. The
semantic feature representation of a document therefore is build upon the collection of
such features derived from the document’s entities mapped to a DBpedia resource. The
mapping targets the available DBpedia dump when the document was generated. In [1],
we proposed different weightening strategies some of which made use of graph prop-
erties of a Topic in a DBpedia graph. Such strategies incorporated statistics of the topic
graph representation considering a DBpedia graph at time t.
2.1 Construction of Time-Dependent Topic Classifiers
We focus on the binary topic classification in epoch-based scenarios, where the classi-
fier that we train on a corpus from epoch t 1, is tested on a corpus on epoch t. Our
3
The DBpedia dumps correspond to Wikipedia articles at different time periods as fol-
lows: DBp3.6 generated on 2010-10-11; DBpedia 3.7 on 2011-07-22, DBp3.8 on 2012-
06-01, DBp3.9 on late April. DBpedia have them available to download at DBpedia
http://wiki.dbpedia.org/Downloads39
410
2010 Concept Topic
2010
2011
2013
2011 Enrichment Labelled
2013 Microposts
Resource SemGraph
Backtrack Mapping Snapshots
Build Cross-Epoch
3.6 2010 Topic Classifiers
3.7 2011
DBpedia Semantic
3.8 2013 Feature Weighting
Fig. 2. Architecture for backtrack mapping of resources to DBpedia dumps and deriving topic-
relevance based features for epoch-dependent topic classification.
analysis targeted our hypothesis that, as opposed to lexical features which are situation-
dependent and can change progressively in time, semantic structures – including onto-
logical classes and properties – can provide a more stable representation of a Topic.
Following the proposed weighting strategies the semantic feature representations of
the t 1 corpus and the t corpus, are both generated from the DBpedia graph available
at t 1. For example when applying a classifier trained on data from 2010, the feature
space of a target test set from 2011 is computed based on the DBpedia version used
for training the 2010-based classifier. This is in order to simulate the availability of
resources in a DBpedia graph at a given time. The semantic feature f in a document
x is weighted based on the frequency of a semantic feature f in a document x with
Laplace smoothing and the topic-relevance of the feature in the DB t graph:
[Nx (f )DB t + 1
Wx (f )DB t = [ P ] ⇤ (WDB t (f ))1/2 (1)
|F | + f 0 2F Nx (f 0 )DB t
where Nx (f ) is the number of times feature f appears in all the semantic meta-
graphs associated with document x derived from the DB t graph ; F is the semantic
features’ vocabulary of the semantic feature type and WDB t (f ) is the weighting func-
tion corresponding to the semantic feature type computed based on the DB t graph.
This weighting function captures the relative importance of a document’s semantic fea-
tures against the rest of the corpus and incorporates the topic-relative importance of
these features in the DB t graph.
3 Experiments
We evaluated our approach using two collections: DBpedia and Twitter datasets. The
DBpedia collection comprises four DBpedia dumps (3.6 to 3.9)4 . The Twitter datasets
consist of a collection of Violence-related topics: Disaster Accident, Law Crime and
War Conflict. Each of these datasets comprises three epoch-based collections of tweets,
corresponding to 2010, 2011, and 2013. The Twitter dataset contained 12,000 annotated
tweets 5 . To compare the overall benefit of the use of the proposed weighting strategies
against the baselines on this three topics, we averaged the P, R and F-measure of these
three cross-epoch settings for each topic. Table 1 presents a summarised version of our
results in [1], showing only the best performing features. We can see that in average
the Class-based semantic features improve upon the bag of words (BoW) features in
F measures. This reveals that the use of ontological classes is a more stable option for
the representation of a topic. In order to analyse the differences in gain in F measure
for each topic in each of the examined features we used the radar plots in Figure 3. In
4
General statistics of these dumps are available at http://wiki.dbpedia.org/Downloads39
5
Further information about this dataset is available at [1]
411
this figure a positive value indicates an improvement on the classifier. While semantic
features improve upon lexical feature in the three topics, the weighted features for re-
source, class and category exhibit a positive improvement on these scenarios. Moreover
the class based features consistently outperform the BoW in all three topics.
B OW Catsf f Catsf g Catj Ressf f Ressf g Resj Clssf f Clssf g Clsj Semsf f Semsf g Semj
P 0.808 0.719 0.784 0.775 0.764 0.775 0.777 0.692 0.691 0.705 0.708 0.751 0.75
R 0.429 0.433 0.434 0.383 0.438 0.426 0.408 0.649 0.638 0.640 0.438 0.373 0.404
F 0.536 0.524 0.550 0.501 0.544 0.529 0.517 0.660 0.658 0.665 0.525 0.490 0.518
Table 1. Average results for the cross-epoch scenarios for all three topics.
BOW BOW
BOW
AllJoint 0.1 CatSFF Disaster_Accident AllJoint 0.2 CatSFF Law_Crime AllJoint 0% CatSFF War_Conflict
0 SemJoint !0.1% PropSFF
SemJoint PropSFF SemJoint 0 PropSFF
-0.1 !0.2%
-0.2 -0.2 ClsJoint !0.3% ResSFF
ClsJoint ResSFF ClsJoint ResSFF
-0.3 -0.4 !0.4%
-0.4
-0.6 !0.5%
ResJoint -0.5 ClsSFF ResJoint ClsSFF ResJoint ClsSFF
-0.6 -0.8 !0.6%
-0.7 -1 !0.7%
PropJoint SemFF PropJoint SemFF PropJoint SemFF
2010-2011
CatJoint AllSFF CatJoint AllSFF 2010-2013 CatJoint AllSFF
2010-2011 2011-2013 2010-2011
AllSFG CatSFG AllSFG CatSFG AllSFG CatSFG
2010-2013 2010-2013
SemSFG PropSFG SemSFG PropSFG
2011-2013 ClsSFG ResSFG SemSFG PropSFG 2011-2013
ClsSFG ResSFG ClsSFG ResSFG
Fig. 3. Summary of performance decays for each feature for each Topic on the three cross-epoch
scenarios.
4 Conclusions
Our results showed that Class-based semantic features are much slower to decay than
other features, and that they can improve performance upon traditional BoW-based clas-
sifiers in cross-epoch scenarios. These results demonstrate the feasibility of the use of
semantic features in epoch-based transfer learning tasks. This opens new possibilities
for the research of concept drift tracking for transfer learning based on existing Linked
Data sources.
References
1. A. E. Cano, Y. He, and H. Alani. Streching the life of twitter classifiers with time-stamped se-
mantic graphs. In ISWC 2014, Riva del Garda, Trentino, Italy, Oct 19-23, 2014. Proceedings,
Lecture Notes in Computer Science. Springer, 2014.
2. A. E. Cano, A. Varga, M. Rowe, F. Ciravegna, and Y. He. Harnessing linked knowledge
source for topic classification in social media. In Proc. 24th ACM Conf. on Hypertext and
Social Media (Hypertext), Paris, France, 2013.
3. R. Caruana. Multitask learning. 28(1):41–75, 1997.
4. Y. Genc, Y. Sakamoto, and J. V. Nickerson. Discovering context: classifying tweets through a
semantic transform based on wikipedia. In Proceedings of the 6th international conference on
Foundations of augmented cognition: directing the future of adaptive systems, FAC’11, pages
484–492, Berlin, Heidelberg, 2011. Springer-Verlag.
5. Y. He. Incorporating sentiment prior knowledge for weakly supervised sentiment analysis.
ACM Transactions on Asian Language Information Processing, 11(2):4:1–4:19, June 2012.
6. S. Thrun. Is learning the n-th thing any easier than learning the first? In Advances in Neural
Information Processing Systems, pages 640–646. The MIT Press, 1996.
7. A. Varga, A. Cano, M. Rowe, F. Ciravegna, and Y. He. Linked knowledge sources for topic
classification of microposts: A semantic graph-based approach. Journal of Web Semantics:
Science, Services and Agents on the World Wide Web (JWS), 2014.
412
Linked Data and facets to explore text corpora
in the Humanities: a case study
Christian Morbidoni
Semedia, Department of Information Engineering (DII), Università Politecnica delle
Marche, Ancona, Italy
Abstract. Faceted search and browsing is an intuitive and powerful
way of traversing a structured knowledge base and has been applied
with success in many contexts. The GramsciSource project is currently
investigating how faceted navigation and Linked Data can be combined
to help Humanities scholars in working with digital text corpora. In
this short paper we focus on the ”Quaderni dal carcere” by Antonio
Gramsci, one of the most popular Italian philosophers and politicians,
we present our ongoing work and we discuss our approach. This consists
of first building a RDF graph to encode di↵erent ”levels” of knowledge
about the texts and then extracting relevant graph paths to be used
as navigation facets. We then built a first prototype exploration tool
with a two-fold objective: a) allow non experts to make sense of the
extremely fragmented and multidisciplinary text corpus, and b) allow
Gramsci scholars to easily select a subset of the corpus of interest as well
as possibly discovering new insights or answer research questions.
Keywords: Faceted browsing, Digital Humanities, Entity Extraction
1 The data
Gramsci’s ”Quaderni dal carcere” is an extremely fragmented corpus composed
of more that 4,000 ”notes’ organized in 29 books (quaderni). Notes vary in
length from single lines to several pages and span di↵erent domains, from soci-
ology and politics to literature. They are available in the GramsciSource digital
library1 (created in the frame of the project) with stable URLs. We built a Linked
Data graph by merging structured knowledge coming from di↵erent sources: the
Linked Data Gramsci Dictionary, data coming from DBpedia Italia2 and seman-
tic annotations made with Pundit [1]3 .
The Linked Data Gramsci Dictionary, is a dataset extracted from the
Gramsci Dictionary [2], a recognized scholarly contribution within the inter-
national Gramsci scholarly community, which includes all the most important
topics in the Gramsci’s thought. Each topic in the dictionary is documented by a
1
The DL is currently officially o✏ine due to maintenance, but can be reached at the
following address: http://89.31.77.216
2
http://it.dbpedia.org
3
http://thepund.it
413
2 C. Morbidoni
Fig. 1. A screenshot of the prototype
text with references to specific Gramsci’s Notes from the Quaderni dal Carcere.
We automatically processed such citations using regular expressions and pro-
ducing RDF triples representing such connections, expressing relations among
single notes and a number of dictionary topics they are relevant to.
Entity extraction and linking. Several approaches and tools for extract-
ing and disambiguating relevant entities mentioned in a text appeared in the
last years. Among them, DataTXT4 is, to our knowledge, one of the best tools
supporting Italian language. DataTXT derives from previous academic research
[3], makes use of Wikipedia to disambiguate matched entities and to link them
to the Italian DBPedia and proved to be highly performant even on very short
texts. Running such a entity extraction tool on all the notes resulted in over
2,038 notes (50%of the total number) annotated with at least one entity and a
total of 43,000 entities matched. After a manual revision of the results we re-
moved around 30 entities that were clearly wrong matches. We then inspected
80 random notes (2% of the total number of notes) and measured an accuracy
of around 85%. Extracted entities span 144 di↵erent entity rdf:types and 5,876
distinct dc:types (which can be considered as entities categories). A more accu-
rate evaluation of the results a well as a better tuning of the tool are goals for
the next stage of the project.
Scholars annotations. Pundit5 is a semantic web tool that enables users
to produce machine readable data in the form of RDF by annotating web pages.
Annotations from a single scholar are collected in so called ”notebooks”, which
can be private or public. For the purpose of our proof of concept we created a
set of sample annotations by manually linking texts to DBpedia and Freebase
entities. At data representation level, such annotations are equivalent to those
produced by DataTXT, once imported they are naturally captured by the facet
queries (discussed in the next section).
4
https://dandelion.eu/products/datatxt/
5
http://thepund.it
414
Linked Data and facets to explore text corpora in the Humanities 3
2 Faceted search prototype
Existing approaches to identify relevant facets to browse a RDF graph based on
quantitative measures such as predicate frequency, balance and objects cardi-
nality [4]. This kind of approaches do not account for informative content of a
facet and only consider facets derived from a set of triples with the same pred-
icate. In the general case, however, relevant facets could be derived from more
complex paths in the graph. Approaches to automatic facets extraction in such
a general case have been recently proposed [6] and we plan to investigate their
applicability in the near future.
Our simple approach is to derive facets from SPARQL queries of the form:
select distinct ?url ?facet ?value where { CUSTOM_QUERY }
Where ?url is a resource of interest (notes in our case), ?facet is a facet name
and ?value is a possible value of such a facet. Such a simple approach is also
quite flexible and allows, for example, to easily turn all the datatype properties
of a resource to facets, e.g with the following query:
select distinct ?uri ?facet ?value where {
?uri rdf:type gramsci:Note. ?uri ?facet ?facet.}
Deriving facets from SPARQL queries is an approach already explored in litera-
ture [5]. For the purpose of our proof of concept we chose candidate graph paths
by inspecting the data and accordingly to scholars preferences. The facets we
implemented in our prototype are:
– Gramsci Dictionary topics. This facet lists all the dictionary topics where a
note is referenced;
– DBpedia entities. A set of facets where entities mentioned in a note are
grouped according to their rdf:type. Relevant rdf:types individuated are Per-
sons, Books, Languages, Places and Events, but they could be more specific
(e.g. Politicians, Artists, Magazines, etc.);
– Categories. A facet listing all the dc:types associated to entities mentioned
in a note;
– Scholars Notebooks. This facet lists all the scholars (Pundit users) who man-
ually annotated a note.
To enable navigation of the corpus along the di↵erent ”dimensions”, we im-
plemented a faceted browser based on Apache Solr6 . Solr, along with its Ajax-
Solr7 frontend provides a relatively easy way to build a performant faceted
browser on top of Lucene. We built the solr index by running the SPARQL
queries (described in the previous section) and using results associated to the
?uri variable as document ID, ?facet as index field and ?value as field values.
The prototype is available at http://purl.org/gramscisource/quaderni.
6
http://lucene.apache.org/solr/
7
http://github.com/evolvingweb/ajax-solr/wiki
415
4 C. Morbidoni
Some usage patterns have been individuated by scholars involved in the
project: a) Using the Dictionary facet to intersect two or more topics from the
vocabulary. This is a simple but useful ”advanced search” feature; b) Choose one
or more Dictionary topics (e.g. Storia), then use the facets on the right (DBpedia
entities) to provide additional context (e.g. Hegel, Croce and Plechanov are the
main persons related to History, ”Teoria e storia della storiografia” and ”Misre
de la philosophie” are two related books, etc.); c) Start from a full text search or
from a DBPedia entity (e.g. ”Conte di Montecristo”) and discover related topics.
3 Conclusions and Acknowledgements
In this short paper we discussed preliminary results in leveraging Linked Data in
the GramsciSource project and we presented a proof of concept prototype. Feed-
back from Humanities scholars involved in the project (and in related projects,
such as DM2E8 ) was positive and encouraged us to move further. End user eval-
uation will be run in the next months. We are currently evaluating automatic
methods to derive entities and facets (e.g. based on language analysis tool such
as [7]), with the aim of making the approach easily applicable to di↵erent texts
corpora.
This work is supported by the GramsciSource project funded by the Italian
Ministry of Education under the FIRB action.
References
1. Marco Grassi, Christian Morbidoni, Michele Nucci, Simone Fonda and Francesco
Piazza. Pundit: Augmenting Web Contents with Semantics. Literary & Linguisting
Computing, 2013
2. Dizionario gramsciano 1926-1937, Curated by Guido Liguori, Pasquale Voza, Roma,
Carocci Editore, 2009, pp. 918.
3. Paolo Ferragina, Ugo Scaiella, TAGME: on-the-fly annotation of short text frag-
ments (by wikipedia entities), Proceedings of the 19th ACM international conference
on Information and knowledge management, New York, 2010
4. Eyal Oren, Renaud Delbru, Stefan Decker, Extending Faceted Navigation for RDF
Data, The Semantic Web - ISWC 2006, Lecture Notes in Computer Science Volume
4273, 2006, pp 559-572.
5. Philipp Heim, Jrgen Ziegler, Faceted Visual Exploration of Semantic Data, Human
Aspects of Visualization, Lecture Notes in Computer Science Volume 6431, 2011,
pp 58-75.
6. Bei Xu, Hai Zhuge, Automatic Faceted Navigation, Future Generation Computer
Systems archive, Volume 32, March, 2014, Pages 187-197
7. Dell’Orletta F., Venturi G., Cimino A., Montemagni S. (2014) T2K: a System for
Automatically Extracting and Organizing Knowledge from Texts. In Proceedings
of 9th Edition of International Conference on Language Resources and Evaluation
(LREC 2014), 26-31 May, Reykjavik, Iceland.
8
http://dm2e.eu
416
Dexter 2.0 - an Open Source Tool for
Semantically Enriching Data
Salvatore Trani1,4 , Diego Ceccarelli1,2 , Claudio Lucchese1 ,
Salvatore Orlando1,3 , and Ra↵aele Perego1
1
ISTI–CNR, Pisa, Italy, 2 IMT Lucca, Italy, 3 Ca’ Foscari - University of Venice,
4
University of Pisa
{name.surname}@isti.cnr.it
Abstract. Entity Linking (EL) enables to automatically link unstruc-
tured data with entities in a Knowledge Base. Linking unstructured data
(like news, blog posts, tweets) has several important applications: for ex-
ample it allows to enrich the text with external useful contents or to
improve the categorization and the retrieval of documents. In the latest
years many e↵ective approaches for performing EL have been proposed
but only a few authors published the code to perform the task. In this
work we describe Dexter 2.0, a major revision of our open source frame-
work to experiment with di↵erent EL approaches. We designed Dexter
in order to make it easy to deploy and to use. The new version provides
several important features: the possibility to adopt di↵erent EL strate-
gies at run-time and to annotate semi-structured documents, as well as a
well-documented REST-API. In this demo we present the current state
of the system, the improvements made, its architecture and the APIs
provided.
1 Introduction
In the latest years many researchers proposed new techniques for performing
Entity Linking (or Wikification) that consists of enriching a document with the
entities that are mentioned within it. For example, consider the document in
Figure 1: an EL framework first detects the pieces of text that are referring to
an entity e.g., Maradona, Argentina, or Belgium, (usually called mentions
or spots); this step is known as mention detection or spotting. Then the sys-
tem performs the disambiguation step: each spot is linked to an entity chosen
from a list of candidates. The entity is represented by its URI or identifier in
a knowledge base, in our case Wikipedia. As an example, in Figure 1 the cor-
rect entity for the spot Argentina is http://en.wikipedia.org/wiki/Argentina_
national_football_team. Please note that linking the mention to the correct entity
is not a trivial task since often a mention is ambiguous: indeed in the previous
example Argentina is not referring to the most common sense (the country)
but rather to the national football team.
In this demo we present the current status of Dexter, our open source frame-
work for entity linking. We introduced Dexter one year ago [1] in order to provide
417
Maradona, [http://en.wikipedia.org/wiki/Diego_Maradona] played his first World Cup
tournament [http://en.wikipedia.org/wiki/FIFA_World_Cup] in 1982 when Argentina
[http://en.wikipedia.org/wiki/Argentina_national_football_team] played Belgium
[http://en.wikipedia.org/wiki/Belgium_national_football_team] in the opening game of the 1982 Cup
[http://en.wikipedia.org/wiki/1982_FIFA_World_Cup] in Barcelona
[http://en.wikipedia.org/wiki/Barcelona].
Fig. 1: Example of annotated document
a tool for implementing new EL methods, and for comparing or simply exploiting
the existing EL methods on a common platform.
We designed the framework for researchers and students; Dexter is easy to
deploy: it consists of a unique jar file without external dependencies, and some
binary files representing the model. The user only has to run the program that
will expose a web server providing both a Rest API and a web interface for
performing EL. The framework is highly modular and it allows the developers
to replace single parts of the EL process. It runs on commodity hardware and it
requires only 3 gigabytes of memory.
2 Dexter Framework
2.1 Architecture
Dexter1 is developed in Java, and is organized in several Maven2 modules (as
depicted in Figure 2):
Json-wikipedia3 This module converts the Wikipedia XML Dump in a JSON
Dump, where each line is a JSON record representing an article. The parser is
based on the MediaWiki markup parser UKP4 . While DBpedia only contains
semistructured data extracted from the dump (mainly from the infoboxes) in
RDF format, JSON-Wikipedia contains other fields, e.g., the section headers,
the text (divided in paragraphs), the templates with their schema, text em-
phasized and so on. The module is designed to support di↵erent languages;
Dexter-Common Contains the domain objects, shared among all the modules
of Dexter;
Dexter-Core The core implements the EL pipeline (illustrated on the right
of Figure 2): the text is first processed by a Spotter, that produces a list
of spot matches. Each spot match contains the o↵set of the match in the
document, the list of entities that could be represented by the spot (produced
by an Entity Ranker ) and other features useful to perform the linking. The
spot matches are then processed by a Disambiguator that for each spot
tries to select the correct entity in the list of candidates (often relying on
a Relatedness function, that estimates the semantic distance between two
entities);
1
The project page is http://dexter.isti.cnr.it, the website also presents a demo
2
http://maven.apache.org/
3
json-wikipedia is available at https://github.com/diegoceccarelli/json-wikipedia
4
http://www.ukp.tu-darmstadt.de/software/jwpl/
418
Dexter-Webapp exposes a REST API for performing the annotations. It
also implements a simple web interface for performing a demo. The current
version of the REST API is briefly described in Table 1, and it is organized in
4 logical categories: the Annotate API, used for annotating a document,
the Spot API that allows to retrieve the candidate spots in a document
and to visualize their features, the Graph API and the Category API
that allow to browse respectively the Wikipedia article’s link graph and the
category graph. The current API is available and testable. We provide a well
written documentation for each method, in a web page that also allows the
user to test the service;
Dexter-Client a simple client to perform EL from a client machine, implicitly
calling the REST API.
Shingle
Extractor
Spotter
Shingles
JSON Wikipedia
Entity Spot
Ranker Filter
Spot Repository
Dexter-Core
Spot Match List
Common
Dexter
Articles Index
Dexter-Webapp Disambiguator
Web REST
App API Relatedness
Entity Link Graph
Entity Match List
Dexter
Client
Fig. 2: Dexter Architecture
2.2 Novel Features
Since the first version we added the possibility to replace and combine di↵erent
versions of the components of the system (the spotter, the disambiguator, the
relatedness function etc.). An EL annotation can then be performed providing to
the linker the symbolic names of the components that the developer wants to use
(the spotter x, the disambiguator y . . . ). More in detail, in the annotate REST
API the spotter and the disambiguator components are parameters, allowing to
make use of di↵erent EL techniques at run-time. Another interesting feature is
the possibility to annotate semi-structured documents; the previous version, as
well as other EL frameworks, annotates only flat text, i.e., a plain string. In
the new version we added the possibility to annotate documents composed by
several fields (e.g., title, headlines, paragraphs); when developing a new spot-
ter/disambiguator, a researcher can exploit this information about the structure
419
of a document. It is worth to observe that in the new version each candidate
spot contains also the field where the match was performed. The system also
o↵ers a Category API (extracted from the DBpedia categories).
Annotate API
api/rest/annotate Performs the EL on a given text
api/rest/get-desc Given the Wikipedia Id of an entity, returns an object describing
the entity (title, short description, . . . )
Spot API
api/rest/spot Performs only the spotting step on the document, returning a list
of mentions detected in the document, and for each mention some
useful features and the list of possible candidate entities
api/rest/get-spots Given an entity returns all the spots used in the Wikipedia dump
for referring to the entity, for example given the entity Mona lisa,
it returns mona lisa,gioconda, la joconde . . .
Graph API
api/rest/get-target-entities Returns the entities linked by the given entity
api/rest/get-source-entities Returns the entities that link to the given entity
api/rest/relatedness Returns the semantic relatedness between two entities (by default
using the Milne and Witten formula [2])
Category API
api/rest/get-parent-categories Given an category, returns its parent categories
api/rest/get-child-categories Given an category, returns its child categories
api/rest/get-entity-categories Given an entity, returns its categories
api/rest/get-belonging-entities Given a category, returns the entities belonging to the category
Table 1: The current version of the Dexter’s REST-API
Finally, we released a framework for evaluating the quality of the annotations
and comparing our framework with the others5 . We are also planning to integrate
our tool with the NERD framework [3].
The demonstration will present the main functionalities provided by our sys-
tem. We will illustrate how to use the API, how to deploy the system on a server,
how to extend the components, and we will show some applications built on top
of Dexter.
Future work. We are planning to add several disambiguators and spotters
proposed in literature, and produce a performance comparison on di↵erent types
of datasets.
Acknowledgements This work was partially supported by the EU project E-CLOUD
(no. 325091), the Regional (Tuscany) project SECURE! (POR CReO FESR 2007/2011),
and the Regional (Tuscany) project MAPaC (POR CReO FESR 2007/2013).
References
1. D. Ceccarelli, C. Lucchese, S. Orlando, R. Perego, and S. Trani. Dexter: an open
source framework for entity linking. In ESAIR, 2013.
2. D. Milne and I. H. Witten. Learning to link with wikipedia. In Proceedings of
CIKM, 2008.
3. G. Rizzo and R. Troncy. Nerd: a framework for unifying named entity recognition
and disambiguation extraction tools. In Proceedings of EACL, 2012.
5
https://github.com/diegoceccarelli/dexter-eval
420
A Hybrid Approach to Learn Description Logic based
Biomedical Ontology from Texts?
Yue Ma1 and Alifah Syamsiyah1,2
1
Institute of Theoretical Computer Science, Technische Universität Dresden, Germany,
2
Free University of Bozen-Bolzano
mayue@tu-dresden.de,alifah.syamsiyah@stud-inf.unibz.it
Abstract. Augmenting formal medical knowledge is neither manually nor auto-
matically straightforward. However, this process can benefit from rich information
in narrative texts, such as scientific publications. Snomed-supervised relation ex-
traction has been proposed as an approach for mining knowledge from texts in
an unsupervised way. It can catch not only superclass/subclass relations but also
existential restrictions; hence produce more precise concept definitions. Based on
this approach, the present work aims to develop a system that takes biomedical
texts as input and outputs the corresponding EL++ concept definitions. Several
extra features are introduced in the system, such as generating general class inclu-
sions (GCIs) and negative concept names. Moreover, the system allows users to
trace textual causes for a generated definition, and also give feedback (i.e. correc-
tion of the definition) to the system to retrain its inner model, a mechanism for
ameliorating the system via interaction with domain experts.
1 Introduction
Biomedicine is a discipline that involves a large number of terminologies, concepts, and
complex definitions that need to be modeled in a comprehensive knowledge base to be
shared and processed distributively and automatically. The National Library of Medicine
(NLM) has maintained the world’s largest biomedical library since 1836 [5]. One of the
medical terminologies preserved by NLM is Systematized Nomenclature of Medicine
Clinical Terms (SNOMED CT). It is a comprehensive clinical vocabulary structured in a
well-defined form that has the lightweight Description Logic EL++ [2] as the underlying
logic, which can support automatic checking of modeling consistency.
However, creating, maintaining, and extending formal ontology is an expensive
process [6]. In contrast, narrative texts, such as medical records, health news, and
scientific publications, contain rich information that is useful to augment a medical
knowledge base. In this paper, we propose a hybrid system that can generate EL TBoxes
from texts. It extends the formal definition candidates learned by the Snomed-supervised
relation extraction process [4, 3] with linguistic patterns to give a finer-grained translation
of the learned candidates. Besides generating concept name hierarchy that has been
widely studied, the system can also generate definitions with existential restrictions to
exploit the expressivity of EL. Moreover, the implemented Graphical User Interface
helps a user to visualize the flow of this framework, tracks textual sentences from
which a formal definition is generated, and gives feedback to enhance the system
interactively. The implementation of the system can be found from the link https:
//github.com/alifahsyamsiyah/learningDL.
?
We acknowledge financial support by the DFG Research Unit FOR 1513, project B1. Alifah
Syamsiyah was supported by the European Master’s Program in Computational Logic (EMCL)
421
New Sentence Training Texts Ontology
Ontology Annotation Ontology Annotation Concept Relationships
Concept Pair Extraction Relation Alignment
Feature Extraction Feature Extraction
Multiclass Classifying Building Probabilistic Model
Syntactic Parsing
Generate Axioms User Verification
Fig. 1. Hybrid approach overview: the upper darked block is for machine learning phase to extract
definition candidates, and the lower left darked block is for the pattern based transformation of
definition candidates, and the brown ellipse is for interaction with users.
2 Task and Our Approach
Our task is to generate EL definitions from textual sentences. For example, from the
sentence “Baritosis is a pneumoconiosis caused by barium dust”, it is desired to have an
automatic way to generate the formal EL axiom (together with some confidence value),
as shown in the red frame of Figure 2. Moreover, to help users understand the origin of
a generated definition and/or give their feedbacks, the system should be able to trace
the textual sources from which a definition is generated (implemented with the question
mark in our system), and allow users to correct automatically learned definitions (the “V”
mark in Figure 2).
Fig. 2. An illustrative example for the functionality of the system (CA is shortened form for the
SNOMED CT relation Causative_Agent)
Below we describe our hybrid system that has two components, as shown in Figure 1:
one for extracting definition candidates by machine learning techniques, and the other
for formulating final definitions from the definition candidates by linguistic patterns.
2.1 Extracting Formal Definition Candidates
The first part of the system is to generate definition candidates via the steps given in
the upper block of Figure 1. It again contains two components: learning a model from
the training data (training texts and ontology) and generating candidates from new texts.
Each steps in the two components are described below.
422
Common steps in processing training and test texts. One is to recognize SNOMED CT
concept names from a given textual sentence, called ontology annotation in Figure 1.
In our implementation, this is done by invoking the tool Metamap [1]. Since we are only
interested in the most specific and precise concepts, we filter the Metamap annotations
by keeping merely those that refer to head of a phrase but not a verb. The other common
processing step for training and test sentence is to extract textual features for a pair of
concepts occurring in a sentence, called feature extraction. Currently, the system uses
classical lexical features of n-grams over both characters and words as in [4].
A special processing on training texts is concerned to generate labelled training data
and to learn multi-class classification model for each predefined relation [4]:
– Automatic generation of training data is realized by the step named Relationship
Alignment that matches an annotated sentence by Metamap with relationships
between concept names from ontology: If one sentence containing a pair of concepts
that has a relation R according to the ontology, this sentence is considered as a textual
representation of R, thus being labelled with R. Furthermore, we also consider the
inverse roles that often appear in texts via active and passive sentences. Hence, if
there are n predefined relations, there will be 2n possible labels for a sentence.
– Building probabilistic model is to learn a probabilistic multi-class classification
model based on the textual features of labelled sentences from the previous step. For
this, the current system uses the maximum entropy Stanford Classifier1 .
A special processing on test texts is to extract definition candidates from a new test
sentence. A definition candidate is a triple (A, R, B) where A, B are concept names and
R is a relation, meaning that A and B have a relation R according to a test sentence.
– Concept pair extraction is to get pairs of concepts from an annotated test sentence.
– Multiclass classification is to answer whether a pair of two concept names has a
relation, and if yes, which relation it is. This part can be achieved by the model
learned from training data by Stanford Classifier. A positive answer returned by the
classifier gives a definition candidate (A, R, B). Slightly abusing of the notation,
we also call 9r.B a definition candidate for A.
2.2 Pattern based Transformation of Definition Candidates
Once we get definition candidates, we first change the order of inverse role so it al-
ways appears as an active role. Next, different from [4], we distinguish two ways
to formalize it: (1) into a subsumption (A v R.B) or (2) into a conjunction (A u
R.B). For example, the sentence “Baritosis is caused by barium dust” stands for
the subsumption Baritosis(disorder) v Causative_agent.Barium_dust; whilst
“Chest pain from anxiety ...” corresponds to a conjunction Chestpain(disorder) u
9.Causative_agent.Anxiety(disorder). To decide which transformation of a defini-
tion candidate, we follow the intuition observable from the above examples:
– A subsumption A v 9R.B should be generated from a candidate (A, R, B) if A
and B are connected in the sentence in a subject-object relation, called S-form.
– A conjunction A u 9R.B should be formed if A and B appearing in a noun phrase
structure, called NP-form.
1
http://nlp.stanford.edu/software/classifier.shtml
423
To implement this linguistic pattern based strategy, we use the Stanford Parser2 to
get syntactical parsing tree of a test sentence. The S-form and NP-form are detected in
the following way: First, the phrases corresponding to A and B are recognized from the
sentences, and then the least common node of these two phrases is searched from the
syntactic parsing tree of the whole sentence. If the least common node has type S (resp.
NP)3 , then A and B is in S-form (resp. NP-form). Otherwise, a parsing error is returned.
Negation Concept Names In natural language, sometimes we use negative way to define
the opposite meaning. For example, the sentence “The disease from foot is not relative to
heart attack” will be translated to Disease (disorder) u F S. F oot(body structure) v
¬Heart_disease (disorder). This is achieved in the system based on negated atomic
concept names detectable by Metamap version 2013.
2.3 Tracing Source Sentence and Classifier Model Retraining
There are two extra functions provided by the system, namely tracing to sentence and
classifier model retraining. As given in Figure 2, if a user clicks the "?" mark, system
will provide sentences from which the formal definition extracted. Note that the system
uses machine learning approach to acquire definition candidates which may get wrong.
Therefore, we provide a mechanism for user to validate the answer by clicking "V"
symbol and then give the correct relation to link two concept names. As shown in Figure
3, the user changes the role relation from inverse of Finding Site (FS-1) to Causative
Agent (CA).
Fig. 3. Interaction with users: change the predicated relation in a definition candidate (FS is the
shortened form for the SNOMED CT relation Finding_site, and FS-1 is the inverse role of FS)
References
1. Aronson, A.R., Lang, F.M.: An overview of metamap: historical perspective and recent
advances. JAMIA 17(3) (2010) 229–236
2. Baader, F., Brandt, S., Lutz, C.: Pushing the EL envelope. In: Proceedings of IJCAI’05. (2005)
3. Ma, Y., Distel, F.: Concept adjustment for description logics. In: Proceedings of K-Cap’13.
(2013) 65-72.
4. Ma, Y., Distel, F.: Learning formal definitions for Snomed CT from text. In: Proceedings of
AIME’13. (2013) 73–77
5. National Library of Medicine: NLM overview. http://www.nlm.nih.gov/about/index.html
(2014)
6. Simperl, E., Bürger, T., Hangl, S., Wörgl, S., Popov, I.: Ontocom: A reliable cost estimation
method for ontology development projects. Web Semantics: Science, Services and Agents on
the World Wide Web 16(5) (2012)
2
http://nlp.stanford.edu/software/lex-parser.shtml
3
“S" is for sentence, and“NP" for noun phase.
424
Identifying First Responder Communities Using
Social Network Analysis
John S. Erickson, Katie Chastain, Evan W. Patton, Zachary Fry, Rui Yan,
James P. McCusker, Deborah L. McGuinness
Rensselaer Polytechnic Institute, Tetherless World Constellation
110 8th Street, Troy NY 12180 USA
{erickj4}@cs.rpi.edu
Abstract. First responder communities must identify technologies that
are e↵ective in performing duties ranging from law enforcement to emer-
gency medical to fire fighting. We aimed to create tools that gather and
assist in quickly understanding responders’ requirements using semantic
technologies and social network analysis. We describe the design and pro-
totyping of a set of semantically-enabled interactive tools that provide
a ”dashboard” for visualizing and interacting with aggregated data to
perform focused social network analysis and community identification. 1
Keywords: first responders, emergency response, network analysis, topic
modeling
1 Introduction
In response to a request from NIST to develop approaches to using social net-
works and associated technology to improve first responder e↵ectiveness and
safety, we used semantic technologies and social network analysis to locate
Twitter-based first responder sub-communities and to identify current topics
and active stakeholders within those communities. Our objective is to create a
repeatable set of Twitter-compatible methods that constitute an initial require-
ments gathering process. We report on using social media analysis techniques
for the tasks of identifying first responder communities and on examining tools
and techniques for identifying potential requirements stakeholders within those
networks.
Our First Responders Social Network Analysis Workflow (Figure 12 ) has
helped researchers make sense of the vast quantity of information moving through
Twitter. Identified stakeholders might be engaged by researchers in (for exam-
ple) participatory design 3 tasks that are elements of a requirements gathering
1
A technical report discussing this work in greater detail may be found at [3]. All tool
screenshots mentioned in this paper appear in the tech report in greater detail.
2
See also http://tw.rpi.edu/media/latest/workflow2
3
Participatory design studies end user participation in the design and introduction of
computer-based systems in the workplace, with the goal of creating a more balanced
relationship between the technologies being developed and the human activities they
are meant to facilitate. See e.g. [4], citing [5]
425
Fig. 1. Overview of a First Responders Social Network Analysis Workflow
methodology. We present first responder-related Twitter data and metadata
through interfaces that reduce the overall information, to keep up with the
quickly-changing environment of social media.
2 Identifying First Responder Communities During
Disasters
We employed the Twitter Search API 4 to collect tweets containing one or more
hashtags from a list of 17 hashtags identified as relevant by the first responder
community.5 We report on two events: the anticipated February 2013 Nemo
storm and the unanticipated Boston Marathon bombing. A visualization tool
allows browsing over time showing (for example) total tweets for a hashtag over
time while enabling a user to zoom in and explore with finer temporal granularity.
3 Identifying Themes through Topic Modelling
We created a tool to visualize and enable interaction with topic modeling6 re-
sults, applying MALLET (http://mallet.cs.umass.edu/ ) across Twitter
sample data. The tool presents topics as a pie chart; each ”pie slice” represents
an emergent topic, with assigned names indicating the most prevalent hashtags
occurring in that topic. A popup list of hashtags enables the researcher to view
other hashtags that are more loosely related to the topic.
4
See e.g. ”Using the Twitter Search API” http://bit.ly/1sY7O
5
For the complete list see http://www.sm4em.org/active-hashtags/
6
See especially [2]
426
4 Identifying Hashtags of Interest
Machine learning can help researchers identify hashtags that are topically related
to the users area of interest but might not be immediately obvious. One of our
tools uses co-occurrence to help identify evolving hashtags of interest, relating
Twitter frequent posters with hashtags. The intensity of each cell in a matrix
indicates the relative frequency with which a given user has tweeted using a
particular hashtag. Users are filtered by weighted entropy and a subset is selected
to provide the most coverage over hashtags of interest. Researchers may use this
tool to develop a fine-grained understanding of topics and to pinpoint users of
interest for further requirements gathering.
5 Multi-modal Visualization Tools
Situations may arise where close examination of network dynamics and conversa-
tion evolution is necessary. ”Multi-modal” data visualizations enable researchers
to move seamlessly from macro-scale visualizations to the micro-scale of individ-
ual tweets. Fig 2 shows one level where the propagation and retweeting of themes
can be dynamically observed, while another level supports examination of indi-
vidual tweets and associated media content.7
Fig. 2. Multi-modal visualization of Boston Marathon Twitter activity
7
Further details of the Twitter dataset used for this visualization may be found in [3]
427
6 Discussion and Conclusions
Our social network analysis and visualization tools demonstrate methods of pas-
sive social network monitoring8 intended to help researchers discover topical
social network conversations among first responders. These tools have limited
ability to connect and engage researchers with individual persons of interest.
Current and future work includes extending the tools to expose and make ac-
tionable more user information, including identifying which individuals are most
active on pertinent hashtags and are stakeholders of interest from a requirements
gathering perspective. The time-sensitive nature of any Twitter sample dataset
requires that visualization tools be adept at filtering over time periods of inter-
est. Current and future work includes improved support for browsing over time
with emphasis on finding and understanding topic shifts.
Recent events have demonstrated [1] that passive studies using social net-
work data without full user knowledge and consent may backfire. Further studies
should carefully examine the social implications of this work and in particular
seek to understand at what point, if any, researchers should seek informed con-
sent from potential stakeholder candidates.
7 Acknowledgements
We are grateful to the Law Enforcement Standards Office (OLES) of the U.S.
National Institute of Standards and Technology (NIST) for sponsoring this work,
and members of the DHS First Responders Communities of Practice Virtual
Social Media Working Group (VSMWG) for numerous helpful discussions.
References
1. Arthur, C.: Facebook emotion study breached ethical guidelines, researchers say.
The Guardian (June 2014), http://bit.ly/1kuebVW
2. Blei, D.M.: Probabilistic topic models. Communications of the ACM 55, 77–84
(April 2012), http://bit.ly/1rdOccT
3. Erickson, J.S., Chastain, K., Patton, E., Fry, Z., Yan, R., McCusker, J., McGuin-
ness, D.L.: Technical report: Identifying first responder communities using so-
cial network analysis. Tech. rep., RPI (July 2014), tw.rpi.edu/web/doc/tr_
firstresponder_communityanalysis
4. Kensing, F., Blomberg, J.: Participatory design: Issues and concerns. Computer
Supported Cooperative Work 7, 167185 (1998), http://bit.ly/1zESj44
5. Suchman, L.: Forward. In: Schuler, D., Namioka, A. (eds.) Participatory Design:
Principles and Practices. p. viiix (1993)
8
Passive monitoring supports constant monitoring of a ”default” set of known first
responder hashtags meaning that when unanticipated events such as natural dis-
asters happen, it is likely we’ll have a useful if not perfect sample dataset. Active
monitoring, conducted after the fact supports a deeper examination of user activity,
including a focused examination of retweets and an investigation of ”spontaneous”
hashtags that emerge throughout the event.
428
Exploiting Semantic Annotations for
Entity-based Information Retrieval
Lei Zhang1 , Michael Färber1 , Thanh Tran2 , and Achim Rettinger1
1
Institute AIFB, Karlsruhe Institute of Technology, Germany
2
San Jose State University, USA
{l.zhang,michael.faerber,rettinger}@kit.edu,
{ducthanh.tran}@sjsu.edu
Abstract. In this paper, we propose a new approach to entity-based
information retrieval by exploiting semantic annotations of documents.
With the increased availability of structured knowledge bases and
semantic annotation techniques, we can capture documents and queries
at their semantic level to avoid the high semantic ambiguity of terms and
to bridge the language barrier between queries and documents. Based on
various semantic interpretations, users can refine the queries to match
their intents. By exploiting the semantics of entities and their relations
in knowledge bases, we propose a novel ranking scheme to address the
information needs of users.
1 Introduction
The ever-increasing amount of semantic data on the Web pose new challenges
but at the same time open up new opportunities for information access. With
the advancement of semantic annotation technologies, the semantic data can be
employed to significantly enhance information access by increasing the depth
of analysis of current systems, while traditional document search excels at the
shallow information needs expressed by keyword queries and the meaningful
semantic annotations contribute very little. There is an impending need to
exploit the currently emerging knowledge bases (KBs), such as DBpedia and
Freebase, as underlying semantic model and make use of semantic annotations
that contain vital cues for matching the specific information needs of users.
There is a large body of work that automatically analyzes documents and
the analysis results, such as part-of-speech tags, syntactic parses, word senses,
named entity and relation information, are leveraged to improve the search
performance. A study [1] investigates the impact of named entity and relation
recognition on search performance. However, this kind of work is based on natural
language processing (NLP) techniques to extract linguistic information from
documents, where the rich semantic data on the Web has not been utilized. In [2],
an ontology-based scheme for semi-automatic annotation of documents and a
retrieval system is presented, where the ranking is based on an adaptation of the
traditional vector space model taking into account adapted TF-IDF weights.
429
This work can be dedicated to research in this area. Nevertheless, it provides a
significantly new search paradigm. The main contributions include: (1) The rich
semantics in KBs are used to yield the semantic representations of documents
and queries. Based on the various semantic interpretations of queries, users
can refine them to match their intents. (2) Given our emphasize on semantics
of entities and relations, we introduce a novel scoring mechanism to influence
document ranking through manual selection of entities and weighting of relations
by users. (3) Another important feature is the support of cross-linguality, which
is crucial when queries and documents are in di↵erent languages.
2 Document Retrieval Process
In this section, we present our document retrieval process, which consists of five
steps. While lexica extraction and text annotation are performed o✏ine, entity
matching, query refinement and document ranking are handled online based on
the index generated by o✏ine processing.
Lexica Extraction. In this step, we constructed the cross-lingual lexica by
exploiting the multilingual Wikipedia to extract the cross-lingual groundings of
entities in KBs, also called surface forms, i.e., words and phrases in di↵erent
languages that can be used to refer to entities [3]. Besides the extracted surface
forms, we also exploit statistics of the cross-lingual groundings to measure the
association strength between the surface forms and the referent entities.
Text Annotation. The next step is performed to enrich documents with
entities in KBs to help to bridge the ambiguity of natural language text and
precise formal semantics captured by KBs as well as to transform documents in
di↵erent languages into a language independent representation. For this purpose,
we employ our cross-lingual semantic annotation system [4] and the resulting
annotated documents are indexed to make them searchable with KB entities.
Entity Matching. Our online search process starts with the keyword query
in a specific language. Instead of retrieving documents, our approach first finds
entities from KBs matching the query based on the index constructed in the
lexica extraction step. These entities represent di↵erent semantic interpretations
of the query and thus are employed in the following steps to help users to refine
the search and influence document ranking according to their intents.
Query Refinement. Di↵erent interpretations of the query are presented for
users to select the intended ones. Since interpretations correspond to entities in
this step, users can choose the intended entity for refinement of their information
needs. We also enable users to adjust the weights of entity relations to influence
the document ranking for a personalized document retrieval. For this, the chosen
entity is shown and extended with relations to other entities retrieved from KBs.
Document Ranking. After query refinement by users, the documents in
di↵erent languages containing the chosen entity are retrieved from the index
constructed by text annotation. Then, we exploit the semantics of entities and
relations for ranking. We observe that annotated documents generally share the
following structure pattern: every document is linked to a set of entities, where
2
430
a subset (several subsets) of these entities are connected via relations in the
KB, forming a graph (graphs). In this regard, a document can be conceived as
a graph containing several connected components. Leveraging this pattern, we
propose a novel ranking scheme based on the focus on the chosen entity and the
relevance to the weighted relations.
Focus-Based Ranking: Intuitively, given two documents d1 and d2 retrieved
for the chosen entity e, d1 is more relevant than d2 if it focuses more on e than
d2 does, i.e., when the largest connected component of d1 containing e is larger
than that of d2 . Based on this rationale, we propose ScoreF ocus (d, e) between
document d and entity e to capture the focus of d on e as follows:
ScoreF ocus (d, e) = |LCCde | (1)
where LCCde is the largest connected component of d containing e and |LCCde |
represents the number of entities in LCCde .
Relation-Based Ranking: Given the chosen entity e, the users can weight
both the existence and the occurrence frequency of its relations to influence the
document ranking. This di↵erentiation separates the one scenario where users
are interested in obtaining more detailed information about the relationship
(qualitative information) from the other, where users are interested in the
quantity. Let Re be the set of relations of chosen entity e. We define xr = 1
|rd |
if r 2 Re , otherwise 0, and yr = log(avg r)
, where |rd | denotes the occurrence
frequency of r in d and avgr is the average occurrence frequency of r. Then,
we propose ScoreRelation (d, e) between document d and entity e to capture the
relevance of d to the weighted relations in Re as follows:
X
ScoreRelation (d, e) = xr · wrexistence + yr · wrf requency (2)
r2Re
where wrexistence and wrf requency are weights given by users for the existence and
the occurrence frequency of relation r, respectively.
By taking into account both focus-based and relation-based ranking, we
present the final function for scoring the documents as given in Eq 3.
ScoreF ocus (d, e) · ScoreRelation (d, e)
Score(d, e) = (3)
ndlde
where ndlde is the normalized document length of d w.r.t. annotations, i.e. the
number of entities contained in d, which is used to penalize documents in
accordance with their lengths because a document containing more entities has
a higher likelihood to be retrieved. The e↵ect of this component is similar to
that of normalized document length w.r.t. terms in IR. We can compute it as
efd
ndlde = (1 s) + s · (4)
avgef
where efd denotes the total number of entities in d, avgef is the average number
of entities in the document collection, and s is a parameter taken from IR
literature, which has been typically set to 0.2.
3
431
3 Evaluation
We now discuss our preliminary evaluation results. In the experiment, we use
DBpedia [5] as the KB and Reuters Corpus Volume 1 (RCV1) as the document
corpus containing about 810,000 English news articles. To assess the e↵ectiveness
of our approach, we investigate the normalized discounted cumulative gain
(nDCG) measure of the top-k results instead of the common measures like
precision and recall, which are not suitable to our scenario because the results
can be di↵erent in relevance for each query and di↵er for each facet or weight
used. We asked volunteers to provide keyword queries in Chinese (17 in total)
along with descriptions of the intents used to set the weight for the relations,
which yield the average nDCG of 0.87 and the average number of results of 612.
4 Conclusions and Future Work
In this paper, we show that the semantics captured in KBs can be exploited
to allow the information needs to be specified and addressed on the semantic
level, resulting in the semantic representations of documents and queries, which
are language independent. The user feedback on our demo system [6] suggests
that the proposed approach enables more precise refinement of the queries and
is also valuable in terms of the cross-linguality. In the future, we plan to advance
the query capability to support keyword queries involving several entities and
conduct more comprehensive experiments to evaluate our system.
Acknowledgments. This work is supported by the European Community’s
Seventh Framework Programme FP7-ICT-2011-7 (XLike, Grant 288342) and
FP7-ICT-2013-10 (XLiMe, Grant 611346). It is also partially supported by
the German Federal Ministry of Education and Research (BMBF) within the
SyncTech project (Grant 02PJ1002) and the Software-Campus project “SUITE”
(Grant 01IS12051).
References
1. Chu-Carroll, J., Prager, J.M.: An experimental study of the impact of information
extraction accuracy on semantic search performance. In: CIKM. (2007) 505–514
2. Castells, P., Fernández, M., Vallet, D.: An adaptation of the vector-space model for
ontology-based information retrieval. IEEE Trans. Knowl. Data Eng. 19(2) (2007)
261–272
3. Zhang, L., Färber, M., Rettinger, A.: xlid-lexica: Cross-lingual linked data lexica.
In: LREC. (2014) 2101–2105
4. Zhang, L., Rettinger, A.: X-lisa: Cross-lingual semantic annotation. PVLDB 7(13)
(2014) 1693–1696
5. Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., Hellmann,
S.: DBpedia - A crystallization point for the Web of Data. J. Web Sem. 7(3) (2009)
154–165
6. Färber, M., Zhang, L., Rettinger, A.: Kuphi - an investigation tool for searching for
and via semantic relations. In: ESWC. (2014)
4
432
Crawl Me Maybe: Iterative Linked Dataset
Preservation
Besnik Fetahu, Ujwal Gadiraju, and Stefan Dietze
L3S Research Center, Leibniz Universität Hannover, Germany
{fetahu, gadiraju,dietze}@L3S.de
Abstract. The abundance of Linked Data being published, updated,
and interlinked calls for strategies to preserve datasets in a scalable
way. In this paper, we propose a system that iteratively crawls and
captures the evolution of linked datasets based on flexible crawl defi-
nitions. The captured deltas of datasets are decomposed into two con-
ceptual sets: evolution of (i)metadata and (ii)the actual data covering
schema and instance-level statements. The changes are represented as
logs which determine three main operations: insertions, updates and
deletions. Crawled data is stored in a relational database, for efficiency
purposes, while exposing the di↵s of a dataset and its live version in
RDF format.
Keywords: Linked Data; Dataset; Crawling; Evolution; Analysis
1 Introduction
Over the last decade there has been a large drive towards publishing structured
data on the Web. A prominent case being data published in accordance with
Linked Data principles [1]. Next to the advantages concomitant with the dis-
tributed and linked nature of such datasets, challenges emerge with respect to
managing the evolution of datasets through adequate preservation strategies.
Due to the inherent nature of linkage in the LOD cloud, changes with respect
to one part of the LOD graph, influence and propagate changes throughout the
graph. Hence, capturing the evolution of entire datasets or specific subgraphs is a
fundamental prerequisite, to reflect the temporal nature of data and links. How-
ever, given the scale of existing LOD, scalable and efficient means to compute
and archive di↵s of datasets are required.
A significant e↵ort towards this problem has been presented by Käfer et al.[2],
with the Dynamic Linked Data Observatory: a long-term experiment to monitor
a two-hop neighbourhood of a core set of diverse linked data documents.
The authors investigate the lifespan of the core set of documents, measuring
their on and o↵-line time, and the frequency of changes. Furthermore, they delve
into how the evolution of links between dereferenceable documents over time.
An understanding of how links evolve over time is essential for traversing linked
data documents, in terms of reachability and discoverability. In contrast to the
previous initiatives, in this work we provide an iterative linked dataset crawler.
433
2 Besnik Fetahu, Ujwal Gadiraju, and Stefan Dietze
It distinguishes between two main conceptual types of data: metadata and the
actual data covering schema and instance-level statements.
In the remainder of this paper, we explain the schema used to capture the
crawled data, the workflow of the iterative crawler and the logging states which
encode the evolution of a dataset.
2 Iterative Linked Dataset Crawler
The dataset crawler extracts resources from linked datasets. The crawled data
is stored in a relational database. The database schema (presented in Figure 1)
was designed towards ease of storage and retrieval.
Fig. 1. Conceptual schema for the iteratively crawled linked datasets. Logs are rep-
resented with dashed lines (e.g. triple insertion: hs, p, oi) of the various conceptual
classes of data within linked datasets.
The crawler is designed with the intent to accommodate methods for assess-
ing the temporal evolution of linked datasets. A dataset which has not been
crawled before will thereby be crawled completely and all corresponding data
will be stored in a database. This would thereby correspond to a dump of that
dataset, stored according to the database schema. In case a dataset has al-
ready been crawled, the di↵erences between the previously crawled state of the
dataset and the current state are determined on-the-fly. Such s or di↵s, are
then stored. Therefore, for any dataset that has been crawled multiple times at
di↵erent crawl-points1 , it is possible to reconstruct the state of the dataset at
any of the given crawl-points.
2.1 Computation of Di↵s
The di↵erences between the state of a dataset at di↵erent crawl-points can
be captured efficiently using the dataset crawler. Evolution of datasets can be
1
The time at which a given crawl operation is triggered.
434
Crawl Me Maybe: Iterative Linked Dataset Preservation 3
computed at di↵erent levels. Each crawl explicitly logs the various changes at
schema and resource-levels in a dataset as either inserted, updated or deleted.
The changes themselves are first captured at triple-level, and then attributed to
either schema-level or resource instance-level. The following log operators with
respect to dataset evolution are handled by the dataset crawler.
– Insertions. New triples may be added to a dataset. Such additions intro-
duced in the dataset correspond to insertions.
– Deletions. Over time, triples may be deleted from a dataset due to various
reasons ranging from persisting correctness to detection of errors. These
correspond to deletions.
– Updates. Updates correspond to the update of one element of a triple <
s, p, >.
Figure 2 presents an example depicting the computation of between a
previously crawled dataset at crawl-point t0 and a fresh crawl at crawl-point t1 .
Fig. 2. Computation of di↵s on-the-fly.
First, assume a change in the ‘live dataset’ in the form of an insertion of the
triple corresponding to the URI resource_uri_2. Thus, the triple describing the
city Madras is added. Consequently, if the value of the property dbpedia-owl:
city is updated, then a subsequent crawl would capture this di↵erence in the
literal value of the property as an update to Chennai. Similarly, deletions made
are also detected during the computation of di↵s. Thus, computing and storing
di↵s on-the-fly in accordance with the log operators is beneficial; we avoid the
overheads emerging from storing dumps of entire datasets.
2.2 Web Interface for the Iterative Dataset Crawler
We present a Web interface (accessible at http://data-observatory.org/
dataset_crawler) that provides means to access the crawled resources, given
specific crawl-points of interest from the periodical crawls. The interface allows
us to filter for specific datasets and resource types. The Web application has
three main components (see Figure 3): (i) displaying metadata of the dataset, (ii)
dataset evolution, showing summaries of added/updated/deleted resources for
435
4 Besnik Fetahu, Ujwal Gadiraju, and Stefan Dietze
the di↵erent types, and (iii) dataset type-specific evolution, showing a summary
of the added/updated/deleted resource instances for a specific resource type and
corresponding to specific crawl time-points. In addition, the crawler tool is made
available along with instructions for installation and configuration2 .
Fig. 3. Functionalities of the Dataset Crawler Web Interface.
3 Conclusion
In this paper, we presented a linked dataset crawler for capturing dataset evolu-
tion. Data is preserved in the form of three logging operators (insertions/updates/
deletions) by performing an online computation for any given dataset with
respect to the live state of the dataset and its previously crawled state (if avail-
able). Furthermore, the crawled and computed of a dataset can be used to
assess its state at any given crawl-point. Finally, we provided a web interface
which allows the setup of the crawler, and facilitates simple query functionalities
over the crawled data.
References
1. C. Bizer, T. Heath, and T. Berners-Lee. Linked data - the story so far. Int. J.
Semantic Web Inf. Syst., 5(3):1–22, 2009.
2. T. Käfer, A. Abdelrahman, J. Umbrich, P. OByrne, and A. Hogan. Observing linked
data dynamics. In The Semantic Web: Semantics and Big Data, pages 213–227.
Springer, 2013.
2
https://github.com/bfetahu/dataset_crawler
436
A Semantics-Oriented Storage Model for Big
Heterogeneous RDF Data
HyeongSik Kim, Padmashree Ravindra, and Kemafor Anyanwu
Department of Computer Science, North Carolina State University, Raleigh, NC
{hkim22, pravind2, kogan}@ncsu.edu
Abstract. Increasing availability of RDF data covering different domains is en-
abling ad-hoc integration of different kinds of data to suit varying needs. This
usually results in large collections of data such as the Billion Triple Challenge
datasets or SNOMED CT, that are not just “big” in the sense of volume but also
“big” in variety of property and class types. However, techniques used by most
RDF data processing systems fail to scale adequately in these scenarios. One
major reason is that the storage models adopted by most of these systems, e.g.,
vertical partitioning, do not align well with the semantic units in the data and
queries. While Big Data distributed processing platforms such as the Hadoop-
based platforms offer the promise of “unlimited scale-out processing”, there are
still open questions as to how best to physically partition and distribute RDF
data for optimized distributed processing. In this poster, we present the idea of a
semantics-oriented RDF storage model that partitions data into logical units that
map to subqueries in graph patterns. These logical units can be seen as equiva-
lence classes of star subgraphs in an RDF graph. This logical partitioning strategy
enables more aggressive pruning of irrelevant query results by pruning irrelevant
partitions. It also enables the possibility of semantic-query optimization for some
queries such as eliminating joins under appropriate conditions. These benefits in
addition to appropriate techniques for physically partitioning the logical parti-
tions, translate to improved performance as shown by some preliminary results.
Keywords: RDF Storage Model, Partitioning Scheme, Hadoop, MapReduce
1 Introduction and Motivation
The Resource Description Framework (RDF) has been widely adopted and used to rep-
resent various datasets in many different communities such as government, life sci-
ence, and finance, etc. One challenge that arises from this phenomenon is that most
RDF datasets now contain a significant number of various properties and classes, e.g.,
105 distinct properties and classes in DBPedia [3] and 400k concepts in SNOMED
CT [7]. This is in contrast to popular benchmark datasets that are often used for eval-
uating RDF data processing systems like LUBM1 , which contain only a few hundreds
of distinct properties and classes. To efficiently process such collections, data needs to
be organized suitably into a storage model. A very common storage model is called
Vertical Partitioning [1](VP) and its variants [4] which partition data in terms of the
1
The Lehigh University Benchmark: http://swat.cse.lehigh.edu/projects/lubm
437
2 HyeongSik Kim, Padmashree Ravindra, and Kemafor Anyanwu
! ! ! """#"""#""""! !
! ! !
! ! ! !
! ! ! !
Fig. 1: A comparison of partitioning schemes: (a) vertical partitioning and its execution
plan and (b) equivalence-class-based partitioning.
types of properties and classes in a dataset. Given a query, matching vertical partitions
are selected based on the properties in the graph pattern and then join operations are
performed using the partitions. In a sense, this approach allows all vertical partitions
corresponding to properties that are not in the query to be pruned out. However, despite
this degree of prunability, the joins between “relevant” vertical partitions still incurs
some overhead of processing irrelevant data since not all entries the vertical partitions
form joined results. For example, consider the star pattern query with properties :name,
:mbox, and :homepage. Fig. 1(a) shows the example of the execution plan and parti-
tioned data using the VP-based approach, which results in two join operations. The
violet-colored cells denote triples that are not relevant to query, but are processed and
discarded during expensive join operations. Furthermore, the vertical partitioning pro-
cess itself can be challenging for large heterogeneous datasets for multiple reasons.
First, it may require the management of a large number file descriptors/buffers (> 105
for DBPedia) in memory during the partitioning process which can be impractical de-
pending on hardware architecture being used. Second, a scalability is a key design ob-
jective on Hadoop-based frameworks, but the distributed file system used in Hadoop
(or HDFS) does not scale well when there are numerous small files2 . Given these chal-
lenges, there is a clear need for investigating novel storage and distribution schemes for
RDF on scale-out platforms such as Hadoop.
2 Semantics-Oriented Storage Model : SemStorm
In this poster, we build on our previous works (e.g.,[5, 6]) which introduced the no-
tion of a triplegroup as a first class object in our data and query model. A triplegroup
is a group of triples related to the same resource (i.e. with the same subject), i.e. a star
subgraph. Fig. 1(b) shows the example triplegroup representation of our previous exam-
ple, e.g., under the equivalence class [nmha], a single triplegroup instance exists, which
contains a subject (:prs1) and objects corresponding to properties :n, :m, :h, and :a. The
benefits of both the triplegroup data model and algebra have been articulated in our pre-
vious works, including shortening of query execution workflows, reducing the footprint
of intermediate results which impacts I/Os in distributed processing. Here, we present
an overview of an RDF storage model called SemStorm that is based on logically and
2
http://blog.cloudera.com/blog/2009/02/the-small-files-problem
438
A Semantics-Oriented Storage Model for Big Heterogeneous RDF Data 3
Fig. 3: An execution time of queries
with type triples (Q1,Q2) and a nega-
Fig. 2: Query processing in SemStorm. tive query (Q3).
physically partitioning triplegroups in a semantics-oriented way. By semantics-oriented
we mean, partitioning triplegroups into equivalence classes such that all members in an
equivalence class are equivalent with respect to queries. This approach enables more ag-
gressive pruning of logical partitions, i.e. equivalence classes than other approaches like
the vertical partitioningW. For example, Fig. 2 shows that we select matching equiva-
lence class sets (mecs): [nmha] and [nmhp], which contain all the properties in the
example query, i.e. :n,:m, and :h (ignore a type property for now such as tA ). All other
remaining equivalence classes are pruned out, e.g., [nma] is not selected due to the ab-
sence of :h. The mecs sometimes contain extra values, e.g., objects for property :a and
:p in [nmha] and [nmhp]. We later filter such values for exact matching results.
Another unique advantage of SemStorm is that it enables additional optimizations that
are not possible with other approaches, e.g., it may be possible to avoid explicitly mate-
rializing rdf:type triples if such triples can be inferred by the label of an equivalence
class (the label of an equivalence class can be considered to be the set of proper-
ties in that equivalence class). For example, triplegroups under an equivalence class
[pubAuthor, rdf:type with Publication] can skip materializations of Publication type
triples if a schema file contains a triple “pubAuthor rdfs:domain Publication”. Fig. 2
shows that type triples are not materialized in triplegroup instances such as tg1 and tg2 ,
which are denoted as (tA ) for the class A. This optimization can add significant advan-
tages because rdf:type triples tend to be disproportionately larger than other properties
for many datasets, e.g., approx. 20% in LUBM datasets. Thus, avoiding their explicit
representation reduces the amount of I/Os needed when rdf:type triples need to be pro-
cessed and may in some cases eliminate the need to perform a join with such properties
since it is implicitly captured in the equivalence class representation.
Implementation Issues. Triplegroups can be generated easily using a group-by opera-
tion on a subject field of triples using a single MR job; they are then categorized based
equivalence class and stored in HDFS. Each equivalence class could be mapped into
a physical file, but it is likely that such 1:1 mappings could cause the many file issue
in case that many distinct equivalence classes are generated. To relieve the issue, we
need a heuristic that clusters equivalence classes into a smaller number of files, e.g.,
group equivalence classes that share a specific set of properties and store them together.
We also need to consider building indexes to locate matching equivalence classes from
physical files, e.g., mappings between equivalence class and their offsets in files.
439
4 HyeongSik Kim, Padmashree Ravindra, and Kemafor Anyanwu
Preliminary Evaluation. We evaluated three types of queries using the LUBM datasets
(450GB, Univ. 20k) on a 80-node Hadoop cluster in VCL3 , where each node was
equipped with 2.33 GHz dual core CPUs, 4GB RAM, and 40GB HDD. Hive 0.124 was
selected for the VP approach. Query Q1 retrieves a list of publication authors with Pub-
lication type triples, and Q2 additionally retrieves name (or title) of the publications.
We evaluated two variations of queries, with object field of some non-type triple pattern
bounded (high selectivity, denoted with postfix h), and same query with unbounded ob-
ject (low selectivity marked with l). Fig. 3 shows that SemStorm was 3 times faster than
Hive for Q1 because SemStorm can process queries using a Map-only job and save the
disk I/O for all the type triples. The execution time of Hive increased from Q1 to Q2
due to reading additional property relation :name but the execution time of SemStorm
was almost constant because both queries read the same equivalence classes. Finally,
Q3 was a negative query, which produces no answers. While SemStorm determined that
there are no answers (due to no matching equivalence classes) even before launching
the job, Hive executed join operations, producing 0 answer. The details are available in
the project website.5
Related Work. Our approach might be similar with Property Table [2], which groups
triples that tend to be together. However, the main difference is that the property table
is query-driven, which is built gradually based on query logs. However, SemStorm is
data-driven one, which directly can be constructed from the datasets without any query
logs. In addition, while the property table approach mainly suffers from its storage
inefficiencies, e.g., a lot of NULLs and left-over tables, our approach does not, i.e. all
triples can be transformed into triplegroups without any leftovers.
Acknowledgment The work presented in this paper is partially funded by NSF grant
IIS-1218277.
References
1. Abadi, D.J., Marcus, A., Madden, S.R., Hollenbach, K.: Scalable Semantic Web Data Man-
agement Using Vertical Partitioning. In: Proc. VLDB. pp. 411–422 (2007)
2. Carroll, J.J., Dickinson, I., Dollin, C., Reynolds, D., Seaborne, A., Wilkinson, K.: Jena: Im-
plementing the Semantic Web Recommendations. In: Proc. WWW Alt. pp. 74–83 (2004)
3. Duan, S., Kementsietsidis, A., Srinivas, K., Udrea, O.: Apples and Oranges: A Comparison of
RDF Benchmarks and Real RDF Datasets. In: Proc. SIGMOD. pp. 145–156 (2011)
4. Husain, M., McGlothlin, J., Masud, M., Khan, L., Thuraisingham, B.: Heuristics-Based Query
Processing for Large RDF Graphs Using Cloud Computing 23(9), 1312–1327 (2011)
5. Kim, H., Ravindra, P., Anyanwu, K.: Scan-Sharing for Optimizing RDF Graph Pattern Match-
ing on MapReduce. In: Proc. CLOUD. pp. 139–146 (2012)
6. Ravindra, P., Kim, H., Anyanwu, K.: An Intermediate Algebra for Optimizing RDF Graph
Pattern Matching on MapReduce. In: Proc. ESWC. pp. 46–61 (2011)
7. Salvadores, M., Horridge, M., Alexander, P.R., Fergerson, R.W., Musen, M.A., Noy, N.F.:
Using SPARQL to Query Bioportal Ontologies and Metadata. In: Proc. ISWC. pp. 180–195
(2012)
3
Virtual Computing Lab: http://vcl.ncsu.edu
4
https://hive.apache.org
5
http://research.csc.ncsu.edu/coul/RAPID+/ISWC2014.html
440
Approximating Inference-enabled Federated
SPARQL Queries on Multiple Endpoints
Yuji Yamagata and Naoki Fukuta
Graduate School of Informatics, Shizuoka University
Shizuoka, Japan
{gs14044@s,fukuta@cs}.inf.shizuoka.ac.jp
Abstract. Running inference-enabled SPARQL queries may sometimes
require unexpectedly long execution time. Therefore, demand has in-
creased to make them more usable by slightly changing their queries,
which could produce an acceptable level of similar results. In this demon-
stration, we present our query-approximation system that can transform
an inference-enabled federated SPARQL query into another one that can
produce acceptably similar results without unexpectedly long runtimes to
avoid timeout on executing inference-enabled federated SPARQL queries.
Keywords: SPARQL, inference, federated query, ontology mapping
1 Introduction
Reasoning on LODs allows queries to obtain unstated knowledge from a distinct
one [1]. Techniques to utilize reasoning capability based on ontology have been
developed to overcome several issues, such as higher complexity in the worst case
[5][7][8][9].
When a query prepared by the client might require a long execution time, a
standard SPARQL endpoint implementation will try to execute the query with
lots of cost to return answers. If the endpoint receives lots of heavy queries,
it might spend much time on their execution or, more severely, it might cause
a server-down. This is especially important for endpoints that have inference
engines to support OWL reasoning capability.
In this paper, we present an idea and its prototype implementation of a
query-approximation system that can transform an inference-enabled federated
SPARQL query into another one that can produce acceptably similar results
without unexpectedly long runtimes to avoid timeout on executing inference-
enabled federated SPARQL queries.1
2 Background
In [7], Kang et al. introduced a number of metrics that can be used to predict
reasoning performance and evaluated various classifiers to know how accurately
1
A demonstration is available at http://whitebear.cs.inf.shizuoka.ac.jp/Yaseki/
441
2 Yuji Yamagata, Naoki Fukuta
they predict classification time for an ontology based on its metric values. Ac-
cording to their evaluation results, they have prepared prediction models with
accuracy of more than 80%, but there are still major difficulties in improving
them.
In [5], it was introduced that reasoning tasks on ontologies constructed from
an expressive description logic have a high worst-case complexity. It has been
done by analyzing experimental results that divided each of several ontologies
into four and eight random subsets of equal size and measuring classification
times of these subsets as increments. They reported that some ontologies ex-
hibit non-linear sensitivity on their inference performance. They also argued
that there is no straightforward relationship between the performance of a sub-
set of each isolated ontology and the contribution of each subset to the whole
inference performance on the whole ontology, while they provided an algorithm
that identifies an ontology’s hot spots.
There are two possible approaches to managing long-running queries. One is
to utilize parallel and distributed computing techniques to make those executions
faster [10]. Another possible approach is rewriting a query that requires long ex-
ecution time to a light-weight one. There are some query rewriting approaches to
improve the quality of queries [2][3][4]. Also, there are some heuristic techniques
to approximate inference-enabled queries by modifying some hotspots in the
query that prevent faster execution [12]. However, since those hotspots are also
dependent on their individual ontologies, such query modification should take
into account both query-structure and characteristics of the ontologies used.
3 Outline and System Architecture
If a query seems not to be a time-consuming one, the endpoint executes the
query. If the query execution is classified as time-consuming, the endpoint may
have an option to reject the execution of the query or transform that query into
an optimized one. To implement such behaviors in an endpoint, some extensions
should be provided to allow a notification to the client that the received query
has been transformed into another one, or the query has been rejected due to a
heavy-load condition.
To realize the idea, we are implementing a preliminary system to classify
whether a query execution is time-consuming or not, rewriting the query to a
more light-weight one, and extending the protocol to notify the rejection of the
query, the applied query-transformation for the query, and so on. We applied a
pattern-based heuristic query rewriting technique that, for example, substitutes
some named classes to subsets of their potential classes that are derived by the
inference. Our prototype system has a unique proxy module called “Front-end
EP” between the client and the endpoint (called “Back-end EP” in this paper).
Figure 1 shows a brief overview of the query execution process mediated by a
Front-end EP. Figure 2 shows the basic procedure of query processing on our
system. Table 1 shows our preliminary evaluation on the heavy-query detection
442
Approximating Inference-enabled Federated SPARQL Queries 3
on a single endpoint configuration shown in [12]. Here, we used Linklings ontology
from the OAEI dataset in the preliminary experiment.
To prepare datasets to evaluate the performance sensitivity of ontology-level
simplification techniques, we reduced Linklings ontology by cutting several re-
lational descriptions and added 10 instances for each named class. As an ex-
perimental environment, we set up a SPARQL endpoint using Joseki (v3.4.4)
in conjunction with a server-side reasoner using Pellet [11] to enable OWL-level
inference capability on the endpoint. In this experiment, we used 100 ms as the
threshold time. The evaluation data set was generated by queries to get the
instances of a named class in the Linklings ontology. Here, we conducted an ex-
periment for all 1,369 queries on N-fold cross validation. We used two classifiers:
Bagged C4.5 and Boosted C4.5, implemented in Weka [6] with default parame-
ters. Further evaluation of the performance on multiple-endpoint configurations
remains as future work.
Fig. 1. Overview of Our System
Table 1. Classification Performance on Our Approach (N-fold Cross Validation)[12]
Classifier Recall Precision F-Measure
Bagged C4.5 0.959 0.977 0.964
Boosted C4.5 0.999 0.999 0.999
References
1. Baader, F., Suntisrivaraporn, B.: Debugging Snomed ct Using Axiom Pinpointing
in the Description Logic EL+ . In: Cornet, R., Spackman, K. (eds.) Representing
and sharing knowledge using SNOMED. Proceedings of the 3rd International Con-
ference on Knowledge Representation in Medicine KR-MED 2008, vol. 410, pp.
1–7. CEUR-WS (2008)
2. Bischof, S., Polleres, A.: RDFS with Attribute Equations via SPARQL Rewriting.
In: Cimiano, P., Corcho, O., Presutti, V., Hollink, L., Rudolph, S. (eds.) The
Semantic Web: Semantics and Big Data. LNCS, vol. 7882, pp. 335–350. Springer-
Verlag (2013)
3. Fujino, T., Fukuta, N.: SPARQLoid - a Querying System using Own Ontology and
Ontology Mappings with Reliability. In: Proc. of the 11th International Semantic
Web Conference (Poster & Demos) (ISWC 2012) (2012)
443
4 Yuji Yamagata, Naoki Fukuta
Front end EP rejects the query Front end EP sends the request as is
when the received query is classified as when the received query is classified as
time-consuming. not-time-consuming.
Front end EP
Rejecting Querying Back end EP
Access
Client Querying RDF Data
Controller
Ontology with
Inference Engine
Classifier
Back end EP
Querying
Query RDF Data
Client
Converter
Notifying Ontology with
Querying Inference Engine
Front end EP optimizes the received query
Front end EP notifies to the client that
and sends the optimized query,
the received query is rewritten.
when the received query is classified as time-consuming.
Fig. 2. Basic Query Processing Procedure
4. Fujino, T., Fukuta, N.: Utilizing Weighted Ontology Mappings on Federated
SPARQL Querying. In: Kim, W., Ding, Y., Kim, H.G. (eds.) The 3rd Joint In-
ternational Semantic Technology Conference (JIST2013). LNCS, vol. 8388, pp.
331–347. Springer International Publishing (2013)
5. Gonçalves, R.S., Parsia, B., Sattler, U.: Performance Heterogeneity and Approx-
imate Reasoning in Description Logic Ontologies. In: Cudré-Mauroux, P., Heflin,
J., Sirin, E., Tudorache, T., Euzenat, J., Hauswirth, M., Parreira, J.X., Hendler,
J., Schreiber, G., Bernstein, A., Blomqvist, E. (eds.) The Semantic Web–ISWC
2012 Part I. LNCS, vol. 7649, pp. 82–98. Springer-Verlag (2012)
6. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The
WEKA Data Mining Software: An Update. ACM SIGKDD explorations newsletter
11(1), 10–18 (2009)
7. Kang, Y.B., Li, Y.F., Krishnaswamy, S.: Predicting Reasoning Performance Using
Ontology Metrics. In: Cudré-Mauroux, P., Heflin, J., Sirin, E., Tudorache, T.,
Euzenat, J., Hauswirth, M., Parreira, J.X., Hendler, J., Schreiber, G., Bernstein,
A., Blomqvist, E. (eds.) The Semantic Web–ISWC 2012 Part I. LNCS, vol. 7649,
pp. 198–214. Springer-Verlag (2012)
8. Motik, B., Shearer, R., Horrocks, I.: Hypertableau Reasoning for Description Log-
ics. Journal of Artificial Intelligence Research 36, 165–228 (2009)
9. Romero, A.A., Grau, B.C., Horrocks, I.: MORe: Modular Combination of OWL
Reasoners for Ontology Classification. In: Cudré-Mauroux, P., Heflin, J., Sirin, E.,
Tudorache, T., Euzenat, J., Hauswirth, M., Parreira, J.X., Hendler, J., Schreiber,
G., Bernstein, A., Blomqvist, E. (eds.) The Semantic Web–ISWC 2012 Part I.
LNCS, vol. 7649, pp. 1–16. Springer (2012)
10. Schätzle, A., Przyjaciel-Zablocki, M., Hornung, T., Lausen, G.: PigSPARQL: A
SPARQL Query Processing Baseline for Big Data. In: Proc. of the 12th Interna-
tional Semantic Web Conference (Poster & Demos) (ISWC 2013) (2013)
11. Sirin, E., Parsia, B., Grau, B.C., Kalyanpur, A., Katz, Y.: Pellet: A practical owl-dl
reasoner. Journal of Web Semantics 5(2), 51 – 53 (2007)
12. Yamagata, Y., Fukuta, N.: A Dynamic Query Optimization on a SPARQL End-
point by Approximate Inference Processing . In: Proc. of 5th International Confer-
ence on E-Service and Knowledge Management (ESKM 2014). pp. 161–166 (2014)
444
VKGBuilder – A Tool of Building and Exploring
Vertical Knowledge Graphs
Tong Ruan, Haofen Wang, and Fanghuai Hu
East China University of Science & Technology, Shanghai, 200237, China
{ruantong,whfcarter}@ecust.edu.cn, xiaohuqi@126.com
Abstract. Recently, search engine companies like Google and Baidu are
building their own knowledge graphs to empower the next generation of
Web search. Due to the success of knowledge graphs in search, customers
from vertical sectors are eager to embrace KG related technologies to
develop domain specific semantic platforms or applications. However,
they lack skills or tools to achieve the goal. In this paper, we present
an integrated tool VKGBuilder to help users manage the life cycle of
knowledge graphs. We will describe three modules of VKGBuilder in
detail which construct, store, search and explore knowledge graphs in
vertical domains. In addition, we will demonstrate the capability and
usability of VKGBuilder via a real-world use case in the library industry.
1 Introduction
Recently, an increasing amount of semantic data sources are published on the
Web. These sources are further interlinked to form Linking Open Data (LOD).
Search engine companies like Google and Baidu leverage LOD to build their
own semantic knowledge bases (called knowledge graphs 1 ) to empower semantic
search. The success of KGs in search attracts much attention from users in
vertical sectors. They are eager to embrace related technologies to build semantic
platforms in their domains. However, they either lack skills to implement such
platforms from scratch or fail to find sufficient tools to accomplish the goal.
Compared with general-purpose KGs, knowledge graphs in vertical industries
(denoted as VKG) have the following characteristics: a) More accurate and richer
data of certain domains to be used for business analysis and decision making;
b) Top-down construction to ensure the data quality and stricter schema while
general KGs are built in a bottom-up manner with more emphasis on the wide
coverage of data from di↵erent domains; c) Internal data stored in RDBs are
further considered to be integrated into VKGs; and d) Besides search, VKGs
should provide user interfaces especially for KG construction and maintenance.
While there exist tool suites (e.g., LOD2 Stack2 ) which help to build and
explore LOD, these tools are mainly developed for researchers and developers
of the Semantic Web community. Vertical markets, on the other hand, need
1
http://en.wikipedia.org/wiki/Knowledge_Graph
2
http://lod2.eu/
445
2 Tong Ruan et al.
Schema Inconsistency or Data Conflict
Knowledge Integration Module
D2R
RDB Importer Schema
LOD Expansion and Schema
LOD Alignment Editor
Linker
UGC
UGC Data Editor
Wrapper Data
Information Enrichment
Text
Extractor
Incremental Schema Design and Data Enrichment
Knowledge Store Module
Virtual Graph Database
Knowledge Access Module
Restful Visual Explorer Semantic Search With
API (Card View,Wheel View) Natural Language Interface
Fig. 1. Architecture of VKGBuilder Fig. 2. Semantic Search Interface
end-to-end solutions to manage the life cycle of knowledge graphs and hide the
technical details as much as possible. To the best of our knowledge, we present
the first suitable tool for vertical industry users called VKGBuilder. It allows
rapid and continuous VKG construction which imports and extracts data from
diverse data sources, provides a mechanism to detect intra- and inter-data source
conflicts, and consolidates these data into a consistent KG. It also provides
intuitive and user-friendly interfaces for novice users with little knowledge of
semantic technologies to understand and exploit the underlying VKG.
2 Description of VKGBuilder
VKGBuilder is composed of three modules namely the Knowledge Integration
module, the Knowledge Store module, and the Knowledge Access module. The
whole architecture is shown in Figure 1. Knowledge Integration is the core mod-
ule for VKG construction with three main components. Knowledge Store is a
virtual graph database which combines RDBs, in-memory stores and inverted
indexes to support fast access of VKG in di↵erent scenarios, and the Knowledge
Access module provides di↵erent interfaces for end users and applications.
2.1 Knowledge Integration Module
– Data Importers and Information Extractors. Structured data from internal
relational database are imported and converted into RDF triples by D2R
importers3 . A LOD Linker is developed to enrich VKG with domain on-
tologies from the public linked open data. For the user generated contents
(UGCs), we mainly consider encyclopaedic sites like Wikipedia, Baidu Baike,
and Hudong Baike. Due to the semi-structured nature of these sites, wrap-
pers automatically extract properties and values of certain entities. As for
unstructured text, distant-supervised learning methods are adapted to dis-
cover missing relations between entities or fill property values of a given
entity where the above extracted semantic data serve as seeds.
3
http://d2rq.org/
446
VKGBuilder – A Tool of Building and Exploring Vertical Knowledge Graphs 3
– Schema Inconsistency and Data Conflict Detection. After semantic data are
extracted or imported from various sources, data integration is performed
to build an integrated knowledge graph. During integration, schema-level
inconsistency and data-level conflicts might occur. Schema editing is used
to define axioms of properties such as (e.g., functional, inverse, transitive),
concept subsumptions, and concepts of entities. Then a rule-based validator
is triggered to check whether the newly added data or imported ontologies
will cause any conflicts with existing ones. The possible conflicts are resolved
by user defined rules or delivered to domain experts for human intervention.
– Schema and Data Editor. Knowledge workers can extend or refine a VKG
in both schema-level and data-level with a collaborative editing interface.
2.2 Knowledge Access Module
– Visual Explorer. It includes three views namely the Wheel View, the Card
View, and the Detail View. The Wheel View organizes concepts and entities
in two wheels. In the left wheel, the node of interest is displayed in the
center. If it is a concept, its child concepts are neighbors in the same wheel.
If it is an entity, its related entities are connected via properties as outgoing
(or incoming) edges. When a related concept (or entity) is clicked, the right
wheel is expanded with the clicked node in the center surrounded with its
related information on the VKG. Thus, we allow users to navigate through
the concept hierarchy and traverse between di↵erent entities. The Card View
visualizes entities in a force-directed graph layout, which is similar to the
galaxy visualization in a 3D space. The Card View also allows to change the
focus through drag and drop as well as zoom-in and zoom-out. The Detailed
View shows all properties and property values of a particular entity. The
three views can be switched from one to another in a flexible way.
– Semantic Search with Natural Language Interface. Users can submit any
keyword query or natural language question. The query is interpreted into
possible SPARQL queries with natural language descriptions. Once a SPAR-
QL query is selected, the corresponding answers are returned, along with
relevant documents which contain semantic annotations on these answers.
Besides, a summary (a.k.a, knowledge card) of the main entity mentioned in
the query or the top-ranked answer is shown. Related entities defined in the
VKG as well as correlated entities in the query log are recommended.
– Restful APIs. They are designed for developers with little knowledge of se-
mantic technologies to access the VKG using any programming language
from any platform at ease. These APIs are actually manipulations of SPAR-
QL queries to support graph traversal or sub-graph matching on the VKG.
3 Demonstration
VKGBuilder is first used in the ZhouShan Library. The current VKG (marine-
oriented KG) contains more than 32,000 fishes and each fish has more than 20
447
4 Tong Ruan et al.
Fig. 3. Wheel View Fig. 4. Conflict Resolution
properties. Besides fishes, VKGBuilder also captures knowledge about fishing
grounds, fish processing methods, related researchers and local enterprises. An
online demo video of VKGBuilder can be downloaded at http://202.120.1.49:
19155/SSE/video/VKGBuilder.wmv.
Figure 2 shows a snapshot of the semantic search interface. When a user
enters a query “Distribution of Little Yellow Croaker”, VKGBuilder first seg-
ments the query into “Little Yellow Croaker” and “Distribution”. Here, “Little
Yellow Croaker” is recognized as a fish, and properties about “distribution” are
returned. Then all sub-graphs connecting the fish with each property are found
as possible SPARQL query interpretations of the input query. Top interpreta-
tions whose scores are above a threshold are returned with natural language
descriptions for further selection. Once a user selects a query, the answers (e.g.,
China East Sea) are returned. Also, related books with these answers as seman-
tic annotations are returned. The related library classification of these books are
displayed in the left, and the knowledge card as well as related concepts and
entities of Little Yellow Croaker are listed in the right panel.
In Figure 3, the Wheel View initially shows the root concept (owl:Thing)
in the center of the left wheel (denoted as LC). When a sub-concept Fish is
clicked, it becomes the center of the right wheel (denoted as RC) with its child
concepts (e.g., Chondrichthyes). We can also navigate between entities. For
instance, selenium is one of the nutrients of Little Yellow Croaker. When clicking
selenium, all fishes containing this nutrient are shown in the right wheel.
The user experience heavily depends on the quality of the underlying VKG.
The extraction and importing are executed automatically in the back-end while
we provide a user interface for conflict resolution. For “Little Yellow Croaker”,
we extract Ray-finned Fishes and Actinopterygii from di↵erent sources as
values of the property Class in the scientific classification. Since Class is defined
as a functional property and the two values do not refer to the same thing, a
conflict occurs. As shown in Figure 4, VKGBuilder accepts Actinopterygii as
the final value because this value is extracted from more trusted sources.
Acknowledgements This work is funded by the National Key Technology
R&D Program through project No. 2013BAH11F03.
448
Using the semantic web for author
disambiguation - are we there yet?
Cornelia Hedeler1 , Bijan Parsia1 , and Brigitte Mathiak2
1
School of Computer Science, The University of Manchester, Oxford Road,
M13 9PL Manchester, UK,
{chedeler,bijan.parsia}@manchester.ac.uk
2
GESIS - Leibniz Institute for the Social Sciences, Unter Sachsenhausen 6-8,
50667 Cologne, Germany
brigitte.mathiak@gesis.org
Abstract. The quality, and therefore, the usability and reliability of
data in digital libraries depends on author disambiguation, i.e., the cor-
rect assignment of publications to a particular person. Author disam-
biguation aims to resolve name ambiguity, i.e., synonyms (the same au-
thor publishing under di↵erent names), and polysemes (di↵erent authors
with the same name), and assign publications to the correct person.
However, author disambiguation is difficult given that the information
available in digital libraries is sparse and, when integrated from multi-
ple data sources, contain inconsistencies in representation, e.g., of per-
son names, or venue titles. Here we analyse and evaluate the usability of
person-centred reference data available as linked data to complement the
information present in digital libraries and aid author disambiguation.
1 Introduction
Users of digital libraries are not only interested in literature related to a par-
ticular topic or research field of interest, but more frequently also in literature
written by a particular author [2]. However, as digital libraries tend to integrate
information from various sources, they su↵er from inconsistencies in represen-
tation of, e.g., author names or venue titles, despite best e↵orts to maintain a
high data quality. For the actual disambiguation process, a wide variety of addi-
tional metadata are used, e.g., journal or conference names, author affiliations,
co-author networks, and keywords or topics [1, 6]. However, in some digital li-
braries the available metadata can be quite sparse, providing insufficient amount
and detail of information to disambiguate authors efficiently.
To complement the sometimes sparse bibliographic information a number of
approaches surveyed in [1] utilise information available elsewhere, e.g., using web
searches, and most of the approaches proposed are evaluated utilising gold stan-
dard datasets of high quality, such as Google Scholar author profiles. However,
to the best of the authors’ knowledge, these high quality data sets have so far not
been used as part of the disambiguation process itself. Here we analyse person-
centred reference data available on the semantic web and evaluate whether it
contains sufficient detail and content to provide additional information and aid
author disambiguation.
449
2 Cornelia Hedeler et al.
2 Data sets
2.1 Digital library data sets
In contrast to the wealth of metadata available in some digital libraries, the
records in the two digital library data sets used here only o↵er limited metadata.
DBLP Most publication records in the DBLP Computer Science Bibliography
[4] consist only of author names, publication titles, and venue information, such
as names of conferences and journals. In addition to the publication records,
DBLP also contains person records, which are created as result of ongoing e↵orts
for author disambiguation [5].
Sowiport The portal Sowiport(http://sowiport.gesis.org) is provided by GESIS
and contains publication records relevant to the social sciences. Here we only
focus on a subset of just over 500,000 literature entries in Sowiport from three
data sources (SOFIS, SOLIS, SSOAR) within GESIS, that have been annotated
with keywords from TheSoz, a German thesaurus for the Social Sciences.
So far, no author disambiguation has taken place in these records, and in-
consistencies in particular in author names make it hard for users to find all
publications by a particular author. An analysis of the search logs has shown
that the authors most frequently searched for are those with large numbers of
publications, who tend to have entries in DBpedia and GND, motivating the use
of the reference data sources introduced below.
2.2 Person-centred reference data
GND authority file and GND publication information. As the literature
in Sowiport, in particular the subset used here, is heavily biased towards German
literature, we use the Integrated Authority File (GND) of the German-speaking
countries and the bibliographic data o↵ered as part of the linked data service
by the German National Library (http://www.dnb.de/EN/lds). Amongst other
information, which also includes keywords, the GND file contains di↵erentiated
person records, which refer to a real person, and are used here.
DBpedia [3] is available for download(http://wiki.dbpedia.org/Downloads39)
and comes in various data sets containing di↵erent kinds of data, amongst them
‘Persondata’, with information about people, such as their date and place of
birth and death. As the persondata subset itself does not contain much addi-
tional detail, other data sets are required to obtain information useful for author
disambiguation. The data is available either as raw infobox data or cleaned
mapping-based data, which we use here.
3 Approach for author disambiguation
Our approach for author disambiguation can be seen as preliminary, as the main
focus of this work was to evaluate whether there is sufficient information available
in such reference data sets to make this a viable approach. It uses a domain
specific heuristic as similarity function, and the reference data sets introduced
above as additional (web information) evidence. To limit the number of records
that need to be compared in detail, we use an index on the author/person names
450
Using the semantic web for author disambiguation - are we there yet? 3
Table 1. left: Number of person records in GND with selected professions; right: Number of
instances in persondata in DBpedia for selected classes (y = yago).
Class English German
#foaf:person 1,055,682 479,201
Profession # person
#dbpedia-owl:person 652,031 215,585
author/ female author 8,319 / 6,301
#y:person 844,562 0
lecturer / female lecturer 7,931 / 1,204
#dbpedia-owl:scientist 15,399 0
research associated / female 1,117 / 759
#y:Scientist110560637 44,033 0
physicist / female physicist 11,595 / 1,280
#y:ComputerScientist109951070 1,667 0
mathematician / female 7,561 / 908
#y:Mathematician110301261 4,994 0
computer scientist / female 5,443 / 589
#y:Physicist110428004 6,020 0
sociologists / female sociologist 3,298 / 1,590
#y:SocialScientist110619642 9,083 0
social scientist / female 998 / 546
#y:Philosopher110423589 6,116 0
#dbpedia-owl:Philosopher 1,276 0
Table 2. Number of person instances in DBpedia with selected properties of relevance.
Property English German Property English German
Author names Co-authors
foaf:Name 1,055,682 479,201 dbpedia-owl:academicAdvisor 508 0
rdfs:label 1,055,682 479,201 dbpedia-owl:doctoralAdvisor 3,698 0
dbpedia-owl:birthName 44,977 285 dbpedia-owl:doctoralStudent 1,791 0
dbpedia-owl:pseudonym 1,865 0 dbpedia-owl:notableStudent 372 0
Author affiliation dbpedia-owl:influenced 2,830 0
dbpedia-owl:almaMater 42,318 0 dbpedia-owl:influencedBy 5,928 0
dbpedia-owl:employer 3,232 0 Keywords or topics / research area
dbpedia-owl:school 1,974 0 dbpedia-owl:knownFor 17,702 0
dbpedia-owl:university 1,073 0 dbpedia-owl:notableIdea 392 0
dbpedia-owl:institution 923 0 dbpedia-owl:field 17,831 0
dbpedia-owl:college 13,510 1,829 dbpedia-owl:significantProject 614 0
for the records in each of the data sources. We preprocess the author names to
make the representation of names consistent across all data sources. The search
over the index allows for slight spelling variations, the presence of only the
initial of the forename, missing middle names, and a swap in the order of the
fore- and surnames. The decision of whether a person record is considered to be
sufficiently similar to the author of a publication record is currently based on
a domain specific heuristic, and can be improved. However, the algorithm only
serves as a test-bed to assess whether GND and DBpedia provide sufficiently
detailed information to be used for author disambiguation.
4 Analysis and Evaluation
Analysis of GND and DBpedia. In addition to the (incomplete) list of publi-
cations of a person, GND contains additional information that could characterise
a person sufficiently, including subject categories, their profession, and keywords
for their publications. Unlike the GND authority file and the additional publica-
tion records, which are maintained by a library, and therefore, are structured and
contain data more akin to digital libraries, DBpedia was not developed for that
purpose, resulting in the required information being less readily available. In ad-
dition, the German part of DBpedia contains significantly less of the information
useful for author disambiguation (see Table 1 right and Table 2).
Evaluation To determine whether the lack of more detailed information has a
negative e↵ect on the performance of author disambiguation using these reference
data, we have taken the following data sets: (i) A manually created small test
data set consisting of 30 of the top social scientists with a di↵erentiated entry
in GND, and DBpedia. (ii) A random subset of 250 computer scientists from
451
4 Cornelia Hedeler et al.
person records in DBLP with a link to the corresponding wikipedia page, and
run the part of the author disambiguation algorithm that identifies the GND and
DBpedia entries of an author of a publication record in Sowiport and DBLP. The
precision ranging between 0.7 and 1 is encouraging (In detail: social scientists
with entry in German DBpedia: 0.97; - with entry in English DBpedia: 1; - with
entry in GND: 0.92; computer scientists with entry in DBpedia: 0.89 taking into
account the language of the false positives; - with entry in GND: 0.7). However,
the data set used here is fairly small and does not contain too many people with
common names, which contribute the majority of the false positives.
5 Discussion
The analysis and evaluation of DBpedia and GND has shown that the seman-
tic markup of the information in DBpedia is still lacking in various aspects.
How much of an issue this lack of appropriately detailed information and lack of
completeness really causes for tasks does not only depend on the corresponding
subset of the reference data and its properties, but also on the remainder of the
reference data set, and the digital library data set. This would suggest that a
quality measure that assesses the suitability of the reference data set for author
disambiguation should take into account the following: (i) tuple completeness,
(ii) specificity of the annotation with ontologies, (iii) how much of the informa-
tion is provided in form of ontologies or thesaurus or even worse literal strings,
which provides an indication of the expected heterogeneity of the information
across di↵erent data sets, and (iv) the number of people in the reference data
set who share their names.
To bring this into context with the digital library data set, one could also
determine whether and how many of the author names are shared with several
person records in the reference data set. In particular in these cases, sufficiently
detailed information is vital in order to be able to identify the correct person
record or determine that there is no person record available for that particular
person, even though there are plenty of records for people with the same name.
References
1. Ferreira, A.A., Gonçalves, M.A., Laender, A.H.F.: A brief survey of automatic meth-
ods for author name disambiguation. SIGMOD Record 41(2) (2012)
2. Herskovic, J.R.J., Tanaka, L.Y.L., Hersh, W.W., Bernstam, E.V.E.: A day in the life
of PubMed: analysis of a typical day’s query log. Journal of the American Medical
Informatics Association : JAMIA 14(2), 212–220 (2007)
3. Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N.,
Hellmann, S., Morsey, M., van Kleef, P., Auer, S.: Dbpedia-a large-scale, multilingual
knowledge base extracted from wikipedia. Semantic Web Journal (2014)
4. Ley, M.: DBLP: some lessons learned. In: VLDB’09. pp. 1493–1500 (2009)
5. Reuther, P., Walter, B., Ley, M., Weber, A., Klink, S.: Managing the Quality of
Person Names in DBLP. In: ECDL’06. pp. 508–511 (2006)
6. Smalheiser, N.R., Torvik, V.I.: Author name disambiguation. Annual review of in-
formation science and technology 43(1) (2009)
452
SHEPHERD: A Shipping-Based Query Processor to
Enhance SPARQL Endpoint Performance
Maribel Acosta1 , Maria-Esther Vidal2 , Fabian Flöck1 ,
Simón Castillo2 , Carlos Buil-Aranda3 , and Andreas Harth1
1
Institute AIFB, Karlsruhe Institute of Technology, Germany
{maribel.acosta,fabian.floeck,harth}@kit.edu
2
Universidad Simón Bolı́var, Venezuela
{mvidal,scastillo}@ldc.usb.ve
3
Department of Computer Science, Pontificia Universidad Católica, Chile
cbuil@ing.puc.cl
Abstract. Recent studies reveal that publicly available SPARQL endpoints ex-
hibit significant limitations in supporting real-world applications. In order for this
querying infrastructure to reach its full potential, more flexible client-server ar-
chitectures capable of deciding appropriate shipping plans are needed. Shipping
plans indicate how the execution of query operators is distributed between the
client and the server. We propose SHEPHERD, a SPARQL client-server query
processor tailored to reduce SPARQL endpoint workload and generate shipping
plans where costly operators are placed at the client site. We evaluated SHEP-
HERD on a variety of public SPARQL endpoints and SPARQL queries. Experi-
mental results suggest that SHEPHERD can enhance endpoint performance while
shifting workload from the endpoint to the client.
1 Introduction
Nowadays, public SPARQL endpoints are widely deployed as one of the main mecha-
nisms to consume Linked Data sets. Although endpoints are acknowledged as a promis-
ing technology for RDF data access, a recent analysis by Buil-Aranda et al. [1] indi-
cates that performance and availability vary notably between different public endpoints.
One of the main reasons for the at times undesirable performance of public SPARQL
endpoints is the unpredictable workload, since a large number of clients may be concur-
rently accessing the endpoint and some of the queries handled by endpoints may incur
prohibitively high computational costs. To relieve endpoints of some of the workload
they face, many operators of the query can potentially be executed at the client side.
Shipping policies [2] allow for deciding which parts of the query will be executed at the
client or the server according to the abilities of SPARQL endpoints.
The goal of this work is to provide a system to access SPARQL endpoints that
shifts workload from the server to the client taking into account the capabilities of the
addressed endpoint for executing a certain query – while still offering a competitive per-
formance in terms of execution time and the number of answers produced. We propose
SHEPHERD, a SPARQL query processor that mitigates the workload posed to pub-
lic SPARQL endpoints by tailoring hybrid shipping plans to every specific endpoint.
In particular, SHEPHERD performs the following tasks: (i) decomposing SPARQL
453
queries into lightweight sub-queries that will be posed against the endpoint, (ii) travers-
ing the plan space in terms of formal properties of SPARQL queries, and (iii) generating
shipping-based query plans based on the public SPARQL endpoint performance statis-
tics collected by SPARQLES [1]. We designed a set of 20 different SPARQL queries
over four public SPARQL endpoints. We empirically analyzed the performance of the
hybrid shipping policies devised by SHEPHERD and the query shipping policy when
submitting a query directly to a SPARQL endpoint.
2 The SHEPHERD Architecture
SHEPHERD is a SPARQL query processor based on the wrapper architecture [4].
SHEPHERD implements different shipping policies to reduce the workload posed over
public SPARQL endpoints. Figure 1 depicts the SHEPHERD architecture which con-
sists of three core components: the SHEPHERD optimizer, the engine broker, and the
SPARQL query engine that is considered a black box component.
SPARQL
Query Q
Query
Parser
Planner τ(Q) Q’ SPARQL
Engine Query
Broker
Engine
Algebraic Cost
Space Model
Endpoints Engine SPARQL
Status Catalog Endpoint
Rewriter ERE
SHEPHERD Optimizer
Fig. 1. The SHEPHERD architecture
The SHEPHERD optimizer is designed for enhancing SPARQL query plans since
it relies on formal properties of SPARQL and statistics of SPARQL endpoints to esti-
mate the plan cost. In the following, we elaborate on each of the sub-components that
comprise the proposed optimizer.
– Query parser: Translates the input query Q into internal structures that will be
processed by the planner.
– Planner: Implements a Dynammic Programming algorithm to traverse the space
of plans. During the optimization process, SHEPHERD decides whether to place
the operators at the server (i.e., endpoint), or client (i.e., SHEPHERD) according to
statistics of the endpoint. In this way, SHEPHERD explores shipping polices tai-
lored for each public endpoint. The planner generates bushy-tree plans, where the
leaves correspond to light-weight sub-queries and the nodes correspond to opera-
tors (annotated with the shipping policy to follow).
– Algebraic space and rewriter: The algebraic space defines a set of algebraic rules
to restrict plan transformations. The algebraic rules correspond to the formal prop-
erties for well-designed SPARQL patterns [3]. The rewriter transforms a query in
terms of the algebraic space to produce a more efficient equivalent plan.
– Cost model: The cost of executing sub-queries and SPARQL operators at the end-
point is obtained from the ERE component. Based on these values, SHEPHERD
454
estimates the cost of combining sub-plans with a given operator. The cost model
is designed to gradually increase the cost of operators when more complex expres-
sions are evaluated. This behavior is modeled with the Boltzmann distribution.1
– Endpoint Reliability Estimator (ERE): Endpoint statistics collected by the SPAR-
QLES tool [1] are used to provide reliable estimators for the SHEPHERD cost
model. The endpoint statistics are aggregated and stored in the SHEPHERD cata-
log, and used to characterize endpoint in terms of opertator performance.
The engine broker translates the optimized plan ⌧ (Q) into the corresponding input
for the SPARQL query engine that will execute the plan. The engine broker can specify
the plan in two different ways: i) as a query Q0 with the SPARQL Federation Extension,
to execute the query with a SPARQL 1.1 engine; ii) translating the plans directly into
the internal structures of a given query engine.
3 Experimental Study
We empirically compared the performance of the hybrid shipping policies implemented
by SHEPHERD with the query shipping policy when executing queries directly against
the endpoint. We selected the following four public SPARQL endpoints monitored by
SPARQLES [1]: DBpedia, IproBio2RDF, data.oceandrilling.org and TIP.2 We designed
a query benchmark comprising five different queries for each endpoint.3 Each query
contains modifiers as well as graph patterns that include UNIONs, OPTIONALs, and
filters. SHEPHERD was implemented using Python 2.7.6. Queries were executed di-
rectly against the endpoints using the command curl. All experiments were performed
from the Amazon EC2 Elastic Compute Cloud infrastructure.4
Figure 2 depicts the result of the queries in terms of the execution time (sec.) as re-
ported by the Python time.time() function. We can observe that in the majority of
the cases SHEPHERD retrieves the results notably faster, except in three queries. Con-
cerning the cardinality of the result set retrieved, a similar picture emerges. For 18 of
the overall 20 queries tested, both methods produced the same amount of results, while
in one instance each SHEPHERD and the query shipping approach did not retrieve any
answers. Both methods therefore seem to be on par in this regard.
Even though SHEPHERD is able to reduce the execution time to a certain extent, the
most important finding is that it shifted in average 26% of the operators in the queries to
the client, thereby relieving the servers of notable workload. The ratio of data shipping
varies significantly from case to case depending on the individual shipping strategy
chosen, but does not show a direct correlation with the achieved runtime decrease.5
Hence, we can affirm that SHEPHERD is able to reduce computational load on the
endpoints in an efficient way; this could not be achieved neither by simply moving all
1
The Boltzmann distribution is also used in Simulated Annealing to model the gradual decreas-
ing of a certain function (temperature).
2
Available at http://dbpedia.org/sparql, http://iproclass.bio2rdf.
org/sparql, http://data.oceandrilling.org/sparql and http://lod.
apc.gov.tw/sparql, respectively.
3
http://people.aifb.kit.edu/mac/shepherd/
4
https://aws.amazon.com/ec2/instance-types/
5
Std.Dev. is 0.09. The poster will discuss further details about this ratio and its impact.
455
operator execution to the client – since this increments the bandwith consumption and
the evaluation of non-selective queries may starve the resources of the client – nor by
submitting the whole query to the endpoint as shown in our experiments.
Fig. 2. Runtime results for the four different public endpoints and studied queries
4 Conclusions
We presented SHEPHERD, a SPARQL query processor that implements hybrid ship-
ping policies to reduce public SPARQL endpoint workload. We crafted 20 different
queries against four SPARQL endpoints and empirically demonstrated that SHEPHERD
is (i) able to adapt to endpoints with different characteristics by varying the rate of op-
erators executed at the client, and (ii) in doing so not only retrieves the same number of
results as query shipping but even decreases runtime in the majority of the cases. While
these results provide a first insight into SHEPHERD’s capabilities, they showcase the
potential of adaptive hybrid shipping approaches, which we will explore in future work.
Acknowledgements
The authors acknowledge the support of the European Community’s Seventh Frame-
work Programme FP7-ICT-2011-7 (XLike, Grant 288342).
References
1. C. B. Aranda, A. Hogan, J. Umbrich, and P.-Y. Vandenbussche. SPARQL web-querying in-
frastructure: Ready for action? In International Semantic Web Conf. (2), pp. 277–293, 2013.
2. M. J. Franklin, B. T. Jónsson, and D. Kossmann. Performance tradeoffs for client-server query
processing. In SIGMOD Conference, pp. 149–160, 1996.
3. J. Pérez, M. Arenas, and C. Gutierrez. Semantics and complexity of SPARQL. ACM Trans.
Database Syst., 34(3):16:1–16:45, Sept. 2009.
4. G. Wiederhold. Mediators in the architecture of future information systems. IEEE Computer,
25(3):38–49, 1992.
456
AgreementMakerLight 2.0: Towards Efficient
Large-Scale Ontology Matching
Daniel Faria1 , Catia Pesquita1,2 , Emanuel Santos2 , Isabel F. Cruz3 , and
Francisco M. Couto1,2
1
LASIGE, Faculdade de Ciências, Universidade de Lisboa, Portugal
2
Dept. Informática, Faculdade de Ciências, Universidade de Lisboa, Portugal
3
ADVIS Lab, Dept. of Computer Science, University of Illinois at Chicago, USA
Abstract. Ontology matching is a critical task to realize the Semantic
Web vision, by enabling interoperability between ontologies. However,
handling large ontologies efficiently is a challenge, given that ontology
matching is a problem of quadratic complexity.
AgreementMakerLight (AML) is a scalable automated ontology match-
ing system developed to tackle large ontology matching problems, par-
ticularly for the life sciences domain. Its new 2.0 release includes several
novel features, including an innovative algorithm for automatic selection
of background knowledge sources, and an updated repair algorithm that
is both more complete and more efficient.
AML is an open source system, and is available through GitHub 1 both
for developers (as an Eclipse project) and end-users (as a runnable Jar
with a graphical user interface).
1 Background
Ontology matching is the task of finding correspondences (or mappings) between
semantically related concepts of two ontologies, so as to generate an alignment
that enables integration and interoperability between those ontologies [2]. It is
a critical task to realize the vision of the Semantic Web, and is particularly
relevant in the life sciences, given the abundance of biomedical ontologies with
partially overlapping domains.
At its base, ontology matching is a problem of quadratic complexity as it entails
comparing all concepts of one ontology with all concepts of the other. Early
ontology matching systems were not overly concerned with scalability, as the
matching problems they tackled were relatively small. But with the increasing
interest in matching large (biomedical) ontologies, scalability became a critical
aspect, and as a result, traditional all-versus-all ontology matching strategies
are giving way to more efficient anchor-based strategies (which have linear time
complexity).
1
https://github.com/AgreementMakerLight
457
Input
Output
Ontologies
Alignment
OWL RDF
Ontology AgreementMakerLight
Selection
Loading
& Repair
Primary Core Secondary Refined
Ontology
Matching Alignment Matching Alignment
Objects
Fig. 1. AgreementMakerLight ontology matching framework.
2 The AgreementMakerLight System
AgreementMakerLight (AML) is a scalable automated ontology matching system
developed to tackle large ontology matching problems, and focused in particu-
lar on the biomedical domain. It is derived from AgreementMaker, one of the
leading first generation ontology matching systems [1], and adds scalability and
efficiency to the design principles of flexibility and extensibility which character-
ized its namesake.
2.1 Ontology Matching Framework
The AML ontology matching framework is represented in Figure 1. It is divided
into four main modules: ontology loading, primary matching, secondary match-
ing, and alignment selection and repair
The ontology loading module is responsible for reading ontologies and parsing
their information into the AML ontology data structures, which were conceived
to enable anchor-based matching [5]. AML 2.0 marks the switch from the Jena2
ontology API to the more efficient and flexible OWL API, and includes several
upgrades to the ontology data structures. The most important data structure
AML uses for matching is the Lexicon, a table of class names and synonyms in
an ontology, which uses a ranking system to weight them and score their matches
[7].
The primary and secondary matching modules contain AML’s ontology match-
ing algorithms, or matchers, with the di↵erence between them being their time
complexity. Primary matchers have O(n) time complexity and therefore can be
458
Fig. 2. AgreementMakerLight graphical user interface.
employed globally in all matching problems, whereas secondary matchers have
O(n2 ) time complexity and thus can only be applied locally in large problems.
The use of background knowledge in primary matchers is a key feature in AML,
and it includes an innovative automated background knowledge selection algo-
rithm [4].
The alignment selection and repair module ensures that the final alignment has
the desired cardinality and that it is coherent (i.e., does not lead to the viola-
tion of restrictions of the ontologies) which is important for several applications.
AML’s approximate alignment repair algorithm features a modularization step
which identifies the minimal set of classes that need to be analyzed for coherence,
thus greatly reducing the scale of the repair problem [8].
2.2 User Interface
The GUI was a recent addition to AML, as we sought to make our system
available to a wider range of users. The main challenge in designing the GUI was
finding a way to visualize an alignment between ontologies that was both scalable
and useful for the user. Our solution was to visualize only the neighborhood of
one mapping at a time, while providing several options for navigating through
the alignment [6]. The result is a simple and easy to use GUI which is shown in
Figure 2.
459
3 Performance
AML 1.0 achieved top results in the 2013 edition of the Ontology Alignment
Evaluation Initiative (OAEI) [3]. Namely, it ranked first in F-measure in the
anatomy track, and second in the large biomedical ontologies, conference and
interactive matching tracks. In addition to its e↵ectiveness in matching life sci-
ences ontologies, AML was amongst the fastest systems in all tracks, and more
importantly, had consistently a high F-measure/run time ratio.
AML 2.0 is more e↵ective than its predecessor (thanks to the improved handling
of background knowledge, the richer data structures and the addition of new
matching algorithms) without sacrificing efficiency, so we expect it to perform
even better in this year’s edition of the OAEI.
Acknowledgments
DF, CP, ES and FMC were funded by the Portuguese FCT through the SOMER
project (PTDC/EIA-EIA/119119/2010) and the LASIGE Strategic Project
(PEst-OE/EEI/UI0408/2014). The research of IFC was partially supported by
NSF Awards CCF–1331800, IIS–1213013, IIS–1143926, and IIS–0812258 and by
a UIC-IPCE Civic Engagement Research Fund Award.
References
1. I. F. Cruz, F. Palandri Antonelli, and C. Stroe. AgreementMaker: Efficient Matching
for Large Real-World Schemas and Ontologies. PVLDB, 2(2):1586–1589, 2009.
2. J. Euzenat and P. Shvaiko. Ontology Matching. Springer-Verlag New York Inc,
2007.
3. D. Faria, C. Pesquita, E. Santos, I. F. Cruz, and F. M. Couto. AgreementMakerLight
Results for OAEI 2013. In ISWC International Workshop on Ontology Matching
(OM), 2013.
4. D. Faria, C. Pesquita, E. Santos, I. F. Cruz, and F. M. Couto. Automatic Back-
ground Knowledge Selection for Matching Biomedical Ontologies. PLoS One, In
Press, 2014.
5. D. Faria, C. Pesquita, E. Santos, M. Palmonari, I. F. Cruz, and F. M. Couto. The
AgreementMakerLight Ontology Matching System. In OTM Conferences, volume
8185 of LNCS, pages 527–541, 2013.
6. C. Pesquita, D. Faria, E. Santos, and F. M. Couto. Towards visualizing the align-
ment of large biomedical ontologies. In 10th International Conference on Data
Integration in the Life Sciences, 2014.
7. C. Pesquita, D. Faria, C. Stroe, E. Santos, I. F. Cruz, and F. M. Couto. What’s
in a ‘nym’ ? Synonyms in Biomedical Ontology Matching. In The Semantic Web
- ISWC 2013, volume 8218 of Lecture Notes in Computer Science, pages 526–541.
Springer Berlin Heidelberg, 2013.
8. E. Santos, D. Faria, C. Pesquita, and F. M. Couto. Ontology alignment repair
through modularization and confidence-based heuristics. CoRR, arXiv:1307.5322,
2013.
460
Extracting Architectural Patterns from Web Data
Ujwal Gadiraju, Ricardo Kawase, and Stefan Dietze
L3S Research Center, Leibniz University Hannover, Germany
{gadiraju, kawase, dietze}@L3S.de
Abstract. Knowledge about the reception of architectural structures is crucial
for architects or urban planners. Yet obtaining such information has been a chal-
lenging and costly activity. With the advent of the Web, a vast amount of struc-
tured and unstructured data describing architectural structures has become avail-
able publicly. This includes information about the perception and use of buildings
(for instance, through social media), and structured information about the build-
ing’s features and characteristics (for instance, through public Linked Data). In
this paper, we present the first step towards the exploitation of structured data
available in the Linked Open Data cloud, in order to determine well-perceived
architectural patterns.
1 Introduction and Motivation
Urban planning and architecture encompass the requirement to assess the popularity or
perception of built structures (and their evolution) over time. This aids in understanding
the impact of a structure, identify needs for restructuring, or to draw conclusions useful
for the entire field, for instance, about successful architectural patterns and features.
Thus, information about how people think about a building that they use or see, or
how they feel about it, could prove to be invaluable information for architects, urban
planners, designers, building operators, and policy makers alike. For example, keeping
track of the evolving feelings of people towards a building and its surroundings can help
to ensure adequate maintenance and trigger retrofit scenarios where required. On the
other hand, armed with prior knowledge of specific features that are well-perceived by
the public, builders and designers can make better-informed design choices and predict
the impact of building projects.
The Web contains structured information about particular building features, for
example, size, architectural style, built date, etc. of certain buildings through public
Linked Data. Here in particular, reference datasets such as Freebase1 or DBpedia2 o↵er
useful structured data describing a wide range of architectural structures.
The perception of an architectural structure itself has historically been studied to
be a combination of the aesthetic as well as functional aspects of the structure [3, 4].
The impact of such buildings of varying types on the built environment, as well as how
these buildings are perceived, thus varies. For example, intuitively we can say that in
1
http://www.freebase.com/
2
http://dbpedia.org/
461
2 Ujwal Gadiraju, Ricardo Kawase, and Stefan Dietze
case of churches, the appearance plays a vital role in the emotions induced amongst
people. However, in case of airports or railway stations, the functionality aspects such
as the efficiency or the accessibility may play a more significant role. This suggests that
the impact of particular influence factors di↵ers significantly between di↵erent building
types.
In this paper, we present our work regarding the alignment of Influence Factors with
structured data. Firstly, we identified the influence factors for a predefined set of archi-
tectural structures. Secondly, we align these factors with structured data from DBpedia.
This work serves as a first step towards semantic enhancement of the architectural do-
main, which can support semantic classification of architectural structures, semantic
analysis, and ranking, amongst others.
2 Crowdsourcing Influential Factors and Ranking Buildings
Recent research works in the field of Neuroscience [1, 2], reliably suggest that neu-
rophysiological correlates of building perception successfully reflect aspects of an ar-
chitectural rule system that adjust the appropriateness of style and content. They show
that people subconsciously rank buildings that they see, between the categories of ei-
ther high-ranking (‘sublime’) or low-ranking (‘low’) buildings. However, what exactly
makes a building likeable or prominent remains unanswered. Size could be an influ-
ential factor. At the same time, it is not sound to suggest that architects or builders
should design and build only big structures. For instance, a small hall may invoke more
sublime feelings while a huge kennel may not. This indicates that there are additional
factors that influence building perception. In order to determine such factors, we employ
Crowdsourcing.
An initial survey was conducted using LimeService3 with a primary focus on the
expert community of architects, builders and designers in order to determine influential
factors. The survey administered 32 questions spanning over the background of the par-
ticipants and their feelings about certain buildings, of di↵erent types (bridges, churches,
skyscrapers, halls and airports). We received 42 responses from the expert community.
The important influential factors that surfaced from the responses of the survey are pre-
sented below.
For bridges, churches, skyscrapers and halls: history, surroundings, materials, size,
personal experiences, and level of detail. For airports: Ease of access, efficiency, ap-
pearance, choice/availability, facilities, miscellaneous facilities and size.
Based on these influential factors we acquired perception scores of buildings on
a Likert-scale, through crowdsourcing. By aggregating and normalizing these scores
between 0 and 1, we arrived at a ranked list of buildings of each type within our dataset.
3 Correlating Influential Factors with Relevant Structured Data
In order to determine patterns in the perception of well-received structures (as per the
building rankings), we correlate the influential factors of buildings with concrete prop-
erties and values from DBpedia.
3
http://www.limeservice.com/
462
Extracting Architectural Patterns from Web Data 3
Table 1: DBpedia properties that are used to materialize corresponding Influence Factors.
Airports Bridges Churches Halls Skyscrapers
dbpedia-owl: dbprop:architect, dbprop: dbpedia-owl: dbprop:startDate,
runwaySurface, dbpedia-owl: architectureStyle, yearOfConstruction, dbprop:completionDate,
dbpedia-owl: constructionMaterial, dbprop: dbprop:built, dbprop: dbpedia-owl:architect,
runwayLength, dbprop:material, consecrationYear, architect, dbprop:area, dbpedia-owl:floorCount
dbprop:cityServed, dbpedia-owl:length, dbprop:materials, dbprop:seatingCapacity,
dbpedia-owl: dbpedia-owl:width, dbprop:domeHeightOuter, dbpedia-owl:location
locatedInArea, dbpedia-owl:mainspan dbprop:length, dbprop:
dbprop:direction width, dbprop:area,
dbpedia-owl:location,
dbprop:district
Table 1 depicts some of the properties that are extracted from the DBpedia knowl-
edge graph in order to correlate the influence factors corresponding to each structure
with specific values.
By doing so, we can analyze the well-received patterns for architectural structures
at a finer level of granularity, i.e., in terms of tangible properties. In order to extract
relevant data from DBpedia for each structure in our dataset, we first collect a pool of
properties that correspond to each of the influence factors as per the building type (see
Table 1). In the next step, by traversing the DBpedia knowledge graph leading to each
structure in our dataset, we try to extract corresponding values for each of the prop-
erties identified. The properties thus extracted semi-automatically, are limited to those
available on DBpedia. In addition, it is important to note that not all structures of a par-
ticular type have the same properties available on DBpedia. Therefore, although all the
identified values accurately correspond to the structure, the coverage itself is restricted
to the data available on DBpedia (see Table 2).
Table 2: Coverage of properties related to ‘size’, extracted from DBpedia for di↵erent architec-
tural structures in our dataset.
Airports Bridges Churches Halls Skyscrapers
runwayLength: 95% length: 67.79% architectureStyle: 36.69% seatingCapacity: 65.67% floorCount: 91%
4 Application and Conclusions
By correlating the influence factors to specific DBpedia properties, we can identify pat-
terns for well-perceived architectural structures. In order to demonstrate how such ob-
served patterns for architectural structures can be used, we choose the influence factor
‘size’ of the structure. Although, this approach can be directly extended to other in-
fluence factors and across di↵erent kinds of architectural structures, due to the limited
space we restrict ourselves to showcasing this influence factor.
We observe that for each airport, we can extract indicators of size using the DBpe-
dia property dbpedia-owl:runwayLength. Similarly, in case of bridges the influence
factor ‘size’ can be represented using the DBpedia properties dbpedia-owl:length,
463
4 Ujwal Gadiraju, Ricardo Kawase, and Stefan Dietze
!"
Fig. 1: Influence of Size in the perception of Halls.
dbpedia-owl:width and dbpedia-owl:mainspan, for halls we can use the DB-
Pedia properties dbprop:area and dbprop:seatingCapacity, while we can use
dbpedia-owl:floorCount, and dbprop:height to consolidate the well-perceived
patterns for Skyscrapers. We thereby extract corresponding property values for each
structure in our dataset4 using the DBpedia knowledge graph.
Figure 1 depicts the influence of size in the perception of halls. We observe that halls
with a seating capacity between 1000-4000 people are well-perceived with the positive
perception, varying between 0.1 and 1. The perception scores are obtained through the
aggregation of results from the crowdsourcing process. Similarly, as a result of the
quantitative analysis of churches, by leveraging the rankings and correlating with the
property dbpedia-owl:architecturalStyle, we find that the most well-received
styles of churches in Germany are (i) Gothic, (ii) Gothic Revival, and (iii) Romanesque.
With this, we demonstrated that by correlating building characteristics with ex-
tracted data from DBpedia, one is able to compute and analyze architectural structures
quantitatively. Thus, our main contribution includes semantic analysis and quantitative
measurement of public perception of architectural structures based on structured data.
As future work, we plan to develop algorithms that exploit properties from the struc-
tured data on the web in order to provide multi-dimensional architectural patterns like
‘skyscrapers with x size,y uniqueness, and z materials used are best perceived’, which
architects and urban planners can benefit from.
References
1. I. Oppenheim, H. Mühlmann, G. Blechinger, I. W. Mothersill, P. Hilfiker, H. Jokeit, M. Kur-
then, G. Krämer, and T. Grunwald. Brain electrical responses to high-and low-ranking build-
ings. Clinical EEG and Neuroscience, 40(3):157–161, 2009.
2. I. Oppenheim, M. Vannucci, H. Mühlmann, R. Gabriel, H. Jokeit, M. Kurthen, G. Krämer,
and T. Grunwald. Hippocampal contributions to the processing of architectural ranking. Neu-
roImage, 50(2):742–752, 2010.
3. C. Sitte. City planning according to artistic principles. Rizzoli, 1986.
4. L. H. Sullivan. The autobiography of an idea, volume 281. Courier Dover Publications, 1956.
4
Our dataset and building rankings:
http://data-observatory.org/building-perception/
464
sQ/t
MQ/2 7Q` i?2 .Bbi`B#mi2/ a2KMiB+ aQ+BH L2irQ`F
LiM2H `M/i M/ a2#biBM h`KT
lMBp2`bBi i G2BTxB;- AMbiBimi 7Ƀ` AM7Q`KiBF- Eaq-
SQbi7+? RyyNky- .@y9yyN G2BTxB;- :2`KMv
&`M/i%i`KT'!BM7Q`KiBFXmMB@H2BTxB;X/2
R AMi`Q/m+iBQM
h?2 rQ`H/ rB/2 r2# UqqqV Bb MQi MvKQ`2 Dmbi M BM7Q`KiBQM `2i`B2pH bvb@
i2K (R) #mi `i?2` M BMi2`+iBp2 +QKKmMB+iBQM K2/BmKX qBi?BM i?2 Hbi /2+/2
QMHBM2 bQ+BH M2irQ`Fb ?p2 2pQHp2/ M/ +QMbiMiHv BM+`2b2/ BM TQTmH`BivX h?2
+m``2MiHv KQbi mb2/ QMHBM2 bQ+BH M2irQ`F b2`pB+2b- ++Q`/BM; iQ i?2B` 2biBKi2/
KQMi?Hv +iBp2 mb2`bR - `2 6+2#QQF U1.27 #BHHBQM- 7+2#QQFX+QKV- :QQ;H2 SHmb
U541 KBHHBQM- THmbX;QQ;H2X+QKV M/ hrBii2` U283 KBHHBQM- irBii2`X+QKVX *QK@
T`2/ iQ i?2 2biBKi2/ iQiH mb2`b Q7 i?2 qqq Q7 2.93 #BHHBQM- Qp2` 40 W Q7 i?2
mb2`b Q7 i?2 qqq `2 +iBp2Hv mbBM; 6+2#QQFX h?Bb +QM+2Mi`iBQM QM bQK2
bBM;H2 b2`pB+2b +QMi`/B+ib i?2 +imH Q`;MBbiBQM Q7 i?2 qqq M/ i?2 r?QH2
AMi2`M2i b M2irQ`F Q7 /2+2Mi`HHv Q`;MBb2/ M/ BMi2`+QMM2+i2/ +QKTmi2`
MQ/2bX h?Bb bBimiBQM #2`b `BbFb `2;`/BM; i?2 T`Bp+v- /i b2+m`Biv- /i QrM@
2`b?BT- `2HB#BHBiv Q7 i?2 b2`pB+2b M/ 7`22/QK Q7 +QKKmMB+iBQMX "v #mBH/BM; mT
/Bbi`B#mi2/ QMHBM2 bQ+BH M2irQ`F rBi? KmHiBTH2 BMi2`+QMM2+i2/ b2`pB+2b i?Bb
`BbFb +M #2 KBMBKBb2/ M/ Km+? KQ`2 ~2tB#Hv 2tT2M/#H2 M2irQ`F Bb +`2i2/X
q2 T`2b2Mi sQ/t U?iiT,ffFbrXQ`;fS`QD2+ibfsQ/t- BM+Hm/2b HBp2 /2KQV
M BKTH2K2MiiBQM Q7 MQ/2 7Q` i?2 .Bbi`B#mi2/ a2KMiB+ aQ+BH L2irQ`F
U.aaLVX h?2 .aaL Bb ;2M2`H `+?Bi2+im`2 7Q` #mBH/BM; M QMHBM2 bQ+BH M2irQ`F
mbBM; a2KMiB+ q2# biM/`/b M/ //BiBQMH T`QiQ+QHb 7Q` `2H@iBK2 +QKKmMB@
+iBQMX sQ/t T`QpB/2b 7mM+iBQMHBiv 7Q` Tm#HBb?BM; M/ 2/BiBM; T2`bQMH T`Q}H2b-
//BM; 7`B2M/b iQ i?2 7`B2M/ HBbi- b2M/BM; M/ `2+2BpBM; 7`B2M/b?BT `2[m2bib- Tm#@
HBb?BM; TQbib M/ 7QHHQrBM; Qi?2` mb2`b +iBpBiB2b +`Qbb /Bbi`B#mi2/ MQ/2bX
k LQ/2 AMi2`+QKKmMB+iBQM M/ AMi2;`iBQM BM i?2 q2#
h?2 +QKTH2i2 `+?Bi2+im`2 7Q` i?2 .aaL Bb T`QTQb2/ BM DzM `+?Bi2+im`2 Q7
.Bbi`B#mi2/ a2KMiB+ aQ+BH L2irQ`Fdz (k)X Ai +QK#BM2b 2bi#HBb?2/ q2# kXy
i2+?MQHQ;B2b BX2X Ua2KMiB+V SBM;#+F (j) USBM;V M/ +iBpBiv@ai`2Kb Tm#HBb?2/
i?`Qm;? Sm#am#>m##m# USma>Vk rBi? M _.6 /i KQ/2H M/ i?2 GBMF2/ .i
T`QiQ+QH (9)X h?2 +QMb2[m2Mi mb2 Q7 _.6 7+BHBii2b i?2 BMi2;`iBQM Q7 ?2i2`Q;2@
M2Qmb /i BM i?2 /i KQ/2HX "v mbBM; i?2 GBMF2/ .i T`QiQ+QH i?2 .aaL Bb
R
b 2biBKi2/ QM AMi2`M2i GBp2 aiib, BMi2`M2iHBp2biibX+QK- CmHv Rji? kyR9
k
Sm#am#>m##m#, ?iiTb,ff+Q/2X;QQ;H2X+QKfTfTm#bm#?m##m#f
465
2bBHv BMi2;`i2/ rBi? Mv Qi?2` a2KMiB+ q2# TTHB+iBQM M/ i?mb 7+BHBii2b M
2ti2M/#H2 BM7`bi`m+im`2X h?Bb 2M#H2b mb iQ #mBH/ mT +QKTH2i2Hv /Bbi`B#mi2/
M2irQ`F Q7 /i M/ b2`pB+2b- r?B+? Bb ?B;?Hv BMi2;`i2/ M/ 2K#2//2/ BM i?2
q2# Q7 .i M/ qqqX
announce 3a ping 3b
Ping
6b notify 6a
publish 2c
5c notify subscribe
subscribe 4b 5a & publish
PuSH PuSH
subscribe 4a
Hub
5b update update 5b
Other DSSN nodes
announce 2b
Notication
Application
Component
Activity
Manager
Sharing
Prole
Streams
read (get) 7
Feeds
announce 2a
create & WebIDs
update 1
Data & Media
Artefacts
Resources
6B;m`2 RX `+?Bi2+im`2 Q7 M sQ/t MQ/2 M/ BMi2`+QKKmMB+iBQM #2ir22M MQ/2b QM
i?2 .aaLX (8)
h?2 `+?Bi2+im`2 Q7 i?2 sQ/t BKTH2K2MiiBQM M/ i?2 BMi2`+QKKmMB+iBQM
rBi? Qi?2` MQ/2b Bb /2TB+i2/ BM };X RX h?2 T`Q}H2 KM;2` Bb mb2/ 7Q` 2/BiBM; i?2
q2#A. M/ //BM; M2r 7`B2M/b?BT `2HiBQMb- Bi +`2i2b M/ mT/i2b i?2 ++Q`/BM;
`2bQm`+2b M/ +iBpBiv bi`2Kb URVX aBKBH`Hv- i?2 b?`BM; TTHB+iBQM +M #2 mb2/
7Q` b?`BM; K2/B `i27+ib M/ r2# `2bQm`+2X lT/i2b Q7 i?2 q2#A. Q` Mv Qi?2`
/i Q` K2/B `i27+i `2 MMQmM+2/ UkV #v mT/iBM; i?2 +iBpBiv bi`2K-
r?B+? Bb i?2M MMQmM+2/ iQ i?2 Sma> b2`pB+2 Uk#V- r?B+? BM im`M Tm#HBb?2b i?2
mT/i2b iQ i?2 Sma> ?m# Uk+VX AM T`HH2H i?Bb +?M;2 Bb MMQmM+2/ iQ i?2 SBM;
+QKTQM2Mi UjV- r?B+? b2M/b TBM; iQ 2+? `2bQm`+2 r?B+? Bb K2MiBQM2/ BM i?2
mT/i2 Uj#VX qBi? i?2 Sma> b2`pB+2 mb2` +M #2 bm#b+`B#2/ iQ Mv +iBpBiv
bi`2K QM i?2 .aaL U9- 9#V- r?B+? BKTH2K2Mib 7QHHQr 7mM+iBQMHBivX A7 i?2
Sma> ?m# `2+2Bp2b M2r mT/i2 MMQmM+2K2Mi 7`QK Mv .aaL MQ/2 UTm#HBb?-
8V Bi rBHH mT/i2 HH MQ/2b- r?B+? `2 bm#b+`B#2/ iQ i?2 ++Q`/BM; `2bQm`+2
bi`2K U8#VX h?2 Sma> b2`pB+2 Q7 i?2 sQ/t MQ/2 rBHH i?2M MQiB7v i?2 ++Q`/BM;
+QKTQM2Mib U8+VX q?2M i?2 TBM; b2`pB+2 `2+2Bp2b M2r TBM; UeV Bi rBHH +HH
i?2 MQiB}+iBQM +QKTQM2Mi iQ ;2M2`i2 M2r mb2` MQiB}+iBQM Ue#VX A7 Qi?2`
MQ/2b UQ` Mv GBMF2/ .i TTHB+iBQMV #`Qrb2b i?2 .aaL i?2 +iBpBiv bi`2Kb-
/i K2/B `i27+ib M/ Mv _.6 `2bQm`+2 +M #2 `2i`B2p2/ pB biM/`/
>hhS@:1h@`2[m2bib UdV ++Q`/BM; iQ i?2 GBMF2/ .i T`BM+BTH2X
466
6B;m`2 kX a+`22Mb?Qi M/ 72im`2 /2KQMbi`iBQM Q7 sQ/t
6B;m`2 k b?Qrb i?2 ?QK2@b+`22M Q7 HQ;;2/@BM mb2`- Bi Bb Q`;MBb2/ BM i?`22
KBM T`ib- i?2 iQT MpB;iBQM #`- i?2 #miiQM M/ 7`B2M/@HBbi #` iQ i?2 `B;?i
M/ i?2 KBM T`i rBi? +iBpBiv bi`2K BM i?2 KB//H2X HH Qi?2` `2bQm`+2@pB2rb
`2 Q`;MBb2/ p2`v bBKBH` iQ i?2 mb2`Ƕb ?QK2@b+`22MX h?2 iQT MpB;iBQM #`
7`QK H27i iQ `B;?i ?b ?QK2 #miiQM 7Q` i?2 mb2`- r?B+? HbQ BM/B+i2b B7 M2r
MQiB}+iBQMb `2 pBH#H2X #miiQM iQ MpB;i2 iQ i?2 HBbi Q7 T`Q}H2b- pBH#H2
i i?2 +m``2Mi MQ/2X h?2 T`Q}H2 2/BiQ` 2M#H2b i?2 mb2` iQ 2/Bi Mv i`BTH2 Q7 ?2`
q2#A.X h?2 `B;?i KQbi #miiQM Bb 7Q` HQ;@QmiX
PM i?2 `B;?i ?M/ bB/2 QM2 +M pB2r i?2 +iBpBiv 622/ Q7 i?2 +m``2Mi /BbTHv2/
`2bQm`+2 #v +HB+FBM; QM dza?Qr +iBpBiv 622/ǴX A7 i?2 T`Q}H2 Q7 T2`bQM /Bz2`2Mi
7`QK i?2 +m``2Mi mb2` Bb /BbTHv2/- M //BiBQMH #miiQM dz// &MK2' b 6`B2M/Ǵ
Bb /BbTHv2/- iQ // 7`B2M/b?BT `2HiBQM iQ i?2 QrM T`Q}H2- b2M/ TBM; iQ i?2
`2bT2+iBp2 T2`bQM 7Q` MQiB}+iBQM M/ bm#b+`B#2 iQ ?Bb +iBpBiv bi`2K 7Q` mT/i2bX
"2HQr i?2 #miiQMb QM2 +M }M/ i?2 7`B2M/ HBbiX Ai +QMiBMb }2H/ 7Q` //BM;
M2r 7`B2M/ mbBM; Bib q2#A.@l_A M/ HBbi Q7 2tBbiBM; 7`B2M/b?BT `2HiBQMbX "v
+HB+FBM; QM i?2 BM/BpB/mH 2Mi`B2b i?2 mb2` +M #`Qrb2 i?2 T`Q}H2b M/ 7m`i?2`
MpB;i2 iQ i?2 7`B2M/b Q7 i?2B` 7`B2M/bX A7 i?2 #`Qrb2` `2+?2b `2bQm`+2- r?B+?
Bb MQi v2i pBH#H2 BM i?2 HQ+H i`BTH2 biQ`2- Bi Bb `2i`B2p2/ ++Q`/BM; iQ i?2 GBF2/
.i T`BM+BTH2bX
h?2 +2Mi`H T`i b?Qrb i?2 TB+im`2 M/ i?2 MK2 Q7 i?2 +m``2Mi mb2`- +m``2Mi
MQiB}+iBQMb M/ i?2 b?`BM; TTHB+iBQM rBi? +iBpBiv bi`2KX A7 M2r 7`B2M/
`2[m2bi Bb b2M/ iQ i?2 MQ/2 Q` B7 i?2 mb2` rb K2MiBQM2/ BM TQbi- M2r MQiB}@
+iBQM Bb ;2M2`i2/ M/ /BbTHv2/ BM i?2 MQiB}+iBQM b2+iBQM iQ BM7Q`K i?2 mb2`X
lbBM; i?2 b?`BM; TTHB+iBQM mb2` +M +`2i2 M/ b?`2 M2r TQbib- r2# `2@
467
bQm`+2b Q` T?QiQb rBi? ?Bb bm#b+`B#2`b pB Sma>X h?2 +iBpBiv bi`2K /BbTHv2/
i i?2 p2`v #QiiQK Bb +QK#BM2/ bi`2K Q7 i?2 T2`bQMH bi`2K Q7 i?2 mb2` M/
HH +iBpBiv bi`2Kb b?2 Bb bm#b+`B#2/ iQX qBi? i?2 irQ #miiQMb iQ i?2 `B;?i Q7
i?2 +iBpBiv 2Mi`v mb2` +M ;2M2`i2 `2THv `2bQm`+2 Mbr2`BM; iQ i?Bb +iBpBiv
`2bQm`+2 M/ pB2r i?2 +iBpBiv 722/ Q7 i?Bb `2bQm`+2- r?B+? BM+Hm/2b HH Bib `2THB2bX
j S`QbT2+i M/ 6mim`2 qQ`F
h?2 sQ/t BKTH2K2MiiBQM /2KQMbi`i2b i?2 72bB#BHBiv Q7 b2KMiB+ bQ+BH M2i@
rQ`F #mBH/ #v /Bbi`B#mi2/ MQ/2bX Ai H`2/v bmTTQ`ib i?2 KBM 72im`2b Q7 #mBH/BM;
7`B2M/b?BT `2HiBQMb M/ b?`BM; `2bQm`+2b +`Qbb i?2 M2irQ`FX *m``2MiHv bQK2
7m`i?2` 72im`2b 7Q` bmTTQ`iBM; i?2 rQ`F BM ;`QmTb `2 THM2/X h?2 T`+iB+H i2bib
rBi? KmHiBTH2 MQ/2b H`2/v TQBMi2/ Qmi i?2 mb;2 Q7 i?2 Sm#am#>m##m# T`QiQ@
+QH b HBKBiBM; 7+iQ`X *m``2MiHv QMHv 72r ?m# BKTH2K2MiiBQMb `2 pBH#H2
M/ 2z2+iBp2Hv i?2 ;QQ;H2 `272`2M+2 BKTH2K2MiiBQM M/ BMbiM+2 Bb i?2 QMHv mb@
#H2 ?m#X aQ r2 `2 i`vBM; iQ bm#biBimi2 i?2 72/2`iBQM T`QiQ+QH rBi? GBMF2/
.i M/ a2KMiB+ SBM;#+F T`QiQ+QHX "mi /m2 iQ i?2 GBMF2/ .i Mim`2 Q7
i?2 .aaL r2 ?p2 b22M- i?i Bi Bb 2bv iQ BMi2;`i2 i?Bb bQ+BH M2irQ`F `+?Bi2+@
im`2 rBi? Mv TTHB+iBQM Q` /i@b2i QM i?2 q2# Q7 .iX SQbbB#H2 TTHB+iBQMb-
r?B+? +M #2M2}i 7`QK i?2 b2KMiB+ BMi2;`iBQM rBi? i?2 bQ+BH M2irQ`F `2 2X;X
+QHH#Q`iBp2@rBFBb- r2#@HQ;b Q` T2`bQMH BM7Q`KiBQM KM;2K2Mi bvbi2KbX h?Bb
+QM+2Ti rBHH ;Bp2 mb M2r QTTQ`imMBiB2b 7Q` BMi2;`iBM; bQ+BH 7mM+iBQMHBiv BM Mv
r2# TTHB+iBQM M/ i?mb 2ti2M/ i?2 bQ+BH r2# iQ i?2 r?QH2 AMi2`M2i `i?2`
i?M bQK2 #B; +HQb2/ MQ/2bX
_272`2M+2b
RX "2`M2`b@G22- hX- *BHHBm- _X- :`Qz- CX6X- SQHH2`KMM- "X, qQ`H/@qB/2 q2#, h?2
AM7Q`KiBQM lMBp2`b2X 1H2+i`QMB+ L2irQ`FBM;, _2b2`+?- TTHB+iBQMb M/ SQHB+v
kURV URNNkV 8kĜ83
kX h`KT- aX- 6`Bb+?Kmi?- SX- 1`KBHQp- hX- a?2F`TQm`- aX- m2`- aX, M `+?Bi2+im`2
Q7 .Bbi`B#mi2/ a2KMiB+ aQ+BH L2irQ`FX a2KMiB+ q2# CQm`MH aT2+BH Abbm2
QM h?2 S2`bQMH M/ aQ+BH a2KMiB+ q2# UkyRkV
jX h`KT- aX- 6`Bb+?Kmi?- SX- 1`KBHQp- hX- m2`- aX, q2pBM; aQ+BH .i q2# rBi?
a2KMiB+ SBM;#+FX AM *BKBMQ- SX- SBMiQ- >X- 2/bX, S`Q+22/BM;b Q7 i?2 1Eq kyRy
@ EMQrH2/;2 1M;BM22`BM; M/ EMQrH2/;2 JM;2K2Mi #v i?2 Jbb2bc RRi? P+iQ#2`@
R8i? P+iQ#2` kyRy @ GBb#QM- SQ`im;HX oQHmK2 ejRd Q7 G2+im`2 LQi2b BM `iB}+BH
AMi2HHB;2M+2 UGLAVX- "2`HBM f >2B/2H#2`;- aT`BM;2` UP+iQ#2` kyRyV Rj8ĜR9N
9X "2`M2`b@G22- hX, GBMF2/ .iX .2bB;M Bbbm2b- qj* UCmM2 kyyNV ?iiT,ffrrrXrjXQ`;
f.2bB;MAbbm2bfGBMF2/.iX?iKHX
8X `M/i- LX, sQ/t Ĝ EQMx2TiBQM mM/ AKTH2K2MiB2`mM; 2BM2b .Bbi`B#mi2/ a2KMiB+
aQ+BH L2irQ`F EMQi2MbX Jbi2`Ƕb i?2bBb- lMBp2`bBi i G2BTxB;- 6FmHi i 7Ƀ` Ji?2@
KiBF mM/ AM7Q`KiBF- AMbiBimi 7Ƀ` AM7Q`KiBF UCmM2 kyRjV
468
An Ontology Explorer for Biomimetics Database
Kouji KOZAKI1 and Riichiro MIZOGUCHI2
1
The Institute of Scientific and Industrial Research, Osaka University
8-1 Mihogaoka, Ibaraki, Osaka, 567-0047 Japan
kozaki@ei.sanken.osaka-u.ac.jp
2
Japan Advanced Institute of Science and Technology
1-1 Asahidai, Nomi, Ishikawa 923-1292, Japan
mizo@jaist.ac.jp
Abstract. Biomimetics contributes to innovative engineering by imitating the
models, systems, and elements of nature. For biomimetics research, it is im-
portant to develop biomimetics database including widely varied knowledge
across different domains such as biology and engineering. Interoperability of
knowledge among those domains is necessary to create such a database. For this
purpose, the authors are developing a biomimetics ontology which bridge gaps
between biology and engineering. In this demo, the authors shows an ontology
exploration tool for biomimetics database. It is based on linked data techniques
and allows the users to find important keywords so that they can search mean-
ingful knowledge from various databases.
Keywords: ontology, linked data, biomietics, database, semantic search
1 Introduction
Learning from nature aids development of technologies. Awareness of this fact has
been increasing, and biomimetics1 [1], innovative engineering through imitation of the
models, systems, and elements of nature, has caught the attention of many people. Well-
known examples of biomimetics include, paint and cleaning technologies that imitate
the water repellency of the lotus, adhesive tapes that imitate the adhesiveness of gecko
feet, and high-speed swimsuits that imitate the low resistance of a shark’s skin. These
results integrate studies on the biological mechanisms of organisms with engineering
technologies to develop new materials. Facilitating such biomimetics-based innova-
tions requires integrating knowledge, data, requirements, and viewpoints across differ-
ent domains. Researchers and engineers need to develop a biomimetics database to as-
sist them in achieving this goal.
Because ontologies clarify concepts that appear in target domains [2], we assume
that it is important to develop a biomimetics ontology that contributes to improvement
of knowledge interoperability between the biology and engineering domains. Further-
more, linked data technologies are very effective for integrating a database with exist-
ing biological diversity databases. On the basis of these observations, we developed a
1
http://www.cbid.gatech.edu/
469
biomimetics ontology and ontology exploration system based on linked data techniques.
The tool allows users to find important keywords for retrieving meaningful knowledge
from viewpoints of biomimetics through various databases. This demo shows how the
ontology explorer for biomimetics database works on the web.
2 A Biomimetics Ontology
Before we began developing a biomimetics ontology, we conducted interviews with
engineers working with biomimetics regarding their requirements for biomimetics da-
tabase search. When we asked, “What do you want to search for in a biomimetic data-
base?” they said they wanted to search for organisms or organs that perform functions
that they were trying to develop in their new products. In fact, most successful examples
are imitations of capabilities that organisms possess, such as the water repellency of a
lotus and the adhesiveness of a gecko’s feet. Therefore, we proposed that it is important
to search the biomimetic database for functions or goals that they want to achieve.
On the other hand, someone engaged in cooperative research with engineers and
biologists reported that engineers do not have knowledge that is very familiar to biolo-
gists. For instance, when an engineer had a question about functions of projections
shown in an electron microscopy image of an insect, a biologist (entomologist) sug-
gested that it could have an anti-slip capability, because the insect often clings to slip-
pery surfaces. This suggests that a biomimetic ontology must bridge knowledge gaps
between engineers and biologists.
Considering the requirements discussed in the above, we set the first requirement
for biomimetics ontology as to be able to search for related organisms by the function
the user wants to perform. At the same time, we propose that it should support various
viewpoints to bridge gaps among domains. As a result, we built a biomimetics ontology
that includes 379 concepts (classes) and 314 relationships (properties), except for the
is-a (sub-class-of) relation. For example, Organism may have relationships such as
Ecological environment, Characteristic behavior, Characteristic structure, Character-
istic function, Region Part, and Goal may have relationships such as Structure on which
to base and Related function. Other top level concepts includes Behavior, Act, Function,
Process, Structure, Living environment, and so on.
3 An Ontology Explorer for Biomimetics Database
We developed the ontology explorer for biomimetics database based on an ontology
exploration techniques proposed in our previous work [3]. The framework enables us-
ers to freely explore a sea of concepts in the ontology from a variety of perspectives
according to their own motives. Exploration stimulates their way of thinking and con-
tributes to deeper understanding of the ontology and hence its target world. As a result,
users can discover what interests them. This could include new findings that are new to
them, because they might find unexpected conceptual chains from the ontology explo-
ration that they would otherwise never have thought of.
Exploration of an ontology can be performed by choosing arbitrary concepts from
which multi-perspective conceptual chains can be traced, according to the explorer’s
470
Fig.1 A snapshot of the Ontology Explorer for Biomimetics Database.
intention. We define the viewpoint for exploring an ontology and obtaining multi-per-
spective conceptual chains as the combination of a focal point and aspects. A focal
point indicates a concept to which the user pays attention as a starting point of the ex-
ploration. The aspect is the manner in which the user explores the ontology. Because
an ontology consists of concepts and the relationships among them, the aspect can be
represented by a set of methods for extracting concepts according to its relationships.
The multi-perspective conceptual chains are visualized in a user-friendly form, i.e., in
a conceptual map. Based on these techniques, we developed the ontology explorer for
retrieving information from biomimetics database as a web application to assist the user
in using the results easily for searching other databases, while the previously described
system was developed as a Java client application. We implemented the ontology ex-
ploration tool using HTML5 and Java Script to enable it to work on web browsers on
many platforms, including not only PCs but also tablets and smartphones. We imple-
mented the exploration methods based on Simple Protocol and RDF Query Language
(SPARQL) queries.
Fig.1 shows one result of ontology exploration using the system. In this example,
the user selected Antifouling as the focal point (starting point) and obtained conceptual
chains to some Organism as the end point. In this case, the system searches all combi-
nation of aspects (relationships) to generate conceptual chains from a concept selected
as starting point to those specified by the user. As a result, the system shows all con-
ceptual chains between the selected concepts as a conceptual map. By clicking the
nodes on the map, the user can detailed information about each paths. Furthermore, the
user can use the selected information to search other Linked Data such as DBpedia and
471
databases. Though the current version supports only a few LODs and databases, it can
be easily extended to others.
4 Conclusion and Future work
This article outlined an ontology explorer for biomimetics database. Since the current
version of the system is a prototype, it uses only a small ontology and has limits on the
conditions of exploration. However, it was well received by researchers on biomimetics.
In fact, one of them said that the resulting path from Antifouling to Sandfish shown in
Fig.1 was unexpected one for him. This suggests that the proposed system could con-
tribute innovations in biomimetics. The researchers also plan to use the biomimetics
ontology and system as an interactive index for a biomimetics textbook.
Future work includes extensions of the biomimetics ontology and the exploration
system. For the former, we plan to use documents on biomimetics and existing linked
data related to biology and considering some methods for semi-automatic ontology
building using them. For later, we are exploring potentially useful patterns through dis-
cussion with biomimetics researchers and ontology engineers.
There are many approaches to Semantic Search using SPARQL. For example,
Ferré proposes QFS (Query-based Faceted Search) for support in navigating faceted
search using LISQL (Logical Information System Query Language) [4] and im-
plement it based on SPARQL endpoints to scale to large datasets [5]. Popov proposes
an exploratory search called Multi-Pivot [6] which extracts concepts and relationships
from ontologies according to a user’s interest. These are visualized and used for seman-
tic searches among instances (data). The authors took the same approach as Popov.
Considering how to use these techniques in our system is an important future work.
The current version of the proposed system is available at the URL;
http://biomimetics.hozo.jp/ontology_db.html .
Acknowledgements
This work was supported by JSPS KAKENHI Grant Number 25280081 and 24120002.
References
1. Shimomura, M.:Engineering Biomimetics: Integration of Biology and Nanotechnology, De-
sign for Innovative Value Towards a Sustainable Society, 905-907 (2012)
2. Gruber,T.: A translation approach to portable ontologyspecifications, Proc. of JKAW'92,
pp.89-108 (1992)
3. Kozaki K, Hirota T and Mizoguchi R, Understanding an Ontology through Divergent Ex-
ploration, Proc. of ESWC 2011, Part I, LNCS 6643, pp.305-320 (2011).
4. Ferré, S., Hermann, A.: Reconciling faceted search and query languages for the semantic
web. IJMSO 7(1), 37-54 (2012)
5. Guyonvarch, J., Ferré S.: Scalewelis: a scalable query-based faceted search elenawork.
Multilingual Question Answering over Linked Data (QALD-3), Valencia,Spain
6. Popov, I. O., m.c. schraefel, Hall, W., Shadbolt, N.: Connecting the dots: A multi-pivot
approach to data exploration. International Semantic Web Conference (ISWC2011),
LNCS7031, 553{568 (2011)
472
Semi-Automated Semantic Annotation of the Biomedical
Literature
Fabio Rinaldi
Institute of Computational Linguistics, University of Zurich
fabio.rinaldi@uzh.ch
Abstract. Semantic annotations are a core enabler for efficient retrieval of relevant infor-
mation in the life sciences as well in other disciplines. The biomedical literature is a major
source of knowledge, which however is underutilized due to the lack of rich annotations that
would allow automated knowledge discovery.
We briefly describe the results of the SASEBio project (Semi Automated Semantic Enrich-
ment of the Biomedical Literature) which aims at adding semantic annotations to PubMed
abstracts, in order to present a richer view of the existing literature.
1 Introduction
The scientific literature contains a wealth of knowledge which however cannot be easily used
automatically due to its unstructured nature. In the life sciences, the problem is so acutely felt that
large budgets are invested into the process of literature curation, which aims at the construction
of structured databases using information mostly manually extracted from the literature. There
are several dozens of life science databases, each specializing on a particular subdomain of biology.
Examples of well-known biomedical databases are UniProt (proteins), EntrezGene (genes), NCBI
Taxonomy (species), IntAct (protein interactions), BioGrid (protein and genetic interactions),
PharmGKB (drug-gene-disease relations), CTD (chemical-gene-disease relations), and RegulonDB
(regulatory interactions in E. coli).
The OntoGene group1 aims at developing text mining technologies to support the process of
literature curation, and promote a move towards assisted curation. By assisted curation we mean a
combination of text mining approaches and the work of an expert curator, aimed at leveraging the
power of text mining systems, while retaining the high quality associated with human expertise.
We believe that it is possible to gradually automate much of the most repetitive activities of the
curation process, and therefore free up the creative resources of the curators for more challenging
tasks, in order to enable a much more efficient and comprehensive curation process. Our text
mining system specializes in the detection of entities and relationships from selected categories,
such as proteins, genes, drugs, diseases, chemicals. OntoGene derives some of its resources from
life sciences databases, thus allowing a deeper connection between the unstructured information
contained in the literature and the structured information contained in databases. The quality of
the system has been tested several times through participation in some of the community-organized
evaluation campaigns, where it often obtained top-ranked results. We have also implemented a
platform for assisted curation called ODIN (OntoGene Document INspector) which aims at serving
the needs of the curation community. The usage of ODIN as a tool for assisted curation has been
tested within the scope of collaborations with curation groups, including PharmGKB [7], CTD
[8], RegulonDB [5].
Assisted curation is also of utility in the process of pharmaceutical drug discovery. Many text
mining tasks in drug discovery require both high precision and high recall, due to the importance
of comprehensiveness and quality of the output. Text mining algorithms, however, cannot often
achieve both high precision and high recall, sacrificing one for the other. Assisted curation can
be paired with text mining algorithms which have high recall and moderate precision to produce
results that are amenable to answer pharmaceutical problems with only a reasonable e↵ort being
allocated to curation.
1
http:/www.ontogene.org/
473
2 Fabio Rinaldi
Methods
The Ontogene system is based on a pipeline architecture (see figure 1), which includes, among
others, modules for entity recognition and relation extraction. Some of the modules are rule-based
(e.g. lexical lookup with variants) while others use machine-learning approaches (e.g. maximum en-
tropy techniques). The initial step consists in the annotation of names of relevant domain entities in
biomedical literature (currently the system considers proteins, genes, species, experimental meth-
ods, cell lines, chemicals, drugs and diseases). These names are sourced from reference databases
and are associated with their unique identifiers in those databases, thus allowing resolution of
synonyms and cross-linking among di↵erent resources.
One of the problems with sourcing resources
from several databases is the possible inconsisten-
cies among them. The fact that domain knowledge
BioC XML input is scattered across dozens of data sources, occa-
# sionally also with some incompatibilities among
(0) Wait request / validate BioC them, is a severe problem in the life sciences. Ide-
# ally these resources should be integrated in a sin-
(1) Read input with PyBioC reader gle repository, as some projects are attempting
# to do (e.g. OpenPhacts [16]), allowing querying
(2) Fetch Pubmed source (optional) within an unified platform. However, a deep inte-
gration of the information provided by the scien-
#
tific literature and the content of the databases is
(3) Convert to OGXML
still missing.
#
We train our system using the knowledge pro-
(4) Sentence splitting + tokenization
vided by life sciences databases as our gold stan-
# dard, instead of hand-labeled corpora, since we
(5) Term annotation believe that the scope and size of manually anno-
# tated corpora, however much e↵ort has been in-
(6) Extract terms vested in creating them, is not sufficient to capture
# the wide variety of linguistic phenomena that can
(7) Merge tokens be encountered in the full corpus of biomedical lit-
# erature, let alone other types of documents, such
(8) Entity disambiguation as internal scientific reports in the pharma indus-
# try, which are not represented at all in annotated
(9) Compute concept relevance corpora. For example, PubMed currently contains
# more than 23 million records, while the entire
(10) Filter concepts by score set of all annotated publications probably barely
# reaches a few thousands, most of them sparsely
(11) Compute relation relevance annotated for very specific purposes.
#
We generate interaction candidates using co-
occurence of entities within selected syntactic
(12) Filter relations by score
units (typically sentences). An additional step of
#
syntactic parsing using a state-of-the-art depen-
(13) Annotate OGXML for visualization
dency parser allows us to derive specialized fea-
#
tures in order to increase precision. The details of
(14) Add annotations to PyBioC writer
the algorithm are presented in [14]. The informa-
# tion delivered by the syntactic analysis is used as
(15) Send back annotated BioC a factor in order to score and filter candidate in-
# teractions based on the syntactic fragment which
BioC XML output
connects the two participating entities. All avail-
Fig. 1. Schema of the OntoGene pipeline able lexical and syntactic information is used in
order to provide an optimized ranking for candi-
date interactions. The ranking of relation candi-
dates is further optimized by a supervised machine learning method described in detail in [2].
474
Semi-Automated Semantic Annotation of the Biomedical Literature 3
Results
The OntoGene annotator o↵ers an open architecture allowing for a considerable level of customiza-
tion so that it is possible to plug in in-house terminologies. We additional provide access to some of
our text mining services through a RESTful interface.2 Users can submit arbitrary documents to
the OntoGene mining service by embedding the text to be mined within a simple XML wrapper.
Both input and output of the system are defined according to the BioC standard [4]. However,
typical usage will involve processing of PubMed abstracts or PubMed Central full papers. In this
case, the user can provide as input simply the PubMed identifier of the article. Optionally the user
can specify which type of output they would like to obtain: if entities, which entity types, and if
relationships, which combination of types.
The OntoGene pipeline identifies all relevant entities mentioned in the paper, and their interac-
tions, and reports them back to the user as a ranked list, where the ranking criteria is the system’s
own confidence for the specific result. The confidence value is computed taking into account sev-
eral factors, including the relative frequency of the term in the article, its general frequency in
PubMed, the context in which the term is mentioned, and the syntactic configuration among two
interacting entities (for relationships). A detailed description of the factors that contribute to the
computation of the confidence score can be found in [14].
The user can choose to either inspect the results, using the ODIN web interface, or to have
them delivered back via the RESTful web service in BioC XML format, for further local process-
ing. ODIN (OntoGene Document Inspector) is a flexible browser-based client application which
interfaces with the OntoGene server. The curator can use the features provided by ODIN to vi-
sualize selected annotations, together with the statements from which they were derived, and, if
necessary, add, remove or modify them. Once the curator has validated a set of candidate annota-
tions, they can be exported, using a standard format (e.g. CSV, RDF), for further processing by
other tools, or for inclusion in a reference database, after a suitable format conversion. In case of
ambiguity, the curator is o↵ered the opportunity to correct the choices made by the system, at any
of the di↵erent levels of processing: entity identification and disambiguation, organism selection,
interaction candidates. The curator can access all the possible readings given by the system and
select the most accurate.
As a way to verify the quality of the core text mining functionalities of the OntoGene sys-
tem, we have participated in a number of text mining evaluation campaigns [9, 3, 12, 13]. Some
of most interesting results include best results in the detection of protein-protein interactions in
BioCreative 2009 [14], top-ranked results in several tasks of BioCreative 2010 [15], best results in
the triage task of BioCreative 2012 [9]. The usage of ODIN as a curation tool has been tested in
a few collaborations with curation groups, including PharmGKB [10], CTD [7], RegulonDB [11].
Assisted curation is also one of the topics being evaluated at the BioCreative competitions [1],
where OntoGene/ODIN participated with favorable results. The e↵ectiveness of the web service
has been recently evaluated within the scope of one of the BioCreative 2013 shared tasks [6].
Di↵erent implementations can rapidly be produced upon request.
Since internally the original database identifiers are used to represent the entities and interac-
tions detected by the system, the annotations can be easily converted into a semantic web format,
by using a reference URI for each domain entity, and using RDF statements to express interac-
tions. While it is possible to access the automatically generated annotations for further processing
by a reasoner or integrator tool, we strongly believe that at present a process of semi-automated
validation is preferable and would lead to better data consistency.
Acknowledgments. The OntoGene group is partially supported by the Swiss National Sci-
ence Foundation (grant 105315 130558/1 to Fabio Rinaldi) and by the Data Science Group at
Ho↵mann-La Roche, Basel, Switzerland.
2
http://www.ontogene.org/webservices/
475
4 Fabio Rinaldi
References
1. Arighi, C., Roberts, P., Agarwal, S., Bhattacharya, S., Cesareni, G., Chatr-aryamontri, A., Clematide,
S., Gaudet, P., Giglio, M., Harrow, I., Huala, E., Krallinger, M., Leser, U., Li, D., Liu, F., Lu, Z.,
Maltais, L., Okazaki, N., Perfetto, L., Rinaldi, F., Saetre, R., Salgado, D., Srinivasan, P., Thomas, P.,
Toldo, L., Hirschman, L., Wu, C.: Biocreative iii interactive task: an overview. BMC Bioinformatics
12(Suppl 8), S4 (2011), http://www.biomedcentral.com/1471-2105/12/S8/S4
2. Clematide, S., Rinaldi, F.: Ranking relations between diseases, drugs and genes for a curation task.
Journal of Biomedical Semantics 3(Suppl 3), S5 (2012), http://www.jbiomedsem.com/content/3/
S3/S5
3. Clematide, S., Rinaldi, F., Schneider, G.: Ontogene at calbc ii and some thoughts on the need of
document-wide harmonization. In: Proceedings of the CALBC II workshop, EBI, Cambridge, UK,
16-18 March (2011)
4. Comeau, D.C., Doan, R.I., Ciccarese, P., Cohen, K.B., Krallinger, M., Leitner, F., Lu, Z., Peng, Y.,
Rinaldi, F., Torii, M., Valencia, A., Verspoor, K., Wiegers, T.C., Wu, C.H., Wilbur, W.J.: BIoC:
a minimalist approach to interoperability for biomedical text processing. The Journal of Biological
Databases and Curation bat064 (2013), published online
5. Gama-Castro, S., Rinaldi, F., Lpez-Fuentes, A., Balderas-Martnez, Y.I., Clematide, S., Ellendor↵,
T.R., Collado-Vides, J.: Assisted curation of growth conditions that a↵ect gene expression in e. coli
k-12. In: Proceedings of the Fourth BioCreative Challenge Evaluation Workshop. vol. 1, pp. 214–218
(2013)
6. Rinaldi, F., Clematide, S., Ellendor↵, T.R., Marques, H.: OntoGene: CTD entity and action term
recognition. In: Proceedings of the Fourth BioCreative Challenge Evaluation Workshop. vol. 1, pp.
90–94 (2013)
7. Rinaldi, F., Clematide, S., Garten, Y., Whirl-Carrillo, M., Gong, L., Hebert, J.M., Sangkuhl, K.,
Thorn, C.F., Klein, T.E., Altman, R.B.: Using ODIN for a PharmGKB re-validation experiment.
Database: The Journal of Biological Databases and Curation (2012)
8. Rinaldi, F., Clematide, S., Hafner, S.: Ranking of ctd articles and interactions using the ontogene
pipeline. In: Proceedings of the 2012 BioCreative workshop. Washington D.C. (April 2012)
9. Rinaldi, F., Clematide, S., Hafner, S., Schneider, G., Grigonyte, G., Romacker, M., Vachon, T.: Using
the OntoGene pipeline for the triage task of BioCreative 2012. The Journal of Biological Databases
and Curation, Oxford Journals (2013)
10. Rinaldi, F., Clematide, S., Schneider, G., Romacker, M., Vachon, T.: ODIN: An advanced interface
for the curation of biomedical literature. In: Biocuration 2010, the Conference of the International
Society for Biocuration and the 4th International Biocuration Conference. p. 61 (2010), available from
Nature Precedings http://dx.doi.org/10.1038/npre.2010.5169.1
11. Rinaldi, F., Gama-Castro, S., Lpez-Fuentes, A., Balderas-Martnez, Y., Collado-Vides, J.: Digital cu-
ration experiments for regulondb. In: BioCuration 2013, April 10th, Cambridge, UK (2013)
12. Rinaldi, F., Kappeler, T., Kaljurand, K., Schneider, G., Klenner, M., Clematide, S., Hess, M., von
Allmen, J.M., Parisot, P., Romacker, M., Vachon, T.: OntoGene in BioCreative II. Genome Biology
9(Suppl 2), S13 (2008), http://genomebiology.com/2008/9/S2/S13
13. Rinaldi, F., Kappeler, T., Kaljurand, K., Schneider, G., Klenner, M., Hess, M., von Allmen, J.M.,
Romacker, M., Vachon, T.: OntoGene in Biocreative II. In: Proceedings of the II Biocreative Workshop
(2007)
14. Rinaldi, F., Schneider, G., Kaljurand, K., Clematide, S., Vachon, T., Romacker, M.: OntoGene in
BioCreative II.5. IEEE/ACM Transactions on Computational Biology and Bioinformatics 7(3), 472–
480 (2010)
15. Schneider, G., Clematide, S., Rinaldi, F.: Detection of interaction articles and experimental methods
in biomedical literature. BMC Bioinformatics 12(Suppl 8), S13 (2011), http://www.biomedcentral.
com/1471-2105/12/S8/S13
16. Williams, A.J., Harland, L., Groth, P., Pettifer, S., Chichester, C., Willighagen, E.L., Evelo, C.T.,
Blomberg, N., Ecker, G., Goble, C., Mons, B.: Open phacts: semantic interoperability for drug discov-
ery. Drug Discovery Today 17(2122), 1188 – 1198 (2012), http://www.sciencedirect.com/science/
article/pii/S1359644612001936
476
Live SPARQL Auto-Completion
Stéphane Campinas
Insight Centre for Data Analytics, National University of Ireland, Galway
stephane.campinas@insight-centre.org
Abstract. The amount of Linked Data has been growing increasingly. How-
ever, the efficient use of that knowledge is hindered by the lack of informa-
tion about the data structure. This is reflected by the difficulty of writing
SPARQL queries. In order to improve the user experience, we propose an
auto-completion library1 for SPARQL that suggests possible RDF terms. In
this work, we investigate the feasibility of providing recommendations by
only querying the SPARQL endpoint directly.
1 Introduction
The Linking Open Data movement has brought a tremendous amount of data avail-
able to the general user. The available knowledge spans a wide range of domains,
from life sciences to films. However, using SPARQL to search through this knowl-
edge is a tedious process, not only because of the syntax barrier but mainly due
to the schema heterogeneity of the data. The expression of an information need in
SPARQL is difficult due to the schema being generally unknown to the user as well
as an heterogeneous of several vocabularies.
A common solution is for the user to manually gain knowledge about the data
structure, i.e., what predicates and classes are used, by executing additional queries
in parallel to the main one. The paper [3] proposes a “context-aware” auto-completion
method for assisting a user in writing a SPARQL query by recommending schema
terms in various position in the query. The method is context-aware in the sense
that only essential triple patterns are considered for the recommendations. To do
so, it leverage a data-generated schema. Instead, in this work we propose to bypass
this need by executing live SPARQL queries in order to provide recommendations.
Thus, this removes the overhead of pre-computing the data-generated schema. The
proposed approach exposes a trade-o↵ between the performance of the application
and the quality of the recommendations. We make available a library1 for providing
data-based recommendations that can be used with other tools such as YASGUI [8].
In Section 2 we discuss related works regarding auto-completion for SPARQL.
In Section 3 we present the proposed approach. In Section 4 we report an evaluation
of the system based on query logs of DBpedia.
2 Related Work
Over the years, many contributions have been done towards facilitating the use of
SPARQL, either visually [4], or by completely hiding SPARQL from the user [7]. In
this work, we aim to help users with a knowledge of SPARQL by providing an auto-
completion feature. Several systems have been proposed in this direction. Although
1
Gosparqled: https://github.com/scampi/gosparqled
477
the focus in [1] is the visual interface, it can provide recommendations of terms such
as predicates and classes. In [6] possible recommendations are taken from query
logs. The system proposed in [5] provides recommendations based on the data itself,
with a focus on SPARQL federation. Instead, we aim to make available an easy-to-
use library which core feature is to provide data-based recommendations. In [3] an
editor with auto-completion was developed that leverage a data-generated schema
(i.e., a graph summary). We investigate in this work the practicability of bypassing
the graph summary by relying only on the data.
3 Live Auto-Completion
We propose a data-based auto-completion which retrieves possible items with re-
gards to the current state of the query. Recommended items can be predicates,
classes, or even named graphs. Firstly, we indicate the position in the SPARQL
query that is to be auto-completed, i.e., the Point Of Focus (POF), by inserting the
character ‘<’. Secondly, we reduce the query down to its recommendation scope [3].
Finally, we transform the POF into the SPARQL variable “?POF” which is used
for retrieving recommendations. The retrieved recommendations are then ranked,
e.g., by the number of occurrences of an item.
Recommendation Scope. While building a SPARQL query, not all triple patterns are
relevant for the recommendation. Therefore, we define the scope as the connected
component that contains the POF. Figure 1a depicts a SPARQL query where the
POF is associated with the variable “?s”: it seeks possible predicates that occur with
a “:Person” having the predicate “:name”. Figure 1b depicts the previous SPARQL
query reduced to its recommendation scope. Indeed, the pattern on line 4 is removed
since it is not part of the connected component containing the POF.
1 SELECT * { 1 SELECT ? POF {
2 ? s a : Person ; 2 ? s a : Person ;
3 : name ? name ; < . 3 : name ? name ; ? POF [] .
4 ? o a : Document 4
5 } 5 }
(a) A query with ‘<’ as the POF (b) Scope of the query
Fig. 1: Query auto-completion
Recommendation Capabilities. The scope may include content-specific terms, e.g,
resources and filters, unlike to [3] since the graph summary is an abstraction that
captures only the structure of the data. Recommendations about predicates, classes
and named graphs are possible as in [3]. In addition, the use of the data directly
allows to provide recommendations for specific resources.
4 Evaluation
Systems. In this section, we evaluate the recommendations returned by the proposed
system, that we refer to as “S1”, against the ones provided by the approach in [3],
which we refer to as “S2”.
478
Settings. We compare the recommendations with regards to (1) the response-time,
i.e., the time spent on retrieving the recommendations via a SPARQL query; and
(2) the quality of the recommendations. A run of the evaluation consists of the fol-
lowing steps. First, we vary the amount of information retrieved via the “LIMIT”
clause. Then, we compare the ranked TOP-10 recommendations against a gold stan-
dard. The ranking is based on the number of occurrences of a recommendation. The
gold standard consists in retrieving recommendations directly from the data without
the LIMIT clause, and retaining only the 10 most occurring terms. The TOP-10 of
the gold standard and the system are compared using the Jaccard similarity. We
consider that the higher the similarity, the higher the quality of recommendations.
Queries. We used the query logs of the DBpedia endpoint version 3.3 available
from the USEWOD20132 dataset. The queries3 were stripped of any pattern about
specific resources, in order to keep only the structure of the query. In addition,
we removed queries that contain more than one connected component. Queries are
grouped according to their complexity, which depends on the number of triple pat-
terns and on the number of star graphs. A group is identified by a string that has
as many numbers as there are stars, with numbers separated by a dash ’-’ and rep-
resenting the number of triple patterns in a star. For example, a query with two
stars and one triple pattern each is then identified with 1-1. This definition of query
complexity exhibits the potential errors, i.e., a recommendation having zero-result,
that a graph summary can have, as described in [2].
Graphs. We loaded into an endpoint the English part of the Dbpedia3.34 dataset,
which consists of 167 199 852 triples. The graph summary consists of 29 706 051
triples, generated by grouping resources sharing the same set of classes.
Endpoint. We used a Virtuoso5 SPARQL endpoint. The endpoint is deployed on a
server with 32GB of RAM and with SSD drives.
Comparison. For each group of query complexity QC, we report in Table 1 the
results of the evaluation, with J1 (resp., J2) the average Jaccard similarity for the
system S1 (resp., S2); and T 1 (resp., T 2) the average response-time in ms for the
system S1 (resp., S2). The reported values are the averages over 5 runs. We can see
that as the LIMIT gets larger, the higher the Jaccard similarity becomes.Since the
graph summary used in S2 is a concise representation of the graph structure, the
data sample at a certain LIMIT value contains more terms than in S1. However,
this impacts negatively on the quality of S2 as reflected by the values of J2. This
shows the graph summary is subject to errors [2], i.e., zero-result recommendations.
Nonetheless, it is interesting to remark that in S1 the recommendations can lead
the query to an “isolated” part of the graph, from which the way out is through
the use of “OPTIONAL” clauses. In S2, the graph summary allows to reduce this
e↵ect. The response-times for either system is similar, with S2 being slightly faster
than S1. This indicates that directly querying the endpoint for recommendations
is feasible. However, the significant di↵erence in sizes between the graph summary
and the original graph would become increasingly pre-dominant as the data grows.
2
http://usewod.org/
3
https://github.com/scampi/gosparqled/tree/master/eval/data
4
http://wiki.dbpedia.org/Downloads33
5
Virtuoso v7.1.0 at https://github.com/openlink/virtuoso-opensource
479
J1 J2 J1 J2 J1 J2 J1 J2 J1 J2 J1 J2 J1 J2 J1 J2 J1 J2
QC 2 3 4 5 6 9 10 1-1 1-2
10 0.12 0.12 0.17 0.21 0.15 0.21 0.16 0.19 0.14 0.16 0.17 0.19 0.16 0.19 0.11 0.09 0.19 0.18
100 0.15 0.17 0.28 0.26 0.27 0.28 0.28 0.29 0.24 0.26 0.25 0.26 0.25 0.27 0.12 0.11 0.24 0.22
500 0.24 0.27 0.34 0.29 0.34 0.30 0.36 0.35 0.38 0.31 0.42 0.27 0.43 0.26 0.15 0.18 0.29 0.29
QC 1-3 1-4 1-5 2-2 3-4 1-1-2 1-1-3 1-1-4
10 0.62 0.64 0.23 0.22 0.15 0.19 0.17 0.17 0.15 0.06 0.55 0.38 0.50 0.49 0.38 0.43
100 0.62 0.60 0.38 0.32 0.24 0.32 0.19 0.19 0.24 0.10 0.57 0.39 0.53 0.52 0.44 0.40
500 0.62 0.59 0.60 0.34 0.25 0.29 0.25 0.22 0.21 0.12 0.57 0.40 0.55 0.51 0.47 0.46
T1 T2 T1 T2 T1 T2 T1 T2 T1 T2 T1 T2 T1 T2 T1 T2 T1 T2
QC 2 3 4 5 6 9 10 1-1 1-2
10 107 81 119 82 127 81 129 82 144 85 314 197 688 468 97 79 103 80
100 108 81 180 84 147 95 202 86 173 88 311 198 701 458 122 84 140 83
500 141 91 192 96 144 79 172 99 149 101 337 207 701 467 127 89 133 111
QC 1-3 1-4 1-5 2-2 3-4 1-1-2 1-1-3 1-1-4
10 101 108 108 87 104 93 102 80 114 83 107 391 106 87 105 87
100 103 105 102 94 105 84 106 80 142 85 115 385 112 89 105 96
500 126 105 141 92 136 97 158 94 137 117 126 400 133 99 139 102
Table 1: Average Jaccard similarity (J1 for system S1 and J2 for S2) and response-
times in ms (T 1 for system S1 and T 2 for S2) for each group of query complexity QC,
and with the LIMIT varying from 10 to 500. The reported values are the averages
over 5 runs.
Acknowledgement
This material is based upon works supported by the European FP7 projects LOD2
(257943).
Bibliography
[1] Ambrus, O., Mller, K., Handschuh, S.: Konduit vqb: a visual query builder for sparql
on the social semantic desktop
[2] Campinas, S., Delbru, R., Tummarello, G.: Efficiency and precision trade-o↵s in graph
summary algorithms. In: Proceedings of the 17th International Database Engineering
& Applications Symposium. pp. 38–47. IDEAS ’13, ACM, New York, NY, USA (2013)
[3] Campinas, S., Perry, T.E., Ceccarelli, D., Delbru, R., Tummarello, G.: Introducing rdf
graph summary with application to assisted sparql formulation. In: Proceedings of the
2012 23rd International Workshop on Database and Expert Systems Applications. pp.
261–266. DEXA ’12, IEEE Computer Society, Washington, DC, USA (2012)
[4] Clark, L.: Sparql views: A visual sparql query builder for drupal. In: International
Semantic Web Conference. pp. –1–1 (2010)
[5] Gombos, G., Kiss, A.: Sparql query writing with recommendations based on datasets.
In: Yamamoto, S. (ed.) Human Interface and the Management of Information. Infor-
mation and Knowledge Design and Evaluation, Lecture Notes in Computer Science,
vol. 8521, pp. 310–319. Springer International Publishing (2014)
[6] Kramer, K., Dividino, R.Q., Gröner, G.: Space: Sparql index for efficient autocom-
pletion. In: Blomqvist, E., Groza, T. (eds.) International Semantic Web Conference.
CEUR Workshop Proceedings, vol. 1035, pp. 157–160. CEUR-WS.org (2013)
[7] Lehmann, J., Bühmann, L.: Autosparql: Let users query your knowledge base. In:
Proceedings of the 8th Extended Semantic Web Conference on The Semantic Web:
Research and Applications - Volume Part I. pp. 63–79. ESWC’11, Springer-Verlag,
Berlin, Heidelberg (2011)
[8] Rietveld, L., Hoekstra, R.: Yasgui: Not just another sparql client. In: SALAD@ESWC.
pp. 1–9 (2013)
480