=Paper=
{{Paper
|id=Vol-369/paper-7
|storemode=property
|title=Building Linked Data For Both Humans and Machines
|pdfUrl=https://ceur-ws.org/Vol-369/paper06.pdf
|volume=Vol-369
|dblpUrl=https://dblp.org/rec/conf/www/HalbRH08
}}
==Building Linked Data For Both Humans and Machines==
Building Linked Data For Both Humans and Machines∗
† ‡ §
Wolfgang Halb Yves Raimond Michael Hausenblas
Institute of Information Centre for Digital Music Institute of Information
Systems & Information London, UK Systems & Information
Management Management
Graz, Austria Graz, Austria
ABSTRACT Existing linked datasets such as [3] are slanted towards ma-
In this paper we describe our experience with building the chines as the consumer. Although there are exceptions to
riese dataset, an interlinked, RDF-based version of the Eu- this machine-first approach (cf. [13]), we strongly believe
rostat data, containing statistical data about the European that satisfying both humans and machines from a single
Union. The riese dataset (http://riese.joanneum.at), aims source is a necessary path to follow.
at serving roughly 3 billion RDF triples, along with mil-
lions of high-quality interlinks. Our contribution is twofold: We subscribe to the view that every LOD dataset can be
Firstly, we suggest using RDFa as the main deployment understood as a Semantic Web application. Every Semantic
mechanism, hence serving both humans and machines to Web application in turn is a Web application in the sense
effectively and efficiently explore and use the dataset. Sec- that it should support a certain task for a human user. With-
ondly, we introduce a new way of enriching the dataset out offering a state-of-the-art Web user interface, potential
with high-quality links: the User Contributed Interlinking, end-users are scared away. Hence a Semantic Web applica-
a Wiki-style way of adding semantic links to data pages. tion needs to have a nice outfit, as well.
Categories and Subject Descriptors Further, the interlinking algorithms found in current LOD
H.4 [Information Systems Applications]: Miscellaneous datasets are largely based on templates. This means that
a huge number of interlinks can be generated, however, the
quality of these links in terms of their respective ’semantic
Keywords strength’ is somewhat limited. It is well known that humans
Linked data, Semantic Web, XHTML+RDFa, User Con-
are good at associations, so we basically propose in the fol-
tributed Interlinking
lowing to let humans do the hard part of the interlinking.
1. MOTIVATION The paper is structured as follows: Section 2 discusses re-
The goal of the “RDFising and Interlinking the Eurostat lated efforts, then in section 3 we introduce the Eurostat
Data Set Effort” (riese)1 is to offer a Semantic Web version dataset and state the requirements. In section 4 we describe
of the public accessible data provided by the Eurostat data the riese, and discuss in section 5 the current implemented
source. riese has been initiated as part of the W3C SWEO version of the system. Finally, in section 6, we conclude on
Linking Open Data (LOD) project and aims at being useful the current results and outline future steps.
for both humans and machines.
∗Copyright is held by the author/owner(s). 2. RELATED WORK
LDOW2008, April 22, 2008, Beijing, China.
†JOANNEUM RESEARCH Forschungsges. mbH, email: Statistical data on the (Semantic) Web. Looking at re-
wolfgang.halb@joanneum.at lated work reveals that there is actual demand for new solu-
‡Queen Mary, University of London, email: yves.raimond@ tions to disseminate statistical data using semantic technolo-
elec.qmul.ac.uk gies. As reported by Assini [2] the European Union funded a
§JOANNEUM RESEARCH Forschungsges. mbH, email: research and development project called NESSTAR in 1998,
michael.hausenblas@joanneum.at with the aim of bringing the advantages of the Web to the
1
http://esw.w3.org/topic/SweoIG/TaskForces/ world of statistical data dissemination. Another project that
CommunityProjects/LinkingOpenData/EuroStat is entirely situated on the Semantic Web is the U.S. Census
data [20] where 1 billion RDF triples containing statistical
information about the United States were published in 2007.
An earlier attempt to publish Eurostat is known from the
FU Berlin2 , using a very small subset of country and region
statistics. Stuckenschmidt [19] has reported on translating
and modelling the European fishery statistics in ontologies.
[10] recently pointed out issues with translating the Swiss
2
http://www4.wiwiss.fu-berlin.de/eurostat/
statistics to an RDF basis. A somehow related approach is Dictionaries are especially valuable as they contain all in-
the Rswub3 , a package for handling statistical data, based formation for resolving the nearly 100,000 data codes used
on RDF and capable of handling ontologies. in the statistical data. These data codes refer to dimen-
sions such as time, location, currency, etc. The data codes
also contain an implicit hierarchy, which can be used for
RDFa. As RDFa [1] is turning into a W3C Last Call doc- further classification. However, various schemas have been
ument at the time of writing of this paper, the penetration used requiring individual processing for extracting classify-
is expected to dramatically increase in the next couple of ing features. For example, in order to refer to locations, the
months. Although not yet a standard, there exist a num- Nomenclature of Territorial Units for Statistics (NUTS)6 is
ber of smaller-sized deployed datasets, such as those listed in use. This basically allows to extract information about
at http://rdfa.info/rdfa-in-the-wild/. It has been re- the structure of administrative divisions of countries. For
ported that for example Joost plans to offer RDFa-enriched each of the dictionaries a different terminology is used.
content4 and we have recently proposed to use RDFa as a
base for multimedia metadata deployment [11]. However, to Most of the data is represented in time series with varying
the best of our knowledge there exists no other linked-data granularity, ranging from annual to daily data. Each single
set deployed in RDFa. data item can be identified using the corresponding dataset
and various dimensions as the following example illustrates:
3. REQUIREMENTS AND ISSUES The population of the European Union can be seen as one
single data item valued at 497,198,740 (contained in the
3.1 The Eurostat data dataset ’Total population’), having as time-dimension the
This section provides a short description of the Eurostat year 2008, as indicator-dimension ’Population on 1. Jan-
data, which served as the primary input for the riese dataset. uary’, and as geo-dimension the ’European Union (27 coun-
tries)’. Additionally the data is flagged as provisional and
Eurostat provides detailed statistics for the entire Euro- Eurostat estimate.
pean Union as well as additional statistics for major non-
European countries. The Eurostat data is arranged along
the following themes: 3.2 Requirements
In a first phase we have analysed the Eurostat data. We have
identified the implicit semantics present in the TOC and the
• General and regional statistics dictionaries and gathered a number of issues. Firstly, the
Eurostat data set is highly heterogeneous; the data sources
• Economy and finance
formats vary (TSV, HTML) and are not machine-processable
• Population and social conditions per se. Another issue is the modelling of temporal data,
more specific how to represent time intervals. Further, the
• Industry, trade and services schemas in the dictionaries form a multidimensional space
• Agriculture and fisheries that somehow has to be linearised in order to be represented
in a URI format. We have also identified data provenance
• External trade (and trust) issues, which are currently only handled on a
global level.
• Transport
• Environment and energy Based on the analyses given above we state the following
requirements for a linked dataset that is designed to serve
• Science and technology both humans and machines:
Three main data sources are being provided by Eurostat for • The system must serve both humans and machines in
public download5 , namely (i) the statistical data itself, (ii) an adequate way by applying the don’t-repeat-yourself
a table of content, and (iii) dictionaries. (DRY)7 principle;
The statistical data is provided as dump download of ap- • To allow both humans and machines to reveal more
proximately 4,000 single tab-separated values (TSV) docu- information, the follow-your-nose8 principle must be
ments, having a total size of approximately 5GByte, and applied.
containing some 350 million data values. This data is up-
dated twice a day. Only limited semantic exploitable in- • To be a useful (real-world) Semantic Web application,
formation is contained in these TSV documents, hence it is the system must be able to scale to the size of the Web;
inevitable to use other available information sources.
A table of content (TOC) provides a hierarchical overview Additionally we want to point out that we aim at providing
of the datasets—organised in so called themes—allowing to high-quality interlinking. Hence, the sheer template-driven
identify the structure and content of a dataset. generation of global interlinks is certainly not sufficient.
3 6
http://www.biostat.harvard.edu/~carey/hbsfin.html http://ec.europa.eu/comm/eurostat/ramon/nuts/
4 7
http://rdfa.info/2007/08/23/ http://skimstone.x-port.net/node/272
joost-using-rdfa-on-website/ 8
http://www.inkdroid.org/journal/2008/01/04/
5
http://europa.eu/estatref/download/everybody/ following-your-nose-to-the-web-of-data/
4. LINKED DATA FOR HUMANS AND MA- • Additional Eurostat datasets can easily be added with-
CHINES out changing the schema (and are instantaneously in-
In order to demonstrate how to address the issues raised tegrated in the hierarchy, hence available to all users
earlier in this paper, we have implemented the riese dataset regardless of the access method);
(http://riese.joanneum.at) as a Semantic Web applica-
• Dimensions can be added without any changes to the
tion. This section describes how the mapping—from the
schema;
available, relational data into RDF form—has been done,
explains the interlinking mechanisms applied, and finally in- • Finally, it is possible to formulate very flexible queries.
troduces the riese system architecture.
4.1 Data, Schemas and Mapping Other approaches, such as the U.S. Census data [20] use a
This section explains the schemas utilised in riese and dis- more complex schema, where for example a new property for
cusses the mapping to RDF. every possible description is introduced. This yields proper-
ties such as population15YearsAndOverWithIncomeIn1999,
The data used in riese is a snapshot of the data available for which do not offer any additional semantic information.
bulk-download taken on 9 Jan 2008. Depending on the type
of data, three formats are used by Eurostat: HTML or plain Querying data using these properties can get very cumber-
text for the TOC, and TSV for the dictionary files and the some, as the user would have to know about the exact terms
actual data tables. beforehand. We believe that our flat approach, where every
value can be identified by the corresponding dataset and
In Fig. 1 the riese core schema is depicted. Currently the dimensions, enables fairly flexible queries.
riese core schema is modelled using RDF-Schema [4] rather
than OWL [14] based and comprises three main classes: 1 SELECT *
riese:Dataset, riese:Item and riese:Dimension. A dataset 2 WHERE
is the logical container of either more sub-datasets (related 3 { ? item riese : dimension dim : geo_at .
via skos:narrower) or data items. An item represents one 4 ? item riese : dataset ? dataset .
5 ? dataset dc : title ? ds_title
single data value (like 497,198,740 for the population of the 6 FILTER regex ( ? ds_title , " food " ,i ) }
European Union) with all accompanying metadata about
the containing dataset and the dimensions used. A dimen-
sion semantically describes the value of a data item in terms Listing 2: A query in riese.
of, e.g. time, location, unit, etc. In listing 1 an exemplary
snippet of an item is shown.
The example in listing 2 demonstrates this. All items for
Austria are returned that belong to a dataset with ’food’ in
1 data : eb040_infl_2006_at a : Item ; the description9 .
2 dc : title " Inflation rate Austria
2006 " ;
3 rdf : value " 1.7 " ; 4.2 Interlinking
4 : dimension dim : geo_at ; Leaving the mapping of the Eurostat data into RDF apart,
5 : dimension dim : time_2006 ;
6 : dataset data : eb040 . it is equally important to apply the follow-your-nose prin-
ciple, hence creating interlinks to other datatsets. For cre-
ating interlinks in riese we have basically used the following
Listing 1: An single data item. approach:
1. Restrict the source dataset to possible candidates for
Additionally, the following schemas are used or have been interlinking to the target dataset;
extended:
2. For each qualifying item in the source dataset look up
the label or another identifying feature in the target
• Dublin Core (DC) Elements [7] and Terms [6] dataset;
• Geonames [9]
3. Restrict the results by appropriate classifications or
• Simple Knowledge Organisation Systems (SKOS) [18] identifiers;
• Description of a Project (DOAP) [8] 4. Create the interlink.
• the event ontology [16]
For example the interlinking between country descriptions
in riese and Geonames is done using the ISO-3166 alpha2
We decided to model a flat schema for the following reasons: country codes (AT) instead of the label (Austria) assuring
that exactly the same resource is addressed in both datasets.
• Queries can be constructed with very little a-priori 9
with default namespace http://riese.joanneum.at/
knowledge about the structure of the dataset; schema/core#
Figure 1: The core schema.
Note that using ISO-3166 codes for identifying country de- terlinks. This is why we additionally allow users to add
scriptions in different datasets was already used by Voss [21] their own links, a new feature called ’User Contributed In-
and others. terlinking’ (UCI). The idea behind is applying the WikiWiki
approach to LOD: Users can add semantic links to other
In the practical implementation this means that first of all datasets on their own. Currently three different types are
the source dataset is restricted to only geographical features. supported: rdfs:seeAlso, owl:sameAs and foaf:topic (cf.
According to the nomenclature used it is also possible to also [5]).
identify country descriptions in the source dataset. Then
the Geonames search Webservice (i.e. the target dataset) is 4.3 System Architecture
queried using the standardized codes. The result from the Based on the lessons learned from [12] we have developed
target dataset is then further restricted to return only coun- the riese Web application. It comprises:
tries, i.e. entries having a specified Geonames feature code
(A.ADM1). Finally all the matches are being interlinked by
inserting a new triple into the source dataset which relates 1. An (offline) module, being responsible for converting
the resources using owl:sameAs. In this case it is possible to the Eurostat data into an RDF representation and cre-
create exact matching high-quality interlinks. ating the global, pattern-based interlinks (RDFising &
Interlinking), and a
Further candidates for interlinking the Eurostat data are
Geonames (more geographical features), DBpedia, CIA Fact- 2. Web server including a scripting environment that fills
book and Wikicompany. By introducing these interlinks predefined templates with the values from the (static)
users of riese will not only benefit from a larger interlinked RDF/XML representation in order to generate an RDFa
dataspace but especially for the geographic features also representation of the themes and the data tables.
by being able to produce even more flexible and powerful
queries. The Fig. 2 depicts the riese system architecture and shows as
well the interfaces with the environment (in and out ports).
As already mentioned above, the pure pattern-based ap-
proach is believed to be not sufficient for high-quality in- The riese Web application supports the following tasks:
Figure 2: The system architecture of riese.
• Human users: Users can navigate the dataset provided and also store the triples according to their URI directly into
in XHTML+RDFa; the file system in RDF/XML.
• Semantic Web agents: The latter approach is currently used for ’Rendering & Serv-
ing’ where the PHP scripts looks up the files in the file sys-
– single item query—XHTML+RDFa per page al- tem and renders a RDFa representation. Beside the data
lows the exploration of the dataset and the query that originates from Eurostat (the official statistical data),
of a single data item (FYN); the UCI module stores the user-contributed triples in a sep-
– global query—To allow an efficient query of the arate document. This physical separation is mainly due to
entire dataset, a SPARQL-endpoint is provided; being able to replace parts of the data without too much
– indexer: to allow semantic search engines (in- additional effort.
dexer) an effective processing, the entire dataset
is offered as a dump and an according description 5. USING RIESE
using the semantic crawler sitemap extension pro- In the following we show how riese can both satisfy the hu-
tocol10 is offered. man user, as well as the machine (Semantic Web agents).
Please note that the alpha version of the riese system is
For creating the RDF representation from the original Euro- available at http://riese.joanneum.at/.
stat files, SWI-Prolog scripts are used. The SWI-Prolog Se-
mantic Web Library provides an infrastructure for reading, Both human and machine users would presumably start at
querying and storing semantic web documents. Additionally the top-level page in order to get an overview of the available
the Prolog-2-RDF (p2r) modules11 and individually defined data. In Fig. 3 the hierarchical rendering of a selected Eu-
mappings are used for translating the input data to RDF. rostat theme (the ’Economy’ theme) is depicted. A machine
accessing the same page would have another view, namely
The resulting RDF can be accessed via a SPARQL endpoint focusing on the embedded RDF, exemplary shown in exam-
and it is further possible to consume a dump of the entire ple 3.
data. We have created one large dump containing all triples,
Note that although both humans and machines access the
10
http://sw.deri.org/2007/07/sitemapextension/ same resource, different parts are relevant. This is made
11
http://moustaki.org/p2r/ possible through the deployment in XHTML+RDFa. The
Figure 3: The Eurostat theme ’Economy’ viewed by a human user.
browser will render a nice GUI, the machine gets what it
deserves: triples.
Further, a single table may be explored; this is depicted in
Fig. 4.
However, till now the user was passively consuming the in-
formation. But riese offers more: Users can provide their
1 < body own links using the UCI (cf. Fig. 5).
2 about = " http :// riese . joanneum . at /
3 data / economy "
4 instanceof = " riese : Dataset " >
5 ...
6 < div id = " main - ind " >
7 ...
8
11 Balance of payments -
12 International transactions
13
14 div >
Listing 3: The Eurostat theme ’Economy’ viewed by
a machine. Figure 5: The UCI module—users can provide own
links.
The UCI module enables the user to add (and remove for
that matter) additional links to a certain data page. As
the user must specify the type (cf. the drop-down box in
Fig. 5) it is ensured that only valid triples are introduced
to the system—the subject of the RDF statement is always
the page where the ’Related’ box is on; the predicate is de-
Figure 4: A single data table in XHTML+RDFa.
termined through the type selection. The object (named this yields thousands of file access operations for simply pars-
target in our context) is the only variable we are not able to ing them. Regarding the file system we came across another
control. However, we rely on the community effect, i.e. we limitation: reserved names on the MS Windows operating
expect that ’wrong’ links will be removed. A REST-based systems (as it turned out, it is not possible to create files or
interface for adding UCI-triples automatically is available folders named ’con’, ’aux’, etc. [15]).
as well. Regarding the acceptance of the UCI, i.e. enabling
users to contribute semantically typed links, we refer to the When modelling the representation of time related to a cer-
success story of Wikipedia [17] and strive for considerable tain statistical information we encountered some challenges
community involvement. In riese , we therefore try to im- as the raw data from Eurostat is sometimes ambiguous and
plement many of the success factors of Wikipedia, such as can only be resolved by analysing the corresponding docu-
openness or ease of editing. However, UCI may need to ment. For example the statement time\2007 can stand for
be applied to other datasets with more ’appealing’ data— the value over a period of time (e.g. entire year) or at the
compared to statistical one—in order to properly evaluate end of the reporting period (e.g. 31 Dec). In our future
its uptake. work we will focus on resolving these issues.
The future work roughly comprises a thorough analysis of
6. CONCLUSION the current bottlenecks, as well as gathering feedback from
In this paper we have presented the riese dataset contain- end-users of the system. We are planning to use a solution
ing statistical data from Eurostat. We have shown how to based on an triple-store (such as SESAME or Virtuoso) al-
RDFise and interlink this data, hence making it possible to lowing us to generate triples at a faster pace—currently it
expose it onto the Semantic Web. The benefits of supplying would take us several weeks to RDFise the entire Eurostat
data for both humans and machines have been explicated data set. Using a dedicated store will likely improve the
and a WikiWiki approach for adding user contributed inter- performance serving the data to both human and machine
links has been introduced. users.
We have also identified some issues and bottlenecks when de- Finally, as Eurostat updates their data twice a day, we aim
ploying datasets of such enormous size. Generating a static at updating the data on riese continuously. One of the issues
file-structure with small RDF files requires quite a lot of to be solved in this respect is how to deprecate the data when
time. This is due to our current way of storing the data updating the items. From a UI point-of-view we also want
items in the file system. Because in riese several hundred to address navigational issues (using maps and timelines12 )
millions of folders and files have to be created, the bottleneck to further enhance the user experience.
is somehow obvious. Moreover, when accessing datasets (ta-
12
bles) containing thousands of items (cells) in individual files http://simile.mit.edu/timeline/
7. ACKNOWLEDGMENTS Graphs. In 3rd Workshop on Scripting for the
The research leading to this paper was carried out in the Semantic Web (SFSW07), Innsbruck, Austria, 2007.
“Understanding Advertising” (UAd) project13 , funded by [13] T. Heath and E. Motta. Revyu.com: a Reviewing and
the Austrian FIT-IT Programme, and was partially sup- Rating Site for the Web of Data. In The Semantic
ported by the European Commission under contract FP6- Web, 6th International Semantic Web Conference,
027026-K-SPACE. 2nd Asian Semantic Web Conference, ISWC 2007 +
ASWC 2007, pages 895–902, 2007.
The authors would like to thank the Linking Open Data [14] D. L. McGuinness and F. van Harmelen. OWL Web
community and the RDFa folks. Additionally we would Ontology Language Overview. W3C Recommendation,
like to credit all the people that made available the fol- OWL Working Group, 2004.
lowing magnificent technologies: SWI-Prolog, Apache, PHP, [15] Microsoft. Naming a File. http://msdn2.microsoft.
RAP - Rdf API and YUI. We wish to thank Danny Ayers com/en-us/library/aa365247.aspx, 2008.
for his early comments on modelling issues, Giovanni Tum- [16] Y. Raimond and S. Abdallah. The Event Ontology.
marello for feeding sindice (allowing advanced queries), and http:
Jan Wielemaker for his superb SWI-Prolog support. //motools.sourceforge.net/event/event.html,
2007.
8. REFERENCES [17] L. Sanger. The Early History of Nupedia and
[1] B. Adida, M. Birbeck, S. McCarron, and Wikipedia: A Memoir. In C. DiBona, M. Stone, and
S. Pemberton. RDFa in XHTML: Syntax and D. Cooper, editors, Open Sources 2.0: The Continuing
Processing. W3C Working Draft 18 October 2007, Evolution. O’Reilly, 2005.
W3C Semantic Web Deployment Working Group, [18] Semantic Web Deployment Working Group. SKOS
2007. Simple Knowledge Organization System Reference.
[2] P. Assini. NESSTAR: A Semantic Web Application for http://www.w3.org/TR/swbp-skos-core-spec/,
Statistical Data and Metadata. In International 2008.
Workshop Real World RDF and Semantic Web [19] H. Stuckenschmidt and F. van Harmelen. Information
Applications, 11th International World Wide Web Sharing on the Semantic Web. Springer, 2005.
Conference (WWW2002), 2002. [20] J. Tauberer. The 2000 U.S. Census: 1 Billion RDF
[3] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, Triples. http://www.rdfabout.com/demo/census/,
R. Cyganiak, and Z. G. Ives. DBpedia: A Nucleus for 2007.
a Web of Open Data. In The Semantic Web, 6th [21] J. Voss. Encoding changing country codes in RDF
International Semantic Web Conference, 2nd Asian with ISO 3166 and SKOS. In International Conference
Semantic Web Conference, ISWC 2007 + ASWC on Metadata and Semantics Research (MTSR07),
2007, pages 722–735, 2007. 2007.
[4] D. Brickley and R. Guha. RDF Vocabulary
Description Language 1.0: RDF Schema. W3C
Recommendation, RDF Core Working Group, 2004.
[5] D. Brickley and L. Miller. FOAF Vocabulary
Specification. http://xmlns.com/foaf/0.1/, 2004.
[6] Dublin Core Metadata Initiative. DCMI Metadata
Terms.
http://dublincore.org/documents/dcmi-terms/,
2008.
[7] Dublin Core Metadata Initiative. Dublin Core
Metadata Element Set, Version 1.1.
http://dublincore.org/documents/dces/, 2008.
[8] E. Dumbill. Description of a Project (DOAP)
vocabulary. http://usefulinc.com/ns/doap, 2005.
[9] Geonames. Geonames Ontology.
http://www.geonames.org/ontology/, 2007.
[10] A. Grossenbacher. Semantic Web: Basics, RDF, DC
and the description of a statistical site.
http://tinyurl.com/2d5gta, 2007.
[11] M. Hausenblas, W. Bailer, and H. Mayer. Deploying
Multimedia Metadata in Cultural Heritage on the
Semantic Web. In First International Workshop on
Cultural Heritage on the Semantic Web, collocated
with the 6th International Semantic Web Conference
(ISWC07), Busan, South Korea, 2007.
[12] M. Hausenblas, W. Slany, and D. Ayers. A
Performance and Scalability Metric for Virtual RDF
13
http://www.sembase.at/index.php/UAd