Linking Enterprise Data
                                                      François-Paul Servant
                                                             RENAULT
                                              FR EQV NOV 3 31 - 13 av Paul Langevin
                                             92359 Le Plessis Robinson Cedex FRANCE
                                                        +33 (1) 76 84 38 30
                                              francois-paul.servant@renault.com


ABSTRACT                                                               already been heard many times before. Also, the current focus is
                                                                       about “Web Services”, and people do not know what semantic
The “Linking Open Data” community initiative contributed a
                                                                       web technologies add to the picture - a data oriented viewpoint
great deal to the concretization of the web of data, describing best
                                                                       that complements application oriented one provided by web
practice, publishing large sets of RDF data on the web, and
                                                                       services. To put it shortly, people need to be explained that RDF
consequently giving birth to a new area of possibilities for
                                                                       can make it easier to exchange and use the results computed by
innovative mashups using these data. Enterprises’ information
                                                                       services.
systems too can be envisioned as a space of linked data. We
describe herein how we used Linked Data principles in a work           But things change! After a slow start, the Semantic Web finally
intended to foster adoption of semantic web technologies in our        took off, and it is one merit of the “Linking Open Data”
company.                                                               community project to have proven that it was right now possible
                                                                       to build on the Semantic Web stack, to publish large RDF data
Categories and Subject Descriptors                                     sets, and to create applications using that data. In this respect, the
H.4.m [Information Systems]: Miscellaneous; D.2 [Software]:            description of Linked Data principles and the accompanying how-
Software Engineering                                                   to undoubtedly were of big help [3].

General Terms                                                          Why not use the same strategy in the enterprise? A company’s
                                                                       Information Systems can be envisioned as a space of Linked Data.
Experimentation
                                                                       Convinced that Linked Data principles were good, also in a
Keywords                                                               corporate context, we decided to try to put them in practice and
                                                                       foster adoption of semantic web technologies at Renault.
RDF in the enterprise, RDF based web services, Linking
Enterprise Data, Linked Data                                           These principles provide effective solutions for two questions that
                                                                       Renault regards as priorities for its IS architecture: data
1. INTRODUCTION                                                        repositories and services. Their implementation yields indeed an
Integrating the disparate applications and data sources in a large     architecture of REST services, easy to set up and to get connected
corporation is expensive, and using semantic web technologies          to. Our work should make that clear.
could dramatically cut down these costs: this is the “Business
                                                                       The cornerstone of Linked Data principles and of semantic web
Model for the Semantic Web” [1]. The lowest layer of semantic
                                                                       architecture - the identification of data by the mean of URIs - is,
web specifications, RDF - an open and mature standard – has
                                                                       by itself, of great importance in the definition of information
some very appealing properties in a corporate context [4]. RDF is
                                                                       systems. Not only because a sharable way of naming things is
indeed a format built upon a simple, powerful, and well known
                                                                       needed to support exchange of information about those things: it
data model that makes exchanging, aggregating and querying
                                                                       is indeed our observation that a frequent cause for problems is the
information easier.
                                                                       absence of proper identification of real world things. Those
However, despite their potential, adoption of semantic web             problems tend to surface when legacy systems need to be
technologies seems to remain rather slow in the enterprise’s world     interconnected. In [6], we described such a situation, where
- at least this is our feeling about the situation at Renault. Not     concepts central to the domain were not formally identified, thus
being advertised by solution providers, they are very often simply     hindering efficient use of existing information resources, and
overlooked, or at best considered as “promising”, but not ready to     increasing the costs of data reconciliation.
be used right now. They may even suffer from a negative
                                                                       In what will follow we describe what has been done in the work
prejudice, being perceived as yet another technological hype,
                                                                       we undertook:
whose promises, such as the easy exchange of information, have
                                                                       • the publishing of a data repository as Linked Data,
                                                                       • the implementation of a simple RDF browser,
                                                                       • and the prototyping of access to the published data from outer
                                                                         application.
                                                                       We then discuss some noticeable points concerning Linked Data.
 Copyright is held by the author/owner(s).
 LDOW2008, April 22, 2008, Beijing, China.
2. PUBLISHING A DATA REPOSITORY AS                                      2.4 First steps
LINKED DATA                                                             2.4.1 Minting URIs for the items of the repository
                                                                        Minting URIs for the items of the repository was not a big deal: as
2.1 Previous experiences                                                they already had an id, it was just a matter of choosing a
We had some previous experience with the publishing of RDF              namespace where we would be able to publish the data: the URI
Data.                                                                   of an item of the repository is just the concatenation of that
                                                                        namespace, of the item’s class name (the name of the table in the
Semanlink, the first of these experiences, is a free tagging tool       database the item is primary key of) and of the item’s id.
developed by the author, where tags are SKOS-like concepts,             Regarding the “generic parts”, which, as we mentioned earlier, do
identified by URIs, all dereferenceable [5].                            not pertain to this repository, they should be published by their
The second was a prototype repository of repair and diagnostic          owner (the parts department), under their own namespace. Of
operations, modeled with OWL, and developed to provide a                course it is not the case and, to get started quickly, we minted
probabilistic diagnostic tool with data [6].                            URIs for these external items inside the same namespace. It will
                                                                        be possible to fix this later, for instance using owl:sameAs
In these experiences however, implementing Linked Data                  statements to link them to their true URI when their natural owner
principles had not been the central point of the work. Our concern      makes them available as Linked Data.
in this new work was to emphasize the publishing and consuming
of linked data. The objectives were to better explore the topic, and    2.4.2 Hash or slash URIs
to highlight the benefits of the method, in a corporate context. We     Another decision concerns the type of URI to use. Best choice
also wanted to produce guidelines and sample code for Renault           depends on the size of the repository. Hash URIs are simple, but
developers: if Semantic Web technologies are to be used in              they suppose that the whole file gets downloaded when
projects, we’ll need them trained to these techniques.                  dereferencing one of them. It is therefore better in our case to use
                                                                        slash URIs, which can be dereferenced one by one.
2.2 Chosen use case
As a use case we chose a repository created recently by the
                                                                        2.4.3 Information and Non-Information Resources
                                                                        Clearly, the items of the repository are “real-world things”, not
department in charge of after-sales repair documentation.
                                                                        “information resources”: to respect the web architecture, we must
Basically, it is a dictionary of the terms that documentation writers
                                                                        implement the “HTTP-range 14” resolution when dereferencing
may use when describing repair methods. The first purpose of this
                                                                        them.
repository is to enforce an homogenous naming scheme of things
throughout the whole documentation. These terms are translated          Using Non-Information Resources (NIR) requires the creation of
into the many different languages the documentation is produced         three different URIs: one for the resource itself, one for the RDF
in, and they are classified in a SKOS-like hierarchy. Finally, the      data describing the resource, and one for an HTML description.
repository also contains a link to a data set produced by a different   There are several ways that make sense for the naming of these
department, the department in charge of spare parts: a list of so-      URIs. We used [namespace]:[resid] for the NIR, [namespace]:
called “generic parts” is associated to each such term. A “generic      [resid].rdf and [namespace]:[resid].html for the two corresponding
part” is a part seen through its function in the car (for instance      information resources.
“engine”, “air conditioning compressor”, “right front wing” ;
several spare part references correspond to a “generic part”).          2.4.4 Producing the RDF
                                                                        An XML dump of the repository being available, it was easy to
We chose this repository for several reasons. First, it’s simple        produce RDF out of it - a really simple RDF, by the way, without
enough, yet significant, and it is supposed to be well managed.         blank nodes, for instance.
Second, there is no way to access its data from another
application: it is not available as a service, hence, publishing this   2.5 The “LED” Servlet
data as Linked Data corresponded to an actual need. Third, a            URIs of the data set can be made dereferenceable by using a
dump as XML is available. It was therefore easy to produce RDF          servlet, the “context path” of which matches the namespace
out of it. Finally, we could envision interesting use of this data,     chosen for the repository.
that we would be able to demonstrate in the context of our work:
                                                                        The servlet must of course have access to the RDF data of the
• access to repository’s RDF data from another application ;            repository. As the data set contains no more than 60,000
• inclusion of RDFa data in repair methods generated as                 statements, handling it as a Jena memory model, loaded at startup,
  XHTML pages (repair methods are produced as small chunks              was not a problem.
  of XML, and they contain references to the terms of the
  repository. Having RDF statements inside the page allows for          2.6 Dereferencing URIs of NIRs
  interesting features using javascript)                                When an agent gets the URI of an NIR, the servlet must respond
                                                                        with a 303 HTTP status code, and include a redirection to the URI
2.3 Development environment                                             of an information resource that best fits the preferences expressed
Java is the main language used for software developments at             in the “accept” HTTP header of the request - that is, in our case,
Renault, and therefore the obvious choice for our work, given the       the URI of either the RDF or the HTML describing the NIR.
high quality of Jena’s implementation of the semantic web stack.        A servlet has access to the HTTP headers of a request. It can
                                                                        therefore analyze its accept header to decide what it should return.
2.7 Answering requests for the RDF about an                            hierarchy of concepts to access data is so common, we developed
                                                                       a dedicated tree-like widget, that we took care to make reusable
NIR                                                                    (Semanlink GUI for instance could use it, instead of having its
When the client requests for the RDF data about an NIR, (that is,      trees created by the server. We’ll see another reuse example later).
when it dereferences the “.rdf” URI), answering is just a matter of
extracting the statements of interest from the Jena Model, and of      2.10 A simple RDF browser
returning them, which is made easy by Jena serializers. As our         We had not mentioned it yet, but what we described until now
repository is very simple, there is no question about the content to   supports only dereferencing URIs from our own data set, and we
be returned: all statements of the form <nirURI,?,?> and               had limited our GUI based on this constraint. It is indeed standard
<?,?,nirURI>. We just added the statements defining the links          javascript security to forbid connections to servers other than the
between the three resources nirURI, nirURI.rdf and nirURI.html.        one the page comes from. It is possible to bypass this constraint,
We can think about including labels of linked resources. This can      as Tabulator does, using Firefox with one of its default settings
be a useful optimization when we know that the RDF returned            changed. This was not an option in our case: we had to work with
will be displayed as HTML. We didn’t do it by default, only for        standard version and configuration of browsers. Dereferencing
resources linked by certain properties. (The question of the           URIs from other servers therefore meant implementing an usual
amount of information to be returned about one URI is briefly          trick (namely, an HTTP proxy: requests to dereference URIs
discussed later: in other cases, the behavior adopted here would       outside our domain must be sent to the servlet, which forwards
lead to a uselessly huge number of statements).                        them to the actual server, and then returns the result).

2.8 Answering requests for the HTML about                              Implementing this trick is all what it needs to transform the
                                                                       solution into a very simple, yet generic RDF browser: RDF can be
an NIR                                                                 downloaded from anywhere, and displayed using javascript.
Generating HTML about a resource from its RDF description is of
course one important topic. It can be done on the server, for          2.11 Getting linked data from another
instance using JSP. (That’s what we had done in our previous
experiments with Linked Data publishing, such as Semanlink).
                                                                       application
But it can also be done on the client, thanks to the javascript RDF    An important aspect of Linked Data is the fact that it provides the
parser made available with the Tabulator project [7].                  implementation of a service to which applications can connect
                                                                       over HTTP to get data and use it as they see fit. It was important
2.9 Generating a page out of RDF data using                            to demonstrate that this can actually be done without difficulty, at
                                                                       a low cost in terms of development for the client application. In
javascript on the client.                                              order to provide sample code, we implemented two kinds of
That’s the road we chose, because it allows for a nice architecture:   connections to the repository’s data: one in Java, using the Jena
                                                                       API to parse and use the RDF, the other in Javascript.
• it provides a clean separation of “Views” and “Model” (in
  MVC parlance), with easier reuse of GUI widgets,                     For this demonstration, we used a completely unrelated tool that
                                                                       we had built some time ago for the parts department. It is a web
• it decreases the load on the server,                                 application that computes and displays information about “generic
• it gives the possibility to change the display on the client         parts”: the user enters the code of a “generic part”, and the
  without sending a new request to the server,                         application displays a list of corresponding spare part references.

• it allows to incrementally load RDF.                                 The first feature we added to this application, in Java, is simply
                                                                       the possibility to list the spare part references corresponding to a
The principle is simple: a request for the HTML about an NIR           term of our repository: the “part” servlet connects to the
returns an HTML page which is almost an empty box (or a                repository, dereferences the term, parses the returned RDF,
template), containing a call to a javascript function that takes the   extracts the corresponding list of “generic parts”, and for each,
URI of the RDF data as argument. It is this javascript that            computes the list of spare part references. For the second feature,
downloads the RDF data, and displays it.                               we reused our tree widget: the user can use it to navigate down the
                                                                       hierarchy of concepts of our linked data set, to choose either a
2.9.1 Parsing problems with Internet Explorer                          “generic part” or a term, and to get the list of corresponding spare
We faced a difficulty with Tabulator’s javascript RDF parser, as it
                                                                       part references.
didn’t support Internet Explorer (neither 6 nor 7) at the time we
undertook our work (this was Tabulator 0.8). We had to correct         2.12 Conclusion concerning this prototype
that, as most people use Explorer in our company, and it wouldn’t
be acceptable to propose a solution that doesn’t work with it. The     We implemented in Java and Javascript an example of a
patch is available for download by interested people.                  repository published as linked data. It is composed of a servlet
                                                                       that uses a Jena Model containing the RDF data, and that ensures
2.9.2 Generic display                                                  the dereferencing of URIs, respecting the principles regarding
In a first step, we didn’t do much more than displaying the RDF        Non-Information Resources. Basically, this servlet only produces
about the described resource in a very generic way, that is listing    RDF output. The display in HTML is done in Javascript. This is
values, property by property.                                          enough to build a very simple but generic RDF browser, and to
                                                                       publish the repository’s data, providing the functionalities of a
2.9.3 Tree-like GUI widget                                             REST web service. We demonstrated how to connect to it from a
On important part of the repository is the classification of the       program, and how to use its data.
terms in a SKOS-like hierarchy. It is supposed to provide a way
for a human user to easily navigate inside the corpus of data and      We now plan to include a SPARQL endpoint, to complete our
help her finding the term she’s looking for. As navigating a           accompanying how-to.
We think this is a fairly noticeable achievement, for a relatively       relevant for a given car). To filter the list, the service has to
simple development, which is almost completely reusable, and             evaluate these Boolean expressions. Let's say that the engine
easily extensible. One obvious idea is to improve the RDF                removal depends on the model and the engine type of the car. We
browser, implementing, for instance, templates allowing to               could ask the user to enter those values returning some RDF such
customize the display depending on the type of the resource that         as:
gets dereferenced.
The solution should be compared side by side with more                   <rdf:Description
traditional or advertised ways to proceed, such as SOAP web                rdf:about"ex:engine_removal">
services. Let us just note some points. This approach respects the         <form:hasForm><form:Form>
web architecture, and this has its benefits: caching, for instance,            <form:param
which is important for a repository like this one, whose data                    rdf:resource="ex:model"/>
barely changes. At the opposite of WS-* services, this service
only uses HTTP get, and therefore benefits from the standard                   <form:param
HTTP cache mechanism. The last point we would like to insist on                  rdf:resource="ex:engine_type"/>
is related to the use of RDF, which is a generic data model. We do         </form:Form></form:hasForm>
not have to learn a special syntax to be able to use the service: we     </rdf:Description >
directly manipulate the data, not a specialized API. Furthermore,
any chunck of RDF extracted from the repository could be                 If the client program knows the conventions of this "form
transfered from application to application, possibly aggregated          vocabulary", it can understand that it has to provide value for
with more data, and still remain completely understandable with          model and engine_type. How it determines these values varies (if
just a standard RDF parser. This reduces the cost of development         there is a human user, it can generate an HTML form). The point
in client applications (a question that seems to be sometimes            is that the client program knows (or can discover) the exact
overlooked when speaking about services).                                meaning of the parameters of the form. If it is able to provide the
                                                                         answer, it will then construct a URL including the values and
This concludes what had to be said about this prototype, and we          dereference it to get the data it is looking for.
are now going to discuss some points about Linked Data, in no
particular order.                                                        We chose this example, though it is a bit long, on purpose. The
                                                                         European Commission has emitted a directive that requires from
3. RDF FORMS                                                             automotive constructors that they publish their technical
Quoting [2]: "The Semantic Web [...] is about making links, so           documentation about repairs. A specification by an OASIS
that a person or machine can explore the web of data." Beside            technical committee describes how this should be achieved, using
hypertext (“href”) links, forms are an important feature of the          RDF for metadata. The protocol for this situation could be
web. How does this transpose to the web of data? Shouldn't there         improved using “RDF Forms”.
be a standardized way to "include forms" in RDF data? This               We described here a form with a GET method, but POST methods
would allow a server of RDF data to require some input, with a           are of course interesting as well.
well defined meaning, from its clients (should they be humans or
machines).                                                               4.FINDING THE URIS OF “REAL WORLD
We had a use case in the field of technical after-sale                   THINGS”
documentation. The repository of repair and diagnostic operations        This is probably the major difficulty with Linked Data on the web.
mentioned earlier can be used by humans (mechanics looking for           The publishers of the large data sets of the LOD project
information about how to repair a car) and by programs: for              accomplished an important effort to interlink their data, and they
instance a program that computes estimates needs to get the list of      built a huge source of identities for real-world things. But how can
parts that are necessary for a given repair on a given car, and the      a user discover how to say “Paris” or “Hamlet”?
time needed to perform the repair.
Suppose for instance that you have to replace the engine of a car,
                                                                         4.1 The case of enterprise data
and that you want to get information about how to do that. The           Let’s note that this is not really a problem in the corporate context.
repository contains a concept "ex:engine_removal". Now, for any          Companies indeed do have large and standardized vocabularies,
given car, there is one and only one method to (correctly) take          that are shared throughout their whole organization, to name many
apart the engine. You are not concerned by the information that          of the things they manipulate in their operations. Of course,
you would get by dereferencing "ex:engine_removal", rather by            everything is not perfect: different departments sometimes give a
the subset that is relevant to your car. This subset depends on the      different meaning to the same term, and this may be a cause of
characteristics of that car. It would not make sense to return all the   misunderstandings and problems. We believe that adopting URIs
information about all the cars. Not only because it would be a           for identification would reduce the number of such cases: URIs
waste of bandwidth, but also because it would be difficult for the       including the designation of their “owner”, they are incentive to
client to understand this information. (This information would           check for the real meaning of the thing before deciding to use it.
indeed contain "conditional links", things such as: “if condition
then statement”, where condition is a Boolean function of the            4.2 On the web
characteristics of the car).                                             The question is of first importance. If we don’t have tools to help
                                                                         a user discover the URIs for things, she won’t be able to write
Typically, the service extracts from its underlying database the list    statements using a shared vocabulary. Best she’ll be able to do is
of all documents matching "ex:engine_removal". Each record has           to write statements using her own vocabulary. She shouldn’t be
a property "condition" (a Boolean function of the technical              left helpless, because this hurts the chances of the semantic web to
characteristics of a car which returns true when the document is         be largely adopted.
We think that the publisher of the massive linked data sets should       could the server do, when such a URI gets dereferenced, to avoid
try to provide tools to help connect to them, and find the URI of        waste (that is, to avoid returning all statements involving the
things.                                                                  URI), yet let the client know that more information is available
                                                                         and could be returned if needed? Some interesting suggestions
The problem is of course difficult, but pragmatic tools can be of        were made, based basically on the idea: “look up there to get
great help. If I type “Paris” in Wikipedia, I get to a page about the    triples involving that property”. These suggestions imply a new
capital of France, with a link to a disambiguating page.                 convention (that is, the definition of a few properties). We would
Implementing such a service, returning its results as a small chunk      appreciate an agreement of the Linked Data community on such a
of RDF, should not be a problem for the large LOD projects, and          convention.
this would be very useful.
More complex tools can be thought of. If a data set provides a           6. CONCLUSION
SPARQL endpoint, some interesting tools can be developed by              We are convinced that Linked Data principles are good for the
third parties.                                                           web as well as for companies’ information systems. In a work of
                                                                         evangelization about Semantic Web technologies inside our
4.3 Semanlink                                                            company, we published as linked data a repository, without
We will try to participate to this effort. As noted earlier,             difficulty. We showed how to connect to the resulting service from
Semanlink publishes data as RDF, following Linked Data                   an outer application. The code written is largely reusable, and
principles. We hope to be able in the next months                        expandable. And we will soon reuse it to publish new sets of
                                                                         linked data. For instance, we will shortly implement the
• to work on the linking of Semanlink to the LOD data sets,              publishing as RDF of data served by a SOAP web service.
                                                                         Increasing the number of use cases should convince about the
• to make results of the tag search available as RDF,
                                                                         flexibility of the method. We’ll continue our work in the field of
• to build tools that help discover whether a given concept is           after-sales technical documentation. This field looks like a typical
  actually used in a Semanlink data set,                                 use case for Semantic Web technologies, given the number of
                                                                         business objects that have to be shared among the many systems
• and to work on the interlinking of several Semanlink databases.        involved, in a corporation-wide process. But it is also a field that
  We hope that this will eventually provide a test bed for semi-         is slow to evolve, partly for the same reasons.
  automatic reconciliations of portions of independently
  developed vocabularies.                                                Concerning Linked Data on the web, we think that the community
                                                                         should work on defining a small vocabulary to better handle some
4.4 How to get the URI of an NIR when you                                problems that surface (such as the amount of data to return when
know the URI of the corresponding HTML ?                                 dereferencing a URI), or to provide new functionalities (such as
                                                                         “RDF Forms”).
Most of the time, what you see of an NIR is the HTML page that
gets displayed in your standard web browser when dereferencing           7. REFERENCES
its URI. How do you get back from this HTML to the URI of the
                                                                         [1] Berners-Lee, T., “Business Model for the Semantic Web",
NIR? Today, there is no easy way to get it. If Linked Data
                                                                             2001 http://www.w3.org/DesignIssues/Business
publishers follow the recommendations of [3], the URI of the
RDF data describing the NIR can be extracted from the HTML               [2] Berners-Lee, T. “Linked Data”, http://www.w3.org/
page, but not the URI of the NIR. If you’re interested in it (for            DesignIssues/LinkedData.html
instance because you want to write a statement involving this            [3] Bizer, C., Cyganiak, R., Heath, T., “How to Publish Linked
NIR), you have to download the RDF, parse it, and find the                   Data on the Web” http://www4.wiwiss.fu-berlin.de/bizer/
statement linking the NIR to the HTML. It would be easier to                 pub/LinkedDataTutorial/
have the statement describing the link between the HTML page
and the NIR directly in the HTML (probably as RDFa) We think             [4] Feigenbaum, L., “Semantic Web Technologies in the
that suggesting data publishers to do so would be a nice addition            Enterprise”, 2006 http://www.thefigtrees.net/lee/blog/
to the “How to Publish Linked Data on the Web” document [3].                 2006/11/semantic_web_technologies_in_t.html
                                                                         [5] Semanlink http://www.semanlink.net
5. AMOUNT OF DATA TO RETURN WHEN                                         [6] Servant, FP "Semantic Web Technologies in Technical
A URI GETS DEREGERENCED                                                      Automotive Documentation" - CEUR-WS.org/Vol-258 -
This is a known issue with Linked Data, with several aspects. In             OWL: Experiences and Directions 2007 http://
particular, it is not always practical, or wise, to blindly return all       ftp.informatik.rwth-aachen.de/Publications/CEUR-WS/
triples containing the URI that is dereferenced. This information            Vol-258/paper04.pdf
can be huge, or computing the statements involving some special          [7] “Tabulator: Generic Data Browser” http://www.w3.org/2005/
properties may be a heavy task. So, returning all the statements             ajar/tab
that a server is able to provide can be a waste of bandwidth, or of
server resources, if the client is not interested with them. What