Linking Enterprise Data François-Paul Servant RENAULT FR EQV NOV 3 31 - 13 av Paul Langevin 92359 Le Plessis Robinson Cedex FRANCE +33 (1) 76 84 38 30 francois-paul.servant@renault.com ABSTRACT already been heard many times before. Also, the current focus is about “Web Services”, and people do not know what semantic The “Linking Open Data” community initiative contributed a web technologies add to the picture - a data oriented viewpoint great deal to the concretization of the web of data, describing best that complements application oriented one provided by web practice, publishing large sets of RDF data on the web, and services. To put it shortly, people need to be explained that RDF consequently giving birth to a new area of possibilities for can make it easier to exchange and use the results computed by innovative mashups using these data. Enterprises’ information services. systems too can be envisioned as a space of linked data. We describe herein how we used Linked Data principles in a work But things change! After a slow start, the Semantic Web finally intended to foster adoption of semantic web technologies in our took off, and it is one merit of the “Linking Open Data” company. community project to have proven that it was right now possible to build on the Semantic Web stack, to publish large RDF data Categories and Subject Descriptors sets, and to create applications using that data. In this respect, the H.4.m [Information Systems]: Miscellaneous; D.2 [Software]: description of Linked Data principles and the accompanying how- Software Engineering to undoubtedly were of big help [3]. General Terms Why not use the same strategy in the enterprise? A company’s Information Systems can be envisioned as a space of Linked Data. Experimentation Convinced that Linked Data principles were good, also in a Keywords corporate context, we decided to try to put them in practice and foster adoption of semantic web technologies at Renault. RDF in the enterprise, RDF based web services, Linking Enterprise Data, Linked Data These principles provide effective solutions for two questions that Renault regards as priorities for its IS architecture: data 1. INTRODUCTION repositories and services. Their implementation yields indeed an Integrating the disparate applications and data sources in a large architecture of REST services, easy to set up and to get connected corporation is expensive, and using semantic web technologies to. Our work should make that clear. could dramatically cut down these costs: this is the “Business The cornerstone of Linked Data principles and of semantic web Model for the Semantic Web” [1]. The lowest layer of semantic architecture - the identification of data by the mean of URIs - is, web specifications, RDF - an open and mature standard – has by itself, of great importance in the definition of information some very appealing properties in a corporate context [4]. RDF is systems. Not only because a sharable way of naming things is indeed a format built upon a simple, powerful, and well known needed to support exchange of information about those things: it data model that makes exchanging, aggregating and querying is indeed our observation that a frequent cause for problems is the information easier. absence of proper identification of real world things. Those However, despite their potential, adoption of semantic web problems tend to surface when legacy systems need to be technologies seems to remain rather slow in the enterprise’s world interconnected. In [6], we described such a situation, where - at least this is our feeling about the situation at Renault. Not concepts central to the domain were not formally identified, thus being advertised by solution providers, they are very often simply hindering efficient use of existing information resources, and overlooked, or at best considered as “promising”, but not ready to increasing the costs of data reconciliation. be used right now. They may even suffer from a negative In what will follow we describe what has been done in the work prejudice, being perceived as yet another technological hype, we undertook: whose promises, such as the easy exchange of information, have • the publishing of a data repository as Linked Data, • the implementation of a simple RDF browser, • and the prototyping of access to the published data from outer application. We then discuss some noticeable points concerning Linked Data. Copyright is held by the author/owner(s). LDOW2008, April 22, 2008, Beijing, China. 2. PUBLISHING A DATA REPOSITORY AS 2.4 First steps LINKED DATA 2.4.1 Minting URIs for the items of the repository Minting URIs for the items of the repository was not a big deal: as 2.1 Previous experiences they already had an id, it was just a matter of choosing a We had some previous experience with the publishing of RDF namespace where we would be able to publish the data: the URI Data. of an item of the repository is just the concatenation of that namespace, of the item’s class name (the name of the table in the Semanlink, the first of these experiences, is a free tagging tool database the item is primary key of) and of the item’s id. developed by the author, where tags are SKOS-like concepts, Regarding the “generic parts”, which, as we mentioned earlier, do identified by URIs, all dereferenceable [5]. not pertain to this repository, they should be published by their The second was a prototype repository of repair and diagnostic owner (the parts department), under their own namespace. Of operations, modeled with OWL, and developed to provide a course it is not the case and, to get started quickly, we minted probabilistic diagnostic tool with data [6]. URIs for these external items inside the same namespace. It will be possible to fix this later, for instance using owl:sameAs In these experiences however, implementing Linked Data statements to link them to their true URI when their natural owner principles had not been the central point of the work. Our concern makes them available as Linked Data. in this new work was to emphasize the publishing and consuming of linked data. The objectives were to better explore the topic, and 2.4.2 Hash or slash URIs to highlight the benefits of the method, in a corporate context. We Another decision concerns the type of URI to use. Best choice also wanted to produce guidelines and sample code for Renault depends on the size of the repository. Hash URIs are simple, but developers: if Semantic Web technologies are to be used in they suppose that the whole file gets downloaded when projects, we’ll need them trained to these techniques. dereferencing one of them. It is therefore better in our case to use slash URIs, which can be dereferenced one by one. 2.2 Chosen use case As a use case we chose a repository created recently by the 2.4.3 Information and Non-Information Resources Clearly, the items of the repository are “real-world things”, not department in charge of after-sales repair documentation. “information resources”: to respect the web architecture, we must Basically, it is a dictionary of the terms that documentation writers implement the “HTTP-range 14” resolution when dereferencing may use when describing repair methods. The first purpose of this them. repository is to enforce an homogenous naming scheme of things throughout the whole documentation. These terms are translated Using Non-Information Resources (NIR) requires the creation of into the many different languages the documentation is produced three different URIs: one for the resource itself, one for the RDF in, and they are classified in a SKOS-like hierarchy. Finally, the data describing the resource, and one for an HTML description. repository also contains a link to a data set produced by a different There are several ways that make sense for the naming of these department, the department in charge of spare parts: a list of so- URIs. We used [namespace]:[resid] for the NIR, [namespace]: called “generic parts” is associated to each such term. A “generic [resid].rdf and [namespace]:[resid].html for the two corresponding part” is a part seen through its function in the car (for instance information resources. “engine”, “air conditioning compressor”, “right front wing” ; several spare part references correspond to a “generic part”). 2.4.4 Producing the RDF An XML dump of the repository being available, it was easy to We chose this repository for several reasons. First, it’s simple produce RDF out of it - a really simple RDF, by the way, without enough, yet significant, and it is supposed to be well managed. blank nodes, for instance. Second, there is no way to access its data from another application: it is not available as a service, hence, publishing this 2.5 The “LED” Servlet data as Linked Data corresponded to an actual need. Third, a URIs of the data set can be made dereferenceable by using a dump as XML is available. It was therefore easy to produce RDF servlet, the “context path” of which matches the namespace out of it. Finally, we could envision interesting use of this data, chosen for the repository. that we would be able to demonstrate in the context of our work: The servlet must of course have access to the RDF data of the • access to repository’s RDF data from another application ; repository. As the data set contains no more than 60,000 • inclusion of RDFa data in repair methods generated as statements, handling it as a Jena memory model, loaded at startup, XHTML pages (repair methods are produced as small chunks was not a problem. of XML, and they contain references to the terms of the repository. Having RDF statements inside the page allows for 2.6 Dereferencing URIs of NIRs interesting features using javascript) When an agent gets the URI of an NIR, the servlet must respond with a 303 HTTP status code, and include a redirection to the URI 2.3 Development environment of an information resource that best fits the preferences expressed Java is the main language used for software developments at in the “accept” HTTP header of the request - that is, in our case, Renault, and therefore the obvious choice for our work, given the the URI of either the RDF or the HTML describing the NIR. high quality of Jena’s implementation of the semantic web stack. A servlet has access to the HTTP headers of a request. It can therefore analyze its accept header to decide what it should return. 2.7 Answering requests for the RDF about an hierarchy of concepts to access data is so common, we developed a dedicated tree-like widget, that we took care to make reusable NIR (Semanlink GUI for instance could use it, instead of having its When the client requests for the RDF data about an NIR, (that is, trees created by the server. We’ll see another reuse example later). when it dereferences the “.rdf” URI), answering is just a matter of extracting the statements of interest from the Jena Model, and of 2.10 A simple RDF browser returning them, which is made easy by Jena serializers. As our We had not mentioned it yet, but what we described until now repository is very simple, there is no question about the content to supports only dereferencing URIs from our own data set, and we be returned: all statements of the form and had limited our GUI based on this constraint. It is indeed standard . We just added the statements defining the links javascript security to forbid connections to servers other than the between the three resources nirURI, nirURI.rdf and nirURI.html. one the page comes from. It is possible to bypass this constraint, We can think about including labels of linked resources. This can as Tabulator does, using Firefox with one of its default settings be a useful optimization when we know that the RDF returned changed. This was not an option in our case: we had to work with will be displayed as HTML. We didn’t do it by default, only for standard version and configuration of browsers. Dereferencing resources linked by certain properties. (The question of the URIs from other servers therefore meant implementing an usual amount of information to be returned about one URI is briefly trick (namely, an HTTP proxy: requests to dereference URIs discussed later: in other cases, the behavior adopted here would outside our domain must be sent to the servlet, which forwards lead to a uselessly huge number of statements). them to the actual server, and then returns the result). 2.8 Answering requests for the HTML about Implementing this trick is all what it needs to transform the solution into a very simple, yet generic RDF browser: RDF can be an NIR downloaded from anywhere, and displayed using javascript. Generating HTML about a resource from its RDF description is of course one important topic. It can be done on the server, for 2.11 Getting linked data from another instance using JSP. (That’s what we had done in our previous experiments with Linked Data publishing, such as Semanlink). application But it can also be done on the client, thanks to the javascript RDF An important aspect of Linked Data is the fact that it provides the parser made available with the Tabulator project [7]. implementation of a service to which applications can connect over HTTP to get data and use it as they see fit. It was important 2.9 Generating a page out of RDF data using to demonstrate that this can actually be done without difficulty, at a low cost in terms of development for the client application. In javascript on the client. order to provide sample code, we implemented two kinds of That’s the road we chose, because it allows for a nice architecture: connections to the repository’s data: one in Java, using the Jena API to parse and use the RDF, the other in Javascript. • it provides a clean separation of “Views” and “Model” (in MVC parlance), with easier reuse of GUI widgets, For this demonstration, we used a completely unrelated tool that we had built some time ago for the parts department. It is a web • it decreases the load on the server, application that computes and displays information about “generic • it gives the possibility to change the display on the client parts”: the user enters the code of a “generic part”, and the without sending a new request to the server, application displays a list of corresponding spare part references. • it allows to incrementally load RDF. The first feature we added to this application, in Java, is simply the possibility to list the spare part references corresponding to a The principle is simple: a request for the HTML about an NIR term of our repository: the “part” servlet connects to the returns an HTML page which is almost an empty box (or a repository, dereferences the term, parses the returned RDF, template), containing a call to a javascript function that takes the extracts the corresponding list of “generic parts”, and for each, URI of the RDF data as argument. It is this javascript that computes the list of spare part references. For the second feature, downloads the RDF data, and displays it. we reused our tree widget: the user can use it to navigate down the hierarchy of concepts of our linked data set, to choose either a 2.9.1 Parsing problems with Internet Explorer “generic part” or a term, and to get the list of corresponding spare We faced a difficulty with Tabulator’s javascript RDF parser, as it part references. didn’t support Internet Explorer (neither 6 nor 7) at the time we undertook our work (this was Tabulator 0.8). We had to correct 2.12 Conclusion concerning this prototype that, as most people use Explorer in our company, and it wouldn’t be acceptable to propose a solution that doesn’t work with it. The We implemented in Java and Javascript an example of a patch is available for download by interested people. repository published as linked data. It is composed of a servlet that uses a Jena Model containing the RDF data, and that ensures 2.9.2 Generic display the dereferencing of URIs, respecting the principles regarding In a first step, we didn’t do much more than displaying the RDF Non-Information Resources. Basically, this servlet only produces about the described resource in a very generic way, that is listing RDF output. The display in HTML is done in Javascript. This is values, property by property. enough to build a very simple but generic RDF browser, and to publish the repository’s data, providing the functionalities of a 2.9.3 Tree-like GUI widget REST web service. We demonstrated how to connect to it from a On important part of the repository is the classification of the program, and how to use its data. terms in a SKOS-like hierarchy. It is supposed to provide a way for a human user to easily navigate inside the corpus of data and We now plan to include a SPARQL endpoint, to complete our help her finding the term she’s looking for. As navigating a accompanying how-to. We think this is a fairly noticeable achievement, for a relatively relevant for a given car). To filter the list, the service has to simple development, which is almost completely reusable, and evaluate these Boolean expressions. Let's say that the engine easily extensible. One obvious idea is to improve the RDF removal depends on the model and the engine type of the car. We browser, implementing, for instance, templates allowing to could ask the user to enter those values returning some RDF such customize the display depending on the type of the resource that as: gets dereferenced. The solution should be compared side by side with more services. Let us just note some points. This approach respects the web architecture, and this has its benefits: caching, for instance, barely changes. At the opposite of WS-* services, this service only uses HTTP get, and therefore benefits from the standard is related to the use of RDF, which is a generic data model. We do not have to learn a special syntax to be able to use the service: we directly manipulate the data, not a specialized API. Furthermore, any chunck of RDF extracted from the repository could be If the client program knows the conventions of this "form transfered from application to application, possibly aggregated vocabulary", it can understand that it has to provide value for with more data, and still remain completely understandable with model and engine_type. How it determines these values varies (if just a standard RDF parser. This reduces the cost of development there is a human user, it can generate an HTML form). The point in client applications (a question that seems to be sometimes is that the client program knows (or can discover) the exact overlooked when speaking about services). meaning of the parameters of the form. If it is able to provide the answer, it will then construct a URL including the values and This concludes what had to be said about this prototype, and we dereference it to get the data it is looking for. are now going to discuss some points about Linked Data, in no particular order. We chose this example, though it is a bit long, on purpose. The European Commission has emitted a directive that requires from 3. RDF FORMS automotive constructors that they publish their technical Quoting [2]: "The Semantic Web [...] is about making links, so documentation about repairs. A specification by an OASIS that a person or machine can explore the web of data." Beside technical committee describes how this should be achieved, using hypertext (“href”) links, forms are an important feature of the RDF for metadata. The protocol for this situation could be web. How does this transpose to the web of data? Shouldn't there improved using “RDF Forms”. be a standardized way to "include forms" in RDF data? This We described here a form with a GET method, but POST methods would allow a server of RDF data to require some input, with a are of course interesting as well. well defined meaning, from its clients (should they be humans or machines). 4.FINDING THE URIS OF “REAL WORLD We had a use case in the field of technical after-sale THINGS” documentation. The repository of repair and diagnostic operations This is probably the major difficulty with Linked Data on the web. mentioned earlier can be used by humans (mechanics looking for The publishers of the large data sets of the LOD project information about how to repair a car) and by programs: for accomplished an important effort to interlink their data, and they instance a program that computes estimates needs to get the list of built a huge source of identities for real-world things. But how can parts that are necessary for a given repair on a given car, and the a user discover how to say “Paris” or “Hamlet”? time needed to perform the repair. Suppose for instance that you have to replace the engine of a car, 4.1 The case of enterprise data and that you want to get information about how to do that. The Let’s note that this is not really a problem in the corporate context. repository contains a concept "ex:engine_removal". Now, for any Companies indeed do have large and standardized vocabularies, given car, there is one and only one method to (correctly) take that are shared throughout their whole organization, to name many apart the engine. You are not concerned by the information that of the things they manipulate in their operations. Of course, you would get by dereferencing "ex:engine_removal", rather by everything is not perfect: different departments sometimes give a the subset that is relevant to your car. This subset depends on the different meaning to the same term, and this may be a cause of characteristics of that car. It would not make sense to return all the misunderstandings and problems. We believe that adopting URIs information about all the cars. Not only because it would be a for identification would reduce the number of such cases: URIs waste of bandwidth, but also because it would be difficult for the including the designation of their “owner”, they are incentive to client to understand this information. (This information would check for the real meaning of the thing before deciding to use it. indeed contain "conditional links", things such as: “if condition then statement”, where condition is a Boolean function of the 4.2 On the web characteristics of the car). The question is of first importance. If we don’t have tools to help a user discover the URIs for things, she won’t be able to write Typically, the service extracts from its underlying database the list statements using a shared vocabulary. Best she’ll be able to do is of all documents matching "ex:engine_removal". Each record has to write statements using her own vocabulary. She shouldn’t be a property "condition" (a Boolean function of the technical left helpless, because this hurts the chances of the semantic web to characteristics of a car which returns true when the document is be largely adopted. We think that the publisher of the massive linked data sets should could the server do, when such a URI gets dereferenced, to avoid try to provide tools to help connect to them, and find the URI of waste (that is, to avoid returning all statements involving the things. URI), yet let the client know that more information is available and could be returned if needed? Some interesting suggestions The problem is of course difficult, but pragmatic tools can be of were made, based basically on the idea: “look up there to get great help. If I type “Paris” in Wikipedia, I get to a page about the triples involving that property”. These suggestions imply a new capital of France, with a link to a disambiguating page. convention (that is, the definition of a few properties). We would Implementing such a service, returning its results as a small chunk appreciate an agreement of the Linked Data community on such a of RDF, should not be a problem for the large LOD projects, and convention. this would be very useful. More complex tools can be thought of. If a data set provides a 6. CONCLUSION SPARQL endpoint, some interesting tools can be developed by We are convinced that Linked Data principles are good for the third parties. web as well as for companies’ information systems. In a work of evangelization about Semantic Web technologies inside our 4.3 Semanlink company, we published as linked data a repository, without We will try to participate to this effort. As noted earlier, difficulty. We showed how to connect to the resulting service from Semanlink publishes data as RDF, following Linked Data an outer application. The code written is largely reusable, and principles. We hope to be able in the next months expandable. And we will soon reuse it to publish new sets of linked data. For instance, we will shortly implement the • to work on the linking of Semanlink to the LOD data sets, publishing as RDF of data served by a SOAP web service. Increasing the number of use cases should convince about the • to make results of the tag search available as RDF, flexibility of the method. We’ll continue our work in the field of • to build tools that help discover whether a given concept is after-sales technical documentation. This field looks like a typical actually used in a Semanlink data set, use case for Semantic Web technologies, given the number of business objects that have to be shared among the many systems • and to work on the interlinking of several Semanlink databases. involved, in a corporation-wide process. But it is also a field that We hope that this will eventually provide a test bed for semi- is slow to evolve, partly for the same reasons. automatic reconciliations of portions of independently developed vocabularies. Concerning Linked Data on the web, we think that the community should work on defining a small vocabulary to better handle some 4.4 How to get the URI of an NIR when you problems that surface (such as the amount of data to return when know the URI of the corresponding HTML ? dereferencing a URI), or to provide new functionalities (such as “RDF Forms”). Most of the time, what you see of an NIR is the HTML page that gets displayed in your standard web browser when dereferencing 7. REFERENCES its URI. How do you get back from this HTML to the URI of the [1] Berners-Lee, T., “Business Model for the Semantic Web", NIR? Today, there is no easy way to get it. If Linked Data 2001 http://www.w3.org/DesignIssues/Business publishers follow the recommendations of [3], the URI of the RDF data describing the NIR can be extracted from the HTML [2] Berners-Lee, T. “Linked Data”, http://www.w3.org/ page, but not the URI of the NIR. If you’re interested in it (for DesignIssues/LinkedData.html instance because you want to write a statement involving this [3] Bizer, C., Cyganiak, R., Heath, T., “How to Publish Linked NIR), you have to download the RDF, parse it, and find the Data on the Web” http://www4.wiwiss.fu-berlin.de/bizer/ statement linking the NIR to the HTML. It would be easier to pub/LinkedDataTutorial/ have the statement describing the link between the HTML page and the NIR directly in the HTML (probably as RDFa) We think [4] Feigenbaum, L., “Semantic Web Technologies in the that suggesting data publishers to do so would be a nice addition Enterprise”, 2006 http://www.thefigtrees.net/lee/blog/ to the “How to Publish Linked Data on the Web” document [3]. 2006/11/semantic_web_technologies_in_t.html [5] Semanlink http://www.semanlink.net 5. AMOUNT OF DATA TO RETURN WHEN [6] Servant, FP "Semantic Web Technologies in Technical A URI GETS DEREGERENCED Automotive Documentation" - CEUR-WS.org/Vol-258 - This is a known issue with Linked Data, with several aspects. In OWL: Experiences and Directions 2007 http:// particular, it is not always practical, or wise, to blindly return all ftp.informatik.rwth-aachen.de/Publications/CEUR-WS/ triples containing the URI that is dereferenced. This information Vol-258/paper04.pdf can be huge, or computing the statements involving some special [7] “Tabulator: Generic Data Browser” http://www.w3.org/2005/ properties may be a heavy task. So, returning all the statements ajar/tab that a server is able to provide can be a waste of bandwidth, or of server resources, if the client is not interested with them. What