<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>EURECOM at the SemStats 2019 Challenge</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Thibault Ehrhart</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Raphael Troncy</string-name>
          <email>raphael.troncyg@eurecom.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>EURECOM</institution>
          ,
          <addr-line>Sophia Antipolis</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <abstract>
        <p>In this paper, we present two contributions for the SemStats 2019 Challenge. First, we developed the SIRENE ontology for modeling the o cial database of French enterprises (legal units) and establishments (local units) and we study the coverage of this dataset in Wikidata. Second, we developed a web-based application for visualizing the public database of facilities which has been previously enriched using a tourism and culture knowledge graph.</p>
      </abstract>
      <kwd-group>
        <kwd>Ontology modeling</kwd>
        <kwd>data interlinking</kwd>
        <kwd>knowledge graph</kwd>
        <kwd>visualization</kwd>
        <kwd>Wikidata</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>2.1</p>
      <sec id="sec-1-1">
        <title>Dataset overview</title>
        <p>
          Sirene is the French directory managed by INSEE which assigns a SIREN number
to French enterprises, and and a SIRET number to their establishments. The
Sirene track challenge consists in proposing a RDF model for this data. The
Sirene dataset is divided into 5 les:
{ (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) StockUniteLegale, one of the two main les of the dataset with
Stock
        </p>
        <p>Etablissement. It contains all active and ceased companies in their current
state in the directory. A legal unit is a legal entity governed by public or
private law. This legal entity can be: a legal person whose existence is
recognized by law independently of the persons or institutions that own it or who
are members of it; or a natural person, who, as an independent, can carry
on an economic activity.
{ (2) StockEtablissement, the second main le, which contains all active
and closed establishments in their current state in the directory.
{ (3) StockEtablissementLiensSuccession, the list of predecessors and
successors of establishments.
{ (4) StockUniteLegaleHistorique, a set of values of certain variables
historized in the Sirene directory for all the companies.
{ (5) StockEtablissementHistorique, a set of values of certain variables
historized in the Sirene directory for all the establishments.</p>
        <p>The data is saved in CSV format and is updated on a monthly basis. Ceased
businesses and closed establishments are included, providing access to Sirene
data since 1973.
2.2</p>
      </sec>
      <sec id="sec-1-2">
        <title>Re-using popular vocabularies</title>
        <p>We need a model that can represent all the data in the Sirene database. For
this, we re-used existing vocabulary, which we expanded when necessary. Our
modeling work was initially based on euBusinessGraph, an ontology made to
represent the basic informations of a company. It uses several other vocabularies,
including W3C Org1, W3C RegOrg2, FOAF3, schema.org4, and ADMS5.</p>
        <p>W3C Org is an ontology designed to publish information about organizations
and organizational structures. It is intended to provide a generic and reusable
basic ontology that can be expanded or specialized for use in particular
situations.</p>
        <p>W3C RegOrg is a vocabulary used to represent registered organizations. It is
an extension of the W3C Org ontology and is designed to describe organizations
that have acquired legal status through a formal registration process, typically
in a national or regional register. This ontology focuses only on companies, and
excludes natural persons.
2.3</p>
      </sec>
      <sec id="sec-1-3">
        <title>SKOS controlled vocabulary</title>
        <p>We also de ned a controlled vocabulary to represent the legal categories and the
employee groups. The vocabulary of legal categories is organized in hierarchical
form. A 3-level hierarchy corresponds to the existing one from the data provided
by Sirene. The URI of the entity is based on the code of the legal category.
1 https://www.w3.org/TR/vocab-org/
2 https://www.w3.org/TR/vocab-org/
3 http://xmlns.com/foaf/spec/
4 https://schema.org/
5 https://www.w3.org/TR/vocab-adms/</p>
        <p>Listing 1.1. Samples from the legal categories vocabulary
&lt;http://sirene.eurecom.fr/categorie-juridique/&gt; a skos:ConceptScheme ;
rdfs:label "Categories juridiques" @fr ;
rdfs:comment "La nomenclature des categories juridiques retenue
dans la gestion du repertoire Sirene, repertoire officiel d'
immatriculation des entreprises et des etablissements, a ete e
laboree sous l'egide du comite interministeriel Sirene.\n\nC'
est une nomenclature avocation inter-administrative, utilisee
aussi dans la gestion du Registre du Commerce et des Societes.
Elle sert de reference aux Centres de Formalites des
Entreprises (CFE) pour recueillir les declarations des
entreprises."@fr ;
dct:created "2019-10-01"^^xsd:date ;
dct:modified "2019-10-01"^^xsd:date .
&lt;http://sirene.eurecom.fr/categorie-juridique/5&gt; a skos:Concept ;
skos:inScheme &lt;http://sirene.eurecom.fr/categorie-juridique/&gt; ;
skos:prefLabel "Societe commerciale"@fr .
&lt;http://sirene.eurecom.fr/categorie-juridique/54&gt; a skos:Concept ;
skos:broader &lt;http://sirene.eurecom.fr/categorie-juridique/5&gt; ;
skos:inScheme &lt;http://sirene.eurecom.fr/categorie-juridique/&gt; ;
skos:prefLabel "Societe aresponsabilite limitee (SARL)"@fr .
&lt;http://sirene.eurecom.fr/categorie-juridique/5422&gt; a skos:Concept ;
skos:broader &lt;http://sirene.eurecom.fr/categorie-juridique/54&gt; ;
skos:inScheme &lt;http://sirene.eurecom.fr/categorie-juridique/&gt; ;
skos:prefLabel "SARL immobiliere pour le commerce et l'industrie (</p>
        <p>SICOMI)"@fr .
...</p>
        <p>The employee group vocabulary uses the schema:QuantitativeValue class
and contains intervals of the number of employees, with a minimum value and
a maximum value. There are 16 employee groups de ned by Sirene6.</p>
        <p>Listing 1.2. Example of employee group
&lt;http://sirene.eurecom.fr/tranche-effectif/11&gt; a schema:</p>
        <p>QuantitativeValue ;
schema:minValue "10"^^xsd:int ;
schema:maxValue "19"^^xsd:int .
6 https://www.sirene.fr/sirene/public/variable/tefen</p>
      </sec>
      <sec id="sec-1-4">
        <title>Sirene ontology and URI pattern</title>
        <p>We started by creating a mapping between the properties de ned in the dataset
with those available in the di erent ontologies.</p>
        <p>Legal units are mapped on rov:RegisteredOrganization by reusing the
properties de ned in this vocabulary. The URI of the legal unit is composed of
the base URI followed by the SIREN number of the unit (e.g. &lt;http://sirene
.eurecom.fr/siren/19450855200016&gt;). The legal category uses the rov:orgType
property and points to the category URI, as de ned in our SKOS controlled
vocabulary. The employee group value is mapped to schema:numberOfEmployees
and points to the URI of the employee group as de ned in our ontology.</p>
        <p>Establishments are mapped to rov:RegisteredOrganization and org:Site.
The URI of the establishment is composed of the base URI followed by the
SIRET number of the establishment (e.g.
&lt;http://sirene.eurecom.fr/siret/32517500032&gt;). The establishment's address is
mapped to the org:siteAddress property which points to a URI made from
the establishment's URI followed by /address (e.g. &lt;http://sirene.eurecom.fr
/siret/32517500032/address&gt;). The link between the legal unit and the
establishment is represented by the org:hasSite property. If etablissementSiege
is set to true, then the link is also represented by the org:hasRegisteredSite
property, which indicates that this is the primary site legally registered by the
organization.</p>
        <p>Organizational changes are mapped to org:ChangeEvent, where the
properties org:originalOrganization and org:resultingOrganization are set
to the URIs of the original and the resulting establishments. The URI of the
succession link is composed of the URI database followed by a unique
identier generated from the SIRET numbers (e.g. &lt;http://sirene.eurecom.fr/event
/32517500032-12345678901&gt;).</p>
        <p>Since none of the existing ontologies covered the complete scope we needed,
we reused them where possible, and we created an extension called sirene:UniteJuridique,
in the base URI http://sirene.eurecom.fr/ontology#.</p>
        <p>Listing 1.3. De nition of UniteJuridique
sirene:UniteJuridique a owl:Class ;
rdfs:isDefinedBy &lt;http://sirene.eurecom.fr/ontology#&gt; ;
rdfs:label "Unite Juridique"@fr ;
rdfs:isDefinedBy sirene: .</p>
        <p>This owl:Class is also complemented with 37 properties that are based on
the name of the variables from the Sirene dataset.</p>
        <p>Listing 1.4. List of properties from Sirene Ontology for the UniteJuridique class
s i r e n e : i d e n t i f i a n t A s s o c i a t i o n U n i t e L e g a l e
s i r e n e : n i c S i e g e U n i t e L e g a l e
s i r e n e : n o m b r e P e r i o d e s U n i t e L e g a l e
s i r e n e : e c o n o m i e S o c i a l e S o l i d a i r e U n i t e L e g a l e
s i r e n e : c a t e g o r i e E n t r e p r i s e
s i r e n e : c a r a c t e r e E m p l o y e u r U n i t e L e g a l e
s i r e n e : a n n e e E f f e c t i f s U n i t e L e g a l e
s i r e n e : a n n e e C a t e g o r i e E n t r e p r i s e
s i r e n e : s t a t u t D i f f u s i o n U n i t e L e g a l e
s i r e n e : u n i t e P u r g e e U n i t e L e g a l e
s i r e n e : a c t i v i t e P r i n c i p a l e E t a b l i s s e m e n t
s i r e n e : a c t i v i t e P r i n c i p a l e R e g i s t r e M e t i e r s E t a b l i s s e m e n t
s i r e n e : a n n e e E f f e c t i f s E t a b l i s s e m e n t
s i r e n e : c a r a c t e r e E m p l o y e u r E t a b l i s s e m e n t
s i r e n e : c od eC ed ex Eta bl is se men t
s i r e n e : c odeCe dex2 Etabli ssem ent
s i r e n e : codeCommuneEtablissement
s i r e n e : codeCommune2Etablissement
s i r e n e : c o d e P a y s E t r a n g e r E t a b l i s s e m e n t
s i r e n e : c o d e P a y s E t r a n g e r 2 E t a b l i s s e m e n t
s i r e n e : d e n o m i n a t i o n U s u e l l e E t a b l i s s e m e n t
s i r e n e : d i s t r i b u t i o n S p e c i a l e E t a b l i s s e m e n t
s i r e n e : d i s t r i b u t i o n S p e c i a l e 2 E t a b l i s s e m e n t
s i r e n e : e t a b l i s s e m e n t S i e g e
s i r e n e : e t a t A d m i n i s t r a t i f E t a b l i s s e m e n t
s i r e n e : i n d i c e R e p e t i t i o n E t a b l i s s e m e n t
s i r e n e : i n d i c e R e p e t i t i o n 2 E t a b l i s s e m e n t
s i r e n e : n i c
s i r e n e : n o m b r e P e r i o d e s E t a b l i s s e m e n t
s i r e n e : n o m e n c l a t u r e A c t i v i t e P r i n c i p a l e E t a b l i s s e m e n t
s i r e n e : s t a t u t D i f f u s i o n E t a b l i s s e m e n t
s i r e n e : t r a n s f e r t S i e g e
s i r e n e : c o n t i n u i t e E c o n o m i q u e</p>
        <p>The data has been enriched with other sources by linking legal units and
establishments with data from http://entreprise.data.gouv.fr. We have
materialized this link usign the owl:sameAs property. The link points to https://
entreprise.data.gouv.fr/etablissement/&lt;identifier&gt;, where &lt;identifier&gt;
corresponds to the SIREN number for legal units, or the SIRET number for
establishments.</p>
        <p>The following diagram shows an example of a materialized legal unit and the
relationship with its establishment.</p>
      </sec>
      <sec id="sec-1-5">
        <title>2.5 Studying Sirene coverage in Wikidata</title>
        <p>We extracted the data from the Wikidata knowledge base using a SPARQL
query to retrieve the entities with properties P1616 (SIREN number) and P3215
(SIRET number). We get about 41k registered organizations and 374 registered
establishments in Wikidata. We then link the entities together using their
registration number. In the end, we get a list of links to the Wikidata pages of
40984 companies and 374 establishments, which are materialized thanks to the
owl:sameAs property.</p>
        <p>Listing 1.5. Example of entity linking between a legal unit from Sirene and a Wikidata
page
&lt;http://sirene.eurecom.fr/siren/19450855200016&gt;</p>
        <p>owl:sameAs &lt;https://www.wikidata.org/wiki/Q13334&gt; .
3
3.1</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>BPE Track</title>
      <sec id="sec-2-1">
        <title>Dataset overview</title>
        <p>The permanent facilities database (or BPE for "Base de donnees Permanente
des Installations") provides information on the level of facilities and services
provided by a territory to its population. It lists over 2.5 million installations
of a wide range of di erent types with their main features, most of which are
geolocated.</p>
        <p>The datasets provided for the challenge are separated into 3 folders:
1. bpe2018-facilities: contains data for each facility, in RDF format.
2. bpe2018-codelists: the code lists used, expressed in SKOS.
3. bpe2018-geo-quality: metadata on geolocation quality. The quality level is
established according to the following rules:
{ good: the di erence of the coordinates (X, Y) provided with the reality
of the ground is less than 100m;
{ acceptable: the maximum deviation of the coordinates (X, Y) provided
with the reality of the ground is between 100m and 500m;
{ bad: the maximum deviation of the coordinates (X, Y) provided with the
reality of the eld is greater than 500m and random imputations could
be made.</p>
        <p>The facilities data contains information about the creation date, category,
commune number, and geolocation of each facility. The category refers to a
SKOS controlled vocabulary that contains 3 levels of categories with 7 rst level
categories, 27 second level categories, and 187 third level categories. Geolocation
uses Lambert-93 projection 7. In order to facilitate the computation of the
distances and the visualization of the results, we decided to convert the geographic
coordinates to WGS84 (World Geodetic System 1984), a frequently used
coordinate system format.
3.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>City Moove Knowledge Graph</title>
        <p>City Moove is a knowledge base specialized in the domain of tourism and city
exploration. It contains descriptions of events, places, transportation facilities
and social activities, collected from numerous local and global data providers.
The entities in the knowledge base are deduplicated, interlinked and enriched
using semantic technologies [1]. The query endpoint is available at https://kb.
city-moove.fr/sparql.</p>
        <p>The data model used in the City Moove knowledge base is based on a set
of ontologies: DOLCE+DnS Ultralite8, schema.org9, Dublin Core10, LODE11,
Location Core 12, Geo13, Transit14, Media Annotations15, and Topo16. In
addition to using these ontologies, there is a system of categories that apply to
both events and activities, and points of interest, using both the label and
category description, as well as all the instances belonging to these categories. The
result is represented using the SKOS language and in particular the axioms
skos:closeMatch and skos:broadMatch. This vocabulary has 480 place
categories.</p>
        <p>During our experiment, we focused particularly on one of the largest areas
available in the City Moove knowledge base which is the French Riviera, with
7 https://geodesie.ign.fr/?p=72&amp;page=site_lambert93
8 http://ontologydesignpatterns.org/ont/dul/DUL.owl
9 http://schema.org/
10 http://purl.org/dc/elements/1.1/
11 http://linkedevents.org/ontology/
12 http://www.w3.org/ns/locn/
13 http://www.w3.org/2003/01/geo/wgs84_pos#
14 http://vocab.org/transit/terms/
15 http://www.w3.org/ns/ma-ont#
16 http://data.ign.fr/def/topo#
nearly 339k locations collected to date. The dataset of the BPE contains 70k
facilities on the C^ote d'Azur.
3.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Enriching BPE data using social media</title>
        <p>We started by de ning a mapping between the categories from BPE and those
from the City Moove knowledge base. Across all categories of the BPE, we have
managed to map 59 of them with at least or more categories of City Moove.
We have materialized these relations through RDF triples using the owl:sameAs
property.</p>
        <p>Listing 1.6. Samples from the mapping between BPE categories and City Moove
categories
&lt;http://data.linkedevents.org/kos/3cixty/touristinformationcenter&gt;
&lt;http://www.w3.org/2002/07/owl#sameAs&gt;
&lt;http://beta.id.insee.fr/codes/territoire/typeEquipement/G104&gt; .
&lt;http://data.linkedevents.org/kos/3cixty/bank&gt;
&lt;http://www.w3.org/2002/07/owl#sameAs&gt;
&lt;http://beta.id.insee.fr/codes/territoire/typeEquipement/A203&gt; .
&lt;http://data.linkedevents.org/kos/3cixty/postoffice&gt;
&lt;http://www.w3.org/2002/07/owl#sameAs&gt;
&lt;http://beta.id.insee.fr/codes/territoire/typeEquipement/A206&gt; .
...</p>
        <p>In order to enrich the data of the BPE with those of the City Moove
knowledge base, we must rst link the entities based on properties common to both
sets of data. For this, we use the geographical position and the mapping of the
categories. The objective is to calculate a similarity score between each entity,
by minimizing the score obtained. The distance is calculated using the Haversine
formula. The weight of the geographical quality is de ned as follows: 1.0 if the
quality is bad, 0.8 if the quality is acceptable, 0.6 if the quality is good.</p>
        <p>
          Given the low number of links on the category mapping, we set the weight
to 0.1, in order to favor geographic distance rather than categorization. The
formula for calculating the similarity score can be summarized as follows:
score = (distanceInM eters geoW eight) + (catM atch catW eight)
(
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
where: score is the similarity score, distanceInM eters is the distance in meters
between the two geographic positions, geoW eight is the weight of the geographic
quality, catM atch is equal to 0.0 when the categories match, or 1.0 otherwise
and catW eight is the weight of category mapping.
        </p>
        <p>The scores obtained are then normalized in order to be contained in an
interval between 0 and 1, where 1 corresponds to the best score, and 0 to the
worst score. Finally, the results are converted into RDF using the Expressive
Declarative Ontology Alignment Language (EDOAL) format17, which makes it
possible to represent the relations between two entities in the form of RDF
triples.</p>
        <p>Listing 1.7. Example of an alignment between a facility from BPE and a place from
the City Moove knowledge base</p>
        <p>The properties align:entity1 and align:entity2 contain the URI of each
entity, while align:measure contains the similarity score obtained in previous
steps, and align:relation describes the kind of relation between the two
entities. In the example of Listing 1.7, the entities are considered as perfectly equal.
3.4</p>
      </sec>
      <sec id="sec-2-4">
        <title>Visualizing enriched BPE data</title>
        <p>In order to be able to explore the results obtained, we have developed a web
application presenting the user with a map of the world with each BPE device
represented as a marker. The color of each marker is based on the second-level
category given by the BPE. Only reconciled facilities with a minimum score of
0.8 are shown on the map.</p>
        <p>When moving the mouse over a marker, a popup appears with the label,
category and photo of the reconciled place. The data is queried directly from the
City Moove knowledge base in real time using a Federated SPARQL Query18
which allows for executing queries distributed over di erent SPARQL endpoints.
Listing 1.8. Query being used to retrieve the data of a given facility from both the
BPE graph and the City Moove knowledge base
SELECT ?ent1 ?ent2 ?geo ?capacite ?typeNotation ?typeNotationLabel ?
businessType ?businessTypeLabel ?label ?poster ?streetAddress ?
measure WHERE {
17 http://alignapi.gforge.inria.fr/edoal.html
18 https://www.w3.org/TR/sparql11-federated-query/
}
GRAPH &lt;http://semstats.eurecom.fr/bpe/facilities&gt; {</p>
        <p>OPTIONAL { ?ent1 ibpe:capacite ?capacite . }
?ent1 dcterms:type ?type .</p>
        <p>GRAPH &lt;http://semstats.eurecom.fr/bpe/codelists&gt; {
?type skos:notation ?typeNotation .</p>
        <p>?type skos:prefLabel ?typeNotationLabel .
}
SERVICE &lt;https://kb.city-moove.fr/sparql&gt; {
?ent2 rdfs:label ?label .
?ent2 geo:location/locn:geometry ?geo .
?ent2 locationOnt:businessType ?businessType .</p>
        <p>OPTIONAL { ?businessType skos:prefLabel ?businessTypeLabel . }
OPTIONAL { ?ent2 lode:poster/ma-ont:locator ?poster . }
OPTIONAL { ?ent2 schema:location/schema:streetAddress ?</p>
        <p>streetAddress . }
In this paper, we tackled two challenges o ered by SemStats 2019. We rst
proposed a way to model the data from the Sirene database by reusing popular
ontologies from W3C and the euBusinessGraph H2020 projet. This allows us
to connect and enrich the data using the technologies associated with Linked
Data, as we have shown by linking Wikidata pages with Sirene entities based
on the SIREN and SIRET numbers. Moreover, this could be used to enrich
the Wikidata database by lling up existing pages that don't have the SIREN
number yet.</p>
        <p>We also showed how existing RDF data could be interlinked with other data
sources, by using entity matching techniques. We were then able to create a
prototype of a web application to showcase the usage of multiple Linked Data
sources. The source code to the Sirene track and BPE track challenges are
available on GitHub at https://github.com/D2KLab/insee/.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Troncy</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rizzo</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jameson</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corcho</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Plu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Palumbo</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hermida</surname>
            ,
            <given-names>J.C.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Spirescu</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuhn</surname>
            ,
            <given-names>K.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barbu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rossi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Celino</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Agarwal</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scanu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Valla</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Haaker</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>3cixty: Building comprehensive knowledge bases for city exploration</article-title>
          .
          <source>Journal of Web Semantics (JWS) 46-47</source>
          ,
          <issue>2</issue>
          {
          <fpage>13</fpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>