<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Theme Identi cation in RDF Graphs?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>PRiSM</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Univ. Versailles St Quentin</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>UMR CNRS</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Versailles France hanane.ouksili@prism.uvsq.fr</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>An increasing number of RDF datasets is published on the Web. A user willing to use these datasets will rst have to explore them in order to determine which information is relevant for his speci c needs. To facilitate this exploration, we present an approach allowing to provide a thematic view of a given RDF dataset, making it easier to target the relevant resources and properties. Our approach combines a density-based graph clustering algorithm with semantic criteria in order to identify clusters, each one corresponding to a theme. Prior to clustering, the initial RDF graph is simpli ed, and user preferences are mapped into a set of transformations applied to the graph. Once the clusters are identi ed, labels are extracted to express their semantics. In this paper, we describe the main features of our approach to generate a set of themes from an RDF dataset.</p>
      </abstract>
      <kwd-group>
        <kwd>Theme idendi cation</kwd>
        <kwd>RDF(S) data</kwd>
        <kwd>Clustering</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>An increasing number of RDF datasets is published on the Web, making a huge
amount of data available for users and applications. In this context, a key issue
for the users is to locate the relevant information for their speci c needs. A
typical way of exploring RDF datasets is the following: the users rst select a
URI, called a seed of interest, which they are willing to use as a starting point
for their queries; then they explore all the URIs reachable from this seed by
submitting queries to obtain information about the existing properties.</p>
      <p>To facilitate this interaction, a thematic view of an RDF dataset can be
given in order to guide the exploration process. We argue that once the data
is presented as a set of themes, it is easier to target the relevant resources and
properties by exploring the interesting topics only. In this paper, we present
our approach for theme identi cation which combines a density-based graph
clustering algorithm with semantic clustering criteria in order to identify clusters,
each one corresponding to a theme.</p>
      <p>The paper is organized as follows. Section 2 gives an overview of our proposal.
Section 3 details the preprocessing step. Section 4 presents the clustering
algorithm and we discuss methods of describing themes in Section 5. Our prototype
and an example scenario are described in Section 6, related works are provided
in Section 7, and nally, Section 8 concludes the paper.
? This work was supported by Electricity of France (EDF R&amp;D).</p>
    </sec>
    <sec id="sec-2">
      <title>General Principle of Theme Identi cation</title>
      <p>Given an RDF dataset, our goal is to identify a set of themes and to extract the
labels or tags which best capture their semantics. Providing this thematic view
raises several questions:
{ Which information could be used to de ne a theme?
{ As di erent users may not have the same perception of the data, how to
capture their preferences and use them for building the themes?
{ Finally, once the themes have been identi ed, how to label them so as to
make their semantic as clear as possible to the user?</p>
      <p>Our approach relies on the idea that a theme corresponds to a highly
connected area on the RDF graph. The more a set of resources is connected, the
more likely it is that they belong to the same theme or are related to the same
topic. We will therefore use the structure of the RDF graph itself in order to
build the themes. We apply a graph clustering algorithm which identi es these
highly connected areas and their neighborhood in order to form clusters, each
one corresponding to a theme.</p>
      <p>The structure of the graph alone is not su cient to provide meaningful
themes. Indeed, di erent users may have distinct perceptions of what a theme is.
If we consider a dataset providing information about universities and scientists,
one possible view is that themes correspond to research areas such as
Mathematics or Physics, another one is that themes correspond to research teams located
in the same geographical area. These preferences will be used for identifying the
themes, in addition to the structure of the graph.</p>
      <p>User preferences are captured by specifying the characteristics of all resources
which should be assigned to the same cluster (for example, resources having the
same value or linked by a given property). Each preference will be mapped
into one or several transformations applied to the graph. For example, if the
user expresses that two resources related by the owl:sameAs property should
be assigned to the same cluster, the transformation will consist in merging the
corresponding nodes in the graph.</p>
      <p>An overview of our approach is given in Figure 1. It comprises three main
steps, (i) preprocessing, where transformations are applied on the RDF graph,
(ii) graph clustering, where themes are identi ed, and (iii) label extraction which
provides a summary of the content of each cluster. In this paper, we mainly
address the rst two steps.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Preprocessing</title>
      <p>The initial RDF graph will be transformed prior to the execution of the clustering
algorithm. Some transformations are systematic regardless of the context, others
consist in integrating user preferences in the graph. This section describes both
of them.
Some of the information in the initial graph is not useful in order to group the
nodes into meaningful clusters.</p>
      <p>In the initial RDF graph, edges are oriented and labeled with the name of
a predicate. The clustering algorithm used in our approach will try to identify
highly connected areas, regardless of the orientation of the edges; no matter what
the orientation of an edge is, what we are interested in is that some semantic
relation exists between the resources. For example, consider dbo:in uenced
property that is asymmetric in nature; if we have the triplet hri dbo:in uenced rj i.
We are not interested in which researcher between ri and rj that in uenced the
other; the most important is there is a semantic relationship between the two
researchers. We can therefore simplify the graph by removing the orientation of
the edges. Similarly, the clustering algorithm will not use the label of the edges,
and they are also removed from the graph.</p>
      <p>An RDF graph contains several types of nodes which can be either resources
or literals. A literal is related to one resource and is a characteristic of this
resource. Obviously, a resource and the related literals should be grouped into the
same cluster. We could therefore apply the clustering algorithm on a simpli ed
version of the graph which doesn't contain the literals.</p>
      <p>The output of the preprocessing stage is a graph where the labels, orientation
of the edges and literal nodes have been removed. Figure 2 shows an example of
simpli ed graph (2(b)) corresponding to an RDF dataset (2(a)).
3.2</p>
      <p>Capturing User Preferences
As stated earlier, the structure of the graph alone is not su cient for the
identication of meaningful clusters. Sometimes the density of the graph doesn't fully
capture semantic closeness: for example, two resources might not be located in
a very connected area of the graph, but if there is an edge in the graph relating
them, and if this edge expresses a strong semantic link (e.g. owl:sameAs ), the
two resources should be assigned to the same cluster. Furthermore, for the same
dataset, di erent users might have di erent points of view and be interested in
distinct properties. To capture this need, the clustering should take into account
these properties as semantic criteria.</p>
      <p>In our approach, user preferences are captured by mapping each of them
into one or several graph transformation primitives. We consider that there are
mainly two kinds of preferences a user might want to express. The rst one is
that two resources related by a given property should belong to the same theme.
The second one is that a set of resources having the same value for a given
property should belong to the same theme.</p>
      <p>Grouping Two Resources According to a Property. Some properties express a
strong semantic link that should be used as a clustering criteria. For example,
resources linked by the owl:sameAs property should obviously be assigned to
the same cluster, and this is true for any user in any context. Besides, some
users may wish to give a property more importance than other users. For
example, if we consider a dataset containing information about scientists in di erent
research domains, a given user might consider that the resources Student and
Scientist related by the dbo:doctoralAdvisor property should be grouped in the
same cluster. This kind of preference is taken into account by merging the two
resources.</p>
      <p>Grouping a Set of Resources According to a Property. Resources that should be
assigned to the same theme are not always linked by a property; the semantic
closeness between them can be expressed by the values of some shared property.
In other words, a set of resources having the same value for a property p should
be assigned to the same cluster. For instance, the user could state that scientists
having the same value for the dbo: eld property should be in the same cluster,
thus ensuring that scientists of the same research domain are grouped together.
This kind of preference is taken into account by creating in the graph a highly
connected area containing the speci ed resources. If we consider the set R of
resources ri having the same value for a property p, then an edge (ri; rj ) will be
added for each pair (ri; rj ) of resources such that ri and rj are in R, unless the
edge already exists in the graph.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Clustering Algorithm</title>
      <p>The clustering algorithm at the core of our approach has to ful ll a set of
requirements, the rst of which is exploiting the density of the graph to enable the
identi cation of clusters corresponding to highly connected areas of the graph.
The second requirement is that the algorithm should not require the number of
clusters as a parameter, as this information cannot be known prior to clustering
in our context. Finally, resulting clusters provided by the algorithm should not
necessarily be disjoint, as it is possible that two distinct resources in our initial
graph belong to two di erent themes.</p>
      <p>
        We have chosen the algorithm proposed by [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and initially used in the domain
of bioinformatics. It is a density-based algorithm producing possibly overlapping
clusters.
      </p>
      <p>The algorithm operates in three steps. First (i), it computes the weights of
each node in the graph using the concept of k-core. Consider that the degree of
a node is the number of his adjacent nodes. A k-core is a graph in which the
minimal node degree is k. The weight of a node Si is computed based on the
highest possible k-core value in the subgraph composed of Si and its adjacent
nodes; once the weights have been computed, (ii) the nodes are explored in a
descending order of their weights; each node Si will initiate a cluster, and for
each adjacent node Sj such as the di erence between the weights of Si and Sj
is below a threshold t is assigned to the same cluster as Si; nally, (iii) once all
the nodes have been explored, the algorithm enriches the clustering by checking
all the adjacent nodes for a given cluster; if for a node Si in a cluster Ci, the
subgraph composed of Si and its adjacent nodes is highly connected, then all
the adjacent nodes of Si will also be added to Ci. This will enable nodes to be
part of more than one clusters.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Labels Extraction</title>
      <p>Goal of this step is to provide the user a view of the cluster content by extracting
a set of relevant labels that describe the theme. The set of labels is extracted
from the names of RDF resources is composed by the top-k keywords having the
high weight in the cluster Ci. The weight wij of the keyword j (noted keywordj )
in the cluster Ci is computed according to the degree of the node j (noted nodej ).
We note that keywordj appears in the name of nodej . We give an example in
Figure 4, where selected theme represents a set of researchers workings in the
eld of physics. The top-1 labels extracted using our approach is "Physics" as
we can see in the name of the sub window of the gure. This label re ects the
semantic content of the cluster. We can add more labels by increasing the value
of k.</p>
      <p>
        This approach can be extended to use more characteristics to calculate the
weight of keyword by combining the degree of the node with the frequency of the
keyword in the cluster. Castano et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] use the most frequent keyword
combining with the most frequent type of entities in the cluster. Another alternative
would be to use an adaptation of the tf-idf function to determinate the weight
wij. In this way, the relevant of the keywordj is proportional to its frequency of
the keyword in the cluster Ci and its scarcity in other clusters Ck with k 6= i.
6
      </p>
    </sec>
    <sec id="sec-6">
      <title>Our System</title>
      <p>We have implemented a tool to support our approach for theme identi cation.
The system requires two types of parameters: (i) clustering parameters, used to
specify thresholds for assigning a node to a cluster, and (ii) semantic parameters,
used to capture user preferences.</p>
      <p>To illustrate the way our tool is used for theme identi cation, consider the
following example of an RDF dataset extracted from DBPedia (see Figure 3).
This dataset contains resources describing scientists working in di erent domains
with their organizations and their countries. Assume that the user wants to
identify themes in the input graph, and would like scientists from the same
domain to be assigned to the same cluster. As the research domain is represented
by the dbo: eld property in our example, the user will indicate that two resources
having the same value for this property should be assigned to the same cluster.
He can repeat the clustering process either on the initial graph by adding new
semantic parameters, or on a cluster obtained in previous iterations in order to
get further details.</p>
      <p>
        According to the preference set by the user, scientists of the same domain
will be assigned to the same cluster. But it may happen that this property
is not de ned for some of the scientists in the dataset, and the user would
therefore like to use another semantic criteria. For example, he could state that
scientists related by the dbo:doctoralAdvisor property should be assigned to the
same cluster.
Theme identi cation approaches have been proposed for text documents [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] or
for other types of data published on the web, e.g. social networks [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], youtube
documents [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and DBpedia [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The goal of these approaches is to facilitate the
search process and the navigation into the dataset. All of them use clustering
techniques. Unlike our approach, they do not consider RDF datasets except the
one described in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Furthermore, they rely on text comparison to compute the
distance between documents. This distance is used as the similarity measure for
the clustering algorithm.
      </p>
      <p>
        Despite the increasing amount of RDF(S)/OWL datasets available online,
the problem of discovering themes have received little attention. Some works
have focused on improving the quality of data by grouping resources to detect
concepts and induce new classes or re ne existing one [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ].
      </p>
      <p>
        The closest work to ours is an approach for topic identi cation presented in
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. It exploits a graph generated from an input RDF dataset, by adding new
edges between resources that have an important number of similar terms in their
labels. A clustering algorithm is then applied to identify regions that are highly
connected in the graph, which represent the topics. Similarly to our approach,
this work is based on a clustering algorithm, but focuses only on identifying
highly connected areas while we combine the density-based clustering process
with semantic criteria capturing user preferences.
8
      </p>
    </sec>
    <sec id="sec-7">
      <title>Conclusions</title>
      <p>
        In this paper, we have proposed an approach for theme identi cation in RDF
datasets. It combines a density-based clustering algorithm and semantic criteria
capturing user preferences. Our approach comprises three stages: (1)
preprocessing and capturing user preferences, (2) density-based clustering to form the
clusters and (3) extraction of labels to describe the semantic of the cluster.
Preprocessing consists mainly in simplifying the graph and removing the
information which is not necessary for the clustering algorithm. Users' preferences are
captured by mapping them into graph transformation primitives. Our approach
di ers from existing ones such as [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] in that it combines structural and semantic
criteria for graph clustering. We have implemented a system for theme identi
cation. Future works include the extension of the approach by improving label
identi cation and providing the user with a summary of the clusters' content to
describe its semantics. We are currently experimenting the use of our system on
di erent RDF datasets in order to evaluate the precision of the clustering and
the performances of the system.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>G.</given-names>
            <surname>Bader</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Hogue</surname>
          </string-name>
          .
          <article-title>An automated method for nding molecular complexes in large protein interaction networks</article-title>
          .
          <source>BMC bioinformatics</source>
          ,
          <volume>27</volume>
          :1{
          <fpage>27</fpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>S.</given-names>
            <surname>Castano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ferrara</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Montanelli</surname>
          </string-name>
          .
          <article-title>Thematic clustering and exploration of linked data</article-title>
          .
          <source>Search Computing</source>
          , pages
          <volume>157</volume>
          {
          <fpage>175</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>S.</given-names>
            <surname>Castano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ferrara</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Montanelli</surname>
          </string-name>
          .
          <article-title>Mining topic clouds from social data</article-title>
          .
          <source>In Proceedings of the Fifth International Conference on Management of Emergent Digital EcoSystems (MEDES '13)</source>
          , pages
          <fpage>108</fpage>
          {
          <fpage>112</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>K.</given-names>
            <surname>Christodoulou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. W.</given-names>
            <surname>Paton</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A. A. A.</given-names>
            <surname>Fernandes</surname>
          </string-name>
          .
          <article-title>Structure inference for linked data sources using clustering</article-title>
          .
          <source>In Proceedings of the Joint EDBT/ICDT 2013 Workshops on - EDBT '13, page 60</source>
          . ACM Press,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>N.</given-names>
            <surname>Fanizzi</surname>
          </string-name>
          , C. DAmato, and
          <string-name>
            <given-names>F.</given-names>
            <surname>Esposito</surname>
          </string-name>
          .
          <article-title>Metric-based stochastic conceptual clustering for ontologies</article-title>
          .
          <source>Information Systems</source>
          ,
          <volume>34</volume>
          (
          <issue>8</issue>
          ):
          <volume>792</volume>
          {
          <fpage>806</fpage>
          ,
          <string-name>
            <surname>Dec</surname>
          </string-name>
          .
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>U.</given-names>
            <surname>Gargi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Mirrokni</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Yoon</surname>
          </string-name>
          .
          <article-title>Large-Scale Community Detection on YouTube for Topic Discovery and Exploration</article-title>
          . ICWSM, pages
          <volume>486</volume>
          {
          <fpage>489</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>R.</given-names>
            <surname>Mirizzi</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Ragone</surname>
          </string-name>
          .
          <article-title>Semantic wonder cloud: exploratory search in DBpedia</article-title>
          .
          <source>In ICWE 2nd Int. Workshop on Semantic Web Information Management (SWIM</source>
          <year>2010</year>
          ), pages
          <fpage>138</fpage>
          {
          <fpage>149</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>H. Shahsavand</given-names>
            <surname>Baghdadi</surname>
          </string-name>
          and
          <string-name>
            <given-names>B.</given-names>
            <surname>Ranaivo-Malancon</surname>
          </string-name>
          .
          <article-title>An Automatic Topic Identi - cation Algorithm</article-title>
          .
          <source>Journal of Computer Science</source>
          ,
          <volume>7</volume>
          (
          <issue>9</issue>
          ):
          <volume>1363</volume>
          {
          <fpage>1367</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>