Toward Social, Structured and Semantic Search

       Raphaël Bonaque1,2 , Bogdan Cautis2,1 François Goasdoué3,1 , and Ioana
                                  Manolescu1,2
                 1
                     INRIA 2 Université Paris-Sud 3 Université de Rennes 1
       Abstract. Social content such as social network posts, tweets, news articles and
       more generally web page fragments is often structured. Such social content is
       also frequently enriched with annotations, most of which carry semantics, either
       by collaborative effort or from automatic tools. Searching for relevant informa-
       tion in this context is both a basic feature for the users and a challenging task. We
       present a data model and a preliminary approach for answering queries over such
       structured, social and semantic-rich content, taking into account all dimensions
       of the data in order to return the most meaningful results.

1   Introduction
In this work, we introduce S3, a novel data model for structured, semantic-rich content
exchanged in a social context. The data model builds on widespread Web standards for
structured documents (XML/JSON) and semantic Web data (RDF), while it also ex-
tends well-established models for data shared and queried in social applications. Based
on this data model, we revisit top-k keyword search, well studied in the social context
for unstructured (flat) data items, to also take into account document structure and se-
mantics. The implementation and evaluation of a query answering algorithm for this
framework is currently ongoing; more details are provided in our technical report [1].
This paper is organised as follows: we give a presentation of the data model and its
various components (Sec.2), then a query model (Sec.3) and finally a quick overview of
a query evaluation algorithm (Sec.4), and we conclude with the related works (Sec.5).
2   Data model
In this section, we describe our model integrating the social, structured and semantic-
rich content we are interested in into a single graph. First, we define RDF graphs and
use one as an instance of our problem, then we show how social networks are em-
bedded in such an instance. Next, structured documents are considered along with the
links between social, semantics and structure, and finally we introduce some tools to
manipulate such an instance.
    We assume given a set U of Uniform Resource Identifiers (URIs, in short), as de-
fined by the standard [8], and a set of literals (constants) denoted L; U and L are disjoint.
Keywords We denote by K = U ∪ stem(L) the set of all possible keywords, i.e., we
consider that stemming has been applied on literals prior to their inclusion in K. In this
paper, literals are systematically shown between quotes, e.g., “literal”.
2.1 RDF
We adopt a standard model for semantic Web data, as follows. An RDF graph (or graph,
in short) is a set of triples of the form s p o , stating that its subject s has the property
            Constructor               Triple Relational notation under Open Word Assumption
            Subclass constraint      s ≺sc o                       s⊆o
            Subproperty constraint   s ≺sp o                       s⊆o
            Domain typing constraint s ←-d o                  Πdomain (s) ⊆ o
            Range typing constraint s ,→r o                    Πrange (s) ⊆ o
                                     Fig. 1. RDFS statements.
p and the value of that property is the object o. A triple is well-formed whenever its
subject, s, belongs to U , its property, p, belongs to U and its object, o, belongs to K. In
what follows, we only consider well-formed triples.
RDF Schema A valuable feature of RDF is RDF Schema (RDFS), which allows en-
hancing the resource descriptions provided by RDF graphs. An RDFS declares semantic
constraints between the classes and the properties used in these graphs, through the use
of RDF built-in properties, as summarized in Figure 1. In this figure, s, o ∈ U , while
domain and range denote respectively the first and second attribute of every property.
    Within the RDF standard [7], the built-in property rdf:type, denoted a from now
on, is used to describe the classes to which a resource belongs, i.e., resource typing.
Saturation RDF graphs also feature implicit triples, notably when an RDFS is avail-
able. For instance, if x a c1 and c1 ≺sc c2 , then the implicit triple x ≺sc c2 holds.
A graph in which all implicit triples have been made explicit is termed the saturation
(or closure) of the RDF graph. In the following, we assume RDF graphs are saturated.
Weighted RDF graph For the needs of our model, we introduce weighted triples of
the form (s, p, o, w) such that (s, p, o) is an RDF triple, and w ∈ [0, 1] is the weight of
the (s, p, o) triple. In the following, the weight of any RDF triple is assumed 1 unless
specified otherwise. We define the saturation of a weighted RDF graph as the saturation
derived from certain facts (triples with weight 1) only, using the standard RDF [7].
S3 namespace and instance To model semantically rich data, we rely on a small
special-purpose set of RDF classes and properties, identified in this paper by the S3
namespace prefix. Relationships between documents, users, comments, keywords, so-
cial connections etc. naturally lead to a graph structure. We encode them as weighted
RDF triples in a special graph called an S3 instance, denoted I, as explained next.
2.2 Social network
We distinguish a non-empty set Ω ⊂ U of social network users and for each u ∈ Ω we
add u a S3:user 1 to I where S3:user is a dedicated RDF class.
   We introduce the property S3:social generalizing the various relationships which
may connect users in the social network. For every relationship from a user u1 to a user
u2 having the weight w, we have: u1 S3:social u2 w ∈ I.
   In our model, the higher the weight, the closer we consider the two users are.
2.3 Documents
We consider that content is created under the form of structured documents e.g., XML,
JSON, etc. A document is an unranked, ordered tree of nodes. Let N be a set of node
names (e.g., legal XML or JSON node names), and  be a special symbol denoting the
empty node name (allowed in JSON for instance). Any node has an URI.
    We denote by D the subset of U consisting of node URIs. Each node has a name
from N ∪ {}, and a content: a set of keywords from K. We assume every text node
has been broken in words, stop words have been removed, and the remaining words are
              URI and structure           Name                       Content
            URI 0                                      ∅
                    URI 0.0              text           ∅
                          URI 0.0.0                    {"troop","attack", "Crimea"}
                          URI 0.0.1                    {"#Ukraine"}
                    URI 0.1              created_at     {"Mon, 10 Mar 2014 16:43:29 +0000"}
                    URI 0.2              from_user_id   {2314512344}
                                      Fig. 2. Sample S3 document.
stemmed to obtain our version of the node’s text content. For example, in Figure 2, the
original text of URI 0.0.0 could be “Troops are attacking Crimea”. For simplicity, we
consider the URI of a document to be that of its root node.
    We term any subtree rooted at a node in document d a fragment of d. The set of
fragments (nodes) of a document d is denoted F rag(d). We may use f to refer inter-
changeably to a fragment f and to its URI.
Document-derived triples We capture the relationships between documents, fragments
and keywords through a set of RDF statements using S3-specific properties.
    We introduce a dedicated RDF class S3:doc corresponding to the documents, and:
  – For each d ∈ D, we have d a S3:doc 1 ∈ I;
  – For each document d ∈ D and each fragment n of d: n S3:partOf d 1 ∈ I;
  – For each node n and keyword k in the content of n: n S3:contains k 1 ∈ I;
  – For each node n with name m: n S3:nodeName m 1 ∈ I.
Example 1. Based on the document in Figure 2, the following triples are in I:
URI 0.1 S3:partOf URI 0 1, URI 0.0.0 S3:contains ”Crimea” 1, and
URI 0.0 S3:nodeName text 1.

Fragment position We assume available a function pos(d, f ) returning the position of
fragment f within document d, as the sequence of nodes from node d to node f .
2.4 Relations between structure, semantics, and users
The multiple facets of our data model are interconnected, reflecting the relationships
between users, content, and semantics. We model these relationships by using a set of
S3 classes and properties, as we outline below.
Connections between documents and semantics Document semantics can be charac-
terized by I quadruples relying on a class we call hereafter S3:relatedTo. Intuitively,
S3:relatedTo accounts for the multiple ways in which a node of a document can be
related to a keyword by a user. S3:relatedTo is thus used to make statements about the
semantics of a document (or node), as illustrated by the following example, where user
u4 states that the fragment URI 0.0.0 is somehow related to Russia:
                     a a S3:relatedTo 1         a S3:hasSubject URI 0.0.0 1
               a S3:hasKeyword ”Russia” 1          a S3:hasAuthor u4 1
The above four triples are part of I; a is an ad-hoc resource whose role is to encapsulate
the S3:relatedTo statement.
    The property S3:hasSubject can take values either from D or from other instances
of S3:relatedTo. The latter allows to express higher-level annotations, i.e., make state-
ments about the semantics of other statements.
We define by T the set of resources of type S3:relatedTo and we refer to them as tags.
                                                       S3:postedBy, 1                   S3:social, 0.5

                                            URI 0                           u0                                u3
                                                       S3:postedBy, 1                   S3:social, 0.3
                         S3:partOf, 1                                                                                     S3:social, 0.5
                                                       S3:partOf, 1
                                                                                                     S3:social, 0.7
                                                                     S3:commentsOn, 1                 S3:postedBy, 1

                            URI 0.0       S3:partOf, 1     URI 0.1                          URI 1                            u1
                                                                     S3:commentsOn, 1                    S3:postedBy, 1
                                                             k2
    S3:contains, 1 S3:partOf, 1                                                         S3:contains, 1
                                                   S3:hasKeyword, 1
        k0                              S3:hasSubject, 1        S3:hasAuthor, 1                               k1
                           URI 0.0.0                         a0                              u2
                                        S3:hasSubject, 1              S3:hasAuthor, 1

                                                    Fig. 3. Sample graph I.
     We stress that we do not aim at a single, “standard”, interpretation of S3:relatedTo,
since there are already many concrete instantiations thereof. For instance, a natural
language processing (NLP) tool may recognize a text fragment as related to a person
or a topic, a user may state that a document is related to a certain topic, etc. Any such
concrete flavors of “being related to” can be used, and we consider that they are all
stated to be subclasses (in RDFS sense) of our special class S3:relatedTo.
Connections between users and documents We identify two different relationships
between users and documents, as detailed below.
  1. Users can post documents. We model this using a dedicated property S3:postedBy;
     for each user u having posted a document d, d S3:postedBy u 1 is in I.
  2. Users can also create documents that comment on other documents. We capture
     this using the dedicated property S3:commentsOn. For each user u having created
     a document c as a comment on the document d, the following hold:
                    c S3:commentsOn d 1 ∈ I and c S3:postedBy u 1 ∈ I.
     Observe that when user u posts a document c, which comments on a document d,
and possibly cites part of d, each fragment of d exactly replicated in the comment is
now part of c, and thus has a new URI.
Example 2. Assuming that URI 1, posted by u1 , is a comment on URI 0, we have:
 URI 1 S3:postedBy u1 1 ∈ I and URI 1 S3:commentsOn URI 0 1 ∈ I.

Inverse properties Observe that the relations introduced previously are interesting
to explore in both directions. For instance, it is interesting to know what comments
have been posted on a given document d, and by whom, but it is also interesting to
know “What are the comments posted by user u?”, or “What does this comment refer
to?”. To support bidirectional manipulation of such relationship, we introduce in our
model a set of inverse properties for S3:postedBy, S3:commentsOn, S3:hasSubject
and S3:hasAuthor, denoted respectively:
            S3:postedBy, S3:commentsOn, S3:hasSubject, S3:hasAuthor
with the straightforward semantics: s p̄ o 1 ∈ I iff o p s 1 ∈ I, for p ∈ {S3:postedBy,
S3:commentsOn, S3:hasSubject, S3:hasAuthor}.
2.5 Manipulating the S3 instance I
As we will show, in our framework, user queries are expressed using search keywords.
Based on the semantics encoded in the RDF schema, we define for every keyword
k ∈ K a simple semantic extension as follows:
Definition 1 (Keyword extension). Given a S3 instance I and a keyword k ∈ K, the
extension of k, denoted Ext(k), is defined as follows:
  – k ∈ Ext(k);
  – for any weighted triple b a k 1, b ≺sc k 1 or b ≺sp k 1 in I, we have b ∈ Ext(k).
     For example, given the keyword fish and assuming that tuna ≺sc fish 1 holds in I,
we have tuna ∈ Ext(fish). Note that the extension is not a generalization, in the sense
that it does not introduce any loss of precision: whenever we include a keyword (URI)
k2 in the extension of another keyword k1 , it follows from the RDFS in I that k2 is an
instance of or a specialization (particular case) of k1 .
Social components A notion of distance is natural in a social context such as the one
we consider. We consider that a subset Inet of I’s edges encapsulate quantitative in-
formation on the strength of the links between users, documents and tags. The network
edges, Inet , are exactly those labeled with properties from the S3 namespace (other than
S3:partOf), and whose subject and object are either users, documents, or tags.
     For instance in Figure 3, u1 S3:social u3 0.5 and URI 0 S3:postedBy u1 1 are net-
work edges but URI 0.0 S3:contains k0 1 and URI 0.1 S3:partOf URI 0 1 are not: the
first triple’s object is neither a user, a document, or a tag but a keyword while the sec-
ond has the property S3:partOf.
     Observe that the above notion of network edges: (i) includes any edge between two
social network users (given that for any s-labeled edge between two users u1 , u2 ∈
Ω, s is a sub-class of S3:relatedTo); (ii) includes all the S3-prefixed properties we
introduced, such as commenting upon, authoring, annotating etc.; (iii) does not include
relations between document fragments. While a document node can be seen as “close”
to its children or ancestors, two distant nodes from the same document may be harder to
relate. We formalize the possible connections between document nodes by introducing:
Definition 2 (Document vertical neighbourhood). Two documents are vertical neigh-
bours if one of them is a fragment of the other. We define by neigh the set of vertical
neighbours of an URI.
In Figure 3, URI 0 and URI 0.0.0 are vertical neighbours; URI 0.0.0 and URI 0.1 are not.
     We are interested in vertical neighborhoods, because we consider that a social in-
teraction (e.g., a comment or tag) upon a document fragment carries over to both the
descendants and the ancestors of this fragment. As a consequence, we define:
Definition 3 (Social path). A social path (or simply a path) in I is a chain of network
edges such that the end of each edge and the beginning of the next one are either the
same or vertical neighbours. We may also designate a path simply by the list of nodes it
traverses, when there is no ambiguity as to the edges taken.
                        u2 S3:hasAuthor a0 1     a0 S3:hasSubject URI 0.0.0 1
     In Figure 3, u2 −−   −−−−−−−−−−−→ a0 −−−−−−−−−−−−−−−−−−→ URI 0.0.0 99K
      URI 0 S3:postedBy u0 1
URI 0 −
      −−−−−−−−−−−−−−→ u0 is an example of such a path, stepping from URI 0.0.0
to its vertical neighbour URI 0 (the dotted arrow denotes this).
     We use the notation x     y to denote the set of all social paths leading from node x
or one of its vertical neighbours to y or one of its vertical neighbours in I. Finally, |p|
denotes the number of edges along the path p.
     Given a path p, the normalized path from p is the copy of p where the edge weights
have been modified so that the sum of the weights of the edges outgoing from any source
of social edge is 1. In more details, let n be the end point of a network edge in a path
and e be the next edge in this path, the normalized weight of e for this path is:
                                           P
                     wnormalized e = we / e0 ∈out(neigh(n)) we0
where we denotes the weight of the edge e and out(X) the network edges outgoing
from a set of edges X. In the following, whenever we work with a path, we first obtain
its normalized version and manipulate it instead.
3      Querying the S3 instance I
Users can search S3 instances by means of ranked search queries; the answer consists
of the k top score fragments, according to a joint structured, social and semantic score.
Definition 4 (Query). A query is a pair of the form (u, φ) where u ∈ Ω is a user and
φ ⊂ K is a set of keywords. We call u the seeker.
    The ranked results to be computed in response to a query require the presence of a
numeric score function. More precisely, score(q, d) assigns a numerical score from R
to any document d or any fragment f thereof in the graph, for a given query q.
Definition 5 (Query answer). The top-k answer to a query q is the k highest scoring
documents or fragments for q, such that the presence of a document or fragment in a
given rank prevents the inclusion of its vertical neighbours at lower ranks in the result.
     Our approach for effective search over the structured, social and semantic graph re-
lies on two core ingredients: (i) a model of relationships between keywords and various
graph components (users, documents etc.); (ii) a score function quantifying the interest
of a fragment w.r.t. a query. We outline these two components next.
Relationships to keywords, and their sources We need to capture both the explicit and
the implicit relationships connecting a document or a resource of type S3:relatedTo to
a search keyword. The relationship can be explicit, for instance, when the document
contains the keyword, or a member of its extension Ext(k) (Definition 1). Implicit
relationships are due to a document (or the S3:relatedTo resource) being connected
through a succession of comments and tags, to a search keyword k. For instance, a doc-
ument d2 which has been stated as related to k may be a comment on another document
d1 ; in this case, d1 is related to k through d2 .
     To model these relationships, we introduce a function relation source, denoted
rel(d, k). For any document d and keyword k, rel(d, k) is a set of tuples of the form
(type, f rag, src) such that:
  – type ∈ {S3:contains, S3:relatedTo, S3:commentsOn} is the kind of relation,
  – f ∈ F rag(d) is the exact fragment of d due to which d is involved in this relation,
  – src ∈ Ω ∪ D (users or documents) records the origin of this relation between d and
      k, as we explain next.
    Concretely, the relation source tuples are built as follows. Let d be in D ∪ T , let
f ∈ F rag(d), let k ∈ K, and let src ∈ Ω ∪ D be a user or a document4 .
    1. Documents connected to the keywords of their fragments If the fragment f con-
tains a keyword k, then (S3:contains, f, d) ∈ rel(d, k). This holds even if f does not
contain k itself, but some k 0 ∈ Ext(k).
    2. Connections due to S3:relatedTo statements Whenever I comprises an annota-
tion a of a fragment f (of document d) by a source (author) src, with a specialization
of the keyword k, we record the resulting connection between d and k, as well as the
 4
     This is a slight abuse of notation (tags do not have fragments): if t ∈ T we use F rag(t) = {t}.
fact that the connection is due to the source of the annotation on f , through the tuple
(S3:relatedTo, f, src) ∈ S3:relatedTo(d, k).
     Moreover, if the annotation a itself is related to a keyword k (directly or indirectly,
due to a source src0 , and no matter what the relationship type is: we denote this by
(_, a, src0 ) ∈ rel(a, k)), and a is about the fragment f of document d, then rel(d, k)
includes the tuple (S3:relatedTo, f, src0 ). This records that a connects (relates) f and
all its ancestors, to the keyword k, and keeps the source src0 of the connection.
     3. Connections due to S3:commentsOn statements Finally, if a relation source
tuple connects a comment c on a document d to the keyword k due to some relation
source src, then rel(d, k) includes (S3:commentsOn, c, src). Intuitively, social con-
nections to a comment on d also extend to d.
Social proximity Our second ingredient for a score function is a comprehensive social
proximity measure on I. To each pair of vertices connected by at least one social path,
it associates a global measure (of all the paths betwen them).
     First, for each path we define a proximity function path_prox, returning values in
[0, 1], satisfying the following conditions:
   – path_prox([ ]) = 1, that is: the proximity is maximal on an empty path;
   – for any two paths p1 , p2 such that the end point of p1 is the start point of p2 :
                 path_prox(p1 ||p2 ) 6 min(path_prox(p1 ), path_prox(p2 ))
      where || denotes path concatenation. Again, this follows the intuition of a proximity
      measure: it can only decrease as the path gets longer.
     Based on this, we express the social proximity prox : (Ω ∪D ∪T )2 → [0, 1], which
is a measurement along every possible path:
                   prox(a, b) = ⊕path ({path_prox(p), p ∈ a          b})
where ⊕path is a function which aggregates a set of values in [0, 1] in a single scalar
value. We consider all paths and not just the shortest, because different paths traverse
different nodes, users, documents and relationships, all of which must be taken into
account when computing the score.
     Based on the rel tuples and the proximity measure introduced above, we define:
Definition 6 (Generic score). Given a document d and a query q = (u, φ), the score
of d for q is: score(d, (u, φ)) = ⊕gen ({(k, type, pos(d, f ), prox(u, src)) |k ∈ φ,
(type, f, src) ∈ rel(d, k)}), where ⊕gen aggregates four-tuples of the form (keyword,
relationship type, importance of fragment f in d, social proximity) into a value from R.
     Let us comment on how various aspects of the graph reflect in the generic score.
     First, ⊕gen aggregates over all the keywords in φ. Further, recall that tuples from
rel(d, k) account not only for k but also for all keywords k 0 ∈ Ext(k). This is how
semantics is injected into the score.
     Second, the score of d takes into account the relationships between fragments f of
d, and the keyword k (or some k 0 ∈ Ext(k)). We use pos(d, f ) as an indication of
the structural importance of the fragment within the document. Document structure is
thus taken into account both directly through pos, and indirectly, since the rel tuples
previously introduced also propagate relationships from fragments to their documents.
     Third, the score takes into account the social component of the graph, through prox.
This function accounts for the relationships between the seeker u, and the various par-
ties (users, documents and tags), denoted src, due to which f is related to k.
     We outline concrete methods for implementing the ⊕path , path_prox and ⊕gen
functions in our technical report [1].
4   Query evaluation algorithm
The core idea of our algorithm is the following. The graph is explored starting from
the seeker and moving to vertices (users, documents, or tags) at increasing distance,
looking for candidate fragments that may be part of the top-k answer. We compute
Ext(k) for each query keyword, since the members of these extensions will be used to
compute scores. For each candidate c we maintain a score interval with its lowest and
highest possible real scores, which are refined during the exploration. When a candidate
fragment is discovered, we are aware of at least one path from the seeker to it, but not of
all of them, and the candidate’s score w.r.t. the query may be significantly impacted by
a path not visited yet. During the search, candidate documents are kept sorted by their
highest possible score. The exploration ends when we are certain that no candidate
document outside the current top-k candidates may have an upper bound above the
minimum lower bound within the current top k.
    For the particular proximity and score functions we considered (based on the liter-
ature and adapted to our setting), we have the guarantee that our algorithm correctly
produces the query result (Definition 5), as we show in [1].
5   Conclusion and main related work.
To the best of our knowledge, this is the first attempt to model social media with both
structured and semantis content. Past research has considered keyword search on struc-
tured XML documents, e.g., [2]. Other studies have focused instead on social keyword-
based search, ignoring structure and semantics [4,5,9]. In these works, data items are
seen as unstructured collections of keywords (tags) assigned by users, and top-k search
ranks higher items close to the seeker. [6] studies top-k search over unstructured text
documents, exploiting a measure of semantic similarity between keywords, on which
query expansion is based in an early-termination algorithm. Such a model ignores both
social and structural features. We have previously considered querying interconnected
corpora of XML documents and RDF triples [3]; however, that work ignored any social
aspect o top-k search.
    Our work seeks to extend and integrate the aforementioned models, to support
the three dimensions (structured, semantic and social), in the most common query
paradigm: the one of keyword query.
References
1. R. Bonaque, B. Cautis, F. Goasdoué, and I. Manolescu. S4: Social, Structured and Semantic
   Search. http://pages.saclay.inria.fr/raphael.bonaque/p/9efac4ffa.
2. L. J. Chen and Y. Papakonstantinou. Supporting top-k keyword search in XML databases. In
   ICDE. IEEE, 2010.
3. F. Goasdoué, K. Karanasos, Y. Katsis, J. Leblay, I. Manolescu, and S. Zampetakis. Growing
   triples on trees: an XML-RDF hybrid model for annotated documents. VLDB Journal, 2013.
4. S. Maniu and B. Cautis. Network-aware search in social tagging applications: instance opti-
   mality versus efficiency. In CIKM, 2013.
5. R. Schenkel, T. Crecelius, M. Kacimi, S. Michel, T. Neumann, J. X. Parreira, and G. Weikum.
   Efficient top-k querying over social-tagging networks. In SIGIR, 2008.
6. M. Theobald, R. Schenkel, and G. Weikum. Efficient and self-tuning incremental query ex-
   pansion for top-k query processing. In SIGIR, 2005.
7. Resource Description Framework. http://www.w3.org/RDF.
8. Uniform Resource Identifier. http://tools.ietf.org/html/rfc3986.
9. S. A. Yahia, M. Benedikt, L. V. Lakshmanan, and J. Stoyanovich. Efficient network aware
   search in collaborative tagging sites. PVLDB, 2008.