Ranking, Aggregation, and Reachability in
                Faceted Search with SemFacet?

            Evgeny Kharlamov1 Luca Giacomelli2 Evgeny Sherkhonov1
             Bernardo Cuenca Grau1 Egor V. Kostylev1 Ian Horrocks1
              1                              2
                  University of Oxford, UK       Sapienza University of Rome, Italy


1     Introduction
Motivation. Faceted search is a prominent search paradigm that became the standard in
many Web applications including online shopping and real-estate portals, where users
can progressively narrow down the search results by applying filters, called facets [9]
which are organised in faceted interfaces. Faceted search has also been proposed in the
Semantic Web context for exploring and querying RDF graphs and a number of RDF-
based faceted search systems have been developed (see [3,7,4,5,2,6] for an overview).
    One of the main challenges that hampers usability of faceted search systems is in-
formation overload [9]: when the size of the faceted interface becomes comparable to
the size of data over which the search is performed. The overload is already a challenge
in the case of the classical faceted search and it becomes even stronger when faceted
search is applied to RDF data. Indeed, observe that in both classical and RDF settings
the search is over annotated entities, at the same time, in the latter case the number of
annotations and possible values is typically much larger: any predicate occurring in an
RDF data set can give a facet during a search, and any entity or value that it points to in
the RDF data can become a value in this facet. At the same time, in the classical case
the list of possible facets is typically predefined and controlled.
    Moreover, differently from the classical case, in the RDF settings entities are also
interconnected. Thus, in an RDF driven faceted interface one has to nest facets in order
to reflect this interconnections. In Figure 1, left, one can see a possible way to nest
facets which is implemented in our SemFacet system. This not only raises a non-trivial
problem of how to arrange nested facets within an interface in an intuitive and user
oriented way, but also leads to an unavoidable overload of the interface with various
paths of nested facets. Thus, faceted navigation over RDF requires from users to be
experts in the structure of the underlying RDF graph in order to be able to find the
required facet which can be deeply nested.
    In order to see why it might be hard to find a required nested facet, consider in
Figure 2 an example RDF data. Assume that one is looking for a smartphone with
the price within £500–£900 and the processor that is manufactured by a company with
headquarters in Asia. One can see in Figure 2 that Samsung S8 is such a smartphone. At
the same time, finding this smartphone via a faceted interface requires one to traverse
the data via 4 nested facets (Figure 1, left).
SemFacet System. In our RDF-based faceted search system SemFacet we propose to
address the information overload by
    – ranking facets and their values and thus offering to users top-k most prominent
      facets and values;
?
    This work was supported by the Royal Society under a University Research Fellowship and
    the EPSRC under an IAA award and the projects DBOnto, MaSI3 , ED3 , and VADA.
      phone                             keywords                                    Search                           phone                                                                                Search

              type                                 Samsung Galaxy S8                                                         type                                     Exycon 8 Processor
                                                   Galaxy S8 is a phone that offers an exceptional                                                      ANY           Galaxy S8 is built around the new Exynos 8895,
       SmartPhone                                  experience for any user. The large screen is a real                SmartPhone      (8.270)                         featuring new custom CPU cores, new Mali GPU, and
       Laptop                                      turning point in flagship phone design and should usher            Laptop                                          with the whole thing fabricated via a cutting-edge 10nm
                                                                                                                                      (5.850)
                                                   in the end of large bezels, and the camera and slick                                                               process intended to maximize performance and power
                                                   performance work brilliantly under the finger.                                                                     efficiency.
          hasPart                                                                                      ANY               hasPart
       Processor                                                                                                      Processor       (2.490)                   refocusing                     refocussed answer

              producedBy                                                                                                 producedBy                                                   Search Task:
         ANY                                                                                                            ANY                                                           Show me processors of smartphones
                                                                                                                                           (1.800)
                                                                                  answer                                                                                               ▪   whose processor
                      withHQ                                                                                                     withHQ                                                ▪   is manufactured by a company
               ANY                                                                                                           ANY                                                       ▪   with headquarters in Asia
                                                                                                                                                (630)
                                                                                                                                                          Reachable facets             ▪   and whose max price among
                       inCountry                                                                                             San Diego            (8)
                                                                                                                                                                                           providers is £500-£900
                                                                                                                             Suwon                (5)
                   ANY
                         inContinent               facet predicate                                                           enter facet                                                          inContinent
                                                                       Search Task:
                        ANY                                                                                             price        max                      aggregate facets                     Asia            (1)
                                                                       Find smartphones
                        Asia                        facet values         ▪   whose processor                                                                                                       N. America      (3)

                                                                         ▪   is manufactured by a                0       500         900         1.600
                        North America
                                                                             company
              price                                                      ▪   with headquarters in Asia
                                                                         ▪   and whose price is within
                                                                             £500-£900
  0      500           900     1.600


Fig. 1. Left: SemFacet with deep nesting over RDF data; Right: SemFacet enhanced with rank-
ing, aggregation, and reachability

 – minimising the number of values within a facet by first grouping them according to
   the corresponding entities and then aggregating them with the standard aggregate
   functions max, min, count, sum, avg;
 – shortcutting paths of nested facets with the help of a reachability operator.
    Observe in Figure 1, right, the SemFacet interface where all these three features
are incorporated. The facets and values are now ranked and only the top 10 of them
are displayed. The users can choose whether to look for a maximal price of entities by
activating aggregate functions within facets with numerical values. In the screenshot
the user selected max and thus the system is searching for smartphones whose maximal
price across providers is within the range £500–£900. Finally, the users can enter the
name of a desired predicate instead of choosing a facet value in order to ask the system
to find a shortcut (a reachable facet). In the screenshot the user entered inContinent
instead of selecting SanDiego or Suwon.
Demo Overview. The goal of the demo is to show the attendees how our novel features,
that is, ranking, aggregation, and shortcutting, make hard faceted search—over RDF
datasets that are highly interconnected and with many data values—easier. In order to
experience the impact of these features the attendees will be able to explore the demo
dataset using two versions of our SemFacet systems: with and without the features.

2 SemFacet System
We first give an overview of SemFacet with the focus on its novel components and then
present technical details behind ranking, aggregation, and shortcuts.
System Overview. SemFacet is implemented in Java and available under an academic
license [1]. In Figure 2 there is the architecture of SemFacet where the arrows denote
the data flow between the systems’ components.
    On the client side, SemFacet has an HTML 5 based GUI consisting of three main
parts: a free text search box for keywords, a hierarchically organised faceted interface,
and a scrollable panel containing snippet-shaped answers. User keywords are sent by
the client to the server, evaluated and this gives the initial faceted interface. User se-
lections in the faceted interface are compiled into a SPARQL query using the SPARQL
query constructor and then sent to the server for evaluation. The snippet composer and
facet composer receive information about facets and answers that should be displayed
to the user and update the currently displayed interface and query answers. The sys-
tem updates the faceted interface incrementally: only the parts of the interface that are
affected by users’ actions are updated, which allows for a significantly faster response


                                                                                                             2
4
          800             900           290                       270                                                                                                                     type
                                                                                                                                                                                    SmartPhone    (8.270)
                                                                                                                                                                                                                     Exycon 8 Processor
                                                                                                                                                                                                                     Galaxy S8 is built around the new Exynos 8895,
                                                                                                                                                                                                                     featuring new custom CPU cores, new Mali GPU, and
                                                                                                                                                                                     Laptop                          with the whole thing fabricated via a cutting-edge 10nm
                                                                                                                                                                                                  (5.850)            process intended to maximize performance and power
                                                                                                        phone                         phone                                                                          efficiency.
       price    price                          price      price                                                                                                                         hasPart
                                                                                                                                                                                    Processor     (2.490)


                        hasPart                        producedBy                     withHQ             inCountry
     :SamsungS8                   :Exynos                           :Samsung                   :Suwon                :S .Korea
                                                                                                                                                                           Facet                       SPARQL Q.            Snippet
                                                                                                                                                                         Composer                      Constructor         Composer
        type                          type                              type                       inContinent                                                                                                                                                Client

                                                                                                                                               Search                                                                        Snippet                        Server
      Smartphone                  Processor                         Company                    :Asia             :NorthAmerica                                       Facet Generator
                                                                                                                                               Engine                                                                       Generator

        type                            type                                   type                                  inContinent                             Aggregation            Shortcuts
                                                                                                                                                               Handler               Handler
                        hasPart                               producedBy                       withHQ                   inCountry
     :HP Elite X3                     :Snapdragon                               :Qualcomm               :SanDiego                   :USA                       Ranking               Range
                                                                                                                                                               Handler               Handler
       price                  price      price
                                                                                                                                                                                   RDF Data, Rules
          730           280                    300                                                                                            RDFox: Storage and Query Execution


                            Fig. 2. Example RDF graph about products
                          Fig. 2. Left: Example RDF graph about products; Right: Architecture of SemFacet
Let O = [F1 , . . . , Fn ] be a possible ordering of S. Our function f computes the
score of O time.
           by combiningThe user     is ablethree
                              the following   to do   refocusing.
                                                  characteristics of O:This feature of SemFacet can be observed in
               Figure 1, right, where the user can click on a box in the hasPart facet to change the
(1) selectivity of facets in O,
           answers
(2) diversity of facets from    phones to their parts, that is, from Samsung S8 to its processor Exycon 8.
                         in O, and
(3) nesting depthOnof the
                       facetsserver
                              in O. side, the system relies on an in-memory triple store RDFox [8] to
           store
In particular,     theweinverted
               for (1)                 index,
                            prefer those           input
                                          facets that    are RDF     data, query
                                                              more selective,            answers, and all necessary auxiliary
                                                                               i.e., ticking
values in the  facet narrow such
           information         down the    search result more rapidly
                                      as materialisation                    than doing
                                                                   rules which         wesodiscuss later in this section. One of
in other facets. For (2), we prefer those facets that lead to results which are
not coveredthe   server other
             by selecting   components          is an
                                  facets. Finally,        inverted
                                                     for (3), we preferindex     based
                                                                         facets that   allowfull-text search engine; in order to
           ensure
deeper nesting       a betterexplore
                thus allowing      integration
                                         the graphbetween
                                                       structure offull-text     anddata.
                                                                    the underlying      faceted search and thus achieve good
We now define
           efficiency     of SemFacet
                 the functions    correspondingweeach      of (1)-(3) andour
                                                      implemented           thenowncombine
                                                                                        search engine. Another backend com-
them into the final scoring function f .
    Let F denote a facet, A the set of current answers, r(F ) ✓ A the setentities
           ponent     is snippet     composer          that   for given     answer         of      retrieves their textual descrip-
           tions, by
answers obtained    images,
                       selectingand    links.
                                  any value      The
                                              in F  , andnext
                                                           d > 0component          is facet generator that constructs faceted
                                                                  a predefined threshold
parameter interfaces
           for nesting depth.   To account to
                           in response        for user
                                                  (1), weactions.
                                                            compute the
                                                                      Thisselectivity
                                                                              componentscore     is backed by four handlers: ag-
of F given A as follows:
            gregation, shortcuts, ranking, and ranges. The range handler computes and stores left
            and right boundssel(F, A)for
                                      = 1 the
                                            logrange     sliders for numerical facet values, like the price-slider
                                               |A| |r(F )|.

            in Figure 1, that correspond to the lower and upper bounds of possible values for this
In particular, the less search results is obtained by applying F , the higher the
selectivity property    name
            score of F is.       in theforunderlying
                           To account      (2), we consider RDF  data. of the Set-
                                                             a variation
Cover ranking [5] to compute the overlap score of Fi 2 O w.r.t. O as follows:
            Ranking facets. Whenever          n
                                                       SemFacetj
                                                                       updates the interface it should decide in which
                                     1       X                [
            order    toi ,present
              overlap(F   O) =       relevant    | facets.
                                                   r(Fi ) \ This r(F  is mdone
                                                                           )|     in two steps. First, we compute the set
                                n ⇥ |r(Fi )| j=1
            S of all relevant        facets that should     m=1,m6be
                                                                  =i displayed in the same level of nesting. Assume

In particular,     S has
            thata high      n facets.
                        overlap  score of Then,
                                          facet Fi for
                                                     means each
                                                             that of   n! possible
                                                                   selecting            orderings of S we find the optimal
                                                                              values in it
            order
leads to results thatusing
                      are notour   scoring
                              present           function
                                       in the first           f and
                                                     highly ranked      SemFacet
                                                                     facets.           displays the facets from S (or some
                                                                             Thus, highly
positioned offacets (according
                them)           to the overlap
                          according     to thisscore)
                                                    order.provide
                                                             Ourmore      diversefresults.
                                                                    function        computes the score of a given ordering
To account for (3), we compute the depth score depth(F, d) of F given d as the
            of facet by combining the following three characteristics of the order: (1) selectivity
            of its facets, (2) diversity of its facets, and (3) nesting depth of its facets. In particular,
            for (1) we prefer those facets that are more selective, i.e., ticking values in the facet
            narrow down the search result more rapidly than doing so in other facets. For (2), we
            prefer those facets that lead to results which are not covered by selecting other facets.
            Finally, for (3), we prefer facets that allow deeper nesting thus allowing explore the
            graph structure of the underlying data. We now define the functions corresponding each
            of (1)-(3) and then combine them into the final scoring function f .
               Ranking facet values. Besides ranking of facets, it is important to rank facet values
               as well. We adopt the count-based ranking (see also [10]): facet values that lead to a
               larger number of results are ranked higher. Although the computation of counts is con-
               ceptually trivial, the main challenge is in their update. Indeed, the integers associated
               to each facet value in the interface must be updated every time the user interacts with
               the faceted interface without affecting the performance of the system. Additional chal-
               lenges come from (i) graph structure of the data and (ii) interplay of conjunctive and
               disjunctive interpretation of facet values. In our experience, implementing the straight-

                                                                                                                            3
forward approach to updating counts leads to a significant increase of the user interface
response time. To mitigate this problem, we adopted a multi-threading solution: each
thread receives a set of facet values for which counts need to be updated and, addition-
ally, the load among the threads is balanced. For instance, we avoid the situation when
one thread is busy with updating counts for values with a high number of results only,
while another gets away with updating counts for values with a low number of results.
This approach led to a significant response time decrease.
Aggregation. Recall that every interface update performed by the user (i.e., refocusing,
selection of a facet or a facet value) results in formulating a corresponding SPARQL
query on the fly that is then issued to RDFox. Then the interface is updated (i.e., with
search results and available facets at this point) depending on the result of this query.
When aggregate facets are considered, it is possible to follow the same approach and
formulate aggregate SPARQL queries. However, we decided to do materialisation of
aggregate information at loading time instead since (1) RDFox, our back-end system,
supports efficient materialisation of aggregate information and, more importantly, (2)
non-aggregate SPARQL queries are usually faster to answer which is essential for re-
sponsive user interface updates.
Reachability. SemFacet provides the shortcut functionality described in Section 1. In
the user interface, SemFacet offers a search box within each facet that allows users to
search for reachable facets. As the user has typed in a facet name F in the box, the
system checks if such a facet is reachable from the current facet. For this we perform
breadth-first search to find all reachable nodes with an outgoing property F and we
store corresponding witnessing paths. These paths are important in constructing the
corresponding SPARQL query for fetching possible facet values for F and for further
facet navigation. For a faster response time, we predefine a parameter B and check
facet reachability up to length B instead. In our example, in Figure 1 (right) the user
can search for the inContinent facet, which in this case is reachable within 2 steps, and
then select ‘Asia’, thus selecting processors produced by an asian company.

References
 1. SemFacet Project Page. http://www.cs.ox.ac.uk/isg/tools/SemFacet/
 2. Arenas, M., Cuenca Grau, B., Kharlamov, E., Marciuska, S., Zheleznyakov, D.: Towards
    Semantic Faceted Search. In: Proc. of WWW (Companion Volume). pp. 219–220 (2014)
 3. Arenas, M., Cuenca Grau, B., Kharlamov, E., Marciuška, Š., Zheleznyakov, D.: Faceted
    search over RDF-based knowledge graphs. J. Web Semantics 37, 55–74 (2016)
 4. Arenas, M., Cuenca Grau, B., Kharlamov, E., Marciuška, Š., Zheleznyakov, D.: Enabling
    Faceted Search over OWL 2 with SemFacet. In: Proc. of OWLED. pp. 121–132 (2014)
 5. Arenas, M., Cuenca Grau, B., Kharlamov, E., Marciuška, Š., Zheleznyakov, D., Jiménez-
    Ruiz, E.: SemFacet: Semantic Faceted Search over Yago. In: Proc. of WWW (Companion
    Volume). pp. 123–126 (2014)
 6. Cuenca Grau, B., Kharlamov, E., Marciuška, Š., Zheleznyakov, D., Zhou, Y.: Querying Life
    Science Ontologies with SemFacet. In: Proc. of SWAT4LS (2014)
 7. Grau, B.C., Kharlamov, E., Marciuska, S., Zheleznyakov, D., Arenas, M.: Semfacet: Faceted
    search over ontology enhanced knowledge graphs. In: ISWC Posters & Demos (2016)
 8. Motik, B., Nenov, Y., Piro, R., Horrocks, I., Olteanu, D.: Parallel Materialisation of Datalog
    Programs in Centralised, Main-Memory RDF Systems. In: Proc. of AAAI (2014)
 9. Tunkelang, D.: Faceted Search. Synthesis Lectures on Information Concepts, Retrieval, and
    Services, Morgan & Claypool Publishers (2009)
10. Wagner, A.J.: Faceted semantic search, technical report. Tech. rep.


                                                4