TRT - A Tripleset Recommendation Tool

 Alexander Arturo Mera Caraballo1 , Bernardo Pereira Nunes1 , Giseli Rabello
   Lopes1 , Luiz André P. Paes Leme2 , Marco A. Casanova1 , Stefan Dietze3
           1
           Department of Informatics, PUC-Rio, Rio de Janeiro/RJ – Brazil
             {acaraballo,bnunes,grlopes,casanova}@inf.puc-rio.br
    2
      Computer Science Institute, Fluminense Federal University, Niterói/RJ – Brazil
                               {lapaesleme}@ic.uff.br
           3
             L3S Research Center, Leibniz University Hannover, Germany
                                   {dietze}@l3s.de


         Abstract. According to the Linked Data principles, a tripleset should
         be interlinked with others to take advantage of existing knowledge. How-
         ever, interlinking is a laborious task. Thus, users interlink their triple-
         sets mostly with data hubs, such as DBpedia and Freebase, ignoring the
         more specific yet often even more promising triplesets. To alleviate this
         problem, this paper describes a tripleset interlinking recommendation tool
         based on link prediction techniques and evaluates the tool on a real-world
         tripleset repository.

         Key words: Linked Data, Recommender Systems, Social Networks


1      Introduction
A considerable number of triplesets, following the Linked Data principles, have
already been published in a large number of areas, ranging from geographic
to bibliographic data. This growth makes it difficult to choose which triplesets
should be interlinked with a given tripleset. Thus, users interlink their triple-
sets mostly with data hubs, such as DBpedia and Freebase, ignoring the more
specific triplesets which often contain particularly useful data. Furthermore, the
metadata provided in data repositories such as the DataHub are typically not
sufficient to help users choose the most suitable triplesets to interlink with.
    To help alleviate this situation, we describe a tool for tripleset interlink-
ing recommendation, based on previous work by the authors [1, 2]. More pre-
cisely, the tool addresses the tripleset recommendation problem, defined as fol-
lows: Given a tripleset t and a set of triplesets S, rank the triplesets in S based
on the probability of interlinking t with them.


2      TRT - The Tripleset Recommendation Tool

Recommendation Procedure. A tripleset t is a set of RDF triples. A re-
source, identified by an RDF URI reference s, is defined in t iff s occurs as the
subject of a triple in t.
                          Table 1: Local and quasi-local indices

                      Indice
                                                                 Equation
       Type                     Name
                         Common Neighbors                   CNt,u = |Ct ∩ Cu |
                                                                       |Ct ∩Cu |
                                 Salton                    Saltont,u = √   0  0
                                                                            Ct .Cu

                                Jaccard                    Jaccardt,u = |Ct ∩Cu |
                                                                        |Ct ∪Cu |
                                                                             t ∩Cu |
   Local indices                Sørensen                 Sørensent,u = 2.|C
                                                                         C 0 +C 0
                                                                                t           u
                                                                     |Ct ∩Cu |
                        Hub Promoted index                HP It,u = min{C 0 ,C 0 }
                                                                                t       u
                                                                    |Ct ∩Cu |
                        Hub Depressed index               HDIt,u = max{C 0 ,C 0 }
                                                                                t       u

                       Leicht-Holme-Newman                  LHNt,u = |CCt0∩C u|
                                                                          .C 0
                                                                            t       u

                       Preferencial Attachment               P At,u = Ct0 .Cu0
                                                                   X            1
                            Adamic-Adar                 AAt,u =                    0 |
                                                                w∈Ct ∩Cu
                                                                           log  |Cw
                                                                     X          1
                         Resource Allocation             RAt,u =                 0
                                                                  w∈C ∩C
                                                                              |C w|
                                                                        t   u

                               Local Path                   LPt,u = A2 + εA3
Quasi-local indices
                                                                tC0              u      C0
                        Local Random Walk        LRWt,u (s) = 2|C| .πt,u (s) + 2|C| .πu,t (s)


    Let t and u be two triplesets. A link from t to u is a triple of the form (s, p, o),
where s is an RDF URI reference identifying a resource defined in t and o is an
RDF URI reference identifying a resource defined in u; we also say that (s, p, o),
interlinks s and o. We say that t can be interlinked with u iff it is possible to
define links from t to u. A Linked Data network is a graph G = (S, C) such that
S is a set of triplesets and C contains an edge (t, u), called a connection from t
to u, iff there is at least one link from t to u.
    Our recommendation procedure analyses the Linked Data network in much
the same way as a Social Network. The inputs of the procedure are: (i) a Linked
Data network G = (S, C); (ii) a target tripleset t not in S (intuitively the user
wishes to define links from t to the triplesets in S); and (iii) a target context Ct
for t consisting of one or more triplesets u in S (intuitively the user knows that
t can be interlinked with u). The output is an order list L of triplesets in S,
called a ranking. The triplesets in the ranking are ordered using link prediction
techniques discussed in what follows.

Link prediction techniques. The procedure uses link prediction theory to
estimate the likelihood of the existence of a link between triplesets. We focus
on local and quasi-local indices to measure the structural similarity between
triplesets [3] according to their link structure. Table 1 summarizes the indices
the procedure implements, where:
 – Ci is the context of i (triplesets that i points to), where i a specific tripleset;
 – Ci0 is the inverse context of i (triplesets that point to i), where i a specific
   tripleset;
 – Aj is the number of different paths with length j connecting t and u;
 – ε is a free parameter;
 – πt,u (s) is the probability that a random walker starting on t locates u after
   s steps;
 – C is the set of all edges of the Linked Data network G.

Description of the TRT Tool in Action. Briefly, suppose that the user is
working on a tripleset t and wants to discover one or more triplesets u such that
t can be interlinked with u. He then uses the tool to obtain recommendations.
    The tool first builds the Linked Data network G = (S, C) defined by the
metadata stored in the DataHub repository.
    Then, the user defines the rest of the input data the tool requires. He may
define a target context Ct for t, consisting of one or more triplesets in S, in two
different ways: (i) by providing a VoID descriptor Vt for t from which the tool
extracts Ct by analysing the void:linkset declarations occuring in Vt ; or (ii) by
manually selecting triplesets from the categories the tool displays. Finally, the
user chooses a similarity index from those shown on Table 1.
    From this input data, the tool outputs a ranked list of triplesets, thereby
helping reduce the effort required to find related triplesets for the interlinking
process.
    The tool can be accessed at http://web.ccead.puc-rio.br:8080/Uncover/.

3   Evaluation
The tool was evaluated using the DataHub repository, which contains more than
6,000 triplesets, with approximately 15 thousand links that connect only 711
of the available triplesets. The links across triplesets were used to rank and
recommend triplesets for interlinking. The recommendation process was assessed
using the 10-fold cross validation approach, where we randomly divided the
observed links into 10 subsets used as recommendation subgraphs. Finally, the
overall performance was computed in terms of the average of the performances
in the testing partitions.
    To evaluate the prediction indices, we used three standard metrics: Area Un-
der the receiver operating characteristic Curve (AUC), Mean Average Precision
(MAP) and Recall. Table 2 summarizes the results for different target context
sizes (shown in the first column of the table). The entries corresponding to the
highest results among the 12 indices are emphasized in boldface underlined. The
reader may observe that the PA index achieved the highest AUC (ranging from
83.74% to 95.90% depending on the target context size). The PA index also
obtained the best MAP (37.83%) for target contexts with very few triplesets,
while the RA index turned out to be more precise (72.42%) for larger target
contexts. Table 2 also shows the coverage results. The PA index obtained the
highest recall (96.4%), regardless of the size of the target context.
         Table 2: AUC, MAP and Recall of the local and quasi-local indices


AUC CN Salton Jaccard Sørensen HPI HDI LHN PA          AA    RA    LP LRW
  1   70.52 47.79 69.84 69.28 48.94 69.31 48.00 83.74 71.31 70.53 70.74 69.67
  5   87.10 55.73 81.20 80.93 58.78 80.17 52.24 90.76 88.45 88.02 92.70 83.21
 10 92.42 57.14 85.06 84.85 60.84 83.79 52.87 92.81 92.37 92.40 92.25 86.69
 20 92.77 58.47 88.34 88.30 59.45 86.54 51.39 94.33 92.53 92.64 92.76 88.22
 50 92.84 59.10 92.96 92.99 56.27 92.09 52.30 95.90 92.17 92.72 91.91 90.26
MAP CN Salton Jaccard Sørensen HPI HDI LHN PA          AA    RA    LP LRW
  1   18.17 14.49 16.30 14.73 17.08 15.00 14.80 37.83 18.06 17.80 18.46 15.57
  5   49.48 25.07 21.80 20.36 35.14 19.20 18.38 48.26 52.20 51.48 58.23 26.05
 10 63.49 30.99 30.40 28.71 41.81 24.41 19.44 52.62 63.43 63.71 62.63 31.91
 20 71.20 34.22 44.37 43.56 38.14 34.14 17.90 53.97 71.46 72.38 70.59 34.66
 50 71.13 27.73 69.49 70.55 20.64 66.14 15.92 47.30 70.99 72.42 67.51 39.03
Recall CN Salton Jaccard Sørensen HPI HDI LHN PA       AA    RA    LP LRW
  1   48.72 49.69 49.86 49.76 49.68 49.55 50.02 96.40 50.81 48.74 48.81 49.12
  5   81.45 83.80 82.69 83.03 83.68 82.23 82.43 98.45 83.63 82.83 86.90 82.42
 10 89.52 88.73 89.42 89.29 89.35 89.85 89.17 98.74 89.49 89.28 88.96 89.21
 20 90.03 90.31 89.68 89.18 89.12 89.53 89.19 99.80 89.50 89.58 89.84 90.01
 50 90.05 90.16 90.15 90.21 90.04 89.38 88.45 99.56 89.06 89.58 89.02 89.71


4    Conclusions
In this paper, we proposed the use of link prediction techniques to address the
tripleset recommendation problem in the Linked Data domain and presented a
tool that implements the techniques. The tool computes local and quasi-local
indices to predict links between triplesets. The results showed that the tool
performs better, with respect to both AUC and recall, when the PA index is
adopted. In terms of MAP, the PA index should be adopted for smaller context
sizes, while the RA index should be adopted for larger context sizes.
Acknowledgments. This work was partly supported by CNPq, under grants
160326/2012-5, 301497/2006-0, 475717/2011-2 and 57128/2009-9, by FAPERJ,
under grants E-26/170028/2008 and E-26/103.070/2011.


References
1. Leme, L.A.P.P., Lopes, G.R., Nunes, B.P., Casanova, M.A., Dietze, S.: Identifying
   candidate datasets for data interlinking. In Daniel, F., Dolog, P., Li, Q., eds.: ICWE.
   Volume 7977 of Lecture Notes in Computer Science., Springer (2013) 354–366
2. Lopes, G.R., Leme, L.A.P.P., Nunes, B.P., Casanova, M.A., Dietze, S.: Recom-
   mending tripleset interlinking through a social network approach. In: Proceedings
   of WISE’13. (2013 (to appear))
3. Lü, L., Jin, C.H., Zhou, T.: Similarity index based on local paths for link prediction
   of complex networks. Physical Review E 80(4) (2009) 046122