=Paper= {{Paper |id=Vol-2029/paper4 |storemode=property |title=Predicting Invariant Nodes in Large Scale Semantic Graphs |pdfUrl=https://ceur-ws.org/Vol-2029/paper4.pdf |volume=Vol-2029 |authors=Damian Barsotti,Martín Ariel Domínguez,Pablo Ariel Duboué |dblpUrl=https://dblp.org/rec/conf/simbig/BarsottiDD17 }} ==Predicting Invariant Nodes in Large Scale Semantic Graphs== https://ceur-ws.org/Vol-2029/paper4.pdf
             Predicting Invariant Nodes in Large Scale Semantic Graphs


            Damian Barsotti       Martin Ariel Dominguez      Pablo Ariel Duboue
                               FaMAF-UNC / Cordoba, Argentina
                        damian,mdoming,pablod@famaf.unc.edu.ar




                       Abstract                            which we will call a triple hS, V, Oi, we will
                                                           say a given node S is an add-only node if in a
        We are interested in understanding and
                                                           next version (G1 ) of the multigraph, all triples
        predicting how large knowledge graphs
                                                           starting on S in G0 are also in G1 . That is,
        change over time. An important subprob-
                                                           S is add only i↵ :
        lem is predicting which nodes within the
        graph won’t have any edges deleted or
        changed (what we call add-only nodes).                  8v, o/ hS, v, oi 2 G0 ) hS, v, oi 2 G1
        Predicting add-only nodes correctly has
        practical importance, as such nodes can               This type of nodes can be efficiently represented
        then be cached or represented using a              as static information, for example by leveraging
        more efficient data structure. This pa-            large scale perfect hashes (Botelho and Ziviani,
        per presents a logistic regression ap-             2007).
        proach using attribute-values as features             Our intuition is that in large scale semantic
        that achieves 95%+ precision on DBpe-              graphs holding an imperfect representation of the
        dia yearly changes trained using Apache            real world, there will be two types of changes,
        Spark. It concludes by outlining how we            (1) model enhancements, where the truth about
        plan to use these models for Natural Lan-          the world is better captured by the model and (2)
        guage Generation.                                  model corrections, where the world has changed
                                                           and the model is updated. Updates of the first type
1       Introduction
                                                           result in new information added to the graph, with-
We are interested in understanding and predicting          out modifying existing data. Finding such nodes is
how large knowledge graphs change over time. An            the objective of our work.
important subproblem is predicting which nodes                This work is structured as follows. In the next
within the graph won’t have any edges deleted              section we summarize related work. In Section 2
or changed (what we call add-only nodes) or un-            we discuss DBpedia, the semantic graph we used
dergo any changes at all (what we call constant            for our experiments. Our methods and result fol-
nodes). Predicting add-only nodes correctly has            lows, closing with a discussion of our intended ap-
practical importance, as such nodes can then be            plication in Natural Language Generation.
cached or represented using a more efficient data
structure. In this paper we show a logistic regres-        2   Related Work
sion approach using attribute-values as features           Mining graphs for nodes with special properties is
that achieves 95%+ precision on DBpedia1 yearly            not new to Big Data mining (Drury et al., 2015).
changes, as trained using Apache Spark. We con-            With the development DBpedia much research has
clude by outlining how we plan to use these mod-           been devoted to exploiting this resource in AI
els for Natural Language Generation.                       tasks as well as to model its changes. For exam-
   Definition. Given a multigraph G0 with named            ple, there is research on modeling DBpedia’s cur-
edges such that each source node S is linked               rency (Rula et al., 2014), that is, the age of the data
through an edge labeled V to a target node O,              in it and the speed at which those changes can be
    1
        http://dbpedia.org                                 captured by any system. Although currency could



                                                      64
be computed based on the modification/creation               tion. It is this type of optimizations that we expect
dates of the resources, this information is not al-          detection of invariable nodes will help semantic
ways present in Wikipedia pages. To overcome                 graphs updates.
this, the authors propose a model to estimate cur-
rency combining information from the original re-            3   Data
lated pages and a couple of currency metrics mea-
suring the speed of retrieval by a system and ba-            As a large scale naturally occurring knowledge
sic currency or timestamp. Their experiments sug-            graph with a rich update history, we use DBpedia,
gest that entities with high system currency are as-         a knowledge graph derived from the Wikipedia
sociated with more complete DBpedia resources                collaborative encyclopedia started in January 2001
and entities with low system currency appear as-             at present containing over 37 million articles in
sociated with Wikipedia pages that are not eas-              284 languages.
ily tractable (or that “could not provide real world            Given that the content in Wikipedia pages is
information” according with the authors). While              stored in a structured way, it is possible to ex-
both the authors and us look into changes in DBpe-           tract and organize it in an ontology-like manner as
dia, we are interested in changes that for the most          implemented in the DBpedia community project.
part do not result from changes in the real world,           This is accomplished by mapping Wikipedia in-
as Lehman and others are interested.                         foboxes from each page to a curated shared on-
                                                             tology that contains 529 classes and around 2,300
   The need to account for changes in ontologies
                                                             different properties. DBpedia contains the knowl-
has long been acknowledged, given that they may
                                                             edge from 111 different language editions of
not be useful in real world applications if the rep-
                                                             Wikipedia and, for English the knowledge base
resentation of the knowledge they contain is out-
                                                             consists of more than 400 million facts describing
dated. Eder and Koncilia (Eder and Koncilia,
                                                             3.7 million things (Lehmann et al., 2015). A noble
2004) present a formalism to represent ontolo-
                                                             feature of this resource is that it is freely available
gies as graphs that contain a time model includ-
                                                             to download in the form of dumps or it can be con-
ing time intervals and valid times for concepts.
                                                             sulted using specific tools developed to query it.
They base their formalism on techniques devel-
                                                                These dumps contain the information in a lan-
oped for temporal databases, namely the version-
                                                             guage called Resource Description Framework
ing of databases instead of their evolution and they
                                                             (RDF) (Lassila et al., 1998). The WWW Con-
provide some guidelines about its possible imple-
                                                             sortium (W3C) has developed RDF to encode the
mentation. Our work can be used to improve the
                                                             knowledge present in web pages, so that it is com-
internal representation of such temporal databases
                                                             prehensible and exploitable by agents during any
(Cheng et al., 2014).
                                                             information search. RDF is based on the concept
   Another source of ontology transformation is              of making statements about (web) resources using
spatiotemporal changes. Dealing with spatial                 expressions in the subject-predicate-object form.
changes in historical data (or over time series) is          These expressions are known as triples, where the
crucial for some NLP tasks, such as information              subject denotes the resource being described, the
retrieval (Kauppinen and Hyvnen, 2007). In their             predicate denotes a characteristic of the subject
case, the authors deal with the evolution of the on-         and describes the relation between the subject and
tology’s underlying domain instead of its version-           the object. A collection of such RDF declarations
ing or evolution due to developments or refine-              can be formally represented as a labeled directed
ments. Their main result is the definition of partial        multi-graph, naturally appropriate to represent on-
overlaps between concepts in a given time series,            tologies.
which was applied to build a Finnish Temporal Re-               Table 1 shows the different years employed in
gion Ontology, showing promising results.                    this work. The DBpedia project obtains its data
   Finally, we see parallelisms between change               through a series of scripts run over Wikipedia,
tracking in other large graphs: object graphs                which on itself is a user-generated resource.
in garbage collection systems. State of the art              Changes to the DBpedia scripts or to Wikipedia it-
garbage collection will single out objects that              self sometimes result in dramatic differences from
survive multiple garbage collections (Stefanović            one year to the next. Besides the overall sizes,
et al., 1999) and stop considering them for collec-          what is relevant to this work is the total number



                                                        65
                                                            5     Results
                  Table 1: Data sizes.
          Version      # Nodes         # Links              Using the machine learning method described in
          2010-3.6 1,668,503 19,969,604                     the previous section we took three consecutive
          2011-3.7 1,831,219 26,770,236                     year, Gy1 , Gy2 , Gy3 , built a model M on Gy1 !
          2012-3.8 2,350,907 33,742,028                     Gy2 , apply M on Gy2 , obtaining G0y3 and evaluate
          2013-3.9 3,243,478 41,804,713                     it by comparing it to Gy3 . Table 3 shows our re-
          2014       4,218,628 61,481,509                   sults. We can see that for pairs close in size (from
          2015-04 4,080,388 37,791,134                      Table 1) we obtain a precision close to 90% with
          2015-10 4,995,949 40,617,978                      recall ranging from 20% to 58% (and F-measure
          2016-04 5,109,879 40,407,197                      as high as 66%). As our numbers were obtained
                                                            by optimizing F1 on a binary classification prob-
                                                            lem, precision and recall are dual and treated as
Table 2: Percentage of entities which remains con-          identical quantities by the optimizer (this can be
stant or add-only calculated between two consec-            easily seen from a confusion table). The rows
utive versions of DBPedia.                                  marked with an asterisk in the table were inverted
    Consecutive versions % Const. % Add.                    in our experiments. The low numbers for the last
    2010-3.6 2011-3.7            9.71    45.73              row in the table can be attributed to ontological
    2011-3.7 2012-3.8           30.28    65.51              re-estructuring on Wikipedia/DBpedia on 2015-
    2012-3.8 2013-3.9           38.72    76.79              04/10 period (second to last row on Table 2) were
    2013-3.9 2014               16.61    49.32              few entities remained constant through 2015.
    2014       2015-04          14.01    29.01                 From the table we can also see that detecting
    2015-04 2015-10              3.76    20.54              constant nodes, on the other hand, it is a much
    2015-10 2016-04             83.63    90.52              more difficult task that might be ill-defined given
                                                            the nature of the updates discussed in the intro. Ta-
                                                            ble 4 shows some examples of correctly and incor-
of additions and deletions, shown in Table 2.               rectly predicted nodes.

                                                            5.1    Discussion
4       Methods
                                                            How useful are these results? For the task we
Our prediction system is implemented using                  have in mind, building statistically plausible future
Apache Spark2 using its Logistic Regression pack-           versions of existing semantic graphs for the pur-
age. In our algorithm the feature vector itself is          pose of testing Natural Language Generation algo-
comprised of binary features indicating whether or          rithms (Duboue et al., 2016), successfully predict-
not a given relation object holds for the subject in        ing add-only nodes help us immediately with the
                                               Vi
OLD; that is, we do not look at whether the !     Oi        performance of the prediction system. Our high
have changed, just their existence in OLD. The              precision results will then carry over to direct im-
class is, given a node in subject position, S:              provements on our full system: if our system has
   add-only: {(Vi , Oi )}OLD ✓ {(Vi , Oi )}N EW             an error rate of 30% and there are 25% of add-only
   constant: {(Vi , Oi )}OLD = {(Vi , Oi )}N EW             nodes, our current system will reduce error by up
   The feature vector underneath has a dimension            to 12% (in the case of 50% recall).
of kV k ⇥ kOk, a potentially very large number                 Another case is for maintaining the data. The
given the millions of values in O. We leverage              add-only nodes and relations can be pre-cached
Apache Spark Mlib pipelines to filter out this ex-          using more efficient data structures such as perfect
tremely large feature vector to the top million en-         hashes (Botelho and Ziviani, 2007).
tries.
   Figure 1 shows a small example of feature and            6     Conclusions and Future Work
class extraction. A four node graph OLD evolves
                                                            In Natural Language Generation (Reiter and Dale,
into a five node graph N EW . The classes for each
                                                            2000), Referring Expressions Generation (REG),
node are computed over OLD, using three binary
                                                            is the task that, given an entity (the referent) and
features.
                                                            a set of competing entities (the set of distractors),
    2
        https://spark.apache.org/                           involves creating a mention to the referent so that,



                                                       66
                          OLD                               N EW




                            Node    Features            Target
                             S      a=T, a=U      add-only ¬ constant
                             T      ;             add-only ¬ constant
                             U      b=V         ¬ add-only ¬ constant
                             V      ;             add-only     constant

Figure 1: Feature generation from an ontology OLD and N EW . In this example most nodes are add-
only (shaded in OLD), only U loses a relation and it is thus not add-only.


    Table 3: Training in two consecutive years and evaluating on a third. Training maximizing F1.
                    Train              Eval                     System
             Source     Target        Target        Add-only             Constant
                                                Precision Recall     Precision Recall
             2010-3.6    2011-3.7   2012-3.8       0.560 0.579          0.704 0.887
             2011-3.7    2012-3.8   2013-3.9       0.448 0.444          0.658 0.569
             2012-3.8    2013-3.9   2014           0.916 0.224          0.890 0.472
             2013-3.9    2014       2015-04        0.971 0.506          0.965 0.770
             2014        2015-04    2015-10        0.989 0.650          0.971 0.820
             2015-04     2015-10    2016-04        0.945 0.196          0.908 0.068



Table 4: Example predictions and mispredictions, using 2015-04 ! 2015-10 as training and tested on
2016-04.
                                  Correctly predicted add-only
           Iasi Botanical Garden            constant
           USS Breakwater (SP-681)          constant
           Interborough Handicap            constant
           Thode Island                     added archipelago!Marshall Archipelago
           Colonel Reeves Stakes            added location!Perth
                                            added location!Australia

                               Incorrectly predicted as add-only
           Beverly Hills Handicap         disappears due to name change
           First Ward Park                disappears due to name change
           2012 Shonan Bellmare season changes league!2012 J. League Division 2
                                          to league!2012 J.League Division 2




                                                67
in the eyes of the reader, it is clearly distinguish-           Cusco, Peru, September 2-4, 2015.. pages 58–65.
able from any other entity in the set of distractors.           http://ceur-ws.org/Vol-1478/paper6.pdf.
Therefore REG algorithms are expected to select               Pablo Ariel Duboue and Martin Ariel Domı́nguez.
attributes that unambiguously identify an entity                2016. Using Robustness to Learn to Order Se-
with respect to a set of distractors.                           mantic Properties in Referring Expression Genera-
   Our current work is part of a plan to simulate               tion, Springer International Publishing, Cham, pages
                                                                163–174.
natural perturbations on the data in order to find
the conditions on which REG algorithms start to               Pablo Ariel Duboue, Martin Ariel Domınguez, and
fail (for example, a simulated DBpedia 25 years in              Paula Estrella. 2016. On the robustness of stan-
                                                                dalone referring expression generation algorithms
the future).                                                    using rdf data. WebNLG 2016 page 17.
   In previous work we explored the robustness for
the particular case of Referring Expressions Gen-             Johann Eder and Christian Koncilia. 2004. C.: Mod-
eration (REG) algorithms by means of different                  elling changes in ontologies. In In: Proceedings of
                                                                On The Move - Federated Conferences, OTM 2004,
versions of an ontology (Duboue et al., 2016).                  Springer (2004) LNCS 3292. pages 662–673.
   In (Duboue and Domı́nguez, 2016) we pre-
sented experiments on two types of entities (peo-             Tomi Kauppinen and Eero Hyvnen. 2007. Modeling
                                                                and reasoning about changes in ontology time se-
ple and organizations) and using different versions             ries. In Integrated Series in Information Systems.
of DBpedia we found that robustness of the tuned                Springer-Verlag, pages 319–338.
algorithm and its parameters do coincide but more
                                                              Philipp Koehn. 2010. Statistical Machine Translation.
work is needed to learn these parameters from data
                                                                Cambridge University Press, New York, NY, USA,
in a generalizable fashion.                                     1st edition.
   We plan to extend the current model with a
specific model for additions and deletions us-                Ora Lassila, Ralph R. Swick, World Wide, and Web
                                                                Consortium. 1998. Resource description framework
ing techniques from statistical machine translation             (rdf) model and syntax specification.
(Koehn, 2010) and investigate techniques based on
knowledge embedding models (Xie et al., 2017).                Jens Lehmann, Robert Isele, Max Jakob, Anja
                                                                 Jentzsch, Dimitris Kontokostas, Pablo N. Mendes,
                                                                 Sebastian Hellmann, Mohamed Morsey, Patrick van
Acknowledgments                                                  Kleef, Sören Auer, and Christian Bizer. 2015. DB-
                                                                 pedia - a large-scale, multilingual knowledge base
The authors would like to thank the Secretaria de                extracted from wikipedia. Semantic Web Journal
Ciencia y Tecnica of Cordoba Province for support                6(2):167–195.
and the anonymous reviewers for helpful com-
ments and suggestions.                                        Ehud Reiter and Robert Dale. 2000. Building Natural
                                                                Language Generation Systems. Cambridge Univer-
                                                                sity Press.

References                                                    Anisa Rula, Luca Panziera, Matteo Palmonari, and An-
                                                                drea Maurino. 2014. Capturing the currency of db-
Fabiano C. Botelho and Nivio Ziviani. 2007. External            pedia descriptions and get insight into their validity.
  perfect hashing for very large key sets. In Proceed-          In Proceedings of the 5th International Workshop on
  ings of the Sixteenth ACM Conference on Confer-               Consuming Linked Data (COLD 2014) co-located
  ence on Information and Knowledge Management.                 with the 13th International Semantic Web Confer-
  ACM, New York, NY, USA, CIKM ’07, pages 653–                  ence (ISWC 2014), Riva del Garda, Italy, October
  662. https://doi.org/10.1145/1321440.1321532.                 20, 2014..
S. Cheng, A. Termehchy, and V. Hristidis. 2014.               Darko Stefanović, Kathryn S McKinley, and J Eliot B
  Efficient prediction of difficult keyword queries             Moss. 1999. Age-based garbage collection. ACM
  over databases. IEEE Transactions on Knowl-                   SIGPLAN Notices 34(10):370–381.
  edge and Data Engineering 26(6):1507–1520.
  https://doi.org/10.1109/TKDE.2013.140.                      Qizhe Xie, Xuezhe Ma, Zihang Dai, and Eduard Hovy.
                                                                2017. An interpretable knowledge transfer model
Brett Drury, Jorge Carlos Valverde-Rebaza, and Al-              for knowledge base completion. In ACL 2017: Pro-
  neu de Andrade Lopes. 2015. Causation generaliza-             ceedings of the Annual Meeting on Association for
  tion through the identification of equivalent nodes           Computational Linguistics. Association for Compu-
  in causal sparse graphs constructed from text us-             tational Linguistics, Vancouver, Canada.
  ing node similarity strategies. In Proceedings of
  the 2nd Annual International Symposium on Infor-
  mation Management and Big Data - SIMBig 2015,




                                                         68