=Paper=
{{Paper
|id=Vol-2029/paper4
|storemode=property
|title=Predicting Invariant Nodes in Large Scale Semantic Graphs
|pdfUrl=https://ceur-ws.org/Vol-2029/paper4.pdf
|volume=Vol-2029
|authors=Damian Barsotti,Martín Ariel Domínguez,Pablo Ariel Duboué
|dblpUrl=https://dblp.org/rec/conf/simbig/BarsottiDD17
}}
==Predicting Invariant Nodes in Large Scale Semantic Graphs==
Predicting Invariant Nodes in Large Scale Semantic Graphs Damian Barsotti Martin Ariel Dominguez Pablo Ariel Duboue FaMAF-UNC / Cordoba, Argentina damian,mdoming,pablod@famaf.unc.edu.ar Abstract which we will call a triple hS, V, Oi, we will say a given node S is an add-only node if in a We are interested in understanding and next version (G1 ) of the multigraph, all triples predicting how large knowledge graphs starting on S in G0 are also in G1 . That is, change over time. An important subprob- S is add only i↵ : lem is predicting which nodes within the graph won’t have any edges deleted or changed (what we call add-only nodes). 8v, o/ hS, v, oi 2 G0 ) hS, v, oi 2 G1 Predicting add-only nodes correctly has practical importance, as such nodes can This type of nodes can be efficiently represented then be cached or represented using a as static information, for example by leveraging more efficient data structure. This pa- large scale perfect hashes (Botelho and Ziviani, per presents a logistic regression ap- 2007). proach using attribute-values as features Our intuition is that in large scale semantic that achieves 95%+ precision on DBpe- graphs holding an imperfect representation of the dia yearly changes trained using Apache real world, there will be two types of changes, Spark. It concludes by outlining how we (1) model enhancements, where the truth about plan to use these models for Natural Lan- the world is better captured by the model and (2) guage Generation. model corrections, where the world has changed and the model is updated. Updates of the first type 1 Introduction result in new information added to the graph, with- We are interested in understanding and predicting out modifying existing data. Finding such nodes is how large knowledge graphs change over time. An the objective of our work. important subproblem is predicting which nodes This work is structured as follows. In the next within the graph won’t have any edges deleted section we summarize related work. In Section 2 or changed (what we call add-only nodes) or un- we discuss DBpedia, the semantic graph we used dergo any changes at all (what we call constant for our experiments. Our methods and result fol- nodes). Predicting add-only nodes correctly has lows, closing with a discussion of our intended ap- practical importance, as such nodes can then be plication in Natural Language Generation. cached or represented using a more efficient data structure. In this paper we show a logistic regres- 2 Related Work sion approach using attribute-values as features Mining graphs for nodes with special properties is that achieves 95%+ precision on DBpedia1 yearly not new to Big Data mining (Drury et al., 2015). changes, as trained using Apache Spark. We con- With the development DBpedia much research has clude by outlining how we plan to use these mod- been devoted to exploiting this resource in AI els for Natural Language Generation. tasks as well as to model its changes. For exam- Definition. Given a multigraph G0 with named ple, there is research on modeling DBpedia’s cur- edges such that each source node S is linked rency (Rula et al., 2014), that is, the age of the data through an edge labeled V to a target node O, in it and the speed at which those changes can be 1 http://dbpedia.org captured by any system. Although currency could 64 be computed based on the modification/creation tion. It is this type of optimizations that we expect dates of the resources, this information is not al- detection of invariable nodes will help semantic ways present in Wikipedia pages. To overcome graphs updates. this, the authors propose a model to estimate cur- rency combining information from the original re- 3 Data lated pages and a couple of currency metrics mea- suring the speed of retrieval by a system and ba- As a large scale naturally occurring knowledge sic currency or timestamp. Their experiments sug- graph with a rich update history, we use DBpedia, gest that entities with high system currency are as- a knowledge graph derived from the Wikipedia sociated with more complete DBpedia resources collaborative encyclopedia started in January 2001 and entities with low system currency appear as- at present containing over 37 million articles in sociated with Wikipedia pages that are not eas- 284 languages. ily tractable (or that “could not provide real world Given that the content in Wikipedia pages is information” according with the authors). While stored in a structured way, it is possible to ex- both the authors and us look into changes in DBpe- tract and organize it in an ontology-like manner as dia, we are interested in changes that for the most implemented in the DBpedia community project. part do not result from changes in the real world, This is accomplished by mapping Wikipedia in- as Lehman and others are interested. foboxes from each page to a curated shared on- tology that contains 529 classes and around 2,300 The need to account for changes in ontologies different properties. DBpedia contains the knowl- has long been acknowledged, given that they may edge from 111 different language editions of not be useful in real world applications if the rep- Wikipedia and, for English the knowledge base resentation of the knowledge they contain is out- consists of more than 400 million facts describing dated. Eder and Koncilia (Eder and Koncilia, 3.7 million things (Lehmann et al., 2015). A noble 2004) present a formalism to represent ontolo- feature of this resource is that it is freely available gies as graphs that contain a time model includ- to download in the form of dumps or it can be con- ing time intervals and valid times for concepts. sulted using specific tools developed to query it. They base their formalism on techniques devel- These dumps contain the information in a lan- oped for temporal databases, namely the version- guage called Resource Description Framework ing of databases instead of their evolution and they (RDF) (Lassila et al., 1998). The WWW Con- provide some guidelines about its possible imple- sortium (W3C) has developed RDF to encode the mentation. Our work can be used to improve the knowledge present in web pages, so that it is com- internal representation of such temporal databases prehensible and exploitable by agents during any (Cheng et al., 2014). information search. RDF is based on the concept Another source of ontology transformation is of making statements about (web) resources using spatiotemporal changes. Dealing with spatial expressions in the subject-predicate-object form. changes in historical data (or over time series) is These expressions are known as triples, where the crucial for some NLP tasks, such as information subject denotes the resource being described, the retrieval (Kauppinen and Hyvnen, 2007). In their predicate denotes a characteristic of the subject case, the authors deal with the evolution of the on- and describes the relation between the subject and tology’s underlying domain instead of its version- the object. A collection of such RDF declarations ing or evolution due to developments or refine- can be formally represented as a labeled directed ments. Their main result is the definition of partial multi-graph, naturally appropriate to represent on- overlaps between concepts in a given time series, tologies. which was applied to build a Finnish Temporal Re- Table 1 shows the different years employed in gion Ontology, showing promising results. this work. The DBpedia project obtains its data Finally, we see parallelisms between change through a series of scripts run over Wikipedia, tracking in other large graphs: object graphs which on itself is a user-generated resource. in garbage collection systems. State of the art Changes to the DBpedia scripts or to Wikipedia it- garbage collection will single out objects that self sometimes result in dramatic differences from survive multiple garbage collections (Stefanović one year to the next. Besides the overall sizes, et al., 1999) and stop considering them for collec- what is relevant to this work is the total number 65 5 Results Table 1: Data sizes. Version # Nodes # Links Using the machine learning method described in 2010-3.6 1,668,503 19,969,604 the previous section we took three consecutive 2011-3.7 1,831,219 26,770,236 year, Gy1 , Gy2 , Gy3 , built a model M on Gy1 ! 2012-3.8 2,350,907 33,742,028 Gy2 , apply M on Gy2 , obtaining G0y3 and evaluate 2013-3.9 3,243,478 41,804,713 it by comparing it to Gy3 . Table 3 shows our re- 2014 4,218,628 61,481,509 sults. We can see that for pairs close in size (from 2015-04 4,080,388 37,791,134 Table 1) we obtain a precision close to 90% with 2015-10 4,995,949 40,617,978 recall ranging from 20% to 58% (and F-measure 2016-04 5,109,879 40,407,197 as high as 66%). As our numbers were obtained by optimizing F1 on a binary classification prob- lem, precision and recall are dual and treated as Table 2: Percentage of entities which remains con- identical quantities by the optimizer (this can be stant or add-only calculated between two consec- easily seen from a confusion table). The rows utive versions of DBPedia. marked with an asterisk in the table were inverted Consecutive versions % Const. % Add. in our experiments. The low numbers for the last 2010-3.6 2011-3.7 9.71 45.73 row in the table can be attributed to ontological 2011-3.7 2012-3.8 30.28 65.51 re-estructuring on Wikipedia/DBpedia on 2015- 2012-3.8 2013-3.9 38.72 76.79 04/10 period (second to last row on Table 2) were 2013-3.9 2014 16.61 49.32 few entities remained constant through 2015. 2014 2015-04 14.01 29.01 From the table we can also see that detecting 2015-04 2015-10 3.76 20.54 constant nodes, on the other hand, it is a much 2015-10 2016-04 83.63 90.52 more difficult task that might be ill-defined given the nature of the updates discussed in the intro. Ta- ble 4 shows some examples of correctly and incor- of additions and deletions, shown in Table 2. rectly predicted nodes. 5.1 Discussion 4 Methods How useful are these results? For the task we Our prediction system is implemented using have in mind, building statistically plausible future Apache Spark2 using its Logistic Regression pack- versions of existing semantic graphs for the pur- age. In our algorithm the feature vector itself is pose of testing Natural Language Generation algo- comprised of binary features indicating whether or rithms (Duboue et al., 2016), successfully predict- not a given relation object holds for the subject in ing add-only nodes help us immediately with the Vi OLD; that is, we do not look at whether the ! Oi performance of the prediction system. Our high have changed, just their existence in OLD. The precision results will then carry over to direct im- class is, given a node in subject position, S: provements on our full system: if our system has add-only: {(Vi , Oi )}OLD ✓ {(Vi , Oi )}N EW an error rate of 30% and there are 25% of add-only constant: {(Vi , Oi )}OLD = {(Vi , Oi )}N EW nodes, our current system will reduce error by up The feature vector underneath has a dimension to 12% (in the case of 50% recall). of kV k ⇥ kOk, a potentially very large number Another case is for maintaining the data. The given the millions of values in O. We leverage add-only nodes and relations can be pre-cached Apache Spark Mlib pipelines to filter out this ex- using more efficient data structures such as perfect tremely large feature vector to the top million en- hashes (Botelho and Ziviani, 2007). tries. Figure 1 shows a small example of feature and 6 Conclusions and Future Work class extraction. A four node graph OLD evolves In Natural Language Generation (Reiter and Dale, into a five node graph N EW . The classes for each 2000), Referring Expressions Generation (REG), node are computed over OLD, using three binary is the task that, given an entity (the referent) and features. a set of competing entities (the set of distractors), 2 https://spark.apache.org/ involves creating a mention to the referent so that, 66 OLD N EW Node Features Target S a=T, a=U add-only ¬ constant T ; add-only ¬ constant U b=V ¬ add-only ¬ constant V ; add-only constant Figure 1: Feature generation from an ontology OLD and N EW . In this example most nodes are add- only (shaded in OLD), only U loses a relation and it is thus not add-only. Table 3: Training in two consecutive years and evaluating on a third. Training maximizing F1. Train Eval System Source Target Target Add-only Constant Precision Recall Precision Recall 2010-3.6 2011-3.7 2012-3.8 0.560 0.579 0.704 0.887 2011-3.7 2012-3.8 2013-3.9 0.448 0.444 0.658 0.569 2012-3.8 2013-3.9 2014 0.916 0.224 0.890 0.472 2013-3.9 2014 2015-04 0.971 0.506 0.965 0.770 2014 2015-04 2015-10 0.989 0.650 0.971 0.820 2015-04 2015-10 2016-04 0.945 0.196 0.908 0.068 Table 4: Example predictions and mispredictions, using 2015-04 ! 2015-10 as training and tested on 2016-04. Correctly predicted add-only Iasi Botanical Garden constant USS Breakwater (SP-681) constant Interborough Handicap constant Thode Island added archipelago!Marshall Archipelago Colonel Reeves Stakes added location!Perth added location!Australia Incorrectly predicted as add-only Beverly Hills Handicap disappears due to name change First Ward Park disappears due to name change 2012 Shonan Bellmare season changes league!2012 J. League Division 2 to league!2012 J.League Division 2 67 in the eyes of the reader, it is clearly distinguish- Cusco, Peru, September 2-4, 2015.. pages 58–65. able from any other entity in the set of distractors. http://ceur-ws.org/Vol-1478/paper6.pdf. Therefore REG algorithms are expected to select Pablo Ariel Duboue and Martin Ariel Domı́nguez. attributes that unambiguously identify an entity 2016. Using Robustness to Learn to Order Se- with respect to a set of distractors. mantic Properties in Referring Expression Genera- Our current work is part of a plan to simulate tion, Springer International Publishing, Cham, pages 163–174. natural perturbations on the data in order to find the conditions on which REG algorithms start to Pablo Ariel Duboue, Martin Ariel Domınguez, and fail (for example, a simulated DBpedia 25 years in Paula Estrella. 2016. On the robustness of stan- dalone referring expression generation algorithms the future). using rdf data. WebNLG 2016 page 17. In previous work we explored the robustness for the particular case of Referring Expressions Gen- Johann Eder and Christian Koncilia. 2004. C.: Mod- eration (REG) algorithms by means of different elling changes in ontologies. In In: Proceedings of On The Move - Federated Conferences, OTM 2004, versions of an ontology (Duboue et al., 2016). Springer (2004) LNCS 3292. pages 662–673. In (Duboue and Domı́nguez, 2016) we pre- sented experiments on two types of entities (peo- Tomi Kauppinen and Eero Hyvnen. 2007. Modeling and reasoning about changes in ontology time se- ple and organizations) and using different versions ries. In Integrated Series in Information Systems. of DBpedia we found that robustness of the tuned Springer-Verlag, pages 319–338. algorithm and its parameters do coincide but more Philipp Koehn. 2010. Statistical Machine Translation. work is needed to learn these parameters from data Cambridge University Press, New York, NY, USA, in a generalizable fashion. 1st edition. We plan to extend the current model with a specific model for additions and deletions us- Ora Lassila, Ralph R. Swick, World Wide, and Web Consortium. 1998. Resource description framework ing techniques from statistical machine translation (rdf) model and syntax specification. (Koehn, 2010) and investigate techniques based on knowledge embedding models (Xie et al., 2017). Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo N. Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick van Acknowledgments Kleef, Sören Auer, and Christian Bizer. 2015. DB- pedia - a large-scale, multilingual knowledge base The authors would like to thank the Secretaria de extracted from wikipedia. Semantic Web Journal Ciencia y Tecnica of Cordoba Province for support 6(2):167–195. and the anonymous reviewers for helpful com- ments and suggestions. Ehud Reiter and Robert Dale. 2000. Building Natural Language Generation Systems. Cambridge Univer- sity Press. References Anisa Rula, Luca Panziera, Matteo Palmonari, and An- drea Maurino. 2014. Capturing the currency of db- Fabiano C. Botelho and Nivio Ziviani. 2007. External pedia descriptions and get insight into their validity. perfect hashing for very large key sets. In Proceed- In Proceedings of the 5th International Workshop on ings of the Sixteenth ACM Conference on Confer- Consuming Linked Data (COLD 2014) co-located ence on Information and Knowledge Management. with the 13th International Semantic Web Confer- ACM, New York, NY, USA, CIKM ’07, pages 653– ence (ISWC 2014), Riva del Garda, Italy, October 662. https://doi.org/10.1145/1321440.1321532. 20, 2014.. S. Cheng, A. Termehchy, and V. Hristidis. 2014. Darko Stefanović, Kathryn S McKinley, and J Eliot B Efficient prediction of difficult keyword queries Moss. 1999. Age-based garbage collection. ACM over databases. IEEE Transactions on Knowl- SIGPLAN Notices 34(10):370–381. edge and Data Engineering 26(6):1507–1520. https://doi.org/10.1109/TKDE.2013.140. Qizhe Xie, Xuezhe Ma, Zihang Dai, and Eduard Hovy. 2017. An interpretable knowledge transfer model Brett Drury, Jorge Carlos Valverde-Rebaza, and Al- for knowledge base completion. In ACL 2017: Pro- neu de Andrade Lopes. 2015. Causation generaliza- ceedings of the Annual Meeting on Association for tion through the identification of equivalent nodes Computational Linguistics. Association for Compu- in causal sparse graphs constructed from text us- tational Linguistics, Vancouver, Canada. ing node similarity strategies. In Proceedings of the 2nd Annual International Symposium on Infor- mation Management and Big Data - SIMBig 2015, 68