Scalable Link Discovery for Modern Data-Driven Applications Kleanthi Georgala Universität Leipzig, Institut für Informatik, AKSW, Postfach 100920, D-04009 Leipzig, Germany, georgala@informatik.uni-leipzig.de Abstract. The constant growth of volume and velocity of knowledge bases on the Linked Data Web has led to an increasing need for scalable linking techniques between resources. Modern data-driven applications often have to integrate large amounts of data relaying on fast but accurate Link Discovery solutions. Hence, they often operate under time or space constraints. Additionally, most Link Dis- covery frameworks rely on complex link specifications to determine candidates for links, in which the scalability of execution is of significant importance. The main focus of our work is the implementation of time efficient and scalable data linking approaches by utilizing Semantic Web technologies. In this work, we ad- dress these Link Discovery challenges by presenting a novel approach for time constraint linking, efficient computation and scalable execution of link specifica- tions with applications towards periodically updated knowledge bases. 1 Problem Statement Over the last years, the Linked Data Web has grown to contain billions of triples dis- tributed over hundreds of knowledge bases (KBs) [1]. Recent technological progress in hardware development and network infrastructures have led to the collection of large amounts of data in scenarios as diverse as monitoring industrial plants [6], monitoring open SPARQL endpoints [15], implementing the Internet of Things (IoT) and Cloud Computing [2]. Efficient identification of links between KBs is one of the most impor- tant key challenges when creating a linked data set, as reflected by the fourth Linked Data principle.1 Current LD frameworks utilize complex link specifications (LSs) to identify links between KBs by implementing a set of independent steps: the initial LS is potentially re-written, then planned and finally executed. Most planners create a static plan for the initial LS, by using a set of cost functions to estimate the runtime of the LS. However, this linear process towards the execution of a LS is lacking of one important aspect that influences significantly the performance of the LD framework: the execution engine knows more about the runtimes of a LS than the planner itself. Furthermore, the increasing need for scalable and time-efficient linking approaches comes with the cost of completeness. In real-time applications, such as Linked-Data- driven manufacturing infrastructures and complex-event-processing (CEP), the data 1 http://www.w3.org/DesignIssues/LinkedData.html changes rapidly and events are updated periodically. As a result, LD and CEP rule-based frameworks must complete their learning and update process within a strict time-frame. Additionally, performing accurate LD with time constrains has been addressed by var- ious machine learning algorithms. Herein, the main challenge is identifying efficiently the appropriate set of link candidates for training the machine learning algorithms. In this work, we present a novel approach for scalable LD for Modern Data-Driven Applications, focusing on time efficient approaches towards the execution of link spec- ifications with respect to time constraints pertaining to the expected recall of the LD task. In contrast to the state-of-the-art approaches, we transform the LS execution task into a dynamic process, where the planner re-plans a LS using information provided by the execution engine. Furthermore, we propose the (to the best of our knowledge) first partial-recall LD approach by computing a portion of the links returned by a LS efficiently while achieving a guaranteed expected recall. 2 Relevancy Scalable execution of LSs and time-efficient computation of links are able to provide solutions towards key issues of the scientific community and industry. The constant growth of Semantic Web technologies has led to a quadratic evolution rate of Linked data sets. For example, data sets such as LinkedGeoData2 have grown to more than 30 billion triples over the years. Additionally, independent providers are constantly pub- lishing data sets that pertain to related or even the same real-world entities. Linking and integrating those constantly increasing KBs is a task that requires both fast and efficient LD techniques. Additionally, identifying an appropriate set of link candidates is a major challenge for machine learning algorithms for LD. Supervised and semi-supervised learning algo- rithms often have to train under time or space constrains, or in the absence of a full set of training examples. Therefore, our proposed method for sampling links can be proven beneficial for such algorithms. Another important research area that our approach can be applied to is rule-based CEP. The increased amount of sensor data along with its heterogeneous nature require scalable and flexible methods. The benefits for rule-based CEP models are similar to the benefits of the machine learning algorithms, since these approaches focus on detecting and classifying new events based on the knowledge ob- tained from classifiers on the existing sensor data. 3 Hypotheses and Research Questions This thesis includes the following hypotheses and formulated research questions involv- ing fast and scalable LD: (H1 ): Given an initial LS, a recall constrain and a refinement time constrain, a pre- sumable subsumed LS will achieve a lower execution runtime compared to the initial LS while complying to the input recall limitation. (Q1 ): Will the subsumed LS be more efficient that the initial LS, in terms of time complexity? (Q2 ): How will the different 2 http://linkedgeodata.org values of the predefined recall constrain influence the performance of the algorithm? (Q3 ): To which extent will the different values of the refinement time constrain influ- ence the overall runtime of our approach? (Q4 ): How much will the sampling of links influence supervised machine learning for LD? (H2 ): The execution engine knows more about the runtimes of a LS than the planner itself, therefore by introducing a dynamic flow of information between the engine and the planner, a LD framework will execute a LS faster. (Q1 ): Will the overall execution time of LS be significantly improved by dynamic planning? (Q2 ): How much time will the dynamic planner spend planning? (Q3 ): How will the different sizes of a LS influence the overall performance of the dynamic planner? (Q4 ): How does the dynamic planning algorithm perform compared to the state-of-the-art static approaches? 4 Proposed Approach In this section we present the main idea behind our approach for fast and scalable LD. We begin by giving a brief but formal description of the fundamental concepts of LD. Then, we explain in detail our approach for dynamic planning and ensuring selectivity when executing a link specification. 4.1 Preliminaries A knowledge base K is a set of triples (s, p, o) ∈ (R ∪ B) × P × (R ∪ B ∪ L), where R is the set of all RDF resources, P ⊆ R is the set of all RDF properties, B is the set of all RDF blank nodes and L is the set of all literals. A LD framework aims to compute the set M = {(s, t) ∈ S × T : R(s, t)} where S and T are sets of RDF resources and R is a binary relation. Given that M is commonly difficult to compute directly, LD frameworks commonly compute an approximation M 0 ⊆ S × T × R of M by executing a LS. An atomic LS L is a pair L = (m, θ), where m is a similarity measure that compares properties of pairs (s, t) from S×T and θ is a similarity threshold. LS can be combined by means of operators and filters. Here, we consider the binary operators t (union), u (intersection) and \ (difference). Filters are pairs (f, τ ), where (1) f is either empty (denoted ) or a combination of similarity measures by means of specification operators and (2) τ is a threshold. A complex LS L is a triple (f, τ, ω(L1 , L2 )) where ω is a specification operator, (f, τ ) is a filter, L1 and L2 are the left and the right child of L resp. Note that an atomic specification can be regarded as a filter (f, τ, X) with [[X]] = S × T . We call (f, τ ) the filter of L and denote it with ϕ(L). We denote the operator of a LS L with op(L). For L = (f, τ, ω(L1 , L2 )), op(L) = ω. We denote the semantics (i.e., the results of a LS for given sets of resources S and T ) of a LS L as [[L]] and call it a mapping. 4.2 Approach As we have introduced in Sect. 3, our hypothesis consists of two parts: 1) discover a presumable subsumed LS L in order to partially compute the links returned by the initial LS efficiently, while achieving a guaranteed expected recall and 2) dynamic planning that changes the existing static infrastructure of execution in order to further improve the runtime of the subsumed LS. Linking with Guaranteed Partial Recall. Given an initial LS L0 , the main goal of our approach is to discover a subsumed LS L from L0 , that will be given as input to our dynamic execution infrastructure and once it will be executed the result mapping will be a portion of the links retrieved by L, achieving a guaranteed expected recall. Our approach relies on a refinement operator, which allows exploring the space of potential solutions to this problem efficiently. Definition 1 (Subsumption of Specifications). The LS L0 is subsumed by the LS L (denoted L v L0 ) when [[L]] ⊆ [[L0 ]] for all possible S and T . A key observation that underlies our approach is that if the interpretation of L1 = (m(ps , pt ), θ) and L2 = (m(ps , pt ), θ0 ) are carried out on the same sets S and T , then the following holds: Proposition 1. ∀θ, θ0 ∈ [0, 1] θ > θ0 → (m(ps , pt ), θ) v (m(ps , pt ), θ0 ). Proposition 2. v is a quasi-ordering (i.e., reflexive and transitive) on 2LS . Definition 2 (Refinement Operator and Properties). In the quasi-ordered space (LS, v), we call any function f : L → 2LS an (LS) operator. A downward refinement opera- tor ρ is an operator such that for all L ∈ LS we have L0 ∈ ρ(L) implies L0 v L. L0 is called a specialisation of L. We denote L0 ∈ ρ(L) with L ρ L0 . A refinement operator r over the quasi-ordered space (S, 4) adibes by the following criteria: (1) r is finite iff r(s) is finite for all s ∈ S. (2) r is proper if ∀s ∈ S, s0 ∈ r(s) ⇒ s 6= s0 . (3) r is said to be complete if for all s and s0 , s0 4 s implies that there is a s00 with s00 4 s0 ∧ s0 4 s00 such that a refinement chain between s00 and s exists. (4) A refinement operator r over the space (S, 4) is redundant if two different refinement chains can exist between s ∈ S and s0 ∈ S. We define our refinement operator over the space (2LS , v) as follows:  ∅   if L = L∅ , L∅ if L = (m(ps , pt ), 1),     (m(p , p ), next(θ)) s t if L = (m(ps , pt ), θ) ∧ θ < 1, ρ(L) = (1)    (ρ(L1 ) t L2 ) ∪ (L1 t ρ(L2 )) if L = L1 t L2 , (ρ(L 1 ) u L2 ) ∪ (L1 u ρ(L2 )) if L = L1 u L2 ,      (ρ(L1 )\L2 ) if L = L1 \L2 .  In words, our operator works as follows: If L is the empty specification L∅ , then we return an empty set of specifications, ergo, L is not refined any further. If L is an atomic specification with a threshold of 1, our approach returns L∅ . By these means, we can compute refinement chains from L = L1 t L2 to L1 and L2 . If θ < 1, our approach al- ters the threshold θ by applying the next function. Formally, for a given set on input data sets S and T and any θ, next always returns values from N (m(ps , pt )) = {n : ∃(s, t) ∈ S × T : n = m(ps , pt )} ∪ {∅}. Given a threshold θ, ∅ is returned if L is to be refined to the empty specification. Else, next returns the smallest value from N (m(ps , pt )) that is larger than θ. Note that (m(ps , pt ), next(θ)) v (m(ps , pt ), θ) always holds. If L is complex, then the refinement depends on the operator of the specification. To explicate the set of LS returned by ρ, we extend the semantics of op(ρ(L1 ), L2 ) resp. op(L1 , ρ(L2 )) to be the set of all specifications that can be computed by using op on all L0 ∈ ρ(L1 ) resp. L0 ∈ ρ(L2 ). If op = u or op = t, then ρ returns the union of all specifications that can be generated by apply ρ to one child of L and combining these with the other child. If op = \, then we combine L’s right child with all refinements of L’s left child. The reason for which we do not do this the other way around for this particular operator is simply ρ would not be a refinement operator if we did so, as we could not guarantee that ρ(L) v L. In our initial implementation, our algorithm (C-RO) takes as input the desired ex- pected recall k, where k ∈ [0, 1], and a refinement time constrain maxOpt, then esti- mates the selectivity of L0 using an oracle and computes the desired selectivity as its fraction using k. The approach starts by initialising a refinement tree with the given LS L0 and proceeds on selecting the previously unvisited LS of the tree that (1) has the lowest expected run time and (2) abides by the settings provided by the user. Our algorithm computes the whole refinement of this LS by virtue of ρ’s finiteness. Redun- dant refinement results are subsequently detected (as ρ is redundant) and not added into the tree. Our algorithm terminates if the refinement tree has been explored completely or the time threshold for running the refinements is met. Additionally, we have imple- mented variation on the approach that makes uses of the potential monotonicity of the run times of specifications (RO-MA). Dynamic Planning. The basic idea behind our novel planning approach, is to con- struct a plan derived from a LS with small expected runtime in a non-linear fashion. Our method incorporates a cost function cost, which approximates the expected run- time of a plan P of a LS. The basic insight behind the approach is that if P corresponds to a specification L that is complex, i.e., L = (f, τ, ω(L1 , L2 )), then running a plan for L1 or L2 can help overwriting the initial approximation of the cost function with a better cost approximation and thus lead to the plan for L to be reshaped and made more efficient. Additionally, through the constant information flow of information between the engine and the planner, our method re-uses previous result sets, ensuring that du- plicated steps are executed exactly once. Our existing implementation consists of two basic functions: plan and execute. Plan Given a LS L as input, the plan function returns a plan P (L) with the smallest expected cost based on current cost(P ) function. Our plan function distinguishes be- tween executed and non-executed LSs. Therefore, if a LS has been executed previously, it will not be planned again. Then, if L is atomic, its plan will consist of a simply run command. If L = (f, τ, ω(L1 , L2 )) then, the algorithm will derive plans for L1 and L2 sequentially, compute possible plans for L and then decide the least costly plan. At this point, the plan function discriminates the possible plans for L given its operator. If op(L) = t, then P (L) will consist of executing the plans for L1 and L2 and then perform union between the resulting sets. If op(L) = u or \, the algorithm can choose between two alternatives 1) executing both plans for L1 and L2 and then perform the corresponding set operation or 2) execute the plan of one of the children and use the other one as a filter on the resulting mapping. The final plan for L will be chosen based the cost function. For further optimization, if both children are executed, then the algo- rithm will be forced to choose the first alternative, since the estimated cost of this plan will be 0. In case one of the child LS is executed, then this child cannot be used a filter operator and the possible set of plans change accordingly. Execute Similarly to the plan function, execute takes as input a LS L and returns the corresponding mapping M . Initially, the execution engine is informed from the plan- ner whether or not the plan of a LS has been executed and retrieves the cached set of links. In case of a non-executed LS, execute checks whether a LS L0 with [[L]] ⊆ [[L]]0 has already been executed, retrieves the set of links and performs a filtering function using (f, τ ) = ϕ(L). If such LS does not exist, then the algorithm discriminates be- tween atomic and complex LSs. In case of a atomic LS, then the engine executes the corresponding plan returned by the plan function. In case of a complex LS, then the al- gorithm, calls the plan function, executes the first sub-plan assigned to P (L), re-plans the remaining steps of L by invoking again the plan function and then proceeds in ex- ecuting the second sub-plan of P (L), that can vary given from the initial given the intermediate executed steps of the first plan. 5 Evaluation Plan In order to test our hypotheses and research questions in practice, we will use three sets of datasets: (1) the first set consists of a set of benchmark datasets, Abt-Buy, Amazon- Google Products, DBLP-ACM and DBLP-Scholar described in [5], and (2) the second set of data (MOVIES, TOWNS and VILLAGES) was constructed in order to test the scalability of our methodology using real the data sets DBpedia, LinkedGeodata and LinkedMDB. 3 All LSs used during our experiments will be generated automatically by the unsupervised version of the genetic-programming-based machine learning approach E AGLE [12]. For the partial-recall algorithm, we plan to report the overall runtime of our partial- recall algorithm (the initial approach C-RO and its alternative RO-MA) along with the baseline method of running the original LS and compare their performance in terms of time efficiency. Additionally, we will exploit performance of our method for the differ- ent values of (1) the desired expected recall k and (2) the maximum time (maxOpt) for finding a subsumed LS. Finally, we would like to study the loss of F-measure of a machine-learning approach when presented with the results of partial-recall link dis- covery in comparison with the F-measure it would achieve using the full results. For the dynamic planning approach C ONDOR, we will compared the execution time of our dynamic method with that of the state-of-the-art C ANONICAL and H ELIOS planners as implemented in L IMES [10]. Note that the C ANONICAL planner serves as the baseline method since it incorporates the static, linear execution infrastructure. 3 http://www.linkedmdb.org/ 6 Preliminary Results Figure 1 presents preliminary results of applying our methods for scalable LD on the DBLP-ACM dataset. The left plot includes execution time results for the baseline, C- RO and RO-MA for k = 0.2 and maxOpt = 1.6 s. The right plot includes comparison of runtimes of C ANONICAL, H ELIOS and C ONDOR. In both plots, the x-axis represents the number of specifications, the y-axis represents the cumulative execution times in seconds. 120 160 Baseline C-RO CANONICAL 140 100 RO-MA HELIOS 120 CONDOR 80 100 60 80 60 40 40 20 20 0 0 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 (a) Partial Recall (b) Dynamic Planning Fig. 1. Preliminary Results for DBLP-ACM 7 Related Work Over the last few years, a large number of frameworks such as SILK [4], L IMES [9] and KnoFuss [13] were developed to address the scalability of solutions to link discov- ery and rely on scalable approaches for computing simple and complex specifications. For example, the SILK framework implements MultiBlock [4], a multi-dimensional blocking approach. KnoFuss [13] on the other hand implements classical blocking ap- proaches derived from databases. Zhishi.links [14] is another framework that scales (through an indexing-based) approach. These approaches are not guaranteed to achieve result completeness. Such theoretical and practical guarantees of completeness and ef- ficiency are given by the L IMES framework. L IMES reduces the time-complexity of the LD procedure by combining techniques such as PPJoin+ [16] and HR3 [8] with set theoretical operators and planning algorithms. The problem of identifying appropriate LSs using machine learning techniques has been explored in various previous papers that focus on minimizing the need of the training examples. EAGLE [12], RAVEN [11], AGP [3] are some of the approaches that request less labeled examples but maintain a high level of accuracy. The only method that focus on identifying informative link candidates for training is COALA [7]. 8 Conclusions and Future Work In this thesis, we present an approach for fast and scalable LD for Data-Driven applica- tions using Semantic Web technologies. Our contribution will be two-fold: (1) the first partial-recall LD approach using a refinement operator and insights pertaining to the subsumption of LS in order to detect LS with guaranteed expected recall efficiently and (2) a dynamic planning with sumbsumptions and result caching. During the comple- tion of this PhD, we are aiming to investigate and evaluate the hypotheses and research question that we proposed, by comparing our method with the state of the art. References 1. Auer, S., Lehmann, J., Ngonga Ngomo, A.C.: Introduction to Linked Data and Its Lifecycle on the Web, pp. 1–75. Springer Berlin Heidelberg, Berlin, Heidelberg (2011) 2. Dagli, C.H., Mehdiyev, N., Krumeich, J., Enke, D., Werth, D., Loos, P.: Complex Adaptive Systems San Jose, CA November 2-4, 2015 Determination of Rule Patterns in Complex Event Processing Using Machine Learning Techniques. Procedia Computer Science 61, 395 – 401 (2015) 3. de Freitas, J., Pappa, G., da Silva, A., Gonalves, M., Moura, E., Veloso, A., Laender, A., de Carvalho, M.: Active Learning Genetic programming for record deduplication. In: Evo- lutionary Computation (CEC), 2010 IEEE Congress on. pp. 1–8 (July 2010) 4. Isele, R., Jentzsch, A., Bizer, C.: Efficient Multidimensional Blocking for Link Discovery without losing Recall. In: WebDB (2011) 5. Köpcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. PVLDB 3(1), 484–493 (2010) 6. Loskyll, M., Schlick, J., Hodek, S., Ollinger, L., Gerber, T., Prvu, B.: Semantic service dis- covery and orchestration for manufacturing processes. In: Emerging Technologies Factory Automation (ETFA), 2011 IEEE 16th Conference on. pp. 1–8 (Sept 2011) 7. Ngomo, A.C.N., Lyko, K., Christen, V.: COALA – Correlation-Aware Active Learning of Link Specifications, pp. 442–456. Springer Berlin Heidelberg, Berlin, Heidelberg (2013) 8. Ngonga Ngomo, A.C.: Link Discovery with Guaranteed Reduction Ratio in Affine Spaces with Minkowski Measures, pp. 378–393. Springer Berlin Heidelberg, Berlin, Heidelberg (2012) 9. Ngonga Ngomo, A.C.: On Link Discovery using a Hybrid Approach. Journal on Data Se- mantics 1(4), 203–217 (2012) 10. Ngonga Ngomo, A.C.: HELIOS – Execution Optimization for Link Discovery, pp. 17–32. Springer International Publishing, Cham (2014) 11. Ngonga Ngomo, A.C., Lehmann, J., Auer, S., Höffner, K.: RAVEN – Active Learning of Link Specifications. In: Proceedings of OM@ISWC (2011) 12. Ngonga Ngomo, A.C., Lyko, K.: EAGLE: Efficient Active Learning of Link Specifications Using Genetic Programming, pp. 149–163. Springer Berlin Heidelberg, Berlin, Heidelberg (2012) 13. Nikolov, A., d’Aquin, M., Motta, E.: Unsupervised learning of link discovery configuration. In: 9th Extended Semantic Web Conference (ESWC 2012) (2012) 14. Niu, X., Rong, S., Zhang, Y., Wang, H.: Zhishi.links results for OAEI 2011. Ontology Match- ing p. 220 (2011) 15. Saleem, M., Ali, M.I., Hogan, A., Mehmood, Q., Ngomo, A.C.N.: LSQ: The Linked SPARQL Queries Dataset, pp. 261–269. Springer International Publishing, Cham (2015) 16. Xiao, C., Wang, W., Lin, X., Yu, J.X., Wang, G.: Efficient Similarity Joins for Near-duplicate Detection. ACM Trans. Database Syst. 36(3), 15:1–15:41 (Aug 2011)