=Paper=
{{Paper
|id=Vol-2788/om2020_Tpaper10
|storemode=property
|title=LIGER - link discovery with partial recall
|pdfUrl=https://ceur-ws.org/Vol-2788/om2020_STpaper4.pdf
|volume=Vol-2788
|authors=Kleanthi Georgala,Mohamed Ahmed Sherif,Axel-Cyrille Ngonga Ngomo
|dblpUrl=https://dblp.org/rec/conf/semweb/GeorgalaSN20
}}
==LIGER - link discovery with partial recall==
L IGER – Link Discovery with Partial Recall Kleanthi Georgala1,2 , Mohamed Ahmed Sherif2 , and Axel-Cyrille Ngonga Ngomo1,2 1 Department of Computer Science, Paderborn University, Germany, 2 Department of Computer Science, University of Leipzig, Germany georgala@informatik.uni-leipzig.de {mohamed.sherif,axel.ngonga}@upb.de Abstract. In this work, we present a novel approach for link discovery under constraints pertaining to the expected recall of a link discovery task. Given a link specification, the approach aims to find a subsumed link specification that achieves a lower run time than the input specification while abiding by a prede- fined constraint on the expected recall it has to achieve. Our approach, combines downward refinement operators with monotonicity assumptions to detect such specifications. Our results suggest that our different implementations can detect subsumed specifications that abide by expected recall constraints efficiently, thus leading to significantly shorter overall run times than our baseline. 1 Introduction Sensor data is used in a plethora of modern Industry-4.0 applications such as condition monitoring and predictive maintenance. An increasing number of such machines gen- erate knowledge graphs in the Resource Description Framework (RDF) format. A key step for learning axioms which generalize well is to learn them across several machines. However, single machines generate independent data streams. Hence, time-efficient data integration (in particular link discovery, short LD) approaches must precede the machine learning approaches to integrate data streams from several machines. Given that new data batches are available periodically (e.g., every 2 hours), practical applica- tions of machine learning on RDF streams demand LD solutions which can guarantee the completion of their computation under constraints such as time (i.e., their total run- time for a particular integration task) or expected recall (i.e., the estimated fraction of a given LD task they are guaranteed to complete). In this paper, we address the problem of LD with partial recall by proposing L IGER, the first partial-recall LD approach. Given a link specification L that is to be executed, L IGER aims to compute a portion of the links returned by L efficiently, while achieving a guaranteed expected recall. L IGER relies on a refinement operator, which allows the efficient exploration of potential solutions to this problem. The main contributions of our work are: (1) We present a downward refinement operator that allows the detection of subsumed LSs with partial recall. (2) We use a monotonicity assumption to improve the time efficiency of our approach. (3) We evaluate L IGER using benchmark datasets. Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons Li- cense Attribution 4.0 International (CC BY 4.0). This work has been supported by the EU H2020 project KnowGraphs (GA no. 860801) as well as the BMVI projects LIMBO (GA no. 19F2029C) and OPAL (GA no. 19F2028A). 2 K. Georgala et al. 2 Linking with Guaranteed Expected Recall A knowledge base K is a set of triples (s, p, o) ∈ (I ∪ B) × I × (I ∪ B ∪ L), where I is the set of all Internationalized Resource Identifiers (IRIs) B is the set of all RDF blank nodes and L is the set of all literals. Given two sets of RDF resources S and T from two (not necessarily distinct) knowledge bases as well as a relation R, the main goal of LD is to discover the mapping µ = {(s, t) ∈ S×T : R(s, t)}. To achieve this goal, declarative LD frameworks rely on link specifications (LSs), which describe the conditions under which R(s, t) can be assumed to hold for pairs (s, t) ∈ S × T . Several grammars have been used for describing a LS in previous works [3]. In general, these grammars assume that a LS consists of : (i) Similarity measures m (m : S × T × P 2 → [0, 1]), through which property values of resources found in the input datasets S and T can be compared and (ii) operators op. An Atomic LS is a filter f = (m, τ ), where m is a similarity measure and τ ∈ [0, 1] is a threshold. Operators combine two LSs L1 and L2 to a more complex specification L = (f, τ, op(L1 , L2 )). For L = (f, τ, op(L1 , L2 )), we call op the operator of L. We define the mapping [[L]] ⊆ S × T of the LS L as the set of links that will be computed by L when applied to S × T . We denote the size of a mapping [[L]] by |[[L]]|. We define the selectivity of a link specification L as sel(L) = |[[L]]|/|S × T |. The aim of sel is to encode the predicted value of |[[L]]| as a fraction of |S × T |. This is akin to the selectivity definition often used in the database literature. A LS L0 is said to achieve a recall k w.r.t. to L if k ×|[[L]]| = |[[L]]∩[[L0 ]]|. If [[L0 ]] ⊆ [[L]], then the recall k of L0 abides by the simpler equation k ×|[[L]]| = |[[L0 ]]|. A specification L0 with [[L0 ]] ⊆ [[L]] is said to achieve an expected recall k w.r.t. to L if k × sel(L) = sel(L0 ). Formally, given a specification L, the aim of partial-recall LD is to detect a rapidly executable LS L0 v L with an expected recall of at least k ∈ [0, 1], i.e. a LS L0 with sel([[L0 ]]) ≥ k × sel([[L]]), where k ∈ [0, 1] is a minimal expected recall set by the user. The LS L0 is subsumed by the LS L (denoted L v L0 ) when [[L]] ⊆ [[L0 ]] for any fixed pair of sets S and T . Note that v is a quasi-ordering (i.e., reflexive and transitive) on the set of all LS, which we denote LS. A key observation that underlies our approach is as follows: ∀θ, θ0 ∈ [0, 1] θ > θ0 → (m, θ) v (m, θ0 ). This observation can be extended to LS as follows: (1) L1 v L01 → (L1 t L2 ) v (L01 t L2 ), (2) L1 v L01 → (L1 u L2 ) v (L01 u L2 ), (3) L1 v L01 → (L1 \L2 ) v (L01 \L2 ), and (4) L2 v L02 → (L1 \L02 ) v (L1 \L2 ) We call ρ : LS → 2LS a downward refinement operator if ∀L ∈ LS : L0 ∈ ρ(L) → L0 v L, where (LS, v) is a quasi-ordered space. L0 is called a specialisation of L. We denote L0 ∈ ρ(L) with L ρ L0 . Given L as input, the idea behind our approach is to use a refinement operator to compute L0 v L with at least a given expected recall k w.r.t L. We define the corresponding refinement operator over the space (2LS , v) as follows: ∅ if L = L∅ , L∅ if L = (m, 1), (m, next(θ)) if L = (m, θ) ∧ θ < 1, ρ(L) = (1) (ρ(L1 ) t L2 ) ∪ (L1 t ρ(L2 )) if L = L1 t L2 , (ρ(L1 ) u L2 ) ∪ (L1 u ρ(L2 )) if L = L1 u L2 , ρ(L1 )\L2 if L = L1 \L2 . L IGER – Link Discovery with Partial Recall 3 Our refinement operator ρ is finite, incomplete, proper and redundant if L, S and T are finite. ρ being incomplete is not a restriction for our purposes given that we aim to find Ls that run faster and thus do not want to refine the input LS L to L0 that might make our implementation of the operator slower. Given that ρ is finite, we can generate ρ for any chosen node completely in our implementation. ρ being redundant means that after a refinement, we need to check whether we have already seen the newly generated LS. Hence, we need to keep a set of seen LS. Finally, ρ being proper means that while checking for redundancy, there is no need to compare LS with any of their parents. 3 Approach The basic goal behind L IGER is to find the LS Λ ∈ ρ∗ (L0 ) that achieves the lowest ex- pected run time while (1) being subsumed by L0 and (2) achieving at least a predefined excepted recall k ∈ [0, 1]. L IGER is based on ρ as described in Section 2. The basic implementation of L IGER is dubbed C-RO. Our approach takes as input: a LS L0 , an oracle O which can predict the run times and selectivity of LS, the minimal expected recall k and a refinement time constraint maxOpt. We begin by asking O to provide the algorithm with estimations of the selectivity of L0 (selL0 ). We define a refinement tree with L0 as its root. For each refined LS, the set of refined LSs are added as children nodes to the currently refined LS, and a leaf node is as a LS that can not be refined any further. We assign L0 as the best subsumed LS Λ and the best run time rtΛ with L0 ’s runtime estimation from O. The algorithm computes the desired selectivity value (seldes ) as a fraction of L0 ’s selectivity. Then, we add L0 to the set Buf f er, that serves as a buffer and includes LSs obtained by refining L0 that have not yet been refined. All LSs that were generated through the refinement procedure as well as the input LS L0 are stored in another buffer named T otal. By keeping track of these LSs, we avoid refining a LS more than once and address the redundancy of our refinement operator. The refinement of L0 stops when the refinement time has exceeded maxOpt, or if the Buf f er is empty or the selectivity of Λ returned by O is equal to seldes . At each iteration, the algorithm selects the next node for refinement as follows: first, it retrieves the run time estimation of each LS L that belongs to Buf f er using O. Then, it selects the next LS for refinement as the LS with minimum run time estimation (Lxt ). The algorithm then checks if Lxt receives a better runtime score, and assigns Lxt as the new value of Λ and updates rtΛ accordingly. Finally, the algorithm refines Lxt by implementing ρ. For each subsumed LS, the algorithm checks if it already exists in set T otal, to ensure that L IGER does not explore LSs that have already been seen before. Each remaining subsumed LS is added to T otal and the algorithm proceeds in computing its selectivity. If its selectivity is higher or equal to the desired selectivity, the algorithm updates Buf f er by adding the new LS. One key observation pertaining to the run time of L0 ∈ ρ∗ (L) is that by virtue of L v L, RT (L0 ) ≤ RT (L) will most probably hold. By virtue of the transitivity of ≤, 0 L1 ∈ ρ(L) ∧ L2 ∈ ρ(L) ∧ RT (L1 ) ≤ RT (L2 ) → ∀L0 ∈ ρ∗ (L1 ): RT (L0 ) ≤ RT (L2 ) also holds. We call this assumption the monotonicity of run times. Since the implemen- tation of C-RO does not take this monotonicity into consideration, we wanted to know whether this assumption can potentially improve the run time of our approach. We then 4 K. Georgala et al. Table 1: Average execution times of Baseline, C-RO and RO-MA for k = 0.1 and different values of maxOpt over 100 LS per dataset. All times are in seconds. Abt-Buy Amazon-GP DBLP-ACM DBLP-Scholar maxOpt Baseline C-RO RO-MA Baseline C-RO RO-MA Baseline C-RO RO-MA Baseline C-RO RO-MA 0.1 0.66 0.52 0.52 5.71 3.91 3.82 1.08 0.25 0.25 792.81 596.53 598.44 0.2 0.66 0.55 0.54 5.71 3.81 2.89 1.08 0.26 0.26 792.81 545.22 546.01 0.4 0.66 0.45 0.44 5.71 3.04 2.91 1.08 0.26 0.25 792.81 589.72 587.87 0.8 0.66 0.55 0.53 5.71 3.27 3.15 1.08 0.28 0.26 792.81 598.54 599.03 1.6 0.66 0.54 0.51 5.71 3.47 3.18 1.08 0.33 0.28 792.81 554.82 557.06 k = 0.1 MOVIES TOWNS VILLAGES maxOpt Baseline C-RO RO-MA Baseline C-RO RO-MA Baseline C-RO RO-MA 0.1 4.05 1.89 1.89 44.52 31.15 31.20 123.58 15.32 15.26 0.2 4.05 1.75 1.76 44.52 32.23 32.19 123.58 15.65 15.71 0.4 4.05 1.91 1.90 44.52 34.21 34.09 123.58 14.10 14.12 0.8 4.05 1.77 1.76 44.52 34.08 34.10 123.58 14.69 14.52 1.6 4.05 1.93 1.89 44.52 34.38 34.00 123.58 15.65 15.17 implemented an extension of L IGER with the monotonicity assumption (dubbed RO- MA). RO-MA uses a hierarchical ordering on the set of unrefined nodes. By incorpo- rating RO-MA as a search strategy, the refinement tree is expanded using a “top-down” approach until there are no nodes to be further explored in a particular path. 4 Evaluation We evaluated our approach on seven datasets [4, 2]. All LS used during our experiments were generated automatically by the unsupervised version of the genetic-programming- based ML approach E AGLE [5] as implemented in LIMES [6]. Regarding the Oracle O mentioned in Section 3, L IGER assumes that it can (i) approximate its run time us- ing a linear model described in [1], and (ii) estimate its selectivity as follows: (1) For |[[L]]| an atomic LS, the selectivity values were computed using |S|×|T | , where |[[L]]| is the size of the mapping returned by the LS L, |S| and |T | are the sizes of the source and target data. To do so, we pre-computed the real selectivity of atomic LSs that were based on a set of measures using the methodology presented in [7] for thresholds be- tween 0.1 and 1. (2) For complex LSs, which are binary combinations of two LSs L1 (selectivity: sel(L1 )) and L2 (selectivity: sel(L2 )), the run time approximation was computed by summing up the individual run times of L1 ,L2 . Therefore, we de- rived the following selectivities: (I) op(L) = ∩ → sel(L) = 12 sel(L1 )sel(L2 ), (II) op(L) = ∪ → sel(L) = 21 (1 − (1 − sel(L1 ))(1 − sel(L2 ))) and (3) op(L) = \ → sel(L) = 21 sel(L1 )(1 − sel(L2 )). The results achieved with L0 were our Baseline. We first compared the execution time of C-RO and RO-MA (see Table 1) against the Baseline for all 7 datasets alongside with L IGER. As expected, all variations of L IGER require less execution time than the Baseline. As a result, L IGER produces more time-efficient LS, even when maxOpt is set to a high value. L IGER performs best on VILLAGES for k = 0.1 and maxOpt = 0.4 s, where it can reduce the average runtime of the 100 LSs we considered by 88%. On the smaller BDLP-ACM dataset, RO-MA performs best and achieves a time reduction of the run time by 77.5%. Futhermore, we studied how the strategies C-RO and RO-MA compare to each other (see Table 1). Our average results suggest that RO-MA outperforms C-RO on average. The statistical significance of these results is confirmed by a paired t-test on the average run time L IGER – Link Discovery with Partial Recall 5 distributions (significance level = 0.95). Our intuition that the monotonicity of run times can potentially improve the run time of our approach is supported by the results on three out of the seven datasets (Abt-Buy, DBLP-ACM and Amazon-GP). On the remaining four datasets, RO-MA outperforms C-RO on average. Still, when C-RO outperforms RO-MA, the absolute differences are minute. Additionally, both subsumed LSs received the same selectivity. Hence, when the available refinement time is limited, RO-MA should be preferred when aiming to carry out partial-recall LD. The highest absolute difference between C-RO and RO-MA is achieved on the DBLP-Scholar dataset, where RO-MA is 1179.59 s faster than C-RO, while the highest relative gain of 776.28% by C-RO against the Baseline is achieved on VILLAGES (k=10%, maxOpt = 400), which is the largest dataset of our experiments. Finally, we wanted to measure the loss of F-measure of a machine-learning ap- proach when presented with the results of partial-recall LD vs. the F-measure it would achieve using the full results. We use W OMBAT [8], which is currently the only ap- proach for learning LSs from positive examples. Our results show that with an expected partial recall of 50%, W OMBAT achieves at least 76.6% of the F-measure that it achieves when presented with all the data generated by E AGLE (recall = 100%). 5 Conclusions and Future Work We presented L IGER, the first partial-recall LD approach. We provided a formal defi- nition of a downward refinement operator along with its characteristics,which we used to develop an algorithm for partial-recall LD. We thus evaluated our approach on 7 datasets and showed that by using our refinement operator, we are able to detect LS with guaranteed expected recall efficiently. Our extension of the L IGER algorithm with a monotonicity assumption pertaining to the run time of the LS was shown to be slightly better than the basic L IGER implementation. In future work, we will build upon L IGER to guarantee the real selectivity and recall of our approaches with a given probability. References 1. Georgala, K., Hoffmann, M., Ngomo, A.N.: An Evaluation of Models for Runtime Approxi- mation in Link Discovery. In: Proceedings of the International Conference on WI (2017) 2. Georgala, K., Obraczka, D., Ngonga Ngomo, A.C.: Dynamic planning for link discovery. In: The Semantic Web. pp. 240–255 (2018) 3. Isele, R., Jentzsch, A., Bizer, C.: Efficient Multidimensional Blocking for Link Discovery without losing Recall. In: Marian, A., Vassalos, V. (eds.) WebDB (2011) 4. Köpcke, H., Thor, A., Rahm, E.: Evaluation of Entity Resolution Approaches on Real-world Match Problems. Proc. VLDB Endow. 3(1-2), 484–493 (Sep 2010) 5. Ngomo, A.C.N., Lyko, K.: Eagle: Efficient active learning of link specifications using genetic programming. In: Extended Semantic Web Conference. pp. 149–163. Springer (2012) 6. Ngonga Ngomo, A.C.: On Link Discovery using a Hybrid Approach. Journal on Data Seman- tics 1(4), 203–217 (2012), http://dx.doi.org/10.1007/s13740-012-0012-y 7. Ngonga Ngomo, A.C.: HELIOS – Execution Optimization for Link Discovery, pp. 17–32. Springer International Publishing, Cham (2014) 8. Sherif, M., Ngonga Ngomo, A.C., Lehmann, J.: WOMBAT - A Generalization Approach for Automatic Link Discovery. In: 14th Extended Semantic Web Conference. Springer (2017)