Noisy-aware Blocking over Heterogeneous Data Tiago B. Araújo1,2 , Carlos Eduardo Pires1 , and Kostas Stefanidis2 1 Federal University of Campina Grande, Brazil {tiagobrasileiro@copin,cesp@dsc}.ufcg.edu.br 2 University of Tampere, Finland kostas.stefanidis@uta.fi Abstract. Entity resolution (ER) emerges as a fundamental step to in- tegrate multiple knowledge bases or identify similarities between entities. Blocking is widely applied as an initial step of ER to avoid computing similarities between all pairs of entities. Heterogeneous and noisy data increase the difficulties faced by blocking, since these issues directly in- terfere the block generation. In this work, we propose a technique ca- pable to tolerate noisy data to extract information regarding the data schema, and generate high-quality blocks. We apply Locality Sensitive Hashing (LSH) to hash the entities values, and enable the generation of high-quality blocks, even with the presence of noise in the values. In the experiments, we highlight that our approach has better effective- ness compared to the state-of-the-art technique, as well as less produced comparisons. 1 Introduction Entity resolution (ER) in big Web data deals typically with two Vs: volume, as it handles large amounts of entities; and variety, since different formats are used to represent entities [3]. Beyond the two Vs, we highlight another problem tackled by ER: noisy data, which is commonly characterized by pronunciation/spelling errors and typos in the entities values [5]. In practical scenarios, people are less careful with the lexical accuracy of the content written informally or with some pressure. In ER, noisy data directly impacts the identification of similar entities, since spelling difference in their values determines that two entities, truly similar, are not regarded as similar by the ER task. In this work, the two most common noise on data will be considered: typos and misspelling errors [1]. To deal with the large volume of data handled by ER, blocking techniques are applied. Blocking groups similar entities into blocks and perform comparisons between entities within the same block. The variety of data is related to the fact that entities do not share the same schema. In this sense, schema-agnostic blocking techniques (e.g., token blocking) have been proposed to address the variety challenge, since they disregard the schema and consider only the val- ues related to the entity attributes [3]. Among the schema-agnostic techniques, BLAST [7] emerges as one of the most promising technique regarding effective- ness. Although it is a schema-agnostic technique, it exploits schema information based on statistics collected from data, to enhance the quality of the blocks in a metablocking approach. However, the presence of noise in the attribute values compromises the effectiveness of BLAST, since it relies on the accuracy of the values to exploit the schema information, and generate and prune the blocks. Our work proposes a noise-aware schema-agnostic blocking method for ER. This a novel technique capable of tolerating noisy data to extract information regarding the schema from the data (i.e., group similar attributes based on the data) and enhance the quality of the generated blocks. The method applies Local- ity Sensitive Hashing (LSH) to hash the entities attribute values and enable the generation of high-quality blocks (i.e., blocks that contain a significant number of entities with high chances of being considered similar), even with the presence of noise in the values. This technique is evaluated against the state-of-the-art method regarding efficiency and effectiveness, using a real dataset. 2 Noise-aware Schema-agnostic Blocking Our approach is based on metablocking [2, 6, 4], which exploits blocking informa- tion to improve the efficiency gains with a minimum impact on the effectiveness. To this end, metablocking restructures a set of blocks into a new one that in- volves significantly fewer comparisons, while maintaining the original level of effectiveness. Initially, a schema-agnostic blocking technique, e.g., token block- ing, is applied to block the heterogeneous data. Token blocking extracts tokens from the entities attribute values, and creates an individual block for every token that appears in at least two entities. Blocks generated by token blocking result in a big number of redundant comparisons between entities. For this reason, blocks are transformed into a weighted graph, such that each entity is represented by a node and each edge between a pair of nodes infers that the nodes share at least one block in common. Based on the number of blocks in common between a pair of entities, metablocking performs pruning, which aims to discard comparisons between entities with few chances of being considered a correspondence. Our technique performs in three steps: (i) schema information extraction, (ii) block generation, and (iii) pruning. It receives as input two data sources D1 and D2 . Each data source is an entity collection D = {e1 , e2 , ..., en } with attributes A(D) = {a1 , a2 , · · · , ak }. Since entities can follow different schemes, each entity e ∈ D has a specific attribute set and a value associated to each attribute, denoted by Ae = {ha1 , v1 i, ha2 , v2 i, ..., hak , vk i}. In the schema information extraction step, all attributes of the entities in D1 and D2 are extracted. All values S associated with the same attribute ai are grouped into a set Vai , i.e., Vai = e∈D (v | hai , vi ∈ Ae ). So, the pair hai , Vai i represents the values associated with ai . The attributes in D1 and D2 are grouped based on the similarity of their values: G(D1 , D2 ) = {g1 , g2 , ..., gm } | ∀g ∈ G(D1 , D2 ) : g ⊆ (A(D1 ) × A(D2 )) and ∀g ∈ G(D1 , D2 ), ∀hai , aj i ∈ g : Vai ' Vaj . The sets Vai and Vaj are considered similar if sim(Vai , Vaj ) ≥ Φ. In the block generation step, each set of attribute values V (associated with an attribute a) is converted into a hash-signature S (provided by LSH), given by hash(ha, V i) = ha, Si. To compute the similarity S of all pairs of attributes, the complexity is O(|UD1 | · |UD2 |), where UD1 = ai ∈A(D1 ) (Vai | Vai ∈ hai , Vai i), S UD2 = aj ∈A(D2 ) (Vaj | Vaj ∈ haj , Vaj i). This complexity is impractical for semi- structured Web data, since data sources commonly have hundreds of attributes and millions of attribute values [7]. For this reason, LSH that has a linear cost in relation to the set size, is applied to reduce the dimensionality of these sets, i.e., UD1 and UD2 , targeting at minimizing the complexity to a linear cost [8]. The set of LSH-signatures S (ha, Si) guide the block generation, since entities with a similar LSH-signature are grouped into the same block. The loose-schema information (i.e., G(D1 , D2 )) is applied to the block generation step to avoid that similar LSH-signatures originated from attributes with different semantics (due to the fact that the attributes are not in the same g) being inserted into the same block by the blocking technique. The output of the block generation step is a collection of blocks B: B = {b1 , b2 , ..., bx }, S such that ∀b ∈ B : (e1 ∈ b ∧ e2 ∈ b) ⇔ (∃ha1 , a2 i ∈ (A(D1 ) × A(D2 )) : ha1 , a2 i ∈ g∈G(D1 ,D2 ) g ∧ hash(ha1 , v1 i ∈ Ae1 ) ∼ hash(ha2 , v2 i ∈ Ae2 )). Finally, in the pruning step, metablocking discards comparisons between enti- ties with low-weight edge, representing low similarity. In this sense, the collection B provided by block generation is restructured relying on the intuition that the more blocks two entities share, the more likely they result in a correspondence. The output of the pruning step is a restructured collection of blocks B 0 . 3 Experiments We evaluate our approach3 against BLAST [7], the state-of-the-art method. We used the IMDB (27,615 entities, four attributes) vs. DBpedia (23,182 entities, seven attributes) datasets with movies provided by imdb.com and dbpedia.org. For effectiveness, we apply: Pair Completeness (PC, similar to recall), Pair Qual- ity (PQ, similar to precision), and F-Measure (FM, harmonic mean between PC and PQ) [7]. For efficiency, we measure the execution time of all steps of the techniques, and the number of comparisons for all generated blocks. To evaluate effectiveness, we synthetically insert typos and misspellings (i.e., noise) into the attribute values. In order to simulate the occurrence of ty- pos/misspellings [5], one character of each token (i.e., relevant words) present in the attribute values is randomly exchanged by other characters, or additional characters are inserted into the tokens. For instance, a token “snow” can be modified to “sn0w”. In this sense, we vary the level of noise in the dataset in the interval 0 (i.e., no noise is inserted into the values) and 1 (i.e., noise is inserted into the values of all entities). For instance, noise level 0.3 indicates that 30% of the entities (contained in the first dataset) had their values modified. Effectiveness: Figure 1 illustrates the results of our analysis. For all effective- ness measures, our approach outperforms BLAST for all variations of noise level. It is important to highlight that, as the noise level increases, the effectiveness metrics decrease for both techniques. This decrease occurs due to the noise on the data that negatively interferes the block generation. However, this decrease for BLAST occurs abruptly when compared to our approach. Since the latter applies strategies to tolerate noisy data, the effectiveness decrease is amortized. For FM, we reach an average of 23% less than BLAST in terms of proportional F M (noise=1.0) decrease (1 − F M (noise=0.0) ). Our most significant result achieved for PQ was in 3 https://bitbucket.org/tbrasileiro/na-blocker Fig. 1. Effectiveness results of the IMDB-DBpedia dataset. the scenario without noise, which is 27% better than BLAST. The main reason is the generation of multiple tokens per attribute (based on a particular attribute value) as blocking keys, in BLAST. Since non-matching entities eventually share multiple tokens, they are included in the same block erroneously. On the other hand, our technique generates a single hash value for each particular value. Thus, non-matching entities sharing the same hash value are harder to occur than non- matching entities sharing common tokens. This is why we enhance PQ. Based on these results and the pair-wise (considering the noisy level) distribution T- Student test (with confidence 95%), our technique achieves better effectiveness. Efficiency: Regarding efficiency (we omit the figures due to space limitations), BLAST achieves better results than our approach. On average, there is an around 38% increase on the execution time, as expected, due to the fact that our tech- nique needs more time to generate the LSH-signatures and determine the sim- ilarity of values based on the approximate similarity. On the other hand, our technique produces less comparisons to be executed in the ER task. On average, the generated blocks indicate a total number of comparisons around 36% less when compared to the blocks generated by BLAST. Thus, the efficiency results achieved may be compensated by efficiency gains generated by the execution of fewer comparisons between entities to be performed at the following ER tasks. References 1. S. Agarwal, S. Godbole, D. Punjani, and S. Roy. How much noise is too much: A study in automatic text classification. In ICDM, 2007. 2. T. B. Araújo, C. E. S. Pires, and T. P. da Nóbrega. Spark-based streamlined metablocking. In ISCC, 2017. 3. V. Christophides, V. Efthymiou, and K. Stefanidis. Entity resolution in the web of data. Synthesis Lectures on the Semantic Web, 5(3), 2015. 4. V. Efthymiou, G. Papadakis, G. Papastefanatos, K. Stefanidis, and T. Palpanas. Parallel meta-blocking for scaling entity resolution over big heterogeneous data. Inf. Syst., 65:137–157, 2017. 5. H. Liang, Y. Wang, P. Christen, and R. Gayler. Noise-tolerant approximate blocking for dynamic real-time entity resolution. In PAKDD, 2014. 6. G. Papadakis, G. Koutrika, T. Palpanas, and W. Nejdl. Meta-blocking: Taking entity resolutionto the next level. IEEE TKDE, 26(8):1946–1960, 2014. 7. G. Simonini, S. Bergamaschi, and H. Jagadish. Blast: a loosely schema-aware meta- blocking approach for entity resolution. Proceedings of the VLDB Endowment, 9(12):1173–1184, 2016. 8. J. Wang, W. Liu, S. Kumar, and S.-F. Chang. Learning to hash for indexing big data survey. Proceedings of the IEEE, 104(1):34–57, 2016.