<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Italian Symposium on Advanced Database Systems, June</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Extracting a Unified Database from Heterogeneous Bills of Materials via Data Integration Approaches</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Luigi De Lucia</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Guglielmo Faggioli</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicola Ferro</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabio Orobelli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Bedeschi S.p.A.</institution>
          ,
          <addr-line>Limena</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Padova</institution>
          ,
          <addr-line>Padova</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>1</volume>
      <fpage>9</fpage>
      <lpage>22</lpage>
      <abstract>
        <p>Since the beginning of the information era, manufacturing companies have collected a huge amount of data and acknowledged the importance of proper collection and curation policies. Nevertheless, an open problem is the exploitation of historical and legacy data. Such data is heterogeneous: it has often been accumulated throughout decades by diferent curators with diferent degrees of attention toward the data. In this work, we focus on the historical data collected by Bedeschi S.p.A., an international company dealing with the design and production of turn-key solutions for Bulk Material Handling, Container Logistics and Bricks. Our task is to extract knowledge, available under the form of partially structured Bills of Materials (BoMs), to distil a unified database over a mediated schema that can be used by designers and sales department to rapidly have a global overview of the products developed over the years. We present a feasibility study investigating the performance of several traditional approaches of data integration applied to our specific task. In particular, we treat BoMs as sources that need to be aggregated in the unified mediated schema. We achieved satisfactory results indicating the direction that the company should follow to properly extract knowledge from their historical data.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>it in a uniform database over a global schema providing an overview of the diferent models
of Bedeschi machines developed over decades. This work is a feasibility study to determine
whether it is possible to process these BoMs automatically. In particular, to determine the most
promising approaches, we focus on a single Bedeschi machine type, the BEL-C, in production
at Bedeschi S.p.A. for decades. Furthermore, machines of this type have rich BoMs that allow us
to recognize emerging challenges and critical issues. Moreover, diferent designers and curators
have produced, collected, and organized the data about this type of machine over the years
using diferent media, starting from paper sheets – now digitalized – to modern information
access systems.. We interpret our task under the data integration framework. In particular, we
consider BoMs associated with diferent machines as our data sources. These data sources are
then integrated to obtain a global schema.</p>
      <p>Our contributions are the following:
• We formalize the unified database extraction task from an industrial perspective as a data
integration task;
• We identify a set of techniques drawn from the schema matching and instance matching
domains to address the task;
• We empirically evaluate the performance of such techniques.</p>
      <p>The remainder is organized as follows: Section 2 describes the main eforts in the Data
Integration domain. Section 3 contains the description of the task at hand and the data available.
Section 4 details on our methodology, while Section 5 contains its evaluation. Finally, in Section 6
we draw same conclusions and outline future steps.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Data integration aims to provide unified access to data that usually resides on multiple,
autonomous data sources, like diferent databases or data retrieved by internet websites [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. One
of the major challenges to data integration is data ambiguity. This ambiguity arises whenever
there is some uncertainty about the meant data information. Among data integration systems
developed, some handle the whole data integration process, such as Do and Rahm [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], Fagin
et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], Seligman et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], while others, e.g., Bernstein et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], Falconer and Noy [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], focus
only on the schema mapping task. Among the most prominent endeavours, we list the following.
COMA [
        <xref ref-type="bibr" rid="ref3">3, 8, 9</xref>
        ] automatically generates matchings between source and target schemas, by
exploiting multiple matching approaches to compute the matching. Clio, developed by Fagin
et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], supports a visual matching representation, and is capable of working with schemas
without knowing the mediated schema. Cupid [10] is a schema matcher algorithm that exploits
similarities of diferent aspects of the schema elements: keys, referential constraints, and views.
      </p>
      <p>
        Several diferent endeavours have been devoted to the schema matching task. Rahm [11]
presents an overview of the possible approaches based on diferent types of matching to detect
the correspondences between the schema elements. When the instances are available, it is
possible to state that two schema elements are similar if they have close instances Bernstein
et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], Madhavan et al. [12].
      </p>
      <p>(a) BEL-C
(b) Bill of Materials (BoM)</p>
      <p>Additionally, diferent works exploit the information provided by instance values. COMA++,
an extension of COMA proposed by Engmann and Massmann [8], considers the average length
of instance characters, average values and standard deviations. A second approach has been
proposed by Bilke and Naumann [13] and exploits duplicate instances to find matches. Despite
the popularity of the task, there is no universal benchmark to evaluate schema matching
approaches [14].</p>
      <p>Among the ancillary techniques needed to achieve data integration, we can list data profiling,
reverse database engineering, and matching approaches combination. Data profiling is the set
of processes and actions performed to retrieve specific information, also called metadata, about
a given dataset [15]. Generally, profiling tasks support manual data exploration and improve the
data gazing [16] process. Data profiling further helps data integration by revealing
inconsistencies and errors in the data errors [17]. Reverse database engineering allows identifying diferent
characteristics of the database entities like relations, attributes, domain semantics, foreign keys
and cardinalities that can be introduced into the data integration process via explicit rules [18].
Combining diferent matching approaches allows detecting a higher number of correct matches
with respect using a single matcher [19]. We can see some approaches in [12, 10, 20]. Hybrid and
composite matchings are two diferent ways of combining matching approaches. The former
uses multiple matching criteria together to compute the score. The latter, which we employ in
our work, computes the matching scores independently and then combine them.</p>
    </sec>
    <sec id="sec-3">
      <title>3. The Task and Dataset</title>
      <p>3.1. Task
In the design process of a new product, it is common practice to look at the design of previous
ones to retrieve reference specifications and understand the design choices. To this end, it is
necessary to have an overview of products realized previously. In this work, we focus on a
specific product: a machine type, called BEL-C, which has been in production for the Bedeschi
company for decades and has multiple versions. Figure 1a depicts an example of BEL-C machine:
this machine includes several “parts”, called assemblies, e.g., the mechanical arm on the right
side is an assembly composed of multiple articles such as wheels and buckets.</p>
      <p>The description of each realization of a BEL-C machine is available under the form of Bill of
Materials (BoM). A BoM is the list of the components, also called articles, needed to manufacture
a certain product. Multiple articles can be further aggregated into assemblies. Thus, the elements
in a BoM are organized hierarchically as a tree. The root node represents the final product, while
internal and leaves nodes represent respectively assemblies and articles. Figure 1b illustrates
the apical area of the BoM for the machine on Figure 1a.</p>
      <p>BoMs referring to diferent products of the same type, as in our case, might be highly
heterogeneous. In fact, they might have been realized decades apart, can be the digitalization of
old analogical production data and might include heterogeneous descriptions. This results in
data ambiguities that need to be resolved. We aim to realize a tool that resumes all the relevant
information of the diferent BEL-C machines produced during the years. More in detail, we
have a manually compiled example of a database constructed over a mediated schema and we
want to learn how to extract a similar database automatically from the BoMs. Such example
unified database is based on recent realizations of the BEL-C. Therefore, mapping historical
BoMs into the database over the mediated schema presents two main challenges: i) example
data is expressed in natural language and it is not possible to find the very same attribute
values among our data; ii) modern BoMs used to construct the example of unified database
are not among the historical ones and thus are not in our dataset. Therefore, we cannot use
traditional entity resolution and linkage approaches. Even though they share similar structure,
the heterogeneity between diferent BoMs makes them closer to diferent data sources rather
than to diferent instances of the same database. In this sense, the schema underlying the
unified database that combines the information on diferent BoMs can be seen as a mediated
schema. Therefore, We model our task in a data integration framework as follows: given a set of
BoMs, the description of the articles they contain and an example of unified database over the
mediated schema, we have to: 1) understand how BoMs are mapped into the mediated schema;
2) construct automatically the unified database for previously unseen BoMs.</p>
      <sec id="sec-3-1">
        <title>3.2. Dataset</title>
        <p>To address our task, we have a set of BoMs, a database of the articles, and a manually built
example of the required final mediated schema.</p>
        <p>BoMs Diferent BEL-C machines might have diverse specifications, i.e., the same component
can have diferent dimensions or distinct motors can be installed. Each BoM, identified by a
unique serial number, is an instance representing a specific realization of a BEL-C machine
with its characteristics. The BoM contains the articles’ ID used in building the machine. Inside
each BoM articles might be organized into assemblies. Each assembly has a unique identifier
and a description. The assembly description allows understanding what part of the machine
the tree portion is referring to. In our historical data, we have BoMs associated with 16 distinct
machines. BoMs contain from 6 to 375 articles.</p>
        <p>Articles The article database contains the description of the characteristics of articles listed
in BoMs. Articles are organized into classes – e.g., chains, engines, etc – uniquely identified
(0, N)
contains
(1, N)
includes
Assembly
(0, N)</p>
        <p>(1, N)
...</p>
        <p>Chain
(0, N)
(0, N)</p>
        <p>Article</p>
        <p>(0, N)
is_composed
by a mnemonic code. Articles belonging to diferent classes have distinct sets of attributes.
The domains of such attributes are not formally described: in most cases, attributes values
are expressed via natural language strings even when only typed (e.g., numeric or Boolean)
values are expected. The only attribute common to all classes is “description”: a free text field
containing the description of the article and its usage. Several articles have multiple null values
– such values are not really empty: they have not been correctly reported or lost while moving
from paper to digital archives. In this sense, the “description” attribute is often the only one
(ab-)used, containing also the values for all the other attributes. Figure 2 reports the ideal
Entity–Relationship (ER) schema for our dataset, BoMs are composed of assemblies or articles
and assemblies include either other assemblies or articles.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Example Database over the Global Schema To understand which are the relevant at</title>
        <p>tributes and their domain, domain experts provided us an example of the database constructed
over the global schema that we would like to distil from the BoMs. This database was constructed
manually and often the values for its fields are not in the domain of those in the articles dataset
– e.g., diferent units of measure, strings instead of numbers. Furthermore, it is based on recent
BEL-C machines that are not among historical BoMs. The main challenge is understanding how
to map articles’ attributes to the global schema fields comparing their domains and relations.
The example database organizes the data as follows: each column contains a diferent machine
realization – 20 of them – while each row represents a machine specification. Our dataset
includes 47 unique fields for the machine specification. Rows are divided into groups – e.g.
technical features, chain – that provide semantic meaning to the fields they include.
Manual mapping As aforementioned, historical BoMs and those available in the example
of unified database do not overlap. Furthermore, values for fields in the example unified
database are manually set and not available in the domain of the attributes in the article’s data
set. Therefore, we cannot determine whether the automatic mapping produced using data
integration techniques is correct. Thus, with the help of domain experts, we built a manual
mapping between global schema fields and articles’ attributes. Such mapping allows us to
define a query that provides the mediated schema immediately. The construction of such
manual mapping took several hours and has two main drawbacks: it does not work on diferent
machines besides the BEL-C, nor it works if new attributes are taken into account. Therefore,
we use the manual mapping to evaluate diferent automatic integration approaches, so that
they can be used on previously unseen scenarios for which we do not have time or resources to
construct the manual mapping.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <p>Our objective is to understand the mapping between fields of the target schema and article
attributes. Given ℱ the set of fields of the mediated schema and  the set of articles’ attribute,
for each  ∈ ℱ our objective is to select the articles attribute  ∈  that maps into  . To do
so, we compute a similarity measure (, ) which describe how similar are  and . Then the
value of the field  is extracted from the attribute  such that:
argmax (, )</p>
      <p>∈</p>
      <p>Our task becomes identifying the best similarity function  to describe the relation between a
ifeld  and an article attribute . To do so, we model three aspects: the similarity function , the
representation for the field  and the representation for the article attribute . Depending on the
representation used for  and , there are two main categories of matching approaches:
schemabased and instance-based matching approaches. The former relies on schema information about
both field and article, such as their names, group or class. Instance-based matching approaches
represent articles and fields using their instances. A third class of approaches is the one based
on composite matchings, where the outputs of multiple matchers are combined.</p>
      <p>Finally, using the manual mapping developed with the contribution of two domain experts,
we can evaluate the performance of each similarity function or mapping strategy.</p>
      <sec id="sec-4-1">
        <title>4.1. Data Normalization</title>
        <p>Following common practices in the schema matching domain [10, 21], we start by normalizing
the entities in our dataset. The normalization pipeline includes converting characters to
lowercase, removing trailing spaces, removing non-informative tokens (e.g., single characters and
stopwords).</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Schema-Based Matching</title>
        <p>
          Using the schema information, such as labels, data types, and structural properties is a widely
used approach [
          <xref ref-type="bibr" rid="ref3">3, 10, 12, 22, 23</xref>
          ]. In this scenario, we can exploit elements from the schema of
the unified database, the mediated schema, and the schema of the articles database. For the
mediated schema fields  we consider the following representations:
• Field name (F): names of the attributes of the target schema.
• Group name (G): name of group the field is in.
        </p>
        <p>• Concatenation between the field and the group it belongs to (GF).</p>
        <p>Concerning the articles attribute , on the other hand, we can represent them as follows:
• Article attribute name (A): name of the article attribute in the articles database.
• Article class (C): Name of the article class. Being a mnemonic code, it contains relevant
semantic information.</p>
        <p>• Concatenation between the article class and attribute name (CA).</p>
        <p>Once we have denfied the representation for the fields and articles, we need to define the
similarity . In particular, we employ the following matching approaches:
• Dice’s coeficient (CW): size of the intersection between the terms of two strings.
• Edit Distance (ED): edit operations required to transform one string into the other.
• Pairwise Edit Distance (EDP): if two strings contain identical words in diverse order, they
have a high edit distance. EDP computes the edit distance between each possible pair of
tokens from the two strings, returning the average distance of all the combinations.
• N-Grams (NG): number of common n-grams between two strings [24].</p>
        <p>Besides the pairwise edit distance, normalized by default, all the above-mentioned matching
approaches are employed in both their original and normalized forms.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Instance-Based Matching</title>
        <p>In the case of instance-based matching, we represent the field  and the article attribute  using
the values that they assume for the diferent instances. In particular, given a field  , we can
concatenate the values of that field for each BoM in the example global schema and obtain a
document  . Similarly, taking all the values for the attribute  in the articles database, we
concatenate them and define a document , that represents the article. Given the size of 
and , the similarities used for the schema-based scenario are computationally too demanding.
Therefore, we propose to use documents’ similarity measures, such as those typically used
in Information Retrieval (IR). Given a document field  we rank all the attribute documents
, based on their similarity, and select the attribute that maximizes the similarity with the
document field. Among the similarity measures , we investigate the following: Jaccard’s
coeficient [25, 26], cosine similarity [25] and BM25 [27].</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Composite Matchings</title>
        <p>To further improve matching performance, we also investigate matching strategies fusion. In
particular, similarly to what typically done in IR, for each field, we rank attributes based on their
similarity with the field. Then, we fuse diferent rankings using CombSUM and CombMNZ [ 28].
We also propose an approach, dubbed weighted average (WA), that weights the similarity
computed by diferent matching approaches as follows:</p>
        <p>WA(, ) =
 1(, ) +  2(, ) + ... +   (, )
 +  + ... + 
(1)
where 1, 2, ...,  is the set of similarity matrices.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Empirical Evaluation</title>
      <p>The subsequent section contains the empirical analysis of the proposed methodology. We adopt
Mean Reciprocal Rank (MRR) [29] to evaluate our matching strategies. Furthermore, we also
include the proportion of exact matches (precision at 1), using two distinct tie-breaking policies:
“first” which considers the first match in alphabetical order and “any”, which measures whether,
among the ties, one of them is the correct match. To understand the statistical relevance of the
diferent matching strategies performances, we conduct an ANOVA test.</p>
      <sec id="sec-5-1">
        <title>5.1. Schema-Based Matching Evaluation</title>
        <p>The performances of schema-based matching strategies are available in Table 1. Concerning
the MRR, the best performing strategy is 3-gram algorithm with “Field/Class and Attribute”
input. This configuration achieves the best MRR score 33.6% among the set of tested matching
strategies, but it’s statistically diferent only from 9 other configurations. Table 1 illustrates how
the normalized version of the algorithms outperform the not-normalized ones. Furthermore,
3-gram and 2-gram usually achieve better MRR scores, but Dice’s coeficient has high number
of matches with the "Any" policy – it exhibits high recall.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Instance-Based Matching Evaluation</title>
        <p>The performances of the three instance-based matching strategies are reported in Table 2. Notice
that, the new similarity functions are based on real numbers much more widely spread. This
dramatically reduces the chance of ties: thanks to this, we do not need anymore a tie-breaking
policy. BM25 and cosine are statistically significantly equivalent, while Jaccard similarity is
significantly worse than the others. It is interesting to observe that diferent matching strategies
identify diferent correct matches. This suggests that a composite matching strategy might
improve the overall quality of the approach.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Composite Matching Strategies</title>
        <p>When considering the composite matching strategies, almost any combination of matchers
and representations is a suitable candidate - therefore, in the following analysis we report the
results for the best performing or most interesting composite matching techniques, selected
also with the help of experts of the domain.</p>
        <p>Among the composition strategies that we considered, the first approach consists in using
simple matching strategies and combine them using traditional rank fusion techniques. Based</p>
        <p>MRR
input
matcher</p>
        <p>orig. norm. orig. norm. orig. norm.</p>
        <p>F, A
F, CA
GF, A
GF, CA</p>
        <p>CW
ED
EDP
NG2
NG3
CW
ED
EDP
NG2
NG3
CW
ED
EDP
NG2
NG3
CW
ED
EDP
NG2
NG3
cos .178</p>
        <p>BM25 .155
Jaccard .022</p>
        <p>MRR
on Tables 1 and 2, we select the following matchers as fusion candidate: 2-gram and 3-gram
normalized, CW normalized, EDP, BM25 and Cosine similarity. To represent fields and attributes
in the schema-based approaches, for each matcher, we use the best performing representation
according to Table 1. CombMNZ and CombSUM are then used to aggregate the similarity scores
computed by the single matching strategies.</p>
        <p>Furthermore, we experiment with the WA fusion strategy. In particular we use the following
similarity function and test:
• NG2(F,A): normalized 2-gram similarity between field and attribute names.
• NG2(GF,C): normalized 2-gram similarity between field name and group name
concate</p>
        <p>MRR</p>
        <sec id="sec-5-3-1">
          <title>CombMNZ</title>
        </sec>
        <sec id="sec-5-3-2">
          <title>CombSUM</title>
          <p>BM25+cos+NG2+EDP
BM25+cos
BM25+NG2
cos+NG2
EDP+NG2
cos+EDP
cos+CW
NG2+NG3
BM25+cos+NG2+NG3+CW+EDP
BM25+cos+NG2+EDP
BM25+cos
BM25+NG2
cos+NG2
EDP+NG2
cos+EDP
cos+CW
NG2+NG3
BM25+cos+NG2+NG3+CW+EDP
nation and attribute name.
• NG3(F, CA): normalized 3-gram similarity between field name and attribute and class
names concatenation.
• cos(, ): cosine similarity between document representation for field  and article’s
attribute .
• BM25(, ): BM25 similarity between document representation for field  and article’s
attribute .</p>
          <p>Among all the diferent combinations we experiment with, the above-mentioned matchers have
been selected due to their diversity. To find the best weight combination, we perform a grid
search through a subset of parameters. Each weight can assume the values between 0 and 1
with a pace of 0.2, resulting in 16807 dispositions with repetition.</p>
          <p>The resulting formula for the weighted average approach is:</p>
          <p>WA(, ) =  NG2(,)+ NG2(,)+ NG3(,)+ cos(,)+ BM25(,)
 + + + +
(2)
The empirical analysis, carried out following a 5-fold cross-validation strategy, shows that the top
performing set of parameters are:  = 0.2,  = 1, = 1,  = 0.6,  = 0.81. Table 3 illustrates
the results of the composite approach using the CombSUM, CombMNZ, and Weighted Average
strategies. It is possible to observe that WA is the best performing strategy, achieving 46.7%
correct matches and an MRR score of 51.9%. We observe a general improvement of performances
with respect to single matching strategies – Tables 1 and 2). CombMNZ aggregation provides
better results than CombSUM. Finally, we observe that, as a general trend, if we include more
matching strategies, the quality of the aggregation improves.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>This work stems from the need for exploiting historical production data to obtain a mediated
schema of machines built by a company throughout several years. We interpreted the problem
as a data integration task that required us to join multiple semi-structured BoMs in a unified
database over the global schema. We, therefore, focused on lexical schema- and instance-based
matching strategies. Among schema-based matching strategies, we considered matching
strategies that exploit the lexical similarity of the element names. For instance-based matching, we
used document similarity measures. Finally, we investigated the combination of the developed
matches showed a performance increase regarding the single matching strategies. We showed
that the number of correct matches increases if we combine multiple matching approaches.
Results are satisfactory and considered promising by The Bedeschi Company. Our study
highlighted challenges and issues emerging when approaching this task and indicated promising
research paths. Future works will analyze the performances of our proposed solutions on
diferent datasets as well as the use of additional similarity measures.</p>
      <p>1Due to space reasons, we omit the full analysis on the parametrization
8. D. Engmann, S. Massmann, Instance matching with coma++., in: BTW workshops, volume 7,
2007, pp. 28–37.
9. S. Massmann, S. Raunich, D. Aumüller, P. Arnold, E. Rahm, Evolution of the coma match
system, Ontology Matching 49 (2011) 49–60.
10. J. Madhavan, P. A. Bernstein, E. Rahm, Generic schema matching with cupid, in: vldb,
volume 1, Citeseer, 2001, pp. 49–58.
11. E. Rahm, Towards large-scale schema and ontology matching, in: Schema matching and
mapping, Springer, 2011, pp. 3–27.
12. J. Madhavan, P. A. Bernstein, A. Doan, A. Halevy, Corpus-based schema matching, in: 21st</p>
      <p>International Conference on Data Engineering (ICDE’05), IEEE, 2005, pp. 57–68.
13. A. Bilke, F. Naumann, Schema matching using duplicates, in: 21st International Conference
on Data Engineering (ICDE’05), IEEE, 2005, pp. 69–80.
14. Z. Bellahsene, A. Bonifati, F. Duchateau, Y. Velegrakis, On evaluating schema matching and
mapping, in: Schema matching and mapping, Springer, 2011, pp. 253–291.
15. Z. Abedjan, L. Golab, F. Naumann, T. Papenbrock, Data profiling, Synthesis Lectures on</p>
      <p>Data Management 10 (2018) 1–154.
16. A. Maydanchik, Data quality assessment, Technics publications, 2007.
17. S. Kandel, R. Parikh, A. Paepcke, J. M. Hellerstein, J. Heer, Profiler: Integrated statistical
analysis and visualization for data quality assessment, in: Proceedings of AVI, 2012.
18. Z. Abedjan, L. Golab, F. Naumann, Profiling relational data: a survey, The VLDB Journal 24
(2015) 557–581.
19. E. Rahm, P. A. Bernstein, A survey of approaches to automatic schema matching, the VLDB</p>
      <p>Journal 10 (2001) 334–350.
20. R. Dhamankar, Y. Lee, A. Doan, A. Halevy, P. Domingos, imap: Discovering complex
semantic matches between database schemas, in: Proceedings of the ACM SIGMOD, 2004,
pp. 383–394.
21. P. Mork, L. Seligman, A. Rosenthal, J. Korb, C. Wolf, The harmony integration workbench,
in: Journal on Data Semantics XI, Springer, 2008, pp. 65–93.
22. F. Giunchiglia, M. Yatskevich, Element level semantic matching (2004).
23. W. H. Gomaa, A. A. Fahmy, et al., A survey of text similarity approaches, International</p>
      <p>Journal of Computer Applications 68 (2013) 13–18.
24. G. Kondrak, N-gram similarity and distance, in: International symposium on string
processing and information retrieval, Springer, 2005, pp. 115–126.
25. A. Huang, et al., Similarity measures for text document clustering, in: Proceedings of</p>
      <p>NZCSRSC, volume 4, 2008, pp. 9–56.
26. R. Ferdous, et al., An eficient k-means algorithm integrated with jaccard distance measure
for document clustering, in: Proceedings of the AH-ICI, IEEE, 2009, pp. 1–6.
27. S. Robertson, H. Zaragoza, The probabilistic relevance framework: BM25 and beyond, Now</p>
      <p>Publishers Inc, 2009.
28. J. H. Lee, Analyses of multiple evidence combination, in: Proc. SIGIR, 1997, pp. 267–276.
29. E. M. Voorhees, et al., The trec-8 question answering track report, in: Trec, 1999, pp. 77–82.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>F.</given-names>
            <surname>Azzalini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Renzi</surname>
          </string-name>
          , L. Tanca,
          <article-title>Blocking techniques for entity linkage: A semanticsbased approach</article-title>
          ,
          <source>Data Science and Engineering</source>
          <volume>6</volume>
          (
          <year>2021</year>
          )
          <fpage>20</fpage>
          -
          <lpage>38</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>A.</given-names>
            <surname>Doan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Halevy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ives</surname>
          </string-name>
          ,
          <article-title>Principles of data integration</article-title>
          ,
          <source>Elsevier</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. H.
          <string-name>
            <surname>-H. Do</surname>
          </string-name>
          , E. Rahm,
          <article-title>Coma-a system for flexible combination of schema matching approaches</article-title>
          ,
          <source>in: Proceedings of the VLDB, Elsevier</source>
          ,
          <year>2002</year>
          , pp.
          <fpage>610</fpage>
          -
          <lpage>621</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>R.</given-names>
            <surname>Fagin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. M.</given-names>
            <surname>Haas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hernández</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Popa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Velegrakis</surname>
          </string-name>
          , Clio:
          <article-title>Schema mapping creation and data exchange</article-title>
          ,
          <source>in: Conceptual modeling: foundations and applications</source>
          , Springer,
          <year>2009</year>
          , pp.
          <fpage>198</fpage>
          -
          <lpage>236</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>L.</given-names>
            <surname>Seligman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mork</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Halevy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Carey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wolf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Madhavan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kannan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Burdick</surname>
          </string-name>
          ,
          <article-title>Openii: an open source information integration toolkit</article-title>
          ,
          <source>in: Proceedings of the 2010 ACM SIGMOD</source>
          ,
          <year>2010</year>
          , pp.
          <fpage>1057</fpage>
          -
          <lpage>1060</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>P. A.</given-names>
            <surname>Bernstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Madhavan</surname>
          </string-name>
          , E. Rahm,
          <article-title>Generic schema matching, ten years later</article-title>
          ,
          <source>Proceedings of the VLDB Endowment</source>
          <volume>4</volume>
          (
          <year>2011</year>
          )
          <fpage>695</fpage>
          -
          <lpage>701</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Falconer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. F.</given-names>
            <surname>Noy</surname>
          </string-name>
          ,
          <article-title>Interactive techniques to support ontology matching</article-title>
          ,
          <source>in: Schema Matching and Mapping</source>
          , Springer,
          <year>2011</year>
          , pp.
          <fpage>29</fpage>
          -
          <lpage>51</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>