Entity Identification on the Semantic Web

                Alexis Morris, Yannis Velegrakis, Paolo Bouquet

                              University of Trento
                    {morris,velgias,bouquet}@disi.unitn.eu


      Abstract. In the core of every information integration and data ex-
      change effort lies the ability to identify whether two pieces of information
      refer to the same real world entity. This ability is of paramount impor-
      tance for all those applications and systems currently operating in the
      highly heterogeneous web environment. Research in data management
      has long ago exploited features like keys or schema constraints for deal-
      ing with that issue, but the web reality has brought new challenges. In
      this work we survey a number of entity disambiguation and identification
      techniques and tools that can be used in semantic web applications and
      more specifically, into an entity management system for the semantic
      web.


1   Introduction
Making good business decisions depends heavily not only on the amount of
information available, but also on the quality of the data at hand. To success-
fully locate, retrieve and integrate information related to a given task, it is of
paramount importance the ability to identify whether two pieces of information
refer to the same real world entity. Database management and federated systems
have long ago studied the specific problem by exploiting special structures such
as keys or referential integrity schema constraints [1].
    The advent of the web enabled the exchange of data among data sources of
different organizations and individuals. Since these sources hare typically been
developed by different people, at different times and with different assumptions
and requirements in mind, a natural degree of heterogeneity is prevalent. This
makes the problem of entity identification even harder. The same entity may be
represented in different sources using different data models or meanings (seman-
tic heterogeneity), it may be structured differently (structural heterogeneity),
or it may have varying spelling values (syntactic heterogeneity). The name of a
person, for instance, may be recorded in one data source using two fields for the
first and last name, in a second using one field for both, while in a third it may
use again one field but with the first name abbreviated. Furthermore, the fact
that different organizations or individuals typically have different interests and
priorities, results into a situation in which data sources model different (possibly
partially overlapping) parts of the same information. For instance, a data source
may store the name and date of birth of a person, while another may store only
the name and the city of birth.
     Entity identification can be defined as the ability of a system to accurately
match semantically similar terms to the same concept. Entity disambiguation, a
task that typically prevails entity identification, is the ability to select the most
fitting categorization of a concept among a range of candidate options. In the
literature the two terms are most of the time used equivalently. The problem
of entity identification has been known to the data management research com-
munity for more than two decades under different names, e.g., record linkage
or record matching [2], merge-purge [3], data deduplication, database harden-
ing and very recently as reference reconciliation [4]. Most existing techniques
are designed for relational systems, and may involve techniques typically met in
schema matching [5] or mapping [6]. Ontologies have been used extensively to
communicate the semantics of the data mostly when the schemas are limited in
successfully delivering such a task. Nevertheless, neither ontologies are free from
the many identification problems. Due to the size of the web, global agreement
on the modeling, structure and use of the ontologies is hard to achieve. Ontolo-
gies developed by different communities are potentially diverse, and mapping
techniques across ontologies have become essential [7]. Ontology integration has
become one of the major challenges for the semantic web [8]. In a recent sur-
vey [9] of 25 information integration approaches that involve ontologies, one can
easily note the prevalence of a need for entity identification techniques in each
one of them.
     An effort has recently been initiated aiming at facilitating entity identifica-
tion on the Semantic Web. This effort is currently being implemented within the
OKKAM 1 [10] project. The goal of OKKAM, in general, is to enable the Web
of Entities, namely a virtual space where any collection of data and informa-
tion about any type of web published entity, e.g. person, location, organization,
event, product, etc., can be integrated into a single virtual, decentralized, open
knowledge base. It provides a common global reference for every web document,
application, or any other entity that has a representation on the semantic web.
The success of OKKAM is of critical importance since it can offer to the Seman-
tic Web a similar benefit to the one hypertext has offered to the Web [11]. Having
entity identification as one of its core components, the success of OKKAM highly
depends on the efficiency and effectiveness of the entity identification process.
     In this work we survey a number of techniques and tools related to the entity
identification process in the context of the semantic web. Special focus is given
to approaches that use ontologies or are applicable to semantic web applications.
This is the first step towards the development of the high accuracy entity iden-
tification mechanisms required in OKKAM. Critical role in all these processes
plays the domain knowledge. Domain knowledge has been noted [12] as a critical
tool for successfully performing tasks such as matching [5], mapping [6], evolu-
tion [13], query translation and integration [14]. The Word-Net [15] taxonomy,
for instance, has been extensively used to improve the accuracy of matching
methods. Of course, since dealing with semantics is a hard task, a fully auto-
matic solution may not always be possible. Human intervention will always be
1
    http://www.okkam.org


                                         2
needed, at least to verify the correctness of the generated results. Nevertheless,
rich semantic models and knowledge representation reasoning techniques can
significantly reduce the required human effort and allow entity identifications
among data containing thousands or millions of entries [8].
     The list of techniques presented here is definitely not exhaustive, mainly due
to space limitations. Nevertheless, we reports all the categories we have found
throughout our study. For each category we have selected one or two charac-
teristic representative works which we report. The categories include techniques
from machine learning (Section 2), lexicon/taxonomies (Section 3), similarity
functions (Section 4), structural approaches (Section 5) word-sense disambigua-
tion methods (Section 6), meta-data assisted solutions (Section 7) linguistic al-
gorithms (Section 8) and semantic coordination 9. The goal of the current work
is to provide a comprehensive picture and an understanding of how these tech-
niques can be integrated, or maybe orchestrated, together to form a robust and
efficient solution for entity identification.

2     Machine Learning
Machine learning based approaches are among the earliest works in this area.
Neural Networks have been very popular in computing the semantic similar-
ity [16]. The idea is to use a classifier to categorize attributes according to their
field specifications and data values, then train a neural network to recognize sim-
ilar attributes. A typical data set that has been extensively used is the DBLP2
bibliographic collection due to its nature, size and format.
     A characteristic approach of machine learning techniques for entity identifi-
cation is the one of supervised learning by Han et al. [17]. The primary focus of
this work is on disambiguation of names from within citation data. The authors
identify name entities by the use of a ”canonical” name, which is a ”minimal
invariant” that uniquely identifies an individual author. This technique is com-
monly used in libraries to overcome the problem of identity uncertainty. Despite
the fact that token based methods are used, the focus is not entirely on string-
based similarity calculations. Two models are used: a generative and a discrim-
inative. The generative model results in new sample data through the use of a
Naive Bayes algorithm. A support vector machine adapted for multi-class clas-
sification is used as the discriminative model. To illustrate in more details the
specific mechanism we provide a high level description of the performed steps
on the DBLP data set. The Naive Bayes Model is initially setup by computing
a probability based on a group of citation entries that have been parsed by reg-
ular expressions to determine a sample set. Each entry is analyzed to see if the
maximum posterior probability indicates that the author entry is the author of
a paper. The processing of the author information is used in future iterations as
prior knowledge or as training data. The Bayes rule is used to compute which
author wrote which paper. In the sequel, the training data is used to compute
the probability of writing a paper in the future with some other authors. To do
so, sub-probabilities are calculated, such as: the probability of writing a paper
2
    http://www.informatik.uni-trier.de/ ley/db


                                          3
alone, with co-authors in general, with previously seen co-authors, with unseen
co-authors, etc. These individual probabilities are then combined with the terms
found in the citations to keep track of author interests versus co-author pat-
terns. In a different process, a support vector machine approach is used in order
to classify a citation to the nearest author. Citations are given a vector contain-
ing features of the author information and other parameters. Finally, a decision
function is computed and the classification features are ranked for future inter-
actions.
    Although use of a machine learning approach is generic, the above ideas can
be easily extended to take into consideration similarities between terms and con-
cepts as provided by ontologies or taxonomies. For instance, WordNet has been
suggested for that purpose [17]. Clustering based on that similarity can then be
applied. The advantage of the machine learning approaches is that they have
a high precision and recall, but on the other hand, they require training which
may not always be easy. Furthermore, as can be noticed above, the analysis per-
formed on the data, it is not only based on similarities, but takes into considera-
tion behavioral characteristics, i.e., with what persons has an author previously
collaborated. The findings can then be used to predict future behaviour.

3   Lexicon/Taxonomy
A commonly used approach to quest of discovering the semantics of entities is
to use a taxonomy structure, an ontology, a concept map, or a lexicon, i.e., a
data dictionary. The Adapted Lesk algorithm [18] is one of those proposed for
word sense disambiguation. It works by computing how semantically close two
words are. To disambiguate and assign the right meaning to a word, context
information is used. The notion of gloss overlap is also used as a measure, where
“gloss” represents the definition of a word, i.e., glossary, and the corresponding
“overlap” is a measure of the number of words that are common to different
gloss-groups. The approach is based on taxonomy concept hierarchies such as
WordNet. The gloss calculation is based on a weighted scoring method. A cen-
tral issue in lexicon-based approaches is the definition of the similarity between
lexicon terms. Among the different metrics that can be used for such a purpose
is the length of the path between the two words in the hierarchy, the kind of
edges that exist in such a path, the context information, or a combination of
the above. Another similarity measure is the ratio of the amount of information
needed to state the commonality of the two concepts over the amount of infor-
mation needed to describe them. To decide whether two concepts correspond to
the same real world entity or not, one needs to compute the similarity between
their respective concepts and compare it to the similarity of other concepts. The
decision is based on some predefined cut-off value.
    The advantages of using taxonomies is that they may be leveraged for their
capacity of domain information for different purposes. Furthermore, taxonomies
facilitate semi-automatic solutions. On the other hand, the limitation of tax-
onomy based approaches is that they require a shared taxonomy to be always
available and a domain expert to tune the cut-off values on which the decisions
on entity identification are based.

                                        4
4   Similarity Functions
The most commonly used approach in entity identification is probably the use
of similarity functions. A similarity function is a function that computes a score
based on how many components two entities have in common or not. Popular
functions are the vector model [19], distance measurements such as those pro-
posed in information theory [20] or those based on relational schemes [21]. A
system for global schema generation and integration based on ontologies has re-
cently been developed [22]. The system integrates local ontologies into a global
counterpart. It performs similarity inference using description logic reasoning.
This reasoning provides a mean of disambiguation through equality, specializa-
tion, overlapping, and disjoint relationships.
    Seeker [23] is one of the largest, in terms of scale, semantic tagging efforts
to date. It is a text analysis engine used mainly for web annotation. It performs
automatic entity disambiguation using a technique called Taxonomy Based Dis-
ambiguation. It employees similarity functions, along with machine learning tech-
niques, to calculate the semantic value of a given word and generate semantically
meaningful tags.
    For a similar goal, i.e., semantic annotation, similarity functions have been
used [24] to disambiguate author names from citation data using an “ontology
category utility”. In the particular approach, authors are initially represented as
clusters of their published work. Key terms are extracted from the titles and the
abstracts based on the concepts that exist in a domain ontology. Authors are
then compared to each other, based on four types of similarities: exact-match,
taxonomy similarity, subsumption similarity, and relation similarity. The results
of each comparison is a set of four numbers, known as the the 4 category utilities.
The four category utilities are then summed up and the total represents the final
score that determined whether two authors are actually the same person or not.
    The advantage of similarity functions is that they are easy to formalize and
they can be easily tuned to adapt to the nature of the data at hand in order to
maximize recall and precision.

5   Structure Exploitation
A great part of the semantics of every piece of information is stored in its struc-
ture. This is why good data modeling is considered as one of the most critical
tasks in data management. This extends to ontological data as well. One can
analyze the structure of the ontology concepts in order to determine whether
two concepts refer to the same real world entity or a similar one. Methods that
have been developed for semi-structured data [25] can easily be adapted to apply
in the case of ontologies. A recent study [26] suggests three different measures
for comparing concepts, namely graph matching, filter-to-one descriptions and
probabilistic measures. The same work also describes a structural entity match-
ing approach based on tree-edit distance, which is intended to be used as a first
step towards the building of more general integration approaches. It aims at
creating a computational model that assesses semantic similarity among entity
classes from different and independent ontologies without constructing a-priory


                                        5
a shared ontology. To compute the structural distance, three specific match-
ings are taken into consideration, i.e., word matching, feature matching, and
neighborhood matching. The results of each one are combined into a weighted-
sum similarity function which also incorporates the depth of the taxonomy or
the ontology tree. For word matching the authors check the words in synonym
sets by comparing the number of similar and different words. This approach
is able to find the highest degree of similar terms between different synonym
sets and gives an estimate of similarity. Feature matching refers to the process
of determining the distance between distinguishing features. These features are
lexicographic, and involve string matching over words in synonym sets. Finally,
for semantic-neighbourhood matching, entity classes are grouped into a semantic
cluster according to the synonym set or feature matching. This type of matching
is based on a quantity defined as the cardinality of the neighbourhoods over the
cardinality of their intersection.
    The accuracy of structural exploitation can be highly improved through
knowledge of distinguishing features of certain types of entities. For instance,
knowing that the date and place of birth, along with the name are distinguish-
ing features of a person, one can conclude that to entities refer to the same
person if they agree on these distinguishing attributes even if they differ in the
rest of their structure.

6   Word-sense Disambiguation
One the most important tasks in entity identification is the word sense dis-
ambiguation. Naturally, one cannot expect that the same word has always the
same meaning in every web applications. This observation has led to a number
of studies on finding ways to make sense of words based on the context in which
they appear. Id-Rank [27] is one such approach. The target domain is the multi-
domain news archives of multiple news agencies. It is designed similarly to the
Page-Rank algorithm, but uses a news metadata ontology and a natural lan-
guage processing engine, along with a heuristic/deductive database techniques.
It is based on the notion of semantic coherence, i.e., common appearances of
entities in certain contexts, and trends, i.e., important references to a particular
entity or event.
    In analogy to Page-Rank that values a page based on its incoming reference
links, Id-Rank values the meaning of a word based on the number of news docu-
ments in which the word appears, and is mainly used to rank candidate entities.
A semantic network is formed from the entities in the candidate set. An edge
between two entities means that the entities have at some point appeared in the
same context. Edges have weights. An equation incorporates these weights in a
computation process in order to come up with a final score which will be used to
generate the final ranking of the candidate entities similar to an entity at hand.
    The advantage of word-sense disambiguation approaches is that they can find
similarities between entities modeled at different times and different fields, since
they do not require a common vocabulary. Nevertheless, due to the semantic
information that is involved in the task, a verification process by a domain
expert may often be necessary.

                                         6
7    Metadata Information
The importance of metadata has already been recognized in many different
fields [28]. Specifically for entity identification, metadata can play a significant
role. The term metadata refers to any kind of information that is not consid-
ered part of the data but is related to it and helps in better communicating its
semantics. Metadata may include provenance information, language or quality
information, i.e., time, accuracy, authority, etc.
     Ontologies are one of the main tools used to communicate metadata infor-
mation. Mappings between different ontologies can be used to achieve interoper-
ability between different ontologies [29]. This can be either on one-to-one basis,
i.e., a P2P style, or in a style similar to information integration systems. In the
latter case, a global ontology is created and then mappings are introduced to
associate the terms and concepts of the global ontology to those of the indi-
vidual local ontologies. Mappings are typically logical expression. This provides
the additional advantage that inference and integration systems can use, for in-
stance, description logic rules to find subsumption relationships, inconsistencies,
and provide query optimization services.
     Additional kinds of metadata, such as language information, can be used to
achieve the linking of different values that may represent the same real world
entity but look different just because they are expressed in different languages.


8    Linguistic Analysis
Social reasons are an important form of heterogeneity. Different people may use
different words or expressions to refer to the same entity. In such cases, linguistic
analysis technologies can be proved highly beneficial in disambiguating words or
expressions in the data or in the user queries. A successful application of linguistic
techniques is the Linguistic Combination System (LCS) [30]. First, LCS maps
terms of two ontologies. The results are then provided as input to matchers
which produce a similarity cube that is then aggregated into a similarity matrix
and used for richer mapping discovery. The approach makes use of similarity
functions, but proposes its own linguistic aggregation operation to determine
matches. It involves several phases, proceeding via matching functions for the two
candidate ontologies, as well as aggregation functions for deducing the similarity
possibilities, and heuristics to discover mappings based on the prior phases of
execution. The approach begins with ontology matchers for Name, Name path,
Taxonomy, Domain and Range, and Mother-concept. The initial matcher, the
Name matcher, provides the first similarity matrix that is used by the other
matchers as input. The Linguistic Aggregation Operators (LAO) are then chosen
by a user and are applied for every entity pair (x,y). These are operators for Max,
Min, Avg, Most, Alh (at least half), and Amap (as many as possible). They are
used to group the match results as needed based on which matchers are satisfied
or not. Once the matchers are combined to form the similarity cube, the LAOs
are used to further combine these into a similarity matrix for selecting match
candidates.

                                          7
9    Semantic Coordination
Semantic coordination, namely the problem of finding an agreement on the mean-
ing of heterogeneous semantic models, is one of the key issues in the development
of the Semantic Web. A new algorithm has recently been proposed [31] for the
discovering of semantic mappings across hierarchical classifications based on se-
mantic coordination. It employees lexical structure and semantic similarity. The
specific approach shifts the problem of semantic coordination from the prob-
lem of computing linguistic or structural similarities (what most other proposed
approaches do) to the problem of deducing relations between sets of logical for-
mulae that represent the meaning of concepts belonging to different models. The
approach is based on the intuition that there is an essential conceptual differ-
ence between coordinating generic abstract structures (e.g., arbitrary labelled
graphs) and coordinating structures whose labels are taken from the language
spoken by the community of their users. An important conclusion of the above
work is that the intended use of the data is a very important factor for entity
identification, equivalent or even more important that the actual structure of
the data.

10    Discussion and Conclusions
We have carried out a study on the different approaches, techniques and tools
that can be used for entity identification on the semantic web. Entity identi-
fication is not a new problem and many studies have already been performed
and a number of interesting surveys have already been published [32]. However,
these works have concentrated their effort on database management techniques.
Their main interest is on duplicate record detection on databases with millions
of tuples. Here we put effort more on the semantic aspect of the problem. We
chose to present approaches that are either based on semantic web applications
and ontologies, or can be used for entity identification on ontological data. The
motivation for our work was an advanced entity matching mechanism that is
currently under development within the OKKAM project, a project aiming to
enable the web of entities by providing an infrastructure that can be used to
assign identifiers to every entity available on the semantic web.
    The conclusion of our study is that there is no silver bullet. No single tech-
nique can offer a service that performs well in all the situations. Despite the depth
and breath of the existing techniques we believe that there is enough space for
optimization and improvement mainly in two directions. The first is the devel-
opment of methods that exploit the results of different approaches and combine
them to reach one single decision. The second direction is the incorporation to
the existing techniques of domain knowledge.
    Another observation that we have made is that all the existing techniques op-
erate on the assumption of static, or relatively static, data. The evolving nature
of the data and schemas has not been taken into consideration. The modern web
is a very volatile environment since millions of users are not only accessing the
data but are also modifying it. Thus, mechanisms to easily and quickly adapt to
new requirements and new data are needed [33]. Furthermore, recent advances


                                         8
on sensor networks and other technologies, have introduced numerous streaming
data sources. In those cases, many of the presented entity identification tech-
niques will be hard to apply [34]. The reason is that the streaming nature does
not allow for expensive computations and for large amount of data to be kept in
memory or on the hard disk, thus, the existing methods will have to be adapted
and new one may have to be invented.

Acknowledgements: This work has been partially funded by the EU grant
ICT-215032.


References
 1. Arens, Y., Chee, C.Y., Hsu, C., Knoblock, C.A.: Retrieving and Integrating Data
    from Multiple Information Sources. JICIS 2 (1993) 127–158
 2. Newcombe, H.B.: Record linking: The design of efficient systems for linking records
    into individual and family histories. American Journal of Human Genetics 19
    (1967) 335–359
 3. Hernandez, M.A., Stolfo, S.J.: Real-world Data is Dirty: Data Cleansing and The
    Merge/Purge Problem. Data Mining and Knowledge Discovery 2 (1998) 9–37
 4. Dong, X., Halevy, A.Y., Madhavan, J.: Reference Reconciliation in Complex In-
    formation Spaces. In: SIGMOD Conference. (2005) 85–96
 5. Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching.
    VLDB Journal 10 (2001) 334–350
 6. Popa, L., Velegrakis, Y., Miller, R.J., Hernandez, M.A., Fagin, R.: Translating
    Web Data. In: VLDB. (2002) 598–609
 7. Choi, N., Song, I., Han, H.: A survey on ontology mapping. SIGMOD Rec. 35
    (2006) 34–41
 8. Noy, N.F.: Semantic integration: a survey of ontology-based approaches. SIGMOD
    Record 33 (2004) 65–70
 9. Wache, H., Vögele, T., Visser, U., Stuckenschmidt, H., Schuster, G., Neumann,
    H., Hübner, S.: Ontology-based integration of information — a survey of existing
    approaches. In: IJCAI–01 Workshop: Ontologies and Information Sharing. (2001)
    108–117
10. Palpanas, T., Chaudhry, J., Andritsos, P., Velegrakis, Y.: Entity Data Management
    in OKKAM. In: SWAP DEXA Workshop. (2008) 729–733
11. Bouquet, P., Stoermer, H., Bazzanella, B.: An Entity Name System (ENS) for the
    Semantic Web. In: ESWC. (2008) 258–272
12. Doan, A., Halevy, A.Y.: Semantic-integration research in the database community.
    AI Mag. 26 (2005) 83–94
13. Velegrakis, Y., Miller, R.J., Popa, L.: Mapping Adaptation under Evolving
    Schemas. In: VLDB. (2003) 584–595
14. Pinto, H., Prez, A., Martins, J.: Some Issues on Ontology Integration. In: IJCAI-
    99 Workshop on Ontologies and Problem-Solving Methods: Lessons Learned and
    Future Trends. (1999)
15. Niles, I., Pease, A.: Linking lexicons and ontologies: Mapping WordNet to the
    suggested upper merged ontology. In: Proceedings of the 2003 International Con-
    ference on Information and Knowledge Engineering (IKE ’03), Las Vegas, Nevada.
    (2003)


                                          9
16. Li, W., Clifton, C.: Semantic Integration in Heterogeneous Databases Using Neural
    Networks. In: VLDB. (1994) 1–12
17. Han, H., Giles, L., Zha, H., Li, C., Tsioutsiouliklis, K.: Two supervised learning
    approaches for name disambiguation in author citations. In: JCDL. (2004) 296–305
18. Patwardhan, S., Banerjee, S., Pedersen, T.: Using Measures of Semantic Related-
    ness for Word Sense Disambiguation. In: In International Conference on Intelligent
    Text Processing and Computational Linguistics. (2003) 241–257
19. Frakes, W.B., Baeza-Yates, R.A.: Information Retrieval: Data Structures & Algo-
    rithms. Prentice-Hall (1992)
20. Lin, D.: An Information-Theoretic Definition of Similarity. In: ICML. (1998)
    296–304
21. Brauner, D.F., Intrator, C., Freitas, J.C., Casanova, M.A.: An Instance-based
    Approach for Matching Export Schemas of Geographical Database Web Services.
    In: GeoInfo. (2007) 109–120
22. Hakimpour, F., Geppertb, A.: Resolving semantic heterogeneity in schema inte-
    gration. In: FOIS. (2001) 297–308
23. Dill, S., Eiron, N., Gibson, D., Gruhl, D., Guha, R.: SemTag and Seeker: Boot-
    strapping the Semantic Web via Automated Semantic Annotation. In: WWW.
    (2003) 178–186
24. Park, Y., Kim, J.: OnCU system: ontology-based category utility approach for
    author name disambiguation. In: ICUIMC. (2008) 63–68
25. Madhavan, J., Bernstein, P.A., Rahm, E.: Generic schema matching with cupid.
    In: VLDB. (2001) 49–58
26. Rodriguez, M.A., Egenhofer, M.J.: Determining Semantic Similarity among Entity
    Classes from Different Ontologies. TKDE 15 (2003) 442–456
27. Garca, N.F., del Toro, J.M.B., Snchez, L., Bernardi, A.: IdentityRank: Named
    Entity Disambiguation in the Context of the NEWS Project. In: ESWC. (2007)
    640–654
28. Srivastava, D., Velegrakis, Y.: Intensional Associations between Data and Meta-
    data. In: SIGMOD. (2007) 401–412
29. Kwon, J., Jeong, D., Lee, L., Baik, D.: Intelligent semantic concept mapping for
    semantic query rewriting/optimization in ontology-based information integration
    system. International Journal of Software Engineering and Knowledge Engineering
    14 (2004) 519–542
30. Ji, Q., Liu, W., Qi, G., Bell, D.A.: LCS: A Linguistic Combination System for
    Ontology Matching. In: KSEM. (2006) 176–189
31. Bouquet, P., Serafini, L., Zanobini, S.: Semantic Coordination: A New Approach
    and an Application. (2003) 130–145
32. Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate Record Detection: A
    Survey. TKDE 19 (2007) 1–16
33. Velegrakis, Y.: On the Importance of Updates in Information Integration and Data
    Exchange Systems. In: DBISP2P. (2008)
34. Tantono, F.I., Manerikar, N., Palpanas, T.: Efficiently discovering recent frequent
    items in data streams. In: SSDBM. (2008) 222–239


                                          10