Entity Lifecycle Management for OKKAM1 Junaid Chaudhrya, Themis Palpanasa, Periklis Andritsosa, Antonio Manab a University of Trento b University of Malaga Abstract. In this paper, we examine the special requirements of lifecycle management for entities in the context of an entity management system for the semantic web. We study the requirements with respect to creating and modifying these entities, as well as to managing their evolution over time. Furthermore, we present the issues arising from the access control models needed for the management of a large, distributed repository of entities. Finally, we discuss the research directions that can offer solutions to the above problems, and give a brief overview of techniques and methods relevant to these solution directions. Keywords: semantic web, entity lifecycle management 1. Introduction To date, the natural growth path for computer systems has been in supporting technologies such as data storage density, processing capability, and per-user network bandwidth. The usefulness of internet and intranet networks has fueled the growth of computing applications and in turn the complexity of their administration. The wealth of data that is currently found on the World Wide Web (WWW) is of limited use if it cannot be converted into meaningful information. The Semantic Web (SW) is an evolving extension of the WWW, in which the meaning of data and services is defined by attaching semantic concepts to them, making it possible 1 This work was partially supported by the FP7 EU Large-scale Integrating Project OKKAM - Enabling a Web of Entities (contract no. ICT-215032). For more details, visit http://www.okkam.org. for applications and machines to make sense of the web content [2]. The SW is evolving towards a direction that leads in understanding, processing and utilizing the information on the web. One of the major problems that have emerged through the semantic web effort is the problem of uniquely identifying entities 2 [3]. The entities play a major role for the SW since they represent the atomic objects of reference and reasoning. Nevertheless, we currently face the problem of identifying and referencing these entities, which prohibits us from moving to the next step towards the goal of the SW, that of reasoning about entities. The problem derives from the fact that different users, or systems, assign and use different identifiers for the same real-world entity. As a result, we cannot effectively reason about this entity, exactly because it is not consistently being assigned the same identifier. We claim that the above problem is at the core of the semantic web effort. Along with the problem of assigning global identifiers to entities in the semantic web also come the problems of managing these identifiers throughout the entire lifetime of the entities. Giving solutions to the above issues is the goal of OKKAM [3], a web-scale system for assigning and managing unique, global identifiers to entities in the WWW. In such a system, with which a large number of users is expected to interact, it is natural to have diversity among user queries about the same entity. This is due to the fact that their knowledge and viewpoints about the specific entity may differ from each other. Exploiting this innate property of system interaction, the entities in the system repository should respond to this diversity and evolve with time. This evolution can be triggered by nascent (i.e., new knowledge) or tacit knowledge discovery (i.e. discovering duplication, coexistence, resemblance, etc.). The aim of this study is to discuss requirements and propose solution directions for the problems of keeping the data in the OKKAM system that are related to the representation of entities consistent, up-to-date and unsullied. We will collectively refer to all the above problems as the problem of entity lifecycle management. The rest of the paper is organized as follows. We review the relevant literature in Section 2, and give an overview of the OKKAM system in Section 3. In Section 4, we describe the requirements of the entity lifecycle management. Section 5 outlines the solution we envision and the research directions we are currently pursuing. Finally, we conclude in Section 6. 2 In the rest of this paper, we will use the term entity to refer to individuals, particulars, and instances. This notion of entity is quite liberal, and includes things like products, organizations, associations, countries, events, publications, hotels, people, etc. It may also include fictional objects (e.g., Pegasus), objects from the past (e.g., Plato), or abstract objects (e.g., Gödel’s Theorem). 2. Related Work In this section, we review different approaches related to entity lifecycle management that have been proposed in the literature. 2.1 Data Storage There has been lots of work on rule-based reasoning for data partitioning and placement techniques [6], but empirical evidence shows the need for reliance of the system on a human user. The placement of data based on recency is proposed in [7]. The most recently queried data are placed in the top tier of the memory, because it is pre-assumed that these data will be accessed again in the near future. Fred et al. [1] propose a storage infrastructure that effectively takes into account not only disk read and writes, but also data creation and deletion. The storage system they propose uses a mechanism based on relative values in order to decide which portions of the data to retain in the case of space shortage. Mirta et al. [8] propose and evaluate query-based partitioning, a novel approach for partitioning documents and indexes across the storage hierarchy, based on the insight that documents not present in the top-K results of a query are unlikely to be accessed through that query. Various techniques that employ different strategies have been proposed for efficiently storing different versions of data objects [37][38]. Versioning has also been studied in the context of semi-structured documents [39], and efficient query answering algorithms have been proposed [40]. 2.2 Data Lineage When entities are created and modified, we are interested in keeping track of information related to the provenance of the entity data stored in the repository [34]. An important issue in data providence is its characterization. That is, to find the answers of questions like “why is a piece of data in the output?” and “where is the piece of data copied from?”. Buneman et al. [35] target these issues and propose a framework for describing and understanding provenance using a special tree-like model, where the location of a piece of data can be uniquely described by a path from the root of this tree. Sometimes the propagation of annotations is dependent on the syntax of the query. One may want to control the propagation of annotations in a schema. The custom propagation schemes allow the user to specify where to obtain annotations from. Bhagwat et al. [36] present propagation schemes that are essentially based on where data is copied from. 2.3 Entity Resolution A lot of structural heterogeneity is expected in the OKKAM system (for example, representing a date as year/month/day in place day/month/year, or the location of a room as room number-building-university in place of university-room number-building). For such situations various data cleaning techniques are necessary [11]. Elmagarmid et al. [9] target the record linkage and record matching problems, and enumerate various techniques suitable for resolving lexical heterogeneity. They divide the duplicate record detection techniques into two broad categories: ad hoc and probabilistic. The ad hoc methods perform efficiently on existing relational databases. On the other hand, the probabilistic methods outperform ad hoc techniques in terms of accuracy. However, they are only efficient for relatively small datasets. Jaro et al. [13] use the linear sum assignment model to pair related records together. They calculate a string similarity numerical value obtained by taking into account all the characters of the corresponding strings. Then, a weight adjustment strategy is applied, which leads to an efficient way for recognizing typographical errors. Using the Bayes decision rule for minimum cost, Dempster et al. [14] use the expectation maximization algorithm for parameter estimation from incomplete data. Winkler [15] introduces probabilistic methods of accounting for certain types of typographical variation (and hence, duplicate detection). An important point of this research is that no calibration datasets are needed to train the algorithm. The ALIAS system [19] uses the “reject region” scheme to reduce the size of the dataset, but it needs human intervention when the level of uncertainty is high. Using rule- based approaches [27][28][29], it is easier to create a large number of training pairs that are either clearly non duplicates or clearly duplicates. Despite that, the rule-based approaches require user intervention in rule management scenarios. Recent approaches have also focused on the problem of how to efficiently support the duplicate identification operation in the context of relational database systems [22][24]. Duplicate detection through record linkage has also been studied [16][17][18][26]. These approaches are based on different flavors of clustering algorithms. McCallum et al. [12] discuss anaphora resolution, where the problem is to locate different mentions of the same entity in a document. They propose the use of canopy like structures for speeding up the duplicate detection process. The first step is to use a cheap comparison metric to group records into clusters. Then the records are compared in a pair wise fashion, using a more expensive similarity metric that leads to better qualitative results. Benjelloun et al. [10] propose three algorithms for solving the entity resolution problem, namely, G-Swoosh, R- Swoosh, and F-Swoosh. These algorithms take into account the characteristics of the match and merge functions, and can also provide approximate results. 3. Background In this section, we give a brief overview of the OKKAM system (a more detailed presentation can be found elsewhere [3]). We will use OKKAM as the basis for our discussion on the requirements and solution directions for the entity lifecycle management, exposed in the rest of this study. Note however, that our discussion is not restricted to the OKKAM system, but is general, and relevant to any system for entity identification management. The overall goal of OKKAM is to handle the process of assigning and managing unique identifiers for entities in the WWW. These identifiers are global, with the purpose of consistently identifying a specific entity across system boundaries, regardless of who and where references this entity. Figure 1 shows a high-level conceptual view of the OKKAM system and the interactions with its environment. Figure 1: Schematic of OKKAM system and interactions. The OKKAM system has a repository for storing the entity identifiers (note that in reality this repository will be distributed and replicated) along with some small amount of descriptive information for each entity. The purpose of storing this information is to use it for discriminating among entities, not exhaustively describing them. Note that there is no fixed schema for this kind of information. Entities are described by a number of attribute- value pairs, where the attribute names and the potential values are user-defined (arbitrary) strings. The exact number of attribute-value pairs used to represent an entity is not preset, but may vary according to the information provided for each specific entity. Clients interact with the system through the OKKAM Services layer. Clients can be both human users and applications. There are two types of interaction. First, clients inquire about the identifier of an entity by providing a set of attributes that describes this entity. If the entity exists in the repository, the system returns its identifier. Second, clients may insert a new entity in the system. The system returns the newly assigned identifier. As shown in Figure 1, the end result is that all instances of the same entity (i.e., mentioned in different systems, ontologies, web pages, etc.) are assigned the same OKKAM identifier. Therefore, entity identity resolution becomes trivial, and is done without any further interactions with OKKAM. 4. Requirements for Entity Lifecycle Management In this section we discuss the requirements for entity lifecycle management, in the context of a large, distributed repository of entities for the semantic web. We start by examining the representation of entities. Then, we present the issues arising for the processes of creating, modifying, versioning, and merging entities, as well as keeping track of information relevant to provenance and access control. 4.1 Entity Representation The OKKAM system is designed to store arbitrary entities, referring to very diverse domains, including (but not limited to) persons, buildings, documents, and products. As such, the representation of the entities in the system has to be flexible in order to accommodate the requirements of all the different domains. Note that in OKKAM we are merely interested in assigning and managing unique ids to entities, which means that we do not need to represent all the known information about an entity, but rather only a small amount of data that can help us discriminate this entity from all the rest. Nevertheless, the set of data that need to be stored can vary drastically among entities. 4.2 Creating and Modifying Entities The creation of new entities can be initiated through one of the following two ways: an automatic OKKAM-ization process, or a manual interface-based method. When a document is parsed, entities are identified using an automatic entity identification process that is part of the OKKAM-ization process. When new entities are created, we have to check if these entities already exist in the system. This process can be of two kinds: (a) Offline Check: the entities are added first and checked for duplication later, and (b) Online check: the entities are checked for duplication before they are actually added to the database. If the entity is unique, it is assigned a unique identifier, and stored in the system. After the entity has been added to the system, it is subject to updates. New attributes may be added to the description of the entity, or the values of existing attributes may change (e.g., when the information is outdated). If the description of an entity changes, we once again have to check if this entity is a duplicate of another entity stored in the repository of the system. 4.3 Data Lineage When entities are created and modified, we are interested in keeping track of information related to the provenance of the entity data stored in the repository [34]. This includes information related to the source of the corresponding data, the owner and creation and modification time of an entities schema. The above information can potentially be very useful for other algorithms operating on the entities in the repository, such as matching and merging. For example, this kind of information can allow the matching algorithm to differentiate between attributes modified by humans (i.e., manually) and by computers (i.e., automatically), as well as between attributes that were changed in the recent past (i.e., up-to-date) and the distant past (out- of-date). Obviously, the knowledge of how and when each of the data representing an entity has been edited may lead to different strategies for performing entity matching. The information on data lineage can refer to each entity as a whole, or be more fine- grained, and refer to each individual attribute of every entity in the repository. The latter alternative results in a much more detailed view of how all the entity data was inserted in the repository, but also leads to higher space requirements and management cost. The proposed solution will have to take into account this tradeoff between flexibility and implementation cost. 4.4 Versioning of Entities Regular updates of the attributes describing an entity lead to the creation of different versions of the same entity. In several cases, it is beneficial to store old information for future reference. For example, queries may ask for entities based on old attributes/attribute values. When a query is performed, the results returned to the user are from the latest information that has been collected after the update procedures. In many cases, once the information is updated, the old information is deleted. Of course we have to define a limit (by date, or by version) with respect to the storage of old records. The user should be given this facility to query across different versions of the entity or in some cases, search the changes that have been performed on an entity over a certain period of time. The possibility of having different versions of the same entity raises some efficiency questions. The OKKAM system will be deemed efficient if it entails the following properties with respect to efficiency. Storage: it is important to define the storage schemes and physical representation for versions and the up-to-date information. Query partitioning and redirection: Defining whether the user wants to search into the up-to-date records or the version, and appropriately redirecting the query. Search and matching: the searching algorithms that are going to fetch the query results for the user. 4.5 Merging Entities Merging of entities may take place when one or more entities exceed some threshold of similarity. Before making this decision, one has to carefully take into account the nature of the considered entities and the tools used to compute this similarity. As we have mentioned, entities are described by a set of attributes. The values of these attributes can be either numerical (e.g., year, age) or categorical (e.g., name, address). In these cases, it is evident that the same proximity measure cannot be used. Therefore, suitable distance measures will need to be devised, which will also take into account the different characteristics of the various types of attributes. As more and more entities are added in the system, it may be the case that a single entity is represented by multiple instances in the repository. In such cases, we would like to employ techniques in order to detect such situations, and merge the duplicated entities. These techniques will be monitoring the evolution of entities and their attributes and/or corresponding values and will take the necessary actions either automatically or after interacting with the curator of the OKKAM repository. Ideally, we would like to be able to identify duplicates at the time when new entities are inserted in the system, using on-line algorithms. This assumes that (a) summaries exist to store the necessary information from existing entities, and (b) a proper threshold is used to guide the decision of whether the entities can be merged or not. We envision the use of techniques used in clustering to build attribute and attribute value summaries, and use them to decide whether a new or existing entity can be merged with one or more entities in the repository. 4.6 Access Control An important issue that we have not discussed so far is the access control (AC) model. This is probably the most important aspect to take into account regarding the security of the OKKAM infrastructure from the users’ point of view. The huge amount and the heterogeneity of the information stored require very scalable and flexible access control mechanisms in order to adequately protect each individual piece of information. Additionally, the information stored in OKKAM repositories must respect different (and sometimes very strong) privacy requirements. Furthermore, the information to be protected is not only the one returned in the queries to the repositories, but also the information that can be derived (for instance, by means of data mining techniques) from the data stored. Solving this situation requires the ability to define restrictions on the data that can be used for searching, and data that can be returned as a query result. For the access control scheme to be able to suit the needs of OKKAM we have identified the following sub- requirements: Flexibility; scalability; manageability; and provision of advanced features such as controlled delegation, owner-retained control, dynamic self-adaptability (i.e. content-based and context-aware access), anonymous access and provisional authorization. All these characteristics pose important challenges in the protection and access control mechanisms. Furthermore, we believe that the importance of automation and ease of management must not be underestimated: due to the large-scale and distributed nature of OKKAM it is essential that security management is highly automated and does not require a lot of administration, as otherwise the system would rapidly become vulnerable due to the human limitations of the administrators and the high complexity of the system. In fact, this aspect, which is usually overlooked when designing a security infrastructure, is the most important source of vulnerabilities and attacks. 5. Proposed Approach In what follows, we briefly outline the directions that we will pursue in order to address the issues related to entity lifecycle management we identified in the previous section. 5.1 Entity Representation Conceptual Model In OKKAM, we represent an entity E as a tuple
. We call the set P the profile of the entity, because it stores the unique identifier of the entity, and contains information that specifies the (semantic) type of an entity, as well as the relationship of the entity with other entities in OKKAM (if we know that it is identical to another entity stored in the system), or outside OKKAM (if there exists another id assigned to the entity by another system). The entity profile is also composed of a set of arbitrary, used-defined attributes that describe the entity. For example, if the entity is a person, possible attributes are name, date of birth, and nationality. Note that this set of attributes can be different for every entity, even for entities in the same domain. The set M refers to the metadata of the entity. These metadata are used to support complex algorithms for the other functionalities offered by OKKAM, such as entity matching that based on the attributes in P identifies if two entities are the same or not. Examples of the information that these metadata may carry are the creation time and usage patterns of the entity, and of each individual attribute in P. The information in M can also be used to store some simple information regarding the version history of the entity. Nevertheless, specialized data structures are needed in order to keep track of the history of changes in P. 5.2 Processing of Usage Patterns The way the users access the system and interact with it may determine several aspects of the entity lifecycle management. Consider the following example. Assume that many users search for an entity with attributes A1 and A2, and always select entity E1, which is the only entity in the repository that contains attribute A1 in its profile. If E1 does not contain A2 as well, we may choose to add it to the profile of E1, because many users refer to E1 using A2. Alternatively, assume that the query for entities with attributes A1 and A2 returns n entities, E1, E2, … , En, that satisfy the search conditions, but the interested users always select entity Ek, 1 ≤ k ≤ n. In this case, we may choose to increase the importance of entity Ek, so that it ranks first for the particular query. In both the above situations, we are interested in monitoring the usage patterns of the system, in order to obtain some knowledge that can help the system perform better. By monitoring and analyzing the way users interact with OKKAM, we can determine which entities, or profile attributes, are relevant to specific queries or to certain contexts. This information can subsequently be used for updating the profile or the metadata of the entities, and ultimately for producing more relevant search results. The above kind of processing can be done automatically, in an online fashion, and be flexible enough to allow effective and efficient data analysis [5][41], as well as time-decaying representations that evolve with the arrival of new data [4]. Some of the results of the above processing will be part of the entity representation (stored as metadata). If necessary, additional data structures will be created as well, in order to fully exploit the knowledge hidden in the usage patterns. 5.3 Repository Adaptation The results of the usage patterns monitoring techniques that we discussed in the previous paragraphs are also relevant to the repository evolution process. One of the important aspects of this process is the entity merging operation, which takes place when we discover that two entities in the repository represent the same real-world entity. As we already mentioned, when merging entities, it is important to consider the type of values at hand, i.e., numerical vs. categorical with their corresponding measures. When the merging of the entities takes place off-line, we plan to employ standard techniques from clustering in order to group similar entities together. Although time may not be an issue in this case, we seek to perform the merging in an as efficient fashion as possible. Adopting techniques like BIRCH [32] for numerical data, and LIMBO for categorical data [33], we may perform the assessment of similarity among entities very efficiently as these algorithms promise linear complexity as the size of the input increases. The greatest advantage of employing techniques as the ones mentioned above is that we get to summarize the input data set, i.e. the OKKAM entities, in summaries that retain as much of the initial information as possible. We plan to extend the construction of summaries so that a) they can handle both numerical and categorical data at the same time, and b) can be used effectively with streaming data. The big challenge in this case is the maintenance of the summaries in that we will employ policies, either automatic or user-driven with the use of which we can “forget” summaries [4] of entities in the verge of extinction, or change the group (cluster) in which entities belong as their properties (attributes and attribute values) evolve in time. 5.4 Management of Access Control Policies In current access control models, the complexity of management grows exponentially with the size and complexity of the contents. This is due to (i) the existence of different related artifacts such as roles, groups, which capture semantics in a “hidden” way; (ii) the possibility of creating conflicting policies, which in turn introduces the need for conflict resolution strategies; (iii) the direct relation between the location where contents are stored and the policies; (iv) the lack of dynamism and context-awareness in the AC model; and (v) the lack of policy validation tools. In this situation, when the size, dynamism and heterogeneity of the content set increases, it becomes impossible for administrators to understand the result of the policies they define, which inevitably leads to errors. The OKKAM security model, and in particular the access control model, is based on the use of semantic modeling of the contents, policies and users. Semantic modeling brings important advances to access control because it facilitates the creation of easy to understand policies that can be automatically assigned to contents. Additionally, it facilitates interoperability of users’ credentials and the development of mechanisms providing advanced access control features such as controlled delegation and user-retained control. Among the traditional access control models, Role-Based Access Control (RBAC) is the most popular and has received much attention from researchers, but despite such effort, there are situations that are not well-handled by this model. In RBAC, a role must be defined for each different set of access criteria required by the group of resources controlled. This means that when the number of resources and the heterogeneity of access conditions are very high, the administration of RBAC systems becomes very complex. A more general approach is needed for these new environments, and in particular for OKKAM repositories. The Semantic Access Control (SAC) model [30] provides the foundations for an appropriate solution to the aforementioned problems. Moreover, the flexibility of the SAC model allows it to easily simulate other models such as MAC, DAC or RBAC. Although it represents a good foundation for OKKAM, SAC lacks several features that we need. We are currently working on the extension of the SAC model to support controlled delegation, provisional access and to incorporate privacy-preserving mechanisms like those provided by the Interactive Access Control model (iAccess) [31]. 6. Conclusions The web is quickly moving towards the direction of adding semantics to the online information, and using these semantics for enabling a vastly richer range of applications and user-experiences. A major step, necessary for achieving this goal, is to have a way for uniquely identifying entities in the emerging semantic web. In this paper, we argue for an entity naming and management system, and we discuss the various problems that arise when considering the entity lifecycle management in this context. We examine the special requirements of representing entities, creating new and modifying existing entities, detecting duplicates, merging and versioning entities, and finally controlling the access to entities. Evidently, all the above functionalities are interrelated and affect each other. This work identifies these relationships, and presents the directions we are currently pursuing for realizing the functionalities outlined above. References [1] Fred Douglis, John Palmer, Elizabeth S. Richards, David Tao, William H. Tetzlaff, John M. Tracey, and Jian Yin, "Position: Short Object Lifetimes Require a Delete- Optimized Storage System," 11th ACM SIGOPS European Workshop, September 2004. [2] Nigel Shadbolt, Tim Berners-Lee, Wendy Hall: The Semantic Web Revisited. IEEE Intelligent Systems 21(3): 96-101 (2006). [3] Paolo Bouquet, Heiko Stoermer, and Daniel Giacomuzzi. OKKAM:. In Proceedings of the WWW2007 Workshop i3: Identity, Identifiers and Identification, Banff, Canada, May 8 2007, CEUR Workshop Proceedings, ISSN 1613-0073, May 2007. [4] Themis Palpanas, Michail Vlachos, Eamonn J. Keogh, Dimitrios Gunopulos, Wagner Truppel: Online Amnesic Approximation of Streaming Time Series. ICDE 2004: 338-349. [5] Themis Palpanas, Vana Kalogeraki, Dimitrios Gunopulos: Online Distribution Estimation for Streaming Data: Framework and Applications. SEBD 2007: 430-438. [6] E. Pierre. Introduction to ILM: A tutorial. http://www.snia.org/, 2004. [7] C. Johnson. ILM Case Study: Complete Data Lifecycle Management Solution. http://www.snia.org/, 2004. [8] Soumyadeb Mitra, Marianne Winslett and Windsor Hsu: Query-based Partitioning of Documents and Indexes for Information Lifecycle Management. SIGMOD 2008. [9] Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, Vassilios S. Verykios: Duplicate Record Detection: A Survey. IEEE Trans. Knowl. Data Eng. 19(1): 1-16 (2007). [10] O. Benjelloun, H. Garcia-Molina, D. Menestrina, Q. Su, S.E. Whang, and J. Widom. Swoosh: A Generic Approach to Entity Resolution. To appear in VLDB Journal, 2008. [11] S. Sarawagi, ed., special issue on data cleaning, IEEE Data Eng. Bull., vol. 23, no. 4, Dec. 2000. [12] A. McCallum, Information Extraction: Distilling Structured Data from Unstructured Text, ACM Queue, vol. 3, no. 9, pp. 48-57, 2005. [13] M.A. Jaro, “Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida,” J. Am. Statistical Assoc., vol. 84, no. 406, pp. 414- 420, June 1989. [14] A.P. Dempster, N.M. Laird, and D.B. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm,” J. Royal Statistical Soc., vol. B, no. 39, pp. 1-38, 1977. [15] W.E. Winkler, “Improved Decision Rules in the Felligi-Sunter Model of Record Linkage,” Technical Report Statistical Research Report Series RR93/12, US Bureau of the Census, Washington, D.C., 1993. [16] N. Bansal, A. Blum, and S. Chawla, “Correlation Clustering,” Machine Learning, vol. 56, nos. 1-3, pp. 89-113, 2004. [17] W.W. Cohen and J. Richman, “Learning to Match and Cluster Large High- Dimensional Data Sets for Data Integration,” Proc. Eighth ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining (KDD ’02), 2002. [18] P. Singla and P. Domingos, “Multi-Relational Record Linkage,” Proc. KDD-2004 Workshop Multi-Relational Data Mining, pp. 31-48, 2004. [19] S. Sarawagi and A. Bhamidipaty, “Interactive Deduplication Using Active Learning,” Proc. Eighth ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining (KDD ’02), pp. 269-278, 2002. [20] S. Tejada, C.A. Knoblock, and S. Minton, “Learning Domain- Independent String Transformation Weights for High Accuracy Object Identification,” Proc. Eighth ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining (KDD ’02), 2002. [21] W.W. Cohen, “Data Integration Using Similarity Joins and a Word-Based Information Representation Language,” ACM Trans. Information Systems, vol. 18, no. 3, pp. 288-321, 2000. [22] N. Koudas, A. Marathe, and D. Srivastava, “Flexible String Matching against Large Databases in Practice,” Proc. 30th Int’l Conf. Very Large Databases (VLDB ’04), pp. 1078-1086, 2004. [23] D. Dey, S. Sarkar, and P. De, “Entity Matching in Heterogeneous Databases: A Distance Based Decision Model,” Proc. 31st Ann. Hawaii Int’l Conf. System Sciences (HICSS ’98), pp. 305-313, 1998. [24] S. Guha, N. Koudas, A. Marathe, and D. Srivastava, “Merging the Results of Approximate Match Operations,” Proc. 30th Int’l Conf. Very Large Databases (VLDB ’04), pp. 636-647, 2004. [25] T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk, “Mining Database Structure; or, How to Build a Data Quality Browser,” Proc. 2002 ACM SIGMOD Int’l Conf. Management of Data (SIGMOD ’02), pp. 240-251, 2002. [26] Dong, X., Halevy, A., and Madhavan, J. 2005. Reference reconciliation in complex information spaces. In Proceedings of the 2005 ACM SIGMOD international Conference on Management of Data (Baltimore, Maryland, June 14 - 16, 2005). [27] E.-P. Lim, J. Srivastava, S. Prabhakar, and J. Richardson, “Entity Identification in Database Integration,” Proc. Ninth IEEE Int’l Conf. Data Eng. (ICDE ’93), pp. 294- 301, 1993. [28] M.A. Herna´ndez and S.J. Stolfo, “Real-World Data Is Dirty: Data Cleansing and the Merge/Purge Problem,” Data Mining and Knowledge Discovery, vol. 2, no. 1, pp. 9- 37, Jan. 1998. [29] H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita, “Declarative Data Cleaning: Language, Model, and Algorithms,” Proc. 27th Int’l Conf. Very Large Databases (VLDB ’01), pp. 371-380, 2001. [30] Yagüe, M.I., Maña, A., López, J., Troya, J.M.:Applying the Semantic Web Layers to Access Control. Proc. Int. Workshop on Web Semantics, Dexa 2003. IEEE Computer Society Press. September 2003. [31] Koshutanski H. and Massacci F. A Negotiation Scheme for Access Rights Establishment in Autonomic Communication. Journal of Network and System Management (JNSM), Springer-Verlag press (to appear). [32] Tian Zhang, Raghu Ramakrishnan, and Miron Livny. BIRCH: An Efficient Data Clustering Method for Very Large Databases. In SIGMOD, Montreal, QB, Canada, 4-6 June 1996. [33] Periklis Andritsos, Panayiotis Tsaparas, Renee J. Miller, and Kenneth C. Sevcik. LIMBO: Scalable Clustering of Categorical Data. In EDBT, Heraklion, Greece, 14-18 March 2004. [34] Wang Chiew Tan: Provenance in Databases: Past, Current, and Future. IEEE Data Eng. Bull. 30(4): 3-12, 2007. [35] P. Buneman, S. Khanna,On Propagation of Deletions and Annotations through Views. T. PODS 2002. [36] D. Bhagwat, L. Chiticariu, W. Tan, G. Vijayvargiya, An Annotation Management System for Relational Databases. VLDB Journal Vol. 14, No. 4, Nov. 2005. [37] Douglas S. Santry, Michael J. Feeley, Norman C. Hutchinson, Alistair C. Veitch, Ross W. Carton, Jacob Ofir: Deciding when to forget in the Elephant file system. SOSP 1999: 110-123. [38] Mallik Mahalingam, Chunqiang Tang, Zhichen Xu: Towards a Semantic, Deep Archival File System. FTDCS 2003: 115-121. [39] Shu-Yao Chien, Vassilis J. Tsotras, Carlo Zaniolo: Efficient schemes for managing multiversionXML documents. VLDB J. 11(4): 332-353 (2002). [40] Shu-Yao Chien, Vassilis J. Tsotras, Carlo Zaniolo, Donghui Zhang: Supporting complex queries on multiversion XML documents. ACM Trans. Internet Techn. 6(1): 53-84 (2006). [41] Ferry Irawan Tantono, Nishad Manerikar, Themis Palpanas: Efficiently Discovering Recent Frequent Items In Data Streams. SSDBM (2008).