<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Scalable Matching of Industry Models -A Case Study</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Brian</forename><surname>Byrne</surname></persName>
							<email>byrneb@us.ibm.com</email>
							<affiliation key="aff0">
								<orgName type="department" key="dep1">IBM Software Group</orgName>
								<orgName type="department" key="dep2">Information Management</orgName>
								<address>
									<settlement>Austin</settlement>
									<region>Texas</region>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Achille</forename><surname>Fokoue</surname></persName>
							<email>achille@us.ibm.com</email>
							<affiliation key="aff1">
								<orgName type="institution">IBM T. J. Watson Research Center</orgName>
								<address>
									<settlement>Hawthorne</settlement>
									<region>New York</region>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Aditya</forename><surname>Kalyanpur</surname></persName>
							<email>adityakal@us.ibm.com</email>
							<affiliation key="aff1">
								<orgName type="institution">IBM T. J. Watson Research Center</orgName>
								<address>
									<settlement>Hawthorne</settlement>
									<region>New York</region>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Srinivas</forename><surname>Kavitha</surname></persName>
							<affiliation key="aff1">
								<orgName type="institution">IBM T. J. Watson Research Center</orgName>
								<address>
									<settlement>Hawthorne</settlement>
									<region>New York</region>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Min</forename><surname>Wang</surname></persName>
							<affiliation key="aff1">
								<orgName type="institution">IBM T. J. Watson Research Center</orgName>
								<address>
									<settlement>Hawthorne</settlement>
									<region>New York</region>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Scalable Matching of Industry Models -A Case Study</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">03555809B0AE596AED4BB9A19D069A3D</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T12:30+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>Model matching</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>A recent approach to the problem of ontology matching has been to convert the problem of ontology matching to information retrieval. We explore the utility of this approach in matching model elements of real UML, ER, EMF and XML-Schema models, where the semantics of the models are less precisely defined. We validate this approach with domain experts for industry models drawn from very different domains (healthcare, insurance, and banking). We also observe that in the field, manually constructed mappings for such large industry models are prone to serious errors. We describe a novel tool we developed to detect suspicious mappings to quickly isolate these errors.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>The world of business is centered around information. Every business deals with a myriad of different semantic expressions of key business information, and expends huge resources working around the inconsistencies, challenges and errors introduced by a variety of information models. Typically, these information models organize the data, services, business processes, or vocabulary of an enterprise, and they may exist in different forms such as ER models, UML models, thesauri, ontologies or XML schema. A common problem is that these varying models rarely share a common terminology, because they have emerged as a result of several inputs. In some cases, mergers of organizations operating in the same business result in different information models, to express the same exact concepts. In other cases, they may have been developed by different organizational units to express overlapping business concepts, but in slightly different domains.</p><p>Irrespective of how these models came about, today's business is faced with many different information models, and an increasing need to integrate across these models, through data integration, shared processes and rules, or reusable business services. In all of these cases, the ability to relate, or map, between different models is critical. Both human attempts to manually map different information models and and the use of tools to automate mappings however are very error prone in the real world. For humans, the source of the error comes from multiple sources:</p><p>-The size of these models (typically, these models have several thousand elements each) -The fact that lexical names of model elements rarely match, or when they do match, its because of the wrong reasons (e.g., a document may have an endDate attribute, as does a claim, but the two endDate reflect semantically different things, although they match at the lexical level). -Models often express concepts at different levels of granularity, and it may not always be apparent at what level the concept should be mapped. In many real world mappings, we have observed a tendency for human analysts to map everything to generic concepts rather than more specific concepts. While these mappings are not necessarily invalid, they have limited utility in data integration scenarios, or in solution building.</p><p>The above points make it clear that there is a need for a tool to perform semiautomated model mapping, where a tool can help suggest appropriate mappings to a human analyst. Literature on ontology matching and alignment is clearly helpful in designing such a tool. Our approach to building such a tool is similar in spirit to the ideas implemented in Falcon-AO <ref type="bibr" target="#b0">[1]</ref>, <ref type="bibr" target="#b1">[2]</ref> and PRIOR ( <ref type="bibr" target="#b2">[3]</ref>), except that we adapted their techniques to UML, ER and EMF models. Matching or alignment across these models is different from matching ontologies, because the semantics of these models are poorly defined compared to those of ontologies. Perhaps due to this reason, schema mapping approaches tend to focus mostly on lexical and structural analysis. However, existing schema mapping approaches scale very poorly to large models. Most analysts in the field therefore tend to revert to manual mapping, despite the availability of many schema mapping tools.</p><p>We however make the observation that in most industry models, the semantics of model elements is buried in documentation (either within the model, or in separate PDF, Excel or Word files). We therefore use techniques described by Falcon-AO and PRIOR to build a generic representation that allows us to exploit the structural and lexical information about model elements along with semantics in documentation. The basic idea, as described in PRIOR is to convert the model mapping problem into a problem of information retrieval. Specifically, each model element is converted into a virtual document with a number of fields that encode the structural, lexical and semantic information associated with that model element. This information is in turn expressed as a term vector for a document. Mapping across model elements is then measured as a function of document similarity; i.e., the cosine similarity between two term vectors. This approach scales very well because we use the Apache Lucene text search engine for indexing and searching these virtual documents.</p><p>The novelty in our approach is that we also developed an engine to identify suspicious mappings produced either by our tool or by human analysts. We call this tool a Lint engine for model mappings, after the popular Lint tool which checks C programs for common software errors. The key observation that motivated our development of the Lint engine was that human model mappings were shockingly poor for 3/4 model mappings that were produced in real business scenarios. Common errors made by human analysts included the following:</p><p>-Mapping elements to overly general classes (equivalent to Thing).</p><p>-Mapping elements to subtypes even when the superclass was the appropriate match. As an example, Hierarchy was mapped to HierarchyType when Hierarchy existed in the other model. -Mapping elements that were simply invalid or wrong.</p><p>We encoded 6 different heuristics to flag suspicious mappings, including heuristics that can identify common errors made by our own algorithm (e.g., the tendency to match across elements with duplicate, copied documentation). The Lint engine for model mappings is thus incorporated as a key filter for semiautomated model mapping tool, to reduce the number of false positives that the human analyst needs to examine. A second use of our tool is of course to review the quality of human mappings in cases where the model mappings were produced manually.</p><p>Our key contributions are as follows:</p><p>-We describe a technique to extend existing techniques in ontology mapping to the problem of model mapping across UML, ER, and EMF models. Unlike existing approaches in schema mapping, we exploit semantic information embedded in documentation along with semantic and lexical information to perform the mapping. -We describe a novel Lint engine which can be used to review the quality of model mappings produced either by a human or by our algorithm. -We perform a detailed evaluation of the semi-automated tool on 7 real world model mappings. Four of the seven mappings had human mappings that were performed in a business context. We evaluated the Lint engine on these 4 mappings. The mappings involved large industry specific framework models with thousands of elements in each model in the domains of healthcare, insurance, and banking, as well as customer models in the domains of healthcare and banking. Our approach has therefore been validated on mappings that were performed for real business scenarios. In all cases, we validated the output of both tools with domain experts.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related Work</head><p>Ontology matching or the related problem of schema matching is a well studied problem, with a number of different approaches that are too numerous to be outlined here in detail. We refer the reader instead to surveys of ontology or schema matching <ref type="bibr" target="#b3">[4]</ref><ref type="bibr" target="#b4">[5]</ref><ref type="bibr" target="#b5">[6]</ref>. A sampling of ontology matching approaches include GLUE <ref type="bibr" target="#b6">[7]</ref>, PROMPT <ref type="bibr" target="#b7">[8]</ref>, HCONE-merge <ref type="bibr" target="#b8">[9]</ref> and SAMBO <ref type="bibr" target="#b9">[10]</ref>. Sample approaches to schema matching include Cupid <ref type="bibr" target="#b10">[11]</ref>, Artemis <ref type="bibr" target="#b11">[12]</ref>, and Clio <ref type="bibr" target="#b12">[13]</ref><ref type="bibr" target="#b13">[14]</ref><ref type="bibr" target="#b14">[15]</ref><ref type="bibr" target="#b15">[16]</ref>. Our work is mostly closely related to Falcon-AO <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b1">2]</ref> and PRIOR <ref type="bibr" target="#b2">[3]</ref>, two recent approaches to ontology matching that combine some of the advantages of earlier approaches such as linguistic and structural matching incorporated within an informationretrieval approach, and seem well positioned to be extended to address matching in shallow-structured models such as UML, ER and EMF models. Both Falcon-AO and PRIOR have been compared with existing systems in OAEI 2007 and appear to scale well in terms of performance. Because our work addresses matching across very large UML, ER and EMF data models (about 5000 elements), we adapted the approaches described in Falcon-AO and PRIOR to these models.</p><p>Matching or alignment across these models is different from matching ontologies, because the semantics of these models are poorly defined compared to those of ontologies. More importantly, we report the results of applying these techniques to 7 real ontology matching problems in the field, and describe scenarios where the approach is most effective.</p><p>3 Overall Approach</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Matching algorithm</head><p>Casting the matching problem to an IR problem Similar to approaches outlined in Falcon-AO <ref type="bibr" target="#b0">[1]</ref>, <ref type="bibr" target="#b1">[2]</ref> and PRIOR ( <ref type="bibr" target="#b2">[3]</ref>), a fundamental principle in our approach is to cast the problem of model matching into a classical Information Retrieval problem. Model elements (e.g. attributes or classes) from various modeling representations (e.g. XML Schema, UML, EMF, ER) are transformed into virtual documents. A virtual document consists of one or more fields capturing the structural, lexical and semantic information associated with the corresponding model element.</p><p>A Vector Space Model (VSM) <ref type="bibr" target="#b16">[17]</ref> is then adopted: each field F of a document is represented as a vector in a N F -dimensional space, with N F denoting the number of distinct words in field F of all documents. Traditional TF-IDF (Term Frequency -Inverse Document Frequency) values are used as the value of coordinates associated to terms. Formally, let D F denotes the vector associated with the field F of a virtual document D, and D F [i] denotes the ith coordinate of the vector associated with the field F of a virtual document D:</p><formula xml:id="formula_0">D F [i] = tf i * idf i (<label>1</label></formula><formula xml:id="formula_1">)</formula><formula xml:id="formula_2">tf i = |t i |/N F (2) idf i = 1 + log(N D/d i )<label>(3)</label></formula><p>where -|t i | represents the number of occurrence, in the field F of document D, of the term t corresponding to the ith coordinate of the vector D F , -N D corresponds to the total number of documents, and d i is the number of documents in which t appears at least once in F The similarity sim(A, B) between two model elements A and B is computed as the weighted mean of the cosine of the angle formed by their field vectors.</p><p>Formally, let D and D ′ be the virtual documents corresponding to A and B, respectively. Let q be the number of distinct field names in all documents.</p><formula xml:id="formula_3">sim(A, B) = q k=1 α k * cosine(D F k , D ′ F k ) q k=1 α k (4) cosine(D F k , D ′ F k ) = NF k i=1 D F k [i] * D ′ F k [i] |D F k | * |D ′ F k |<label>(5)</label></formula><formula xml:id="formula_4">|D F | = NF i=1 (D F [i]) 2<label>(6)</label></formula><p>where α k is the weight associated with the field F k , which indicates the relative importance of information encoded by that field.</p><p>In our Lucene<ref type="foot" target="#foot_0">3</ref> -based implementation, before building document vectors, standard transformations, such as stemming/lemmatization, stop words removal, lowercasing, etc, are performed. In addition to these standard transformations, we also convert camel case words (e.g. "firstName") into corresponding group of space separated words (e.g. "first name").</p><p>Transforming model elements into virtual documents A key step in our approach is the transformation of elements of a data model into virtual documents. For simplicity of the presentation, we assume that the data model is encoded as a UML Class diagram <ref type="foot" target="#foot_1">4</ref>The input of the transformation is a model element (e.g. attribute, reference/association, or class). The output is a virtual document with the the following fields:</p><p>name. This field consists of the name of the input element.</p><p>documentation. This field contains the documentation of the input model element.</p><p>-containerClass. For attribute, reference and association, this field contains the name and documentation of their containing class. -path. This field contains the path from the model root package to the model element (e.g. for an attribute "bar" of the class "foo" located in the package "example", the path is /example/foo/bar). -body. This field is made of the union of terms in all fields except path.</p><p>While the first two fields encode only lexical information, the next two fields (containerClass and path) capture some of the structure of the modeling elements. In our implementation, when the models to be compared appear very similar, which translates to a very large number of discovered mappings, we typically empirically adjust upwards the weight of the "containerClass" and "path" fields to convey more importance to the structural similarity.</p><p>For the simple UML model shown in Figure <ref type="figure">3</ref>.1, 5 virtual documents will be created, among which is the following: name : "Place" -documentation: "a bounded area defined by nature by an external authority such as a government or for an internal business purpose used to identify a location in space that is not a structured address for example country city continent postal area or risk area a place may also be used to define a logical place in a computer or telephone network e.g. laboratory e.g. hospital e.g. home e.g. doctor's office e.g. clinic" -containerClass: "" -path: "/simple test model/place" -body:"place, a bounded area defined by nature by an external authority such as a government or for an internal business purpose used to identify a location in space that is not a structured address for example country city continent postal area or risk area a place may also be used to define a logical place in a computer or telephone network e.g. laboratory e.g. hospital e.g. home e.g. doctor's office e.g. clinic" 2. Virtual document corresponding to the attribute "Place id":</p><p>name : "place id" -documentation: "the unique identifier of a place" -containerClass: "place, a bounded area defined by nature by an external authority such as a government or for an internal business purpose used to identify a location in space that is not a structured address for example country city continent postal area or risk area a place may also be used to define a logical place in a computer or telephone network e.g. laboratory e.g. hospital e.g. home e.g. doctor's office e.g. clinic" -path: "/simple test model/place/place id" -body: "place id, the unique identifier of a place, place, a bounded area defined by nature by an external authority such as a government or for an internal business purpose used to identify a location in space that is not a structured address for example country city continent postal area or risk area a place may also be used to define a logical place in a computer or telephone network e.g. laboratory e.g. hospital e.g. home e.g. doctor's office e.g. clinic"</p><p>Adding lexical and semantic similarity between terms The cosine scoring scheme presented above ( <ref type="formula">4</ref>) is intolerant to even minor lexical or semantic variations in terms. For example, the cosine score computed using equation ( <ref type="formula">4</ref>) for the document vectors (gender: 1, sex: 0) and (gender:0, sex: 1) will be 0 although "gender" mentioned in the first document is clearly semantically related to "sex" appearing in the second document. To address this limitation, we modify the initial vector to add, for a given term t, the indirect contributions of terms related to t as measured by a term similarity metric. Formally, instead of using D F k (resp. D ′ F k ) in equation ( <ref type="formula">4</ref>), we used the document vector</p><formula xml:id="formula_5">D F k whose coordinates D F k [i], for 1 ≤ i ≤ N F k ,</formula><p>are defined as follows:</p><formula xml:id="formula_6">D F k [i] = D F k [i] + β i * NF k j=1 &amp; j =i termSim(t i , t j ) * D F k [j]<label>(7)</label></formula><formula xml:id="formula_7">β i =    0 if, for all j = i, D F k [j] = 0, 1 N F k j=1 &amp; j =i &amp; D F k [j] =0 1 otherwise (8)</formula><p>where -termSim is a term similarity measure such as Jaccard or Levenshtein similarity measure (for lexical similarity), a semantic similarity measure based on WordNet <ref type="bibr" target="#b17">[18]</ref> [19], or a combination of similarity measures. termSim(t i , t j ) * D F k [j] in <ref type="bibr" target="#b6">(7)</ref> measures the contribution to the term t i of the potentially related term t j . -β i is the weight assigned to indirect contributions of related terms.</p><p>For efficiency, when comparing two document vectors, we only add in the modified document vectors, the contributions of terms corresponding to at least one non-zero coordinate of any of the two vectors.</p><p>The equation ( <ref type="formula" target="#formula_6">7</ref>) applied to the previous example transforms (gender:1, sex :0) to (gender: 1, sex: termSim("sex", "gender")) and (gender: 0, sex: 1) to (gender: termSim("gender", "sex"), sex: 1). Assuming that termSim("sex", "gender"), which is the same as termSim("gender", "sex"), is not equal to zero, the cosine score of the transformed vectors will obviously be different from zero, and will reflect the similarity between the terms "gender" and "sex".</p><p>For the results reported in the evaluation section, only the Levenshtein similarity measure was used. Using a semantic similarity measures based on wordnet significantly increasing the algorithm running time with a marginal improvement of quality of the resulting mappings. The running time performance of semantic similarity measures based on WordNet, was still unacceptable after restricting related terms to synonyms and hyponyms.</p><p>Our approach provides a tigher integration of cosine scoring scheme and a term similarity measure. In previous work, e.g. Falcon-AO <ref type="bibr" target="#b1">[2]</ref>, the application of the term similarity measure (Levenshtein measure in Falcon-AO) is limited to names of model elements, and the final score is simply a linear combination of the cosine score and the measure of similarity between model element names.</p><p>To evaluate the model matching algorithm, we accumulated industry models and customer data models from IBM architects who regularly build solutions for customers. The specific model comparisons we chose were ones that IBM architects need mapped in the field. In four cases out of 7 model matching comparisons, the matching had been performed by IBM solutions teams manually. We tried to use these as a 'gold standard' to evaluate the model matching algorithm, but unfortunately found that in 3 of 4 cases, the quality of the manual model matching was exceedingly poor. We address this issue with a tool to assess matching quality in the next section.</p><p>As shown in Table <ref type="table" target="#tab_0">1</ref>, the industry models we used in the comparisons included BDW (a logical data model for financial services), HPDM (a logical data model for healthcare), MDM (a model for the IBM's solution for master data management), RDWM (a model for warehouse solutions for retail organizations), and IAA (a model for insurance). Model A in the table is a customer ER model in the healthcare solutions space, model B is a customer logical data model in financial services, and model C is customer logical data model in retail. To evaluate our model matching results, we had two IBM architects assess the precision of the best possible match produced by our algorithm. Manual evaluation of the matches was performed on sample sizes of 100 in 5 of 7 cases (all cases except the IAA-BDW and A-HPDM comparisons). For IAA-BDW, we used a sample size of 50 because the algorithm produced less than 100 matches. For A-HPDM, we relied on previously created manual mappings to evaluate both precision and recall (recall was at 25%). The sizes of these models varied from 300 elements to 5000 elements.</p><p>We make two observations about our results:</p><formula xml:id="formula_8">-(a)</formula><p>The results show a great deal of variability ranging from cases where we had 100% precision in the top 100 matches, to 52% precision. This reflected the degree to which the models shared a common lineage or common vocabulary in their development. For example, RDWM was actually derived from BDW, and this is clearly reflected in the model matching results. IAA and BDW target different industries (and therefore do not have much in common), and this is a scenario where the algorithm tends to make more errors. We should point out that although IAA and BDW target different industries (insurance and banking respectively), there is a real business need for mapping common or overlapping concepts across these disparate models, so the matching exercise is not a purely academic one. -(b) Even in cases where the precision (or recall) was low, the IBM architects attested to the utility of such a semi-automated approach to model matching, because their current process is entirely manual, tedious and error prone. None of the model mapping tools available to them currently provide results that are usable or verifiable. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Lint Engine</head><p>We turn now to another aspect of our work, which is to somehow measure the quality of ontology matching in the field. As mentioned earlier, we initially started our work with the hope of using manual matchings as a gold standard to measure the output of our matching algorithm, but were surprised to find a rather large number of errors in the manually generated model mappings. A lot of these errors were presumably due to the ad hoc nature of the manual mapping process, leading to poor transcription of names, e.g., changes in spaces, appending package names etc. when writing mapping results in a separate spreadsheet; specification of new classes/attributes/relationships to make up a mapping, when the elements did not exist in the original models etc. Also, there were cases in which mappings were made to an absurdly generic class (such as Thing) which rendered them meaningless.</p><p>In order to deal with the above issues, and also improve the accuracy of our mapping tool, we decided to write a Lint Engine to detect suspicious mappings. The engine runs through a set of suspicious mapping patterns, with each pattern being assigned a severity rating and a user-friendly explanation, both specified by the domain expert. We have currently implemented the following six mapping patterns based on discussions with a domain-expert:</p><p>-Element not found : The pattern detects mappings where one or more elements involved does not exist in any of the models. This pattern is assigned a high severity since it indicates something clearly suspicious or wrong. -Exact name mismatches: Detects mappings where a model element with an exact lexical match was not returned. This does not necessarily indicate an incorrect mapping, however does alert the user of a potentially interesting alternative that may have been missed. -Duplicate documentation: Detects mappings where the exact same documentation is provided for both elements involved in the mapping. This may arise when models or portions of models are copy/pasted across. -Many-to-1 or 1-to-Many: Detects cases where a single element in one model is mapped to a suspiciously large number elements in another model. As mentioned earlier, these typically denote mappings to an absurdly generic class/relation.</p><p>-Class-Attribute proliferations: Detects cases when a single class' attributes/relations are mapped to attributes/relations of several different classes in the other model. What makes this case suspicious is that model mappings are a means to an end, typically used to specify instance transformations. Transformations can become extremely complex when class-attribute proliferations exist. -Mapping without documentation: Detects cases where all the elements involved in the mapping have no associated documentation. This could arise due to lexical and structural information playing a role in the mapping, however the lack of documentation points to a potentially weaker match.</p><p>We applied our Lint engine to the manual mappings to see if it could reveal in more detail the defects we had observed. The results are summarized in the Tables 2 -5 below.  <ref type="table">5</ref>. Evaluation of MDM-HPDM manual mappings using our Lint Engine</p><p>The results are quite shocking, e.g., in the BDW-MDM case, all 702 mappings specified an element that did not exist in either of the two models. The only explanation for this bizarre result is that mapping exercises, typically performed in Excel etc, are hideously inaccurate -in particular, significant approximation of the source and target elements is pervasive. Another point to note is that humans like to try and cheat and map at a generic level, and this practice seems to be quite pervasive, as such mappings were discovered in almost all the cases. Finally, the lack of, or duplication of documentation can be identified in many ways (e.g. products such as SoDA from Rational<ref type="foot" target="#foot_2">5</ref> ) -but surfacing this during the mapping validation is very helpful. It helps present an estimation of the degree of confidence in the foundation of the mapping -the understanding of the elements being mapped.</p><p>The results were analyzed in detail by a domain expert who verified that the accuracy and usefulness for the suspicious mappings was very high (in the B-BDW case, only 1 suspicious mapping produced by Lint was actually correct). The fact that the lint engine found roughly less than 1 valid mapping for every 10 suspicious ones is an indication of the inefficiency of manual mapping practices. What the engine managed to do effectively is to filter from a huge pool of mappings, the small subset that need human attention, while hinting to the user what may be wrong by nicely grouping the suspicious mappings under different categories.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 .</head><label>1</label><figDesc>Fig. 1. Simple Model Example</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 .</head><label>1</label><figDesc>Model matching results</figDesc><table><row><cell cols="3">Models Compared Number of matches Precision</cell></row><row><cell>A-HPDM</cell><cell>43</cell><cell>67%</cell></row><row><cell>B-BDW</cell><cell>197</cell><cell>74%</cell></row><row><cell>MDM-BDW</cell><cell>149</cell><cell>71%</cell></row><row><cell>MDM-HPDM</cell><cell>324</cell><cell>54%</cell></row><row><cell>RDWM-BDW</cell><cell>3632</cell><cell>100%</cell></row><row><cell>C-BDW</cell><cell>3263</cell><cell>96%</cell></row><row><cell>IAA-BDW</cell><cell>69</cell><cell>52%</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 .</head><label>2</label><figDesc>Evaluation of B-BDW manual mappings using our Lint Engine</figDesc><table><row><cell>Total number of mappings</cell><cell>306</cell></row><row><cell cols="2">Total number of suspicious mappings 151 (51 %)</cell></row><row><cell>One To Many Mappings</cell><cell>143 (46 %)</cell></row><row><cell cols="2">Mapping Without Documentation 40 (25 %)</cell></row><row><cell>Exact Name Not Match</cell><cell>13 (8 %)</cell></row><row><cell>Duplicate Documentation</cell><cell>2 (1 %)</cell></row><row><cell>Total number of mappings</cell><cell>702</cell></row><row><cell cols="2">Total number of suspicious mappings 702 (100 %)</cell></row><row><cell>Name Not Found in Models</cell><cell>702 (100 %)</cell></row><row><cell cols="2">Mapping Without Documentation 702 (100 %)</cell></row><row><cell>Exact Name Not Match</cell><cell>30 (4 %)</cell></row><row><cell>One To Many Mappings</cell><cell>312 (44 %)</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3 .</head><label>3</label><figDesc>Evaluation of BDW-MDM manual mappings using our Lint Engine</figDesc><table><row><cell>Total number of mappings</cell><cell>117</cell></row><row><cell cols="2">Total number of suspicious mappings 95 (81 %)</cell></row><row><cell cols="2">Mapping Without Documentation 95 (100 %)</cell></row><row><cell>One To Many Mappings</cell><cell>10 (10 %)</cell></row><row><cell>Duplicate Documentation Checker</cell><cell>9 (9 %)</cell></row><row><cell>Name Not Found in Models</cell><cell>2 (2 %)</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 4 .</head><label>4</label><figDesc>Evaluation of A-HPDM manual mappings using our Lint Engine</figDesc><table><row><cell>Total number of mappings</cell><cell>748</cell></row><row><cell cols="2">Total number of suspicious mappings 748 (100 %)</cell></row><row><cell>Mapping Without Documentation</cell><cell>741 (99 %)</cell></row><row><cell>Name Not Found in Models</cell><cell>459 (61 %)</cell></row><row><cell cols="2">Class Attribute Mapping Proliferation 472 (63 %)</cell></row><row><cell cols="2">Duplicate Documentation Checker 378 (50 %)</cell></row><row><cell>One To Many Mappings</cell><cell>321 (42 %)</cell></row><row><cell>Exact Name Not Match</cell><cell>33 (4 %)</cell></row><row><cell>Table</cell><cell></cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_0">http://lucene.apache.org/java/docs/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_1">Our implementation is able to handle more data model representations, including XML Schemas, ER diagrams, and EMF ECore models.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_2">http://www-01.ibm.com/software/awdtools/soda/index.html</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Falcon-ao: Aligning ontologies with falcon</title>
		<author>
			<persName><forename type="first">N</forename><surname>Jian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Hu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Cheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Qu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of K-CAP Workshop on Integrating Ontologies</title>
				<meeting>K-CAP Workshop on Integrating Ontologies</meeting>
		<imprint>
			<date type="published" when="2005">2005</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Constructing virtual documents for ontology matching</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Qu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Hu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Cheng</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 15th international conference on World Wide Web</title>
				<meeting>the 15th international conference on World Wide Web<address><addrLine>Edinburgh, UK</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2006">2006</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">A profile propagation and information retrieval based ontology mapping approach</title>
		<author>
			<persName><forename type="first">M</forename><surname>Mao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Peng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Spring</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 3rd International Conference on Semantics, Knowledge and Grid (research track)</title>
				<meeting>the 3rd International Conference on Semantics, Knowledge and Grid (research track)<address><addrLine>Xian, China</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2007">2007</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Semantic integration: a survey of ontology-based approaches</title>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">F</forename><surname>Noy</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">SIGMOD Rec</title>
		<imprint>
			<biblScope unit="volume">33</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="65" to="70" />
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Schema mappings, data exchange, and metadata management</title>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">G</forename><surname>Kolaitis</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 24th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems</title>
				<meeting>the 24th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems<address><addrLine>Baltimore, Maryland</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2005">2005</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Model management 2.0: manipulating richer mappings</title>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">A</forename><surname>Bernstein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Melnik</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the ACM SIGMOD International Conference on Management of Data</title>
				<meeting>the ACM SIGMOD International Conference on Management of Data<address><addrLine>Beijing, China</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2007">2007</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Learning to match ontologies on the semantic web</title>
		<author>
			<persName><forename type="first">A</forename><surname>Doan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Madhavan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Dhamankar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Domingos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Halevy</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">The VLDB Journal</title>
		<imprint>
			<biblScope unit="volume">12</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="303" to="319" />
			<date type="published" when="2003">2003</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Prompt: Algorithm and tool for automated ontology merging and alignment</title>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">F</forename><surname>Noy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Musen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on on Innovative Applications of Artificial Intelligence</title>
				<meeting>the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on on Innovative Applications of Artificial Intelligence<address><addrLine>Austin, Texas, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2000">2000</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Capturing semanticstowards automatic coordination of domain ontologies</title>
		<author>
			<persName><forename type="first">K</forename><surname>Kotis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Vouros</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Stergiou</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">AIMSA</title>
		<imprint>
			<biblScope unit="page" from="22" to="32" />
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Sambo-a system for aligning and merging biomedical ontologies</title>
		<author>
			<persName><forename type="first">P</forename><surname>Lambrix</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Tan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Web Semant</title>
		<imprint>
			<biblScope unit="volume">4</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="196" to="206" />
			<date type="published" when="2006">2006</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Generic schema matching with cupid</title>
		<author>
			<persName><forename type="first">J</forename><surname>Madhavan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">A</forename><surname>Bernstein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Rahm</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">VLDB &apos;01: Proceedings of the 27th International Conference on Very Large Data Bases</title>
				<meeting><address><addrLine>San Francisco, CA, USA</addrLine></address></meeting>
		<imprint>
			<publisher>Morgan Kaufmann Publishers Inc</publisher>
			<date type="published" when="2001">2001</date>
			<biblScope unit="page" from="49" to="58" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Global viewing of heterogeneous data sources</title>
		<author>
			<persName><forename type="first">S</forename><surname>Castano</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>De Antonellis</surname></persName>
		</author>
		<author>
			<persName><surname>De Capitani Di</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Vimercati</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Trans. on Knowl. and Data Eng</title>
		<imprint>
			<biblScope unit="volume">13</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="277" to="297" />
			<date type="published" when="2001">2001</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Schema mapping as query discovery</title>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">J</forename><surname>Miller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">M</forename><surname>Haas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Hernández</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of 26th International Conference on Very Large Data Bases</title>
				<meeting>26th International Conference on Very Large Data Bases<address><addrLine>Cairo, Egypt</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2000">2000</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">The clio project: Managing heterogeneity</title>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">J</forename><surname>Miller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Hernández</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">M</forename><surname>Haas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">L</forename><surname>Yan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">T H</forename><surname>Ho</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Fagin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Popa</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">SIGMOD Record</title>
		<imprint>
			<biblScope unit="volume">30</biblScope>
			<biblScope unit="issue">1</biblScope>
			<date type="published" when="2001">2001</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Model management and schema mappings: Theory and practice</title>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">A</forename><surname>Bernstein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Ho</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 33rd International Conference on Very Large Data Bases</title>
				<meeting>the 33rd International Conference on Very Large Data Bases<address><addrLine>Vienna, Austria</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2007">2007</date>
		</imprint>
		<respStmt>
			<orgName>University of</orgName>
		</respStmt>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Clio: A schema mapping tool for information integration</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Hernández</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Popa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Ho</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Naumann</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 8th International Symposium on Parallel Architectures, Algorithms and Networks</title>
				<meeting>the 8th International Symposium on Parallel Architectures, Algorithms and Networks<address><addrLine>Las Vegas, Nevada, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2005">2005</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">A critical analysis of vector space model for information retrieval</title>
		<author>
			<persName><forename type="first">V</forename><forename type="middle">V</forename><surname>Raghavan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">K M</forename><surname>Wong</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of the American Society for Information Science</title>
		<imprint>
			<biblScope unit="volume">37</biblScope>
			<biblScope unit="issue">5</biblScope>
			<biblScope unit="page" from="279" to="287" />
			<date type="published" when="1999-01">January 1999</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<monogr>
		<title level="m" type="main">Semantic similarity based on corpus statistics and lexical taxonomy</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">J</forename><surname>Jiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">W</forename><surname>Conrath</surname></persName>
		</author>
		<idno>CoRR cmp-lg/9709008</idno>
		<imprint>
			<date type="published" when="1997">1997</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">An information-theoretic definition of similarity</title>
		<author>
			<persName><forename type="first">D</forename><surname>Lin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ICML &apos;98: Proceedings of the Fifteenth International Conference on Machine Learning</title>
				<meeting><address><addrLine>San Francisco, CA, USA</addrLine></address></meeting>
		<imprint>
			<publisher>Morgan Kaufmann Publishers Inc</publisher>
			<date type="published" when="1998">1998</date>
			<biblScope unit="page" from="296" to="304" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
