<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">A Context-based Approach for Complex Semantic Matching</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Youssef</forename><surname>Bououlid</surname></persName>
							<email>bououlii@iro.umontreal.ca</email>
							<affiliation key="aff0">
								<orgName type="department">DIRO</orgName>
								<orgName type="institution">University of Montreal</orgName>
								<address>
									<settlement>Montreal</settlement>
									<region>QC</region>
									<country key="CA">Canada</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Julie</forename><surname>Vachon</surname></persName>
							<email>vachon@iro.umontreal.ca</email>
							<affiliation key="aff0">
								<orgName type="department">DIRO</orgName>
								<orgName type="institution">University of Montreal</orgName>
								<address>
									<settlement>Montreal</settlement>
									<region>QC</region>
									<country key="CA">Canada</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">A Context-based Approach for Complex Semantic Matching</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">A4E7CDD86D1042BDFFD3188E9CBF5F55</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-25T06:56+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>complex matching</term>
					<term>semantic matching</term>
					<term>context analysis</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Semantic matching 1 is a fundamental step in implementing data sharing applications. Most systems automating this task however limit themselves to finding simple (one-to-one) matching. In fact, complex (many-to-many) matching raises a far more difficult problem as the search space of concept combinations is often very large. This article presents Indigo, a system discovering complex matching in two steps. First, it semantically enriches data sources with complex concepts extracted from their development artifacts. It then proceeds to the alignment of data sources thus enhanced.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Semantic matching consists in finding semantic correspondences between heterogenous sources. When done manually, this task can prove to be very tedious and error prone <ref type="bibr" target="#b2">[3]</ref>. To date, many systems <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b3">4]</ref> have addressed the automation of this stage. However, most solutions confine themselves to simple matching (one-to-one) although complex matching (many-to-many) is frequently required in practice. Linking concept address to the concatenation of concepts street + city is a typical example of a complex match. The little work addressing complex matching can be explained by the greater complexity of finding complex matches than of discovering simple ones. To cope with this challenging problem, this article introduces the solution implemented by our matching system Indigo<ref type="foot" target="#foot_1">2</ref> . Indigo avoids searching such large spaces of possible concept combinations. It rather implements an innovative solution based on the exploration of the data sources' informational context, which can indeed contain very useful semantic hints about concept combinations. The informational context of a data source is composed of all the available textual and formal artifacts documenting, specifying, implementing this data source. Indigo distinguishes two main sets of documents in the informational context (cf. <ref type="bibr" target="#b1">[2]</ref> for details). The first set, called the descriptive context, gathers all the available data source specification and documentation files produced during the different development stages. The second set is called the operational context. It is composed of formal artifacts such as programs, forms or XML files. In formal settings, significant combinations of concepts are more easily located (e.g. they can be found in formulas, function declarations, etc.). Indigo thus favors the exploration of the operational context to identify relevant concept combinations that can form new complex concepts. Complex concepts are added to data sources as new candidates for the matching phase. As an experiment, Indigo was used for the semantic matching of two database schemas taken from two open-source e-commerce applications, Java Pet Store <ref type="bibr" target="#b4">[5]</ref> and eStore <ref type="bibr" target="#b5">[6]</ref>, provided with all their source code files.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Indigo's architecture</head><p>To handle both context analysis and semantic matching, Indigo presents an architecture composed of two main modules: a Context Analyzer and a Mapper module. The Context Analyzer module takes the data sources to be matched along with their related contexts and proceeds to their enrichment before delivering them to the Mapper module for their effective matching.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Context Analyzer</head><p>The Context Analyzer comprises two main modules, each one being specialized in a specific type of concept extraction. 1) The Concept name collector explores the descriptive context of a data source to find (simple) concept names related to the ones found in the data source's schema. 2) The Complex concept extractor analyzes the operational context to extract complex concepts. The left side of Figure <ref type="figure" target="#fig_0">1</ref> shows the current architecture of our Context Analyzer. Modules are either basic or meta analyzers. The analysis performed is based on heuristic rules in all cases. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Complex concept Generator</head><note type="other">Arithmetic</note><formula xml:id="formula_0">SP 1 ||SP 2 ...||SP n → extractionAction.</formula><p>The left part is a disjunction of syntactic patterns (noted SP) that basic analyzers try to match<ref type="foot" target="#foot_2">3</ref> when parsing a document. A SP is a regular expression which can contain pattern variables name, type, exp 1 , exp 2 , ... exp n (e.g. for an accessor method in a Java program: SP i = {public type getname * return exp 1 }). When a basic analyzer recognizes one of the SPs appearing on the left-hand side of a rule, pattern variables are assigned values (by pattern matching) and the corresponding righthand side action of the rule is executed. This action builds a complex concept &lt; name, type, concept combinaison &gt; using the pattern variables' values. Each basic analyzer applies its own set of heuristic extraction rules over each of the artifacts it is assigned. Our current basic analyzers deal with forms, programs, SQL requests, DTD and XSD schemas.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Meta-analyzers.</head><p>Each meta-analyzer is in charge of a set of artifacts composing the informational context. Its role essentially consists in classifying these artifacts and assigning each of them to a relevant child. To do so, it applies heuristics like checking file name extensions or parsing file internal structures. The meta-analyzer module at the head of the Context Analyzer is in charge of the Concept name collector and the Complex concept extractor coordination. It enhances data sources with the simple and complex concepts respectively delivered by these two modules. For complex concepts, this enrichment step not only requires the name of the enriching concepts but also the values<ref type="foot" target="#foot_3">4</ref> associated to them. Regarding the Complex concept extractor, let's mention it coordinates the actions of the basic analyzers responsible for the complex concepts extraction.</p><p>In addition, it relies on an internal module, called complex concept generator, that validates discovered concept combinations and generates complex concepts by replacing expressions (coming from source code) by appropriate concepts of the data source.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Mapper module</head><p>As shown on the right side of Figure <ref type="figure" target="#fig_0">1</ref>, the current architecture of the Mapper is composed of several matching modules hierarchically organized. Each aligner supports a given matching strategy and is responsible of generating a mapping between data sources. On top, each coordinator supervises a given set of aligners and combines their returned results. The current implementation of the Mapper comprises the three following aligners: (1) The Name-based aligner proposes matches between concepts having similar names with regards to the JaroWinkler lexical similarity metric. (2) The Whirl-based aligner uses an adapted version of the so-called WHIRL technique developed by Cohen and Hirsh <ref type="bibr" target="#b6">[7]</ref> to match concepts having similar instances. Finally, (3) the Statistic-based aligner compares concepts' content which is represented, in this case, by a normalized vector describing seven characteristics (e.g. minimum value, maximum, variance, etc.).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Experimental results</head><p>Indigo's Context Analyzer has been applied to match the data sources of the eStore and PetStore applications. Two measures, respectively called significance and relevance, were defined to evaluate its performance. The significance measure indicates the percentage of extracted complex concepts presenting a semantically sound combination of concepts (e.g. concat(first name, last name)). The relevance measure is used to compute the proportion of significant complex concepts which effectively appear in the final mapping of the two data sources. The Context Analyzer has globally discovered 31 complex concepts of which 87% were significant. Of course, not all of them were relevant. The manual examination of the data sources revealed that eStore's data source only contained two complex concepts while PetStore's contained none. Indigo nevertheless succeeded in discovering both relevant complex concepts of eStore. After their enhancement with complex concepts, the PetStore's and the eStore's data sources were matched by the Mapper module which was able to discover all complex matches.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Conclusion</head><p>We proposed Indigo, an innovative solution for the discovery of complex matches between data sources. Avoiding to search the unbounded space of possible concept combinations, Indigo discovers complex concepts by searching operational context artifacts of data sources. Newly discovered complex concepts are added to data sources as new matching candidates for complex matching. Indigo implements a Context analyzer and a Mapper module both offering a flexible and extensible hierarchical architecture. Specialized analyzers and aligners can be added to allow the application of new mining and matching strategies. Extensibility and adaptability are undoubtedly appreciable qualities of Indigo. Our first experiments with Indigo showed the pertinence of this approach and let's hope for promising futur results.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 .</head><label>1</label><figDesc>Fig. 1. The Context Analyzer and the Mapper modules</figDesc></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">also called semantic alignment, or simply matching</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">INteroperabilty and Data InteGratiOn</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">N.b. This kind of text matching is called "pattern matching"</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3">These values are assessed by querying the database using SQL SELECT statements.</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">A Survey of Approaches to Automatic Schema Matching</title>
		<author>
			<persName><forename type="first">E</forename><surname>Rahm</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">A</forename><surname>Bernstein</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">VLDB Journal</title>
		<imprint>
			<biblScope unit="volume">10</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="334" to="350" />
			<date type="published" when="2001">2001</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Context Analysis for Semantic Mapping of Data Sources Using a Multi-Strategy Machine Learning Approach</title>
		<author>
			<persName><forename type="first">I</forename><forename type="middle">Y</forename><surname>Bououlid</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Vachon</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the International Conf. on Enterprise Information Systems (ICEIS05)</title>
				<meeting>of the International Conf. on Enterprise Information Systems (ICEIS05)<address><addrLine>Miami, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2005">2005</date>
			<biblScope unit="page" from="445" to="448" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Semantic Integration in Heterogeneous Databases Using Neural Networks</title>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">S</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Clifton</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the 20th Conf. on Very Large Databases (VLDB)</title>
				<meeting>of the 20th Conf. on Very Large Databases (VLDB)</meeting>
		<imprint>
			<date type="published" when="1994">1994</date>
			<biblScope unit="page" from="1" to="12" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><surname>Euzenat</surname></persName>
		</author>
		<idno>IST-2004-507482</idno>
		<title level="m">Part of a research project funded by the IST Program of the Commission of the European Communities, project number</title>
				<imprint>
			<publisher>Knowledge Web Consortium</publisher>
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
	<note>State of the Art on Ontology Alignment</note>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<ptr target="http://java.sun.com/developer/releases/petstore/" />
		<title level="m">Sun Microsystems</title>
				<imprint>
			<date type="published" when="2005">2005</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Developing pet store using rup and xde</title>
		<author>
			<persName><forename type="first">R</forename><surname>Mcumber</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Web Site</title>
		<imprint>
			<date type="published" when="2003">2003</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Joins that Generalize: Text Classification using Whirl</title>
		<author>
			<persName><forename type="first">W</forename><surname>Cohen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Hirsh</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the Fourth Int. Conf. on Knowledge Discovery and Data Mining</title>
				<meeting>of the Fourth Int. Conf. on Knowledge Discovery and Data Mining</meeting>
		<imprint>
			<date type="published" when="1998">1998</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
