<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">ObjectCoref &amp; Falcon-AO: Results for OAEI 2010</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Wei</forename><surname>Hu</surname></persName>
						</author>
						<author>
							<persName><forename type="first">Jianfeng</forename><surname>Chen</surname></persName>
						</author>
						<author>
							<persName><forename type="first">Gong</forename><surname>Cheng</surname></persName>
						</author>
						<author role="corresp">
							<persName><forename type="first">Yuzhong</forename><surname>Qu</surname></persName>
							<email>yzqu@nju.edu.cn</email>
						</author>
						<author>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science and Technology</orgName>
								<orgName type="institution">Nanjing University</orgName>
								<address>
									<country key="CN">China</country>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff1">
								<orgName type="laboratory">State Key Laboratory for Novel Software Technology</orgName>
								<orgName type="institution">Nanjing University</orgName>
								<address>
									<country key="CN">China</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">ObjectCoref &amp; Falcon-AO: Results for OAEI 2010</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">E3F28E763259F2941C54EB35F523E459</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T05:50+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>In this report, we mainly present an overview of ObjectCoref, which follows a self-training framework to resolve object coreference on the Semantic Web. Besides, we show preliminary results of Falcon-AO (2010) for this year's OAEI campaign, including the benchmark and conference tracks.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>The Semantic Web is an ongoing effort by the W3C Semantic Web Activity to actualize data integration and sharing across different applications and organizations. To date, a number of prominent ontologies have emerged to publish data for specific domains, such as the Friend of a Friend (FOAF). These specifications recommend common identifiers for classes and properties in the form of URIs <ref type="bibr" target="#b0">[1]</ref> that are widely and consistently used across data sources.</p><p>On the instance level, however, it is far from achieving agreement among sources on the use of common URIs to identify specific objects on the Semantic Web. In fact, due to the decentralized and dynamic nature of the Semantic Web, it frequently happens that different URIs from various sources, more likely originating from different RDF documents, are used to identify the same real-world object, i.e., refer to an identical thing (as known as URI aliases <ref type="bibr" target="#b4">[5]</ref>). Examples exist in the domains of people, academic publications, encyclopedic or geographical resources.</p><p>Object coreference resolution, also called consolidation or identification <ref type="bibr" target="#b1">[2]</ref>, is a process for identifying multiple URIs of the same real-world object, that is, determining URI aliases (called coreferent URIs in this report) that denote a unique object. At present, object coreference resolution is recognized to be useful for data-centric applications, e.g. heterogeneous data integration or mining systems, semantic search, query and browsing engines.</p><p>We introduce a new approach, ObjectCoref, for bootstrapping object coreference resolution on the Semantic Web. The architecture of the proposed approach follows a common self-training framework (see Fig. <ref type="figure" target="#fig_0">1</ref>). Self-training <ref type="bibr" target="#b5">[6]</ref> is a major kind of semisupervised learning, which assumes that there are abundant unlabeled examples in the real world, but the number of labeled training examples is limited. We believe that selftraining is an appropriate way for resolving object coreference on the Semantic Web.</p><p>Falcon-AO <ref type="bibr" target="#b3">[4]</ref> is an automatic ontology matching system with acceptable to good performance and a number of remarkable features. It is written in Java, and is open source. ObjectCoref and Falcon-AO together help better enable interoperability between applications that use heterogeneous Semantic Web data. The semantics of owl:sameAs dictates that all the URIs linked with this property have the same identity; if a property is declared to be inverse functional (IFP), then the object of each property statement uniquely determines the subject (some individual); a functional property (FP) is a property that can have only one unique value for each object; while cardinality (or max-cardinality) allows the specification of exactly (or at most) the number of elements in a relation, in the context of a particular class description, and when the number equals 1, it is somehow similar to the FP, but only applied to this particular class.</p><p>Next, ObjectCoref learns the discriminability of pairs of properties based on the coreferent URIs, in order to find more coreferent URIs for extending the training set. The discriminability reflects how well each pair of properties can be used to determine whether two URIs are coreferent or not. As an extreme example, IFPs (e.g. foaf:mbox) have a very good discriminability.</p><p>In RDF graphs, each URI is involved in a number of RDF triples whose subject is the URI, and the predicates and objects in these RDF triples form some &lt;property, value&gt; pairs, which can be considered as features for describing such URI. ObjectCoref compares the values between the &lt;property, value&gt; pairs from coreferent URIs, and finds which two properties have similar values and how frequent. The significance is the percentage of the number of coreferent URIs that can found by the discriminant properties in all the coreferent URIs in the training set. If the significance is greater than a given threshold, such the property pair is chosen for further resolution. Please note that for different domains, same property pairs may have different discriminability.</p><p>For example, a pair of rdfs:labels is discriminant for the biomedical domain but not for people.</p><p>If new coreferent URIs are found, ObjectCoref selects highly accurate ones and adds them into the training set. The whole process iterates several times and terminates when the property discriminability is not significant enough or cannot find more discriminant property pairs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.3">Adaptations made for the evaluation</head><p>For ObjectCoref, there is no explicit equivalence semantics in the DI and PR tracks. In order to establish the initial training set of coreferent URIs, we randomly extract 20 mappings from the reference alignment for each test case. All the mappings generated by ObjectCoref are based on the same parameters.</p><p>For Falcon-AO, we do not make any specific adaptation in the OAEI 2010 campaign. All the mappings for the benchmark and conference tracks outputted by Falcon-AO are uniformly based on the same parameters.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.4">Link to the system and parameters file</head><p>We implement an online service for ObjectCoref, and run it over a large-scale dataset collected by the Falcons <ref type="bibr" target="#b2">[3]</ref> search engine up to Sept. 2008. The dataset consists of nearly 600 million RDF triples describing over 76 million URIs. It is still under development. Please visit: http://ws.nju.edu.cn/objectcoref.</p><p>Besides, we follow the SEALS platform to publish Falcon-AO (2010) as a service. Please access it from http://219.219.116.154:8083/falconWS?wsdl. The offline version can be downloaded from our website: http://ws.nju.edu.cn/ falcon-ao.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.5">Link to the set of provided alignments (in align format)</head><p>The alignments for this year's OAEI campaign should be available at the official website: http://oaei.ontologymatching.org/2010/.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Results</head><p>In this section, we will present the results of ObjectCoref and Falcon-AO (2010) on the tracks provided by the OAEI 2010 campaign.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">DI</head><p>In this track, we use ObjectCoref to resolve object coreference between three pairs of datasets, namely diseasome vs. sider, dailymed vs. sider and drugbank vs. sider. Table 1 shows the discriminant property pairs that ObjectCoref learns by self-training. For example, diseasome:name and sider:siderEffectName are a pair of discriminant properties, and if some URI in the diseasome dataset has a value w.r.t. diseasome:name that is similar to some URI in the sider dataset w.r.t. sider:siderEffectName, these two URIs can be considered as coreferent. In this track, the training process converges at two iterations, respectively. With these discriminant property pairs, ObjectCoref finds a number of coreferent URIs for each pair of datasets. As shown in Table <ref type="table" target="#tab_1">2</ref>, the precision and recall is moderate. Without considering the type of each object, the precision is not very good, so further inference-based debugging on coreferent URIs is needed for future work. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">PR</head><p>In this track, ObjectCoref uses the same self-training process to recognize coreferent URIs for each pair of datasets, two of which are related to persons and the other is about restaurants. The discriminant property pairs are listed in Table <ref type="table" target="#tab_2">3</ref>. Based on these discriminant properties, ObjectCoref finds a set of coreferent URIs, where the precision and recall are pretty good (see Table <ref type="table" target="#tab_3">4</ref>). In particular, the good recall reflects that our learning approach identifies the key properties for resolving object coreference in this track. But we also notice that some combination of properties may be also helpful. For example, first name + last name can be used for identifying same people. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Benchmark &amp; conference</head><p>We use Falcon-AO (2010) to participate in the benchmark and conference tracks. The average precision and recall are depicted in Table <ref type="table" target="#tab_4">5</ref>. As compared to OAEI 2007, the benchmark track adds some new cases. Falcon-AO failed in several cases due to the Jena parsing errors. For the detailed results, please see Appendix. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">General comments</head><p>In this section, we will firstly discuss several possible ways to improve ObjectCoref, and then give comments on the OAEI 2010 test cases.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Discussions on the way to improve the proposed system</head><p>The preliminary results of ObjectCoref demonstrate that using property discriminability is feasible to find coreferent URIs on the Semantic Web. However, we also see several shortcomings of the proposed approach, which will be considered in the next version.</p><p>1. How to divide objects into different domains? For the tasks in this year's OAEI, we may not see the importance of recognizing domains, but on the whole Semantic Web, different domains may have different discriminant properties, and a single property pair may have different discriminability in different domains. So, a uniform measurement is ineffective.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">How to avoid error accumulation?</head><p>In self-training, an important issue is to prevent error accumulation, since a wrong labeled example would lead to misclassification in further propagation. In our evaluation, because the training process converges in a few iterations, so this situation is not so significant. But in real world, it is imperative to consider that.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">How to find discriminant property combinations?</head><p>A single property may be not good enough for resolving object coreference, while the combination of several properties would be more discriminant. However, we need to avoid overfitting. So, we plan to mine frequent patterns in the RDF data for describing objects and refine these frequent patterns to form property combinations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Comments on the OAEI 2010 test cases</head><p>The proposed matching tasks cover a large portion of real world domains, and the discrepancies between them are significant. Doing experiments on these tasks are helpful to improve algorithms and systems. In order to enhance applicability, we list some problems in our experiment procedure, which might aid organizers to improve in the future.</p><p>1. In the DI track, the organizers provide 4 downloadable datasets for the biomedical domain, however, the interlinking track also involves a number of others, e.g., linkedct, lifescience, bio2rdf. The datasets are not only very large, but also difficult to find the latest versions, most of which are even not allowed to download. Furthermore, using SPARQL endpoints in the experiment is very time-consuming, especially for such a large scale. So, we would expect that all the datasets can be (perhaps temporarily) offline in the next year.</p><p>2. Falcon-AO (2010) uses Jena 2.6.3 as the RDF parser. In the benchmark track, some ontologies may have problems and cause the Jena exception "Unqualified typed nodes are not allowed. Type treated as a relative URI". So, we would expect the organizers to fix this in the next year.</p><p>Object coreference resolution is an important way for establishing interoperability among (Semantic) Web applications that use heterogenous data. We implement an online system for resolving object coreference called ObjectCoref, which follows a self-training framework focusing on learning property discriminability. From the experiments in this year's DI and PR tracks, we find some positive and negative experience for improving our system. In the near future, we look forward to making a stable progress towards building a comprehensive object coreference resolution system for the Semantic Web.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 .</head><label>1</label><figDesc>Fig. 1. Self-training process</figDesc><graphic coords="2,137.64,165.73,339.62,115.19" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 .</head><label>1</label><figDesc>Property discriminability on the DI track</figDesc><table><row><cell></cell><cell>Property in dataset1</cell><cell>Property in dataset2</cell></row><row><cell></cell><cell>rdfs:label</cell><cell>sider:sideEffectName</cell></row><row><cell>diseasome vs. sider</cell><cell>diseasome:name rdfs:label</cell><cell>sider:siderEffectName rdfs:label</cell></row><row><cell></cell><cell>diseasome:name</cell><cell>rdfs:label</cell></row><row><cell></cell><cell>dailymed:genericMedicine</cell><cell>sider:drugName</cell></row><row><cell>dailymed vs. sider</cell><cell>dailymed:name dailymed:genericMedicine</cell><cell>sider:drugName rdfs:label</cell></row><row><cell></cell><cell>dailymed:name</cell><cell>rdfs:label</cell></row><row><cell></cell><cell>drugbank:genericName</cell><cell>sider:drugName</cell></row><row><cell></cell><cell>rdfs:label</cell><cell>sider:drugName</cell></row><row><cell></cell><cell>drugbank:genericName</cell><cell>rdfs:label</cell></row><row><cell>drugbank vs. sider</cell><cell>rdfs:label drugbank:synonym</cell><cell>rdfs:label sider:drugName</cell></row><row><cell></cell><cell>drugbank:synonym</cell><cell>rdfs:label</cell></row><row><cell></cell><cell cols="2">drugbank:pubchemCompoundId sider:siderDrugId</cell></row><row><cell></cell><cell>drugbank:brandName</cell><cell>sider:drugName</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 .</head><label>2</label><figDesc>Performance of ObjectCoref on the DI track</figDesc><table><row><cell cols="3">Found Existing Precision Recall F-measure</cell></row><row><cell>diseasome vs. sider 190</cell><cell>238</cell><cell>0.837 0.668 0.743</cell></row><row><cell cols="2">dailymed vs. sider 2903 1592</cell><cell>0.548 0.999 0.708</cell></row><row><cell>drugbank vs. sider 933</cell><cell>283</cell><cell>0.302 0.996 0.464</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3 .</head><label>3</label><figDesc>Property discriminability on the PR track</figDesc><table><row><cell></cell><cell>Property in dataset1</cell><cell>Property in dataset2</cell></row><row><cell></cell><cell cols="2">person11:has address person12:has address</cell></row><row><cell>person1</cell><cell cols="2">person11:phone number person12:phone number</cell></row><row><cell></cell><cell>person11:soc sec id</cell><cell>person12:soc sec id</cell></row><row><cell></cell><cell cols="2">person21:has address person22:has address</cell></row><row><cell>person2</cell><cell cols="2">person21:phone number person22:phone number</cell></row><row><cell></cell><cell>person21:soc sec id</cell><cell>person22:soc sec id</cell></row><row><cell>restaurants</cell><cell cols="2">restaurant1:has address restaurant2:has address restaurant1:name restaurant2:name</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 4 .</head><label>4</label><figDesc>Performance of ObjectCoref on the PR track</figDesc><table><row><cell></cell><cell cols="3">Found Existing Precision Recall F-measure</cell></row><row><cell>person1</cell><cell>499</cell><cell>500</cell><cell>1.000 0.998 0.999</cell></row><row><cell>person2</cell><cell>362</cell><cell>400</cell><cell>1.000 0.900 0.947</cell></row><row><cell cols="2">restaurants 91</cell><cell>112</cell><cell>0.989 0.804 0.887</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>Table 5 .</head><label>5</label><figDesc>Performance of Falcon-AO (2010) on the benchmark and conference tracks</figDesc><table><row><cell cols="2">Precision Recall</cell></row><row><cell>Benchmark 0.76</cell><cell>0.64</cell></row><row><cell>Conference 0.60</cell><cell>0.60</cell></row></table></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgements</head><p>This work is in part supported by the NSFC under Grant 61003018 and 60773106. We would like to thank Ming Li for his valuable comments on self-training.</p></div>
			</div>

			<div type="annex">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Appendix: Complete results</head><p>In this appendix, we will show the complete results of Falcon-AO (2010) on the benchmark and conference tracks. Tests were carried out on two Intel Xeon Quad 2.40GHz CPUs, 8GB memory with Redhat Linux Enterprize Server 5.4 (x64), Java 6 compiler and MySQL 5.0.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Matrix of Results</head><p>In the following tables, the results are shown by precision (Prec.) and recall (Rec.). </p></div>			</div>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title level="m" type="main">Uniform Resource Identifier (URI): Generic Syntax</title>
		<author>
			<persName><forename type="first">T</forename><surname>Berners-Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Fielding</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Masinter</surname></persName>
		</author>
		<idno>RFC 2396</idno>
		<ptr target="http://www.ietf.org/rfc/rfc3986.txt" />
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Data Fusion</title>
		<author>
			<persName><forename type="first">J</forename><surname>Bleiholder</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Naumann</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM Computing Surveys</title>
		<imprint>
			<biblScope unit="volume">41</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="1" to="41" />
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Searching Linked Objects with Falcons: Approach, Implementation and Evaluation</title>
		<author>
			<persName><forename type="first">G</forename><surname>Cheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Qu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International Journal on Semantic Web and Information Systems</title>
		<imprint>
			<biblScope unit="volume">5</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="49" to="70" />
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Falcon-AO: A Practical Ontology Matching System</title>
		<author>
			<persName><forename type="first">W</forename><surname>Hu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Qu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Web Semantics</title>
		<imprint>
			<biblScope unit="volume">6</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="237" to="239" />
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Architecture of the World Wide Web</title>
		<author>
			<persName><forename type="first">I</forename><surname>Jacobs</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Walsh</surname></persName>
		</author>
		<ptr target="http://www.w3.org/TR/webarch/" />
	</analytic>
	<monogr>
		<title level="m">W3C Recommendation 15</title>
				<imprint>
			<date type="published" when="2004-12">December 2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Semi-supervised learning by disagreement</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Li</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Knowledge and Information Systems</title>
		<imprint>
			<biblScope unit="volume">24</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="415" to="439" />
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
