Matching Instances in GeoLink Michelle Cheatham, Reihaneh Amini, and Chandan Patel DaSe Lab, Wright State University, Dayton OH 45435, USA, {michelle.cheatham, amini.2, patel.383}@wright.edu Abstract. We propose the use of the GeoLink data repository as an in- stance matching benchmark. The GeoLink project brings together seven datasets related to geoscience research. Both the T-box and the A-box of GeoLink are significantly larger than current benchmarks, and they have interesting challenges, such as geospatial and temporal data. GeoLink is part of the NSF EarthCube initiative. Seven diverse geoscience datasets have been brought together into a single data repository. The ontology is documented at http://schema.geolink.org, and the triple store is accessible at http://data.geolink.org. There are currently 282 classes, 338 properties, 5,118,150 instances and 45,093,750 triples in the knowledge base. The are also owl:sameAs and skos:closeMatch links between instances of different types. The sameAs links were manually generated by the data providers, while the close- Match links were generated by an automated coreference resolution system. We highlight three different classes within the GeoLink schema that pose different opportunities for evaluating and challenging coreference resolution systems: Per- son, Cruise, and Organization. Person Instances of Person appear in a variety of contexts such as Chief Scien- tist on a cruise, Principal Investigator on a project, participant in a meeting, or creator of a dataset or paper. Key object properties related to the person class reflect these different contexts. Related data properties include name, email ad- dress, and ORCID.1 GeoLink considers the NSF dataset to be “canonical” for the Person class, meaning that Person instances in each of the other datasets have been mapped to NSF instances. The NSF dataset contains 335,504 people, so it is not feasible to compare each person from one of the constituent datasets to every person in the NSF datset. This benchmark can therefore be used to encourage development of systems that employ effective filtering or other mech- anisms to achieve scalablility. The triple store currently contains 15,660 people not in the NSF dataset. There are 790 sameAs and 1,405 closeMatch links be- tween these people and those within the NSF data. Cruise There are 12,070 cruises in the GeoLink repository, potentially allowing an m by n comparison. There are 1,356 sameAs links and 368 closeMatch links among cruises. The cruise coreference task is intriguing because cruises have geospatial and temporal elements, which are considered an important challenge 1 http://orcid.org 2 Cheatham, Amini and Patel for coreference resolution systems [3]. Two properties of particular interest are hasTrack and hasPortCall properties. A cruise’s track is generally a series of latitude and longitude coordinates. The Cruise class also has properties has- StartPortCall, hasMidPortCall, and hasEndPortCall. The PortCall class is in the domain of the properties hasTimeStamp and hasPort, whereas the range of hasTimeStamp is a date time literal and the range of hasPort is Place. A place can be described in terms of its latitude and longitude, but it might also be identified using a gazetteer term. Organization Compared to Person and Cruise, the GeoLink knowledge base contains relatively little information about instances of the Organization class. There is often little data other than the organization’s title and the set of people who are affiliated with it in the knowledge base. Finding coreferences in this situation is likely to be difficult for approaches that rely on extensive schema information; however, approaches that rely on, for instance, the degree of overlap between the people affiliated with two organizations to measure their similarity, may perform quite well. Because there are nearly 300,000 organizations within GeoLink, this is again a task in which approaches that do not perform some type of filtering are unlikely to be feasible. There are currently no sameAs links between organizations, but 268 closeMatch links have been established. There are several existing coreference resolution benchmarks. The dominant existing benchmark is that of the OAEI, which has included an instance matching track since 2009 [1]. Some tasks within this track are synthetic (generated via SPIMBENCH [2]) while others are real-world. The benchmark proposed here differs because it is less narrowly focused and involves a much larger schema and A-box. On the other hand, because the current set of links in GeoLink is likely not exhaustive, only recall (and not precision) can be evaluated. Acknowledgments The authors sincerely thank the GeoLink team.2 This work was supported by the National Science Foundation GeoLink project (1440202). References 1. Ferrara, A., Nikolov, A., Noessner, J., Scharffe, F.: Evaluation of instance matching tools: The experience of OAEI. Web semantics: Science, services and agents on the World Wide Web 21, 49–60 (2013) 2. Saveta, T., Daskalaki, E., Flouris, G., Fundulaki, I., Herschel, M., Ngonga Ngomo, A.C.: Pushing the limits of instance matching systems: A semantics-aware bench- mark for linked data. In: Proceedings of the 24th International Conference on World Wide Web. pp. 105–106. ACM (2015) 3. Shvaiko, P., Euzenat, J.: Ontology matching: State of the art and future challenges. IEEE Transactions on Knowledge and Data Engineering 25(1), 158–176 (2013) 2 http://www.geolink.org/team.html