Unification of Geospatial Reasoning, Temporal Logic, & Social Network Analysis in an RDF Database Jans Aasman Franz, Inc. 555 12th Street, Suite 1450 Oakland, CA 94607 +1-510-452-2000 ja@franz.com Abstract - This paper is about a new type of event supports a combination of description logic, geospatial reasoning, database that enables efficient reasoning about things, people, temporal reasoning, and knowledge about the social relationships companies, relationships between people and companies, and between people. about places and events. The event database is built on top of a The commercial vendors of Semantic Technologies also see scalable distributed RDF triple store that can handle literally a number of use cases that all center around events and require billions of events. Like objects, events have at least one actor, the aforementioned query capabilities. We currently see but usually more, a start-time and possibly an end-time, a companies using large data warehouses with very disparate RDF place where the event happened, and the type of the event. An based triple stores describing various types of events where each event can have many additional properties and annotations. event has at least two actors, usually a begin and end time, and On top of this event database we implemented libraries for very often a geospatial component. These events are literally RDFS++ logic reasoning, for geospatial and temporal everywhere: in Health Care applications we see hospital visits, capabilities, and an extensive social network analysis package. drugstore visits, and medical procedures. In the Communications This paper focuses on a query framework that makes it easy to Industry we see telephone call detail records, including location. combine all of the aforementioned capabilities in a user An email and calendar database of a large company is nothing friendly query language. more than a social network database filled with events in time and, in many cases, space. In the Financial Industry every Index Terms – Geotemporal logic, geospatial reasoning, transaction is essentially an event. In the Insurance Industry claims are important events and they desperately need more RDF database, graph database, RDFS, OWL, SPARQL, social activity recognition. In the Intelligence community basically network analytics, business intelligence, event-based systems, everything revolves around events and actors. The REWERSE event-driven architectures, metadata, semantic technologies. program from the 6th Framework Programme of the EU Commission [3] is one of the few systematic efforts to combine RDFS/OWL with geotemporal reasoning, although the social aspect hasn't been addressed yet. The recent book “The I. INTRODUCTION Geospatial Web” [4] currently provides the state of the art overview on how to work with people and events on a web scale This paper describes the design and use of a unifying query and what kind of applications we might expect in the near future. framework for geospatial reasoning, temporal logic, social network analytics, RDFS and OWL in Event-based systems [1]. In this introduction we will first go into why we need such a II. FRAMEWORK REQUIREMENTS framework and the requirements for such a framework. The Semantic Web community has made great strides in the The reason for such a framework can be answered by area of ontologies and description logic, and some initial work in looking at the vision of the semantic web and understanding how the areas of geospatial reasoning [5], temporal reasoning [6], companies use semantic technologies. Tim Berners-Lee, James social network analysis [7], and event ontologies [8]. All of this Hendler and Ora Lassila’s Scientific American article (May, is based on RDF as the data representation. Based on this W3C 2000) [2] provides a compelling vision of the Semantic Web. It standard the combination of all these different reasoning contains some interesting use cases for what the Semantic Web capabilities in one unified framework will propel further industry will bring. These use cases assume that software agents know adoption of Semantic Technology. Given that we have seen a how to roam the web and reason over things, people, companies, direct need for query capabilities that handle relationships between people and companies and about places geospatial/temporal/social/rdfs/owl, we have designed a and events. Clearly these agents need a query capability that framework. The main requirements we identified were: of Allen’s 13 interval predicates. Note that we do purely 1. User and programmer friendly: We wanted the quantitative temporal reasoning. So if you provide a number of framework to be an extension of SPARQL, with events with a start time and an end time or a duration then we SPARQL as the foundation. Certainly the framework can perform queries like the following. This example will return should not be anymore complex than SPARQL. all intervals ?i2 that happened in interval ?i1. SPARQL is relatively user friendly, and as languages go, (select ?x (interval-during ?i1 ?i2)) the adoption rate is such that one could make the Temporal reasoning uses the range query capabilities to the argument that it is sufficient to address most use cases. fullest extent. If you want to find all the events that happened between Jan. 1, 2008 and Jan. 2, 2008, the triple store performs a 2. Implementer friendly: We need many people to straight triple query with only one cursor scan. It is still possible experiment with this proposed framework such that the to blow up the query time spectacularly by doing things like Semantic Web community can converge on a standard. (select (?x ?y) (point-before ?x ?y)) 3. Efficient: Given that we work with very large databases as that will generate every before/after pair. However, we do with millions of events where the response time has to consider that to be the responsibility of the user. In many cases a be on the sub second level, the implementation of the query optimizer can warn for that or rearrange the clauses to bind query language and query engine needs to be very fast ?x or ?y. 4. We want the query language to work on distributed databases. Currently we’ve designed the query engine to work on federations of triple stores. Once we develop B. Geospatial Primitives efficient caching techniques for distributed RDF knowledge stores residing all over the web, it will also Our original intention of adding Geospatial capabilities was be efficient for agents that need to roam the web. not so much to compete with existing spatial databases but instead make it very easy for RDF users to be able to deal with 5. Practical & Easily Extendible: We want the API to be locations of objects very efficiently. In order to make this fast we such that it can be easily modified to allow for ongoing implemented a variation of an R-Tree to encode two-dimensional experimentation. data very efficiently directly in the triple indices [10]. A detailed description of how this geospatial representation works can be found in the geospatial tutorial included with the AllegroGraph 6. Works well with RDFS and OWL reasoning. documentation [11]. Currently we support a number of predicates that can be used in the query language. Some examples of the predicates: III. DISCUSSION (geo-distance ?x ?y ?dist) -> given, x and y, return distance In the remainder of the paper we show how we can combine (geo-within-radius ?x ?y 10.0) -> find y within 10 miles from x geospatial reasoning, temporal logic, social network analytics, (geo-inside-polygon ?polygon ?place ?lon ?lat) and RDFS reasoning all in one query language. One question that people ask who are familiar with triple stores is: how can this work efficiently on very large data sets For our benchmarking we use the open source GeoNames containing billions of triples? Most first generation triple stores database that can be freely downloaded from GeoNames.org [12]. store the URIs and literals that constitute the parts of a triple as The database contains nearly 7 million points of interest on strings in a dictionary. So, when doing range queries over earth. From interesting points in nature, to populated areas, to numeric values, for example, "select * from person where age > schools and churches, etc. Each point has 12 features such as 50”, the triple store engine has to go through each value for the asciiname, the local name, elevation level, longitude, latitude, predicate ‘age’. One way around this is to add btrees for every population, etc. Actually, it is not a database but a csv file that numeric type but that in general is a very inefficient solution in programmers can modify as necessary. For our purposes we triple stores. The triple store that we use is AllegroGraph which obviously transform it into RDF triples. We can retrieve all 459 is actually a hybrid between a relational database and triple geo-points around Berkeley less than 4 miles away in less than 5 store, the internal representation of the triples is such that is milliseconds. We would argue that the basic retrieval speed is allows for very efficient range queries. comparable to or better than current commercial spatial databases. Here are some typical example queries that you can do on the GeoNames database: A. Temporal Reasoning Our temporal reasoning is based on James Allen’s Interval Find the distance between Oakland and the one and only Logic [9]. This logic looks at all the 13 ways two temporal Berkeley in California. intervals can relate to each other. We provide predicates for each (select (?dist) (q ?x geo:name “Oakland”) measure. This predicate will start with the most important one (q ?y geo:name "Berkeley") first. (q ?y geo:admin1_code "CA") (geo-distance ?x ?y ?dist)) (select ?x (ego-group fr:Person1 knows 2 ?group) Put in a Google map all the places within 10 miles from Oakland (actor-centrality-members ?group knows ?x)) (google-map (select (?name ?lat ?lon) (q ?x geo:asciiname “Oakland”) AllegroGraph is a native, general graph database, written specifically to make graph search faster. However, the bottleneck (geo-within-radius ?x ?y 10) is still getting triples from disk as fast as possible and having the (q ?y geo:asciiname ?name) smartest algorithms and best caching available. For example, many of the centrality measures that are used to compute the (q ?y geo:isAt5 ?pos) importance of an actor in a known group need to compute the (pos->lon/lat ?pos ?lon ?lat))) shortest path between every actor in the group. We have created special constructors to cache these groups in a transparent way so that most computations can be done without minimal IO. C. Social Network Analysis (SNA) IV. AN OVERVIEW EXAMPLE Many RDF resources are about people and relationships between people, or between people and companies, or between companies and other companies. We added Social Network In order to give the reader an impression of the breadth and Analysis methods to make it easier to reason about relationships depth of the query language, we provide a typical example that and groups. The functions that we provide address the five basic combines geospatial, temporal, SNA and RDFS reasoning. questions from Social Network Analysis. (1) How far is person A (select (?x) from person B, (2) if there is a link between A and B then how (ego-group person:jans knows ?group 2) strong is this relationship, (3) given a particular actor A, in what group does this actor ‘live’, (4) given an actor in a group, how (actor-centrality-members ?group knows ?x ?num) important is this actor in the group and finally, (5) given a group, (q ?event fr:actor ?x) how dense are the relationships in the group and does this group have a leader or a set of leaders. The SNA library encompasses a (qs ?event!rdf:type fr:Meeting) set of well know SNA algorithms. We provide a set of general (interval-during ?event “2008-12-01” “2008-12-05”) functions and have developed the concept of a generator. A (geo-box-around geoname:Berkeley ?event 5 miles) generator is basically a function that takes as an input one node and then creates a set of output nodes. The search functions and !) SNA functions that we provide take these generators as first class In English this translates into: arguments. For example: say we have a database with Find the group of friends and friends of friends around the relationships between people, the generator ‘knows’ will take as person “Jans”. Find within this group the most important an input a person and return a set of person(s) by following person first. Find if this person was part of an event that fr:went-to-dinner-with and fr:went-to-movies in both directions. was of type Meeting and happened in a particular time interval within 5 miles of Berkeley. (defgenerator knows () Note that we seamlessly mix Social Network Analysis in the first (bidirectional fr:went-to-dinner fr:went-to-movies)) two clauses, a simple database look up in the third, an RDFS inference about the type of event, and then a temporal and a geospatial constraint. This current example and the examples We can use this generator to find, for example, the shortest shown above utilize Prolog. We expect in early 2009 to have a path between two people. In this case the query will return a list SPARQL engine that will perform this identical query. of persons. The syntax of the SPARQL query will be slightly more contrived due to the fact that SPARQL normally only allows (select ?x patterns that map directly on triples (see example below). Note that we introduced the non-standard ‘=’ or assignment construct. (shortest-path knows fr:Person1 fr:Person2 ?x)) We are planning to discuss this topic with the SPARQL committees. Or we can use the generator to first create a group of friends select ?x where { and friends of friends in the ego-group predicate, and then we ?group = ego-group(person:jans knows 2) . find the importance of each member using the actor-centrality ?x = actor-centrality-members(?group knows ?x) . ?event fr:actor ?x ; rdf:type fr:Meeting . [8] Raimond, Y. Abdallay, S., Event Ontology, 2007, http://motools.sourceforge.net/event/event.html FILTER (interval-during ?event '2007-12-01' '2007-12-31') FILTER (geo-box-around geoname:Berkeley ?event 5miles) [9] Allen, J.F.: Time and Time Again: The Many Ways to Represent Time. International Journal of Intelligent } Systems, Vol. 6, No. 4 (1991) V. SUMMARY AND FUTURE RESEARCH [10] Wikipedia R-tree data structure, http://en.wikipedia.org/wiki/Rtree In this paper we have discussed how RDF can serve as a basis for an event database where events are defined as ‘things’ [11] Geospatial tutorial section of Franz Inc.’s AllegroGraph 3.0 that (1) require RDFS++ reasoning because events have types, documentation, (2) require geospatial reasoning because events happen http://agraph.franz.com/support/documentation/current/geos somewhere, (3) require temporal reasoning because events nearly patial-tutorial.html always have a start and duration and (4) require some form of social analysis because most interesting events have one or more [12] GeoNames Data Access, http://www.geonames.org/export/ actors. We demonstrated how all of these capabilities can be used in one query language, in this case Prolog. And we expect that in the near future these capabilities will be available in SPARQL as well. The primary research effort for the current version of the query framework is to enhance query-optimization. Notice that in the example shown above, most clauses are not direct matches against the database but functors that do computations. Some of these functors can act both as generators and as filters (as is common in Prolog). In case a functor acts as generator we need to research better statistical predictions for how many solutions can be expected so that we can do better re-ordering of clauses. VI. REFERENCES [1] Aasman, J., Unification of geospatial reasoning, temporal logic, & social network analysis in event-based systems, Distributed Event Based Systems (DEBS 2008) http://portal.acm.org/citation.cfm?id=1386007 [2] 1st Scientific American article on the Semantic Web, http://www.sciam.com/article.cfm?articleID=00048144- 10D2-1C70-84A9809EC588EF21&ref=sciam [3] REWERSE, http://rewerse.net [4] Scharl, A., Tochtermann, K.: The Geospatial Web: How Geobrowsers, Social Software and the Web 2.0 are Shaping the Network Society. Springer (2007) [5] W3C Geospatial Incubator Group, http://www.w3.org/2005/Incubator/geo/ [6] Gutierrez, C., Hurtado, C., and Vaisman, A. Temporal RDF. In European Conference on the Semantic Web (ECSW’05) (Best paper award), pages 93–107, 2005, http://www.dcc.uchile.cl/~cgutierr/papers/temporalRDF.pdf [7] Mika, P.: Social Networks and the Semantic Web. Springer (2007)