! ! ! ! Proceedings!of ! DeRiVE!2015! Events!in!the!Semantic!Web ! ! 4th!International!Workshop!on!Detection,!Representation,!and!Exploitation!of! ! Co#located*with*the*12th*Extended*Semantic*Web*Conference*(ESWC*2015),*31*May*–* 4*June,*Portoroz,*Slovenia! Preface This volume contains the papers presented at DeRiVE2015: 4th Workshop on Detection Representation and Exploitation of Events in the Semantic Web held on Sunday May 31, 2015 in Portoroz (Co-located with the 12th Extended Se- mantic Web Conference - ESWC 2015) In recent years, researchers in several communities involved in aspects of in- formation science have begun to realise the potential benefits of assigning an important role to events in the representation and organisation of knowledge and media-benefits which can be compared to those of representing entities such as persons or locations instead of just dealing with more superficial ob- jects such as proper names and geographical coordinates. While a good deal of relevant research for example, on the modeling of events has been done in the semantic web community, much complementary research has been done in other, partially overlapping communities, such as those involved in multimedia processing, information extraction, sensor processing and information retrieval research. However, these areas often deal with events with a di↵erent perspec- tive. The attendance of DeRiVE 2011, DeRiVE 2012 and DeRiVE 2013 proved that there is a great interest from many di↵erent communities in the role of events. The results presented in there also indicated that dealing with events is still an emerging topic. The goal of this workshop is to advance research on the role of events within the information extraction and semantic web communities, both building on existing work and integrating results and methods from other areas, while focusing on issues of special importance for the semantic web. We have defined questions for the two main directions that characterise cur- rent research into events on the semantic web. Orthogonal to that, we have identified a number of application domains in which we will actively seek con- tributions. Question 1: How can events be detected and extracted for the semantic web? – How can events be detected, extracted and/or summarized in particular types of content on the web, such as calendars of public events, social media, semantic wikis, and regular web pages? – What is the quality and veracity of events extracted from noisy data such as microblogging sites? – How can a system recognise a complex event that comprises several sub- events? – How can a system recognise duplicate events? Question 2: How can events be modelled and represented in the semantic web? – How are events currently represented on the Web? In particular, how de- ployed is the schema.org Event class? Should scheduled events versus break- ing events be represented the same way? v – To what extent can the many di↵erent event infoboxes of Wikipedia be rec- onciled? How to deal with the numerous Timeline of xxx topics in knowledge bases? – How can existing event representations developed in other communities be adapted to the needs of the semantic web? To what extent can/should a unified event model be employed for di↵erent types of events? – How do social contexts (Facebook, Twitter, etc.) change the implicit content semantics? Application Domains: Research into detection (question 1) and representa- tion (question 2) of events is being implemented in various application domains. Known application domains that we target are: – Personal events – Cultural and sports events – Making something out of ”raw” events – Historic events and events in news and other media – Scientific observation events – Supply chain events Among the submissions we received, 6 papers were selected for full presentation at the workshop: – Jean-Paul Calbimonte and Karl Aberer - Reactive Processing of RDF Streams of Events – Selver Softic, Laurens De Vocht, Erik Mannens, Martin Ebner and Rik Van de Walle - COLINDA: Modeling, Representing and Using Scientific Events in the Web of Data – Michael Färber and Achim Rettinger - Toward Real Event Detection – Gregory Katsios, Svitlana Vakulenko, Anastasia Krithara and Georgios Paliouras - Towards Open Domain Event Extraction from Twitter: REVEALing Entity Relations – Loris Bozzato, Stefano Borgo, Alessio Palmero Aprosio, Marco Rospocher and Luciano Serafini - A Contextual Framework for Reasoning on Events – Jacobo Rouces, Gerard de Melo and Katja Hose - Representing specialized events with FrameBase We would like to thank the members of the program committee and the additional reviewers for their time and e↵orts. All papers included here have been revised and improved based on your valuable feedback, thus setting the basis for an exciting workshop programme. Finally, we would like to thank our sponsor, the Newsreader (www.newsreader- project.eu) FP7 European Project, for funding the workshop. vi May 31, 2015 Marieke Van Erp Raphaël Troncy Marco Rospocher Willem Robert Van Hage David A. Shamma vii Program Committee Eneko Agirre University of the Basque Country, Spain Stefano Borgo LOA - CNR, Italy Loris Bozzato Fondazione Bruno Kessler Christian Hirsch University of Auckland, New Zealand Jane Hunter University of Queensland, Australia Tomi Kauppinen Aalto University, Finland Azam Khan Autodesk Research, Canada Erik Mannens Ghent University – IBBT, Belgium Ingrid Mason Intersect Australia Ltd Diana Maynard University of Sheffield, UK Adrian Paschke Freie Universiteit Berlin, Germany Giuseppe Rizzo EURECOM, France Ansgar Scherp Kiel University and Leibniz Information Center for Economics, Kiel, Germany Ryan Shaw University of North Carolina at Chapel Hill, USA Thomas Steiner Google Inc, Germany Kerry Taylor CSIRO & Australian National University Denis Teyssou Agence France-Presse viii Reactive Processing of RDF Streams of Events Jean-Paul Calbimonte and Karl Aberer Faculty of Computer Science and Communication Systems EPFL, Switzerland. firstname.lastname@epfl.ch Abstract. Events on the Web are increasingly being produced in the form of data streams, and are present in many different scenarios and applications such as health monitoring, environmental sensing or social networks. The heterogeneity of event streams has raised the challenges of integrating, interpreting and processing them coherently. Semantic tech- nologies have shown to provide both a formal and practical framework to address some of these challenges, producing standards for representa- tion and querying, such as RDF and SPARQL. However, these standards are not suitable for dealing with streams for events, as they do not in- clude the concpets of streaming and continuous processing. The idea of RDF stream processing (RSP) has emerged in recent years to fill this gap, and the research community has produced prototype engines that cover aspects including complex event processing and stream reasoning to varying degrees. However, these existing prototypes often overlook key principles of reactive systems, regarding the event-driven processing, responsiveness, resiliency and scalability. In this paper we present a re- active model for implementing RSP systems, based on the Actor model, which relies on asynchronous message passing of events. Furthermore, we study the responsiveness property of RSP systems, in particular for the delivery of streaming results. 1 Introduction Processing streams of events is challenging task in a large number of systems in the Web. Events can encode different types of information at different levels, e.g. concerts, financial patterns, traffic events, sensor alerts, etc., generating large and dynamic volumes of streaming data. Needless to say, the diversity and the heterogeneity of the information that they produce would make it impossible to interpret and integrate these data, without the appropriate tools. Semantic Web standards such as RDF1 and SPARQL2 provide a way to address these chal- lenges, and guidelines exist to produce and consume what we know as Linked Data. While these principles and standards have already gained a certain degree of maturity and adoption, they are not always suitable for dealing with data streams. The lack of order and time in RDF, and its stored and bounded char- acteristics contrast with the inherently dynamic and potentially infinite nature 1 RDF 1.1 Primerhttp://www.w3.org/TR/rdf11-primer/ 2 SPARQL 1.1 http://www.w3.org/TR/sparql11-query/ of the time-ordered streams. Furthermore, SPARQL is governed by one-time semantics as opposed to the continuous semantics of a stream event processor. It is in this context that it is important to ask How can streaming events can be modeled and queried in the Semantic Web?. Several approaches have been proposed in the last years, advocating for extensions to RDF and SPARQL for querying streams of RDF events. Examples of these RDF stream processing (RSP) engines include C-SPARQL [4], SPARQLstream [6], EP-SPARQL [3] or CQELS [11], among others. Although these extensions target different scenarios and have heterogeneous semantics, they share an important set of common features, e.g. similar RDF stream models, window operators and continuous queries. There is still no stan- dard set of these extensions, but there is an ongoing effort to agree on them in the community 3 . The RSP prototypes that have been presented so far focus al- most exclusively in the query evaluation and the different optimizations that can be applied to their algebra operators. However, the prototypes do not consider a broader scenario where RDF stream systems can reactively produce and con- sume RDF events asynchronously, and deliver continuous results dynamically, depending on the demands of the stream consumer. In this paper we introduce a model that describes RSP producers and con- sumers, and that is adaptable to the specific case of RSP query processing. This model is based on the Actor Model, where lightweight objects interact exclusively by interchanging immutable messages. This model allows composing networks of RSP engines in such a way that they are composable, yet independent, and we show how this can be implemented using existing frameworks in the family of the JVM (Java Virtual Machine) languages. In particular, we focus on specifying how RSP query results can be delivered in scenarios where the stream producer is faster than the consumer, and takes into account its demand to push only the volumes of triples that can be handled by the other end. This dynamic push delivery can be convenient on scenarios where receivers have lower storage and processing capabilities, such as constrained devices and sensors in the IoT. The remainder of the paper is structured as follows: we briefly describe RSP systems and some of their limitations in Section 2, then we present the actor-based model on Section 3. We provide details of the dynamic push delivery on Section 4, and the implementation and experimentation are described in Section 5. We present the related work on Section 6 before concluding in Section 7. 2 RSP Engines, Producers and Consumers In general RSP query engines can be informally described as follows: given as input a set of RDF streams and graphs, and a set of continuous queries, the RSP engine will produce a stream of continuous answers matching the queries (see Figure 1). This high-level model of an RSP engine is simple yet enough to describe most stream query processing scenarios. Nevertheless, this model, 3 W3C RDF Stream Processing Community Group http://www.w3.org/community/rsp and the existing implementations of it, does not detail how stream producers communicate with RSP engines, and how stream consumers receive results from RSP engines. This ambiguity or lack of specification has resulted in different implementations that may result in a number of issues, especially regarding responsiveness, elasticity and resiliency. Fig. 1: Evaluation of continuous queries in RDF Stream Processing. The data stream flows through the engine, while continuous queries are registered and results that match them are streamed out. 2.1 RSP Query Engines To illustrate these issues, let’s consider first how streams are produced in these systems. On the producer side, RDF streams are entities to which the RSP engine subscribes, so that whenever a stream element is produced, the engine is notified (Figure 2). The issues with this model arise from the fact that the RSP engine and the stream producer are tightly coupled. In some cases like C- SPARQL or SPARQLstream , the coupling is at the process level, i.e. both the producer and the engine coexist in the same application process. A first issue regards scalability: it is not possible to dynamically route the stream items from the producer to a different engine or array of engines, since the subscription is hard-wired on the code. Moreover, if the stream producer is faster than the RSP engine, the subscription notifications can flood the latter, potentially overflowing its capacity. A second issue is related to resilience: failures on the stream producer can escalate and directly affect or even halt the RSP engine. Fig. 2: Implementation of an RSP query engine based on tightly coupled publisher and subscribers. Looking at the stream consumer side, the situation is similar. The continuous queries, typically implemented as SPARQL extensions, are registered into the RSP engine, acting as subscription systems. Then, for each of the continuous queries, a consumer can be attached so that it can receive notifications of the continuous answers to the queries (see Figure 2). Again, we face the problem of tightly coupled publisher and subscribers that have fixed routing configura- tion and shared process space, which may hinder the scalability, elasticity and resiliency of the system. Added to that, the delivery mode of the query results is fixed and cannot be tuned to the needs of the consumer. It is possible to see these issues in concrete implementations: for instance in Listing 1 the C-SPARQL code produces an RDF stream. Here, the stream data structure is mixed with the execution of the stream producer (through a dedicated thread). Even more important, the tightly coupled publishing is done when the RDF quad is made available through the put method. The engine (in this case acting as a consumer) is forced to receive quad-by-quad whenever the RDF Stream has new data. public class SensorsStreamer extends RdfStream implements Runnable { public void run() { while(true){ RdfQuadruple q=new RdfQuadruple(subject,predicate,object, System.currentTimeMillis()); this.put(q); } } } Listing 1: Example of generation of an RDF stream in C-SPARQL. A similar scenario can be observed on query results recipient. The continuous listener code for the CQELS engine in Listing 2 represents a query registration (ContinuousSelect) to which one or more listeners can be attached. The subscrip- tion is tightly coupled, and results are pushed mapping by mapping, forcing the consumer to receive these updates and act accordingly. String queryString =" SELECT ?person ?loc " ContinuousSelect selQuery=context.registerSelect(queryString); selQuery.register(new ContinuousListener() { public void update(Mapping mapping){ String result=""; for(Iterator vars=mapping.vars();vars.hasNext();){ result+=" "+context.engine().decode(mapping.get(vars.next())); System.out.println(result); } } }); Listing 2: Example of generation of an RDF stream in CQELS. 2.2 Results Delivery for Constrained Consumers In the previous section we discussed some of the general issues of current RSP engines regarding producing and consuming RDF streams. Now we focus on the particular case where a stream consumer is not necessarily able to cope with the rate of the stream producer, and furthermore, when the stream generation rate fluctuates. As an example, consider the case of an RDF stream of annotated geo- located triples that mobile phones communicate to stationary sensors that detect proximity (e.g. for a social networking application, or for public transportation congestion studies), In this scenario the number of RDF stream producers can greatly vary (from a handful to thousands, depending on how many people are nearby in a certain time of the day), and also the stream rate can fluctuate. In this and other examples the assumption that all consumers can handle any type of stream load does not always hold, and RSP engines need to consider this fact. Some approaches have used load shedding, eviction and discarding methods to alleviate the load, and could be applicable in these scenarios [1, 9]. Complementary to that, it should be possible for stream producers to regulate the rate and the number of items they dispatch to a consumer, depending on the data needs and demand of the latter. 3 An Actor Architecture for RDF Stream Processing A central issue in the previous systems is that several aspects are mixed into a single implementation. An RDF stream in these systems encapsulates not only the stream data structure, but also its execution environment (threading model) and the way that data is delivered (subscriptions). In distributed systems, one of the most successful models for decentralized asynchronous programming is the Actor model [2, 10]. This paradigm introduces actors, lightweight objects that communicate through messages in an asynchronous manner, with no-shared mutable state between them. Each actor is responsible of managing its own state, which is not accessible by other actors. The only way for actors to interact is through asynchronous and immutable messages that they can send to each other either locally or remotely, as seen in figure 3. Fig. 3: Actor model: actors communicate through asynchronous messages that arrive to their mail- boxes. There is no shared mutable state, as each actor handles its own state exclusively. We can characterize an actor A as a tuple: A = (s, b, mb), where s is the actor state, b is the actor behavior and mb is its message box. The state s is accessible and modifiable only by the actor itself, and no other Actor can either read or write on it. The mailbox mb is a queue of messages mi that are received from other actors. Each message mi = (asi , ari , di ) is composed of a data item di , a reference to the sender actor asi , and a reference to the receiver actor ari . The behavior is a function b(mi , s) where mi is a message received through the mailbox. The behavior can change the actor state depending on the message acquired. Given a reference to an actor a, an actor can send a message mi through the send(mi , a) operation. References to actors can be seen as addresses of an actor, which can be used for sending messages. We propose a simple execution model for RDF stream processing that is composed of three generic types of actors: a stream producer, a processor and a consumer, as depicted in Figure 4. A producer actor generates and transmits messages that encapsulate RDF streams to the consumer actors. The processor actor is a special case that implements both a producer (producer of results) and a consumer (consumes the input RDF streams), as well as some processing logic. Following the above definitions the data di of a message mi emitted by a producer actor, or received by a consumer actor, is a set of timestamped triples. This model does not prevent these actors to receive and send also other types of messages. In this model there is a clear separation of the data and the execution: the data stream is modeled as an infinite sequence of immutable event messages, each containing a set of RDF triples. Communication between producers and consumers is governed through asynchronous messaging that gets to the mail- boxes of the actors. In that way, the subscribers are not tightly coupled with the producers of RDF streams, and in fact any consumer can feed from any stream generated by any producer. Moreover, this separation allows easily isolating fail- ures in either end. Failures on consumers do not directly impact other consumers nor the producers, and vice-versa. Fig. 4: RSP actors: RDF stream producers, processors and consumers. All actors send the stream elements as asynchronous messages. An RSP query engine is both a consumer (receives an input RDF stream) and a producer (produces a stream of continuous answers). Event-driven asynchronous communication within RSP actors, as well as avoiding blocking operators, guarantees that the information flow is not stuck unnecessarily. Also, adaptive delivery of query results using dynamic push and pull, can prevent data bottlenecks and overflow, as we will see later. By handling stream delays, data out of order and reacting gracefully to failures, the system can maintain availability, even under stress or non-ideal conditions. Similarly, elasticity can boost the system overall responsiveness by efficiently distribut- ing the load and adapting to the dynamic conditions of the system. The actor model results convenient for RDF stream processing, as it constitutes a basis for constructing what is commonly called a reactive system 4 . Reactive systems are characterized for being event-driven, resilient, elastic, and responsive. 4 The reactive manifesto http://www.reactivemanifesto.org/ 4 Dynamic Push Delivery In RSP engines there are typically two types of delivery modes for the stream of results associated to a continuous query: pull and push. In pull mode, the con- sumer actively requests the producer for more results, i.e. it has control of when the results are retrieved. While this mode has the advantage of guaranteeing that the consumer only receives the amount and rate of data that it needs, it may incur in delays that depend on the polling frequency. In the push mode, on the contrary, the producer pushes the data directly to the consumer, as soon as it is available. While this method can be more responsive and requires no active polling communication, it forces the consumer to deal with bursts of data, and potential message flooding. In some cases, when the consumer is faster than the producer, the push mode may be appropriate, but if the rate of messages exceeds the capacity of the consumer, then it may end up overloaded, causing system disruption, or requiring shedding or other techniques to deal with the problem (see Figure 5a). (a) Push: overload if the producer(b) Dynamic push: demand on the pushes too fast. side of the consumer. Fig. 5: Delivery modes in RSP engines. As an alternative, we propose using a dynamic push approach for delivering stream items to an RDF stream consumer, taking into consideration the capacity and availability of the latter (see Figure 5b ). The dynamic mechanism consists in allowing the consumer to explicitly indicate its demand to the producer. This can be simply done by issuing a message that indicates the capacity (e.g. volume of data) that it can handle. Then, knowing the demand of the consumer, the stream producer can push only the volume of data that is required, thus avoiding any overload on the consumer side. If the demand is lower than the supply, then this behavior results in a normal push scenario. Otherwise, the consumer can ask for more data, i.e. pull, when it is ready to do so. Notice that the consumer can at any point in time notify about its demand. If the consumer is overloaded with processing tasks for a period of time, it can notify a low demand until it is free again, and only then raise it and let the producer know about it. 5 Implementing RSP Dynamic Push In order to validate the proposed model, and more specifically, to verify the feasibility of the dynamic push in a RSP engine, we have implemented this mechanism on top of an open-source RSP query processor. We have used the Akka library5 , which is available for both Java and Scala, to implement our 5 Akka: http://akka.io/ RSP Actors. Akka provides a fully fledged implementation of the actor model, including routing, serialization, state machine support, remoting and failover, among other features. By using the Akka library, we were able to create producer and consumer actors that receive messages, i.e. streams of triples. For example, a Scala snippet of a consumer is detailed in Listing 3, where we declare a consumer that extends the Akka Actor class, and implements a receive method. The receive method is executed when the actor receives a message on its mailbox, i.e. in our case an RDF stream item. class RDFConsumer extends Actor { def receive ={ case d:Data => // process the triples in the data message } } Listing 3: Scala code snippet of an RDF consumer actor. To show that an RSP engine can be adapted to the actor model, we have used CQELS, which is open source and is written in Java, as it has demonstrated to be one of the most competitive prototype implementations, at least in terms of performance [12]. More concretely, we have added the dynamic push delivery of CQELS query results, so that a consumer actor can be fed with the results of a CQELS continuous query. To show the feasibility of our approach and the implementation of the dy- namic push, we used a synthetic stream generator based on the data and vocab- ularies of the SRBench [15] benchmark for RDF stream processing engines. As a sample query, consider the CQELS query in Listing 4 that constructs a stream of triples consisting of an observation event and its observed timestamp, for the last second. PREFIX omOwl: http://knoesis.wright.edu/ssw/ont/sensor-observation.owl#. CONSTRUCT {?observation ?time} WHERE { STREAM [RANGE 1000ms] { ?observation omOwl:timestamp ?time } } Listing 4: Example of generation of CQELS query over the SRBench dataset. In the experiments, we focused on analyzing the processing throughput of the CQELS dynamic push, compared to the normal push operation. We tested using different processing latencies, i.e. considering that the processing on the consumer side can cause a delay of 10, 50, 100 and 500 milliseconds. This sim- ulates a slow stream consumer, and we tested its behavior with different values for the fluctuating demand: e.g. from 5 to 10 thousand triples per execution. The results of these experiments are depicted in Figure 6, where each plot cor- responds to a different delay value, the Y axis is the throughput, and the X axis is the demand of the consumer. As it can be seen, when the demand of the consumer is high, the behavior is similar to the push mode. However if the consumer specifies a high demand Fig. 6: Results of the experimentation: throughput of the results delivery after query processing for bot dynamic and normal push. The delay per processing execution is of 500, 100, 50, 10 milliseconds from left to right, top to bottom. The Y axis is the throughput, and the X axis is the demand. but has a slow processing time, the throughput is slowly degraded. When the processing time is fast (e.g. 10 ms), the push delivery throughput is almost constant, as expected, although it is important to notice that in this mode, if the supply is greater than the demand, the system simply drops and does not process the exceeding items. In that regard, the dynamic push can help alleviate this problem, although it has a minor penalty in terms of throughput. 6 Related Work & Discussion RDF stream processors have emerged in the latest years as a response to the chal- lenge of producing, querying and consuming streams of RDF events. These efforts resulted in a series of implementation and approaches in this area, proposing their own set of stream models and extensions to SPARQL [5, 11, 7, 3, 9]. These and other RSP engines have focused on the execution of SPARQL streaming queries and the possible optimization and techniques that can be applied in that context. However, their models and implementation do not include details about the stream producers and consumers, resulting in prototypes that overlook the issues described in Section 2. For handling continuous queries over streams, several Data Stream Manage- ment Systems (DSMS) have been designed and built in the past years, exploiting the power of continuous query languages and providing pull and push-based data access. Other systems, cataloged as complex event processors (CEP), emphasize on pattern matching in query processing and defining complex events from basic ones through a series of operators [8]. Although none of the commercial CEP solutions provides semantically rich annotation capabilities on top of their query interfaces, systems as the ones dexfibed in [14, 13] have proposed different types of semantic processing models on top of CEPs. More recently, a new sort of stream processing platforms has emerged, spin- ning off the massively parallel distributed Map-Reduce based frameworks. Ex- amples of this include Storm6 or Spark Streaming7 , which represent stream processing as workflows of operators that can be deployed in the cloud, hiding the complexity of parallel and remote communication. The actor based model can be complementary to such platforms (e.g. Spark Streaming allows feeding streams from Akka Actors on its core implementation). 7 Conclusions Event streams are one of the most prevalent and ubiquitous source of Big Data on the web, and it is a key challenge to design and build systems that cope with them in an effective and usable way. In this paper we have seen how RDF Stream Processing engines can be adapted to work in an architecture that responds to the principles of reactive systems. this model is based on the usage of lightweight actors that communicate via asynchronous event messages. We have shown that using this paradigm we can avoid the tight coupled design of current RSP en- gines, while opening the way for building more resilient, responsive and elastic systems. More specifically, we have shown a technique for delivering the contin- uous results of queries in an RSP engine through a dynamic push that takes into consideration the demand of the stream consumer. The resulting prototype implementation, on top of the well known CQELS engine, shows that is feasible to adapt an RSP to include this mode, while keeping a good throughput. When processing streams of data, whether they are under the RDF umbrella or not, it is important to take architectural decisions that guarantee that the system aligns with the characteristics of a reactive system. Otherwise, regardless of how performant a RSP engine is, if it is not able to be responsive, resilient to failures and scalable, it will not be able to match the challenges of streaming applications such as the Internet of Things. we have seen that there are many pitfalls in systems design that prevent most of RSP engines to be reactive, in the sense that they do not always incorporate the traits of resilience, responsiveness, elasticity and message driven nature. We strongly believe that these principles have to be embraced at all levels of RDF stream processing to be successful. As future work, we plan to extend the reactive actor model to all aspects of an RSP engine, including the stream generation, linking with stored datasets and dealing with entailment regimes. We also envision to use this architecture to show that different and heterogeneous RSP engines can be combined together, forming 6 http://storm.apache.org/ 7 https://spark.apache.org/streaming/ a network of producers and consumers that can communicate via messaging in a fully distributed scenario. Acknowledgments Partially supported by the SNSF-funded Osper and Nano- Tera OpenSense2 projects. References 1. Abadi, D.J., Carney, D., Cetintemel, U., Cherniack, M., Convey, C., Lee, S., Stone- braker, M., Tatbul, N., Zdonik, S.: Aurora: a new model and architecture for data stream management. The VLDB Journal 12(2), 120–139 (August 2003) 2. Agha, G.: Actors: A model of concurrent computation in distributed systems. Tech. rep., MIT (1985) 3. Anicic, D., Fodor, P., Rudolph, S., Stojanovic, N.: EP-SPARQL: a unified language for event processing and stream reasoning. In: WWW, pp. 635–644 (2011) 4. Barbieri, D.F., Braga, D., Ceri, S., Della Valle, E., Grossniklaus, M.: C-SPARQL: SPARQL for continuous querying. In: WWW, pp. 1061–1062 (2009) 5. Barbieri, D.F., Braga, D., Ceri, S., Della Valle, E., Grossniklaus, M.: Incremental reasoning on streams and rich background knowledge. In: Proc. 7th Extended Semantic Web Conference, pp. 1–15 (2010) 6. Calbimonte, J.P., Corcho, O., Gray, A.J.G.: Enabling ontology-based access to streaming data sources. In: ISWC, pp. 96–111 (2010) 7. Calbimonte, J.P., Jeung, H., Corcho, O., Aberer, K.: Enabling query technologies for the semantic sensor web. International Journal On Semantic Web and Infor- mation Systems (IJSWIS) 8(1), 43–63 (2012) 8. Cugola, G., Margara, A.: Processing flows of information: From data stream to complex event processing. ACM Computing Surveys 44(3), 15:1–15:62 (2011) 9. Gao, S., Scharrenbach, T., Bernstein, A.: The clock data-aware eviction approach: Towards processing linked data streams with limited resources. In: ESWC, pp. 6–20. Springer (2014) 10. Karmani, R.K., Shali, A., Agha, G.: Actor frameworks for the jvm platform: a com- parative analysis. In: Proceedings of the 7th International Conference on Principles and Practice of Programming in Java. pp. 11–20. ACM (2009) 11. Le-Phuoc, D., Dao-Tran, M., Xavier Parreira, J., Hauswirth, M.: A native and adaptive approach for unified processing of linked streams and linked data. In: ISWC, pp. 370–388 (2011) 12. Le-Phuoc, D., Nguyen-Mau, H.Q., Parreira, J.X., Hauswirth, M.: A middleware framework for scalable management of linked streams. Web Semantics: Science, Services and Agents on the World Wide Web 16, 42–51 (2012) 13. Paschke, A., Vincent, P., Alves, A., Moxey, C.: Tutorial on advanced design pat- terns in event processing. In: DEBS. pp. 324–334. ACM (2012) 14. Taylor, K., Leidinger, L.: Ontology-driven complex event processing in heteroge- neous sensor networks. In: ISWC, pp. 285–299. Springer (2011) 15. Zhang, Y., Duc, P., Corcho, O., Calbimonte, J.P.: SRBench: A Streaming RDF/S- PARQL Benchmark. In: ISWC, pp. 641–657. Springer (2012) COLINDA: Modeling, Representing and Using Scientific Events in the Web of Data Selver Softic1 , Laurens De Vocht2 , Martin Ebner1 , Erik Mannens2 , and Rik Van de Walle2 1 Graz University of Technology, Inffeldgasse 16c, 8010 Graz, Austria {selver.softic,martin.ebner}@tugraz.at 2 Ghent University, iMinds - Multimedialab, Sint-Pietersnieuwstraat 41, 9000 Ghent, Belgium {laurens.devocht,erik.mannens,rik.vandewalle}@ugent.be Abstract. Conference Linked Data (COLINDA)3 , a recent addition to the LOD (Linked Open Data) Cloud4 , exposes information about scientific events (confer- ences and workshops) for the period from 2002 up to 2015. Beside title, descrip- tion and time COLINDA includes venue information of scientific events which is interlinked with Linked Data sets of GeoNames5 , and DBPedia6 . Additionally in- formation about events is enhanced with links to corresponding proceedings from DBLP (L3S)7 and Semantic Web Dog Food 8 repositories. The main sources of COLINDA are WikiCfP9 and Eventseer10 . The research questions addressed by this work in particular are: how scientific events can be extracted and summa- rized from the Web, how to model them in Semantic Web to be useful for mining and adapting of research related social media content in particular micro blogs, and finally how they can be interlinked with other scientific information from the Linked Data Cloud to be used as base for explorative search for researchers . Keywords: Linked Data, Scientific Events, Linked Science, Research 2.0 1 Introduction and Motivation COLINDA11 contains information about scientific events worldwide (including loca- tion and proceedings references), published as Linked Data. The data contained in COLINDA is extracted and accumulated from the data dumps of WikiCfP , which are published yearly and freely available on request for research12 purposes, and from data 3 http://colinda.org 4 http://lod-cloud.net/ 5 http://www.geonames.org/ 6 http://dbpedia.org 7 http://dblp.l3s.de/d2r/ 8 http://data.semanticweb.org/ 9 http://www.wikicfp.com/ 10 http://eventseer.net/ 11 Available at: http://colinda.org/, see also http://datahub.io/dataset/colinda 12 http://www.wikicfp.com/cfp/data.jsp 2 Selver Softic et al. gathered via JSON interface from Eventseer. WikiCfP and Eventseer are two very pop- ular online scientific event archives. WikiCfP contains calls for paper for about approx- imately 30.000 conferences and has approximately 100.000 registered users. Eventseer contains according the latest information13 calls for around 21000 events and serves more then 1 million users. We also track the Twitter14 feeds of both sites integrating on the fly arrival of upcoming scientific events using the Twitter API15 to recieve the data from Twitter profiles of Wiki CfP and Eventseer. Currently COLINDA includes data about more than 15000 conferences. Event instances are enriched with informa- tion from Linked Data proceedings repositories DBLP (L3S)16 and Semantic Web Dog Food17 as well by location information from Geonames and DBPedia. Primary intention of COLINDA was to provide hashtag based identification system for scientific events in Twitter in the manner of the "5-star" quality Open Data18 . Researchers are using very often hashtags, while they are discussing on Twitter. Specially during scientific events, they are using hashtags as abbreviated reference to the event they are attend- ing [6]. E.g. ISWC (International Semantic Web Conference) 2012 is often referred as "iswc12" or "iswc2012". DBLP (L3S) Linked Dataset and Semantic Web Dog Food also use this kind of notation to reference the event of conference proceedings19 ,20 . The overall idea of COLINDA is to serve as mining reference for creation of semantically driven microblog data Mash Ups for Research 2.0 and as interlinking hub for other sci- ence relevant sources from the LOD cloud in order to enhance explorative search for researchers. Efforts made in this field using COLINDA will be introduced in detail in section 3. 2 Extraction, Modeling, Creation and Publishing of Linked Scientific Events COLINDA data covers generally three domains: The first domain originates from Wi- kiCfP and Eventseer and describes the Conference as basic scientific event with a start date, location, description, label and link to the event web page. Second domain is the Location of the event with geographic parameters resolved using the GeoNames and DBPedia data set in interlinking process. Each location contains reference to the city, country and coordinates of the location. Further as extension and third domain we have Proceedings of the conference represented by the links from DBLP (L3S) or Semantic Web Dog Food. 2.1 Linked Scientific Events Creation Process The data creation process comprises the following steps: 13 http://eventseer.net/data/ 14 http://www.twitter.com/ 15 COLINDA 16 http://datahub.io/dataset/l3s-dblp 17 http://datahub.io/dataset/semantic-web-dog-food 18 http://5stardata.info/ 19 e.g. for ’iswc2012’ at DBLP(L3S): http://dblp.l3s.de/d2r/page/publications/conf/ISWC/2012 20 e.g. for ’iswc2012’ at SW Doog Food: http://data.semanticweb.org/conference/iswc/2012/ COLINDA: Modeling, Representing and Using Scientific Events in the Web of Data 3 – Extraction - extraction and pre-processing of data sources (Subsection 2.2) – Modeling of Events using SWRC Ontology - concept coverage (Subsection 2.3) – Triplification - creating RDF data triples (Subsection 2.4) – Interlinking - connection to other Linked Data sets (Subsection 2.5) Fig. 1. Creation process of linked scientific events. 2.2 Data Extraction COLINDA is constructed from variously structured sources. Therefore we defined a minimal set of properties that describe the Conference concept for a single RDF in- stance. During extraction, all properties from sources are being mapped to defined nor- malized set in order to harmonize the federated data. The Location and Proceedings concepts related to conference events as such are considered as optional enrichment which will be treated in the interlinking process. We made this decision having in mind that all conference descriptions do not explicitly include the venue information. The quality of source data depends on the users that provide the information. Thus such data sources implicitly exclude assumption of completeness. Table 1 represents the minimal set of properties a Conference and Location instance should include. The Extraction process includes steps of either pre-processing of XML dumps from WikiCfP or JSON from Tweets and Eventseer into the temporary tables of values formatted as Comma 4 Selver Softic et al. Separated Value (CSV). During the pre-processing cycle data fields like e.g. date or labels are being normalized to achieve uniform representation, and to provide easier processable input for triplification step which converts the extracted values from tem- porary tables into RDF formatted instances of Linked Data. Table 1. Harmonized COLINDA - minimal properties set. Entries denoted with * are optional. Concept Property Conference label title description date* link* location* Proceedings proceedings* Location placename city country longitude latitude 2.3 Modeling Scientific Events in the Web of Data Basic representation of scientific events was well elaborated in previous research work about the SWRC ontology introduced by Sure et al [7]. This practice has been already approved and adapted by the implementation of Linked Data proceedings repositories DBLP (L3S) and Semantic Web Dog Food. We also followed the good practice of re-using existing vocabularies before we define our own. Minimal field set defined in table 1 for RDF instance generation matches well the range of SWRC concepts. There- fore, we have chosen the SWRC Ontology21 and basic RDFS Schema22 as established vocabularies to describe Conference instances. The same approach was applied for Location concept; needed set of geographical features to describe conference venues is well covered by elements from GeoNames23 and Basic Geo (WGS84) Vocabulary24 . Complete model with interlinked properties (proceeding and location) can be seen in figure 2, where a single complete and interlinked instance of a conference (ISWC2012) is depicted. Matchings between features and the vocabulary properties is shown in ta- ble 2. 2.4 Triplification - Creation of RDF Instances of Scientific Events The triplification25 process uses as input temporary data tables in CSV like format gen- erated in extraction and pre-processing step. Input generated in this way represents tab- 21 http://ontoware.org/swrc/ 22 http://www.w3.org/TR/rdf-schema/ 23 http://www.geonames.org/ontology/ 24 http://www.w3.org/2003/01/geo/wgs84_pos# 25 Under ’triplification’ we understand ’triple-wise’ creation of Linked Data instances as RDF graphs COLINDA: Modeling, Representing and Using Scientific Events in the Web of Data 5 Table 2. COLINDA concept to ontology model mapping (note: geonames - GeoNames Ontology, geo - W3C GEO Vocabulary, swrc - SWRC Ontology). Entries denoted with * are optional. Concept/Property RDF Class/Property Conference swrc:Conference label rdfs:label title swrc:eventTitle description swrc:description date* swrc:startDate link* owl:sameAs location reference* swrc:location location reference* dcterms:spatial Proceedings* rdfs:seeAlso Location* geo:SpatialThing placename* geonames:P city* geonames:name country* geonames:countryName longitude* geo:long latitude* geo:lat Fig. 2. Sample interlinked Conference RDF instance of ISWC 2012 generated by Visual RDF. ular set of values compatible with properties from table 1. This input is then parsed line by line and conference instance is generated as single RDF graph using the vocabulary properties defined in table 2. Each conference instance is accessible via REST (Repre- sentational State Transfer) call as described in subsection 2.6. To make them accessible by SPARQL endpoint, background batch process loads the conference instances into the ARC226 RDF triple store running on the server. 2.5 Interlinking to Other Interesting Sources In order to provide 5-star data and led by the design issues described in [1], we used swrc:location as interlinking property in order to interlink the location data with GeoN- ames. The interlinking process uses GeoNames query service to resolve geographical 26 https://github.com/semsol/arc2/ 6 Selver Softic et al. information and retrieve coordinates. Although usually owl:sameAs is used to interlink to other data set we used this property to resolve the connection to the conference web page and since swrc:location seems regarding the GeoNames to be more appropriate choice. How this connection looks like can be seen in the sample depicted in figure 2 as well as online27 ,28 . Further we use dumps of DBPedia and Semantic Web Dog Food to enhance the instances with DBPedia location info using the dcterms:spatial property and for interlinking the proceedings from DBLP (L3S) ans Semantic Web Dog Food we match the conference’s rdfs:label to the corresponding labels in those data sets via SPARQL queries. In matching case a link is established with correlating results using the rdfs:seeAlso property. 2.6 URI Design and Public Accessibility Access to instances of COLINDA is possible via URIs with following pattern: – http://colinda.org/resource/conference/{label}/{year} All responses from COLINDA are formatted as RDF/XML fragment. Other supported formats are: HTML, Text, N3, NTRIPLES format29 . Alternative access offers the SPARQL 30 endpoint. Current endpoint supports up to 250000 result triples per query and deliv- ers results in different formats like: JSON, RDF/XML, XML, TSV etc. How to query the endpoint is shown by simple example in listings 1.1. Results from the query re- turn the COLINDA link, city, country and the geo-location of ISWC 2012 conference. Recently, a dump of COLINDA was made available as Linked Data Fragments31 . COL- Listing 1.1. Sample SPARQL query for retrieval of conference (geo) location. PREFIX swrc: PREFIX gn: PREFIX geo: PREFIX rdfs: SELECT DISTINCT ?x ?city ?country ?long ?lat { ?x rdfs:label "ISWC2012"; swrc:location ?loc. OPTIONAL { ?loc gn:name ?city; gn:countryName ?country; geo:lat ?lat; geo:long ?long. } } INDA RDF data dumps are also accessible via the CKAN Registry32 of LOD Cloud. 27 http://www.colinda.org/resource/conference/ISWC/2012?format=html 28 http://graves.cl/visualRDF/?url=www.colinda.org/resource/conference/ISWC/2012 29 e.g. http://www.colinda.org/resource/conference/ISWC/2012?format=html 30 http://colinda.org/endpoint.php 31 http://data.linkeddatafragments.org/colinda#dataset 32 http://datahub.io/dataset/colinda COLINDA: Modeling, Representing and Using Scientific Events in the Web of Data 7 2.7 Actuality of Data COLINDA is kept up-to-date by a daily cron job which grabs the newest event an- nouncements over the Twitter API for accounts of WikiCfP and Eventseer. The cron job parses, creates, interlinks and synchs new events into the triple store. Each tweet also includes information about the call page link which allows retrieval of the extended information about events via web (WikiCfP) or available JSON (Eventseer) interface during the update task. Additionally to the automated job, also manual updates are ran as soon as the fresh dumps from both sites are available. 3 Applications and Use Cases Both use cases introduced in following subsections address the challenges of Research 2.0. Research 2.0 as adaptation of the Web 2.0 for researchers defines researchers as main consumers of the information. The purpose of these research activities is to of- fer a set of tools and services which researcher can use to discover resources, such as publications or events they might be interested in, as well as to collaborate with each other via the web. These tools and services, according to the specifications of Research 2.0, are considered as Mash Ups, APIs, publishing feeds and specially designed inter- faces based on social profiles [5, 8]. The role of COLINDA is addressed separately in application description. 3.1 Affinity Browser The "Researcher Affinity browser" was developed as a tool to demonstrate semantically driven aggregation of microbolog data for Research 2.0. (use of Web 2.0 tools in sci- entific research). In this context, COLINDA was used as mining source for the faceted detection of similar scientist Twitter profiles based upon conferences they visited as special affinity criteria. This is done by matching the COLINDA tags with the hashtags of the Twitter user. Adequate demo video showing the "Researcher Affinity Browser" in action can be also viewed online33 . The "Researcher Affinity Browser Application" [4] is depicted in figure 3. At the beginning it retrieves a list of relevant users. Those results represent a current snapshot which means that every time users produce new tweets on Twitter, the analysis result evolves with it. The relevance is measured according to the number of common conceptual affinities. Different affinity facets are displayed on the left. Users can explore three types of affinities: conferences, tags and mentions. Activation of a certain affinity filters the list of matching persons. There is the result table that displays detailed information about each person and how many affinities are shared. Further there is a map view and an affinity plot synchronized with the result table. The purpose of the map is to get a better impression of where the affiliations of the found persons would lie. The affinity plot visualizes in a quick overview affinity correspondence between the analyzed profile and other profiles in the system. 8 Selver Softic et al. Fig. 3. "Researcher Affinity Browser Application" snapshot. Fig. 4. Mapping of keywords 3.2 ResXplorer "ResXplorer"34 is an Research 2.0 [8] aggregated interface for search and exploration of the underlying Linked Data Knowledge Base. A demo video explaining the interface shown in figure 5 is available online35 . Data from Linked Data Knowledge Base orig- inates from: DBLP (L3S)36 which is a bibliography of computer science conference proceedings, COLINDA37 which is a main binding hub data set including informa- 33 http://www.youtube.com/watch?v=A25DrP3Mv8w 34 http://www.resxplorer.org 35 https://www.youtube.com/watch?v=tZU97BQxE-0 36 http://dblp.l3s.de/ 37 http://colinda.org COLINDA: Modeling, Representing and Using Scientific Events in the Web of Data 9 tion about scientific events and links to venue and proceedings, common Linked Data Knowledge Base DBPedia38 and Open Linked Data repository with geographical infor- mation GeoNames39 . The role of COLINDA is to act as a hub which connects all data sets in the knowledge base by pointing with links to other data sets. In this context it is used both for keyword matching together with other data sets and for enabling the al- gorithms in back-end to find better connections and paths between the terms visualized in the interface as well as for their expansion. Within ResXplorer interface a real-time keyword disambiguation guides researchers by expressing their needs. User selects the correct meaning from a typeahead drop down menu. Query expansion of terms happens in real-time. Figure 4 shows the typeahead expansion of "ResXplorer" in action. At the same time background modules also fetch neighbor links which match the selected suggestion. As result, selection of various resources is then presented to the researchers within radial interface. In case they have no idea which object or topic to investigate next, they get an overview of possible objects of interest (like points of interest on a street map e.g. figure 5). As shown in figure 5 features like color, shape and size of the items are used to enhance the guidance of the user during the search and exploration process [2]. Different shapes and colors represent different entities like conferences, persons, publications or proceedings. The explored items are marked black, and rela- tions are marked red and clearly highlight the context and history of a search. Each presented resource is somehow related with some of other resources. This is expressed through lines and description of the relation which is a RDF property. The path distance in hops over links is expressed through the orbital layers. As additional feature in ResXplorer is that researchers, when they sign in with their Twitter account, they can either use the mentions and hashtags automatically for search setup instead of typing keywords or to check visually the status of their network. This happens through visualization of recent collaboration and interactions based upon data from their Twitter accounts [3] (link to video on this procedure 40 ). Figure 6 depicts the network of a researcher. The size of the scholar is in the middle between the minimum and maximum size of a node. As much as possible users are placed around the focused researcher. The more publications someone coauthored with the scholar, the bigger the node. Several visual aspects aid the user in focusing and exploring the current state of their network: – Spatial: the number of co-authorships determines distance to center, a higher num- ber results in a closer distance. – Size: a higher frequency of being mentioned together on social media (i.e. Twitter) increases the size. – Color: green, already in their Twitter network; red, not in their Twitter network. – Tooltip: displays facts about the collaborations (e.g. co-authorships and mentions), i.e. the number of mentions for a specific conference (where conference identifica- tion is done over user hashtags which are automatically matched with COLINDA conference labels) and the the number of co-publications. The co-autorships are re- 38 http://dbpedia.org 39 http://geonames.org 40 http://youtu.be/QopnPvWIFzw 10 Selver Softic et al. Fig. 5. ResXplorer - discovering scholar artifacts like conferences (represented as stars), miscel- laneous related resources such as locations or microblog posts (represented as dots in different colors) etc. The distance to central node represents the intensity of relation. Fig. 6. The scholar is centered in the middle and the network is visualized in nodes around the central (blue) node. COLINDA: Modeling, Representing and Using Scientific Events in the Web of Data 11 solved by bibliographic records from DBLP which are matched pair-wise between the users. Users who whether have no co-autorships or common mentions and conference hash- tags with central user profile are not included in visualization. 4 Conclusion and Outlook In this work we described how we extract, model and create scientific events as Linked Data from known conference portals. We showed also how those events can be en- hanced with additional relevant information and applied as as mining source for gen- eration and enhancement of Researcher Affinity Browser as well as main interlinking hub for discovery of research related artifacts for the ResXplorer. This potential has been also recognized by the LinkedUp Challenge at the ISWC 201441 and upcoming Semantic Publishing Challenge 201542 at ESWC 2015 where COLINDA is nominated as reference Linked Data set for scientific events. As one of the future efforts we also want to implement a DBPedia Lookup43 and Spotlight44 like service for detection and identification of scientific events with COLINDA. We also want to link the instances to WorldCat URIs of the published proceeding volumes and to he Crossref DOIs of the published conference articles to make it more useful for the library linked data commu- nity. Finally, to verify the quality of COLINDA we will run in the future an evaluation against Linked Data Integration Benchmark (LODIB)45 . Acknowledgments. The research activities that have been described in this paper were funded by Ghent University, the Social Learning Department at Graz University of Technology, iMinds (an independent research institute founded by the Flemish government to stimulate ICT innovation), the Institute for the Promotion of Innovation by Science and Technology in Flanders (IWT), the Fund for Scientific Research-Flanders (FWO-Flanders), and the European Union. References 1. Berners-Lee, T.: Linked data (2006), http://www.w3.org/DesignIssues/ LinkedData.html 2. De Vocht, L., Mannens, E., Van de Walle, R., Softic, S., Ebner, M.: A search interface for researchers to explore affinities in a linked data knowledge base. In: Proceedings of the 12th International Semantic Web Conference Posters & Demonstrations Track. pp. 21–24. CEUR- WS (2013) 41 http://data.linkededucation.org/linkedup/catalog/browse/ 42 https://github.com/ceurws/lod/wiki/SemPub2015 43 http://lookup.dbpedia.org 44 http://dbpedia-spotlight.github.io/demo/ 45 http://lodib.wbsg.de/ 12 Selver Softic et al. 3. De Vocht, L., Softic, S., Dimou, A., Verborgh, R., Ebner, M., Mannens, E., Van de Walle, R.: Visualizing collaborations and online social interactions at scientific conferences for schol- arly networking. In: Proceedings of the Workshop on Semantics, Analytics, Visualisation: Enhancing Scholarly Data; 24th International World Wide Web Conference (May 2015) 4. De Vocht, L., Softic, S., Ebner, M., Mühlburger, H.: Semantically driven social data aggre- gation interfaces for research 2.0. In: Proceedings of the 11th International Conference on Knowledge Management and Knowledge Technologies. pp. 43:1–43:9. i-KNOW ’11, ACM, New York, NY, USA (2011), http://doi.acm.org/10.1145/2024288.2024339 5. Parra Chico, G., Duval, E.: Filling the gaps to know More! about a researcher. In: Proceed- ings of the 2nd International Workshop on Research 2.0. At the 5th European Conference on Technology Enhanced Learning: Sustaining TEL,. pp. 18–22. CEUR-WS (Sep 2010) 6. Reinhardt, W., Ebner, M., Beham, G., Costa, C.: How people are using Twitter during confer- ences. In: Hornung-Prähauser, V., Luckmann, M.(Hg.): 5th EduMedia conference, Salzburg. pp. 145–156 (2009) 7. Sure, Y., Bloehdorn, S., Haase, P., Hartmann, J., Oberle, D.: The SWRC ontology - Seman- tic Web for Research Communities. Progress in Artificial Intelligence pp. 218–231 (2005), http://dx.doi.org/10.1007/11595014\_22 8. Ullmann, T.D., Wild, F., Scott, P., Duval, E., Vandeputte, B., Parra Chico, G.A., Reinhardt, W., Heinze, N., Kraker, P., Fessl, A., Lindstaedt, S., Nagel, T., Gillet, D.: Components of a research 2.0 infrastructure. In: Lecture Notes in Computer Science,. pp. 590–595. Springer (2010) Toward Real Event Detection Michael Färber? and Achim Rettinger Karlsruhe Institute of Technology (KIT), Institute AIFB, Karlsruhe, Germany {michael.faerber,rettinger}@kit.edu Abstract. News agencies and other news providers or consumers are confronted with the task of extracting events from news articles. This is done i) either to monitor and, hence, to be informed about events of specific kinds over time and/or ii) to react to events immediately. In the past, several promising approaches to extracting events from text have been proposed. Besides purely statistically-based approaches there are methods to represent events in a semantically-structured form, such as graphs containing actions (predicates), participants (entities), etc. However, it turns out to be very difficult to automatically determine whether an event is real or not. In this paper, we give an overview of approaches which proposed solutions for this research problem. We show that there is no gold standard dataset where real events are annotated in text documents in a fine-grained, semantically-enriched way. We present a methodology of creating such a dataset with the help of crowdsourcing and present preliminary results. Keywords: Event Detection, Information Extraction, Factuality. 1 Motivation News agencies and other digital media publishers publish each day news articles in the magnitude of dozens of thousands. They also process the news for further business tasks such as trend prediction and market change detection. This is still mainly done manually today. Even if knowledge workers at news agencies have access to all this information, it is infeasible for them to read all the news and to determine, whether the articles contain information which is not only interesting for people in their domains, but which contain real events and, hence, have a significant, immediate impact on business such as financial operations (shares) and political happenings. Consider for example the first sentence of a news article: “Apple may acquire Beats Electronics next week.” (1) ? This work was carried out with the support of the German Federal Ministry of Education and Research (BMBF) within the Software Campus project SUITE (Grant 01IS12051). Here, it remains unclear whether Apple is really going to acquire Beats (and does not cancel it in the last minute) or whether this is just a rumor. The sentence “Apple confirmed that it acquired Beats Electronics on Wednesday.” (2) in contrary, reveals that the acquisition already happened (besides the confirmation which is an event per se). This demonstrates the di↵erentiating characteristic between real events and events in general. As humans we can estimate that the first article is not a trigger for immediate shifts in the stock market (besides psychological e↵ects), but maybe the second mentioned article. Machines, in contrast, have their difficulties in distinguishing real events from other events. We envision building a decision support tool for agents like stockbrokers. The aim of the system is to inform the user quickly and automatically when some detected event has really happened and hence might influence the invested assets of the user. The user should also have the possibility to store purely real events in his database. For such purposes, an event extraction system would consist of two steps: i) It extracts events in a structured, semantically enriched representation and ii) determines based on linguistic cues whether the event is real or not. Research on real event detection has been very limited so far. In this paper, we present an approach to define events and real events in a setting as described. Since no suitable gold standard for evaluating a real event detection system exists, we present our setting of creating one using crowdsourcing. Preliminary results regarding this gold standard are presented, as well as challenges which we came across. The remainder of this paper is organized as follows: First we present definitions of event detection in Section 2, before considering definitions of real event detection in Section 3. After discussing our setup of creating a gold standard for real event detection in Section 4, we conclude in Section 5. 2 General Event Definitions 2.1 Event Definitions in Use We can distinguish between the following classes of event representation (see also Fig. 1 for examples): 1. Something happened : In this event representation, events are only roughly covered. There are no types and deeper meanings gathered, only what topic the document/sentence is about. This topic is often characterized by the words occurring in the document (bag-of-words model) and/or by the set of recognized named entities. 2. This happened : For this representation, the event type of the event is detected. The event type can be quite generic such as earthquake. The number of events which can be detected is often very limited. Events may 2 Wednesday Event type Acquisition Beats acquired Electronics Participant Apple Apple Participant Beats Electronics confirmed (a) Event Representation Class 1 (b) Event Representation Class 2 "Wednesday" :Beats Electronics :time :patient :acquire :agent :subevent :Apple Inc. :confirm :agent (c) Event Representation Class 3 Fig. 1: Examples of event representations for the di↵erent event representation classes regarding the example sentence “Apple confirmed that it acquired Beats Electronics on Wednesday.” have attributes or slots which are pre-defined for the single event types. Instead of predefined entity types such as earthquake or accident sometimes only the entity types Per, Loc, Org, and Misc are used. 3. This happened to these objects in this way: If we use this representation format, we have a deeper understanding in the actual event. Events of this class are quite specific and include not only specific actions, but also participants, and maybe time, place, and manner of the action. Often linguistic theories such as Semantic Role Labeling provide the basis for event representations of this class. Related work using event definitions corresponding to the first event representation class do not define events at all [1,2,3,4,5]. This is due to the fact that here it must be only known that something happened (something that is, for instance, di↵erent to what has been seen so far), but not what. Events do not need to be represented on its own; instead, events are indirectly represented by the document in which they are expressed. Documents are compared against each other, either by using the bag-of-words model [2,3,4] or in addition by taking detected named entities (with the classical entity types PER, LOC, ORG, MISC) into account [1,5]. Approaches using the second event definition have in common that coarse-grained events such as accidents or earthquakes are represented. Each event has therefore an event type. Property-value-pairs can be assigned to the events, whereas the assignable properties are pre-defined for all event types. Often templates are used for storing the information about events [6]. In case of event representations of the third kind, structural representations of fine-grained events are extracted from text – here, typically from single sentences or clauses. Research based on this event class usually does not 3 introduce a new definition of events, but instead either uses linguistic definitions of events where events consist of happenings with agents, locations, time, etc. [7,8,9] or abstracts from it to a certain, but limited extend [10]. Bejan [10] characterizes an event as a happening at a given location and in a specific time interval. Each event has semantic relations to agents, to a location, time, etc. as parts of the event. These are the semantic/thematic roles of an event in the linguistic understanding. Events can contain several sub-events. Events of an event scenario (as higher-order structure) are connected by event relations. An example is the cause relation where one event causes another event. Xie et al. [7] propose two approaches which are based on Semantic Frames – constructed by the tool SEMAFOR. Also, Wang et al. [8] use semantic parsing which is based on PropBank in order to represent events. Yeh et al. [9] regard events as similar to frames in FrameNet. Each event encodes knowledge about the participants, where (and when) the event occurred and the events which are caused by this event. A buy event, for instance, is about the object bought, the donor, and the recipient. 2.2 Event Definition In this paper we focus on the detection and semantically-structured representation of real events of the third-mentioned event class, which is the most expensive one. More specifically, an event in our scenario is characterized by – specific participants (agents or objects) – situations (events or states) which are described within the event – taking place at a specific place and/or time – being not a state. States are hereby defined as lasting for an indefinite period of time and which are not really observable. Given the example sentence 2 in Section 1 we can extract two events from it: i) The event that Apple confirmed something (which is an event itself) and ii) the event that Apple acquired Beats Electronics. Fig. 1c shows how these events can be represented as a semantically-structured graph. Hereby, Event ii) can either be part of Event i) (as depicted in the figure) or be stored as a separate graph. Nodes in each event graph can be either predicate nodes (representing actions), entity nodes (representing participants), or literal nodes (representing the time, etc.). Predicate and entity nodes can be linked to entries in knowledge bases such as DBpedia (for entities) and WordNet (for predicates). This enables having unique identifiers for resources and to resolve ambiguities. The edges in these event graphs arise from the semantic roles assigned by a Semantic Role Labeling tool. In the depicted figure, the semantic roles are grounded as RDF predicates. 4 3 Real Event Detection 3.1 Definitions of Real Events We define real event detection as the task of determining whether a given event expressed in text is real. Real events are events according to the definition in Section 2.2 and have already happened or are happening. Thus, the definition of events is extended by this aspect. We can split the task of real event detection therefore into two subtasks: 1. Determining if the situation described in the text is about an event according to our definition. 2. Determining if the event already happened or is currently happening. Regarding the first subtask, we can refer to two areas of linguistic work: i) The distinction of events from states, and ii) the identification of factuality of events. In the following, we amplify these two areas with respect to our goal of real event detection. We hereby use the term situation as a generic concept which encompasses both event and state (cf. [11]). Ad i) The classification of situations can be traced back to Aristotle who distinguished between verbs that have a defined end or result, and others that do not [12]. Vendler [13] distinguished situations into four aspectual classes (also called aktionsarten) and performed empirical experiments. The aspectual classes are based on the temporal structure of events. These classes are namely: state, activity, accomplishment, and achievement. A state is something in which an entity remains for a longer, often unspecified period of time (e.g., “Jack knows the answer”). The three other classes in the aspectual classification cover di↵erent types of events in the narrower sense. An event is characterized as something which happens or occurs in a definite time interval or at a specific point in time. It often comes along with predicates such as “write”, “push”, etc. An event usually causes some state change. To determine which aspectual class a given situation belongs to, we can di↵er between telic, dynamic, and durative Table 1: Vendler’s four-way situations (see Table 1). Telic situations distinction between verbs based always have a culmination point beyond on their aspectual features [13]. which the situation cannot continue. Dynamic situations consist of internal Class Telic Dynamic Durative sub-events which change over time and state - - X are, hence, intrinsically heterogeneous. activity - X X For instance, walking consists of several accomplishment X X X X X alternating subevents. Durative situations achievement - (e.g., eating) last for a specifiable period in time and are not punctual. In our case we want to distinguish events from states. But how can we determine which aspectual class holds for a given situation? For Vendler [13] and others who worked on top of his theories it became apparent that it is not trivial to determine the class automatically. See [13,11,14,15,16] for more details on linguistic rules for that purpose. 5 Moens and Steedman [14] propose another classification of situations. Here, situations are also either states or events. Events are sub-classified by two dimensions: 1. Events are either atomic or durative events. 2. Entities of events are in a consequent state or not. We refer to [14] for more information. Ad ii) Other researchers have focused on determining the factuality of events, i.e. to recognize whether events are presented in the sentences as corresponding to real situations in the world, as situations that have not happened, or as situations of uncertain status. The focus is, hence, the trustfulness of events in text. Factuality can be characterized by two dimensions: Polarity and epistemic modality. Polarity – more concrete: polarity on actuality and not subjective polarity – is a discrete category and can be either positive or negative. Epistemic modality, in contrast, expresses the speaker’s degree of commitment to the truth of a proposition [17]. It ranges from uncertain (also called “possible”) to absolutely certain (also called “necessary”). According to Horn [18], modality is a continuous category. Sauri [19] spans the factuality values space from positive, negative, to unknown for the polarity dimension, and certain, probable, possible, to unknown for the modality dimension. Unknown is true for cases of uncommitment. In this way, a tuple of polarity value and epistemic modality value states the factuality of the event. How is factuality expressed in the text? This is done by lexical markers as well as syntactic markers. Lexical modal markers are modal auxiliaries (e.g., “could”, “may”, “must”), as well as clausal/sentential adverbial modifiers (e.g., “maybe”, “likely”, “possibly”). Examples of lexical polarity markers are adverbs (e.g., “not”, “until”), quantifiers (e.g., “no”, “none”), and pronouns (e.g., “nobody”). Syntactic constructs are necessary to consider since often one clause is embedded in another. Considerable are in this context especially relative clauses and that-clauses as in the example sentences. What are the challenges to determine the factuality? Factuality markers interact with each other. The local modality and polarity operators (e.g., of the current clause) are therefore not enough. Instead, a global consideration is necessary. For instance, in case of that-clauses, the factuality of the inner event is dependent on the factuality of the outer event. Furthermore, what makes the factuality much more complex is the fact that the source of an event is often not only the author. These additional sources are introduced by means of predicates of reporting (such as “say” or “tell”), knowledge and opinion (such as “believe”, “know”), psychological reaction (such as “regret”), etc. Sauri and Pustejovsky [19] calls these predicates due to their role Source Introducing Predicates (SIPs). The difficulty is that the status of the other sources often di↵ers from the author. The reader does not have direct access to the factual assessment of these other sources. In the sentence, “The Guardian wrote that the G-7 leaders pretended everything was OK in Russia’s economy.”, the reader cannot assess directly the “frame of mind” of The Guardian with respect to the factuality of the event of “pretended”. However, the factuality assessment has to be relative to the relevant sources. 6 3.2 Requirements of a Gold Standard for Real Event Detection According to our event definition in Section 2.2 and the additional aspect of factuality addressed in Section 3.1 we can list the following requirements a gold standard dataset for the evaluation of a real event detection system must fulfill: 1. Each mention of an action within an event (e.g., “wrote”) is annotated. 2. There is a distinction between events and states, so that all events in the strict sense are annotated. 3. There is no restriction to specific event types. 4. The factuality of the event is annotated (being positive or negative). 5. All participants and participating objects are annotated. 6. All participants and participating objects are linked to prevalent knowledge bases. 7. Subevents of events are annotated and linked. 8. Mentions of place and time of each event are annotated. This gold standard is also suitable when it comes to extracting real events according to the Event Representation Classes 1 and 2 (see Section 2.1). In these cases, the information about the structural representation of events can be neglected. Additional filtering can achieve that only events of specific types such as accidents are detected. 3.3 Datasets for Real Event Detection In the following, we review existing corpora where event factuality was annotated to some degree. The Multi-Perspective Question Answering (MPQA) corpus [20] provides news articles annotated for opinions and other private states such as beliefs or thoughts. It was designed for subjectivity and sentiment research and does not provide any structured representation of (real) events. At most, it might be applicable as negative corpus in a scenario where situations written in text are approved to be not real events. The Penn Discourse TreeBank (PDTB) [21] is a corpus where discourse connectives are annotated along with their arguments (e.g. $arg1 “– even though” $arg2). On top of the original annotation scheme, an extended annotation scheme was released for marking the attribution of abstract objects such as propositions, facts and eventualities associated with discourse relations and their arguments annotated in the PDTB. The events described in the arguments are, however, not transformed into a structured event representation. TimeBank 1.2 [22] is a corpus which was annotated with TimeML [23]. TimeML is a language for representing temporal and event information. TimeBank is suitable for event factuality learning since it uses grammar markers as well as annotations of predicates. Events are classified into occurrence, state, reporting, immediate-action, immediate-state, aspectual, and perception. TimeBank does not contain a structured event representation where all 7 participating objects are annotated. In addition, the event definition is somehow di↵erent to our proposed definition: A huge fraction (25,7%) of phrases annotated as events are not verbs, but nouns, adjectives, etc. Not all phrases that should be regarded as event predicates are annotated. FactBank [19] is a corpus which was built on top of TimeBank and a subset of the documents in the AQUAINT TimeML Corpus (A-TimeML Corpus). It comes along with annotations of explicitly factual information about events. FactBank has the same obstacles as TimeBank. ACE 2005 [24] from the Automatic Content Extraction (ACE) technology evaluation is a dataset dedicated to the detection of events in text. The task was limited to the detection of specific event types which are: Life, Movement, Transaction, Business, Conflict, Contact, Personnel, and Justice. Each type has one to 13 subtypes so that each event is assigned to one main event type and one subtype of it. The limitation to these event types is the main obstacle why ACE 2005 cannot be used in our setting directly. Four attributes are attached to each annotated event: Modality, Polarity, Genericity, and Tense. In accordance with the event type, specific slots (argument roles called here; such as entities, values, and times) can be assigned. ACE entities are categorized in specific classes (namely, Person, Organization, Location, Geo-political entity, Facility, Vehicle, and Weapon) and their subclasses, but are not linked to any knowledge base. In summary, we can state that none of the mentioned corpora contains semantically-structured representations of events to the extent it is needed to evaluate a real event detection system where events are defined as in Section 2.2. Thus, in the following section we provide experiments on how to build a gold standard which fulfills all our requirements. 4 Experiments for Building a Gold Standard Dataset Very first crowdsourcing experiments revealed that letting users annotate real events as described in Section 3.2 at once is too complex for any crowdsourcing job. Therefore, we arranged subtasks where the following questions are answered separately for each event: 1. Which are the actions/predicates inducing a real event? 2. Which are the participating objects? 3. What is the time and place? 4. Which sub-events are contained? In the following we present our approach regarding the first subtask, namely identifying real events and naming the central predicates of them. We performed two crowdsourcing jobs which di↵er in their methodology.1 Run 1 The crowd was asked to read a given sentence, to look for real events (as defined above), and to enter the action verbs of these events as written in the sentence. 1 The crowdsourcing job descriptions and evaluation data is available online at http: //www.aifb.kit.edu/web/Toward_Real_Event_Detection 8 Run 1: "Find real actions" Run 2: "Find observable and non-observable predicates" 187 sentences, 8 test questions, 12¢ per task, 187 sentences, 9 test questions, 12¢ per task, 5 users per judgment 5 users per judgment Our gold Our gold standard: standard: 205 action verbs 205 verbs inducing 354 observable 185 non-observable real events predicates predicates 133/205 (64.9%) of predicates judged as 152/224 (67.9%) of observable are corecct 285/334 (85.3%) of predicates 224 verbs judged by verbs judged as judged as non-observable crowd as inducing inducing real events are correct real events are correct 205 verbs classified 334 predicates classified by crowd by crowd as observable as non-observable Fig. 2: Results of two crowdsourcing runs where the predicates of real events were annotated in English sentences. In both runs, the confidence value of the answers had to be above 0.5 in order to be considered. Run 2 For this second run, the crowd was asked to read each given sentence, look for all verbs, and categorize them into either observable or not-observable. Observable events/facts were defined as follows:2 An observable fact can be an occurrence (e.g., ”arrive“, ”destroy“), a reporting (e.g., ”report“), or an immediate action (e.g., ”approve“). Observable facts are characterized by the fact that they could be observed or confirmed by third persons directly (e.g., in case of ”say“) or indirectly (e.g., in case of ”confirm“). Non-observable facts describe states which characterize persons or objects, but which are not observable by other persons than the persons involved. Such non-observable facts are states which last for an indefinite/unspecified period of time (e.g., ”be happy“), immediate states (e.g., ”believe“, ”worried“), aspects (e.g., ”start“, ”continue“), or perceptions (e.g., ”feel“). The categorization into observable vs. non-observable facts is here done independently of the fact whether the event has happened (or the state is) for sure or not. The categorization into the past/presence or future is performed in a separate crowdsourcing task. As dataset we used all first sentences of news articles which were published on one day (2014/05/28) by the news agency Bloomberg and where the news articles contained some information about Apple Inc. In total we manually annotated 187 sentences to assess the performance of our crowdsourcing tasks. Crowd sourcing was performed on the platform Crowdflower.3 In Run 1 (Run 2), users had to answer 8 (9) quiz test questions before entering the actual task. In both runs, users got 12 cent per task consisting of 4 questions each. For each question we gained results from 5 users and took the answers where there was an inter-rater agreement of at least 50%. The results of our crowdsourcing annotation experiments are summarized in Fig. 2. It became apparent that completing the crowdsourcing tasks requires high cognitive e↵orts in comparison to other crowdsourcing tasks. A considerable amount of users did not pass the test questions at the beginning. Even if we 2 The definition is based on the TimeBank annotation guidelines. 3 http://crowdflower.com 9 admit only users who worked on our job in the past sufficiently well, creating a big annotated corpus is tricky. As Run 2 shows, already the distinction between observable events, i.e. events showing up in the real world, and not-observable events is hard to perform. Although we put much e↵ort in refining the task descriptions the question arises whether a better approach to annotating the factuality of events is achievable. 5 Conclusions If events are extracted from text in a fine-grained manner, huge amounts of events are gathered, but only a fraction of them represent real events and, hence, are worthwhile to process further on. In this paper, we gave an overview of existing linguistic work about the detection of real events. In order to evaluate a proposed system which extracts semantically-structured, real events from text, we defined requirements and proposed a methodology to create a gold standard dataset. Preliminary experiments with crowdsourcing showed that the annotation of text with factual information is non-trivial. Still, we believe that the creation of such a dataset is necessary for many event detection systems in the future. References 1. Gabrilovich, E., Dumais, S., Horvitz, E.: Newsjunkie: providing personalized newsfeeds via analysis of information novelty. WWW ’04, New York, NY, USA, ACM (2004) 482–490 2. Karkali, M., Rousseau, F., Ntoulas, A., Vazirgiannis, M.: Efficient Online Novelty Detection in News Streams. In Lin, X., et al., eds.: Web Information Systems Engineering – WISE 2013. Springer Berlin Heidelberg (2013) 57–71 3. Zhang, Y., Callan, J., Minka, T.: Novelty and Redundancy Detection in Adaptive Filtering. SIGIR ’02, New York, NY, USA, ACM (2002) 81–88 4. Zhang, K., Zi, J., Wu, L.G.: New Event Detection Based on Indexing-tree and Named Entity. SIGIR ’07, New York, NY, USA, ACM (2007) 215–222 5. Li, X., Croft, W.B.: Novelty Detection Based on Sentence Level Patterns. CIKM ’05, New York, NY, USA, ACM (2005) 744–751 6. Kosmerlj, A., Belyaeva, J., Leban, G., Fortuna, B., Grobelnik, M.: Crowdsourcing Event Extraction. In: NewsKDD – Workshop on Data Science for News Publishing at KDD 2014. (2014) 7. Xie, B., Passonneau, R.J., Wu, L., Creamer, G.G.: Semantic Frames to Predict Stock Price Movement. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. (2013) 873–883 8. Wang, D., Li, T., Zhu, S., Ding, C.: Multi-document Summarization via Sentence-level Semantic Analysis and Symmetric Matrix Factorization. SIGIR ’08, New York, NY, USA, ACM (2008) 307–314 9. Yeh, P.Z., Puri, C.A., Kass, A.: A Knowledge Based Approach for Capturing Rich Semantic Representations from Text for Intelligent Systems. Int. J. Adv. Intell. Paradigms 2(1) (November 2010) 33–48 10. Bejan, C.A.: Learning event structures from text. PhD thesis, The University of Texas at Dallas (2009) 10 11. Bach, E.: The Algebra of Events. Linguistics and Philosophy (1986) 5–16 12. Dowty, D.R.: Word Meaning and Montague Grammar: the semantics of verbs and times in generative semantics and in Montague’s PTQ. Reidel (1979) 13. Vendler, Z.: Linguistics in Philosophy. Cornell University Press (1967) 14. Moens, M., Steedman, M.: Temporal Ontology and Temporal Reference. Computational Linguistics 28(3) (1988) 15–28 15. Pustejovsky, J.: The syntax of event structure. Cognition 41 (1991) 47–81 16. Dorr, B.J., Olsen, M.B.: Deriving Verbal and Compositonal Lexical Aspect for NLP Applications. Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics (ACL) (1997) 151–158 17. Palmer, F.: Mood an Modality. Cambridge University Press (1986) 18. Horn, L.: A Natural History of Negation. University of Chicago Press (1989) 19. Sauri, R., Pustejovsky, J.: From structure to interpretation: A double-layered annotation for event factuality. Proceedings of the Second Linguistic Annotation Workshop (2008) 20. Wiebe, J., Wilson, T., Cardie, C.: Annotating expressions of opinions and emotions in language. Language Resources and Evaluation 39(2) (2005) 165–210 21. Miltsakaki, E., Prasad, R., Joshi, A., Webber, B.: The Penn Discourse Treebank. Proceedings of LREC 2004 (2004) 22. Pustejovsky, J., et al.: The TIMEBANK Corpus. Proceedings of Corpus Linguistics 2003 (2003) 647–656 23. Pustejovsky, J., Knippen, R., Littman, J., Saurı́, R.: Temporal and event information in natural language text. Language Resources and Evaluation 39(2) (2005) 123–164 24. Walker, C., Strassel, S., Medero, J., Maeda, K.: ACE 2005 Multilingual Training Corpus LDC2006T06 (2006) 11 Towards Open Domain Event Extraction from Twitter: REVEALing Entity Relations G. Katsios1 , S. Vakulenko2 , A. Krithara1 , G. Paliouras1 1 Institute of Informatics and Telecommunications, NCSR Demokritos, Greece 2 MODUL University Vienna, Austria Abstract. In the past years social media services received content con- tributions from millions of users, making them a fruitful source for data analysis. In this paper we present a novel approach for mining Twitter data in order to extract factual information concerning trending events. Our approach is based on relation extraction between named entities, such as people, organizations and locations. The experiments and the obtained results suggest that relation extraction can help in extract- ing events in social media, when combined with pre and post-processing steps. Keywords: Event extraction, social media analysis, relation extraction, Twitter 1 Introduction Social media attracts millions of users, and has evolved to become a source of various kinds of information. In Twitter for example, more than 255 million ac- tive users publish over 500 million 140-character “tweets” every day3 . Evidently it has become an important communication medium. More and more people use social media to communicate their ideas and thoughts, as well as to spread important news. Given the enormous size of information exchange happening every day, it is a rather challenging task to process these data and filter out the important and relevant information. Twitter data is part of the Big Data paradigm and is characterized by high Velocity, Veracity and Volume (“the 3 Vs”) [12]. The topics on Twitter span across multiple domains from private issues to important public events in the society. Therefore, filtering out the important or relevant to the user information poses the first challenge for automated processing of tweets. Twitter provides user-generated content in real time. The data is stored in a form of short text messages called tweets. Each tweet has a body that contains text of the message itself, but also a variety of metadata associated with it, e.g. date of creation, author, user mentions, location etc. However, what makes Twitter texts unique is its word count limitation which causes extensive usage 3 https://about.twitter.com/company of acronyms and other abbreviations. Moreover, users often use colloquial words and phrases in tweets, which require context for interpretation. The goal of this research is to develop tools that extract and efficiently sum- marize trending events, the so-called “breaking news“, mined from social media, e.g. Twitter. This task is especially relevant for the professional journalists help- ing them to utilize social media as an information source helping to cope with the information overload. This research was conducted in the context of two European 7th Frame- work projects, REVEAL and DecarboNet. The projects aim at developing new tools and approaches to automatically process digital media content, extracting important information and summarizing it. This paper is reporting on the results of the initial round of experiments, where we combined the current state-of-the-art methods and tools available, and further evaluated them for the task of event extraction from social media. We also enhanced the pipeline with pre- and post-processing procedures in order to adopt it to the specific requirements stemming from the nature of social media data, e.g. spam detection, mention disambiguation and relation selection. These initial investigation and prototyping results aim to reveal the pitfalls and short- comings of the current state-of-the-art approaches and suggest directions for the future work. The definition of an event itself might appear rather blurry and controversial from the first sight. We adopt the wide definition of an ’event’, which goes beyond scheduled events, like a music concert, conference or a football match. In general, we consider any action, which can be observed in the physical world, to constitute an event [21]. Events are often communicated through social media, e.g. ”Chelsea won to- day”, ”We are going to a bar”. Due to the abundance of such event reports on social media we define the notion of an ’important event’, i.e. an event, infor- mation about which is of a potential value to a user of the system. For example, information about an international political summit involving famous politicians may be considered as important for the journalist, while the content of a lunch meal of an average twitter user is likely to be of no particular value. In this work we focus on extracting the factual information about an event, e.g. its location, time and participants. It is important to separate the factual information from the content that expresses an opinion or an emotion related to the event, such as feelings and thoughts of an individual or a group. This can be a rather tricky task, because sentences that are lexically very similar can convey semantically opposite facts. For example: ”Chelsea won today” versus ”I wish Chelsea won today” versus ”I wish Chelsea wins today”. In order to extract event-related information from tweets we adopt and en- hance existing state-of-the-art approaches to automated information extraction, taking into account the unique properties of social media data. We implement and apply the proposed approach to several datasets, evaluate and discuss the results, outlining further directions for the future work. 2 Related Work Existing algorithms for news monitoring typically detect events by grouping together words with similar burst patterns (i.e. words or phrases showing burst in appearance count [24]). They rely on clustering or topic modeling techniques [3, 10, 13]. The draw-back of these approaches is that the resulting bag-of-words representation of the clusters/topics is often not descriptive enough. More sophisticated and precise approach is information extraction on the level of events. Event extraction involves parsing of natural language text with the aim of extracting event-related information. The usual suspects for the event facets are the named entities that belong to actor/place/time classes in Simple Event Model (SEM) [21]. Therefore, many approaches to event extraction in- clude entity recognition stage [5, 19]. In our work we also utilize the assumption that many events are centered around named entities as in [19]. Still open re- mains the question of how to connect the event-related entities, e.g. persons, locations, dates. Most of the approaches use NLP-methods involving a set of regular expressions to extract verbs that are assumed to constitute an event and feed it together with the related entities into the event model [1, 8, 18, 19, 22]. On the contrary, in our approach we utilize the state-of-the-art method for relation extraction [6], that has already been successfully applied to news articles. Relation extraction is the task of identifying relations that hold between entities in text data. Up to now relation extraction systems were only evaluated on news collections, but not on social media data. Therefore, the novelty of the proposed approach is testing the suitability of relation extraction methods for the task of event extraction on Twitter. We also make several modifications in order to adapt the relation extraction approach to the specific nature of social media data and further enhance it to extract event-related relations between the frequent named entities from tweets. There have been a number of projects aiming at extracting events specif- ically from tweets [5, 20, 23]. Tweets are specific in nature and require special treatment, di↵erent from the news articles. Therefore, Twitter-oriented systems often include methods to detect spam, reduce noise and eliminate uninformative messages [5, 20]. Domain-specific event extraction, such as [5, 23], allow fine-tuned event de- tection, but require a set of keywords or event types to be manually predefined. In this work we focus on extracting trending events, i.e. events which are most popular among the users and are most frequently discussed. This approach also allows us to be domain-agnostic and catch previously unknown events. In this respect, our approach is most similar to TwiCal [20]. However, in- stead of training classifier for event extraction on in-domain training data we utilize already trained extractor from ClausIE [6]. The goal of TwiCal is con- structing a calendar of upcoming events. Therefore, it extracts only scheduled events accompanied with explicit date mention. We are primarily interested in information concerning recent or current events, where explicit date annotation is often omitted. 3 Our Approach We adopt the state-of-the-art approach to relation extraction [6] and further enhance it for the task of event extraction from tweets. In our approach we consider any action, which can be observed in the physical world, to constitute an event [21]. We assume that events are indicated by nonstative (dynamic) verbs. Dynamic verbs describe an action, such as ’kick’, ’meet’, ’visit’, as opposed to stative verbs, such as ’believe’, ’like’, ’consider’, etc. Relation extraction approach enables us to extract predicates from a sentence (corresponding to the verbs indicating events) together with their subjects and objects. For example, the sentence: ”The match starts on Sunday” will result in the following relation: The match (Subject) - starts (Predicate) - on Sunday (Object). Objects of the relations often contain event facets that uniquely characterize events in spatial, temporal and social dimensions (e.g. place, date, organizers, participants). Thus, this approach allows for more fine-grained event extraction as opposed to clustering or topic modeling-based approaches which operate with the bag-of-words model, which tend to blend together several lexically similar events. We have extended the initial approach to relation extraction with a few pre- processing steps in order to clean the input data and annotate it with named entities. After the pre-processing we extract relations, link them to named enti- ties and rank according to their frequencies. The resulting pipeline summarizing our approach is presented in Figure 1. In the rest of this section, the di↵erent modules of our approach are described in details. Spam detection Named Entity Recognition Linguistic Relation selection and pre-processing ranking Relation extraction Fig. 1: System’s pipeline 3.1 Spam detection Here we define spam as useless uniformative or malformated messages, which are unlikely to provide us with any meaningful information. Our goal is to pre- process the raw data from Twitter and deliver to the end user only useful and relevant information. Therefore, we attempt to filter out meaningless and mis- leading messages already on the first stage of our pipeline. In the first place, we use a freely distributed black-list of domain names 4 in order to exclude tweets containing links that point to the untrusted web sites. Next, we calculate a “spam score” for each of the remaining tweets and exclude the tweets that receive the score higher than the empirically learned threshold value. The “spam score” is calculated as the number of spam-associated tokens [4, 16] divided by the total number of tokens in the tweet: |U | + |H| + |L| + |S| + |N | spam score = (1) |T | where: – |U |: number of user mentions (e.g. @themichaelowen); – |H|: number of hashtags (e.g. #DavidGill); – |L|: number of web links (e.g. http://t.co/my55ZOoAko); – |S|: number of spam words (from the predefined list5 , e.g. dutyfree, poker, casino); – |N |: number of non-word characters (e.g. %, !); – |T |: total number of tokens in the tweet. The bigger the value of the “spam score”, the more likely that the tweet con- tains spam. We conducted an experiment spanning numerous trials to choose the optimal threshold value for the spam score and arrived at the value of 0.74. Fur- ther one, we identified 3% of the tweets in our datasets as spam and, therefore, excluded them from the next stages in our pipeline. 3.2 Linguistic Pre-processing All the tweets that passed through the Spam Detection module, are further considered in the Linguistic Pre-processing module. The pre-processing steps include tokenization, user mentions resolution, further text cleaning and sentence splitting. Tokenization is used to identify the tokens that will be replaced or removed from the text, such as URLs, user mentions, etc. First, we exploit tweet meta- data to resolve user mentions to their canonical names. In particular, each tweet that contains user mentions, carries a list of the corresponding full user names from the Twitter database. Thus, we substitute the user mentions in the tweet text with the corresponding full names using the tweet metadata. For example, @themichaelowen is resolved to Michael Owen. 4 http://www.squidguard.org/blacklists.html 5 http://notagrouch.com/wp-content/uploads/2009/12/ wordpress-blacklist-words.txt 3.3 Named Entity Recognition In this module, we identify named entities mentioned in the text of the tweet, as well as their types. For example, the tweet containing the following snippet: ”@DavidGill walks out of FIFA meeting in Sao Paulo”, gets annotated with the named entities: David Gill - Person, FIFA - Organization and Sao Paulo - Location. We used Stanford Named Entity Recognizer (Stanford NER) [15] for detect- ing named entities in tweets. According to the benchmark evaluation reported in [7], Stanford NER achieves highest average precision on all three datasets of tweets, when compared with other state-of-the-art Twitter-tailored algorithms. Stanford NER detects the following types of named entities: Location, Per- son, Organization, Date, Money, etc.6 . Due to our pre-processing procedure we also detect the entities “hidden” within the user mentions and hashtags (e.g. @DavidGill ). This would not be feasible, when applying the Stanford NER on the original tweets. 3.4 Relation Extraction The core of our approach is based on extracting relations from the pre-processed tweets. Relation is a triple that consist of subject, predicate and object. Subject and object are entities, predicate is the relation between these entities. For ex- ample, the sentence: ”The match starts on Sunday” will result in the following relation: The match (Subject) - starts (Predicate) - on Sunday (Object). We considered three state-of-the-art systems for the task of relation extrac- tion: ReVerb [9], Ollie [14] and ClausIE [6]. ClausIE was reported to significantly outperform Ollie by the number of propositions extracted [6]. However, it has not been previously applied to social media data. Therefore, we ran our own exper- iments to compare the results returned by ReVerb and ClausIE. Subsequently, we chose ClausIE as the best-suited baseline system. In ClausIE relation triples are extracted from clauses, parts of a sentence that express coherent pieces of information [6]. The clauses are identified based on the results from the dependency parser that helps to reveal the syntactic struc- ture of an input sentence. In particular, ClausIE is using Stanford unlexicalized dependency parser [11]. Additionally, ClausIE has an option to return n-ary predicate by decomposing the object of the relation into several arguments. This option can be useful for extracting complex relations, that consist of several independent but overlapping parts, such as place and time relations. For example, the sentence: ”The match starts on Sunday at Wembley” will result in the following relation: The match (Subject) - starts (Predicate) - ”on Sunday”, ”at Wembley” (Object). We made several modifications to the original implementation of ClausIE in order to adapt it to the task of extracting the relations describing events. Specifically, we enforce omitting the following types of clauses from the relation extraction process: 6 http://nlp.stanford.edu/software/CRF-NER.shtml – conditional clauses (If-clauses), e.g. “If @Chelsea wins I will celebrate till morning!!!!!!!!” – clauses rooted in a stative verb, e.g. ”I believe @Chelsea is the actual winner!” Conditional clauses are used to speculate about what might happen, what could have happened, and what we wish to happen. Stative verbs describe mental state of an agent, but do not signify any action. For example, the following verbs are stative: hate, love, believe, prefer, want, suppose, etc. 3.5 Relation Selection We designed a post-processing step for selecting relations that will appear in the final results. For this we chose the Frequent Pattern Mining approach that helps us to reveal the recurrent information patterns following the assumption that input data from Twitter is often abundant and redundant. Additionally, we employ the following heuristic technique: for the relation to be selected it has to contain popular (frequently occuring) named entities. In this way we get rid of the trivial resuls, e.g. ”I - ate pizza - for breakfast”, but retain the relations such as: ”President Obama - ate pizza - for breakfast”, if they are reported by a considerable number of tweets. Therefore, we combine the results from Relation Extraction (RE) and Named Entity Recognition (NER) modules produced on the previous stages. In particu- lar, we select only those relations that contain named entities in subject and/or object of the relation. The intuition behind this approach enriching relations with NER annotations is that events in real-life are often associated with the corresponding named entities: dates, places and participants. Hints about importance of the relations and named entities are given from their frequencies count. We assume that widely discussed news are more likely to be of importance and interest to the users of our system. Therefore, in order to link NER and RE results we identify frequent named entities and then select frequent relations, in which these entities occur. We use several approaches to select relations between the named entities described below. Firstly, we detect the named entities that occur most frequently in the tweets (⇠ 10 entities for each of the datasets), e.g. Chelsea, Drogba, Ramires. We also identify the most frequently co-occurring pairs of named entities (⇠ 5 pairs per dataset), e.g. Chelsea and Liverpool, Putin and Ukraine. Then, we identify the following relations that hold between named entities: 1. Relations in which the most frequently occurring entities appear in subject or object of the relation; 2. Relations that hold between pairs of the most frequently co-occurring enti- ties; 3. Relations for every combination of entity types pairs from the set: [Person, Organization, Location, Date], e.g. between Person and Organization, Person and Person, Location and Organization, Person and Date etc. Finally, we calculate the support for each of the selected relations, i.e. number of tweets from which the same relation was extracted, and use it for ranking of the relations. The topmost relations are reported in the final results. 4 Experimental Evaluation 4.1 Datasets We conducted experiments using three di↵erent Twitter datasets (see Table 1). All datasets are centered around one or several major events discussed on social media. We have deliberately selected the datasets containing event-related tweets for our evaluation with the goal to uncover the details surrounding these events using our approach. The FACup dataset was created within the Social Sensor project7 and covers the events during the last match of the Football Association Challenge Cup [2]. The SNOW dataset [17] is an attempt to capture the footprint in the social media regarding several important international events: uprising in Ukraine (#ukraine, #euromaidan), protests in Venezuela (#Venezuela), major Bitcoin exchange theft (#bitcoin), etc. The third dataset was collected in June 2014 and con- tains ⇠ 270.000 tweets, that were extracted using the hashtag #WorldCup2014. Dataset # Tweets Hashtags FA Cup ⇠ 20.000 #FACupFinal SNOW ⇠ 1.000.000 #ukraine, #euromaidan, #Venezuela, #bitcoin World Cup ⇠ 270.000 #WorldCup2014 Table 1: Datasets 4.2 Evaluation Method We manually evaluated the results by annotating the relations returned on the last stage of our pipeline (section 3.5). Each of the annotators (3 in total) inde- pendently considered perceived correctness and usefulness (importance) of the relations by looking up the original text of a sample tweet, from which the rela- tion was extracted by the system. The relation was marked as Correct, if the information it provides naturally follows from the original text of the tweet and does not contradict the message conveyed in it. Negation handling is a good example for potential errors in the results returned by the system. If the original tweet reports, that Chelsea did not play better than Liverpool, the relation has to communicate the same fact and not the opposite. For example, Chelsea - play better - than Liverpool relation should be marked as Incorrect in this case. Furthermore, all correct relations were further evaluated with respect to per- ceived importance for the end user of the system. The importance of a relation is harder to evaluate than its correctness, because of the complexity and subjectiv- ity in the notion of importance with respect to an information piece. In general, a relation is considered Important, when it is perceived as being descriptive and 7 http://www.socialsensor.eu/ potentially useful. Meaningless and uninformative relations are marked as Not important, respectively. Collective discussion of the individual annotations resulted in a consensus and a single final evaluation table was constructed. Afterwards, we summarized our evaluation results by counting the number of relations for each of the classes: Correct & Important, Correct & Not important and Incorrect relations (see Table 2). We calculated the ratios and the total number of evaluated relations sepa- rately for each of the datasets. The last row of the evaluation table highlights the average precision values across the three datasets. Correct Dataset Incorrect Not important Important FA Cup 0.17 (8) 0.17 (8) 0.66 (32) SNOW 0.1 (21) 0.14 (32) 0.76 (168) World Cup 0.1 (18) 0.19 (35) 0.71 (134) Average 0.12 (47) 0.17 (75) 0.71 (334) Table 2: Precision of the evaluation results: fraction (total) of relations 4.3 Discussion and Future Directions The average precision of our approach was estimated at 88% taking into account all correctly extracted relations. However, less that 3/4 of the relations returned by the system were considered as potentially valuable for the end users of the system (see Correct & Important in Table 2). The most frequent relations that were selected using our approach from FA Cup dataset are listed in Table 3. These 5 relations provide a short summary of the event by revealing the names of the teams, the place where the game took place, the winner and the final score, as well as the player, who scored. The timestamps of the tweets can disambiguate the mentions ”now”, reveal date of the event and indicate the ”hot spots” on the game timeline, such as the last relation in Table 3. Relations extracted from the SNOW dataset are less homogeneous contain- ing various political statements, business and sport announcements, as well as snapshots of historical events. Sample relations (with their support): Ukraine’s leaders - warn - ”of Crimea separatism threat” (106); Chelsea fans - attending - ”the Galatasaray match”, ”on 26 Feb” (84). World Cup dataset is another noisy collection containing many tweets not related to the football championship. Nevertheless, the three top-most relations reveal the major conflict in the football association: director David Gill - walks out - ”of FIFA meeting in Sao Paulo” (902); director David Gill - says - ”Sepp Blatter should stand down” (901); FA Vice-Chairman David Gill - calls on - ”Sepp Blatter not to stand for re-election as FIFA President” (481). In general, due to our broad definition of ’event’ (as any kind of action re- flected in a physical world) relations can be extracted from virtually any col- lection of tweets. However, in order to achieve comprehensive results the tweets need to be previously clustered according to the common topic, e.g. using a set of hashtags. Subject Predicate Object Count Sample tweet The Chelsea are ”Robbie Di Matteo 129 RT @chelseafc: What celebra- players throwing high in the air” tions! The Chelsea players are throwing Robbie Di Matteo high in the air. And catching ... Chelsea have won ”17 major 58 RT @chelseafc: Chelsea have trophies”, ”now” now won 17 major trophies. We’ve caught Tottenham who are on the same total. Liverpool are out ”for the second 27 RT @chelseafc: Liverpool are out half” for the second half, and Chelsea are on the way. #CFCWembley #FACupFinal (SL) Chelsea beat ”Liverpool 2-1 to 24 RT @premierleague: Chelsea win the FA Cup at beat Liverpool 2-1 to win the Wembley” FA Cup at Wembley, their fourth win in six years in the competition. #cfc #lfc ... Liverpool is ”much”, ”pretty”, 21 RT @espn: Liverpool is pretty ”giving every much giving every Chelsea Chelsea fan a heart fan a heart attack right now: attack right now” http://t.co/MGxAkv94 Table 3: Results from FA Cup dataset We performed only limited experimental evaluation for the proof-of-concept of our approach and can not quantatively compare our results with other ap- proaches to event extraction. Moreover, the relation extraction algorithm is cur- rently computationally rather expensive, which might prevent us from running the system on Twitter stream data in real time. Nevertheless, our initial results provide further motivation and help to outline directions for the future work: 1. Linking relations that convey the same information. Disambiguating and clustering these relations will help to improve quality of the results by in- creasing support of the frequent relations and removing semantic duplicates. This can be achieved by: – grouping the predicates into semantic groups using existing lexical re- sources, such as FrameNet (e.g. verbs related to communication, cogni- tion, perception: say = tell = report, believe = think = consider ); – disambiguating and linking named entities contained in subjects and objects of the relations (e.g. President Obama = Barack Obama, next month = June 2015 ) 2. Linking relations that describe the same event. This can be achieved by building an event knowlege model, e.g. an event ontology, that will incorpo- rate and meaningfully combine event facets extracted from di↵erent sources. 3. Linking events between each other. This task will help to reveal patterns within spatial/temporal/social dimensions by projecting the events on a timeline or a geographic map. This approach may help to learn the common- sense rules useful for reasoning and inference over the event data, such as the ’finish’ event follows the ’start’ event, but also reveal non-trivial patterns and the outliers. 5 Conclusion We presented a novel approach to event extraction from Twitter, which builds upon current state-of-the-art relation extraction techniques. We manually eval- uated the quality of extracted relations in terms of precision on three real-world datasets. Most of the results returned by the system are correct (88%) and con- tain descriptive and potentially useful event-related information (71%). However, recall and computational performance of the system was out of scope of this intial evaluation run. Acknowledgments This work was supported by REVEAL (http://revealproject.eu/) and Decar- boNet (www.decarbonet.eu) projects, which have received funding by the Eu- ropean Unions 7th Framework Program for research, technology development and demonstration under the Grant Agreements No. FP7-610928 and 610829, respectively. References 1. Puneet Agarwal, Rajgopal Vaithiyanathan, Saurabh Sharma, and Gautam Shro↵. Catching the Long-Tail: Extracting Local News Events from Twitter. In ICWSM, 2012. 2. L. M. Aiello, G. Petkos, C. Martin, D. Corney, S. Papadopoulos, R. Skraba, A. Goker, I. Kompatsiaris, and A. Jaimes. Sensing trending topics in twitter. In Multimedia, volume 15, 2013. 3. Hila Becker, Mor Naaman, and Luis Gravano. Beyond Trending Topics: Real-World Event Identification on Twitter. ICWSM, 2011. 4. F. Benevenuto, G. Magno, T. Rodrigues, and V. Almeida. Detecting spammers on twitter. In Collaboration, electronic messaging, anti-abuse and spam conference (CEAS), 2010. 5. Smitashree Choudhury and John G. Breslin. Extracting semantic entities and events from sports tweets. In ’Making Sense of Microposts’: Big Things Come in Small Packages, 2011. 6. L. Del Corro and R. Gemulla. Clausie: clause-based open information extraction. In WWW, 2013. 7. Leon Derczynski, Diana Maynard, Giuseppe Rizzo, Marieke van Erp, Genevieve Gorrell, Raphaël Troncy, Johann Petrak, and Kalina Bontcheva. Analysis of named entity recognition and linking for tweets. Information Processing & Management, 51(2), 2015. 8. Peter Exner and Pierre Nugues. Using semantic role labeling to extract events from Wikipedia. In Proceedings of the Workshop on Detection, Representation, and Exploitation of Events in the Semantic Web (DeRiVE), 2011. 9. A. Fader, S. Soderland, and O. Etzioni. Identifying relations for open information extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2011. 10. Yuheng Hu, Ajita John, Dore Duncan Seligmann, and Fei Wang. What Were the Tweets About? Topical Associations between Public Events and Twitter Feeds. In ICWSM, 2012. 11. D. Klein and C. D Manning. Accurate unlexicalized parsing. In Proceedings of the 41st Annual Meeting on ACL, 2003. 12. D. Laney. 3D data management: Controlling data volume, velocity, and variety. Technical report, February 2001. 13. Jimmy Lin, Rion Snow, and William Morgan. Smoothing techniques for adaptive online language models: Topic tracking in tweet streams. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2011. 14. M., M. Schmitz, R. Bart, S. Soderland, and O. Etzioni. Open language learning for information extraction. In Proceedings of the 2012 Joint Conference on Empiri- cal Methods in Natural Language Processing and Computational Natural Language Learning, 2012. 15. C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and D. McClosky. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the ACL, 2014. 16. M McCord and M Chuah. Spam detection on twitter using traditional classifiers. In Autonomic and Trusted Computing. Springer, 2011. 17. S. Papadopoulos, D. Corney, and L. M. Aiello. Snow 2014 data challenge: Assessing the performance of news topic detection methods in social media. In SNOW- DC@WWW, 2014. 18. Thomas Ploeger, Maxine Kruijt, Lora Aroyo, Frank De Bakker, Iina Hellsten, and Antske Fokkens. Extractivism: Extracting activist events from news articles using existing NLP tools and services. In The 12th International Semantic Web Conference (ISWC), 2013. 19. Ana-Maria Popescu, Marco Pennacchiotti, and Deepa Paranjpe. Extracting events and event descriptions from twitter. In Proceedings of the 20th international con- ference companion on World wide web, 2011. 20. Alan Ritter, Oren Etzioni, Sam Clark, and others. Open domain event extraction from twitter. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, 2012. 21. Willem Robert Van Hage, Vronique Malais, Roxane Segers, Laura Hollink, and Guus Schreiber. Design and use of the Simple Event Model (SEM). Web Semantics: Science, Services and Agents on the World Wide Web, 9(2), 2011. 22. Willem Robert van Hage, Vronique Malais, Marieke Van Erp, and Guus Schreiber. Linked open piracy. In Proceedings of the sixth international conference on Knowl- edge capture, 2011. 23. Guido Van Oorschot, Marieke Van Erp, and Chris Dijkshoorn. Automatic extrac- tion of soccer game events from twitter. In Proc. of the Workshop on Detection, Representation, and Exploitation of Events in the Semantic Web, 2012. 24. Y. Yang, T. Pierce, and J. Carbonell. A study of retrospective and on-line event detection. In Proceedings of the 21st Annual International ACM SIGIR, 1998. A Contextual Framework for Reasoning on Events Loris Bozzato1 , Stefano Borgo2 , Alessio Palmero Aprosio1 , Marco Rospocher1 , and Luciano Serafini1 1 Fondazione Bruno Kessler, Via Sommarive 18, 38123 Trento, Italy 2 Laboratory for Applied Ontology, ISTC CNR, Via alla Cascata 56 C, 38123 Trento, Italy {bozzato,aprosio,rospocher,serafini}@fbk.eu, stefano.borgo@cnr.it Abstract. In this paper we investigate the contextualized representation of events. In particular, we re-interpret the formalization of events and situations adopted in the NewsReader Event and Situation Ontology (ESO) according to the notions of context and module proposed in the Contextualized Knowledge Repositories (CKR) framework. This contextualized formalization sets the basis for exploiting logical reasoning on events and situations, enabling to automatize tasks such as: recognizing incompatible event descriptions or inconsistent situations, inferring missing or implicit events. 1 Introduction With the growth of technologies managing the extraction of events and their participants from texts (e.g. [1,7]), interest has spread in using these low level tools as the base for event processing and reasoning in a higher level abstraction for events. The idea is to be able to discover, starting from low level descriptions of events extracted from text, complex events and their succession (viz. stories) with respect to their participant entities. This is one of the goals of the NewsReader project3 . In this project, Natural Lan- guage Processing (NLP) technologies, such as Named Entity Recognition (NER) and Semantic Role Labelling (SRL), are exploited to process large streams of news in multi- ple languages, to extract events and event participants, locations, dates. At the end of the extraction and processing, events and related information are represented using RDF. In order to define and reason over the features and e↵ects of events, an OWL ontology has been defined for such data, called the Event and Situation Ontology (ESO) [11]. In this paper we propose to reinterpret and formalize the event model defined by the ESO ontology using a context-based framework for representation of Semantic Web data, the Contextualized Knowledge Repositories (CKR) framework [12,2,4,5]: this en- ables to exploit the structure and reasoning possibilities o↵ered by the contextual frame- work in order to perform complex inferences about and inside knowledge associated with events. Intuitively, CKR is a description logics (DL) based framework defined as a two- layered structure: a lower layer contains a set of knowledge bases representing each context, while the upper layer contains context independent knowledge and meta-data 3 www.newsreader-project.eu defining the structure of contexts. The CKR framework has not only been presented as a theoretical framework, but also actual implementations based on its definitions [5,3] have been proposed. In particular, in [5] we presented an implementation for the CKR framework over state-of-the-art tools for storage and inference over RDF graphs: in- tuitively, the CKR architecture can be implemented by representing the global context and the local object contexts as distinct RDF named graphs, while inference inside (and across) named graphs is implemented as SPARQL-based forward rules. From a formal ontology point of view, in the approach we propose in this paper we clearly distinguish ontology from knowledge. The upper layer of our system con- tains the underlying ontology, that is, the description of the organization of the world. This part lists types of entities, relations and constraints that are assumed to exist or to simply be possible. More generally, the upper layer should be thought as including two elements: a general (foundational or domain) ontology describing what can exist, and the organization of a set of knowledge modules characterizing roles and relationships in (typically social) standardized scenarios like economic transactions or soccer games. Regarding the first, we do not make a specific commitment towards one ontology or an- other although we explicitly commit to the existence of physical and social objects, like people and organizations, and temporal happenings like weddings and thunderstorms. These entities come with their usual physical and temporal properties: weight, shape, duration and so on. Regarding the modules, we do explore their role in the system to infer new facts from given knowledge and thus we will present their general setting and how they are used. The lower layer of the system, on the other hand, collects claims about what happens in the world, that is, claims about how things are or change at some time. Throughout the paper these are what we call events, but note that they are not events in the ontological sense, rather they are descriptions of happenings. Each event is associated with three state-like entities, namely happenings charac- terized by some continuously holding property or relation like “begin married”, “being running” or “being employed by” (e.g. see the notion of stative perdurant in DOLCE [9]). These state-like entities are called situations in our system; more specifically the situa- tion holding before the event is called pre-situation, the one holding after post-situation and the one holding during the event itself is the during-situation. Informally, these sit- uations are used to make explicit relevant properties that (a) persist during the whole event, or (b) hold (or don’t hold) before/after the event and their truth values change “because” of the event. We will use situations to formalize (and reason on) the precon- ditions for an event of a certain type to happen, the postconditions (or consequences) of its happening, and what can be taken as stable during the whole event of that type. Although we take the received events at face value, it is possible that the news are imprecise or incomplete, that the text processing used to acquire the news is faulty and misinterprets them, that statements extracted from di↵erent sources about the same happenings contradict each other. For this reason, we take each piece of information ex- tracted from an outside source as a contextual perspective on the world. This means that the content of a news is not equated to absolute knowledge (it is not indisputable). In- deed, the extracted information constitutes a context annotated with its original source (possibly with di↵erent degrees of reliability which can change over time). This allows to integrate pieces of information coming from di↵erent sources into large and artic- ulated event descriptions by integrating di↵erent contexts (in turn, news statements). These extrapolated events reconstruct how an entity like a person or an organization, changes over time by changing its status (size, capacity, death) or its relationships with other entities (property acquisition, employment, marriage). Furthermore, missing in- formation on these complex events can be detected via logical reasoning based on the ontology at the upper layer, leading to infer changes that have not been reported (or detected) in the news. Finally, note that, when a contradiction arises, the system can isolate the conflicting pieces of information and establish which contexts are to be kept apart and need to be verified. In this paper we propose the first sketch and example of use for the contextual- ized ESO model: in the next section we briefly present the ESO model and provide an example of event modelling; in Section 3, we summarize the base definitions for the CKR framework; in the following section we describe the CKR-ESO model, that is our contextual based realization of ESO, we show how it models the previously presented event example and provide some insights in the advantages of such representation; we conclude by outlining some of the current and future directions for our work. 2 Events and Situations One of the objectives in the NewsReader project is the representation of events and of their e↵ects on the entities participating in them. For example, in a “giving” event, we have at least two actors (people, organizations) and an object, e.g., a person A giving something to another person B at some time T . The event also describes a change in these entities: at the time T , person A owns the object, while person B does not (aka, pre-situation); after T , the opposite is true (aka, post-situation). To achieve this objective, the Event and Situation Ontology (ESO) [11] has been developed. It defines two main classes of entities: events and situations. An event de- scribes an happening, typically a change, in the world (for instance, person A giving an object to person B at time T ), has a certain number of participants (here, the two people and the object) and an associated period of time (here, T ). A situation describes a state, i.e., takes a set of statements as describing (part of) the world at some point or interval of time. In the example “person A gives an object to person B at time T ”, we can identify a pre-situation (the state of the world at the initial time point of T ): – person A owns the object, – person B does not own the object and a post-situation (the state of the world at the ending time point of T ): – person A does not own the object – person B owns the object. For instance, in the sentence: “The chairman of India’s Tata Group has confirmed that his company is acquiring the Jaguar and Land Rover businesses from Ford.”, an event is described: the acquisition by Tata Group of Jaguar and Land Rover from Ford. The event is represented in ESO terminology as follows: <:evID, rdf:type, sem:Event> <:evID, rdf:type, eso:Getting> <:evID, rdfs:label, acquire> <:evID, eso:possession-owner 1, dbp:Ford> <:evID, eso:possession-owner 2, dbp:Tata Group> <:evID, eso:possession-theme, :Jag and L. Rover> <:evID, sem:hasTime, #tmxID> where #tmxID is the RDF representation of “August 26th, 2007”. According to ESO, the event eso:Getting has both pre- and post-situation, con- taining eso:hasInPossession and eso:notHasInPossession assertions, respectively. In the pre-situation of our example, Ford has in possession Jaguar and Land Rover, and Tata Group does not have in possession them; in the post-situation, the contrary holds. In the ESO model, such facts specific to a situation are stored in a RDF named graph identified by the URI of the situation. Therefore, the following set of n-tuples is generated:4 <:evID, eso:hasPreSituation, :evID pre> <:evID pre, rdf:type, eso:Situation> <:evID pre, sem:hasTime, #tmxID> <:evID, eso:hasPostSituation, :evID post> <:evID post, rdf:type, eso:Situation> <:evID post, sem:hasTime, #tmxID> In NewsReader, all the events and related information, instantiated according to the ESO metamodel, are stored, together with the original news article from where they were extracted in the KnowledgeStore [6], a scalable, fault-tolerant, and Semantic Web grounded storage system to jointly store, manage, retrieve, and query both struc- tured and unstructured data. 3 Contextualized Knowledge Repositories In the following we provide an informal summary of the definitions for the CKR frame- work: for a formal and detailed description and for complete examples, we refer to [5]. A CKR is a two layered structure: (1) the upper layer consists of a knowledge base G, called global context, containing (a) meta-knowledge, i.e. the structure and properties of contexts, and (b) global (context-independent) object knowledge, i.e., knowledge that applies to every context; (2) the lower layer consists of a set of (local) contexts that contain locally valid facts and can refer to what holds in other contexts. The intuitive structure of a CKR knowledge base is depicted in Figure 1: in the following we detail its formal components and interpretation. Syntax. The meta-knowledge of a CKR is expressed in a DL language containing the elements that define the contextual structure: the meta-vocabulary is a DL signature containing, in particular, the sets of symbols for context names N, module names M and context classes C, including the class of all contexts Ctx. Intuitively, modules represent pieces of knowledge specific to a context or a context class. The role mod defined on N ⇥ M expresses associations between contexts and their modules. The meta-language L of a CKR is a DL language over . The knowledge in contexts of a CKR is expressed via a DL language L⌃ , called object-language, based on an object-vocabulary ⌃. The expressions of the object lan- guage are evaluated locally to each context, i.e., contexts can interpret each symbol 4 Whenever applicable, default named graph is omitted. Metaknowledge Global object knowledge Global context mod_cls1 A ⊑ B, B ⊑ C, ... Class1 R ⊑ S, ... A(a), B(a), ... mod_ctx1 mod_ctx2 mod_ctx3 Rel1 R(a, b), S(a, c) ... Context1 Context2 Context3 Local modules D ⊑ B, D(a)... C ⊑ B, C(a)... E ⊑ B, E(a)... D ⊑ E, D(b)... mod_cls1 mod_ctx1 mod_ctx2 mod_ctx3 Fig. 1. CKR structure independently. To access the interpretation of expressions inside a specific context or context class, we extend L⌃ to Le⌃ with eval expressions of the form eval(X, C), where X is a concept or role expression of L⌃ and C is a concept expression of L (with C v Ctx). Intuitively, eval(X, C) can be read as “the interpretation of X in all the con- texts of type C”. We define a Contextualized Knowledge Repository (CKR) as a structure K = hG, {Km }m2M i where: (i) G is a DL knowledge base over L [ L⌃ ; (ii) every Km is a DL knowledge base over Le⌃ , for each module name m 2 M. We note that the knowledge in a CKR can be expressed by means of any DL language: in our work, we consider SROIQ-RL [5] as language of reference. SROIQ-RL is a restriction of SROIQ syntax corresponding to OWL RL [10]. Semantics. The semantics of CKR basically extends the usual model-based semantics of DL knowledge bases to the two layered structure of the framework. A CKR interpre- tation is a structure I = hM, Ii s.t.: (i) M is a DL interpretation of [ ⌃ (respecting the intuitive interpretation of Ctx as the class of all contexts); (ii) for every context x 2 CtxM , I(x) is a DL interpretation over ⌃ (with same domain and interpretation of individual names of M). The interpretation of ordinary DL expressions on M and I(x) is as usual while eval expressions are interpreted as follows: for every x 2 CtxM , eval(X, C)I(x) represents the union of all elements in X I(e) for all contexts e in CM . A CKR interpretation I is a CKR model of K i↵: (i) for ↵ 2 L⌃ [ L in G, M |= ↵; (ii) for hx, yi 2 modM with y = mM , I(x) |= Km ; (iii) for ↵ 2 G \ L⌃ and x 2 CtxM , I(x) |= ↵. Intuitively, this means that I verifies the contents of global and local modules associated with contexts and global object knowledge has to be propagated to local contexts. Materialization calculus. Reasoning in CKR has been formalized as a materialization calculus [8], a datalog-based calculus for instance checking in SROIQ-RL CKRs. Intuitively, the calculus is based on a translation to datalog of the input CKR. It has three components: (i) the input translations Iglob , Iloc , Irl , where given an axiom ↵ and c 2 N, each I(↵, c) is a set of datalog facts or rules encoding the contents of input global and local DL knowledge bases; (ii) the deduction rules Ploc , Prl , which are sets of datalog rules representing the inference rules for the instance-level reasoning over the translated axioms; and (iii) the output translation O, where given an axiom ↵ and c 2 N, O(↵, c) is a single datalog fact encoding the ABox assertion ↵ that we want to prove to be entailed by the input CKR (in the context c). Intuitively, SROIQ-RL input Irl and deduction Prl rules provide the translation and interpretation of SROIQ-RL axioms from the input CKR. Global input rules in Iglob encode the interpretation of Ctx in the global context. Similarly, local input rules Iloc and deduction rules Ploc provide the translation and rules for the local eval expressions. The rules in O provide the translation of ABox assertions that can be verified to hold in a context c by applying the rules of the final program. The translation of a CKR K to its datalog program PK(K) proceeds in four steps: we first translate G in the global program PG(G) by applying input rules Iglob and Irl to G and adding deduction rules Prl ; then, for every context name c 2 N appearing in PG(G), we compute its knowledge base Kc as the set of modules Km 2 K s.t. mod(c, m) is proved by PG(G); we translate each local program PC(c) by applying input rules Iloc and Irl to Kc and adding deduction rules Ploc and Prl ; the final CKR program PK(K) is then obtained as the union of PG(G) with all local programs PC(c). We say that K entails an axiom ↵ in a context c 2 N if PK(K) |= O(↵, c). We can show (see [5]) that the presented rules and translation process provide a sound and complete calculus for instance checking in SROIQ-RL CKR. CKR implementation on RDF. We recently presented a prototype [5] that implements the forward reasoning procedure over CKR defined by the materialization calculus. The prototype accepts RDF input data expressing OWL-RL axioms and assertions for global and local knowledge modules: these di↵erent pieces of knowledge are represented as distinct named graphs, while we encoded in a OWL vocabulary the CKR contextual primitives (e.g. the class Context of all context individuals, the class Module of all modules and the property hasModule corresponding to the role mod). The prototype is based on an extension of the Sesame RDF Framework5 and structured in a client-server architecture: the main component, called CKR core and residing in the server-side part, o↵ers the ability to compute and materialize the inference closure of the input CKR, add and remove knowledge and execute queries over the complete CKR structure. The distribution of knowledge in di↵erent named graphs asks for a component to compute inference over multiple graphs in a RDF store, since inference mechanisms in current stores usually ignore the graph part. This component has been realized as a general software layer called SPRINGLES (SParql-based Rule Inference over Named Graphs Layer Extending Sesame) [5]. Intuitively, SPRINGLES provides methods to demand a closure materialization on the RDF store data: rules are encoded as (named graphs aware) SPARQL queries and it is possible to customize both the used ruleset and the evaluation strategy. In our case, the ruleset basically encodes the rules of the presented materializa- tion calculus. The rules are evaluated with a strategy that follows the same steps of the translation process defined for the calculus. The plan goes as follows: (i) we com- pute the inference closure on the graph for global context G, by a fixpoint on rules corresponding to Prl ; (ii) we derive associations between contexts and their modules, by adding dependencies for every assertion of the kind hasModule(c, m) in the global closure; (iii) we compute the closure of the contexts, by applying rules encoded from Prl and Ploc and resolving eval expressions by the metaknowledge information in the global closure. 5 http://www.openrdf.org/ 4 Representing events in CKR: CKR-ESO ontology We can now describe how we translated and implemented a first prototype of the ESO model in the form of a contextualized ontology for the CKR, that we call the CKR-ESO ontology. In this model, the event and situation structures are modelled in the metaknowledge. Similarly to the ESO model, each event instance is associated in the metaknowledge with its pre-, during- and post-situations using the object properties hasPreSituation, hasPostSituation and hasDuringSituation, subproperties of hasSituation. Situation el- ements associated with events can be generated automatically by SPRINGLES rules when importing an event. Each event is represented in the metaknowledge as an instance of the class Event: in particular, each event is associated, analogously to the ESO model, with a sub- class of the Event class that determines the type of associated situations: in particular, DynamicEvents (e.g. ChangeOfPossession, Constructing) are typically characterized by their pre- and post-situations, while StaticEvents (e.g. BeingOperational) by their during-situations. This classification is provided by restrictions over the definition of such classes. For example, for the ChangeOfPossession event class, the CKR-ESO ontology states that: ChangeOfPossession v 8hasPreSituation.Pre ChangeOfPossession ChangeOfPossession v 8hasPostSituation.Post ChangeOfPossession Each event individual is associated with a knowledge module that corresponds to the RDF graph of the event in the ESO model. This association is represented in the meta- knowledge by the property hasEventModule. The following chain axiom is defined over this property, asserting that situations related to an event inherit the facts asserted in the event module: (hasSituation) hasEventModule v hasModule. As defined by the ESO model, we expect to find in the event module the instantiation for all the required roles involved in the event. The class Situation is defined as a subclass of the Context class in the CKR vocab- ulary: in other words, in our model we consider situations and their local knowledge as contexts. The particular (pre, post and during) situations associated with event types are modelled by specific context classes. Thus, for example, we have that the pre- and post-situations for events of type ChangeOfPossession are classified as members of the classes Pre ChangeOfPossession and Post ChangeOfPossession. The associ- ation between such type of situations and their local axioms (i.e. what its modelled in the ESO ontology by situation assertions) is performed by linking specific knowl- edge modules to these context classes. For example, in CKR-ESO we declare that ev- ery pre-situation of ChangeOfPossession is associated with the knowledge module pre change-of-possession-m and post-situations to post change-of-possession-m: Pre ChangeOfPossession v 9hasModule.{pre change-of-possession-m} Post ChangeOfPossession v 9hasModule.{post change-of-possession-m} Situation assertions are thus encoded inside these specific modules: the assertions can be basically translated to chain axioms across the roles specified in the event. For ex- ample, assertions for pre-situations of ChangeOfPossession stating that: hasInPossession(possession-owner 1, possession-theme) notHasInPossession(possession-owner 2, possession-theme) Context Event Situation Global context DynamicEvent pre_cop-m Pre_CoP Post_CoP post_cop-m ChangeOf Possession pre-event1 post-event1 hasPostSituation event1 hasPreSituation event1_m Kpre_cop-m (possession-owner_1)– ○ possession-theme ⊑ hasInPossession (possession-owner_2)– ○ possession-theme ⊑ notHasInPossession Local modules Kpost_cop-m (possession-owner_2)– ○ possession-theme ⊑ hasInPossession (possession-owner_1)– ○ possession-theme ⊑ notHasInPossession Kevent1 possession-owner_1(event1,Ford) possession-owner_2(event1,Tata_Group) possession-theme(event1,Jaguar_and_Land_Rover) Fig. 2. Example event in CKR-ESO model. is translated in CKR-ESO to these chain axioms across role properties: (possession-owner 1) possession-theme v hasInPossession (possession-owner 2) possession-theme v notHasInPossession We now can show how to represent our example event from Section 2 using the CKR- ESO model: we depict this modelling in Figure 2. In the global context, we define event1 to be of type ChangeOfPossession and associate it with its situations: ChangeOfPossession(event1) hasPreSituation(event1, pre-event1) hasPostSituation(event1, post-event1) By the above axioms for such event type, we know that the pre- and post-situations of event1 have to be of type Pre ChangeOfPossession and Post ChangeOfPossession. By metalevel reasoning, this implies that: hasModule(pre-event1, pre change-of-possession-m) hasModule(post-event1, post change-of-possession-m) and thus the situation assertions associated with the pre- and post-situations of ChangeOfPossession are imported in the two situations6 . Moreover, the graph associated with the original event is now defined as a module associated with event1 in the metalevel: hasEventModule(event1, event1 m). The event1 m module (i.e. the associated RDF graph) now contains the following facts that are shared with all the situations associated with this event: possession-owner 1(event1, Ford) possession-owner 2(event1, T ata Group) possession-theme(event1, Jaguar and Land Rover) 6 In Figure 2 we abbreviate classes of pre- and post-situations of ChangeOfPossession with Pre CoP and Post CoP and their modules with pre cop-m and post cop-m. Using the situation assertions in the module associated with the pre-situation, the CKR thus derives the following facts in the context of pre-event1: hasInPossession(Ford, Jaguar and Land Rover) notHasInPossession(T ata Group, Jaguar and Land Rover) Similarly, in the context of post-event1 we obtain: notHasInPossession(Ford, Jaguar and Land Rover) hasInPossession(T ata Group, Jaguar and Land Rover) We note that representation of the described event can be completed with its associated during-situation: among the facts that are known to hold during the event, for example, we can assert the existence of the actors and of the possession theme (using the exists property in the ESO). This contextual re-interpretation of the ESO model can bring several advantages from the point of view of reasoning capabilities inside and across events. First of all, every aspect of the reasoning procedure is now strictly ruled by logical reasoning: situation assertions and their association with the type of situation are now directly modelled by the CKR structure and local axioms, without demanding an external reasoner to consider the situation rules and the local reasoning inside situations. Furthermore, the propagation of global object knowledge to local knowledge allows the use of context independent background knowledge in local reasoning. In our example, we can assert in the global knowledge that both actors (Ford and T ata Group) are classified as car companies and their features can be used in local reasoning. More in general, the ad- vantages of an explicit and structured representation of contexts (as the one o↵ered by the CKR) with respect to a modelling based on reification have been shown in [2]. On the other hand, the clear separation of meta and object level reasoning can be exploited to exchange information across the two levels. For example, depending on the type and specific patterns of situation and events, by adding custom SPRINGLES rules it is possible to generate implicit events that have to occur for the completion of the event sequence in a story. In our example, if we have a second event event2 representing another ChangeOfPossession of Jaguar and Land Rover between two companies Company1 and Company2, di↵erent than the two companies from event1, and event2 has a timestamp greater than event1, then we can infer that there have been another two events (possibly being the same one): in one Jaguar and Land Rover has been sold from T ata Group and in the other it has been acquired from Company1. Sim- ilarly, we can recognize cases in which we can assert the equality of certain situations: this can be used to compile sequences of events in a story. Metalevel information for situations and events can be derived from local reason- ing: we might recognize incompatible descriptions of the same event from di↵erent news. For example, let us suppose a di↵erent representation of the scenario shown in the example in Figure 2: assume that event1 is now classified as Buying (subclass of ChangeOfPossession) while another event event2 is extracted as Selling (also sub- class of ChangeOfPossession), but they both represent the same conceptual event (i.e. the acquisition of Jaguar and Land Rover by T ata Group from Ford). Thus, at the level of the metaknowledge, the two events are modelled as: Buying(event1) hasPreSituation(event1, pre-event1) hasPostSituation(event1, post-event1) Selling(event2) hasPreSituation(event2, pre-event2) hasPostSituation(event2, post-event2) Since they represent the same happening, the event modules event1 m and event2 m basically share the same contents: that is, the actors are the same and they take the same role. However, suppose that, due to the extraction from di↵erent news sources, the metamodel property sem:hasTime associated to post-event1 has value “August 26th, 2007” while the value associated to pre-event2 is “August 28th, 2007”. Then, using this metalevel information and the local contents of the event modules, we can easily write a reasoning rule that recognizes that the two events are incompatible and adds the assertion event1 incompatibleWith event2 in the global context. Similarly, we can recognize inconsistent situations by local reasoning: this can be used both to exclude further inferences from inconsistent contexts, by marking as “inconsistent” the situation individual in the metaknowledge, but also to repair (possibly with some ad-hoc rules) the local axioms. We note that, on the other hand, this kind of reasoning requires to define ad-hoc rules to recognize such di↵erent situations. Another interesting possibility is the one of having inter-situation knowledge prop- agation. For example, if two situations or two events are recognized as consequent in a story, unmodified knowledge from the previous situations can be propagated to subse- quent situations (e.g. the marital status of Obama did not change when he was elected US president). This clearly presents problems of non-monotonicity, since one has to consider which knowledge can be seamlessly propagated without incurring in contra- dictory states. In this regard, we recently introduced in CKR a notion of defeasible axioms and their overriding across di↵erent contexts [3]. 5 Conclusions and future works In this paper we introduced the model of the CKR-ESO ontology, a re-interpretation of the Event and Situation Ontology under the contextual view of knowledge o↵ered by the CKR framework. We discussed informally the advantages of such representation and demonstrated its application by means of an example. We are currently completing the translation of the full ESO ontology to its contex- tualized version: we remark that, given the direct translation across the two models, we can easily automatize this transformation. Our goal is to be able to apply some of the proposed complex reasoning services to the events currently represented in the KnowledgeStore of the NewsReader project: to this aim, we plan to integrate the RDF-based CKR implementation with the Knowl- edgeStore and encode such reasoning services with respect to CKR contextual model. Acknowledgments. The research leading to this paper was supported by the European Union’s 7th Framework Programme via the NewsReader Project (ICT-316404). References 1. Björkelund, A., Hafdell, L., Nugues, P.: Multilingual semantic role labelling. In: Proceedings of CoNLL-2009. Boulder, CO, USA (2009) 2. Bozzato, L., Ghidini, C., Serafini, L.: Comparing contextual and flat representations of knowledge: a concrete case about football data. In: K-CAP 2013. pp. 9–16. ACM (2013) 3. Bozzato, L., Eiter, T., Serafini, L.: Contextualized Knowledge Repositories with Justifiable Exceptions. In: DL2014. CEUR-WP, vol. 1193, pp. 112–123. CEUR-WS.org (2014) 4. Bozzato, L., Homola, M., Serafini, L.: Towards More E↵ective Tableaux Reasoning for CKR. In: DL2012. CEUR-WP, vol. 824, pp. 114–124. CEUR-WS.org (2012) 5. Bozzato, L., Serafini, L.: Materialization Calculus for Contexts in the Semantic Web. In: DL2013. CEUR-WP, vol. 1014. CEUR-WS.org (2013) 6. Corcoglioniti, F., Rospocher, M., Cattoni, R., Magnini, B., Serafini, L.: Interlinking unstruc- tured and structured knowledge in an integrated framework. In: Proc. of 7th IEEE Interna- tional Conference on Semantic Computing (ICSC), Irvine, CA, USA (2013), (to appear) 7. Das, D., Schneider, N., Chen, D., Smith, N.: Probabilistic frame-semantic parsing. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. HLT’10 (2010) 8. Krötzsch, M.: Efficient inferencing for OWL EL. In: JELIA 2010. Lecture Notes in Computer Science, vol. 6341, pp. 234–246. Springer (2010) 9. Masolo, C., Borgo, S., Gangemi, A., Guarino, N., Oltramari, A., Schneider, L.: DOLCE: a descriptive ontology for linguistic and cognitive engineering. WonderWeb Project, Deliver- able D17 v2 1 (2003) 10. Motik, B., Fokoue, A., Horrocks, I., Wu, Z., Lutz, C., Grau, B.C.: OWL 2 Web Ontology Lan- guage Profiles. W3C recommendation, W3C (Oct 2009), http://www.w3.org/TR/2009/REC- owl2-profiles-20091027/ 11. Segers, R., Vossen, P., Rospocher, M., Serafini, L., Laparra, E., Rigau, G.: ESO: a Frame based Ontology for Events and Implied Situations. In: Proceedings of the MAPLEX 2015 Workshop (2015) 12. Serafini, L., Homola, M.: Contextualized knowledge repositories for the semantic web. J. of Web Semantics 12 (2012) Representing Specialized Events with FrameBase Jacobo Rouces1 , Gerard de Melo2 , and Katja Hose1 1 Aalborg University, Denmark jrg@es.aau.dk, khose@cs.aau.dk 2 Tsinghua University, China gdm@demelo.org Abstract. Events of various sorts make up an important subset of the entities rel- evant not only in knowledge representation but also in natural language processing and numerous other fields and tasks. How to represent these in a homogeneous yet expressive, extensive, and extensible way remains a challenge. In this paper, we propose an approach based on FrameBase, a broad RDFS-based schema consisting of frames and roles. The concept of a frame, which is a very general one, can be considered as subsuming existing definitions of events. This ensures a broad coverage and a uniform representation of various kinds of events, thus bearing the potential to serve as a unified event model. We show how FrameBase can represent events from several different sources and domains. These include events from a specific taxonomy related to organized crime, events captured using schema.org, and events from DBpedia. 1 Introduction The surge of research on large-scale knowledge bases in recent years has largely been driven by the availability of new sources of information about entities. While structured data about millions of places, people, or companies are very valuable, there have been comparably few new results on capturing events of various sorts. Most existing event- oriented ontologies have introduced only a few abstract classes of events, and typical knowledge bases tend to describe just a small number of specific types of events. Often, however, there is a need to talk about a broad range of very specific sorts of events. For instance, one might want to distinguish battles from both gunfights and from wars, and capture the class-specific details of such events. We adopt a broad notion of events here. This includes the prototypical cases, e.g. local happenings such as concerts, gatherings, or competitions, and world events such as those reported in the news. It also encompasses the more general abstract definition of events, for instance as “happenings in the real world” [15], which would include, e.g., the birth of a person or a commercial transaction between two people. Clearly, such events make up an important aspect of the world that is relevant in knowledge representation, natural language processing, and numerous other fields and tasks. Occasionally, the term eventuality is used to denote a broader notion of events that explicitly includes states, e.g. two people knowing each other. In this paper, we address this challenge of representing many different notions of events under a common schema, from the very prototypical cases to the very abstract, in a way that has both a broad coverage yet supplies sufficient detail to model event-specific properties. For this, we present a new approach for representing event information that is based on FrameBase [12], a broad RDFS-based schema made of frames and their roles. FrameBase provides a predefined vocabulary with event-specific properties for thousands of different kinds of events. For instance, FrameBase’s schema accounts for the fact that a battle takes place in a certain time and place and normally involves two parties. For this, the schema draws on two linguistic resources, FrameNet [2] and WordNet [6]. As these describe important fragments of the English lexicon, their coverage is quite substantial. Additionally, as we illustrate later on, FrameBase can be easily extended. In the following, we prove the suitability of FrameBase for representing different kinds of events by creating rules that integrate instances from different domains: – A taxonomy of event classes relating to organized crime from the EU FP7 project ePOOLICE3 . In the project, the event classes in the taxonomy are used as types of entities that are extracted from documents crawled from the web, as part of a strategic early-warning system. The taxonomy was originally captured using the Conceptual Graphs formalism [17]. We use and integrate the event taxonomy as it is, without ad-hoc modifications to the schema. – The subclasses and properties of the “Event” class in schema.org, which “provides a collection of schemas that webmasters can use to markup HTML pages in ways recognized by major search providers, and that can also be used for structured data interoperability” [1]. – The subclasses and properties of the “Event” class in DBpedia [4], which are extracted from the infoboxes in Wikipedia. – We conclude with a more general overview of how salient aspects of events [15] can be mapped into FrameBase. This paper is structured as follows. After describing previous approaches and research in Section 2, a brief overview of the FrameBase schema is given in Section 3. Section 4 then shows how we can rely on the FrameBase schema to represent events from several different sources and domains. Finally, Section 5 provides concluding remarks and describes avenues for future research and applications of our work. 2 Related Work Considering their importance and unique characteristics, events have been included in numerous upper-level ontologies and vocabularies. In [15], existing event models are reviewed, but these define very broad abstract categorizations or meta-models. Only few example specializations or vocabularies for narrow domains exist, and their overall size is relatively small. For instance, the Simple Event Model (SEM) Ontology [18] introduces the four types Event, Actor, Place, and Time. While it provides a mechanism to create more specific ones by extending these, it does not actually define any specific kinds of events itself. Similarly, the LODE (Linking Open Descriptions of Events) model [16] provides very general concepts, such as the four just mentioned. The event model E [14] proposes a generic structure for the definition of events, but a specific vocabulary is provided 3 https://www.epoolice.eu/ only for the domain of media events with sensor data. The Event Ontology [11] defines a single event class, for which time, place, agents, factors, products, and meronymic relations can be specified, and the domain of focus is music events. Likewise, the Context Ontology (CONON) is limited to the domain of pervasive computing environments [19]. FrameBase’s schema instead aims at a broader coverage of many domains by build- ing on natural language resources. Previous work has made use of natural language processing techniques to extract events from text. For instance, one study [5] relies on semantic role labelling (SRL) in conjunction with VerbNet to collect events from text and convert them to the LODE vocabulary mentioned above. Another system [10] extracts events both from text and from semi-structured data. We believe that such automatic extraction methods would benefit from being able to use a standardized wide-coverage representation schema for their output. 3 The FrameBase Schema The FrameBase schema [12] consists of classes representing frames, and properties representing frame elements. A frame describes any kind of situation, state or action, in which several elements, participants (agents, patients, etc.) or properties are involved. Examples include commercial exchanges, marriages, or the act of stomping. The frame elements refer to the participants or properties that are involved in a particular frame instance. Common general frame elements include those of agent, patient, time, and location, but not all frames involve these. Frame elements are sometimes also referred to as semantic roles, roles, or theta roles, especially when they are very general. The frames and the frame elements in FrameBase are organized in hierarchies of classes (based on subclass relationships) and of properties (based on subproperty rela- tionships), respectively. There are three kinds of frames in FrameBase: LU-microframes, synset-microframes and non-lexical frames. Non-lexical frames are very general and are situated in the upper part of the hierarchy. LU-microframes (lexical unit microframes) descend from non-lexical frames, but are much more specific by being associated with the meaning of particular words (the lexical units). They come from FrameNet [7, 13]. Synset-microframes allow an intermediate level of granularity connecting synonymous LU-microframes, e.g. for marriage and matrimony. These are based on WordNet [6], and thus also have allowed us to extend the coverage of FrameBase beyond that of FrameNet. In the field of linguistics, frames are said to be evoked by words: for example, both the verb to create and the noun creation evoke the Creation frame. FrameBase additionally provides direct binary predicates to directly connect certain values for elements of a given frame. For example, in a creation event, the agent and the place are directly connected via the establishesInPlace relation. This enables more concise queries and representations when only two elements are involved in a particular frame. The frame patterns and the direct binary predicates are logically connected by means of definite clauses that can be used with different kinds of inference systems. For interoperability with existing resources, FrameBase relies on the standard RDF model [9], which has become a common choice for representing knowledge. This is particularly true in the context of the Linked Data [3], a large Web of datasets referring to and reusing each other’s elements. The RDF model uses subject-predicate-object triples to represent statements. Each triple can also be seen as an edge in a directed labelled entity-relationship graph. SPARQL [8] is the standard query language for RDF, which is what we use in order to integrate other event representations into FrameBase. Event frames are specific kinds of frames, subsuming a range of different notions of events, from the very abstract (e.g., “a natural abstraction of happenings in the real world” [15]) to notions with a notably narrower scope, such as that of widely-known events [10]. Frame elements correspond to what are referred to as aspects in the event literature [15]. However, frames can also be more general, and include what the event model E categorizes separately as entities [14]. For example, FrameNet, from which FrameBase is derived, includes a frame People that is evoked by lexical units (LUs) such as the noun man, and with frame elements such as Age and Origin. We believe that the advantage of FrameBase over the existing event models lies on the fact that while extensible as the others, it already provides a broad-coverage vocabulary out of the box in order to bootstrap widespread adoption. Besides, its connection to natural language provides potential advantages, like interfacing with text for question answering or text mining. FrameBase includes, from FrameNet, an Event frame, which inherits from the Change of state scenario frame, and includes a relatively rich hierarchy below for events like creation and destruction events (including more specific ones such as births and deaths), and some others. However, not every event must necessarily fall below this event frame, nor does doing so preclude it from being mapped to other frames that represent other conceptualizations for events, or reflect other perspectives of the frame that stress different aspects than the eventive one. Therefore, the representation of events in FrameBase is not confined to the Event frame and its subframes. We will see examples of this in the next section. 4 Integrating Events In the first subsections of this section, we present manually built rules for integrating events from three different sources into FrameBase. Later, we add further explanations about these rules and discuss the complexity of the integration rules, and the challenges they present, in particular when they are to be established automatically. 4.1 Representing Events about Organized Crime The following list of integration rules shows, for each instance of an event class in the organized crime conceptual graph (in bold), the corresponding representation in RDF that it would have in FrameBase. In particular, the main event instance is represented by the anonymous node _:e. The default prefix indicates elements that already existed in the core FrameBase schema created from FrameNet and WordNet. Event _:e a :frame-Event-event.n Act _:e a :frame-Intentionally_act-act.n Arrest _:e a :frame-Arrest-arrest.n Drug Possession Arrest _:e a :frame-Arrest-arrest.n . _:e :fe-Arrest-Offense _:e2 . _:e2 a :frame-Offenses-possession.n Human Trafficking Arrest _:e a :frame-Arrest-arrest.n . _:e :fe-Arrest-Offense _:e2 . _:e2 a :frame-Commerce_scenario-trafficker.n . _:e2 :fe-Commerce_scenario-Goods :frame-People-human.n Metal Theft Arrest _:e a :frame-Arrest-arrest.n . _:e :fe-Arrest-Offense _:e2 . _:e2 a :frame-Theft-theft.n . _:e2 :fe-Theft-Goods :frame-Substance-metal.n . _:e2 a :frame-Offenses-theft.n Buy _:e a :frame-Commerce_buy-buy.v Crime _:e a :frame-Committing_crime-crime.n Illegal Drug Use _:e a :frame-Ingest_substance-use.v Consume _:e a :frame-Ingestion-consume.v Inhale _:e a :frame-Ingest_substance-sniff.v Inject _:e a :frame-Ingest_substance-inject.v Possession _:e a :frame-Offenses-possession.n Smoke _:e a :frame-Ingest_substance-smoke.v Organised Crime _:e a fbe:frame-Organization-criminal%20organization.n Theft _:e a :frame-Theft-theft.n . _:e a :frame-Offenses-theft.n Metal Theft _:e a :frame-Theft-theft.n . _:e :fe-Theft-Goods :frame-Substance-metal.n . _:e a :frame-Offenses-theft.n Trafficking _:e a :frame-Commerce_scenario-trafficker.n Drug Trafficking _:e a :frame-Commerce_scenario-trafficker.n . _:e :fe-Commerce_scenario-Goods :frame-Intoxicants-drug.n Human Trafficking _:e a :frame-Commerce_scenario-trafficker.n . _:e :fe-Commerce_scenario-Goods :frame-People-human.n Seizure _:e a :frame-Taking-seizure.n Drug Seizure _:e a :frame-Taking-seizure.n . _:e :fe-Taking-Theme :frame-Intoxicants-drug.n Sell _:e a :frame-Commerce_sell-sell.v Transaction _:e a :frame-Commercial_transaction-transaction.n Crime Transaction _:e a :frame-Commercial_transaction-transaction.n . _:e a :frame-Committing_crime-crime.n Drug Trafficking Transaction _:e a :frame-Commercial_transaction-transaction.n . _:e a :frame-Committing_crime-crime.n . _:e :fe-Commercial_transaction-Goods :frame-Intoxicants-drug.n Human Trafficking Transaction _:e a :frame-Commercial_transaction-transaction.n . _:e a :frame-Committing_crime-crime.n . _:e :fe-Commercial_transaction-Goods :frame-People-human.n Metal Theft Transaction _:e a :frame-Commercial_transaction-transaction.n . _:e a :frame-Committing_crime-crime.n . _:e :fe-Commercial_transaction-Goods :frame-Substance-metal.n The hierarchy in the original ontology is not necessarily consistent with the hierarchy in FrameBase. Only in certain cases does a superclass relationship between two elements of the source also exist between the two elements’ respective translations to FrameBase. Therefore, for each translation of an original class of event, the translations of the parents in the original ontology can be added to the set of instances (ABox) in FrameBase, and this will provide additional knowledge that would not always be inferred by the FrameBase schema alone. We minimize the need for declaring new frames and frame elements for specialized domains by making use of the compositionality of most specialized terms, creating complex structures that combine the semantics of simpler, basic elements. For instance, the translation for the event of type “Drug Possession Arrest” declares an event of type arrest, and specifies that it is about drug possession by assigning drug possession (Offenses-possession.n) as the offence. Owing to this flexibility, we merely needed to mint one single new entity that had not existed in the core FrameBase schema (the microframe Organization-criminal%20- organization.n, with the prefix fbe: denoting that this is an extension). This ex- emplifies the potential of FrameBase to represent events from relatively specialized domains, but at the same time the capacity to be extended to fill any possible gaps. For representing timelines, the frame Individual_history-history.n can be used. Each timeline can be represented with one instance of that frame. This instance can be linked with the frame element Individual_history-Domain to the topic, which is preferably an entity (or alternatively, a literal or an anonymous node or dummy entity named with a literal). The instance can also be linked with the frame element Individual_history-Event to each of the elements in the timeline. Additional frame elements are available in FrameBase, originating from FrameNet, for expressing partici- pants, total duration, etc. Then, complex queries such as retrieving all events in a given timeline between two given dates, can be built in SPARQL. Similarly, sub-events can be represented with the property path: ^:fe-Part_whole-Part/:fe-Part_whole-Whole. 4.2 Representing Events from DBpedia.org We now turn to the Event class in DBpedia, and its subclasses, showing how these can be integrated into FrameBase. The integration is implemented using SPARQL CONSTRUCT rules because DBpedia is already in RDF. We only add a couple of subclasses, but most of the properties belong to the parent Event class itself. Top event CONSTRUCT { ?e a :frame-Event-event.n . ?e :fe-Event-Time _:timePeriod . _:timePeriod a fbe:frame-Timespan-period.n ; fbe:fe-Timespan-Start ?o1 ; fbe:fe-Timespan-End ?o2 . _:e2 a :frame-Relative_time-preceding.a ; :fe-Relative_time-Landmark_occasion ?e ; :fe-Relative_time-Focal_occasion ?o3 . _:e3 a :frame-Relative_time-following.a ; :fe-Relative_time-Landmark_occasion ?o3 ; :fe-Relative_time-Focal_occasion ?e . _:e4 a :frame-Relative_time-following.a ; :fe-Relative_time-Landmark_occasion ?e ; :fe-Relative_time-Focal_occasion ?o4 . _:e5 a :frame-Relative_time-preceding.a ; :fe-Relative_time-Landmark_occasion ?o4 ; :fe-Relative_time-Focal_occasion ?e . ?e :fe-Event-Reason ?o5 . ?e a :frame-Social_event-meeting.n ; :fe-Social_event-Attendee ?o8 . } WHERE { ?e a dbpedia-owl:Event . OPTIONAL{?e dbpedia-owl:startDate ?o1} OPTIONAL{?e dbpedia-owl:endDate ?o2} OPTIONAL{?e dbpedia-owl:previousEvent ?o3} OPTIONAL{?e dbpedia-owl:followingEvent|dbpedia-owl:nextEvent ?o4} OPTIONAL{?e dbpedia-owl:causedBy ?o5} OPTIONAL{?e dbpedia-owl:duration ?o6} OPTIONAL{?e dbpedia-owl:numberOfPeopleAttending ?o7} #Omitted OPTIONAL{?e dbpedia-owl:participant ?o8} } For sub-classes of dbpedia-owl:Event CONSTRUCT { ?e a :frame-Social_event-meeting.n . } WHERE {?e a dbpedia-owl:SocietalEvent} For sub-classes of dbpedia-owl:SocietalEvent CONSTRUCT { ?e a :frame-Project-project.n . ?e :fe-Project-Activity dbpedia:Space_exploration . } WHERE {?e a dbpedia-owl:SpaceMission} For sub-classes of dbpedia-owl:SocietalEvent CONSTRUCT { ?e a fbe:frame-Social_event-convention.n . } WHERE {?e a dbpedia-owl:Convention} Out of the 9 properties of the class Event, the only omitted one was numberOfPeopleAttending, because the class Event is too general for it, as it has sub- classes such as NaturalEvent (SolarEclipse) and PersonalEvent (Birth, etc.). The SocietalEvent class appears more appropriate for this. 4.3 Representing Events from schema.org Finally, we present the translation of the Event class in schema.org. Again, SPARQL CONSTRUCT rules are used because schema.org can be expressed using RDFa, and SPARQL offers a standard way of representing knowledge graph transformations. Due to space restrictions, we omit the subclasses here, but these have very few genuine properties, and therefore the specialization is relatively simple. Besides, the taxonomy of schema.org events has some inconsistency issues that makes its use complex: the Event class is defined as capturing events such as concerts, lectures, and festivals, with properties such as “typical age range”, but there are sub-events such as UserInteraction and UserPlusOnes that actually represent a more general kind of events. CONSTRUCT { ?e a :frame-Social_event-meeting.n . ?e :fe-Social_event-Time _:timePeriod . _:timePeriod a fbe:frame-Timespan-period.n ; fbe:fe-Timespan-Start ?Osta ; fbe:fe-Timespan-End ?Oend . ?e :fe-Social_event-Duration ?Odur . ?e :fe-Social_event-Place ?Oloc . ?e :fe-Social_event-Attendee ?Oatt . ?e :fe-Social_event-Host ?Oorg . ?e :fe-Social_event-Occasion ?Osup . ?Osub :fe-Social_event-Occasion ?e . ?Ooff a :frame-Offering-offer.v ; :fe-Offering-Theme ?e . ?e a :frame-Performing_arts-performance.n ; :fe-Performing_arts-Performer ?Oper ; :fe-Performing_arts-Performance ?Owor . _: a :frame-Recording-record.v ; :fe-Recording-Phenomenon ?e ; :fe-Recording-Medium ?Orec . } WHERE { ?e a sch:Event . # Unambiguous translation OPTIONAL{?e sch:startDate ?Osta} OPTIONAL{?e sch:endDate ?Oend} OPTIONAL{?e sch:duration ?Odur} OPTIONAL{?e sch:location ?Oloc} OPTIONAL{?e sch:attendee ?Oatt} OPTIONAL{?e sch:organizer ?Oorg} OPTIONAL{?e sch:superEvent ?Osup} OPTIONAL{?e sch:subEvent ?Osub} OPTIONAL{?e sch:offers ?Ooff} OPTIONAL{?e sch:performer ?Oper} OPTIONAL{?e sch:workPerformed ?Owor} OPTIONAL{?e sch:recordedIn ?Orec} # Ambiguous translation OPTIONAL{?e sch:doorTime ?Odoo} # No translation OPTIONAL{?e sch:eventStatus ?Oeve} OPTIONAL{?e sch:typicalAgeRange ?Otyp} OPTIONAL{?e sch:previousStartDate ?Opre} } The only extension of the FrameBase schema used here was the frame :frame-Timespan-period.n with the start and end frame elements, used to denote periods of time. This, however, is not an ad-hoc extension motivated by a particular need of only one source, but a very general one. Out of the 16 properties of the Event class, 12 were translated without loss of meaning. One was translated with partial loss of meaning (doorTime, translated as a generic start time) and 3 of them were not translated. Whether these can be integrated too, by means of more complex structures, is something we are investigating. 4.4 Mapping Event Aspects to Frame Elements The survey by Scherp and Mezaris [15] proposes a classification of salient aspects of events. We use this classification to show in a more general way how event aspects can relate to frame elements in the FrameNet-based schema of FrameBase. – Time and Space: When applicable, frames include frame elements Time and Place. – Participation: The classification defines this as “participation of objects in event, where objects can be any living as well as non-living things and include peo- ple, buildings, and other even intangible objects like the roles a person plays in a specific situation” [15]. FrameBase provides a large inventory of more spe- cific roles to capture such participants. Often, these correspond to what are some- times called the proto-agent and proto-patient roles, whose realization in Frame- Base depends on the frame. Some examples are :fe-Commerce_buy-Buyer, :fe-Destroying-Destroyer and :fe-Destroying-Undergoer, which are sub- properties of :fe-Getting-Recipient, :fe-Transitive_action-Agent and :fe-Transitive_action-Patient, respectively. – Relations between events. • Mereology: The relation between two events, when one is part of an- other. Some frames will have a frame element that will fill this role, like :fe-Social_event-Occasion in the example of the Event class in schema.org. In other cases, an additional frame instance of type :frame-Part_whole can be used. • Causality: One event is the cause of another. Some frames will have a frame element that will fill this role, like :fe-Event-Reason in the example for the Event class in DBpedia. In other cases, an additional frame instance of type :frame-Causation can be used. • Correlation: When “two (or more) events have a common cause, but this common cause cannot be explained”. If we can assume there is a common cause as in the definition, then the causal relationships can be represented with two instances of :frame-Causation connecting with an anonymous node for the unknown cause. – Documentation: Events can be “documented using some media like photos or videos captured during the event”. This relation is between an event and such documentation. It can be expressed connecting the events by an additional frame of type :frame-Recording-document.v, :frame-Recording-record.v, and :frame-Recording-register.v, or some extension if needed. – Interpretation: This aspect aims at capturing “subjectivity that may exist on the other aspects of events”. This is a very broad category that may include different phenomena. The perspectivization relation in FrameNet [13] connects frames repre- senting objective events with frames describing them from a particular perspective. For instance, :frame-Commerce_Sell and :frame-Commerce_Buy are perspec- tivizations of :frame-Commerce_Scenario. In other cases, an additional frame instance of a pertinent type can be used, for instance :frame-Becoming_aware. 4.5 Complex Transformations Most of the integration rules we have described follow a pattern which involves an event class in the source being translated as a frame class, and each of their outgoing properties being mapped to individual frame elements. However, there are multiple ways in which the rules can differ from this basic pattern. 1. Sometimes, a class integration rule may need to instantiate multiple frames rather than just a single one. We distinguish two main types of this phenomenon. a) The instantiated frame instances may be connected by frame elements. Examples of this include the frame :frame-Timespan-period.n created to represent time periods, and the subframes of Relative_time to express precedence between events (all in the example for dbpedia-owl:Event). The same applies when a frame element is used to specify a frame beyond the lexical unit (see the rule for dbpedia:Space_exploration). b) Several frames can also be evoked separately, without the instances being directly connected by any frame element. When these frames describe dif- ferent perspectives of the same event, there is the possibility that FrameNet links them by means of perspectivization, and therefore FrameBase can in- fer one from another. For example, classes :frame-Commerce_buy-buy.v and :frame-Commerce_sell-sell.v, which are used for classes Buy and Sell in the organized crime taxonomy, are both perspectiviza- tions of :frame-Commerce_goods-transfer. In this case, inference is possible because RDFS subclass and subproperty properties are used in FrameBase to reflect the perspectivization relation between frame classes and frame elements respectively. Another example are :frame-Receive_visitor_scenario and :frame-Visit_host, which are perspectives of :frame-Visitor_and_host. However, in other cases one cannot rely on existing inference. For instance, see how the rule to translate Event from schema.org, besides frames Event-event.n and Timespan-period.n, also instantiates Performing_arts-performance.n, Recording-record.v and Offering-offer.v when certain properties are present. 2. Another possible source of complexity is that frame elements can be inverted. In this case, the integration rules need to invert the order of the arguments, like in the second appearance of :fe-Social_event-Occasion in the integration rule for the class Event in schema.org. 3. Oftentimes, a property (rather than a class) in the source can be translated as evoking a frame on its own. In this case, the two involved entities become connected to the new frame by means of frame elements. This would be the case for a property like fightAgainst, which might evoke an event or frame of type armed conflict, about which additional information could be added. None of the examples we have covered above are of this kind, because we use sources that explicitly represent, or reify, events. In other sources, however, this phenomenon appears quite frequently. Arbitrary combinations of these phenomena are possible (e.g. the rule integrating the Event class from schema.org). Overall, this makes automatic generation of the integration rules a very hard task, because it generates so many free variables that any attempt to train a system would face extreme sparsity. In some cases, it may thus make sense to sacrifice some recall, developing a system that only covers simpler transformations. 4.6 Representational Flexibility Finally, another potential challenge for data integration is that even when a homogeneous schema such as FrameBase is used, certain kinds of knowledge can still be expressed in multiple possible ways. – One example is that there are several ways of narrowing down the meaning of a frame instance. One is creating a new sub-microframe associated with a new lexical unit. Another one is assigning a value to a frame element (see example for SpaceMission), as mentioned above. This may lead to divergent choices of representation even within the core part of the schema that comes from FrameNet. – Another example of this is when a frame element needs to be reified, i.e. represented as a frame instance, to express something additional about it (as would be the case of the property previousStartDate in schema.org), or when there is no direct frame element available and creating it would lead to a combinatorial explosion in the size of the schema. An example of the latter is the difference between our proposal for using the frame Part_whole for expressing sub-event relations, and how we used the frame element Occasion for the frame Social_event, but this is a particularity of that frame. Again, this may lead to an incoherent representations in the knowledge base. One potential way of addressing this would be extending the reification–dereification mechanism of FrameBase [12]. 5 Conclusion We have shown how events from specialized domains can be represented with the FrameBase schema under a unified model, integrating events in the prototypical sense with more general kinds of events in the sense of abstract happenings or situations. This model has proven to have a high degree of coverage because it needed just few extensions to accommodate the integrated knowledge, and we have illustrated how these extensions can be performed when needed. We have also discussed the various challenges and problems one faces when the integration rules from disparate structured sources of event information are to be built automatically. Extremely specialized domains, such as quantum physics, may produce lower cover- age and need more extensions, although in some cases the creators of FrameNet have also been involved in projects that led to the inclusion of specific scientific and technical domains. The integration rules that we produce can be used in the future as gold standards for training and testing automatic methods for creating rules from other schemas. We are currently performing research on these methods to integrate further sources such as YAGO2s, Freebase, and Wikidata. Please refer to http://framebase.org for information on using FrameBase and the integration rules. Acknowledgments The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agree- ment No. FP7-SEC-2012-312651 (ePOOLICE project). as well as China 973 Program Grants 2011CBA00300, 2011CBA00301, and NSFC Grants 61033001, 61361136003, 61450110088. References 1. Schema.org. http://schema.org. 2. C. F. Baker, C. J. Fillmore, and J. B. Lowe. The Berkeley FrameNet Project. ICCL ’98, pages 86–90, 1998. 3. C. Bizer, T. Heath, and T. Berners-Lee. Linked data–the story so far. IJSWIS, 5(3):1–22, 2009. 4. C. Bizer, J. Lehmann, G. Kobilarov, S. Auer, C. Becker, R. Cyganiak, and S. Hellmann. DBpedia-A crystallization point for the Web of Data. Web Semantics: Science, Services and Agents on the World Wide Web, 7(3):154–165, 2009. 5. P. Exner and P. Nugues. Using semantic role labeling to extract events from Wikipedia. DeRiVE ’11, 2011. 6. C. Fellbaum, editor. WordNet: An Electronic Lexical Database. The MIT Press, 1998. 7. C. J. Fillmore, C. R. Johnson, and M. R. Petruck. Background to Framenet. International journal of lexicography, 16(3):235–250, 2003. 8. S. Harris and A. Seaborne. SPARQL 1.1 Query Language. W3C Recommendation, W3C Consortium, Mar. 2013. 9. P. Hayes and P. Patel-Schneider. RDF 1.1 semantics. Technical report, W3C, 2014. http://www.w3.org/TR/rdf11-mt/. 10. E. Kuzey and G. Weikum. Extraction of temporal facts and events from wikipedia. In Proceedings of the 2nd Temporal Web Analytics Workshop, pages 25–32. ACM, 2012. 11. Y. Raimond and S. Abdallah. The event ontology. Technical report, Oct. 2007. http://motools.sf.net/event. 12. J. Rouces, G. De Melo, and K. Hose. FrameBase: Representing N-ary Relations using Semantic Frames. In Proceedings of the 12th Extended Semantic Web Conference, ESWC, 2015. 13. J. Ruppenhofer, M. Ellsworth, M. R. Petruck, C. R. Johnson, and J. Scheffczyk. FrameNet II: Extended Theory and Practice. ICSI, 2006. 14. A. Scherp, S. Agaram, and R. Jain. Event-centric media management. In Electronic Imaging 2008, pages 68200C–68200C. International Society for Optics and Photonics, 2008. 15. A. Scherp and V. Mezaris. Survey on modeling and indexing events in multimedia. Multimedia Tools and Applications, 70(1):7–23, 2014. 16. R. Shaw, R. Troncy, and L. Hardman. LODE: Linking Open Descriptions of Events. In ASWC ’09, Lecture Notes in Computer Science, pages 153–167, 2009. 17. J. F. Sowa. Conceptual graphs. In In Handbook of Knowledge Representation, pages 213–237. Elsevier, 2008. 18. W. R. Van Hage, V. Malaisé, R. Segers, L. Hollink, and G. Schreiber. Design and use of the Simple Event Model (SEM). Web Semantics: Science, Services and Agents on the World Wide Web, 9(2):128–136, 2011. 19. X. H. Wang, D. Q. Zhang, T. Gu, and H. K. Pung. Ontology based context modeling and reasoning using owl. In Pervasive Computing and Communications Workshops, 2004. Proceedings of the Second IEEE Annual Conference on, pages 18–22. Ieee, 2004.