The Power of Integrated Abstraction for Data-Centric Human/Machine Computations Atsuyuki Morishima Norihide Shinagawa Shoji Mochizuki University of Tsukuba University of Tsukuba University of Tsukuba mori@slis.tsukuba.ac.jp siena@slis.tsukuba.ac.jp mshoji@slis.tsukuba.ac.jp ABSTRACT description can be used to derive executable codes written in Humans are recognized as important data sources in state-of-the-art implementation languages and frameworks, data-centric applications today. This paper discusses with guaranteed properties. Alternatively, it can be directly the potential of integrated abstraction of data-centric hu- executed by engines along with other codes (Sec. 3). man/machine computations, where data provided by people The significance of abstractions of data-centric hu- plays an important role. We show the potential by using a man/machine computations is increasing for the following language named CyLog, our first attempt to develop such reasons. First, as mentioned, it has been found that ag- abstractions. In CyLog, the closed world assumption is in- gregating the power of machines and humans is a promising terpreted in a broader world in which people are included as approach for achieving some of the computationally difficult rational data sources, that behave rationally in given games. goals. Second, in many of data-centric systems today, com- We argue that such abstractions give us opportunities to ap- putation is not necessarily closed in machines. In fact, some propriately deal with computations not closed in machines. of the problems of data-centric systems in practice do not 1. INTRODUCTION come from bugs of codes but from human factors [7]. Recently, data-centric applications that exploit hu- To make the discussion concrete, this paper introduces man/social computation have emerged, such as GWAPs [1] CyLog [11], as an example of such abstractions, and uses (e.g., the ESP game), Q&A services (e.g., Yahoo Answers), it in scenarios of data integration, acquisition, and search. and many other services that require the power of people. CyLog is a Datalog-like language that introduces open facts Even in some traditional applications, the power of people and borrows concepts from the game theory to model people is essential. For example, data integration and cleaning re- as novel rational data sources, in order to provide a prin- quire not only the processing of large amounts of data by cipled abstraction for describing, analyzing, and executing computers but also help from people (e.g.,[10]). As the ex- such programs. An essential difference from the existing lan- amples suggest, data from people is essential to attain some guages, is that CyLog deals with human computation as a of the difficult goals that computers alone cannot attain. first class component and allows us to design and analyze People serve as important data sources in such applications. the behavior of users, while others give no hints on whether Existing programming languages, including those for users will behave in the expected manner. Web/DB domains, have been designed only to describe the 2. EXAMPLES AND THE POTENTIAL behavior of computers, and do not offer tools for modeling people as components of computation. Human computation A key idea of CyLog is that it does not limit the scope of is out of the scope of the languages; the logic of interaction the closed world assumption to the stored database so that it with people needs to be implemented from the scratch using can naturally incorporate the processes of interactions with primitive functions (e.g., GUIs) or crowdsourcing APIs (e.g., people in the language design. CyLog allows some facts to Amazon Mechanical Turk or AMT). More important, it is be open in the sense that, when the fact is not stored (cannot difficult to analyze or predict the expected behavior of the be derived) in the database, it tries to extend the world by entire system, which includes machine and human activities. asking people whether a fact holds in the real world. This paper discusses the potential of integrated abstrac- Open Facts. Fig. 1 shows a fragment of a CyLog program. tion of data-centric human/machine computations. A good The fragment, except for the last line, can be interpreted as a abstraction serves as a powerful tool both in theoretical re- Datalog program with the closed world assumption. (Note: search and software development; it can be used to describe CyLog adopts the named perspective [2]. In the program, and analyze problems without implementation details. The “x:y” represents a renaming similar to “x as y” in SQL.) That is, the ancestors are computed based on the minimum Herbrand model. Therefore, Ancestor(pam, pat) holds but Ancestor(pam, ann) does not, because the later is not de- Permission to make digital or hard copies of all or part of this work for rived from the facts in the database. personal or classroom use is granted without fee provided that copies are In CyLog, open facts are allowed. For example, the last not made or distributed for profit or commercial advantage and that copies line in Fig. 1 states that if two persons share one parent, bear this notice and the full citation on the first page. To copy otherwise, to there may be some people who know whether the head fact republish, to post on servers or to redistribute to lists, requires prior specific (an instance of Parent(P, C) to state that they share an- permission and/or a fee. This article was presented at: the workshop Very Large Data Search (VLDS) 2011. other parent) holds. In addition, open facts can have “open” Copyright 2011. attributes that do not appear in the rule body (e.g., the Parent(pam, bob) Parent(bob, pat) Player A/Player B Term A Term B Parent(kate, pat) Parent(kate, ann) Ancestor(A, D) <- Parent(P:A, C:D) Term A (1,1), Term A (0,0), NULL Ancestor(X:A, D) <- Parent(P:X, C:A), Ancestor(A, D) Term B (0,0), NULL (1,1), Term B Parent(P, C)/open <-Parent(P,C:Y), Parent(P:Z, C:Y), Parent(P:Z, C) Figure 5: Payoff matrix for duplicate game Figure 1: Fragment of a CyLog program (a) Ideal: Order Date Player Rel Action StuMember(x) <- SIGMOD(x), Student(x) 1 10:10am Kate MetadataInput tennis StuMember(x) <- DBSJ(x), Student(x) 2 10:11am Ann MetadataInput tennis (b) Reality (only a fragment): 3 10:12am Pam MetadataInput ball StuMember(x) <- SIGMOD(x), DBSJ(y), x!=y, Equiv(x,y), IsReallyStu(x) Figure 6: Path Table Figure 2: Data integration example Game A player is rewarded when she gives: Data: Majority the same value as the others. MetadataInput(file, keyword)/open <- File(file) Metadata(file, g(file):keyword)/game:g(file) <- File(file) First a value given faster than others. Game: Unique a unique value. g(file)@time(10): duplicate, {MetadataInput} BestAnswer a value that received the highest evaluations. Figure 3: ESP game Figure 7: Some predefined games Data: Restaurant(x) <- Restaurants([x|y]) ample, some of the recent applications use tools like Twitter Restaurants(y) <- Restaurants([x|y]) as “human sensors;” [16] utilizes tweets to identify the center Better(a, b)/open <- Restaurant(a), Restaurant(b), a!=b Sort(l, result)/game:crowdrank(l) <- Restaurants(l) of the earthquake. CyLog would help us to write programs Game: independent of what the data sources are. Another example crowdrank(l)@deadline(2011/6/7): proportional, {Better} that often occurs in many areas is when a huge system (like Figure 4: Crowd ranking a reservation system) is down due to the failure of some of keyword of the first rule in Fig. 3). The open fact is evalu- its modules, human workers need to substitute for the sys- ated by a particular person, or a group of people, which is tem until the system becomes operational. CyLog gives us specified by an optional parameter of /open or by a built-in a chance to avoid such all-or-nothing situations. If we could predicate. When the key attributes are not open (i.e., bound continue to partially execute the program on alive parts of in the rule body), the semantics of CyLog accepts only the the system, while allowing people to join the execution of input supplied by the person who came first. As shown in the program, the obtained data would be compatible with Sec. 3, the system can wait for people to answer with Web the program. And it would be easier to restore and merge forms or crowdsourcing APIs, or can actively use messaging the data in order to go back to the normal status when the services, in order to have people evaluate open facts. machines became fully operational. Seamless Description of Human/Machine Compu- Second, open predicates allow us the separation of con- tations. The power of “open” facts comes from the fact cerns. We use the term data aspect to denote a quadruple that the data acquisition from people can be uniformly de- I, D, M, R in which I is a set of data sources including scribed using the same set of language constructs. Let us see not only machinery data sources but also people, D is the a data integration scenario that is based on a real integra- data stored in I, M is the mapping to represent how D is tion experienced by one of the authors. He needed to merge stored in I, and R is a set of rules connecting data in D. the member-list databases of two academic societies (ACM Open facts allow data aspects to be written in one module SIGMOD Japan Chapter and Database Society of Japan). and separated from the other code. This makes it easier to Ideally, the integration is straightforward, because it is the understand and maintain the data aspect. In contrast, with set union of two member databases (Fig. 2 (a)). However, the current db languages, data aspects are written by scat- in reality, the task was difficult, confusing, and error-prone, tered queries combined with the code in other languages. As because of the conflicts, inconsistency, and incompleteness shown in Sec. 3, a data aspect support system (DASS) is of data. We had to deal with many cases that were different designed to execute the data aspect written in CyLog and mixtures of db queries and human activities. The number of programs in general-purpose languages in a combined way. lines in the procedure document (excluding comments) was Finally, the integrated abstraction helps us check if some 153, including 51 lines for manual procedures in a natural of the apparently different programs are equivalent, while language and 39 SQL queries. The manual procedures in- it is difficult to confirm that different (and mixed) sets of clude sending emails to members to find out whether they db queries and procedures written in natural/programming were still students and using our knowledge for the entity languages are equivalent. Therefore, it provides us with a resolution. In contrast, we can write the entire integration chance of optimizing and transforming data aspects, not in- process in CyLog. Fig. 2 (b) shows only a fragment of the dividual queries. For example, the data integration scenario actual process, which is constructed by applying expansion used a set of domain-specific expansion rules to safely con- rules to the original query in Fig. 2 (a). In that fragment, struct complex data integration procedures. we need to perform a manual entity resolution using our Reward System. How can we define the semantics of such knowledge and send an email asking if he is still a student, data aspects? The problem is that human factors are incor- if Equiv(x,y) and IsReallyStu(x) are open facts. porated in the executions. Since people might lie and they The seamless description has a set of benefits. First, Cy- need motivation to participate in the computation, it is diffi- Log allows us to execute the same program in flexible ways cult to predict the execution results. One possible approach with different mixtures of human/machine computations. In is to consider each human as a rational data source. By “ra- the example scenario, if the definition of student members tional,” we mean that people are assumed to provide data in allowed us to know, based on the stored data, whether a a way consistent with the expected rewards. Therefore, Cy- member is still a student, the IsReallyStu could be com- Log is required to have a reward system built-in at the lan- puted by machines with other rule definitions or user-defined guage design level. Games are abstract concepts that have functions. They are equivalent (except for who evaluates the been well studied in the literature from both theoretical and part). The feature is interesting in various scenarios. For ex- practical aspects, and the game theory is known to be useful when discussing not just real “games” but any system that CyLog Execution Controller Default involves incentive structures [3, 9]. In fact, the game theory Program Logic Game Open Fact Functions/ Processor Manager API Other has been applied to particular classes of problems [15], such Payoff Notification Programs as the design of networks, auctions, and GWAPs. CyLog Data (optional) adopts terms and concepts from the game theory to design Data and implement the appropriate behavior of rational data sources. Then, the semantics of open facts are defined by ProgramExecutionintheCyberneticDataspace the equilibrium of the game [17]. This is achieved by payoff Figure 8: Chimera prototype system architecture matrices with outputs (Fig. 5), which we explain below. chines and humans. First, we can discuss the semantics of We show two example programs. The first one is a simple such dataspaces. As mentioned, let a data aspect d writ- version of ESP game [1], in which people (players) provide ∗ ten in CyLog be I, D, M, R. Let TR,S (I, D, M) be an keywords for given image files. If they match, the players operator to compute the (possibly infinite in CyLog) set of are rewarded. Fig. 3 is a CyLog program for the ESP game, consequences [2] for the data aspect with strategies S, where where Data: defines rules, and Game: defines games. The strategies describe how the people behave in the given games “/game:g(file)” specifies that the head is computed after [17]. Since people in I are involved in the evaluation, there the game identified by g(file) is over. The g(file) is a are many strategies and corresponding sets of consequences Skolem function to identify game instances. In the program, of a data aspect. When S is a set of best strategies, we call a game instance is created for each file. In the Game: part, ∗ TR,S (I, D, M) be a set of rational consequences of d. We g(file)@time(10) specifies that each game is invoked for a can define the semantics of d, denoted by sem(d), as the given file and ends in ten seconds. collection of all the sets of rational consequences of d. At the end of the execution of each game instance (iden- Second, such an abstraction gives us a chance to discuss tified by g(file)), a special table, called a path table, is some properties that we would not discuss when dealing with constructed. The table maintains the provenance (corre- traditional programs. Although the following theorem seems sponding to a path in the game theory) to show how the trivial for CyLog programs, it is clear that such abstractions game reached the last state. In other words, a path table would promote theoretical developments. A theorem on the records how players have behaved in the game instance. The efficiency of programs is also discussed in [12]. table is constructed with the schema P(order, date, player, Theorem 1. Given a data aspect I, D, M, R written rel, action) (Fig. 6). Each tuple in the table records when in CyLog, there is an algorithm to statically check whether and who gave values for the open attributes of the relations an open fact is not involved in any game definition in R. 2 specified in { ... } (i.e., MetadataInput in Fig. 3). The theorem is important for the following reason: if an The payoffs to players and the output values to be con- open fact is never involved in any game, it is guaranteed that sumed by other rules are computed by game aggregations of there is no feedback given to users. Therefore, the program each path table. For example, Fig. 5 shows a part of the may not be able to continue its execution as intended, or the payoff marix of duplicate game aggregation (Fig. 3) where data given by people may not be appropriate. This is exactly each cell shows not only the payoffs, but the output value what happened in the Japanese pension system problem [7]. (only two players and two terms are shown in the figure). If open facts are involved in games, tools from the game Given the path table in Fig. 6, payoffs (1 for Kate and Ann, theory can be used to predict behavior. For example, the and 0 for Pam) are given to players and the output value is payoff matrix in Fig. 5 is a typical coordination game [17], in tennis given by Kate and Ann. Then, the value is consumed which rational players choose the same strategy. Likewise, a by the second rule in Fig 3, in which the Skolem function is simple analysis let us know that the crowd ranking program also used to denote the value. Assuming that people behave helps us find popular restaurants, and we can change the rationally, it is expected that the value is computed by the game structure to find little known hot spots. state of equilibrium with the aggregations. We call such a Other Issues. Due to space limitations, we describe other game with output values a data game. Applying game ag- issues briefly. First, CyLog provides programmers with gregations to path tables give us a simple and flexible way means to implement data games with various settings, e.g., to implement both extensive- and normal-form games [17]. who (among the players) and when the program should ask The second example is a crowd ranking service (Fig. 4) whether each open fact holds. Second, some game aggre- where a set of restaurants in a city is sorted by subjective gations are predefined in our library (Fig. 7) but users can preferences of the crowd. The user can execute the pro- provide user-defined game aggregations in the data part. gram to find must-go restaurants before visiting the city. Finally, to explain to people their semantics, predicates can The Better(a, b)/open asks each person if she prefers a have text descriptions, which can be accessed through APIs to b. It can be evaluated, for example, through the AMT. provided by the system (as explained next). The Sort(l, result) simply sorts the list l by the num- ber of votes given by the game (the code omitted). Here, 3. DATA ASPECT SUPPORT SYSTEMS DASS should support the execution of the data aspect proportional is a payoff function that pays points accord- descriptions (e.g., CyLog programs) by machines and peo- ing to the proportion of the votes for restaurants. Therefore, ple, communicating with programs in general-purpose lan- there is an incentive for the crowd to vote for the restaurants guages. We discuss one possible set of components and ar- the others prefer too. chitecture for DASSs by showing those of Chimera, a proto- As a final note, even in the integration scenario, data type we developed for CyLog programs (Fig. 8). Chimera games can be incorporated in the program to improve the adopts a semi-naive evaluation strategy in which the rules quality of the results if more than one person is involved. are evaluated in a bottom up way [2]. To communicate Diving into Cybernetic Dataspaces. The integrated with other programs, Chimera takes a simple API-based ap- abstraction allows us to discuss dataspaces involving ma- proach; the programs call the open fact API to receive the data necessary to have people evaluate open facts, and to a-data-source approach, in the sense that their focus is on invoke an event to indicate that a new fact holds. how to provide programmers with the view of a reliable data DASS should support the execution of data aspects alone, source over the underlying set of people as a whole. too, for rapid prototyping. Chimera provides functions to We believe that CyLog is unique in that we take the generate Web sites with HTML forms where active evalua- human-as-a-data-source approach, in which each human tion can be realized by messaging services such as emails or (player) is modeled as a visible data source, and we try to Twitter, and plans to provide functions to call the AMT. design data-centric abstractions to deal with the “new” type of data source for building cybernetic dataspaces in flexible 4. CHALLENGES AND RELATED WORK ways. We introduce the concept of rational data sources This section identifies some of the new challenges and the and give new components, such as data games, to interact related work. with rational data sources. We believe that data-centric hu- Establishing Theories of Cybernetic Dataspaces. man/machine computations are important in many scenar- This is definitely one of the most exciting challenges, which ios [5] beyond the AMT-style crowdsourcing setting which can lead to significant contributions. An important is- provides only limited forms of game situations. Dealing with sue is how to design dataspaces involving human activities, dataspaces with other styles of human involvements are not with guaranteed properties. We showed only a first step, out of the scope of CyLog. It is interesting, however, that and there are many interesting questions. For example, those projects including CyLog employ many overlapped given a program and different mixtures of human/machinery concepts. For example, hQuery is discussed with a Datalog computers (I, D, M, R and I  , D, M , R), how dif- style notation. CNULL [6] and our open attribute values ferent the efficient optimizations are? Given two differ- are similar to each other. The “majority votes” principle, ent programs involving human activities to achieve the discussed in most of the crowd-as-a-data-source researches, same goal (two data aspects d1 : I, D, M, R and d2 : is related to coordination games in CyLog. Some of the I, D, M, R  s.t. sem(d1 ) = sem(d2 )), which one reaches challenges discussed here are also addressed in the projects. the goal faster? How much payoffs are needed to reach the Acknowledgement. The authors are grateful to Prof. goal of a program in general? Are there design criteria or Sugimoto, Prof. Sakaguchi, Prof. Nagamori, and Prof. Wu- normal forms for cybernetic dataspaces? We believe that wongse for the discussion in seminars, and Prof. Tajima data-centric abstraction is promising to establish theories. and Prof. Kitagawa for giving us valuable comments on Of course, games are not a magic wand; in some appli- earlier versions of the paper. This research is supported by cations, it may be difficult to provide real benefits (e.g., PRESTO from the Japan Science and Technology Agency. points, money, and evaluation scores) to be modeled by pay- off values. Humans are not necessarily rational. However, we believe that modeling humans as rational data sources 5. REFERENCES to apply the game theory is a good starting point. An in- [1] L. von Ahn, L. Dabbish: Designing games with a purpose. CACM 51(8): 58-67 (2008) teresting open question is whether we can apply the results [2] S. Abiteboul, R. Hull, V. Vianu. Foundations of Databases, from other fields such as cognitive and behavioral sciences. Addition-Wesley, 1995. [3] G. Anthes: Mechanism design meets computer science. CACM Assignment of Computations to Appropriate Ma- 53(8): 11-13 (2010) chines and Humans. Humans are not homogeneous in [4] X. Chai, B. Vuong, A. Doan, J. F. Naughton: Efficiently their abilities. In CyLog, it is programmers that give decide incorporating user feedback into information extraction and integration programs. SIGMOD Conference 2009: 87-100 who provides the required data (by specifying I, D, M), [5] A. Doan, R. Ramakrishnan, A. Y. Halevy: Crowdsourcing but finding appropriate humans to evaluate open facts is an systems on the World-Wide Web. CACM 54(4): 86-96 (2011) important open problem. [6] M J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, R. Xin: CrowdDB: answering queries with crowdsourcing. SIGMOD Data Aspect Issues. Since open facts allow us to describe 2011: 61-72 not only individual queries but also data-oriented human [7] Ministry of Internal Affairs and Communications, Japan. The activities as a data aspect, we now have an abstraction for report on the pention system problem. http://www.soumu.go .jp/menu news/s-news/2007/071031 3.html. wider data-centric issues. Although an event-based interface [8] J. M. Hellerstein. The declarative imperative: experiences and is fine to connect the data aspect with other programs, it is conjectures in distributed logic SIGMOD Record , 39(1), 2010. [9] S. Jain, D. C. Parkes: The role of game theory in human still an open problem to identify the best interface. computation systems. KDD Workshop on Human Computation Language Design. CyLog adopts Datalog as a basis for 2009: 58-61 the following reasons. First, as recent studies suggest [8], [10] M. Keulen, A. Keijzer: Qualitative effects of knowledge rules and user feedback in probabilistic data integration. VLDB J. languages based on Datalog allow us to concisely describe 18(5): 1191-1217 (2009) data-centric applications. Second, the rule syntax has good [11] A. Morishima: A Database Abstraction for Data-intensive compatibility with user inputs (e.g., [4]). Finally, logic- Social Applications. The 5th Korea-Japan Database Workshop 2010 (KJDB2010), May 28-29, 2010. (slides available) based programming has an affinity toward event-driven ex- [12] A. Morishima, N. Shinagawa: The Power of Integrated ecutions, which data games often require. Although open Abstraction for Data-Centric Human/Machine Computations. predicates and data games are key components of CyLog Technical Report, 2011. [13] A. Marcus, E. Wu, S. Madden, R. C. Miller: Crowdsourced since the beginning [11], other important constructs and the Databases: Query Processing with People. CIDR 2011: good overall design of the language are still open problems. 211-214 Related Work. Qurk [13], hQuery [14] and CrowdDB [6] [14] A. G. Parameswaran, N. Polyzotis: Answering Queries using Humans, Algorithms and Databases. CIDR 2011: 160-166 are independently conducted research projects but closely [15] Y. Shoham: Computer science and game theory. Commun. related to ours. Their focus is to achieve database functions ACM 51(8): 74-79 (2008) on the crowd, and one of the interesting points is to attain [16] T. Sakaki, M. Okazaki, and Y Matsuo: Earthquake Shakes Twitter Users: Real-time Event Detection by Social Sensors, data independence in the presence of human data sources, WWW2010. on the assumption that a crowd of people are supplied by [17] F. Vega-Redondo. Economics and Theory of Games, crowdsourcing services. Currently, they take the crowd-as- Cambridge University Press, 2003.