Simple User Assistance by Data Posting (Discussion Paper) Elio Masciari, Domenico Saccà, and Irina Trubitsyna elio.masciari@icar.cnr.it {sacca,trubitsyna}@dimes.unical.it ICAR-CNR, DIMES-Università della Calabria, 87036 Rende (CS), Italy Abstract. Big Data deeply changed both researchers and industries ap- proaches to data management. This impressive amount of data contains a lot of information regarding user suggestions and searches. In this paper, we focus on the effect of users suggestions on their social environment. Suggestions are provided to user by means of simple rules that are de- fined on the basis of the data being analyzed. We address the reaction of users in a competitive environment when they were invited to judge each other choices. The above mentioned activity is crucial to perform efficient identification of users able to spread their influence across the network, which is a key activity in several scenarios like tourism promo- tion, personalized marketing, and entertainment suggestion. 1 Introduction Big Data paradigm is currently driving both research and industrial interests. Many proposals for innovative data analysis for managing impressive amounts of data have been developed. These approaches share the common needs to rethink storage and analysis techniques in order to deal with massive, heterogeneous data, that are fast varying [1]. Indeed, the Big Data revolution has been guided by the continuous advances of the web-based services that have millions of users generating new data everyday. As an example, applications such as Uber, Face- book or Twitter attract millions of people posting information about their habits and activities. Quick analysis of these data is a key activity in several scenarios, such as tourism promotion, personalized marketing and entertainment offers. To this end in [6] a collaborative network that allows users to cooperate each others for high performance task execution (by partitioning complex tasks in smaller and easier sub-tasks) is described. Each user can operate as client and/or executor. As the tasks assigned to executors may consist in processes defining specific features for a context (e.g., deciding whether a restaurant is better than another), it may happen to deal with inconsistent or incomplete data. Tasks re- quiring details of all possible solutions obtainable with data exchange process can Copyright © 2019 for the individual papers by the papers’ authors. Copying per- mitted for private and academic purposes. This volume is published and copyrighted by its editors. SEBD 2019, June 16-19, 2019, Castiglione della Pescaia, Italy. be solved applying approximate certain query answering algorithms [8–10, 12]. To extract concise data chosen among the possible alternatives on the basis of analyst’s indications the data posting framework can be used [2]. The latter for- malism allows us to model different situations. However, it requires the ability to manage count constraints and domain relations (we formally describe these notions in the background section). To simplify the modelling process, we pro- pose to integrate the standard mapping with some parameters, avoiding complex specifications. We call the obtained rules smart mapping rules (more details can be found in [13, 14]. As an example, in Section 3 we consider the task of gener- ating user suggestions on the basis of information posted by other peoples and available from the Web. We show how this process can be modelled with smart mapping rules. Since the sub-tasks are solved by independent executors, payed for their ac- tivities, an important problem regards the quality evaluation of the obtained results. In our context the clients can ask for search directions. Once the result is obtained, we can ask the client to evaluate the obtained suggestion, i.e. give a rank for the executor performing this task. Moreover, some credits to execu- tors suggesting useful search directions to clients are assigned. We analyze what happens with users’ evaluations when they are aware of the other ones. In par- ticular, we consider a group of users involved in a research project and we let them know mutual preferences on some topics. We use Exponential Graph Random Models (EGRM) [3] to analyze the net- work dynamics and to model the revenue flow generated by user interactions. Since the maximum likelihood evaluation is computationally hard, we use two ap- proaches for sampling data: Metropolis Hastings sampling and Clustering based sampling. Moreover, since the data are continuously updated, we use two type of statistical analysis: cross-sectional and dynamic. It has been shown in literature, that in many settings the possibility to ob- serve the other user rankings causes mutual adjustment among raters. In our framework, higher rankings are associated with positive evaluation (thus higher rewards), such that being ranked below others is aversive. To properly mea- sure the above mentioned network dynamics, we compute three values proposed in [11]: (i) Global Nonconformity (GNC) - indicates how user conforms with oth- ers’; (ii) Local Nonconformity (LNC) - indicates how user conforms with neigh- bours; (iii) Deference Aversion (DA) - measures the following phenomenon: if user A ranks B above C (B > C), and user C would like rank A above B (A > B), since the transitivity of ranking would state that A is above C, user C is influenced negatively to express his ranking as above defined. Based on these measures we performed a deep experimental analysis. 2 Preliminaries The data posting framework [7] adapts the well-known data exchange techniques to the new Big Data management and analysis challenges we find in real world scenarios. The data posting setting (S, D, T, Σst , Σt ) consists of a finite source database schema S, a finite domain database scheme D, a target schema T, a set Σst of non-deterministic source-to-target TGDs and a set Σt of target count con- straints. A non-deterministic source-to-target TGD (NdTGD) is a dependency over hS, D, Ti of the form ∀x [ φS (x∪ỹ) → φT (z) ], where x and z are lists of uni- versally quantified variables; ỹ is a (possibly empty) list of variables, called non deterministic, these variables can occur in φS only in relations from D; x∩ ỹ = ∅ and z ⊆ x ∪ ỹ; the formula φS and ψT are conjunctions of atoms with predicate symbols in S ∪ D and in T, respectively. The structure of NdTGDs ensures that any target database on T is finite. The NdTGD can be seen as the standard TGD, where existentially quantified variables are replaced with non-deterministic vari- ables, whose values can be chosen from the finite domains defined by domain relations, i.e. relations from D. The mapping process is performed as usual but presumes that for every assignment of x a subset of all admissible values for ỹ can be chosen in an arbitrary way. A count constraint is a dependency over T of the form ∀x [ φT (x) → #({Y : ∃ z α(x, Y, z)}) β(x) ], where φT is a conjunction of atoms with predicate symbol in T, is any of the compar- ison operators (=, >, ≥, < and ≤), H = {Y : ∃ z α(x, Y, z)} is a set term, # is an interpreted function symbol that computes the cardinality of the (possibly empty) set corresponding to H, #(H) is a count term, and β(x) is an integer or a variable in x or another count term with universally quantified variables in x. The lists x, Y and z do not share variables, α(x, Y, z) is a conjunction of atoms Ti (x, Y, z) with Ti ∈ T. Let IT be the instance of T. The active domain ADI is the set of all values occurring in IT . Given a substitution x/v assigning values in ADI to universally quantified variables, Kv = {Y : ∃ z α(v, Y, z)} defines the set of values in ADI assigned to the free variables in Y for which ∃ z α(v, Y, z) is satisfied by IT and #(Kv ) is the cardinality of this set. We say that IT satisfies the count constraint if each substitution x/v that makes true its body expression, makes also true its head expression. The data posting problem is defined as follows: given finite source instance IS and finite domain instance ID , find a finite target instance IT such that hIS , ID , IT i satisfies both Σst and Σt . This problem is N P-complete under the data complexity. When Σst does not contain non-deterministic variables, the problem becomes polynomial. 3 Setting up user suggestions In this section we consider the task requiring generation user suggestions on the basis of information posted in the Web. In the following example we show how this process can be modelled with smart mapping rules. Example 1. Suppose to have the description of the user ratings (that we call comments) stored in the relation C(U, N, V) with attributes U (user identifier), N (argument of rating) and V (rating value). Suppose also to have a relation T(U1 , U2 , L) that represents the trust level L of user U2 for the user U1 . In order to suggest to the user some “relevant” arguments we can set the following criterion: “An argument n is suggested to the user u if the following conditions hold: 1) it is sufficiently supported, i.e. it is “supported” by at least 20 users whose trust level towards u is greater than 70%; 2) if different comments corresponding to the same argument are sufficiently supported, only one can be selected, with the greater level of support”. In this case, the choice of the comment has to take into account two different needs: 1) deciding if the argument has to be suggested to the user; 2) selecting the rating value for this argument. This scenario can be modeled as follows: – C and T are source relations; – Add(U, N, V) is target relation, specifying the comments to be suggested to U ; – the smart mapping rule is reported below: u2 ,20,hv,unique∧maxi C(u2 , n, v) ∧ T(u1 , u2 , l) ∧ l > 0, 70 −−−−−−−−−−−−−−→ Add(u1 , n, v) Intuitively, the body of the rule allows us to restrict the attention to the users whose trust level towards u1 is greater than 70%, the selection criterion has been synthesized on the arrow, indicating 1) the support variable u2 , 2) the minimum quantity 20 of support instances to be able to map, and 3) the variable v whose value should be chosen from the set of candidate values following the indication unique ∧ max.  y,k,hv,f i More formally, a smart mapping rule has the form ∀z[ φS (z) −−−−−−→ r(x, v) ], where x, y, z, v are vectors of variables, such that x ∪ y ∪ v ⊆ z and x, y and v do not share the variables; φS is the conjunction of source relations and ex- pressions involving comparison operators (>, <, ≥, ≤, =, 6=) and variables in z or constants; r is a target relation; y is called a support vector ; k is a natural num- ber (greater than 0) which indicates the support value; the pair hv, f i indicates how the choice for the values of v should be performed: f can be max, unique, the conjunction unique ∧ max, or empty. We assume that each target relation can be defined by only one smart map- ping rule. The smart mapping rule specifies that the tuple hX, Vi is added to r only if it is supported by at least k (different) initializations {Y1 , ...Yk } of y , i.e. for each j ∈ [1..k] there exists an initialization Zj of z, that maps x, y e v in X, Yj and V respectively, and that makes true ψS (Zj ). In the case than no further indications of choice are specified (the third arrow label is hv, i) all the tuples supported by at least k (different) initializations of y are added to r. Otherwise, the set of tuples to be added is further reduced using f content for the selection of values in v. The hv, uniquei constraints specifies that the r relation must obey the func- tional dependency x → v, i.e. for each assignment of values in x the assignment of values in v must be unique. In the case that several tuples are supported by at least k (different) initializations of y and they have the same values in x, the choice can be made arbitrarily. The indication hv, maxi specifies that, for each X only tuples supported by a maximum number of initializations of y must be selected. It is easy to see that this indication does not guarantee the uniqueness of the choice. For example, it is possible that two tuples hX, V1 i and hX, V2 i have the same degree of support corresponding to the maximum value. The indication hv, unique ∧ maxi specifies that, fixed X, only one tuple hX, Vi can be chosen among those supported by a maximum number of (differ- ent) initializations of y. The set of smart mapping rule can be translated into the standard data posting constructs by translating every mapping rule ρ as follows. We introduce the unary domain relation Dρ . If f is empty Dρ contains values −1 and 1, otherwise Dρ contains values −1, 0 and 1. We also introduce the target relations Aρ (X, Y, V) and Addρ (X, V, F lag), where X, Y and V represent vectors of attributes corresponding to the vectors of variables x, y, and v, respectively, while the decision weather to select the pair hx, vi in the target relation r is stored by the attribute F lag: −1 or 0 (not added) and 1 (added). The mapping rules are: φS (z) → Aρ (x, y, v) φS (z) ∧ Dρ (flag) → Addρ (x, y, flag) where all the variables are universally quantified. Using the domain relation Dρ ensures that only a value among −1, 0 and 1 can be chosen for each initialization of (x, y) in the relation Addρ . The set of target count constraints is constructed as follows. 1. We start by adding support constraints, that ensure that the value −1 is assigned to the variable flag iff the degree of support of the combination hx, vi does not reach k. First, we add constraint for values 1 and −1. Addρ (x, v, 1) → #({ Y : Aρ (x, Y, v)}) ≥ k Addρ (x, v, −1) → #({ Y : Aρ (x, Y, v)}) < k If f is not empty, we also add constraint for value 0: Addρ (x, v, 0) → #({ Y : Aρ (x, Y, v)}) ≥ k 2. When f = unique the uniqueness choice constraint is added: Addρ (x, , flag) ∧ flag ≥ 0 → #({ V : Addρ (x, V, 1)}) = 1 This constraint ensures that exactly one initialization of v for each X is selected. 3. When f = max we add (i) the optimization constraint Addρ (x, v2 , 1), Addρ (x, v, 0) → #({Y : Aρ (x, Y, v2 )}) ≥ #({Y : Aρ (x, Y, v)}) (ii) the choice constraint, ensuring that at least one initialization of v for each X is selected Addρ (x, , flag) ∧ flag ≥ 0 → #({ V : Addρ (x, V, 1)}) ≥ 1 4. When f = unique ∧ max we add the uniqueness choice constraint and the optimization constraint. The use of the smart mapping rules offers a notable simplification of the data posting framework. The user and/or designer can construct the selection criterion focusing only on the a small set of parameters. The selection criteria modeled by smart mapping rules are the most widespread in practice. The stan- dardization of their representation allows us also to optimize the implementation of the data posting process. Smart mapping rules, initially developed for our ap- plication scenario, can be profitable used in different contexts. In particular, we are investigating their application in P2P Deductive Databases [4, 5]. 4 Experimental Evaluation In this section we describe the evaluation of the user ranking behaviours. We re- port the results obtained on a test set of user searches collected for four months during a research project. We devised for the experimental evaluation an ap- proach similar to the one presented in [2] where the authors leverage ERGM and clustering analysis to partition the user interaction graph in order to tackle the complexity issues due to graph structures. We provided each user with two different options: 1) they can ask for search suggestion or 2) they can suggest search directions on request. At fixed time intervals, users may evaluate obtained results (both as consumer and provider of information) by assigning a rank ranging from 0 to 10 to their experience. As we implemented a rewarding strategy, i.e. users pay for targeted suggestions and are paid when they suggest a proper search, the effectiveness evaluation is crucial as we do not want user to pay for wrong information. We model our small network interactions by a random graph that is contin- uously updated as new links among users appear (i.e. new interactions between a pair of users). In order to better understand the information relevant to our goal we leverage two types of analysis: the first one, referred as cross-sectional (CS), considers only the variation occurring when data are observed while the second one considers each observation as an independent unit, thus it is referred as dynamic (Dyn). In order to be fair and to evaluate the influence excerpted by users each other, every participant is aware of the other’s scores. The latter results in the need to analyze factors that are endogenous, i.e., those phenomenon that may arise when users know each the rankings of other thus causing the rankings given by user i to be influenced by other user rankings except i. To this end we measure the values of GNC, LNC and DA. Evaluation. In order to analyze the evolution of user search effects, we use the following statistics: overall searches number performed by user; keyword total number; mean assigned keyword rank. For the sake of completeness, we use both CS and Dyn analysis combined with clustering (CL) and Metropolis-Hastings (M-H) sampling. The results are shown in Figure 1. For Dyn analysis we compute fifteen graphs (one for each week except the first one) for the CS analysis we have 16 graphs. It is easy to observe that in tables a and b the values obtained for DA are always high since the beginning. It is worth noting that, as we start Fig. 1. Experiments rewarding users (at week 4), the DA values further increase. More in detail, both for DYn and CS analysis global nonconformity is not a significant factor. The values reported for the other two factors are, on the whole, significant, but are uniformly smaller in magnitude for dynamic analysis compared to those of the corresponding weeks in the CS analysis and are less precisely estimated (as represented by uniformly greater standard errors). This is because rather than embodying the structure of the whole network, they embody only the structure of changes in the network over the week, thus the only important value to rely on is the weekly changes of values. This means that “instant” social effects (e.g., friendship reciprocation) have been absorbed into the Week 0 observation, which is not modeled in the dynamic analysis. Moreover, above mentioned phenomena can be considered as a kind of social “envy”: users tend to decrease the scores of other users in order to get more requests for themselves and thus getting more rewards. We note that also LNC value is high while GNC is not impressive: in a sense, users tend to agree on the general topics but not on the specific ones. Similar observations can be made for all the analysis reported in Tables c and d. As a final note, we can observe that the results obtained when sampling data by clustering are slightly better. This behaviour can be explained considering that when leveraging cluster approaches, it is more likely to obtain more homogeneous groups (i.e. group of user sharing common interests and features). 5 Conclusion and Future Work The impressive amount of data shared on the WEB contains a lot of informa- tion regarding user suggestions and searches. In this paper we focus on the effect of users suggestions on their social environment. We showed how personalized suggestions can be extracted by using smart mapping rules. This kind of rules, initially developed for our application scenario, can be profitable used in differ- ent contexts. For instance, they allow us to define the fragment of data posting framework, where the data posting problem becomes polynomial. We leveraged many interesting mathematical tools to model several psychological and social mechanisms that proved to be effective in our scenario. We are aware that, with respect to population-scale networks, the networks considered in our experiments are fairly small. However, this aspect does not decrease the validity of our ap- proach as ranking data of the form analyzed here is typically of interest only in specific networks, that are by nature fairly small. Thus, highly scalable tech- niques are less compelling in our setting. Nevertheless, computationally scalable estimation is an interesting challenge for future research in this area and we want to address it in next future. References 1. D. Agrawal et al. Challenges and opportunities with big data. A community white paper developed by leading researchers across the United States. 2012. 2. E. Bazzi, N. Cassavia, D. Chiggiato, E. Masciari, D. Saccà, A. Spada, and I. Tru- bitsyna. Evaluating user behaviour in a cooperative environment. Information, 9(12):303, 2018. 3. L. D. Brown. Fundamentals of statistical exponential families with applications in statistical decision theory. Lecture Notes Monograph Series, 1986. 4. L. Caroprese and E. Zumpano. Aggregates and priorities in P2P data management systems. In Proc. of (IDEAS 2011), Lisbon, Portugal, pages 1–7, 2011. 5. L. Caroprese and E. Zumpano. Computing a deterministic semantics for P2P deductive databases. In Proc. of IDEAS 2017, Bristol, UK, pages 184–191, 2017. 6. N. Cassavia, S. Flesca, M. Ianni, E. Masciari, and C. Pulice. Distributed computing by leveraging and rewarding idling user resources from P2P networks. J. Parallel Distrib. Comput., 122:81–94, 2018. 7. N. Cassavia, E. Masciari, C. Pulice, and D. Saccà. Discovering user behavioral features to enhance information search on big data. TiiS, 7(2):7:1–7:33, 2017. 8. N. Fiorentino, S. Greco, C. Molinaro, and I. Trubitsyna. ACID: A system for computing approximate certain query answers over incomplete databases. In Proc. of SIGMOD Conference 2018, Houston, TX, USA, pages 1685–1688, 2018. 9. S. Greco, C. Molinaro, and I. Trubitsyna. Computing approximate query answers over inconsistent knowledge bases. In Proc. of IJCAI 2018, Stockholm, Sweden., pages 1838–1846, 2018. 10. S. Greco, C. Molinaro, and I. Trubitsyna. Approximation algo- rithms for querying incomplete databases. Information Systems (2019), https://doi.org/10.1016/j.is.2019.03.010, 2019. 11. P. N. Krivitsky and C. T. Butts. Exponential-family random graph models for rank-order relational data. http://arxiv.org/abs/1210. 12. T. Lukasiewicz, E. Malizia, and C. Molinaro. Complexity of approximate query answering under inconsistency in datalog+/-. In Proc. of IJCAI 2018, Stockholm, Sweden., pages 1921–1927, 2018. 13. E. Masciari, D. Saccà, and I. Trubitsyna. Simple user assistance by data posting. In Second IEEE International Conference on Artificial Intelligence and Knowledge Engineering, AIKE 2019, Cagliari, Sardinia, Italy, June 3-5, 2019, pages 1–8, 2019. 14. E. Masciari, D. Saccà, and I. Trubitsyna. Simplified data posting in practice. In Proc. of the 23rd International Database Engineering and Applications Symposium (IDEAS 2019), June 10 - 12, 2019, Athens, Greece, pages 1–6, 2019.