=Paper=
{{Paper
|id=Vol-1912/paper1
|storemode=property
|title=A Database Framework for Probabilistic Preferences
|pdfUrl=https://ceur-ws.org/Vol-1912/paper1.pdf
|volume=Vol-1912
|authors=Batya Kenig,Benny Kimelfeld,Haoyue Ping,Julia Stoyanovich
|dblpUrl=https://dblp.org/rec/conf/amw/KenigKPS17
}}
==A Database Framework for Probabilistic Preferences==
A Database Framework for Probabilistic Preferences Batya Kenig1 , Benny Kimelfeld1? , Haoyue Ping2 , and Julia Stoyanovich2?? 1 Technion, Israel 2 Drexel University, USA 1 Introduction Preferences are statements about the relative quality or desirability of items. Ever larger amounts of preference information are being collected and analyzed in a variety of domains, including recommendation systems [2, 16, 18], polling and election analysis [3, 6, 7, 15], and bioinformatics [1, 11, 19]. Preferences are often inferred from indirect input (e.g., a ranked list may be inferred from individual choices), and are therefore uncertain in nature. This motivates a rich body of work on uncertain preference models in the statistics literature [14]. More recently, the machine learning community has been develop- ing methods for effective modeling and efficient inference over preferences, with the Mallows model [13] receiving particular attention [4, 5, 12, 17]. In this paper, we take the position that preference modeling and analysis should be accommodated within a general-purpose probabilistic database frame- work. Our framework is based on a deterministic concept that we proposed in a past vision paper [8]. In the present work we focus on handing uncertain preferences, and develop a representation of preferences within a probabilistic preference database, or PPD for short. This paper is an abbreviated version of our PODS 2017 paper, where an interested reader can find additional details about the formalism and proposed algorithmic solutions. 2 Probabilistic Preference Databases A preference schema S is a relational schema with some relation symbols marked as preference symbols (and others as ordinary symbols). Figure 1 gives an exam- ple of a preference database instance, with the ordinary symbols Candidates and Voters, and the preference symbol Polls. An instance over a preference symbol (such as Polls in Figure 1) represents a collection of preferences among a set of items, where each such preference is ? This work was supported in part by ISF Grant No. 1295/15, BSF Grant No. 2014391 and by the Taub Foundation. ?? This work was supported in part by NSF Grants No. 1464327 and 1539856, and BSF Grant No. 2014391. Polls (p) Candidates (o) Voters (o) voter date lcand rcand cand party sex edu voter edu sex age Ann Oct-5 Sanders Clinton Trump R M BS Ann BS F 25 Ann Oct-5 Sanders Rubio Clinton D F JD Bob BS M 35 Ann Oct-5 Sanders Trump Sanders D M BS Cat MS F 40 Ann Oct-5 Clinton Rubio Rubio R M JD Dave MS M 45 Ann Oct-5 Clinton Trump Ann Oct-5 Rubio Trump Bob Oct-5 Sanders Rubio A MAL-instance over Polls (p) Bob Oct-5 Sanders Clinton voter date Preference model MAL(σ, φ) Bob Oct-5 Sanders Trump Ann Oct-5 hClinton, Sanders, Rubio, Trumpi, 0.3 Bob Oct-5 Rubio Clinton Bob Oct-5 hTrump, Rubio, Sanders, Clintoni, 0.3 Bob Oct-5 Rubio Trump Bob Oct-5 Clinton Trump Fig. 1. An example of a preference database itself a binary relation called a session. A a binary relation over a set I = {σ1 , . . . , σn } of items is a (strict) partial order if it is irreflexive and transitive. A linear (or total ) order is a partial order where every two items are comparable. By a slight abuse of notation, we often identify a linear order σ1 · · · σn with the sequence hσ1 , . . . , σn i, and we call it a ranking. Example 1. Our running example is on individual preferences among the set of US presidential candidates I = {Clinton, Rubio, Sanders, Trump}. The ranking τ = hClinton, Rubio, Sanders, Trumpi is an example ranking over I. t u A preference relation instantiates a special relation symbol with a signature of the form (β, Al , Ar ), where β is the session signature, and Al and Ar are the left- hand-side (lhs) attribute and right-hand-side (rhs) attribute, respectively. We use semicolon (;) to distinguish between the different parts and write (β; Al ; Ar ). Example 2. We use the preference signature (voter, date; lcand; rcand) in our run- ning example. Here the components β, Al and Ar are (voter, date), lcand, and rcand, respectively. The table Polls of Figure 1 is an instance of this preference signature that contains two sessions. The session (Ann, Oct-5) is associated with the ranking hSanders, Clinton, Rubio, Trumpi. The tuple (Ann, Oct-5; Sanders; Clinton) denotes that in the session of the voter Ann on October 5th, the can- didate Sanders is preferred to the candidate Clinton. t u We now make the knowledge about voters’ opinions probabilistic, interpreting the preference database of Figure 1 as one possible world of a probabilistic preference database. A probabilistic preference database (abbrv. PPD) over the preference schema S is a probability space over preference databases over S. A PPD can be represented by explicitly specifying the entire sample space; however, we wish to allow for standard compact representations of preferences. A probabilistic preference model is a (finite and typically compact) represen- tation M of a probability space over partial orders over a finite set of items; we denote this probability space by JM K. A model family is a collection M of probabilistic preference models. As prominent examples, we define two model families: RIM is the family of RIM [5] models RIM(σ, Π), and MAL is the family of Mallows [13] models MAL(σ, φ). A Mallows model [13] MAL(σ, φ) is parameterized by a reference ranking σ = hσ1 , . . . , σm i and a dispersion parameter φ ∈ (0, 1]. The model assigns a non-zero probability to every ranking τ : The higher the Kendal’s tau distance [9] of τ is from σ, the lower its probability under the model. Lower values of φ concentrate most of the probability mass around σ, while φ = 1 corresponds to the uniform probability distribution over the rankings. Doignon [5] showed that MAL(σ, φ) can be represented as the insertion model RIM(σ, Π). In the PPD representations we explore, termed RIM-PPD, each session is associated with the parameters of a RIM model. A RIM-PPD represents a prob- ability space over preference databases, where a possible world is obtained by independently sampling a preference from the model of each session. Figure 1 gives an example of a MAL-instance over the p-symbol Polls that associates each session in Polls with a Mallows model. It is straight-forward to extend this representation to a mixture of Mallows, by associating each session with k com- ponents MAL1 (σ 1 , φ1 ), . . . , MALk (σ k , φk ), with the corresponding probabilities p1 , . . . , pk . A possible world would then be obtained by first sampling compo- nent MALi (σ i , φi ) with probability pi independently for each session, and then sampling a preference from MALi (σ i , φi ). 3 Query Evaluation over PPDs We adopt the semantics of probabilistic databases [20] for query evaluation. Specifically, let S be a schema, let Q be a query, and let D = (Ω, π) be a PPD. A possible answer for Q is a tuple a over sig(Q) such that a ∈ Q(D) for some sample D of D. We denote by PosAns(Q, D) the set of all possible answers. The confidence of a possible answer a ∈ PosAns(Q, D), denoted confQ (D, a), is the probability of having a as an answer when querying a sample of D. If E is an M- PPD for some model class M, then evaluating Q on E is the task of computing the following (finite) set: Q(E) = {(a, confQ (JEK, a)) | a ∈ PosAns(JEK)}. We study the data complexity of evaluating Conjunctive Queries (CQs) over RIM-PPDs. We focus on CQs to which we refer as itemwise. Intuitively, these are CQs where items are connected only through preferences. We show a natural fragment of CQs where the itemwise CQs are precisely the CQs in which query evaluation can be done in polynomial time. In the fragment we consider, we prove that every query that is not itemwise is actually #P-hard, and therefore, we establish a dichotomy in complexity. Let S be a preference schema, and let Q be a CQ over S. An atomic formula of Q is called a p-atom if it is over a p-symbol, and an o-atom if it is over an o-symbol. Let P (s1 , . . . , sk ; tl ; tr ) be p-atom of Q. Each term si for i = 1, . . . , k is said to occur in a session position, and each of tl and tr is said to occur in an item position. A session variable of Q is a variable that occurs in a session position, and an item variable of Q is a variable that occurs in an item position. We say that Q is sessionwise if all p-atoms of Q refer to the same session; that is, if P (s1 , . . . , sk ; tl ; tr ) and P 0 (s01 , . . . , s0l ; t0l ; t0r ) are p-atoms of Q, then P = P 0 and (s1 , . . . , sk ) = (s01 , . . . , s0l ). We say that Q is itemwise if Q is sessionwise, and the joins between item variables occur only inside the p-atoms, or through session variables. Put differently, in an itemwise CQ with a constant session, the o-atoms state properties of individual items but do not draw connections between the items. In [10] we define this property more formally, by means of the Gaifman graph of the CQ. Example 3. Consider the following Boolean CQs. The query Q1 asks whether there is a voter with a BS degree who prefers a male Democratic candidate to a female Democratic candidate. Q1 () ← P (v, ; l; r), V (v, BS, , ) , C(l, D, M, ) , C(r, D, F, ) The query Q2 asks whether there is a voter who prefers a male candidate to a female candidate such that both candidates are of the same political party. Q2 () ← P ( , ; l; r), C(l, p, M, ) , C(r, p, F, ) The query Q3 asks whether there is a voter who prefers a female candidate to both Trump and Sanders. Q3 () ← P (v, d; l; Trump), P (v, d; l; Sanders), C(l, , F, ) All of these CQs are sessionwise. Indeed, Q1 and Q2 involve a single p-atom (hence, they are sessionwise by definition), and in Q3 both atoms have (v, d) in their session parts. CQs Q1 and Q3 are itemwise, while Q2 is not itemwise. t u In [10] we prove the following theorem, which states that every itemwise Boolean CQ can be evaluated in polynomial time, under data complexity. Theorem 1. Let S be a preference schema, and let Q be a Boolean CQ over S. If Q is itemwise, then Q can be evaluated in polynomial time on RIM-PPDs. We also prove that the class itemwise CQs are precisely the tractable ones (among the queries in the class), under conventional complexity assumptions. In other words, every Boolean CQ (in the class) that is not itemwise is necessarily hard to evaluate, and therefore, we establish a dichotomy. Theorem 2. Let S be a preference schema, and let Q be a Boolean CQ over S such that Q has no self joins and Q has a single p-atom. If Q is not itemwise, then the evaluation of Q on RIM-PPDs over S is FP#P -hard. In [10] we give is a polynomial-time algorithm for evaluating itemwise CQs. Interestingly, such CQs translate into a natural (and novel) inference problem over RIM. In this problem, every item is associated with one or more labels (e.g., “democratic” party or “comedy” genre), and the goal is to compute the probability that a graph pattern (or equivalently a partial order) over these labels matches the random ranking. References 1. S. Aerts, D. Lambrechts, S. Maity, P. V. Loo, B. Coessens, F. D. Smet, L.-C. Tranchevent, B. D. Moor, P. Marynen, B. Hassan, P. Carmeliet, and Y. Moreau. Gene prioritization through genomic data fusion. Nature Biotechnology, 24(5):537– 544, 2006. 2. S. Balakrishnan and S. Chopra. Two of a kind or the ratings game? adaptive pairwise preferences and latent factor models. Frontiers of Computer Science, 6(2):197–208, 2012. 3. P. Diaconis. A generalization of spectral analysis with applications to ranked data. Annals of Statistics, 17(3):949–979, 1989. 4. W. Ding, P. Ishwar, and V. Saligrama. Learning mixed membership mallows mod- els from pairwise comparisons. CoRR, abs/1504.00757, 2015. 5. J.-P. Doignon, A. Pekeč, and M. Regenwetter. The repeated insertion model for rankings: Missing link between two subset choice models. Psychometrika, 69(1):33– 54, 2004. 6. I. C. Gormley and T. B. Murphy. A latent space model for rank data. In ICML, 2006. 7. I. C. Gormley and T. B. Murphy. A mixture of experts model for rank data with applications in election studies. The Annals of Applied Statistics, 2(4):1452–1477, 12 2008. 8. M. Jacob, B. Kimelfeld, and J. Stoyanovich. A system for management and analysis of preference data. PVLDB, 7(12):1255–1258, 2014. 9. M. G. Kendall. A new measure of rank correlation. Biometrika, 30(1/2):81–93, 1938. 10. B. Kenig, B. Kimelfeld, H. Ping, and J. Stoyanovich. Querying probabilistic pref- erences in databases. In PODS, 2017. 11. R. Kolde, S. Laur, P. Adler, and J. Vilo. Robust rank aggregation for gene list integration and meta-analysis. Bioinformatics, 28(4):573–580, 2012. 12. T. Lu and C. Boutilier. Effective sampling and learning for mallows models with pairwise-preference data. J. Mach. Learn. Res., 15(1):3783–3829, Jan. 2014. 13. C. L. Mallows. Non-null ranking models. i. Biometrika, 44(1-2):114–130, June 1957. 14. J. I. Marden. Analyzing and Modeling Rank Data. Chapman & Hall, 1995. 15. G. McElroy and M. Marsh. Candidate gender and voter choice: Analysis from a multimember preferential voting system. Political Research Quarterly, 63(4):pp. 822–833, 2010. 16. A. D. Sarma, A. D. Sarma, S. Gollapudi, and R. Panigrahy. Ranking mechanisms in twitter-like forums. In WSDM, pages 21–30, 2010. 17. J. Stoyanovich, L. Ilijasic, and H. Ping. Workload-driven learning of mallows mixtures with pairwise preference data. In WebDB, page 8, 2016. 18. J. Stoyanovich, M. Jacob, and X. Gong. Analyzing crowd rankings. In WebDB, pages 41–47, 2015. 19. J. M. Stuart, E. Segal, D. Koller, and S. K. Kim. A gene-coexpression network for global discovery of conserved genetic modules. Science, 302:249–255, 2003. 20. D. Suciu, D. Olteanu, C. Ré, and C. Koch. Probabilistic Databases. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2011.