Introduction

A Database Framework for Probabilistic Preferences

Batya Kenig

Benny Kimelfeld

Haoyue Ping

Julia Stoyanovich

0 0 Drexel University , USA 1 Technion , Israel

? This work was supported in part by ISF Grant No. 1295/15, BSF Grant No. 2014391 and by the Taub Foundation. ?? This work was supported in part by NSF Grants No. 1464327 and 1539856, and BSF Grant No. 2014391.

Introduction

Preferences are statements about the relative quality or desirability of items. Ever larger amounts of preference information are being collected and analyzed in a variety of domains, including recommendation systems [ 2, 16, 18 ], polling and election analysis [ 3, 6, 7, 15 ], and bioinformatics [ 1, 11, 19 ].

Preferences are often inferred from indirect input (e.g., a ranked list may be inferred from individual choices), and are therefore uncertain in nature. This motivates a rich body of work on uncertain preference models in the statistics literature [ 14 ]. More recently, the machine learning community has been developing methods for e ective modeling and e cient inference over preferences, with the Mallows model [ 13 ] receiving particular attention [ 4, 5, 12, 17 ].

In this paper, we take the position that preference modeling and analysis should be accommodated within a general-purpose probabilistic database framework. Our framework is based on a deterministic concept that we proposed in a past vision paper [ 8 ]. In the present work we focus on handing uncertain preferences, and develop a representation of preferences within a probabilistic preference database, or PPD for short.

This paper is an abbreviated version of our PODS 2017 paper, where an interested reader can nd additional details about the formalism and proposed algorithmic solutions. A preference schema S is a relational schema with some relation symbols marked as preference symbols (and others as ordinary symbols ). Figure 1 gives an example of a preference database instance, with the ordinary symbols Candidates and Voters, and the preference symbol Polls.

An instance over a preference symbol (such as Polls in Figure 1) represents a collection of preferences among a set of items, where each such preference is

Candidates (o) cand party sex edu Trump R M BS Clinton D F JD Sanders D M BS Rubio R M JD

Voters (o) voter edu sex age Ann BS F 25 Bob BS M 35 Cat MS F 40

Dave MS M 45 A MAL-instance over Polls (p) voter date Preference model MAL( ; ) Ann Oct-5 hClinton; Sanders; Rubio; Trumpi; 0:3 Bob Oct-5 hTrump; Rubio; Sanders; Clintoni; 0:3 Polls (p) voter date lcand rcand Ann Oct-5 Sanders Clinton Ann Oct-5 Sanders Rubio Ann Oct-5 Sanders Trump Ann Oct-5 Clinton Rubio Ann Oct-5 Clinton Trump Ann Oct-5 Rubio Trump Bob Oct-5 Sanders Rubio Bob Oct-5 Sanders Clinton Bob Oct-5 Sanders Trump Bob Oct-5 Rubio Clinton Bob Oct-5 Rubio Trump

Bob Oct-5 Clinton Trump itself a binary relation called a session. A a binary relation over a set I = f 1; : : : ; ng of items is a (strict) partial order if it is irre exive and transitive. A linear (or total ) order is a partial order where every two items are comparable. By a slight abuse of notation, we often identify a linear order 1 n with the sequence h 1; : : : ; ni, and we call it a ranking.

Example 1. Our running example is on individual preferences among the set of US presidential candidates I = fClinton; Rubio; Sanders; Trumpg. The ranking = hClinton; Rubio; Sanders; Trumpi is an example ranking over I. tu

A preference relation instantiates a special relation symbol with a signature of the form ( ; Al; Ar), where is the session signature, and Al and Ar are the lefthand-side (lhs) attribute and right-hand-side (rhs) attribute, respectively. We use semicolon (;) to distinguish between the di erent parts and write ( ; Al; Ar). Example 2. We use the preference signature (voter; date; lcand; rcand) in our running example. Here the components , Al and Ar are (voter; date), lcand, and rcand, respectively. The table Polls of Figure 1 is an instance of this preference signature that contains two sessions. The session (Ann; Oct-5) is associated with the ranking hSanders; Clinton; Rubio; Trumpi. The tuple (Ann; Oct-5; Sanders; Clinton) denotes that in the session of the voter Ann on October 5th, the candidate Sanders is preferred to the candidate Clinton. tu

We now make the knowledge about voters' opinions probabilistic, interpreting the preference database of Figure 1 as one possible world of a probabilistic preference database. A probabilistic preference database (abbrv. PPD) over the preference schema S is a probability space over preference databases over S. A PPD can be represented by explicitly specifying the entire sample space; however, we wish to allow for standard compact representations of preferences.

A probabilistic preference model is a ( nite and typically compact) representation M of a probability space over partial orders over a nite set of items; we denote this probability space by JM K. A model family is a collection M of probabilistic preference models. As prominent examples, we de ne two model families: RIM is the family of RIM [ 5 ] models RIM( ; ), and MAL is the family of Mallows [ 13 ] models MAL( ; ).

A Mallows model [ 13 ] MAL( ; ) is parameterized by a reference ranking = h 1; : : : ; mi and a dispersion parameter 2 (0; 1]. The model assigns a non-zero probability to every ranking : The higher the Kendal's tau distance [ 9 ] of is from , the lower its probability under the model. Lower values of concentrate most of the probability mass around , while = 1 corresponds to the uniform probability distribution over the rankings. Doignon [ 5 ] showed that MAL( ; ) can be represented as the insertion model RIM( ; ).

In the PPD representations we explore, termed RIM-PPD, each session is associated with the parameters of a RIM model. A RIM-PPD represents a probability space over preference databases, where a possible world is obtained by independently sampling a preference from the model of each session. Figure 1 gives an example of a MAL-instance over the p-symbol Polls that associates each session in Polls with a Mallows model. It is straight-forward to extend this representation to a mixture of Mallows, by associating each session with k components MAL1( 1; 1); : : : ; MALk( k; k), with the corresponding probabilities p1; : : : ; pk. A possible world would then be obtained by rst sampling component MALi( i; i) with probability pi independently for each session, and then sampling a preference from MALi( i; i). 3

Query Evaluation over PPDs

We adopt the semantics of probabilistic databases [ 20 ] for query evaluation. Speci cally, let S be a schema, let Q be a query, and let D = ( ; ) be a PPD. A possible answer for Q is a tuple a over sig (Q) such that a 2 Q(D) for some sample D of D. We denote by PosAns(Q; D) the set of all possible answers. The con dence of a possible answer a 2 PosAns(Q; D), denoted confQ(D; a), is the probability of having a as an answer when querying a sample of D. If E is an MPPD for some model class M, then evaluating Q on E is the task of computing the following ( nite) set: Q(E) = f(a; confQ(JEK; a)) j a 2 PosAns(JEK)g.

We study the data complexity of evaluating Conjunctive Queries (CQs) over RIM-PPDs. We focus on CQs to which we refer as itemwise. Intuitively, these are CQs where items are connected only through preferences. We show a natural fragment of CQs where the itemwise CQs are precisely the CQs in which query evaluation can be done in polynomial time. In the fragment we consider, we prove that every query that is not itemwise is actually #P-hard, and therefore, we establish a dichotomy in complexity.

Let S be a preference schema, and let Q be a CQ over S. An atomic formula of Q is called a p-atom if it is over a p-symbol, and an o-atom if it is over an o-symbol. Let P (s1; : : : ; sk; tl; tr) be p-atom of Q. Each term si for i = 1; : : : ; k is said to occur in a session position, and each of tl and tr is said to occur in an item position. A session variable of Q is a variable that occurs in a session position, and an item variable of Q is a variable that occurs in an item position. We say that Q is sessionwise if all p-atoms of Q refer to the same session; that is, if P (s1; : : : ; sk; tl; tr) and P 0(s01; : : : ; s0l; t0l; t0r) are p-atoms of Q, then P = P 0 and (s1; : : : ; sk) = (s01; : : : ; s0l). We say that Q is itemwise if Q is sessionwise, and the joins between item variables occur only inside the p-atoms, or through session variables. Put di erently, in an itemwise CQ with a constant session, the o-atoms state properties of individual items but do not draw connections between the items. In [ 10 ] we de ne this property more formally, by means of the Gaifman graph of the CQ.

Example 3. Consider the following Boolean CQs. The query Q1 asks whether there is a voter with a BS degree who prefers a male Democratic candidate to a female Democratic candidate.

Q1()

P (v; ;l; r); V (v; BS; ; ); C(l; D; M; ); C(r; D; F; ) The query Q2 asks whether there is a voter who prefers a male candidate to a female candidate such that both candidates are of the same political party.

Q2()

P ( ; ;l; r); C(l; p; M; ); C(r; p; F; ) The query Q3 asks whether there is a voter who prefers a female candidate to both Trump and Sanders.

Q3()

P (v; d; l; Trump); P (v; d; l; Sanders); C(l; ;F; ) All of these CQs are sessionwise. Indeed, Q1 and Q2 involve a single p-atom (hence, they are sessionwise by de nition), and in Q3 both atoms have (v; d) in their session parts. CQs Q1 and Q3 are itemwise, while Q2 is not itemwise. tu

In [ 10 ] we prove the following theorem, which states that every itemwise Boolean CQ can be evaluated in polynomial time, under data complexity. Theorem 1. Let S be a preference schema, and let Q be a Boolean CQ over S. If Q is itemwise, then Q can be evaluated in polynomial time on RIM-PPDs.

We also prove that the class itemwise CQs are precisely the tractable ones (among the queries in the class), under conventional complexity assumptions. In other words, every Boolean CQ (in the class) that is not itemwise is necessarily hard to evaluate, and therefore, we establish a dichotomy.

Theorem 2. Let S be a preference schema, and let Q be a Boolean CQ over S such that Q has no self joins and Q has a single p-atom. If Q is not itemwise, then the evaluation of Q on RIM-PPDs over S is FP#P-hard.

In [ 10 ] we give is a polynomial-time algorithm for evaluating itemwise CQs. Interestingly, such CQs translate into a natural (and novel) inference problem over RIM. In this problem, every item is associated with one or more labels (e.g., \democratic" party or \comedy" genre), and the goal is to compute the probability that a graph pattern (or equivalently a partial order) over these labels matches the random ranking.

Aerts ,

Lambrechts ,

Maity ,

P. V.

Loo ,

Coessens ,

F. D.

Smet , L.- C. Tranchevent , B. D.

Moor , P.

Marynen , B.

Hassan , P.

Carmeliet , and Y.

Moreau . Gene prioritization through genomic data fusion . Nature Biotechnology , 24 ( 5 ): 537 { 544 , 2006 .

Balakrishnan and

Chopra . Two of a kind or the ratings game? adaptive pairwise preferences and latent factor models . Frontiers of Computer Science , 6 ( 2 ): 197 { 208 , 2012 .

Diaconis . A generalization of spectral analysis with applications to ranked data . Annals of Statistics , 17 ( 3 ): 949 { 979 , 1989 .

Ding ,

Ishwar , and

Saligrama . Learning mixed membership mallows models from pairwise comparisons . CoRR, abs/1504.00757 , 2015 .

J.-P.

Doignon ,

Pekec , and

Regenwetter . The repeated insertion model for rankings: Missing link between two subset choice models . Psychometrika , 69 ( 1 ): 33 { 54 , 2004 .

I. C.

Gormley and

T. B.

Murphy . A latent space model for rank data . In ICML , 2006 .

I. C.

Gormley and

T. B.

Murphy . A mixture of experts model for rank data with applications in election studies . The Annals of Applied Statistics , 2 ( 4 ): 1452 { 1477 , 12 2008 .

Jacob ,

Kimelfeld , and

Stoyanovich . A system for management and analysis of preference data . PVLDB , 7 ( 12 ): 1255 { 1258 , 2014 .

M. G.

Kendall . A new measure of rank correlation . Biometrika , 30 ( 1 /2): 81 { 93 , 1938 .

10.

Kenig ,

Kimelfeld ,

Ping , and

Stoyanovich . Querying probabilistic preferences in databases . In PODS , 2017 .

11.

Kolde ,

Laur ,

Adler , and

Vilo . Robust rank aggregation for gene list integration and meta-analysis . Bioinformatics , 28 ( 4 ): 573 { 580 , 2012 .

12.

Lu and

Boutilier . E ective sampling and learning for mallows models with pairwise-preference data . J. Mach. Learn. Res. , 15 ( 1 ): 3783 { 3829 , Jan . 2014 .

13.

C. L.

Mallows . Non-null ranking models . i. Biometrika , 44 ( 1-2 ): 114 { 130 , June 1957 .

14.

J. I.

Marden . Analyzing and Modeling Rank Data . Chapman & Hall , 1995 .

15.

McElroy and

Marsh . Candidate gender and voter choice: Analysis from a multimember preferential voting system . Political Research Quarterly , 63 ( 4 ):pp. 822 { 833 , 2010 .

16. A. D. Sarma , A. D.

Sarma , S.

Gollapudi , and R.

Panigrahy . Ranking mechanisms in twitter-like forums . In WSDM , pages 21 { 30 , 2010 .

17. J. Stoyanovich , L.

Ilijasic , and H.

Ping . Workload-driven learning of mallows mixtures with pairwise preference data . In WebDB, page 8 , 2016 .

18. J. Stoyanovich , M.

Jacob , and X.

Gong . Analyzing crowd rankings . In WebDB , pages 41 { 47 , 2015 .

19. J. M. Stuart , E.

Segal , D.

Koller , and S. K.

Kim . A gene-coexpression network for global discovery of conserved genetic modules . Science , 302 : 249 { 255 , 2003 .

20.

Suciu ,

Olteanu , C. Re, and

Koch . Probabilistic Databases. Synthesis Lectures on Data Management . Morgan & Claypool Publishers, 2011 .