Close and Loose Associations in Keyword Search from Structural Data Johanna Vainio Marko Junkkari Jaana Kekäläinen University of Tampere University of Tampere University of Tampere Kanslerinrinne 1 Kanslerinrinne 1 Kanslerinrinne 1 FI-33014 University of Tampere, FI-33014 University of Tampere, FI-33014 University of Tampere, Finland Finland Finland s.johanna.vainio@staff.uta.fi marko.junkkari@uta.fi jaana.kekalainen@uta.fi ABSTRACT study which kinds of settings the conceptual associations of a Keyword search over structural data enables users to seek semantic data model serve for the connections of entities. Then, we information from databases without knowing the structure of data analyze their roles in keyword search in relational databases. or mastering actual query languages like SQL. In a keyword query, In information retrieval, keyword search finds documents that data items or text attributes are matched to the keywords and the contain all or some of the keywords and ranks the documents result of a query is typically a set of graphs consisting of connected according to the statistical properties of their words. There is no tuples. The result should be ranked which means that the text need to solve how documents containing the keywords are attributes and connections must be scored and combined. Typically, connected. In context of relational databases, keyword search can the length of a connection is the main criterion in ranking the be used to find the top ranked connections of tuples that contain all connections, i.e. shorter connections are scored higher than longer or some of the keywords. To produce ranking, tuples that contain a ones. The length of a connection is usually based on the foreign key keyword are retrieved, and connections between these tuples are references but their direction has received less attention. At the produced. A connection of tuples may, for example, be a minimal conceptual level, cardinality constrains correspond to foreign key total joining network of tuples (MTJNT) [4] or Steiner tree [1] [2]. references or their combination. In the present paper, we investigate There are also different approaches how to rank the produced the effect of the combinations of cardinality constrains on the result connections. Ranking can be based, simply, on the number of joins of a keyword search. We find that the combination of cardinality of a connection or attribute, tuple or edge level scores or constraints indicates how close the association between keywords combinations of them. [6, 7, 8] Different connections may contain is. We also show that the Minimal Total Joining Network of Tuples different amount of information and different interpretations, even (MTJNT) principle loses semantic connections or fragments the between the same keyword tuples. Therefore, the shortest results of a keyword search from relational databases. connection is not always the best; a longer path may be more appropriate [5, 6]. We draw this conclusion by analyzing the CCS Concepts closeness and looseness of conceptual association. This dimension • Information systems • Information systems ~ Relational is based on the cardinality constrains that appear in the connections database model • Information systems ~ Entity relationship of entities. models • Information systems ~ Database query processing 2. CONCEPTUAL ASSOCIATIONS Keywords Semantic data models are conceptual methods for representing Keyword queries over structured data; associations in relational concepts and the relationships among them. The ER model is the databases; cardinality constraints; ER model. most common semantic data model and its principal primitives are the entity type, attribute, and relationship (type) between the entity 1. INTRODUCTION types. A relationship involves a cardinality constrain that may be Keyword search enables end-users to search data from relational 1:1, 1:N, N:1 or N:M. A constraint determines how many instances databases without knowledge of the syntax of a query language or are participating in the relationship at the extensional (instance) the structure of the data. However, keyword search involves level. Let the ER schema of Figure 1 illustrate this. The schema is ambiguity and raises new challenges. Traditionally, ambiguity is a fragment from [3] but no attributes are represented. The example associated with the nature of keyword search, i.e. matching search contains four entity types (DEPARTMENT, EMPLOYEE, keys to document contents is more or less fuzzy. In the context of DEPENDENT and PROJECT) and four relationships among them. structured data, the nature of relationships among entities and text In the example, several employees may work for a department and attributes may also affect different kinds of semantic an employee works in one department. An employee may have interpretations. Namely, entities may be associated with each other several dependents and a dependent has one employee as a via different kinds of relationships. In the present paper, we first guardian. Furthermore, an employee may work on several projects and a project may have several employees. Finally, a department 2017, Copyright is with the authors. Published in the Workshop may control several projects and a project is controlled by a single Proceedings of the EDBT/ICDT 2017 Joint Conference (March department. 21, 2017, Venice, Italy) on CEUR-WS.org (ISSN 1613-0073). Distribution of this paper is permitted under the terms of the Creative Commons license CC-by-nc-nd 4.0 {1, …, n} holds that Xi = 1 or i  {1, …, n} holds that Yi = 1 then 1 DEPARTMENT CONTROLS the relationships is functional. In general, the immediate relationships and transitive functional 1 relationships determine a close connection between entities at the WORKS extensional level. FOR N Table 1. Relationships and their cardinalities in the ER schema N N M Relationship Cardinality WORKS PROJECT EMPLOYEE ON 1 department – employee department 1:N employee 1 2 project – employee project N:M employee DEPENDENTS N DEPENDENT 3 department – employee department 1:N employee – dependent 1:N dependent 4 department – project – department 1:N project Figure 1: ER- schema employee N:M employee Figure 1 illustrates that an employee and a department may be in association in two ways. First, an employee works for a department 5 project – department – project N:1 department 1:N and second, (s)he works on a project controlled by the department. employee employee The first alternative involves one relationship and the second two 6 department – project – department 1:N project relationships, i.e. the first path is shorter but the longer one contains employee – dependent N:M employee 1:N more information because it also determines in which project the dependent employee is working. This is an essential issue in keyword search from structural data where the result connections should be ranked. In other words, if we want to emphasize access to more information Transitive relationship 4 consist of 1:N and N:M relationships a longer connection should be ranked before shorter connections. respectively. This means that one department can be associated However, usually longer connections are lower in a result list or not with employees through several projects. In other words, in the results list at all. This is justified, because longer connections employees that work on a project controlled by a department may entail more ambiguity than shorter ones and may lose associations or may not work in the department. For the reason that there are between entities. However, the level of the ambiguity a connection two kinds of semantic interpretations, this kind of transitive involves can be examined, and thus, decreased and controlled. Next relationship may cause a loose connection at the extensional level. we investigate how the level of the ambiguity can be determined Transitive relationship 5 contains two immediate relationships based on the cardinality constraints of the ER-model. having cardinality constraints N:1 and 1:N respectively. This is An entity type involves a set of entities whereas relationships called a transitive N:M relationship because several entities of the determine how the entities can be connected to each other. In the start entity type may be connected to several entities of the end present paper, we conceptualize that a close connection between entity type via a middle entity. This kind of relationship causes a entities means that they are associated with each other more ambiguous interpretation on the connection of entities. unambiguously through their relationships. Table 1 contains a Namely, an employee is associated with a project although s(he) sample of immediate and transitive relationships between entity may not work on it, i.e. an employee is associated with every types. Relationships 1 and 2 represent a situation where two entity project a department is controlled. Therefore, a transitive N:M types and the corresponding entities are connected immediately. In relationships may also cause a loose connection between entities. the immediate relationships, there is no ambiguity in the semantics In general, let X1,Y1,…,Xn,Yn, where X1 ≠ 1 and Yn ≠ 1, of the connections, i.e. the corresponding entities are closely represent the cardinality constraints of a transitive relationships, associated to each other. then the relationship is N:M transitive. Connection 6 contains three immediate relationships. The first relationship possesses the 1:N A transitive relationship contains more than one immediate constraint and the last 1:N constraint. However, this is not transitive relationships, i.e. the corresponding entities are connected to each 1:N relationship because it contains a transitive N:M relationship other via a middle entity. Transitive relationship 3 consists of two as a part of it. Therefore, it allows loose connections at the immediate relationships both having the cardinality constrains 1:N. extensional level. In other words, for one department there are several employees and for each employee there may be several dependents, but not vice Next we demonstrate close and loose connections at the database versa. This means that there is a transitive 1:N relationship between level and their effects on keyword search in relational databases. the entity types department and dependent. In other words, the 3. ASSOCIATIONS IN RELATIONAL connection is (inverse) functional. We interpret both inverse functional connections, only 1:N relationships, and functional DATABASES connections, only N:1 relationships, as functional. This is because Roughly speaking, an ER-schema is implemented in relational a connection can be represented in both directions, i.e. the databases such that for each entity type a relation is implemented. connection 3 in Table 1 can be represented from dependent to For each 1:N relation a foreign key is inserted to the N-site. For department (dependent N:1 employee N:1 department) as well. each N:M relationships a middle relation is formed. This relation A functional relationship may also contain 1:1 relationships. contains the foreign keys from both the participating entity types Therefore, we define that if X1,Y1,…,Xn,Yn represents the (relations in RDB). A foreign key constraint is typically represented cardinality constraints of a transitive relationships such that i  as an arrow from a foreign key to the related primary key. The database schema and database instance of Figure 1 is represented middle relation and the length of the connection would be one if the in Figure 2. Attributes are now represented. conceptual schema were followed. In other words, in conceptual approach middle relations should not be taken into account when calculating the length of a connection. Table 2. Connections in the RDB and lengths of the DEPARTMENT ID D_NAME D_DESCRIPTION connections in the RDB and the ER d1 Cs The main topics of connection length in length in teaching are programming, databases RDB ER and XML. 1 d1(XML) – e1(Smith) 1 1 d2 inf The main topics of teaching are information 2 p1(XML) – w_f1 – e1(Smith) 2 1 retrieval and XML. 3 p1(XML) – d1(XML) – e1(Smith) 2 2 d3 history The main topics of teaching are history of 4 d1(XML) – p1(XML) – w_f1 – 3 2 Scandinavian. e1(Smith) 5 d2(XML) – e2(Smith) 1 1 PROJECT ID D_ID P_NAME P_DESCRIPTION 6 p2(XML) – d2(XML) – e2(Smith) 2 2 p1 d1 DB- Different data models project are integrated, such as 7 d2(XML) – p3 – w_f2 – e2(Smith) 3 2 relational, object and 8 d1 – e3 – t1(Alice) 2 2 XML p2 d2 XML and XML offers a 9 d2 – p2 – w_f3 – e3 – t1(Alice) 4 3 IR notation for structured documents. p3 d2 IR task Task based In a schema (intensional) level, connections 1 and 2 have a close information retrieval association and connections 3 and 4 have a loose association between the entities. However, in an instance level, also connections 3 and 4 have a close association between the entities. WORKS_FOR ESSN P_ID HOURS The connections can be read as follows: e1 p1 40 1) “employee e1(Smith) works for department d1(XML)” e2 p3 56 2) “employee e1(Smith) works on a project p1(XML)” e3 p2 70 3) “employee e1(Smith) works for department d1(XML), that e4 p3 60 controls project p1(XML)” 4) “employee e1(Smith) works on project p1(XML), that is EMPLOYEE SSN L_NAME S_NAME D_ID controlled by department d1(XML). e1 Smith John d1 In this case employee e1 works on project p1 as associated in e2 Smith Barbara d2 connection 3 and employee e1 works for department d1 as associated in connection 4, but this cannot generally be assumed e3 Miller Melina d1 without investigating other connections. This is illustrated next. e4 Walker John d2 The closest and longest association between Barbara Smith and XML relates to the description of her department because she DEPENDENT ID ESSN DEPENDENT_NAME works in a department that matches XML (connections 5 and 7 in Table 2). It is worth noting that Barbara is also associated with t1 e3 Alice project p2 in connection 6 although she does not work in it. This is t2 e3 Theodore because the connection contains N:1 and 1:N relationships. In other Figure 2. Database schema and instance words this connection gives broader interpretation and project p2 and employee e2 (Barbara Smith) are in a loose association. A keyword search typically focuses on attribute values. A keyword may match the whole attribute value or a word in a text attribute. If the rank of connections 1 - 7 were based on the length of the Let us consider a sample keyword search connection in RDB, the best connections are 1 and 5 and the worst connections are 4 and 7. If the length of the ER-model were Smith XML followed and the close associations were emphasized, the best “Smith” matches two first employees whereas “XML” matches two connections are 1, 2 and 5 and the worst connections are 3 and 6. projects and two departments. Connections 1 – 7 in Table 2 In the latter approach connections 4 and 7 have a better rank represents some of the connections for the keyword query “Smith because they do not lose the close association (in the schema level), XML” in the RDB in Figure 2. i.e. the employee works in the department and in the project the John Smith is associated with XML through different connections. connection includes. Connections 8 and 9 in Tables 2 and 3 The shortest and the longest connections are between an employee correspond to relationships 5 and 6 in Table 1. Connection 8 has a and a department as shown in Table 1. John Smith is also associated close association and connections 9 has a loose association between with XML through the project by the connections having two steps entities in both the schema and instance levels. (connections 2 and 3 in Table 2). However, WORKS_FOR is a A commonly used approach to form connections is Minimal Total connection. A more precise approach could be achieved by Joining Network of Tuples (MTJNT) [4] or Steiner tree [1] [2]. In analyzing the actual number of participating entities (tuples) in a the MTJNT approach every keyword exists in at least one tuple of database instance. the joining network. It is not possible to remove any tuple from the We also proposed that the length of connections should be based joining network without losing MTJNT. The MTJNT approach on the relationships at the conceptual level because the N:M returns minimally connected tuples that still contain every relationship corresponds to a conceptual relationship. Moreover, keyword. This approach can lose some meaningful tuples that are 1:N or N:1 relationship can be implemented by a middle relation. associated to keyword queries and MTJNTs. In the previous By using conceptual relationships the length of connections does example connections 3, 4, 6 and 7 are lost, if the MTJNT approach not depend on implementation issues of this kind. were followed. The results of a keyword search may produce several paths between Table 3. Connections and relationships of the connections in tuples and they should be ranked based on their assumed relevance. the RDB One widely used indication has been the length of the path, i.e. the Connection Connection with relationships shortest paths are typically assumed to be more relevant than longer paths. However, longer paths may contain more information than 1 d1(XML) – e1(Smith) d1(XML) 1:N e1(Smith) shorter paths and shorter paths may chop a semantic connection 2 p1(XML) – w_f1 – p1(XML) 1:N w_f1 N:1 e1(Smith) between entities or text attributes/documents. Therefore, there e1(Smith) should be an alternative where the user could select longer paths, if s/he is interested in larger context of matched values or documents. 3 p1(XML) – d1(XML) – p1(XML) N:1 d1(XML) 1:N e1(Smith) e1(Smith) 5. REFERENCES 4 d1(XML) – p1(XML) – d1(XML) 1:N p1(XML) 1:N w_f1 [1] Aditya, B., Bhalotia, G., Chakrabarti, S., Hulgeri, A., Nakhe, w_f1 – e1(Smith) N:1 e1(Smith) C., Parag, P. and Sudarshan, S. BANKS: Browsing and Keyword Searching in Relational Databases. In Proceedings 5 d2(XML) – e2(Smith) d2(XML) 1:N e2(Smith) of the 28th International Conference on Very Large Data 6 p2(XML) – d2(XML) – p2(XML) N:1 d2(XML) 1:N Bases. VLDB ’02. VLDB Endowment, 2002, 1083-1086. e2(Smith) e2(Smith) [2] Bergamaschi, S., Guerra, F. and Simonini, G. Keyword 7 d2(XML) – p3 – w_f2 – d2(XML) 1:N p3 1:N w_f2 N:1 Search over Relational Databases: Issues, Approaches and e2(Smith) e2(Smith) Open Challenges. In Ferro, N. ed. Bridging Between Information Retrieval and Databases: PROMISE Winter 8 d1 – e3 – t1(Alice) d1 1:N e3 1:N t1(Alice) School 2013. Revised Tutorial Lectures. Springer Berlin 9 d2 – p2 – w_f3 – e3 – d2 1:N p2 1:N w_f3 N:1 e3 Heidelberg, Berlin, Heidelberg, 2014, 54-73. 2002, 1083- t1(Alice) 1:N t1(Alice) 1086. [3] Elmasri, R. and Navathe, S. B. Fundamentals of Database Systems, Fourth Edition. Addison-Wesley Longman The association of the keyword query in connection 4 is already Publishing Co., Inc, Boston, MA, USA, 2003. implicitly visible for the user in connections 1 and 2. However, in that case we have to assume that the user browses through these [4] Hristidis, V. and Papakonstantinou, Y. Discover: Keyword two answers and discovers the association from answers. Further, Search in Relational Databases. In Proceedings of the 28th it is not always the case that the association is implicitly visible in International Conference on Very Large Data Bases. VLDB the other returned associations as is the case in connection 7. ’02. VLDB Endowment, 2002, 670-681. [5] Kargar, M., An, A., Cercone, N., Godfrey, P., Szlichta, J. 4. DISCUSSION AND CONCLUSIONS and Yu, X. MeanKS: Meaningful Keyword Search in We have investigated the effects of the types of connections on the Relational Databases with Complex Schema. In Proceedings results of keyword queries over structural data. We considered how of the 2014 ACM SIGMOD International Conference on cardinality constraints affect the ranking of query results. We Management of Data.. SIGMOD ’14. ACM, New York, NY, noticed that cardinality constrains can be utilized to infer the USA, 2014, 905-908. looseness of an association. A loose association gives a more [6] Kargar, M., An, A., Cercone, N., Godfrey, P., Szlichta, J. and extensive result for a keyword query because entities (tuples) are Yu, X. Meaningful keyword search in relational databases associated to each other through a more general entity or several with large and complex schema. In Anonymous 2015 IEEE entities. The closeness of a connection at the extensional level can 31st International Conference on Data Engineering. 2015, partly be inferred from the cardinality constraints of the ER model. 411-422. Immediate and transitive functional relationships ensure the close connection between the corresponding entities. Instead, other [7] Li, L., Petschulat, S., Tang, G., Pei, J. and Luk, W. Efficient combinations allow close or loose connections between and Effective Aggregate Keyword Search on Relational participative entities. For example, in a transitive N:M relationship Databases. Int.J.Data Warehous.Min., 8, 4 (oct 2012), 41-81. several entities may be connected to each other through a more DOI=10.4018/jdwm.2012100103. general entity and the semantics of the relationship is vague. [8] Zeng, Z., Bao, Z., Lee, M. L. and Ling, T. W. Towards An However, further studies are needed for investigating how our Interactive Keyword Search over Relational Databases. In findings could be utilized in ranking the result connections. One Proceedings of the 24th International Conference on World criterion could be the number of transitive N:M relationships in a Wide Web. ACM, New York, NY, USA, 2015, 259-262. .